KR100331689B1

KR100331689B1 - Method for speaker adaptive training in speech recognition system

Info

Publication number: KR100331689B1
Application number: KR1020000035081A
Authority: KR
Inventors: 이교혁
Original assignee: 송문섭; 주식회사 현대큐리텔
Priority date: 2000-06-24
Filing date: 2000-06-24
Publication date: 2002-04-09
Also published as: KR20020001066A

Abstract

본 발명은 음성인식 시스템의 화자적응 훈련방법을 제공하기 위한 것으로, 이러한 본 발명은 불특정 다수의 화자로부터의 음성데이터를 입력받아 입력된 데이터를 이용하여 인식 대상 단어에 대한 화자독립모델을 훈련하여 휴대형 디바이스의 초기모델로서 사용할 화자독립모델을 생성하는 제 1 단계와; 휴대형 디바이스에서 인식 대상 어휘당 최소 1회 이상 특정 화자의 음성 데이터를 입력받는 제 2 단계와; 특정 화자의 음성 데이터를 이용하여 초기모델인 화자독립모델을 수정하여 특정인에 대한 화자종속모델을 생성하는 제 3 단계를 수행함으로써, 화자독립모델과 특정 화자에 대한 소량의 훈련용 데이터만으로 특정인에 대한 화자종속모델을 생성하여 HMM을 이용하여 휴대형 디바이스에서 화자종속모델을 구현할 수 있게 되는 것이다.The present invention provides a method for training a speaker adaptation of a speech recognition system. The present invention receives voice data from an unspecified number of speakers and trains a speaker independent model for a word to be recognized by using input data. Generating a speaker independent model to be used as an initial model of the device; A second step of receiving voice data of a specific speaker at least once per recognition word in the portable device; By modifying the speaker-independent model, which is the initial model, using the voice data of a specific speaker, a third step of generating a speaker-dependent model for a specific person is performed. By creating a speaker dependent model, the speaker dependent model can be implemented in a portable device using the HMM.

Description

Method for speaker adaptive training in speech recognition system

본 발명은 음성인식 시스템의 화자적응 훈련방법에 관한 것으로, 특히 화자독립모델과 특정 화자에 대한 소량의 훈련용 데이터만으로 특정인에 대한 화자종속모델을 생성함으로써 HMM을 이용하여 휴대형 디바이스에서 화자종속모델을 구현하기에 적당하도록 한 음성인식 시스템의 화자적응 훈련방법에 관한 것이다.The present invention relates to a speaker adaptation training method of a speech recognition system. In particular, a speaker dependent model is generated in a portable device using a HMM by generating a speaker independent model for a specific person using only a speaker independent model and a small amount of training data for the specific speaker. The present invention relates to a speaker adaptation training method of a speech recognition system suitable for implementation.

일반적으로 음성인식이란 사람의 말을 기계가 인식하도록 하는 것을 말한다. 이러한 음성인식 기술이 실생활에 적용되기 위해서는 높은 인식률과 함께 실시간 처리가 필수적인 요건이 된다. 이때 사람의 음성을 인식하기 위해서는 많은 계산량을 필요로 하여 많은 계산량 때문에 실시간 처리가 곤란한데, 이를 해결하기 위해 제안된 방법이 은닉 마르코프 모델(Hidden Markov Model, 이하 "HMM"이라 약칭한다)이다.In general, speech recognition means that a machine recognizes a person's words. In order to apply the voice recognition technology to real life, real time processing with high recognition rate becomes an essential requirement. At this time, in order to recognize a human voice, a large amount of calculation is required, which makes it difficult to process in real time. A proposed method for solving this problem is a Hidden Markov Model (hereinafter, abbreviated as "HMM").

여기서 음성인식을 위한 시스템에는 다양한 방법들이 있다. 동적시간감소(Dynamic Time Wraping, 이하 "DTW"라 약칭한다)를 이용한 인식방법, 신경회로망을 이용한 인식방법, HMM을 이용한 인식방법들이 있는데, DTW 방식은 단어인식(Isolated Word Recognition)에 주로 이용하며, HMM은 문장인식(Connected Word Recognition)에 이용된다.Here, there are various methods in the system for speech recognition. There is a recognition method using dynamic time wrapping (hereinafter abbreviated as "DTW"), a recognition method using neural network, and a recognition method using HMM. DTW method is mainly used for Isolated Word Recognition. HMM is used for Connected Word Recognition.

그리고 음성인식을 위한 HMM은 바로 전 상태에 의해서 다음 상태가 결정되는 일차 마르코프(Markov) 과정이 주로 사용된다. 음성의 시간적 제약은 음성을 시간의 함수로 표시할 수 있다는 점이다. 이것은 음성이 시작되고 지속적으로 변화하여 마치게 되는 과정이어서 마르코프 모델을 좌향우 천이(Left to right translation)만 허용하는 모델로 단순화시킨다. HMM이 사용되는 근거는 발성하는 기관의 시간적 상태 또는 생성된 음운의 은닉되고 관측되는 음성을 출력으로 음운이나 상태를 파악하여 음성을 인식하고자 하는 인식 과정이라고 볼 수 있다.In the HMM for speech recognition, the first Markov process is mainly used in which the next state is determined by the previous state. The temporal constraint of speech is that it can be expressed as a function of time. This is the process in which the voice begins, continually changes, and ends, simplifying the Markov model to a model that only allows left to right translation. The reason for using HMM can be regarded as the recognition process to recognize the phonological or physiological state by outputting the temporal state of the vocal organ or the concealed and observed voice of the generated phonon.

그래서 HMM을 이용한 음성인식은, 예를 들어 '아버지'의 표현에서 각각'아', '버', '지'의 신호에서 '특징벡터(파라미터)'를 추출하여 표현한다. 즉, 파라미터(Parameter)가 2차원 파라미터라면 2차원상의 점으로 '아버지'를 표현하여 확률분포인 가우시안(Gaussian) 분포로 모델링을 하게 된다(실제로는 20차 내지 28차원의 파라미터). 그리고 확률분포상에 표시된 확률값을 계산하여 '아', '버', '지'를 계산하게 된다.Thus, speech recognition using HMM, for example, extracts and expresses a feature vector (parameter) from the signals of 'a', 'bur', and 'ji' in the expression of 'father'. In other words, if a parameter is a two-dimensional parameter, it is modeled as a Gaussian distribution, which is a probability distribution by expressing 'father' as a two-dimensional point (actually, a 20th to 28th dimension parameter). Then, the probability value displayed on the probability distribution is calculated to calculate 'ah', 'burr' and 'ji'.

이와 같은 HMM을 이용한 음성인식은 훈련과정과 인식과정으로 나눌 수 있다. 훈련과정은 음성의 특징벡터를 추출하여 이것을 이용해 모델의 파라미터를 수렴하도록 조정하여 저장하는 것을 말한다. 인식과정은 테스트할 음성과 각각 저장된 기준 모델을 비교하여 가장 확률값이 높은 것을 인식된 음성으로 결정짓는 과정이다.Such speech recognition using HMM can be divided into training process and recognition process. The training process involves extracting the feature vectors of speech and storing them by adjusting them to converge the model parameters. The recognition process compares the voice to be tested with each stored reference model to determine the highest probability value as the recognized voice.

그래서 훈련과정을 수행할 때 파라미터의 값을 수렴시키기 위해 Expectation Maximize 방법을 이용한 바움-벨히(Baum-Welch)의 재추정식등이 사용된다. 이런 방법은 전체 최대값(Global Maximum)을 찾는 것이 아니고, 국부 최대값(Local Maximum)을 찾는 것이다. 벡터는 다차원 공간이므로 초기값의 변화에 따라 파라미터를 수렴시킬 때 여러 개의 국부 최대값 중 어느 하나로 수렴하는데, 이 초기값을 어떻게 선택하는가에 따라 전체 최대값에 보다 가까운 국부 최대값으로 수렴할 수 있게 된다.Therefore, Baum-Welch's re-estimation equation using the Expectation Maximize method is used to converge the parameter values when performing the training process. This method does not find the Global Maximum, but rather the Local Maximum. Since a vector is a multidimensional space, it converges to one of several local maximums when the parameter converges according to the change of the initial value.However, the vector is converged to a local maximum that is closer to the total maximum. do.

이처럼 HMM을 이용하여 음성을 인식하는 방법은 일단 음성을 프레임별로 나누고 윈도우함수를 사용하여 윈도윙(Windowing, 음성 오디오 신호에서 윈도우 함수를 곱하는 것)을 한 다음 특징벡터인 선형예측계수(LPC : Linear Prediction Coefficient)를 추출하는 전처리 과정을 거친 다음, 그 특징벡터를 이용하여 훈련을 시킨다. 훈련을 시키기 위해서는 모든 음성 데이터들을 초기 상태(state)로 구분한 다음 재추정식과 비터비 알고리즘 등을 사용하여 모델을 수렴시킨다.The method of recognizing speech using HMM is to divide speech by frame and windowing using window function, and then use linear prediction coefficient (LPC: Linear) as a feature vector. After the preprocessing process to extract the Prediction Coefficient, the training is performed using the feature vector. In order to train, all voice data are classified into initial states, and then the model is converged using a reestimation equation and a Viterbi algorithm.

한편 HMM을 이용한 음성인식은 많은 양의 훈련 데이터를 이용하여 인식하고자 하는 단어를 모델링하고, 인식시 훈련과정에서 만들어진 모델들로 입력신호에 대한 확률값을 구하여 인식하고자 하는 단어를 인식해내게 된다.On the other hand, speech recognition using HMM models words to be recognized using a large amount of training data, and recognizes words to be recognized by obtaining probability values for input signals using models created during training.

음성인식에는 불특정 다수의 화자 목소리에 대해 일정 수준 이상의 성능을 나타내는 화자독립 음성인식과 특정 화자의 목소리에 대해서만 일정 수준 이상의 성능을 나타내는 화자종속 음성인식 방법이 있다.Voice recognition includes speaker independent speech recognition that exhibits a certain level of performance for an unspecified number of speaker voices, and speaker dependent speech recognition that exhibits a certain level of performance only for a specific speaker's voice.

이중 화자종속 음성인식 방법은 훈련과정에서 많은 양의 특정 화자의 음성 데이터를 필요로 한다. 따라서 인식하고자 하는 단어에 대해 일정 수준 이상의 정밀도를 가지는 모델을 만들려면 그에 상응하는 양의 훈련 데이터가 필요하게 된다.Dual speaker-dependent speech recognition requires a large amount of voice data for a specific speaker during training. Therefore, to create a model with a certain level of precision for words to be recognized, a corresponding amount of training data is required.

그러나 휴대폰이나 개인용 컴퓨터와 같은 휴대형 디바이스의 특성상 특정 화자에 대한 많은 양의 훈련용 데이터를 수집하는 데는 상당한 문제점이 뒤따르게 된다. 또한 특정 화자에 대한 많은 양의 훈련용 데이터를 수집하였다 하더라도, 바움-벨히 재추정식(Baum-Welch re-estimation)에 의한 ML(Maximum Likelihood) 알고리즘으로 모델을 재훈련해야 하기 때문에 재훈련 프로그램을 저장할 추가 메모리가 요구되는 단점이 있었다.However, due to the nature of portable devices such as mobile phones and personal computers, there is a significant problem in collecting large amounts of training data for a particular speaker. Also, even if you collect a large amount of training data for a particular speaker, you need to retrain the model with the ML (Maximum Likelihood) algorithm by Baum-Welch re-estimation. The disadvantage was that additional memory was required.

이러한 종래 기술의 문제점으로 인하여 종래의 휴대형 디바이스에서의 화자종속 음성인식은 HMM이 아닌 DTW를 이용한 인식 방법이 주류를 이루고 있으며, HMM을 이용한 방식은 제조 원가를 증가시키게 되고, 또한 사용자에게 상당한 불편함을초래하게 된다. 그러나 이러한 DTW를 이용한 음성인식은 인식 어휘의 수가 증가할수록 HMM과 비교하여 낮은 인식률을 나타내는 한계를 갖고 있는 문제점이 있었다.Due to the problems of the prior art, speaker-dependent speech recognition in a conventional portable device has become a mainstream method of recognition using DTW instead of HMM, and the method using HMM increases manufacturing costs and is inconvenient for the user. Will result. However, the speech recognition using the DTW has a problem of having a lower recognition rate than the HMM as the number of recognized words increases.

이에 본 발명은 상기와 같은 종래의 제반 문제점을 해소하기 위해 제안된 것으로, 본 발명의 목적은 화자독립모델과 특정 화자에 대한 소량의 훈련용 데이터만으로 특정인에 대한 화자종속모델을 생성함으로써 HMM을 이용하여 휴대형 디바이스에서 화자종속모델을 구현할 수 있는 음성인식 시스템의 화자적응 훈련방법을 제공하는 데 있다.Therefore, the present invention has been proposed to solve the above-mentioned conventional problems, and an object of the present invention is to use a HMM by generating a speaker dependent model for a specific person using only a speaker independent model and a small amount of training data for a specific speaker. To provide a speaker adaptation training method of a speech recognition system that can implement a speaker dependent model in a portable device.

상기와 같은 목적을 달성하기 위하여 본 발명의 일실시예에 의한 음성인식 시스템의 화자적응 훈련방법은,Speaker adaptation training method of the speech recognition system according to an embodiment of the present invention to achieve the above object,

불특정 다수의 화자로부터의 음성데이터를 입력받아 입력된 데이터를 이용하여 인식 대상 단어에 대한 화자독립모델을 훈련하여 휴대형 디바이스의 초기모델로서 사용할 화자독립모델을 생성하는 제 1 단계와; 상기 휴대형 디바이스에서 인식 대상 어휘당 최소 1회 이상 특정 화자의 음성 데이터를 입력받는 제 2 단계와; 상기 특정 화자의 음성 데이터를 이용하여 상기 초기모델인 화자독립모델을 수정하여 특정인에 대한 화자종속모델을 생성하는 제 3 단계를 수행함을 그 기술적 구성상의 특징으로 한다.A first step of receiving a voice data from an unspecified number of speakers and using the input data to train a speaker independent model for a word to be recognized to generate a speaker independent model to be used as an initial model of a portable device; A second step of receiving voice data of a specific speaker at least once per recognition word in the portable device; The technical configuration is characterized by performing a third step of generating a speaker-dependent model for a specific person by modifying the speaker-independent model using the voice data of the specific speaker.

도 1은 본 발명이 적용되는 음성인식 시스템의 블록구성도이고,1 is a block diagram of a speech recognition system to which the present invention is applied;

도 2는 본 발명에 의한 음성인식 시스템의 화자적응 훈련방법을 보인 흐름도이며,2 is a flowchart illustrating a speaker adaptation training method of a voice recognition system according to the present invention;

도 3은 도 2에서 제 3 단계를 상세히 보인 도면이다.3 is a view showing the third step in detail in FIG.

< 도면의 주요 부분에 대한 부호의 설명 ><Description of Symbols for Main Parts of Drawings>

10 : 음성특징 추출부 20 : HMM 모델부10: voice feature extraction unit 20: HMM model unit

30 : 패턴인식부 40 : 후처리부30: pattern recognition unit 40: post-processing unit

이하, 상기와 같은 본 발명 음성인식 시스템의 화자적응 훈련방법의 기술적 사상에 따른 일실시예를 설명하면 다음과 같다.Hereinafter, an embodiment according to the technical idea of the speaker adaptation training method of the present invention speech recognition system as follows.

도 1은 본 발명이 적용되는 음성인식 시스템의 블록구성도이다.1 is a block diagram of a speech recognition system to which the present invention is applied.

이에 도시된 바와 같이, 입력된 음성으로부터 인식에 유효한 특징을 추출하는 음성특징 추출부(10)와; 입력음성의 변화되는 통계적인 특징을 확률적으로 모델링하는 HMM(Hidden Markov Model) 모델부(20)와; 상기 HMM 모델부(20)의 모델링에 따라 상기 음성특징 추출부(10)에서 출력된 특징벡터에서 패턴을 인식하여 인식된 후보단어를 출력하는 패턴인식부(30)와; 상기 패턴인식부(30)에서 출력된 후보단어의 후처리를 수행하여 후보단어를 기각할 것인지 아니면 인식결과로 받아들일 것이지를 결정하는 후처리부(40)로 구성된다.As shown therein, a voice feature extraction unit 10 for extracting a feature effective for recognition from the input voice; A Hidden Markov Model (HMM) model unit 20 for probabilistically modeling changing statistical characteristics of an input voice; A pattern recognition unit 30 for recognizing a pattern in the feature vector output from the speech feature extraction unit 10 and outputting the recognized candidate word according to the modeling of the HMM model unit 20; The post processing unit 40 determines whether to reject the candidate word or accept it as a recognition result by performing post processing of the candidate word output from the pattern recognition unit 30.

도 2는 본 발명에 의한 음성인식 시스템의 화자적응 훈련방법을 보인 흐름도이다.2 is a flowchart showing a speaker adaptation training method of a voice recognition system according to the present invention.

이에 도시된 바와 같이, 불특정 다수의 화자로부터의 음성데이터를 입력받아 입력된 데이터를 이용하여 인식 대상 단어에 대한 화자독립모델을 훈련하여 휴대형 디바이스의 초기모델로서 사용할 화자독립모델을 생성하는 제 1 단계(ST11)(ST12)와; 상기 휴대형 디바이스에서 인식 대상 어휘당 최소 1회 이상 특정 화자의 음성 데이터를 입력받는 제 2 단계(ST13)와; 상기 특정 화자의 음성 데이터를 이용하여 상기 초기모델인 화자독립모델을 수정하여 특정인에 대한 화자종속모델을 생성하는 제 3 단계(ST14)를 수행한다.As shown in FIG. 1, a first step of generating a speaker independent model to be used as an initial model of a portable device by receiving voice data from an unspecified number of speakers and training the speaker independent model for a word to be recognized using the input data. (ST11) (ST12); A second step (ST13) of receiving voice data of a specific speaker at least once per recognition target vocabulary in the portable device; A third step (ST14) of generating a speaker dependent model for a specific person is performed by modifying the speaker independent model, which is the initial model, using the voice data of the specific speaker.

상기에서 제 3 단계는, 상기 제 1 단계의 화자독립모델과 상기 제 2 단계의 특정 화자의 음성 데이터를 입력받아 화자독립모델로써 비터비 디코딩을 수행하여 상태 별로 분할하는 단계(ST21)와; 상기 분할된 음성 데이터를 상태별로 혼합 개수 만큼의 클러스터를 가지도록 클러스터링을 수행하여 각 클러스터의 평균 관찰벡터를 각 클러스터의 중심으로 설정하는 단계(ST22)와; 상기 클러스터링된 음성 데이터의 중심과 상기 제 1 단계의 화자독립모델의 가우시안 혼합의 평균값을 결합하여 상기 휴대형 디바이스의 화자종속모델을 생성하는 단계(ST23)(ST24)를 수행한다.The third step may include receiving the speaker independent model of the first step and the voice data of the specific speaker of the second step, performing Viterbi decoding as the speaker independent model, and dividing by state (ST21); Performing clustering to have the number of clusters of the divided speech data for each state so as to set an average observation vector of each cluster as a center of each cluster (ST22); A step ST23 (ST24) of generating a speaker dependent model of the portable device is performed by combining the center of the clustered voice data and the average value of the Gaussian mixture of the speaker independent model of the first step.

상기 제 3 단계에서, 상기 휴대형 디바이스의 화자종속모델 생성시 화자종속모델의 파라미터에서 평균을 제외한 모든 파라미터는 상기 제 1 단계의 화자독립모델의 파라미터를 그대로 사용한다.In the third step, when generating the speaker dependent model of the portable device, all parameters except the average of the parameters of the speaker dependent model use the parameters of the speaker independent model of the first step.

이와 같이 구성된 본 발명에 의한 음성인식 시스템의 화자적응 훈련방법의 동작을 첨부한 도면에 의거 상세히 설명하면 다음과 같다.The operation of the speaker adaptation training method of the voice recognition system according to the present invention configured as described above will be described in detail with reference to the accompanying drawings.

먼저 휴대형 디바이스에서 특정인에 대한 음성을 녹음시키게 한다. 이때는 인식 대상 어휘당 최소 1회 이상으로 한다. 그리고 수집된 특정인에 대한 소량의 훈련용 데이터를 이용하여 화자독립모델의 파라미터 중 평균만을 수정하여 특정인에 대한 화자종속모델을 만들고자 한다. 즉, 휴대형 디바이스는 초기 모델로써 화자독립모델을 가지고 있는 상태에서 특정인에 대한 소량의 훈련용 데이터를 이용하여 화자독립모델을 화자종속모델로 수정한다.First, let the portable device record the voice of a specific person. In this case, at least one per vocabulary is recognized. In addition, by using only a small amount of training data collected for a specific person, we attempt to create a speaker-dependent model for a specific person by modifying only the mean of the speaker-independent model parameters. That is, the portable device modifies the speaker independent model into a speaker dependent model by using a small amount of training data for a specific person while having a speaker independent model as an initial model.

그래서 불특정 다수 화자의 발성으로 이루어진 음성 데이터로부터 화자독립모델을 만든 다음, 이렇게 만든 화자독립모델로써 비터비 디코딩(Viterbidecoding)을 수행하여 수집된 특정인에 대한 음성을 상태(state) 단위로 레벨링(labeling)한다. 각 상태에 해당하는 데이터에 대해 mixture 개수 만큼의 클러스터(cluster)를 가지도록 클러스터링(clustering)한다. 이렇게 클러스터링된 데이터의 centroid와 화자독립모델의 가우시안 혼합(Gaussian mixture)의 평균값을 적당한 비율로 수정하여 특정 화자에 대한 화자독립모델을 만든다. 이때 분산(variance) 및 초기 확률(initial probability), 천이 확률(transition probability)은 화자독립모델의 그것들을 그대로 사용한다.Thus, a speaker-independent model is created from speech data composed of unspecified majority of speakers, and then Viterbidecoding is performed on the speaker-independent model, thus labeling the voice of a specific person collected in units of states. do. Clustering to have as many clusters as mixture for the data corresponding to each state. The speaker-independent model for a specific speaker is made by modifying the average value of the centroid of the clustered data and the Gaussian mixture of the speaker-independent model in a suitable ratio. In this case, the variance, initial probability, and transition probability use those of the speaker independent model.

이에 따라 제 1 단계에서는 불특정 다수의 화자로부터 수집된 음성 데이터를 이용하여 인식 대상 단어에 대한 화자 독립 모델을 훈련시킨다. 이렇게 훈련된 화자 독립 모델을 휴대형 디바이스의 초기 모델로 사용한다.Accordingly, in the first step, a speaker independence model for a word to be recognized is trained using voice data collected from a plurality of unspecified speakers. This trained speaker-independent model is used as the initial model for the portable device.

제 2 단계에서는 휴대형 디바이스에서 인식 대상 어휘당 최소 1회 이상 특정 화자의 음성 데이터를 수집한다.In the second step, the portable device collects voice data of a specific speaker at least once per recognition word.

제 3 단계에서는 제 1 단계의 화자독립모델과 제 2 단계의 특정 화자의 음성 데이터를 이용하여 특정 화자에 대한 화자종속모델을 만들게 된다.In the third stage, a speaker-dependent model for the specific speaker is generated by using the speaker independent model of the first stage and the voice data of the specific speaker of the second stage.

이러한 제 3 단계에 대한 설명을 좀 더 자세히 하면 다음과 같다.The third step is described in more detail as follows.

먼저, 화자독립모델의 HMM 표현을 ??라 두고, 수집된 특정인에 대해 수집한 N개의 음성 데이터를 Ｏ(Ｏ₁, Ｏ₂, ... Ｏ_N)라고 정의한다. ??와 Ｏ를 이용하여 추정할 특정 화자에 대한 화자독립모델의 HMM 표현을 ??'이라 정의한다.First, the HMM expression of the speaker independence model is referred to as ??, and the N pieces of voice data collected for the specific person collected are defined as O (O ₁ , O ₂ , ... O _N ). The HMM expression of the speaker-independent model for a specific speaker to be estimated using ?? and O is defined as ?? '.

수집한 N개의 음성 데이터를 화자독립모델(??)로써 비터비 디코딩을 수행하여 상태 별로 분할(Segmentation)한다. 이렇게 분할된 음성 데이터를 상태별로 혼합(mixture) 개수 만큼의 클러스터를 가지도록 클러스터링을 수행하여 각 클러스터의 평균 관찰벡터를 각 클러스터의 중심(??_osm)으로 설정한다. 여기서 각 클러스터의 중심(??_osm)은 다음의 수학식 1과 같이 구한다.Viterbi decoding is performed on the collected N voice data using a speaker independent model (??) and segmented by state. This mixture (mixture) to perform clustering of the cluster as the number of the speech data split by state and sets the average of observed vectors of each cluster to the center of each cluster (?? _osm). Here, the center (?? _osm ) of each cluster is obtained as in Equation 1 below.

여기서 M은 상태 s와 혼합 m에서 관찰벡터의 수이고, ??_osm은 상태 s와 혼합 m에서 관찰벡터의 중심이다.Where M is the number of observation vectors in state s and mixture m, ?? _osm is the center of the observation vector in state s and in mixture m.

수학식 1에서 구해진 산술 평균을 이용하여 화자독립모델의 가우시안 혼합의 평균을 다음의 수학식 2와 같이 변형시켜 특정 화자에 대한 화자종속모델을 생성한다.Using the arithmetic mean obtained from Equation 1, the Gaussian mixture of the speaker independent model is transformed as shown in Equation 2 to generate a speaker dependent model for a specific speaker.

여기서은 화자독립모델 ??'에서 상태 s와 혼합 m의 평균이고,은 화자종속모델 ??'에서 상태 s와 혼합 m의 평균이다.here Is the mean of the state s and the mixture m in the speaker-independent model ?? Is the mean of the state s and the mixing m in the speaker dependent model ?? '.

이와 같이 화자독립모델의 평균들을 수학식 2에서 추정한 평균들로 대치시킴으로써 특정인에 대한 화자독립모델을 만든다.In this way, the speaker independence model is made by replacing the average of the speaker independence model with the average estimated in Equation 2.

휴대형 디바이스의 특성상, 음성인식 시스템의 훈련을 위해 특정 화자에 대한 많은 양의 음성 데이터를 수집하기가 곤란하기 때문에 평균을 제외한 모든 파라미터(초기확률, 천이확률, 분산 등)는 화자독립모델의 파라미터를 그대로 사용한다.Due to the characteristics of portable devices, it is difficult to collect a large amount of voice data for a specific speaker for training the voice recognition system. Use it as it is.

이처럼 본 발명은 화자독립모델과 특정 화자에 대한 소량의 훈련용 데이터만으로 특정인에 대한 화자종속모델을 생성함으로써 HMM을 이용하여 휴대형 디바이스에서 화자종속모델을 구현하게 되는 것이다.As described above, the present invention implements the speaker dependent model in the portable device using the HMM by generating the speaker independent model for the specific person using only the speaker independent model and a small amount of training data for the specific speaker.

이상에서 본 발명의 바람직한 실시예를 설명하였으나, 본 발명은 다양한 변화와 변경 및 균등물을 사용할 수 있다. 본 발명은 상기 실시예를 적절히 변형하여 동일하게 응용할 수 있음이 명확하다. 이러한 본 발명은 휴대형 디바이스를 비롯한 특정 화자에 대한 충분한 음성 데이터를 수집하기 곤란한 상태에서 운용되는 모든 음성인식 시스템인 휴대폰, 개인용 컴퓨터 등에 적용가능하다. 따라서 상기 기재 내용은 하기 특허청구범위의 한계에 의해 정해지는 본 발명의 범위를 한정하는 것이 아니다.Although the preferred embodiment of the present invention has been described above, the present invention may use various changes, modifications, and equivalents. It is clear that the present invention can be applied in the same manner by appropriately modifying the above embodiments. The present invention is applicable to mobile phones, personal computers, and the like, which are all speech recognition systems operated in a state where it is difficult to collect sufficient voice data for a specific speaker including a portable device. Accordingly, the above description does not limit the scope of the invention as defined by the limitations of the following claims.

이상에서 살펴본 바와 같이, 본 발명에 의한 음성인식 시스템의 화자적응 훈련방법은, 종래기술의 경우 특정 화자만을 대상으로 화자종속음성인식 시스템을 위해서 특정 화자에 대한 음성 데이터를 수집할 때 충분한 양의 데이터를 수집하기가 현실적으로 어려운 상황에서 DTW 방식을 사용하게 되는데, 이러한 DTW 방식은 HMM에 비해 낮은 성능을 나타내게 되는 문제점을 해결하여, 화자독립모델의 평균만을변형시켜 특정 화자에 대한 화자종속모델을 만듦으로써 DTW에 비해 높은 인식률을 나타내는 HMM 방식의 화자종속 음성인식 시스템을 보다 용이하게 구현할 수 있는 효과가 있게 된다.As described above, the speaker adaptation training method of the voice recognition system according to the present invention, in the prior art, a sufficient amount of data when collecting voice data for a specific speaker for the speaker-dependent speech recognition system targeting only a specific speaker The DTW method is used in a situation where it is difficult to collect the problem. This DTW method solves the problem of showing a lower performance than the HMM, and creates a speaker dependent model for a specific speaker by transforming only the mean of the speaker independent model. Compared with DTW, HMM-based speaker dependent speech recognition system exhibiting higher recognition rate can be more easily implemented.

Claims

A first step of receiving a voice data from an unspecified number of speakers and using the input data to train a speaker independent model for a word to be recognized to generate a speaker independent model to be used as an initial model of a portable device;

A second step of receiving voice data of a specific speaker at least once per recognition word in the portable device;

And a third step of generating a speaker dependent model for a specific person by modifying the speaker independent model using the voice data of the specific speaker.

The method of claim 1, wherein the third step,

Receiving the speaker independent model of the first stage and the voice data of the specific speaker of the second stage and performing Viterbi decoding as the speaker independent model to divide by states;

Performing clustering to have the number of clusters of the divided speech data for each state to set the average observation vector of each cluster as the center of each cluster;

And generating a speaker dependent model of the portable device by combining the center of the clustered speech data and the Gaussian mixture of the speaker independent model of the first step. .

3. The method of claim 2, wherein in the third step, all parameters except the average are used as parameters of the speaker independent model of the first step when generating the speaker dependent model of the portable device. Speaker adaptation training method of speech recognition system.