KR100766061B1

KR100766061B1 - apparatus and method for speaker adaptive

Info

Publication number: KR100766061B1
Application number: KR1020050120301A
Authority: KR
Inventors: 전형배
Original assignee: 한국전자통신연구원
Priority date: 2005-12-09
Filing date: 2005-12-09
Publication date: 2007-10-11
Also published as: KR20070060581A

Abstract

본 발명은 사용자가 서비스를 사용할 때마다 사용자 누적정보가 갱신되고, 음향모델이 갱신되어, 사용자가 많이 사용하면 사용할수록 사용자 종속 음향모델을 만들어 주는 음성인식 시스템을 제공하기 위한 것으로서, (a) 사용자 확인 과정을 통해 각 사용자에 대한 음향모델을 생성하여 사용자 음향모델 DB에 등록하는 단계와, (b) 사용자의 발화에 대해 상기 기 저장된 사용자 음향모델 DB로부터 사용자 음향모델을 로딩하여 음성인식을 수행하고 화자적응에 필요한 정보를 출력하는 단계와, (c) 상기 출력되는 정보들을 전달받아 화자적응에 필요한 관측데이터를 사용자 누적기 DB에 누적하는 단계와, (d) 사용자가 음성인식 서비스를 종료할 시점에 현재까지 누적된 누적정보를 이용하여 화자적응을 수행하는 단계를 포함하는데 있다.The present invention is to provide a speech recognition system that updates the cumulative user information each time the user uses the service, the acoustic model is updated, the user-dependent acoustic model is created as the user uses more, (a) the user Creating a sound model for each user through the verification process and registering in the user acoustic model DB, and (b) to perform a voice recognition by loading the user acoustic model from the previously stored user acoustic model DB for the user's speech Outputting information required for speaker adaptation, (c) accumulating observation data necessary for speaker adaptation in a user accumulator DB by receiving the output information, and (d) a time point when the user ends the voice recognition service; It includes the step of performing a speaker adaptation using the accumulated information accumulated so far.

화자적응 서버, 음성인식 서버, 음향모델 Speaker adaptation server, Speech recognition server, Acoustic model

Description

Apparatus and method for speaker adaptive}

도 1 은 본 발명에 따른 화자적응 서버를 사용한 화자적응 시스템을 나타낸 구성도1 is a block diagram showing a speaker adaptation system using a speaker adaptation server according to the present invention

도 2 는 본 발명에 따른 화자적응 서버를 사용한 화자적응 방법을 나타낸 흐름도2 is a flowchart illustrating a speaker adaptation method using a speaker adaptation server according to the present invention.

도 3 은 본 발명에 따른 화자적응 서버를 사용한 화자적응 방법에서 사용자 등록 단계의 교사방식 화자적응을 통한 사용자 음향모델 생성 과정을 보다 상세히 나타낸 흐름도3 is a flowchart illustrating a process of generating a user acoustic model through speaker adaptation in a user registration step in a speaker adaptation method using a speaker adaptation server according to the present invention;

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

110 : 사용자 120 : 음성인식 서버110: user 120: voice recognition server

130 : 화자적응 서버 140 : 사용자 음향모델130: speaker adaptation server 140: user acoustic model

150 : 사용자 누적기 DB 160 : 응용 프로그램150: User Accumulator DB 160: Application

본 발명은 음성인식 시스템에 관한 것으로, 특히 화자적응 서버를 통해 일반적인 화자독립 음성인식 시스템에 비해 인식률을 향상시킬 수 있는 방법에 관한 것 이다.The present invention relates to a speech recognition system, and more particularly, to a method of improving the recognition rate compared to a general speaker independent speech recognition system through a speaker adaptation server.

일반적인 음성인식 시스템은 불특정 화자를 대상으로 음성인식을 수행하기 때문에, 많은 훈련 화자로부터 음성 데이터를 수집하여 화자 독립의 음향모델을 훈련하게 된다. 이를 화자독립 음성인식 시스템이라고 한다.Since the general speech recognition system performs speech recognition for unspecified speakers, the speech model is trained by collecting voice data from many training speakers. This is called speaker-independent speech recognition system.

그러나, 일반적으로 특정 화자가 음성인식 시스템을 계속 사용하게 되는데, 특정 화자의 음성데이터로 훈련 한 화자종속 음향모델을 사용하는 화자종속 음성인식 시스템이 화자독립 음성인식 시스템의 성능보다 뛰어나다. 그래서 특정화자의 음성을 이용하여 화자독립 음향모델로부터 화자종속 음향모델로 변환시키는 화자적응 방법이 필요하게 된다.However, in general, a specific speaker continues to use the speech recognition system. A speaker-dependent speech recognition system using a speaker-dependent acoustic model trained with the speech data of a specific speaker is superior to that of a speaker-independent speech recognition system. Therefore, there is a need for a speaker adaptation method that converts a speaker-independent acoustic model into a speaker-dependent acoustic model using a specific speaker's voice.

한편, 항상 어떤 화자가 사용할지를 알 수 없는 음성인식 서비스의 경우에는 서비스 상황의 화자의 음성 신호를 이용하여 화자적응을 수행하게 된다. 이때에는 사용자가 발성한 음성이 무엇인지 모르기 때문에 비교사 학습 방법에 의한 화자적응을 수행하게 되고, 이로 인해 화자적응 성능이 조금 떨어지게 된다.On the other hand, in the case of a voice recognition service that does not always know which speaker to use, the speaker adaptation is performed using the voice signal of the speaker in the service situation. At this time, because the user does not know what the voice is spoken, the speaker adaptation by the non-comparative learning method is performed, and thus the speaker adaptation performance is slightly degraded.

이에 반해, 증권거래 음성인식서비스, 금융거래 음성인식 서비스, 텔레메틱스 음성인식 서비스와 같이 특정 화자가 음성인식 시스템에 미리 등록을 한 후 이용하는 서비스의 경우는 최상의 인식 성능을 얻기 위해 화자 등록 단계에서 화자적응을 수행하여 생성된 음향모델을 사용자 마다 저장하고, 서비스 상황에서 해당 화자의 화자종속 음향모델을 읽어 들여 사용함으로써 음성인식 시스템의 성능을 향상 시킬 수 있다.On the other hand, in the case of services used by a specific speaker after registering in the voice recognition system, such as securities transaction voice recognition service, financial transaction voice recognition service, and telematics voice recognition service, the speaker adaptation is performed at the speaker registration stage to obtain the best recognition performance. The user can improve the performance of the speech recognition system by storing the acoustic model generated by each user and reading and using the speaker-dependent acoustic model of the speaker in the service situation.

이와 같이 화자 등록 단계에서 화자종속 음향모델을 생성하는 시스템의 경우는 시스템에서 음소가 적절히 분포되도록 사용자가 발성할 목록을 미리 작성한 후, 미리 작성된 발성 목록을 제시하여 교사 학습 화자적응을 수행함으로서, 화자적응 성능을 향상 시킬 수 있는 장점을 가지게 된다. 따라서 실제 화자 등록 음성 인식 시스템에 적합한 방법이다. 반면에 처음 등록된 이후 사용자 음향모델이 변동되지 않게 되어, 사용자가 꾸준히 서비스를 사용하더라도 음성인식 성능이 더 좋아지지 못하는 단점이 존재한다.As described above, in the case of a system generating a speaker-dependent acoustic model at the speaker registration stage, the user prepares a list to be uttered so that the phonemes are properly distributed in the system, and then presents the prepared utterance list to perform teacher learning speaker adaptation. It has the advantage of improving the adaptive performance. Therefore, it is a suitable method for the actual speaker registration speech recognition system. On the other hand, since the user acoustic model is not changed after the first registration, there is a disadvantage that the voice recognition performance does not improve even if the user uses the service steadily.

또한 화자적응을 수행하는 방법으로는 일반적으로 MAP(Maximum A Posteriori), MLLR(Maximum Likelihood Linear Regression), Eigenvoice 등을 사용하는데, 최적의 화자적응 성능을 얻기 위해서는 화자적응 단계에서 사용되는 관측데이터의 양에 따라 적절한 화자적응 방법을 선택해야 한다. In addition, as a method of speaker adaptation, MAP (Maximum A Posteriori), MLLR (Maximum Likelihood Linear Regression), and Eigenvoice are generally used.In order to obtain optimal speaker adaptation performance, the amount of observation data used in the speaker adaptation stage is used. The appropriate speaker adaptation method should be selected accordingly.

상기 Eigenvoice 화자적응 방법은 사용자가 불편하지 않을 정도로 적은 화자적응 음성 데이터를 사용하더라도 화자종속 음향모델에 가깝게 적응하는 고속 화자적응 방법으로, 성능 역시 우수한 것으로 알려져 있는 방법이다. The Eigenvoice speaker adaptation method is a high-speed speaker adaptation method that adapts closely to a speaker-dependent acoustic model even if the speaker adapts the voice data so small that the user is not inconvenient.

반면에 관측데이터 양이 충분히 많이 존재할 경우는 상기 MAP 방식의 화자적응 방법이 화자종속 음향모델과 근사한 성능을 보이는 것으로 알려져 있다. On the other hand, when there is a large amount of observation data, it is known that the speaker adaptation method of the MAP method has a performance close to that of the speaker dependent acoustic model.

그리고 상기 MLLR 방식의 화자적응은 위의 두 가지 방법의 중간 정도 관측 데이터를 사용할 때 최적의 성능을 기대할 수 있는 방법이다.In addition, the speaker adaptation of the MLLR method is a method that can expect the best performance when using the intermediate observation data of the above two methods.

따라서 본 발명은 상기와 같은 문제점을 해결하기 위해 안출한 것으로서, 사용자가 서비스를 사용할 때마다 사용자 누적정보가 갱신되고, 음향모델이 갱신되 어, 사용자가 많이 사용하면 사용할수록 사용자 종속 음향모델을 만들어 주는 음성인식 시스템을 제공하는데 그 목적이 있다.Accordingly, the present invention has been made to solve the above problems, the user cumulative information is updated each time the user uses the service, the acoustic model is updated, the more the user uses the more the user-dependent acoustic model is created The main purpose is to provide a voice recognition system.

본 발명의 다른 목적은 음성인식 시스템의 인식률을 기존의 화자독립 음성인식 시스템에 비해 높일 수 있는 화자적응 서버를 사용한 화자적응 방법을 제공하는데 있다.Another object of the present invention is to provide a speaker adaptation method using a speaker adaptation server that can increase the recognition rate of a speech recognition system as compared to a conventional speaker independent speech recognition system.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 화자적응 방법의 특징은 (a) 사용자 확인 과정을 통해 각 사용자에 대한 음향모델을 생성하여 사용자 음향모델 DB에 등록하는 단계와, (b) 사용자의 발화에 대해 상기 기 저장된 사용자 음향모델 DB로부터 사용자 음향모델을 로딩하여 음성인식을 수행하고 화자적응에 필요한 정보를 출력하는 단계와, (c) 상기 출력되는 정보들을 전달받아 화자적응에 필요한 관측데이터를 사용자 누적기 DB에 누적하는 단계와, (d) 사용자가 음성인식 서비스를 종료할 시점에 현재까지 누적된 누적정보를 이용하여 화자적응을 수행하는 단계를 포함하는데 있다.A feature of the speaker adaptation method according to the present invention for achieving the above object is (a) generating an acoustic model for each user through a user identification process and registering in the user acoustic model DB, and (b) the user Loading a user acoustic model from the pre-stored user acoustic model DB for speech and performing voice recognition and outputting information necessary for speaker adaptation; and (c) receiving the outputted information to obtain observation data necessary for speaker adaptation. Accumulating in the user accumulator DB; and (d) performing speaker adaptation using accumulated information accumulated so far at the time when the user ends the voice recognition service.

바람직하게 상기 화자적응 과정에 필요한 정보는 인식결과의 음소열, 음소열의 시간정보 및 음성 특징데이터인 것을 특징으로 한다.Preferably, the information required for the speaker adaptation process is a phoneme sequence of the recognition result, time information of the phoneme sequence, and voice feature data.

바람직하게 상기 (a)단계는 사용자 ID를 입력하여 서비스 사용자가 기존에 등록된 사용자인지 판단하는 단계와, 상기 판단결과, 처음 사용자이면 서비스 시스템에 사용자 등록 과정을 거쳐 교사방식 화자적응 과정을 통해 초기 사용자 음향모델을 생성하고, 상기 생성된 사용자 음향모델을 사용자 음향모델 DB에 등록하는 단 계를 포함하는 것을 특징으로 한다.Preferably, the step (a) is a step of determining whether a service user is a registered user by inputting a user ID, and if it is the first user, initializing the user through a teacher-type speaker adaptation process through a user registration process in the service system. Generating a user acoustic model and registering the generated user acoustic model in a user acoustic model DB.

바람직하게 상기 음향모델의 생성은 처음 사용자가 서비스 시스템에 사용자를 등록하면, 처음 사용자의 화자적응 음향모델 생성과정 수행을 위한 초기 사용자 누적기를 생성하는 단계와, 교사방식 화자적응 수행을 위해 제시된 음소가 적절히 분포된 발성목록을 한 문장씩 발성하도록 하여 사용자 발화를 입력 받는 단계와, 상기 사용자가 발화한 음성데이터에 대한 음성 특징 포맷에 따라 음성특징 데이터를 추출하는 단계와, 상기 발성목록에 따라 해당 음소열을 생성하고, 상기 추출된 음성 특징 데이터에 대하여 비터비 정렬방법(viterbi alignment)에 의해 음성특징 데이터와 음소열 정렬을 수행하는 단계와, 상기 음소열 정렬과정을 통해 얻어진 각 음소들의 시간정보를 사용하여 음성특징 데이터에 대한 각 음소들의 관측정보를 누적하는 단계와, 해당 사용자의 음향모델을 화자적응 방법에 의해 생성하고, 생성된 사용자 음향모델을 사용자 음향 모델 DB에 등록하는 단계와, 상기 관측정보가 누적된 초기 사용자 누적기를 사용자 누적기 DB에 사용자 누적기로 등록하는 단계를 포함하는 것을 특징으로 한다.Preferably, when the first user registers a user in the service system, generating the acoustic model may include generating an initial user accumulator for performing a speaker adaptation process of the first user, and a phoneme presented for performing a teacher method speaker adaptation. Receiving a user's utterance by uttering an appropriately distributed utterance list by one sentence, extracting voice feature data according to a voice feature format for the voice data uttered by the user, and corresponding phonemes according to the utterance list Generating a column, and performing voice feature data and phoneme string alignment on the extracted voice feature data by viterbi alignment; and time information of each phoneme obtained through the phoneme string alignment process. Accumulating observation information of each phoneme on the voice feature data using the Generating a user acoustic model by a speaker adaptation method, registering the generated user acoustic model in a user acoustic model DB, and registering an initial user accumulator in which the observation information is accumulated as a user accumulator in the user accumulator DB. Characterized in that it comprises a.

바람직하게 상기 (b)단계는 상기 사용자 음향모델 DB로부터 사용자 음향모델을 로딩하는 단계와, 음성인식 서비스를 위한 사용자의 음성이 발화되면, 음성인식을 수행하여 화자적응 과정에 필요한 정보를 출력하는 단계와, 상기 음성인식이 수행된 결과를 상기 응용프로그램은 상기 음성인식 결과에 기반하여 동작하고, 음성인식 서버는 화자 적응에 필요한 정보를 화자적응 서버에 전달하는 단계를 포함하여 이루어지는 것을 특징으로 한다.Preferably, the step (b) includes loading a user acoustic model from the user acoustic model DB, and outputting information necessary for a speaker adaptation process by performing voice recognition when a user's voice is spoken for a voice recognition service. And the application program operates on the result of the speech recognition being performed based on the speech recognition result, and the speech recognition server comprises delivering information necessary for speaker adaptation to the speaker adaptation server.

바람직하게 상기 (c)단계는 서비스 사용자가 처음 교사 화자적응 과정에서 누적하였던 사용자 누적기 또는 이전 서비스 과정에서 추가로 누적하였던 사용자 누적기를 사용자 누적 DB로부터 로딩하는 단계와, 상기 출력되는 음성인식 정보를 이용하여 인식결과의 신뢰도를 측정하는 단계와, 상기 측정된 신뢰도와 미리 정해진 문턱값을 서로 비교하고 이 결과에 따라 관측된 정보를 사용자 누적기에 누적하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (c) includes loading the user accumulator accumulated in the first teacher speaker adaptation process or the user accumulator additionally accumulated in the previous service process from the user accumulation DB, and the output voice recognition information. And measuring the reliability of the recognition result, comparing the measured reliability with a predetermined threshold value, and accumulating the observed information in the user accumulator according to the result.

바람직하게 상기 음성인식 정보는 인식결과의 음소열, 음소열의 시간정보 및 음성 특징데이터이고, 상기 사용자 누적기에서 로딩되는 정보는 음성 특정데이터와 점유확률이 누적된 것을 특징으로 한다.Preferably, the voice recognition information is a phoneme sequence of the recognition result, time information of the phoneme sequence, and voice feature data, and the information loaded from the user accumulator is characterized in that voice specific data and occupation probability are accumulated.

바람직하게 상기 화자적응 과정은 MAP(Maximum A Posteriori), MLLR(Maximum Likelihood Linear Regression) 및 Eigenvoice 방법 중 하나인 것을 특징으로 한다.Preferably, the speaker adaptation process may be one of a Maximum A Posteriori (MAP), a Maximum Likelihood Linear Regression (MLLR), and an Eigenvoice method.

바람직하게 상기 누적하는 단계는 상기 신뢰도가 문턱값보다 큰 경우에만 누적하는 것을 특징으로 한다.Preferably, the accumulating may only accumulate when the reliability is greater than a threshold.

바람직하게 상기 (d)단계는 서비스 종료여부를 판단하는 제 1 판단 단계와, 상기 제 1 판단결과 사용자가 계속 사용할 경우는 사용자의 음성 발화를 검출하고, 상기 제 1 판단결과 사용자가 서비스를 종료할 경우는 서비스 종료 시점에 화자적응 서버에 서비스 종료 신호를 전달하는 단계와, 상기 화자적응 서버에서 상기 서비스 종료 신호를 수신하면, 현재 사용자가 서비스를 계속 이용하는지를 판단하는 제 2 판단 단계와, 그리고 상기 제 2 판단결과, 사용자가 서비스를 계속 이용하는 경우에는 신뢰도 측정 과정부터 반복하고, 상기 제 2 판단결과, 사용자가 서비스를 종료할 경우에는 현재까지 누적된 사용자 누적기를 사용하여 사용자 음향모델을 생성하고, 생성된 사용자 음향모델을 사용자 음향모델 DB에 갱신하는 단계와, 현재까지 누적된 사용자 누적기를 사용자 누적기 DB로 갱신하는 단계를 포함하는 것을 특징으로 한다.Preferably, the step (d) is a first determination step of determining whether or not to terminate the service, and if the user continues to use the first determination result, the user's voice utterance is detected, and the user may terminate the service. In case of service termination, the service termination signal is transmitted to the speaker adaptation server; when the service termination signal is received by the speaker adaptation server, a second determination step of determining whether the current user continues to use the service; and As a result of the second determination, if the user continues to use the service, the process is repeated from the reliability measurement process, and if the user terminates the service, the user acoustic model is generated using the user accumulator accumulated up to now, Updating the created user acoustic model to the user acoustic model DB, and accumulated user And updating the bandit with the user accumulator DB.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 화자적응 장치의 특징은 사용자가 발성한 음성데이터를 이용하여 기 정의된 사용자 음향모델을 기반으로 화자종속 음성인식을 수행하여 화자적응 과정에 필요한 정보를 생성하는 음성인식 서버와, 상기 생성된 정보를 사용하여 이전에 사용하던 사용자 누적기를 로딩하여 관측 데이터를 누적하고, 음성인식 서비스 종료시점에 누적된 정보를 사용하여 화자적응을 수행하여 새로운 사용자 음향모델을 생성하며 새로 생성된 사용자 음향모델 및 현재까지 누적된 사용자 누적기를 갱신하는 화자적응 서버와, 상기 음성인식 서버의 음성인식 결과정보를 사용하여 해당 기능을 구동하는 응용프로그램을 포함하는 것을 특징으로 한다.A feature of the speaker adaptation apparatus according to the present invention for achieving the above object is to perform the speaker-dependent speech recognition based on a predefined user acoustic model using the voice data uttered by the user to provide information necessary for the speaker adaptation process. A new user acoustic model is generated by loading the previously accumulated user accumulator using the generated voice recognition server and the generated information, accumulating observation data, and performing speaker adaptation using information accumulated at the end of the voice recognition service. And a speaker adaptation server for generating a newly generated user acoustic model and a user accumulator accumulated to date, and an application program for driving a corresponding function using the speech recognition result information of the speech recognition server. .

바람직하게 상기 화자적응 과정에 필요한 정보는 인식결과의 음소열, 음소열의 시간정보, 음성 특징데이터인 것을 특징으로 한다.Preferably, the information required for the speaker adaptation process is a phoneme sequence of the recognition result, time information of the phoneme sequence, and voice feature data.

본 발명의 다른 목적, 특성 및 이점들은 첨부한 도면을 참조한 실시예들의 상세한 설명을 통해 명백해질 것이다.Other objects, features and advantages of the present invention will become apparent from the following detailed description of embodiments with reference to the accompanying drawings.

본 발명에 따른 화자적응 서버를 사용한 화자적응 방법의 바람직한 실시예에 대하여 첨부한 도면을 참조하여 설명하면 다음과 같다.A preferred embodiment of a speaker adaptation method using a speaker adaptation server according to the present invention will be described with reference to the accompanying drawings.

도 1 은 본 발명에 따른 화자적응 서버를 사용한 화자적응 시스템을 나타낸 구성도이다.1 is a block diagram showing a speaker adaptation system using a speaker adaptation server according to the present invention.

도 1과 같이, 음성인식 서비스의 사용자가 발성한 음성데이터를 이용하여 사용자 음향 모델 DB(140)에 기 정의된 사용자 음향모델을 기반으로 화자종속 음성인식을 수행하여 화자적응 과정에 필요한 정보를 생성하는 음성인식 서버(120)와, 상기 음성인식 서버(120)에서 생성된 정보를 사용하여 이전에 사용하던 사용자 누적기를 사용자 누적기 DB(150)로부터 로딩하여 관측 데이터를 누적하고, 음성인식 서비스 종료시점에 상기 화자적응 서버(130)에 누적된 누적정보를 사용하여 화자적응을 수행하며, 새로 생성된 사용자 음향모델을 상기 사용자 음향모델 DB(140)에 갱신하고, 현재까지 누적된 사용자 누적기를 사용자 누적기 DB(150)에 갱신하는 화자적응 서버(130)와, 상기 음성인식 서버의 음성인식 결과정보를 사용하여 해당 기능을 구동하는 응용프로그램(160)으로 구성된다. As shown in FIG. 1, using speaker data uttered by a user of a voice recognition service, speaker-dependent voice recognition is performed based on a user acoustic model defined in the user acoustic model DB 140 to generate information necessary for a speaker adaptation process. Using the voice recognition server 120 and the information generated by the voice recognition server 120, the user accumulator previously used is loaded from the user accumulator DB 150 to accumulate observation data, and the voice recognition service ends. Speaker adaptation is performed using the accumulated information accumulated in the speaker adaptation server 130 at a point in time, the newly generated user acoustic model is updated in the user acoustic model DB 140, and the user accumulator accumulated so far is Speaker application server 130 to update the accumulator DB 150, and the application program 160 to drive the corresponding function using the voice recognition result information of the voice recognition server It consists of a.

이때, 상기 화자적응 과정에 필요한 정보는 인식결과의 음소열, 음소열의 시간정보, 음성 특징데이터로 구성되는 것이 바람직하다.In this case, the information required for the speaker adaptation process is preferably composed of a phoneme sequence of the recognition result, time information of the phoneme sequence, and voice feature data.

이와 같이 구성된 본 발명에 따른 화자적응 서버를 사용한 화자적응 방법을 첨부한 도면을 참조하여 상세히 설명하면 다음과 같다.The speaker adaptation method using the speaker adaptation server according to the present invention configured as described above will be described in detail with reference to the accompanying drawings.

도 2 는 본 발명에 따른 화자적응 서버를 사용한 화자적응 방법을 나타낸 흐름도이다.2 is a flowchart illustrating a speaker adaptation method using a speaker adaptation server according to the present invention.

도 2를 참조하여 설명하면, 음성인식 서비스를 처음 시작하면, 먼저 시스템은 사용자 확인 과정을 통해 각 사용자마다 음향모델을 생성하여 사용자 음향모델 DB(140)에 등록한다(S100). Referring to FIG. 2, when the voice recognition service is started for the first time, the system first generates an acoustic model for each user through a user confirmation process and registers it in the user acoustic model DB 140 (S100).

좀더 상세히 설명하면, 먼저 서비스 사용자가 기존에 등록된 사용자인지 확인한다(S110). 이때, 신규사용자는 사용자 등록절차를 수행하게 되고, 기존 사용자는 사용자 ID를 입력한다. 이를 통해 서비스 사용자가 처음 사용자인지를 판단한다(S120)In more detail, first, it is checked whether the service user is a registered user (S110). At this time, the new user performs the user registration procedure, and the existing user enters the user ID. Through this, it is determined whether the service user is the first user (S120).

상기 판단결과(S120), 사용자가 처음 사용자이면 서비스 시스템에 사용자 등록 과정을 거친다(S130). 그리고 상기 처음 사용자는 사용자 등록과정을 마친 후 교사방식 화자적응 과정을 통해 초기 사용자 음향모델을 생성하고(S140), 이렇게 생성된 사용자 음향모델을 사용자 음향모델 DB(140)에 등록한다. 이때, 이미 등록된 사용자는 사용자 음향모델 DB(140)로부터 사용자 등록과정에서 생성한 사용자 음향모델 또는 이전 서비스 사용과정에서 갱신된 사용자 음향모델을 로딩한다(S210). As a result of the determination (S120), if the user is the first user, the user is subjected to a user registration process (S130). The first user completes the user registration process and then generates an initial user acoustic model through a teacher-adaptive speaker adaptation process (S140), and registers the generated user acoustic model in the user acoustic model DB 140. At this time, the already registered user loads the user acoustic model generated in the user registration process or the user acoustic model updated in the previous service use process from the user acoustic model DB 140 (S210).

여기서 상기 교사방식 화자적응 과정을 통한 음향모델 생성방법은 도 3을 통해 다음에 상세히 설명하도록 한다. Here, the method of generating an acoustic model through the teacher method speaker adaptation process will be described in detail later with reference to FIG. 3.

다음으로, 사용자의 발화에 대해 음성인식 서버(120)는 상기 기 저장된 사용자 음향모델 DB(140)로부터 사용자 음향모델을 로딩하여 음성인식을 수행하고, 화자적응 과정에 필요한 정보를 출력한다(S200). 이때, 상기 화자적응 과정에 필요한 정보는 인식결과의 음소열, 음소열의 시간정보, 음성 특징데이터 등이다.Next, for the user's speech, the voice recognition server 120 loads the user acoustic model from the previously stored user acoustic model DB 140, performs voice recognition, and outputs information necessary for the speaker adaptation process (S200). . At this time, the information necessary for the speaker adaptation process is a phoneme sequence of the recognition result, time information of the phoneme sequence, voice feature data, and the like.

좀더 상세히 설명하면, 먼저 기존 사용자의 경우 저장된 사용자 음향모델 DB로부터 사용자 음향모델을 로딩한다(S210). 이어 음성인식 서비스를 위한 사용자의 음성이 발화되면(S220), 음성인식 서버(120)에서 음성인식을 수행하여(S230), 화자 적응 과정에 필요한 정보를 출력한다(S250). 그리고 상기 음성인식이 수행된 결과를 화자적응 서버(130) 및 응용프로그램(160)에 전달한다(S240)(S250). 그러면 상기 응용프로그램(160)은 상기 전달된 음성인식 결과에 의해 동작하며(S260). 음성인식 서버(120)는 음성인식 중간 과정에서 음성인식 결과 음소열과 각 음소의 시간정보, 해당 시간의 음성특징 데이터 등 화자 적응에 필요한 정보를 화자적응 서버(130)에 전달한다(S240).In more detail, in the case of an existing user, first, a user acoustic model is loaded from a stored user acoustic model DB (S210). Subsequently, when the user's voice for speech recognition service is uttered (S220), the voice recognition server 120 performs voice recognition (S230) and outputs information necessary for the speaker adaptation process (S250). The voice recognition result is transmitted to the speaker adaptation server 130 and the application program 160 (S240) (S250). The application 160 is then operated by the delivered voice recognition result (S260). The voice recognition server 120 transmits information necessary for speaker adaptation, such as a phoneme sequence, a phoneme sequence of each phoneme, time information of each phoneme, and voice feature data of the corresponding time, to the speaker adaptation server 130 (S240).

이어, 상기 화자적응 서버(130)는 상기 음성인식 서버(120)로부터 출력되는 정보들을 전달받아 화자적응에 필요한 관측데이터를 누적한다(S300). 이때 이전에 사용하던 사용자 누적기를 사용자 누적기 DB(150)로부터 로딩(loading)하여 기존 누적정보에 새로운 누적정보를 추가한다.Subsequently, the speaker adaptation server 130 receives information output from the speech recognition server 120 and accumulates observation data necessary for speaker adaptation (S300). At this time, the previously used user accumulator is loaded from the user accumulator DB 150 and new accumulation information is added to the existing accumulator information.

좀더 상세히 설명하면, 먼저 서비스 사용자가 처음 교사 화자적응 과정에서 누적하였던 사용자 누적기 또는 이전 서비스 과정에서 추가로 누적하였던 사용자 누적기를 사용자 누적 DB(150)로부터 로딩하여 새로운 음성데이터에 대해 정보를 누적할 준비를 한다.(S310). In more detail, first, a user accumulates information accumulated on new voice data by loading a user accumulator accumulated in the first teacher speaker adaptation process or a user accumulator accumulated in the previous service process from the user accumulation DB 150. To prepare (S310).

그리고 상기 음성인식 서버(120)로부터 전달받은 음성인식 주요 정보를 이용하여 인식결과의 신뢰도를 측정한다(S320). 이때, 상기 주요 정보는 인식결과의 음소열, 음소열의 시간정보, 음성 특징데이터 등이다.The reliability of the recognition result is measured using the main voice recognition information received from the voice recognition server 120 (S320). At this time, the main information is a phoneme sequence of the recognition result, time information of the phoneme sequence, voice feature data, and the like.

그리고 상기 측정된 신뢰도와 미리 정해진 문턱값을 서로 비교하여, 상기 신 뢰도가 문턱값보다 큰 경우에 대해서만 관측된 정보를 사용자 누적기에 누적한다(S340). The measured reliability is compared with a predetermined threshold, and the observed information is accumulated in the user accumulator only when the reliability is larger than the threshold (S340).

일반적인 화자적응 방법인 MAP(Maximum A Posteriori), MLLR(Maximum Likelihood Linear Regression), Eigenvoice 방법에서는 관측된 음성 특징데이터와 음성 특징데이터가 해당 음소 모델에 점유할 점유(occupation) 확률에 대한 정보가 필요하고, 관측정보 누적과정(S340)에서 점유 확률값을 계산하고, 점율 확률값과 음성 특징데이터값을 곱한 값과 해당 음소 모델이 관측된 빈도를 로딩된 사용자 누적기에 누적된 값에 추가 누적한다.Typical speaker adaptation methods such as MAP (Maximum A Posteriori), MLLR (Maximum Likelihood Linear Regression), and Eigenvoice methods require information about the occupation probability that the observed voice feature data and voice feature data occupy the phoneme model. In operation S340, the occupancy probability value is calculated, and the frequency obtained by multiplying the occupancy probability value and the speech feature data value and the frequency of the corresponding phoneme model are further accumulated to the accumulated value in the loaded user accumulator.

다음으로 사용자가 음성인식 서비스를 종료할 시점에 상기 화자적응 서버(130)는 현재까지 누적된 누적정보를 이용하여 화자적응을 수행한다. 그리고 새로이 생성된 사용자 음향모델을 해당 화자에 대한 사용자 음향모델 DB(140)에 갱신하고, 또한 현재까지 누적된 사용자 누적기를 사용자 누적기 DB(150)에 갱신한다. Next, at the time when the user terminates the voice recognition service, the speaker adaptation server 130 performs speaker adaptation using the accumulated information accumulated to date. The newly generated user acoustic model is updated in the user acoustic model DB 140 for the corresponding speaker, and the user accumulator accumulated up to now is updated in the user accumulator DB 150.

좀더 상세히 설명하면, 먼저 서비스 종료여부를 판단한다(S410). 그리고 상기 판단결과(S410) 사용자가 계속 사용할 경우는 사용자의 음성 발화를 검출하고, 상기 판단결과(S410) 사용자가 서비스를 종료할 경우는 서비스 종료 시점에 화자적응 서버(130)에 서비스 종료 신호를 전달한다(S420). In more detail, it is first determined whether the service is terminated (S410). When the user continues to use the determination result (S410), the user's voice is detected. When the user terminates the service (S410), the service termination signal is transmitted to the speaker adaptation server 130 at the end of the service. Transfer (S420).

이어 상기 화자적응 서버(130)에서 상기 서비스 종료 신호를 수신하면, 현재 사용자가 서비스를 계속 이용하는지를 판단한다(S350). Subsequently, upon receiving the service end signal from the speaker adaptation server 130, it is determined whether the current user continues to use the service (S350).

그리고 상기 판단결과(S350), 사용자가 서비스를 계속 이용하는 경우에는 신뢰도 측정 과정부터 반복하게 된다. If the user continues to use the service, the determination result (S350) is repeated from the reliability measurement process.

또한, 상기 판단결과(S350), 사용자가 서비스를 종료할 경우에는 현재까지 누적된 사용자 누적기를 사용하여 화자적응 방법에 의해 사용자 음향모델을 생성하고, 새로운 사용자 음향모델을 사용자 음향모델 DB에 갱신한다(S360). In addition, in the determination result (S350), when the user terminates the service, a user acoustic model is generated by a speaker adaptation method using a user accumulator accumulated so far, and the new user acoustic model is updated in the user acoustic model DB. (S360).

아울러, 현재까지 누적된 사용자 누적기를 사용자 누적기 DB에 기존 사용자 누적기와 교체하여 더 많은 정보가 포함되어 있는 사용자 누적기로 갱신한다(S370).In addition, the user accumulator accumulated up to now is replaced with an existing user accumulator in the user accumulator DB and updated to a user accumulator including more information (S370).

도 3 은 본 발명에 따른 화자적응 서버를 사용한 화자적응 방법에서 사용자 등록 단계의 교사방식 화자적응을 통한 사용자 음향모델 생성 과정을 보다 상세히 나타낸 흐름도이다.FIG. 3 is a flowchart illustrating a process of generating a user acoustic model through speaker adaptation in a user registration step in a speaker adaptation method using a speaker adaptation server according to the present invention.

도 3을 참조하여 설명하면, 처음 사용자가 서비스 시스템에 사용자를 등록하면(S130), 처음 사용자의 화자적응 음향모델 생성과정 수행을 위한 초기 사용자 누적기를 생성한다(S141). 이때 사용자는 미리 음소들이 적절히 분포되도록 만들어 놓은 발성목록을 제시받는다.Referring to FIG. 3, when a first user registers a user in a service system (S130), an initial user accumulator for generating a speaker adaptation acoustic model of the first user is generated (S141). At this time, the user is presented with a utterance list in which the phonemes are properly distributed in advance.

즉, 교사방식 화자적응 수행을 위한 음소가 적절히 분포된 발성목록을 제시하고(S142), 이 제시된 발성목록을 사용자에게 한 문장씩 발성하도록 하여, 화자적응 서버(130)는 사용자 발화를 입력 받는다(S143).That is, the present invention provides a speech list in which the phonemes for performing the teacher method speech adaptation are properly distributed (S142), and the speaker list server 130 receives the user speech by causing the suggested speech list to be spoken one by one to the user ( S143).

그리고 이렇게 사용자가 발화한 음성데이터에 대해 적절한 음성 특징 포맷에 따라 음성데이터 특징을 추출한다(S144). Then, the voice data feature is extracted according to the voice feature format appropriate for the voice data spoken by the user (S144).

이어 화자적응 서버(130)는 제시한 발성목록(S142)에 따라 해당 음소열을 생성하고, 상기 입력받은 사용자 발화로부터 추출된 음성 특징 데이터에 대하여 비터비 정렬방법(viterbi alignment)에 의해 음성특징 데이터와 음소열 정렬을 수행한다(S145). 이때, 음소열 정렬과정에서는 아직 사용자 음향모델이 없기 때문에 화자독립 음향 모델을 로딩하여 사용하게 된다(S149).Subsequently, the speaker adaptation server 130 generates a corresponding phoneme string according to the proposed utterance list S142, and performs voice feature data by viterbi alignment on the voice feature data extracted from the input user utterance. And phoneme string alignment is performed (S145). At this time, since there is no user acoustic model in the phoneme alignment process, the speaker-independent acoustic model is loaded and used (S149).

이렇게 음소열 정렬과정을 통해 각 음소들의 시간정보를 사용하여 음성특징 데이터에 대한 각 음소들의 관측정보를 누적기에 누적한다(S146). 여기서 누적되는 관측정보는 위의 화자적응 방법에서의 관측정보 누적과정(S340)과 동일하게, 음성특징 데이터가 해당 음소 모델에 점유할 점유 확률값과 음성특징데이터값을 곱한 값과 해당 음소 모델이 관측된 빈도수가 되고, 관측정보 누적은 초기 사용자 누적기 생성과정에서 생성된 초기 누적기에 누적하게 된다.Through the phoneme alignment process, observation information of each phoneme of voice feature data is accumulated in the accumulator using time information of each phoneme (S146). Here, the accumulated observation information is the same as the accumulation of observation information in the speaker adaptation method (S340), and the value obtained by multiplying the occupancy probability value and the voice feature data value that the voice feature data will occupy in the phoneme model is observed by the phoneme model. The accumulated accumulation of observation information is accumulated in the initial accumulator generated in the initial user accumulator generation process.

제시된 모든 발성목록에 대한 관측정보 누적 과정이 완료되면, 화자적응 서버(130)는 해당 사용자의 음향모델을 화자적응 방법에 의해 생성하고, 생성된 사용자 음향모델을 사용자 음향 모델 DB(140)에 등록한다(S147).When the observation information accumulation process for all the presented utterance lists is completed, the speaker adaptation server 130 generates the acoustic model of the corresponding user by the speaker adaptation method, and registers the generated user acoustic model in the user acoustic model DB 140. (S147).

또한, 상기 교사 방식 화자적응 음향모델 생성과정에서 사용한 관측정보가 누적된 초기 사용자 누적기를 사용자 누적기 DB(150)에 사용자 누적기로 등록한다(S148).In addition, the initial user accumulator in which the observation information used in the teacher-type speaker adaptation acoustic model generation process is accumulated is registered as a user accumulator in the user accumulator DB 150 (S148).

이상에서와 같이 상세한 설명과 도면을 통해 본 발명의 최적 실시예를 개시하였다. 용어들은 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미 한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다. As described above, the preferred embodiment of the present invention has been disclosed through the detailed description and the drawings. The terms are used only for the purpose of describing the present invention and are not used to limit the scope of the present invention as defined in the meaning or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible from this. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

이상에서 설명한 바와 같은 본 발명에 따른 화자적응 장치 및 방법은 다음과 같은 효과가 있다.Speaker adaptation apparatus and method according to the present invention as described above has the following advantages.

위에서 제시한 화자적응 서버를 사용한 화자적응 서비스는 기존의 방법인, 사용자 등록과정에서 교사방식 화자적응을 수행하여 사용자 음향모델을 생성하는 화자종속 음성인식 서비스에 비해, 처음 사용자에 대해서 교사방식 화자적응을 수행하여 사용자 음향모델을 생성하고, 서비스 사용을 하면서 비교사 방식으로 사용자 음향모델을 갱신하고, 또한 사용자 누적기를 계속 유지하여, 사용자의 관측정보를 화자종속 음향모델생성에 충분한 정도로 누적할 수 있어 결국 사용자가 사용하면 사용할수록 음성인식 성능이 향상되는 화자종속 음성인식 서비스를 구축할 수 있게 된다. The speaker adaptation service using the speaker adaptation server presented above is a teacher-adaptive speaker adaptation for the first user, compared to the speaker-dependent speech recognition service that generates a user acoustic model by performing the teacher-based speaker adaptation during the user registration process. The user acoustic model can be generated, the user acoustic model can be updated in the comparator manner while using the service, and the user accumulator can be maintained continuously, so that the user's observation information can be accumulated enough to generate speaker-dependent acoustic models. As a result, it is possible to build a speaker-dependent voice recognition service that improves voice recognition performance as the user uses it.

Claims

(a) generating a user acoustic model for each user through a user identification process and registering the user acoustic model in the user acoustic model DB;

(b) loading the pre-registered user acoustic model from the user acoustic model DB with respect to the user's speech to perform voice recognition and outputting information required for speaker adaptation;

(c) accumulating observation data necessary for speaker adaptation in a user accumulator DB by receiving the output information;

and (d) speaker adaptation using the accumulated information accumulated so far at the time when the user ends the voice recognition service.

The method of claim 1,

The information required for the speaker adaptation is a phoneme sequence of a recognition result, time information of the phoneme sequence, and voice feature data.

The method of claim 1, wherein step (a) comprises:

Determining whether the user is a registered user based on an identification number input from a user;

As a result of the determination, if the user is not a registered user, after the user registration process in the service system, an initial user acoustic model is generated through a teacher-adaptive speaker adaptation process, and the generated user acoustic model is converted into a user acoustic model DB. Speaker registration method comprising the step of registering.

The method of claim 3, wherein the generation of the acoustic model,

Creating an initial user accumulator for performing a process of generating a speaker adaptation acoustic model of the user when a user who is not already registered is registered with the service system;

Receiving a user's utterance by uttering a list of phonological distributions of the phonemes provided for the teacher method speaker adaptation by one sentence;

Extracting voice feature data according to a voice feature format for the voice data spoken by the user;

Generating a corresponding phoneme string according to the speech list, and performing a phoneme string alignment with the voice feature data by viterbi alignment on the extracted voice feature data;

Accumulating observation information of each phoneme of voice feature data using time information of each phoneme obtained through the phoneme string alignment process;

Generating a user acoustic model of the user by a speaker adaptation method, and registering the generated user acoustic model in the user acoustic model DB;

And registering the initial user accumulator in which the observation information is accumulated as a user accumulator in a user accumulator DB.

According to claim 1, wherein step (b),

Loading a user acoustic model from the user acoustic model DB;

When the user's voice for speech recognition service is uttered, performing voice recognition to output information necessary for the speaker adaptation process to the speaker adaptation server;

And executing an application program based on a result of the speech recognition being performed.

The method of claim 1, wherein step (c) comprises:

Loading a user accumulator accumulated in the first teacher speaker adaptation process or a user accumulator accumulated in the previous service process from the user cumulative DB;

Measuring the reliability of a speech recognition result by using the information required for the speaker adaptation;

And comparing the measured reliability with a predetermined threshold value and accumulating the observed information in a user accumulator according to the comparison result.

The method of claim 6,

The information necessary for the speaker adaptation is a phoneme string of a recognition result, time information of a phoneme string, and voice feature data. The observation information loaded from the user accumulator is a value obtained by multiplying an occupancy probability value and a voice feature data value of the corresponding phoneme, and the corresponding phoneme. Speaker adaptation method characterized in that the model is observed frequency information.

The method of claim 6,

The speaker adaptation process is one of a Maximum A Posteriori (MAP), Maximum Likelihood Linear Regression (MLLR), and an Eigenvoice method.

The method of claim 6,

The accumulating step accumulates only when the reliability is greater than a threshold value.

The method of claim 1, wherein step (d)

A first determination step of determining whether to terminate the service;

Detecting a user's voice utterance if the user continues to use the first determination result, and if the user terminates the service as a result of the first determination, transmitting a service termination signal to the speaker adaptation server at the end of the service;

A second determination step of determining whether a current user continues to use a service when the speaker adaptation server receives the service termination signal;

As a result of the second determination, if the user continues to use the service, the process proceeds to the reliability measurement step. When the user determines that the service terminates, the user acoustic model is generated using the accumulated user accumulator. Updating the generated user acoustic model in a user acoustic model DB;

A speaker adaptation method comprising the step of updating the user accumulator accumulated to date to the user accumulator DB.

Speech recognition server for generating information required for the speaker adaptation process by performing speaker-dependent speech recognition based on a predefined user acoustic model using the voice data uttered by the user,

Accumulate observation data by loading the user accumulator previously used using the generated information, perform speaker adaptation using the accumulated information at the end of the voice recognition service, accumulate the newly generated user acoustic model and the present And a speaker adaptation server for updating the accumulated user accumulator.

The method of claim 11,

The information required for the speaker adaptation process is a phoneme sequence of a recognition result, time information of a phoneme sequence, and voice feature data.