KR20030075330A

KR20030075330A - Channel Mis-match Compensation apparatus and method for Robust Speaker Verification system

Info

Publication number: KR20030075330A
Application number: KR1020020014492A
Authority: KR
Inventors: 정희석; 진세훈
Original assignee: 정희석
Priority date: 2002-03-18
Filing date: 2002-03-18
Publication date: 2003-09-26
Also published as: KR100435441B1

Abstract

PURPOSE:An apparatus for compensating channel discordant in speaker recognition considering the user mobility and a method therefor are provided to prevent a person recognition rate from decreasing due to channel discordant. CONSTITUTION:An initializing element sets up a center value of each input vector of a voice section of sampled voice data. A center difference calculating element obtains a difference between the mean center value of a common code book and the center value set by the initializing element. A compensating element compensates the difference by using the calculated difference. A speaker registering element registers speakers by using the compensated value.

Description

Channel Mis-match Compensation apparatus and method for Robust Speaker Verification system in speaker recognition considering user mobility

본 발명은 사용자 이동성을 고려한 화자 인식에서의 채널 불일치 보상 장치 및 그 방법에 관한 것으로서, 보다 상세하게는, 학습시와 인식시의 채널 불일치 조건을 공통 코드북 센터의 평균값과 개인 코드북 센터의 평균값 차이를 보상함으로써, 화자 인증의 성공률을 보다 향상시키는 채널 불일치 보상 장치 및 그 방법에 관한 것이다.The present invention relates to an apparatus and method for compensating for channel mismatch in speaker recognition in consideration of user mobility. More particularly, the present invention relates to a channel mismatch compensation condition in learning and recognition. By compensating, the present invention relates to a channel mismatch compensation device and method for further improving the success rate of speaker authentication.

화자 인식 기술은 대상 화자의 수에 따라, 화자 확인(Speaker Verification) 방법 및 화자 식별(Speaker Identification) 방법으로 나눌 수 있다.The speaker recognition technology may be divided into a speaker verification method and a speaker identification method according to the number of target speakers.

도 1a는 종래의 화자 확인 방법의 전체적인 과정을 개략적으로 나타낸 흐름도이고, 도 1b는 종래의 화자 식별 방법의 전체적인 과정을 개략적으로 나타낸 흐름도로서, 이를 설명하면 다음과 같다.FIG. 1A is a flowchart schematically illustrating an overall process of a conventional speaker identification method, and FIG. 1B is a flowchart schematically illustrating an overall process of a conventional speaker identification method.

먼저, 화자 확인 방법을 설명하면, 다음과 같다.First, the speaker confirmation method will be described.

스텝 S101에서, 확인하고자 하는 화자의 음성이 입력되면, 스텝 S103에서, 상기 입력된 음성의 특징 벡터를 추출한다. 한편, 별도로, 스텝 S105에서, 입력된 화자가 자신의 ID를 입력시키면, 스텝 S106에서, 입력된 화자 ID의 화자 모델을 생성한다.In step S101, when the speaker's voice to be checked is input, in step S103, the feature vector of the input voice is extracted. On the other hand, when the input speaker inputs his ID separately in step S105, in step S106, a speaker model of the input speaker ID is generated.

이어서, 스텝 S107에서, 상기 추출된 음성의 특징 벡터와 상기 생성된 화자 모델의 유사도를 계산한 후, 스텝 S109에서, 입력된 음성과 화자 ID가 서로 일치하는지 여부를 임계치를 이용하여 판단한다.Subsequently, in step S107, the similarity between the feature vector of the extracted speech and the generated speaker model is calculated. In step S109, it is determined by using a threshold whether the input speech and the speaker ID coincide with each other.

그리고, 스텝 S111에서, 상기 일치 결과를 출력한 후, 종료한다.Then, in step S111, the matching result is output, and then ends.

도 1b에 도시되어 있는 화자 식별 방법을 설명하면 다음과 같다.Referring to the speaker identification method shown in Figure 1b as follows.

먼저, 스텝 S151에서, 식별하고자 하는 음성이 입력되면, 스텝 S153에서, 입력된 음성의 특징 벡터를 추출한다. 한편, 별도로 스텝 S155에서, N 개의 화자 모델이 입력된다.First, when the voice to be identified is input in step S151, the feature vector of the input voice is extracted in step S153. On the other hand, in step S155, N speaker models are input.

이어서, 스텝 S157에서, 상기 추출한 입력 음성의 특징 벡터와 상기 N 개의 화자 모델간의 유사도를 계산한다. 이때, 유사도의 계산은 상기 N 개의 화자 모델을 일일이 상기 입력 음성의 특징 벡터와 수행하기 때문에, 최종 결과값은 N 개의 유사도 값이다.Next, in step S157, the similarity between the feature vector of the extracted input speech and the N speaker models is calculated. In this case, since the similarity calculation is performed for each of the N speaker models with the feature vector of the input speech, the final result is N similarity values.

그리고, 스텝 S159에서, 상기 N 개의 유사도 값 중에서 최대 유사도 값을 선택한 후, 스텝 S161에서, 화자 식별 결과값을 출력하고, 종료한다.After selecting the maximum similarity value among the N similarity values in step S159, in step S161, the speaker identification result value is output and ends.

화자 확인 방법을 좀 더 상세하게 설명하면, 다음과 같다.The speaker identification method will be described in more detail as follows.

화자 확인 알고리즘은 일반적인 패턴 인식 시스템의 과정을 따른다. 즉, 인식하고자 하는 객체의 특성을 표현하는 특징들이 주어졌을 때, 이러한 객체에 대한 사전 지식을 바탕으로 객체를 인식하는 것을 목표로 한다.The speaker verification algorithm follows the process of a general pattern recognition system. That is, when the characteristics representing the characteristics of the object to be recognized are given, it aims to recognize the object based on prior knowledge of the object.

일반적 패턴 인식 시스템은 훈련, 테스트 및 구현의 단계를 거쳐서 완성되는데, 훈련 단계에서는 주어진 객체에 대한 특징들과 실제 객체와의 관계를 대응시키면서 객체에 대한 모델 파라미터들을 추정하는 단계이고, 테스트 단계에서는 많은데이터를 이용한 테스트를 통하여 모델을 조절하거나 적응시키는 단계이다. 마지막으로, 구현 단계는 실제 적용되는 단계로서, 알려지지 아니한 객체에 대한 특징들을 입력받아서 모델링된 객체와의 관계를 판별하게 된다.The general pattern recognition system is completed through the steps of training, testing and implementation. In the training phase, the model parameters for the object are estimated while matching the characteristics of the given object with the relationship with the real object. This step is to adjust or adapt the model through data testing. Finally, the implementation step is actually applied, and receives the characteristics of the unknown object to determine the relationship with the modeled object.

도 2는 종래의 화자 확인 방법의 전체적인 과정을 개략적으로 나타낸 또 다른 흐름도로서, 이를 상세히 설명하면 다음과 같다.Figure 2 is another flow chart schematically showing the overall process of the conventional speaker identification method, which will be described in detail as follows.

개념적인 패턴 인식 시스템은 크게 나누어 특징 추출기(Feature Extractor) 및 분류기(Classifier)로 구성되는데, 좀 더 구체적인 과정을 덧붙인 일반적인 화자 확인 방법은 도 2와 같이 구성될 수 있다.The conceptual pattern recognition system is largely divided into a feature extractor and a classifier. A general speaker identification method with a more specific process may be configured as shown in FIG. 2.

(1) 전처리 과정(S201) : 전처리 과정이란 입력단을 통하여 들어오는 각각의 화자의 음성 정보를 이용하여 시스템에서의 화자에 대한 변별도를 높이기 위한 특징 파리미터(Feature Parameter)를 추출 및 처리하는 것으로서, 이러한 전처리 과정은 입력된 음성의 구간을 추출하고 분절하는 단계, 음성 및 화자 특징 파라미터를 추출하는 두 단계로 세분화될 수 있다.(1) Pre-processing process (S201): Pre-processing process is to extract and process a feature parameter to increase the discrimination of the speaker in the system by using the voice information of each speaker coming through the input stage, such pre-processing The process can be subdivided into two steps: extracting and segmenting the input speech segment, and extracting the speech and speaker feature parameters.

이때, 특징 파라미터들은 엘피시 켑스트럼(LPC Cepstrum), 델타(Delta) 켑스트럼, 멜(Mel) 켑스트럼, 피치(Pitch), 포먼츠(Formants) 등이 있다.In this case, the characteristic parameters include LPC Cepstrum, Delta Chamstrum, Mel Chamstrum, Pitch, Formants, and the like.

(2) 학습 및 인식 과정(S203) : 학습 및 인식 과정에서는 상기 전처리 과정에서 추출된 음성의 화자별 특징 파라미터를 시스템에서 인식하여 확인할 수 있도록 학습하고, 이를 확인하는 일련의 과정을 수행하는 것으로서, 이러한 학습 및 인식 과정은 구축된 데이터베이스를 이용하여 음성 및 화자 모델을 생성하는 단계, 결정 규칙을 적용하여 인증 또는 거부하는 최종 단계로 세분화될 수 있다.(2) Learning and Recognition Process (S203): In the learning and recognition process, the system recognizes and recognizes the speaker-specific feature parameters of the speech extracted in the preprocessing process, and performs a series of processes for confirming the same. This learning and recognition process can be subdivided into a final step of generating a speech and speaker model using a built-up database and applying or deciding a decision rule.

이때, 이러한 학습 및 인식 과정은 벡터 양자화(Vector Quantization), 동적 정합법(Dynamic Time Warping : DTW), HMM(Hidden Markov Model) 및 신경 회로망(Neural Network) 등의 알고리즘들로 수행된다.At this time, the learning and recognition process is performed by algorithms such as vector quantization, dynamic time warping (DTW), hidden markov model (HMM), and neural network.

그러나, 상술한 종래 기술들은 후천적인 개인성을 나타내는 습관이나 억양에 의한 화자의 발성 패턴에는 적절한 성능을 보이며, 본인 확인시의 오거부율을 줄여 주는 효과가 있으나, 이러한 파라미터의 특성을 지나치게 강조함으로써 성대 모사에 의한 오수락율을 높여 화자 확인 시스템의 성능을 저하시키는 문제점이 있다.However, the above-described prior arts exhibit proper performance in the speaker's speech pattern by habits or accents that show acquired personality, and have an effect of reducing the false rejection rate at the time of identity verification. There is a problem that decreases the performance of the speaker identification system by increasing the error rate by.

한편, 화자 확인 시스템의 경우에는 학습시 사용된 시스템과 동일한 시스템으로 인증 시험을 할 경우, 그 특성의 차이가 없기 때문에 높은 인식율을 보이게 되나, 실생활에서 네트워크를 기반으로 적용하게 될 경우에는 많은 에러가 발생하게 된다.On the other hand, in the case of the speaker identification system, when the certification test is performed with the same system used for learning, the recognition rate is high because there is no difference in characteristics, but when the application is based on the network in real life, many errors occur. Will occur.

즉, 임의의 화자가 네트워크를 기반으로 임의의 시스템으로 학습하여 자신의 모델을 생성한 후, 이를 인증하고자 할 때, 항상 동일한 시스템으로만 확인하게 되지는 않는다.That is, when an arbitrary speaker learns an arbitrary system based on a network, generates his own model, and then attempts to authenticate the same, it is not always confirmed with the same system.

따라서, 이러한 경우 마이크와 같은 음성 입력 시스템이나 사운드 카드와 같은 채널의 고유한 특성에 영향을 받게 되며, 이로 인한 바이어스 효과는 심각한 오인식률을 가져 오는 원인이 된다.Therefore, in this case, the inherent characteristics of a voice input system such as a microphone or a channel such as a sound card are affected, and the bias effect thereof causes a serious misrecognition rate.

한편, 이러한 채널 불일치를 보상하기 위한 수많은 종래의 연구가 이는데, 이들 종래 기술들은 대부분 '불일치'의 원인을 발성 환경 잡음과 채널의 두가지 요인으로 나누고 있다. 일반적으로, 발성 환경 잡음은 가산 잡음(Additive Noise)의 형태로 음성 신호를 왜곡시키게 되고, x(k)를 왜곡된 음성 신호, s(k)를 왜곡되기 전의 음성 신호, n(k)를 가산 잡음이라고 할 때, 아래의 [수학식 1]과 같이 표현된다.On the other hand, there are numerous conventional studies to compensate for such channel mismatches, and these prior arts divide most of the causes of 'inconsistency' into two factors, vocal environmental noise and channel. In general, vocal environmental noise distorts a speech signal in the form of additive noise, adds x (k) to a distorted speech signal, and s (k) to a speech signal before distorting n (k). Speaking of noise, it is expressed as Equation 1 below.

이러한 가산 잡음은 스티브(Steve)에 의하여 제안된 주파수 차감법(Spectral Subtraction)에 의하여 효과적으로 제거될 수 있다. 단, 이때 가산 잡음은 국부적으로 안정된 잡음이어야 하며, 충분한 시간 여유를 두고 다른 안정된 잡음으로 변화한다는 가정을 만족해야 한다.Such additive noise can be effectively removed by the Spectral Subtraction proposed by Steve. In this case, however, the additive noise must be locally stable noise and satisfy the assumption of changing to another stable noise with sufficient time margin.

한편, 채널에 의한 음성 신호의 왜곡은 시간축에서 채널 성분과 컨벌루션된 형태로 나타나며, 아래의 [수학식 2]와 같은 식으로 표현된다.On the other hand, the distortion of the voice signal due to the channel appears in the form of a convolution with the channel component on the time axis, it is represented by the following equation (2).

여기서, h(k)는 채널 성분으로서, 음성 신호가 채널에 의하여 필터링된 효과를 나타낸다.Here, h (k) is a channel component, and represents the effect of filtering the voice signal by the channel.

이렇게 음성과 컨벌루션된 채널 성분을 음성 성분에 가산된 형태로 표현하기 위하여는 주파수 변환된 신호에 로그를 취하여 켑스트럼 영역에서 나타낼 수 있다. 각 신호에 대한 주파수 변환은 X(z), S(z) 및 H(z)라고 하면, 아래의 [수학식 3]과 같은 식으로 표현된다.In order to express the channel component convoluted with the voice in this manner, the logarithm of the frequency-converted signal may be taken to represent the spectral region. If the frequency conversion for each signal is X (z), S (z) and H (z), it is expressed by the following formula (3).

이때, 채널 왜곡 성분이 천천히 조금씩 변한다고 가정하고, 순수한 음성 켑스트럼의 장구간 평균이 0이라면, 켑스트럼의 영역에서 전체 구간에 대한 평균을 구하여 차감하면, 채널 효과를 제거할 수 있다. 다시 말해, 채널의 영향은 순수한 음성의 켑스트럼에 가산된 형태로 나타나므로, 채널 켑스트럼의 추정치는 필터링된 음성의 켑스트럼들을 평균내서 구할 수 있고, 결과적으로 추정된 채널 켑스트럼을 제거하여 채널 효과를 보상할 수 있다는 것이다.In this case, it is assumed that the channel distortion component slowly changes little by little, and if the long-term average of the pure voice cepstrum is 0, the average of the entire interval is calculated and subtracted from the cepstrum region, thereby removing the channel effect. In other words, the effect of the channel appears in the form of an addition to the pure speech cepstruum, so that the estimate of the channel cepstrum can be obtained by averaging the filtered spectrums of the filtered speech, resulting in the estimated channel cepstrum. Can be compensated for the channel effect.

이러한 방법은 켑스트럼 평균 차감법(Cepstrum Mean Subtraction)이라고 하며, 전화선과 같은 채널을 거친 음성에 대하여 음성 인식 및 화자 인식에서 매우 효과적으로 사용되고 있다.This method is called Cepstrum Mean Subtraction and is used very effectively in speech recognition and speaker recognition for voice over a channel such as a telephone line.

그러나, 켑스트럼 평균 차감법은 순수한 음성에 대한 켑스트럼 평균이 0이 되기 위해서 유성음, 무성음 및 파열음 등이 음향학적 균형을 이루어야 하므로, 이러한 조건이 만족되지 않은 경우, 채널 성분 이외의 음성 성분이 차감되는 단점을 가진다.However, the cepstrum mean subtraction method requires acoustic balance of voiced sounds, unvoiced sounds, and ruptured sounds in order for the cepstral mean to be zero for pure speech, so that the speech components other than the channel components are not satisfied. This has the disadvantage of being subtracted.

따라서, 이러한 단점을 보안한 PFCMS(Pole - Filtered Cepstral Mean Subtraction), ACW(Adaptive Component Weighted cepstrum) 및 MMCMNFW(Modified Mean Cepstrum Mean Normalization with Frequency Warping) 방법 등이 제안되었다. 또한, 이러한 가산 잡음 또는 채널 성분을 직접 제거하는 방법 이외에 잡음과 채널 성분의 영향에 강인한 음성 특징을 추출하는 방법을 제안하기도 하였다.Therefore, PFCMS (Pole-Filtered Cepstral Mean Subtraction), ACW (Adaptive Component Weighted cepstrum) and Modified Mean Cepstrum Mean Normalization with Frequency Warping (MMCMNFW) methods have been proposed. In addition to the method of directly removing such additive noise or channel components, a method of extracting speech features robust to the effects of noise and channel components has been proposed.

기존의 LPCC, MFCC 이외에 PLP(Perceptually Linear Prediction)는 스펙트럼 정보에 청각 특성을 고려한 분석을 통하여 잡음 환경에서 음성의 특징을 잘 나타내는 강인한 특성을 가진 것으로 알려졌다. 또한, 기존의 음성 특징에서 잡음 또는 채널 효과를 제거하거나 보상하는 기법들이 알려졌다.In addition to the existing LPCC and MFCC, PLP (Perceptually Linear Prediction) is known to have robust characteristics that represent the characteristics of speech in a noisy environment by analyzing the auditory characteristics of spectral information. In addition, techniques are known for removing or compensating for noise or channel effects in existing speech features.

켑스트럴 웨이팅(Cepstral Weighting), 특히, 밴드 패스 필터링(Band - Pass Filtering)은 잡음에 민감한 고차 영역을 상대적으로 감쇄시킴으로써, 좋은 성능을 얻을 수 있다. 이미 설명한 CMS는 채널 환경에서 느리게 변하는 켑스트럼 영역의 채널 왜곡 성분을 제거하는 가장 효과적인 채널 효과 정규화 방법으로 알려져 있다. 그 외에 1990년 허만스키(Hermansky)에 의하여 처음 제안된 RASTA(RelAtive SpecTrAL) 등의 기법은 스펙트럼을 로그 영역으로 변환한 후, 필터링함으로써, 채널 성분 등을 제거할 수 있다고 알려져 있다.Cepstral weighting, especially band-pass filtering, can achieve good performance by relatively attenuating higher-order areas that are sensitive to noise. The previously described CMS is known to be the most effective channel effect normalization method that removes the channel distortion component of the slow-varying cepstrum region in the channel environment. In addition, a technique such as RelAtive SpecTrAL (RASTA), first proposed by Hermansky in 1990, is known to remove channel components by converting the spectrum into a log region and then filtering the spectrum.

지금까지 살펴 본 방법들은 추출된 특징 벡터가 분류기로 처리되기 이전에 프론트 엔드(Front - End) 부분인 특징 추출기에서 처리하는 방식이다.The methods discussed so far are processed by the feature extractor, which is the front-end part, before the extracted feature vector is processed by the classifier.

이 밖에 최근에는 화자 인식기의 분류기를 가산 잡음과 채널 성분에 의한 왜곡에 강인하도록 설계하는 방법들이 제안되어 효과적으로 사용되고 있다. 흔히, 이러한 방법들은 모델 기반의 채널 보상 방법들로 분류되며, 'Affine Transform'을 사용하는 것이 대표적이다.Recently, a method of designing a speaker classifier to be robust to distortion by addition noise and channel component has been proposed and used effectively. Frequently, these methods are classified into model-based channel compensation methods, and are typically represented by using 'Affine Transform'.

음성 인식 및 화자 인식 등에서 보편적으로 사용되는 통계적인 분류기인 HMM, GMM의 관점에서, 잡음 및 채널 성분의 변화는 평균과 분산으로 표현되는 음성의 특징을 벡터 공간에서 이동시키는 것으로 알려져 있다. 이러한 이동은 스케일링(Scaling), 로테이션(Rotation) 및 트랜슬레이션(Translation) 등으로 표현될 수 있고, 일반적으로 왜곡되기 전의 음성, 잡음 및 채널 성분에 의하여 왜곡된 음성을 동시에 수집한 스테레오 음성 데이터베이스에 의하여 이러한 변환 과정을 추정하게 된다. 추정된 변환 과정은 분류기에 적용되어 입력된 음성을 잡음 및 채널 성분에 강인하도록 모델링할 수 있다.In terms of HMM and GMM, which are statistical classifiers commonly used in speech recognition and speaker recognition, changes in noise and channel components are known to shift the characteristics of speech expressed in mean and variance in vector space. This movement can be expressed by scaling, rotation, and translation, and is generally used in a stereo voice database that simultaneously collects voices distorted by voice, noise, and channel components before being distorted. This conversion process is estimated. The estimated conversion process can be applied to the classifier to model the input speech to be robust to noise and channel components.

그러나, 이러한 방법도 다양한 변환 과정을 추정할 수 있는 스테레오 음성 데이터베이스를 구축하여야 한다는 문제점이 있다.However, this method also has a problem in that it is necessary to construct a stereo voice database capable of estimating various conversion processes.

상기와 같은 종래 기술의 문제점을 해결하기 위한 본 발명의 목적은 학습시와 인식시의 채널 불일치 조건을 공통 코드북 센터의 평균값과 개인 코드북 센터의 평균값 차이를 보상함으로써, 화자 인증의 성공률을 보다 향상시키는 채널 불일치 보상 장치 및 그 방법을 제공하기 위한 것이다.An object of the present invention for solving the problems of the prior art as described above to improve the success rate of the speaker authentication by compensating for the difference between the average value of the common codebook center and the average value of the individual codebook center in the channel mismatch conditions during learning and recognition It is to provide an apparatus and method for compensating for channel mismatch.

도 1a는 종래의 화자 확인 방법의 전체적인 과정을 개략적으로 나타낸 흐름도이고,Figure 1a is a flow chart schematically showing the overall process of the conventional speaker identification method,

도 1b는 종래의 화자 식별 방법의 전체적인 과정을 개략적으로 나타낸 흐름도이고,Figure 1b is a flow chart schematically showing the overall process of the conventional speaker identification method,

도 2는 종래의 화자 확인 방법의 전체적인 과정을 개략적으로 나타낸 또 다른 흐름도이고,Figure 2 is another flow chart schematically showing the overall process of the conventional speaker identification method,

도 3은 본 발명의 일 실시예에 따른 학습시 공통 코드북의 센터값과 학습 데이터의 센터값과의 차수별 차를 미리 보상하는 개념을 보여주는 도면이고,FIG. 3 is a diagram illustrating a concept of compensating a difference in advance between a center value of a common codebook and a center value of learning data in advance according to an embodiment of the present invention;

도 4는 본 발명의 일 실시예에 따른 화자 음성 학습시의 채널 불일치 보상 방법을 보여주는 흐름도이고,4 is a flowchart illustrating a channel mismatch compensation method in speaker speech learning according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 화자 음성 확인시의 채널 불일치 보상 방법을 보여주는 흐름도이다.5 is a flowchart illustrating a method of compensating for channel mismatch during speaker voice confirmation according to an embodiment of the present invention.

상기한 목적을 달성하기 위하여 본 발명에 따르면, 사용자 이동성을 고려한 화자 학습 시스템에서의 학습시 채널 불일치 보상 장치에 있어서, 입력되는 샘플링된 음성 데이터의 음성 구간의 모든 입력 벡터에 대한 각각의 중심값을 설정하는 초기화 수단; 공통 코드북의 평균 센터값과 상기 초기화 수단에 의하여 설정된 중심값과의 차이()를 구하는 중심값 차이 연산 수단; 상기 중심값 차이 연산수단에 의하여 계산된 중심값 차이를 이용하여 공통 코드북의 평균 센터값과 입력 벡터에 대한 중심값의 차이를 보상하는 보상 수단; 및 상기 보상 수단에 의하여 보상된 값을 이용하여 화자 등록 과정을 수행하는 화자 등록 수단;을 포함하는 것을 특징으로 하는 사용자 이동성을 고려한 화자 확인 시스템에서의 학습시 채널 불일치 보상 장치가 제공된다.In order to achieve the above object, according to the present invention, in the apparatus for compensating for channel mismatch in learning in a speaker learning system in consideration of user mobility, each center value of all input vectors of speech sections of the input sampled speech data is determined. Initialization means for setting; The difference between the average center value of the common codebook and the center value set by the initialization means ( Means for calculating a center value difference; Compensation means for compensating for the difference between the mean center value of the common codebook and the center value for the input vector using the center value difference calculated by the center value difference calculating means; And speaker registration means for performing a speaker registration process using a value compensated by the compensation means. The apparatus for compensating for channel inconsistency in learning in a speaker identification system considering user mobility is provided.

또한, 사용자 이동성을 고려한 화자 학습 시스템에서의 학습시 채널 불일치 보상 방법에 있어서, 입력되는 샘플링된 음성 데이터의 음성 구간의 모든 입력 벡터에 대한 각각의 중심값을 설정하는 제 1 단계; 공통 코드북의 평균 센터값과 상기 제 1 단계에서 설정된 중심값과의 차이()를 구하는 제 2 단계; 상기 제 2 단계에서 계산된 중심값 차이를 이용하여 공통 코드북의 평균 센터값과 입력 벡터에 대한 중심값의 차이를 보상하는 제 3 단계; 및 상기 제 3 단계에서 보상된 값을 이용하여 화자 등록 과정을 수행하는 제 4 단계;를 포함하는 것을 특징으로 하는 사용자 이동성을 고려한 화자 학습 시스템에서의 학습시 채널 불일치 보상 방법이 제공된다.A method of compensating for channel mismatch in learning in a speaker learning system in consideration of user mobility, the method comprising: a first step of setting respective center values of all input vectors of speech sections of input sampled speech data; The difference between the mean center value of the common codebook and the center value set in the first step ( Obtaining a second step; A third step of compensating the difference between the mean center value of the common codebook and the center value of the input vector using the center value difference calculated in the second step; And a fourth step of performing a speaker registration process using the value compensated in the third step. The channel mismatch compensation method of learning in a speaker learning system considering user mobility is provided.

또한, 사용자 이동성을 고려한 화자 확인 시스템에서의 확인시 채널 불일치 보상 장치에 있어서, 입력되는 샘플링된 음성 데이터의 묵음 구간의 입력 벡터에 대한 평균값을 구하고, 시그모이드(Sigmoid) 함수에 의하여 가중치를 결정하는 가중치 결정 수단; 입력되는 샘플링된 음성 데이터의 음성 구간의 모든 입력 벡터에 대한 하나의 중심값을 설정하는 중심값 설정 수단; 공통 코드북의 평균 센터값과 상기 중심값 설정 수단에 의하여 설정된 중심값의 차이를 구하는 중심값 차이 연산수단; 상기 중심값 차이 연산 수단에 의하여 계산된 중심값의 차이와 상기 가중치 결정 수단에 의하여 결정된 가중치를 이용하여 입력 벡터를 보상하는 보상 수단; 및 상기 보상 수단에 의하여 보상된 입력 벡터를 이용하여 화자 확인 과정을 수행하는 화자 확인 수단;을 포함하는 것을 특징으로 하는 사용자 이동성을 고려한 화자 확인 시스템에서의 확인시 채널 불일치 보상 장치가 제공된다.In addition, in the apparatus for compensating for channel inconsistency in a speaker identification system considering user mobility, an average value of an input vector of a silent section of input sampled speech data is obtained, and a weight is determined by a sigmoid function. Weight determination means; Center value setting means for setting one center value for all the input vectors of the voice interval of the input sampled speech data; Center value difference calculating means for obtaining a difference between an average center value of a common codebook and a center value set by the center value setting means; Compensation means for compensating the input vector using the difference in the center value calculated by the center value difference calculating means and the weight determined by the weight determining means; And a speaker identification means for performing a speaker identification process by using the input vector compensated by the compensation means.

또한, 사용자 이동성을 고려한 화자 확인 시스템에서의 확인시 채널 불일치 보상 방법에 있어서, 입력되는 샘플링된 음성 데이터의 묵음 구간의 입력 벡터에 대한 평균값을 구하고, 시그모이드(Sigmoid) 함수에 의하여 가중치를 결정하는 제 1 단계; 입력되는 샘플링된 음성 데이터의 음성 구간의 모든 입력 벡터에 대하여 각각의 중심값을 설정하는 제 2 단계; 공통 코드북의 평균 센터값과 상기 제 2 단계에서 설정된 중심값의 차이를 구하는 제 3 단계; 상기 제 3 단계에서 계산된 중심값의 차이와 상기 제 1 단계에서 결정된 가중치를 이용하여 입력 벡터를 보상하는 제 4 단계; 및 상기 제 4 단계에서 보상된 입력 벡터를 이용하여 화자 확인 과정을 수행하는 제 5 단계;를 포함하는 것을 특징으로 하는 사용자 이동성을 고려한 화자 확인 시스템에서의 확인시 채널 불일치 보상 방법을 제공한다.Also, in a method of compensating for channel mismatch in confirmation in a speaker identification system considering user mobility, an average value of an input vector of a silent section of input sampled speech data is obtained, and a weight is determined by a sigmoid function. A first step of making; A second step of setting respective center values for all input vectors of voice sections of the input sampled voice data; Obtaining a difference between the average center value of the common codebook and the center value set in the second step; A fourth step of compensating the input vector using the difference between the center value calculated in the third step and the weight determined in the first step; And a fifth step of performing a speaker identification process by using the input vector compensated in the fourth step.

이하, 첨부된 도면을 참조하면서 본 발명의 일 실시예에 따른 사용자 이동성을 고려한 화자 인식에서의 채널 불일치 보상 장치 및 그 방법을 보다 상세하게 설명하기로 한다.Hereinafter, an apparatus and method for compensating for channel mismatch in speaker recognition considering user mobility according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

본 발명에서 제안하는 방법은 학습시와 인식시의 채널 불일치 조건(Mismatch- condition)을 공통 코드북 센터의 평균값과 개인 코드북 센터의 평균값의 차이의 보상으로 제거하는 것이다.The proposed method removes the channel mismatch condition between learning and recognition as a compensation for the difference between the mean value of the common codebook center and the mean value of the personal codebook center.

도 3은 본 발명의 일 실시예에 따른 학습시 공통 코드북의 센터값과 학습 데이터의 센터값과의 차수별 차를 미리 보상하는 개념을 보여주는 도면이다.FIG. 3 is a diagram illustrating a concept of compensating a difference in advance between a center value of a common codebook and a center value of learning data according to an embodiment of the present invention.

화자가 발성한 음성의 특징 파라미터는 접속하는 시스템의 특성, 즉, 사운드 카드, 마이크 및 배경 잡음 등에 의하여 크게 변화한다. 이러한 특징 파라미터의 변화는 도 3에 도시되어 있듯이, 화자 영역과의 차이를 발생시켜 본인 오거부율을 발생시킨다.The characteristic parameter of the voice spoken by the speaker varies greatly depending on the characteristics of the connecting system, that is, the sound card, the microphone, and the background noise. This characteristic parameter change causes a difference from the speaker area, resulting in a self rejection rate, as shown in FIG.

그러나, 이러한 차이는 화자 영역의 분포 형태에는 크게 영향을 끼치지 아니한다. 즉, 채널에 의하여 왜곡된 특징 파라미터의 변화는 화자 영역의 평균(Mean) 값에 영향을 미치지만, 분산(Variance) 값에는 영향을 미치지 못하게 된다. 따라서, 본 발명에서 제안하는 기법은 도 3과 같이 학습시 공통 코드북의 센터값과 학습 데이터의 센터값과의 차수별 차를 미리 보상하여 학습하고, 확인시에도 공통 코드북의 센터값과 학습 데이터의 센터값과의 차수별 차를 보상하여 확인함으로써, 채널의 불일치에 의한 급격한 본인 인식율 하락을 해결한다.However, this difference does not significantly affect the distribution form of the speaker area. That is, the change of the characteristic parameter distorted by the channel affects the mean value of the speaker region, but does not affect the variance value. Therefore, the technique proposed in the present invention compensates the difference between the center value of the common codebook and the center value of the learning data in advance by learning, as shown in FIG. By compensating and confirming the difference by order with the value, it resolves the sudden drop in the recognition rate caused by the channel mismatch.

그러나, 무조건적인 평균값 보상은 사칭자의 인증 오류를 가져 오게 되므로, 채널의 변이에 비례하는 적절한 가중치를 통한 평균값 보상이 필요하다. 따라서, 음성 구간을 제외한 묵음 구간의 분포를 고려하여 학습시 채널과의 변이 차이에 비선형 함수에 의한 가중치로 보상해 준다.However, since the unconditional average value compensation results in the authentication error of the impersonator, the average value compensation through appropriate weights proportional to the channel variation is necessary. Therefore, in consideration of the distribution of the silent section except for the speech section, the difference between the channel and the channel during the training is compensated by the weight of the nonlinear function.

도 4는 본 발명의 일 실시예에 따른 화자 음성 학습시의 채널 불일치 보상 방법을 보여주는 흐름도이다.4 is a flowchart illustrating a channel mismatch compensation method in speaker speech learning according to an embodiment of the present invention.

(1) 스텝 S401 : 초기화 과정이다. 음성 구간의 모든 입력 벡터에 대하여 하나의 중심값을 설정하는 바, 이를 수식으로 표현한 것이 아래의 [수학식 4]이다.(1) Step S401: Initialization process. One center value is set for all the input vectors of the voice interval, and this is expressed by the following equation (Equation 4).

여기서,는 모든 입력 벡터의 개수이다.here, Is the number of all input vectors.

(2) 스텝 S402 : 공통 코드북의 평균 센터값과 입력 벡터에 대한 중심값의 차이(Bias)를 구하는 바, 이를 수식으로 표현한 것이 아래의 [수학식 5]이다.(2) Step S402: A difference (Bias) between the average center value of the common codebook and the center value of the input vector is obtained. The following expression is expressed by Equation 5 below.

여기서, p = 0, 1, ..., k 이다.Where p = 0, 1, ..., k.

(3) 스텝 S403 : 공통 코드북의 평균 센터값과 입력 벡터에 대한 중심값의 차이를 보상하는 바, 이를 수식으로 표현한 것이 아래의 [수학식 6]이다.(3) Step S403: The difference between the mean center value of the common codebook and the center value with respect to the input vector is compensated for. The following expression is expressed by Equation 6 below.

(4) 스텝 S404 : 상기 보상된 중심값의 차이를 이용하여 화자의 등록 과정을 수행한다.(4) Step S404: A speaker registration process is performed using the difference of the compensated center value.

(1) 스텝 S501 : 초기화 과정이다. 묵음 구간의 입력 벡터에 대한 평균값을 구하고, 시그모이드(Sigmoid) 함수에 의하여 가중치를 결정하는 바, 이를 나타낸 것이 아래의 [수학식 7]이다.(1) Step S501: Initialization process. The average value of the input vector of the silent section is obtained, and the weight is determined by a sigmoid function, which is represented by Equation 7 below.

여기서,는 개인 화자의 등록시 배경 노이즈이고,는 배경 노이즈(Background Noise)이며, w는 본 스텝 S501에서 구하고자 하는 가중치이다.here, Is background noise at the time of registration of personal speaker, Is background noise, and w is a weight to be obtained in this step S501.

(2) 스텝 S502 : 음성 구간의 모든 입력 벡터에 대한 하나의 중심값을 설정하는 바, 이를 수식으로 표현한 것이 아래의 [수학식 8]이다.(2) Step S502: One center value for all the input vectors of the speech section is set, and this is expressed by the following formula (8).

여기서,는 입력 벡터이고,는 모든 입력 벡터의 개수이다.here, Is the input vector, Is the number of all input vectors.

(3) 스텝 S503 : 공통 코드북의 평균 센터값과 입력 벡터에 대한 중심값의 차이를 구하는 바, 이를 수식으로 표현한 것이 아래의 [수학식 9]이다.(3) Step S503: The difference between the average center value of the common codebook and the center value of the input vector is obtained. The following expression is expressed by Equation 9 below.

여기서, p = 0, 1, ..., k 이다.Where p = 0, 1, ..., k.

(4) 스텝 S504 : 공통 코드북의 평균 센터값과 입력 벡터에 대한 중심값의 차이를 상기 가중치를 주어 보상하는 바, 이를 수식으로 표현한 것이 아래의 [수학식 10]이다.(4) Step S504: The difference between the mean center value of the common codebook and the center value of the input vector is compensated by giving the above weight, which is expressed by the following formula (10).

(5) 스텝 S505 : 화자 확인 과정을 수행한다.(5) Step S505: A speaker verification process is performed.

한편, 학습한 마이크와 인증 시험에 사용되는 마이크가 동일한 경우에는 채널 보상을 적용하지 아니한 경우나 채널 보상 알고리즘을 적용한 경우나 거의 비슷한 인식율을 보이나, 인증 시험시 다른 마이크를 사용하여 테스트한 결과, 둘 사이에는 상당한 성능의 차이를 보여 주었다.On the other hand, if the learned microphone is the same as the microphone used for the certification test, the channel recognition is not applied or the channel compensation algorithm is applied. Showed a significant performance difference between.

이러한 실험 결과를 보여 주는 것이 아래의 [표 1]이다.The results of these experiments are shown in Table 1 below.

[표 1]TABLE 1

상기 표 1에 도시되어 있듯이, 마이크 1로 학습한 모델에 대하여 마이크 4로 인증 실험을 할 경우, 본인 인증율이 20 % 에서 90 % 로 크게 향상됨을 알 수 있다.As shown in Table 1, when the authentication experiment with the microphone 4 for the model learned with the microphone 1, it can be seen that the identity authentication rate is greatly improved from 20% to 90%.

위에서 양호한 실시예에 근거하여 이 발명을 설명하였지만, 이러한 실시예는 이 발명을 제한하려는 것이 아니라 예시하려는 것이다. 이 발명이 속하는 분야의 숙련자에게는 이 발명의 기술 사상을 벗어남이 없이 위 실시예에 대한 다양한 변화나 변경 또는 조절이 가능함이 자명할 것이다. 그러므로, 이 발명의 보호 범위는 첨부된 청구 범위에 의해서 한정될 것이며, 위와 같은 변화예나 변경예 또는 조절예를 모두 포함하는 것으로 해석되어야 할 것이다.While the invention has been described above based on the preferred embodiments thereof, these embodiments are intended to illustrate rather than limit the invention. It will be apparent to those skilled in the art that various changes, modifications, or adjustments to the above embodiments can be made without departing from the spirit of the invention. Therefore, the protection scope of the present invention will be limited by the appended claims, and should be construed as including all such changes, modifications or adjustments.

이상과 같이 본 발명에 의하면, 화자 인증 방법에 있어서 학습시 공통 코드북의 센터값과 학습 데이터의 센터값과의 차수별 차를 미리 보상하여 학습하고, 확인시에도 공통 코드북의 센터값과 학습 데이터의 센터값과의 차수별 차를 보상하여 확인함으로써, 채널의 불일치에 의한 급격한 본인 인식율 하락을 해결하는 효과가 있다.As described above, according to the present invention, in the speaker authentication method, the difference between the center value of the common codebook at the time of learning and the center value of the learning data is compensated for in advance, and the center value of the common codebook and the center of the learning data are also verified. By compensating and confirming the difference by order with the value, there is an effect of resolving a sudden drop in the recognition rate caused by the channel mismatch.

Claims

A device for compensating for channel mismatch in learning in a speaker learning system considering user mobility,

Initialization means for setting respective center values for all input vectors of speech sections of the inputted sampled speech data;

The difference between the average center value of the common codebook and the center value set by the initialization means ( Means for calculating a center value difference;

Compensation means for compensating for the difference between the mean center value of the common codebook and the center value for the input vector using the center value difference calculated by the center value difference calculating means; And

And speaker registration means for performing a speaker registration process by using the value compensated by the compensation means.

The method of claim 1,

The initialization means,

The apparatus for compensating for channel mismatch in learning in a speaker learning system considering user mobility, wherein the center value is determined by Equation 1 below.

[Equation 1]

here, Is the input vector, Is the number of all input vectors.

The method of claim 1,

The compensation means,

Compensation apparatus for a channel mismatch in learning in a speaker learning system considering user mobility, wherein the difference of the center value is compensated by Equation 2 below.

[Equation 2]

here, Is the compensated value, Is the difference between the mean center of the common codebook and the center of the input vector.

In the channel mismatch compensation method for learning in the speaker learning system considering the user mobility,

A first step of setting respective center values for all input vectors of speech sections of the input sampled speech data;

The difference between the mean center value of the common codebook and the center value set in the first step ( Obtaining a second step;

A third step of compensating the difference between the mean center value of the common codebook and the center value of the input vector using the center value difference calculated in the second step; And

And a fourth step of performing a speaker registration process using the value compensated in the third step.

The method of claim 4, wherein

The first step is,

The channel mismatch compensation method for learning in the speaker learning system considering the user mobility, characterized in that the center value is determined by the following [Equation 3].

[Equation 3]

here, Is the input vector, Is the number of all input vectors.

The method of claim 4, wherein

The third step,

Compensation for the channel mismatch during learning in the speaker learning system considering the user mobility, characterized in that the difference of the center value is compensated by the following [Equation 4].

[Equation 4]

A device for compensating for channel mismatch in confirmation in a speaker identification system considering user mobility,

Weight determination means for obtaining an average value of the input vector of the silent section of the input sampled speech data and determining the weight by a sigmoid function;

Center value setting means for setting one center value for all the input vectors of the voice interval of the input sampled speech data;

Center value difference calculating means for obtaining a difference between an average center value of a common codebook and a center value set by the center value setting means;

Compensation means for compensating the input vector using the difference in the center value calculated by the center value difference calculating means and the weight determined by the weight determining means; and

And a speaker identification means for performing a speaker identification process using the input vector compensated by the compensation means.

The method of claim 7, wherein

The weight determining means,

The apparatus for compensating for channel inconsistency in confirmation in a speaker identification system considering user mobility, wherein the weight is determined by Equation 5 below.

[Equation 5]

here, Is background noise at the time of registration of personal speaker, N_bg ^ (p) is Background Noise and w is a weight to be obtained.

The method of claim 7, wherein

The center value setting means,

The apparatus of claim 2, wherein the center value is determined by Equation 6 below.

[Equation 6]

here, Is the input vector, Is the number of all input vectors.

The method of claim 7, wherein

The compensation means,

Compensation apparatus for a channel mismatch in confirmation in a speaker confirmation system considering user mobility, which is compensated by Equation 7 below.

[Equation 7]

here, Is the compensated value, Is the result value of the said center value difference calculating means.

In the channel mismatch compensation method in the speaker identification system considering the user mobility,

A first step of obtaining an average value of an input vector of a silent section of input sampled speech data and determining a weight by a sigmoid function;

A second step of setting respective center values for all input vectors of voice sections of the input sampled voice data;

Obtaining a difference between the average center value of the common codebook and the center value set in the second step;

A fourth step of compensating the input vector using the difference between the center value calculated in the third step and the weight determined in the first step; And

And a fifth step of performing a speaker identification process by using the input vector compensated in the fourth step.

The method of claim 11,

The first step is,

The method of claim 2, wherein the weight is determined by Equation 8 below.

[Equation 8]

here, Is background noise at the time of registration of personal speaker, Is the background noise and w is the weight to obtain.

The method of claim 11,

The second step,

The channel mismatch compensation method when confirming in the speaker confirmation system considering the user mobility, characterized in that the center value is determined by the following [Equation 9].

[Equation 9]

here, Is the input vector, Is the number of all input vectors.

The method of claim 11,

The fourth step,

A channel mismatch compensation method when confirming in a speaker identification system considering user mobility, which is compensated by Equation 10 below.

[Equation 10]

A computer-readable recording medium capable of executing a channel mismatch compensation program when learning in a speaker learning system considering user mobility,

The center value is determined by Equation 11 below, and the difference in the center value is compensated by Equation 12 below. A computer-readable recording medium capable of running a computer.

[Equation 11]

here, Is the input vector, Is the number of all input vectors.

[Equation 12]

A computer-readable recording medium capable of executing a channel mismatch compensation program upon confirmation in a speaker identification system considering user mobility,

Obtaining a difference between the mean center value of the common codebook and the center value set in the second step;

And a fifth step of performing a speaker identification process using the input vector compensated in the fourth step.

The weight is determined by Equation 13 below, the center value is determined by Equation 14 below, and compensated by Equation 10 below. A computer-readable recording medium capable of executing a channel mismatch compensation program upon verification at.

[Equation 13]

[Equation 14]

here, Is the input vector, Is the number of all input vectors.

[Equation 15]