KR101843079B1

KR101843079B1 - Robust i-vector extractor learning method and system using speaker mutual information

Info

Publication number: KR101843079B1
Application number: KR1020160123219A
Authority: KR
Inventors: 김남수; 강우현
Original assignee: 서울대학교산학협력단
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2018-05-14

Abstract

The present invention relates to a method for learning a robust I-vector extracting device using speaker mutual information, and a system thereof. According to an embodiment of the present invention, the system for learning a robust I-vector extracting device using speaker mutual information comprises a learning unit (100), a calculating unit (200), and a speaker information highlighted I-vector extracting unit (300). According to the present invention, the method for learning a robust I-vector extracting device using speaker mutual information, and the system thereof can extract a robust I-vector.

Description

TECHNICAL FIELD [0001] The present invention relates to a robust I-vector extractor learning method and system using speaker mutual information,

본 발명은 I-벡터 추출기 학습 방법 및 시스템에 관한 것으로서, 보다 구체적으로는 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법 및 시스템에 관한 것이다.The present invention relates to an I-vector extractor learning method and system, and more particularly, to a robust I-vector extractor learning method and system utilizing speaker mutual information.

화자의 인식 및 인증은 얼굴, 지문 등의 생체 정보와 함께 화자를 식별하는 차원으로서 주로 적용되어 왔다. 인증 서비스와 관련한 종래 기술로는 특정 사업자가 제공하는 정보를 서비스 이용자가 이용하고자 할 경우, 서비스 이용자가 성인인지 여부를 판별하기 위해 이용자가 입력한 주민등록번호에 대한 화자 인식 및 음성 인식을 통해 본인 인증을 수행하는 기술이 있다. 또한, 인터넷 브라우저와 컴퓨터 프로그램 실행 시 아이디어와 비밀 번호를 입력해야 하는 번거로움을 줄이고 음성 명령과 화자 인증 및 지문 인증을 병합하여 사용자를 간편하게 인식할 수 있는 인증 기술도 제시된 바 있다.
The recognition and authentication of a speaker has been mainly applied as a dimension for identifying a speaker together with biometric information such as a face and a fingerprint. In the prior art related to the authentication service, when a service user wants to use information provided by a specific service provider, in order to determine whether the service user is an adult, the user authentication is performed through speaker recognition and speech recognition on the resident registration number inputted by the user There is a technique to perform. In addition, an authentication technique has been proposed to reduce the inconvenience of inputting an idea and a password when executing an Internet browser and a computer program, and to easily recognize a user by merging voice commands, speaker authentication, and fingerprint authentication.

한편, 입력된 음성을 발화한 화자를 판별하는 화자 인식 시스템에 있어서, 입력된 음성으로부터 화자를 인식하기 위해 특정 시간에서의 주파수 특성 혹은 MFCC(Mel frequency cepstral coefficient)와 같은 음성의 프레임 단위 특징의 분포를 표현하는 가우시안 혼합 모델(Gaussian mixture model, GMM)이 갖는 여러 변이성을 작은 벡터로 표현하는 특징 벡터로서 I-벡터를 널리 사용하고 있다.
On the other hand, in a speaker recognition system for recognizing a speaker that uttered the input voice, a speaker recognition system for recognizing the speaker from the input voice includes a frequency characteristic at a specific time or a distribution of a frame unit characteristic of a voice such as a Mel frequency cepstral coefficient (MFCC) Vector is widely used as a feature vector that expresses the variability of a Gaussian mixture model (GMM) representing a small vector.

이러한 I-벡터는 화자 인식에서 널리 사용되는 특징 벡터로서, 입력 음성 내에 존재하는 화자, 녹음 상태, 잡음 등으로 인한 다양한 변이성을 작은 차원의 벡터로 표현할 수 있고, 화자 인식 분야에서 높은 성능을 보이며 널리 사용되고 있다. 하지만 이 기법은 화자에 대한 정보를 고려하지 않고, 화자 독립적인 가우시안 혼합 모델(GMM)인 일반 배경 모델(Universal background model, UBM)을 사용하여 파라미터를 구하기 때문에, 즉, 화자에 무관하게 다양한 변이성들에 대한 정보를 포함하는 특징을 추출하기 때문에, 화자 인식에 필요한 정보를 추출하기 위해 I-벡터를 생성한 후 linear discriminant analysis(LDA) 혹은 within-class covariance normalization(WCCN)과 같은 후처리 과정을 거쳐야 하는 불편함이 있다.
This I-vector is a feature vector widely used in speaker recognition. It can represent various variabilities due to the speaker, recording state, noise, etc. existing in the input speech as a small-dimension vector, and shows high performance in the field of speaker recognition . However, since this technique does not consider information about the speaker and obtains parameters using a speaker-independent Gaussian mixture model (GMM), a universal background model (UBM), that is, various variants , We need to generate an I-vector in order to extract the information needed for speaker recognition and then perform post-processing such as linear discriminant analysis (LDA) or within-class covariance normalization (WCCN). There is an inconvenience.

또한, 기존의 I-벡터는 입력된 음성의 길이가 짧아질수록 음성에 포함된 정보가 크게 제한되어 성능이 확연히 떨어진다는 단점이 있다. 대한민국 공개특허공보 제10-2006-0066483호는 음성 인식을 위한 특징 벡터 추출 방법에 대한 선행기술 문헌을 개시하고 있고, 대한민국 등록특허공보 제10-1618512호는 가우시안 혼합모델을 이용한 화자 인식 시스템 및 추가 학습 발화 선택 방법에 대한 선행기술 문헌을 개시하고 있다.In addition, the existing I-vector is disadvantageous in that the information contained in the voice is significantly limited as the input voice length is shortened, and the performance is significantly degraded. Korean Patent Laid-Open Publication No. 10-2006-0066483 discloses a prior art document about a feature vector extraction method for speech recognition. Korean Patent Registration No. 10-1618512 discloses a speaker recognition system using a Gaussian mixed model and an additional Discloses prior art documents on learning speech selection methods.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 각 가우시안 성분에 대한 화자와 학습 데이터 프레임들의 평균 화자 상호 정보량을 계산한 후 계산된 평균 화자 상호 정보량을 모든 프레임 및 전체 화자에 대하여 평균을 취하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하고, 이를 가중치로서, I-벡터 추출기의 블락(block) 행렬에 곱하는 방식으로 적용시킴으로써, I-벡터 추출기를 통해 추출되는 I-벡터에 화자에 대한 정보가 많은 가우시안 성분의 영향이 부각될 수 있도록 하여, 보다 높은 성능을 갖는 특징 벡터인 강인한 I-벡터를 추출할 수 있는, 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법 및 시스템을 제공하는 것을 그 목적으로 한다.
The present invention has been proposed in order to solve the above-mentioned problems of the previously proposed methods. The present invention calculates an average speaker's mutual information amount of a speaker and a learning data frame for each Gaussian component, By applying the average value of the average speaker information of each Gaussian component to the whole speaker and calculating the average speaker information amount of each Gaussian component and multiplying it by a block matrix of the I-vector extractor as a weight, - Robust I-vector extractor learning using speaker mutual information that can extract strong I-vector, which is a feature vector with higher performance, so that the effect of Gaussian component with much information about speaker can be highlighted. And to provide a method and a system.

또한, 본 발명은, I-벡터 추출기에 가중치를 적용하여 I-벡터의 화자 관련 정보를 부각시킴으로써, 화자 이외의 잡음, 마이크 상태 등의 요소로 인한 변이성에 강인한 특징을 추출할 수 있어, 입력된 음성 길이가 짧거나 잡음이 많은 환경에서도 화자의 특징을 효과적으로 추출하여 화자 인식의 성능을 높일 수 있는, 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법 및 시스템을 제공하는 것을 또 다른 목적으로 한다.Further, by applying the weight to the I-vector extractor to emphasize the speaker-related information of the I-vector, it is possible to extract characteristics that are robust against the variability due to elements other than the speaker, such as noise, Another object of the present invention is to provide a robust I-vector extractor learning method and system using speaker mutual information which can effectively extract the characteristics of a speaker and improve the performance of speaker recognition even in an environment with a short voice length or a lot of noises .

상기한 목적을 달성하기 위한 본 발명의 특징에 따른, 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법은,According to an aspect of the present invention, there is provided a robust I-vector extractor learning method using speaker mutual information,

(1) 복수 개의 음성 파일들로 구성된 학습 데이터를 사용하여 서로 다른 2 이상의 화자 각각에 대하여 종속적인 가우시안 혼합 모델(Gaussian mixture model, GMM)을 학습하고, 상기 학습 데이터를 사용하여 I-벡터 추출기를 학습하는 단계;(1) learning a Gaussian mixture model (GMM) dependent on each of two or more different speakers using learning data composed of a plurality of audio files, and using the learning data to extract an I-vector extractor Learning step;

(2) 상기 단계 (1)에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계; 및(2) calculating an average speaker mutual information amount of each Gaussian component using the speaker-dependent Gaussian mixture models learned in the step (1); And

(3) 상기 단계 (2)에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 상기 단계 (1)에서 학습된 I-벡터 추출기에 가중치로 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.
(3) Applying the average speaker mutual information amount of each Gaussian component calculated in the step (2) to the I-vector extractor learned in the step (1) as a weight to extract a robust I-vector in which the speaker information is highlighted The present invention is not limited to these embodiments.

바람직하게는, 상기 단계 (1)은,Preferably, the step (1)

상기 학습 데이터를 사용하여 화자 독립적 모델인 일반 배경 모델(Universal background model, UBM)을 학습하고, 상기 학습된 화자 독립적 일반 배경 모델을 기준으로 특정 화자 데이터로의 적응화를 통해 상기 특정 화자 각각에 대하여 종속적인 가우시안 혼합 모델을 도출할 수 있다.
Learning a universal background model (UBM), which is a speaker-independent model, using the learning data, and applying adaptation to specific speaker data based on the learned speaker-independent general background model, An in-gaussian mixture model can be derived.

바람직하게는, 상기 단계 (1)에서는,Preferably, in the step (1)

상기 I-벡터 추출기를 학습하여 상기 I-벡터 추출기의 블락(block) 행렬을 도출할 수 있다.
A block matrix of the I-vector extractor can be derived by learning the I-vector extractor.

더욱 바람직하게는, 상기 단계 (3)에서는,More preferably, in the step (3)

상기 단계 (2)에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 상기 단계 (1)에서 상기 I-벡터 추출기를 학습하여 도출된 상기 I-벡터 추출기의 블락(block) 행렬에 곱해줄 수 있다.
The mean speaker mutual information amount of each Gaussian component calculated in step (2) may be multiplied by the block matrix of the I-vector extractor derived by learning the I-vector extractor in step (1) .

바람직하게는, 상기 단계 (2)는,Preferably, the step (2)

(2-1) 상기 단계 (1)에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 상기 서로 다른 2 이상의 화자 각각 및 상기 학습 데이터의 프레임별 평균 화자 상호 정보량(Normalized pointwise mutual information, NPMI)을 계산하는 단계; 및(2-1) calculating normalized pointwise mutual information (NPMI) for each of the two or more different speakers and the learning data using the speaker-dependent Gaussian mixture models learned in the step (1) step; And

(2-2) 상기 단계 (2-1)에서 계산된 상기 평균 화자 상호 정보량(NPMI)을 상기 학습 데이터의 전체 프레임 및 전체 화자에 대하여 평균을 취해 상기 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계를 포함할 수 있다.
(2-2) An average of the average speaker mutual information amount (NPMI) calculated in the step (2-1) is averaged over the entire frame and all speakers of the learning data, and the average speaker mutual information amount of each Gaussian component is calculated .

더욱 바람직하게는, 상기 단계 (2-1)은,More preferably, the step (2-1)

하기의 수학식을 통해 상기 서로 다른 2 이상의 화자 각각 및 상기 학습 데이터의 프레임별 평균 화자 상호 정보량(NPMI)을 계산할 수 있다.The average speaker mutual information amount (NPMI) of each of the two or more different speakers and the learning data can be calculated through the following equation.

여기서, N(x_l|θ_s,c)은 s 화자에 종속적인 가우시안 혼합 모델(GMM)의 c번째 가우시안 성분이 확률 분포로 주어졌을 때, x_l 음성 프레임이 생성될 확률을 나타내며, N(x_l|θ_UBM,c)은 UBM의 c번째 가우시안 성분에 대한 확률, 즉 UBM의 c번째 가우시안 성분이 확률 분포로 주어졌을 때, x_l 음성 프레임이 생성될 확률을 나타내고, p(s)는 학습 데이터에 존재하는 s 화자의 존재 확률, 즉, s 화자의 비율을 나타낸다.
Here, N (x _l | θ _{s, c} ) represents the probability that x _l speech frames will be generated when the c th Gaussian component of the Gaussian mixture model (GMM) x _l | θ _{UBM, c)} when the probability of the c-th Gaussian component of the UBM, that is c-th Gaussian component of the UBM is given a probability distribution, x _l denotes the probability that the speech frame is generated, p (s) is Indicates the probability of existence of the s speaker present in the learning data, that is, the ratio of the s speaker.

더욱 더 바람직하게는, 상기 단계 (2-2)는,Even more preferably, the step (2-2)

하기의 수학식을 통해 상기 각 가우시안 성분이 갖는 평균 화자 상호 정보량()을 계산할 수 있다.The average speaker mutual information amount () of each Gaussian component can be calculated through the following equation.

여기서, L은 특징을 추출하고자 하는 음성 데이터 내 프레임의 개수를 나타내고, S는 화자의 명수를 나타낸다.
Here, L denotes the number of frames in the audio data to extract the characteristic, and S denotes the number of speakers.

또한, 상기한 목적을 달성하기 위한 본 발명의 특징에 따른, 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템은,According to an aspect of the present invention, there is provided a robust I-vector extractor learning system using speaker mutual information,

복수 개의 음성 파일들로 구성된 학습 데이터를 사용하여 서로 다른 2 이상의 화자 각각에 대하여 종속적인 가우시안 혼합 모델(Gaussian mixture model, GMM)을 학습하고, 상기 학습 데이터를 사용하여 I-벡터 추출기를 학습하는 학습부;Learning a Gaussian mixture model (GMM) dependent on each of two or more different speakers using learning data composed of a plurality of audio files and learning an I-vector extractor using the learning data part;

상기 학습부에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 계산부; 및A calculating unit for calculating an average speaker mutual information amount of each Gaussian component using the speaker dependent Gaussian mixture models learned in the learning unit; And

상기 계산부에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 상기 학습부에서 학습된 I-벡터 추출기에 가중치로 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출하는 화자 정보 부각 I-벡터 추출부를 포함하는 것을 그 구성상의 특징으로 한다.
Extracting a speaker information incoherent I-vector extracting a robust I-vector in which speaker information is emphasized by applying an average speaker mutual information amount of each Gaussian component calculated by the calculation unit to the I-vector extractor learned in the learning unit as a weight value And the like.

바람직하게는, 상기 학습부는,Preferably, the learning unit includes:

더욱 바람직하게는, 상기 화자 정보 부각 I-벡터 추출부는,More preferably, the speaker information relief I-

바람직하게는, 상기 계산부는,Advantageously,

상기 학습부에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 상기 서로 다른 2 이상의 화자 각각 및 상기 학습 데이터의 프레임별 평균 화자 상호 정보량(Normalized pointwise mutual information, NPMI)을 계산하는 NPMI 계산부; 및An NPMI calculation unit for calculating normalized pointwise mutual information (NPMI) for each of the two or more speakers and the learning data using a speaker dependent Gaussian mixture model learned in the learning unit; And

상기 NPMI 계산부에서 계산된 상기 평균 화자 상호 정보량(NPMI)을 상기 학습 데이터의 전체 프레임 및 전체 화자에 대하여 평균을 취해 상기 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 가중치 계산부를 포함할 수 있다.
And a weight calculation unit for calculating an average speaker mutual information amount (NPMI) calculated by the NPMI calculation unit on the average of all frames and all speakers of the learning data and calculating an average speaker mutual information amount of each Gaussian component .

더욱 바람직하게는, 상기 NPMI 계산부는,More preferably, the NPMI calculation section calculates,

더욱 더 바람직하게는, 상기 가중치 계산부는,Still more preferably, the weight calculation section calculates,

여기서, L은 특징을 추출하고자 하는 음성 데이터 내 프레임의 개수를 나타내고, S는 화자의 명수를 나타낸다.Here, L denotes the number of frames in the audio data to extract the characteristic, and S denotes the number of speakers.

본 발명에서 제안하고 있는 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법 및 시스템에 따르면, 각 가우시안 성분에 대한 화자와 학습 데이터 프레임들의 평균 화자 상호 정보량을 계산한 후 계산된 평균 화자 상호 정보량을 모든 프레임 및 전체 화자에 대하여 평균을 취하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하고, 이를 가중치로서, I-벡터 추출기의 블락(block) 행렬에 곱하는 방식으로 적용시킴으로써, I-벡터 추출기를 통해 추출되는 I-벡터에 화자에 대한 정보가 많은 가우시안 성분의 영향이 부각될 수 있도록 하여, 보다 높은 성능을 갖는 특징 벡터인 강인한 I-벡터를 추출할 수 있다.
According to the robust I-vector extractor learning method and system using the speaker mutual information proposed in the present invention, the average speaker's mutual information amount of the speaker and learning data frames for each Gaussian component is calculated, By applying an I-vector extractor to a block matrix of an I-vector extractor as a weight by calculating an average speaker mutual information amount of each Gaussian component by taking an average for all frames and all speakers, A robust I-vector, which is a feature vector having a higher performance, can be extracted by making the influence of the Gaussian component having a lot of information on the speaker in the extracted I-vector to be highlighted.

또한, 본 발명에 따르면, I-벡터 추출기에 가중치를 적용하여 I-벡터의 화자 관련 정보를 부각시킴으로써, 화자 이외의 잡음, 마이크 상태 등의 요소로 인한 변이성에 강인한 특징을 추출할 수 있어, 입력된 음성 길이가 짧거나 잡음이 많은 환경에서도 화자의 특징을 효과적으로 추출하여 화자 인식의 성능을 높일 수 있다.Further, according to the present invention, by applying the weight to the I-vector extractor to highlight the speaker-related information of the I-vector, it is possible to extract characteristics that are robust against variability due to factors other than the speaker, such as noise, The performance of the speaker recognition can be improved by effectively extracting the characteristics of the speaker even in a case where the length of the speech sound is short or the noise is high.

도 1은 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 흐름을 도시한 도면.
도 2는 본 발명의 다른 실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 흐름을 도시한 도면.
도 3은 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 흐름을 블록도로 도시한 도면.
도 4는 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템을 도시한 도면.
도 5는 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템에 있어서, 계산부(200)의 구성을 설명하기 위해 도시한 도면.BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of a robust I-vector extractor learning method using speaker mutual information according to an embodiment of the present invention; FIG.
BACKGROUND OF THE INVENTION Field of the Invention [0001] The present invention relates to a method and apparatus for learning an I-vector extractor.
3 is a block diagram illustrating a flow of a robust I-vector extractor learning method using speaker mutual information according to an embodiment of the present invention.
FIG. 4 illustrates a robust I-vector extractor learning system utilizing speaker mutual information according to an embodiment of the present invention. FIG.
5 is a diagram illustrating a configuration of a calculation unit 200 in a robust I-vector extractor learning system that utilizes speaker mutual information according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings, in order that those skilled in the art can easily carry out the present invention. In the following detailed description of the preferred embodiments of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear. In the drawings, like reference numerals are used throughout the drawings.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.
In addition, in the entire specification, when a part is referred to as being 'connected' to another part, it may be referred to as 'indirectly connected' not only with 'directly connected' . Also, to "include" an element means that it may include other elements, rather than excluding other elements, unless specifically stated otherwise.

I-벡터는, 화자 인식에서 널리 사용되는 특징 벡터로서, 입력 음성 내에 존재하는 화자, 녹음 상태, 잡읍 등으로 인한 다양한 변이성을 작은 차원의 벡터로 표현할 수 있다. 데이터의 분포를 가우시안 혼합 모델(Gaussian mixture model, GMM)으로 모델링했을 때 각 가우시안 성분들의 평균값들을 연결하여 만든 GMM 슈퍼벡터의 변이를 표현하는 벡터인 I-벡터와 GMM 슈퍼벡터의 관계는 하기 수학식 1과 같이 나타낼 수 있다.The I-vector is a feature vector that is widely used in speaker recognition, and can represent various variabilities due to a speaker, a recording state, a jiggle, etc. existing in the input speech as a small-dimensional vector. The relationship between the I-vector, which is the vector representing the variation of the GMM supervectors formed by connecting the average values of the respective Gaussian components when the distribution of the data is modeled by the Gaussian mixture model (GMM), and the GMM supervector, 1 < / RTI >

여기서 M, m, T, w는 각각 가우시안 혼합 모델(Gaussian Mixture Model, GMM) 슈퍼벡터, 일반 배경 모델(Universal background model, UBM), 전체 변이성 행렬(total variability matrix), 그리고 I-벡터를 의미한다. 이때, 전체 변이성 행렬은 I-벡터 추출기로도 불린다.
Here, M, m, T, and w represent a Gaussian Mixture Model (GMM) supervector, a universal background model (UBM), a total variability matrix, and an I-vector . At this time, the entire variability matrix is also referred to as an I-vector extractor.

PMI(pointwise mutual information)은 두 이벤트 간의 상호 정보량을 의미하며, 이 수치가 높을수록 한 쪽에 대한 정보가 다른 쪽에 대한 불확실성(uncertainty)를 줄여주는 정도가 높음을 의미한다. 두 이벤트 x와 y의 PMI는 하기 수학식 2와 같이 구할 수 있다.Pointwise mutual information (PMI) means the amount of mutual information between two events. The higher the value, the higher the degree of uncertainty for the other. The PMI of both events x and y can be calculated as shown in Equation 2 below.

여기서 p(x|y)와 p(x)는 각각 y가 주어졌을 때 x의 확률과 일반 배경 모델(UBM), 즉, 화자 독립적인 전반적인 음성에 대한 확률 분포가 주어졌을 때 x의 확률을 의미한다. 이 수치가 가질 수 있는 값의 범위는 -∞와 min(-logp(x), -logp(y))이며, 이를 -1과 1 사이로 정규화해주기 위해서는 하기의 수학식 3과 같은 식을 이용할 수 있다.Where p (x | y) and p (x) are the probability of x when y is given and the probability of x when given a probability distribution for the general background model (UBM) do. The range of values that this value can have is -∞ and min (-logp (x), -logp (y)). In order to normalize it to -1 and 1, the following equation can be used .

이를 normalized pointwise mutual information(NPMI)라 부르며, 여기에서 p(x, y)는 x와 y의 동시 확률을 의미한다.
This is called normalized pointwise mutual information (NPMI), where p (x, y) means the simultaneous probability of x and y.

본 발명은 이러한 평균화된 상호 정보량을 나타내는 NPMI를 이용하여 화자 평균 상호 정보량을 가중치로서 I-벡터 추출기에 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출하기 위한 I-벡터 추출기 학습 방법 및 시스템에 관한 것으로서, 이하에서 도 1 내지 도 5를 참조하여 상세히 설명하도록 한다.
The present invention applies an I-vector extractor learning method and system for extracting a robust I-vector in which speaker information is emphasized by applying an average mutual information amount as a weight to an I-vector extractor using NPMI representing the averaged mutual information amount And will be described in detail with reference to Figs. 1 to 5 below.

도 1은 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 흐름을 도시한 도면이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법은, 복수 개의 음성 파일들로 구성된 학습 데이터를 사용하여 서로 다른 2 이상의 화자 각각에 대하여 종속적인 가우시안 혼합 모델(Gaussian mixture model, GMM)을 학습하고, 학습 데이터를 사용하여 I-벡터 추출기를 학습하는 단계(S100), 단계 S100에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계(S200), 단계 S200에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 단계 S100에서 학습된 I-벡터 추출기에 가중치로 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출하는 단계(S300)를 포함하여 구현될 수 있다.
FIG. 1 is a flowchart illustrating a robust I-vector extractor learning method using speaker mutual information according to an embodiment of the present invention. Referring to FIG. As shown in FIG. 1, a robust I-vector extractor learning method using speaker mutual information according to an embodiment of the present invention is a method for learning a strong I-vector extractor using learning data composed of a plurality of speech files, (S100) of learning a Gaussian mixture model (GMM) which is dependent on the Gaussian mixture model and learning the I-vector extractor using the learning data. The speaker dependent Gaussian mixture models learned in step S100 are used to calculate each Gaussian component (S200). The average speaker mutual information amount of each Gaussian component calculated in step S200 is applied to the I-vector extractor learned in step S100 as a weight, and a robust I - extracting the vector (S300).

단계 S100은, 복수 개의 음성 파일들로 구성된 학습 데이터를 사용하여 화자 종속적 가우시안 혼합 모델(GMM)을 학습하고, 동일한 학습 데이터를 사용하여 I-벡터 추출기를 학습하는 단계로서, 이때, I-벡터 추출기는 통상적으로 사용되는 I-벡터 추출기이다.
In step S100, a speaker-dependent Gaussian mixture model (GMM) is learned using learning data composed of a plurality of audio files, and the I-vector extractor is learned using the same learning data. At this time, Is a commonly used I-vector extractor.

이때, 단계 S100은, 학습 데이터를 사용하여 화자 독립적 모델인 일반 배경 모델(universal background model, UBM)을 학습하고, 학습된 화자 독립적 일반 배경 모델(UBM)을 기준으로 특정 화자 데이터로의 적응화를 통해 특정 화자 각각에 대해 종속적인 가우시안 혼합 모델(GMM)을 도출할 수 있다.
At this time, in step S100, a universal background model (UBM), which is a speaker independent model, is learned using learning data, and adaptation to specific speaker data is performed based on the learned speaker-independent general background model A dependent Gaussian mixture model (GMM) can be derived for each particular speaker.

이때, 적응화(adaptation)는, 여러 파라미터를 특정 화자 데이터를 표현할 수 있도록 조정하는 과정을 의미하며, maximum a posteriori(MAP) 알고리즘과 같은 기법을 사용하여 수행될 수 있다.
In this case, adaptation means a process of adjusting various parameters to express specific speaker data, and can be performed using a technique such as a maximum a posteriori (MAP) algorithm.

또한, 실시예에 따라서는, 학습 데이터를 사용하여 일반 배경 모델의 학습과는 별개로 특정 화자 각각에 대한 화자 종속적 가우시안 혼합 모델을 학습하여, 화자 종속적 가우시안 혼합 모델을 도출할 수도 있다.
Further, according to an embodiment, a speaker-dependent Gaussian mixture model for each specific speaker may be learned separately from the learning of the general background model using the learning data to derive a speaker-dependent Gaussian mixture model.

한편, 단계 S100에서는, I-벡터 추출기를 학습하여 I-벡터 추출기의 블락(block) 행렬을 도출할 수 있다. 여기서 학습 데이터 프레임 단위의 특징의 차원이 F이고, 가우시안 혼합 모델의 성분의 개수가 C일 때, M차원의 I-벡터 추출기는 CF의 행과 M의 열을 가진 직각 행렬로서, 이러한 직각 행렬의 행을 F크기의 행 C개로 나누었을 때, 각각의 블락(block) 행렬들은 서로 다른 가우시안 성분을 I-벡터로 변환해주는 투영 행렬이 될 수 있다.
On the other hand, in step S100, a block matrix of the I-vector extractor can be derived by learning the I-vector extractor. Here, when the dimension of the feature of the learning data frame unit is F and the number of components of the Gaussian mixture model is C, the M-dimensional I-vector extractor is a rectangular matrix having rows of CF and M columns, When a row is divided into C rows of F size, each block matrix may be a projection matrix that transforms different Gaussian components into I-vectors.

즉, 단계 S100에서는 I-벡터 추출기를 학습하여, 서로 다른 가우시안 성분을 I-벡터로 변환해주는 투영 행렬에 해당하는 블락(block) 행렬을 도출할 수 있다.
That is, in step S100, a block matrix corresponding to a projection matrix for converting different Gaussian components into I-vectors can be derived by learning an I-vector extractor.

단계 S200은, 단계 S100에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계로서, 구체적인 내용에 대해서는 추후 도 2를 참조하여 상세히 설명하도록 한다.
Step S200 is a step of calculating an average speaker mutual information amount of each Gaussian component using the speaker-dependent Gaussian mixture models learned in step S100, and the details thereof will be described later in detail with reference to FIG.

단계 S300은, I-벡터 추출기에 가중치를 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출하는 단계로서, 보다 구체적으로는, 단계 S200에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 단계 S100에서 I-벡터 추출기를 학습하여 도출된 I-벡터 추출기의 블락(block) 행렬에 곱해줄 수 있다.
Step S300 is a step of extracting a robust I-vector in which speaker information is emphasized by applying a weight to the I-vector extractor. More specifically, the average speaker mutual information amount of each Gaussian component calculated in step S200 is stored in step S100 Vector multiplier can be multiplied by the block matrix of the I-vector extractor derived by learning the I-vector extractor.

이와 같이, I-벡터 추출기의 블락(block) 행렬에, 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 가중치로 적용하여, 화자에 대한 정보가 많은 가우시안 성분은 부각된 강인한 I-벡터를 추출할 수 있다.
In this way, by applying the average speaker information amount of each Gaussian component to the block matrix of the I-vector extractor as a weight, a robust I-vector highlighting Gaussian components having a lot of information about the speaker can be extracted .

도 2는 본 발명의 다른 실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 흐름을 도시한 도면이다. 도 2에 도시된 바와 같이, 본 발명의 다른 실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 단계 S200은, 단계 S100에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 서로 다른 2 이상의 화자 각각 및 학습 데이터의 프레임별 평균 화자 상호 정보량(Normalized pointwise mutual information, NPMI)을 계산하는 단계(S210), 및 단계 S210에서 계산된 평균 화자 상호 정보량(NPMI)을 학습 데이터의 전체 프레임에 대하여 평균을 취해 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계(S220)를 포함할 수 있다.
FIG. 2 is a flowchart illustrating a robust I-vector extractor learning method using speaker mutual information according to another embodiment of the present invention. 2, in step S200 of the robust I-vector extractor learning method using speaker mutual information according to another embodiment of the present invention, the speaker-dependent Gaussian blend models learned in step S100 are used to generate two (Step S210) of calculating the average speaker mutual information (NPMI) of each of the speakers and the learning data for each frame, and calculating the average speaker mutual information amount (NPMI) calculated in step S210 for the entire frame of the learning data And calculating an average speaker mutual information amount of each Gaussian component by taking an average (S220).

단계 S210은, 화자 및 학습 데이터의 프레임별 평균 화자 상호 정보량(NPMI)을 계산하는 단계로서, 보다 구체적으로는, 하기 수학식 4를 통해 서로 다른 2 이상의 화자 각각 및 학습 데이터의 프레임별 평균 화자 상호 정보량(NPMI)을 계산할 수 있다.Step S210 is a step of calculating the average speaker mutual information amount (NPMI) of the speaker and the learning data for each frame. More specifically, the average speaker's mutual information amount The amount of information (NPMI) can be calculated.

즉, 단계 S210에서는 각 가우시안 성분에 대한 화자 각각 및 학습 데이터의 프레임별로 평균적인 화자 상호 정보량(NPMI)을 계산할 수 있다.
That is, in step S210, it is possible to calculate an average speaker mutual information amount (NPMI) for each speaker for each Gaussian component and for each frame of learning data.

단계 S220은, I-벡터 추출기에 적용할 가중치(weight_c)를 계산하는 단계로서, 보다 구체적으로는, 하기 수학식 5를 통해 I-벡터 추출기에 적용할 가중치(weight_c)를 계산할 수 있다.Step S220 is a step of calculating a weight (weight _c) to be applied to the vector extractor I-, more specifically, to be able to calculate the weight (weight _c) to be applied to the I- vector extractor through the equation (5).

즉, 단계 S220에서는, 각 가우시안 성분에 대한 화자 각각 및 학습 데이터의 프레임별 평균적인 화자 상호 정보량(NPMI)을 모든 프레임 및 화자에 대해 평균을 취함으로써, 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 구할 수 있으며, 이를 단계 S300에서 I-벡터 추출기의 블락(block) 행렬에 곱해줌으로써, 가중치로 적용할 수 있다.
That is, in step S220, the average speaker information amount (NPMI) for each of the speakers for each Gaussian component and for each frame of the learning data is averaged over all frames and speakers to obtain the average speaker mutual information amount of each Gaussian component And it can be applied as a weight by multiplying the block matrix of the I-vector extractor in step S300.

도 3은 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 흐름을 블록도로 도시한 도면이다. 도 3에 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법은, 화자 인식에 있어서, 화자를 인식하기 위한 특징 벡터인 I-벡터를 추출하는 I-벡터 추출기를 학습하는 단계에 관한 것으로서, 먼저 단계 S100에서 학습 데이터로 화자 독립적 모델인 일반 배경 모델(UBM) 뿐만 아니라, 서로 다른 화자 각각에 대하여 종속적인 가우시안 혼합 모델(GMM)들을 학습하고, 동일한 학습 데이터를 사용하여 기존에 통상적으로 I-벡터 기법에서 사용되는 것과 같은 I-벡터 추출기를 학습한다.
FIG. 3 is a block diagram illustrating a flow of a robust I-vector extractor learning method using speaker mutual information according to an embodiment of the present invention. As shown in FIG. 3, the robust I-vector extractor learning method using speaker mutual information according to an embodiment of the present invention includes extracting an I-vector, which is a feature vector for recognizing a speaker, Vector extractor. First, in step S100, dependent Gaussian mixture models (GMMs) are learned not only for the general background model (UBM), which is a speaker-independent model, but also for different speakers, Using the same learning data, learn the same I-vector extractor that is conventionally used in the I-vector technique.

이후, 단계 S200에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 각 가우시안 성분에 대한 화자와 학습 데이터 프레임들의 평균 상호 정보량(NPMI)을 구하고, 구해진 화자와 학습 데이터 프레임들의 평균 상호 정보량(NPMI)를 모든 화자 및 모든 프레임에 대해 평균을 취함으로써, 최종적으로 각 가우시안 성분이 갖는 평균 화자 상호 정보량(가중치, weight_c)을 구할 수 있다.
Thereafter, the average mutual information amount (NPMI) of the speaker and the learning data frames for each Gaussian component is obtained using the speaker-dependent Gaussian mixture models learned in step S200, and the average mutual information amount (NPMI) By taking an average for the speaker and all the frames, the average speaker mutual information amount (weight, weight _c ) of each Gaussian component can finally be obtained.

마지막으로 단계 S300에서 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 I-벡터 추출기의 블락(block) 행렬에 곱함으로써, 화자 정보가 부각된 I-벡터 추출기가 생성될 수 있고, 이를 통해 화자 정보가 부각된 강인한 I-벡터가 추출될 수 있다.
Finally, in step S300, an I-vector extractor in which the speaker information is highlighted can be generated by multiplying the block matrix of the I-vector extractor by the average speaker mutual information amount of each Gaussian component, Robust I-vector can be extracted.

이와 같이, 본 발명에서 제안하고 있는 I-벡터 추출기 학습 방법에 따르면, 각 가우시안 성분에 대한 화자와 학습 데이터 프레임들의 평균 화자 상호 정보량을 계산한 후 계산된 평균 화자 상호 정보량을 모든 프레임 및 전체 화자에 대하여 평균을 취하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하고, 이를 가중치로서, I-벡터 추출기의 블락(block) 행렬에 곱하는 방식으로 적용시킴으로써, I-벡터 추출기를 통해 추출되는 I-벡터 중 화자에 대한 정보가 많은 가우시안 성분은 부각될 수 있도록 하여, 보다 높은 성능을 갖는 특징 벡터인 강인한 I-벡터를 추출할 수 있다.
Thus, according to the I-vector extractor learning method proposed in the present invention, the mean speaker mutual information amount of the speaker and the learning data frames for each Gaussian component is calculated and then the calculated average speaker mutual information amount is stored in all the frames and all speakers Vector is calculated by calculating a mean value of the average speaker's mutual information of each Gaussian component and multiplying it by a block matrix of an I-vector extractor as a weight so that an I-vector extracted through an I- The Gaussian component having a lot of information about the speaker can be highlighted, and a robust I-vector, which is a feature vector having higher performance, can be extracted.

또한, I-벡터 추출기에 가중치를 적용하여 I-벡터의 화자 관련 정보를 부각시킴으로써, 화자 이외의 잡음, 마이크 상태 등의 요소로 인한 변이성에 강인한 특징을 추출할 수 있어, 입력된 음성 길이가 짧거나 잡음이 많은 환경에서도 화자의 특징을 효과적으로 추출하여 화자 인식의 성능을 높일 수 있다.
In addition, by applying the weight to the I-vector extractor to highlight the speaker-related information of the I-vector, it is possible to extract features that are robust against variability due to factors other than the speaker, such as noise and microphone state, The performance of the speaker recognition can be improved by effectively extracting the characteristics of the speaker even in a noisy environment.

도 4는 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템을 도시한 도면이다. 도 4에 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템은, 학습부(100), 계산부(200) 및 화자 정보 부각 I-벡터 추출부(300)를 포함하여 구성될 수 있다.
FIG. 4 is a diagram illustrating a robust I-vector extractor learning system using speaker mutual information according to an embodiment of the present invention. 4, a robust I-vector extractor learning system using speaker mutual information according to an embodiment of the present invention includes a learning unit 100, a calculation unit 200, and a speaker information incoherent I-vector extraction (300). &Lt; / RTI >

학습부(100)는, 복수 개의 음성 파일들로 구성된 학습 데이터를 사용하여 서로 다른 2 이상의 화자 각각에 대하여 종속적인 가우시안 혼합 모델(GMM)을 학습하고, 동일한 학습 데이터를 사용하여 I-벡터 추출기를 학습할 수 있다.
The learning unit 100 learns a dependent Gaussian mixture model (GMM) for each of two or more different speakers using learning data composed of a plurality of audio files, and uses an I-vector extractor You can learn.

계산부(200)는, 학습부(100)에서 학습된 화자 종속적 가우시안 혼합 모델들(GMM)을 이용하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산할 수 있다.
The calculation unit 200 can calculate the average speaker mutual information amount of each Gaussian component using the speaker dependent Gaussian mixture models (GMM) learned in the learning unit 100. [

화자 정보 부각 I-벡터 추출부(300)는, 계산부(200)에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량(NPMI)을 학습부(100)에서 학습된 I-벡터 추출기에 가중치로 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출할 수 있다.
The speaker information dependence I-vector extractor 300 applies the average speaker mutual information amount (NPMI) of each Gaussian component calculated by the calculation unit 200 as a weight to the I-vector extractor learned in the learning unit 100 A robust I-vector in which the speaker information is highlighted can be extracted.

즉, 학습부(100), 계산부(200), 및 화자 정보 부각 I-벡터 추출부(300)는, 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 단계 S100, S200 및 S300의 과정을 각각 수행할 수 있으며, 그 구체적인 내용에 대해서는 앞에서 도 1 내지 도 3을 참조하여 상세히 설명한 바와 같으므로, 이하 생략한다.
In other words, the learning unit 100, the calculation unit 200, and the speaker information dependence I-vector extraction unit 300 may be configured to perform the steps S100, S200, and S300 of the robust I-vector extractor learning method using the speaker mutual information And the detailed contents thereof are as described in detail with reference to FIG. 1 to FIG. 3, and the description thereof will be omitted.

도 5는 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템에 있어서, 계산부(200)의 구성을 설명하기 위해 도시한 도면이다. 도 5에 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 시스템의 계산부(200)는, 학습부(100)에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 서로 다른 2 이상의 화자 각각 및 학습 데이터의 프레임별 평균 화자 상호 정보량(Normalized pointwise mutual information, NPMI)을 계산하는 NPMI 계산부(210), 및 NPMI 계산부(210)에서 계산된 평균 화자 상호 정보량(NPMI)을 학습 데이터의 전체 프레임 및 전체 화자에 대하여 평균을 취해 상기 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 가중치 계산부(230)를 포함할 수 있다.
FIG. 5 is a diagram illustrating a configuration of a calculation unit 200 in a robust I-vector extractor learning system using speaker mutual information according to an embodiment of the present invention. 5, the calculation unit 200 of the robust I-vector extractor learning system that utilizes the speaker mutual information according to an embodiment of the present invention includes a speaker-dependent Gaussian mixture model learned in the learning unit 100, An NPMI calculation unit 210 for calculating the average speaker-mutual information (NPMI) of each of two or more speakers different from each other and the learning data by using the average speaker mutual information (NPMI) And a weight calculation unit 230 for calculating an average information amount (NPMI) of the learning data by averaging over the entire frame and all speakers of the learning data and calculating the average speaker mutual information amount of each Gaussian component.

이때, NPMI 계산부(210) 및 가중치 계산부(230)는 화자 상호 정보를 활용한 강인한 I-벡터 추출기 학습 방법의 단계 S210 및 S220의 과정을 각각 수행할 수 있으며, 그 구체적인 내용에 대해서는 앞에서 도 2를 참조하여 상세히 설명한 바와 같으므로, 이하 생략한다.
In this case, the NPMI calculation unit 210 and the weight calculation unit 230 may perform steps S210 and S220 of the robust I-vector extractor learning method using the speaker mutual information, respectively. 2, it will be omitted here.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention may be embodied in many other specific forms without departing from the spirit or essential characteristics and scope of the invention.

S100: 복수 개의 음성 파일들로 구성된 학습 데이터를 사용하여 서로 다른 2 이상의 화자 각각에 대하여 종속적인 가우시안 혼합 모델(Gaussian mixture model, GMM)을 학습하고, 학습 데이터를 사용하여 I-벡터 추출기를 학습하는 단계
S200: 단계 S100에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계
S210: 단계 S100에서 학습된 화자 종속적 가우시안 혼합 모델들을 이용하여 학습 데이터의 프레임별 평균 화자 상호 정보량(Normalized pointwise mutual information, NPMI)을 계산하는 단계
S220: 단계 S210에서 계산된 학습 데이터의 프레임별 평균 화자 상호 정보량을 학습 데이터의 전체 프레임에 대하여 평균을 취해 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 계산하는 단계
S300: 단계 S200에서 계산된 각 가우시안 성분이 갖는 평균 화자 상호 정보량을 단계 S100에서 학습된 I-벡터 추출기에 가중치로 적용하여 화자 정보가 부각된 강인한 I-벡터를 추출하는 단계
100: 학습부 200: 계산부
210: NPMI 계산부 230: 가중치 계산부
300: 화자 정보 부각 I-벡터 추출부S100: A Gaussian mixture model (GMM) is learned for each of two or more different speakers using learning data composed of a plurality of speech files, and an I-vector extractor is learned using learning data step
S200: calculating the average speaker mutual information amount of each Gaussian component using the speaker-dependent Gaussian mixture models learned in step S100
S210: calculating normalized pointwise mutual information (NPMI) for each frame of learning data using the speaker-dependent Gaussian mixture models learned in step S100
S220: averaging the average speaker mutual information amount per frame of the learning data calculated in step S210 with respect to the entire frame of the learning data, and calculating an average speaker mutual information amount of each Gaussian component
S300: extracting the robust I-vector in which the speaker information is highlighted by applying the average speaker mutual information amount of each Gaussian component calculated in step S200 to the I-vector extractor learned in step S100 as a weight value
100: learning unit 200: calculation unit
210: NPMI calculation unit 230: Weight calculation unit
300: speaker information incidence I-vector extracting unit

Claims

As a robust I-vector extractor learning method using speaker mutual information,
(1) learning a Gaussian mixture model (GMM) dependent on each of two or more different speakers using learning data composed of a plurality of audio files, and using the learning data to extract an I-vector extractor Learning step;
(2) calculating an average speaker mutual information amount of each Gaussian component using the speaker-dependent Gaussian mixture models learned in the step (1); And
(3) Applying the average speaker mutual information amount of each Gaussian component calculated in the step (2) to the I-vector extractor learned in the step (1) as a weight to extract a robust I-vector in which the speaker information is highlighted &Lt; / RTI >
In the above step (1)
And extracting a block matrix of the I-vector extractor by learning the I-vector extractor, wherein the block matrix of the I-vector extractor is derived by learning the I-vector extractor.

2. The method of claim 1, wherein the step (1)
Learning a universal background model (UBM), which is a speaker-independent model, using the learning data, and applying adaptation to specific speaker data based on the learned speaker-independent general background model, And extracting an in-gaussian mixture model based on the speaker's mutual information.

delete

2. The method according to claim 1, wherein in the step (3)
The average speaker mutual information amount of each Gaussian component calculated in the step (2) is multiplied by a block matrix of the I-vector extractor derived by learning the I-vector extractor in the step (1) A robust I-vector extractor learning method using speaker mutual information.

2. The method of claim 1, wherein step (2)
(2-1) calculating normalized pointwise mutual information (NPMI) for each of the two or more different speakers and the learning data using the speaker-dependent Gaussian mixture models learned in the step (1) step; And
(2-2) An average of the average speaker mutual information amount (NPMI) calculated in the step (2-1) is averaged over the entire frame and all speakers of the learning data, and the average speaker mutual information amount of each Gaussian component is calculated And extracting the I-vector extractor learning information from the extracted I-vector extractor.

6. The method of claim 5, wherein the step (2-1)
Calculating a mean speaker mutual information amount (NPMI) of each of the two or more different speakers and a frame of the learning data using the following equation.

Here, N (x _l | θ _{s, c} ) represents the probability that x _l speech frames will be generated when the c th Gaussian component of the Gaussian mixture model (GMM) x _l | θ _{UBM, c)} when the c-th Gaussian component probability of c-th Gaussian component of the UBM to the UBM is given by a probability distribution, x _l speech frame represents the probability of generating, p (s) is a learning Indicates the ratio of the s speaker, which is the probability of existence of the s speaker present in the data.

7. The method according to claim 6, wherein the step (2-2)
A robust I-vector extractor learning method using speaker mutual information, characterized in that an average speaker mutual information amount of each Gaussian component is calculated through the following equation.

Here, L denotes the number of frames in the audio data to extract the characteristic, and S denotes the number of speakers.

As a robust I-vector extractor learning system using speaker mutual information,
Learning a Gaussian mixture model (GMM) dependent on each of two or more different speakers using learning data composed of a plurality of audio files and learning an I-vector extractor using the learning data (100);
A calculation unit 200 for calculating an average speaker mutual information amount of each Gaussian component using the speaker dependent Gaussian mixture models learned in the learning unit 100; And
A speaker for extracting a robust I-vector in which speaker information is emphasized by applying an average speaker mutual information amount of each Gaussian component calculated by the calculation unit 200 as a weight to an I-vector extractor learned in the learning unit 100 Vector extracting unit 300,
The learning unit (100)
And extracting a block matrix of the I-vector extractor by learning the I-vector extractor, wherein the block matrix of the I-vector extractor is derived by learning the I-vector extractor.

9. The apparatus of claim 8, wherein the learning unit (100)
Learning a universal background model (UBM), which is a speaker-independent model, using the learning data, and applying adaptation to specific speaker data based on the learned speaker-independent general background model, And a robust I-vector extractor learning system utilizing speaker mutual information, wherein the in-gaussian mixture model is derived.

delete

9. The apparatus of claim 8, wherein the speaker information relief I-vector extractor (300)
The learning unit 100 multiplies the average speaker mutual information amount of each Gaussian component calculated by the calculation unit 200 by a block matrix of the I-vector extractor derived by learning the I-vector extractor And a robust I-vector extractor learning system using speaker mutual information.

The apparatus of claim 8, wherein the calculation unit (200)
An NPMI calculation unit (for calculating normalized pointwise mutual information (NPMI) for each of the two or more different speakers and the learning data using the speaker-dependent Gaussian mixture models learned by the learning unit 100 210); And
A weight calculation unit (for calculating an average speaker mutual information amount (NPMI) calculated by the NPMI calculation unit 210 for an average of all frame and all speakers of the learning data and calculating an average speaker mutual information amount of each Gaussian component 230). The robust I-vector extractor learning system utilizing speaker mutual information.

13. The apparatus of claim 12, wherein the NPMI calculation unit (210)
Vector extractor learning system using the speaker mutual information is characterized by calculating the mean speaker mutual information amount (NPMI) of each of the two or more different speakers and the frame of the learning data through the following equation.

14. The apparatus of claim 13, wherein the weight calculation unit (230)
A robust I-vector extractor learning system utilizing speaker mutual information, characterized in that an average speaker mutual information amount of each Gaussian component is calculated through the following equation.