KR102055886B1

KR102055886B1 - Speaker voice feature extraction method, apparatus and recording medium therefor

Info

Publication number: KR102055886B1
Application number: KR1020180010899A
Authority: KR
Inventors: 반성민; 김남수; 강우현
Original assignee: 에스케이텔레콤 주식회사; 서울대학교산학협력단
Priority date: 2018-01-29
Filing date: 2018-01-29
Publication date: 2019-12-13
Also published as: KR20190091837A

Abstract

본 발명은 화자 음성의 특징을 추출하는 방법 및 장치, 그리고 이를 위한 기록매체에 관한 것으로, 입력된 음성으로부터 추출된 Baum-Welch 통계 값과 입력된 음성 내에 포함되는 프레임 단위의 특징들의 분포 사이를 매개하는 랜덤 은닉 변수가 존재한다고 가정하여, 입력 음성 신호의 불확실성 및 신뢰도에 대한 척도를 추출하여, 짧거나 열화된 음성 신호에 대해서도 화자를 인식할 수 있도록 한 것이다.The present invention relates to a method and apparatus for extracting a feature of a speaker voice, and a recording medium therefor, wherein the method includes a mediation between a Baum-Welch statistical value extracted from an input voice and a distribution of features of a frame unit included in the input voice. Assuming that there is a random concealment variable, a measure of uncertainty and reliability of the input speech signal is extracted to recognize the speaker even for a short or degraded speech signal.

Description

SPEAKER VOICE FEATURE EXTRACTION METHOD, APPARATUS AND RECORDING MEDIUM THEREFOR}

본 발명은 화자 음성의 특징을 추출하는 방법 및 장치, 그리고 이를 위한 기록매체에 관한 것으로, 더욱 상세하게는 입력 음성 신호의 불확실성 및 신뢰도에 대한 척도를 추출하여, 짧거나 열화된 음성 신호에 대해서도 화자를 인식할 수 있도록 특징을 추출하는 방법 및 장치, 그리고 이를 위한 기록매체에 관한 것이다.The present invention relates to a method and apparatus for extracting features of a speaker's voice, and a recording medium therefor, and more particularly to extracting a measure of uncertainty and reliability of an input voice signal, and to a speaker for a short or deteriorated voice signal. The present invention relates to a method and apparatus for extracting a feature to recognize the information, and a recording medium therefor.

이 부분에 기술된 내용은 단순히 본 실시 예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute a prior art.

아이벡터(i-vector)는 데이터의 분포를 GMM(Gaussian Mixture Model)으로 모델링 했을 때 각 가우시안들의 평균값들을 연결하여 만든 GMM(Gaussian Mixture Model) 슈퍼벡터(supervector)의 변이를 표현하는 벡터로서, 화자 인식에서 제안된 특징 벡터이다. 입력 신호 내에 존재하는 다양한 변이성을 작은 차원의 벡터로 표현할 수 있다는 장점으로 인하여 화자 인식뿐만 아니라 영상 등 다른 분야에서도 사용된다. 이 때, GMM(Gaussian Mixture Model) supervector와 아이벡터의 관계는 다음과 같은 식으로 나타낼 수 있다.The i-vector is a vector representing the variation of the Gaussian Mixture Model (GMM) supervector created by connecting the Gaussian mean values when the distribution of data is modeled by the Gaussian Mixture Model (GMM). The feature vector proposed in the recognition. Due to the advantage of being able to express various variability existing in the input signal as a small dimension vector, it is used not only for speaker recognition but also for other fields such as video. In this case, the relationship between the Gaussian Mixture Model (GMM) supervector and the eye vector may be expressed as follows.

여기에서

,

와

은 각각 입력에 종속적인 GMM supervector, UBM(universal background model), 전체 변이성 행렬(total variability matrix), 그리고 입력

에 종속적인 아이벡터를 의미한다. From here

,

Wow

Are input-dependent GMM supervectors, universal background model (UBM), total variability matrix, and input

The child vector that depends on

UBM은 여러 종류의 신호를 사용하여 학습한 GMM으로, 특정 신호에 독립적인 음성의 분포를 나타낸다.The UBM is a GMM trained using several kinds of signals and represents a distribution of speech independent of a specific signal.

ALI는 데이터 생성을 위한 generative model의 한 종류로, 3가지 네트워크로 구성된다. 첫 번째 네트워크는 디코더로, 임의의 은닉 랜덤 변수(latent variable)

를 입력 받으면 데이터

를 생성한다. 두 번째 네트워크는 인코더로, 데이터

를 입력 받으면 이에 해당하는 은닉 변수(latent variable)를 추정한다. 마지막 네트워크인 디스크리미네이터(discriminator)는 인코더 혹은 디코더로부터 은닉 변수과 데이터의 쌍

을 입력으로 받아 인코더로부터 생성되었는지 디코더로부터 생성되었는지를 판단한다. 인코더 및 디코더와 디스크리미네이터를 경쟁하듯 학습함으로써 인코더와 디코더는 각각 실제 분포에서 생성 된 듯한 은닉 변수(latent variable)와 데이터를 출력하도록 학습된다. ALI is a kind of generative model for data generation and consists of three networks. The first network is a decoder, random hidden variable

Data when you enter

Create The second network is an encoder

When the input is received, a corresponding latent variable is estimated. The final network, the discriminator, is a pair of hidden variables and data from an encoder or decoder.

Is inputted to determine whether it is generated from an encoder or a decoder. By competing the encoder and the decoder and the delimiter, the encoder and the decoder are trained to output hidden variables and data that appear to have been generated from the actual distribution, respectively.

상기 아이 벡터를 이용한 화자 인식에서는 화자 정보를 선형적 방식으로 밖에 표현할 수 없었고, 기존에 공개된 딥러닝 기반 특징 추출 기술에서, 랜덤 변수를 포함하는 경우에 있어서도, 기 설정한 가우시안 분포에 비슷한 분포를 갖도록 수학식을 통해 제약을 하게 되어, 원래 정보에 대한 상실이 크다는 문제점이 있었다.In the speaker recognition using the eye vector, only speaker information can be represented in a linear manner, and even in the case of the existing deep learning-based feature extraction technique, even when a random variable is included, a similar distribution to the preset Gaussian distribution is obtained. There was a problem that the loss of the original information is large, because the constraint through the equation to have.

한국등록특허 제 10-1805976호Korea Patent Registration No. 10-1805976

본 발명은 상기한 종래의 문제점을 해결하기 위해 제안된 것으로서, 실제가우시안 분포로부터 나온 분포를 따르도록 별도의 수학적 제약을 사용하지 않고 판단해주는 디스크리미네이터를 통해, 짧거나 열화 된 음성도 길이에 무관하게 고정된 작은 차원의 벡터로 표현하여, 화자를 인식하는데 목적이 있다. The present invention has been proposed to solve the above-mentioned conventional problems, and has a short or deteriorated voice length regardless of the length of the speech through a delimiter which determines without using a separate mathematical constraint to follow the distribution derived from the real Gaussian distribution. It aims to recognize the speaker by expressing it as a small fixed vector.

특히, 본 발명은 입력된 음성으로부터 추출된 Baum-Welch 통계 값과 입력된 음성 내에 포함되는 프레임 단위의 특징들의 분포 사이를 매개하는 은닉 랜덤 변수가 존재한다고 가정하여, 입력 신호의 불확실성 및 신뢰도에 대한 척도를 추출할 수 있어, 짧거나 열화 된 신호가 주어진 경우에도 화자를 인식할 수 있는 화자 인식 방법 및 장치, 그리고 이를 위한 기록매체를 제공하는데 그 목적이 있다. In particular, the present invention assumes that there is a hidden random variable that mediates the distribution of Baum-Welch statistical values extracted from the input speech and the distribution of the features of the frame unit included in the input speech. It is an object of the present invention to provide a speaker recognition method and apparatus capable of recognizing a speaker and to recognize a speaker even when a short or degraded signal is given, and a recording medium therefor.

그러나, 이러한 본 발명의 목적은 상기의 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above object, and other objects not mentioned will be clearly understood from the following description.

본 발명은 상술한 과제의 해결 수단으로서, 화자 음성 신호에 대해 UBM(Universal Background Model) 조건 하에서 Baum-Welch 통계를 실시하여, 상기 화자 음성 신호의 특징을 추출하는 화자 음성 특징 추출 방법을 제공한다.The present invention provides a speaker speech feature extraction method for extracting features of the speaker speech signal by performing Baum-Welch statistics on a speaker speech signal under a UBM (Universal Background Model) condition.

본 발명에 따른 화자 음성 특징 추출 방법은, 음성 신호에 대해 UBM 조건 하에서 Baum-Welch 통계를 실시하고, 상기 입력 음성 신호에 종속적인 입력 음성의 분포적 특성을 나타내는 랜덤 은닉 변수의 분포를 추출하는 단계; 화자 음성 신호에 대해 UBM조건 하에서 실시한 Baum-Welch 통계 값을 기반으로 최대 사후화 과정(maximum a posteriori)에 따른 제1 GMM 슈퍼벡터를 산출하는 단계; 상기 랜덤 은닉 변수의 사전 확률 분포에 대한 샘플링 값을 추출하는 단계; 상기 랜덤 은닉 변수의 사전 확률 분포에 대한 샘플링 값을 이용하여, 제2 GMM 슈퍼벡터를 산출하는 단계; 상기 랜덤 은닉 변수의 분포와 제1 GMM 슈퍼벡터의 쌍 및 상기 랜덤 은닉 변수의 사전 확률 분포에 대한 샘플링 값과 제2 GMM 슈퍼벡터의 쌍 중 어느 하나를 디스크리미네이터로 입력하여 상기 입력이, 샘플링 값에서 생성된 것인지, 입력 음성으로부터 생성되었는지 분류하는 단계; 를 포함한다.In the speaker speech feature extraction method according to the present invention, performing Baum-Welch statistics on a speech signal under UBM conditions, and extracting a distribution of a random hidden variable representing a distribution characteristic of the input speech dependent on the input speech signal. ; Calculating a first GMM supervector according to a maximum a posteriori based on Baum-Welch statistical values performed under UBM conditions for the speaker speech signal; Extracting a sampling value for a prior probability distribution of the random hidden variable; Calculating a second GMM supervector using a sampling value for a prior probability distribution of the random concealment variable; The input is sampled by inputting any one of a pair of a distribution of the random hidden variable and a pair of first GMM supervectors, and a pair of sampling values for a prior probability distribution of the random hidden variable and a pair of second GMM supervectors as a delimiter. Classifying whether it is generated from a value or from an input voice; It includes.

더하여, 본 발명은 상술한 과제의 해결 수단으로서, 화자 음성 신호에 대해 UBM(Universal Background Model) 조건 하에서 Baum-Welch 통계를 실시하고, 상기 입력 음성 신호에 종속적인 입력 음성의 분포적 특성을 나타내는 랜덤 은닉 변수(random latent variable)의 분포를 추출하는 인코더; 상기 화자 음성 신호에 대해 UBM 조건 하에서 실시한 Baum-Welch 통계 값을 이용하여 최대 사후화 과정(maximum a posteriori)에 따른 제1 GMM(Gaussian Mixture Model) 슈퍼벡터를 생성하는 GMM 생성부; 상기 랜덤 은닉 변수의 사전 확률 분포에 대한 샘플링 값을 추출하고, 이를 이용하여 제2 GMM(Gaussian Mixture Model) 슈퍼벡터 산출하는 디코더; 상기 랜덤 은닉 변수의 분포와 제1 GMM 슈퍼벡터의 쌍 및 상기 랜덤 은닉 변수의 사전 확률 분포에 대한 샘플링 값과 제2 GMM 슈퍼벡터의 쌍 중 어느 하나를 입력받아 상기 입력이, 샘플링 값에서 생성된 것인지, 입력 음성으로부터 생성되었는지 분류하는 디스크리미네이터;를 포함하는 것을 특징으로 하는 화자 음성 특징 추출 장치를 제공한다.In addition, the present invention, as a means for solving the above-described problem, performs a Baum-Welch statistics on the speaker voice signal under UBM (Universal Background Model) conditions, and randomly showing the distribution characteristics of the input voice dependent on the input voice signal An encoder for extracting a distribution of random latent variables; A GMM generator configured to generate a first Gaussian Mixture Model (GMM) supervector according to a maximum a posteriori using Baum-Welch statistical values performed under UBM conditions on the speaker speech signal; A decoder configured to extract a sampling value for a prior probability distribution of the random concealment variable, and calculate a second Gaussian Mixture Model (GMM) supervector using the same; The input is generated from the sampling value by receiving any one of a pair of the random hidden variable and a pair of first GMM supervectors and a sampling value of a prior probability distribution of the random hidden variable and a pair of second GMM supervectors. It provides a speaker speech feature extraction apparatus comprising a; a delimiter for classifying whether or not it is generated from the input voice.

더하여, 상술한 본 발명에 따른 화장 음성 특징 추출 방법은 프로그램으로 구현되어, 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다.In addition, the makeup speech feature extraction method according to the present invention described above may be implemented as a program and recorded on a computer-readable recording medium.

본 발명의 화자 음성 특징 추출 방법 및 장치, 그리고 이를 위한 기록매체에 의하면, 입력된 음성으로부터 추출된 Baum-Welch 통계 값과 입력된 음성 내에 포함되는 프레임 단위의 특징들의 분포 사이를 매개하는 은닉 랜덤 변수가 존재한다고 가정하여, 입력 신호의 불확실성 및 신뢰도에 대한 척도를 추출할 수 있기에 짧거나 열화된 신호가 주어진 경우 이로 인한 성능 저하를 감소시킬 수 있는 효과가 있다. According to the speaker speech feature extraction method and apparatus of the present invention, and a recording medium therefor, a hidden random variable that mediates the distribution of Baum-Welch statistical values extracted from the input speech and the features of the frame unit included in the input speech. Assuming that is present, a measure of the uncertainty and reliability of the input signal can be extracted, so that a short or degraded signal is given, thereby reducing the performance degradation.

또한, 본 발명은 본 발명은 기존의 아이벡터(i-vector) 추출 기법과 달리 심층신경망 학습(deep neural network, DNN)을 사용함에 더하여, 랜덤 은닉 변수를 추출함에 있어, 별도의 수학적 제약을 하지 않아, 종래 아이벡터(i-vector)의 선형적 투사(projection) 과정에서 표현하지 못할 수 있는 화자 정보를 표현할 수 있다. In addition, the present invention, in addition to using a deep neural network (DNN), unlike the conventional i-vector extraction technique, in extracting the random hidden variable, there is no additional mathematical constraint Therefore, speaker information that may not be expressed in a linear projection process of the conventional i-vector may be expressed.

아울러, 상술한 효과 이외의 다양한 효과들이 후술될 본 발명의 실시 예에 따른 상세한 설명에서 직접적 또는 암시적으로 개시될 수 있다.In addition, various effects other than the above-described effects may be directly or implicitly disclosed in the detailed description according to the embodiment of the present invention to be described later.

도 1은 본 발명의 실시 예에 따른 화자 음성 특징 추출 장치에 대한 블록도이다.
도 2는 본 발명의 실시 예에 따른 화자 음성 특징 추출 과정을 나타낸 순서도이다.
도 3은 본 발명의 실시 예에 따른 화장 음성 특징 추출 장치에서 인코더 및 GMM 생성부에 의한 디스크리미네이터의 입력을 추출하는 과정을 나타낸 순서도이다.
도 4는 본 발명의 실시 예에 따른 화장 음성 특징 추출 장치에서 디코더에 의한 디스크리미네이터의 입력을 추출하는 과정을 나타낸 순서도이다.
도 5는 본 발명의 실시 예에 따른 화자 음성 특징 추출의 효과를 나타낸 도면이다.1 is a block diagram of an apparatus for extracting speaker speech features according to an embodiment of the present invention.
2 is a flowchart illustrating a speaker voice feature extraction process according to an embodiment of the present invention.
3 is a flowchart illustrating a process of extracting an input of a delimiter by an encoder and a GMM generator in the makeup speech feature extraction apparatus according to an embodiment of the present invention.
4 is a flowchart illustrating a process of extracting an input of a delimiter by a decoder in a makeup speech feature extraction apparatus according to an embodiment of the present invention.
5 is a view showing the effect of speaker speech feature extraction according to an embodiment of the present invention.

본 발명의 과제 해결 수단의 특징 및 이점을 보다 명확히 하기 위하여, 첨부된 도면에 도시된 본 발명의 특정 실시 예를 참조하여 본 발명을 더 상세하게 설명한다. BRIEF DESCRIPTION OF THE DRAWINGS To make the features and advantages of the present invention more clear, the present invention will be described in more detail with reference to specific embodiments shown in the accompanying drawings.

다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.However, in the following description and the accompanying drawings, detailed descriptions of well-known functions or configurations that may obscure the subject matter of the present invention will be omitted. In addition, it should be noted that like elements are denoted by the same reference numerals as much as possible throughout the drawings.

이하의 설명 및 도면에서 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위한 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다.The terms or words used in the following description and drawings should not be construed as being limited to the ordinary or dictionary meanings, and the inventors may appropriately define the concept of terms for explaining their own invention in the best way. It should be interpreted as meaning and concept corresponding to the technical idea of the present invention based on the principle that the present invention. Therefore, the embodiments described in the present specification and the configuration shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all of the technical ideas of the present invention, and various alternatives may be substituted at the time of the present application. It should be understood that there may be equivalents and variations.

또한, 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하기 위해 사용하는 것으로, 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용될 뿐, 상기 구성요소들을 한정하기 위해 사용되지 않는다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제2 구성요소는 제1 구성요소로 명명될 수 있고, 유사하게 제1 구성요소도 제2 구성요소로 명명될 수 있다.In addition, terms including ordinal numbers, such as first and second, are used to describe various components, and are used only to distinguish one component from another component, and to limit the components. Not used. For example, without departing from the scope of the present invention, the second component may be referred to as the first component, and similarly, the first component may also be referred to as the second component.

더하여, 어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급할 경우, 이는 논리적 또는 물리적으로 연결되거나, 접속될 수 있음을 의미한다. 다시 말해, 구성요소가 다른 구성요소에 직접적으로 연결되거나 접속되어 있을 수 있지만, 중간에 다른 구성요소가 존재할 수도 있으며, 간접적으로 연결되거나 접속될 수도 있다고 이해되어야 할 것이다.In addition, when a component is referred to as being "connected" or "connected" to another component, it means that it may be connected or connected logically or physically. In other words, although a component may be directly connected or connected to other components, it should be understood that other components may exist in the middle, and may be connected or connected indirectly.

또한, 본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 본 명세서에서 기술되는 "포함 한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In addition, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. In addition, the terms "comprises" or "having" described herein are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or the same. It is to be understood that the present invention does not exclude in advance the possibility of the presence or the addition of other features, numbers, steps, operations, components, parts, or a combination thereof.

이하, 본 발명의 실시 예에 따른 화자 음성 특징 추출 방법에 대해 설명하도록 한다. Hereinafter, a speaker voice feature extraction method according to an embodiment of the present invention will be described.

먼저, 본 발명은 인코더(Encoder), 디코더(Decoder), 디스크리미네이터(discriminator)로 이루어지는 ALI 모델을 이용하여 화자 음성의 특징을 추출하는 것으로서, 이때, 입력된 음성으로부터 추출된 Baum-Welch 통계 값과 입력된 음성 내에 포함되는 프레임 단위의 특징들의 분포 사이를 매개하는 랜덤 은닉 변수(random latent variable)가 존재한다고 가정하고, 추정한 은닉 변수

의 사후확률 분포로부터 샘플링 된 변수들이 은닉 변수의 사전확률로부터 샘플링 된 실제 은닉 변수와 가까운 현실적인 값을 갖도록 디스크리미네이터를 사용한다.First, the present invention is to extract the characteristics of the speaker voice using an ALI model consisting of an encoder, a decoder, a discriminator, the Baum-Welch statistical value extracted from the input voice It is assumed that there is a random latent variable that mediates between the distribution of the features of the frame unit included in the input speech, and the estimated hidden variable

The delimiter is used to ensure that the variables sampled from the post-probability distribution of P have a realistic value close to the actual hidden variables sampled from the prior probability of the hidden variable.

도 1은 본 발명의 실시 예에 화자 음성 특징 추출 장치에 대한 블록도이다.1 is a block diagram of an apparatus for extracting speaker speech features according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 화자 음성 특징 추출 장치는, 인코더(100), GMM 생성부(200), 디코더(300), 및 디스크리미네이터(400)를 포함한다.Referring to FIG. 1, the speaker speech feature extraction apparatus according to the present invention includes an encoder 100, a GMM generating unit 200, a decoder 300, and a delimiter 400.

이렇게 구성된 본 발명에 따른 화자 음성 특징 추출 장치는, 화자 음성 신호의 특징을 추출하기 위하여, 인코더(100)를 통해서 화자 음성 신호(x)에 대해 UBM조건 하에서 Baum-Welch 통계를 실시하여 상기 입력 음성 신호(x)에 종속적인 랜덤 은닉 변수의 분포 z(x)를 추출한다.The speaker speech feature extraction apparatus according to the present invention configured as described above performs Baum-Welch statistics on the speaker speech signal x through the encoder 100 under UBM condition in order to extract the feature of the speaker speech signal. Extract the distribution z (x) of the random hidden variable dependent on the signal x.

이어서, 상기 디코더(300)를 통해서, 은닉 변수의 사전 확률 분포인 N(0,1)로부터 랜덤하게 샘플링 된

를 이용하여 GMM슈퍼벡터

를 추출한다.Subsequently, the decoder 300 randomly sampled from N (0,1), which is a prior probability distribution of the hidden variable.

GMM supervector using

Extract

이 때, GMM 슈퍼 벡터

의 입력 음성에 대한 로그 우도(log-likelihood)는 <수학식 1>을 통해 산출할 수 있다. At this time, GMM super vector

The log-likelihood of the input voice of may be calculated by Equation 1.

(

: 인코더의 파라미터,

: 디코더의 파라미터,

: 입력 음성에 종속적인 랜덤 은닉 변수 (random latent variable),

: GMM 분포 표현에 사용된 가우시안의 개수,

: 입력 음성 및 c번째 가우시안에 대한 1차 Baum-Welch 통계,

: 입력 음성의 프레임 단위 특징의 차원,

: UBM 분포의 c번째 가우시안의 공분산 행렬,

: 입력 음성에 포함된 프레임 수,

: 입력 음성의 l번째 프레임이 UBM의 c번째 가우시안에 포함될 확률,

: 추정한 GMM 분포의 c번째 가우시안의 평균 벡터를 의미한다) (

: Parameter of the encoder,

: The parameters of the decoder,

Random latent variable,

Is the number of Gaussians used to represent the GMM distribution,

: First order Baum-Welch statistics for input voice and c th Gaussian,

: Dimension of the frame unit feature of the input voice,

Is the covariance matrix of the c th Gaussian of the UBM distribution,

: The number of frames included in the input voice,

: Probability that the l th frame of the input speech is included in the c th Gaussian of the UBM,

Is the mean vector of the c-th Gaussian of the estimated GMM distribution.)

위 수학식 1은 특정 은닉 변수

에 대한 로그 우도(log-likelihood)이므로 은닉 변수에 대하여 marginalize하면 아래 수학식 2와 같이 근사할 수 있다.Equation 1 is a specific hidden variable

Since it is log-likelihood for, marginalizing the hidden variable can be approximated by Equation 2 below.

(수학식 2에서

는 Monte Carlo 근사에 사용한 샘플의 수이며

는 latent variable의 사전 확률 분포에서 랜덤하게 샘플링한 은닉 변수이다.)(Equation 2

Is the number of samples used to approximate Monte Carlo

Is a hidden variable sampled randomly from the prior probability distribution of the latent variable.)

아울러, 수학식 1에서, Baum-Welch statistics는 UBM이 주어졌을 때, 특정 음성에 존재하는 프레임 특징들의 분포 패턴에 대한 정보를 가지며, 아래와 같은 수학식 3, 수학식 4들로 정의된다.In addition, in Equation 1, Baum-Welch statistics has information on the distribution pattern of the frame features present in a specific voice when UBM is given, and is defined by Equations 3 and 4 below.

위 수학식 3,4에서

와

는 각각 UBM 의 c 번째 가우시안 요소에서의 0 차 및 1 차 Baum-Welch statistics 를 의미하며,

는 X의 l번째 프레임 특징인

이 UBM의 c 번째 가우시안 요소에서 생성될 확률을, 그리고

는 UBM의 c 번째 가우시안 요소의 평균 벡터를 의미한다. 0 차 Baum-Welch statistics 는 X 의 프레임들 중 c 번째 가우시안 요소에 align 된 수를 나타내며, 1 차 Baum-Welch statistics 는 c번째 가우시안 요소에 align 된 평균 프레임을 의미한다.In Equation 3,4 above

Wow

Are the 0th and 1st Baum-Welch statistics in the c-th Gaussian element of UBM, respectively.

Is the lth frame of X

The probability that it will be generated from the c th Gaussian element of this UBM, and

Denotes the average vector of the c-th Gaussian element of UBM. The 0th order Baum-Welch statistics represent the number of frames aligned with the c-th Gaussian element among X frames, and the 1st order Baum-Welch statistics mean the average frame aligned with the c-th Gaussian element.

다음으로, GMM(Gaussian Mixture Model) 생성부(200)는, 상기 인코더(100)에서 입력 음성(x)로부터, 추출한 z(x)를 이용하여 최대 사후화 과정(maximum a posteriori)을 거쳐 추정된 GMM 슈퍼벡터를 추출할 수 있다.Next, the Gaussian Mixture Model (GMM) generation unit 200 estimates the maximum post-postulation process using a extracted z (x) from the input voice x in the encoder 100. GMM supervectors can be extracted.

이 때, 최대 사후화 과정(maximum a posteriori)을 거쳐 추정된 GMM 슈퍼벡터는, <수학식 5>에 따라 산출 될 수 있다.In this case, the GMM supervector estimated through the maximum a posteriori process may be calculated according to Equation 5.

(

: 입력 음성,

:

의

번째 프레임 특징이 UBM의

번째 가우시안에 속할 확률,

: 1차 Baum-Welch 통계 값,

: 0차 Baum-Welch 통계 값,

: UBM의

번째 가우시안의 평균 값을 의미한다.)(

: Input voice,

:

of

Second frame features of UBM

Probability of belonging to the first Gaussian,

: 1st Baum-Welch statistical value,

: 0th Baum-Welch statistical value,

UBM's

Mean value of the first Gaussian.)

디스크리미네이터(discriminator)(400)는, (

,

) 또는 (

,

) 중 어느 하나를 입력으로 하여,

를 출력하는 과정을 수행하고, 이 때, 상기

는 시그모이드(sigmoid)함수를 이용한, 심층신경망(Deep Neural Network;DNN) 학습을 이용하여 계산 될 수 있다.Discreminator 400, (

,

) or (

,

) As one of the inputs,

To perform the process of outputting, in this case,

Can be calculated using Deep Neural Network (DNN) learning using sigmoid function.

더하여, 디스크리미네이터(discriminator)(400)는, <수학식 6>을 목적함수로 하여, 입력이, ((

),

)인 경우, 1을 입력이, (

,

)인 경우, 0을 출력하는 역할을 수행 할 수 있다.In addition, the discriminator 400 has an input function of (Equation 6), and the input is ((

),

If 1, enter 1, (

,

), It can output 0.

(

은 인코더가 나타내는 은닉 변수와 데이터의 joint distribution을 나타내며,

는 디코더가 나타내는 joint distribution, 그리고

는 디스크리미네이터(400)의 출력 값을 의미한다.)(

Represents the joint distribution of the hidden variables and data represented by the encoder,

Is the joint distribution represented by the decoder, and

Denotes an output value of the delimiter 400.)

이 때, 디스크리미네이터(400)는, (

)의 로그 우도(log-likelihood)가 최대가 되기 위해, <수학식 7>를 목적함수로 이용할 수 있다.At this time, the delimiter 400 is (

In order to maximize the log-likelihood of), Equation 7 can be used as the objective function.

여기서,

는

의 랜덤 은닉 변수(random latent variable)와 GMM 분포의 joint distribution,

는

는 Monte Carlo 추정을 위하여 사용한 샘플의 수,

는 입력 음성이 주어진 경우 랜덤 은닉 변수 (random latent variable)의 분포로부터 샘플된 은닉 변수(latent variable)를 의미한다.here,

Is

Random latent variable and joint distribution of GMM distribution,

Is

Random latent variable and joint distribution of GMM distribution,

Is the number of samples used to estimate Monte Carlo,

Denotes a latent variable sampled from the distribution of a random latent variable given the input negative.

즉, 본 발명의 디스크리미네이터(400)는 인코더(100) 혹은 디코더(300)로부터 생성된 은닉 변수와 음성 분포를 받아서 은닉 변수의 사전 확률 분포로부터 랜덤하게 샘플링 된 샘플로부터 생성되었는지 실제 음성 데이터로부터 생성되었는지를 판단한다. That is, the delimiter 400 of the present invention receives the hidden variable and the voice distribution generated from the encoder 100 or the decoder 300 and generates a random sample from a sample randomly sampled from the prior probability distribution of the hidden variable. Determine if it was created.

다시 말하면, 디스크리미네이터(400) 인코더(100)로 실제 음성이 입력되어 추정한 은닉 변수 및 음성의 분포가 주어지면 1을, 디코더로 은닉 변수의 사전 확률 분포로부터 랜덤하게 샘플링 된 변수가 입력되어 추정한 음성 분포가 주어지면 0을 출력하도록 학습된다.In other words, when the actual speech is input to the delimiter 400 encoder 100 and the estimated hidden variable and the distribution of the speech are given, 1 is input to the decoder and a random sampled variable is input from the prior probability distribution of the hidden variable. Given an estimated speech distribution, it is learned to output zero.

기존 VAE 기반 화자 인식을 위한 특징을 추출하는 방법에 있어서는, 인코더로부터 생성된 은닉 변수(latent variable)의 확률 분포가 가우시안 분포와 유사할 수 있도록, 은닉 변수(latent variable)의 사전 확률 분포 간에 KL divergence를 이용하여 수학적 제약이 필요하였으나, 본 발명은 이러한 제약없이, 불확실성 및 신뢰도를 표현하는 변수를 함께 생성함으로써, 짧거나 열화된 음성에서도 안정적인 화자 인식 성능을 보일 수 있다. In the method for extracting features for existing VAE-based speaker recognition, KL divergence between prior probability distributions of the latent variables so that the probability distribution of the latent variables generated from the encoder can be similar to the Gaussian distribution. Although mathematical constraints have been necessary using the present invention, the present invention can provide stable speaker recognition performance even in short or deteriorated voices by generating variables expressing uncertainty and reliability without these limitations.

도 2는 본 발명의 실시 예에 따른 화자 음성 특징 추출 과정을 나타낸 순서도이다2 is a flowchart illustrating a speaker voice feature extraction process according to an embodiment of the present invention.

도 2를 참조하면, 먼저 화자의 음성 신호를 입력으로 하여, 상기 음성 신호에 대해 UBM 조건 하에서 Baum-Welch 통계를 실시하여, 상기 입력 음성 신호에 종속적인 입력 은성의 분포적 특성을 나타내는 랜덤 은닉 변수 분포를 추출한다(S101).Referring to FIG. 2, first, a speaker's voice signal is input, and a Baum-Welch statistic is performed on the voice signal under UBM conditions to indicate a random concealment variable indicating a distribution characteristic of the input voice dependent on the input voice signal. Extract the distribution (S101).

이 후, 입력 음성 신호에 대해 UBM 조건 하에서 Baum-Welch 통계를 실시하여 얻은 값을 기반으로 최대 사후화 과정에 따른 제1 GMM슈퍼백터를 산출한다(S102).Thereafter, the first GMM supervector according to the maximum post-processing process is calculated based on a value obtained by performing Baum-Welch statistics on the input speech signal under UBM condition (S102).

다음으로, 상기 랜덤 은닉 변수의 사전 확률 분포에 대한 샘플링 값을 추출한다(S103).Next, a sampling value for the prior probability distribution of the random hidden variable is extracted (S103).

이 후, 상기 S103단계에서 추출한 샘플링 값으로부터 제2 GMM 슈퍼벡터를 산출하게 된다(S104).Thereafter, a second GMM supervector is calculated from the sampling value extracted in step S103 (S104).

다음으로, 상기 S101과정에서 추출한 랜덤 은닉 변수 분포와 S102과정에서 산출한 제1 GMM슈퍼벡터의 쌍 또는 S103과정에서 추출한 샘플링 값과, S104과정에서 산출한 제2 GMM슈퍼벡터의 쌍 둘 중 어느 하나를 디스크리미네이터(400)의 입력으로 하여, 상기 입력이 샘플링 값에서 생성되었는지, 입력음성으로부터 생성되었는지 분류하는 과정을 거치게 된다(S105).Next, any one of the random hidden variable distribution extracted in step S101 and the pair of first GMM supervectors calculated in step S102 or the sampling value extracted in step S103 and the pair of second GMM supervectors calculated in step S104. By using the input as the input of the delimiter 400, a process of classifying whether the input is generated from the sampling value or the input voice is performed (S105).

본 발명에서의 디스크리미네이터(400)는 추출한 랜덤 은닉 변수의 사후확률 분포로부터 샘플링 된 변수들이 은닉 변수의 사전확률로부터 샘플링 된 실제 은닉 변수와 가까운 현실적인 값을 갖도록 하게 학습하는 것이다.In the present invention, the delimiter 400 learns that the variables sampled from the posterior probability distribution of the extracted random hidden variable have a realistic value close to the actual hidden variable sampled from the prior probability of the hidden variable.

도 3, 4는 본 발명의 실시 예에 따른 디스크리미네이터의 입력을 추출하기 위한 과정을 더욱 상세하게 나타낸 순서도이다.3 and 4 are flowcharts illustrating in more detail a process for extracting an input of a delimiter according to an embodiment of the present invention.

도 3을 참조하면 인코더(100)는 음성 신호에 대해 UBM(Universal Background Model) 조건 하에서, Baum-Welch 통계를 수행(S110)하여 얻은, 값들을 입력으로 하여, 은닉 변수에 대한 평균과 분산을 구할 수 있다(S120). 이 때, 분산은 은닉 변수의 로그분산 일 수 있다.Referring to FIG. 3, the encoder 100 obtains an average and a variance of a hidden variable by inputting values obtained by performing Baum-Welch statistics (S110) under a UBM (Universal Background Model) condition on a voice signal. It may be (S120). In this case, the variance may be the log variance of the hidden variable.

더하여, GMM 생성부(200)에서 상기 Baum-Welch 통계 값을 이용하여 최대 사후화 과정(maximum a posteriori)에 따라 추정된 제1 GMM 슈퍼벡터를 추출한다(S130).In addition, the GMM generator 200 extracts the first GMM supervector estimated according to the maximum a posteriori using the Baum-Welch statistical value (S130).

이 때, 최대 사후화 과정을 거쳐 추정된 제1 GMM 슈퍼벡터는, <수학식 5>에 따라 산출 될 수 있다.At this time, the first GMM supervector estimated through the maximum post-processing may be calculated according to Equation 5.

[수학식 5][Equation 5]

는 입력 음성,

는

의

번째 프레임 특징이 UBM의

번째 가우시안에 속할 확률,

는 1차 Baum-Welch 통계 값,

는 0차 Baum-Welch 통계 값,

는 UBM의

번째 가우시안의 평균 값을 의미한다.

Input voice,

Is

of

Second frame features of UBM

Probability of belonging to the first Gaussian,

Is the primary Baum-Welch statistic,

Is the 0th order Baum-Welch statistic,

UBM's

The mean value of the first Gaussian.

이와 같이 산출하는 이유는, 디스크리미네이터(400)의 입력으로 화자 음성(x)를 그대로 사용할 수 없고, GMM 슈퍼벡터가 필요하기 때문이다.The reason for calculating in this way is that the speaker voice x cannot be used as it is as the input of the delimiter 400, and a GMM supervector is required.

도 4를 참조하면, 디코더는 은닉 변수의 사전 확률 분포인N(0,1)로부터 랜덤하게 샘플링된

를 추출한다(S310).Referring to Figure 4, the decoder is randomly sampled from N (0,1), which is the prior probability distribution of the hidden variable.

To extract (S310).

이 후 상기 추출된

를 이용하여, 제2 GMM 슈퍼벡터 (

)를 추출한다(S320).After this extracted

Using the second GMM supervector (

) Is extracted (S320).

도3, 4를 통해 추출된

,

는 디스크리미네이터(400)의 입력으로 이용되어, 입력 신호의 불확실성 및 신뢰도에 대한 척도를 추출할 수 있어, 짧거나 열화 된 신호가 주어진 경우에도 화자를 인식할 수 있도록 한다.3, extracted through 4

,

Is used as an input of the delimiter 400 to extract a measure of uncertainty and reliability of the input signal, so that the speaker can be recognized even if a short or degraded signal is given.

도 5는 본 발명의 실시 예에 따른 화자 음성 특징 추출의 효과를 나타낸 도면이다.5 is a view showing the effect of speaker speech feature extraction according to an embodiment of the present invention.

성능 검증을 위하여 인코더(100)와 디코더(300)를 각각 ReLU 활성함수 노드 4096로 구성된 단일 은닉층으로 설정하였다. 화자 630명 및 6300 음성 샘플로 구성된 TIMIT 데이터셋으로 학습 후 326명 화자로 구성된 TIDIGITS로 성능 검증을 하며, linear discriminant analysis(LDA)로 특징 일반화 및 probabilistic linear discriminant analysis(PLDA)로 화자 인증을 수행하였다. UBM 및 GMM의 가우시안은 32개로 설정되었으며, 은닉 변수의 차원은 200으로 설정하였다. 우선 추출된 은닉 변수의 로그 분산으로 구한 differential entropy는 도 5와 같이 계산되었다.For performance verification, the encoder 100 and the decoder 300 were set as a single hidden layer composed of ReLU active function nodes 4096, respectively. After learning with TIMIT dataset consisting of 630 speakers and 6300 voice samples, performance was verified by TIDIGITS consisting of 326 speakers, speaker generalization by linear discriminant analysis (LDA), and probabilistic linear discriminant analysis (PLDA). . The Gaussians of UBM and GMM were set to 32 and the dimension of the hidden variable was set to 200. First, the differential entropy obtained by the log variance of the extracted hidden variables was calculated as shown in FIG. 5.

도 5의 결과에서, VAE는 디스크리미네이터(400)를 사용하지 않고 로그 우도 및 추정한 은닉 변수의 사후확률과 사전 확률 사이의 KL-divergence 값을 최적화한 방법이며, ALI는 로그 우도를 고려하지 않은 방법, 그리고 ALI/NLL은 본 발명에서 제안한 로그 우도 및 디스크리미네이터400) 모두를 고려하여 학습한 방법이다. 위 그래프에서 볼 수 있듯 음성의 길이가 길어짐에 따라 은닉 변수의 로그 분산으로 계산된 differential entropy가 감소하는 것을 볼 수 있으며, 본 발명에서 제안한 방식을 사용하는 경우 그 차이가 훨씬 확연해짐을 확인할 수 있다. 이는 제안한 특징 추출 기법으로 추출한 특징이 음성 길이에 따른 불확실성에 대한 정보를 효과적으로 내포함을 의미한다.In the results of FIG. 5, VAE is a method of optimizing the KL-divergence value between the log likelihood and the posterior probability and the prior probability of the estimated hidden variable without using the delimiter 400, and the ALI does not consider the log likelihood. Method and ALI / NLL are methods that are considered in consideration of both the log likelihood and the delimiter 400 proposed in the present invention. As can be seen from the graph above, as the length of speech becomes longer, the differential entropy calculated by the logarithmic distribution of the hidden variables decreases, and the difference is much more pronounced when the proposed method is used in the present invention. . This means that the feature extracted by the proposed feature extraction technique effectively contains information about the uncertainty according to the voice length.

더 나아가, 기존의 아이벡터 특징과 본 발명에서 제안하는 특징 추출모델로 생성한 은닉 변수 특징을 이용하였을 때 화자 인식 성능은 아래 <표 1>과 같다.Furthermore, the speaker recognition performance when using the existing eye vector feature and the hidden variable feature generated by the feature extraction model proposed by the present invention is shown in Table 1 below.

LMLM LVLV LM+LVLM + LV i-vector+LMi-vector + LM i-vector+LVi-vector + LV i-vector+LM+LVi-vector + LM + LV i-vector(200)i-vector (200) 3.363.36 i-vector(400)i-vector (400) 2.682.68 i-vector(600)i-vector (600) 2.172.17 VAEVAE 3.613.61 4.654.65 2.032.03 1.781.78 1.651.65 0.970.97 ALIALI 4.394.39 4.564.56 2.322.32 1.591.59 1.551.55 1.031.03 ALI/NLLALI / NLL 3.643.64 3.463.46 1.911.91 1.551.55 1.511.51 0.940.94

위 <표 1>은 equal error rate(EER) 결과를 나타내며, 이는 오인식률과 오거부율이 같을 때의 오차를 나타낸다. 위 표에서 LM, LV, LM+LV, i-vector+LM, i-vector+LV, i-vector+LM+LV는 각각 은닉 변수(latent variable)의 평균, 로그 분산, 평균과 로그 분산을 연결 시킨 특징, 200 차원의 i-vector와 은닉 변수(latent variable) 평균을 연결한 특징, i-vector와 은닉 변수(latent variable) 로그 분산을 연결시킨 특징, 그리고 은닉 변수(latent variable)의 평균 및 로그 분산을 i-vector와 연결시킨 특징을 의미한다. 위 결과에서 볼 수 있듯 은닉 변수(latent variable)의 로그 분산과 평균을 함께 사용하는 경우 같은 차원의 i-vector (i-vector(400))보다 좋은 화자 인식 결과를 보임을 확인할 수 있다. 더 나아가, i-vector와 함께 사용하는 경우 보다 높은 성능 개선이 있음을 확인할 수 있다.<Table 1> shows the result of equal error rate (EER), which shows the error when the false recognition rate and the false rejection rate are the same. In the table above, LM, LV, LM + LV, i-vector + LM, i-vector + LV, i-vector + LM + LV connect the mean, log variance, mean and log variance of the latent variables, respectively. Feature that combines i-vector and hidden variable mean in 200 dimensions, feature that links i-vector and logarithmic variable log variance, and mean and logarithm of latent variable It is a feature that links variance with i-vectors. As can be seen from the above results, the log variance and average of the latent variable are used together, which shows that the speaker recognition result is better than the i-vector (i-vector (400)) of the same dimension. Furthermore, we can see that there is a higher performance improvement when used with i-vector.

<표 2>는 분류 오차 결과를 나타내며, 이는 가장 높은 PLDA 점수의 화자를 예측 화자로 선택하였을 때 오차율을 의미한다. 위 결과에서도 볼 수 있듯 은닉 변수(latent variable)의 평균과 로그 분산을 함께 사용하는 경우 같은 차원의 i-vector에 비하여 높은 성능을 보임을 확인할 수 있었으며, i-vector와 함께 사용하는 경우 보다 높은 성능을 보였다.Table 2 shows the classification error results, which means the error rate when the speaker with the highest PLDA score is selected as the prediction speaker. As can be seen from the above results, when the average and log variance of the latent variables are used together, the performance is higher than that of the i-vector of the same dimension. Showed.

LMLM LVLV LM+LVLM + LV i-vector+LMi-vector + LM i-vector+LVi-vector + LV i-vector+LM+LVi-vector + LM + LV i-vector(200)i-vector (200) 12.6212.62 i-vector(400)i-vector (400) 7.677.67 i-vector(600)i-vector (600) 5.075.07 VAEVAE 11.8911.89 17.7817.78 6.946.94 5.365.36 4.994.99 2.752.75 ALIALI 16.6216.62 17.3117.31 8.548.54 5.105.10 4.794.79 2.792.79 ALI/NLLALI / NLL 12.5612.56 12.3812.38 6.766.76 3.973.97 4.184.18 2.492.49

이상으로, 본 발명의 실시 예에 따른 화자 음성 특징 추출 장치 및 그의 화자 음성 특징 추출 방법에 대해 설명하였다. In the above, the apparatus for extracting the speaker speech feature and the method for extracting the speaker speech feature according to an embodiment of the present invention have been described.

상술한 바와 같은 본 발명의 화자 음성 특징 추출 방법은 컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체의 형태로 제공될 수도 있다. The speaker speech feature extraction method of the present invention as described above may be provided in the form of a computer readable medium suitable for storing computer program instructions and data.

이러한, 컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체는, 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM)과 같은 반도체 메모리를 포함한다. 프로세서와 메모리는 특수 목적의 논리 회로에 의해 보충되거나, 그것에 통합될 수 있다. Such computer-readable media suitable for storing computer program instructions and data include, for example, recording media comprising magnetic media, such as hard disks, floppy disks, and magnetic tape, and compact disk read only memory (CD-ROM). , Optical media such as Digital Video Disk (DVD), magneto-optical media such as Floppy Disk, and ROM (Read Only Memory), RAM And a semiconductor memory such as a random access memory, a flash memory, an erasable programmable ROM (EPROM), and an electrically erasable programmable ROM (EEPROM). The processor and memory can be supplemented by or integrated with special purpose logic circuitry.

또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(Functional) 프로그램과 이와 관련된 코드 및 코드 세그먼트 등은, 기록매체를 읽어서 프로그램을 실행시키는 컴퓨터의 시스템 환경 등을 고려하여, 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론되거나 변경될 수도 있다.The computer readable recording medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. In addition, a functional program for implementing the present invention, codes and code segments associated therewith may be used in consideration of a system environment of a computer that reads a recording medium and executes the program. It may be easily inferred or changed by.

아울러, 상술한 바와 같은 컴퓨터가 읽을 수 있는 기록매체에 기록된 컴퓨터 프로그램은 상술한 바와 같은 기능을 수행하는 명령어를 포함하며 기록매체를 통해 배포되고 유통되어 특정 장치, 특정 컴퓨터에 읽히어 설치되고 실행됨으로써 전술한 기능들을 실행할 수 있다. In addition, the computer program recorded on the computer-readable recording medium as described above includes instructions for performing the functions as described above, distributed and distributed through the recording medium to be read and installed on a specific device, a specific computer, and executed. Thus, the above functions can be executed.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.Although the specification includes numerous specific implementation details, these should not be construed as limiting to any invention or the scope of the claims, but rather as a description of features that may be specific to a particular embodiment of a particular invention. It must be understood. Certain features that are described in this specification in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable subcombination. Furthermore, while the features may operate in a particular combination and may be initially depicted as so claimed, one or more features from the claimed combination may in some cases be excluded from the combination, the claimed combination being a subcombination Or a combination of subcombinations.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 시스템 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 시스템들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although the operations are depicted in the drawings in a specific order, it should not be understood that such operations must be performed in the specific order or sequential order shown in order to obtain desirable results or that all illustrated operations must be performed. In certain cases, multitasking and parallel processing may be advantageous. Moreover, the separation of the various system components of the above-described embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems will generally be integrated together into a single software product or packaged into multiple software products. It should be understood that it can.

본 발명은 입력되는 음성 신호에서 특징을 추출하는 기술에 관련된 것으로서, 불확실성, 신뢰도를 표현하는 변수를 함께 생성하여, 입력 음성의 분포가 보다 현실적인 값을 가질 수 있도록 디스크리미네이터를 구성하고, 이로 인해 짧은 음성에서도 안정적인 화자 인식 성능을 보일 수 있도록 한다. The present invention relates to a technique for extracting features from an input speech signal, and generates a delimiter so that the distribution of the input speech can have a more realistic value by generating a variable representing an uncertainty and a reliability. This enables stable speaker recognition even with short voices.

이러한 본 발명에 의하면, 짧은 음성으로 인한 추출된 특징의 불확실성을 추정하며, 딥러닝 구조를 활용하여 기존의 선형적 방식으로 표현되지 못하던 화자 정보를 효과적으로 표현하는 특징을 추출할 수 있어, 짧은 음성을 다루는 실제 화자 인식 어플리케이션에 적용되는 경우 화자 인식 성능을 개선시킬 수 있다. 더 나아가 기존의 아이벡터(i-vector)와 입출력을 같이하기 때문에 기존의 아이벡터(i-vector)가 활용되던 음성 인식기 및 합성기의 화자 적응화에도 활용될 수 있으며, 딥러닝 기반의 음성 인식기 및 합성기의 경우 동시 학습(joint training)을 통하여 태스크에 맞도록 최적화될 수 있다.According to the present invention, it is possible to estimate the uncertainty of the extracted feature due to the short voice, and to extract the feature that effectively expresses the speaker information that cannot be represented in the conventional linear method by using the deep learning structure, When applied to the actual speaker recognition application to deal with can improve the speaker recognition performance. Furthermore, since the i-vector is combined with the input and output of the i-vector, it can be used for speaker adaptation of the speech recognizer and synthesizer that used the i-vector, and the deep learning-based speech recognizer and synthesizer In the case of can be optimized to fit the task through joint training (joint training).

100: 인코더 200: GMM 생성부
300: 디코더 400: 디스크리미네이터(discriminator)100: encoder 200: GMM generating unit
300: decoder 400: discriminator

Claims

In the speaker speech feature extraction method for extracting the feature of the speaker speech signal by performing Baum-Welch statistics on the speaker speech signal under UBM (Universal Background Model) conditions,
Performing Baum-Welch statistics on the speech signal under UBM conditions, and extracting a distribution of a random hidden variable representing a distribution characteristic of the input speech dependent on the input speech signal;
Calculating a first GMM supervector according to a maximum a posteriori based on Baum-Welch statistical values performed on a speaker speech signal under UBM conditions;
Extracting a sampling value for a prior probability distribution of the random hidden variable;
Calculating a second GMM supervector using a sampling value for a prior probability distribution of the random concealment variable; And
When the distribution of the random concealment variable and the pair of the first GMM supervectors are input as an input value, the input value is classified as being generated based on an input speech, and the sampling value and the first value of the prior probability distribution of the random concealment variable are determined. If a pair of 2 GMM supervectors is input as an input value, classifying the input value as being generated based on a sampling value;
Speaker speech feature extraction method comprising a.

The method of claim 1,
The log-likelihood of the input speech of the second Gaussian Mixture Model (GMM) supervector is calculated by Equation 1 below.
[Equation 1]

(

: Parameter of the encoder,

: The parameters of the decoder,

Random latent variable,

Is the number of Gaussians used to represent the GMM distribution,

: First order Baum-Welch statistics for input voice and c th Gaussian,

: Dimension of the frame unit feature of the input voice,

Is the covariance matrix of the c th Gaussian of the UBM distribution,

: The number of frames included in the input voice,

Is the mean vector of the c'th Gaussian in the estimated GMM distribution.)

The method of claim 1,
Computing the first GMM supervector,
A speaker speech feature extraction method comprising calculating a first GMM supervector using Equation 2 below.
[Equation 2]

(

: Input voice,

:

of

Second frame features of UBM

Probability of belonging to the first Gaussian,

: 1st Baum-Welch statistical value,

: 0th Baum-Welch statistical value,

UBM's

Mean value of the first Gaussian)

The method of claim 1,
The classifying step,
A speaker speech feature extraction method using deep neural network (DNN) learning using a sigmoid function.

The method of claim 1,
The classifying step,
Using the following Equation 3, if the input is a pair of the distribution of the random hidden variable and the first GMM supervector, output 1;
And outputting zero when the input is a pair of a sampling value for a prior probability distribution of the random concealment variable and a second GMM supervector.
[Equation 3]

(

Is the joint distribution represented by the decoder, and

Is the output value of the delimiter)

The method of claim 1, wherein calculating the second GMM supervector comprises:
A method for extracting speaker speech features using Equation 4 as an objective function in order to maximize the log likelihood of the second GMM supervector.
[Equation 4]

(

:

Random latent variable and joint distribution of GMM distribution,

:

Joint distribution of random hidden variables and GMM distribution of,

= Number of samples used to estimate Monte Carlo,

If the input speech given meaning to the hidden variables (latent variable) sample from a random distribution of the hidden variables (latent random variable))

An encoder for performing Baum-Welch statistics on the speaker speech signal under UBM (Universal Background Model) conditions and extracting a distribution of a random latent variable representing a distribution characteristic of the input speech dependent on the input speech signal;
A GMM generator configured to calculate a first Gaussian Mixture Model (GMM) supervector according to a maximum a posteriori using Baum-Welch statistical values performed under UBM conditions on the speaker speech signal;
A decoder configured to extract a sampling value for a prior probability distribution of the random concealment variable, and calculate a second Gaussian Mixture Model (GMM) supervector using the same; And
When the distribution of the random concealment variable and the pair of the first GMM supervectors are input as an input value, the input value is classified as being generated based on an input speech, and the sampling value and the first value of the prior probability distribution of the random concealment variable are determined. A delimiter for classifying a pair of 2 GMM supervectors as inputs based on a sampling value;
Speaker speech feature extraction apparatus comprising a.

The method of claim 7, wherein
The delimiter is an output value of the delimiter

The speaker speech feature extraction apparatus, characterized in that the calculation using the deep neural network (DNN) learning using the sigmoid (sigmoid) function.

The method of claim 8,
The discreminator (discriminator),
Using the following formula (3) as the objective function,
Outputs 1 if the input is a pair of random hidden variables and the first GMM supervector,
If the input is a pair of a sampling value for a prior probability distribution of the random hidden variable and a second GMM supervector, output 0;
And using Equation 4 as an objective function to maximize the log likelihood of the second GMM supervector.
[Equation 3]

(

Represents the joint variable of the latent variable and data represented by the encoder,

Is the joint distribution represented by the decoder, and

Is the output value of the delimiter)
[Equation 4]

(

:

Random latent variable and joint distribution of GMM distribution,

:

Random latent variable and joint distribution of GMM distribution,

= Number of samples used to estimate Monte Carlo,

A computer-readable recording medium storing a program for executing the speaker speech feature extraction method according to any one of claims 1 to 6.