KR102405163B1

KR102405163B1 - Apparatus and method unsupervised pretraining speaker embedding extraction system using mutual information neural estimator, computer-readable storage medium and computer program

Info

Publication number: KR102405163B1
Application number: KR1020200091037A
Authority: KR
Inventors: 나선필; 김남수; 한민현; 김형용; 김석민; 손병찬
Original assignee: 국방과학연구소
Priority date: 2020-07-22
Filing date: 2020-07-22
Publication date: 2022-06-08
Also published as: KR20220012473A

Abstract

실시예의 화자 임베딩 추출 장치는 음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 프레임 단위 특징 벡터 추출부와, 상기 프레임 단계 특징 벡터 추출부로부터 추출된 상기 프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 문장 단위 특징 벡터 추출부와, 상기 프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 상호 정보량 추정부를 포함할 수 있다.
실시예는 화자 라벨 데이터를 필요로 하지 않기 때문에 비교적 구하기 쉬운 라벨이 없는 음성 데이터를 이용해 비지도 방식으로 화자 엠비딩 추출 모델을 학습시킬 수 있는 효과가 있다.The speaker embedding extraction apparatus of the embodiment includes a frame-by-frame feature vector extractor for extracting frame-by-frame feature vectors using a first deep learning model to which a speech feature vector is input, and the frame extracted from the frame-by-frame feature vector extractor The amount of mutual information using a sentence unit feature vector extractor that receives a unit feature vector and converts it into a sentence unit feature vector, and a second deep learning model in which the frame unit feature vector and the sentence unit feature vector are input It may include a mutual information amount estimator for estimating .
Since the embodiment does not require speaker label data, it is possible to train the speaker embedding extraction model in an unsupervised manner using relatively easy-to-obtain label-free speech data.

Description

Apparatus and method for extracting speaker embedding using mutual information amount estimation, computer-readable recording medium, and computer program

실시예는 화자 임베딩 추출 모델의 사전 학습 기법에 관한 것으로, 보다 상세하게는 다량의 라벨이 없는 데이터가 있는 경우, 화자 임베딩 추출 모델을 학습시키는 기술에 관한 것이다.The embodiment relates to a prior learning technique for a speaker embedding extraction model, and more particularly, to a technique for training a speaker embedding extraction model when there is a large amount of unlabeled data.

일반적으로, 화자 인식이란 주어진 음성의 특징들을 분석하여 해당 음성을 발화한 화자의 정체성을 판별하는 기술이다. 일반적인 화자 인식 과정에서는 화자 임베딩이라고 불리는 고정된 길이의 특징 벡터를 추출한 후 사전에 등록되어 있는 화자 임베딩과 비교하여 화자의 정체성을 판별하게 된다. In general, speaker recognition is a technique for determining the identity of a speaker who has uttered a given voice by analyzing characteristics of a given voice. In a general speaker recognition process, a feature vector of a fixed length called speaker embedding is extracted, and the speaker's identity is determined by comparing it with previously registered speaker embeddings.

종래에는 i-vector 와 같은 통계 기반의 화자 임베딩이 주로 사용되었으나, 최근 딥 러닝의 발달로 신경망을 이용한 화자 임베딩 추출 기법으로 대체되고 있다. 하지만, 이러한 딥 러닝을 이용한 화자 임베딩 추출 시스템은 다량의 라벨이 있는 학습 데이터가 있어야만 높은 성능이 보장된다는 단점이 있다.Conventionally, speaker embeddings based on statistics such as i-vector have been mainly used, but with the recent development of deep learning, it has been replaced by a speaker embedding extraction technique using a neural network. However, such a speaker embedding extraction system using deep learning has a disadvantage that high performance is guaranteed only when there is a large amount of labeled training data.

한국등록특허 제10-1843079호Korean Patent No. 10-1843079

상술한 문제점을 해결하기 위해, 실시예는 상호 정보량을 이용하여 화자 임베딩을 추출하기 위한 상호 정보량 추정을 이용한 화자 임베딩 추출 장치 및 방법을 제공하는 것을 그 목적으로 한다.In order to solve the above-mentioned problem, the embodiment aims to provide an apparatus and method for extracting speaker embeddings using mutual information amount estimation for extracting speaker embeddings using the mutual information amount.

실시예의 화자 임베딩 추출 장치는 음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 프레임 단위 특징 벡터 추출부와, 상기 프레임 단계 특징 벡터 추출부로부터 추출된 상기 프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 문장 단위 특징 벡터 추출부와, 상기 프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 상호 정보량 추정부를 포함할 수 있다.The speaker embedding extraction apparatus of the embodiment includes a frame-by-frame feature vector extracting unit that extracts a frame-by-frame feature vector using a first deep learning model to which a speech feature vector is input, and the frame extracted from the frame-level feature vector extracting unit. The amount of mutual information using a sentence unit feature vector extractor that receives a unit feature vector and converts it into a sentence unit feature vector, and a second deep learning model in which the frame unit feature vector and the sentence unit feature vector are input It may include a mutual information amount estimation unit for estimating .

상기 음성 특징 벡터는 MFCC, 스펙트로그램을 포함할 수 있다.The speech feature vector may include an MFCC and a spectrogram.

상기 문장 단계 특징 벡터 추출부는 풀링(Pooling) 기법을 이용하여 상기 프레임 단위의 특징 벡터를 상기 문장 단위의 특징 벡터로 변환시킬 수 있다.The sentence stage feature vector extractor may convert the frame unit feature vector into the sentence unit feature vector using a pooling technique.

상기 제1 딥 러닝 모델 및 상기 제2 딥 러닝 모델은 FCN, CNN 및 RNN을 포함할 수 있다.The first deep learning model and the second deep learning model may include FCN, CNN, and RNN.

상기 상호 정보량 추정부는 GIM(Global Information Maximization) 기법을 이용하여 상기 상호 정보량을 추정할 수 있다.The mutual information amount estimating unit may estimate the mutual information amount using a Global Information Maximization (GIM) technique.

상기 상호 정보량 추정부는 LIM(Local Information Maximization) 기법을 이용하여 상기 상호 정보량을 추정할 수 있다.The mutual information amount estimator may estimate the mutual information amount using a local information maximization (LIM) technique.

상기 상호 정보량 추정부는 GIM 기법 및 LIM 기법을 이용하여 상기 상호 정보량을 추정할 수 있다.The mutual information amount estimating unit may estimate the mutual information amount using a GIM technique and a LIM technique.

상기 상호 정보량 추정부는 DVR(Donsker-Varadhan representation), BCE(Binary Cross Entropy) 또는 NCE(Noise Contrastive Estimation) 중 어느 하나를 목적함수로 사용하여 상기 상호 정보량을 최대화하는 방향으로 학습시킬 수 있다.The mutual information amount estimator may use any one of Donsker-Varadhan representation (DVR), Binary Cross Entropy (BCE), or Noise Contrastive Estimation (NCE) as an objective function to learn in a direction to maximize the amount of mutual information.

또한, 실시예는 화자 임베딩 추출 장치에서 수행되는 화자 임베딩 추출 방법에 있어서, 음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 단계와, 상기 프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 단계와, 상기 프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 단계를 포함할 수 있다.In addition, the embodiment provides a method for extracting speaker embeddings performed in a speaker embedding extraction apparatus, comprising the steps of: extracting a frame-by-frame feature vector using a first deep learning model to which a speech feature vector is input; It may include receiving a vector as an input and converting it into a sentence-unit feature vector, and estimating the amount of mutual information using a second deep learning model to which the frame-unit feature vector and the sentence-unit feature vector are input. have.

실시예는 화자 라벨 데이터를 필요로 하지 않기 때문에 비교적 구하기 쉬운 라벨이 없는 음성 데이터를 이용해 비지도 방식으로 화자 임베딩 추출 모델을 학습시킬 수 있는 효과가 있다.Since the embodiment does not require speaker label data, it is possible to train a speaker embedding extraction model in an unsupervised manner using relatively easy-to-obtain label-free speech data.

또한, 학습된 임베딩 추출 장치는 추후에 습득한 라벨이 존재하는 음성 데이터를 통해 미세 조정(fine tuning)하여 사용할 수 있다.In addition, the learned embedding extraction apparatus can be used by fine tuning through voice data with labels that are acquired later.

도 1은 실시예에 따른 상호 정보량 추정을 이용한 화자 임베딩 추출 장치를 나타낸 블록도이다.
도 2는 실시예에 따른 상호 정보량 추정을 이용한 화자 임베딩 추출 방법을 나타낸 순서도이다.1 is a block diagram illustrating an apparatus for extracting speaker embeddings using mutual information amount estimation according to an embodiment.
2 is a flowchart illustrating a method for extracting speaker embeddings using mutual information amount estimation according to an embodiment.

이하, 도면을 참조하여 실시예를 상세히 설명하기로 한다.Hereinafter, the embodiment will be described in detail with reference to the drawings.

도 1은 실시예에 따른 상호 정보량 추정을 이용한 화자 임베딩 추출 장치를 나타낸 블록도이다.1 is a block diagram illustrating an apparatus for extracting speaker embeddings using mutual information amount estimation according to an embodiment.

도 1을 참조하면, 실시예에 따른 상호 정보량 추정을 이용한 화자 임베딩 추출 장치(1000)는 프레임 단위 특징 벡터 추출부(100)와, 문장 단위 특징 벡터 추출부(200)와, 상호 정보량 추정부(300)를 포함할 수 있다.Referring to FIG. 1 , a speaker embedding extraction apparatus 1000 using mutual information amount estimation according to an embodiment includes a frame unit feature vector extraction unit 100, a sentence unit feature vector extraction unit 200, and a mutual information amount estimation unit ( 300) may be included.

프레임 단위 특징 벡터 추출부(100)는 음성 특징 벡터(V1)를 이용하여 프레임 단위의 특징 벡터(V2)를 추출할 수 있다. 프레임 단위 특징 벡터 추출부(100)는 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터(V2)를 추출할 수 있다. 제1 딥 러닝 모델의 입력으로 음성 특징 벡터(V1)가 이용될 수 있다. The frame-by-frame feature vector extractor 100 may extract the frame-by-frame feature vector V2 by using the speech feature vector V1. The frame unit feature vector extractor 100 may extract the frame unit feature vector V2 using the first deep learning model. A speech feature vector V1 may be used as an input of the first deep learning model.

음성 특징 벡터(V1)는 음성으로부터 추출될 수 있다. 음성 특징 벡터(V1)는 음성 내에 존재하는 화자, 녹음 상태, 잡음 등으로 인한 다양한 변이성을 작은 차원의 벡터로 표현된 것으로, MFCC, 스펙트로그램 등을 포함할 수 있다. 예컨대, MFCC는 음성의 고유한 특징을 나타내는 수치이다.The speech feature vector V1 may be extracted from speech. The speech feature vector V1 expresses various variability due to a speaker, a recording state, noise, etc. present in the speech as a small-dimensional vector, and may include an MFCC, a spectrogram, and the like. For example, MFCC is a numerical value representing the unique characteristics of speech.

제1 딥 러닝 모델은 FCN (fully connected neural network), CNN(convolutional neural network), RNN(recurrent neural network)을 포함할 수 있으나, 이에 한정되지 않는다. 제1 딥 러닝 모델은 라벨이 없는 데이터들을 이용해, 상호정보량을 최대화 시키도록 두 모델과 상호정보량 추정부 네트워크를 미리 같이 학습(jointly training) 시킨 후, 라벨이 있는 데이터들을 이용해, 미세 조정 하는 식으로 학습이 수행될 수 있다.The first deep learning model may include a fully connected neural network (FCN), a convolutional neural network (CNN), and a recurrent neural network (RNN), but is not limited thereto. The first deep learning model uses unlabeled data to jointly train the two models and the mutual information amount estimator network in advance to maximize the amount of mutual information, and then fine-tunes them using labeled data. Learning may be performed.

제1 딥 러닝 모델은 음성 특징 벡터(V1)를 입력으로 하여 짧은 시간 즉, 프레임 단위의 특징을 나타내는 벡터를 추출할 수 있다. The first deep learning model may take the speech feature vector V1 as an input and extract a vector representing a feature in a short time, that is, a frame unit.

문장 단위 특징 벡터 추출부(200)는 프레임 단위 특징 벡터(V2)를 입력받아 고정된 차원의 문장 단위의 특징 벡터(V3)로 변환할 수 있다. 문장 단위 특징 벡터 추출부(200)는 풀링(Pooling) 기법을 이용하여 프레임 단위 특징 벡터(V2)를 문장 단위의 특징 벡터(V3)로 변환시킬 수 있다. 풀링 기법으로는 average pooling, statistics pooling, attention based pooling 등을 포함할 수 있으나, 이에 한정되지 않는다.The sentence unit feature vector extractor 200 may receive the frame unit feature vector V2 and convert it into a sentence unit feature vector V3 of a fixed dimension. The sentence unit feature vector extractor 200 may convert the frame unit feature vector V2 into the sentence unit feature vector V3 using a pooling technique. The pooling technique may include, but is not limited to, average pooling, statistics pooling, attention based pooling, and the like.

이후, 고정된 차원으로 변환된 특징 벡터는 입력 문장의 특징을 나타내는 문장 단위의 특징 벡터 또는 화자 임베딩으로 사용될 수 있다.Thereafter, the feature vector transformed into a fixed dimension may be used as a feature vector or speaker embedding in a sentence unit representing the feature of the input sentence.

상호 정보량 추정부(300)는 프레임 단위 특징 벡터 추출부(100)로부터 추출된 프레임 단위 특징 벡터(V2)와, 상기 문장 단위 특징 벡터 추출부(200)로부터 추출된 문장 단위 특징 벡터(V3)를 이용하여 상호 정보량을 추정할 수 있다. 상호 정보량 추정부(300)는 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정할 수 있다. The mutual information amount estimation unit 300 calculates the frame unit feature vector V2 extracted from the frame unit feature vector extraction unit 100 and the sentence unit feature vector V3 extracted from the sentence unit feature vector extraction unit 200 . can be used to estimate the amount of mutual information. The mutual information amount estimating unit 300 may estimate the mutual information amount using the second deep learning model.

제2 딥 러닝 모델은 FCN (fully connected neural network), CNN(convolutional neural network), RNN(recurrent neural network)을 포함할 수 있으나, 이에 한정되지 않는다. 제2 딥 러닝 모델은 라벨이 없는 데이터들을 이용해, 상호정보량을 최대화 시키도록 두 모델과 상호정보량 추정부 네트워크를 미리 같이 학습(jointly training) 시킨 후, 라벨이 있는 데이터들을 이용해, 미세 조정 하는 식으로 학습이 수행될 수 있다.The second deep learning model may include a fully connected neural network (FCN), a convolutional neural network (CNN), and a recurrent neural network (RNN), but is not limited thereto. The second deep learning model uses unlabeled data to jointly train both models and the mutual information amount estimator network in advance to maximize the amount of mutual information, and then fine-tune it using labeled data. Learning may be performed.

학습은 상호 정보량의 하계(lower bound)를 나타내는 DVR(Donsker-Varadhan representation)을 목적 함수로 사용하여 상호 정보량을 최대화 하는 방향으로 학습될 수 있다. 이외에도 DVR과 유사한 역할을 수행하는 목적 함수인 BCE(Binary Cross Entropy) 또는 NCE(Noise Contrastive Estimation) 기법을 이용할 수도 있다.Learning can be learned in the direction of maximizing the amount of mutual information by using the DVR (Donsker-Varadhan representation) representing the lower bound of the mutual information as an objective function. In addition, a Binary Cross Entropy (BCE) or Noise Contrastive Estimation (NCE) technique, which is an objective function that performs a role similar to that of a DVR, may be used.

상호 정보량 추정부(300)는 입력에 사용되는 특징 벡터의 개수를 전체 프레임 단위 특징 벡터들을 사용해 전체적인 상호 정보량을 최대화하는 GIM(Global Information Maximization) 기법을 사용할 수 있다.The mutual information amount estimator 300 may use a global information maximization (GIM) technique for maximizing the overall mutual information amount by using the total frame unit feature vectors for the number of feature vectors used for input.

이와 다르게, 상호 정보량 추정부(300)는 단일 프레임 단위 특징 벡터들을 사용해 평균적인 상호 정보량을 최대화하는 LIM(Local Information Maximization) 기법을 이용할 수 있다.Alternatively, the mutual information amount estimator 300 may use a Local Information Maximization (LIM) technique for maximizing the average mutual information amount using single frame unit feature vectors.

이와 다르게, 상호 정보량 추정부(300)는 GIM 기법 및 LIM 기법을 동시에 사용할 수 있다.Alternatively, the mutual information amount estimation unit 300 may use the GIM technique and the LIM technique at the same time.

실시예는 화자 라벨 데이터를 필요로 하지 않기 때문에 비교적 구하기 쉬운 라벨이 없는 음성 데이터를 이용해 비지도 방식으로 화자 엠비딩 추출 모델을 학습시킬 수 있는 효과가 있다.Since the embodiment does not require speaker label data, it is possible to train the speaker embedding extraction model in an unsupervised manner using relatively easy-to-obtain label-free speech data.

도 2는 실시예에 따른 상호 정보량 추정을 이용한 화자 임베딩 추출 방법을 나타낸 순서도이다.2 is a flowchart illustrating a method for extracting speaker embeddings using mutual information amount estimation according to an embodiment.

도 2를 참조하면, 실시예에 따른 화자 임베딩 추출 방법은 음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 단계(S100)와, 상기 프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 단계(S200)와, 상기 프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 단계(S300)를 포함할 수 있다. 여기서, 화자 임베딩 추출 방법은 화자 임베딩 추출 장치에서 수행될 수 있다.Referring to FIG. 2 , the method for extracting speaker embeddings according to the embodiment includes extracting a frame-by-frame feature vector using a first deep learning model to which a speech feature vector is input (S100), and the frame-by-frame feature vector receiving the input and converting it into a sentence unit feature vector (S200), and estimating the amount of mutual information using a second deep learning model to which the frame unit feature vector and the sentence unit feature vector are input (S300) ) may be included. Here, the speaker embedding extraction method may be performed by the speaker embedding extraction apparatus.

음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 단계(S100)는 프레임 단위 특징 벡터 추출부에서 수행될 수 있다.The step of extracting the frame-by-frame feature vector using the first deep learning model to which the speech feature vector is input ( S100 ) may be performed by the frame-by-frame feature vector extractor.

음성 특징 벡터는 음성으로부터 추출될 수 있다. 음성 특징 벡터는 음성 내에 존재하는 화자, 녹음 상태, 잡음 등으로 인한 다양한 변이성을 작은 차원의 벡터로 표현된 것으로, MFCC, 스펙트로그램 등을 포함할 수 있다. 예컨대, MFCC는 음성의 고유한 특징을 나타내는 수치이다.A speech feature vector may be extracted from speech. The speech feature vector expresses various variability due to a speaker, a recording state, noise, etc. present in the speech as a small-dimensional vector, and may include an MFCC, a spectrogram, and the like. For example, MFCC is a numerical value representing the unique characteristics of speech.

프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 단계(S200)는 문장 단위 특징 벡터 추출부에서 수행될 수 있다.The step ( S200 ) of receiving the frame unit feature vector and converting the frame unit feature vector into a sentence unit feature vector may be performed by the sentence unit feature vector extractor.

문장 단위 특징 벡터로 변환하는 단계(S200)는 풀링(Pooling) 기법을 이용하여 프레임 단위 특징 벡터를 문장 단위의 특징 벡터로 변환시킬 수 있다. 풀링 기법으로는 average pooling, statistics pooling, attention based pooling 등을 포함할 수 있으나, 이에 한정되지 않는다.In the step of converting the sentence unit feature vector ( S200 ), the frame unit feature vector may be converted into the sentence unit feature vector using a pooling technique. The pooling technique may include, but is not limited to, average pooling, statistics pooling, attention based pooling, and the like.

프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 단계(S300)는 상호 정보량 추정부에서 수행될 수 있다.The step (S300) of estimating the amount of mutual information by using the second deep learning model to which the frame unit feature vector and the sentence unit feature vector are input may be performed by the mutual information amount estimator.

상호 정보량을 추정하는 단계(S300)는 입력에 사용되는 특징 벡터의 개수를 전체 프레임 단위 특징 벡터들을 사용해 전체적인 상호 정보량을 최대화하는 GIM(Global Information Maximization) 기법을 사용할 수 있다. 이와 다르게, 상호 정보량 추정부는 단일 프레임 단위 특징 벡터들을 사용해 평균적인 상호 정보량을 최대화하는 LIM(Local Information Maximization) 기법을 이용할 수 있다.In the step of estimating the amount of mutual information ( S300 ), a Global Information Maximization (GIM) technique of maximizing the amount of mutual information by using the total frame unit feature vectors for the number of feature vectors used for input may be used. Alternatively, the mutual information amount estimator may use a local information maximization (LIM) technique for maximizing the average mutual information amount using single frame unit feature vectors.

이와 다르게, 상호 정보량 추정부는 GIM 기법 및 LIM 기법을 동시에 사용할 수 있다.Alternatively, the mutual information amount estimator may use the GIM technique and the LIM technique at the same time.

본 문서의 다양한 실시예들은 기기(machine)(예: 컴퓨터)로 읽을 수 있는 저장 매체(machine-readable storage media)(예: 메모리(내장 메모리 또는 외장 메모리))에 저장된 명령어를 포함하는 소프트웨어(예: 프로그램)로 구현될 수 있다. 기기는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 동작이 가능한 장치로서, 개시된 실시예들에 따른 전자 장치를 포함할 수 있다. 상기 명령이 제어부에 의해 실행될 경우, 제어부가 직접, 또는 상기 제어부의 제어하에 다른 구성요소들을 이용하여 상기 명령에 해당하는 기능을 수행할 수 있다. 명령은 컴파일러 또는 인터프리터에 의해 생성 또는 실행되는 코드를 포함할 수 있다. 기기로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서, 비일시적은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.Various embodiments of the present document include software (eg, a machine-readable storage media) (eg, a memory (internal memory or external memory)) including instructions stored in a readable storage medium (eg, a computer). : program) can be implemented. The device is a device capable of calling a stored command from a storage medium and operating according to the called command, and may include the electronic device according to the disclosed embodiments. When the command is executed by the control unit, the control unit may perform a function corresponding to the command directly or by using other components under the control of the control unit. Instructions may include code generated or executed by a compiler or interpreter. The device-readable storage medium may be provided in the form of a non-transitory storage medium. Here, non-transitory means that the storage medium does not include a signal and is tangible, and does not distinguish that data is semi-permanently or temporarily stored in the storage medium.

실시예에 따르면, 본 문서에 개시된 다양한 실시예들에 따른 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다.According to an embodiment, the method according to various embodiments disclosed in the present document may be included and provided in a computer program product.

일 실시예에 따르면, 음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 단계와, 상기 프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 단계와, 상기 프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 단계를 수행하기 위한 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment, extracting a feature vector in units of frames using a first deep learning model to which a speech feature vector is input, and receiving the feature vector in units of frames and converting the feature vector in units of sentences into a feature vector in units of sentences and estimating the amount of mutual information using a second deep learning model to which the frame unit feature vector and the sentence unit feature vector are input may include.

일 실시예에 따르면, 컴퓨터 판독 가능한 기록매체에 저장되어 있는 컴퓨터 프로그램으로서, 음성 특징 벡터를 입력으로 한 제1 딥 러닝 모델을 이용하여 프레임 단위의 특징 벡터를 추출하는 단계와, 상기 프레임 단위의 특징 벡터를 입력받아 문장 단위의 특징 벡터로 변환하는 단계와, 상기 프레임 단위의 특징 벡터와 상기 문장 단위의 특징 벡터를 입력으로 한 제2 딥 러닝 모델을 이용하여 상호 정보량을 추정하는 단계를 수행하기 위한 동작을 포함하는 방법을 프로세서가 수행하도록 하기 위한 명령어를 포함할 수 있다.According to an embodiment, as a computer program stored in a computer-readable recording medium, extracting a frame-by-frame feature vector using a first deep learning model to which a speech feature vector is input, and the frame-by-frame feature For performing the steps of receiving a vector and converting it into a sentence-unit feature vector, and estimating the amount of mutual information using a second deep learning model to which the frame-unit feature vector and the sentence-unit feature vector are input It may include instructions for causing the processor to perform a method comprising an operation.

상기에서는 도면 및 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허청구범위에 기재된 실시예의 기술적 사상으로부터 벗어나지 않는 범위 내에서 실시예는 다양하게 수정 및 변경시킬 수 있음은 이해할 수 있을 것이다.Although the above has been described with reference to the drawings and embodiments, it will be understood by those skilled in the art that various modifications and changes can be made to the embodiments without departing from the spirit of the embodiments described in the claims below. will be able

100: 프레임 단위 특징 벡터 추출부
200: 문장 단위 특징 벡터 추출부
300: 상호 정보량 추정부100: frame unit feature vector extraction unit
200: sentence unit feature vector extraction unit
300: mutual information amount estimation unit

Claims

a frame-by-frame feature vector extracting unit for extracting frame-by-frame feature vectors using a pre-trained first deep learning model to which speech feature vectors are input;
a sentence unit feature vector extractor that receives the frame unit feature vector extracted from the frame unit feature vector extractor and converts it into a sentence unit feature vector; and
A mutual information amount estimator for estimating the amount of mutual information using the second pre-trained deep learning model to which the frame unit feature vector and the sentence unit feature vector are input;
The pre-trained first deep learning model and the second deep learning model are learned by unsupervised learning, and are jointly trained to maximize the amount of mutual information,
Speaker embedding extraction device.

According to claim 1,
The speech feature vector is a speaker embedding extraction apparatus including an MFCC and a spectrogram.

According to claim 1,
The sentence unit feature vector extractor converts the frame unit feature vector into the sentence unit feature vector by using a pooling technique.

According to claim 1,
The first deep learning model and the second deep learning model is a speaker embedding extraction apparatus comprising an FCN, a CNN, and an RNN.

According to claim 1,
The mutual information amount estimating unit is a speaker embedding extraction apparatus for estimating the mutual information amount by using a Global Information Maximization (GIM) technique.

According to claim 1,
The mutual information amount estimating unit is a speaker embedding extraction apparatus for estimating the mutual information amount by using a LIM (Local Information Maximization) technique.

According to claim 1,
The mutual information amount estimating unit is a speaker embedding extraction apparatus for estimating the mutual information amount by using a GIM technique and a LIM technique.

According to claim 1,
The mutual information amount estimator uses any one of Donsker-Varadhan representation (DVR), Binary Cross Entropy (BCE), or Noise Contrastive Estimation (NCE) as an objective function to learn in a direction to maximize the amount of mutual information.

A method for extracting speaker embeddings performed in a speaker embedding extraction apparatus, the method comprising:
extracting a frame-by-frame feature vector using a pre-trained first deep learning model to which a speech feature vector is input;
receiving the frame unit feature vector and converting the frame unit feature vector into a sentence unit feature vector; and
estimating the amount of mutual information using a pre-learned second deep learning model to which the frame unit feature vector and the sentence unit feature vector are input;
The pre-trained first deep learning model and the second deep learning model are learned by unsupervised learning, and are jointly trained to maximize the amount of mutual information,
How to extract speaker embeddings.

As a computer-readable recording medium storing a computer program,
The computer program is
extracting a frame-by-frame feature vector using a pre-trained first deep learning model to which a speech feature vector is input;
receiving the frame unit feature vector and converting the frame unit feature vector into a sentence unit feature vector; and
estimating the amount of mutual information using a pre-learned second deep learning model to which the frame unit feature vector and the sentence unit feature vector are input;
The pre-trained first deep learning model and the second deep learning model are learned by unsupervised learning, and computer readable including instructions for causing the processor to perform jointly trained to maximize the amount of mutual information recording medium.

As a computer program stored in a computer-readable recording medium,
The computer program is
extracting a frame-by-frame feature vector using a pre-trained first deep learning model to which a speech feature vector is input;
receiving the frame unit feature vector and converting the frame unit feature vector into a sentence unit feature vector; and
estimating the amount of mutual information using a pre-learned second deep learning model to which the frame unit feature vector and the sentence unit feature vector are input;
The pre-trained first deep learning model and the second deep learning model are learned by unsupervised learning, and a computer program comprising instructions for causing a processor to perform jointly trained to maximize the amount of mutual information.