KR102429656B1

KR102429656B1 - A speaker embedding extraction method and system for automatic speech recognition based pooling method for speaker recognition, and recording medium therefor

Info

Publication number: KR102429656B1
Application number: KR1020200130575A
Authority: KR
Inventors: 김남수; 문성환
Original assignee: 서울대학교산학협력단
Priority date: 2020-10-08
Filing date: 2020-10-08
Publication date: 2022-08-08
Also published as: KR20220047080A; WO2022075714A1

Abstract

본 발명은 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 관한 것으로서, 보다 구체적으로는 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법으로서, (1) 프레임 단위 특징 정보 추출부가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출하는 단계; (2) 프레임 단위 문자 확률 분포 추출부가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 단계; (3) 풀링(pooling) 처리부가 상기 단계 (1)의 특징 정보와 상기 단계 (2)의 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출하는 단계; 및 (4) 화자 임베딩 추출부가 상기 단계 (3)을 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.
또한, 본 발명의 특징에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템은, 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템으로서, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출하는 프레임 단위 특징 정보 추출부; 상기 프레임 단위 특징 정보 추출부와 동시에 작동되며, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 프레임 단위 문자 확률 분포 추출부; 상기 프레임 단위 특징 정보 추출부의 특징 정보와 상기 프레임 단위 문자 확률 분포 추출부의 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출하는 풀링(pooling) 처리부; 및 상기 풀링 처리부로부터 입력되는 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 화자 임베딩 추출부를 포함하는 것을 그 구성상의 특징으로 한다.
본 발명에서 제안하고 있는 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 따르면, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보 및 문자 확률 분포 정보를 각각 추출하고, 추출된 특징 정보와 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출한 후, 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하도록 구성함으로써, 화자 임베딩 추출 과정에서 화자 정보만을 고려하던 기존의 기법들과 달리, 프레임 단위의 출력을 집계하는 과정에서 문자 단위의 개별적 처리 과정을 갖기 때문에 추론 단계에서 화자 임베딩간의 유사도 계산 시 특정 발음 간의 특징 비교를 가능하도록 하며, 이를 통해 화자 인식 시스템에서 화자 정보 및 문장 발화 정보를 동시에 고려하여 비교 분석할 수 있도록 할 수 있다.
또한, 본 발명의 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 따르면, 음성 인식기(ASR) 모델을 활용하여 음성으로부터 프레임 단위의 문자 단위 확률 분포를 추정하고, 이를 풀링(pooling) 과정에서 활용하여 화자 임베딩 추출 단계에서 화자 정보뿐만 아니라 발화 문장 정보도 함께 추출하도록 구성함으로써, 기존 화자 인식 분야에서 널리 사용되고 있는 모델 구조를 적용하되, 목적에 맞도록 적절한 모델 구조로 쉽게 대체하여 적용할 수 있어 다양한 분야에 활용할 수 있으며, 화자 정보와 발화 문장 정보를 함께 추출하는 화자 인식 및 생체 신호 인식 분야에 범용성 있게 폭넓게 적용될 수 있도록 할 수 있다.The present invention relates to a method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition, and a recording medium therefor, and more particularly, to a speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition, ( 1) extracting the feature information of the input voice in units of frames through the processing of the input voice to be input by the frame-by-frame feature information extracting unit; (2) extracting the character probability distribution information of the input voice in units of frames through the processing of the input voice input by the frame-by-frame character probability distribution extracting unit; (3) extracting, by a pooling processing unit, a speaker embedding vector of a fixed dimension by receiving the feature information of step (1) and the character probability distribution information of step (2) as inputs; and (4) the speaker embedding extraction unit processes the fixed-dimensional speaker embedding vector extracted through step (3) as an input, and outputs speaker information and utterance sentence information, respectively. .
In addition, the speaker embedding extraction system of the voice recognizer-based pooling technique for speaker recognition according to a feature of the present invention is a speaker embedding extraction system of the voice recognizer-based pooling technique for speaker recognition. a frame unit feature information extraction unit for extracting feature information of the input voice in units; a frame-by-frame character probability distribution extracting unit that operates simultaneously with the frame-by-frame feature information extracting unit and extracts character probability distribution information of the input voice in units of frames through processing of the input voice; a pooling processing unit that receives the characteristic information of the frame unit characteristic information extraction unit and the character probability distribution information of the frame unit character probability distribution extraction unit as inputs and extracts a speaker embedding vector of a fixed dimension; and a speaker embedding extraction unit that processes the extracted fixed-dimensional speaker embedding vector input from the pooling processing unit as an input and outputs speaker information and utterance sentence information, respectively.
According to the method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition proposed in the present invention, and a recording medium for the same, the characteristic information and text of the input voice are processed in units of frames through processing of the input voice. After extracting probability distribution information, receiving the extracted feature information and character probability distribution information as inputs, extracting a speaker embedding vector of a fixed dimension, and processing the extracted speaker embedding vector of a fixed dimension as an input to obtain speaker information and utterance By configuring each sentence information to be output, unlike the existing techniques that considered only speaker information in the speaker embedding extraction process, since the output of the frame unit has an individual processing process in the character unit in the process of aggregating the output in the frame unit, it When calculating the similarity, it is possible to compare features between specific pronunciations, and through this, the speaker recognition system can simultaneously consider and analyze the speaker information and the sentence utterance information.
In addition, according to the method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition of the present invention, and a recording medium for the same, the probability distribution of the character unit of the frame unit is estimated from the voice using the voice recognizer (ASR) model. And, by utilizing this in the pooling process to extract not only the speaker information but also the utterance sentence information in the speaker embedding extraction step, the model structure widely used in the existing speaker recognition field is applied, but an appropriate model to suit the purpose. Since it can be easily replaced with a structure and applied, it can be used in various fields, and it can be widely applied to the field of speaker recognition and biosignal recognition that extracts speaker information and utterance sentence information together.

Description

A SPEAKER EMBEDDING EXTRACTION METHOD AND SYSTEM FOR AUTOMATIC SPEECH RECOGNITION BASED POOLING METHOD FOR SPEAKER RECOGNITION, AND RECORDING MEDIUM THEREFOR

본 발명은 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 관한 것으로서, 보다 구체적으로는 화자 임베딩 추출 과정에서 화자 정보만을 고려하던 기존의 기법들과 달리, ASR 모델을 활용하여 음성으로부터 프레임 단위의 문자 단위 확률 분포를 추정하고, 이를 풀링(pooling) 과정에서 활용하여 화자 임베딩 추출 단계에서 화자 정보뿐만 아니라 발화 문장 정보도 함께 추출하도록 하는 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 관한 것이다.The present invention relates to a method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition, and to a recording medium for the same, and more specifically, to a speaker embedding extraction process, unlike the existing techniques that only consider speaker information, A speech recognizer for speaker recognition that uses the ASR model to estimate the probability distribution of each frame by frame from the voice, and uses it in the pooling process to extract not only the speaker information but also the uttered sentence information in the speaker embedding extraction step A method and system for extracting speaker embeddings based on a pooling technique, and a recording medium for the same.

일반적으로 화자인식은 입력된 음성이 등록된 화자 중 어떤 사람의 목소리인지를 식별하는 기술 분야이다. 이러한 화자인식은 많은 데이터를 가지고 화자의 특징을 배우는 학습과정이 있으며, 학습된 모델을 활용하여 사전 시스템에 특정 화자들의 정보를 저장해 놓는 등록과정, 그리고 실제 들어오는 음성을 등록된 음성과 비교하여 동일 화자인지 여부를 결정하는 테스트과정으로 구성되어 있다.In general, speaker recognition is a technical field for identifying which person's voice is an input voice among registered speakers. Such speaker recognition has a learning process that learns the speaker's characteristics with a lot of data, a registration process that stores information about specific speakers in a dictionary system using the learned model, and a registration process that compares the actual incoming voice with the registered voice to identify the same speaker. It consists of a test process to determine whether or not it is recognized.

이와 같은 화자 인식 분야에서 학습 과정과, 등록 과정, 및 테스트 과정에서 화자의 정보를 추출하고 비교 및 결정을 위해 화자 임베딩이라고 불리는 고정된 차원의 특징 벡터가 널리 사용되고 있다. 이러한 기법은 음성으로부터 화자의 정보를 표현할 수 있는 정보를 고정된 차원의 벡터로 추출하고, 등록 및 테스트 과정에서 임베딩들 간의 유사도를 계산하여 동일인인지 여부를 판단하게 된다.In the field of speaker recognition, a fixed-dimensional feature vector called speaker embedding is widely used for extracting, comparing, and determining speaker information in a learning process, a registration process, and a test process. This technique extracts information that can express the speaker's information from the voice as a fixed-dimensional vector, calculates the similarity between embeddings during registration and testing, and determines whether they are the same person.

최근 다양한 딥러닝 기술들이 발전되고 대용량 데이터 셋이 제공됨에 따라 딥러닝 기반의 화자 임베딩 기법들이 눈부신 성능 향상을 이루었으며, 보다 효과적인 화자 임베딩 추출을 위한 다양한 연구들(심층심경망 구조, 손실 함수 모델링, pooling 기법 등)이 진행되고 있다. 그 중에서도 화자 임베딩 추출을 위한 pooling 기법은, 프레임 단위의 피처를 고정된 차원의 벡터로 요약하는 구성으로, 이 과정에서 화자 인식에 유효한 정보를 얼마나 잘 집계하는지가 성능에 직접적으로 연결된다. 하지만 이러한 연구들은 공통적으로 화자의 신원 정보만을 학습 과정에서 고려하고 있다. 즉, 서로 다른 두 화자를 비교하는데 있어서 화자의 신원뿐 만 아니라 화자의 발화 특성, 발화 문장 정보 등과 같은 다양한 요인들을 고려할 필요가 있으나, 기존의 화자 임베딩 추출 방법에서는 화자 정보만을 고려하는 방식으로, 화자 인식 성능에 제한이 따른 문제가 있었다.Recently, as various deep learning technologies have been developed and large data sets are provided, deep learning-based speaker embedding techniques have achieved remarkable performance improvement, and various studies (deep neural network structure, loss function modeling, pooling techniques, etc.) are in progress. Among them, the pooling technique for speaker embedding extraction consists of summarizing frame-level features into a fixed-dimensional vector, and in this process, how well it aggregates information valid for speaker recognition is directly linked to performance. However, these studies commonly consider only the speaker's identity information in the learning process. That is, in comparing two different speakers, it is necessary to consider various factors such as the speaker's identity as well as the speaker's speech characteristics and speech sentence information. However, in the existing speaker embedding extraction method, only the speaker information is considered, There was a problem that the recognition performance was limited.

이와 같이, 기존의 화자 임베딩 추출 시스템은 학습 과정에서, 입력 음성은 네트워크를 통과하며, 추출된 출력을 통해 정답 화자를 맞추도록 학습된다. 즉, 정답 화자 정보에 의존한 지도 학습 방식을 구성하고 있는 것이며 화자 임베딩 추출을 위해 화자 식별 정보만을 활용하고 있다. 하지만, 음성 정보는 순차적 정보이며, 화자 정보 이외의 다양한 정보(잡음, 공백, 환경, 녹음기기, 언어 등) 또한 포함하고 있기 때문에 보다 효과적인 화자 인식을 위해서는 중요한 정보는 더 강조하여 처리하는 접근 방식이 요구되고 있다.As described above, in the existing speaker embedding extraction system, in the learning process, the input voice passes through the network, and the speaker is trained to match the correct speaker through the extracted output. That is, a supervised learning method that relies on correct speaker information is constituted, and only speaker identification information is used to extract speaker embeddings. However, since voice information is sequential information and includes various information other than speaker information (noise, blank space, environment, recording equipment, language, etc.), an approach that emphasizes important information for more effective speaker recognition is not is being demanded

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출하고, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하며, 추출된 특징 정보와 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출한 후, 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하도록 구성함으로써, 화자 임베딩 추출 과정에서 화자 정보만을 고려하던 기존의 기법들과 달리, 프레임 단위의 출력을 집계하는 과정에서 문자 단위의 개별적 처리 과정을 갖기 때문에 추론 단계에서 화자 임베딩간의 유사도 계산 시 특정 발음 간의 특징 비교를 가능하도록 하며, 이를 통해 화자 인식 시스템에서 화자 정보 및 문장 발화 정보를 동시에 고려하여 비교 분석할 수 있도록 하는, 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체를 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems of the previously proposed methods, and extracts feature information of the input voice in units of frames through processing of the input voice, and performs processing on the input voice. extracts the character probability distribution information of the input voice in frame units through Unlike the existing techniques that considered only the speaker information in the speaker embedding extraction process by processing the Therefore, it is possible to compare features between specific pronunciations when calculating the similarity between speaker embeddings in the reasoning step, and through this, the speaker recognition system can compare and analyze the speaker information and the sentence utterance information at the same time. An object of the present invention is to provide a method and system for extracting speaker embeddings of a pooling technique, and a recording medium for the same.

또한, 본 발명은, 음성 인식기(ASR) 모델을 활용하여 음성으로부터 프레임 단위의 문자 단위 확률 분포를 추정하고, 이를 풀링(pooling) 과정에서 활용하여 화자 임베딩 추출 단계에서 화자 정보뿐만 아니라 발화 문장 정보도 함께 추출하도록 구성함으로써, 기존 화자 인식 분야에서 널리 사용되고 있는 모델 구조를 적용하되, 목적에 맞도록 적절한 모델 구조로 쉽게 대체하여 적용할 수 있어 다양한 분야에 활용할 수 있으며, 화자 정보와 발화 문장 정보를 함께 추출하는 화자 인식 및 생체 신호 인식 분야에 범용성 있게 폭넓게 적용될 수 있도록 하는, 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체를 제공하는 것을 또 다른 목적으로 한다.In addition, the present invention uses a speech recognizer (ASR) model to estimate a frame-by-frame probability distribution from a voice, and utilizes this in a pooling process to extract not only the speaker information but also the utterance sentence information in the speaker embedding extraction step. By configuring it to be extracted together, the model structure widely used in the existing speaker recognition field is applied, but it can be easily replaced with an appropriate model structure to fit the purpose and applied, so it can be used in various fields. Another object of the present invention is to provide a method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition, which can be universally and widely applied to the field of speaker recognition and biosignal recognition, and a recording medium for the same.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법은,A method for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition according to a feature of the present invention for achieving the above object,

화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법으로서,As a speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition,

(1) 프레임 단위 특징 정보 추출부가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출하는 단계;(1) extracting the feature information of the input voice in units of frames through the processing of the input voice to be input by the frame-by-frame feature information extracting unit;

(2) 프레임 단위 문자 확률 분포 추출부가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 단계;(2) extracting the character probability distribution information of the input voice in units of frames through the processing of the input voice input by the frame-by-frame character probability distribution extracting unit;

(3) 풀링(pooling) 처리부가 상기 단계 (1)의 특징 정보와 상기 단계 (2)의 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출하는 단계; 및(3) extracting, by a pooling processing unit, a speaker embedding vector of a fixed dimension by receiving the feature information of the step (1) and the character probability distribution information of the step (2) as inputs; and

(4) 화자 임베딩 추출부가 상기 단계 (3)을 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 단계를 포함하는 것을 그 구성상의 특징으로 한다.(4) The speaker embedding extraction unit processes the fixed-dimensional speaker embedding vector extracted through step (3) as an input and outputs speaker information and utterance sentence information, respectively.

바람직하게는, 상기 단계 (1)에서는,Preferably, in step (1),

상기 프레임 단위 특징 정보 추출부가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출할 수 있는 TDNN, CNN, RNN 중 어느 하나의 딥러닝 모델 구조로 구현될 수 있다.The frame unit feature information extraction unit may be implemented as a deep learning model structure of any one of TDNN, CNN, and RNN capable of extracting the feature information of the input voice in units of frames through processing of the input voice.

바람직하게는, 상기 단계 (2)에서는,Preferably, in step (2),

상기 프레임 단위 문자 확률 분포 추출부가 CTC(Connectionist Temporal Classification) 기반 목적함수로 학습된 음성 인식기(ASR)를 사용하며, TDNN, CNN, RNN 및 딥러닝 기반의 end-to-end 음성 인식 모델로 구현될 수 있다.The frame-by-frame character probability distribution extractor uses a speech recognizer (ASR) trained with a CTC (Connectionist Temporal Classification)-based objective function, and is implemented as an end-to-end speech recognition model based on TDNN, CNN, RNN and deep learning. can

바람직하게는, 상기 단계 (3)에서는,Preferably, in step (3),

(3-1) 상기 풀링(pooling) 처리부가 상기 단계 (1)의 특징 정보와 상기 단계 (2)의 문자 확률 분포 정보를 입력으로 받고, 동시에 입력되는 특징 정보와 문자 확률 분포 정보에 대해 글자 단위의 개별적 집계를 수행하는 단계;(3-1) The pooling processing unit receives the feature information of step (1) and the character probability distribution information of step (2) as inputs, and the character information and character probability distribution information input at the same time are character-by-character units performing individual aggregation of ;

(3-2) 상기 단계 (3-1)의 처리 후, 각 프레임 단위의 특징 정보가 특정 발화에 속할 확률 값을 사용하여 가중치 합이 구해지는 단계; 및(3-2) obtaining a weighted sum by using a probability value that the characteristic information of each frame belongs to a specific utterance after the process of step (3-1); and

(3-3) 상기 단계 (3-2)의 처리 후, 특정 단어의 발화 정보를 개별적으로 처리하여 고정된 차원의 벡터를 계산하는 단계로 이루어질 수 있다.(3-3) After the processing of step (3-2), the step of calculating a fixed-dimensional vector by individually processing the speech information of a specific word may be performed.

더욱 바람직하게는, 상기 단계 (4)에서는,More preferably, in step (4),

상기 화자 임베딩 추출부가 상기 단계 (3)을 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하되, 출력되는 화자 정보와 발화 문장 정보는 locally-connected 및 fully-connected 2층의 레이어 및 softmax 함수를 통과하고, multi-class cross-entropy 목적함수를 통해 학습되도록 할 수 있다.The speaker embedding extraction unit processes the fixed-dimensional speaker embedding vector extracted through step (3) as input and outputs speaker information and utterance sentence information, respectively, but the output speaker information and utterance sentence information are locally-connected and It can be learned through a multi-class cross-entropy objective function, passing through the fully-connected two-layer layer and softmax function.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템은,A speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to a feature of the present invention for achieving the above object,

화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템으로서,As a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition,

입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출하는 프레임 단위 특징 정보 추출부;a frame unit feature information extraction unit for extracting feature information of the input voice in units of frames through processing of the input voice;

상기 프레임 단위 특징 정보 추출부와 동시에 작동되며, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 프레임 단위 문자 확률 분포 추출부;a frame-by-frame character probability distribution extracting unit that operates simultaneously with the frame-by-frame feature information extracting unit and extracts character probability distribution information of the input voice in units of frames through processing of the input voice;

상기 프레임 단위 특징 정보 추출부의 특징 정보와 상기 프레임 단위 문자 확률 분포 추출부의 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출하는 풀링(pooling) 처리부; 및a pooling processing unit that receives the characteristic information of the frame unit characteristic information extraction unit and the character probability distribution information of the frame unit character probability distribution extraction unit as inputs and extracts a speaker embedding vector of a fixed dimension; and

상기 풀링 처리부로부터 입력되는 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 화자 임베딩 추출부를 포함하는 것을 그 구성상의 특징으로 한다.and a speaker embedding extractor that processes the extracted fixed-dimensional speaker embedding vector input from the pooling processing unit as an input and outputs speaker information and utterance sentence information, respectively.

바람직하게는, 상기 프레임 단위 특징 정보 추출부는,Preferably, the frame unit feature information extraction unit,

입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출할 수 있는 TDNN, CNN, RNN 중 어느 하나의 딥러닝 모델 구조로 구현될 수 있다.It can be implemented as a deep learning model structure of any one of TDNN, CNN, and RNN, which can extract feature information of the input voice in units of frames through processing of the input voice.

바람직하게는, 상기 프레임 단위 문자 확률 분포 추출부는,Preferably, the frame unit character probability distribution extracting unit,

CTC(Connectionist Temporal Classification) 기반 목적함수로 학습된 음성 인식기(ASR)를 사용하며, TDNN, CNN, RNN 및 딥러닝 기반의 end-to-end 음성 인식 모델로 구현될 수 있다.It uses a speech recognizer (ASR) trained with an objective function based on Connectionist Temporal Classification (CTC), and can be implemented as an end-to-end speech recognition model based on TDNN, CNN, RNN and deep learning.

바람직하게는, 상기 풀링(pooling) 처리부는,Preferably, the pooling processing unit,

상기 프레임 단위 특징 정보 추출부의 특징 정보와 상기 프레임 단위 문자 확률 분포 추출부의 문자 확률 분포 정보를 입력으로 받고, 동시에 입력되는 특징 정보와 문자 확률 분포 정보에 대해 글자 단위의 개별적 집계를 수행하고, 각 프레임 단위의 특징 정보가 특정 발화에 속할 확률 값을 사용하여 가중치 합을 구한 후, 특정 단어의 발화 정보를 개별적으로 처리하여 고정된 차원의 벡터를 계산하는 과정으로 이루어질 수 있다.receiving the characteristic information of the frame unit characteristic information extraction unit and the character probability distribution information of the frame unit character probability distribution extraction unit as inputs, performing individual aggregation of the character information and character probability distribution information input at the same time in units of characters; After obtaining a weighted sum using a probability value that the characteristic information of a unit belongs to a specific utterance, the utterance information of a specific word is individually processed to calculate a fixed-dimensional vector.

더욱 바람직하게는, 상기 화자 임베딩 추출부는,More preferably, the speaker embedding extraction unit,

상기 풀링 처리부를 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하되, 출력되는 화자 정보와 발화 문장 정보는 locally-connected 및 fully-connected 2층의 레이어 및 softmax 함수를 통과하고, multi-class cross-entropy 목적함수를 통해 학습되도록 할 수 있다.The fixed-dimensional speaker embedding vector extracted through the pooling processing unit is processed as an input to output speaker information and utterance sentence information, respectively, but the output speaker information and utterance sentence information are locally-connected and fully-connected second layer layers and softmax function, and can be learned through a multi-class cross-entropy objective function.

본 발명에서 제안하고 있는 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 따르면, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보 및 문자 확률 분포 정보를 각각 추출하고, 추출된 특징 정보와 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출한 후, 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하도록 구성함으로써, 화자 임베딩 추출 과정에서 화자 정보만을 고려하던 기존의 기법들과 달리, 프레임 단위의 출력을 집계하는 과정에서 문자 단위의 개별적 처리 과정을 갖기 때문에 추론 단계에서 화자 임베딩간의 유사도 계산 시 특정 발음 간의 특징 비교를 가능하도록 하며, 이를 통해 화자 인식 시스템에서 화자 정보 및 문장 발화 정보를 동시에 고려하여 비교 분석할 수 있도록 할 수 있다.According to the method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition proposed in the present invention, and a recording medium for the same, the characteristic information and text of the input voice are processed in units of frames through processing of the input voice. After extracting probability distribution information, receiving the extracted feature information and character probability distribution information as inputs, extracting a speaker embedding vector of a fixed dimension, and processing the extracted speaker embedding vector of a fixed dimension as an input to obtain speaker information and utterance By configuring each sentence information to be output, unlike the existing techniques that only considered speaker information in the speaker embedding extraction process, since the output of each frame has an individual character-by-character processing process in the process of aggregating the output of the frame unit, the inference stage between speaker embeddings When calculating the similarity, it is possible to compare features between specific pronunciations, and through this, the speaker recognition system can simultaneously consider and analyze the speaker information and the sentence utterance information.

또한, 본 발명의 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체에 따르면, 음성 인식기(ASR) 모델을 활용하여 음성으로부터 프레임 단위의 문자 단위 확률 분포를 추정하고, 이를 풀링(pooling) 과정에서 활용하여 화자 임베딩 추출 단계에서 화자 정보뿐만 아니라 발화 문장 정보도 함께 추출하도록 구성함으로써, 기존 화자 인식 분야에서 널리 사용되고 있는 모델 구조를 적용하되, 목적에 맞도록 적절한 모델 구조로 쉽게 대체하여 적용할 수 있어 다양한 분야에 활용할 수 있으며, 화자 정보와 발화 문장 정보를 함께 추출하는 화자 인식 및 생체 신호 인식 분야에 범용성 있게 폭넓게 적용될 수 있도록 할 수 있다.In addition, according to the method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition of the present invention, and a recording medium for the same, the probability distribution of the character unit of the frame unit is estimated from the voice using the voice recognizer (ASR) model. And, by utilizing this in the pooling process to extract not only the speaker information but also the utterance sentence information in the speaker embedding extraction step, the model structure widely used in the existing speaker recognition field is applied, but a model suitable for the purpose Since it can be easily replaced with a structure and applied, it can be used in various fields, and it can be widely applied to the field of speaker recognition and biosignal recognition that extracts speaker information and utterance sentence information together.

도 1은 딥러닝 기반 화자 임베딩 모델의 개략적인 구성을 도시한 도면.
도 2는 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법의 흐름을 도시한 도면.
도 3은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법에서, 단계 S130의 세부 흐름을 도시한 도면.
도 4는 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 구성을 기능블록으로 도시한 도면.
도 5는 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 프레임 단위 특징 정보 추출부의 딥러닝 모델의 일례 구성을 도시한 도면.
도 6은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 프레임 단위 문자 확률 분포 추출부의 구현 구성을 개략적으로 도시한 도면.
도 7은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 알고리즘 전체 구조를 도시한 도면.
도 8은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 풀링 과정의 구조를 도시한 도면.1 is a diagram showing a schematic configuration of a deep learning-based speaker embedding model.
2 is a diagram illustrating a flow of a method for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.
3 is a diagram illustrating a detailed flow of step S130 in a method for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.
4 is a block diagram illustrating the configuration of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.
5 is a diagram illustrating an exemplary configuration of a deep learning model of a frame unit feature information extraction unit of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.
6 is a diagram schematically illustrating an implementation configuration of a frame-wise character probability distribution extraction unit of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.
7 is a diagram showing the overall structure of an algorithm of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.
8 is a diagram illustrating the structure of a pooling process of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, preferred embodiments will be described in detail so that those of ordinary skill in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and functions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In addition, throughout the specification, when a part is 'connected' with another part, it is not only 'directly connected' but also 'indirectly connected' with another element interposed therebetween. include In addition, "including" a certain component means that other components may be further included, rather than excluding other components, unless otherwise stated.

도 1은 딥러닝 기반 화자 임베딩 모델의 개략적인 구성을 도시한 도면이다. 도 1에 도시된 바와 같이, 딥러닝 기반 화자 임베딩 모델은 기본적으로 프레임 단위 네트워크(frame-level network)와, 화자 인식을 위한 압축 층(pooling layer), 및 화자 인식 네트워크(speaker classification network)로 구성될 수 있다. 여기서, 프레임 단위 네트워크는 long short-term memory model(LSTM)이나 일반적인 deep neural network(DNN)과 같은 구조를 이용하여 입력된 프레임 단위 시퀀스를 받아서 보다 유의미한 정보를 포함한 시퀀스를 출력한다.1 is a diagram illustrating a schematic configuration of a deep learning-based speaker embedding model. As shown in FIG. 1, the deep learning-based speaker embedding model is basically composed of a frame-level network, a pooling layer for speaker recognition, and a speaker classification network. can be Here, the frame unit network receives an input frame unit sequence using a structure such as a long short-term memory model (LSTM) or a general deep neural network (DNN) and outputs a sequence including more meaningful information.

또한, 화자 인식용 압축 층에서는 평균 혹은 가중 합(weighted sum)을 통하여 프레임 단위 네트워크에서 출력한 시퀀스를 하나의 벡터, 즉 화자 임베딩(speaker embedding)으로 압축시킨다. 이후 압축된 벡터는 DNN으로 구성된 화자 인식 네트워크로 입력되며, 해당 화자 인식 네트워크는 화자 인식 결과(화자 라벨, speaker label)를 출력한다. 위의 세 구성 요소들은 화자 인식 결과가 좋아지도록 동시에 학습되며, 학습 후에는 화자 임베딩을 추출하는데 활용되고 있다.In addition, the compression layer for speaker recognition compresses the sequence output from the frame unit network into one vector, that is, speaker embedding, through an average or a weighted sum. Thereafter, the compressed vector is input to a speaker recognition network composed of DNN, and the corresponding speaker recognition network outputs a speaker recognition result (speaker label, speaker label). The above three components are simultaneously learned to improve speaker recognition results, and after learning, they are used to extract speaker embeddings.

도 2는 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법의 흐름을 도시한 도면이고, 도 3은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법에서, 단계 S130의 세부 흐름을 도시한 도면이다. 도 2 및 도 3에 각각 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법은, 프레임 단위로 입력 음성의 특징 정보를 추출하는 단계(S110), 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 단계(S120), 고정된 차원의 화자 임베딩 벡터를 추출하는 단계(S130), 및 화자 임베딩 추출부가 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 단계(S140)를 포함하여 구현될 수 있다.2 is a diagram illustrating a flow of a method for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention, and FIG. 3 is a voice recognizer for speaker recognition according to an embodiment of the present invention. It is a diagram showing the detailed flow of step S130 in the speaker embedding extraction method of the based pooling technique. 2 and 3, the method for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention includes extracting feature information of an input voice in units of frames (S110). ), extracting the character probability distribution information of the input voice in units of frames (S120), extracting a fixed-dimensional speaker embedding vector (S130), and a speaker embedding extractor with a fixed dimension extracted from the speaker embedding vector It may be implemented including the step (S140) of processing the input and outputting the speaker information and the uttered sentence information, respectively.

여기서, 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법은 컴퓨팅 장치에 의해 수행될 수 있으며, 도 4 내지 도 8에 각각 도시되는 바와 같이, 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템(100)이 적용되는 것으로 이해될 수 있다.Here, the speaker embedding extraction method of the voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention may be performed by a computing device, and as shown in FIGS. 4 to 8, respectively, for speaker recognition It can be understood that the speaker embedding extraction system 100 of the speech recognizer-based pooling technique is applied.

단계 S110에서는, 프레임 단위 특징 정보 추출부(110)가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출한다. 이러한 단계 S110에서는 프레임 단위 특징 정보 추출부(110)가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출할 수 있는 TDNN(Time Delay Neural Network), CNN(Convolutional Neural Network), RNN(Recurrent Neural Network) 중 어느 하나의 딥러닝 모델 구조로 구현될 수 있다.In step S110, the frame-by-frame feature information extraction unit 110 extracts the feature information of the input voice in units of frames through processing on the input voice. In this step S110, a TDNN (Time Delay Neural Network), CNN (Convolutional Neural Network) capable of extracting the characteristic information of the input voice in units of frames through the processing of the input voice input by the frame unit feature information extraction unit 110 , can be implemented as a deep learning model structure of any one of RNN (Recurrent Neural Network).

또한, 프레임 단위 특징 정보 추출부(110)에서는 입력된 음성 특징 벡터(MFCC, STFT, Mel Filter Bank 등)로부터 프레임 단위의 특징 정보 출력을 계산한다. 이때, 음성 특징 벡터들은 입력 wave 신호에 특정 길이(10ms)의 윈도우를 씌워 일정 길이로 이동(shift)시켜 처리되며, 총 T개의 프레임 음성 특징 벡터가 얻어질 수 있다. 여기서, 음성 특징 벡터는

로 정의하며, 프레임 단위 특징 정보 추출부(110)를 통해 출력된 출력 특징은

로 정의할 수 있다. 즉, 위에서 설명한 바와 같이, 입력 음성 특징 처리를 위한 신경망 모델로는 TDNN, CNN, RNN 등이 선택적으로 사용될 수 있다.In addition, the frame-by-frame feature information extraction unit 110 calculates the output of the feature information in units of frames from the input speech feature vectors (MFCC, STFT, Mel Filter Bank, etc.). In this case, the speech feature vectors are processed by covering the input wave signal with a window of a specific length (10 ms) and shifting it to a certain length, and a total of T frame speech feature vectors can be obtained. Here, the speech feature vector is

, and the output features output through the frame unit feature information extraction unit 110 are

can be defined as That is, as described above, TDNN, CNN, RNN, etc. may be selectively used as a neural network model for processing input speech features.

단계 S120에서는, 프레임 단위 문자 확률 분포 추출부(120)가 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출한다. 이러한 단계 S120에서는 프레임 단위 문자 확률 분포 추출부(120)가 CTC(Connectionist Temporal Classification) 기반 목적함수로 학습된 음성 인식기(ASR)를 사용하며, TDNN(Time Delay Neural Network), CNN(Convolutional Neural Network), RNN(Recurrent Neural Network), 및 딥러닝 기반의 end-to-end 음성 인식 모델로 구현될 수 있다.In step S120, the frame-by-frame character probability distribution extractor 120 extracts the character probability distribution information of the input voice in units of frames through processing of the input voice. In this step S120, the frame-by-frame character probability distribution extraction unit 120 uses a speech recognizer (ASR) trained as an objective function based on a Connectionist Temporal Classification (CTC), a Time Delay Neural Network (TDNN), a Convolutional Neural Network (CNN) , a Recurrent Neural Network (RNN), and an end-to-end speech recognition model based on deep learning.

또한, 프레임 단위 문자 확률 분포 추출부(120)에서는 사전 학습된 음성 인식기(ASR)를 통해 프레임 별 문자 확률 분포를 추출하며, 음성 인식기(ASR)는 위에서 설명한 바와 같이, CTC(Connectionist Temporal Classification) 기반의 목적함수를 통해 학습되며, 이 과정에서 프레임 단위 입력을 K개의 클래스(문자 및 기호)로 분류한다. 이러한 프레임 단위 문자 확률 분포 추출부(120)에서는 음성 특징(x_t)을 입력으로 받아 K개 클래스에 대한 문자 확률 분포를 출력한다. 출력되는 확률 분포는 아래의 [수학식 1]과 같이 정의될 수 있다.In addition, the frame-by-frame character probability distribution extractor 120 extracts the character probability distribution for each frame through a pre-learned voice recognizer (ASR), and the voice recognizer (ASR) is based on Connectionist Temporal Classification (CTC), as described above. It is learned through the objective function of The frame unit character probability distribution extractor 120 receives the voice feature (x _t ) as an input and outputs character probability distributions for K classes. The output probability distribution may be defined as in [Equation 1] below.

여기서,

이며, i의 범위는 [1,T]이다.here,

, and the range of i is [1,T].

단계 S130에서는, 풀링(pooling) 처리부(130)가 단계 S110의 특징 정보와 단계 S120의 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출한다. 이러한 단계 S130에서는 도 3에 도시된 바와 같이, 풀링(pooling) 처리부(130)가 단계 S110의 특징 정보와 단계 S120의 문자 확률 분포 정보를 입력으로 받고, 동시에 입력되는 특징 정보와 문자 확률 분포 정보에 대해 글자 단위의 개별적 집계를 수행하는 단계(S131)와, 단계 S131의 처리 후, 각 프레임 단위의 특징 정보가 특정 발화에 속할 확률 값을 사용하여 가중치 합이 구해지는 단계(S132)와, 단계 S132의 처리 후, 특정 단어의 발화 정보를 개별적으로 처리하여 고정된 차원의 벡터를 계산하는 단계(S132)로 이루어질 수 있다.In step S130, the pooling processing unit 130 receives the feature information of step S110 and the character probability distribution information of step S120 as inputs, and extracts a speaker embedding vector of a fixed dimension. In this step S130, as shown in FIG. 3, the pooling processing unit 130 receives the characteristic information of step S110 and the character probability distribution information of step S120 as inputs, and the characteristic information and character probability distribution information that are simultaneously input performing individual aggregation in units of letters (S131), and after the processing of step S131, a weight sum is obtained using a probability value that the characteristic information of each frame belongs to a specific utterance (S132), and step S132 After the processing of , a step of calculating a vector of a fixed dimension by individually processing the utterance information of a specific word ( S132 ) may be performed.

또한, 풀링 처리부(130)의 Pooling 단계에서는 프레임 단위 특징 정보 추출부(110)와 프레임 단위 문자 확률 분포 추출부(120)에서의 출력을 입력으로 받아 고정된 차원의 화자 임베딩을 계산하며, Pooling을 위한 계산 과정은 아래의 [수학식 2], 및 [수학식 3]과 같이 나타낼 수 있다.In addition, in the pooling step of the pooling processing unit 130, the output from the frame unit feature information extraction unit 110 and the frame unit character probability distribution extraction unit 120 is received as an input to calculate speaker embeddings of a fixed dimension, and pooling is performed. The calculation process for this can be expressed as [Equation 2] and [Equation 3] below.

여기서, τ는 발산을 막기 위한 상수 값이고, 추정된 문자 확률 분포를 통해 문자 별 계산을 개별적으로 처리한 후 υ로 결합하는 과정을 갖는다. 여기서, υ는 affine layer 및 softmax를 통과하며, affine layer를 화자 임베딩 벡터로 사용한다.Here, τ is a constant value to prevent divergence, and the calculation for each character is individually processed through the estimated character probability distribution and then combined into υ. Here, υ passes through the affine layer and softmax, and the affine layer is used as the speaker embedding vector.

단계 S140에서는, 화자 임베딩 추출부(140)가 단계 S130을 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력한다. 이러한 단계 (4)에서는 화자 임베딩 추출부(140)가 단계 S130을 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하되, 출력되는 화자 정보와 발화 문장 정보는 locally-connected 및 fully-connected 2층의 레이어 및 softmax 함수를 통과하고, multi-class cross-entropy 목적함수를 통해 학습될 수 있다.In step S140, the speaker embedding extraction unit 140 processes the fixed-dimensional speaker embedding vector extracted in step S130 as an input, and outputs speaker information and utterance sentence information, respectively. In this step (4), the speaker embedding extraction unit 140 processes the fixed-dimensional speaker embedding vector extracted in step S130 as an input and outputs speaker information and uttered sentence information, respectively, but the output speaker information and uttered sentences Information passes through two layers of locally-connected and fully-connected layers and softmax functions, and can be learned through multi-class cross-entropy objective functions.

도 4는 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 구성을 기능블록으로 도시한 도면이고, 도 5는 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 프레임 단위 특징 정보 추출부의 딥러닝 모델의 일례 구성을 도시한 도면이며, 도 6은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 프레임 단위 문자 확률 분포 추출부의 구현 구성을 개략적으로 도시한 도면이고, 도 7은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 알고리즘 전체 구조를 도시한 도면이며, 도 8은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 풀링 과정의 구조를 도시한 도면이다. 도 4 내지 도 8에 각각 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템(100)은, 프레임 단위 특징 정보 추출부(110), 프레임 단위 문자 확률 분포 추출부(120), 풀링(pooling) 처리부(130), 및 화자 임베딩 추출부(140)를 포함하여 구성될 수 있다.4 is a functional block diagram illustrating the configuration of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention, and FIG. 5 is a diagram illustrating speaker recognition according to an embodiment of the present invention. It is a diagram showing an example configuration of a deep learning model of a frame unit feature information extraction unit of a speaker embedding extraction system of a voice recognizer-based pooling technique for It is a diagram schematically showing the implementation configuration of the frame unit character probability distribution extraction unit of the speaker embedding extraction system of It is a diagram showing the overall structure, and FIG. 8 is a diagram showing the structure of a pooling process of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention. 4 to 8, the speaker embedding extraction system 100 of the voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention includes a frame unit feature information extraction unit 110, a frame It may be configured to include a unit character probability distribution extraction unit 120 , a pooling processing unit 130 , and a speaker embedding extraction unit 140 .

프레임 단위 특징 정보 추출부(110)는, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출하는 구성이다. 이러한 프레임 단위 특징 정보 추출부(110)는 도 5에 도시된 바와 같이, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보를 추출할 수 있는 TDNN, CNN, RNN 중 어느 하나의 딥러닝 모델 구조로 구현될 수 있다.The frame unit feature information extraction unit 110 is configured to extract feature information of the input voice in units of frames through processing of the input voice. As shown in FIG. 5 , the frame unit feature information extraction unit 110 is configured to extract the feature information of the input voice in units of frames through processing of the input voice in any one of TDNN, CNN, and RNN. It can be implemented as a deep learning model structure.

프레임 단위 문자 확률 분포 추출부(120)는, 프레임 단위 특징 정보 추출부(110)와 동시에 작동되며, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 구성이다. 이러한 프레임 단위 문자 확률 분포 추출부(120)는 도 6에 도시된 바와 같이, CTC(Connectionist Temporal Classification) 기반 목적함수로 학습된 음성 인식기(ASR)를 사용하며, TDNN, CNN, RNN 및 딥러닝 기반의 end-to-end 음성 인식 모델로 구현될 수 있다.The frame-by-frame character probability distribution extracting unit 120 operates simultaneously with the frame-by-frame feature information extracting unit 110, and extracts the character probability distribution information of the input voice in units of frames through processing of the input voice. . As shown in FIG. 6 , the frame unit character probability distribution extractor 120 uses a speech recognizer (ASR) trained with an objective function based on Connectionist Temporal Classification (CTC), and is based on TDNN, CNN, RNN and deep learning. It can be implemented as an end-to-end speech recognition model of

풀링(pooling) 처리부(130)는, 프레임 단위 특징 정보 추출부(110)의 특징 정보와 프레임 단위 문자 확률 분포 추출부(120)의 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출하는 구성이다. 이러한 풀링(pooling) 처리부(130)는 도 7 및 도 8에 각각 도시된 바와 같이, 프레임 단위 특징 정보 추출부(110)의 특징 정보와 프레임 단위 문자 확률 분포 추출부(120)의 문자 확률 분포 정보를 입력으로 받고, 동시에 입력되는 특징 정보와 문자 확률 분포 정보에 대해 글자 단위의 개별적 집계를 수행하고, 각 프레임 단위의 특징 정보가 특정 발화에 속할 확률 값을 사용하여 가중치 합을 구한 후, 특정 단어의 발화 정보를 개별적으로 처리하여 고정된 차원의 벡터를 계산하는 과정으로 이루어질 수 있다.The pooling processing unit 130 receives the characteristic information of the frame unit characteristic information extraction unit 110 and the character probability distribution information of the frame unit character probability distribution extraction unit 120 as inputs, and obtains a speaker embedding vector of a fixed dimension. It is an extracting composition. As shown in FIGS. 7 and 8 , the pooling processing unit 130 includes the characteristic information of the frame unit characteristic information extraction unit 110 and the character probability distribution information of the frame unit character probability distribution extraction unit 120 , respectively. is received as input, the character information and character probability distribution information input at the same time are individually aggregated in units of characters, and the weighted sum is obtained using the probability that the characteristic information of each frame belongs to a specific utterance, and then a specific word It can be accomplished by processing the utterance information of the speech individually and calculating a vector of a fixed dimension.

화자 임베딩 추출부(140)는, 풀링 처리부(130)로부터 입력되는 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 구성이다. 이러한 화자 임베딩 추출부(140)는 풀링 처리부(130)를 통해 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하되, 출력되는 화자 정보와 발화 문장 정보는 locally-connected 및 fully-connected 2층의 레이어 및 softmax 함수를 통과하고, multi-class cross-entropy 목적함수를 통해 학습되도록 할 수 있다.The speaker embedding extraction unit 140 is configured to process the speaker embedding vector of a fixed dimension input from the pooling processing unit 130 as an input and output speaker information and utterance sentence information, respectively. The speaker embedding extraction unit 140 processes the fixed-dimensional speaker embedding vector extracted through the pooling processing unit 130 as input and outputs speaker information and utterance sentence information, respectively, but the output speaker information and utterance sentence information are It can be learned through the multi-class cross-entropy objective function, passing through the locally-connected and fully-connected two-layer layers and softmax functions.

도 7은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 알고리즘 전체 구조를 나타내고 있으며, 도 8은 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템의 풀링 과정의 구조를 나타내고 있다. 도 에 도시된 바와 같이, 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 시스템(100)은 프레임 단위의 특징 정보와 프레임 단위의 문자 확률 분포를 pooling 단계에서 동시에 입력으로 받아 글자 단위의 개별적 집계과정을 갖는다. 이어, Pooling 단계의 전반적인 과정은 도 8에 도시된 바와 같이, 각 프레임 단위의 특징 정보는 특정 발화(29 종류)에 속할 확률 값을 사용하여 가중치 합이 취해지고, 이 과정을 통해 특정 단어 발화 정보를 개별적으로 처리하게 되며, 고정된 차원의 벡터를 계산하게 된다. 이어, Pooling 단계 후 locally-connected 및 fully-connected 2층의 레이어 및 softmax 함수를 통과하게 되고 최종적으로 multi-class cross-entropy 목적함수를 통해 전체 모델을 학습 시킬 수 있게 된다.7 shows the overall structure of an algorithm of a speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention, and FIG. 8 is a voice recognizer for speaker recognition according to an embodiment of the present invention. The structure of the pooling process of the speaker embedding extraction system based on the pooling technique is shown. As shown in Fig., the speaker embedding extraction system 100 of the voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention simultaneously pools the frame unit feature information and the frame unit character probability distribution in the pooling step. It receives as input and has an individual aggregation process for each character. Next, the overall process of the pooling step is as shown in FIG. 8 , for each frame unit feature information, a weighted sum is taken using a probability value belonging to a specific utterance (29 types), and through this process, specific word utterance information is processed individually, and a vector of a fixed dimension is calculated. Then, after the pooling step, it passes through the locally-connected and fully-connected two-layer layers and softmax functions, and finally the entire model can be trained through the multi-class cross-entropy objective function.

상술한 바와 같이, 본 발명의 일실시예에 따른 화자 인식을 위한 음성인식기 기반 풀링 기법의 화자 임베딩 추출 방법 및 시스템, 그리고 이를 위한 기록매체는, 입력되는 입력 음성에 대한 처리를 통해 프레임 단위로 입력 음성의 특징 정보 및 문자 확률 분포 정보를 각각 추출하고, 추출된 특징 정보와 문자 확률 분포 정보를 입력으로 받아 고정된 차원의 화자 임베딩 벡터를 추출한 후, 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하도록 구성함으로써, 화자 임베딩 추출 과정에서 화자 정보만을 고려하던 기존의 기법들과 달리, 프레임 단위의 출력을 집계하는 과정에서 문자 단위의 개별적 처리 과정을 갖기 때문에 추론 단계에서 화자 임베딩간의 유사도 계산 시 특정 발음 간의 특징 비교를 가능하도록 하며, 이를 통해 화자 인식 시스템에서 화자 정보 및 문장 발화 정보를 동시에 고려하여 비교 분석할 수 있도록 할 수 있으며, 특히, 음성 인식기(ASR) 모델을 활용하여 음성으로부터 프레임 단위의 문자 단위 확률 분포를 추정하고, 이를 풀링(pooling) 과정에서 활용하여 화자 임베딩 추출 단계에서 화자 정보뿐만 아니라 발화 문장 정보도 함께 추출하도록 구성함으로써, 기존 화자 인식 분야에서 널리 사용되고 있는 모델 구조를 적용하되, 목적에 맞도록 적절한 모델 구조로 쉽게 대체하여 적용할 수 있어 다양한 분야에 활용할 수 있으며, 화자 정보와 발화 문장 정보를 함께 추출하는 화자 인식 및 생체 신호 인식 분야에 범용성 있게 폭넓게 적용될 수 있도록 할 수 있게 된다.As described above, the method and system for extracting speaker embeddings of a voice recognizer-based pooling technique for speaker recognition according to an embodiment of the present invention, and a recording medium for the same, are input in frame units through processing of the input voice. After extracting speech feature information and character probability distribution information, receiving the extracted feature information and character probability distribution information as inputs, extracting a fixed-dimensional speaker embedding vector, and then using the extracted fixed-dimensional speaker embedding vector as input By processing and outputting speaker information and utterance sentence information, respectively, unlike the existing techniques that only considered speaker information in the speaker embedding extraction process, in the process of aggregating the output of the frame unit, each character has an individual processing process. When calculating the similarity between speaker embeddings in the reasoning step, it is possible to compare features between specific pronunciations, and through this, the speaker recognition system can simultaneously consider and analyze the speaker information and the sentence speech information. In particular, the speech recognizer (ASR) ) model is used to estimate the probability distribution of each frame by frame from the speech, and by using this in the pooling process to extract not only the speaker information but also the spoken sentence information in the speaker embedding extraction step, the existing speaker recognition field However, it can be applied to various fields by applying the model structure widely used in It is possible to make it universally applicable to a wide range.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.Various modifications and applications of the present invention described above are possible by those skilled in the art to which the present invention pertains, and the scope of the technical idea according to the present invention should be defined by the following claims.

100: 본 발명의 일실시예에 따른 풀링 기법의 화자 임베딩 추출 시스템
110: 프레임 단위 특징 정보 추출부
120: 프레임 단위 문자 확률 분포 추출부
130: 풀링(pooling) 처리부
140: 화자 임베딩 추출부
S110: 프레임 단위로 입력 음성의 특징 정보를 추출하는 단계
S120: 프레임 단위로 입력 음성의 문자 확률 분포 정보를 추출하는 단계
S130: 고정된 차원의 화자 임베딩 벡터를 추출하는 단계
S131: 동시에 입력되는 특징 정보와 문자 확률 분포 정보에 대해 글자 단위의 개별적 집계를 수행하는 단계
S132: 각 프레임 단위의 특징 정보가 특정 발화에 속할 확률 값을 사용하여 가중치 합이 구해지는 단계
S133: 특정 단어의 발화 정보를 개별적으로 처리하여 고정된 차원의 벡터를 계산하는 단계
S140: 화자 임베딩 추출부가 추출된 고정된 차원의 화자 임베딩 벡터를 입력으로 처리하여 화자 정보와 발화 문장 정보를 각각 출력하는 단계100: Speaker embedding extraction system of pooling technique according to an embodiment of the present invention
110: frame unit feature information extraction unit
120: frame unit character probability distribution extraction unit
130: pooling (pooling) processing unit
140: speaker embedding extraction unit
S110: extracting feature information of the input voice in units of frames
S120: Step of extracting character probability distribution information of the input voice in units of frames
S130: extracting a speaker embedding vector of a fixed dimension
S131: A step of performing individual aggregation in units of characters for simultaneously input feature information and character probability distribution information
S132: Step of obtaining a weighted sum using a probability value that feature information of each frame unit belongs to a specific utterance
S133: calculating a fixed-dimensional vector by individually processing speech information of a specific word
S140: A step of processing the speaker embedding vector of a fixed dimension extracted by the speaker embedding extraction unit as an input and outputting speaker information and utterance sentence information, respectively

Claims

As a speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition,
(1) extracting, by the frame-by-frame feature information extraction unit 110, the feature information of the input voice in units of frames through processing of the input voice;
(2) extracting, by the frame unit character probability distribution extracting unit 120, character probability distribution information of the input voice in units of frames through processing of the input voice;
(3) extracting, by the pooling processing unit 130, a speaker embedding vector of a fixed dimension by receiving the feature information of step (1) and the character probability distribution information of step (2) as inputs; and
(4) the speaker embedding extraction unit 140 processes the fixed-dimensional speaker embedding vector extracted through step (3) as an input, and outputs speaker information and utterance sentence information, respectively. , A speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition.

The method of claim 1, wherein in step (1),
It is implemented as a deep learning model structure of any one of TDNN, CNN, and RNN in which the frame unit feature information extraction unit 110 can extract the feature information of the input voice in units of frames through processing of the input voice. A speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition, characterized.

The method of claim 1, wherein in step (2),
The frame-by-frame character probability distribution extractor 120 uses a speech recognizer (ASR) trained with an objective function based on Connectionist Temporal Classification (CTC), and end-to-end speech recognition based on TDNN, CNN, RNN and deep learning A speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition, characterized in that it is implemented as a model.

The method according to any one of claims 1 to 3, wherein in step (3),
(3-1) The pooling processing unit 130 receives the feature information of step (1) and the character probability distribution information of step (2) as inputs, and adds the feature information and character probability distribution information input at the same time performing individual aggregation on a character-by-character basis;
(3-2) after the step (3-1), calculating a weighted sum using a probability value that the characteristic information of each frame belongs to a specific utterance; and
(3-3) After the process of step (3-2), speech recognizer-based pooling for speaker recognition, characterized in that it comprises the step of calculating a fixed-dimensional vector by individually processing the speech information of a specific word Method of extracting speaker embeddings in technique.

The method of claim 4, wherein in step (4),
The speaker embedding extraction unit 140 processes the fixed-dimensional speaker embedding vector extracted through step (3) as input and outputs speaker information and utterance sentence information, respectively, but the output speaker information and utterance sentence information are A speaker embedding extraction method of a voice recognizer-based pooling technique for speaker recognition, characterized in that it passes through two layers of locally-connected and fully-connected layers and a softmax function, and is learned through a multi-class cross-entropy objective function.

As a speaker embedding extraction system 100 of a voice recognizer-based pooling technique for speaker recognition,
a frame unit feature information extraction unit 110 for extracting feature information of the input voice in units of frames through processing of the input voice;
a frame unit character probability distribution extracting unit 120 operating simultaneously with the frame unit feature information extracting unit 110, and extracting character probability distribution information of the input voice in units of frames through processing of the input voice;
A pooling processing unit that receives the characteristic information of the frame unit characteristic information extraction unit 110 and the character probability distribution information of the frame unit character probability distribution extraction unit 120 as inputs and extracts a speaker embedding vector of a fixed dimension ( 130); and
and a speaker embedding extraction unit 140 that processes the extracted fixed-dimensional speaker embedding vector input from the pooling processing unit 130 as an input and outputs speaker information and utterance sentence information, respectively. A speaker embedding extraction system of a voice recognizer-based pooling technique for

The method of claim 6, wherein the frame unit feature information extraction unit 110,
Speech recognizer-based for speaker recognition, characterized in that it is implemented as a deep learning model structure of any one of TDNN, CNN, and RNN, which can extract feature information of the input voice in units of frames through processing of the input voice A speaker embedding extraction system in a pooling technique.

The method of claim 6, wherein the frame unit character probability distribution extraction unit 120,
It uses a speech recognizer (ASR) trained with an objective function based on CTC (Connectionist Temporal Classification), and is implemented as an end-to-end speech recognition model based on TDNN, CNN, RNN and deep learning. A speaker embedding extraction system of a voice recognizer-based pooling technique for

According to any one of claims 6 to 8, wherein the pooling (pooling) processing unit 130,
Receives the characteristic information of the frame unit feature information extraction unit 110 and the character probability distribution information of the frame unit character probability distribution extraction unit 120 as inputs, and compares the character information and character probability distribution information that are simultaneously input in character units It consists of performing individual aggregation, calculating a weighted sum using the probability value that the characteristic information of each frame belongs to a specific utterance, and calculating a fixed-dimensional vector by individually processing the utterance information of a specific word. A speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition.

The method of claim 9, wherein the speaker embedding extraction unit 140,
The fixed-dimensional speaker embedding vector extracted through the pooling processing unit 130 is processed as an input to output speaker information and utterance sentence information, respectively, but the output speaker information and utterance sentence information are locally-connected and fully-connected 2 A speaker embedding extraction system of a voice recognizer-based pooling technique for speaker recognition, characterized in that it is passed through the layers of layers and the softmax function, and is learned through a multi-class cross-entropy objective function.

A computer-readable recording medium recording a program for executing the speaker embedding extraction method of the voice recognizer-based pooling technique for speaker recognition according to any one of claims 1 to 3.