KR101805437B1

KR101805437B1 - Speaker verification method using background speaker data and speaker verification system

Info

Publication number: KR101805437B1
Application number: KR1020160086846A
Authority: KR
Inventors: 허희수; 양일호; 윤성현; 유하진; 김명재
Original assignee: 서울시립대학교 산학협력단
Priority date: 2016-07-08
Filing date: 2016-07-08
Publication date: 2017-12-07

Abstract

A method for executing a speaker authentication operation by using a background speaker data in a speaker authentication system includes the steps of: extracting a first feature vector regarding registered speech data of a speaker, a second feature vector regarding authentication speech data, and multiple third feature vectors regarding background speaker data; extracting a B-vector based on the first and second feature vectors; extracting a first R-vector based on the first feature vector and the third feature vectors; extracting a second R-vector based on the second feature vector and the third feature vectors; inputting the B-vector, first R-vector, and second R-vector to a deep neural network (DNN); and executing a speaker authentication operation based on the DNN.

Description

TECHNICAL FIELD [0001] The present invention relates to a speaker authentication method and a speaker authentication system using background speaker data,

본 발명은 배경 화자 데이터를 이용한 화자 인증 방법 및 화자 인증 시스템에 관한 것이다. The present invention relates to a speaker authentication method and a speaker authentication system using background speaker data.

화자 인증(Speaker verification)이란, 개인의 음성 정보를 이용하여 신분을 확인하는 기술을 의미한다. 화자 인증 기술은 사람마다 각기 다른 음성 주파수 특성을 가지고 있으므로, 개개인 마다의 각기 다른 음성 신호 주파수를 분석하고, 미리 저장된 개개인의 음성과 비교함으로써, 해당 음성이 본인의 음성 인지 여부를 판단하여 인증할 수 있다는 점에서 보안 시스템 등에서 이용되고 있다. Speaker verification is a technique for verifying identity using personal voice information. Since the speaker authentication technique has different voice frequency characteristics for each person, it analyzes different voice signal frequencies for each individual and compares the voice signal frequencies with the previously stored individual voice to determine whether the voice is the voice of the user or not And is used in security systems and the like.

이러한 화자 인증 기술과 관련하여, 선행기술인 한국공개특허 제 2004-0035647호는 화자인증 기술을 이용한 네트워크 기반의 전자금융거래 사용자 인증서비스 제공 방법 및 이를 수행하는 장치를 개시하고 있다. In connection with such speaker authentication technology, Korean Unexamined Patent Publication No. 2004-0035647 discloses a method for providing a network-based electronic financial transaction user authentication service using speaker authentication technology and an apparatus for performing the same.

종래의 화자 인증 기술로서, i-vector PLDA 시스템이 주로 이용되었다. 그 후, 기계학습과 관련된 다른 연구에서 높은 성능이 확인된 DNN(Deep Neural Network)을 이용한 DNN 기반의 화자 인증 시스템이 등장하였다. As a conventional speaker authentication technique, an i-vector PLDA system is mainly used. After that, DNN - based speaker authentication system using DNN (Deep Neural Network), which has been confirmed in other studies related to machine learning, appeared.

이러한 DNN 기반의 화자 인증 시스템은 DNN을 사용하지 않은 i-vector PLDA 시스템보다 높은 정확도를 제공한다는 장점을 가진다. 그러나 종래의 DNN 기반의 화자 인증 시스템은 화자 인증 시스템에 DNN을 적용시키기 위해 새로운 화자가 등록될 때마다, DNN 전체를 다시 학습해야 하는 과정을 필요로 하며, 이러한 학습 과정은 많은 시간과 자원이 소요된다는 단점을 가지고 있다. 또한, 대량의 배경 화자 정보를 활용할 수 있는 방안이 없다는 문제점이 있었다. This DNN-based speaker authentication system has an advantage of providing higher accuracy than an i-vector PLDA system that does not use DNN. However, the conventional DNN-based speaker authentication system requires a process of re-learning the entire DNN every time a new speaker is registered in order to apply DNN to the speaker authentication system. This learning process requires a lot of time and resources . Further, there is a problem that there is no way to utilize a large amount of background speaker information.

종래의 B 벡터를 이용한 화자 인증 시스템에서 발생된 DNN(Deep Neural Network)에서의 데이터 학습 문제를 해결하며 배경 화자 정보를 활용할 수 있는 화자 인증 방법 및 화자 인증 시스템을 제공하고자 한다. DNN에 배경 화자 정보를 입력함으로써, DNN기반의 향상된 버전의 B 벡터를 이용한 배경 화자 데이터를 이용한 화자 인증 방법 및 화자 인증 시스템을 제공하고자 한다. 다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.And to provide a speaker authentication method and a speaker authentication system capable of solving data learning problems in a DNN (Deep Neural Network) generated in a speaker authentication system using a conventional B vector and utilizing background speaker information. We propose a speaker authentication method and a speaker authentication system using background speaker data using an improved version of B vector based on DNN by inputting background speaker information to DNN. It is to be understood, however, that the technical scope of the present invention is not limited to the above-described technical problems, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 화자의 등록 발성 데이터에 대한 제 1 특징 벡터(i-vector), 인증 발성 데이터에 대한 제 2 특징 벡터 및 배경 화자 데이터에 대한 복수의 제 3 특징 벡터를 추출하는 단계, 상기 제 1 특징 벡터 및 상기 제 2 특징 벡터에 기초하여 B 벡터(B-vector)를 추출하는 단계, 상기 제 1 특징 벡터 및 상기 복수의 제 3 특징 벡터에 기초하여 제 1 R 벡터(R-vector) 를 추출하는 단계, 상기 제 2 특징 벡터 및 상기 복수의 제 3 특징 벡터에 기초하여 제 2 R 벡터 를 추출하는 단계, 상기 B 벡터, 상기 제 1 R 벡터 및 상기 제 2 R 벡터를 심층 신경망(DNN, Deep Neural Network)에 입력하는 단계 및 상기 심층 신경망에 기초하여 화자 인증을 수행하는 단계를 포함하는 화자 인증 방법을 제공할 수 있다. As a means for achieving the above-mentioned technical object, an embodiment of the present invention provides a speech recognition apparatus, comprising a first feature vector (i-vector) for registered speech data of a speaker, a second feature vector for authentication speech data, Extracting a plurality of third feature vectors for the first feature vector and the second feature vector, extracting a B vector based on the first feature vector and the second feature vector, Extracting a first R vector (R-vector) based on the first feature vector and the second feature vector, extracting a second R vector based on the second feature vector and the plurality of third feature vectors, Inputting the R vector and the second R vector to a Deep Neural Network (DNN), and performing speaker authentication based on the in-depth neural network.

또한, 본 발명의 다른 실시예는, 화자의 등록 발성 데이터에 대한 제 1 특징 벡터(i-vector), 인증 발성 데이터에 대한 제 2 특징 벡터 및 배경 화자 데이터에 대한 복수의 제 3 특징 벡터를 추출하는 특징 벡터 추출부, 상기 제 1 특징 벡터 및 상기 제 2 특징 벡터에 기초하여 B 벡터(B-vector)를 추출하는 B 벡터 추출부, 상기 제 1 특징 벡터 및 상기 복수의 제 3 특징 벡터에 기초하여 제 1 R 벡터(R-vector) 를 추출하고, 상기 제 2 특징 벡터 및 상기 복수의 제 3 특징 벡터에 기초하여 제 2 R 벡터를 추출하는 R 벡터 추출부, 상기 B 벡터, 상기 제 1 R 벡터 및 상기 제 2 R 벡터를 심층 신경망(DNN, Deep Neural Network)에 입력하는 심층 신경망 입력부 및 상기 심층 신경망에 기초하여 화자 인증을 수행하는 화자 인증 수행부를 포함하는 화자 인증 시스템을 제공할 수 있다. Another embodiment of the present invention is a method for extracting a first feature vector (i-vector) for registered speech data of a speaker, a second feature vector for authentication speech data, and a plurality of third feature vectors for background speaker data A B-vector extracting unit for extracting a B-vector based on the first feature vector and the second feature vector, a B-vector extractor for extracting a B-vector based on the first feature vector and the plurality of third feature vectors, An R vector extracting unit for extracting a first R vector (R-vector) and extracting a second R vector based on the second feature vector and the plurality of third feature vectors; And a speaker authentication unit for performing speaker authentication based on the deep layer neural network based on the second R vector input to the Deep Neural Network (DNN).

또한, 본 발명의 또 다른 실시예는, 상기 화자 인증 시스템의 컴퓨팅 장치에 의해 실행될 경우, 상기 컴퓨터 프로그램은 화자의 등록 발성 데이터에 대한 제 1 특징 벡터(i-vector), 인증 발성 데이터에 대한 제 2 특징 벡터 및 배경 화자 데이터에 대한 복수의 제 3 특징 벡터를 추출하고, 상기 제 1 특징 벡터 및 상기 제 2 특징 벡터에 기초하여 B 벡터(B-vector)를 추출하고, 상기 제 1 특징 벡터 및 상기 복수의 제 3 특징 벡터에 기초하여 제 1 R 벡터(R-vector) 를 추출하고, 상기 제 2 특징 벡터 및 상기 복수의 제 3 특징 벡터에 기초하여 제 2 R 벡터를 추출하고, 상기 B 벡터, 상기 제 1 R 벡터 및 상기 제 2 R 벡터를 심층 신경망(DNN, Deep Neural Network)에 입력하고, 상기 심층 신경망에 기초하여 화자 인증을 수행하도록 하는 명령어들의 시퀀스를 포함하는 컴퓨터 프로그램을 제공할 수 있다. Further, in another embodiment of the present invention, when executed by the computing device of the speaker authentication system, the computer program causes the computer to execute: a first feature vector (i-vector) for the registered voiced data of the speaker; Extracting a plurality of third feature vectors for two feature vectors and background speaker data, extracting a B vector (B-vector) based on the first feature vector and the second feature vector, Extracting a first R vector (R-vector) based on the plurality of third feature vectors, extracting a second R vector based on the second feature vector and the plurality of third feature vectors, and calculating the B vector A computer program comprising a sequence of instructions for inputting the first R vector and the second R vector to a Deep Neural Network (DNN) and performing a speaker authentication based on the depth network, It can provide.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described task solution is merely exemplary and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and the detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 종래의 B 벡터(B-vector)를 이용한 화자 인증 시스템에서 발생된 DNN에서의 데이터 학습 문제를 해결하며 배경 화자 정보를 활용할 수 있는 화자 인증 방법 및 화자 인증 시스템을 제공할 수 있다. DNN에 배경 화자 정보를 입력함으로써, 향상된 버전의 B 벡터를 이용한 DNN기반의 화자 인증 시스템을 제공하는 배경 화자 데이터를 이용한 화자 인증 방법 및 화자 인증 시스템을 제공할 수 있다. According to any one of the above-mentioned objects of the present invention, there is provided a speaker authentication method capable of solving a data learning problem in a DNN generated in a speaker authentication system using a B vector (B-vector) and utilizing background speaker information And a speaker authentication system. It is possible to provide a speaker authentication method and a speaker authentication system using background speaker data that provides a DNN-based speaker authentication system using an improved version of B vector by inputting background speaker information to DNN.

도 1은 종래의 화자 인증 기술을 설명하기 위한 예시적인 도면이다.
도 2는 본 발명의 일 실시예에 따른 화자 인증 시스템에서 특징 벡터(i-vector)를 이용하여 화자 인증을 수행하는 과정을 설명하기 위한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 DNN(Deep Neural Network) 기반에서 화자를 인증하는 방법을 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 B-vector를 이용하여 DNN 기반의 화자 인증 시스템에서 화자 인증을 수행하는 방법을 설명하기 위한 예시적인 도면이다.
도 5는 본 발명의 일 실시예에 따른 배경화자 데이터 및 주성분 분석법을 도시한 예시적인 도면이다.
도 6은 본 발명의 일 실시예에 따른 화자 인증 시스템의 구성도이다.
도 7은 본 발명의 일 실시예에 따른 화자 인증 시스템에서 배경 화자 데이터를 이용하여 화자 인증을 수행하는 과정을 설명하기 위한 예시적인 도면이다.
도 8은 본 발명의 일 실시예에 따른 화자 인증 시스템에서 추출한 배경 화자 데이터의 특징 벡터를 이용하여 R 벡터를 추출하는 과정을 설명하기 위한 예시적인 도면이다.
도 9는 본 발명의 일 실시예에 따른 화자 인증 시스템에서 배경 화자 데이터를 이용하여 화자 인증을 수행하는 방법의 순서도이다. 1 is an exemplary diagram for explaining a conventional speaker authentication technique.
2 is an exemplary diagram illustrating a process of performing speaker authentication using a feature vector (i-vector) in a speaker authentication system according to an embodiment of the present invention.
3 is an exemplary diagram illustrating a method for authenticating a speaker based on a DNN (Deep Neural Network) according to an embodiment of the present invention.
4 is an exemplary diagram illustrating a method for performing speaker authentication in a DNN-based speaker authentication system using a B-vector according to an embodiment of the present invention.
5 is an exemplary diagram illustrating background speaker data and principal component analysis in accordance with an embodiment of the present invention.
6 is a configuration diagram of a speaker authentication system according to an embodiment of the present invention.
7 is an exemplary diagram illustrating a process of performing speaker authentication using background speaker data in a speaker authentication system according to an embodiment of the present invention.
8 is an exemplary diagram for explaining a process of extracting an R vector using a feature vector of background speaker data extracted by the speaker authentication system according to an embodiment of the present invention.
9 is a flowchart of a method of performing speaker authentication using background speaker data in a speaker authentication system according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings, which will be readily apparent to those skilled in the art. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In order to clearly illustrate the present invention, parts not related to the description are omitted, and similar parts are denoted by like reference characters throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is referred to as being "connected" to another part, it includes not only "directly connected" but also "electrically connected" with another part in between . Also, when an element is referred to as "including" an element, it is to be understood that the element may include other elements as well as other elements, And does not preclude the presence or addition of one or more other features, integers, steps, operations, components, parts, or combinations thereof.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, the term " part " includes a unit realized by hardware, a unit realized by software, and a unit realized by using both. Further, one unit may be implemented using two or more hardware, or two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.In this specification, some of the operations or functions described as being performed by the terminal or the device may be performed in the server connected to the terminal or the device instead. Similarly, some of the operations or functions described as being performed by the server may also be performed on a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 종래의 화자 인증 기술을 설명하기 위한 예시적인 도면이다. 1 is an exemplary diagram for explaining a conventional speaker authentication technique.

일반적으로 화자 인식 기술은 화자 식별 기술과 화자 인증 기술로 나누어진다. 화자 식별 기술은 미리 정의된 화자 후보군에 대해 새로 입력된 발성이 어느 후보군과 가장 유사한지를 탐색하는 기술을 의미한다. 예를 들어, 미리 정의된 화자 후보군이 3명인 경우, 화자 식별 기술은 미리 정의된 화자 후보군들의 발성과 새로 입력된 발성을 비교하여 유사도를 계산할 수 있다. 화자 식별 기술은 새로 입력된 발성과 제 1 화자 간의 유사도가 0.5이고, 새로 입력된 발성과 제 2 화자 간의 유사도가 0.8이고, 새로 입력된 발성과 제 3 화자 간의 유사도가 0.2인 경우, 새로 입력된 발성이 제 2 화자의 발성과 유사한 발성으로 판단할 수 있다. In general, speaker recognition technology is divided into speaker identification technology and speaker authentication technology. The speaker identification technique refers to a technique of searching for a candidate speaker group having a predetermined predefined similarity with a candidate group that is newly inputted. For example, when there are three predefined speaker candidates, the speaker identification technique can calculate the similarity by comparing the voices of the predefined speaker candidates with the newly input voices. In the speaker identification technique, if the similarity degree between the newly input utterance and the first speaker is 0.5, the similarity between the newly input utterance and the second speaker is 0.8, and the similarity between the newly input utterance and the third speaker is 0.2, It can be judged that the utterance is similar to the utterance of the second speaker.

이에 반해, 화자 인증 기술은 인증 대상에 해당하는 화자의 발성을 시스템에 미리 등록하고, 입력된 평가 발성이 시스템에 등록된 화자의 발성이 맞는지를 확인하는 기술을 의미한다. On the other hand, the speaker authentication technology refers to a technique of registering, in advance, a speaker's utterance corresponding to an authentication target, and verifying whether the entered evaluation utterance is a utterance of a speaker registered in the system.

도 1을 참조하면, 화자 인증 시스템은 등록 절차(100)에서 화자의 발성을 입력받고 화자의 발성을 모델링하여, 사용자 모델을 생성할 수 있다. Referring to FIG. 1, a speaker authentication system receives a speaker's utterance in a registration procedure (100), and generates a user model by modeling a speaker's utterance.

또한, 화자 인증 시스템은 평가 절차(110)에서 복수의 화자(111, 112)로부터 발성을 수신하고, 입력된 발성과 사용자 모델 간의 유사도를 비교하고, 입력된 발성과 사용자 모델 간의 유사도가 미리 설정된 임계값 이상인 지를 확인함으로써, 화자 인증 시스템에 등록된 발성인지를 확인할 수 있다. 예를 들어, 제 1 화자(111)의 발성과 제 2 화자(112)의 발성이 입력된 경우, 화자 인증 시스템은 사용자 모델과 제 1 화자(111)의 발성 간의 유사도를 0.3으로, 제 2 화자(112)의 발성 간의 유사도를 0.8로 계산할 수 있다. 화자 인증 시스템에 미리 설정된 임계값이 0.5인 경우, 화자 인증 시스템은 유사도가 0.8인 제 2 화자(110)를 사용자 모델로 등록된 발성과 동일한 화자로 판단할 수 있으며, 유사도 0.3인 제 1 화자(111)를 사칭자로 판단할 수 있다. Also, the speaker authentication system receives the utterance from the plurality of speakers 111 and 112 in the evaluation procedure 110, compares the input utterance with the user model, and determines whether the similarity between the uttered utterance and the user model is greater than a predetermined threshold Value, it is possible to confirm whether or not the utterance registered in the speaker authentication system is recognized. For example, when the utterance of the first speaker 111 and the utterance of the second speaker 112 are input, the speaker authentication system sets the similarity degree between the user model and the utterance of the first speaker 111 to 0.3, The similarity degree between the utterances of the speech recognition unit 112 can be calculated as 0.8. When the threshold value preset in the speaker authentication system is 0.5, the speaker authentication system can determine that the second speaker 110 having the similarity degree of 0.8 is the same speaker as the registered utterance of the user model, and the first speaker 111) can be judged to be an imposter.

이러한 화자 인증 시스템은 최근 모바일 단말의 잠금 해제 화면에서 종종 이용되고 있다. 예를 들어, 사용자는 모바일 단말의 잠금 해제를 위해 음성을 기등록한 후, 사용자는 모바일 단말의 잠금 해제를 통해 모바일의 사용 권한을 획득하기 위해 사용자의 음성을 새로이 입력할 수 있다. 모바일 내의 화자 인증 시스템은 입력된 사용자의 음성과 등록된 사용자의 음성 간의 유사도를 계산하여, 입력된 음성이 사용자의 음성이 맞는지를 확인할 수 있다. 확인 결과, 동일한 음성으로 판단된 경우, 모바일 단말은 잠금 해제된 화면을 디스플레이에 출력할 수 있다. Such a speaker authentication system is often used in unlocking screens of mobile terminals in recent years. For example, after the user pre-registers the voice for unlocking the mobile terminal, the user can newly input the voice of the user to obtain the usage right of the mobile through unlocking the mobile terminal. The speaker authentication system in the mobile can calculate the similarity between the input user's voice and the registered user's voice and confirm whether the input voice matches the user's voice. If it is determined that the same voice is detected, the mobile terminal can output the unlocked screen to the display.

도 2는 본 발명의 일 실시예에 따른 화자 인증 시스템에서 특징 벡터(i-vector)를 이용하여 화자 인증을 수행하는 과정을 설명하기 위한 예시적인 도면이다. 도 2를 참조하면, 화자 인증 시스템은 i-vector라고 하는 발성 단위의 특징 벡터를 이용하여 화자 인증을 수행할 수 있다. 2 is an exemplary diagram illustrating a process of performing speaker authentication using a feature vector (i-vector) in a speaker authentication system according to an embodiment of the present invention. Referring to FIG. 2, the speaker authentication system can perform speaker authentication using a feature vector of a utterance unit called i-vector.

화자 인증 시스템은 화자로부터 등록 발성(220)을 입력받은 경우, 입력된 등록 발성(200)으로부터 i-vector를 추출하여 저장할 수 있다. 화자 인증 시스템은 화자로부터 평가 발성(210)을 입력받은 경우, 입력된 평가 발성(210)으로부터 i-vector를 추출할 수 있다. 예를 들어, 화자 인증 시스템은 화자의 i-vector의 추출을 위해 총 변동성 매트릭스를 이용할 수 있으며, 총 변동성 매트릭스는 화자의 슈퍼벡터의 원공간에서 클래스-디펜던트(class-dependant) 및 독립 보조공간(independent subspace)을 포함할 수 있다. 일반적으로, 가우시안 혼합 모델(GMM) 슈퍼벡터는 원화자의 슈퍼벡터로서 이용되며, 가우시안 혼합 모델(GMM)은 저차원 특징 벡터(i-vector)를 이용하여 고차원 입력 벡터를 다음과 같이 표현할 수 있다. If the speaker authentication system 220 receives the registration speech 220 from the speaker, the speaker authentication system can extract and store the i-vector from the input registration speech 200. The speaker authentication system can extract the i-vector from the inputted evaluation vocalization 210 when the evaluation vocalization 210 is inputted from the speaker. For example, the speaker authentication system may use a total variability matrix for extracting a speaker's i-vector, and the total variability matrix may be a class-dependent and an independent auxiliary space in the original space of the speaker's supervector and an independent subspace. Generally, a Gaussian mixture model (GMM) supervector is used as a super vector of a source, and a Gaussian mixture model (GMM) can express a high dimensional input vector as a low dimensional feature vector (i-vector) as follows.

M = m + TwM = m + Tw

M은 입력 화자 벡터, m은 UBM(Universial Background Model) 슈퍼벡터, T는 화자의 종속 보조공간을 포함하는 총 변동성 매트릭스, w는 i-vector를 의미한다. M is an input speaker vector, m is a universal background model (UBM) super vector, T is a total variability matrix including a speaker's dependent auxiliary space, and w is an i-vector.

화자 인증 시스템은 추출한 특징 벡터를 PLDA(Probabilistic Linear Discriminant Analysis) 모델을 이용하여 화자 인증을 수행할 수 있다. 예를 들어, PLDA(Probabilistic Linear Discriminant Analysis)는 추출한 등록 발성(200)에 대한 i-vector와 인증 발성(210)에 대한 i-vector 간의 거리를 계산하여 두 발성 간의 유사도를 계산할 수 있다. 여기서, i-vector에 포함된 클래스-디펜던트(Class-Dependant) 요소 및 독립 요소를 분석할 수 있다. 예를 들어, 복수의 항의 합을 이용하여 입력 i-vector를 다음과 같이 표현할 수 있다. The speaker authentication system can perform speaker authentication using a probabilistic linear discriminant analysis (PLDA) model. For example, the Probabilistic Linear Discriminant Analysis (PLDA) can calculate the similarity between two utterances by calculating the distance between the i-vector of the extracted registered utterance (200) and the i-vector of the authentication utterance (210). Here, the class-dependent element included in the i-vector and the independent element can be analyzed. For example, an input i-vector can be expressed as follows using a sum of a plurality of terms.

여기서,

는 입력 i-vector, μ는 트레이닝 데이터 세트(training data set)의 전체 의미 벡터, Φ는 클래스 요소들 사이의 보조 공간을 포함하는 매트릭스, h_i는 클래스 보조 공간 사이의 i 클래스의 위치, Ψ는 클래스 요소 내의 보조 공간을 포함하는 매트릭스,

는 클래스 보조공간 내에서의 i 클래스의 위치, 화자 j,

는 잔류 노이즈 주기를 의미할 수 있다. here,

Is the input i-vector, μ is the overall semantic vector of the training data set, Φ is the matrix containing the auxiliary space between the class elements, h _i is the position of the i-class between the class auxiliary spaces, A matrix containing auxiliary space within the class element,

The location of the class i in the class auxiliary space, the speaker j,

May denote a residual noise period.

도 3은 본 발명의 일 실시예에 따른 DNN(Deep Neural Network) 기반에서 음성을 인식하는 방법을 설명하기 위한 예시적인 도면이다. 도 3을 참조하면, 음성 인식 시스템은 입력된 음성 신호(300)를 작은 프레임 단위로 분할한 뒤, 분할된 영역(310)에 해당하는 주파수 대역의 특징을 추출하여 DNN(320)에 입력할 수 있다. 입력된 주파수 대역의 특징은 은닉층(중간층)을 거쳐 비선형으로 변환되면서 정보를 출력층으로 전달될 수 있다. 출력층은 노드를 통해 해당 프레임의 음소 정보를 나타내며, 출력층의 노드 중 활성값이 가장 큰 노드에 해당하는 음소를 입력된 프레임의 음소로 결정할 수 있다. 3 is an exemplary diagram for explaining a method of recognizing a voice based on a DNN (Deep Neural Network) according to an embodiment of the present invention. Referring to FIG. 3, the speech recognition system may divide the input speech signal 300 into a small frame unit, extract characteristics of a frequency band corresponding to the divided region 310, and input the extracted characteristics to the DNN 320 have. The characteristics of the input frequency band can be transferred to the output layer while being converted into nonlinearity through the hidden layer (middle layer). The output layer represents the phoneme information of the frame through the node, and the phoneme corresponding to the node having the largest active value among the nodes of the output layer can be determined as the phoneme of the input frame.

예를 들어, 출력층(330)에 포함된 복수의 노드 중 제 1 노드가 '아(활성값 0.8)', 제 2 노드가 '에(활성값 0.2)', 제 3 노드가 '이(활성값 0.1)', 제 4 노드가 '하(활성값 0.3)'를 포함하는 경우, DNN은 활성값이 제일 큰 '아'가 입력된 프레임 영역에 포함되었다고 인식할 수 있다. For example, if the first node among the plurality of nodes included in the output layer 330 is' active (active value 0.8) ', the second node is' active value 0.2', the third node is' active 0.1) 'and the fourth node includes' lower (active value 0.3)', the DNN can recognize that the largest active value is included in the inputted frame region.

도 4는 본 발명의 일 실시예에 따른 B-vector를 이용하여 DNN 기반의 화자 인증 시스템에서 화자 인증을 수행하는 방법을 설명하기 위한 예시적인 도면이다. 4 is an exemplary diagram illustrating a method for performing speaker authentication in a DNN-based speaker authentication system using a B-vector according to an embodiment of the present invention.

도 4를 참조하면, B-vector를 이용한 DNN 기반의 화자 인증 시스템은 화자의 등록 발성(400) 및 인증 발성(410)로부터 각각의 특징 벡터(i-vector, 420)를 추출할 수 있다. B 벡터(B-vector)를 이용한 DNN 기반의 화자 인증 시스템은 추출한 두 i-vector(420)를 이용하여 등록 발성(400)과 인증 발성(410) 간의 관계를 나타내는 B 벡터(430)를 추출할 수 있다. B 벡터(430)는 다음과 같이 두 i-vector의 가감과 같이 이항 연산 결과의 연속을 통해 생성될 수 있다. Referring to FIG. 4, a DNN-based speaker authentication system using a B-vector can extract a feature vector (i-vector) 420 from a speaker's registration utterance (400) and authentication utterance (410). A DNN-based speaker authentication system using a B-vector extracts a B vector 430 indicating a relationship between the registration speech 400 and the authentication speech 410 using the extracted two i-vectors 420 . The B vector 430 may be generated through a series of binomial computation results, such as adding and subtracting two i-vectors as follows.

B 벡터는 화자 인증에 이용될만한 직관적인 정보를 가지고 있지 않으므로, SVM 또는 DNN과 같은 이진 분류기를 통해 인증 발성(410)이 등록 발성(400)과 동일한 화자의 발성인지, 또는 서로 다른 화자의 발성인지 분류될 수 있다. Since the vector B does not have intuitive information that can be used for speaker authentication, it is possible to determine whether the authentication utterance 410 is a utterance of the same speaker as the registered utterance 400, or a voice utterance of a different speaker, through a binary classifier such as SVM or DNN Can be classified.

화자 인증 시스템은 추출한 B 벡터(430)를 DNN(440)에 입력할 수 있다. DNN(440)은 추출된 B 벡터(430)에 포함된 로 인포메이션(raw information)을 분석하여, 최종적으로 두 부류로 분류할 수 있다. 예를 들어, DNN(440)은 등록 발성과 인증 발성이 동일한 화자의 발성임을 나타내는 제 1 부류(450)와 등록 발성과 인증 발성이 서로 다른 화자의 발성임을 나타내는 제 2 부류(460)로 분류할 수 있다. The speaker authentication system can input the extracted B vector 430 to the DNN 440. [ The DNN 440 analyzes the raw information included in the extracted B vector 430 and finally classifies the DN information into two classes. For example, the DNN 440 classifies a first class 450, which indicates that the registered voices and the authentication voices are voices of the same speaker, and a second class 460, which indicates that the registered voices and the authentication voices are voices of different speakers .

도 5는 본 발명의 일 실시예에 따른 배경화자 데이터 및 주성분 분석법을 도시한 예시적인 도면이다. 도 5를 참조하면, 배경 화자의 데이터는 화자 인증에 유효한 화자 인자(speaker factor)를 탐색하기 위해 배경 화자 데이터로 사용된다. 5 is an exemplary diagram illustrating background speaker data and principal component analysis in accordance with an embodiment of the present invention. Referring to FIG. 5, the data of the background speaker is used as background speaker data to search a speaker factor effective for speaker authentication.

예를 들어, 등록 발성 및 인증 발성으로부터 추출한 특징 벡터(i-vector)는 화자의 정보뿐만 아니라, 녹음된 채널(예를 들어, PC 마이크, 스마트폰 마이크 등), 녹음 환경(잡음 환경 등)에 대한 정보를 포함한다. 이로 인해, 동일한 화자의 발성에서 추출한 특징 벡터(i-vector)라 하더라도, 다른 요인에 의해 각 발성의 특징 벡터(i-vector) 간의 거리가 멀어질 수 있다는 문제점을 가지고 있다. For example, a feature vector (i-vector) extracted from a registration speech and an authentication speech can be used not only for a speaker's information but also for a recorded channel (for example, a PC microphone, a smartphone microphone, etc.) . Therefore, even if the feature vector (i-vector) extracted from the utterance of the same speaker has a problem, the distance between the feature vectors (i-vector) of the utterance can be distant from other factors.

본 발명은 이러한 단점을 보완하기 위해 배경 화자 데이터를 이용하여, 배경 화자 데이터의 특징 벡터(i-vector)의 분포를 주성분 분석법에 기초하여 분석함으로써, 화자의 분리에 효과적인 화자 인자(speaker factor)를 탐색하고, 이를 기준으로 두 발성의 특징 벡터(i-vector) 간의 거리를 계산할 수 있으며, 거리에 따라 동일 화자의 발성인지, 서로 다른 화자의 발성인지를 쉽게 결정할 수 있다. 예를 들면, 2차원 데이터로 표현된 배경 화자 데이터의 특징 벡터를 1차원(예컨대, 직선)으로 표현할 수 있다. 이에 따라, 각 특징 벡터의 데이터의 크기를 줄일 수 있고, 이에 따라 데이터를 빠르게 처리할 수 있다.In order to overcome such disadvantages, the present invention analyzes the distribution of the feature vector (i-vector) of the background speaker data on the basis of the principal component analysis by using the background speaker data, thereby obtaining an effective speaker factor The distance between two feature vectors (i-vector) can be calculated based on the distance, and it is easy to determine whether the same speaker is uttered or the speaker is uttered according to distance. For example, the feature vector of background speaker data represented by two-dimensional data can be expressed in one dimension (e.g., a straight line). Thus, the size of data of each feature vector can be reduced, and the data can be processed quickly.

도 6은 본 발명의 일 실시예에 따른 화자 인증 시스템의 구성도이다. 도 6을 참조하면, 배경 화자 데이터를 이용한 DNN 기반의 화자 인증 시스템(1)은 특징 벡터 추출부(610), B 벡터 추출부(620), R 벡터 추출부(630), 심층 신경망 입력부(640) 및 화자 인증 수행부(650)를 포함할 수 있다. 6 is a configuration diagram of a speaker authentication system according to an embodiment of the present invention. 6, the DNN-based speaker authentication system 1 using the background speaker data includes a feature vector extracting unit 610, a B vector extracting unit 620, an R vector extracting unit 630, a depth neural network inputting unit 640 And a speaker authentication performing unit 650.

특징 벡터 추출부(610)는 화자의 등록 발성 데이터에 대한 제 1 특징 벡터(i-vector), 인증 발성 데이터에 대한 제 2 특징 벡터 및 배경 화자 데이터에 대한 복수의 제 3 특징 벡터를 추출할 수 있다. The feature vector extractor 610 may extract a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data have.

특징 벡터 추출부(610)는 배경 화자 데이터로부터 추출한 복수의 제 3 특징 벡터에 대해 k-means 알고리즘을 이용하여 대표 특징을 선택할 수 있다. The feature vector extractor 610 can select a representative feature using a k-means algorithm for a plurality of third feature vectors extracted from the background speaker data.

B 벡터 추출부(620)는 제 1 특징 벡터 및 제 2 특징 벡터에 기초하여 B 벡터(B-vector)를 추출할 수 있다. 예를 들어, B 벡터 추출부(620)는 길이 정규화를 거치지 않은 제 1 특징 벡터 및 제 2 특징 벡터로부터 B 벡터를 추출할 수 있다. The B vector extractor 620 can extract a B vector based on the first feature vector and the second feature vector. For example, the B vector extractor 620 may extract the B vector from the first feature vector and the second feature vector that have not undergone length normalization.

R 벡터 추출부(630)는 제 1 특징 벡터 및 복수의 제 3 특징 벡터에 기초하여 제 1 R 벡터를 추출하고, 제 2 특징 벡터 및 복수의 제 3 특징 벡터에 기초하여 제 2 R 벡터를 추출할 수 있다.The R vector extractor 630 extracts the first R vector based on the first feature vector and the plurality of third feature vectors, extracts the second R vector based on the second feature vector and the plurality of third feature vectors, can do.

심층 신경망 입력부(640)는 B 벡터, 제 1 R 벡터 및 제 2 R 벡터를 심층 신경망(DNN, Deep Neural Network)에 입력할 수 있다. The neural network input unit 640 may input the B vector, the first R vector, and the second R vector to the Deep Neural Network (DNN).

화자 인증 수행부(650)는 심층 신경망에 기초하여 화자 인증을 수행할 수 있다. The speaker authentication performing unit 650 can perform speaker authentication based on the in-depth network.

이하에서, 화자 인증 시스템(1)에서 화자 인증을 수행하는 과정은 도 7 및 도 8을 통해 상세히 설명하도록 하겠다. Hereinafter, a process of performing speaker authentication in the speaker authentication system 1 will be described in detail with reference to FIGS. 7 and 8. FIG.

도 7은 본 발명의 일 실시예에 따른 화자 인증 시스템에서 배경 화자 데이터를 이용하여 화자 인증을 수행하는 과정을 설명하기 위한 예시적인 도면이다. 7 is an exemplary diagram illustrating a process of performing speaker authentication using background speaker data in a speaker authentication system according to an embodiment of the present invention.

도 6 및 도 7을 참조하면, 화자 인증 시스템(1)은 등록 발성 데이터(710), 인증 발성 데이터(720) 및 배경 화자 데이터(730)를 이용하여 화자 인증을 수행할 수 있다. 6 and 7, the speaker authentication system 1 can perform speaker authentication using the registration speech data 710, the authentication speech data 720, and the background speaker data 730.

특징 벡터 추출부(610)는 등록 발성 데이터(710)로부터 제 1 특징 벡터(711)를 추출하고, 인증 발성 데이터(720)로부터 제 2 특징 벡터(721)를 추출하고, 배경 화자 데이터(730)로부터 복수의 제 3 특징 벡터(731)를 추출할 수 있다. 특징 벡터 추출부(610)는 예를 들어, 150 차원의 제 1 특징 벡터(711), 제 2 특징 벡터(721) 및 복수의 제 3 특징 벡터(731)를 추출할 수 있다. The feature vector extractor 610 extracts the first feature vector 711 from the registration speech data 710 and extracts the second feature vector 721 from the authentication speech data 720 and outputs the background speaker data 730, It is possible to extract a plurality of third feature vectors 731 from the second feature vector 731. The feature vector extractor 610 may extract a first feature vector 711, a second feature vector 721, and a plurality of third feature vectors 731 of, for example, 150 dimensions.

특징 벡터 추출부(610)는 추출된 복수의 제 3 특징 벡터(731)에 대해 k-means 알고리즘을 적용하여 대표 특징을 선택할 수 있다. 예를 들어, 특징 벡터 추출부(610)는 복수의 제 3 특징 벡터(731)에 k-means 알고리즘을 이용하여 30개의 대표 특징을 선택할 수 있다.The feature vector extractor 610 can select a representative feature by applying a k-means algorithm to the extracted plurality of third feature vectors 731. For example, the feature vector extractor 610 can select 30 representative features using a k-means algorithm for a plurality of third feature vectors 731. [

B 벡터 추출부(620)는 특징 벡터 추출부(610)에서 추출된 제 1 특징 벡터(711) 및 제 2 특징 벡터(721)에 기초하여 두 발성의 관계를 나타내는 특징으로서 B-vector (740)를 추출할 수 있다(712, 722). B 벡터(B-vector, 740)는 제 1 특징 벡터(711) 및 제 2 특징 벡터(721) 간의 2항 연산(Binary operation)에 기초하여 추출될 수 있다. 예를 들어, B 벡터 추출부(620)는 제 1 특징 벡터(711) 및 제 2 특징 벡터(721) 간의 합, 곱, 차를 이용한 2항 연산을 수행하고, 2항 연산의 복수의 결과값을 연결하여 B 벡터(740)를 생성할 수 있다. B 벡터 추출부(620)는 150차원의 제 1 특징 벡터(711) 및 150차원의 제 2 특징 벡터(721)로부터 600차원의 B 벡터(740)를 추출할 수 있다. 이와 관련한 자세한 내용은 도 4를 통해 상술한 바와 같다. The B-vector extractor 620 extracts the B-vector 740 as a feature indicating the two-phoneme relationship based on the first feature vector 711 and the second feature vector 721 extracted from the feature vector extractor 610, (712, 722). The B-vector 740 may be extracted based on a binary operation between the first feature vector 711 and the second feature vector 721. For example, the B vector extractor 620 performs a two-term operation using a sum, a product, and a difference between a first feature vector 711 and a second feature vector 721, and outputs a plurality of result values May be coupled to generate B vector 740. The B vector extractor 620 can extract the 600-dimensional B vector 740 from the first feature vector 711 of 150 dimensions and the second feature vector 721 of 150 dimensions. Details related to this are described above with reference to FIG.

R 벡터 추출부(630)는 특징 벡터 추출부(610)에서 추출된 제 1 특징 벡터(711) 및 복수의 제 3 특징 벡터(731)에 기초하여 제 1 R 벡터(R-vector, 750)를 추출할 수 있다(713, 732). 또한, R 벡터 추출부(630)는 특징 벡터 추출부(610)에서 추출된 제 2 특징 벡터(721) 및 복수의 제 3 특징 벡터(731)에 기초하여 제 2 R 벡터(760)를 추출할 수 있다(723, 733). 제 1 R 벡터(750)는 제 1 특징 벡터(711)와 복수의 제 3 특징 벡터(731) 간의 2항 연산 및 주성분 분석법(PCA, Principal Component Analysis)에 기초하여 추출되고, 제 2 R 벡터(760)는 제 2 특징 벡터(721)와 복수의 제 3 특징 벡터(731) 간의 2항 연산 및 주성분 분석법(PCA, Principal Component Analysis)에 기초하여 추출될 수 있다. 이와 관련한 자세한 내용은 도 4 및 도 5를 통해 상술한 바와 같다.The R vector extractor 630 extracts a first R vector 750 based on the first feature vector 711 and the plurality of third feature vectors 731 extracted by the feature vector extractor 610 (713, 732). The R vector extractor 630 extracts the second R vector 760 based on the second feature vector 721 and the plurality of third feature vectors 731 extracted by the feature vector extractor 610 (723, 733). The first R vector 750 is extracted based on the binomial operation between the first feature vector 711 and the plurality of third feature vectors 731 and Principal Component Analysis (PCA), and the second R vector 760 can be extracted based on the binomial operation between the second feature vector 721 and the plurality of third feature vectors 731 and Principal Component Analysis (PCA). Details related to this are as described above with reference to FIG. 4 and FIG.

심층 신경망 입력부(640)는 B 벡터(740), 제 1 R 벡터(750) 및 제 2 R 벡터(760)를 심층 신경망(DNN, 770)에 입력할 수 있다. DNN은 5개의 은닉층을 가질 수 있으며, 각 은닉층은 하이퍼볼릭탄젠트 함수(hyperbolictangent function)에 의해 활성화되어 연결된 2048개의 신경을 포함하는 것으로, DNN(770)은 입력된 B 벡터(740), 제 1 R 벡터(750) 및 제 2 R 벡터(760)에 대해 이진 분류(Binary Classification)를 적용할 수 있다. 이와 관련한 자세한 내용은 도 3을 통해 상술한 바와 같다.The neural network input unit 640 may input the B vector 740, the first R vector 750 and the second R vector 760 to the neural network (DNN) 770. DNN can have five hidden layers, each hidden layer including 2048 nerves connected by activation by a hyperbolic tangent function, DNN 770 includes input B vector 740, first R Binary Classification may be applied to the vector 750 and the second R vector 760. Details related to this are described above with reference to FIG.

화자 인증 수행부(650)는 심층 신경망(770)에 기초하여 화자 인증을 수행할 수 있다. 화자 인증 수행부(650)는 심층 신경망(770)을 통해 도출된 활성값과 미리 설정된 임계값을 비교하여, 활성값이 임계값 이상이면, 인증 발성과 등록 발성이 동일한 화자임을 나타내는 'Positive'(780)로 나타내고, 활성값이 임계값 미만이면, 인증 발성과 등록 발성이 서로 다른 화자 임을 나타내는 'Negative'(790)로 나타낼 수 있다. The speaker authentication performing unit 650 can perform the speaker authentication based on the neural network 770. The speaker authentication performing unit 650 compares the active value derived through the neural network 770 with a preset threshold value. If the active value is equal to or greater than the threshold value, the speaker authentication performing unit 650 outputs 'Positive' 780). If the active value is less than the threshold value, it can be expressed as 'Negative' (790) indicating that the authentication utterance and the registered utterance are different from each other.

도 8은 본 발명의 일 실시예에 따른 화자 인증 시스템에서 추출한 배경 화자 데이터의 특징 벡터를 이용하여 R 벡터를 추출하는 과정을 설명하기 위한 예시적인 도면이다. 8 is an exemplary diagram for explaining a process of extracting an R vector using a feature vector of background speaker data extracted by the speaker authentication system according to an embodiment of the present invention.

R 벡터 추출부(630)는 제 1 특징 벡터(711) 및 복수의 제 3 특징 벡터(731) 간의 2항 연산 및 주성분 분석법(PCA, Principal Component Analysis)에 기초하여 제 1 R 벡터(812)를 추출할 수 있다(713, 732). 예를 들어, R 벡터 추출부(630)는 제 1 R 벡터(812)에 해당하는 복수의 B 벡터(810)를 추출하고, 제 1 특징 벡터(711)와 각 제 3 특징 벡터(731) 간의 2항 연산을 수행하고, 제 1 특징 벡터(711)와 각 제 3 특징 벡터(731) 간의 2항 연산의 복수의 결과값을 연결하여 제 1 R 벡터(812)에 해당하는 복수의 B 벡터(810)를 생성할 수 있다. 여기서, 제 1 R 벡터(812)에 해당하는 복수의 B 벡터(810)는 추출된 제 3 특징 벡터의 수만큼 생성될 수 있다. 예를 들어, 30개의 제 3 특징 벡터(731)가 추출된 경우, 각 제 3 특징 벡터(731)와 제 1 특징 벡터(711)간의 2항 연산을 수행함으로써, 30개의 B 벡터(810)를 생성할 수 있다.The R vector extractor 630 extracts the first R vector 812 based on the binomial operation between the first feature vector 711 and the plurality of third feature vectors 731 and Principal Component Analysis (PCA) (713, 732). For example, the R vector extractor 630 extracts a plurality of B vectors 810 corresponding to the first R vector 812, and extracts a plurality of B vectors 810 corresponding to the first R vector 812 from among the first feature vector 711 and each third feature vector 731 A plurality of B vectors (812) corresponding to the first R vector 812 are obtained by connecting the plurality of result values of the binomial operation between the first feature vector 711 and each third feature vector 731, 810 < / RTI > Here, a plurality of B vectors 810 corresponding to the first R vector 812 can be generated by the number of the extracted third feature vectors. For example, if 30 third feature vectors 731 are extracted, 30 B vectors 810 can be obtained by performing a 2-ary operation between each third feature vector 731 and the first feature vector 711 Can be generated.

R 벡터 추출부(630)는 제 1 R 벡터(812)에 해당하는 복수의 B 벡터(810) 각각을 주성분 분석법(811)에 기초하여 차원 축소시키고, 차원 축소된 제 1 R 벡터(812)에 해당하는 복수의 B 벡터(810)를 연결하여 제 1 R 벡터(812)를 생성할 수 있다. 예를 들어, R 벡터 추출부(630)는 300 차원의 제 1 R 벡터(812)에 해당하는 복수의 B 벡터(810) 각각을 주성분 분석법(811)에 기초하여 차원을 축소시켜, 10차원의 제 1 R 벡터(812)를 생성할 수 있다. 이와 관련한 자세한 내용은 도 3을 통해 상술한 바와 같다.The R vector extracting unit 630 may reduce the dimension of each of the plurality of B vectors 810 corresponding to the first R vector 812 based on the principal component analysis method 811, The first R vector 812 can be generated by connecting a plurality of B vectors 810 corresponding thereto. For example, the R vector extractor 630 may reduce the dimension of each of the plurality of B vectors 810 corresponding to the 300-dimensional first R vector 812 based on the principal component analysis method 811, A first R vector 812 may be generated. Details related to this are described above with reference to FIG.

R 벡터 추출부(630)는 제 2 특징 벡터(721) 및 복수의 제 3 특징 벡터(731) 간의 2항 연산 및 주성분 분석법(PCA, Principal Component Analysis)에 기초하여 제 2 R 벡터(822)를 추출할 수 있다(723, 733). 예를 들어, R 벡터 추출부(630)는 제 2 R 벡터(822)에 해당하는 복수의 B 벡터(820)를 추출하고, 제 2 특징 벡터(721)와 각 제 3 특징 벡터(731) 간의 2항 연산을 수행하고, 제 2 특징 벡터(721)와 각 제 3 특징 벡터(731) 간의 2항 연산의 복수의 결과값을 연결하여 제 2 R 벡터(822)에 해당하는 복수의 B 벡터(820)를 생성할 수 있다. 제 2 R 벡터(822)에 해당하는 복수의 B 벡터(820)는 추출된 제 3 특징 벡터(731)의 수만큼 생성될 수 있다. 예를 들어, 예를 들어, 30개의 제 3 특징 벡터(731)가 추출된 경우, 각 제 3 특징 벡터(731)와 제 2 특징 벡터(721)간의 2항 연산을 수행함으로써, 30개의 B 벡터(820)를 생성할 수 있다.The R vector extractor 630 extracts the second R vector 822 based on the binomial operation between the second feature vector 721 and the plurality of third feature vectors 731 and Principal Component Analysis (PCA) (723, 733). For example, the R vector extractor 630 extracts a plurality of B vectors 820 corresponding to the second R vector 822, and extracts the B vector 820 corresponding to the second feature vector 721 from the third feature vector 731 A plurality of B vectors corresponding to the second R vector 822 are obtained by connecting the plurality of result values of the two-term operation between the second feature vector 721 and each third feature vector 731 820 < / RTI > A plurality of B vectors 820 corresponding to the second R vector 822 can be generated by the number of the extracted third feature vectors 731. [ For example, when 30 third feature vectors 731 are extracted, by performing a binomial operation between each third feature vector 731 and the second feature vector 721, 30 B vectors Lt; RTI ID = 0.0 > 820 < / RTI >

R 벡터 추출부(630)는 제 2 R 벡터(822)에 해당하는 복수의 B 벡터(820) 각각을 주성분 분석법(821)에 기초하여 차원 축소시키고, 차원 축소된 제 2 R 벡터(822)에 해당하는 복수의 B 벡터(820)를 연결하여 제 2 R 벡터(822)를 생성할 수 있다. 예를 들어, R 벡터 추출부(630)는 300 차원의 제 2 R 벡터(822)에 해당하는 복수의 B 벡터(820)각각을 주성분 분석법(821)에 기초하여 차원을 축소시켜, 10차원의 제 2 R 벡터(822)를 생성할 수 있다. 이와 관련한 자세한 내용은 도 3을 통해 상술한 바와 같다.The R vector extractor 630 may reduce the dimension of each of the plurality of B vectors 820 corresponding to the second R vector 822 based on the principal component analysis method 821 and may reduce the dimension of the second R vector 822 A plurality of corresponding B vectors 820 may be concatenated to generate a second R vector 822. [ For example, the R vector extractor 630 may reduce the dimension of each of the plurality of B vectors 820 corresponding to the second R vector 822 of 300 dimensions based on the principal component analysis method 821, A second R vector 822 may be generated. Details related to this are described above with reference to FIG.

화자 인증 시스템(1)은 컴퓨팅 장치에 의해 실행될 경우, 컴퓨팅 장치는 화자의 등록 발성 데이터에 대한 제 1 특징 벡터(i-vector), 인증 발성 데이터에 대한 제 2 특징 벡터 및 배경 화자 데이터에 대한 복수의 제 3 특징 벡터를 추출하고, 제 1 특징 벡터 및 제 2 특징 벡터에 기초하여 B 벡터를 추출하고, 제 1 특징 벡터 및 복수의 제 3 특징 벡터에 기초하여 제 1 R 벡터를 추출하고, 제 2 특징 벡터 및 복수의 제 3 특징 벡터에 기초하여 제 2 R 벡터를 추출하고, B 벡터, 제 1 R 벡터 및 제 2 벡터를 심층 신경망(DNN, Deep Neural Network)에 입력하고, 심층 신경망에 기초하여 화자 인증을 수행하도록 하는 명령어들의 시퀀스를 포함하는 컴퓨터 프로그램을 제공할 수 있다. When the speaker authentication system (1) is executed by a computing device, the computing device includes a first feature vector (i-vector) for the speaker's registered speech data, a second feature vector for the authentication speech data, and a plurality Extracts a B vector based on the first feature vector and the second feature vector, extracts a first R vector based on the first feature vector and the plurality of third feature vectors, 2 feature vector and a plurality of third feature vectors, and inputs the B vector, the first R vector, and the second vector to the Deep Neural Network (DNN) Thereby enabling the speaker authentication to be performed.

도 9는 본 발명의 일 실시예에 따른 화자 인증 시스템에서 배경 화자 데이터를 이용하여 화자 인증을 수행하는 방법의 순서도이다. 도 9에 도시된 실시예에 따른 화자 인증 시스템에서 배경 화자 데이터를 이용하여 화자 인증을 수행하는 방법은 도 6에 도시된 실시예에 따른 화자 인증 시스템(1)에서 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 6 내지 도 9에 도시된 실시예에 따른 화자 인증 시스템(1)에 의해 수행되는 배경 화자 데이터를 이용하여 화자 인증을 수행하는 방법에도 적용된다. 9 is a flowchart of a method of performing speaker authentication using background speaker data in a speaker authentication system according to an embodiment of the present invention. The method for performing the speaker authentication using the background speaker data in the speaker authentication system according to the embodiment shown in FIG. 9 includes the steps of being processed in a time-series manner in the speaker authentication system 1 according to the embodiment shown in FIG. 6 do. Accordingly, the present invention is also applied to a method of performing speaker authentication using background speaker data performed by the speaker authentication system 1 according to the embodiment shown in FIGS.

단계 S910에서 화자 인증 시스템(1)은 화자의 등록 발성 데이터에 대한 제 1 특징 벡터(i-vector), 인증 발성 데이터에 대한 제 2 특징 벡터 및 배경 화자 데이터에 대한 복수의 제 3 특징 벡터를 추출할 수 있다. In step S910, the speaker authentication system 1 extracts a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data can do.

단계 S920에서 화자 인증 시스템(1)은 제 1 특징 벡터 및 제 2 특징 벡터에 기초하여 B 벡터(B-vector)를 추출할 수 있다. B 벡터는 제 1 특징 벡터 및 제 2 특징 벡터간의 2항 연산(binary operation)에 기초하여 추출될 수 있다. 예를 들어, 화자 인증 시스템(1)은 제 1 특징 벡터 및 제 2 특징 벡터 간의 2항 연산을 수행하는 단계 및 2항 연산의 복수의 결과값을 연결하여 B 벡터를 생성하는 단계를 더 포함할 수 있다. In step S920, the speaker authentication system 1 can extract a B vector (B-vector) based on the first feature vector and the second feature vector. B vector can be extracted based on a binary operation between the first feature vector and the second feature vector. For example, the speaker authentication system 1 may further include a step of performing a two-ary operation between the first feature vector and the second feature vector, and a step of concatenating the plurality of result values of the two-ary operation to generate a B-vector .

단계 S930에서 화자 인증 시스템(1)은 제 1 특징 벡터 및 복수의 제 3 특징 벡터에 기초하여 제 1 R 벡터(R-vector)를 추출할 수 있다. 예를 들어, 제 1 R 벡터는 제 1 특정 벡터 및 복수의 제 3 특징 벡터 간의 2항 연산 및 주성분 분석법(PCA, Principal Component Analysis)에 기초하여 추출될 수 있다. In step S930, the speaker authentication system 1 may extract the first R vector (R-vector) based on the first feature vector and the plurality of third feature vectors. For example, the first R vector may be extracted based on a binomial operation between a first specific vector and a plurality of third feature vectors and Principal Component Analysis (PCA).

단계 S940에서 화자 인증 시스템(1)은 제 2 특징 벡터 및 복수의 제 3 특징 벡터에 기초하여 제 2 R 벡터(R-vector)를 추출할 수 있다. 예를 들어, 제 2 R 벡터는 제 2 특정 벡터 및 복수의 제 3 특징 벡터 간의 2항 연산 및 주성분 분석법(PCA, Principal Component Analysis)에 기초하여 추출될 수 있다.In step S940, the speaker authentication system 1 may extract a second R vector (R-vector) based on the second feature vector and the plurality of third feature vectors. For example, the second R vector may be extracted based on a binomial operation between a second specific vector and a plurality of third feature vectors and Principal Component Analysis (PCA).

단계 S950에서 화자 인증 시스템(1)은 B 벡터, 제 1 R 벡터 및 제 2 벡터를 심층 신경망(DNN, Deep Neural Network)에 입력할 수 있다. In step S950, the speaker authentication system 1 may input the B vector, the first R vector, and the second vector to the Deep Neural Network (DNN).

단계 S960에서 화자 인증 시스템(1)은 심층 신경망에 기초하여 화자 인증을 수행할 수 있다. In step S960, the speaker authentication system 1 can perform speaker authentication based on the depth-of-field neural network.

도 9에서는 도시되지 않았으나, 화자 인증 시스템(1)은 제 1 R 벡터에 해당하는 복수의 B 벡터를 추출하는 단계, 제 1 R 벡터에 해당하는 복수의 B 벡터 각각을 주성분 분석법에 기초하여 차원을 축소시키는 단계 및 차원 축소된 제 1 R 벡터에 해당하는 복수의 B 벡터를 연결하여 제 1 R 벡터를 생성하는 단계를 더 포함할 수 있다. 화자 인증 시스템(1)은 제 1 R 벡터에 해당하는 복수의 B 벡터를 추출하기 위해 제 1 특징 벡터와 각 제 3 특징 벡터간의 2항 연산을 수행하는 단계 및 제 1 특징 벡터와 각 제 3 특징 벡터간의 2항 연산의 복수의 결과값을 연결하여 제 1 R 벡터 에 해당하는 복수의 B 벡터를 생성하는 단계를 더 포함할 수 있다. 제 1 R 벡터에 해당하는 복수의 B 벡터는 추출된 제 3 특징 벡터의 수만큼 생성될 수 있다. Although not shown in FIG. 9, the speaker authentication system 1 extracts a plurality of B vectors corresponding to the first R vector, and extracts a plurality of B vectors corresponding to the first R vector from dimensions And generating a first R vector by concatenating a plurality of B vectors corresponding to the first reduced R vectors. The speaker authentication system (1) includes a step of performing a two-term operation between a first feature vector and each third feature vector to extract a plurality of B vectors corresponding to a first R vector, And generating a plurality of B vectors corresponding to the first R vector by concatenating the plurality of result values of the binomial operation between the vectors. A plurality of B vectors corresponding to the first R vector may be generated by the number of the extracted third feature vectors.

상술한 설명에서, 단계 S910 내지 S960은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다.In the above description, steps S910 to S960 may be further divided into further steps or combined into fewer steps, according to an embodiment of the present invention. Also, some of the steps may be omitted as necessary, and the order between the steps may be changed.

도 1 내지 도 9를 통해 설명된 화자 인증 시스템에서 수행되는 배경 화자 데이터를 이용한 화자 인증 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 9를 통해 설명된 화자 인증 시스템에서 수행되는 배경 화자 데이터를 이용한 화자 인증 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. The speaker authentication method using the background speaker data performed in the speaker authentication system described with reference to Figs. 1 to 9 may be performed in the form of a computer program stored in a medium executed by a computer or a recording medium including instructions executable by the computer Can be implemented. In addition, the speaker authentication method using the background speaker data performed in the speaker authentication system described with reference to FIGS. 1 to 9 may be implemented in the form of a computer program stored in a medium executed by a computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. The computer-readable medium may also include computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. It will be understood by those skilled in the art that the foregoing description of the present invention is for illustrative purposes only and that those of ordinary skill in the art can readily understand that various changes and modifications may be made without departing from the spirit or essential characteristics of the present invention. will be. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다. The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

610: 특징 벡터 추출부
620: B 벡터 추출부
630: R 벡터 추출부
640: 심층 신경망 입력부
650: 화자 인증 수행부610: Feature vector extraction unit
620: B vector extracting unit
630: R vector extracting unit
640:
650: speaker authentication unit

Claims

A method for performing speaker authentication using background speaker data in a speaker authentication system,
Extracting a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data;
Extracting a B-vector based on the first feature vector and the second feature vector;
Extracting a first R vector (R-vector) based on the first feature vector and the plurality of third feature vectors;
Extracting a second R vector based on the second feature vector and the plurality of third feature vectors;
Inputting the B vector, the first R vector and the second R vector to a Deep Neural Network (DNN); And
And performing speaker authentication based on the in-depth neural network,
Wherein the B vector is extracted based on a binary operation between the first feature vector and the second feature vector.

delete

The method according to claim 1,
Wherein the step of extracting the B vector comprises:
Performing a two-ary operation between the first feature vector and the second feature vector; And
Generating a B vector by concatenating a plurality of result values of the binary operation;
And a speaker authentication method.

A method for performing speaker authentication using background speaker data in a speaker authentication system,
Extracting a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data;
Extracting a B-vector based on the first feature vector and the second feature vector;
Extracting a first R vector (R-vector) based on the first feature vector and the plurality of third feature vectors;
Extracting a second R vector based on the second feature vector and the plurality of third feature vectors;
Inputting the B vector, the first R vector and the second R vector to a Deep Neural Network (DNN); And
And performing speaker authentication based on the in-depth neural network,
Wherein the first R vector is extracted based on a binomial operation and a Principal Component Analysis (PCA) between the first feature vector and the plurality of third feature vectors,
Wherein the second R vector is extracted based on a binomial operation and Principal Component Analysis (PCA) between the second feature vector and the plurality of third feature vectors.

5. The method of claim 4,
Wherein the extracting of the first R vector comprises:
Extracting a plurality of B vectors corresponding to the first R vector;
Dimensionally reducing each of a plurality of B vectors corresponding to the first R vector based on principal component analysis; And
Generating a first R vector by connecting a plurality of B vectors corresponding to the dimensionally-reduced first R vectors;
And a speaker authentication method.

6. The method of claim 5,
Wherein the step of extracting a plurality of B vectors corresponding to the first R vector comprises:
Performing a two-ary operation between the first feature vector and each third feature vector; And
Generating a plurality of B vectors corresponding to the first R vector by concatenating a plurality of result values of the second term operation between the first feature vector and each third feature vector;
And a speaker authentication method.

The method according to claim 6,
Wherein a plurality of B vectors corresponding to the first R vector are generated by the number of the extracted third feature vectors.

A speaker authentication system for performing speaker authentication using background speaker data,
A feature vector extractor for extracting a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data;
A B vector extracting unit for extracting a B vector (B-vector) based on the first feature vector and the second feature vector;
Extracting a first R vector (R-vector) based on the first feature vector and the plurality of third feature vectors, and extracting a second R vector based on the second feature vector and the plurality of third feature vectors An R vector extracting unit for extracting an R vector;
A depth neural network input unit for inputting the B vector, the first R vector, and the second R vector to a deep neural network (DNN); And
And a speaker authentication performing unit for performing speaker authentication based on the deep layer neural network,
Wherein the B vector is extracted based on a binary operation between the first feature vector and the second feature vector.

delete

9. The method of claim 8,
Wherein the B vector extracting unit comprises:
Performing a two-ary operation between the first feature vector and the second feature vector,
And concatenates a plurality of result values of the two-term operation to generate the B vector.

A speaker authentication system for performing speaker authentication using background speaker data,
A feature vector extractor for extracting a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data;
A B vector extracting unit for extracting a B vector (B-vector) based on the first feature vector and the second feature vector;
Extracting a first R vector (R-vector) based on the first feature vector and the plurality of third feature vectors, and extracting a second R vector based on the second feature vector and the plurality of third feature vectors An R vector extracting unit for extracting an R vector;
A depth neural network input unit for inputting the B vector, the first R vector, and the second R vector to a deep neural network (DNN); And
And a speaker authentication performing unit for performing speaker authentication based on the deep layer neural network,
Wherein the first R vector is extracted based on a binomial operation and a Principal Component Analysis (PCA) between the first feature vector and the plurality of third feature vectors,
Wherein the second R vector is extracted based on a binomial operation and Principal Component Analysis (PCA) between the second feature vector and the plurality of third feature vectors.

12. The method of claim 11,
Wherein the R vector extracting unit comprises:
Extracting a plurality of B vectors corresponding to the first R vector,
Wherein each of the plurality of B vectors corresponding to the first R vector is dimensionally reduced based on a principal component analysis method,
And connecting the plurality of B vectors corresponding to the dimensionally-reduced first R vector to generate the first R vector.

13. The method of claim 12,
Wherein the R vector extracting unit comprises:
Extracting a plurality of B vectors corresponding to the first R vector,
Performing a two-ary operation between the first feature vector and each third feature vector,
And a plurality of B vectors corresponding to the first R vector are generated by concatenating a plurality of result values of the second term operation between the first feature vector and each third feature vector.

14. The method of claim 13,
Wherein the plurality of B vectors corresponding to the first R vector are generated by the number of the extracted third feature vectors.

A computer program stored in a computer-readable storage medium for performing speaker authentication using background speaker data in a speaker authentication system,
When executed by the computing device of the speaker authentication system,
The computer program comprising:
Extracting a first feature vector (i-vector) for the registered speech data of the speaker, a second feature vector for the authentication speech data, and a plurality of third feature vectors for the background speaker data,
Extracting a B-vector (B-vector) based on the first feature vector and the second feature vector,
Extracting a first R vector (R-vector) based on the first feature vector and the plurality of third feature vectors,
Extracting a second R vector based on the second feature vector and the plurality of third feature vectors,
The B vector, the first R vector, and the second R vector are input to a Deep Neural Network (DNN)
And to perform speaker authentication based on the in-depth neural network,
Wherein the B vector is extracted based on a binary operation between the first feature vector and the second feature vector.