KR102498268B1

KR102498268B1 - Electronic apparatus for speaker recognition and operation method thereof

Info

Publication number: KR102498268B1
Application number: KR1020220087549A
Authority: KR
Inventors: 임형우; 정대진
Original assignee: 국방과학연구소
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2023-02-09

Abstract

A speaker recognition method of an electronic device of the present disclosure comprises: a step of acquiring the first information generated from a speech signal and the second information generated from the speech signal; a step of identifying the first identification information for identifying a speaker of the speech signal based on an output of a first neural network to which the first information is inputted, and identifying the second identification information for identifying the speaker of the speech signal based on the output of a second neural network to which the second information is inputted; and a step of identifying the speaker of the speech signal based on the first identification information and the second identification information.

Description

Electronic device for speaker recognition and its operating method

본 개시는 화자 인식을 위한 전자 장치 및 그의 동작 방법에 관한 것이다.The present disclosure relates to an electronic device for speaker recognition and an operating method thereof.

화자 인식 기술은 전망이 촉망받는 주요 기술로서, 생체 인식을 포함하는 보안 기술 중 하나이다. 인공지능 기술은 인간의 인식, 판단 및 결심 능력을 보완해주는 정보기술로 미래 국방 체계 구현의 중요한 기술이다. 전장 상황에서 인간의 감각/인지 능력만으로 식별이 어려운 데이터로부터 인공지능 기술을 적용한 식별 기술 개발이 이루어지고 있다. 대표적으로, 합성곱 심층 신경망(Convolutional Deep Neural Network)이 화자 인식 정확도 향상에 큰 기여를 함에 따라 화자 인식 분야에서도 딥러닝을 적용시키고자 하는 시도가 지속되고 있다.Speaker recognition technology is a key technology with promising prospects and is one of the security technologies including biometrics. Artificial intelligence technology is an information technology that complements human perception, judgment, and decision-making capabilities, and is an important technology for realizing future defense systems. Identification technology using artificial intelligence technology is being developed from data that is difficult to identify only with human senses/cognitive abilities in battlefield situations. As a representative example, as convolutional deep neural networks have contributed greatly to improving speaker recognition accuracy, attempts to apply deep learning to speaker recognition are continuing.

개시된 실시예들은 화자 인식을 위한 전자 장치 및 그의 동작 방법을 개시하고자 한다. 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 이하의 실시예들로부터 또 다른 기술적 과제들이 유추될 수 있다.The disclosed embodiments are intended to disclose an electronic device for speaker recognition and an operating method thereof. The technical problem to be achieved by the present embodiment is not limited to the technical problems described above, and other technical problems can be inferred from the following embodiments.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 획득하는 단계; 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 확인하고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 확인하는 단계; 및 제1 확인 정보 및 제2 확인 정보를 기초로 음성 신호의 화자를 확인하는 단계를 포함하는 방법이 제공될 수 있다.A method for recognizing a speaker in an electronic device according to an exemplary embodiment, comprising: obtaining first information generated from a voice signal and second information generated from the voice signal; The first identification information for identifying the speaker of the voice signal is identified based on the output of the first neural network to which the first information is input, and the speaker of the voice signal is determined based on the output of the second neural network to which the second information is input. Checking second confirmation information for confirming; and identifying a speaker of the voice signal based on the first identification information and the second identification information.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 제1 정보는, 음성 신호로부터 생성된 멜-스펙트로그램(mel-spectrogram)에 관한 정보, 및 멜-스펙트로그램으로부터 생성된 X-vector, I-vector, 및 D-vector 중 어느 하나에 관한 정보를 포함하고, 제2 정보는, 음성 신호로부터 생성된 멜-스펙트로그램에 관한 정보, 및 멜-스펙트로그램으로부터 생성된 X-vector, I-vector, 및 D-vector 중 다른 하나에 관한 정보를 포함할 수 있다.In a speaker recognition method of an electronic device according to an embodiment, the first information includes information on a mel-spectrogram generated from a voice signal, and X-vector and I generated from the mel-spectrogram. - includes information about any one of a vector and a D-vector, and the second information includes information about a Mel-spectrogram generated from a voice signal, and an X-vector and an I-vector generated from the Mel-spectrogram , and information about the other one of the D-vector.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 음성 신호의 화자를 확인하는 단계는, 제1 확인 정보와 제2 확인 정보를 기초로 최종 확인 정보를 생성하는 단계; 및 기 등록된 적어도 하나의 화자의 확인 정보와 최종 확인 정보를 비교하여 적어도 하나의 화자 중 어느 하나의 화자를 음성 신호의 화자로 확인하는 단계를 포함할 수 있다.In a speaker recognition method of an electronic device according to an embodiment, the step of identifying a speaker of a voice signal may include generating final identification information based on first identification information and second identification information; and identifying any one of the at least one speaker as a speaker of the voice signal by comparing confirmation information of at least one pre-registered speaker with final confirmation information.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 제1 확인 정보는, 음성 신호의 화자를 확인하기 위한 제1 화자 식별 값을 포함하고, 제2 확인 정보는 음성 신호의 화자를 확인하기 위한 제2 화자 식별 값을 포함하고, 음성 신호의 화자를 확인하는 단계는, 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하는 단계; 및 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하여 음성 신호의 화자를 확인하는 단계를 포함할 수 있다.In the speaker recognition method of an electronic device according to an embodiment, the first identification information includes a first speaker identification value for identifying a speaker of a voice signal, and the second identification information is used to identify a speaker of the voice signal. The step of including the second speaker identification value and identifying the speaker of the voice signal includes: determining a final speaker identification value from the first speaker identification value and the second speaker identification value; and identifying the speaker of the voice signal by comparing a speaker identification value of at least one pre-registered speaker with a final speaker identification value.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 제1 확인 정보는 음성 신호의 화자를 확인하기 위한 제1 화자 식별 값 및 제1 성별 값을 포함하고, 제2 확인 정보는, 음성 신호의 화자를 확인하기 위한 제2 화자 식별 값 및 제2 성별 값을 포함하고, 음성 신호의 화자를 확인하는 단계는, 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정하는 단계; 및 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하고, 기 등록된 적어도 하나의 화자의 성별 값과 최종 성별 값을 비교하여 음성 신호의 화자를 확인하는 단계를 포함할 수 있다.In the speaker recognition method of an electronic device according to an embodiment, the first identification information includes a first speaker identification value and a first gender value for identifying a speaker of a voice signal, and the second identification information includes: A second speaker identification value and a second gender value for identifying a speaker are included, and the step of identifying the speaker of the voice signal includes: determining a final speaker identification value from the first speaker identification value and the second speaker identification value; determining a final gender value from the first gender value and the second gender value; and comparing a speaker identification value of at least one pre-registered speaker with a final speaker identification value, and comparing a gender value of at least one pre-registered speaker with a final gender value to identify the speaker of the voice signal. there is.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 제1 확인 정보 및 제2 확인 정보를 확인하는 단계는, 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 제1 신뢰도, 제1 화자 식별 값, 및 제1 성별 값을 확인하고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 제2 신뢰도, 제2 화자 식별 값, 및 제2 성별 값을 확인하는 단계를 포함하고, 음성 신호의 화자를 확인하는 단계는, 제1 신뢰도 및 제2 신뢰도에 기초하여, 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정하는 단계; 및 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하고, 기 등록된 적어도 하나의 화자의 성별 값과 최종 성별 값을 비교하여 음성 신호의 화자를 확인하는 단계를 포함할 수 있다.In the speaker recognition method of an electronic device according to an embodiment, the checking of first confirmation information and second confirmation information may include a first reliability, a first determining a speaker identification value and a first gender value, and determining a second reliability level, a second speaker identification value, and a second gender value based on an output of a second neural network into which second information is input; , The step of identifying the speaker of the voice signal may include determining a final speaker identification value from the first speaker identification value and the second speaker identification value based on the first reliability and the second reliability, and determining the first gender value and the second gender value. determining a final gender value from the values; and comparing a speaker identification value of at least one pre-registered speaker with a final speaker identification value, and comparing a gender value of at least one pre-registered speaker with a final gender value to identify the speaker of the voice signal. there is.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 제1 신뢰도는, 제1 뉴럴 네트워크의 소프트맥스 레이어(softmax layer)의 출력들에 기초하여 계산되고, 제2 신뢰도는, 제2 뉴럴 네트워크의 소프트맥스 레이어의 출력들에 기초하여 계산될 수 있다.In a speaker recognition method of an electronic device according to an embodiment, a first reliability is calculated based on outputs of a softmax layer of a first neural network, and a second reliability is calculated based on outputs of a softmax layer of a second neural network. It can be calculated based on the outputs of the softmax layer.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 결정하는 단계는, 제1 신뢰도 및 제2 신뢰도의 비율에 따라 제1 화자 식별 값 및 제2 화자 식별 값 각각에 가중치를 부여하고, 가중치가 부여된 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1 신뢰도 및 제2 신뢰도의 비율에 따라 제1 성별 값 및 제2 성별 값 각각에 가중치를 부여하고, 가중치가 부여된 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정하는 단계를 포함할 수 있다.In the speaker recognition method of an electronic device according to an embodiment, the determining step may include assigning a weight to each of a first speaker identification value and a second speaker identification value according to a ratio between a first reliability level and a second reliability level, and A final speaker identification value is determined from the assigned first speaker identification value and the second speaker identification value, weights are assigned to each of the first gender value and the second gender value according to the ratio of the first reliability level and the second reliability level, and the weight value is applied. and determining a final gender value from the first gender value and the second gender value to which ? is assigned.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 결정하는 단계는, 제1 신뢰도 및 제2 신뢰도 간의 크기 비교를 통해 제1 화자 식별 값 및 제2 화자 식별 값 중에서 어느 하나를 최종 화자 식별 값으로 결정하고, 크기 비교를 통해 제1 성별 값 및 제2 성별 값 중에서 어느 하나를 최종 성별 값으로 결정하는 단계를 포함할 수 있다.In the speaker recognition method of an electronic device according to an embodiment, the determining step may include setting one of a first speaker identification value and a second speaker identification value as a final speaker identification value through a size comparison between a first reliability level and a second reliability level. , and determining one of the first gender value and the second gender value as the final gender value through size comparison.

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 제1 신뢰도는 제1 화자 식별 값에 대한 제1-1 신뢰도, 및 제1 성별 값에 대한 제1-2 신뢰도를 포함하고, 제2 신뢰도는 제2 화자 식별 값에 대한 제2-1 신뢰도, 및 제2 성별 값에 대한 제2-2 신뢰도를 포함하고, 결정하는 단계는, 제1-1 신뢰도 및 제2-1 신뢰도를 기초로 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1-2 신뢰도 및 제2-2 신뢰도를 기초로 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정하는 단계를 포함할 수 있다.In the speaker recognition method of an electronic device according to an embodiment, the first reliability includes a 1-1 reliability for a first speaker identification value and a 1-2 reliability for a first gender value, and a second reliability includes a 2-1 reliability for the second speaker identification value and a 2-2 reliability for the second gender value, and the determining step comprises a second reliability based on the 1-1 reliability and the 2-1 reliability. Determining a final speaker identification value from the first speaker identification value and the second speaker identification value, and determining a final gender value from the first gender value and the second gender value based on the 1-2 reliability and the 2-2 reliability. can include

일 실시예에 따른 전자 장치의 화자 인식 방법에 있어서, 통신 디바이스를 통해 확인된 음성 신호의 화자에 관한 정보를 외부 장치로 전송하거나, 디스플레이를 통해 확인된 음성 신호의 화자에 관한 정보를 출력하는 단계를 포함할 수 있다.A method for recognizing a speaker of an electronic device according to an embodiment, comprising transmitting information about a speaker of a voice signal identified through a communication device to an external device or outputting information about a speaker of a voice signal identified through a display. can include

일 실시예에 따른 화자 인식을 위한 전자 장치로서, 적어도 하나의 프로그램이 저장된 메모리; 및 적어도 하나의 프로그램을 실행함으로써, 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 획득하고, 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 확인하고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 확인하고, 및 제1 확인 정보 및 제2 확인 정보를 기초로 음성 신호의 화자를 확인하는 프로세서를 포함하는, 전자 장치가 제공될 수 있다.An electronic device for speaker recognition according to an embodiment, comprising: a memory in which at least one program is stored; and by executing at least one program, first information generated from the voice signal and second information generated from the voice signal are obtained, and a speaker of the voice signal is determined based on an output of the first neural network into which the first information is input. First identification information for identification is checked, second identification information for identification of the speaker of the voice signal is checked based on the output of the second neural network to which the second information is input, and the first identification information and the second identification information are checked. An electronic device including a processor for identifying a speaker of a voice signal based on identification information may be provided.

일 실시예에 따른 화자 인식 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 비일시적 기록매체로서, 화자 인식 방법은, 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 획득하는 단계; 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 확인하고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 확인하는 단계; 및 제1 확인 정보 및 제2 확인 정보를 기초로 음성 신호의 화자를 확인하는 단계를 포함하는, 비일시적 기록매체가 제공될 수 있다.A computer-readable non-transitory recording medium storing a program for executing a speaker recognition method in a computer according to an embodiment, wherein the speaker recognition method includes first information generated from a voice signal and second information generated from the voice signal obtaining; The first identification information for identifying the speaker of the voice signal is identified based on the output of the first neural network to which the first information is input, and the speaker of the voice signal is determined based on the output of the second neural network to which the second information is input. Checking second confirmation information for confirming; and identifying a speaker of the voice signal based on the first identification information and the second identification information. A non-transitory recording medium may be provided.

본 개시에 따르면 전자 장치는 훈련된 뉴럴 네트워크를 이용하여 음성 신호로부터 생성된 정보에 기초하여 음성 신호의 화자를 확인할 수 있다. 또한, 전자 장치는 제1 뉴럴 네트워크의 제1 신뢰도와 제2 뉴럴 네트워크의 제2 신뢰도를 기초로, 제1 뉴럴 네트워크의 제1 화자 식별 값과 제2 뉴럴 네트워크의 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1 뉴럴 네트워크의 제1 성별 값과 제2 뉴럴 네트워크의 제2 성별 값으로부터 최종 성별 값을 결정하므로, 보다 신뢰도 높은 최종 화자 식별 값과 최종 식별 값을 계산하여 보다 정확하게 음성 신호의 화자를 인식할 수 있다. 전자 장치는 제1 특징 벡터에 관한 정보를 통한 뉴럴 네트워크의 출력과 제2 특징 벡터에 관한 정보를 통한 뉴럴 네트워크의 출력의 신뢰도를 고려하여 융합하는 방법을 통해 향상된 화자 인식 성능을 제공할 수 있다. 국방 분야에서 전자 장치는 향상된 화자 인식 성능을 제공할 수 있다. 예를 들어, 전자 장치는 음성 신호의 화자를 적군의 주요 인물로 보다 정확하게 판별할 수 있다.According to the present disclosure, an electronic device may identify a speaker of a voice signal based on information generated from the voice signal using a trained neural network. Also, the electronic device determines a final speaker from a first speaker identification value of the first neural network and a second speaker identification value of the second neural network, based on the first reliability of the first neural network and the second reliability of the second neural network. Since the identification value is determined and the final gender value is determined from the first gender value of the first neural network and the second gender value of the second neural network, a more reliable final speaker identification value and a final identification value are calculated to make the speech more accurate. The speaker of the signal can be recognized. The electronic device may provide improved speaker recognition performance through a convergence method in consideration of the reliability of the output of the neural network through the information on the first feature vector and the output of the neural network through the information on the second feature vector. In the field of defense, electronic devices may provide improved speaker recognition performance. For example, the electronic device may more accurately determine the speaker of the voice signal as a main character of the enemy group.

도 1은 본 개시에 따른 전자 장치를 나타낸다.
도 2는 뉴럴 네트워크를 훈련시키기 위한 정보를 획득하는 실시예를 나타낸다.
도 3은 뉴럴 네트워크의 일 예시인 컨볼루션 뉴럴 네트워크를 나타낸다.
도 4는 뉴럴 네트워크의 출력들에 대한 엔트로피의 예시를 나타낸다.
도 5는 전자 장치가 최종 화자 식별 값 및 최종 성별 값을 결정하는 실시예를 나타낸다.
도 6은 전자 장치가 동작하는 일 실시예를 나타낸다.
도 7은 일 실시예에 따른 전자 장치의 동작 방법을 나타낸다.1 shows an electronic device according to the present disclosure.
2 shows an embodiment of acquiring information for training a neural network.
3 shows a convolutional neural network as an example of a neural network.
4 shows an example of entropy for the outputs of a neural network.
5 illustrates an embodiment in which an electronic device determines a final speaker identification value and a final gender value.
6 shows an embodiment in which an electronic device operates.
7 illustrates a method of operating an electronic device according to an exemplary embodiment.

이하, 본 발명의 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

실시예들에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.The terms used in the embodiments have been selected from general terms that are currently widely used as much as possible while considering the functions in the present disclosure, but they may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technologies, and the like. In addition, in a specific case, there are also terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the corresponding description. Therefore, terms used in the present disclosure should be defined based on the meaning of the term and the general content of the present disclosure, not simply the name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "..부", "..모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When it is said that a certain part "includes" a certain component throughout the specification, it means that it may further include other components without excluding other components unless otherwise stated. In addition, terms such as "..unit" and "..module" described in the specification mean a unit that processes at least one function or operation, which may be implemented as hardware or software or a combination of hardware and software. there is.

명세서 전체에서 기재된 "a, b, 및 c 중 적어도 하나"의 표현은, 'a 단독', 'b 단독', 'c 단독', 'a 및 b', 'a 및 c', 'b 및 c', 또는 'a,b,c 모두'를 포괄할 수 있다.The expression of "at least one of a, b, and c" described throughout the specification means 'a alone', 'b alone', 'c alone', 'a and b', 'a and c', 'b and c' ', or 'all of a, b, and c'.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시예에 대하여 본 개시가 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다.Hereinafter, with reference to the accompanying drawings, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily carry out the present disclosure. However, the present disclosure may be embodied in many different forms and is not limited to the embodiments described herein.

국방 분야에서는 적군의 주요 인물을 정확하게 인식하여 정보전에서 아군의 우위를 유지하는 것이 중요하다. 다만, 음성 데이터는 화자의 음성과 더불어 노이즈가 섞인 경우가 대부분이므로, 인간의 인지 능력만으로 음성 신호의 화자를 인식하는 것이 어려울 수 있다. 따라서, 본 개시는 X-vector, I-vector, 및 D-vector를 통합하여 모방학습을 통해 화자 인식의 성능 향상을 모색하고자 한다. 또한, 본 개시는 X-vector, I-vector 및 D-vector의 신뢰도 측정을 기반으로 X-vector, I-vector, 및 D-vector의 최적 조합을 찾는 알고리즘을 개시하고자 한다. In the field of national defense, it is important to accurately recognize key figures in the enemy force and maintain the advantage of friendly forces in information warfare. However, since most of the voice data is mixed with the speaker's voice and noise, it may be difficult to recognize the speaker of the voice signal only with human cognitive ability. Therefore, the present disclosure seeks to improve speaker recognition performance through imitation learning by integrating X-vector, I-vector, and D-vector. In addition, the present disclosure intends to disclose an algorithm for finding an optimal combination of X-vector, I-vector, and D-vector based on measuring the reliability of the X-vector, I-vector, and D-vector.

화자 인식의 구현을 위해서는 대부분 MFCC(Mel-Frequency Cepstral Coefficients) 기반의 화자 인식 기법이 사용된다. 사람의 성음을 추출하기 위해서 X-vector, I-vector, 및 D-vector를 포함하는 다양한 특징 벡터 추출 기법이 사용될 수 있다. 또한, 사람의 성음을 시각화 하기 위해 스펙트로그램(spectrogram)이 사용될 수 있다. 모방학습을 기반으로 한 화자 인식을 구현하기 위해 스펙트로그램 및 특징 벡터에 기초하여 화자 식별 값 및 화자의 성별 값이 레이블링될 수 있다. 동일한 음성 데이터를 이용하여 스펙트로그램 및 특징 벡터에 관한 정보(예를 들어, X-vector, I-vector, 및 D-vector)를 생성할 수 있다. 본 개시에서 사용되는 딥러닝 기반 아키텍쳐는 backpropagation 알고리즘을 통해 당해 음성으로부터 생성된 정보에 대응하는 레이블 데이터와 출력 값의 차이를 줄이는 방향으로 학습할 수 있다. 학습된 네트워크는 X-vector, D-vector, 또는 I-vector 데이터 및 스펙트로그램에 관한 정보를 입력받아 레이블된 화자 및 화자의 성별과 유사 추론 값을 예측하여 출력할 수 있다. 일 실시예에 따라, 인공신경망은 5개의 화자 식별 값과 2개의 성별 값 중 각각 하나를 선택하여 출력할 수 있다.To implement speaker recognition, a speaker recognition technique based on Mel-Frequency Cepstral Coefficients (MFCC) is mostly used. In order to extract human voice, various feature vector extraction techniques including X-vector, I-vector, and D-vector may be used. Also, a spectrogram can be used to visualize a person's voice. In order to implement speaker recognition based on imitation learning, a speaker identification value and a speaker's gender value may be labeled based on the spectrogram and feature vector. Information on the spectrogram and the feature vector (eg, X-vector, I-vector, and D-vector) may be generated using the same voice data. The deep learning-based architecture used in the present disclosure may learn in a direction of reducing a difference between label data corresponding to information generated from a corresponding voice and an output value through a backpropagation algorithm. The learned network may receive X-vector, D-vector, or I-vector data and information about the spectrogram, predict and output a labeled speaker and gender and similarity inference values of the speaker. According to an embodiment, the artificial neural network may select and output one of five speaker identification values and two gender values.

도 1은 본 개시에 따른 전자 장치를 나타낸다.1 shows an electronic device according to the present disclosure.

전자 장치(100)는 프로세서(110) 및 메모리(120)를 포함한다. 도 1에 도시된 전자 장치(100)에는 본 실시예들과 관련된 구성요소들만이 도시되어 있다. 따라서, 전자 장치(100)에는 도 1에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다.The electronic device 100 includes a processor 110 and a memory 120 . In the electronic device 100 shown in FIG. 1 , only components related to the present embodiments are shown. Accordingly, it is apparent to those skilled in the art that the electronic device 100 may further include other general-purpose components in addition to the components shown in FIG. 1 .

전자 장치(100)는 화자를 인식할 수 있다. 구체적으로, 전자 장치(100)는 음성 신호의 화자를 확인할 수 있다.The electronic device 100 may recognize a speaker. Specifically, the electronic device 100 may check the speaker of the voice signal.

프로세서(110)는 전자 장치(100)의 전반적인 기능들을 제어하는 역할을 한다. 예를 들어, 프로세서(110)는 전자 장치(100) 내의 메모리(120)에 저장된 프로그램들을 실행함으로써, 전자 장치(100)를 전반적으로 제어한다. 프로세서(110)는 전자 장치(100) 내에 구비된 CPU(central processing unit), GPU(graphics processing unit), AP(application processor) 등으로 구현될 수 있으나, 이에 제한되지 않는다.The processor 110 serves to control overall functions of the electronic device 100 . For example, the processor 110 generally controls the electronic device 100 by executing programs stored in the memory 120 of the electronic device 100 . The processor 110 may be implemented as a central processing unit (CPU), graphics processing unit (GPU), or application processor (AP) included in the electronic device 100, but is not limited thereto.

메모리(120)는 전자 장치(100) 내에서 처리되는 각종 데이터들을 저장하는 하드웨어로서, 메모리(120)는 전자 장치(100)에서 처리된 데이터들 및 처리될 데이터들을 저장할 수 있다. 또한, 메모리(120)는 전자 장치(100)에 의해 구동될 애플리케이션들, 드라이버들 등을 저장할 수 있다. 메모리(120)는 DRAM(dynamic random access memory), SRAM(static random access memory) 등과 같은 RAM(random access memory), ROM(read-only memory), EEPROM(electrically erasable programmable read-only memory), CD-ROM, 블루레이 또는 다른 광학 디스크 스토리지, HDD(hard disk drive), SSD(solid state drive), 또는 플래시 메모리를 포함할 수 있다.The memory 120 is hardware that stores various types of data processed in the electronic device 100, and the memory 120 may store data processed in the electronic device 100 and data to be processed. Also, the memory 120 may store applications and drivers to be driven by the electronic device 100 . The memory 120 may include random access memory (RAM) such as dynamic random access memory (DRAM) and static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), and CD-ROM. ROM, Blu-ray or other optical disk storage, hard disk drive (HDD), solid state drive (SSD), or flash memory.

일 실시예에 따라, 프로세서(110)는 음성 신호로부터 생성된 정보를 획득할 수 있다. 구체적으로, 프로세서(110)는 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 획득할 수 있다. 여기서 음성 신호는 확인하고자 하는 화자의 음성 신호, 확인하고자 하는 화자의 음성 신호와 노이즈가 섞인 신호, 확인하고자 하는 화자와 다른 화자들의 음성 신호, 또는 확인하고자 하는 화자와 다른 화자들의 음성 신호와 노이즈가 섞인 신호를 포함할 수 있다. 일 실시예에 따라, 제1 정보는 음성 신호로부터 생성된 스펙트로그램(spectrogram)에 관한 정보, 및 스펙트로그램으로부터 생성된 제1 특징 벡터에 관한 정보를 포함하고, 제2 정보는 음성 신호로부터 생성된 스펙트로그램에 관한 정보, 및 스펙트로그램으로부터 생성된 제2 특징 벡터에 관한 정보를 포함할 수 있다. 제1 특징 벡터는 X-vector, I-vector, 및 D-vector 중 어느 하나일 수 있고, 제2 특징 벡터는 X-vector, I-vector, 및 D-vector 중 다른 하나일 수 있다. 예를 들어, 프로세서(110)는 제1 정보로서 음성 신호로부터 생성된 스펙트로그램에 관한 정보와 스펙트로그램으로부터 생성된 X-vector에 관한 정보를 획득할 수 있고, 제2 정보로서 음성 신호로부터 생성된 스펙트로그램에 관한 정보와 스펙트로그램으로부터 생성된 I-vector에 관한 정보를 획득할 수 있다. 스펙트로그램과 X-vector, D-vector, 및 I-vector에 관하여는 이하 도 2에서 상세히 설명될 것이다.According to an embodiment, the processor 110 may obtain information generated from a voice signal. Specifically, the processor 110 may obtain first information generated from the voice signal and second information generated from the voice signal. Here, the audio signal is the audio signal of the speaker to be confirmed, a signal mixed with the audio signal of the speaker to be confirmed and noise, the audio signal of the speaker to be confirmed and other speakers, or the audio signal and noise of the speaker and other speakers to be confirmed. May contain mixed signals. According to an embodiment, the first information includes information about a spectrogram generated from the voice signal and information about a first feature vector generated from the spectrogram, and the second information includes information about a spectrogram generated from the voice signal. It may include information about the spectrogram and information about the second feature vector generated from the spectrogram. The first feature vector may be any one of X-vector, I-vector, and D-vector, and the second feature vector may be another one of X-vector, I-vector, and D-vector. For example, the processor 110 may obtain information about a spectrogram generated from a voice signal as first information and information about an X-vector generated from the spectrogram, and as second information, information about a spectrogram generated from a voice signal Information on the spectrogram and information on the I-vector generated from the spectrogram can be obtained. The spectrogram, X-vector, D-vector, and I-vector will be described in detail in FIG. 2 below.

일 실시예에 따라, 프로세서(110)는 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 전자 장치(100) 내 통신 디바이스(미도시)를 통해 외부로부터 획득할 수 있다. 다른 실시예에 따라, 프로세서(110)는 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 메모리(120)로부터 획득할 수 있다.According to an embodiment, the processor 110 may obtain first information generated from the voice signal and second information generated from the voice signal from the outside through a communication device (not shown) in the electronic device 100 . According to another embodiment, the processor 110 may obtain first information generated from the voice signal and second information generated from the voice signal from the memory 120 .

도 2는 프로세서가 뉴럴 네트워크를 훈련시키기 위한 정보를 획득하는 실시예를 나타낸다.2 shows an embodiment in which a processor obtains information for training a neural network.

일 실시예에 따라, 프로세서(110)는 화자 인식 성능 향상을 위하여 이미지(예를 들어, 도 2의 스펙트로그램)및 특징 벡터(예를 들어, X-vector, D-vector, 및 I-vector)가 입력 정보로서 제공되었을 때 레이블 데이터를 기초로 Deep-Learning CNN 기법을 통해 학습 또는 훈련하여 모델을 도출할 수 있다. 또한, 프로세서(110)에 미학습된 새로운 입력 정보(예를 들어, 음성 신호로부터 생성된 스펙트로그램 및 특징 벡터)가 제공되었을 때, 기 학습 또는 훈련된 모델을 기반으로 음성 신호의 화자를 확인할 수 있다. 일 실시예에 따라, 프로세서(110)는 뉴럴 네트워크를 학습 또는 훈련시키기 위한 정보로서 음성 신호로부터 생성된 정보를 획득할 수 있다. 구체적으로 음성 신호로부터 생성된 정보는, 음성 신호로부터 생성된 스펙트로그램 및 음성 신호로부터 추출된 특징 벡터를 포함할 수 있다.According to an embodiment, the processor 110 uses images (eg, the spectrogram of FIG. 2 ) and feature vectors (eg, X-vector, D-vector, and I-vector) to improve speaker recognition performance. When is provided as input information, a model may be derived by learning or training through a deep-learning CNN technique based on label data. In addition, when new unlearned input information (eg, a spectrogram and a feature vector generated from a speech signal) is provided to the processor 110, the speaker of the speech signal may be identified based on a previously learned or trained model. there is. According to an embodiment, the processor 110 may obtain information generated from a voice signal as information for learning or training a neural network. Specifically, the information generated from the voice signal may include a spectrogram generated from the voice signal and a feature vector extracted from the voice signal.

도 2를 참조하면, 프로세서(110)는 스펙트로그램 및 특징 벡터와 스펙트로그램 및 특징 벡터에 대응하는 레이블 데이터를 사용하여 뉴럴 네트워크를 훈련시킬 수 있다. 일 실시예에 따라, 프로세서(110)는 멜-스펙트로그램(Mel-Spectrogram)과 특징 벡터를 함께 뉴럴 네트워크의 입력 정보로서 사용할 수 있다. 구체적으로, 프로세서(110)는 멜-스펙트로그램과 멜-스펙트로그램으로부터 생성된 특징 벡터(예를 들어, X-vector, D-vector, 및 I-vector) 중 상위 2개의 수를 저장한 파일을 입력 정보로서 사용할 수 있다. 일 실시예에 따라, 프로세서(110)는 특징 벡터 중 상위 2개의 수를 Json 파일, Txt 파일, 또는 Yaml 파일로 저장할 수 있다. 스펙트로그램은 음성의 스펙트럼을 시각화하여 그래프로 표현하는 기법으로서, 시간상 진폭 축의 변화를 시각적으로 볼 수 있는 파형과 주파수 상 진폭 축의 변화를 시각적으로 볼 수 있는 스펙트럼의 특징이 결합된 구조이다. 일 실시예에 따라, 프로세서(110)는 음성 신호에 국소 퓨리에 변환(Short-Time Fourier Transform, STFT)을 수행함으로써 스펙트로그램을 생성할 수 있다. 스펙트로그램은 RGB 채널로 이루어진 2D 이미지이므로, 컨볼루션 뉴럴 네트워크를 학습시키기 용이할 수 있다. 일 실시예에 따라, 프로세서(110)는 멜-스펙트로그램을 통해 뉴럴 네트워크를 훈련시킬 수 있다. 멜-스펙트로그램이란 주파수 단위를 소정의 공식에 의해 멜 단위(Mel-unit)로 변환한 스펙트로그램이다. 멜-스펙트로그램은 저음에서는 주파수 변화에 민감하고 고음에서는 주파수 변화에 덜 민감한 인간의 청각 기관의 특징을 반영할 수 있다. 프로세서(110)는 스펙트로그램에 멜 필터(Mel-filter)를 적용하여 멜-스펙트로그램을 생성하고, 생성된 멜 스펙트로그램을 뉴럴 네트워크의 입력 정보로서 사용하여 뉴럴 네트워크를 학습시킬 수 있다. Referring to FIG. 2 , the processor 110 may train a neural network using a spectrogram, a feature vector, and label data corresponding to the spectrogram and feature vector. According to an embodiment, the processor 110 may use a Mel-Spectrogram and a feature vector together as input information of a neural network. Specifically, the processor 110 generates a file storing the upper two numbers of the mel-spectrogram and the feature vectors (eg, X-vector, D-vector, and I-vector) generated from the mel-spectrogram. Can be used as input information. According to an embodiment, the processor 110 may store the top two numbers of feature vectors as a Json file, a Txt file, or a Yaml file. A spectrogram is a technique of visualizing and expressing the spectrum of speech as a graph. It is a structure in which a waveform that visually displays a change in an amplitude axis in time and a spectrum feature that visually displays a change in an amplitude axis in a frequency phase are combined. According to an embodiment, the processor 110 may generate a spectrogram by performing a local Fourier transform (STFT) on the speech signal. Since the spectrogram is a 2D image composed of RGB channels, it may be easy to train a convolutional neural network. According to an embodiment, the processor 110 may train a neural network through a Mel-spectrogram. A mel-spectrogram is a spectrogram obtained by converting a frequency unit into a mel-unit according to a predetermined formula. The Mel-spectrogram may reflect the characteristics of the human auditory organ, which is sensitive to frequency changes in low tones and less sensitive to frequency changes in high tones. The processor 110 may generate a Mel-spectrogram by applying a Mel-filter to the spectrogram, and may train the neural network by using the generated Mel-spectrogram as input information of the neural network.

프로세서(110)는 뉴럴 네트워크를 훈련시키기 위한 입력 정보로서 멜-스펙트로그램 뿐만 아니라 화자 임베딩 기법을 통해 음성 신호로부터 추출된 특징 벡터를 사용할 수 있다. 일 실시예에 따라, 프로세서(110)는 소정의 기법(예를 들어, X-vector, I-vector, 및 D-vector 기법)에 기반하여 멜-스펙트로그램으로부터 생성된 특징 벡터(예를 들어, X-vector, I-vector, 및 D-vector)와 음성 신호로부터 생성된 멜-스펙트로그램에 기초하여 음성 신호의 화자를 확인하기 위한 확인 정보를 출력하도록 뉴럴 네트워크를 훈련시킬 수 있다. 프로세서(110)는 뉴럴 네트워크를 학습시키기 위한 입력 정보로서 도 2의 멜-스펙트로그램 및 특징 벡터에 관한 정보를 획득할 수 있고, 입력 정보에 대한 출력 정보로서 멜-스펙트로그램 및 특징 벡터에 관한 정보에 대응하는 확인 정보(예를 들어, 화자 식별 값 및 성별 값)를 획득할 수 있다. 예를 들어, 프로세서(110)는 전자 장치(100)의 입력부를 통해 기 등록된 적어도 하나의 화자의 화자 식별 값(예를 들어, 0, 1, 2, 3, 및 4)과 성별 값(예를 들어, 남성의 경우 0, 여성의 경우 1)을 획득할 수 있다. 다른 예시로서, 프로세서(110)는 멜-스펙트로그램, 특징 벡터, 및 확인 정보를 메모리(120)로부터 획득할 수 있다.The processor 110 may use a feature vector extracted from a speech signal through a speaker embedding technique as well as a mel-spectrogram as input information for training the neural network. According to an embodiment, the processor 110 may perform a feature vector (eg, an X-vector, an I-vector, and a D-vector technique) generated from a Mel-spectrogram based on a predetermined technique (eg, an X-vector, an I-vector, and a D-vector technique). A neural network may be trained to output identification information for identifying a speaker of a speech signal based on the X-vector, I-vector, and D-vector) and the Mel-spectrogram generated from the speech signal. The processor 110 may obtain information about the Mel-spectrogram and feature vector of FIG. 2 as input information for learning the neural network, and information about the Mel-spectrogram and feature vector as output information for the input information. Confirmation information (eg, a speaker identification value and a gender value) corresponding to may be obtained. For example, the processor 110 may input speaker identification values (eg, 0, 1, 2, 3, and 4) of at least one speaker pre-registered through an input unit of the electronic device 100 and gender values (eg, For example, 0 for male and 1 for female can be obtained. As another example, processor 110 may obtain a mel-spectrogram, a feature vector, and identification information from memory 120 .

일 실시예에 따라, 프로세서(110)는 멜-스펙트로그램으로부터 X-vector, I-vector, 및 D-vector 중 어느 하나에 관한 제1 정보를 생성할 수 있고, 프로세서(110)는 멜-스펙트로그램으로부터 X-vector, I-vector, 및 D-vector 중 다른 하나에 관한 제2 정보를 생성할 수 있다. 예를 들어, 프로세서(110)는 멜-스펙트로그램으로부터 X-vector에 관한 제1 정보와 I-vector에 관한 제2 정보를 생성할 수 있다. X-vector, I-vector, 및 D-vector는 화자 임베딩 기법을 이용하여 화자의 음성 신호로부터 추출된 특징 벡터이다. 화자 임베딩 기법은 X-vector, D-vector, 및 I-vector 기법을 포함할 수 있다. X-vector, D-vector, 및 I-vector는 모두 멜-스펙트로그램 기반으로 생성될 수 있으나 추출 방식에 있어서 일부 공통점 및 차이점이 있다. 여기서 X-vector 및 D-vector 기법은 딥러닝 기반의 화자 특징 추출 기법이다. X-vector 및 D-vector 기법은 화자 분류 네트워크를 학습하고 한 개의 은닉층(Hidden Layer)을 임베딩으로 취한다는 특징을 공유한다. 그러나, D-vector 기법은 전연결 심층신경망(fully-connected deep neural network)의 마지막 은닉층을 특징으로 이용하고, X-vector 기법은 TDNN(time-delay neural network)의 마지막 은닉층을 통계적으로 풀링(statistics pooling)하여 특징으로 이용한다는 점에서 차이가 있다.According to an embodiment, the processor 110 may generate first information about any one of an X-vector, an I-vector, and a D-vector from the mel-spectrogram, and the processor 110 converts the mel-spectrogram into Second information about the other one of X-vector, I-vector, and D-vector may be generated from the gram. For example, the processor 110 may generate first information about the X-vector and second information about the I-vector from the Mel-spectrogram. X-vector, I-vector, and D-vector are feature vectors extracted from a speaker's speech signal using a speaker embedding technique. Speaker embedding techniques may include X-vector, D-vector, and I-vector techniques. X-vector, D-vector, and I-vector can all be generated based on the Mel-spectrogram, but there are some commonalities and differences in extraction methods. Here, the X-vector and D-vector techniques are speaker feature extraction techniques based on deep learning. The X-vector and D-vector techniques share the feature of learning a speaker classification network and taking one hidden layer as an embedding. However, the D-vector technique uses the last hidden layer of a fully-connected deep neural network as a feature, and the X-vector technique statistically pools the last hidden layer of a time-delay neural network (TDNN). There is a difference in that it is used as a feature after pooling.

X-vector 기법과 I-vector 기법은 음성 발화를 간결한 방식으로 표현하는 기능을 공유한다. 그러나 두 기법 간의 추출 알고리즘의 성격 및 성능은 상당히 상이하다. I-vector 기법은 가변적인 음성 발화의 프레임을 고정된 길이의 벡터로 나타내는 특징을 갖는다. I-vector 추출은 본질적으로 GMM(Gaussian Mixture Model) 슈퍼 벡터의 차원 감소이다. 또한, I-vector 기법은 고유 음성 적응 기법이나 JFA(Joint Factor Analysis) 기법과 유사한 방식으로 특징 벡터를 추출한다. 다만, I-vector 기법은 문장(또는 입력 음성 샘플) 별로 특징 벡터를 추출한다는 데서 X-vector 기법과 차이점을 갖는다. The X-vector technique and the I-vector technique share the function of expressing speech in a concise manner. However, the nature and performance of the extraction algorithms between the two techniques are quite different. The I-vector technique has a feature of representing variable speech frames as vectors with a fixed length. I-vector extraction is essentially a dimensionality reduction of a Gaussian Mixture Model (GMM) super-vector. In addition, the I-vector technique extracts a feature vector in a manner similar to the eigenvoice adaptation technique or the JFA (Joint Factor Analysis) technique. However, the I-vector technique differs from the X-vector technique in extracting feature vectors for each sentence (or input speech sample).

X-vector 기법은 I-vector 기법보다 최근에 개발된 기법으로 대체적으로 I-vector 기법 대비 성능이 우수하다고 알려져 있다. 실제 X-vector 기법을 사용하여 특징 벡터를 추출할 경우, I-vector 기법을 사용하는 경우보다 음성 신호의 화자를 더 정확히 특정할 수 있다. 다만, X-vector 기법은 실제 환경에서의 외란(예를 들어, 자동차 소음), 노이즈 등에 취약할 수 있다. 반면에, I-vector 기법은 X-vector 기법을 사용할 때 보다 화자를 인식하는 성능은 떨어질 수 있으나, 외란에는 더 강인할 수 있다. 따라서, X-vector 기법이 언제나 I-vector 기법보다 우월하다고 정의할 수 없다. 이처럼 X-vector, D-vector, 및 I-vector 기법은 서로 다른 방식으로 음성 신호로부터 특징 벡터를 추출하므로 이들은 상호 보완적으로 작용할 수 있다. 본원은 각각의 특징 벡터를 입력한 뉴럴 네트워크의 출력을 융합하는 방식을 통해 화자 인식 성능을 향상시켜 화자를 정확히 식별하는 효과를 가질 수 있다.The X-vector technique is a technique developed more recently than the I-vector technique and is generally known to have superior performance compared to the I-vector technique. When a feature vector is extracted using the actual X-vector technique, the speaker of the speech signal can be more accurately specified than when the I-vector technique is used. However, the X-vector technique may be vulnerable to disturbances (eg, car noise) and noise in a real environment. On the other hand, the I-vector technique may have lower speaker recognition performance than the X-vector technique, but it may be more robust to disturbances. Therefore, it cannot be defined that the X-vector technique is always superior to the I-vector technique. As such, since the X-vector, D-vector, and I-vector techniques extract feature vectors from speech signals in different ways, they can complement each other. According to the present disclosure, speaker recognition performance can be improved through a method of converging the output of a neural network inputting each feature vector to have an effect of accurately identifying a speaker.

다시 도 1을 참조하면, 프로세서(110)는 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 확인할 수 있고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 확인할 수 있다. 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크는 음성 신호의 화자를 확인하기 위한 확인 정보를 추론하도록 훈련될 수 있다. 다시 말해, 제1 뉴럴 네트워크는 음성 신호로부터 생성된 멜-스펙트로그램과 X-vector, D-vector, 및 I-vector 중 어느 하나의 특징 벡터(즉, 제1 특징 벡터)에 기초하여 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 추론하도록 훈련된 뉴럴 네트워크일 수 있고, 제2 뉴럴 네트워크는 음성 신호로부터 생성된 멜-스펙트로그램과 X-vector, D-vector, 및 I-vector 중 다른 어느 하나의 특징 벡터(즉, 제2 특징 벡터)에 기초하여 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 추론하도록 훈련된 뉴럴 네트워크일 수 있다.Referring back to FIG. 1 , the processor 110 may check first identification information for identifying the speaker of the voice signal based on the output of the first neural network to which the first information is input, and the second information to which the second information is input. Based on the output of the second neural network, second identification information for identifying the speaker of the voice signal may be identified. The first neural network and the second neural network may be trained to infer identification information for identifying a speaker of a speech signal. In other words, the first neural network generates the voice signal based on the Mel-spectrogram generated from the voice signal and any one feature vector among X-vector, D-vector, and I-vector (ie, the first feature vector). It may be a neural network trained to infer first identification information for identifying a speaker, and the second neural network may be a mel-spectrogram generated from a speech signal and any other of an X-vector, a D-vector, and an I-vector. It may be a neural network trained to infer second identification information for identifying a speaker of a speech signal based on one feature vector (ie, a second feature vector).

일 실시예에 따라, 확인 정보는 음성 신호의 화자를 확인하기 위한 화자 식별 값을 포함할 수 있다. 구체적으로, 제1 뉴럴 네트워크의 출력을 기초로 확인된 제1 확인 정보는 음성 신호의 화자를 확인하기 위한 제1 화자 식별 값을 포함할 수 있고, 제2 뉴럴 네트워크의 출력을 기초로 확인된 제2 확인 정보는 음성 신호의 화자를 확인하기 위한 제2 화자 식별 값을 포함할 수 있다. According to an embodiment, the identification information may include a speaker identification value for identifying a speaker of the voice signal. Specifically, the first identification information identified based on the output of the first neural network may include a first speaker identification value for identifying the speaker of the voice signal, and the first identification information identified based on the output of the second neural network. 2 The identification information may include a second speaker identification value for identifying a speaker of the voice signal.

다른 실시예에 따라, 확인 정보는 음성 신호의 화자를 확인하기 위한 화자 식별 값과 화자의 성별 값을 포함할 수 있다. 구체적으로, 제1 뉴럴 네트워크의 출력을 기초로 확인된 제1 확인 정보는 음성 신호의 화자를 확인하기 위한 제1 화자 식별 값과 제1 성별 값을 포함할 수 있고, 제2 뉴럴 네트워크의 출력을 기초로 확인된 제2 확인 정보는 음성 신호의 화자를 확인하기 위한 제2 화자 식별 값과 제2 성별 값을 포함할 수 있다. 프로세서(110)는 화자 식별 값과 더불어 화자의 성별 값을 고려함으로써 음성 신호의 화자를 더 정확하게 판별할 수 있다.According to another embodiment, the identification information may include a speaker identification value for identifying a speaker of the voice signal and a gender value of the speaker. Specifically, the first identification information identified based on the output of the first neural network may include a first speaker identification value and a first gender value for identifying the speaker of the voice signal, and the output of the second neural network The second identification information identified on the basis may include a second speaker identification value and a second gender value for identifying the speaker of the voice signal. The processor 110 may more accurately determine the speaker of the voice signal by considering the gender value of the speaker as well as the speaker identification value.

도 3은 뉴럴 네트워크의 일 예시인 컨볼루션 뉴럴 네트워크를 나타낸다.3 shows a convolutional neural network as an example of a neural network.

도 3에 도시된 바와 같이, 컨볼루션 뉴럴 네트워크(300)는 컨볼루션 레이어(Convolutional layer)들과, 풀리 커넥티드 레이어(Fully connected layer)들과 소프트맥스 레이어(Softmax layer)로 구성될 수 있다. 일 실시예에 따르면 컨볼루션 뉴럴 네트워크(300)는 5개의 컨볼루션 레이어들과 2개의 풀리 커넥티드 레이어들과 소프트 맥스 레이어로 구성될 수 있다. 또한, 컨볼루션 뉴럴 네트워크(300)는 입력 정보인 음성 신호로부터 생성된 스펙트로그램(또는 멜-스펙트로그램)에 관한 정보 및 스펙트로그램(또는 멜-스펙트로그램)으로부터 생성된 X-vector, I-vector, 및 D-vector 중 어느 하나에 관한 정보와, 입력 정보에 대한 출력 정보인 음성 신호의 화자를 확인하기 위한 확인 정보(예를 들어, 화자 식별 값 또는 성별 값)에 기초하여 훈련된 뉴럴 네트워크일 수 있다. 또한, 컨볼루션 뉴럴 네트워크(300)는 Flatten 함수가 이용될 수 있으며, 여기서 Flatten 함수는 데이터(tensor)의 형태(shape)를 바꾸는 함수를 의미할 수 있다. 예를 들어, Flatten 함수는 200x200x1의 데이터를 40000x1의 데이터로 바꿀 수 있다.As shown in FIG. 3 , the convolutional neural network 300 may include convolutional layers, fully connected layers, and a Softmax layer. According to an embodiment, the convolutional neural network 300 may include 5 convolutional layers, 2 fully connected layers, and a soft max layer. In addition, the convolutional neural network 300 provides information about a spectrogram (or mel-spectrogram) generated from a voice signal, which is input information, and an X-vector and an I-vector generated from the spectrogram (or mel-spectrogram). , and D-vector, and identification information (eg, speaker identification value or gender value) for identifying a speaker of a speech signal, which is output information for input information, is a neural network trained. can In addition, the convolutional neural network 300 may use a Flatten function, where the Flatten function may mean a function that changes the shape of data (tensor). For example, the Flatten function can convert 200x200x1 data into 40000x1 data.

일 실시예에 따라, 멜-스펙트로그램에 관한 정보 및 멜-스펙트로그램으로부터 생성된 X-vector에 관한 정보 또는 멜-스펙트로그램에 관한 정보 및 멜-스펙트로그램으로부터 생성된 I-vector에 관한 정보가 입력된 컨볼루션 뉴럴 네트워크(300)는 5개의 뉴런들을 통해 5개의 후보 화자 식별 값들을 출력할 수 있고, 2개의 뉴런들을 통해 2개의 후보 성별 값들을 출력할 수 있다. 다른 실시예에 따라, 컨볼루션 뉴럴 네트워크(300)에 멜-스펙트로그램에 관한 정보 및 멜-스펙트로그램으로부터 생성된 D-vector에 관한 정보 또는 멜-스펙트로그램에 관한 정보 및 멜-스펙트로그램으로부터 생성된 I-vector에 관한 정보를 입력하여 후보 화자 식별 값들 및 후보 성별 값들을 출력할 수 있다. 프로세서(110)는 5개의 후보 화자 식별 값들 중에서 소프트맥스 레이어의 출력값이 가장 높은 후보 화자 식별 값을 화자 식별 값(310)으로 확인할 수 있고, 2개의 후보 성별 값들 중에서 소프트맥스 레이어의 출력값이 가장 높은 후보 성별 값을 성별 값(320)으로 확인할 수 있다. 다시 말해, 프로세서(110)는 음성 신호의 화자를 확인하기 위한 확인 정보로써 화자 식별 값(310) 및 성별 값(320)을 확인할 수 있다. According to an embodiment, information on the mel-spectrogram and information on the X-vector generated from the mel-spectrogram or information on the mel-spectrogram and information on the I-vector generated from the mel-spectrogram The inputted convolutional neural network 300 may output 5 candidate speaker identification values through 5 neurons and 2 candidate gender values through 2 neurons. According to another embodiment, the convolutional neural network 300 is provided with information about the Mel-spectrogram and information about the D-vector generated from the Mel-spectrogram, or information about the Mel-spectrogram and the information generated from the Mel-spectrogram. Candidate speaker identification values and candidate gender values may be output by inputting information on the I-vector. The processor 110 may identify a candidate speaker identification value having the highest output value of the softmax layer among the five candidate speaker identification values as the speaker identification value 310 and having the highest output value of the softmax layer among the two candidate gender values. The candidate gender value may be identified as the gender value 320 . In other words, the processor 110 may check the speaker identification value 310 and the gender value 320 as identification information for identifying the speaker of the voice signal.

또한, 프로세서(110)는 컨볼루션 뉴럴 네트워크(300)의 소프트맥스 레이어의 출력들에 기초하여 입력인 음성 신호로부터 생성된 제1 정보 또는 음성 신호로부터 생성된 제2 정보의 신뢰도를 계산할 수 있다. 일 실시예에 따라, 프로세서(110)는 5개의 뉴런들에 대응되는 소프트맥스 레이어의 출력들에 대한 엔트로피를 계산하여 음성 신호로부터 생성된 정보의 신뢰도를 계산할 수 있다. 다시 말해, 프로세서(110)는 화자 식별 값(310)에 관한 신뢰도를 음성 신호로부터 생성된 정보의 신뢰도로 확인할 수 있다. 다른 실시예에 따라, 프로세서(110)는 2개의 뉴런들에 대응되는 소프트맥스 레이어들의 출력들에 대한 엔트로피를 계산하여 음성 신호로부터 생성된 정보의 신뢰도를 계산할 수 있다. 다시 말해, 프로세서(110)는 성별 값(320)에 대한 신뢰도를 음성 신호로부터 생성된 정보의 신뢰도로 확인할 수 있다. 또 다른 실시예에 따라, 프로세서(110)는 화자 식별 값(310)에 대한 신뢰도와 성별 값(320)에 대한 신뢰도를 이용하여 음성 신호로부터 생성된 정보의 신뢰도를 확인할 수 있다. 예를 들어, 프로세서(110)는 화자 식별 값(310)에 대한 신뢰도와 성별 값(320)에 대한 신뢰도의 평균 값을 음성 신호로부터 생성된 정보의 신뢰도로 확인하거나, 화자 식별 값(310)에 대한 신뢰도 및 성별 값(320)에 대한 신뢰도 중 낮은 값을 음성 신호로부터 생성된 정보의 신뢰도로 확인할 수 있다.In addition, the processor 110 may calculate the reliability of the first information generated from the voice signal or the second information generated from the voice signal based on the outputs of the softmax layer of the convolutional neural network 300 . According to an embodiment, the processor 110 may calculate the reliability of information generated from the speech signal by calculating entropy of outputs of the softmax layer corresponding to the five neurons. In other words, the processor 110 may check the reliability of the speaker identification value 310 as the reliability of information generated from the voice signal. According to another embodiment, the processor 110 may calculate reliability of information generated from a speech signal by calculating entropy of outputs of softmax layers corresponding to two neurons. In other words, the processor 110 may check the reliability of the gender value 320 as the reliability of information generated from the voice signal. According to another embodiment, the processor 110 may check the reliability of information generated from the voice signal by using the reliability of the speaker identification value 310 and the reliability of the gender value 320 . For example, the processor 110 determines the average value of the reliability of the speaker identification value 310 and the reliability of the gender value 320 as the reliability of information generated from the voice signal, or determines the reliability of the speaker identification value 310. A lower value among the reliability for the gender value 320 and the reliability for the gender value 320 may be confirmed as the reliability of information generated from the voice signal.

프로세서(110)는 제1 정보가 입력된 제1 뉴럴 네트워크로부터 제1 정보에 대한 제1 신뢰도를 계산할 수 있고, 제2 정보가 입력된 제2 뉴럴 네트워크로부터 제2 정보에 대한 제2 신뢰도를 계산할 수 있다. 구체적으로, 프로세서(110)는 제1 뉴럴 네트워크의 소프트맥스 레이어의 출력들에 기초하여 제1 신뢰도를 계산할 수 있고, 제2 뉴럴 네트워크의 소프트맥스 레이어의 출력들에 기초하여 제2 신뢰도를 계산할 수 있다.The processor 110 may calculate a first reliability of the first information from the first neural network to which the first information is input, and calculate a second reliability of the second information from the second neural network to which the second information is input. can Specifically, the processor 110 may calculate a first reliability based on outputs of a softmax layer of a first neural network, and calculate a second reliability based on outputs of a softmax layer of a second neural network. there is.

일 실시예에 따라, 프로세서(110)는 아래 수학식 1 및 수학식 2에 따라 신뢰도를 계산할 수 있다.According to an embodiment, the processor 110 may calculate reliability according to Equations 1 and 2 below.

수학식 1 및 2에서,

는 뉴럴 네트워크의 소프트맥스 레이어의 n개의 출력값들 중 i번째 출력값을 나타낸다. 따라서, 프로세서(110)는 수학식 1을 통해 엔트로피

를 계산하고, 수학식 2를 통해 엔트로피

의 역수로 신뢰도

를 계산할 수 있다. 여기서

는 0으로 나누는 것을 방지하기 위한 최소값으로 예를 들어 0.00001로 설정될 수 있다. 또한,

는 아래 수학식 3에 따라 계산될 수 있다. 수학식 3에서

는 소프트맥스 레이어에 입력되는 i번째 값을 나타낸다.In

Equations

1 and 2,

represents an i-th output value among n output values of the softmax layer of the neural network. Therefore, the processor 110 calculates the entropy through Equation 1

Calculate , and entropy through Equation 2

reliability as the reciprocal of

can be calculated. here

may be set to, for example, 0.00001 as a minimum value for preventing division by zero. also,

Can be calculated according to Equation 3 below. in Equation 3

represents the i-th value input to the softmax layer.

다른 실시예에 따라, 프로세서(110)는 아래 수학식 4에 따라 신뢰도

를 계산할 수 있다.According to another embodiment, the processor 110 determines the reliability according to Equation 4 below.

can be calculated.

수학식 4에서,

는 뉴럴 네트워크의 소프트맥스 레이어의 n개의 출력값들 중 가장 큰 값을 나타낸다. 수학식 4를 이용하여 신뢰도를 계산할 경우, 수학식 1 및 2를 이용하는 경우보다 연산 속도가 개선되는 효과를 가질 수 있다.In Equation 4,

represents the largest value among n output values of the softmax layer of the neural network. When the reliability is calculated using Equation 4, the calculation speed may be improved compared to the

case using Equations

1 and 2.

일 실시예에 따라, 프로세서(110)는 제1 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제1 화자 식별 값과 제1 성별 값을 확인하고, 제2 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제2 화자 식별 값과 제2 성별 값을 확인할 수 있다. 예를 들어, 프로세서(110)는 아래 수학식 5에 따라 화자 식별 값

및 성별 값

를 확인할 수 있다(categorical mode).According to an embodiment, the processor 110 determines a first speaker identification value and a first gender value for identifying a speaker of the voice signal based on the output of the first neural network, and based on the output of the second neural network. A second speaker identification value and a second gender value for identifying a speaker of the voice signal may be identified as . For example, the processor 110 calculates the speaker identification value according to Equation 5 below.

and gender values

can be checked (categorical mode).

수학식 5에서, 화자 식별 값

는 뉴럴 네트워크의 출력인 후보 화자 식별 값들 중에서 소프트맥스 레이어의 출력값인

가 가장 높은 i번째 후보 화자 식별 값을 나타내고, 성별 값

는 뉴럴 네트워크의 출력인 후보 성별 값들 중에서 소프트맥스 레이어의 출력값인

가 가장 높은 i번째 후보 성별 값을 나타낸다.In Equation 5, speaker identification value

is the output value of the softmax layer among the candidate speaker identification values that are outputs of the neural network.

denotes the highest i-th candidate speaker identification value, and the gender value

is the output value of the softmax layer among the candidate gender values that are the output of the neural network.

represents the i-th candidate gender value with the highest.

도 4는 뉴럴 네트워크의 출력들에 대한 엔트로피의 예시를 나타낸다.4 shows an example of entropy for the outputs of a neural network.

좌측 그래프(410)는 일 실시예에 따른 뉴럴 네트워크의 소프트 맥스 레이어의 출력값들을 나타낸다. 구체적으로, 좌측 그래프(410)는 후보 화자 식별 값들에 대한

를 나타내고,

가 모두 균등하게 일정한 값들임을 나타낸다. 뉴럴 네트워크를 통해 출력된 확률 값들이 서로 비슷한 값을 가지면 엔트로피가 높게 계산되므로, 신뢰도는 반비례하여 낮아진다. 한편, 출력된 확률 값이 일부에 치우친 경우 엔트로피가 낮게 계산되므로, 신뢰도는 반비례하여 높아진다. 도 4를 참조하면, 좌측 그래프(410)의 경우 후보 화자 식별 값들이 서로 비슷한 값을 가져 엔트로피는 2.9957로 비교적 큰 값으로 계산되고, 이는 뉴럴 네트워크에 입력되는 음성 신호로부터 생성된 정보의 신뢰도가 낮음을 의미할 수 있다. 이와 같이, 인공신경망이 어려워하는 예시(예를 들어, 좌측 그래프(410))에 대한 출력은 일반적으로 확률 값이 한 곳으로 치우치지 않고, 여러 확률 값들이 서로 비슷한 값을 가지므로 신뢰도가 낮아진다.A graph 410 on the left represents output values of a soft max layer of a neural network according to an exemplary embodiment. Specifically, the left graph 410 is for candidate speaker identification values.

represents,

indicates that all are equally constant values. When probability values output through the neural network have similar values to each other, entropy is calculated to be high, and thus reliability decreases in inverse proportion. On the other hand, since entropy is calculated low when the output probability value is biased to a part, reliability increases in inverse proportion. Referring to FIG. 4, in the case of the graph 410 on the left, candidate speaker identification values have similar values to each other, so the entropy is calculated as a relatively large value of 2.9957, which indicates low reliability of information generated from a voice signal input to the neural network. can mean In this way, the output of an example (eg, the graph 410 on the left), which is difficult for the artificial neural network, generally has a probability value that is not concentrated in one place, and since several probability values have values similar to each other, reliability is lowered.

우측 그래프(420)는 다른 실시예에 따른 뉴럴 네트워크의 소프트맥스 레이어의 출력값들을 나타낸다. 구체적으로, 우측 그래프(420)는 후보 화자 식별 값들에 대한

를 나타내고, 특정 후보 화자 식별 값에 대한

가 높음을 나타낸다. 도 4를 참조하면, 우측 그래프(420)의 경우 후보 화자 식별 값이 일부에 치우쳐 엔트로피가 1.0457로 비교적 작은 값으로 계산되고, 이는 뉴럴 네트워크에 입력되는 음성 신호로부터 생성된 정보의 신뢰도가 높음을 의미할 수 있다. 본원에서는 다채널 정보(예를 들어, 멜-스펙트로그램으로부터 생성된 X-vector, I-vector, 및 D-vector에 관한 정보)를 이용하여 각 뉴럴 네트워크로부터 계산되는 각 신뢰도의 융합을 통해 화자를 인식하기 어려운 상황에서 다채널 정보의 상호 보완적 이용 방법을 개시한다.A graph 420 on the right represents output values of a softmax layer of a neural network according to another embodiment. Specifically, the right graph 420 is for candidate speaker identification values.

, and for a specific candidate speaker identification value

indicates high. Referring to FIG. 4, in the case of the graph 420 on the right, the candidate speaker identification value is partially biased, so the entropy is calculated as a relatively small value of 1.0457, which means that the reliability of the information generated from the voice signal input to the neural network is high. can do. In the present invention, a speaker is identified through fusion of each reliability calculated from each neural network using multi-channel information (eg, information on an X-vector, an I-vector, and a D-vector generated from a Mel-spectrogram). In situations where it is difficult to recognize, a mutually complementary method of using multi-channel information is disclosed.

다시 도 1을 참조하면, 프로세서(110)는 제1 확인 정보 및 제2 확인 정보를 기초로 음성 신호의 화자를 확인할 수 있다. 구체적으로, 프로세서(110)는 제1 확인 정보와 제2 확인 정보를 기초로 최종 확인 정보를 생성할 수 있고, 프로세서(110)는 기 등록된 적어도 하나의 화자의 확인 정보와 최종 확인 정보를 비교하여 적어도 하나의 화자 중 어느 하나의 화자를 음성 신호의 화자로 확인할 수 있다. 일 실시예에 따라, 프로세서(110)는 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하여 음성 신호의 화자를 확인할 수 있다. 다른 실시예에 따라, 프로세서(110)는 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다. 이어서, 프로세서(110)는 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하고, 기 등록된 적어도 하나의 화자의 성별 값과 최종 성별 값을 비교하여 음성 신호의 화자를 확인할 수 있다. 프로세서(110)는 확인된 음성 신호의 화자에 관한 정보를 통신 디바이스를 통해 외부 장치로 전송하거나, 디스플레이를 통해 출력할 수 있다. Referring back to FIG. 1 , the processor 110 may identify the speaker of the voice signal based on the first identification information and the second identification information. Specifically, the processor 110 may generate final confirmation information based on the first confirmation information and the second confirmation information, and the processor 110 compares the confirmation information of at least one pre-registered speaker with the final confirmation information. Thus, any one of the at least one speaker may be identified as a speaker of the voice signal. According to an embodiment, the processor 110 determines a final speaker identification value from the first speaker identification value and the second speaker identification value, compares the speaker identification value of at least one pre-registered speaker with the final speaker identification value, The speaker of the audio signal can be identified. According to another embodiment, the processor 110 may determine a final speaker identification value from the first speaker identification value and the second speaker identification value, and determine a final gender value from the first gender value and the second gender value. Next, the processor 110 compares the speaker identification value of at least one pre-registered speaker with the final speaker identification value, and compares the gender value of at least one pre-registered speaker with the final gender value to identify the speaker of the voice signal. can The processor 110 may transmit information about the speaker of the confirmed voice signal to an external device through a communication device or output it through a display.

일 실시예에 따라, 프로세서(110)는 제1 뉴럴 네트워크의 출력 및 제2 뉴럴 네트워크의 출력에 기초하여 확인한 제1 신뢰도 및 제2 신뢰도를 기초로 제1 화자 식별 값 및 제2 화자 식별 값으로부터 음성 신호의 화자를 확인하기 위한 최종 화자 식별 값을 결정할 수 있고, 제1 신뢰도 및 제2 신뢰도를 기초로 제1 성별 값 및 제2 성별 값으로부터 음성 신호의 화자를 확인하기 위한 최종 성별 값을 결정할 수 있다.According to an embodiment, the processor 110 obtains the first speaker identification value and the second speaker identification value based on the first reliability and the second reliability determined based on the output of the first neural network and the output of the second neural network. A final speaker identification value for identifying a speaker of the voice signal may be determined, and a final gender value for identifying a speaker of the voice signal may be determined from the first gender value and the second gender value based on the first reliability and the second reliability. can

일 실시예에 따라, 제1 뉴럴 네트워크는 음성 신호를 멜-스펙트로그램과 특징 벡터(예를 들어, X-vector)로 표현하고 이에 대응하는 음성 신호의 화자를 확인하기 위한 확인 정보가 레이블링 된 네트워크일 수 있다. 제2 뉴럴 네트워크는 음성 신호를 멜-스펙트로그램과 특징 벡터(예를 들어, I-vector)로 표현하고 이에 대응하는 음성 신호의 화자를 확인하기 위한 확인 정보가 레이블링된 네트워크일 수 있다. 각각의 네트워크는 입력 정보를 받아 확인 정보(예를 들어, 화자 식별 값 및 성별 값)을 예측할 수 있다. 구체적인 실시예로 제1 뉴럴 네트워크에서 나온 제1 화자 식별 값, 제1 성별 값과 제2 뉴럴 네트워크에서 나온 제2 화자 식별 값, 제2 성별 값을 융합하는 방법이 제시될 수 있다. 예를 들어, 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크를 통해 나온 각 화자 식별 값 및 각 성별 값의 평균을 내는 방법으로 융합할 수 있다. 일 실시예에 따라, 각각의 뉴럴 네트워크를 통해 확인된 각 신뢰도가 상이한 경우, 평균을 내는 방법보다 각 신뢰도를 바탕으로 각 뉴럴 네트워크의 출력 값을 융합하는 것이 더 효과적일 수 있다. 이하에서 구체적인 실시예로 각 뉴럴 네트워크의 신뢰도를 바탕으로 각 뉴럴 네트워크의 출력 값(즉, 화자 식별 값 및 성별 값)을 가중합하거나 신뢰도가 높은 뉴럴 네트워크의 출력 값을 최종 출력 값으로 결정하는 예를 개시한다.According to an embodiment, the first neural network represents a voice signal with a Mel-spectrogram and a feature vector (eg, X-vector), and a network labeled with identification information for identifying a speaker of the voice signal corresponding thereto. can be The second neural network may represent a voice signal with a Mel-spectrogram and a feature vector (eg, I-vector) and may be a network labeled with identification information for identifying a speaker of the voice signal corresponding thereto. Each network may receive input information and predict confirmation information (eg, speaker identification value and gender value). As a specific embodiment, a method of fusing a first speaker identification value and a first gender value derived from a first neural network with a second speaker identification value and a second gender value derived from a second neural network may be presented. For example, fusion may be performed by averaging speaker identification values and gender values generated through the first neural network and the second neural network. According to an embodiment, when the respective reliability levels identified through the respective neural networks are different, it may be more effective to fuse the output values of the respective neural networks based on the respective reliability levels rather than the averaging method. In the following specific example, an example of weighting the output values of each neural network (ie, speaker identification value and gender value) based on the reliability of each neural network or determining the output value of a highly reliable neural network as the final output value Initiate.

일 실시예에 따라, 프로세서(110)는 제1 신뢰도 및 제2 신뢰도의 비율에 따라 제1 화자 식별 값 및 제2 화자 식별 값 각각에 가중치를 부여하고, 가중치가 부여된 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정할 수 있다. 마찬가지로, 프로세서(110)는 제1 신뢰도 및 제2 신뢰도의 비율에 따라 제1 성별 값 및 제2 성별 값 각각에 가중치를 부여하고, 가중치가 부여된 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다.According to an embodiment, the processor 110 assigns a weight to each of the first speaker identification value and the second speaker identification value according to the ratio of the first and second reliability, and the weighted first speaker identification value and A final speaker identification value may be determined from the second speaker identification value. Similarly, the processor 110 assigns a weight to each of the first gender value and the second gender value according to the ratio of the first reliability level and the second reliability level, and determines the final gender value from the weighted first gender value and the second gender value. value can be determined.

예를 들어, 프로세서(110)는 아래 수학식 6에 따라 최종 화자 식별 값

및 최종 성별 값

을 결정할 수 있다.For example, the processor 110 calculates the final speaker identification value according to Equation 6 below.

and final gender values

can determine

수학식 6에서,

는 제1 신뢰도를 나타낼 수 있고,

는 제2 신뢰도를 나타낼 수 있다. 예를 들어, 제1 신뢰도는 음성 신호로부터 생성된 제1 정보에 대한 신뢰도를 의미할 수 있고, 제2 신뢰도는 음성 신호로부터 생성된 제2 정보에 대한 신뢰도를 의미할 수 있다. 또한, 수학식 6에서,

는 제1 화자 식별 값을 나타낼 수 있고,

는 제2 화자 식별 값을 나타낼 수 있고,

는 제1 성별 값을 나타낼 수 있고,

는 제2 성별 값을 나타낼 수 있다. 따라서, 프로세서(110)는 제1 신뢰도 및 제2 신뢰도의 비율에 따라 가중치가 부여된 제1 화자 식별 값 및 제2 화자 식별 값을 융합하여 최종 화자 식별 값을 결정할 수 있고, 제1 신뢰도 및 제2 신뢰도의 비율에 따라 가중치가 부여된 제1 성별 값 및 제2 성별 값을 융합하여 최종 성별 값을 결정할 수 있다.In Equation 6,

May represent the first reliability,

may represent the second reliability. For example, the first reliability may mean reliability of first information generated from a voice signal, and the second reliability may mean reliability of second information generated from a voice signal. Also, in Equation 6,

May represent a first speaker identification value,

May represent a second speaker identification value,

may represent a first gender value,

may represent a second gender value. Accordingly, the processor 110 may determine the final speaker identification value by fusing the first speaker identification value and the second speaker identification value weighted according to the ratio of the first reliability and the second reliability, and The final gender value may be determined by fusing the first gender value and the second gender value weighted according to the ratio of the two reliability values.

다른 실시예에 따라, 프로세서(110)는 제1 신뢰도 및 제2 신뢰도 간의 크기 비교를 통해 제1 화자 식별 값 및 제2 화자 식별 값 중에서 어느 하나를 최종 화자 식별 값으로 결정할 수 있고, 제1 신뢰도 및 제2 신뢰도 간의 크기 비교를 통해 제1 성별 값 및 제2 성별 값 중에서 어느 하나를 최종 성별 값으로 결정할 수 있다.According to another embodiment, the processor 110 may determine one of the first speaker identification value and the second speaker identification value as the final speaker identification value through a size comparison between the first reliability and the second reliability, and And, one of the first gender value and the second gender value may be determined as the final gender value through size comparison between the second reliability values.

예를 들어, 프로세서(110)는 아래 수학식 7에 따라 최종 화자 식별 값

및 최종 성별 값

을 결정할 수 있다.For example, the processor 110 calculates the final speaker identification value according to Equation 7 below.

and final gender values

can determine

수학식 7에서,

는 제1 신뢰도를 나타낼 수 있고,

는 제2 신뢰도를 나타낼 수 있다. 예를 들어, 제1 신뢰도는 음성 신호로부터 생성된 제1 정보에 대한 신뢰도를 의미할 수 있고, 제2 신뢰도는 음성 신호로부터 생성된 제2 정보에 대한 신뢰도를 의미할 수 있다. 또한, 수학식 7에서,

는 제1 화자 식별 값을 나타낼 수 있고,

는 제2 화자 식별 값을 나타낼 수 있고,

는 제1 성별 값을 나타낼 수 있고,

는 제2 성별 값을 나타낼 수 있다. 따라서, 프로세서(110)는 제1 신뢰도와 제2 신뢰도 간의 크기 비교에 따라, 신뢰도가 높은 음성 신호로부터 생성된 정보로부터 획득된 화자 식별 값 또는 성별 값을 최종 화자 식별 값 또는 최종 성별 값으로 결정할 수 있다.In Equation 7,

May represent the first reliability,

may represent the second reliability. For example, the first reliability may mean reliability of first information generated from a voice signal, and the second reliability may mean reliability of second information generated from a voice signal. Also, in Equation 7,

May represent a first speaker identification value,

May represent a second speaker identification value,

may represent a first gender value,

may represent a second gender value. Accordingly, the processor 110 may determine a speaker identification value or gender value obtained from information generated from a voice signal having a high reliability as a final speaker identification value or a final gender value according to a magnitude comparison between the first reliability level and the second reliability level. there is.

일 실시예에 따라, 제1 신뢰도는 제1 화자 식별 값에 대한 제1-1 신뢰도, 및 제1 성별 값에 대한 제1-2 신뢰도를 포함할 수 있고, 제2 신뢰도는 제2 화자 식별 값에 대한 제2-1 신뢰도, 및 제2 성별 값에 대한 제2-2 신뢰도를 포함할 수 있다. 프로세서(110)는 제1-1 신뢰도 및 제2-1 신뢰도를 기초로 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1-2 신뢰도 및 제2-2 신뢰도를 기초로 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다. 일 실시예에 따라, 프로세서(110)는 제1-1 신뢰도 및 제2-1 신뢰도의 비율에 따라 제1 화자 식별 값 및 제2 화자 식별 값 각각에 가중치를 부여하고, 가중치가 부여된 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정할 수 있다. 또한, 프로세서(110)는 제1-2 신뢰도 및 제2-2 신뢰도의 비율에 따라 제1 성별 값 및 제2 성별 값 각각에 가중치를 부여하고, 가중치가 부여된 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다. 다른 실시예에 따라, 프로세서(110)는 제1-1 신뢰도 및 제2-1 신뢰도 간의 크기 비교를 통해 제1 화자 식별 값 및 제2 화자 식별 값 중에서 어느 하나를 최종 화자 식별 값으로 결정할 수 있고, 제1-2 신뢰도 및 제2-2 신뢰도의 크기 비교를 통해 제1 성별 값 및 제2 성별 값 중 어느 하나를 최종 성별 값으로 결정할 수 있다.According to an embodiment, the first reliability may include 1-1 reliability for the first speaker identification value and 1-2 reliability for the first gender value, and the second reliability may include the second speaker identification value. 2-1 reliability for , and 2-2 reliability for the second gender value. The processor 110 determines a final speaker identification value from the first speaker identification value and the second speaker identification value based on the 1-1 reliability and the 2-1 reliability, and determines the 1-2 reliability and the 2-2 reliability. Based on , a final gender value may be determined from the first gender value and the second gender value. According to an embodiment, the processor 110 assigns a weight to each of the first speaker identification value and the second speaker identification value according to the ratio of the 1-1 reliability level and the 2-1 reliability level, and the weighted first speaker identification value is assigned a weight. A final speaker identification value may be determined from the speaker identification value and the second speaker identification value. Further, the processor 110 assigns a weight to each of the first gender value and the second gender value according to the ratio of the 1-2 reliability level and the 2-2 reliability level, and the weighted first gender value and the second gender value are assigned weights. From the values, a final gender value can be determined. According to another embodiment, the processor 110 may determine any one of the first speaker identification value and the second speaker identification value as the final speaker identification value through a size comparison between the 1-1 reliability and the 2-1 reliability, and , One of the first gender value and the second gender value may be determined as the final gender value through a comparison between the magnitudes of the 1-2 reliability level and the 2-2 level reliability level.

일 실시예에 따라, 프로세서(110)는 제1 화자 식별 값과 제2 화자 식별 값을 융합하여 최종 화자 식별 값을 결정할 수 있고, 제1 성별 값과 제2 성별 값을 융합하여 최종 성별 값을 결정할 수 있다. 예를 들어, 프로세서(110)는 제1 화자 식별 값과 제2 화자 식별 값의 평균을 통해 최종 화자 식별 값을 결정할 수 있고, 제1 성별 값과 제2 성별 값의 평균을 통해 최종 성별 값을 결정할 수 있다. 또한, 프로세서(110)는 잦은 외란이 발생하는 환경에서는 X-vector에 관한 정보를 포함하는 제1 정보 보다 I-vector에 관한 정보를 포함하는 제2 정보가 더 신뢰도가 높을 수 있기 때문에, 제1 정보 보다 제2 정보를 중심으로 최종 화자 식별 값 및 최종 성별 값을 결정할 수 있다.According to an embodiment, the processor 110 may fuse the first speaker identification value and the second speaker identification value to determine a final speaker identification value, and fuse the first gender value and the second gender value to obtain a final gender value. can decide For example, the processor 110 may determine the final speaker identification value through the average of the first speaker identification value and the second speaker identification value, and determine the final gender value through the average of the first gender value and the second gender value. can decide In addition, since the processor 110 may have higher reliability of the second information including information about the I-vector than the first information including the information about the X-vector in an environment where frequent disturbances occur, the first information The final speaker identification value and the final gender value may be determined based on the second information rather than information.

일 실시예에 따라, 전자 장치(100)는 기 훈련된 제1 뉴럴 네트워크 및 제2 뉴럴 네트워크를 통해 최종 화자 식별 값 및 최종 성별 값을 결정할 수 있고, 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하고, 기 등록된 적어도 하나의 화자의 성별 값과 최종 성별 값을 비교하여 음성 신호의 화자를 확인할 수 있다. 전자 장치(100)는 최종 화자 식별 값 및 최종 성별 값을 통해 확인한 음성 신호의 화자에 관한 정보를 외부 장치로 전송하거나, 디스플레이를 통해 출력할 수 있다. 전자 장치(100)는 화자 식별 값뿐만 아니라 성별 값을 통해 음성 신호의 화자를 인식하므로, 확인된 화자의 성별 값과 실제 기 등록된 화자의 성별 값이 상이한 경우, 화자를 잘못 인식했다고 판단할 수 있다. 예를 들어, 기 등록된 적군의 주요 인물(예를 들어, 적군 수장)의 성별이 남성임에도 불구하고, 확인된 화자의 성별 값이 여성에 대응하는 값이라면 적군의 주요 인물을 정확하게 인식하지 못한 것으로 판단할 수 있다. 따라서, 전자 장치(100)는 화자 식별 값과 더불어 성별 값을 이용함으로써 화자 인식 성능을 개선할 수 있다. 뿐만 아니라, 전자 장치(100)는 다채널 정보(예를 들어, X-vector, I-vector, 및 D-vector에 관한 정보)를 사용함으로써 음성 신호로부터 생성된 멜-스펙트로그램에 관한 정보만을 사용할 때보다 보다 정확하게 화자를 식별할 수 있다.According to an embodiment, the electronic device 100 may determine a final speaker identification value and a final gender value through a pre-trained first neural network and a second neural network, and the speaker identification value of at least one pre-registered speaker. The speaker of the voice signal may be identified by comparing the final speaker identification value with the final speaker identification value, and comparing the gender value of at least one pre-registered speaker with the final gender value. The electronic device 100 may transmit information about the speaker of the voice signal identified through the final speaker identification value and the final gender value to an external device or output it through a display. Since the electronic device 100 recognizes the speaker of the voice signal not only through the speaker identification value but also through the gender value, when the gender value of the confirmed speaker and the gender value of the actually registered speaker are different, it may be determined that the speaker has been incorrectly recognized. there is. For example, even though the sex of the main person of the enemy group (for example, the head of the enemy group) registered in advance is male, if the gender value of the identified speaker is a value corresponding to a woman, it is considered that the main person of the enemy group has not been accurately recognized. can judge Accordingly, the electronic device 100 may improve speaker recognition performance by using the gender value together with the speaker identification value. In addition, the electronic device 100 can use only information about a Mel-spectrogram generated from a voice signal by using multi-channel information (eg, information about X-vector, I-vector, and D-vector). You can identify the speaker more accurately than before.

도 5는 프로세서가 최종 화자 식별 값 및 최종 성별 값을 결정하는 실시예를 나타낸다.5 illustrates an embodiment in which a processor determines a final speaker identification value and a final gender value.

도 5를 참조하면, 프로세서(110)는 음성 신호로부터 생성된 제1 정보(510)가 입력된 뉴럴 네트워크(530)의 출력을 기초로, 제1 화자 식별 값 및 제1 성별 값을 확인할 수 있고, 제1 화자 식별 값에 대한 제1-1 신뢰도 및 제1 성별 값에 대한 제1-2 신뢰도를 확인할 수 있다. 또한, 프로세서(110)는 음성 신호로부터 생성된 제2 정보(520)가 입력된 뉴럴 네트워크(540)의 출력을 기초로, 제2 화자 식별 값 및 제2 성별 값을 확인할 수 있고, 제2 화자 식별 값에 대한 제2-1 신뢰도 및 제2 성별 값에 대한 제2-2 신뢰도를 확인할 수 있다.Referring to FIG. 5 , the processor 110 may determine a first speaker identification value and a first gender value based on the output of the neural network 530 to which the first information 510 generated from the voice signal is input. , 1-1 reliability for the first speaker identification value and 1-2 reliability for the first gender value may be confirmed. Also, the processor 110 may determine the second speaker identification value and the second gender value based on the output of the neural network 540 to which the second information 520 generated from the voice signal is input, and A 2-1 reliability level for the identification value and a 2-2 level reliability level for the second gender value may be confirmed.

일 실시예에 따라, 프로세서(110)는 제1-1 신뢰도 및 제2-1 신뢰도를 기초로 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1-2 신뢰도 및 제2-2 신뢰도를 기초로 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다. 예를 들어, 프로세서(110)는 제1-1 신뢰도 및 제2-1 신뢰도의 비율 또는 제1-1 신뢰도 및 제2-1 신뢰도 간의 크기 비교에 따라, 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1-2 신뢰도 및 제2-2 신뢰도의 비율 또는 제1-2 신뢰도 및 제2-2 신뢰도의 크기 비교에 따라 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다. 일 실시예에 따라, 기 등록된 화자의 화자 식별 값 및 성별 값은 정수일 수 있다. 예를 들어, 다섯 명의 화자가 기 등록된 경우, 다섯 명의 화자의 화자 식별 값은 0, 1, 2, 3 및 4일 수 있고, 각각의 화자는 남성일 경우 성별 값이 0이고 여성일 경우 성별 값이 1일 수 있다. 일 실시예에 따라, 프로세서(110)는 신뢰도에 따라 계산된 최종 화자 식별 값 또는 최종 성별 값이 정수가 아닌 경우, 기 등록된 화자 중 계산된 최종 화자 식별 값 또는 최종 성별 값과 가장 가까운 화자 식별 값 또는 성별 값을 갖는 화자를 음성 신호의 화자로 확인할 수 있다. 예를 들어, 계산된 최종 화자 식별 값이 2.3이고 최종 성별 값이 1.2인 경우, 기 등록된 화자 중 화자 식별 값이 2이고 성별 값이 1인 화자를 음성 신호의 화자로 확인할 수 있다.According to an embodiment, the processor 110 determines a final speaker identification value from the first speaker identification value and the second speaker identification value based on the 1-1 reliability level and the 2-1 reliability level, and determines the 1-2 level reliability level. And based on the 2-2 reliability, a final gender value may be determined from the first gender value and the second gender value. For example, the processor 110 identifies the first speaker identification value and the second speaker identification value according to the ratio of the 1-1 reliability and the 2-1 reliability or the size comparison between the 1-1 reliability and the 2-1 reliability. A final speaker identification value is determined from the value, and from the first gender value and the second gender value according to the ratio of the 1-2 confidence level and the 2-2 confidence level or the size comparison of the 1-2 confidence level and the 2-2 confidence level. Final gender values can be determined. According to an embodiment, the speaker identification value and the gender value of a pre-registered speaker may be integers. For example, if five speakers are pre-registered, the speaker identification values of the five speakers may be 0, 1, 2, 3, and 4, and each speaker has a gender value of 0 when male and a gender value when female. The value can be 1. According to an embodiment, if the final speaker identification value or final gender value calculated according to the reliability is not an integer, the processor 110 identifies a speaker closest to the final speaker identification value or final gender value calculated among pre-registered speakers. A speaker having a value or a gender value may be identified as a speaker of a voice signal. For example, if the calculated final speaker identification value is 2.3 and the final gender value is 1.2, a speaker with a speaker identification value of 2 and a gender value of 1 among pre-registered speakers may be identified as a speaker of the voice signal.

도 5에서는 설명의 편의를 위해 제1 정보(510)가 입력된 뉴럴 네트워크(530)의 출력이 음성 신호의 하나의 화자를 확인하기 위한 화자 식별 값과 성별 값으로 도시되었지만, 제1 정보(510)가 입력된 뉴럴 네트워크(530)의 출력은 음성 신호의 복수의 화자를 확인하기 위한 복수의 값들(화자 식별 값들과 성별 값들)일 수 있다. 마찬가지로, 제2 정보(520)가 입력된 뉴럴 네트워크(540)의 출력은 음성 신호의 복수의 화자를 확인하기 위한 복수의 값들(화자 식별 값들과 성별 값들)일 수 있다. 일 실시예에 따라, 뉴럴 네트워크(530)로부터 제1 값들이 출력되고 뉴럴 네트워크(540)로부터 제2 값들이 출력되는 경우, 프로세서(110)는 제1 값들 및 제2 값들로부터 음성 신호의 복수의 화자를 확인하기 위한 최종 값들(화자 식별 값들과 성별 값들)을 결정할 수 있다.In FIG. 5, for convenience of description, the output of the neural network 530 to which the first information 510 is input is shown as a speaker identification value and a gender value for identifying one speaker of the voice signal, but the first information 510 ) may be a plurality of values (speaker identification values and gender values) for identifying a plurality of speakers of the voice signal. Similarly, an output of the neural network 540 to which the second information 520 is input may be a plurality of values (speaker identification values and gender values) for identifying a plurality of speakers of the voice signal. According to an embodiment, when the first values are output from the neural network 530 and the second values are output from the neural network 540, the processor 110 generates a plurality of voice signals from the first values and the second values. Final values for identifying a speaker (speaker identification values and gender values) may be determined.

도 6은 전자 장치가 동작하는 일 실시예를 나타낸다.6 shows an embodiment in which an electronic device operates.

데이터 획득(Data Acquisition)을 위한 단계(S610)에서, 전자 장치(100)는 음성 신호로부터 생성된 정보를 획득할 수 있다. 예를 들어, 전자 장치(100)는 외부 장치로부터 음성 신호로부터 생성된 멜-스펙트로그램에 관한 정보와 멜-스펙트로그램으로부터 생성된 X-vector에 관한 정보를 포함하는 제1 정보 및 음성 신호로부터 생성된 멜-스펙트로그램에 관한 정보와 멜-스펙트로그램으로부터 생성된 I-vector에 관한 정보를 포함하는 제2 정보를 획득할 수 있다. 또한, 전자 장치(100)는 각각의 제1 및 제2 정보에 대응하는 음성 신호의 화자를 확인하기 위한 확인 정보(화자 식별 값, 성별 값)를 획득할 수 있다. 일 실시예에 따라, 전자 장치(100)는 메모리(120)에 저장된 음성 신호로부터 생성된 정보 및 음성 신호의 화자를 확인하기 위한 확인 정보를 획득할 수 있다.In step S610 for data acquisition, the electronic device 100 may acquire information generated from the voice signal. For example, the electronic device 100 generates first information including information about a mel-spectrogram generated from a voice signal from an external device and information about an X-vector generated from the mel-spectrogram, and a voice signal Second information including information on the generated mel-spectrogram and information on an I-vector generated from the mel-spectrogram may be obtained. In addition, the electronic device 100 may obtain identification information (speaker identification value, gender value) for identifying the speaker of the voice signal corresponding to each of the first and second information. According to an embodiment, the electronic device 100 may obtain information generated from a voice signal stored in the memory 120 and identification information for identifying a speaker of the voice signal.

데이터 편집(Data Editing)을 위한 단계(S620)에서, 전자 장치(100)는 S610에서 획득된 데이터를 편집하여 뉴럴 네트워크 학습에 필요가 없는 부분을 제거할 수 있다. 예를 들어, 전자 장치(100)는 뉴럴 네트워크를 학습 또는 훈련시키기 위해 기 획득된 음성 신호로부터 생성된 정보 또는 확인 정보가 사람의 단순 실수에 의해 중복되는 경우, 중복된 데이터를 제거할 수 있다.In a step for data editing (S620), the electronic device 100 may edit the data obtained in S610 to remove parts not necessary for learning the neural network. For example, the electronic device 100 may remove redundant data when information or confirmation information generated from a previously obtained voice signal in order to learn or train a neural network is duplicated due to human error.

훈련(Training)을 위한 단계(S630)에서, 전자 장치(100)는 S620에서 편집된 데이터에 기반하여 뉴럴 네트워크를 학습시킬 수 있다. 구체적으로, 전자 장치(100)는 입력 정보인 제1 정보와 제2 정보 또는 레이블인 화자 식별 값과 성별 값을 통해 뉴럴 네트워크를 훈련 또는 학습시킬 수 있다. 예를 들어, 전자 장치(100)는 X-vector에 관한 정보를 포함하는 음성 신호로부터 생성된 제1 정보와 레이블인 화자 식별 값과 성별 값을 통해 제1 컨볼루션 뉴럴 네트워크를 훈련 또는 학습시킬 수 있고, 입력 정보인 I-vector에 관한 정보를 포함하는 음성 신호로부터 생성된 제2 정보와 레이블인 화자 식별 값과 성별 값을 통해 제2 컨볼루션 뉴럴 네트워크를 훈련 또는 학습시킬 수 있다.In step S630 for training, the electronic device 100 may train the neural network based on the data edited in S620. Specifically, the electronic device 100 may train or learn a neural network through first information and second information as input information or speaker identification value and gender value as labels. For example, the electronic device 100 may train or learn a first convolutional neural network through first information generated from a speech signal including information about an X-vector, a speaker identification value that is a label, and a gender value. and the second convolutional neural network can be trained or learned through second information generated from a voice signal including information about an I-vector as input information and speaker identification values and gender values as labels.

테스팅(Testing)을 위한 단계(S640)에서, 전자 장치(100)는 S630에서 학습 또는 훈련된 뉴럴 네트워크를 이용하여 음성 신호의 화자를 확인하기 위한 확인 정보를 확인할 수 있고, 확인 정보에 따라 음성 신호의 화자를 확인할 수 있다. 구체적으로, 전자 장치(100)는 각각의 확인 정보에 기초하여 최종 화자 식별 값 및 최종 성별 값을 결정하고, 기 등록된 적어도 하나의 화자의 화자 식별 값 및 성별 값과 비교함으로써 음성 신호의 화자를 확인할 수 있다. 따라서, 전자 장치(100)는 적군으로부터 입수한 녹취 파일에서 기 등록된 적군 수장을 정확히 인식할 수 있다.In the step for testing (S640), the electronic device 100 may check identification information for identifying the speaker of the voice signal using the neural network learned or trained in S630, and according to the identification information, the electronic device 100 may check the voice signal. of the speaker can be identified. Specifically, the electronic device 100 determines the speaker of the voice signal by determining a final speaker identification value and a final gender value based on each piece of identification information and comparing them with the speaker identification value and gender value of at least one pre-registered speaker. You can check. Accordingly, the electronic device 100 can accurately recognize the previously registered enemy leader in the recorded file obtained from the enemy.

일 실시예에 따르면, 전자 장치(100)는 확인된 음성 신호의 화자에 관한 정보를 제공할 수 있다. 구체적으로, 전자 장치(100)는 통신 디바이스(미도시)를 통해 확인된 음성 신호의 화자에 관한 정보를 외부 장치로 전송하거나, 디스플레이(미도시)를 통해 확인된 음성 신호의 화자에 관한 정보를 출력할 수 있다. 출력된 화자에 관한 정보(예를 들어, 화자의 성별)가 실제 기 등록된 화자의 정보와 상이한 경우, 전자 장치(100)가 화자를 정확히 인식하지 못한 것으로 판단할 수 있다. 화자를 정확히 인식하지 못한 경우, 전자 장치(100)는 기 사용된 특징 벡터(예를 들어, X-vector 및 I-vector)과 다른 특징 벡터(예를 들어, D-vector)에 관한 정보 및 멜-스펙트로그램에 관한 정보와 이들에 대응하는 레이블을 이용하여 뉴럴 네트워크를 재학습시킬 수 있다.According to an embodiment, the electronic device 100 may provide information about the speaker of the identified voice signal. Specifically, the electronic device 100 transmits information about the speaker of the voice signal identified through a communication device (not shown) to an external device, or transmits information about the speaker of the voice signal identified through a display (not shown). can be printed out. When the information about the output speaker (eg, the speaker's gender) is different from information about the actually registered speaker, it may be determined that the electronic device 100 did not accurately recognize the speaker. If the speaker is not accurately recognized, the electronic device 100 provides information about a previously used feature vector (eg, X-vector and I-vector) and other feature vectors (eg, D-vector) and a message. - Neural networks can be retrained using information about spectrograms and labels corresponding to them.

도 7은 일 실시예에 따른 전자 장치의 동작 방법을 나타낸다.7 illustrates a method of operating an electronic device according to an exemplary embodiment.

도 7의 동작 방법의 각 단계는 도 1의 전자 장치(100)에 의해 수행될 수 있으므로, 도 7과 중복되는 내용에 대해서는 설명을 생략한다.Since each step of the operating method of FIG. 7 can be performed by the electronic device 100 of FIG. 1 , descriptions of overlapping contents with those of FIG. 7 will be omitted.

단계 S710에서, 전자 장치(100)는 음성 신호로부터 생성된 제1 정보 및 음성 신호로부터 생성된 제2 정보를 획득할 수 있다. 제1 정보는 음성 신호로부터 생성된 멜-스펙트로그램에 관한 정보, 및 멜-스펙트로그램으로부터 생성된 X-vector, I-vector, 및 D-vector 중 어느 하나에 관한 정보를 포함할 수 있고, 제2 정보는 음성 신호로부터 생성된 멜-스펙트로그램에 관한 정보, 및 멜-스펙트로그램으로부터 생성된 X-vector, I-vector, 및 D-vector 중 다른 하나에 관한 정보를 포함할 수 있다. In step S710, the electronic device 100 may acquire first information generated from the voice signal and second information generated from the voice signal. The first information may include information about a mel-spectrogram generated from a voice signal and information about any one of an X-vector, an I-vector, and a D-vector generated from the mel-spectrogram, and 2 information may include information about a mel-spectrogram generated from a voice signal, and information about another one of an X-vector, an I-vector, and a D-vector generated from the mel-spectrogram.

단계 S720에서, 전자 장치(100)는 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 확인하고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 확인할 수 있다. 전자 장치(100)는 음성 신호의 화자를 확인하기 위해, 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 기 등록된 적어도 하나의 화자의 화자 식별 값과 최종 화자 식별 값을 비교하여 음성 신호의 화자를 확인할 수 있다. 일 실시예에 따라, 제1 뉴럴 네트워크는 음성 신호로부터 생성된 제1 정보에 기초하여 음성 신호의 화자를 확인하기 위한 제1 확인 정보를 추론하도록 훈련된 뉴럴 네트워크이고, 제2 뉴럴 네트워크는 음성 신호로부터 생성된 제2 정보에 기초하여 음성 신호의 화자를 확인하기 위한 제2 확인 정보를 추론하도록 훈련된 뉴럴 네트워크일 수 있다. In step S720, the electronic device 100 checks the first identification information for identifying the speaker of the voice signal based on the output of the first neural network to which the first information is input, and the second neural network to which the second information is input. Based on the output of the network, second identification information for identifying the speaker of the voice signal may be identified. To identify the speaker of the voice signal, the electronic device 100 determines a final speaker identification value from the first speaker identification value and the second speaker identification value, and determines the speaker identification value of at least one pre-registered speaker and the final speaker identification. By comparing the values, the speaker of the voice signal can be identified. According to an embodiment, the first neural network is a neural network trained to infer first identification information for identifying a speaker of the speech signal based on first information generated from the speech signal, and the second neural network is a speech signal It may be a neural network trained to infer second identification information for identifying a speaker of a voice signal based on second information generated from

단계 S730에서, 전자 장치(100)는 제1 확인 정보 및 제2 확인 정보를 기초로 음성 신호의 화자를 확인할 수 있다. 확인 정보는 화자 식별 값 및 성별 값을 포함할 수 있다. 전자 장치(100)는 제1 화자 식별 값 및 제2 화자 식별 값으로부터 음성 신호의 화자를 확인하기 위한 최종 화자 식별 값을 결정하고, 제1 성별 값 및 제2 성별 값으로부터 음성 신호의 화자를 확인하기 위한 최종 성별 값을 결정할 수 있다. 전자 장치(100)는 최종 화자 식별 값 및 최종 성별 값을 기 등록된 적어도 하나의 화자의 화자 식별 값 및 성별 값과 비교하여 적어도 하나의 화자 중 어느 하나의 화자를 음성 신호의 화자로 인식할 수 있다.In step S730, the electronic device 100 may identify the speaker of the voice signal based on the first identification information and the second identification information. The identification information may include a speaker identification value and a gender value. The electronic device 100 determines a final speaker identification value for identifying a speaker of the voice signal from the first speaker identification value and the second speaker identification value, and identifies the speaker of the voice signal from the first gender value and the second gender value. It is possible to determine the final gender value for The electronic device 100 may recognize any one of the at least one speaker as the speaker of the voice signal by comparing the final speaker identification value and the final gender value with the speaker identification value and gender value of at least one previously registered speaker. there is.

일 실시예에 따라, 전자 장치(100)는 제1 정보가 입력된 제1 뉴럴 네트워크의 출력을 기초로 제1 신뢰도, 제1 화자 식별 값, 및 제1 성별 값을 확인할 수 있고, 제2 정보가 입력된 제2 뉴럴 네트워크의 출력을 기초로 제2 신뢰도, 제2 화자 식별 값, 및 제2 성별 값을 확인할 수 있다. 전자 장치(100)는 제1 신뢰도 및 제2 신뢰도를 기초로 제1 화자 식별 값 및 제2 화자 식별 값으로부터 음성 신호의 화자를 확인하기 위한 최종 화자 식별 값을 결정하고, 제1 성별 값 및 제2 성별 값으로부터 음성 신호의 화자를 확인하기 위한 최종 성별 값을 결정할 수 있다. 여기서 제1 신뢰도는 제1 뉴럴 네트워크의 소프트맥스 레이어의 출력들에 기초하여 계산되고, 제2 신뢰도는 제2 뉴럴 네트워크의 소프트맥스 레이어의 출력들에 기초하여 계산될 수 있다.According to an embodiment, the electronic device 100 may check the first reliability, the first speaker identification value, and the first gender value based on the output of the first neural network to which the first information is input, and the second information The second reliability, the second speaker identification value, and the second gender value may be determined based on the output of the second neural network to which ? is input. The electronic device 100 determines a final speaker identification value for identifying a speaker of the voice signal from the first speaker identification value and the second speaker identification value based on the first reliability and the second reliability, and determines the first gender value and the second speaker identification value. 2 The final gender value for identifying the speaker of the voice signal can be determined from the gender value. Here, the first reliability may be calculated based on outputs of the softmax layer of the first neural network, and the second reliability may be calculated based on outputs of the softmax layer of the second neural network.

일 실시예에 따라, 제1 신뢰도는 제1 화자 식별 값에 대한 제1-1 신뢰도, 및 제1 성별 값에 대한 제1-2 신뢰도를 포함하고, 제2 신뢰도는 제2 화자 식별 값에 대한 제2-1 신뢰도, 및 제2 성별 값에 대한 제2-2 신뢰도를 포함할 수 있다. 전자 장치(100)는 제1-1 신뢰도 및 제2-1 신뢰도를 기초로 제1 화자 식별 값 및 제2 화자 식별 값으로부터 최종 화자 식별 값을 결정하고, 제1-2 신뢰도 및 제2-2 신뢰도를 기초로 제1 성별 값 및 제2 성별 값으로부터 최종 성별 값을 결정할 수 있다. 전자 장치(100)는 결정된 최종 화자 식별 값 및 최종 성별 값을 기 등록된 화자의 화자 식별 값 및 성별 값과 비교하여 음성 신호의 화자를 확인할 수 있다. 또한, 전자 장치(100)는 통신 디바이스를 통해 확인된 음성 신호의 화자에 관한 정보를 외부 장치로 전송하거나, 디스플레이를 통해 확인된 음성 신호의 화자에 관한 정보를 출력할 수 있다.According to an embodiment, the first confidence level includes a 1-1 confidence level for a first speaker identification value and a 1-2 confidence level for a first gender value, and the second confidence level is a 1-1 confidence level for a second speaker identification value. A 2-1 reliability level and a 2-2 reliability level for the second gender value may be included. The electronic device 100 determines the final speaker identification value from the first speaker identification value and the second speaker identification value based on the 1-1 reliability and the 2-1 reliability, and determines the 1-2 reliability and the 2-2 Based on the reliability, a final gender value may be determined from the first gender value and the second gender value. The electronic device 100 may identify the speaker of the voice signal by comparing the determined final speaker identification value and final gender value with the speaker identification value and gender value of a pre-registered speaker. In addition, the electronic device 100 may transmit information about the speaker of the voice signal identified through the communication device to an external device or output information about the speaker of the voice signal identified through the display.

전술한 실시예들에 따른 전자 장치는, 프로세서, 프로그램 데이터를 저장하고 실행하는 메모리, 디스크 드라이브와 같은 영구 저장부(permanent storage), 외부 장치와 통신하는 통신 포트, 터치 패널, 키(key), 버튼 등과 같은 사용자 인터페이스 장치 등을 포함할 수 있다. 소프트웨어 모듈 또는 알고리즘으로 구현되는 방법들은 상기 프로세서상에서 실행 가능한 컴퓨터가 읽을 수 있는 코드들 또는 프로그램 명령들로서 컴퓨터가 읽을 수 있는 기록 매체 상에 저장될 수 있다. 여기서 컴퓨터가 읽을 수 있는 기록 매체로 마그네틱 저장 매체(예컨대, ROM(read-only memory), RAM(random-Access memory), 플로피 디스크, 하드 디스크 등) 및 광학적 판독 매체(예컨대, 시디롬(CD-ROM), 디브이디(DVD: Digital Versatile Disc)) 등이 있다. 컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템들에 분산되어, 분산 방식으로 컴퓨터가 판독 가능한 코드가 저장되고 실행될 수 있다. 매체는 컴퓨터에 의해 판독가능하며, 메모리에 저장되고, 프로세서에서 실행될 수 있다. An electronic device according to the above-described embodiments includes a processor, a memory for storing and executing program data, a permanent storage unit such as a disk drive, a communication port for communicating with an external device, a touch panel, a key, User interface devices such as buttons and the like may be included. Methods implemented as software modules or algorithms may be stored on a computer-readable recording medium as computer-readable codes or program instructions executable on the processor. Here, the computer-readable recording medium includes magnetic storage media (e.g., read-only memory (ROM), random-access memory (RAM), floppy disk, hard disk, etc.) and optical reading media (e.g., CD-ROM) ), and DVD (Digital Versatile Disc). A computer-readable recording medium may be distributed among computer systems connected through a network, and computer-readable codes may be stored and executed in a distributed manner. The medium may be readable by a computer, stored in a memory, and executed by a processor.

본 실시예는 기능적인 블록 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블록들은 특정 기능들을 실행하는 다양한 개수의 하드웨어 또는/및 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 실시예는 하나 이상의 마이크로프로세서들의 제어 또는 다른 제어 장치들에 의해서 다양한 기능들을 실행할 수 있는, 메모리, 프로세싱, 로직(logic), 룩 업 테이블(look-up table) 등과 같은 직접 회로 구성들을 채용할 수 있다. 구성 요소들이 소프트웨어 프로그래밍 또는 소프트웨어 요소들로 실행될 수 있는 것과 유사하게, 본 실시 예는 데이터 구조, 프로세스들, 루틴들 또는 다른 프로그래밍 구성들의 조합으로 구현되는 다양한 알고리즘을 포함하여, C, C++, 자바(Java), 어셈블러(assembler) 등과 같은 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능적인 측면들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 실시 예는 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다. “매커니즘”, “요소”, “수단”, “구성”과 같은 용어는 넓게 사용될 수 있으며, 기계적이고 물리적인 구성들로서 한정되는 것은 아니다. 상기 용어는 프로세서 등과 연계하여 소프트웨어의 일련의 처리들(routines)의 의미를 포함할 수 있다.This embodiment can be presented as functional block structures and various processing steps. These functional blocks may be implemented with any number of hardware or/and software components that perform specific functions. For example, embodiments may include integrated circuit configurations such as memory, processing, logic, look-up tables, etc., that may execute various functions by means of the control of one or more microprocessors or other control devices. can employ them. Similar to components that can be implemented as software programming or software elements, the present embodiments include data structures, processes, routines, or various algorithms implemented as combinations of other programming constructs, such as C, C++, Java ( It can be implemented in a programming or scripting language such as Java), assembler, or the like. Functional aspects may be implemented in an algorithm running on one or more processors. In addition, this embodiment may employ conventional techniques for electronic environment setting, signal processing, and/or data processing. Terms such as “mechanism”, “element”, “means” and “composition” may be used broadly and are not limited to mechanical and physical components. The term may include a meaning of a series of software routines in association with a processor or the like.

전술한 실시예들은 일 예시일 뿐 후술하는 청구항들의 범위 내에서 다른 실시예들이 구현될 수 있다.The foregoing embodiments are merely examples and other embodiments may be implemented within the scope of the claims described below.

Claims

A method for recognizing a speaker in an electronic device,
obtaining first information generated from a voice signal and second information generated from the voice signal;
Based on the output of the first neural network to which the first information is input, first identification information for identifying the speaker of the voice signal is checked, and based on the output of the second neural network to which the second information is input, the checking second identification information for identifying a speaker of the voice signal; and
Identifying a speaker of the voice signal based on the first confirmation information and the second confirmation information;
The first information,
It includes information about a spectrogram generated from the speech signal and information about a first feature vector generated from the spectrogram by any one of an X-vector technique, a D-vector technique, and an I-vector technique, ,
The second information,
Including information about the spectrogram and information about a second feature vector generated from the spectrogram by another one of an X-vector technique, a D-vector technique, and an I-vector technique,
The first confirmation information,
A first speaker identification value and a first gender value for identifying a speaker of the voice signal are included;
The second confirmation information,
A second speaker identification value and a second gender value for identifying a speaker of the voice signal are included;
The step of identifying the speaker of the voice signal,
determining a final speaker identification value from the first speaker identification value and the second speaker identification value, and determining a final gender value from the first gender value and the second gender value; and
identifying the speaker of the voice signal by comparing a speaker identification value of at least one pre-registered speaker with the final speaker identification value, and comparing a gender value of the at least one pre-registered speaker with the final gender value; Including, how.

delete

According to claim 1,
The step of identifying the speaker of the voice signal,
Generating final confirmation information based on the first confirmation information and the second confirmation information; and
and identifying a speaker among the at least one speaker as a speaker of the voice signal by comparing confirmation information of at least one pre-registered speaker with the final confirmation information.

According to claim 1,
The first confirmation information,
A third speaker identification value for identifying a speaker of the voice signal;
The second confirmation information,
A fourth speaker identification value for identifying a speaker of the voice signal;
The step of identifying the speaker of the voice signal,
determining a first final speaker identification value from the third speaker identification value and the fourth speaker identification value; and
and identifying a speaker of the voice signal by comparing a speaker identification value of at least one pre-registered speaker with the first final speaker identification value.

delete

According to claim 1,
The step of checking the first confirmation information and the second confirmation information,
A first reliability, a third speaker identification value, and a third gender value are determined based on an output of the first neural network to which the first information is input, and an output of the second neural network to which the second information is input determining a second reliability, a fourth speaker identification value, and a fourth gender value based on
The step of identifying the speaker of the voice signal,
Based on the first reliability and the second reliability, a first final speaker identification value is determined from the third speaker identification value and the fourth speaker identification value, and a second final speaker identification value is determined from the third gender value and the fourth gender value. 1 determining the final gender value; and
The speaker identification value of at least one pre-registered speaker is compared with the first final speaker identification value, and the sex value of the at least one pre-registered speaker is compared with the first final gender value to determine the speaker of the voice signal. A method comprising the step of verifying.

According to claim 6,
The first reliability is,
Calculated based on outputs of a softmax layer of the first neural network,
The second reliability is,
Calculated based on outputs of the softmax layer of the second neural network.

According to claim 6,
The determining step is
A weight is assigned to each of the third speaker identification value and the fourth speaker identification value according to the ratio of the first reliability and the second reliability, and a third speaker identification value and a fourth speaker identification value are weighted. 1 A final speaker identification value is determined, weights are assigned to each of the third gender value and the fourth gender value according to the ratio of the first reliability level and the second reliability level, and the weighted third gender value and the second gender value are weighted. determining a first final gender value from the 4 gender values.

According to claim 6,
The determining step is
One of the third speaker identification value and the fourth speaker identification value is determined as the first final speaker identification value through the size comparison between the first reliability and the second reliability, and the third gender value and the second speaker identification value are determined through the size comparison. determining any one of the four gender values as a first final gender value.

According to claim 6,
The first reliability includes a 1-1 reliability for the third speaker identification value and a 1-2 reliability for the third gender value;
The second reliability includes a 2-1 reliability level for the fourth speaker identification value and a 2-2 reliability level for the fourth gender value;
The determining step is
The first final speaker identification value is determined from the third speaker identification value and the fourth speaker identification value based on the 1-1 reliability level and the 2-1 reliability level, and the 1-2 reliability level and the 1st speaker identification value are determined. and determining the first final gender value from the third gender value and the fourth gender value based on a 2-2 confidence level.

According to claim 1,
Transmitting the information about the identified speaker of the voice signal to an external device through a communication device, or outputting the information about the speaker of the identified voice signal through a display.

As an electronic device for speaker recognition,
a memory in which at least one program is stored; and
Obtaining first information generated from a voice signal and second information generated from the voice signal by executing the at least one program;
Based on the output of the first neural network to which the first information is input, first identification information for identifying the speaker of the voice signal is checked, and based on the output of the second neural network to which the second information is input, the Checking second identification information for identifying the speaker of the voice signal; and
A processor for identifying a speaker of the voice signal based on the first identification information and the second identification information;
The first information,
It includes information about a spectrogram generated from the speech signal and information about a first feature vector generated from the spectrogram by any one of an X-vector technique, a D-vector technique, and an I-vector technique, ,
The second information,
Including information about the spectrogram and information about a second feature vector generated from the spectrogram by another one of an X-vector technique, a D-vector technique, and an I-vector technique,
The first confirmation information,
A first speaker identification value and a first gender value for identifying a speaker of the voice signal are included;
The second confirmation information,
A second speaker identification value and a second gender value for identifying a speaker of the voice signal are included;
the processor,
determine a final speaker identification value from the first speaker identification value and the second speaker identification value, and determine a final gender value from the first gender value and the second gender value; and
The speaker identification value of at least one pre-registered speaker is compared with the final speaker identification value, and the gender value of the at least one pre-registered speaker is compared with the final gender value to identify the speaker of the voice signal. Device.

A computer-readable non-transitory recording medium recording a program for executing a speaker recognition method on a computer,
The speaker recognition method,
obtaining first information generated from a voice signal and second information generated from the voice signal;
Based on the output of the first neural network to which the first information is input, first identification information for identifying the speaker of the voice signal is checked, and based on the output of the second neural network to which the second information is input, the checking second identification information for identifying a speaker of the voice signal; and
Identifying a speaker of the voice signal based on the first confirmation information and the second confirmation information;
The first information,
It includes information about a spectrogram generated from the speech signal and information about a first feature vector generated from the spectrogram by any one of an X-vector technique, a D-vector technique, and an I-vector technique, ,
The second information,
Including information about the spectrogram and information about a second feature vector generated from the spectrogram by another one of an X-vector technique, a D-vector technique, and an I-vector technique,
The first confirmation information,
A first speaker identification value and a first gender value for identifying a speaker of the voice signal are included;
The second confirmation information,
A second speaker identification value and a second gender value for identifying a speaker of the voice signal are included;
The step of identifying the speaker of the voice signal,
determining a final speaker identification value from the first speaker identification value and the second speaker identification value, and determining a final gender value from the first gender value and the second gender value; and
identifying the speaker of the voice signal by comparing a speaker identification value of at least one pre-registered speaker with the final speaker identification value, and comparing a gender value of the at least one pre-registered speaker with the final gender value; Including, non-transitory recording medium.