KR20210015542A

KR20210015542A - Apparatus for identifying speaker based on in-depth neural network capable of enrolling unregistered speakers, method thereof and computer recordable medium storing program to perform the method

Info

Publication number: KR20210015542A
Application number: KR1020190094553A
Authority: KR
Inventors: 정지원; 유하진; 허희수; 심혜진
Original assignee: 서울시립대학교 산학협력단
Priority date: 2019-08-02
Filing date: 2019-08-02
Publication date: 2021-02-10
Also published as: KR102286775B1

Abstract

A speaker identifying device based on a deep neural network capable of adding an unregistered speaker comprises: a deep neural network including an input layer and one or more hidden layers and output layers including a plurality of nodes, wherein a plurality of nodes of different layers are connected by weights; and a recognition unit identifying a speaker based on output values of a plurality of output nodes of the output layers after a speaker inputs an unknown voice to the input layer of the deep neural network and classifying the speaker as an unregistered speaker if all the output values of the plurality of output nodes are less than a preset threshold.

Description

Apparatus for identifying speaker based on in-depth neural network capable of enrolling unregistered speaker identification device based on a deep neural network capable of adding an unregistered speaker, a method therefor, and a program for performing the method speakers, method thereof and computer recordable medium storing program to perform the method}

본 발명은 화자 식별 기술에 관한 것으로, 보다 상세하게는, 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자 식별 장치, 이를 위한 방법 및 이 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체에 관한 것이다. The present invention relates to a speaker identification technology, and more particularly, a deep neural network-based speaker identification device capable of adding an unregistered speaker, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded. About.

심층 신경망을 이용한 화자 식별 시스템은 사전에 정의된 화자들을 식별할 수 있도록 학습된다. 심층 신경망 기술이 발전함에 따라 우수한 성능의 화자 식별 시스템의 구축이 가능해졌다. 하지만， 심층 신경망을 이용한 분류 시스템의 특성으로 인해 미등록 화자를 추가하는 과정에서 시스템의 전면적인 재구성이 필요하기 때문에 큰 비용(overhead)이 발생한다. A speaker identification system using a deep neural network is learned to identify predefined speakers. With the development of deep neural network technology, it has become possible to construct a speaker identification system with excellent performance. However, due to the characteristics of the classification system using a deep neural network, a large overhead occurs because a complete reconfiguration of the system is required in the process of adding an unregistered speaker.

한국공개특허 제2015-0104111호 2015년 09월 14일 공개 (명칭: 인공 신경망 기반 서브-음성 유닛 구별을 이용한 화자 검층 및 식별)Korean Patent Laid-Open Patent No. 2015-0104111 published on September 14, 2015 (Name: Speaker recording and identification using artificial neural network-based sub-speech unit distinction)

본 발명의 목적은 심층신경망을 이용한 화자 식별 시스템에 새로운 화자를 주가하는 경우를 가정하여 전체 시스템의 재구성 대신 마지막 계층의 가중치를 산출하는 방법으로 새로운 화자를 추가할 수 있는 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자 식별 장치, 이를 위한 방법 및 이 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체를 제공할 수 있다. It is an object of the present invention to add an unregistered speaker capable of adding a new speaker by calculating the weight of the last layer instead of reconfiguring the entire system on the assumption that a new speaker is added to a speaker identification system using a deep neural network. A speaker identification device based on a deep neural network, a method therefor, and a computer-readable recording medium in which a program for performing the method is recorded can be provided.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자 식별 장치는 각각이 복수의 노드를 포함하는 입력계층, 하나 이상의 은닉계층 및 출력계층을 포함하며, 서로 다른 계층의 복수의 노드가 가중치로 연결되는 심층신경망과, 화자가 알려지지 않은 음성을 상기 심층신경망의 상기 입력계층에 입력시킨 후, 상기 출력계층의 복수의 출력노드의 출력값을 기초로 화자를 식별하되, 상기 복수의 출력노드의 출력값 모두가 기 설정된 임계치 미만이면, 상기 화자를 미등록 화자로 분류하는 인식부를 포함한다. A speaker identification device based on a deep neural network capable of adding an unregistered speaker according to a preferred embodiment of the present invention to achieve the above-described object includes an input layer, one or more hidden layers, and an output layer each including a plurality of nodes. Including, a deep neural network in which a plurality of nodes of different layers are connected by weights, and a speech of an unknown speaker is input to the input layer of the deep neural network, and then the output values of the plurality of output nodes of the output layer are based. And a recognition unit for classifying the speaker as an unregistered speaker when all of the output values of the plurality of output nodes are less than a preset threshold.

상기 미등록 화자로 분류되면, 상기 미등록 화자에 대응하는 출력노드를 출력계층에 추가하고, 소정 횟수 이상 저장된 미등록 화자의 마지막 은닉계층의 노드값을 기초로 마지막 은닉계층의 복수의 은닉노드와 상기 출력층에 추가된 출력노드 간의 가중치를 산출하는 학습부를 더 포함한다. When classified as the non-registered speaker, an output node corresponding to the non-registered speaker is added to the output layer, and a plurality of hidden nodes of the last hidden layer and the output layer are added to the output layer based on the node value of the last hidden layer of the non-registered speaker stored at least a predetermined number of times. It further includes a learning unit that calculates the weight between the added output nodes.

상기 출력 노드의 활성화 함수는 소프트맥스(softmax) 함수이며, 마지막 은닉계층의 복수의 은닉노드와 상기 추가된 출력 노드 간의 가중치는 저장된 마지막 은닉계층의 복수의 은닉노드 각각의 노드값의 평균인 것을 특징으로 한다. The activation function of the output node is a softmax function, and the weight between the plurality of hidden nodes of the last hidden layer and the added output node is an average of node values of each of the plurality of hidden nodes of the last hidden layer. To do.

상기 임계치는 등록된 화자를 학습할 때 사용한 학습 데이터의 기댓값 중 가장 큰 값인 것을 특징으로 한다. The threshold value is characterized in that the largest value among expected values of training data used when learning a registered speaker.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자 식별 방법은 인식부가 각각이 복수의 노드를 포함하는 입력계층, 하나 이상의 은닉계층 및 출력계층을 포함하며, 서로 다른 복수의 계층의 복수의 노드가 가중치로 연결되는 심층신경망에 화자가 알려지지 않은 음성을 입력하는 단계와, 상기 인식부가 상기 입력에 따라 상기 심층신경망의 복수의 계층의 복수의 노드가 상기 가중치가 작용되는 복수의 연산을 통해 출력계층의 복수의 출력노드의 출력값을 도출하는 단계와, 상기 인식부가 상기 출력계층의 복수의 출력노드의 출력값을 기초로 화자를 식별하되, 상기 복수의 출력노드의 출력값 모두가 기 설정된 임계치 미만인지 여부를 판별하는 단계와, 상기 판별 결과, 상기 복수의 출력노드의 출력값 모두가 기 설정된 임계치 미만이면, 상기 인식부가 상기 화자를 미등록 화자로 분류하는 단계를 포함한다. In a deep neural network-based speaker identification method in which an unregistered speaker can be added according to a preferred embodiment of the present invention for achieving the above object, the recognition unit includes an input layer including a plurality of nodes, one or more hidden layers, and The step of inputting a voice of an unknown speaker into a deep neural network including an output layer, wherein a plurality of nodes of a plurality of different layers are connected by weights, and the recognition unit includes a plurality of layers of the deep neural network according to the input. Deriving output values of the plurality of output nodes of the output layer through a plurality of operations on which the weight is applied by the node of, and the recognition unit identifies a speaker based on the output values of the plurality of output nodes of the output layer, Determining whether all of the output values of the plurality of output nodes are less than a preset threshold, and when the determination result, if all of the output values of the plurality of output nodes are less than a preset threshold, the recognition unit classifies the speaker as an unregistered speaker. Includes steps.

상기 화자 식별 방법은 상기 화자를 미등록 화자로 분류하는 단계 후, 학습부가 상기 미등록 화자에 대응하는 출력노드를 출력계층에 추가하는 단계와, 상기 학습부가 소정 횟수 이상 저장된 미등록 화자의 마지막 은닉계층의 복수의 은닉노드의 노드값을 기초로 마지막 은닉계층의 복수의 은닉노드와 상기 출력계층에 추가된 출력노드 간의 가중치를 산출하는 단계를 더 포함한다. The speaker identification method includes the step of classifying the speaker as an unregistered speaker, then adding an output node corresponding to the unregistered speaker to an output layer by a learning unit, and a plurality of the last hidden layers of the unregistered speaker stored by the learning unit a predetermined number or more. And calculating weights between the plurality of hidden nodes of the last hidden layer and the output nodes added to the output layer based on the node values of the hidden nodes of.

상기 출력노드의 활성화 함수는 소프트맥스(softmax) 함수이며, 상기 가중치를 산출하는 단계는 저장된 마지막 은닉계층의 복수의 은닉노드 각각의 노드값의 평균을 상기 마지막 은닉계층의 복수의 은닉노드와 상기 출력계층에 추가된 출력 노드 간의 가중치로 산출하는 것을 특징으로 한다. The activation function of the output node is a softmax function, and in the calculating of the weight, the average of the node values of each of the plurality of hidden nodes of the last hidden layer is stored and the plurality of hidden nodes of the last hidden layer and the output It is characterized by calculating the weight between the output nodes added to the layer.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 바람직한 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자 식별 방법을 수행하기 위한 프로그램이 기록된 컴퓨터 판독 가능한 기록매체는 인식부가 각각이 복수의 노드를 포함하는 입력계층, 하나 이상의 은닉계층 및 출력계층을 포함하며, 서로 다른 복수의 계층의 복수의 노드가 가중치로 연결되는 심층신경망에 화자가 알려지지 않은 음성을 입력하는 단계와, 상기 인식부가 상기 입력에 따라 상기 심층신경망의 복수의 계층의 복수의 노드가 상기 가중치가 작용되는 복수의 연산을 통해 출력계층의 복수의 출력노드의 출력값을 도출하는 단계와, 상기 인식부가 상기 출력계층의 복수의 출력노드의 출력값을 기초로 화자를 식별하되, 상기 복수의 출력노드의 출력값 모두가 기 설정된 임계치 미만인지 여부를 판별하는 단계와, 상기 판별 결과, 상기 복수의 출력노드의 출력값 모두가 기 설정된 임계치 미만이면, 상기 인식부가 상기 화자를 미등록 화자로 분류하는 단계를 포함한다. A computer-readable recording medium on which a program for performing a speaker identification method based on a deep neural network in which an unregistered speaker can be added according to a preferred embodiment of the present invention to achieve the above object is recorded has a plurality of recognition units. Inputting an unknown speaker to a deep neural network in which a plurality of nodes of a plurality of different layers are connected by weights, including an input layer including nodes of, at least one hidden layer, and an output layer, and the recognition unit Deriving output values of the plurality of output nodes of the output layer through a plurality of operations in which the weight is applied by a plurality of nodes of the plurality of layers of the deep neural network according to the input, and the recognition unit Identifying a speaker based on the output values of the output nodes, but determining whether all output values of the plurality of output nodes are less than a preset threshold, and as a result of the determination, all output values of the plurality of output nodes are less than a preset threshold. In this case, the recognition unit includes the step of classifying the speaker as an unregistered speaker.

본 발명에 따르면, 심층심경망의 전체적인 재학습 없이, 마지막 은닉 계층과 추가된 화자에 대응하는 출력 노드 간의 가중치를 산출하여 인식할 수 있는 화자를 추가할 수 있다. 이에 따라, 연산 복잡도가 감소하며, 시스템의 부하를 줄일 수 있다. According to the present invention, it is possible to add a speaker that can be recognized by calculating a weight between the last hidden layer and an output node corresponding to the added speaker, without relearning the entire deep-depth network. Accordingly, the computational complexity is reduced and the load on the system can be reduced.

도 1은 본 발명의 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자를 식별하기 위한 장치를 설명하기 위한 블록도이다.
도 2는 본 발명의 실시예에 따른 심층신경망의 구성을 설명하기 위한 도면이다.
도 3은 본 발명의 실시예에 따른 가중치가 적용되는 연산을 수행하는 노드를 설명하기 위한 도면이다.
도 4는 본 발명의 실시예에 따른 심층신경망을 학습시키는 방법을 설명하기 위한 흐름도이다.
도 5는 본 발명의 실시예에 따른 심층신경망을 이용한 화자 인식 방법을 설명하기 위한 흐름도이다.
도 6 및 도 7은 본 발명의 실시예에 따른 미등록 화자를 추가하는 방법을 설명하기 위한 흐름도이다.
도 8은 본 발명의 실시예에 따른 미등록 화자를 추가하는 방법을 설명하기 위한 도면이다. 1 is a block diagram illustrating an apparatus for identifying a speaker based on a deep neural network to which an unregistered speaker can be added according to an embodiment of the present invention.
2 is a view for explaining the configuration of a deep neural network according to an embodiment of the present invention.
3 is a diagram for explaining a node that performs an operation to which a weight is applied according to an embodiment of the present invention.
4 is a flowchart illustrating a method of learning a deep neural network according to an embodiment of the present invention.
5 is a flowchart illustrating a speaker recognition method using a deep neural network according to an embodiment of the present invention.
6 and 7 are flowcharts illustrating a method of adding an unregistered speaker according to an embodiment of the present invention.
8 is a diagram illustrating a method of adding an unregistered speaker according to an embodiment of the present invention.

본 발명의 상세한 설명에 앞서, 이하에서 설명되는 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시예에 불과할 뿐, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다. Prior to the detailed description of the present invention, terms or words used in the present specification and claims described below should not be construed as being limited to their usual or dictionary meanings, and the inventors shall use their own invention in the best way. For explanation, based on the principle that it can be appropriately defined as a concept of terms, it should be interpreted as a meaning and concept consistent with the technical idea of the present invention. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention, and do not represent all the technical spirit of the present invention, and various equivalents that can replace them at the time of application It should be understood that there may be water and variations.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 이때, 첨부된 도면에서 동일한 구성 요소는 가능한 동일한 부호로 나타내고 있음을 유의해야 한다. 또한, 본 발명의 요지를 흐리게 할 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략할 것이다. 마찬가지의 이유로 첨부 도면에 있어서 일부 구성요소는 과장되거나 생략되거나 또는 개략적으로 도시되었으며, 각 구성요소의 크기는 실제 크기를 전적으로 반영하는 것이 아니다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In this case, it should be noted that the same components in the accompanying drawings are indicated by the same reference numerals as possible. In addition, detailed descriptions of known functions and configurations that may obscure the subject matter of the present invention will be omitted. For the same reason, some components in the accompanying drawings are exaggerated, omitted, or schematically illustrated, and the size of each component does not entirely reflect the actual size.

먼저, 본 발명의 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자를 식별하기 위한 장치에 대해서 설명하기로 한다. 도 1은 본 발명의 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자를 식별하기 위한 장치를 설명하기 위한 블록도이다. 도 2는 본 발명의 실시예에 따른 심층신경망의 구성을 설명하기 위한 도면이다. 그리고 도 3은 본 발명의 실시예에 따른 가중치가 적용되는 연산을 수행하는 노드를 설명하기 위한 도면이다. First, an apparatus for identifying a speaker based on a deep neural network to which an unregistered speaker can be added according to an embodiment of the present invention will be described. 1 is a block diagram illustrating an apparatus for identifying a speaker based on a deep neural network to which an unregistered speaker can be added according to an embodiment of the present invention. 2 is a view for explaining the configuration of a deep neural network according to an embodiment of the present invention. And FIG. 3 is a diagram for explaining a node that performs an operation to which a weight is applied according to an embodiment of the present invention.

먼저, 도 1을 참조하면, 본 발명의 실시예에 따른 미등록 화자를 추가할 수 있는 심층 신경망 기반의 화자를 식별하기 위한 장치(100, 이하, 식별장치로 축약함)는 통신부(110), 오디오부(120), 입력부(130), 표시부(140), 저장부(150) 및 제어부(160)를 포함한다. First, referring to FIG. 1, an apparatus for identifying a speaker based on a deep neural network to which an unregistered speaker can be added according to an embodiment of the present invention (100, hereinafter, abbreviated as an identification device) is a communication unit 110, an audio system. It includes a unit 120, an input unit 130, a display unit 140, a storage unit 150, and a control unit 160.

통신부(110)는 다른 장치와 통신하기 위한 수단이다. 통신부(110)는 네트워크를 통해 다른 장치로부터 학습 데이터를 수집할 수 있다. 통신부(110)는 송신되는 신호의 주파수를 상승 변환 및 증폭하는 RF(Radio Frequency) 송신기(Tx) 및 수신되는 신호를 저 잡음 증폭하고 주파수를 하강 변환하는 RF 수신기(Rx)를 포함할 수 있다. 그리고 통신부(110)는 송신되는 신호를 변조하고, 수신되는 신호를 복조하는 모뎀(Modem)을 포함할 수 있다. The communication unit 110 is a means for communicating with other devices. The communication unit 110 may collect learning data from other devices through a network. The communication unit 110 may include a radio frequency (RF) transmitter Tx for up-converting and amplifying a frequency of a transmitted signal, and an RF receiver Rx for low-noise amplifying and down-converting a received signal. Further, the communication unit 110 may include a modem that modulates a transmitted signal and demodulates a received signal.

오디오부(120)는 본 발명의 실시예에 따른 음성과 같은 오디오 신호를 출력하기 위한 스피커(SPK)와, 음성과 같은 오디오 신호를 수집하기 위한 마이크(MIKE)를 포함한다. 즉, 오디오부(120)는 제어부(160)의 제어에 따라 오디오 신호를 스피커(SPK)를 통해 출력하거나, 마이크(MIKE)를 통해 입력된 오디오 신호를 제어부(160)로 전달할 수 있다. The audio unit 120 includes a speaker (SPK) for outputting an audio signal such as voice according to an embodiment of the present invention, and a microphone (MIKE) for collecting an audio signal such as voice. That is, the audio unit 120 may output an audio signal through the speaker SPK or transmit an audio signal input through the microphone (MIKE) to the controller 160 under the control of the controller 160.

입력부(130)는 인식장치(10)를 제어하기 위한 사용자의 키 조작을 입력받고 입력 신호를 생성하여 제어부(160)에 전달할 수 있다. 입력부(130)는 인식장치(10)를 제어하기 위한 각 종 키들을 포함할 수 있다. 입력부(130)는 표시부(140)가 터치스크린으로 이루어진 경우, 각 종 키들의 기능이 표시부(140)에서 이루어질 수 있으며, 터치스크린만으로 모든 기능을 수행할 수 있는 경우, 입력부(130)는 생략될 수도 있다. The input unit 130 may receive a user's key manipulation for controlling the recognition device 10, generate an input signal, and transmit the input signal to the control unit 160. The input unit 130 may include various types of keys for controlling the recognition device 10. When the display unit 140 is formed of a touch screen, the input unit 130 may perform functions of various keys on the display unit 140, and when all functions can be performed only with the touch screen, the input unit 130 will be omitted. May be.

표시부(140)는 다양한 정보를 화면으로 표시하기 위한 것이다. 특히, 표시부(140)는 미등록 화자에게 식별자를 입력하도록 하는 안내를 출력할 수 있다. 그 밖에 표시부(140)는 인식장치(10)의 메뉴, 입력된 데이터, 기능 설정 정보 및 기타 다양한 정보를 사용자에게 시각적으로 제공할 수 있다. 또한, 표시부(140)는 인식장치(10)의 부팅 화면, 대기 화면, 메뉴 화면, 등의 화면을 출력하는 기능을 수행한다. 표시부(140)는 액정표시장치(LCD, Liquid Crystal Display), 유기 발광 다이오드(OLED, Organic Light Emitting Diodes), 능동형 유기 발광 다이오드(AMOLED, Active Matrix Organic Light Emitting Diodes) 등으로 형성될 수 있다. 한편, 표시부(140)는 터치스크린으로 구현될 수 있다. 이러한 경우, 표시부(140)는 터치센서를 포함한다. 터치센서는 사용자의 터치 입력을 감지한다. 터치센서는 정전용량 방식(capacitive overlay), 압력식, 저항막 방식(resistive overlay), 적외선 감지 방식(infrared beam) 등의 터치 감지 센서로 구성되거나, 압력 감지 센서(pressure sensor)로 구성될 수도 있다. 상기 센서들 이외에도 물체의 접촉 또는 압력을 감지할 수 있는 모든 종류의 센서 기기가 본 발명의 터치센서로 이용될 수 있다. 터치센서는 사용자의 터치 입력을 감지하고, 감지 신호를 발생시켜 제어부(160로 전송한다. 특히, 표시부(140)가 터치스크린으로 이루어진 경우, 입력부(130) 기능의 일부 또는 전부는 표시부(140)를 통해 이루어질 수 있다. The display unit 140 is for displaying various types of information on a screen. In particular, the display unit 140 may output a guide for inputting an identifier to an unregistered speaker. In addition, the display unit 140 may visually provide a menu of the recognition device 10, input data, function setting information, and various other information to the user. In addition, the display unit 140 performs a function of outputting screens such as a boot screen, a standby screen, a menu screen, and the like of the recognition device 10. The display unit 140 may be formed of a liquid crystal display (LCD), an organic light emitting diode (OLED), an active matrix organic light emitting diode (AMOLED), or the like. Meanwhile, the display unit 140 may be implemented as a touch screen. In this case, the display unit 140 includes a touch sensor. The touch sensor detects a user's touch input. The touch sensor may be composed of a touch sensing sensor such as a capacitive overlay, a pressure type, a resistive overlay, or an infrared beam, or may be composed of a pressure sensor. . In addition to the above sensors, all types of sensor devices capable of sensing a contact or pressure of an object may be used as the touch sensor of the present invention. The touch sensor senses a user's touch input, generates a detection signal, and transmits it to the control unit 160. In particular, when the display unit 140 is formed of a touch screen, some or all of the functions of the input unit 130 are the display unit 140 It can be done through.

저장부(150)는 인식장치(10)의 동작에 필요한 프로그램 및 데이터를 저장하는 역할을 수행한다. 특히, 저장부(150)는 미등록 화자로 판별된 화자의 마지막 은닉노드의 노드값을 해당 화자의 음성을 입력할 때마다 저장할 수 있다. 저장부(160)에 저장되는 각 종 데이터는 사용자의 조작에 따라, 삭제, 변경, 추가될 수 있다. The storage unit 150 serves to store programs and data necessary for the operation of the recognition device 10. In particular, the storage unit 150 may store the node value of the last hidden node of the speaker, which is determined as an unregistered speaker, each time a voice of the corresponding speaker is input. Each type of data stored in the storage unit 160 may be deleted, changed, or added according to a user's manipulation.

제어부(160)는 인식장치(10)의 전반적인 동작 및 인식장치(10)의 내부 블록들 간 신호 흐름을 제어하고, 데이터를 처리하는 데이터 처리 기능을 수행할 수 있다. 또한, 제어부(160)는 기본적으로, 인식장치(10의 각 종 기능을 제어하는 역할을 수행한다. 제어부(160)는 중앙처리장치(CPU: Central Processing Unit), 디지털신호처리기(DSP: Digital Signal Processor) 등을 예시할 수 있다. 특히, 제어부(160)는 심층신경망(200, DNN: Deep Neural Network), 학습부(300) 및 인식부(400)를 포함한다. 이러한 심층신경망(200), 학습부(300) 및 인식부(400)를 포함하는 제어부(160)의 동작은 아래에서 보다 상세하게 설명될 것이다. The controller 160 may control the overall operation of the recognition device 10 and a signal flow between internal blocks of the recognition device 10 and perform a data processing function of processing data. In addition, the control unit 160 basically controls various functions of the recognition device 10. The control unit 160 includes a central processing unit (CPU) and a digital signal processor (DSP). Processor), etc. In particular, the control unit 160 includes a deep neural network 200 (DNN: Deep Neural Network), a learning unit 300 and a recognition unit 400. Such a deep neural network 200, The operation of the controller 160 including the learning unit 300 and the recognition unit 400 will be described in more detail below.

그러면 도 2를 참조하여 본 발명의 실시예에 따른 심층신경망(200)에 대해서 보다 상세하게 설명하기로 한다. 본 발명의 실시예에 따른 심층신경망(200)은 복수의 계층(IL, HL, OL)을 포함한다. 이러한 복수의 계층은 입력 계층(IL: Input Layer), 적어도 하나의 은닉 계층(HL: Hidden Layer, HL1 ~ HLk) 및 출력 계층(OL: Output Layrer)을 포함한다. Then, the deep neural network 200 according to an embodiment of the present invention will be described in more detail with reference to FIG. 2. The deep neural network 200 according to an embodiment of the present invention includes a plurality of layers (IL, HL, OL). The plurality of layers includes an input layer (IL), at least one hidden layer (HL1 to HLk), and an output layer (OL).

또한, 복수의 계층(IL, HL, OL) 각각은 복수의 노드를 포함한다. 예컨대, 도시된 바와 같이, 입력 계층(IL)은 n개의 노드(i1 ~ in)를 포함하며, 출력 계층(OL)은 t개의 노드(o1 ~ ot)를 포함할 수 있다. 또한, 은닉 계층(HL) 중 제1 은닉계층(HL1)은 a개의 노드(h11 ~ h1a)를 포함하고, 제k 은닉계층(HLk)은 c개의 노드(hk1 ~ hkc)를 포함할 수 있다. In addition, each of the plurality of layers IL, HL, and OL includes a plurality of nodes. For example, as illustrated, the input layer IL may include n nodes i1 to in, and the output layer OL may include t nodes o1 to ot. In addition, among the hidden layers HL, the first hidden layer HL1 may include a nodes h11 to h1a, and the k-th hidden layer HLk may include c nodes hk1 to hkc.

복수의 계층의 복수의 노드 모두는 연산을 가진다. 특히, 서로 다른 계층의 복수의 노드는 가중치(W: weight)를 가지는 채널(점선으로 표시)로 연결된다. 다른 말로, 어느 하나의 노드의 연산 결과는 가중치가 적용되어 다음 계층 노드의 입력이 된다. 이러한 연결 관계에 대해 도 3을 참조하여 설명하기로 한다. 도 3에 예시적으로 심층신경망(200)에 존재하는 어느 하나의 노드인 노드 D를 도시하였다. 노드 D는 입력된 신호 x=[x1, x2, … , xn]에 가중치 w=[w1, w2, … , wn]를 적용한 후, 그 결과에 함수 F를 취한다. 여기서, 함수 F는 활성화 함수(activation function) 또는 전달함수(transfer function)라고 한다. 이때, 입력이 동일한 경우에도, 출력은 가중치(W)에 따라 다른 값이 된다. All of a plurality of nodes of a plurality of hierarchies have operations. In particular, a plurality of nodes of different hierarchies are connected by channels (indicated by dotted lines) having weights (W). In other words, the weight is applied to the operation result of any one node and becomes the input of the next layer node. This connection relationship will be described with reference to FIG. 3. In FIG. 3, a node D, which is one node existing in the deep neural network 200, is illustrated as an example. Node D is the input signal x=[x1, x2, ... , xn] with weights w=[w1, w2,… , wn], then take the function F on the result. Here, the function F is referred to as an activation function or a transfer function. In this case, even when the inputs are the same, the outputs have different values according to the weight W.

즉, 각 노드의 출력은 다음의 수학식 1과 같다. That is, the output of each node is shown in Equation 1 below.

설명되지 않은 변수 θ는 임계치 혹은 바이어스이며, 이러한 임계치는 수학식 4에서

의 값이 임계치 보다 작을 때 해당 노드가 활성화되지 않도록 하는 역할을 한다. Unexplained variable θ is a threshold or bias, and this threshold is

When the value of is less than the threshold, the node is not activated.

예를 들면, 어느 하나의 노드 D의 이전 계층의 노드가 3개라고 가정한다. 이에 따라, 해당 노드에 대해 3개의 입력(n=3) X1, X2, X3과 3개의 가중치 W1, W2, W3이 존재한다. For example, assume that there are three nodes in the previous layer of any one node D. Accordingly, there are three inputs (n=3) X1, X2, X3 and three weights W1, W2, and W3 for the node.

노드 D는 3개의 입력 X1, X2, X3에 대응하는 가중치 W1, W2, W3을 곱한 값을 입력받고, 모두 합산한 후, 합산된 값을 전달 함수에 대입하여 출력을 산출한다. 구체적으로, 입력 [X1, X2, X3] = 0.5, -0.3, 0이라고 가정하고, 가중치 w=[W1, W2, W3] = 4, 5, 2라고 가정한다. 또한, 설명의 편의를 위하여 전달 함수는 'sgn()'이라고 가정하면, 다음과 같이 출력값이 산출된다. Node D receives a value obtained by multiplying the weights W1, W2, and W3 corresponding to the three inputs X1, X2, X3, sums them all, and calculates the output by substituting the summed value into the transfer function. Specifically, it is assumed that the input [X1, X2, X3] = 0.5, -0.3, 0, and the weight w = [W1, W2, W3] = 4, 5, 2. Also, for convenience of explanation, assuming that the transfer function is'sgn()', the output value is calculated as follows.

x1 × w1 = 0.5 × 4 = 2 x1 × w1 = 0.5 × 4 = 2

x2 × w2 = - 0.3 × 5 = -1.5x2 × w2 =-0.3 × 5 = -1.5

x3 × w3 = 0 × 2 = 0 x3 × w3 = 0 × 2 = 0

2 + (-1.5) + 0 = 0.5 2 + (-1.5) + 0 = 0.5

sgn(0.5) = 1 sgn(0.5) = 1

이와 같이, 심층신경망(200)의 어느 한 계층의 어느 하나의 노드는 이전 계층의 노드로부터의 입력에 가중치를 적용한 값을 입력받고, 이를 합산하여 전달 함수를 취하고, 이러한 결과를 다음 계층의 입력으로 전달한다. In this way, any one node of any one layer of the deep neural network 200 receives a value obtained by applying a weight to the input from the node of the previous layer, summing it to take a transfer function, and converts the result to the input of the next layer. Deliver.

다만, 출력계층의 복수의 출력노드의 활성화 함수 혹은 전달 함수는 소프트맥스(softmax) 함수를 이용한다. 소프트맥스 함수는 입력받은 값을 출력값을 0 내지 1사이의 값으로 모두 정규화하며 출력값들의 총합은 항상 1이 되는 특성을 가진 함수이다. 이러한 소프트맥수 함수는 다음의 수학식 2와 같다. However, a softmax function is used as the activation function or transfer function of a plurality of output nodes of the output layer. The softmax function normalizes all input values to output values between 0 and 1, and has a characteristic that the sum of output values is always 1. This soft pulse function is shown in Equation 2 below.

여기서, t는 출력계층의 출력노드를 나타낸다. Here, t represents the output node of the output layer.

한편, 도 2로 돌아오면, 입력 계층(IL)에는 사용자의 음성이 입력되며, 이러한 입력에 따라 은닉 계층(HL)의 복수의 노드들은 도 3에서 설명된 바와 같은 연산을 수행하여 그 연산 결과를 출력 계층(OL)으로 전달한다. 그러면, 출력 계층(OL)의 각 노드들은 출력값을 출력한다. On the other hand, returning to FIG. 2, the user's voice is input to the input layer IL, and according to this input, a plurality of nodes of the hidden layer HL perform an operation as described in FIG. 3 and calculate the result of the operation. Pass it to the output layer (OL). Then, each node of the output layer OL outputs an output value.

도 2에서 출력 계층(OL)의 t개의 노드(o1 ~ ot)는 각각 서로 다른 t명의 화자에 대응한다. 즉, 제1 출력노드 내지 제t 출력노드 각각은 제1 화자 내지 제t 화자에 대응한한다. t명의 화자 중 어느 한 화자의 음성을 심층신경망(200)의 입력계층(IL)에 입력했을 때, 제2 출력 노드(o2)의 출력값이 나머지 출력 노드(o1, o3 ~ ot)의 출력값보다 크면, 해당 음성은 제2 화자를 나타낸다. 즉, 음성을 심층신경망(200)에 입력했을 때, 복수의 출력노드 중 어느 하나의 출력노드의 출력값이 나머지 출력 노드의 출력값보다 크면, 해당 음성은 가장 큰 출력값을 가지는 출력노드에 대응하는 화자의 음성으로 판단한다. In FIG. 2, t nodes o1 to ot of the output layer OL correspond to t different speakers. That is, each of the first to tth output nodes corresponds to the first to tth speakers. When the voice of one of the t speakers is input to the input layer IL of the deep neural network 200, if the output value of the second output node o2 is greater than the output value of the other output nodes o1, o3 ~ ot , The corresponding voice represents the second speaker. That is, when a voice is input to the deep neural network 200, if the output value of one of the plurality of output nodes is greater than the output value of the other output nodes, the corresponding voice is the speaker corresponding to the output node having the largest output value. Judging by voice.

심층신경망(200)이 전술한 바와 같은 인식을 하기 위해서는 학습(machine learning)이 요구된다. 본 발명의 실시예에 따른 학습은 화자가 알려진 음성을 학습 데이터로 사용하고, 기대값을 설정한 후, 해당 기대값에 따라 심층신경망(200)의 노드의 가중치(w)를 조정하는 절차이다. In order for the deep neural network 200 to recognize as described above, machine learning is required. Learning according to an embodiment of the present invention is a procedure in which a speaker uses a known speech as learning data, sets an expected value, and adjusts the weight w of a node of the deep neural network 200 according to the expected value.

이러한 학습에 대해 도 4를 참조하여 보다 자세히 설명하기로 한다. 도 4는 본 발명의 실시예에 따른 심층신경망을 학습시키는 방법을 설명하기 위한 흐름도이다. 도 2 내지 도 4를 참조하면, 이 실시예에서 출력노드는 2개(t=2)이며, 제1 화자와 제2 화자가 존재한다고 가정한다. 또한, 제1 화자 및 제2 화자 각각은 제1 출력 노드(o1) 및 제2 출력 노드(o2)에 대응한다고 가정한다. 제1 화자를 식별하기 위한 학습을 위해 제1 화자의 음성을 이용하며, 제2 화자를 식별하기 위한 학습을 위해 제2 화자의 음성을 이용한다. 다른 말로, 학습 데이터는 화자가 알려진 음성을 이용한다. This learning will be described in more detail with reference to FIG. 4. 4 is a flowchart illustrating a method of learning a deep neural network according to an embodiment of the present invention. 2 to 4, in this embodiment, it is assumed that there are two output nodes (t=2), and there are a first speaker and a second speaker. In addition, it is assumed that each of the first speaker and the second speaker corresponds to the first output node o1 and the second output node o2. The voice of the first speaker is used for learning to identify the first speaker, and the voice of the second speaker is used for learning to identify the second speaker. In other words, the learning data uses a voice known by the speaker.

먼저, 학습부(300는 S110 단계에서 학습시키고자 하는 화자에 따른 기댓값을 설정한다. 예를 들면, 제1 화자에 대한 학습 데이터, 즉, 제1 화자의 음성에 대한 기댓값은 예컨대, 제1 출력 노드(o1)의 출력값이 0.8 이상이고, 제2 출력 노드(o2)의 출력값이 0.2 이하(o1≥0.8, o2≤0.2)로 설정될 수 있다. 또한, 제2 화자에 대한 학습 데이터, 즉, 제2 화자의 음성에 대한 기댓값은 예컨대, 제1 출력 노드(o1)의 출력값이 0.2 이하이고, 제2 출력 노드(o2)의 출력값이 0.8 이상(o1≤0.2, o2≥0.8)으로 설정될 수 있다. First, the learning unit 300 sets an expected value according to the speaker to be trained in step S110. For example, the learning data for the first speaker, that is, the expected value for the voice of the first speaker, is, for example, the first output. The output value of the node o1 may be 0.8 or more, and the output value of the second output node o2 may be set to 0.2 or less (o1≥0.8, o2≤0.2) In addition, learning data about the second speaker, that is, the The expected value for the voice of the second speaker may be set as, for example, the output value of the first output node o1 is 0.2 or less, and the output value of the second output node o2 is 0.8 or more (o1≤0.2, o2≥0.8). have.

다음으로, 학습부(300는 해당하는 학습 데이터를 심층신경망(200)에 입력한다. 심층신경망(200)은 복수의 계층 및 복수의 노드는 도 3에서 설명된 바와 같은 연산을 수행하며, 그 결과인 출력값을 출력 노드(o1, o2)를 통해 출력할 것이다. 이러한 출력값은 앞서 설정된 기댓값과의 차이가 발생할 수 있다. Next, the learning unit 300 inputs the corresponding learning data into the deep neural network 200. The deep neural network 200 performs an operation as described in FIG. 3 in a plurality of layers and a plurality of nodes, and the result The phosphorus output value will be output through the output nodes o1 and o2, which may differ from the previously set expected value.

그러면, 학습부(300는 S130 단계에서 기댓값과 출력값의 차이를 산출하고, S140 단계에서 차이가 최소가 되도록 역확산(Back Propagation) 알고리즘을 통해 각 노드의 가중치(W)를 조정한다. 다른 말로, 학습부(300는 출력값이 기대값이 되도록 각 노드의 가중치(W)를 조정한다. 각 노드의 가중치(W)를 조정하는 것을 학습이라고 하며, 이러한 학습은 전술한 1회의 프로세스로는 부족하며, 복수의 학습 데이터를 이용하여 출력값이 항상 기댓값을 만족할 때까지 반복 수행하는 것이 바람직하다. Then, the learning unit 300 calculates the difference between the expected value and the output value in step S130, and adjusts the weight W of each node through a back propagation algorithm so that the difference is minimized in step S140. In other words, The learning unit 300 adjusts the weight (W) of each node so that the output value becomes the expected value. Adjusting the weight (W) of each node is called learning, and such learning is insufficient in the above-described one-time process. It is preferable to perform iteratively until an output value always satisfies an expected value using a plurality of training data.

정리하면, 학습은 학습 데이터를 심층신경망(200)에 실제 입력 했을 때, 출력값과 미리 설정된 기댓값을 비교하여, 출력값이 기댓값이 되도록 가중치(w)를 조정하는 과정이다. 보다 구체적으로, 다시 도 3을 참조하면, 노드 D는 출력 노드 중 어느 하나라고 가정한다. 입력 [x1, x2, x3] = 0.5, -0.3, 0이고, 가중치 w=[w1, w2, w3] = 4, 5, 2이고, 전달함수는 'sgn()'일 때, 노드 N의 출력은 1이었다. 이때, 노드 D의 기대값은 0이라고 가정한다. 그러면, 학습에 의해 w2는 5에서 6으로 수정 될 수 있고, 출력값은 다음과 같이 기댓값이 된다. In summary, learning is a process of comparing an output value and a preset expected value when training data is actually input to the deep neural network 200, and adjusting the weight w so that the output value becomes the expected value. More specifically, referring to FIG. 3 again, it is assumed that the node D is one of the output nodes. When input [x1, x2, x3] = 0.5, -0.3, 0, weight w=[w1, w2, w3] = 4, 5, 2, and transfer function is'sgn()', output of node N Was 1. At this time, it is assumed that the expected value of node D is 0. Then, w2 can be modified from 5 to 6 by learning, and the output value becomes the expected value as follows.

x1 × w1 = 0.5 × 4 = 2 x1 × w1 = 0.5 × 4 = 2

x2 × w2 = - 0.3 × 6 = -1.8x2 × w2 =-0.3 × 6 = -1.8

x3 × w3 = 0 × 2 = 0 x3 × w3 = 0 × 2 = 0

2 + (-1.8) + 0 = 0.2 2 + (-1.8) + 0 = 0.2

sgn(0.2) = 0 sgn(0.2) = 0

전술한 바와 같은 학습은 기댓값이 같지만 서로 다른 복수의 학습 데이터를 이용하여 반복 수행할 때, 출력값의 변화가 없으면서 그 출력값이 기댓값을 만족할 때까지 반복하여 수행하는 것이 바람직하다. 이와 같이, 기댓값과 출력값의 차이가 최소가 되도록 심층신경망의 가중치를 조정하는 절차인 학습은 역확산 알고리즘을 통해 출력 계층에서부터 입력 계층까지 역순으로 순차로 복수의 노드에 대한 가중치를 수정하는 절차를 포함한다. When the above-described learning is repeatedly performed using a plurality of different learning data having the same expected value, it is preferable to repeatedly perform the learning until the output value satisfies the expected value without changing the output value. In this way, learning, a procedure of adjusting the weights of the deep neural network so that the difference between the expected value and the output value is minimal, includes a procedure of modifying the weights for a plurality of nodes in reverse order from the output layer to the input layer through a despreading algorithm. do.

학습이 완료되면, 심층신경망(200)은 화자를 인식할 수 있다. 그러면, 본 발명의 실시예에 따른 화자 인식 방법에 대해서 설명하기로 한다. 도 5는 본 발명의 실시예에 따른 심층신경망을 이용한 화자 인식 방법을 설명하기 위한 흐름도이다. 도 2 내지 도 5를 참조하면, 이 실시예에서 출력노드는 2개(t=2)이며, 제1 화자와 제2 화자가 학습된 상태라고 가정한다. 또한, 제1 화자 및 제2 화자 각각은 제1 출력 노드(o1) 및 제2 출력 노드(o2)에 대응한다고 가정한다. When learning is completed, the deep neural network 200 may recognize the speaker. Then, a speaker recognition method according to an embodiment of the present invention will be described. 5 is a flowchart illustrating a speaker recognition method using a deep neural network according to an embodiment of the present invention. 2 to 5, in this embodiment, it is assumed that there are two output nodes (t=2), and the first speaker and the second speaker are in a learned state. In addition, it is assumed that each of the first speaker and the second speaker corresponds to the first output node o1 and the second output node o2.

도 5를 참조하면, 인식부(400)는 S210 단계에서 화자가 알려지지 않은 음성을 심층신경망(200)에 입력한다. 그러면, 심층신경망(200)은 복수의 계층의 가중치가 적용되는 복수의 연산을 통해 복수의 출력 노드(o1, o2)를 통해 출력값을 출력할 것이다. Referring to FIG. 5, the recognition unit 400 inputs a voice of an unknown speaker into the deep neural network 200 in step S210. Then, the deep neural network 200 will output the output values through the plurality of output nodes o1 and o2 through a plurality of operations to which the weights of the plurality of layers are applied.

따라서 인식부(400)는 S220 단계에서 출력값에 따라 화자를 인식한다. 예컨대, 제1 출력 노드 및 제2 출력 노드 각각은 출력값이 0.85, 0.15라고 가정한다(o1 = 0.85, o2 = 0.15). 이는 입력된 음성이 제1 화자의 음성일 확률이 85%이고, 제2 화자의 음성일 확률이 15%임을 나타낸다. 따라서 인식부(400)은 입력된 음성이 제1 화자의 음성인 것으로 인식할 수 있다. Therefore, the recognition unit 400 recognizes the speaker according to the output value in step S220. For example, it is assumed that the output values of each of the first and second output nodes are 0.85 and 0.15 (o1 = 0.85, o2 = 0.15). This indicates that the probability that the input voice is the voice of the first speaker is 85% and that the voice of the second speaker is 15%. Accordingly, the recognition unit 400 may recognize that the input voice is the voice of the first speaker.

한편, 학습된 화자의 경우, 해당 화자가 음성임을 인식할 수 있다. 하지만, 미리 학습된 화자의 음성이 아닌 경우, 심층신경망(200)은 화자를 인식할 수 없다. 이러한 경우, 본 발명은 새로 화자를 추가할 수 있다. 이러한 방법에 대해서 설명하기로 한다. 도 6 및 도 7은 본 발명의 실시예에 따른 미등록 화자를 추가하는 방법을 설명하기 위한 흐름도이다. 도 8은 본 발명의 실시예에 따른 미등록 화자를 추가하는 방법을 설명하기 위한 도면이다. Meanwhile, in the case of a learned speaker, it can be recognized that the corresponding speaker is a voice. However, if the voice of the speaker is not learned in advance, the deep neural network 200 cannot recognize the speaker. In this case, the present invention can add a new speaker. This method will be described. 6 and 7 are flowcharts illustrating a method of adding an unregistered speaker according to an embodiment of the present invention. 8 is a diagram for explaining a method of adding an unregistered speaker according to an embodiment of the present invention.

먼저, 도 6을 참조하면, 이 실시예에서 출력노드는 2개(t=2)이며, 제1 화자와 제2 화자가 학습된 상태라고 가정한다. 또한, 제1 화자 및 제2 화자 각각은 제1 출력 노드(o1) 및 제2 출력 노드(o2)에 대응한다고 가정한다. First, referring to FIG. 6, in this embodiment, it is assumed that there are two output nodes (t=2), and the first speaker and the second speaker are in a learned state. In addition, it is assumed that each of the first speaker and the second speaker corresponds to the first output node o1 and the second output node o2.

인식부(400)는 S310 단계에서 화자가 알려지지 않은 음성을 심층신경망(200)에 입력하고, S320 단계에서 심층신경망(200)을 통해 출력값을 도출한다. 이어서, 인식부(400)는 S330 단계에서 출력 노드의 출력값 모두가 기 설정된 임계치 미만인지 여부를 판단한다. 본 발명의 실시예에서 임계치는 등록된 화자를 학습할 때 사용한 학습 데이터의 기댓값 중 가장 큰 값이 될 수 있다. 전술한 바와 같이, 제1 화자에 대한 학습 데이터의 기댓값은 예컨대, 제1 출력 노드(o1)의 출력값이 0.8 이상이고, 제2 출력 노드(o2)의 출력값이 0.2 이하(o1≥0.8, o2≤0.2)로 설정되고, 제2 화자에 대한 학습 데이터의 기댓값은 예컨대, 제1 출력 노드(o1)의 출력값이 0.2 이하이고, 제2 출력 노드(o2)의 출력값이 0.8 이상(o1≤0.2, o2≥0.8)으로 설정될 수 있다. 이러한 경우, 임계치는 0.8이 될 수 있다. The recognition unit 400 inputs a voice of an unknown speaker into the deep neural network 200 in step S310, and derives an output value through the deep neural network 200 in step S320. Subsequently, the recognition unit 400 determines whether all of the output values of the output node are less than a preset threshold in step S330. In an embodiment of the present invention, the threshold may be the largest value among expected values of training data used when learning a registered speaker. As described above, the expected value of the training data for the first speaker is, for example, the output value of the first output node o1 is 0.8 or more, and the output value of the second output node o2 is 0.2 or less (o1≥0.8, o2≤ 0.2), and the expected value of the training data for the second speaker is, for example, the output value of the first output node o1 is 0.2 or less, and the output value of the second output node o2 is 0.8 or more (o1≤0.2, o2). ≥0.8). In this case, the threshold may be 0.8.

만약, 출력 노드(o1, o2)의 출력값 모두가 기 설정된 임계치(예컨대, 0.8) 미만이면(예컨대, o1 = 0.7, o2 = 0.3), S340 단계에서 인식부(400)는 해당 화자를 미등록 화자로 분류하고(해당 음성이 미등록 화자의 음성인 것으로 판단하고), 화자 추가 프로세스를 수행한다. 반면, 출력 노드의 출력값 모두가 기 설정된 임계치 미만이 아니면, 인식부(400)는 S350 단계에서 심층신경망(200)의 출력값에 따라 화자를 인식한다. 이러한 S350 단계는 앞서 설명된 S220 단계와 동일하다. If all of the output values of the output nodes o1 and o2 are less than a preset threshold (e.g., 0.8) (e.g., o1 = 0.7, o2 = 0.3), the recognition unit 400 selects the corresponding speaker as an unregistered speaker in step S340. Classify (determining that the corresponding voice is the voice of an unregistered speaker), and perform a speaker addition process. On the other hand, if all of the output values of the output nodes are not less than the preset threshold, the recognition unit 400 recognizes the speaker according to the output value of the deep neural network 200 in step S350. This step S350 is the same as step S220 described above.

그러면, 도 7을 참조하여 화자 추가 프로세스에 대해서 설명한다. 만약, 출력 노드(o1, o2)의 출력값 모두가 기 설정된 임계치 미만이면(예컨대, o1 = 0.7, o2 = 0.3), 학습부(300)는 S410 단계에서 해당 미등록 화자의 식별자를 입력 받는다. 예컨대, 학습부(300)는 오디오부(120) 및 표시부(140)를 통해 화자의 이름(식별자)을 입력하도록 안내하여, 미등록 화자의 식별자를 입력 받을 수 있다. Then, a process of adding a speaker will be described with reference to FIG. 7. If all of the output values of the output nodes o1 and o2 are less than a preset threshold (eg, o1 = 0.7, o2 = 0.3), the learning unit 300 receives the identifier of the corresponding unregistered speaker in step S410. For example, the learning unit 300 may guide the input of the speaker's name (identifier) through the audio unit 120 and the display unit 140 to receive an identifier of an unregistered speaker.

미등록 화자의 식별자가 입력되면, 학습부(300)는 S420 단계에서 입력된 미등록 화자의 식별자에 매핑하여 마지막 은닉계층의 복수의 은닉노드의 노드값을 저장한다. 도 8을 참조하면, 예컨대, k=2이라고 가정한다. 그러면, 마지막 은닉 계층은 제2 은닉계층이 될 수 있다. 그러면, 학습부(300)는 입력된 미등록 화자의 식별자에 매핑하여 마지막 은닉계층(제2 은닉계층)의 복수의 은닉노드(hk1 ~ hkc, k=2) 각각의 노드값을 저장부(150)에 저장한다. When the identifier of the unregistered speaker is input, the learning unit 300 maps the identifier of the unregistered speaker input in step S420 to store the node values of the plurality of hidden nodes of the last hidden layer. Referring to FIG. 8, it is assumed that k=2, for example. Then, the last hidden layer may be the second hidden layer. Then, the learning unit 300 maps to the inputted identifier of the unregistered speaker and stores the node values of each of the plurality of hidden nodes (hk1 to hkc, k=2) of the last hidden layer (the second hidden layer). Save it to.

이어서, 학습부(300)는 S430 단계에서 해당 미등록 화자의 식별자에 해당하는 마지막 은닉계층의 노드값 저장 횟수가 기 설정된 횟수 이상인지 여부를 판별한다. 이러한 S430 단계는 해당하는 미등록 화자의 음성이 충분히 학습될 수 있는 수의 학습 데이터를 확보하였는지 여부를 판단하기 위한 것이다. Subsequently, in step S430, the learning unit 300 determines whether the number of storing the node value of the last hidden layer corresponding to the identifier of the corresponding unregistered speaker is equal to or greater than a preset number. This step S430 is to determine whether or not the number of learning data for sufficiently learning the voice of the corresponding non-registered speaker has been secured.

만약, S430 단계의 판단 결과, 해당 미등록 화자에 대응하는 마지막 은닉계층의 노드값이 기 설정된 횟수 이상 저장된 상태이면, 학습부(300)는 S440 단계에서 출력노드를 추가한다. 예컨대, 도 2와 비교하여 도 8을 참고하면, 학습부(300)는 제t+1 출력노드(ot+1)를 출력계층에 추가한다. 추가된 제t+1 출력노드(ot+1)는 새로 등록되는 미등록 화자에 대응한다. If, as a result of the determination in step S430, the node value of the last hidden layer corresponding to the corresponding unregistered speaker is stored more than a preset number of times, the learning unit 300 adds an output node in step S440. For example, referring to FIG. 8 compared to FIG. 2, the learning unit 300 adds the t+1th output node ot+1 to the output layer. The added t+1th output node ot+1 corresponds to a newly registered unregistered speaker.

다음으로, 학습부(300)는 S450 단계에서 저장부(150)에 저장된 해당 미등록 화자에 대응하는 마지막 은닉계층의 복수의 은닉노드의 노드값을 이용하여 마지막 은닉계층의 복수의 은닉노드와 출력계층에 추가된 출력노드 간의 가중치(W_add)를 산출한다. Next, the learning unit 300 uses the node values of the plurality of hidden nodes of the last hidden layer corresponding to the corresponding unregistered speaker stored in the storage unit 150 in step S450, and the plurality of hidden nodes and output layers of the last hidden layer. The weight (W_add) between output nodes added to is calculated.

출력계층의 활성함수는 소프트 맥스 함수이기 때문에 마지막 은닉계층과 출력계층을 연결하는 가중치 행렬은 마지막 은닉층의 차원 수 x 등록된 화자 수의 2차원 행렬이다. 해당 가중치 행렬에 새로운 미등록 화자를 새로운 화자로 추가하기 위해 제t+1 화자를 나타내는 마지막 은닉층의 차원수의 벡터를 추가한다. 구체적으로 제t+1 화자를 나타내는 벡터는 사전에 저장해둔 제t+1 화자의 음성이 입력되었을 때의 마지막 은닉계층의 은닉노드의 노드값들의 평균을 사용한다. 결과적으로 가중치 행렬은 (마지막 은닉층의 차원 수 x 등록된 화자 수 + 1) 의 크기를 갖게 되고, 이후 새로운 발성이 입력될 경우 제t+1 화자를 포함하여 화자 식별을 수행할 수 있다. Since the active function of the output layer is a soft max function, the weight matrix connecting the last hidden layer and the output layer is a two-dimensional matrix of the number of dimensions of the last hidden layer x the number of registered speakers. In order to add a new unregistered speaker as a new speaker to the weight matrix, a vector of the number of dimensions of the last hidden layer representing the t+1th speaker is added. Specifically, the vector representing the t+1th speaker uses the average of the node values of the hidden nodes of the last hidden layer when the previously stored voice of the t+1th speaker is input. As a result, the weight matrix has a size of (the number of dimensions of the last hidden layer x the number of registered speakers + 1), and when a new utterance is input thereafter, speaker identification can be performed including the t+1th speaker.

한편, S330 단계의 판단 결과, 해당 미등록 화자에 대응하는 마지막 은닉계층의 노드값이 기 설정된 횟수 미만으로 저장된 상태이면, 해당 미등록 화자를 등록할 수 없기 때문에 해당 프로세스를 종료한다. Meanwhile, as a result of the determination in step S330, if the node value of the last hidden layer corresponding to the corresponding non-registered speaker is stored less than a preset number of times, the corresponding process is terminated because the corresponding non-registered speaker cannot be registered.

추가로, 앞서 설명된 본 발명의 실시예에 따른 방법은 다양한 컴퓨터수단을 통하여 판독 가능한 프로그램 형태로 구현되어 컴퓨터로 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광 기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 와이어뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 와이어를 포함할 수 있다. 이러한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다. In addition, the method according to the embodiment of the present invention described above may be implemented in the form of a program that can be read through various computer means and recorded in a computer-readable recording medium. Here, the recording medium may include a program command, a data file, a data structure, or the like alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and usable to those skilled in computer software. For example, the recording medium includes magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic-optical media such as floptical disks ( magneto-optical media), and hardware devices specially configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of the program instruction may include not only machine language wires such as those made by a compiler, but also high-level language wires that can be executed by a computer using an interpreter or the like. Such a hardware device may be configured to operate as one or more software modules to perform the operation of the present invention, and vice versa.

이상 본 발명을 몇 가지 바람직한 실시예를 사용하여 설명하였으나, 이들 실시예는 예시적인 것이며 한정적인 것이 아니다. 이와 같이, 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 균등론에 따라 다양한 변화와 수정을 가할 수 있음을 이해할 것이다. The present invention has been described above using several preferred embodiments, but these embodiments are illustrative and not limiting. As such, those of ordinary skill in the art to which the present invention pertains will understand that various changes and modifications can be made according to the equivalence theory without departing from the spirit of the present invention and the scope of the rights presented in the appended claims.

100: 식별장치 110: 통신부
120: 오디오부 130: 입력부
140: 표시부 150: 저장부
160: 제어부 200: 신경망
300: 학습부 400: 인식부 100: identification device 110: communication unit
120: audio unit 130: input unit
140: display unit 150: storage unit
160: control unit 200: neural network
300: learning unit 400: recognition unit

Claims

In a deep neural network-based speaker identification device capable of adding an unregistered speaker,
A deep neural network, each including an input layer including a plurality of nodes, at least one hidden layer, and an output layer, wherein a plurality of nodes of different layers are connected by weights; And
After the speaker inputs an unknown voice to the input layer of the deep neural network, the speaker is identified based on the output values of a plurality of output nodes of the output layer, but if all output values of the plurality of output nodes are less than a preset threshold, And a recognition unit for classifying the speaker as an unregistered speaker.
Speaker identification device.

The method of claim 1,
When classified as the non-registered speaker, an output node corresponding to the non-registered speaker is added to the output layer, and a plurality of hidden nodes of the last hidden layer and the output layer are added to the output layer based on the node value of the last hidden layer of the non-registered speaker stored at least a predetermined number of times. Characterized in that it further comprises a; learning unit for calculating the weight between the added output nodes
Speaker identification device.

The method of claim 2,
The activation function of the output node is a softmax function,
The weight between the plurality of hidden nodes of the last hidden layer and the added output node is an average of node values of each of the plurality of hidden nodes of the last hidden layer.
Speaker identification device.

The method of claim 1,
The threshold value is the largest value among expected values of training data used when learning a registered speaker.
Speaker identification device.

In a deep neural network-based speaker identification method that can add an unregistered speaker,
A step of inputting an unknown speaker to a deep neural network in which the recognition unit includes an input layer each including a plurality of nodes, at least one hidden layer, and an output layer, and a plurality of nodes of different layers are connected by weights ;
Deriving, by the recognition unit, output values of a plurality of output nodes of an output layer through a plurality of operations in which the weight is applied by a plurality of nodes of a plurality of layers of the deep neural network according to the input;
Identifying a speaker based on output values of the plurality of output nodes of the output layer by the recognition unit, and determining whether all output values of the plurality of output nodes are less than a preset threshold; And
And classifying, by the recognition unit, the speaker as an unregistered speaker if all of the output values of the plurality of output nodes are less than a preset threshold as a result of the determination.
How to identify a speaker.

The method of claim 5,
After the step of classifying the speaker as an unregistered speaker,
Adding, by a learning unit, an output node corresponding to the unregistered speaker to an output layer;
Calculating weights between the plurality of hidden nodes of the last hidden layer and the output nodes added to the output layer based on the node values of the plurality of hidden nodes of the last hidden layer of the unregistered speaker stored at least a predetermined number of times by the learning unit; Characterized in that it comprises
How to identify a speaker.

The method of claim 6,
The activation function of the output node is a softmax function,
The step of calculating the weight is
And calculating the average of the node values of each of the plurality of hidden nodes of the last hidden layer stored as a weight between the plurality of hidden nodes of the last hidden layer and the output node added to the output layer.
How to identify a speaker.

The method of claim 5,
The threshold value is the largest value among expected values of training data used when learning a registered speaker.
How to identify a speaker.

In a computer-readable recording medium in which a program for performing a speaker identification method based on a deep neural network capable of adding an unregistered speaker is recorded,
Step of inputting an unknown speaker to a deep neural network in which the recognition unit includes an input layer, each of which includes a plurality of nodes, at least one hidden layer, and an output layer, and a plurality of nodes of a plurality of different layers are connected by weights ;
Deriving, by the recognition unit, output values of a plurality of output nodes of an output layer through a plurality of operations in which the weight is applied by a plurality of nodes of a plurality of layers of the deep neural network according to the input;
Identifying a speaker based on output values of the plurality of output nodes of the output layer by the recognition unit, and determining whether all output values of the plurality of output nodes are less than a preset threshold; And
As a result of the determination, if all of the output values of the plurality of output nodes are less than a preset threshold, the recognition unit classifying the speaker as an unregistered speaker;
A computer-readable recording medium on which a program for performing a speaker identification method is recorded.

The method of claim 9,
After the step of classifying the speaker as an unregistered speaker,
Adding, by a learning unit, an output node corresponding to the unregistered speaker to an output layer;
Calculating weights between the plurality of hidden nodes of the last hidden layer and the output nodes added to the output layer based on the node values of the plurality of hidden nodes of the last hidden layer of the unregistered speaker stored at least a predetermined number of times by the learning unit; Included
A computer-readable recording medium on which a program for performing a speaker identification method is recorded.

The method of claim 10,
The activation function of the output node is a softmax function,
The step of calculating the weight is
And calculating the average of the node values of each of the plurality of hidden nodes of the last hidden layer stored as a weight between the plurality of hidden nodes of the last hidden layer and the output node added to the output layer.
A computer-readable recording medium on which a program for performing a speaker identification method is recorded.

The method of claim 9,
The threshold value is the largest value among expected values of training data used when learning a registered speaker.
A computer-readable recording medium on which a program for performing a speaker identification method is recorded.