KR102275873B1

KR102275873B1 - Apparatus and method for speaker recognition

Info

Publication number: KR102275873B1
Application number: KR1020180163902A
Authority: KR
Inventors: 이상엽; 고재진
Original assignee: 한국전자기술연구원
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2021-07-12
Also published as: KR20200075339A

Abstract

본 발명의 일실시예는, 음성 명령을 포함하는 소리를 수신하여 음성신호로 변환하는 소리입력모듈, 상기 음성신호를 수신하고 상기 음성신호의 입력-특성벡터를 생성하는 전처리부, 상기 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 병렬적으로 동시에 비교하고 비교결과를 출력하는 연산부, 및 상기 연산부로부터 수신한 비교결과에 기초하여 상기 음성 명령을 입력한 화자가 누구인지 결정하는 화자인식부를 포함하는, 화자인식 장치 및 방법을 제공하여, 음성 명령을 입력한 화자가 누구인지 신속하게 인식할 수 있다. An embodiment of the present invention provides a sound input module for receiving a sound including a voice command and converting it into a voice signal, a preprocessor for receiving the voice signal and generating an input-characteristic vector of the voice signal, and the input-characteristics A calculation unit that simultaneously compares the vector and a plurality of pre-stored reference-feature vectors in parallel and outputs a comparison result, and a speaker recognition unit that determines who is the speaker who has input the voice command based on the comparison result received from the calculation unit By providing a speaker recognition apparatus and method, including, it is possible to quickly recognize who the speaker inputted the voice command is.

Description

Apparatus and method for speaker recognition }

본 발명은 화자인식 장치 및 방법에 관한 것이다. The present invention relates to a speaker recognition apparatus and method.

최근 음성 인식 기술의 발전에 기초하여 화자가 음성으로 전자기기에 명령을 입력하면 전자기기가 명령에 해당하는 동작을 수행하는 음성 인식 관련 서비스가 개발되어 보급되고 있다. 그러나 다수의 화자의 음성이 동시에 입력되거나, 화자가 새로운 단어를 말하는 경우 음성 명령을 정확히 인식하지 못하는 문제가 존재한다. 또한, 음성 인식 서비스를 제공하는 과정에서 맞춤형 서비스를 제공하기 위하여 특정 화자를 구분할 필요가 있다. Recently, based on the development of voice recognition technology, when a speaker inputs a command to an electronic device by voice, a voice recognition-related service in which the electronic device performs an operation corresponding to the command has been developed and distributed. However, there is a problem in that the voice command is not accurately recognized when the voices of a plurality of speakers are simultaneously input or the speaker speaks a new word. In addition, in the process of providing a voice recognition service, it is necessary to distinguish a specific speaker in order to provide a customized service.

한편, 인공지능 기술분야의 발전과 분화에 의해 기계학습(machine learning)이나 인공신경망(artificial neural network) 등의 기술이 다양한 분야에 적용되고 있다. 다만, 이러한 인공지능 기술분야는 복잡하고 다단계의 판단과정을 수행하며, 신속한 판단을 위해서는 큰 연산능력을 요구하는 문제가 있다. Meanwhile, due to the development and differentiation of artificial intelligence technology fields, technologies such as machine learning and artificial neural networks are being applied to various fields. However, this field of artificial intelligence technology has a problem in that it performs a complex and multi-step judgment process, and requires a large computational power for quick judgment.

KR 10-1883301 B1KR 10-1883301 B1

본 발명의 일실시예에 따른 목적은, 음성 명령의 입력-특징벡터를 추출하고 뉴로모픽 소자에 저장된 복수의 기준-특징벡터와 병렬적으로 비교하여 음성 명령을 입력한 화자가 누구인지 신속하게 인식하는 뉴로모픽 소자를 이용한 화자인식 장치 및 방법을 제공하는 것이다.An object of an embodiment of the present invention is to extract an input-feature vector of a voice command and compare it in parallel with a plurality of reference-feature vectors stored in a neuromorphic device to quickly identify the speaker who input the voice command. To provide a speaker recognition apparatus and method using a recognizing neuromorphic element.

또한, 본 발명의 일실시예에 따른 목적은, 음성 명령을 입력한 화자가 누구인지 인식하고 입력된 음성 명령을 시간지연 없이 실시간으로 기기에 전달하는 뉴로모픽 소자를 이용한 화자인식 장치 및 방법을 제공하는 것이다. 본 발명이 해결하고자 하는 그 외의 다른 과제도 본 발명이 속하는 기술 분야에 통상의 지식을 가진 자라면 본 명세서의 기재를 통해 알 수 있을 것이다. Another object of the present invention is to provide a speaker recognition apparatus and method using a neuromorphic element for recognizing who a speaker inputting a voice command is and delivering the input voice command to a device in real time without time delay. will provide Other problems to be solved by the present invention will be known through the description of the present specification to those of ordinary skill in the art to which the present invention pertains.

본 발명의 일실시예에 따른 화자인식 장치는, 음성 명령을 포함하는 소리를 수신하여 음성신호로 변환하는 소리입력모듈, 상기 음성신호를 수신하고 상기 음성신호의 입력-특성벡터를 생성하는 전처리부, 상기 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 병렬적으로 동시에 비교하고 비교결과를 출력하는 연산부, 및 상기 연산부로부터 수신한 비교결과에 기초하여 상기 음성 명령을 입력한 화자가 누구인지 결정하는 화자인식부를 포함할 수 있다. A speaker recognition apparatus according to an embodiment of the present invention includes a sound input module for receiving a sound including a voice command and converting it into a voice signal, and a preprocessing unit for receiving the voice signal and generating an input-characteristic vector of the voice signal , an operation unit that simultaneously compares the input-feature vector with a plurality of pre-stored reference-feature vectors in parallel and outputs a comparison result, and who is the speaker who input the voice command based on the comparison result received from the operation unit It may include a speaker recognition unit for determining.

또한, 본 발명의 일실시예에 따른 화자인식 장치는, 상기 음성 명령을 입력한 것으로 결정된 화자가 미리 설정된 특정 화자인 것으로 판단되는 경우, 상기 음성신호를 음성인식 서비스를 제공하기 위한 자연어 처리 모듈로 전달하는 신호전달모듈을 더 포함할 수 있다. In addition, in the speaker recognition apparatus according to an embodiment of the present invention, when it is determined that the speaker determined to have input the voice command is a preset specific speaker, the voice signal is converted to a natural language processing module for providing a voice recognition service. It may further include a signal transmission module for transmitting.

또한, 상기 소리입력모듈은 특정 화자를 향하여 배치된 지향성 마이크를 포함할 수 있고, 상기 전처리부는 상기 음성신호로부터 노이즈를 제거하는 균질화부, 및 상기 음성신호에서 입력-특성벡터를 추출하는 특성벡터 생성부를 포함할 수 있다. In addition, the sound input module may include a directional microphone disposed toward a specific speaker, the pre-processing unit may include a homogenizer for removing noise from the voice signal, and a characteristic vector for extracting an input-characteristic vector from the voice signal. may include wealth.

또한, 상기 연산부는 상기 기준-특성벡터를 각각 저장하는 복수의 뉴로셀을 포함하고, 상기 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 상기 복수의 뉴로셀마다 병렬적으로 동시에 비교하여 상기 비교결과를 출력하는 하나 이상의 뉴로모픽소자, 상기 전처리부로부터 수신하는 입력-특성벡터를 상기 뉴로모픽소자로 전달하고, 상기 비교결과를 상기 화자인식부로 전달하는 뉴로모픽 인터페이스, 및 상기 복수의 뉴로셀 중에서 상기 입력-특성벡터를 처리할 일부 뉴로셀들의 세트를 스위칭하는 셀스위치를 포함할 수 있다. In addition, the operation unit includes a plurality of neurocells each storing the reference-feature vector, and compares the input-feature vector and a plurality of pre-stored reference-feature vectors in parallel for each of the plurality of neurocells at the same time. one or more neuromorphic elements for outputting a comparison result, a neuromorphic interface for transferring the input-feature vector received from the preprocessor to the neuromorphic element and transferring the comparison result to the speaker recognition unit, and the plurality of It may include a cell switch for switching a set of some neurocells to process the input-characteristic vector among the neurocells of .

본 발명의 일실시예에 따른 화자인식 방법은, 음성 명령을 포함하는 소리를 수신하여 음성신호로 변환하는 소리입력단계, 상기 음성신호에서 노이즈를 제거하고 상기 음성신호의 입력-특성벡터를 생성하는 전처리단계, 상기 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 병렬적으로 동시에 비교하고 비교결과를 출력하는 특성벡터 비교단계, 및 상기 비교결과에 기초하여 상기 음성 명령을 입력한 화자가 누구인지 결정하는 화자인식단계를 포함할 수 있다. A speaker recognition method according to an embodiment of the present invention includes a sound input step of receiving a sound including a voice command and converting it into a voice signal, removing noise from the voice signal and generating an input-characteristic vector of the voice signal A preprocessing step, a feature vector comparison step of simultaneously comparing the input-feature vector and a plurality of pre-stored reference-feature vectors in parallel and outputting a comparison result, and who is the speaker who input the voice command based on the comparison result It may include a speaker recognition step of determining recognition.

또한, 본 발명의 일실시예에 따른 화자인식 방법은, 상기 음성 명령을 입력한 것으로 결정된 화자가 미리 설정된 특정 화자인 것으로 판단되는 경우, 상기 음성신호를 음성인식 서비스를 제공하기 위한 자연어 처리 모듈로 전달하는 신호전달단계를 더 포함할 수 있다. In addition, in the speaker recognition method according to an embodiment of the present invention, when it is determined that the speaker determined to have input the voice command is a preset specific speaker, the voice signal is converted to a natural language processing module for providing a voice recognition service. It may further include a signal transduction step of delivering.

또한, 상기 특성벡터 비교단계는 뉴로모픽모듈의 뉴로모픽 소자에 포함된 복수의 뉴로셀 중에서 상기 입력-특성벡터를 처리할 일부 뉴로셀들의 세트를 스위칭하는 뉴로셀 스위칭단계, 및 상기 복수의 뉴로셀에 각각 미리 저장된 복수의 기준-특성벡터와 상기 입력-특성벡터를 상기 복수의 뉴로셀마다 병렬적으로 동시에 비교하여 상기 비교결과를 출력하는 병렬비교단계를 포함할 수 있다. In addition, the feature vector comparison step includes a neurocell switching step of switching a set of some neurocells to process the input-feature vector among a plurality of neurocells included in the neuromorphic element of the neuromorphic module, and The method may include a parallel comparison step of simultaneously comparing a plurality of reference-feature vectors stored in advance in each neurocell and the input-feature vector in parallel for each of the plurality of neurocells, and outputting the comparison result.

또한, 상기 특성벡터 비교단계 이전에, 상기 전처리단계에서 출력된 입력-특성벡터를 특정 화자와 맵핑하여 뉴로모픽소자의 뉴로셀에 저장하는 학습단계를 더 포함할 수 있다. Also, before the feature vector comparison step, the method may further include a learning step of mapping the input-feature vector output from the preprocessing step with a specific speaker and storing the mapping in the neurocell of the neuromorphic device.

본 발명의 특징 및 이점들은 첨부도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다.The features and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings.

이에 앞서 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이고 사전적인 의미로 해석되어서는 아니 되며, 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합되는 의미와 개념으로 해석되어야만 한다.Prior to this, the terms or words used in the present specification and claims should not be construed in a conventional and dictionary meaning, and the inventor may properly define the concept of the term to describe his invention in the best way. It should be interpreted as meaning and concept consistent with the technical idea of the present invention based on the principle that there is.

본 발명의 일실시예에 따르면, 음성 명령으로부터 추출된 입력-특징벡터를 뉴로모픽 소자에 저장된 복수의 기준-특징벡터와 병렬적으로 비교하므로 음성 명령을 입력한 화자가 누구인지 신속하게 인식할 수 있다. According to an embodiment of the present invention, since the input-feature vector extracted from the voice command is compared in parallel with a plurality of reference-feature vectors stored in the neuromorphic device, it is possible to quickly recognize who the speaker input the voice command is. can

또한, 본 발명의 일실시예에 따르면, 음성 명령을 입력한 화자가 누구인지 인식하고 입력된 음성 명령을 시간지연 없이 실시간으로 기기에 전달할 수 있으므로, 차량 등과 같은 특수한 환경에서 정해진 화자의 음성 명령에만 기초하여 차량 내의 기기를 동작시켜 안전성을 제고할 수 있다.In addition, according to an embodiment of the present invention, since it is possible to recognize who is the speaker who input the voice command and transmit the input voice command to the device in real time without time delay, only the voice command of the speaker specified in a special environment such as a vehicle is used. Based on this, the safety can be improved by operating the devices in the vehicle.

도 1은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치를 나타낸 블록도이다.
도 2는 차량 환경에 예시적으로 적용한 본 발명의 일실시예에 따른 소리입력모듈을 도시한 도면이다.
도 3은 본 발명의 일실시예에 따른 화자인식 방식을 도시한 도면이다.
도 4는 본 발명의 일실시예에 따른 뉴로모픽모듈을 도시한 블록도이다.
도 5는 본 발명의 일실시예에 따른 뉴로모픽 소자의 동작을 도시한 도면이다.
도 6은 본 발명의 일실시예에 따른 신호전달모듈의 동작을 나타낸 도면이다.
도 7은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 방법의 각 단계를 나타낸 흐름도이다.
도 8은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법의 화자 인식률을 나타내는 그래프이다.
도 9는 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법에서 화자의 수와 뉴로모픽 소자의 개수의 관계를 나타내는 그래프이다. 1 is a block diagram illustrating a speaker recognition apparatus using a neuromorphic element according to an embodiment of the present invention.
2 is a diagram illustrating a sound input module according to an embodiment of the present invention applied to a vehicle environment by way of example.
3 is a diagram illustrating a speaker recognition method according to an embodiment of the present invention.
4 is a block diagram illustrating a neuromorphic module according to an embodiment of the present invention.
5 is a diagram illustrating an operation of a neuromorphic device according to an embodiment of the present invention.
6 is a diagram illustrating an operation of a signal transmission module according to an embodiment of the present invention.
7 is a flowchart illustrating each step of the speaker recognition method using a neuromorphic device according to an embodiment of the present invention.
8 is a graph showing the speaker recognition rate of the speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention.
9 is a graph showing the relationship between the number of speakers and the number of neuromorphic elements in the speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention.

본 발명의 일실시예의 목적, 특정한 장점들 및 신규한 특징들은 첨부된 도면들과 연관되어지는 이하의 상세한 설명과 바람직한 실시예들로부터 더욱 명백해질 것이다. 본 명세서에서 각 도면의 구성요소들에 참조번호를 부가함에 있어서, 동일한 구성 요소들에 한해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 번호를 가지도록 하고 있음에 유의하여야 한다. 또한, "일면", "타면", "제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위해 사용되는 것으로, 구성요소가 상기 용어들에 의해 제한되는 것은 아니다. 이하, 본 발명의 일실시예를 설명함에 있어서, 본 발명의 일실시예의 요지를 불필요하게 흐릴 수 있는 관련된 공지 기술에 대한 상세한 설명은 생략한다. 한편, 설명의 편의와 이해의 증진을 위하여 이하의 실시예에서는 뉴로모픽 소자를 이용하는 것을 위주로 설명할 것이나, 본 발명이 이에 한정되는 것은 아니고 본 발명의 기술적 사상은 이와 다른 하드웨어와 방법에 의해서도 구현될 수 있음은 물론이다.The objects, specific advantages and novel features of one embodiment of the present invention will become more apparent from the following detailed description and preferred embodiments taken in conjunction with the accompanying drawings. In the present specification, in adding reference numbers to the components of each drawing, it should be noted that only the same components are given the same number as possible even though they are indicated on different drawings. In addition, terms such as "one side", "the other side", "first", "second" etc. are used to distinguish one component from another component, and the component is limited by the terms. no. Hereinafter, in describing an embodiment of the present invention, detailed descriptions of related known technologies that may unnecessarily obscure the gist of an embodiment of the present invention will be omitted. On the other hand, for convenience of explanation and enhancement of understanding, in the following embodiments, the use of neuromorphic devices will be mainly described, but the present invention is not limited thereto, and the technical idea of the present invention is implemented by other hardware and methods. Of course it could be.

이하, 첨부된 도면을 참조하여, 본 발명의 일실시예를 상세히 설명한다. 본 명세서에서 화자(speaker)는 음성(voice)으로 명령(order)을 말하는 사람을 뜻하며, 화자는 동시에 여러명일 수 있다. 본 명세서에서 화자인식(speaker recognition)은 음성으로 명령을 말하는 사람이 누구인지 구분하는 것을 뜻한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings. In the present specification, a speaker refers to a person who speaks an order with a voice, and there may be several speakers at the same time. In the present specification, speaker recognition refers to distinguishing who is speaking the command by voice.

도 1은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치를 나타낸 블록도이다. 도 1에 도시된 바와 같이, 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치는, 음성 명령을 포함하는 소리를 수신하여 음성신호로 변환하는 소리입력모듈(100), 음성신호를 수신하고 음성신호의 입력-특성벡터를 생성하는 전처리부(210), 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 병렬적으로 동시에 비교하고 비교결과를 출력하는 뉴로모픽모듈(300), 및 뉴로모픽모듈(300)로부터 수신한 비교결과에 기초하여 상기 음성 명령을 입력한 화자가 누구인지 결정하는 화자인식부(220)를 포함할 수 있다. 1 is a block diagram illustrating a speaker recognition apparatus using a neuromorphic element according to an embodiment of the present invention. As shown in Fig. 1, the speaker recognition device using a neuromorphic element according to an embodiment of the present invention includes a sound input module 100 that receives a sound including a voice command and converts it into a voice signal, a voice signal A preprocessing unit 210 that receives and generates an input-feature vector of a voice signal, and a neuromorphic module 300 that simultaneously compares the input-feature vector with a plurality of pre-stored reference-feature vectors in parallel and outputs the comparison result ), and a speaker recognition unit 220 that determines who is the speaker who input the voice command based on the comparison result received from the neuromorphic module 300 .

소리입력모듈(100)은 소리를 음성신호(voice signal)로 변환하는 장치를 포함한다. 소리입력모듈(100)이 수신하는 소리는 화자의 음성 명령, 다른 화자의 음성 명령, 차량의 소음, 각종 노이즈 등을 포함할 수 있다. 소리입력모듈(100)은 특정 화자를 향하여 배치된 지향성 마이크(directional microphone)를 포함할 수 있다. 지향성 마이크는 정해진 방향에서 입사하는 소리를 효과적으로 수신하고 다른 방향에서 입사하는 소리의 수신률이 낮은 마이크이다. 지향성 마이크를 소리입력모듈(100)로 사용함에 따라, 특정 방향에 위치한 화자의 음성 명령은 상대적으로 크게 수신되고, 다른 방향에 위치한 화자의 음성 명령은 상대적으로 작게 수신될 수 있다. The sound input module 100 includes a device for converting a sound into a voice signal. The sound received by the sound input module 100 may include a speaker's voice command, another speaker's voice command, vehicle noise, various noises, and the like. The sound input module 100 may include a directional microphone disposed toward a specific speaker. A directional microphone is a microphone that effectively receives sound incident from a predetermined direction and has a low reception rate of sound incident from another direction. As the directional microphone is used as the sound input module 100 , a voice command of a speaker located in a specific direction may be received relatively large, and a voice command of a speaker located in another direction may be received relatively small.

도 2는 차량 환경에 예시적으로 적용한 본 발명의 일실시예에 따른 소리입력모듈(100)을 도시한 도면이다. 도 2에 도시된 바와 같이, 차량 등의 특수한 환경에서 지향성 마이크(DM)를 차량의 대시보드나 천장에 배치하고, 운전자(D)의 좌석 방향으로 지향성 마이크의 수신방향을 설정할 수 있다. 이러한 경우 운전자의 신체에 입력용 마이크 등을 부착할 필요 없이 운전자가 말하는 음성을 입력받을 수 있다. 또한 운전자를 향하여 마이크의 수신방향이 설정되므로 운전자의 음성 명령은 잘 수신되고 동승자(E)의 음성 명령은 상대적으로 작게 수신될 수 있다. 따라서 화자인식으로 운전자를 인식하고 운전자 중심의 음성인식 서비스를 제공할 수 있다. 2 is a diagram illustrating a sound input module 100 according to an embodiment of the present invention applied to a vehicle environment by way of example. As shown in FIG. 2 , in a special environment such as a vehicle, the directional microphone DM may be disposed on the dashboard or ceiling of the vehicle, and the receiving direction of the directional microphone may be set in the direction of the driver D's seat. In this case, the driver's voice may be input without the need to attach an input microphone or the like to the driver's body. In addition, since the receiving direction of the microphone is set toward the driver, the driver's voice command can be well received and the passenger's voice command can be received relatively small. Therefore, it is possible to recognize the driver through speaker recognition and provide a driver-centered voice recognition service.

다시 도 1을 참조하면, 전처리부(210)는 음성신호로부터 노이즈를 제거하는 균질화부(211), 및 음성신호에서 입력-특성벡터를 추출하는 특성벡터 생성부(212)를 포함할 수 있다. Referring back to FIG. 1 , the preprocessor 210 may include a homogenizer 211 that removes noise from a voice signal, and a feature vector generator 212 that extracts an input-characteristic vector from the voice signal.

균질화부(211)는 소리입력모듈(100)로부터 음성신호를 수신하고, 음성신호에서 노이즈를 제거할 수 있다. 노이즈는 사람의 음성이 아닌 다른 소리를 말하며, 주파수 대역을 기준으로 필터링하는 등의 알려진 방법을 사용하여 제거될 수 있다. 화자의 음성 명령을 지향성 마이크를 사용하여 수신하고 노이즈제거를 수행함에 따라 음성신호의 균질성을 확보할 수 있다. The homogenizer 211 may receive a voice signal from the sound input module 100 and remove noise from the voice signal. Noise refers to a sound other than a human voice, and may be removed using known methods such as filtering based on a frequency band. Homogeneity of the voice signal can be secured by receiving the speaker's voice command using a directional microphone and performing noise cancellation.

특성벡터 생성부(212)는 음성신호의 특성벡터를 생성할 수 있다. 특성벡터 생성부(212)에 입력되는 음성신호는 노이즈가 제거되어 균질화된 음성신호인 것이 바람직하다. 특성벡터는 음성신호들을 구분하여 화자를 인식할 수 있도록 해주는 특징을 말한다. 특성벡터 생성부(212)가 생성한 특성벡터는 화자인식을 수행할 대상이 되는 입력-특성벡터로 뉴로모픽모듈(300)에 전달되거나, 화자인식의 기준이 되는 기준-특성벡터로 뉴로모픽모듈(300)에 전달될 수 있다.The characteristic vector generator 212 may generate a characteristic vector of the voice signal. The audio signal input to the characteristic vector generator 212 is preferably a homogenized audio signal from which noise has been removed. The feature vector refers to a feature that distinguishes voice signals and allows the speaker to be recognized. The feature vector generated by the feature vector generator 212 is transmitted to the neuromorphic module 300 as an input-feature vector to be subjected to speaker recognition, or as a reference-feature vector as a reference-feature vector for speaker recognition. may be transmitted to the pick module 300 .

특성벡터 생성부(212)는 화자의 음성을 학습하는 과정에서, 화자의 음성 변화에 따라 최대한 세분화된 음성신호를 특성벡터화하여 기준-특성벡터로 뉴로모픽모듈(300)에 전달할 수 있다. 음성신호에 기초하여 특성벡터를 생성하는 과정은 다음 수식을 따라 구성될 수 있다. In the process of learning the speaker's voice, the feature vector generator 212 may feature vectorized as much as possible the subdivided voice signal according to the change of the speaker's voice, and transmit it to the neuromorphic module 300 as a reference-feature vector. The process of generating the feature vector based on the voice signal may be configured according to the following equation.

vi: 화자 음성신호, wi: 음대역과 음폭 행렬, x: 입력 음성신호vi: speaker voice signal, wi: pitch band and pitch matrix, x: input voice signal

Vw: 화자범위의 축벡터, Vi: 화자 음성신호 범위, μ:화자의 음성신호의 평균값Vw: the axis vector of the speaker range, Vi: the speaker voice signal range, μ: the average value of the speaker’s voice signal

아래 수학식 3 및 4에 따라, V_B와 V_W을 지속적으로 비교하고 A(W)값이 최소가 될 때의 입력음성신호로 화자를 판별할 수 있다. According to Equations 3 and 4 below, it is possible _{to continuously compare V B} and V _W and determine the speaker from the input voice signal when the value of A(W) becomes the minimum.

V_B: 화자의 음성신호 평균값과의 거리, μ:화자의 음성신호의 평균값V _B : Distance from the speaker's average value of the speaker's voice signal, µ: the speaker's average value of the speaker's voice signal

V_B: 화자의 음성신호 평균값과의 거리, Vw: 화자범위의 축벡터V _B : Distance from the speaker's average speech signal value, Vw : Axis vector of speaker range

도 3은 본 발명의 일실시예에 따른 화자인식 방식을 도시한 도면이다. 3 is a diagram illustrating a speaker recognition method according to an embodiment of the present invention.

특성벡터 생성부(212)가 생성한 복수의 특성벡터를 음대역과 음폭을 기준으로 배열하면 도 3에 도시된 바와 같이 특정 화자별로 구분될 수 있다. 예를 들어 도 3과 같이 복수의 특성벡터는 제1 화자, 제2 화자, 제3 화자로 구분될 수 있다. 각 화자별로 음대역과 음폭이 상이하므로 화자별로 범위화하면 축벡터(Vw1, Vw2, Vw3)가 상이하고 특성벡터의 평균값(U1, U2, U3)도 상이하게 구분될 수 있다. 소리입력모듈(100)로 지향성 마이크를 이용하면, 특정 방향에서 수신되는 소리는 지향성 마이크의 특성에 따라 가중치가 부여된다. 지향성 마이크에 의해 부여되는 가중치는 다른 특성벡터들과 이격되어 구분되는 방식으로 특성벡터에 반영될 수 있다. 따라서 도 3과 같이 각 화자들의 특성벡터가 서로 충분히 이격될 수 있다. When the plurality of feature vectors generated by the feature vector generator 212 are arranged based on a sound band and a sound width, as shown in FIG. 3 , the plurality of feature vectors can be classified for each specific speaker. For example, as shown in FIG. 3 , the plurality of feature vectors may be divided into a first speaker, a second speaker, and a third speaker. Since the sound band and the sound width are different for each speaker, when the range is made for each speaker, the axis vectors (Vw1, Vw2, Vw3) are different and the average values (U1, U2, U3) of the characteristic vectors are also different. When a directional microphone is used as the sound input module 100 , a sound received from a specific direction is weighted according to the characteristics of the directional microphone. The weight given by the directional microphone may be reflected in the feature vector in a way that is separated from other feature vectors. Accordingly, as shown in FIG. 3 , the characteristic vectors of each speaker may be sufficiently spaced apart from each other.

예를 들어, 임의의 음성 명령에 대한 입력-특성벡터는 도 3 상의 임의의 위치(A)에 해당할 수 있다. 특정 위치(A)와 각 특성벡터들의 위치관계를 비교하면 제2 화자의 특성벡터(Vw2)와 그 평균(U2)이 가장 가까우므로 제2 화자가 음성 명령을 입력한 것으로 판단할 수 있다. 즉, 제2 화자의 특성벡터(Vw2)의 평균(U2)과 임의의 위치(A)와의 거리(V_B2)가 다른 화자들의 특성벡터의 평균(U1, U3)과 임의의 위치(A)와의 거리(V_B1, V_B3)보다 가까우므로, 제2 화자가 음성 명령을 입력한 것으로 판단할 수 있다. For example, an input-characteristic vector for an arbitrary voice command may correspond to an arbitrary position A in FIG. 3 . Comparing the positional relationship between the specific position A and the characteristic vectors, the characteristic vector Vw2 of the second speaker and its average U2 are the closest, so that it can be determined that the second speaker has input the voice command. That is, the average (U2) of the second speaker's characteristic vector (Vw2) and the distance (V _B 2) between the arbitrary position (A) are different from the average (U1, U3) of the characteristic vectors of the second speaker and the arbitrary position (A) Since it is closer than the distances V _B 1 and V _B 3 , it may be determined that the second speaker has input the voice command.

도 4는 본 발명의 일실시예에 따른 뉴로모픽모듈(300)을 도시한 블록도이고, 도 5는 본 발명의 일실시예에 따른 뉴로모픽 소자(330)의 동작을 도시한 도면이다. 4 is a block diagram illustrating a neuromorphic module 300 according to an embodiment of the present invention, and FIG. 5 is a diagram illustrating an operation of a neuromorphic element 330 according to an embodiment of the present invention. .

도 4 및 도 5에 도시된 바와 같이, 기준-특성벡터를 각각 저장하는 복수의 뉴로셀(331)을 포함하고, 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 복수의 뉴로셀(331)마다 병렬적으로 동시에 비교하여 비교결과를 출력하는 하나 이상의 뉴로모픽소자, 전처리부(210)로부터 수신하는 입력-특성벡터를 뉴로모픽소자로 전달하고, 비교결과를 화자인식부(220)로 전달하는 뉴로모픽 인터페이스(310), 복수의 뉴로셀(331) 중에서 입력-특성벡터를 처리할 일부 뉴로셀(331)들의 세트를 스위칭하는 셀스위치(320)를 포함할 수 있다. As shown in FIGS. 4 and 5 , a plurality of neurocells 331 each storing a reference-feature vector are included, and an input-feature vector and a plurality of pre-stored reference-feature vectors are stored in a plurality of neurocells 331 . ), one or more neuromorphic devices that simultaneously compare and output a comparison result, and an input-characteristic vector received from the preprocessor 210 are transmitted to the neuromorphic device, and the comparison result is transmitted to the speaker recognition unit 220 It may include a neuromorphic interface 310 that transmits to , and a cell switch 320 that switches a set of some neurocells 331 to process an input-characteristic vector among a plurality of neurocells 331 .

뉴로모픽모듈(300)은 하나 이상의 뉴로모픽 소자(330)(neuromorphic device)를 포함할 수 있다. 뉴로모픽 소자(330)는 복수의 뉴로셀(331)을 포함하고, 각각의 뉴로셀(331)은 기준-특성벡터를 하나씩 저장하고 있다. 뉴로모픽 소자(330)는 수신한 입력-특성벡터를 복수의 뉴로셀(331) 각각에 저장된 기준-특성벡터와 병렬적으로 비교한 결과를 출력한다. The neuromorphic module 300 may include one or more neuromorphic devices 330 (neuromorphic devices). The neuromorphic element 330 includes a plurality of neurocells 331 , and each neurocell 331 stores a reference-feature vector one by one. The neuromorphic element 330 outputs a result of parallel comparison of the received input-feature vector with the reference-feature vector stored in each of the plurality of neurocells 331 .

종래, 신경망 네트워크 구조를 이용한 딥러닝 알고리즘은 뉴런(뉴로모픽 소자(330)의 뉴로셀(331))에서 연산한 결과를 시냅스를 이용하여 다른 뉴런(뉴로모픽 소자(330)의 다른 뉴로셀(331))의 입력으로 전달하여 다단계의 연산을 수행하므로 속도가 느린 문제가 있다. 이에 비하여, 본 발명의 일실시예는 뉴로모픽 소자(330)를 이용함에 있어서, 뉴로모픽 소자(330)의 복수의 뉴로셀(331) 각각에서 입력-특성벡터와 기준-특성벡터를 비교하는 동작을 동시에 병렬적으로 수행하고, 뉴로셀(331)이 출력하는 값을 다른 뉴로셀(331)로 재입력하여 사용하지 않고 곧바로 결과로서 출력하도록 한다. 따라서, 본 발명의 일실시예는 다단계의 연산을 수행하는 종래의 딥러닝 알고리즘에 비하여 매우 빠른 속도로 특성벡터를 비교한 결과를 출력할 수 있으므로, 신속한 화자인식이 가능하다. Conventionally, a deep learning algorithm using a neural network structure uses a synapse for the result of calculation in a neuron (neurocell 331 of the neuromorphic element 330) to another neuron (another neurocell of the neuromorphic element 330). (331)) as an input to perform multi-step operation, so there is a problem of slow speed. On the other hand, in one embodiment of the present invention, in using the neuromorphic element 330 , the input-feature vector and the reference-feature vector are compared in each of the plurality of neurocells 331 of the neuromorphic element 330 . operation is simultaneously performed in parallel, and the value output from the neurocell 331 is re-inputted into another neurocell 331 to be directly output as a result without use. Accordingly, an embodiment of the present invention can output the result of comparing feature vectors at a very high speed compared to a conventional deep learning algorithm that performs multi-step operation, so that rapid speaker recognition is possible.

뉴로모픽 인터페이스(310)는 전처리부(210)와 뉴로모픽 소자(330) 사이의 데이터 교환을 지원한다. 뉴로모픽 인터페이스(310)는 전처리부(210)에서 출력하는 특성벡터를 뉴로모픽 소자(330)로 전달하고, 뉴로모픽 소자(330)에서 특성벡터의 비교 결과를 받아서 출력한다. 뉴로모픽 인터페이스(310)는 시간지연을 최소화하기 위하여 고속의 데이터 전달 프로토콜을 사용한다. The neuromorphic interface 310 supports data exchange between the preprocessor 210 and the neuromorphic element 330 . The neuromorphic interface 310 transmits the characteristic vector output from the preprocessor 210 to the neuromorphic element 330 , and receives and outputs the comparison result of the characteristic vector from the neuromorphic element 330 . The neuromorphic interface 310 uses a high-speed data transfer protocol to minimize time delay.

셀스위치(320)는 전처리부(210)에서 수신되는 입력-특성벡터를 복수의 뉴로모픽 소자(330)들 중에서 특정 뉴로모픽 소자(330)로 전달하도록 스위칭되거나, 뉴로모픽 소자(330)가 포함하는 복수의 뉴로셀(331)들 중에서 특정 뉴로셀(331)로 전달하도록 스위칭될 수 있다. 셀스위치(320)는 입력되는 특성벡터의 수에 따라 연산이 필요한 세트로 나눠진 뉴로셀(331)을 스위칭할 수 있다. 다양한 음성신호가 뉴로셀(331)에 저장될수록 음성신호간 구분이 명확해지고, 특성벡터에 따라 비교연산이 수행되는 뉴로셀(331)을 스위칭함으로써 전력소모를 줄일 수 있다. The cell switch 320 is switched to transfer the input-characteristic vector received from the preprocessor 210 to a specific neuromorphic element 330 among the plurality of neuromorphic elements 330 , or the neuromorphic element 330 . ) may be switched to be delivered to a specific neurocell 331 from among the plurality of neurocells 331 included. The cell switch 320 may switch the neurocells 331 divided into sets that require calculation according to the number of input characteristic vectors. As the various voice signals are stored in the neurocell 331, the distinction between the voice signals becomes clear, and power consumption can be reduced by switching the neurocell 331 in which a comparison operation is performed according to a characteristic vector.

화자인식부(220)는 뉴로모픽모듈(300)로부터 수신한 비교결과에 기초하여 음성 명령을 입력한 화자가 누구인지 결정할 수 있다. 뉴로모픽모듈(300)이 출력하는 비교결과는 입력-특성벡터와 가까운 기준-특성벡터에 관한 정보를 포함할 수 있고, 화자인식부(220)는 입력-특성벡터와 가까운 기준-특성벡터가 어떤 화자와 맵핑되어 있는지 확인하여 음성 명령을 입력한 화자를 결정할 수 있다. 화자인식부(220)는 음성 명령을 입력한 화자가 미리 결정된 특정 화자인 경우 음성 신호를 외부로 출력하는 판단을 수행할 수 있다. The speaker recognition unit 220 may determine who is the speaker inputting the voice command based on the comparison result received from the neuromorphic module 300 . The comparison result output by the neuromorphic module 300 may include information about the reference-feature vector close to the input-feature vector, and the speaker recognition unit 220 determines that the reference-feature vector close to the input-feature vector is determined. You can determine the speaker who entered the voice command by checking which speaker it is mapped to. The speaker recognition unit 220 may determine to output a voice signal to the outside when the speaker inputting the voice command is a predetermined specific speaker.

본 발명의 일실시예에 따른 전처리부(210) 및 화자인식부(220)는 프로세서(200)(processor)에서 실행가능한 컴퓨터 프로그램 코드로 구현될 수 있다. 프로세서(200)는 컴퓨터 장치의 중앙처리장치로 사용되는 CPU(central processing unit), 스마트폰 등의 이동통신기기에서 사용되는 모바일 AP(application processor) 등을 포함할 수 있으며, 차량용 오디오, 비디오, 네비게이션 시스템(AVN system)에 이용되는 프로세서(200) 등을 포함할 수 있다. The preprocessor 210 and the speaker recognition unit 220 according to an embodiment of the present invention may be implemented as computer program codes executable by the processor 200 . The processor 200 may include a central processing unit (CPU) used as a central processing unit of a computer device, a mobile application processor (AP) used in a mobile communication device such as a smartphone, and the like, and includes vehicle audio, video, and navigation systems. It may include a processor 200 and the like used in the system (AVN system).

본 발명의 일실시예에 따른 뉴로모픽모듈(300)은 일반적 프로세서(200)에서 구동되는 전처리부(210) 및 화자인식부(220)와 달리, 뉴로셀(331)과 뉴로셀(331)을 연결하는 시냅스로 구성되는 뉴로모픽 소자를 이용한다. 특성벡터를 상호 비교하는 동작을 일반 프로세서(200)에서 구동하는 경우 모든 특성벡터를 비교하는데 긴 시간이 걸리는 문제가 있으므로, 본 발명의 일실시예는 특성벡터의 생성은 일반 프로세서(200)에서 수행하되 특성벡터의 비교는 뉴로모픽 소자(330)를 이용하여 복수의 특성벡터 비교를 병렬적으로 동시에 수행하므로 처리시간을 매우 단축할 수 있는 이점이 있다.The neuromorphic module 300 according to an embodiment of the present invention is different from the preprocessor 210 and the speaker recognition unit 220 driven in the general processor 200 , the neurocell 331 and the neurocell 331 . It uses a neuromorphic device composed of synapses that connect When the operation of comparing the feature vectors with each other is driven by the general processor 200 , there is a problem that it takes a long time to compare all the feature vectors. Therefore, in one embodiment of the present invention, the generation of the feature vectors is performed in the general processor 200 . However, the comparison of the characteristic vectors is advantageous in that the processing time can be greatly shortened because the comparison of a plurality of characteristic vectors is simultaneously performed in parallel using the neuromorphic element 330 .

도 6은 본 발명의 일실시예에 따른 신호전달모듈(400)을 나타낸 도면이다.6 is a diagram illustrating a signal transmission module 400 according to an embodiment of the present invention.

도 6에 도시된 바와 같이, 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치는, 음성 명령을 입력한 것으로 결정된 화자가 미리 설정된 특정 화자인 것으로 판단되는 경우, 음성신호를 음성인식 서비스를 제공하기 위한 자연어 처리 모듈(500)로 전달하는 신호전달모듈(400)을 더 포함할 수 있다. 자연어 처리 모듈(500)은 음성신호를 수신하여 자연어 처리를 통해 음성 명령을 인식하는 과정을 수행함으로써 음성인식을 이용한 서비스를 제공할 수 있게 한다.As shown in FIG. 6 , in the speaker recognition apparatus using a neuromorphic element according to an embodiment of the present invention, when it is determined that a speaker determined to have input a voice command is a preset specific speaker, the voice signal is converted into a voice signal. It may further include a signal transmission module 400 for transmitting to the natural language processing module 500 for providing a recognition service. The natural language processing module 500 receives a voice signal and performs a process of recognizing a voice command through natural language processing to provide a service using voice recognition.

신호전달모듈(400)은 화자인식부(220)로부터 수신되는 스위칭 제어 신호에 따라, 화자인식부(220)로부터 수신하는 음성신호를 전달하거나 소리입력모듈(100)로부터 수신한 음성신호를 전달하도록 스위칭될 수 있다. 신호전달모듈(400)은 고속 스위칭을 수행할 수 있는 물리적 스위치로 구현될 수 있다. 일반 상태에서 신호전달모듈(400)은 소리입력모듈(100)로부터 수신한 음성신호를 자연어 처리 모듈(500)로 전달하도록 제1 접점(P1)과 연결될 수 있고, 특정 화자가 음성 명령을 입력한 것으로 판단되는 경우 신호전달모듈(400)은 화자인식부(220)로부터 수신한 음성신호를 자연어 처리 모듈(500)로 전달하도록 제2 접점(P2)과 연결될 수 있다. The signal transmission module 400 transmits the voice signal received from the speaker recognition unit 220 or the voice signal received from the sound input module 100 according to the switching control signal received from the speaker recognition unit 220 . can be switched. The signal transfer module 400 may be implemented as a physical switch capable of performing high-speed switching. In a general state, the signal transmission module 400 may be connected to the first contact point P1 to transmit the voice signal received from the sound input module 100 to the natural language processing module 500, and a specific speaker inputs a voice command. When it is determined that there is, the signal transmission module 400 may be connected to the second contact point P2 to transmit the voice signal received from the speaker recognition unit 220 to the natural language processing module 500 .

본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치가 신호전달모듈(400)을 더 포함함에 따라, 미리 정해진 특정 화자의 음성 명령만을 수행하는 기능과 일반 화자의 음성 명령도 수행하는 기능을 모두 지원할 수 있는 음성인식 시스템에 이용될 수 있다. 예를 들어, 차량과 같은 특수한 환경에서 운전자와 동승자의 음성 명령을 다르게 취급할 필요가 있으며, 차량 내의 정해진 전자기기는 운전자의 음성 명령에만 작동되도록 제한된 음성인식 시스템에 이용될 수 있다.As the speaker recognition apparatus using a neuromorphic element according to an embodiment of the present invention further includes a signal transmission module 400, a function of performing only a predetermined specific speaker's voice command and a general speaker's voice command are also performed. It can be used in a voice recognition system that can support all functions. For example, in a special environment such as a vehicle, it is necessary to treat the driver's and passenger's voice commands differently, and a predetermined electronic device in the vehicle may be used in a voice recognition system limited to operate only with the driver's voice command.

자연어 처리 모듈(500)은 특정 화자가 입력한 음성 명령에 해당하는 음성신호를 수신하여 자연어 처리를 통해 정해진 서비스를 제공할 수 있고, 일반 화자가 입력한 음성 명령에 해당하는 음성신호를 수신하여 정해진 서비스를 제공할 수 있다. 자연어 처리 모듈(500)은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치와 일체로 구성되거나 물리적으로 다른 장치로 구성될 수 있다. 예를 들어, 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치는 음성인식을 이용한 서비스를 제공하는 네비게이션에 포함될 수도 있고, 네비게이션과 별개로 구성되어 네비게이션에 연결되는 방식으로 이용될 수도 있다. The natural language processing module 500 may receive a voice signal corresponding to a voice command input by a specific speaker and provide a predetermined service through natural language processing, and may receive a voice signal corresponding to a voice command input by a general speaker to receive a predetermined service. service can be provided. The natural language processing module 500 may be configured integrally with the speaker recognition device using a neuromorphic element according to an embodiment of the present invention or may be configured as a physically different device. For example, the speaker recognition apparatus using a neuromorphic element according to an embodiment of the present invention may be included in a navigation providing a service using voice recognition, and may be configured separately from the navigation and used in a manner connected to the navigation. may be

도 7은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 방법의 각 단계를 나타낸 흐름도이다. 7 is a flowchart illustrating each step of the speaker recognition method using a neuromorphic device according to an embodiment of the present invention.

도 7에 도시된 바와 같이, 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 방법은, 음성 명령을 포함하는 소리를 수신하여 음성신호로 변환하는 소리입력단계(S10), 음성신호에서 노이즈를 제거하고 음성신호의 입력-특성벡터를 생성하는 전처리단계(S20), 뉴로모픽모듈(300)에서 입력-특성벡터와 미리 저장된 복수의 기준-특성벡터를 병렬적으로 동시에 비교하고 비교결과를 출력하는 특성벡터 비교단계(S30), 및 비교결과에 기초하여 음성 명령을 입력한 화자가 누구인지 결정하는 화자인식단계(S40)를 포함할 수 있다.7, in the speaker recognition method using a neuromorphic device according to an embodiment of the present invention, a sound input step of receiving a sound including a voice command and converting it into a voice signal (S10), the voice signal In the pre-processing step (S20) of removing noise and generating the input-feature vector of the voice signal, the input-feature vector and a plurality of pre-stored reference-feature vectors are simultaneously compared and compared in parallel in the neuromorphic module 300 It may include a feature vector comparison step (S30) of outputting a result, and a speaker recognition step (S40) of determining who is a speaker who has input a voice command based on the comparison result.

소리입력단계(S10)는 소리입력모듈(100)이 소리를 수신하여 음성신호로 변환하여 출력하는 과정이다. 소리입력모듈(100)은 지향성 마이크를 이용하여 수신된 소리를 음성신호로 변환하고 전처리부(210)로 출력할 수 있다. The sound input step (S10) is a process in which the sound input module 100 receives a sound, converts it into a voice signal, and outputs the received sound. The sound input module 100 may convert the received sound into a voice signal using a directional microphone and output it to the preprocessor 210 .

전처리단계(S20)는 소리입력단계(S10)에서 생성된 음성신호에서 노이즈를 제거하는 균질화단계(S21)와 음성신호의 특성벡터를 생성하는 특성벡터 생성단계(S22)를 포함할 수 있다. 균질화단계(S21)는 프로세서(200)에서 구현되는 균질화부(211)에서 수행될 수 있다. 지향성 마이크를 이용한 소리입력단계(S10)와 균질화단계(S21)를 수행하면 음성신호의 균질성을 확보할 수 있다. 균질화단계(S21)는 특성벡터 생성단계(S22)가 수행되기 전에 수행되는 것이 바람직하다. The pre-processing step S20 may include a homogenizing step S21 for removing noise from the audio signal generated in the sound input step S10 and a characteristic vector generating step S22 for generating a characteristic vector of the speech signal. The homogenization step S21 may be performed by the homogenizer 211 implemented in the processor 200 . If the sound input step (S10) and the homogenization step (S21) using a directional microphone are performed, the homogeneity of the voice signal can be secured. The homogenization step (S21) is preferably performed before the feature vector generation step (S22) is performed.

특성벡터 생성단계(S22)는 음성신호를 서로 구분할 수 있는 특징을 나타내는 특성벡터를 생성하는 과정이다. 특성벡터 생성단계(S22)는 프로세서(200)에서 구현되는 특성벡터 생성부(212)에서 수행될 수 있다. 특성벡터 생성단계(S22)에서 생성된 특성벡터는 학습단계(S90)에서 뉴로모픽 소자(330)에 저장되는 기준-특성벡터로 이용되거나 특성벡터 비교단계(S30)에서 입력-특성벡터로 이용될 수 있다. The feature vector generating step S22 is a process of generating a feature vector representing characteristics that can distinguish voice signals from each other. The feature vector generation step S22 may be performed by the feature vector generator 212 implemented in the processor 200 . The feature vector generated in the feature vector generating step S22 is used as a reference-feature vector stored in the neuromorphic element 330 in the learning step S90 or used as an input-feature vector in the feature vector comparison step S30. can be

특성벡터 비교단계(S30)는 뉴로모픽모듈(300)의 뉴로모픽 소자(330)에 포함된 복수의 뉴로셀(331) 중에서 입력-특성벡터를 처리할 일부 뉴로셀(331)들의 세트를 스위칭하는 뉴로셀(331) 스위칭단계(S31), 및 복수의 뉴로셀(331)에 각각 미리 저장된 복수의 기준-특성벡터와 입력-특성벡터를 복수의 뉴로셀(331)마다 병렬적으로 동시에 비교하여 비교결과를 출력하는 병렬비교단계(S32)를 포함할 수 있다. In the feature vector comparison step S30 , the input-a set of some neurocells 331 to process the feature vector is selected from among the plurality of neurocells 331 included in the neuromorphic element 330 of the neuromorphic module 300 . In the switching step S31 of the switching neurocell 331 , and a plurality of reference-feature vectors and input-characteristic vectors stored in advance in the plurality of neurocells 331 , respectively, are compared simultaneously in parallel for each of the plurality of neurocells 331 . and a parallel comparison step (S32) of outputting the comparison result may be included.

뉴로셀(331) 스위칭단계(S31)는 입력-특성벡터의 수에 따라 비교연산이 필요한 뉴로셀(331) 세트에 알맞게 뉴로셀(331)을 스위칭하는 역할을 수행한다. 뉴로셀(331) 스위칭단계(S31)에서 셀스위치(320)는 뉴로모픽 소자(330)에 포함된 복수의 뉴로셀(331) 중에서 입력-특성벡터를 수신하여 기준-특성벡터와 비교할 뉴로셀(331)을 연결하도록 동작한다. The neurocell 331 switching step S31 performs a role of switching the neurocell 331 appropriately for a set of neurocells 331 requiring comparison operation according to the number of input-feature vectors. In the neurocell 331 switching step S31 , the cell switch 320 receives an input-feature vector from among the plurality of neurocells 331 included in the neuromorphic element 330 and compares the reference-feature vector with the reference-characteristic vector. (331) operates to connect.

병렬비교단계(S32)는 기준-특성벡터와 입력-특성벡터를 비교하여 유사여부를 비교결과로 출력하는 과정이다. 병렬비교단계(S32)는 뉴로모픽모듈(300)의 뉴로모픽 소자(330)에서 수행된다. 뉴로셀(331) 스위칭단계(S31)에서 비교를 수행하도록 연결된 복수의 뉴로셀(331)들 각각에서 입력-특성벡터와 기준-특성벡터의 비교가 병렬적으로 동시에 수행된다. 병렬비교단계(S32)는 뉴로셀(331)의 출력을 다른 뉴로셀(331)의 입력으로 전달하는 다단계의 판단을 수행하는 것이 아니라, 뉴로셀(331)의 비교결과를 곧바로 출력하므로 신속한 비교결과를 획득할 수 있다. The parallel comparison step ( S32 ) is a process of comparing the reference-feature vector with the input-feature vector and outputting a similarity as a comparison result. The parallel comparison step ( S32 ) is performed in the neuromorphic element 330 of the neuromorphic module 300 . In each of the plurality of neurocells 331 connected to perform the comparison in the neurocell 331 switching step S31, the comparison of the input-feature vector and the reference-feature vector is simultaneously performed in parallel. In the parallel comparison step (S32), the comparison result of the neurocell 331 is directly output, rather than a multi-step determination of transferring the output of the neurocell 331 to the input of the other neurocell 331 , so a quick comparison result is obtained. can be obtained.

화자인식단계(S40)는 비교결과에 기초하여 음성 명령을 입력한 화자가 누구인지 결정하는 과정이다. 화자인식단계(S40)는 프로세서(200)에서 구현되는 화자인식부(220)에서 수행될 수 있다. 화자인식단계(S40)는 뉴로모픽모듈(300)에서 출력되는 비교결과와 특정한 기준-특성벡터가 어느 화자와 관련된 것인지 확인하여 음성 명령을 입력한 화자가 미리 설정된 특정 화자인지 결정할 수 있다. 화자인식단계(S40)에서 미리 설정된 특정 화자가 음성 명령을 입력한 것으로 판단되면 음성 신호를 자연어 처리 모듈(500)로 출력하도록 신호전달모듈(400)을 제어할 수 있다. The speaker recognition step S40 is a process of determining who is the speaker who input the voice command based on the comparison result. The speaker recognition step S40 may be performed by the speaker recognition unit 220 implemented in the processor 200 . In the speaker recognition step S40, it is possible to determine whether the speaker inputting the voice command is a preset specific speaker by checking which speaker the comparison result output from the neuromorphic module 300 and the specific reference-feature vector are related. When it is determined that a preset specific speaker has input a voice command in the speaker recognition step S40 , the signal transmission module 400 may be controlled to output a voice signal to the natural language processing module 500 .

본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 방법은, 음성 명령을 입력한 것으로 결정된 화자가 미리 설정된 특정 화자인 것으로 판단되는 경우, 음성신호를 음성인식 서비스를 제공하기 위한 자연어 처리 모듈(500)로 전달하는 신호전달단계(S50)를 더 포함할 수 있다.In the speaker recognition method using a neuromorphic device according to an embodiment of the present invention, when it is determined that a speaker determined to have input a voice command is a preset specific speaker, natural language processing for providing a voice recognition service for a voice signal It may further include a signal transfer step (S50) of transmitting to the module (500).

신호전달단계(S50)는 특정 화자가 음성 명령을 입력한 것으로 판단되는 경우 음성인식 서비스를 수행하는 장치로 특정 화자의 음성신호를 전달하도록 신호전달모듈(400)이 스위칭되는 과정이다. 신호전달단계(S50)는 화자인식부(220)의 제어에 의해 수행될 수 있다. 일반 상태에서는 신호전달단계(S50)가 수행되지 않고 소리입력모듈(100)에서 출력하는 음성신호가 자연어 처리 모듈(500)로 전달되고, 특정 화자가 음성 명령을 입력한 것으로 판단되는 경우 신호전달단계(S50)가 수행되도록 설정될 수 있다. The signal transmission step S50 is a process in which the signal transmission module 400 is switched to transmit the voice signal of the specific speaker to the device performing the voice recognition service when it is determined that the specific speaker has input the voice command. The signal transmission step S50 may be performed under the control of the speaker recognition unit 220 . In the general state, the signal transmission step (S50) is not performed, and the voice signal output from the sound input module 100 is transmitted to the natural language processing module 500, and when it is determined that a specific speaker has input a voice command, the signal transmission step (S50) may be set to be performed.

신호전달단계(S50)가 수행되면, 특정 화자로 판단된 음성신호를 수신한 자연어 처리 모듈(500)은 자연어 처리를 통해 음성 명령을 인식하고 정해진 서비스를 제공하는 음성인식 서비스 제공단계(S60)를 수행할 수 있다. When the signal transmission step (S50) is performed, the natural language processing module 500 receiving the voice signal determined to be a specific speaker recognizes a voice command through natural language processing and provides a voice recognition service providing step (S60). can be done

본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 방법은, 특성벡터 비교단계(S30) 이전에, 전처리단계(S20)에서 출력된 입력-특성벡터를 특정 화자와 맵핑하여 뉴로모픽소자의 뉴로셀(331)에 저장하는 학습단계(S90)를 더 포함할 수 있다. 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치가 학습모드에 있는 경우 학습단계(S90)가 수행되고, 분류모드에 있는 경우 특성벡터 비교단계(S30)가 수행될 수 있다. In the speaker recognition method using a neuromorphic element according to an embodiment of the present invention, before the feature vector comparison step ( S30 ), the input-feature vector output in the preprocessing step ( S20 ) is mapped to a specific speaker to be neuromorphic. It may further include a learning step (S90) of storing in the neurocell 331 of the device. When the speaker recognition apparatus using the neuromorphic element according to an embodiment of the present invention is in the learning mode, the learning step S90 may be performed, and if in the classification mode, the feature vector comparison step S30 may be performed.

학습단계(S90)는 특정 화자의 음성 명령을 다른 화자와 구분하기 위하여, 특정 화자와 특성벡터를 맵핑하여 저장하는 과정이다. 학습단계(S90)는 본 발명의 일실시예에 따라 화자인식을 위하여 정해진 기간 또는 정해진 횟수만큼 수행될 수 있다. 학습단계(S90)는 전처리단계(S20)에서 생성된 특성벡터를 뉴로모픽모듈(300)의 뉴로모픽 소자(330)에 기준-특성벡터로서 저장한다. 정해진 기간 또는 횟수만큼 학습단계(S90)가 수행되어, 특정 화자를 구분할 수 있을만큼 기준-특성벡터가 저장되면, 동작모드를 구분모드로 설정함으로써 화자인식을 수행할 수 있다. The learning step S90 is a process of mapping and storing a specific speaker and a characteristic vector in order to distinguish a specific speaker's voice command from other speakers. The learning step S90 may be performed for a predetermined period or a predetermined number of times for speaker recognition according to an embodiment of the present invention. In the learning step (S90), the feature vector generated in the preprocessing step (S20) is stored in the neuromorphic element 330 of the neuromorphic module 300 as a reference-feature vector. When the learning step S90 is performed for a predetermined period or number of times and a reference-feature vector is stored enough to distinguish a specific speaker, speaker recognition can be performed by setting the operation mode to the discrimination mode.

상술한 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법은, 뉴로모픽 소자(330)에 포함된 복수의 뉴로셀(331)을 다단계로 이용하지 않고 병렬적으로 하나의 단계에서 동작하도록 이용함에 따라 특성벡터를 신속하게 비교할 수 있다. 또한, 복수의 뉴로셀(331)마다 기준-특성벡터를 저장하므로 다양한 기준-특성벡터를 저장할 수 있어서 판단이 필요한 특정 화자가 입력하는 다양한 음성 명령에 대응하는 특성벡터들을 저장하고 판단할 수 있다. The speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention described above uses a plurality of neurocells 331 included in the neuromorphic element 330 in parallel without using multiple steps. By using it to operate in the step of , you can quickly compare the feature vectors. In addition, since the reference-feature vectors are stored for each of the plurality of neurocells 331 , various reference-feature vectors can be stored, so that it is possible to store and determine the characteristic vectors corresponding to various voice commands input by a specific speaker requiring determination.

또한, 본 발명의 일실시예는 저사양의 시스템에도 적용되기 위하여, 특성벡터의 비교를 프로세서(200)(CPU 등)에서 처리하지 않고, 별도의 뉴로모픽 소자(330)에서 처리하여 프로세서(200)의 부하를 최소화하고, 비교결과를 빠르게 전달받기 위해서 고속의 뉴로모픽 인터페이스(310)를 사용할 수 있다. 또한, 본 발명의 일실시예는 종래 서버와의 연동을 통해서 딥러닝 기법을 이용하는 것이 아니므로 데이터 통신이 불필요하고 서버와의 연동을 위한 시간지연이 없다. In addition, an embodiment of the present invention is not processed in the processor 200 (CPU, etc.) for the comparison of the characteristic vectors in order to be applied to a low-spec system, but is processed in a separate neuromorphic element 330 to be processed by the processor 200 ), a high-speed neuromorphic interface 310 can be used to minimize the load and quickly receive the comparison result. In addition, since an embodiment of the present invention does not use a deep learning technique through interworking with a conventional server, data communication is unnecessary and there is no time delay for interworking with the server.

도 8은 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법의 화자 인식률을 나타내는 그래프이다. 종래의 방법을 적용한 결과는 점선으로 표시하고 본 발명의 일실시예를 적용한 결과는 실선으로 표시한다.8 is a graph showing the speaker recognition rate of the speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention. A result of applying the conventional method is indicated by a dotted line, and a result of applying an embodiment of the present invention is indicated by a solid line.

도 8을 참조하면, 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법을 한명의 화자에 적용하는 경우 95% 이상의 인식률을 갖는 것을 볼 수 있다. 그러나 뉴로모픽 소자(330)를 이용하지 않고 프로세서(200)만을 이용하는 기존의 알고리즘으로 화자인식을 수행하는 경우 인식률은 70% 미만으로 나타난다. 이러한 결과는 본 발명의 일실시예에서 뉴로모픽 소자(330)의 복수의 뉴로셀(331) 각각에 기준-특성벡터를 저장하고 병렬 비교하는 과정에서 다수의 세분화된 특성벡터를 이용할 수 있어서 인식률이 향상된 것으로 해석될 수 있다. 따라서, 본 발명의 일실시예는 뉴로모픽모듈(300)이 하나 이상의 뉴로모픽 소자(330)를 포함할 수 있고, 뉴로모픽 소자(330)가 복수개인 경우 복수의 뉴로모픽 소자(330)를 동일한 단계에 병렬적으로 배치하는 구조이므로, 매우 다수의 특성벡터를 동시에 병렬적으로 판단하여 인식률을 향상시킬 수 있다.Referring to FIG. 8 , when the speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention is applied to one speaker, it can be seen that the recognition rate is 95% or more. However, when speaker recognition is performed using the existing algorithm using only the processor 200 without using the neuromorphic element 330, the recognition rate is less than 70%. This result shows that in an embodiment of the present invention, a plurality of subdivided feature vectors can be used in the process of storing reference-feature vectors in each of the plurality of neurocells 331 of the neuromorphic element 330 and comparing them in parallel, so the recognition rate This can be interpreted as an improvement. Accordingly, in one embodiment of the present invention, the neuromorphic module 300 may include one or more neuromorphic elements 330, and when there are a plurality of neuromorphic elements 330, a plurality of neuromorphic elements ( 330) in parallel at the same stage, it is possible to improve the recognition rate by simultaneously judging a very large number of feature vectors in parallel.

도 9는 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법에서 화자의 수와 뉴로셀(331) 개수의 관계를 나타내는 그래프이다. 9 is a graph showing the relationship between the number of speakers and the number of neurocells 331 in the speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention.

도 9는 95% 이상의 인식률을 갖기 위하여 필요한 뉴로셀(331) 수를 나타낸다. 본 발명의 일실시예에 따른 뉴로모픽 소자를 이용한 화자인식 장치 및 방법을 적용하여 판단해야 하는 화자의 수가 많아질수록 필요한 뉴로셀(331) 개수도 증가함을 확인할 수 있다. 이는 화자의 수가 많아질수록 화자의 음성에서 발생할 수 있는 다양한 음성의 진폭을 비교해야 하여 필요한 뉴로셀(331) 수가 증가하는 것이다.9 shows the number of neurocells 331 required to have a recognition rate of 95% or more. It can be seen that the number of necessary neurocells 331 increases as the number of speakers to be determined by applying the speaker recognition apparatus and method using a neuromorphic element according to an embodiment of the present invention increases. This is because as the number of speakers increases, the number of necessary neurocells 331 increases because amplitudes of various voices that may occur in the speaker's voices need to be compared.

이상 본 발명을 구체적인 실시예를 통하여 상세히 설명하였으나, 이는 본 발명을 구체적으로 설명하기 위한 것으로, 본 발명은 이에 한정되지 않으며, 본 발명의 기술적 사상 내에서 당해 분야의 통상의 지식을 가진 자에 의해 그 변형이나 개량이 가능함은 명백하다고 할 것이다. Although the present invention has been described in detail through specific examples, this is for the purpose of describing the present invention in detail, and the present invention is not limited thereto, and by those of ordinary skill in the art within the technical spirit of the present invention. It will be clear that the transformation or improvement is possible.

본 발명의 단순한 변형 내지 변경은 모두 본 발명의 영역에 속하는 것으로 본 발명의 구체적인 보호 범위는 첨부된 특허청구범위에 의하여 명확해질 것이다.All simple modifications and variations of the present invention fall within the scope of the present invention, and the specific scope of protection of the present invention will be made clear by the appended claims.

100: 소리입력모듈
200: 프로세서
210: 전처리부
211: 균질화부
212: 특성벡터 생성부
220: 화자인식부
300: 뉴로모픽 모듈
310: 뉴로모픽 인터페이스
320: 셀스위치
330: 뉴로모픽 소자
331: 뉴로셀
400: 신호전달모듈
500: 자연어 처리 모듈 100: sound input module
200: processor
210: preprocessor
211: homogenization unit
212: feature vector generator
220: speaker recognition unit
300: neuromorphic module
310: neuromorphic interface
320: cell switch
330: neuromorphic device
331: Neurocell
400: signal transmission module
500: natural language processing module

Claims

a sound input module for receiving a sound including a voice command and converting it into a voice signal;
a preprocessor for receiving the voice signal and generating an input-characteristic vector based on a sound width and bandwidth of the voice signal;
For speaker recognition, the input-feature vector and the reference-feature vector of each of a plurality of registered speakers each stored in one of a plurality of storage areas are compared simultaneously in parallel and at the same time, and the reference-feature vector closest to the input-feature vector is compared. an arithmetic unit that directly outputs a comparison result including information on ; and
A speaker recognition unit that determines who is the speaker who input the voice command by checking to whom the closest reference-feature vector is mapped as a result of the comparison received from the operation unit
A speaker recognition device comprising a.

The method according to claim 1,
and a signal transmission module for transmitting the voice signal to a natural language processing module for providing a voice recognition service when it is determined that the speaker determined to have input the voice command is a preset specific speaker.

The method according to claim 1,
The sound input module is
It may include a directional microphone disposed toward a specific speaker,
The preprocessor
a homogenizer for removing noise from the voice signal; and
and a feature vector generator for extracting an input-feature vector from the voice signal.

The method according to claim 1,
the calculation unit
a plurality of neurocells each storing the reference-feature vector, wherein the input-feature vector and a plurality of pre-stored reference-feature vectors are simultaneously compared in parallel for each of the plurality of neurocells to output the comparison result one or more neuromorphic elements;
a neuromorphic interface that transmits the input-feature vector received from the preprocessor to the neuromorphic device and transmits the comparison result to the speaker recognition unit; and
and a cell switch for switching a set of some neurocells to process the input-feature vector among the plurality of neurocells.

a sound input step of receiving a sound including a voice command and converting it into a voice signal;
a preprocessing step of removing noise from the voice signal and generating an input-characteristic vector based on a sound width and bandwidth of the voice signal;
For speaker recognition, the input-feature vector and the reference-feature vector of each of a plurality of registered speakers each stored in one of a plurality of storage areas are compared simultaneously in parallel and at the same time, and the reference-feature vector closest to the input-feature vector is compared. a feature vector comparison step of directly outputting a comparison result including information on ; and
A speaker recognition step of determining who is the speaker who input the voice command by checking who the closest reference-feature vector is mapped to as a result of the comparison
A speaker recognition method comprising a.

6. The method of claim 5,
and, when it is determined that the speaker determined to have input the voice command is a preset specific speaker, transmitting the voice signal to a natural language processing module for providing a voice recognition service.

6. The method of claim 5,
The feature vector comparison step is
a neurocell switching step of switching a set of some neurocells to process the input-feature vector among a plurality of neurocells included in the neuromorphic element of the neuromorphic module; and
and a parallel comparison step of simultaneously comparing a plurality of reference-feature vectors and the input-feature vectors stored in advance in the plurality of neurocells in parallel for each of the plurality of neurocells, and outputting the comparison result. .

6. The method of claim 5,
Prior to the feature vector comparison step, the method for recognizing a speaker using a neuromorphic device, further comprising a learning step of mapping the input-feature vector output from the preprocessing step with a specific speaker and storing it in a neurocell of the neuromorphic device .