KR100776803B1

KR100776803B1 - Speaker Recognition Device and Method of Intelligent Robot Using Multi-channel Fuzzy Fusion

Info

Publication number: KR100776803B1
Application number: KR1020060093539A
Authority: KR
Inventors: 곽근창; 김혜진; 배경숙; 지수영
Original assignee: 한국전자통신연구원
Priority date: 2006-09-26
Filing date: 2006-09-26
Publication date: 2007-11-19
Anticipated expiration: 2026-09-26

Abstract

본 발명은 다채널 퍼지 융합을 통한 지능형 로봇의 화자 인식 장치 및 그 방법에 관한 것으로, 상기 장치는, 복수의 채널을 통해 온라인으로 화자의 음성을 취득 및 등록하는 다채널 마이크로폰, 상기 등록된 각 채널별 음성들에 대해 각각 시작점 및 끝점을 검출하여 문장을 구분하고 상기 구분된 문장에 포함된 음성의 잡음을 제거하는 음성 데이터 취득부, 상기 잡음이 제거된 각 음성 데이터에 대한 특징에 기초하여 화자 모델을 구축하고 상기 구축된 화자 모델에 대해 우도 로그값으로 변환하는 화자모델 생성부, 상기 각 채널별 우도 로그값들에 대해 융합된 퍼지 값을 산출하는 퍼지 처리부, 및 상기 산출된 퍼지 값에 기초한 융합값의 최대치를 화자로서 인식하는 화자 인식부를 포함하며, 이에 의해, 전 방향에 대해 발성되는 화자의 음성을 보다 높은 성능으로 정확하게 취득할 수 있고 잡음환경이나 원거리에서 화자 인식 성능을 높일 수 있다. The present invention relates to an apparatus and method for recognizing a speaker of an intelligent robot through multi-channel fuzzy fusion, the apparatus comprising: a multi-channel microphone for acquiring and registering a speaker's voice online through a plurality of channels, each registered channel Speech data acquisition unit for detecting the start point and the end point for each of the voices to separate the sentences and to remove the noise of the speech included in the separated sentences, speaker model based on the characteristics of each of the noise-removed speech data A speaker model generation unit for constructing and converting a likelihood log value to the constructed speaker model, a fuzzy processor for calculating a fused fuzzy value for the likelihood log values of each channel, and a fusion based on the calculated fuzzy value And a speaker recognizer that recognizes the maximum value as the speaker, whereby the voice of the speaker spoken in all directions is seen. It can improve the performance of speaker recognition performance can be obtained accurately in noisy environments or remote locations.

Description

Apparatus and Method for recognizing speaker using fuzzy fusion based multichannel in intelligence robot}

도 1은 본 발명의 바람직한 실시예에 따른 다채널 기반 퍼지 융합을 이용한 지능형 로봇의 화자 인식 장치를 도시한 블록도, 1 is a block diagram showing a speaker recognition apparatus of an intelligent robot using multi-channel fuzzy fusion according to an embodiment of the present invention;

도 2는 본 발명의 바람직한 실시예에 따른 다채널 기반 퍼지 융합을 이용한 지능형 로봇의 화자 인식 방법을 도시한 흐름도, 2 is a flowchart illustrating a speaker recognition method of an intelligent robot using multi-channel based fuzzy fusion according to an embodiment of the present invention;

도 3은 본 발명의 실시예에 따른 도 2의 화자 음성 데이터 등록 단계를 보다 상세하게 도시한 흐름도, 3 is a flow chart illustrating in more detail the speaker voice data registration step of FIG. 2 according to an embodiment of the present invention;

도 4는 본 발명의 실시예에 따른 도 2의 화자 모델 구축 단계를 보다 상세하게 도시한 흐름도, 그리고 4 is a flowchart illustrating in detail the speaker model building step of FIG. 2 according to an embodiment of the present invention; and

도 5는 본 발명의 실시예에 따른 도 2의 퍼지 융합 및 화자 인식 단계를 보다 상세하게 도시한 흐름도이다. 5 is a flow chart illustrating in more detail the fuzzy fusion and speaker recognition steps of FIG. 2 in accordance with an embodiment of the present invention.

본 발명은 지능형 로봇의 화자 인식 장치 및 방법에 관한 것으로서, 보다 상세하게는, 마이크로폰을 통해 발성자의 음성을 보다 선명한 감도로 독취하여 화자가 누구인지를 보다 정확하게 판별할 수 있는 지능형 로봇의 화자 인식 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for recognizing a speaker of an intelligent robot, and more particularly, to recognize a speaker by using a microphone, with a clearer sensitivity, to more accurately determine who is the speaker. And to a method.

최근 들어, 삶 속에서 사용자의 편의를 도모하기 위해 사용자가 의도하는 해당 작업을 돕는 로봇이 개발되고 있다. 특히, 사용자와 로봇 간에 상호작용을 통해 로봇이 지능적으로 판단하고 그 결과에 따른 동작을 수행할 수 있는 지능형 로봇이 개발되고 있다. In recent years, robots that help users in their intended tasks are being developed for the convenience of users in their lives. In particular, intelligent robots that can intelligently determine and perform operations based on the results of the interaction between the user and the robot have been developed.

이러한 지능형 로봇 기술을 구현하기 위해서는 로봇이 사용자의 명령이나 동작을 인식하는 기술이 요구된다. 이중에서 사용자 인식기술의 핵심기술로서 부각되고 있는 것이 사용자의 얼굴 인식과 화자(발성자) 인식 기술이다. In order to implement such an intelligent robot technology, a technology that a robot recognizes a user's command or motion is required. Among them, the user's face recognition and speaker (talker) recognition technology is emerging as a core technology of user recognition technology.

현재까지, 로봇 환경에서 화자인식기술은 얼굴인식기술과 달리 많은 연구와 개발이 이루어지고 있지 않은 실정이다. 다만, 단일 채널을 가진 마이크로폰에 의해 화자의 음성을 독취하고, 이를 기초로 화자를 인식하는 기술이 일부 기술 분야에서 행해지고 있을 뿐이다. 그러나 이는 화자 인식을 위한 감도 및 정확도가 떨어지는 문제점이 있다. To date, the speaker recognition technology in the robotic environment has not been much researched and developed unlike the face recognition technology. However, a technique of reading a speaker's voice by a microphone having a single channel and recognizing the speaker based on the single channel is only performed in some technical fields. However, this has a problem in that sensitivity and accuracy for speaker recognition are poor.

상기와 같은 화자인식방법은 일반적으로 보안 및 전자상거래 분야에서 주로 사용되고 있다. 주로 사용되는 화자인식방법은 문장종속 화자인식, 문장제시형 화자인식, 및 문장독립형 화자인식 방법을 예로 들 수 있다. 이중에서 로봇환경에서는 어떠한 문장을 발성해도 화자를 인식하기 위해서는, 문장 독립형 화자인식 방법 이 필요하다. The speaker recognition method as described above is generally used in the field of security and electronic commerce. Speaker recognition methods that are mainly used include, for example, sentence dependent speaker recognition, sentence presentation speaker recognition, and sentence independent speaker recognition. Among them, in order to recognize a speaker no matter what sentence is spoken, a sentence-independent speaker recognition method is required.

또한, 로봇 환경에서의 화자 인식 방법은 로봇이 위치하는 전(모든) 방향에서 발성자가 대화나 명령을 발성할 경우, 로봇이 화자인식을 수행해야 한다. 또한 로봇은 근거리뿐만 아니라 원거리에서 발성되는 음성도 독취하여 발성자가 누구인지 알아내는 것이 필요하다. 뿐만 아니라, 지능형 로봇은 다양한 환경에서 존재하는 잡음환경 요소를 식별하여 오동작을 방지할 수 있는 기술이 요구된다. In addition, the speaker recognition method in the robot environment requires the robot to perform speaker recognition when the talker speaks a conversation or a command in all (all) directions in which the robot is located. In addition, the robot needs to read voices not only at short distances but also at long distances to find out who the speakers are. In addition, intelligent robots are required to identify a noise environment that exists in various environments to prevent malfunction.

이와 같이, 종래의 단일 채널 마이크로폰이 부착된 로봇은 잡음환경, 모든 방향, 및 원거리 환경에서 화자 인식 성능이 떨어지게 되는 문제점이 있다. As such, the conventional single channel microphone-attached robot has a problem in that speaker recognition performance is degraded in a noise environment, all directions, and a remote environment.

상기와 같은 문제점을 해결하기 위한 본 발명의 제1 목적은, 사용자와 로봇 간에 상호작용을 위해 임의의 잡음 환경에서 발성자가 말하는 문장에 상관없이 화자가 누구인지를 보다 정확하게 인식할 수 있는 지능형 로봇의 화자 인식 장치 및 방법을 제공하는 데 있다. The first object of the present invention for solving the above problems is an intelligent robot capable of more accurately recognizing who the speaker is, regardless of the statement spoken by the speaker in any noise environment for interaction between the user and the robot. A speaker recognition apparatus and method are provided.

본 발명의 제2 목적은, 로봇이 위치하는 모든 방향으로부터 발성되는 음성을 독취하여 보다 정확하게 화자를 인식할 수 있는 지능형 로봇의 화자 인식 장치 및 방법을 제공하는 데 있다. It is a second object of the present invention to provide an apparatus and method for recognizing a speaker of an intelligent robot, which can recognize a speaker more accurately by reading voices uttered from all directions in which the robot is located.

본 발명의 제3 목적은, 로봇이 위치하는 곳으로부터 원거리에서 발성되는 발성자의 음성을 독취하여 보다 정확하게 화자를 인식 할 수 있는 지능형 로봇의 화자 인식 장치 및 방법을 제공하는 데 있다.It is a third object of the present invention to provide an apparatus and method for recognizing a speaker of an intelligent robot that can recognize a speaker more accurately by reading a voice of a speaker spoken at a distance from a place where the robot is located.

상기와 같은 목적을 달성하기 위한 본 발명의 실시예에 따른 지능형 로봇의 화자 인식 장치는, 복수의 채널을 통해 온라인으로 화자의 음성을 취득 및 등록하는 다채널 마이크로폰; 상기 등록된 각 채널별 음성들에 대해 각각 시작점 및 끝점을 검출하여 문장을 구분하고, 상기 구분된 문장에 포함된 음성의 잡음을 제거하는 음성 데이터 취득부; 상기 잡음이 제거된 각 음성 데이터에 대한 특징에 기초하여 화자 모델을 구축하고, 상기 구축된 화자 모델에 대해 우도 로그값으로 변환하는 화자모델 생성부; 상기 각 채널별 우도 로그값들에 대해 융합된 퍼지 값을 산출하는 퍼지 처리부; 및 상기 산출된 퍼지 값에 기초한 융합값의 최대치를 화자로서 인식하는 화자 인식부를 포함한다. According to an aspect of the present invention, there is provided a speaker recognition apparatus for an intelligent robot, including: a multi-channel microphone for acquiring and registering a speaker's voice online through a plurality of channels; A voice data acquisition unit detecting a start point and an end point of each of the registered voices of each channel to separate sentences, and removing noise of voices included in the divided sentences; A speaker model generator configured to construct a speaker model based on the features of the noise-removed speech data and convert the speaker model into likelihood log values; A fuzzy processor configured to calculate a fused fuzzy value for the likelihood log values of each channel; And a speaker recognizer configured to recognize the maximum value of the fusion value based on the calculated fuzzy value as a speaker.

상기 음성 데이터 취득부는 상기 각 채널별 음성들에 대해 끝점 검출 알고리즘(Endpoint detection)을 이용하여 상기 시작점 및 끝점을 검출한다. 상기 음성 데이터 취득부는 위너 필터(Winer filter)를 이용하여 상기 시작점 및 끝점에 의해 구분된 문장의 음성으로부터 잡음을 제거한다. The voice data acquisition unit detects the start point and the end point by using an endpoint detection algorithm for the voices of each channel. The voice data acquisition unit removes noise from voice of a sentence divided by the start point and the end point by using a Winer filter.

또한 상기 화자모델 생성부는, 상기 잡음이 제거된 각 채널의 음성에 대한 특징 정보를 추출하는 특징 추출부; 상기 특징 정보에 기초하여 상기 각 채널의 음성에 대한 화자 모델을 구축하는 화자모델 구축부; 및 상기 각 화자 모델의 음성에 대응하는 우도 로그값을 생성하는 우도 로그값 변환부를 포함한다. 이때, 상기 특징 추출부는 멜 캡스트럼(MFCC: Mel-Frequency Cepstral Coefficients)을 이용하여 상기 각 채널의 음성에 대한 특징 정보를 추출한다. 상기 화자모델 구축부는 가우 시안 혼합모델(GMM: Gaussian Mixture Model)을 이용하여 상기 각 채널의 음성에 대한 화자 모델을 구축한다. The speaker model generator may include: a feature extractor configured to extract feature information about voice of each channel from which the noise is removed; A speaker model constructing unit for constructing a speaker model for speech of each channel based on the feature information; And a likelihood log value converter for generating a likelihood log value corresponding to the voice of each speaker model. In this case, the feature extractor extracts feature information on the voice of each channel using Mel-Frequency Cepstral Coefficients (MFCC). The speaker model building unit builds a speaker model of the voice of each channel using a Gaussian Mixture Model (GMM).

상기 퍼지 처리부는, 상기 각 채널별 우도 로그값들에 대해 각각 퍼지 소속도를 산출하는 퍼지 소속도 산출부; 및 상기 산출된 퍼지소속도 값들을 퍼지 융합하는 퍼지 융합부를 포함한다. 이때, 상기 퍼지 소속도 산출부는 시그모이드(sigmoid) 소속함수를 이용하여 상기 퍼지 소속도를 산출한다. 상기 퍼지 융합부는 퍼지 적분을 이용하여 상기 퍼지소속도 값들을 퍼지 융합한다. The fuzzy processor may include a fuzzy belonging degree calculator configured to calculate a fuzzy belonging degree of each of the likelihood log values of each channel; And a fuzzy fusion unit for fuzzy fusion of the calculated purge small velocity values. In this case, the fuzzy belonging degree calculating unit calculates the fuzzy belonging degree using a sigmoid belonging function. The fuzzy fusion unit purges the fuzzy small velocity values using fuzzy integration.

한편, 상기와 같은 목적을 달성하기 위한 본 발명의 실시예에 따른 지능형 로봇의 화자 인식 방법은, 다채널 마이크로폰을 통해 온라인으로 화자의 음성을 취득 및 등록하는 단계; 상기 등록된 각 채널별 음성들에 대해 화자 모델을 구축하는 단계; 및 상기 각 채널의 화자 모델에 대응하는 우도 로그값의 퍼지 소속도를 융합하여, 이에 대한 최대치를 화자로서 인식하는 단계를 포함한다. On the other hand, the speaker recognition method of the intelligent robot according to an embodiment of the present invention for achieving the above object, the step of acquiring and registering the voice of the speaker online through a multi-channel microphone; Constructing a speaker model for the registered voices of each channel; And fusing a fuzzy belonging degree of the likelihood log value corresponding to the speaker model of each channel, and recognizing the maximum value as the speaker.

바람직하게는, 상기 화자 음성 데이터 등록 단계는, 상기 다채널 마이크로폰으로부터 온라인으로 출력되는 화자의 음성을 취득하는 단계; 상기 각 채널별 음성에 대해 시작점 및 끝점을 검출하는 단계; 및 상기 시작점과 끝점으로 구분된 음성의 문장들에 대해 잡음 요소를 제거하는 단계를 포함한다. Preferably, the speaker voice data registration step includes: acquiring a speaker voice output online from the multichannel microphone; Detecting a start point and an end point of the voice of each channel; And removing a noise element for sentences of speech separated by the start point and the end point.

상기 화자 모델 구축 단계는, 상기 등록된 각 채널별 음성들로부터 특징 값들을 추출하는 단계; 상기 특징 값들에 기초하여 상기 채널별 음성에 대한 화자 모델을 구축하는 단계; 및 상기 구축된 화자 모델에 대한 음성에 대응하는 우도 로그값을 생성하는 단계를 포함한다. The speaker model building step may include extracting feature values from the registered voices of each channel; Building a speaker model for the speech per channel based on the feature values; And generating a likelihood log value corresponding to the voice of the constructed speaker model.

상기 화자 인식 단계는, 상기 각 채널별 우도 로그값에 기초하여 퍼지 소속도를 산출하는 단계; 상기 산출된 퍼지 소속도 값들을 퍼지 융합하는 단계; 및 상기 퍼지 융합된 값의 초대치를 상기 화자로 인식하는 단계를 포함한다. The speaker recognition step may include calculating a fuzzy membership degree based on the likelihood log value of each channel; Fuzzy fusion of the calculated fuzzy membership values; And recognizing the super value of the fuzzy fused value as the speaker.

본 발명에 따르면, 지능형 로봇에서 음성 기반 인간-로봇 상호작용을 수행하기 위해 다채널의 마이크로폰을 통해 온라인 화자등록과 로봇의 모든 방향에서 음성 인식이 가능하도록 구현함으로써, 전 방향에 대해 발성되는 화자의 음성을 보다 높은 성능으로 정확하게 취득할 수 있다. 또한 본 발명은 다채널 기반 온라인 화자등록/인식/퍼지 융합을 통해 보다 정확하게 화자를 인식할 수 있도록 함으로써, 잡음환경이나 원거리에서 화자 인식 성능의 저하를 최소화할 수 있다. According to the present invention, in order to perform voice-based human-robot interaction in an intelligent robot, an online speaker registration and voice recognition can be performed in all directions of the robot through a multi-channel microphone, thereby enabling the speaker to be spoken in all directions. Voice can be acquired accurately with higher performance. In addition, the present invention can recognize the speaker more accurately through the multi-channel-based online speaker registration / recognition / purge fusion, it is possible to minimize the degradation of the speaker recognition performance in the noise environment or remote.

이하, 본 발명의 바람직한 실시예들을 첨부한 도면을 참조하여 상세히 설명한다. 도면들 중 동일한 구성요소들은 가능한 한 어느 곳에서든지 동일한 부호들로 나타내고 있음에 유의해야 한다. 또한 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능 및 구성에 대한 상세한 설명은 생략한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the same elements in the figures are represented by the same numerals wherever possible. In addition, detailed descriptions of well-known functions and configurations that may unnecessarily obscure the subject matter of the present invention will be omitted.

본 발명은 마이크로폰을 통해 발성자의 음성을 보다 선명한 감도로 독취하여 화자가 누구인지를 보다 정확하게 판별할 수 있도록 하기 위해, 다채널 마이크로폰을 통해 발성자의 음성을 독취하고 독취한 다채널의 음성을 퍼지 융합하여 보다 높은 감도에서 화자를 보다 정확하게 판별할 수 있는 지능형 로봇의 화자 인식 방법을 제안한다. 이와 같이, 로봇의 모든 방향에서 음성 인식이 가능하기 위해 본 발명에서는 로봇 주위에 여러 개의 마이크로폰을 부착하고, 잡음환경 및 원거리에서도 보다 정확한 화자 인식이 가능하도록 다채널 마이크로폰의 각 채널로부터 독취 한 음성 데이터에 대해 퍼지 적분을 통해 퍼지 소속 값을 융합하는 기법을 개시한다. The present invention is to read the voice of the speaker with a clearer sensitivity through the microphone to more accurately determine who the speaker is, to read the voice of the speaker through a multi-channel microphone and fuzzy fusion of the read multi-channel voice In this paper, we propose a speaker recognition method for an intelligent robot that can identify speakers more accurately at higher sensitivity. As described above, in order to enable speech recognition in all directions of the robot, in the present invention, a plurality of microphones are attached around the robot, and the voice data read from each channel of the multi-channel microphone is used for more accurate speaker recognition even in a noise environment and a long distance. We disclose a technique for fusing fuzzy membership values via fuzzy integration for.

도 1은 본 발명의 바람직한 실시예에 따른 다채널 기반 퍼지 융합을 이용한 지능형 로봇의 화자 인식 장치를 도시한 블록도이다. 1 is a block diagram illustrating an apparatus for speaker recognition of an intelligent robot using multi-channel fuzzy fusion according to an exemplary embodiment of the present invention.

도시된 바와 같이, 지능형 로봇(100)에는 화자(10)로부터 발성되는 음성을 전 방향에 대해 독취가 가능하도록 측면에 다수개의 마이크로폰(200)이 배치된다. 본 실시예에서는 지능형 로봇(100)의 측면에 4개의 마이크로폰(220,240,260,280)이 배치되어 있음을 알 수 있다. 이에 따라, 화자(10)의 음성이 어느 방향으로부터 발성되더라도 소정 간격을 두고 서로 다른 방향을 향하여 배치된 복수의 다채널 마이크로폰(220,240,260,280)을 통해 발성되는 음성을 보다 용이하게 독취할 수 있다. 이에 따라, 다채널의 마이크로폰(200)은 테스트 음성을 통해 온라인으로 화자(10)의 음성을 취득한다. As shown in the figure, a plurality of microphones 200 are disposed on the side of the intelligent robot 100 to read out the voice spoken by the speaker 10 in all directions. In the present embodiment, it can be seen that four microphones 220, 240, 260, and 280 are disposed on the side of the intelligent robot 100. Accordingly, even if the voice of the speaker 10 is uttered from any direction, the voice uttered through the plurality of multi-channel microphones 220, 240, 260, and 280 arranged in different directions at predetermined intervals may be read more easily. Accordingly, the multi-channel microphone 200 acquires the voice of the speaker 10 online through the test voice.

본 실시예에서 네 개의 마이크로폰(220,240,260,280)이 로봇에 부착되이 있는 경우, 이에 대응하여 각각 끝점 검출부(320,340,360,380)가 마련된다. 이에 따라, 다채널의 마이크로폰(200)에 대응하여 마련되는 끝점 검출부(300)는 다채널의 마이크로폰(200)으로부터 각각 취득되어 출력되는 음성에 대해 시작점 및 끝점을 검출하고, 각각 구분되는 문장들에 대한 잡음을 제거한다. In the present embodiment, when four microphones 220, 240, 260, and 280 are attached to the robot, endpoint detection units 320, 340, 360, and 380 are provided correspondingly. Accordingly, the end point detection unit 300 provided corresponding to the microphone 200 of the multi-channel detects the start point and the end point of the voices acquired and output from the microphone 200 of the multi-channel, respectively, Eliminate noise for

본 실시예에서는 복수의 끝점 검출부(300)에 대응하여 각 채널별로 음성에 대한 화자 모델을 구축하기 위한 화자모델 생성부(420,440,460,480)가 각각 구비된다. In the present exemplary embodiment, speaker model generators 420, 440, 460, and 480 are provided for constructing a speaker model for the voice for each channel corresponding to the plurality of endpoint detection units 300.

여기서, 화자모델 생성부(420,440,460,480)는 각각 특징추출부(422,442,462,482), 화자모델 구축부(424,444,464,484), 및 우도 로그값 변환부(426,446,466,486)를 포함하여 구성된다. Here, the speaker model generators 420, 440, 460, and 480 include a feature extractor 422, 442, 462, 482, a speaker model constructer 424, 444, 464, 484, and a likelihood log value converter 426, 446, 466, and 486, respectively.

이에 따라, 특징추출부(422,442,462,482)는 각각 끝점 검출부(300)에서 잡음이 제거된 각 채널의 음성데이터로부터 특징 정보를 추출한다. 화자모델 구축부(424,444,464,484)는 각 채널별 음성에 대한 화자 모델을 구축한다. 우도 로그값 변환부(426,446,466,486)는 각 화자 모델의 음성에 대해 우도 로그값으로 변환한다. Accordingly, the feature extractors 422, 442, 462, and 482 extract feature information from the voice data of each channel from which the noise is removed from the endpoint detector 300, respectively. The speaker model building unit 424, 444, 464, 484 constructs a speaker model for the voice of each channel. The likelihood log value converter 426, 446, 466, 486 converts the likelihood log value for the voice of each speaker model.

도면에서와 같이, 화자모델 생성부(400)의 각 부(420,440,460,480)에 대응하여 퍼지 소속도 산출부(520,540,560,580)가 구비된다. 퍼지 소속도 산출부(520,540,560,580)는 각각 화자모델 생성부(420,440,460,480)로부터 변환된 우도 로그값에 기초하여 퍼지 소속도를 산출한다. 퍼지 소속도 산출부(520,540,560,580)의 출력단에 연결되는 퍼지 융합부(600)는 퍼지 소속도 산출부(520,540,560,580)로부터 산출된 각 채널의 모델 음성에 대한 우도 로그값들을 퍼지 융합한다. As shown in the figure, a fuzzy belonging degree calculator 520, 540, 560, 580 is provided to correspond to each part 420, 440, 460, 480 of the speaker model generator 400. The fuzzy belonging degree calculators 520, 540, 560, 580 calculate fuzzy belonging degrees based on the likelihood log values converted from the speaker model generators 420, 440, 460, and 480, respectively. The fuzzy fusion unit 600, which is connected to the output terminal of the fuzzy belonging degree calculator 520, 540, 560, 580, purges the likelihood log values of the model voices of the channels calculated from the fuzzy belonging degree calculator 520, 540, 560, 580.

이에 따라, 퍼지 융합부(600)의 출력단에 연결되는 화자 인식부(700)는 퍼지 융합값이 최대치인 음성을 발성한 사용자를, 화자로서 인식한다. Accordingly, the speaker recognition unit 700 connected to the output terminal of the fuzzy fusion unit 600 recognizes a user who has spoken a voice having a maximum fuzzy fusion value as a speaker.

도 2는 본 발명의 바람직한 실시예에 따른 다채널 기반 퍼지 융합을 이용한 지능형 로봇의 화자 인식 방법을 도시한 흐름도이다. 2 is a flowchart illustrating a speaker recognition method of an intelligent robot using multi-channel based fuzzy fusion according to an exemplary embodiment of the present invention.

먼저, 끝점 검출부9300)는 여러 문장을 따라 읽게 함으로서 화자로부터 온라 인상에서 발성되는 음성을, 다채널 마이크로폰(200)을 통해 취득하여 온라인으로 화자 음성 데이터로서 등록한다(S100). First, the endpoint detection unit 9300 reads along the various sentences the voice uttered in the online impression from the speaker through the multi-channel microphone 200 and registers online as speaker voice data (S100).

온라인으로 화자 음성 데이터가 등록되면, 화자 모델 생성부(400)는 각 채널별로 음성 데이터에 대한 화자 모델을 구축한다(S200). When the speaker voice data is registered online, the speaker model generator 400 builds a speaker model for the voice data for each channel (S200).

화자 모델이 구축되면, 퍼지 소속도 산출부(500) 및 퍼지 융합부(600)는 각 채널의 화자 모델에 대응하는 우도 로그값을 산출하고 산출된 각 퍼지 소속도 값들을 융합한다(S300). 이에 따라 화자 인식부(700)는 퍼지 융합된 값의 최대치에 대응하는 음성을 발성한 사용자를 화자로 인식한다. When the speaker model is constructed, the fuzzy belonging degree calculator 500 and the fuzzy fusion unit 600 calculate a likelihood log value corresponding to the speaker model of each channel and fuse the calculated fuzzy belonging values. Accordingly, the speaker recognition unit 700 recognizes as a speaker a user who has spoken a voice corresponding to the maximum value of the fuzzy fusion value.

도 3은 본 발명의 실시예에 따른 도 2의 화자 음성 데이터 등록 단계(S100)를 보다 상세하게 도시한 흐름도이다. 3 is a flowchart illustrating in detail the speaker voice data registration step S100 of FIG. 2 according to an exemplary embodiment of the present invention.

먼저, 끝점 검출부(300)는 사용자가 여러 문장을 읽어서 다채널 마이크로폰(200)으로부터 온라인으로 출력되는 화자의 음성을 취득한다(S120). 즉, 기존에는 오프라인 상에서 사용자의 음성을 미리 등록해 놓고 각 화자의 모델을 구축하는 반면, 본 발명에서의 화자 등록은 다채널 마이크로폰(200)을 통해 각각 이루어지며 온라인상에서 각 화자의 모델을 구축하는 것을 특징으로 한다. First, the endpoint detection unit 300 obtains the speaker's voice output online from the multi-channel microphone 200 by the user reading several sentences (S120). That is, conventionally, the user's voice is registered in advance in offline, and the model of each speaker is established, whereas the speaker registration in the present invention is performed through the multi-channel microphone 200, respectively. It is characterized by.

또한 끝점 검출부(300)는 각 채널별로 등록된 음성에 대해, 끝점검출 알고리즘(Endpoint detection)을 이용하여 음성의 시작점 및 끝점을 검출한다(S140). 뿐만 아니라, 끝점 검출부(300)는 시작점과 끝점으로 구분된 음성의 문장들에 대해 위너필터 (Winer filter)를 이용하여 잡음 요소를 각각 제거한다(S160). In addition, the endpoint detection unit 300 detects the start point and the end point of the voice by using an endpoint detection algorithm for the voice registered for each channel (S140). In addition, the endpoint detection unit 300 removes noise components by using a Winer filter for the sentences of speech divided into a start point and an end point (S160).

도 4는 본 발명의 실시예에 따른 도 2의 화자 모델 구축 단계(S200)를 보다 상세하게 도시한 흐름도이다. 4 is a flowchart illustrating the speaker model building step (S200) of FIG. 2 according to an embodiment of the present invention in more detail.

먼저, 특징 추출부(422,442,462,482)는 각 채널별로 등록된 음성 데이터로부터 스펙트럼 기반 청각특성을 적용한 멜 캡스트럼(MFCC: Mel-Frequency Cepstral Coefficients)을 이용하여 특징 값들을 추출한다(S220). First, the feature extractors 422, 442, 462, and 482 extract feature values using Mel-Frequency Cepstral Coefficients (MFCC) to which spectrum-based auditory characteristics are applied from speech data registered for each channel (S220).

화자 모델 구축부(424,444,464,484)는 화자별로 가우시안 혼합모델(GMM: Gaussian Mixture Model)을 이용하여 채널별 음성에 대한 화자 모델을 구축한다(S240). 가우시안 혼합모델(GMM)은 아래 수학식 1과 같이 나타낼 수 있다. The speaker model building unit 424, 444, 464, 484 constructs a speaker model for the voice for each channel using a Gaussian Mixture Model (GMM) for each speaker (S240). The Gaussian mixture model (GMM) may be represented by Equation 1 below.

여기서 w_i는 혼합 가중치이고, b_i는 가우시안 혼합모델을 통해 얻어진 확률 값이다. Where w _i is the mixing weight and b _i is the probability value obtained through Gaussian mixture model.

여기서 밀도는 평균벡터와 공분산 행렬에 의해 파라미터화된 M개의 가우시안 혼합모델의 가중치된 선형적인 결합이다. 화자 모델 구축부(424,444,464,484)는 임의의 화자로부터 온라인 등록된 음성이 주어졌을 때, 가우시안 혼합모델의 파라미터를 추정한다. 본 실시예에서는 이를 위해 최도 우도 추정방법(maximum likelihood estimation)을 이용한다. Where density is the weighted linear combination of M Gaussian mixture models parameterized by the mean vector and the covariance matrix. The speaker model construction unit 424, 444, 464, 484 estimates the parameters of the Gaussian mixture model when a voice registered online from any speaker is given. In this embodiment, a maximum likelihood estimation method is used for this purpose.

한편, 우도 로그값 변환부(426,446,466,486)는 각 화자 모델의 음성에 대해 우도의 로그 값으로 변환한다(S260). T개의 프레임으로 구성된 한 음성으로부터 얻어진 확률에 대해서, 가우시안 혼합모델의 우도 값은 아래 수학식 2와 같이 나타낼 수 있다. On the other hand, the likelihood log value converter 426, 446, 466, 486 converts the likelihood log values for the voices of the respective speaker models (S260). Regarding the probability obtained from one voice composed of T frames, the likelihood value of the Gaussian mixture model may be expressed by Equation 2 below.

여기서 화자 모델의 파라미터는 가중치, 평균, 공분산으로 구성된

, i=1,2,...,M 이다. 본 실시예에서 최대 우도 파라미터 추정은 최대 기대치(Expectation- Maximization: EM) 알고리즘을 이용함으로써 얻어질 수 있다. 수학식 2의 우도 값은 편리성을 위해 로그(log) 값으로 변환한다. Here, the parameters of the speaker model consist of weights, averages, and covariances.

, i = 1,2, ..., M In this embodiment, the maximum likelihood parameter estimate may be obtained by using an Expectation-Maximization (EM) algorithm. The likelihood value in Equation 2 is converted into a log value for convenience.

도 5는 본 발명의 실시예에 따른 도 2의 퍼지 융합 및 화자 인식 단계(S300)를 보다 상세하게 도시한 흐름도이다. 5 is a flowchart illustrating in more detail the fuzzy fusion and speaker recognition step S300 of FIG. 2 according to an exemplary embodiment of the present invention.

먼저, 퍼지 소속도 산출부(520,540,560,580)는 각각 채널별로 변환된 우두의 로그값에 기초하여 퍼지 소속도를 산출한다(S320). 이때 퍼지 소속도 산출부(520,540,560,580)는 시그모이드(sigmoid) 소속함수를 이용하여 아래 수학식 3과 같은 퍼지 소속도 값을 산출할 수 있다. First, the fuzzy belonging degree calculating units 520, 540, 560 and 580 calculate fuzzy belonging degrees based on log values of vaccinia converted for each channel (S320). In this case, the fuzzy belonging degree calculator 520, 540, 560, 580 may calculate a fuzzy belonging value as shown in Equation 3 below by using a sigmoid belonging function.

,

여기서, a는 경사값을, c는 중심값을 나타낸다. 이들 값(a,c)은 온라인상에서 등록된 학습 음성데이터로부터 얻어진 특징 값의 통계에 의해서 얻어낼 수 있 다. Here, a represents an inclination value and c represents a center value. These values (a, c) can be obtained by statistics of feature values obtained from the learning voice data registered online.

한편, 퍼지 융합부(600)는 퍼지 소속도 산출부(520,540,560,580)에서 각각 산출한 퍼지 소속도 값들을 퍼지 융합한다(S340). 이때 퍼지 융합부(600)는 퍼지 적분을 이용하여 퍼지 소속도 산출부(520,540,560,580)에서 산출된 퍼지 소속도 값을 융합할 수 있다. 본 실시예에서 이용되는 퍼지 적분을 수식으로 전개하여 설명하면 다음과 같다. On the other hand, the fuzzy fusion unit 600 fuzzy fusion of the fuzzy belonging values calculated by the fuzzy belonging degree calculating units 520, 540, 560, 580, respectively (S340). In this case, the fuzzy fusion unit 600 may fuse the fuzzy belonging values calculated by the fuzzy belonging degree calculator 520, 540, 560, 580 using fuzzy integration. The fuzzy integration used in the present embodiment will be described below by developing the equation.

집합함수 g : P(S) -> [0,1]은 아래 수학식 4를 만족하면 퍼지 척도라고 한다. The set function g: P ( S )-> [0,1] is called the fuzzy scale if it satisfies Equation 4 below.

이와 같은 정의로부터

퍼지 척도는 임의의

에 대해서 아래 수학식 5의 성질을 만족한다. From such a definition

Fuzzy scale is random

For the satisfies the property of equation (5).

,

경계조건 g(S)=1이기 때문에,

는 아래 수학식 6의 다항식을 해석함으로서 결정되어진다. Because boundary condition g (S) = 1

Is determined by interpreting the polynomial of Equation 6 below.

따라서, 최종적으로 퍼지 적분은 아래 수학식 7에 의해서 산출될 수 있다. Therefore, finally, the fuzzy integration may be calculated by Equation 7 below.

,

여기서

의 값은

의 형태로 순서화되고,

의 값은

에 대해서 아래 수학식 8에 의해 반복적으로 결정되어진다. here

The value of

Ordered in the form of,

The value of

It is determined repeatedly by Equation 8 below.

,

이에 따라, 화자 인식부(700)는 퍼지 융합을 통해 인식된 결과에 따라 수학식 8에 의해 얻어진 N개의 다채널의 퍼지 융합 값들 중에서 최대치를 구함으로서 화자로서 인식한다(S360). Accordingly, the speaker recognizer 700 recognizes the speaker as a speaker by obtaining a maximum value among the N multi-channel fuzzy fusion values obtained by Equation 8 according to the result recognized through fuzzy fusion (S360).

이상에서는 본 발명에서 특정의 바람직한 실시예에 대하여 도시하고 또한 설명하였다. 그러나 본 발명은 상술한 실시예에 한정되지 아니하며, 특허 청구의 범위에서 첨부하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형 및 균등한 타 실시가 가능할 것이다. 따라서 본 발명의 진정한 기술적 보호범위는 첨부한 특허청구범위에 의해서만 정해져야 할 것이다.In the above, specific preferred embodiments of the present invention have been illustrated and described. However, the present invention is not limited to the above-described embodiments, and any person having ordinary skill in the art to which the present invention pertains may make various modifications and other equivalents without departing from the gist of the present invention attached to the claims. Implementation will be possible. Therefore, the true technical protection scope of the present invention should be defined only by the appended claims.

상기와 같은 본 발명에 따르면, 지능형 로봇에서 음성 기반 인간-로봇 상호작용을 수행하기 위해 다채널의 마이크로폰을 통해 온라인 화자등록과 로봇의 모든 방향에서 음성 인식이 가능하도록 구현함으로써, 전 방향에 대해 발성되는 화자의 음성을 보다 높은 성능으로 정확하게 취득할 수 있다. According to the present invention as described above, in order to perform the voice-based human-robot interaction in the intelligent robot, by implementing the online speaker registration and speech recognition in all directions of the robot through the multi-channel microphone, speech in all directions The speaker's voice can be accurately acquired with higher performance.

또한 본 발명은 다채널 기반 온라인 화자등록/인식/퍼지 융합을 통해 보다 정확하게 화자를 인식할 수 있도록 함으로써, 잡음환경이나 원거리에서 화자 인식 성능의 저하를 최소화할 수 있다. In addition, the present invention can recognize the speaker more accurately through the multi-channel-based online speaker registration / recognition / purge fusion, it is possible to minimize the degradation of the speaker recognition performance in the noise environment or remote.

Claims

A multichannel microphone for acquiring and registering a speaker's voice online through a plurality of channels;

A voice data acquisition unit detecting a start point and an end point of each of the registered voices of each channel to separate sentences, and removing noise of voices included in the divided sentences;

A speaker model generator configured to construct a speaker model based on the features of the noise-removed speech data and convert the speaker model into likelihood log values;

A fuzzy processor configured to calculate a fused fuzzy value for the likelihood log values of each channel; And

And a speaker recognizer configured to recognize a maximum value of the fusion value based on the calculated fuzzy values as a speaker.

The method of claim 1,

And the voice data acquisition unit detects the start point and the end point by using an endpoint detection algorithm for the voices of each channel.

The method according to claim 1 or 2,

And the voice data acquisition unit removes noise from voices of sentences divided by the start point and the end point by using a Winer filter.

The method of claim 1,

The speaker model generator,

A feature extractor for extracting feature information on the voice of each channel from which the noise is removed;

A speaker model constructing unit for constructing a speaker model for speech of each channel based on the feature information; And

And a likelihood log value converter for generating a likelihood log value corresponding to the voice of each speaker model.

The method of claim 4, wherein

The feature extractor extracts feature information on voice of each channel by using Mel-Frequency Cepstral Coefficients (MFCC).

The method of claim 4, wherein

The speaker model construction unit uses a Gaussian Mixture Model (GMM) to build a speaker model for the voice of each channel, characterized in that the intelligent robot speaker recognition apparatus.

The method of claim 1,

The purge processing unit,

A fuzzy belonging degree calculator for calculating a fuzzy belonging degree for each of the likelihood log values of each channel; And

And a fuzzy fusion unit for fuzzy fusion of the calculated fuzzy small velocity values.

The method of claim 7, wherein

The fuzzy belonging degree calculating unit calculates the fuzzy belonging degree by using a sigmoid belonging function.

The method of claim 7, wherein

The fuzzy fusion unit is a fuzzy fusion of the fuzzy small velocity values using the fuzzy integration apparatus of the intelligent robot speaker recognition apparatus.

Acquiring and registering a speaker's voice online through a multichannel microphone;

Constructing a speaker model for the registered voices of each channel; And

Fusing the fuzzy belonging degree of the likelihood log value corresponding to the speaker model of each channel, and recognizing the maximum value as the speaker.

The method of claim 10,

The speaker voice data registration step,

Acquiring a speaker's voice output online from the multichannel microphone;

Detecting a start point and an end point of the voice of each channel; And

And removing noise elements from sentences of speech divided by the starting point and the ending point.

The method according to claim 10 or 11, wherein

The speaker model building step,

Extracting feature values from the registered voices of each channel;

Building a speaker model for the speech per channel based on the feature values; And

And generating a likelihood log value corresponding to the voice of the constructed speaker model.

The method of claim 12,

The speaker recognition step,

Calculating a fuzzy belonging degree based on the likelihood log value of each channel;

Fuzzy fusion of the calculated fuzzy membership values; And

And recognizing the super value of the fuzzy fused value as the speaker.