KR20080023033A

KR20080023033A - Speaker recognition method and system using wireless microphone in robot service system

Info

Publication number: KR20080023033A
Application number: KR1020060087009A
Authority: KR
Inventors: 배경숙; 김혜진; 곽근창; 지수영
Original assignee: 한국전자통신연구원
Priority date: 2006-09-08
Filing date: 2006-09-08
Publication date: 2008-03-12

Abstract

A voice recognition method using a wireless microphone in an intelligent robot service system and an apparatus thereof are provided to enable a speaker having a wireless microphone transmitter to replace a VAD(Voice Activity Detection) function with only on/off of the wireless microphone, thereby operating the voice recognition system as necessary and speaking freely without considering a distance from a robot or a speaking position. A voice recognition method comprises the following steps of: registering speakers by using a wireless microphone(201); receiving valid voice data through a wireless microphone receiver from a speaker which speaks through a wireless microphone transmitter(202,203); extracting a feature from the received valid voice data(203); generating at least one speaker model by using the extracted feature(204); and recognizing the speaker by measuring similarity between the extracted feature and the generated speaker model(205~207).

Description

SPEAKER RECOGNITION METHOD AND SYSTEM USING WIRELESS MICROPHONE IN ROBOT SERVICE SYSTEM}

도 1은 본 발명의 실시예에 따른 무선 마이크로폰을 이용한 지능형 로봇 서비스 시스템의 구조를 도시한 구성도, 1 is a block diagram showing the structure of an intelligent robot service system using a wireless microphone according to an embodiment of the present invention,

도 2는 본 발명의 실시예에 따른 지능형 로봇 서비스 시스템에서의 화자 인식 장치의 구조를 도시한 구성도, 2 is a block diagram showing a structure of a speaker recognition apparatus in an intelligent robot service system according to an embodiment of the present invention;

도 3은 본 발명의 실시예에 따라 지능형 로봇 서비스 시스템에서 화자 인식을 위한 방법을 도시한 흐름도. 3 is a flowchart illustrating a method for speaker recognition in an intelligent robot service system according to an embodiment of the present invention.

본 발명은 무선 마이크로폰을 이용한 지능형 로봇 서비스 시스템에 관한 것으로서, 특히, 무선 마이크로폰의 수신기가 부착된 지능형 로봇에서 무선 마이크로폰의 송신기를 가진 발성자의 음성을 전송받아 사용자에게 맞춤형 서비스를 제공해주기 위해 화자 인식을 수행하기 위한 방법 및 장치에 관한 것이다. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an intelligent robot service system using a wireless microphone. In particular, an intelligent robot equipped with a receiver of a wireless microphone receives a speaker's voice having a transmitter of a wireless microphone and provides speaker recognition to provide a customized service to a user. A method and apparatus for performing the invention.

최근에는 사용자에게 맞춤형 서비스를 제공하기 위해 지능형 서비스 기술로서, 지능형 로봇 기술이 다양한 분야에서 개발되고 있다. 이러한 지능형 로봇의 가장 중요한 감각 매체 중 하나인 청각은 보이지 않는 곳이나 원거리에서도 감지가 가능하다는 장점을 가지고 있다. 따라서 상기 지능형 로봇의 청각을 화자인식 기술에 적용할 수 있는데, 상기 화자 인식 기술은 다양한 로봇 서비스를 제공하는데 중요한 역할을 할 것으로 기대되는 분야이다. Recently, intelligent robot technology has been developed in various fields as an intelligent service technology to provide customized services to users. Hearing, one of the most important sensory media of such an intelligent robot, has the advantage of being able to detect invisible or remote locations. Therefore, the hearing of the intelligent robot can be applied to the speaker recognition technology, and the speaker recognition technology is expected to play an important role in providing various robot services.

상기 화자 인식 기술은 마이크를 통해 입력되는 소리를 매순간 인식하면 로봇에 큰 부하를 주게 되므로 입력되는 소리가 유효한 것인지를 판단하여 유효한 소리인 경우에만 인식할 필요가 있다. 때문에 이를 위해서는 유효 목소리 검출(Voice Activity Detection 이하, VAD라 칭함)기능이 필수적이다. 뿐만 아니라, 다양한 환경에 존재하는 잡음에 대해 강인한 화자인식 및 음성인식을 수행해야 한다. If the speaker recognition technology recognizes the sound input through the microphone every moment, it puts a heavy load on the robot. Therefore, it is necessary to determine whether the input sound is valid and recognize only if the sound is valid. Therefore, effective voice detection (hereinafter referred to as VAD) is essential for this. In addition, robust speaker recognition and speech recognition should be performed for noises in various environments.

기존의 화자 인식 방법은 주로 유선 마이크로폰을 사용하여 VAD를 수행하는 방법을 사용하였다. 그런데 로봇 환경에서 유선 마이크를 사용하게 되면, 일반 가정환경은 다양한 잡음이 존재하기 때문에 VAD를 항상 구동시키면 물건이 떨어지는 소리와 같은 잡음에도 민감하게 반응하여 사용자가 원치 않는 시점에 인식 결과를 내게 된다. The existing speaker recognition method mainly uses a wired microphone to perform VAD. However, when the wired microphone is used in the robot environment, the general home environment has various noises. Therefore, when the VAD is always driven, it is sensitive to the noise such as the falling sound of the object and the recognition result is generated when the user does not want it.

또한, 사용자가 원하지 않는 시점에서도 항상 인식을 수행해야할 뿐만 아니라 사용자와 로봇과의 거리, 발성 자세 및 마이크로폰의 위치 등에 따라 사용자가 협조적으로 목소리의 크기나 발성 위치 등을 조절해 주어야 하는 불편함이 있다.In addition, it is not only necessary to always perform recognition even when the user does not want it, but there is an inconvenience in that the user cooperatively adjusts the voice size or the voice position according to the distance between the user and the robot, the voice posture, and the position of the microphone. .

더욱이, 종래의 화자인식은 주로 보안이라는 관점에서 다루어져 왔기 때문에 제한된 환경과 적극적인 사용자를 가정할 수 있었다. 그러나 로봇 환경에서 화자인식은 다양한 잡음과 비협조적인 사용자를 대상으로 하기 때문에 지능형 서비스 로봇을 위해서는 잡음에 강건하고 사용자의 협조를 최소한으로 하는 화자인식 시스템이 필요하다. Moreover, the conventional speaker recognition has been mainly dealt with in terms of security, and therefore, it is possible to assume a limited environment and active users. However, since speaker recognition targets various noises and uncooperative users in a robotic environment, a speaker recognition system that is robust against noise and minimizes user cooperation is required for intelligent service robots.

따라서 본 발명의 목적은 지능형 로봇 서비스 시스템에서 모바일 로봇에 부착된 무선 마이크로폰을 사용하여 주변 잡음에 강건하고, 사용자가 원하는 시점에 유효 음성을 취득하여 화자 인식(음성인식)을 수행하기 위한 방법 및 장치를 제공함에 있다. Accordingly, an object of the present invention is a method and apparatus for performing speaker recognition (speech recognition) by being robust to ambient noise using a wireless microphone attached to a mobile robot in an intelligent robot service system and acquiring an effective voice at a desired point in time. In providing.

상기 이러한 본 발명의 목적을 달성하기 위한 무선 마이크로폰을 이용한 화자 인식 방법은, 서비스 로봇 환경의 지능형 로봇 서비스 시스템에서, 무선 마이크로폰을 이용하여 화자들을 각각 등록하는 과정과, 상기 등록된 화자들 중 무선 마이크로폰 송신기를 이용하여 발성을 한 화자로부터 무선 마이크로폰 수신기를 통해 유효 음성 데이터를 수신하는 과정과, 상기 수신된 유효 음성 데이터에서 특징을 추출하는 과정과, 상기 추출된 특징을 이용하여 적어도 하나의 화자 모델을 생성하는 과정과, 상기 추출된 특징과 상기 생성된 화자 모델 간의 유사도를 측정하여 상 기 화자를 인식하는 과정을 포함하는 것을 특징으로 한다. The speaker recognition method using a wireless microphone for achieving the object of the present invention, in the intelligent robot service system of the service robot environment, the process of registering the speakers using the wireless microphone, respectively, the wireless microphone of the registered speakers Receiving valid voice data from a talker using a transmitter through a wireless microphone receiver, extracting a feature from the received valid voice data, and using the extracted feature to generate at least one speaker model. And generating a similarity between the extracted feature and the generated speaker model to recognize the speaker.

상기 본 발명의 목적을 달성하기 위한 장치는, 서비스 로봇 환경의 지능형 로봇 서비스 시스템에서, 무선 마이크로폰을 이용하여 화자를 인식하기 위한 장치로서, 발성을 한 화자로부터 무선 마이크로폰 송신기를 통해 유효 음성 데이터를 수신하는 무선 마이크로폰 수신기와, 상기 수신된 유효 음성 데이터에서 특징을 추출하는 특징 추출부와, 상기 추출된 특징을 이용하여 적어도 하나의 화자 모델을 생성하는 화자 모델 생성부와, 상기 추출된 특징과 상기 생성된 화자 모델 간의 유사도를 측정하여 상기 화자를 인식하는 화자 인식부를 포함하는 것을 특징으로 한다. The apparatus for achieving the object of the present invention, in an intelligent robot service system of a service robot environment, a device for recognizing a speaker using a wireless microphone, receiving effective voice data from the talker through a wireless microphone transmitter A wireless microphone receiver, a feature extractor for extracting a feature from the received valid speech data, a speaker model generator for generating at least one speaker model using the extracted feature, and the extracted feature and the generation And a speaker recognition unit for recognizing the speaker by measuring the similarity between the speaker models.

이하, 본 발명의 바람직한 실시 예를 첨부한 도면을 참조하여 상세히 설명한다. 우선 각 도면의 구성 요소들에 참조 부호를 부가함에 있어서, 동일한 구성 요소들에 한해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 부호를 가지도록 하고 있음에 유의해야 한다. 그리고 본 발명을 설명함에 있어, 관련된 공지 기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. First of all, in adding reference numerals to the components of each drawing, it should be noted that the same reference numerals have the same reference numerals as much as possible even if displayed on different drawings. In the following description of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted.

본 발명의 실시예에 따른 지능형 로봇 서비스 시스템은 마이크로폰을 이용하는 지능형 로봇 시스템을 적용한다. 이러한 지능형 로봇 서비스 시스템의 구조를 첨부된 도면을 참조하여 설명하기로 한다. The intelligent robot service system according to the embodiment of the present invention applies an intelligent robot system using a microphone. The structure of such an intelligent robot service system will be described with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 무선 마이크로폰을 이용한 지능형 로봇 서 비스 시스템의 구조를 도시한 구성도이고, 도 2는 본 발명의 실시예에 따른 지능형 로봇 서비스 시스템에서의 화자 인식 장치의 구조를 도시한 구성도이다. 1 is a block diagram showing the structure of an intelligent robot service system using a wireless microphone according to an embodiment of the present invention, Figure 2 is a structure of a speaker recognition apparatus in an intelligent robot service system according to an embodiment of the present invention The configuration diagram shown.

상기 도 1을 참조하면, 상기 지능형 로봇 서비스 시스템은 송신측의 무선 마이크로폰 송신기(120)와, 수신측의 무선 마이크로폰 수신기(111)가 부착된 지능형 로봇(110)으로 이루어진다. Referring to FIG. 1, the intelligent robot service system includes a wireless microphone transmitter 120 on the transmitting side and an intelligent robot 110 to which the wireless microphone receiver 111 on the receiving side is attached.

상기 지능형 로봇(110)은 상기 송신측의 발성자가 자신이 가진 무선 마이크로폰 송신기(120)를 이용하여 음성 데이터를 전송하면, 부착된 무선 마이크로폰 수신기(111)에서 상기 음성 데이터를 수신하여 화자 인식을 수행하는 장치(이하, 화자 인식 장치라 칭함)이다. 이러한 화자 인식 장치의 구체적인 구조를 첨부된 도면을 참조하여 설명하기로 한다. The intelligent robot 110 transmits voice data using the wireless microphone transmitter 120 owned by the sender, and receives the voice data from the attached wireless microphone receiver 111 to perform speaker recognition. Device (hereinafter referred to as a speaker recognition device). A detailed structure of the speaker recognition apparatus will be described with reference to the accompanying drawings.

상기 무선 마이크로폰 송신기(120)는 상기 화자 인식을 원하는 시점에만 상기 발성자(사용자)에 의해 ON되어 유효 음성을 입력받아 상기 무선 마이크로폰 수신기(111)로 입력된 유효 음성의 데이터를 전송한다.The wireless microphone transmitter 120 is turned on by the speaker (user) only when the speaker recognition is desired and receives a valid voice to transmit data of the valid voice input to the wireless microphone receiver 111.

상기 도 2를 참조하면, 화자 인식 장치는 음성 입력부(112)와, 특징 추출부(113)와, 화자 모델 생성부(114)와, 화자 인식부(115) 및 저장부(116)를 포함하며, 외부에 상기 마이크로폰 수신기(111)가 부착된 형태로 이루어진다. 그리고 상기 화자 인식 장치는 상기 마이크로폰 수신기(111)가 상기 마이크로폰 송신기(120)로부터 음성 데이터를 수신함에 따라 구동된다. 이에 따라 상기 화자 인식 장치는 발성자(사용자)가 원하는 시점에만 구동되게 된다. Referring to FIG. 2, the speaker recognition apparatus includes a voice input unit 112, a feature extractor 113, a speaker model generator 114, a speaker recognition unit 115, and a storage unit 116. , The microphone receiver 111 is attached to the outside. The speaker recognition apparatus is driven as the microphone receiver 111 receives voice data from the microphone transmitter 120. Accordingly, the speaker recognition apparatus is driven only at a point in time at which the speaker (user) desires it.

상기 음성 입력부(112)는 무선 마이크로폰의 수신기(111)에서 수신한 음성 데이터(입력 데이터)를 입력 받아 상기 특징 추출부(113)로 전달한다. The voice input unit 112 receives voice data (input data) received from the receiver 111 of the wireless microphone and transmits the received voice data to the feature extractor 113.

상기 특징 추출부(113)는 상기 음성 입력부(110)로부터 전달된 음성 데이터를 프레임별로 나누고 각 프레임에 해당하는 멜 캡스트럼 계수를 구하여 특징을 추출한다. The feature extractor 113 divides the voice data transmitted from the voice input unit 110 for each frame and extracts a feature by obtaining a mel capstrum coefficient corresponding to each frame.

상기 화자 모델 생성부(114)는 상기 특징 추출부(113)에서 구한 멜 캡스트럼 계수(Cepstrum Coefficient)를 화자별로 모으고, 상기 구해진 멜 캡스트럼 계수를 추출된 특징으로서 이용하여 가우시안 혼합 모델(화자 모델)을 생성함으로써, 화자 인식기를 구축한다. 여기서 캡스트럼(Cepstrum)은 DFT나 FFT 결과가 크기값에 대해 명확하지 않은 것을 보완한 알고리즘이다. 따라서 캡스트럼(Cepstrum)은 지진이나 진동, 음성인식 등 미세 신호 분석에 많이 사용하고 있다.The speaker model generator 114 collects the Mel Capstrum Coefficients obtained by the feature extractor 113 for each speaker and uses the obtained Mel Capstrum coefficients as extracted features to perform a Gaussian mixture model (Speaker model). Construct a speaker recognizer. Cepstrum is an algorithm that compensates for the fact that the DFT or FFT results are not clear about the magnitude. Therefore, Capstrum is widely used for micro signal analysis such as earthquake, vibration and voice recognition.

상기 화자 인식부(115)는 기 화자 모델 생성부(114)에서 구축된 화자 인식기를 이용하여 추출된 특징과 각 화자모델 간의 거리를 측정하여 화자를 인식하며, 등록 화자 중 누군가가 발성을 했을 때 최대 사후 확률(maximum a posteriori probability)을 가진 화자 모델을 찾는 방법에 의해 화자 인식을 한다. The speaker recognizer 115 recognizes the speaker by measuring the distance between the extracted feature and each speaker model using the speaker recognizer built in the speaker model generator 114, and when someone among the registered speakers speaks. Speaker recognition is performed by finding the speaker model with the maximum a posteriori probability.

상기 저장부(116)는 화자 등록 정보, 수신된 음성 데이터, 상기 생성된 화자 모델에 대한 정보 및 화자 인식에 대한 결과 정보 등을 저장한다. The storage unit 116 stores speaker registration information, received voice data, information on the generated speaker model, and result information on speaker recognition.

이와 같은 구조를 갖는 화자 인식 장치에서 화자 인식을 위한 방법을 설명하기로 한다. A method for speaker recognition in a speaker recognition apparatus having such a structure will be described.

우선, 화자 인식 장치는 각 화자의 온라인 등록을 수행한 후 등록된 정보를 미리 저장한다. 이후, 온라인 등록된 화자 중 임의의 화자가 무선 마이크로폰 송신기를 통해 음성 데이터를 전송하면, 상기 화자 인식 장치는 무선 마이크로폰 수신기(111)를 통해 상기 음성 데이터를 수신하고, 수신된 음성 데이터를 입력 데이터로 음성 입력부(112)를 통해 내부 장치로 입력하여 내부 장치들을 통해 화자 인식을 수행한다. 이러한 과정을 첨부된 도면을 참조하여 보다 구체적으로 설명하기로 한다. First, the speaker recognition apparatus performs online registration of each speaker and stores the registered information in advance. Thereafter, when any of the registered speakers online transmits the voice data through the wireless microphone transmitter, the speaker recognition apparatus receives the voice data through the wireless microphone receiver 111 and converts the received voice data into the input data. Speaker input is performed through the voice input unit 112 to the internal device, and speaker recognition is performed through the internal devices. This process will be described in more detail with reference to the accompanying drawings.

도 3은 본 발명의 실시예에 따라 지능형 로봇 서비스 시스템에서 화자 인식을 위한 방법을 도시한 흐름도이다. 3 is a flowchart illustrating a method for speaker recognition in an intelligent robot service system according to an embodiment of the present invention.

상기 도 3을 참조하면, 201단계에서 상기 화자 인식 장치(110)는 음성 데이터가 수신되었는지를 확인한다. 이때, 음성 데이터가 수신되면 202단계에서 상기 화자 인식 장치(110)는 특징 추출부(113)를 통해 전달된 음성 데이터를 프레임별로 나누고 각 프레임에 해당하는 멜 캡스트럼 계수를 구하여 특징을 추출한다. Referring to FIG. 3, in step 201, the speaker recognition apparatus 110 checks whether voice data has been received. In this case, when the voice data is received, the speaker recognition apparatus 110 divides the voice data transmitted through the feature extractor 113 for each frame and obtains a mel capstrum coefficient corresponding to each frame to extract the feature.

그런 다음 203단계에서 상기 화자 인식 장치(110)는 상기 추출된 특징을 이용하여 상기 구한 멜 갭스트럼 계수 즉, 특징을 전달받아 화자별로 모으고, 204단계에서 가우시안 혼합 모델을 통해 화자별 화자 모델을 생성하여 화자 인식기를 구축한다. 이러한 화자 인식기를 구축하기 위해서는 D차원의 특징벡터에 대해서 화자에 대한 혼합 밀도를 구해야 하는데, 상기 혼합 밀도를 구하기 위한 식은 하기 <수학식 1>과 같이 나타낼 수 있다. Then, in step 203, the speaker recognition apparatus 110 receives the obtained Mel gap strum coefficient, that is, the feature, and collects each speaker by using the extracted feature, and in step 204, the speaker model for each speaker is obtained through a Gaussian mixture model. Create a speaker recognizer by creating it. In order to construct such a speaker recognizer, a mixing density of a speaker is to be obtained for a D-dimensional feature vector, and the equation for obtaining the mixing density may be expressed as Equation 1 below.

상기 <수학식 1>에서 w_i는 혼합 가중치를 의미하며, b_i 는 가우시안 혼합모델을 통해 얻어진 확률을 의미하며, 하기 <수학식 2>과 같이 나타낼 수 있다. In Equation 1, w _i denotes a mixed weight, and b _i Denotes a probability obtained through a Gaussian mixture model, and may be expressed as Equation 2 below.

그리고 상기 <수학식 1>에서 밀도는 평균벡터와 공분산 행렬에 의해 파라미터화된 M개의 가우시안 혼합모델의 가중치된 선형적인 결합이다. In Equation 1, the density is a weighted linear combination of M Gaussian mixture models parameterized by the mean vector and the covariance matrix.

이후, 204단계에서 상기 화자 인식 장치(110)는 화자 모델을 생성하여 화자 인식기를 구축한다. 이러한 화자 모델은 임의의 화자로부터 음성이 주어졌을 때 가우시안 혼합모델의 파라미터를 추정함으로써 생성될 수 있다. 이에 대한 잘 알려진 방법은 최대 우도 추정방법(maximum likelihood estimation)이 있다. 이러한 최대 우도 추정방법을 이용하여 가우시안 혼합모델의 파라미터 추정에 대해 설명하기로 하면 다음과 같다. Thereafter, in step 204, the speaker recognition apparatus 110 constructs a speaker model by generating a speaker model. This speaker model can be generated by estimating the parameters of the Gaussian mixture model when speech is given from any speaker. A well-known method for this is the maximum likelihood estimation method. The parameter estimation of the Gaussian mixture model using the maximum likelihood estimation method will be described as follows.

T개의 프레임으로 구성된 한 음성으로부터 얻어진 확률에 대한 가우시안 혼합 모델의 우도 값은 하기 <수학식 3>과 같이 나타낼 수 있다. The likelihood value of the Gaussian mixture model for the probability obtained from one voice composed of T frames may be expressed as in Equation 3 below.

상기 <수학식 3>에서 화자 모델의 파라미터는 가중치, 평균, 공분산으로 구성된

, i=1, 2, ... ,M 이다. 최대 우도 파라미터 추정은 잘 알려진 최대 기대치(EM : Expectation- Maximization) 알고리즘을 이용함으로써 얻을 수 있다. In Equation 3, the parameter of the speaker model is composed of weight, average, and covariance.

, i = 1, 2, ..., M Maximum likelihood parameter estimation can be obtained by using a well-known Expectation-Maximization (EM) algorithm.

그러면 이렇게 추정된 가우시안 혼합 모델의 최대 우도 파라미터를 이용하여 GMM(Gaussain Mixture Model) 기반의 각 화자들의 화자 모델을 생성하게 되며, 생성된 화자 모델들을 이용하여 화자 인식기를 구축한다. 이에 따라 상기 화자 인식부(115)는 상기 구축된 인식기를 이용하게 된다. 그리고 상기 추정된 최대 우도 파라미터 및 생성된 화자 모델에 대한 정보는 저장부(116)에 저장되어 관리된다. Then, the speaker model of each speaker based on Gaussian Mixture Model (GMM) is generated using the maximum likelihood parameter of the estimated Gaussian mixture model, and a speaker recognizer is constructed using the generated speaker models. Accordingly, the speaker recognizer 115 uses the constructed recognizer. The information about the estimated maximum likelihood parameter and the generated speaker model is stored and managed in the storage unit 116.

그런 다음 상기 화자 인식 장치(110)는 화자 인식부(115)를 통해 등록된 화자들 중 임의의 화자가 발성을 했을 때 최대 사후 확률(maximum a posteriori probability 이하, MAP이라 칭함)을 가진 화자 모델을 찾아 화자를 인식하게 된다. 여기서 MAP 방식이란 사후확률을 최대한으로 하여서 신호의 유사성 정도를 최대로 하여서 원하는 신호를 찾아내는 방식을 말하며 LLR(Log Likelihood Ratio)을 크게 하여서 LLR이 0보다 크면 1로 0보다 작으면 0으로 복원해 내는 방식을 말한다. Then, the speaker recognition apparatus 110 generates a speaker model having a maximum a posteriori probability (hereinafter, referred to as MAP) when any speaker among the speakers registered through the speaker recognition unit 115 speaks. Find and recognize the speaker. Here, the MAP method is a method of finding a desired signal by maximizing the degree of similarity of the signal by maximizing the posterior probability and increasing the LL (Log Likelihood Ratio) so that the LLR is greater than 0 and is restored to 0 when smaller than 0. Say the way.

다시 말해, 205단계에서 상기 화자 인식 장치(110)는 상기 203단계에서 추출된 특징과 생성되어 있는 각 화자모델과 유사도를 측정한다. 상기 유사도의 측정은 추출된 특징과 각 화자들 간의 거리를 측정함에 따라 얻을 수 있다. 그러면 206단계에서 상기 화자 인식 장치(110)는 상기 추출된 특징과 현재 화자 모델과의 유사도 측정 결과, MAP을 가진 화자 모델인지를 확인하여 MAP을 가진 화자 모델이 아닌 경우 다음 화자 모델과의 유사도를 측정하기 위해 205단계를 진행한다. 반면 그렇지 않은 경우 207단계에서 상기 화자 인식 장치(110)는 상기 측정된 결과에 따라 MAP 화자 모델로 확인된 화자를 인식한다. 이러한 MAP을 가진 화자 모델을 찾는 방법에 의한 화자 인식에 대해서는 하기 <수학식 4>와 같이 나타낼 수 있다. In other words, in step 205, the speaker recognition apparatus 110 measures the similarity with the feature extracted in step 203 and each speaker model generated. The measurement of the similarity can be obtained by measuring the distance between the extracted feature and each speaker. Then, in step 206, the speaker recognition apparatus 110 determines whether the speaker model has the MAP as a result of measuring the similarity between the extracted feature and the current speaker model and determines the similarity with the next speaker model when the speaker model does not have the MAP. Proceed to step 205 to measure. Otherwise, in step 207, the speaker recognition apparatus 110 recognizes the speaker identified by the MAP speaker model according to the measured result. Speaker recognition by a method of finding a speaker model having such a MAP may be expressed by Equation 4 below.

상기 <수학식 4>는 S명의 화자( S={1,2,3...,S})중 사후확률을 최대화하는 모델의 화자를 찾는 것으로서, λ는 화자모델, x는 입력된 음성, P는 확률을 의미한다. Equation 4 is to find a speaker of a model that maximizes the posterior probability among S speakers (S = {1,2,3 ..., S}), λ is a speaker model, x is an input voice, P means probability.

상술한 바와 같은 본 발명의 실시예에 따른 화자 인식을 위한 화자 인식 등록 및 인식 장치는 사용자가 원하는 시점에만 구동되어야 한다. The speaker recognition registration and recognition device for speaker recognition according to an embodiment of the present invention as described above should be driven only at a point in time desired by the user.

따라서 화자 인식을 원치 않는 시점에는 발성자(사용자)가 자신의 무선 마이크로폰 송신기를 OFF하여 유효 음성의 입력을 전면 차단한다. 반면, 화자 인식을 원하는 시점에는 발성자가 자신의 무선 마이크로폰 송신기를 ON하여 유효 음성의 입력을 받고, 입력된 유효 음성을 무선으로 상기 무선 마이크로폰 수신기를 통해 상기 화자 인식 장치(지능형 로봇)로 전달한다. 이에 따라 상기 화자 인식 장치는 상기 유효 음성의 입력 여부에 따라 구동되어 로봇 환경에서 문장 독립 화자 인식을 수행함으로써 사용자에게 맞춤형 서비스를 제공할 수 있다. Therefore, when the speaker does not want to recognize the speaker, the speaker (user) turns off his wireless microphone transmitter to completely block input of the effective voice. On the other hand, when the speaker recognizes the speaker, the speaker turns on his wireless microphone transmitter to receive an effective voice input, and wirelessly transmits the input effective voice to the speaker recognition apparatus (intelligent robot) through the wireless microphone receiver. Accordingly, the speaker recognition apparatus may be driven according to whether the valid voice is input to provide a customized service to a user by performing sentence independent speaker recognition in a robot environment.

한편, 본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되어 정해져서는 안되며 후술하는 발명청구의 범위뿐 만 아니라 이 발명청구의 범위와 균등한 것들에 의해 정해져야 한다.Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be defined not only by the scope of the following claims, but also by the equivalents of the claims.

상술한 바와 같이 본 발명은 지능형 로봇 서비스 시스템에서 무선 마이크로폰을 이용함으로써, 무선 마이크로폰의 송신기를 가진 발성자(사용자)는 마이크로폰의 ON/OFF만으로도 손쉽게 VAD 기능을 대신할 수 있기 때문에 사용자가 원하는 시점에만 화자 인식 시스템을 동작시킬 수 있으며, 잡음에 강건하고 사용자 협조를 최소화함으로써, 로봇과의 거리나 발성자세 등을 고려할 필요 없이 자유로운 자세로 발성할 수 있는 효과가 있다. As described above, the present invention uses the wireless microphone in the intelligent robot service system, so that a speaker (user) having a transmitter of the wireless microphone can easily replace the VAD function only by turning on / off the microphone, so that the user can only By operating the speaker recognition system and being robust against noise and minimizing user cooperation, there is an effect that the user can speak freely without considering the distance from the robot or the talking posture.

Claims

In the intelligent robot service system of the service robot environment, the process of registering the speakers using the wireless microphone,

Receiving valid voice data through a wireless microphone receiver from a speaker who has spoken using a wireless microphone transmitter among the registered speakers;

Extracting a feature from the received valid voice data;

Generating at least one speaker model using the extracted features;

And a step of recognizing the speaker by measuring a similarity between the extracted feature and the generated speaker model.

The method of claim 1,

Turning on the wireless microphone transmitter only at a desired point in time from the speaker;

Turning off the microphone transmitter at a time when the speaker recognition is not desired and cutting off the voice input from the spoken speaker, and driving the constructed speaker recognizer according to ON / OFF of the microphone transmitter. Speaker recognition method using a wireless microphone, characterized in that.

The method of claim 1, wherein the extracting of the feature comprises:

Dividing the received valid speech data frame by frame;

Obtaining a mel capstrum coefficient corresponding to each divided frame;

And using the obtained Mel Capstrum coefficient as an extracted feature.

The method of claim 1, wherein the generating of the at least one speaker model comprises:

Collecting the extracted features for each speaker;

Generating a speaker model for each speaker through a Gaussian mixture model;

And building a speaker recognizer using the generated speaker model.

The method of claim 4, wherein

The speaker recognizer is constructed by mixing density of the speaker as shown in Equation 5 below, wherein Wi denotes a mixing weight in Equation 5, and bi is a probability obtained through a Gaussian mixture model. Speaker recognition method using a wireless microphone, characterized in that means.

The method of claim 1, wherein the recognizing the speaker comprises:

Measuring similarity according to the distance between the extracted feature and the generated speaker model using a maximum likelihood estimation method;

Finding a speaker model having a maximum posterior probability according to the measurement result;

And recognizing the speaker model having the maximum posterior probability as the speaker.

The method of claim 6,

The speaker model having the maximum posterior probability is found by the following Equation 6, wherein λ is the speaker model, x is the input voice, and P is the probability of the wireless microphone. Speaker recognition method using.

In an intelligent robot service system of a service robot environment, an apparatus for recognizing a speaker using a wireless microphone,

A wireless microphone receiver for receiving valid voice data from a talker through a wireless microphone transmitter,

A feature extraction unit for extracting a feature from the received valid speech data;

A speaker model generator configured to generate at least one speaker model using the extracted features;

And a speaker recognizer configured to measure the similarity between the extracted feature and the generated speaker model to recognize the speaker.

The method of claim 8,

A voice input unit for receiving valid voice data received by the wireless microphone receiver and transferring the received valid voice data to the feature extraction unit;

And a storage unit for storing information related to the speaker recognition.

The method of claim 8,

The wireless microphone transmitter is formed on the side of the talker, and is turned on only at the point of time when the speaker is desired to receive the valid voice of the talker. And a speech input from the talker is blocked.

The method of claim 8,

And the feature extracting unit divides the received valid speech data frame by frame, extracts a feature by obtaining a mel capstrum coefficient corresponding to each divided frame, and extracts a feature.

The method of claim 8,

The speaker model generation unit collects the extracted features for each speaker, generates a speaker model for each speaker using a Gaussian mixture model, and builds a speaker recognizer using the generated speaker model. Speaker recognition apparatus using a.

The method of claim 8,

The constructed speaker recognizer is included in the speaker recognizer and driven according to ON / OFF of the microphone transmitter.

The method of claim 13,

The speaker recognizer is constructed through the mixing density of the speaker as shown in Equation 7 below, wherein Wi denotes the mixing weight and bi is a probability obtained through the Gaussian mixture model. Speaker recognition apparatus using a wireless microphone, characterized in that means.

The method of claim 8,

The speaker recognition unit measures similarity according to the distance between the extracted feature and the generated speaker model using a maximum likelihood estimation method, finds a speaker model having a maximum posterior probability according to the measurement result, and finds the found maximum posterior probability. Speaker recognition apparatus using a wireless microphone, characterized in that it comprises a speaker model having the speaker as one speaker.

The method of claim 15,

The speaker model having the maximum posterior probability is found by the following Equation (8), where λ is the speaker model, x is the input voice, and P is the probability. Speaker recognition apparatus using a wireless microphone.