KR101253610B1

KR101253610B1 - Apparatus for localization using user speech and method thereof

Info

Publication number: KR101253610B1
Application number: KR1020090091867A
Authority: KR
Inventors: 이성주; 이윤근; 강병옥; 강점자; 박기영; 박전규; 왕지현; 전형배; 정의석; 정호영; 정훈; 김종진; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2009-09-28
Filing date: 2009-09-28
Publication date: 2013-04-11
Also published as: KR20110034360A

Abstract

본 발명은 사용자의 음성을 이용한 위치 추적 장치 및 그 방법에 관한 것으로, 입력되는 2채널의 음원 신호를 각각의 음원별로 분리하는 음원 분리부에 의해 분리된 각각의 음원 신호로부터 산란 잡음을 제거하고, 음원 위치 추적을 위해 잔여 신호 성분을 강조하도록 필터링하는 스테레오 위너 필터부, 사용자의 음성을 인식하고, 음성 인식 결과에 대한 신뢰도를 측정하는 음성 인식부, 상기 음성 인식부로부터의 음성 인식 결과와 음성 인식 결과에 대한 신뢰도에 근거하여 타겟 채널을 선택하는 채널 선택부, 타겟 채널의 신호 및 간섭 채널의 신호를 분석하여 음원 위치를 추적하는 음원 위치 추적부를 포함한다. 본 발명에 따르면, 암묵적 음원 분리 기술, 스테레오 위너 필터 기술, 음성인식 및 발화검증 기술, 음원 위치 추적 기술을 유기적으로 통합함으로써, 보다 정확하고 주변 환경에 강인한 사용자 음성 위치 추적이 가능한 이점이 있다.The present invention relates to a location tracking device using the user's voice and a method thereof, and to remove scattering noise from each sound source signal separated by a sound source separation unit for separating the input two-channel sound source signal for each sound source, Stereo winner filter for filtering the residual signal components for sound source location tracking, a speech recognition unit for recognizing the user's voice and measuring the reliability of the speech recognition result, the speech recognition result and the speech recognition from the speech recognition unit And a channel selector for selecting the target channel based on the reliability of the result, and a sound source position tracker for analyzing the signal of the target channel and the signal of the interference channel to track the sound source position. According to the present invention, by integrating an implicit sound source separation technology, a stereo winner filter technology, a voice recognition and speech verification technology, and a sound source location tracking technology, there is an advantage that the user voice location tracking that is more accurate and robust to the surrounding environment is possible.

Description

Apparatus for localization using user speech and method

본 발명은 사용자 음성을 이용한 위치 추적 장치 및 그 방법에 관한 것으로, 가정 내 부가잡음 및 반향에 의한 음성 인식 및 화자 위치 추적의 성능저하 문제를 극복하고, 이로 인해 사용자 음성을 이용하여 보다 정확한 음원 위치 추적이 가능하도록 하는 사용자 음성을 이용한 위치 추적 장치 및 그 방법에 관한 것이다.The present invention relates to a location tracking device using a user voice and a method thereof, and to overcome the problems of performance degradation of voice recognition and speaker location tracking due to additional noise and echo in the home, and thus more accurate sound source location using the user voice. The present invention relates to a location tracking device using a user voice and a method for enabling tracking.

본 발명은 지식 경제부의 IT성장동력기술개발산업의 일환으로 수행한 연구로부터 도출된 것이다[과제관리번호: 2006-S-036-04, 과제명: 신성장동력산업용 대용량 대화형 분산 처리 음성인터페이스 기술개발].The present invention is derived from the research conducted as part of the IT growth engine technology development industry of the Ministry of Knowledge Economy [Task management number: 2006-S-036-04, Task name: Development of large-capacity interactive distributed processing voice interface technology for new growth engine industry] ].

우리들이 일상생활을 영위하는 가정환경(Domestic Environment)에는 여러 가지 종류의 부가 잡음원(Additive Noise Source)이 존재하기 때문에 가정환경에서 발성된 화자(Speaker)의 음성신호는 부가 잡음원들에 의해 원래의 특성이 손상되는 경우가 많이 발생하게 된다. Since there are various kinds of additional noise sources in the domestic environment in which we live daily life, the speaker's voice signal spoken in the home environment is originally characterized by the additional noise sources. There are many cases of this damage.

가정환경에서 흔히 찾아 볼 수 있는 부가 잡음원들에는 텔레비전과 같은 멀티미디어 전자장치, 라디오와 같은 오디오 음향 장치, 각종 게임기기, 진공청소기, 수도시설, 냉장고, 에어컨, 기타 등과 같은 가정기기(Domestic Appliance)들이 있다.Additional sources of noise commonly found in home environments include multimedia electronic devices such as televisions, audio and acoustic devices such as radios, various game devices, vacuum cleaners, water facilities, refrigerators, air conditioners, and other domestic appliances. have.

이와 같은 부가 잡음원에 의하여 손상된 음성신호로부터 음성발성 화자의 위치를 추적하는 것은 현재의 디지털 신호처리 기술(Digital Signal Processing)로서는 매우 어렵다.It is very difficult for current digital signal processing technology to track the position of a voice speaker from a voice signal damaged by such an additional noise source.

또한, 우리들의 주거 공간인 집의 실내공간에서 음성을 발성하는 경우, 원래의 음성신호 이외에도 음성신호가 벽면 혹은 가구 등에 반사되어 발생하는 반향음(Reverberation) 신호 성분이 발생하게 된다. In addition, when the voice is uttered in the indoor space of the home, which is our residential space, a reverberation signal component generated by reflecting the voice signal on the wall or the furniture is generated in addition to the original voice signal.

반향음의 특성은 방의 크기와 벽의 구성재료, 가구의 배치 및 그 구성재료, 화자의 위치, 기온과 습도 등의 영향을 받아 같은 실내환경에서도 그 특성이 일정하게 유지되지 않는 특징을 가지고 있다. The characteristics of reverberation sound are not kept constant even in the same indoor environment due to the influence of the size of the room, the material of the wall, the arrangement of the furniture and its materials, the location of the speaker, the temperature and humidity.

이러한 반향음은 화자의 원래 음성신호 성분 외에 가상의 음성신호 성분(Virtual Speech Signal Component)을 만들어 내기 때문에 부가 잡음(Additive Noise)과 더불어, 가정환경에서 자동음성인식 시스템(Automatic Speech Recognition System) 및 화자위치추적 시스템(Speaker Localization System)의 성능을 저하시키는 주요원인으로 알려져 있다.These echoes create a virtual speech signal component in addition to the speaker's original speech signal component, and in addition to additional noise, the automatic speech recognition system and the speaker in the home environment It is known to be a major cause of degrading the performance of the speaker localization system.

부가 잡음에 의해 오염된 음성신호로부터 반향성분을 제거하는 기술은 현재기술 수준으로도 그 해결이 어려워 기존의 사용자 음원 위치 추적 알고리즘인 Generalized Cross-Correlation(GCC) Method, Phase-Transform(PHAT) 방법 등을 적용하여도 그 성능이 크게 저하되는 문제를 가지고 있다.The technology of removing echo components from voice signals contaminated by additional noise is difficult to solve even at the current technology level, and thus, the generalized cross-correlation (GCC) method, phase-transform (PHAT) method, etc. Even if you apply this, the performance is greatly reduced.

또한, 부가 잡음 및 반향 잡음 문제에 강인한 사용자 음원 위치 추적을 위하여 음원 분리 알고리즘이 최근 적용되고 있는데, 이 방법을 살펴보면 음원 분리를 위하여 주변 환경을 신호 혼합 필터로 linear approximation 한다.In addition, a sound source separation algorithm has been recently applied to track the location of a user sound source that is robust against additional noise and echo noise. In this method, a linear approximation of the surrounding environment is performed by a signal mixing filter to separate sound sources.

이후, 이렇게 구한 신호 혼합 필터의 역변환 필터를 이용하여 각각의 음원을 분리해 내고, 주요 화자의 음성을 검출하여 주요 화자의 음성신호가 포함되어 있는 채널 출력을 선택한다. Then, each sound source is separated using the inverse transform filter of the signal mixing filter thus obtained, the voice of the main speaker is detected, and the channel output including the voice signal of the main speaker is selected.

그런 다음, 선택된 타겟 채널 신호를 음원 분리에 사용된 신호 혼합 필터를 이용하여 다채널 입력신호로 복원한다. 마지막으로 기존의 사용자 음원 위치 추적 알고리즘인 Generalized Cross-Correlation(GCC) Method, Phase-Transform(PHAT) 방법 등을 적용하여 타겟 화자의 위치를 찾는다.Then, the selected target channel signal is restored to the multi-channel input signal using the signal mixing filter used for sound source separation. Finally, the target speaker is found by applying the generalized cross-correlation (GCC) method and the phase-transform (PHAT) method.

이러한 알고리즘을 적용할 경우, 타겟 화자의 출력 채널을 찾고 이 타겟 화자의 음성신호와 신호 혼합 필터를 적용하여 다채널 신호로 복원하는 과정 등에 많은 계산량이 소요되게 된다.When the algorithm is applied, a large amount of computation is required for the process of finding the output channel of the target speaker and restoring the multi-channel signal by applying the voice signal and the signal mixture filter of the target speaker.

상기한 문제를 해결하기 위한 본 발명의 목적은, 2채널 마이크로폰, 암묵적 음원분리기술(Blind Source Separation, BSS), HMM(Hidden Markov Model)을 이용한 자동음성인식기술(Automatic Speech Recognition, ASR), 발화 검증(Utterance Verification, UV)기술, 음원 위치 추적기술(Sound Source Localization, SSL)을 유기적으로 통합하여 가정내, 부가잡음 및 반향에 의한 음성 인식 성능저하 문제를 극복하도록 하는 사용자의 음성을 이용한 위치 추적 장치 및 그 방법을 제공함에 있다.An object of the present invention for solving the above problems is, automatic speech recognition (ASR), speech using a two-channel microphone, blind source separation (BSS), Hidden Markov Model (HMM) Location tracking using user's voice to organically integrate verification (UV) technology and sound source localization (SSL) technology to overcome the problem of speech recognition degradation caused by in-house, side noise and echo An apparatus and a method thereof are provided.

또한, 본 발명의 다른 목적은, 부가잡음 및 반향에 의한 음성 인식 성능저하 문제를 해결하고, 사용자 음성을 이용하여 보다 정확한 음원 위치 추적이 가능하도록 하는 사용자의 음성을 이용한 위치 추적 장치 및 그 방법을 제공함에 있다.In addition, another object of the present invention is to solve the problem of speech recognition performance degradation due to additional noise and echo, and location tracking apparatus and method using the user's voice to enable more accurate sound source location tracking using the user's voice In providing.

상기한 목적을 달성하기 위한 본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치는, 입력되는 2채널의 신호를 각각의 음원별로 분리하는 음원 분리부, 상기 음원 분리부에 의해 분리된 각각의 음원 신호로부터 산란 잡음을 제거하고, 잔여 신호 성분을 강조하도록 필터링하는 스테레오 위너 필터부, 상기 각각의 음원 신호로부터 사용자의 음성을 인식하고, 음성 인식 결과에 대한 신뢰도를 측정하는 음성 인식부, 상기 음성 인식부로부터의 음성 인식 결과와 상기 음성 인식 결과에 대한 신뢰도에 근거하여 타겟 채널을 선택하는 채널 선택부, 및 상기 채널 선택부 에 의해 선택된 타겟 채널의 신호와, 간섭 채널의 신호를 분석하여 음원 위치를 추적하는 음원 위치 추적부를 포함한다.The position tracking device using the user's voice according to the present invention for achieving the above object, a sound source separation unit for separating the input two-channel signal for each sound source, each sound source signal separated by the sound source separation unit A stereo winner filter unit for removing scattering noise from the filter and filtering the signal to emphasize residual signal components, a voice recognition unit for recognizing a user's voice from each of the sound source signals, and measuring a reliability of a voice recognition result, the voice recognition unit A channel selector which selects a target channel based on a speech recognition result from the speech recognition result and a reliability of the speech recognition result, and analyzes a signal of a target channel selected by the channel selector and a signal of an interference channel to track a sound source position And a sound source position tracking unit.

상기 음원 분리부는, 암묵적 음원 분리 기술을 이용하여 상기 2채널의 신호를 각각의 음원별로 분리하는 것을 특징으로 한다.The sound source separating unit is configured to separate the signals of the two channels for each sound source by using an implicit sound source separating technique.

상기 스테레오 위너 필터부는, 스테레오 위너 필터 기술을 이용하여 상기 각각의 음원 신호를 필터링하는 것을 특징으로 한다.The stereo winner filter unit may filter each of the sound source signals using a stereo winner filter technology.

상기 스테레오 위너 필터부는, 입력된 음원 신호에 대한 프레임 에너지 기반의 음성 활동을 감지하는 음성 활동 감지부, 상기 음성 활동 감지부로부터 입력된 신호에 근거하여 위너 필터 계수를 추정하는 위너 필터 계수 추정부, 상기 추정된 위너 필터 계수를 이용하여 상기 입력된 채널의 신호에 대한 스테레오 위너 필터링을 수행하는 위너 필터부, 및 상기 필터링된 각 채널의 신호를 복원하는 신호 복원부를 포함하는 것을 특징으로 한다.The stereo Wiener filter unit may include: a voice activity detector detecting frame energy based voice activity on an input sound source signal, a Wiener filter coefficient estimator estimating a Wiener filter coefficient based on a signal input from the voice activity detector; And a Wiener filter unit for performing stereo Wiener filtering on the signal of the input channel by using the estimated Wiener filter coefficients, and a signal reconstruction unit for reconstructing the signal of each filtered channel.

상기 각 채널의 신호로부터 음성 신호의 끝점을 각각 검출하는 음성 끝점 검출부를 더 포함하는 것을 특징으로 한다.The apparatus may further include a voice endpoint detector configured to detect an endpoint of the voice signal from the signal of each channel.

상기 음성 인식부는, 각 채널의 음원 신호로부터 음성 인식을 위한 음성 특징을 추출하는 특징 추출부, 상기 특징 추출부로부터 추출된 음성 특징에 근거하여 각 채널의 신호에 대한 사용자의 음성을 인식하는 음성 인식 디코더, 및 상기 음성 인식 디코더로부터의 음성 인식 결과에 대한 음성 인식 신뢰도를 측정하여 상기 음성 인식 디코더로부터의 음성 인식 결과를 검증하는 발화 검증부를 포함하는 것을 특징으로 한다.The voice recognition unit may include a feature extractor that extracts a voice feature for voice recognition from a sound source signal of each channel, and a voice recognition that recognizes a user's voice with respect to a signal of each channel based on the voice feature extracted from the feature extractor. And a speech verification unit for measuring a speech recognition reliability of the speech recognition result from the speech recognition decoder and verifying the speech recognition result from the speech recognition decoder.

상기 채널 선택부는, 상기 각 채널의 음원 신호 중 하나의 채널에서만 사용자의 음성이 검출되고, 음성 인식 결과에 대한 신뢰도가 임계값 보다 높은 경우, 해당 채널을 타겟 채널로 선택하는 것을 특징으로 한다.The channel selector may select a corresponding channel as a target channel when the user's voice is detected only in one of the sound source signals of each channel, and the reliability of the voice recognition result is higher than a threshold.

상기 채널 선택부는, 상기 각 채널의 음원 신호에서 사용자의 음성이 모두 검출된 경우, 음성 인식 결과에 대한 신뢰도가 임계값 보다 높고, 두 채널의 신호 중 음성 인식 결과에 대한 신뢰도가 높은 채널을 타겟 채널로 선택하는 것을 특징으로 한다.The channel selector, when all of the user's voices are detected in the sound source signals of the respective channels, the target channel is a channel having a higher reliability of the speech recognition result than the threshold value, and a reliability of the speech recognition result among the signals of the two channels. Characterized in that the selection.

상기 채널 선택부에 의해 선택된 타겟 채널의 신호로부터 유성음 프레임을 검출하는 유성음 프레임 검출부를 더 포함하는 것을 특징으로 한다.And a voiced sound frame detector for detecting voiced sound frames from signals of the target channel selected by the channel selector.

상기 유성음 프레임 검출부는, 변경 시간 주파수 특징, 고주파수-저주파수 밴드 에너지 비, 제로 크로싱 비율, 레벨 크로싱 비율, 정규화된 자기 상관 최대 값, 유성음 확률, 자기 상관 함수의 피크 대 밸리 비, AMDF 최소 값 중 적어도 하나의 유성음 특징을 이용하여 유성음 프레임을 검출하는 것을 특징으로 한다.The voiced sound frame detector includes at least one of a change time frequency characteristic, a high frequency to low frequency band energy ratio, a zero crossing ratio, a level crossing ratio, a normalized autocorrelation maximum value, a voiced sound probability, a peak to valley ratio of an autocorrelation function, and an AMDF minimum value. The voiced sound frame is detected using one voiced sound feature.

상기 유성음 프레임 검출부는, 입력된 타겟 채널의 신호로부터 프레임을 추출하여 에너지를 추정하고, 추출된 프레임에 대한 파워 추정 결과에 근거하여 특정 주파수 영역에서의 에너지를 산정하여, 상기 추정된 에너지와 상기 산정된 에너지로부터 상기 변경 시간 주파수 특징을 구하는 것을 특징으로 한다.The voiced sound frame detection unit estimates energy by extracting a frame from a signal of an input target channel, and calculates energy in a specific frequency region based on a result of power estimation for the extracted frame. The change time frequency characteristic is obtained from the set energy.

상기 유성음 프레임 검출부는, 상기 유성음 특징을 임계값과 비교하여 음성 특징 비(Voicing Feature Ratio)를 산출하고, 상기 산출된 음성 특징 비가 기 정의된 임계값 보다 크면, 해당 프레임을 유성음 프레임으로 판별하는 것을 특징으로 한다.The voiced sound frame detector is configured to calculate a voice feature ratio by comparing the voiced voice feature with a threshold value, and if the calculated voice feature ratio is larger than a predefined threshold value, determining the corresponding voice frame as a voiced voice frame. It features.

상기 채널 선택부에 의해 선택된 타겟 채널의 신호와, 타겟 채널로 선택되지 않은 간섭 채널의 신호에 대하여 음성 주파수 구간을 강조하도록 필터링하는 밴드 패스 필터부를 더 포함하는 것을 특징으로 한다.The apparatus may further include a band pass filter that filters the voice frequency section with respect to the signal of the target channel selected by the channel selector and the signal of the interference channel not selected as the target channel.

한편, 상기한 목적을 달성하기 위한 사용자의 음성을 이용한 위치 추적 방법은, 입력된 2채널의 신호를 각각의 음원별로 분리하는 단계, 상기 음원 분리하는 단계에서 분리된 각각의 음원 신호를 필터링하는 단계, 상기 각각의 음원 신호로부터 음성의 끝점을 검출하고, 상기 끝점이 검출된 신호를 이용하여 음성을 인식하는 단계, 상기 음성을 인식하는 단계의 음성 인식 결과 및 상기 음성 인식 결과에 대한 신뢰도에 근거하여 타겟 채널을 선택하는 단계, 및 상기 타겟 채널의 신호로부터 검출된 유성음 프레임과, 상기 타겟 채널 및 간섭 채널의 음성 주파수 구간을 분석하여 음원 위치를 추적하는 단계를 포함하는 것을 특징으로 한다.On the other hand, the location tracking method using the user's voice for achieving the above object, the step of separating the input two-channel signal for each sound source, filtering each of the sound source signal separated in the sound source separation step Detecting an end point of a voice from the respective sound source signals, and recognizing a voice using the detected signal, based on a voice recognition result of the step of recognizing the voice and a reliability of the voice recognition result. Selecting a target channel, and analyzing the voiced sound frame detected from the signal of the target channel, and the voice frequency section of the target channel and the interference channel to track the sound source position.

상기 필터링하는 단계는, 입력된 각각의 음원 신호에 대한 프레임 에너지 기반의 음성 활동을 감지하는 단계, 상기 음성 활동 감지 결과 및 PSD 스펙트럼 추정 결과에 근거하여 위너 필터 계수를 추정하는 단계, 상기 추정된 위너 필터 계수를 이용하여 상기 입력된 각각의 음원 신호에 대한 스테레오 위너 필터링을 수행하는 단계, 및 상기 스테레오 위너 필터링된 각각의 음원 신호를 복원하는 단계를 더 포함하는 것을 특징으로 한다.The filtering may include detecting frame energy-based voice activity for each input sound source signal, estimating a Wiener filter coefficient based on the voice activity detection result and the PSD spectrum estimation result, and the estimated Wiener. And performing stereo winner filter on each of the input sound source signals using a filter coefficient, and restoring the respective stereo winner signal filtered by the sound source signal.

상기 각 채널의 음원 신호로부터 음성 신호의 끝점을 각각 검출하는 단계를 더 포함하는 것을 특징으로 한다.The method may further include detecting end points of a voice signal from the sound source signals of the respective channels.

상기 음성을 인식하는 단계는, 각 채널의 음원 신호로부터 음성 인식을 위한 음성 특징을 추출하는 단계, 상기 추출된 음성 특징에 근거하여 각 채널의 음원 신호에 대한 사용자의 음성을 인식하는 단계, 및 상기 음성 인식 결과에 대한 음성 인식 신뢰도를 측정하여 상기 음성 인식 결과를 검증하는 단계를 포함하는 것을 특징으로 한다.The recognizing of the voice may include extracting a voice feature for voice recognition from a sound source signal of each channel, recognizing a voice of a user with respect to the sound source signal of each channel based on the extracted voice feature, and And verifying the speech recognition result by measuring the speech recognition reliability of the speech recognition result.

상기 타겟 채널을 선택하는 단계는, 상기 각 채널의 음원 신호 중 하나의 채널에서만 사용자의 음성이 검출되고, 음성 인식 결과에 대한 신뢰도가 임계값 보다 높은 경우, 해당 채널을 타겟 채널로 선택하고, 상기 각 채널의 음원 신호에서 사용자의 음성이 모두 검출된 경우, 음성 인식 결과에 대한 신뢰도가 임계값 보다 높고, 두 채널의 신호 중 음성 인식 결과에 대한 신뢰도가 높은 채널을 타겟 채널로 선택하는 것을 특징으로 한다.The selecting of the target channel may include selecting a corresponding channel as a target channel when the user's voice is detected only in one of the sound source signals of each channel, and the reliability of the voice recognition result is higher than a threshold value. When the user's voice is detected in the sound source signal of each channel, the reliability of the voice recognition result is higher than the threshold value, and among the signals of the two channels, the channel having high reliability of the voice recognition result is selected as a target channel. do.

상기 타겟 채널을 선택하는 단계에서 선택된 타겟 채널의 신호로부터 유성음 프레임을 검출하는 단계를 더 포함하며, 상기 유성음 프레임을 검출하는 단계는, 상기 유성음 특징을 임계값과 비교하여 음성 특징 비(Voicing Feature Ratio)를 산출하고, 상기 산출된 음성 특징 비가 기 정의된 임계값 보다 크면, 해당 프레임을 유성음 프레임으로 판별하는 것을 특징으로 한다.The method may further include detecting a voiced sound frame from a signal of the target channel selected in the selecting of the target channel, wherein detecting the voiced sound frame comprises comparing a voiced sound feature with a threshold value. ), And if the calculated voice feature ratio is greater than a predefined threshold, the corresponding frame is determined as a voiced sound frame.

상기 타겟 채널을 선택하는 단계에서 선택된 타겟 채널의 신호와, 타겟 채널로 선택되지 않은 간섭 채널의 신호에 대하여 음성 주파수 구간을 강조하도록 필터링하는 단계를 더 포함하는 것을 특징으로 한다.The method may further include filtering to emphasize a voice frequency section with respect to the signal of the selected target channel and the signal of the interference channel not selected as the target channel in the selecting of the target channel.

본 발명에 따르면, 암묵적 음원 분리 기술, 스테레오 위너 필터 기술, 음성인식 및 발화검증 기술, 그리고 인간의 음성신호성분을 강조하는 음원 위치 추적 기술을 유기적으로 통합함으로써, 간섭잡음, 산란잡음, 그리고 반향음에 의한 음성인식 저하 문제 및 화자 위치 추적 알고리즘의 성능저하 문제를 해소하고, 이로 인하여 보다 정확하고 주변 환경에 강인한 사용자 음성 위치 추적이 가능한 이점이 있다.According to the present invention, interference noise, scattering noise, and reflection sound are organically integrated by integrating an implicit sound source separation technology, a stereo winner filter technology, a voice recognition and speech verification technology, and a sound source location technology that emphasizes human voice signal components. It solves the problem of speech recognition deterioration and the performance degradation of the speaker location tracking algorithm, thereby enabling the user voice location tracking that is more accurate and robust to the surrounding environment.

이하, 첨부된 도면을 참조하여 본 발명의 구체적인 실시예를 설명한다.Hereinafter, specific embodiments of the present invention will be described with reference to the accompanying drawings.

본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치 및 그 방법은 2채널 마이크로폰, 암묵적 음원분리(Blind Source Separation, BSS) 기술, HMM(Hidden Markov Model)을 이용한 자동음성인식(Automatic Speech Recognition, ASR) 기술, 발화 검증(Utterance Verification, UV) 기술, 음원 위치 추적(Sound Source Localization, SSL) 기술을 유기적으로 통합하여 가정 내 부가잡음 및 반향음에 의한 화자 위치 추적의 성능저하 문제를 극복하고자 한다.Location tracking apparatus using the user's voice and the method according to the present invention is a two-channel microphone, BSS (Blind Source Separation, BSS) technology, Automatic Speech Recognition (ASR) using HMM (Hidden Markov Model) The technology, Utterance Verification (UV) technology, and Sound Source Localization (SSL) technology are organically integrated to overcome the problem of performance degradation of speaker location tracking by additional noise and echo in the home.

먼저, 암묵적 음원분리(Blind Source Separation, BSS) 기술은 Multi-channel Input, Multi-channel Output(MIMO) 방식의 음원 분리 및 음질 향상 기술로서, 사용자의 음성신호와 부가잡음을 분리하여, 사용자의 음성신호에 대한 음질을 향상시키는데 이용된다. First, BSS (Blind Source Separation) technology is a multi-channel input, multi-channel output (MIMO) type of sound source separation and sound quality enhancement technology. It is used to improve the sound quality for the signal.

이때, 암묵적 음원분리 기술은 주로 cocktail party problem을 해결하기 위 한 방안으로 다양한 알고리즘들이 개발되고 있다.At this time, the tacit sound separation technology is mainly developed to solve various cocktail party problems.

여기서, Cocktail party problem이란, 칵테일 파티에서와 같이 여러 명의 화자가 동시에 얘기를 하는 경우 특정화자의 음성에 집중을 할 수 있는 능력을 말한다. Here, the Cocktail party problem refers to the ability to concentrate on the voice of a specific speaker when several speakers talk at the same time as in a cocktail party.

이러한, 암묵적 음원 분리기술의 원리는 다중 채널 입력 신호(Multi-channel Input Signal)들로부터 주변환경 파라미터(Environmental Parameter)들을 추정하고, 그 역필터(Inverse Filter)를 이용하여 필터링(Filtering)함으로써, 음원들 각각의 원래 신호를 복원하는 것이다. The principle of the implicit sound source separation technique is to estimate the environmental parameters from the multi-channel input signals and filter by using an inverse filter, thereby filtering the sound source. Each of these is to restore the original signal.

또한, 각각의 출력 채널에는 잔여 신호(Residual Signal)라 불리는 간섭 음원의 신호성분(Interference Signal Components)이 존재하게 된다. In addition, each output channel has an interference signal component called an interference signal called a residual signal.

따라서, 본 발명에서는 실시간 암묵적 음원 분리(Real-time Blind Source Separation, RT BSS) 기술을 적용하여 얻어진 사용자 음성신호와 사용자 음성신호의 잔여 신호 성분을 이용하여 사용자의 위치를 추적한다.Therefore, in the present invention, the location of the user is tracked using the user voice signal obtained by applying Real-time Blind Source Separation (RT BSS) technology and the residual signal components of the user voice signal.

또한, 본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치에 적용되는 암묵적 음원 분리 기술은 2channel time-domain independent component analysis(TD-ICA) 알고리즘을 적용하고, 분리된 사용자 음성신호의 음질을 더욱 향상시키고 간섭채널에 잔존하는 사용자 음성 신호 성분을 강조하기 위하여 stereo Wiener Filter 기술을 적용한다.In addition, the implicit sound source separation technology applied to the location tracking device using the user's voice according to the present invention applies a 2channel time-domain independent component analysis (TD-ICA) algorithm, and further improves the sound quality of the separated user voice signal. Stereo Wiener Filter technology is applied to emphasize the user's voice signal remaining in the interference channel.

본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치는 상기와 같은 암묵 적 음원분리 기술을 front-end processor로 사용함으로써, 부가 잡음과 반향음에 의한 화자 위치 추적 알고리즘의 성능저하 문제를 해소할 수 있다.The location tracking apparatus using the user's voice according to the present invention can solve the problem of performance degradation of the speaker location tracking algorithm by the additional noise and the echo by using the implicit sound source separation technique as the front-end processor. .

본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치에 대한 구체적인 설명은, 도면 설명과 함께 아래의 실시예를 참조한다.For a detailed description of the location tracking apparatus using the user's voice according to the present invention, refer to the following embodiments in conjunction with the drawings.

먼저, 도 1은 본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치의 구성을 설명하는데 참조되는 블록도이다.First, Figure 1 is a block diagram referred to explain the configuration of a location tracking device using the user's voice according to the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치는 2채널 마이크로폰, 음원 분리부(10), 스테레오 위너 필터부(20, 30), 끝점 추출부(40, 50), 음성 인식부(60, 70), 채널 선택부(80), 채널 버퍼링부(90), 유성음 프레임 검출부(100), 밴드 패스 필터부(110, 120), 및 음원 위치 추적부(130)를 포함한다.As shown in Figure 1, the position tracking apparatus using the user's voice according to the present invention is a two-channel microphone, sound source separation unit 10, stereo winner filter unit 20, 30, end point extraction unit 40, 50 , Voice recognition unit 60, 70, channel selector 80, channel buffering unit 90, voiced sound frame detection unit 100, band pass filter unit 110, 120, and sound source position tracking unit 130. Include.

먼저, 음원 분리부(10)는 2채널 마이크로폰으로부터 입력되는 2채널의 여러 음원들이 혼합된 신호를 각각의 음원별로 분리한다. 이때, 음원 분리부(10)는 time-domain independent component analysis(TD-ICA) 알고리즘 기반의 암묵적 음원 분리 기술을 적용하여 분리한다. 이때, 음원 분리부(10)에 의해 분리된 각 음원들은 스테레오 위너 필터부(20, 30)로 전달된다.First, the sound source separating unit 10 separates a signal in which several sound sources of two channels input from a two-channel microphone are mixed for each sound source. At this time, the sound source separation unit 10 is separated by applying an implicit sound source separation technology based on time-domain independent component analysis (TD-ICA) algorithm. At this time, each sound source separated by the sound source separation unit 10 is transmitted to the stereo winner filter unit 20, 30.

또한, 스테레오 위너 필터부(20, 30)는 음원 분리부(10)에 의해 분리된 각각 의 음원 신호를 필터링한다. 이때, 스테레오 위너 필터부(20, 30)는 제1 필터부(20)와, 제2 필터부(30)를 포함한다.In addition, the stereo winner filter unit 20, 30 filters each sound source signal separated by the sound source separation unit 10. In this case, the stereo winner filter units 20 and 30 include a first filter unit 20 and a second filter unit 30.

여기서, 스테레오 위너 필터부(20, 30)는 스테레오 위너 필터(Stereo Wiener filter)를 이용하여 입력된 음원 신호를 필터링한다. 이때, 스테레오 위너 필터부(20, 30)는 음원 분리 후 남아있는 산란 잡음(diffused noise)을 제거하고, 다른 채널에 남아 있는 잔여 신호 성분을 강조한다.Here, the stereo winner filter units 20 and 30 filter the input sound source signal by using the stereo wiener filter. In this case, the stereo winner filter units 20 and 30 remove the scattered noise remaining after the sound source is separated and emphasize the residual signal components remaining in other channels.

예를 들어, 제1 필터부(20)는 채널1을 reference signal로 하여 위너 필터 계수를 추정하고, 음원 분리부(10)에 의해 음원 분리된 채널1, 2의 출력신호에 동일하게 적용하여 채널2에 남아 있는 채널1의 잔여 신호 성분(Residual Signal Component)를 강조한다.For example, the first filter unit 20 estimates the Wiener filter coefficient using the channel 1 as a reference signal, and applies the same to the output signals of the channels 1 and 2 separated by the sound source separation unit 10 to the channel. Emphasize the residual signal component of channel 1 remaining in channel 2.

한편, 제2 필터부(30)는 제1 필터부(20)와는 반대로, 채널2를 대조 신호(reference signal)로 하여 위너 필터 계수를 추정하고, 음원 분리부(10)에 의해 분리된 채널1, 2의 출력신호에 동일하게 적용하여 채널1에 남아 있는 채널2의 잔여 신호 성분(Residual Signal Component)을 강조한다.In contrast to the first filter unit 20, the second filter unit 30 estimates Wiener filter coefficients using channel 2 as a reference signal and separates the channel 1 separated by the sound source separation unit 10. In this case, the residual signal component of the channel 2 remaining in the channel 1 is emphasized by applying the same to the output signal of the 2.

스테레오 위너 필터부(20, 30)는 분리된 각각의 음원 신호로부터 자동 음성 인식의 성능을 저하시키는 산란 잡음(diffused noise)을 제거하고, 각 음원 신호에 남아 있는 잔여 신호 성분을 강조함으로써, 화자의 음성을 인식함에 있어서 보다 정확한 음성 인식 결과를 도출해 낼 수 있다. 스테레오 위너 필터부(20, 30)에 대한 구체적인 동작은 도 2의 설명을 참조한다.The stereo winner filter units 20 and 30 remove scattered noise that degrades the performance of automatic speech recognition from each of the separated sound source signals, and emphasizes residual signal components remaining in each sound source signal, In recognition of speech, more accurate speech recognition results can be obtained. Detailed operations of the stereo winner filter units 20 and 30 will be described with reference to FIG. 2.

한편, 끝점 검출부(40, 50)는 채널1의 신호로부터 끝점을 검출하여 버퍼링하는 제1 끝점 검출부(40)와, 채널2의 신호로부터 끝점을 검출하여 버퍼링하는 제2 끝점 검출부(50)를 포함한다.Meanwhile, the endpoint detectors 40 and 50 include a first endpoint detector 40 that detects and buffers the endpoint from the signal of channel 1, and a second endpoint detector 50 that detects and buffers the endpoint from the signal of channel 2. do.

제1 끝점 검출부(40)와 제2 끝점 검출부(50)는 스테레오 위너 필터부(20, 30)로부터 입력된 각 채널의 신호를 이용하여 음성의 끝점을 검출하고, 끝점이 검출된 각 채널의 신호와 이에 상응하는 다른 채널의 신호를 버퍼링한다.The first and second end point detectors 40 and 50 detect the end points of the voices using the signals of the respective channels input from the stereo winner filter units 20 and 30, and the signals of each channel from which the endpoints are detected. And buffers the signals of the corresponding other channels.

예를 들어, 제1 끝점 검출부(40)는 채널 1의 신호를 이용하여 음성의 끝점을 검출하는 역할을 수행하고, 아울러 끝점이 검출된 채널 1의 신호와 이에 상응하는 채널 2의 신호를 버퍼링한다.For example, the first endpoint detector 40 detects an end point of the voice by using the signal of channel 1 and buffers the signal of channel 1 and the corresponding channel 2 signal at which the endpoint is detected. .

한편, 제2 끝점 검출부(50)는 채널 2의 신호를 이용하여 음성의 끝점을 검출하는 역할을 수행하고, 아울러 끝점이 검출된 채널 2의 신호와 이에 상응하는 채널 1의 신호를 버퍼링해 둔다. 향후, 제1 끝점 검출부(40)와 제2 끝점 검출부(50)에 의해 버퍼링된 신호는 사용자의 위치를 추적하는데 이용된다.On the other hand, the second end point detector 50 detects the end point of the voice by using the signal of the channel 2, and buffers the signal of the channel 2 and the channel 1 signal corresponding to the end point detected. In the future, the signals buffered by the first endpoint detector 40 and the second endpoint detector 50 are used to track the location of the user.

또한, 음성 인식부(60, 70)는 자동음성인식 기술을 이용하여 끝점 검출부(40, 50)부에 의해 끝점이 검출된 각 채널의 신호로부터 음성을 인식한다. 이때, 음성 인식부(60, 70)는 채널1의 신호로부터 음성을 인식하는 제1 음성 인식부(60)와, 채널2의 신호로부터 음성을 인식하는 제2 음성 인식부(70)를 포함한다.In addition, the voice recognition units 60 and 70 recognize voices from the signals of the respective channels where the endpoints are detected by the endpoint detection units 40 and 50 using an automatic voice recognition technique. In this case, the voice recognition units 60 and 70 include a first voice recognition unit 60 that recognizes the voice from the signal of the channel 1 and a second voice recognition unit 70 that recognizes the voice from the signal of the channel 2. .

음성 인식부(60, 70)로 입력된 신호는 암묵적 음원 분리기술과 위너 필터 기술이 이미 적용되어 음질이 향상된 신호이므로, 음성을 인식하는데 더욱 용이한 효 과를 나타낸다.Since the signal input to the voice recognition units 60 and 70 is a signal having an improved sound quality since the implicit sound source separation technique and the Wiener filter technique are already applied, the signal is more easily recognized.

한편, 채널 선택부(80)는 제1 음성 인식부(60)와 제2 음성 인식부(70)로부터 입력된 두 채널의 신호 중 사용자의 음성이 포함된 채널을 선택한다.Meanwhile, the channel selector 80 selects a channel including the user's voice from the signals of the two channels input from the first voice recognition unit 60 and the second voice recognition unit 70.

이때, 채널 선택부(80)는 제1 음성 인식부(60)와 제2 음성 인식부(70)의 음성인식 결과와 음성인식 신뢰도 측정값을 기반으로 사용자의 음성이 포함된 타겟(target) 채널을 선택한다. At this time, the channel selector 80 includes a target channel including the user's voice based on the voice recognition result and the voice recognition reliability measurement values of the first and second voice recognition units 60 and 70. Select.

만일, 하나의 채널에서만 음성이 검출되고 자동음성인식 결과 신뢰도 값이 임계값 보다 높은 경우, 채널 선택부(80)는 해당 채널을 타겟 채널로 선택한다.If a voice is detected in only one channel and the result of the automatic voice recognition is higher than the threshold value, the channel selector 80 selects the channel as the target channel.

한편, 채널 선택부(80)는 두 채널에서 동시에 음성이 검출된 경우에는 자동음성인식 결과 신뢰도가 임계값 보다 높고, 두 채널 중 상대적으로 신뢰도가 높은 채널을 타겟 채널로 선택한다.On the other hand, when voice is detected simultaneously in two channels, the channel selector 80 selects a channel having a higher reliability than a threshold value as a result of automatic voice recognition and having a relatively high reliability among the two channels as a target channel.

채널 선택부(80)에 대한 구체적인 동작은 도 3 및 도 4의 설명을 참조한다.Detailed operations of the channel selector 80 will be described with reference to FIGS. 3 and 4.

채널 버퍼링부(90)는 채널 선택부(80)에 의해 타겟 채널이 선택되면, 선택된 타겟 채널의 스테레오 위너 필터(stereo Wiener filter)로부터의 출력 신호를 버퍼링한다.When the target channel is selected by the channel selector 80, the channel buffering unit 90 buffers an output signal from the stereo Wiener filter of the selected target channel.

만일, 채널1이 타겟 채널로 선택되면, 제1 필터부(20)에 의해 필터링된 신호를 버퍼링하고, 채널2가 타겟 채널로 선택되면, 제2 필터부(30)에 의해 필터링된 신호를 버퍼링한다.If channel 1 is selected as the target channel, buffer the signal filtered by the first filter unit 20, and if channel 2 is selected as the target channel, buffer the signal filtered by the second filter unit 30. do.

유성음 프레임 검출부(100)는 채널 버퍼링부(90)에 의해 버퍼링된 신호로부터 유성음 프레임을 검출한다. 이때, 유성음 프레임 검출부(100)는 타겟 채널 신호로부터 유성음 프레임을 검출한다.The voiced sound frame detection unit 100 detects the voiced sound frame from the signal buffered by the channel buffering unit 90. In this case, the voiced sound frame detector 100 detects the voiced sound frame from the target channel signal.

유성음 프레임 검출부(100)에 대한 구체적인 동작은 도 5의 설명을 참조한다.The detailed operation of the voiced sound frame detector 100 will be described with reference to FIG. 5.

밴드 패스 필터부(110, 120)는 채널 버퍼링부(90)에 의해 버퍼링된 신호를 이용하여 각 채널 신호를 필터링 한다. 이때, 밴드 패스 필터부(110, 120)는 각 채널 신호에 포함된 음성의 주파수 구간을 강조한다.The band pass filter units 110 and 120 filter each channel signal by using the signal buffered by the channel buffering unit 90. In this case, the band pass filter units 110 and 120 emphasize the frequency section of the voice included in each channel signal.

또한, 밴드 패스 필터부(110, 120)는 타겟 채널의 신호를 필터링하는 제1 밴드 패스 필터부(110)와, 간섭 채널(두 채널의 신호 중 사용자의 음성을 포함하는 타겟 채널을 제외한 채널은 간섭 채널로 인식함)의 신호를 필터링하는 제2 밴드 패스 필터부(120)를 포함한다.In addition, the band pass filter unit 110 and 120 may include a first band pass filter unit 110 for filtering a signal of a target channel, and an interference channel (a channel except for a target channel including a user's voice, among the signals of the two channels). And a second band pass filter unit 120 for filtering a signal of the interference channel.

여기서, 제1 밴드 패스 필터부(110)는 타겟 채널에 대한 음성 주파수 구간(2~4kHz)을 강조하고, 제2 밴드 패스 필터부(120)는 간섭 채널의 음성 주파수 구간(2~4kHz)을 강조한다.Here, the first band pass filter unit 110 emphasizes the voice frequency range (2 to 4 kHz) for the target channel, and the second band pass filter unit 120 adjusts the voice frequency range (2 to 4 kHz) of the interference channel. Emphasize.

이때, 밴드 패스 필터부(110, 120)에서 각 채널에 대한 음성 주파수 구간을 강조하는데 적용되는 식은 아래 [수학식 1]과 같다.In this case, the equation applied to the voice frequency section for each channel in the band pass filter (110, 120) is as shown in [Equation 1].

밴드 패스 필터부(110, 120)는 각 채널의 음성 주파수 구간이 강조된 신호를 음원 위치 추적부(130)로 전달한다. The band pass filter unit 110 or 120 transmits a signal in which a voice frequency section of each channel is emphasized to the sound source position tracking unit 130.

따라서, 음원 위치 추적부(130)는 밴드 패스 필터부(110, 120)로부터 각 채널에 대한 음성 주파수 구간이 강조된 신호를 입력받음으로써, 음원 위치를 추적하는데 있어서 보다 정확한 위치를 추적할 수 있는 이점이 있다.Therefore, the sound source position tracking unit 130 receives a signal in which the voice frequency section for each channel is emphasized from the band pass filter unit 110 and 120, so that an accurate position can be tracked in tracking the sound source position. There is this.

이때, 음원 위치 추적부(130)는 phase-transform(PHAT) 알고리즘을 적용하여 사용자의 음원 위치를 추적한다.In this case, the sound source position tracking unit 130 applies a phase-transform (PHAT) algorithm to track the sound source position of the user.

여기서, phase-transform(PHAT) 알고리즘은, "A. Brutti, M. Omologo, and P Svizer, Comparison between different sound source localization techniques based on a real data collection,◎in Proc . Joint Workshop on Hands - Free Speech Communication and Microphone Arrays, pp. 69-72, May 2008"의 내용을 참조한다.Here, phase-transform (PHAT) algorithm, "A. Brutti, M. Omologo, and P Svizer, Comparison between different sound source localization techniques based on a real data collection, ◎ in Proc. Joint Workshop on Hands - Free Speech Communication and Microphone Arrays , pp. 69-72, May 2008 ".

앞에서 설명한 바와 같이, 스테레오 위너 필터는 각각 정해진 채널의 신호를 대조 신호로 하고, 이를 기반으로 위너 필터 기반의 전달함수(Transfer Function)를 구한다. 이렇게 구해진 위너 필터 전달함수(Wiener Filter Transfer Function) 를 이용하여 2채널 신호에 동일하게 적용함으로써, 대조 채널에서는 대조 신호의 음질을 향상시킬 뿐만 아니라 간섭 채널에 남아 있는 대조 신호의 잔여 신호 성분을 강조하는 역할을 수행하게 된다. 따라서 이렇게 얻어진 2채널 신호는 향후, 타겟 채널이 선택된 후, 화자 위치 추적을 위해 사용되게 된다. As described above, the stereo Wiener filter uses a signal of a predetermined channel as a reference signal, and obtains a Wiener filter-based transfer function based on the reference signal. By applying the Wiener Filter Transfer Function equally to the two-channel signal, the control channel not only improves the sound quality of the control signal but also emphasizes the residual signal components of the control signal remaining in the interference channel. It will play a role. Thus, the two-channel signal thus obtained will be used for speaker location tracking in the future after the target channel is selected.

도 2는 본 발명에 따른 스테레오 위너 필터부의 구성을 설명하는데 참조되는 블록도로서, 상세하게는 제1 필터부의 구성을 도시한 블록도이다.FIG. 2 is a block diagram referred to to explain the configuration of the stereo winner filter unit according to the present invention, and is a block diagram showing the configuration of the first filter unit in detail.

도 2에 도시된 바와 같이, 제1 필터부(20)는 신호 프레이밍부(21), 음성 활동 감지부(22), 퓨리에 변환부(23), PSD 스펙트럼 추정부(24), 위너 필터 계수 추정부(25), 위너 필터부(26), 및 신호 복원부(27)를 포함한다. 여기서, 제1 필터부(20)는 음원 분리부(10)에 의해 분리된 채널1의 신호뿐만 아니라 채널2의 신호 또한 함께 입력받는다.As shown in FIG. 2, the first filter unit 20 includes a signal framing unit 21, a voice activity detector 22, a Fourier transform unit 23, a PSD spectrum estimation unit 24, and a Wiener filter coefficient weight. And a government unit 25, a winner filter unit 26, and a signal recovery unit 27. Here, the first filter unit 20 receives not only the signal of the channel 1 separated by the sound source separating unit 10 but also the signal of the channel 2.

신호 프레이밍부(21)는 채널1의 신호를 가지고 Time-domain signal framing 기능을 수행하는 제1 신호 프레이밍부(21a)와, 채널2의 신호를 가지고 Time-domain signal framing 기능을 수행하는 제2 신호 프레이밍부(21b)를 포함한다.The signal framing unit 21 includes a first signal framing unit 21a that performs a time-domain signal framing function with a signal of channel 1, and a second signal that performs a time-domain signal framing function with a signal of channel 2. The framing section 21b is included.

이때, 제1 신호 프레이밍부(21a)는 채널1의 신호를 음성 활동 감지부(22)와 퓨리에 변환부(23)로 전달한다. 한편, 제2 신호 프레이밍부(21b)는 채널2의 신호를 위너 필터부(제2 위너 필터부(26b))로 전달한다.In this case, the first signal framing unit 21a transmits the signal of the channel 1 to the voice activity detecting unit 22 and the Fourier transform unit 23. On the other hand, the second signal framing unit 21b transmits the signal of the channel 2 to the winner filter unit (the second winner filter unit 26b).

또한, 음성 활동 감지부(22)는 채널1의 신호에 대한 프레임 에너지 기반의 음성 활동 감지(Voiced Activity Detection, VAD) 기능을 수행한다.In addition, the voice activity detection unit 22 performs a voice activity detection (VAD) function based on frame energy for the signal of channel 1.

퓨리에 변환부(23)는 채널1의 신호에 대한 빠른 퓨리에 변환(Fast Fourier Transform, FFT) 기능을 수행한다.The Fourier transform unit 23 performs a fast Fourier transform (FFT) function on the signal of the channel 1.

PSD 스펙트럼 추정부(24)는 퓨리에 변환된 채널1의 신호로부터 (Power Spectral Density, PSD) 스펙트럼을 추정하는 기능을 수행한다.The PSD spectrum estimator 24 performs a function of estimating a (Power Spectral Density, PSD) spectrum from a signal of Fourier transformed channel 1.

위너 필터 계수 추정부(25)는 음성 활동 감지부(22) 및 PSD 스펙트럼 추정부(24)로부터 입력된 신호에 기초하여 채널1에 대한 위너 필터 계수(Wiener filter coefficient)를 추정하는 기능을 수행한다. 이때, 위너 필터 계수 추정부(25)는 추정된 위너 필터 계수를 위너 필터부(26)로 전달한다.The Wiener filter coefficient estimator 25 estimates a Wiener filter coefficient for channel 1 based on the signals input from the voice activity detector 22 and the PSD spectrum estimator 24. . At this time, the winner filter coefficient estimator 25 transmits the estimated winner filter coefficients to the winner filter 26.

또한, 위너 필터부(26)는 입력된 신호에 대하여 위너 필터 기능을 수행한다. 이때, 위너 필터부(26)는 채널1의 신호를 위너 필터 계수 추정부(25)로부터 추정된 위너 필터 계수를 이용하여 위너 필터 기능을 수행하는 제1 위너 필터부(26a)와, 채널2의 신호를 위너 필터 계수 추정부(25)로부터 추정된 위너 필터 계수를 이용하여 위너 필터 기능을 수행하는 제2 위너 필터부(26b)를 포함한다.In addition, the winner filter unit 26 performs a winner filter function on the input signal. At this time, the Wiener filter unit 26 performs a Wiener filter function on the signal of the channel 1 by using the Wiener filter coefficient estimated from the Wiener filter coefficient estimator 25 and the channel 2 of the first Wiener filter unit 26a. And a second winner filter 26b which performs a winner filter function using the winner filter coefficients estimated by the winner filter coefficient estimator 25.

신호 복원부(27)는 위너 필터부(26)에 의해 필터링된 신호에 대하여 Time-domain de-noised signal reconstruction 기능을 수행한다. 이때, 신호 복원부(27)는 제1 위너 필터부(26a)에 의해 필터링된 신호를 복원하는 제1 신호 복원부(27a)와, 제2 위너 필터부(26b)에 의해 필터링된 신호를 복원하는 제2 신호 복원부(27b)를 포함한다.The signal recovery unit 27 performs a time-domain de-noised signal reconstruction function on the signal filtered by the Wiener filter unit 26. At this time, the signal recovery unit 27 restores the first signal recovery unit 27a for restoring the signal filtered by the first Wiener filter unit 26a and the signal filtered by the second Wiener filter unit 26b. And a second signal recovery unit 27b.

물론, 도 2는 제1 필터부(20)의 구성을 구체적으로 도시한 것이나, 제2 필터부(30) 또한 동일한 방법으로 동작함은 당연한 것이다. 단, 제2 필터부(30)에서는 채널1과 채널2가 서로 반대로 적용된다.Of course, FIG. 2 illustrates the configuration of the first filter unit 20 in detail, but it is natural that the second filter unit 30 also operates in the same manner. However, in the second filter unit 30, channel 1 and channel 2 are applied oppositely.

도 3은 본 발명에 따른 음성 인식부의 구성을 설명하는데 참조되는 블록도이다. 특히, 도 3은 음성 인식부와 끝점 검출부 및 채널 선택부와의 관계를 나타낸 것이다.3 is a block diagram referred to to explain the configuration of the speech recognition unit according to the present invention. In particular, FIG. 3 illustrates a relationship between the speech recognizer, the endpoint detector, and the channel selector.

앞에서 설명한 바와 같이, 2채널의 출력 신호 중에서 사용자의 음성 신호가 존재하는 타겟 채널을 검출하기 위해서는 타겟 채널 선택 알고리즘을 이용한다. As described above, the target channel selection algorithm is used to detect the target channel in which the user's voice signal exists from the output signals of the two channels.

도 3에 도시된 바와 같이, 본 발명에 따른 음성 인식부는 특징 추출부, 음성 인식 디코더, 및 발화 검증부를 포함한다.As shown in FIG. 3, the speech recognition unit according to the present invention includes a feature extractor, a speech recognition decoder, and a speech verification unit.

먼저, 특징 추출부는 채널1의 신호를 입력받아 처리하는 제1 특징 추출부(61)와, 채널2의 신호를 입력받아 처리하는 제2 특징 추출부(71)를 포함한다.First, the feature extractor includes a first feature extractor 61 that receives and processes a signal of channel 1, and a second feature extractor 71 that receives and processes a signal of channel 2.

이때, 제1 특징 추출부(61)는 채널1의 신호로부터 끝점을 추출하는 제1 음성 끝점 검출부(40)로부터 신호를 입력받는다. 또한, 제2 특징 추출부(71)는 채널1의 신호로부터 끝점을 추출하는 제2 음성 끝점 검출부(50)로부터 신호를 입력받는다.In this case, the first feature extractor 61 receives a signal from the first voice endpoint detector 40 which extracts an endpoint from the signal of channel 1. In addition, the second feature extractor 71 receives a signal from the second voice endpoint detector 50 that extracts an endpoint from the signal of channel 1.

한편, 음성 인식 디코더는 채널1의 신호를 입력받아 처리하는 제1 음성 인식 디코더(65)와, 채널2의 신호를 입력받아 처리하는 제2 음성 인식 디코더(75)를 포함한다.Meanwhile, the speech recognition decoder includes a first speech recognition decoder 65 that receives and processes a signal of channel 1, and a second speech recognition decoder 75 that receives and processes a signal of channel 2.

또한, 발화 검증부(Utterance verification)는 채널1의 신호를 입력받아 처리하는 제1 발화 검증부(69)와, 채널2의 신호를 입력받아 처리하는 제2 발화 검증부(79)를 포함한다.In addition, the speech verification unit (Utterance verification) includes a first speech verification unit 69 for receiving and processing the signal of the channel 1, and a second speech verification unit 79 for receiving and processing the signal of the channel 2.

먼저, 제1 음성 끝점 검출부(40)는 채널1의 신호로부터 음성의 끝점을 검출하여 버퍼링하고, 버퍼링된 신호를 제1 특징 추출부(61)로 전달한다.First, the first voice endpoint detector 40 detects and buffers the voice endpoint from the signal of channel 1 and transfers the buffered signal to the first feature extractor 61.

이때, 제1 특징 추출부(61)는 채널1의 신호로부터 음성 인식을 위한 음성 특징을 추출한다. 이후, 제1 특징 추출부(61)는 추출된 음성 특징 정보를 제1 음성 인식 디코더(65)로 전달한다.In this case, the first feature extractor 61 extracts a voice feature for speech recognition from the signal of channel 1. Thereafter, the first feature extractor 61 transmits the extracted voice feature information to the first voice recognition decoder 65.

한편, 제1 음성 인식 디코더(65)는 제1 특징 추출부(61)로부터 추출된 음성 특징에 근거하여 채널1의 신호에 대한 사용자의 음성을 인식한다.Meanwhile, the first voice recognition decoder 65 recognizes a user's voice with respect to the signal of the channel 1 based on the voice feature extracted from the first feature extractor 61.

이후, 제1 발화 검증부(69)는 제1 음성 인식 디코더(65)로부터의 음성 인식 결과에 대한 신뢰도를 판단하여 해당 음성 신호를 검증한다. 여기서, 제1 발화 검증부(69)는 채널1의 음성 신호에 대한 검증 결과를 채널 선택부(80)로 전달한다.Thereafter, the first speech verification unit 69 verifies the reliability of the speech recognition result from the first speech recognition decoder 65 and verifies the speech signal. Here, the first speech verification unit 69 transmits the verification result of the audio signal of channel 1 to the channel selector 80.

한편, 제2 음성 끝점 검출부(50)는 채널1의 신호로부터 음성의 끝점을 검출하여 버퍼링하고, 버퍼링된 신호를 제2 특징 추출부(71)로 전달한다.Meanwhile, the second voice endpoint detector 50 detects and buffers the voice endpoint from the signal of channel 1 and transmits the buffered signal to the second feature extractor 71.

이때, 제2 특징 추출부(71)는 채널2의 신호로부터 음성 인식을 위한 음성 특징을 추출한다. 이후, 제2 특징 추출부(71)는 추출된 음성 특징 정보를 제1 음성 인식 디코더(75)로 전달한다.In this case, the second feature extractor 71 extracts a voice feature for speech recognition from the signal of channel 2. Thereafter, the second feature extractor 71 transmits the extracted voice feature information to the first voice recognition decoder 75.

한편, 제2 음성 인식 디코더(75)는 제2 특징 추출부(71)로부터 추출된 음성 특징에 근거하여 채널2의 신호에 대한 사용자의 음성을 인식한다.Meanwhile, the second voice recognition decoder 75 recognizes the user's voice with respect to the signal of the channel 2 based on the voice feature extracted from the second feature extractor 71.

제2 발화 검증부(79)는 제2 음성 인식 디코더(75)로부터의 음성 인식 결과에 대한 신뢰도를 판단하여 해당 음성 신호를 검증한다. 제2 발화 검증부(79)는 채널2의 음성 신호에 대한 검증 결과를 채널 선택부(80)로 전달한다.The second speech verification unit 79 determines the reliability of the speech recognition result from the second speech recognition decoder 75 and verifies the speech signal. The second speech verification unit 79 transmits the verification result of the audio signal of the channel 2 to the channel selector 80.

이때, 채널 선택부(80)는 채널1과 채널2에 대한 제1 발화 검증부(69) 및 제2 발화 검증부(79)로부터의 음성 인식 신뢰도를 기반으로 타겟 채널을 선택한다.In this case, the channel selector 80 selects a target channel based on the speech recognition reliability from the first speech verification unit 69 and the second speech verification unit 79 for channels 1 and 2.

채널 선택부(80)에서 채널1과 채널2 중 타겟 채널을 선택하는 동작은 도 4를 참조한다.An operation of selecting a target channel among channel 1 and channel 2 by the channel selector 80 is described with reference to FIG. 4.

도 4는 채널 선택부(80)에서 타겟 채널을 선택하기 위한 프로그래밍 소스를 나타낸 것이다. 4 illustrates a programming source for selecting a target channel in the channel selector 80.

도 4를 참조하면, 채널 선택부(80)는 채널1과 채널2로부터 끝점이 감지되었는지를 판단한다. 만일, 채널 선택부(80)는 채널1과 채널2에서 끝점이 모두 감지된 것으로 판단되면, 채널1 및 채널2의 신뢰도(CV_ch1, CV_ch2)와 임계치(Th)를 비교한다.Referring to FIG. 4, the channel selector 80 determines whether an end point is detected from the channel 1 and the channel 2. If it is determined that both the end points are detected in the channel 1 and the channel 2, the channel selector 80 compares the reliability (CV_ch1, CV_ch2) of the channel 1 and the channel 2 with the threshold Th.

채널 선택부(80)는 CV_ch1와, CV_ch2가 모두 임계치(Th)보다 큰 경우에는, CV_ch1과 CV_ch2를 비교하여, CV_ch1가 크면 채널1을 타겟 채널로 선택하고, CV_ch2가 크면 채널2를 타겟 채널로 선택한다.The channel selector 80 compares the CV_ch1 and the CV_ch2 when both the CV_ch1 and the CV_ch2 are larger than the threshold Th, and selects the channel 1 as the target channel when the CV_ch1 is large, and selects the channel 2 as the target channel when the CV_ch2 is large. Choose.

또한, 채널 선택부(80)는 CV_ch1가 임계치(Th)보다 크고, CV_ch2는 임계 치(Th) 보다 작은 경우에는 채널1을 타겟 채널로 선택하고, 그 반대인 경우에는 채널2를 타겟 채널로 선택한다.In addition, the channel selector 80 selects channel 1 as the target channel when CV_ch1 is larger than the threshold Th and CV_ch2 is smaller than the threshold Th, and selects channel 2 as the target channel when the CV_ch1 is smaller than the threshold Th. do.

한편, 채널 선택부(80)는 채널1에서만 끝점이 감지된 경우에는 CV_ch1가 임계치(Th) 보다 크면 채널1을 타겟 채널로 선택한다.On the other hand, when the endpoint is detected only in channel 1, the channel selector 80 selects channel 1 as the target channel when CV_ch1 is larger than the threshold Th.

만일, 채널 선택부(80)는 채널2에서만 끝점이 감지된 경우에는 CV_ch2가 임계치(Th) 보다 크면 채널2를 타겟 채널로 선택한다.If the endpoint is detected only in channel 2, the channel selector 80 selects channel 2 as the target channel if CV_ch2 is larger than the threshold Th.

도 5는 본 발명에 따른 유성음 프레임 검출부(100)의 구성을 설명하는데 참조되는 블록도이다. 5 is a block diagram referred to to explain the configuration of the voiced sound frame detection unit 100 according to the present invention.

유성음 프레임 검출부(100)는 채널 버퍼링부(90)로부터의 입력신호로부터 유성음 프레임 구간을 검출한다. 이때, 유성음 프레임 검출부(100)는 아래와 같은 8개의 유성음 특징을 이용하여 유성음 프레임을 검출한다.The voiced sound frame detector 100 detects the voiced sound frame section from the input signal from the channel buffering unit 90. In this case, the voiced sound frame detection unit 100 detects the voiced sound frame using the following eight voiced sound features.

1. 잡음에 강한 밴드 에너지 기반의 변경 시간 주파수 특징(Modified time-frequency(TF) feature based on noise robust band energy).1. Modified time-frequency (TF) feature based on noise robust band energy.

2. 고주파수에서 저주파수의 밴드 에너지 비(High-to-low frequency band energy ratio, HLFBER).2. High-to-low frequency band energy ratio (HLFBER) at high frequencies.

3. 제로 크로싱 비율(Zero crossing rate, ZCR).3. Zero crossing rate (ZCR).

4. 레벨 크로싱 비율(Level crossing rate, LCR).4. Level crossing rate (LCR).

5. 정규화된 자기 상관 최대 값(Normalized autocorrelation maximum value, NAMV).5. Normalized autocorrelation maximum value (NAMV).

6. YIN 알고리즘 기반의 유성음 확률(Voicing probability based on YIN algorithm).6. Voicing probability based on YIN algorithm.

7. 자기 상관 함수의 피크 대 밸리 비(Peak-to-valley ratio, PVR).7. Peak-to-valley ratio (PVR) of the autocorrelation function.

8. AMDF(Average magnitude difference function) 최소 값.8. Minimum magnitude difference function (AMDF) minimum value.

이 중, 도 5는 첫번째 유성음 특징인 잡음에 강한 밴드 에너지 기반의 변경 시간 주파수 특징을 구하는데 적용되는 구성을 나타낸 것이다.Among these, FIG. 5 shows a configuration that is applied to obtain a band energy-based change time frequency characteristic resistant to noise, which is the first voiced sound characteristic.

도 5에 도시된 바와 같이, 본 발명에 따른 유성음 프레임 검출부(100)는 프레임 블로킹부(101), 에너지 추정부(102), 퓨리에 변환부(103), FFT 파워 추정부(104), 에너지 산정부(105), 조정부(106), 및 양자화부(107)를 포함한다.As shown in FIG. 5, the voiced sound frame detection unit 100 according to the present invention includes a frame blocking unit 101, an energy estimation unit 102, a Fourier transform unit 103, an FFT power estimation unit 104, and an energy calculation. A government unit 105, an adjustment unit 106, and a quantization unit 107.

먼저, 프레임 블로킹부(101)는 채널 버퍼링부(90)로부터의 입력신호로부터 프레임을 블로킹하여, 에너지 추정부(102), 퓨리에 변환부(103)로 전달하는 기능을 수행한다.First, the frame blocking unit 101 blocks a frame from an input signal from the channel buffering unit 90 and transfers the frame to the energy estimating unit 102 and the Fourier transformer 103.

에너지 추정부(102)는 프레임 블로킹부(101)에 의해 블로킹된 프레임 중 시간 영역 프레임(Time-Domain frame)에 대한 에너지를 추정한다. 이후, 에너지 추정부(102)에 의해 추정된 에너지는 에너지 산정부(105)에 의해 산정된 에너지와 함께 잡음에 강한 밴드 에너지 기반의 변경 시간 주파수 특징을 구하는데 이용된다.The energy estimator 102 estimates energy of a time-domain frame among the frames blocked by the frame blocking unit 101. Then, the energy estimated by the energy estimator 102 is used to obtain a band energy based change time frequency characteristic that is resistant to noise together with the energy calculated by the energy calculation unit 105.

한편, 퓨리에 변환부(103)는 프레임 블로킹부(101)에 의해 블로킹된 프레임에 대한 빠른 퓨리에 변환(Fast Fourier Transform, FFT)을 수행하고, 수행 결과를 FFT 파워 추정부(104)로 전달한다. 이때, FFT 파워 추정부(104)는 퓨리에 변환부(103)로부터 빠른 퓨리에 변환된 프레임에 대한 파워를 추정한다.Meanwhile, the Fourier transform unit 103 performs a fast Fourier transform (FFT) on the frame blocked by the frame blocking unit 101, and transmits the result to the FFT power estimation unit 104. In this case, the FFT power estimator 104 estimates the power of the fast Fourier transformed frame from the Fourier transform unit 103.

이후, 에너지 산정부(105)는 파워 추정부(104)의 파워 추정 결과에 근거하여 특정 주파수 영역에서의 에너지를 산정한다. 다시 말해, 에너지 산정부(105)는 250~3600Hz 주파수 영역에서의 에너지를 산정한다.Thereafter, the energy calculation unit 105 calculates energy in a specific frequency region based on the power estimation result of the power estimation unit 104. In other words, the energy calculation unit 105 calculates energy in the 250 to 3600 Hz frequency domain.

조정부(106)는 에너지 추정부(102)에 의해 추정된 에너지를 조정하고, 에너지 산정부(105)에 의해 산정된 에너지를 조정한다. The adjusting unit 106 adjusts the energy estimated by the energy estimating unit 102, and adjusts the energy calculated by the energy calculating unit 105.

양자화부(107)는 에너지 추정부(102)에 의해 추정된 에너지와 에너지 신정부에 의해 산정된 에너지를 이용하여 양자화를 수행함으로써, 잡음에 강한 밴드 에너지 기반의 변경 시간 주파수 특징을 구하게 된다.The quantization unit 107 performs quantization using the energy estimated by the energy estimator 102 and the energy calculated by the energy new government, thereby obtaining a band energy-based change time frequency characteristic resistant to noise.

한편, 유성음 프레임 검출부(100)는 아래 [수학식 2]를 이용하여 두번째 유성음 특징인 고주파수에서 저주파수로의 밴드 에너지 비(HLFBER)를 산출한다.Meanwhile, the voiced sound frame detector 100 calculates a band energy ratio HLFBER from the high frequency to the low frequency, which is the second voiced sound feature, using Equation 2 below.

[수학식 2]에서와 같이, 고주파수에서 저주파수로의 밴드 에너지 비(HLFBER)는 고주파수 영역 에너지인 highbandE를 저주파수 영역 에너지인 lowbandE로 나누어 구할 수 있다. 이때, 고주파수 영역은 4kHz 내지 8kHz 영역으로서, highbandE는 4kHz 내지 8kHz 영역의 에너지를 말한다. 또한, 저주파수 영역은 0kHz 내지 4kHz 영역으로서, lowbandE는 0kHz 내지 4kHz 영역의 에너지를 말한다.As shown in Equation 2, the band energy ratio HLFBER from high frequency to low frequency can be obtained by dividing highbandE, which is a high frequency region energy, by lowbandE, which is a low frequency region energy. At this time, the high frequency region is a 4kHz to 8kHz region, highbandE refers to the energy of the 4kHz to 8kHz region. In addition, the low frequency region is 0kHz to 4kHz region, lowbandE refers to the energy of 0kHz to 4kHz region.

그 외에, 세번째 유성음 특징 내지 여덟 번째 유성음 특징은 통상적으로 널리 알려져 있는 방법을 이용하여 구할 수 있으며, 이에 대한 구체적일 설명은 생략한다.In addition, the third to eighth voiced features can be obtained using a commonly known method, which will not be described in detail.

유성음 프레임 검출부(100)는 앞서 검출된 여덟 개의 유성음 특징들을 임계값과 비교하여 음성 특징 비(Voicing Feature Ratio)를 산출한다. 이때, 유성음 프레임 검출부(100)는 음성 특징비를 산출하는데, 아래 [수학식 3]을 이용한다.The voiced sound frame detector 100 calculates a voice feature ratio by comparing the eight voiced voice features detected above with a threshold. In this case, the voiced sound frame detection unit 100 calculates a voice feature ratio, using Equation 3 below.

여기서, Voicing Counter는 특징값이 기 정의된 문턱치와 비교하여, 유성음으로 판별이 될 수 있는 경우를 카운트한 값이다. Here, the Voicing Counter is a value obtained by counting the cases where the feature value can be distinguished as a voiced sound by comparing with a predefined threshold value.

한편, 유성음 프레임 검출부(100)는 [수학식 3]에 의해 산출된 음성 특징 비가 기 정의된 임계값 보다 높으면, 해당 프레임이 유성음 프레임으로 판별하게 된다.On the other hand, the voiced sound frame detection unit 100 determines that the frame is a voiced sound frame when the voice feature ratio calculated by Equation 3 is higher than a predefined threshold value.

이때, 기 정의된 문턱치와 임계값은 선험적인 방법으로 구할 수 있다.At this time, the predefined threshold and threshold can be obtained by a priori method.

유성음 프레임 검출부(100)는 상기에서 서술한 방법으로 검출한 유성음 프레임을 음원 위치 추적부(130)로 전달한다.The voiced sound frame detector 100 transmits the voiced sound frame detected by the above-described method to the sound source position tracking unit 130.

상기와 같이 구성되는 본 발명의 동작 방법을 설명하면 다음과 같다.Referring to the operation method of the present invention configured as described above are as follows.

도 6은 본 발명에 따른 사용자의 음성을 이용한 위치 추적 방법에 대한 동작 흐름을 도시한 순서도이다.6 is a flowchart illustrating an operation flow of a location tracking method using a voice of a user according to the present invention.

도 6에 도시된 바와 같이, 2채널 마이크로폰을 통해 2채널의 여러 음원들이 혼합된 신호가 입력되면(S600), 음원 분리부(10)는 입력된 2채널의 신호에 대하여 암묵적 음원 분리를 수행한다(S610). 이때, 음원 분리부(10)는 2채널의 신호를 각각의 음원별로 분리한다.As illustrated in FIG. 6, when a signal in which several sound sources of two channels are mixed through a two-channel microphone is input (S600), the sound source separating unit 10 performs an implicit sound separation on the input two-channel signal. (S610). At this time, the sound source separating unit 10 separates the signals of the two channels for each sound source.

'S610' 과정에서 음원 분리가 완료되면, 스테레오 위너 필터부는 'S610' 과정에서 분리된 각각의 음원 신호를 필터링 한다(S620). 이때, 'S620' 과정에서는 스테레오 위터 필터를 이용하여 각 음원 신호를 필터링 한다.When the sound source separation is completed in the 'S610' process, the stereo winner filter unit filters each sound source signal separated in the 'S610' process (S620). In this case, the 'S620' process filters each sound source signal using a stereo witter filter.

또한, 제1 끝점 검출부(40) 및 제2 끝점 검출부(50)는 각 채널 신호로부터 음성의 끝점을 검출한 후, 버퍼링을 수행한다(S630). 'S630' 과정에서 끝점이 검출되지 않은 채널은 간섭 채널(interference channel)로 인식된다.In addition, the first end point detector 40 and the second end point detector 50 detect an end point of the voice from each channel signal and then perform buffering (S630). In the process 'S630', the channel where the endpoint is not detected is recognized as an interference channel.

이후, 'S630' 과정에서 끝점이 검출된 채널의 신호를 이용하여 음성 인식 동작을 수행하는데(S640), 채널 선택부(80)는 'S640' 과정의 음성 인식 결과를 각 채널별로 비교하여(S650), 타겟 채널(terget channel)을 선택하게 된다(S660). 이때, 채널 선택부(80)는 음성 인식 결과 각 채널의 신호 중 사용자의 음성이 포함된 채널을 타겟 채널로 선택한다. Subsequently, the voice recognition operation is performed using the signal of the channel at which the endpoint is detected in step S630 (S640). The channel selector 80 compares the voice recognition result of step 'S640' for each channel (S650). In operation S660, a target channel is selected. At this time, the channel selector 80 selects a channel including the user's voice as a target channel among the signals of each channel as a result of the speech recognition.

한편, 채널 인식부는 'S630' 과정에서 끝점이 검출되지 않은 채널은 타겟 채널 선택 대상에서 제외시킨다.Meanwhile, the channel recognizer excludes the channel for which the endpoint is not detected in the 'S630' process from the target channel selection target.

'S660' 과정에서 선택된 타겟 채널 신호에 대한 스테레오 위너 필터의 출력 신호를 버퍼링 하고(S670), 이때 유성음 프레임 검출부(100)는 'S670' 과정에서 버퍼링된 신호로부터 유성음 프레임을 검출한다(S680).The output signal of the stereo Wiener filter for the target channel signal selected in step S660 is buffered (S670). In this case, the voiced sound frame detector 100 detects the voiced sound frame from the signal buffered in step S670 (S680).

또한, 밴드 패스 필터부는 'S660' 과정에서 선택된 타겟 채널 신호와 간섭 채널 신호 각각에 대하여 음성 주파수 구간을 필터링한다(S690).In addition, the band pass filter unit filters a voice frequency section for each of the target channel signal and the interference channel signal selected in step S660 (S690).

마지막으로, 음원 위치 추적부(130)는 'S680' 및 'S690' 과정의 결과로부터 음원 위치를 추적하고(S700), 음원 위치 추적 결과를 출력한다(S710).Finally, the sound source position tracking unit 130 tracks the sound source position from the results of the process 'S680' and 'S690' (S700), and outputs the sound source position tracking result (S710).

본 발명에 따른 사용자의 음성을 이용한 음원 위치 추적 장치 및 그 방법은 암묵적 음원 분리 기술을 이용하여 채널별로 신호를 분리하고, 스테레오 위터 필터를 이용하여 타겟 채널 신호의 음질을 더욱 향상시킬 뿐만 아니라, 간섭 채널에 존재하는 사용자의 음성 신호 성분 또한 강조함으로써, 음원 위치를 추적하는데 용이하다.The sound source position tracking apparatus and method using the user's voice according to the present invention separate the signal for each channel by using the implicit sound source separation technology, and further improve the sound quality of the target channel signal by using a stereo Witter filter, interference The voice signal component of the user present in the channel is also emphasized, making it easy to track the source location.

이상에서와 같이 본 발명에 따른 사용자 음성을 이용한 위치 추적 장치 및 그 방법은 상기한 바와 같이 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.As described above, the apparatus and method for tracking a location using a user voice according to the present invention are not limited to the configuration and method of the embodiments described above, but the embodiments may be modified in various ways. All or some of the embodiments may be optionally combined.

도 1 은 본 발명에 따른 사용자의 음성을 이용한 위치 추적 장치에 대한 구성을 도시한 블록도이다.1 is a block diagram illustrating a configuration of a location tracking apparatus using a voice of a user according to the present invention.

도 2 는 본 발명에 따른 스테레오 위너 필터부의 세부 구성을 도시한 블록도이다.2 is a block diagram showing a detailed configuration of a stereo winner filter unit according to the present invention.

도 3 은 본 발명에 따른 음성 인식부의 구성을 도시한 블록도이다.3 is a block diagram showing the configuration of a speech recognition unit according to the present invention.

도 4 는 본 발명에 따른 채널 선택부의 동작 설명에 참조되는 예시도이다.4 is an exemplary view referred to in describing an operation of a channel selector according to the present invention.

도 5 는 본 발명에 따른 유성음 프레임 검출부의 세부 구성을 도시한 블록도이다.5 is a block diagram showing the detailed configuration of the voiced sound frame detection unit according to the present invention.

도 6 은 본 발명에 따른 사용자의 음성을 이용한 위치 추적 방법에 대한 동작 흐름을 도시한 순서도이다.6 is a flowchart illustrating an operation flow of a location tracking method using a voice of a user according to the present invention.

Claims

A sound source separation unit for separating the input two-channel signal for each sound source;

A stereo winner filter unit for removing scattering noise from each of the sound source signals separated by the sound source separating unit and filtering to emphasize residual signal components;

A voice recognition unit recognizing a user's voice from each of the sound source signals and measuring reliability of a voice recognition result;

A channel selector which selects a target channel based on a speech recognition result from the speech recognition unit and a reliability of the speech recognition result; And

And a sound source position tracker for analyzing a signal of a target channel selected by the channel selector and a signal of an interference channel to track the sound source position.

The method according to claim 1,

The sound source separation unit,

Positioning apparatus using a user's voice, characterized in that the signal of the two channels are separated by each sound source using an implicit sound source separation technology.

The method according to claim 1,

The stereo winner filter unit,

The apparatus of claim 1, wherein the respective sound source signals are filtered using stereo winner filter technology.

The method according to claim 1,

The stereo winner filter unit,

A voice activity detector for detecting a frame energy based voice activity on the input sound source signal;

A Wiener filter coefficient estimator for estimating a Witter filter coefficient based on a signal input from the voice activity detector;

A winner filter unit for performing stereo winner filter on the input channel signal using the estimated winner filter coefficients; And

And a signal recovery unit for restoring the signal of each of the filtered channels.

The method according to claim 1,

And a voice end point detector for detecting end points of voice signals from the signals of the respective channels.

The method according to claim 1,

The speech recognition unit,

A feature extractor for extracting a voice feature for speech recognition from a sound source signal of each channel;

A voice recognition decoder recognizing a user's voice with respect to a signal of each channel based on the voice feature extracted from the feature extractor; And

And a speech verification unit for measuring a speech recognition reliability of the speech recognition result from the speech recognition decoder and verifying the speech recognition result from the speech recognition decoder.

The method according to claim 1,

The channel selector,

When the user's voice is detected only in one of the sound source signals of the respective channels, and the reliability of the voice recognition result is higher than the threshold value, the corresponding source is selected as the target channel. Tracking device.

The method according to claim 1,

The channel selector,

When the user's voice is detected in the sound source signal of each channel, the reliability of the voice recognition result is higher than the threshold value, and among the signals of the two channels, the channel having high reliability of the voice recognition result is selected as a target channel. Location tracking device using the user's voice.

The method according to claim 1,

And a voiced sound frame detector for detecting a voiced sound frame from a signal of the target channel selected by the channel selector.

The method of claim 9,

The voiced sound frame detector,

Utilizes at least one voiced feature of a change-time frequency feature, a high frequency to low frequency band energy ratio, a zero crossing ratio, a level crossing ratio, a normalized autocorrelation maximum value, a voiced sound probability, a peak-to-valley ratio of an autocorrelation function, and an AMDF minimum value Position tracking device using the user's voice, characterized in that for detecting the voiced sound frame.

The method of claim 10,

The voiced sound frame detector,

The energy is extracted by extracting a frame from a signal of an input target channel, and an energy in a specific frequency region is calculated based on a result of power estimation for the extracted frame. Position tracking device using the user's voice, characterized in that the frequency characteristics are obtained.

The method of claim 10,

The voiced sound frame detector,

A voice feature ratio is calculated by comparing the voiced sound feature to a threshold value, and when the calculated voice feature ratio is larger than a predetermined threshold value, the corresponding voice is determined as a voiced sound frame. Position tracking device using.

The method according to claim 1,

And a band pass filter for filtering the audio frequency section with respect to the signal of the target channel selected by the channel selector and the signal of the interference channel not selected as the target channel. Location tracking device.

Separating the input two channel signals for each sound source;

Filtering each sound source signal separated in the sound source separation;

Detecting an end point of the voice from the respective sound source signals and recognizing the voice using the signal from which the end point is detected;

Selecting a target channel based on a voice recognition result of the step of recognizing the voice and a reliability of the voice recognition result; And

And analyzing the voiced sound frame detected from the signal of the target channel and the voice frequency section of the target channel and the interference channel to track the location of the sound source.

The method according to claim 14,

The filtering step,

Detecting frame energy-based speech activity for each input sound source signal;

Estimating a Wiener filter coefficient based on the voice activity detection result and the PSD spectrum estimation result;

Performing stereo Wiener filtering on each of the input sound source signals using the estimated Wiener filter coefficients; And

Restoring each of the stereo Wiener filtered sound source signal; location tracking method using the user's voice, characterized in that it further comprises.

The method according to claim 14,

And detecting end points of the voice signals from the sound source signals of the respective channels.

The method according to claim 14,

Recognizing the voice,

Extracting a speech feature for speech recognition from a sound source signal of each channel;

Recognizing a user's voice with respect to a sound source signal of each channel based on the extracted voice feature; And

And verifying the speech recognition result by measuring a speech recognition reliability of the speech recognition result.

The method according to claim 14,

Selecting the target channel,

If the user's voice is detected only in one of the sound source signals of the respective channels, and the reliability of the voice recognition result is higher than the threshold, the corresponding channel is selected as the target channel,

When the user's voice is detected in the sound source signal of each channel, the reliability of the voice recognition result is higher than the threshold value, and among the signals of the two channels, the channel having high reliability of the voice recognition result is selected as a target channel. Location tracking method using the user's voice.

The method according to claim 14,

Detecting a voiced sound frame from a signal of the selected target channel in the selecting of the target channel;

The detecting of the voiced sound frame may include: comparing the voiced sound feature with a threshold to calculate a voice feature ratio, and if the calculated voice feature ratio is greater than a predefined threshold, converting the frame into a voiced sound frame. Position tracking method using the user's voice, characterized in that the discriminating.

The method according to claim 14,

And filtering to emphasize the voice frequency section with respect to the signal of the selected target channel and the signal of the interference channel not selected as the target channel in the step of selecting the target channel. Location tracking method.