KR20110127783A

KR20110127783A - Apparatus for separating voice and method for separating voice of single channel using the same

Info

Publication number: KR20110127783A
Application number: KR1020100047209A
Authority: KR
Inventors: 권오욱; 이윤경; 이인성
Original assignee: 충북대학교 산학협력단
Priority date: 2010-05-20
Filing date: 2010-05-20
Publication date: 2011-11-28
Also published as: KR101096091B1

Abstract

PURPOSE: A voice separation apparatus and single channel voice separation method using the same are provided to improve the accuracy of a voice separation by considering the size of voice signal and phase information. CONSTITUTION: A voice signal division unit(110) separates phase information and size information from the mixed voice signal of a microphone. A size information probability unit(130) defines a soft mask filter. The size information probability unit calculates the DFT(Discrete Fourier Transform)] of the divided voice signal. A phase information probability unit(140) calculates a phase information probability. A voice signal extraction unit(150) estimates the desired voice signal.

Description

Apparatus for Separating Voice and Method for Separating Voice of Single Channel Using the Same}

본 발명은 음성 분리 장치에 관한 것으로서, 특히 단일 채널에서 위상과 크기 정보를 고려하여 혼합 음성 신호에서 원하는 음성 신호를 분리하는 음성 분리 장치 및 이를 이용한 단일 채널 음성 분리 방법에 관한 것이다.The present invention relates to a voice separation device, and more particularly, to a voice separation device for separating a desired voice signal from a mixed voice signal in consideration of phase and magnitude information in a single channel, and a single channel voice separation method using the same.

최근 디지털 신호 처리 기술이 발전함에 따라 관련된 시장이 커지고 있으며 음성 통신 및 음성 인식 시스템을 활용한 음성 다이얼링, 잠금 장치 등의 다양한 음성 서비스들이 보편화되고 있다.Recently, with the development of digital signal processing technology, the related market is growing and various voice services such as voice dialing and locking device using voice communication and voice recognition system are becoming common.

음성 인식 시스템이 실제로 사용되고 있는 환경은 여러 가지 잡음 요인을 포함하고 있다.The actual environment in which the speech recognition system is used includes several noise sources.

따라서, 잡음의 영향을 경감시키는 기술은 음성 정보 처리에 있어 음성 신호를 향상시키고 효과적으로 시스템에 적용하기 위해 필수적으로 요구된다.Therefore, techniques for mitigating the effects of noise are indispensable for improving speech signals and effectively applying them to speech information processing.

잡음을 제거하기 위한 접근 방법의 하나인 음성 분리 기술은 1개의 마이크로폰으로부터 입력된 음성 신호를 이용하여 음성 분리를 수행하는 단일 채널 음성 분리 기술과 2개의 마이크로폰을 통해 얻은 음성 신호를 이용하는 2채널 음성 분리 기술, 그리고 3개 이상의 마이크로폰을 이용하는 다채널 음성 분리 기술이 있다.Voice separation technology, one of the approaches to remove noise, uses single-channel voice separation technology to perform voice separation using voice signals input from one microphone and two-channel voice separation using voice signals obtained from two microphones. Technology, and multi-channel speech separation using three or more microphones.

종래의 음성 처리 시스템에서는 음성 신호를 1개의 마이크로폰을 이용하여 얻는다.In a conventional speech processing system, a speech signal is obtained using one microphone.

종래의 단일 채널 음성 신호 분리 기술은 전산 청각 장면 분석(Computational Auditory Scene Analysis, CASA), 소프트 마스크(Soft Mask)가 있다.Conventional single channel speech signal separation techniques include Computational Auditory Scene Analysis (CASA) and Soft Mask.

CASA는 입력된 혼합 음성 신호로부터 사람의 청각 특성을 이용하여 동일 음원으로부터 발생한 음향 요소들을 찾아내는 방법으로 음성 신호를 분리하는 방법이다.CASA is a method of separating a speech signal by finding acoustic elements generated from the same sound source by using human hearing characteristics from an input mixed speech signal.

소프트 마스크는 통계적 모델링 기반의 음성 분리 방법이다.Soft mask is a speech modeling method based on statistical modeling.

CASA를 이용한 음성 분리 방법은 음원을 분리하는 과정에서 음원 분리 마스크로 이진 마스크를 사용하기 때문에 음성 신호의 손실이 발생하며, 음성학적인 지식과 휴리스틱이 요구되는 단점이 있다.The voice separation method using CASA uses a binary mask as a sound source separation mask in the process of separating a sound source, resulting in a loss of a voice signal, which requires a phonetic knowledge and a heuristic.

소프트 마스크를 이용한 음성 분리 방법은 음원 분리 마스크를 확률에 의한 소프트 마스크를 사용하지만 통계적 모델링에 의한 분리이기 때문에 인접한 음성 신호임에도 불구하고 다른 신호로 분리되는 비연속적인 경우가 발생하는 문제점이 있었다.In the speech separation method using a soft mask, the sound source separation mask uses a soft mask based on probability, but since the separation is performed by statistical modeling, there is a problem in that a non-contiguous case in which an audio signal is separated into other signals is generated even though it is an adjacent speech signal.

또한, 소프트 마스크는 음성 신호의 정보를 가지고 있는 위상 성분을 무시하고 크기 성분(Magnitude)만을 사용하여 음성 신호를 분리하는 문제점이 있었다. In addition, the soft mask has a problem in that the speech signal is separated using only the magnitude component (Magnitude) while ignoring the phase component having the information of the speech signal.

이와 같은 문제점을 해결하기 위하여, 본 발명은 음성 분리 장치에서 입력된 혼합 음성 신호를 각각 크기 정보와 위상 정보를 이용하여 분리를 수행한 후 분리된 정보를 조합하여 혼합 음성 신호에서 원하는 음성 신호를 추출하는데 그 목적이 있다.In order to solve this problem, the present invention separates the mixed speech signal input from the speech separation apparatus using magnitude information and phase information, and then extracts a desired speech signal from the mixed speech signal by combining the separated information. Its purpose is to.

본 발명의 다른 목적은 음성 분리 장치에서 스무딩 필터를 적용하여 음성 분리 과정에서 발생하는 비연속적인 신호 분리 과정을 보완하는데 그 목적이 있다.Another object of the present invention is to compensate for the discontinuous signal separation process occurring in the speech separation process by applying a smoothing filter in the speech separation apparatus.

이러한 기술적 과제를 달성하기 위한 본 발명의 특징에 따른 단일 채널 음성 분리 방법은 마이크로폰으로부터 입력된 혼합 음성 신호에서 원하는 타겟 음성 신호를 분리하기 위해 상기 혼합 음성 신호의 크기와 위상을 분리하는 단계; 크기 정보의 통계적 모델링을 기반으로 한 크기 모델을 이용하여 상기 혼합 음성 신호의 크기가 상기 타겟 음성 신호의 크기일 크기 확률을 계산하는 단계; 위상 정보의 통계적 모델링을 기반으로 한 위상 모델을 이용하여 상기 혼합 음성 신호의 위상이 상기 타겟 음성 신호의 위상일 위상 확률을 계산하는 단계; 및 상기 크기 확률과 상기 위상 확률을 상기 혼합 음성 신호의 크기와 위상에 각각 가중치로 곱하여 상기 타겟 음성 신호를 추정하는 단계를 포함한다.According to an aspect of the present invention, there is provided a single channel speech separation method comprising: separating magnitude and phase of the mixed speech signal to separate a desired target speech signal from a mixed speech signal input from a microphone; Calculating a magnitude probability that the magnitude of the mixed speech signal is the magnitude of the target speech signal using a magnitude model based on statistical modeling of magnitude information; Calculating a phase probability of the phase of the mixed speech signal being a phase of the target speech signal using a phase model based on statistical modeling of phase information; And estimating the target speech signal by multiplying the magnitude probability and the phase probability by the weight and the magnitude and phase of the mixed speech signal, respectively.

본 발명의 특징에 따른 음성 분리 장치는 마이크로폰으로부터 입력된 혼합 음성 신호에서 원하는 타겟 음성 신호를 분리하기 위해 상기 혼합 음성 신호의 크기와 위상을 분리하는 음성 신호 분리부; 크기 정보의 통계적 모델링을 기반으로 한 크기 모델을 이용하여 상기 혼합 음성 신호의 크기가 상기 타겟 음성 신호의 크기일 크기 확률을 계산하는 크기 정보 확률부; 위상 정보의 통계적 모델링을 기반으로 한 위상 모델을 이용하여 상기 혼합 음성 신호의 위상이 상기 타겟 음성 신호의 위상일 위상 확률을 계산하는 위상 정보 확률부; 및 상기 크기 확률과 상기 위상 확률을 상기 혼합 음성 신호의 크기와 위상에 각각 가중치로 곱하여 상기 타겟 음성 신호를 추정하는 음성 신호 추출부를 포함한다.According to an aspect of the present invention, there is provided a voice separation device comprising: a voice signal separation unit for separating a magnitude and a phase of the mixed voice signal to separate a desired target voice signal from a mixed voice signal input from a microphone; A magnitude information probability section for calculating a magnitude probability that the magnitude of the mixed speech signal is the magnitude of the target speech signal using a magnitude model based on statistical modeling of magnitude information; A phase information probability unit for calculating a phase probability of the phase of the mixed speech signal being the phase of the target speech signal using a phase model based on statistical modeling of phase information; And a speech signal extractor configured to estimate the target speech signal by multiplying the magnitude probability and the phase probability by the weight and the magnitude of the mixed speech signal, respectively.

전술한 구성에 의하여, 본 발명은 음성 분리 장치에서 음성 분리시 음성 신호의 크기와 위상 정보를 모두 고려하여 음성 분리의 정확도를 높이는 효과가 있다.According to the above configuration, the present invention has an effect of increasing the accuracy of speech separation in consideration of both the magnitude and phase information of the speech signal during speech separation in the speech separation apparatus.

본 발명은 스무딩 필터를 사용하여 음성의 연속적인 특징을 반영함으로써 통계적 모델에 의한 불연속의 경우를 보완하는 효과가 있다.The present invention has an effect of compensating for the case of discontinuity caused by a statistical model by reflecting continuous features of speech using a smoothing filter.

본 발명은 음성 처리 분야에서 위상 정보와 크기 정보를 함께 활용하여 음성 분리의 성능을 높이는 효과가 있다.The present invention has the effect of increasing the performance of speech separation by using the phase information and the size information in the speech processing field.

본 발명은 홈 네트윅 환경에서 텔레매틱스 서비스의 주소 지명 인식, 지능 로봇의 음성 인터페이스의 잡음 제거 모듈 개발에 활용할 수 있는 효과가 있다.The present invention has the effect that can be utilized in the recognition of address names of telematics services in the home network environment, the noise reduction module of the voice interface of the intelligent robot.

도 1은 본 발명의 실시예에 따른 단일 채널에서 음성의 크기와 위상을 이용하여 원하는 음성 신호를 분리하는 음성 분리 장치의 내부 구성을 간략하게 나타낸 블록도이다.
도 2는 본 발명의 실시예에 따른 음성 분리 장치에서의 단일 채널 음성 분리 방법을 나타낸 개략도이다.
도 3은 본 발명의 실시예에 따른 입력 음성 신호 x(t), y(t)와 혼합 음성 신호 z(t)의 한 프레임의 일부에 대한 로그 스펙트럼과 위상 성분의 출력예를 나타낸 도면이다.FIG. 1 is a block diagram schematically illustrating an internal configuration of a speech separation apparatus for separating a desired speech signal using a magnitude and phase of speech in a single channel according to an embodiment of the present invention.
2 is a schematic diagram showing a single channel speech separation method in a speech separation apparatus according to an embodiment of the present invention.
3 is a diagram showing an example of outputting log spectra and phase components for a part of one frame of an input speech signal x (t), y (t) and a mixed speech signal z (t) according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. The present invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and like reference numerals designate like parts throughout the specification.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a part is said to "include" a certain component, it means that it can further include other components, without excluding other components unless specifically stated otherwise.

본 발명의 실시예는 단일 채널에서 위상과 크기 정보를 고려하여 음성 신호를 분리한다.An embodiment of the present invention separates a speech signal in consideration of phase and magnitude information in a single channel.

도 1은 본 발명의 실시예에 따른 단일 채널에서 음성의 크기와 위상을 이용하여 원하는 음성 신호를 분리하는 음성 분리 장치의 내부 구성을 간략하게 나타낸 블록도이다.FIG. 1 is a block diagram schematically illustrating an internal configuration of a speech separation apparatus for separating a desired speech signal using a magnitude and phase of speech in a single channel according to an embodiment of the present invention.

본 실시예에 따른 음성 분리 장치(100)는 음성 신호 분리부(110), 스무딩 필터부(120), 크기 정보 확률부(130), 위상 정보 확률부(140) 및 음성 신호 추출부(150)를 포함한다.In the speech separation apparatus 100 according to the present exemplary embodiment, the speech signal separation unit 110, the smoothing filter unit 120, the magnitude information probability unit 130, the phase information probability unit 140, and the speech signal extraction unit 150 are described. It includes.

음성 신호 분리부(110)는 마이크로폰으로부터 입력된 혼합 음성 신호에서 크기 정보와 위상 정보를 분리한다.The voice signal separating unit 110 separates magnitude information and phase information from the mixed voice signal input from the microphone.

본 발명에서 크기 정보를 이용하여 혼합 음성 신호에서 원하는 음성 신호를 분리하기 위한 크기 모델(Magnitude Model)은 종래의 통계적 모델링을 기반으로 하는 음성 분리 방법 중 하나인 소프트 마스크를 사용하여 계산한다.In the present invention, a magnitude model for separating a desired speech signal from a mixed speech signal using magnitude information is calculated using a soft mask, which is one of speech separation methods based on conventional statistical modeling.

소프트 마스크는 입력된 혼합 신호가 원하는 신호일 확률을 계산한 후 계산된 확률값을 다시 혼합 신호에 곱하여 원하는 음성 신호를 추정하는 것이다.The soft mask calculates a probability that the input mixed signal is a desired signal, and then multiplies the calculated probability value by the mixed signal again to estimate a desired speech signal.

단일 마이크를 통해 얻어진 화자

의 입력 음성 신호를 각각 x(t), y(t)라고 할 때, 혼합된 음성 신호 z(t)는 두 입력 음성 신호의 합으로 다음의 [수학식 1]과 같다.Speaker obtained through a single microphone

When the input speech signals of are x (t) and y (t), respectively, the mixed speech signal z (t) is the sum of the two input speech signals as shown in Equation 1 below.

여기서, x(t)와 y(t)는 서로 독립적인 신호라고 가정한다. x(t)와 y(t)의 로그 스펙트럼을 x(w), y(w)라고 하면, 혼합 음성 신호의 로그 스펙트럼 z(w)는 x(w) + y(w)로 계산되면 다음의 [수학식 2]과 같이 재정의된다.Here, it is assumed that x (t) and y (t) are signals independent of each other. If the log spectra of x (t) and y (t) are x (w) and y (w), the log spectrum z (w) of the mixed speech signal is calculated as x (w) + y (w). It is redefined as in [Equation 2].

일반적으로 혼합 음성 신호의 로그 스펙트럼은 두 입력 음성 신호의 로그 스펙트럼 중 더 큰 값을 가지는 로그 스펙트럼과 매우 유사한 값을 나타낸다.In general, the log spectrum of the mixed speech signal is very similar to the log spectrum having the larger of the log spectra of the two input speech signals.

따라서, 로그-최대 근사법(Log-Max Approximation)을 사용하여 혼합 음성 신호의 로그 스펙트럼을 다음의 [수학식 3]과 같이 근사화할 수 있다.Therefore, the log spectrum of the mixed speech signal can be approximated using Equation 3 using Log-Max Approximation.

[수학식 3]에서 정의한 로그 최대 근사법에 따라 혼합 음성 신호의 로그 스펙트럼 벡터가 원하는 음성 신호 x일 확률은 x의 로그 스펙트럼의 값이 y의 값보다 클 확률과 같다. 즉, z의 d번째 차수의 로그 스펙트럼 값 z_d가 x_d일 확률은 x_d의 값이 y_d보다 클 확률로 계산되며 다음의 [수학식 4]와 같다.According to the logarithmic approximation defined in Equation 3, the probability that the log spectrum vector of the mixed speech signal is the desired speech signal x is equal to the probability that the value of the log spectrum of x is greater than the value of y. That is, the log spectrum of the d-th order of the value z z _d x _d is the probability is calculated as the probability that the value of x is greater than y _d _d shown in the following [Equation 4].

[수학식 4]를 이용하여 계산한 확률값을 혼합 신호의 로그 스펙트럼에 가중치로 적용하여 원하는 신호의 로그 스펙트럼을 추출한다.The logarithm of the desired signal is extracted by applying a probability value calculated using Equation 4 as a weight to the logarithm of the mixed signal.

소프트 마스크를 이용하여 원하는 음성 신호를 추정한 결과는 두 화자이 신호 x, y가 독립적인 신호라는 가정으로부터 얻어진 것이기 때문에 혼합 음성 신호의 로그 스펙트럼 z를 x 또는 y 중 하나로 분리한다.Since the result of estimating the desired speech signal using the soft mask is obtained from the assumption that the two speakers x and y are independent signals, the log spectrum z of the mixed speech signal is separated into either x or y.

따라서, 바로 옆 시간-주파수 대역에서의 로그 스펙트럼 분리 결과가 서로 다른 음성 신호로 계산될 수 있어 비연속적인 경우가 발생한다.Thus, log spectral separation results in the immediately next time-frequency band can be calculated with different speech signals, resulting in discontinuous cases.

이를 보완하기 위해 스무딩 필터를 사용하여 음성 신호에 스무딩을 적용한 후 확률을 계산하여 인접한 시간-주파수 대역 간의 연속성을 높이고 음성의 특성을 고려한다.To compensate for this, smoothing is applied to the speech signal using a smoothing filter, and then the probability is calculated to increase the continuity between adjacent time-frequency bands and to consider the characteristics of the speech.

스무딩 필터부(120)는 혼합 음성 신호의 로그 스펙트럼 값 z_d가 x_d일 확률을 계산할 때, 현재 프레임(d번째 프레임)을 시간-주파수 영역에서 이웃한 로그 스펙트럼 값들을 이용하여 스무딩시킨다.The smoothing filter unit 120 smoothes the current frame (d th frame) using neighboring log spectral values in the time-frequency domain when calculating a probability that the log spectral value z _d of the mixed speech signal is x _d .

스무딩을 계산하기 위한 마스크는 시간-주파수 영역에서 필터폭이 균일한 균일 마스크 필터(△=4)를 사용하며 필터폭의 크기는

이다.The mask for calculating smoothing uses a uniform mask filter (△ = 4) with a uniform filter width in the time-frequency domain, and the size of the filter width is

to be.

스무딩 필터부(120)는 스무딩이 적용된 로그 스펙트럼 벡터를 다음의 [수학식 5]를 통해 계산한다.The smoothing filter unit 120 calculates the log spectrum vector to which smoothing is applied through Equation 5 below.

시간-주파수 영역에서 스무딩이 적용된 로그 스펙트럼 벡터를 z'이라고 하면, z'의 k번째 로그 스펙트럼값

은 다음의 [수학식 5]와 같다.If the log spectral vector with smoothing in the time-frequency domain is z ', the k-th log spectral value of z'

Is shown in Equation 5 below.

두 음성 신호 x, y의 로그 스펙트럼에 시간-주파수 영역에서 스무딩을 적용한 로그 스펙트럼

,

라 하면 전술한 수학식 5와 같은 과정을 거쳐 계산한다.Log spectrum with smoothing in time-frequency domain to log spectra of two speech signals x and y

,

If it is calculated through the same process as the above equation (5).

크기 정보 확률부(130)는 스무딩을 적용한 로그 스펙트럼 벡터를

라 하면, 스무딩이 적용된 소프트 마스트 필터를 다음의 [수학식 6]과 같이 정의한다.The magnitude information probability unit 130 calculates a log spectral vector to which smoothing is applied.

In this case, the soft mast filter to which smoothing is applied is defined as in Equation 6 below.

혼합 음성 신호가 원하는 음성 신호일 확률(

)은 전술한 [수학식 4]와 베이시안 정리(Bayesian Theory)를 이용하여 계산한다.Probability that the mixed speech signal is the desired speech signal (

) Is calculated using the above Equation 4 and the Bayesian Theory.

여기서, m_x, m_y는 x, y의 혼합 가우시안 모델에서 가우시안의 개수 M_x, M_y에 해당하는 가중합의 인덱스이다.Here, m _x and m _y are indices of weighted sums corresponding to the number of Gaussians M _x and M _y in the mixed Gaussian model of _x and _y .

평균, 분산 등과 같은 파라미터들은 학습 음성 데이터로부터 계산되며, 평균

과 분산

가 주어졌을때 가우시안 분포를

라 하면, x'의 m_x번째 가우시안의 평균 벡터의 d번째 평균값

와 분산 벡터의 d번째 분산값

가 주어졌을때 확률 밀도 함수

는 다음의 [수학식 7]과 같고, y'의 m_y번째 가우시안의 평균 벡터의 d번째 평균값

와 분산 벡터의 d번째 분산값

가 주어졌을때의 누적 확률 밀도

는 다음의 [수학식 8]과 같다.Parameters such as mean, variance, etc. are calculated from the learning speech data

And dispersion

Given the Gaussian distribution

When La, x 'of the m _x-th mean vector of the Gaussian second average value of d

D variances of and the variance vector

Given the probability density function

Is equal to Equation 7 below, and the d-th average of the mean vector of the _y- th Gaussian of y '

D variances of and the variance vector

Cumulative probability density given by

Is shown in Equation 8 below.

전술한 [수학식 6]을 사용하여 계산된 확률값을

라고 하면, 혼합 음성 신호가 원하는 신호일 확률의 d번째 확률값

은 다음의 [수학식 9]와 같이 정의된다.The probability value calculated using the above Equation 6

Is the d-th probability value of the probability that the mixed speech signal is a desired signal.

Is defined as in Equation 9 below.

혼합 음성 신호로부터 화자 S_x로 추정된 로그 스펙트럼 벡터의 d번째 로그 스펙트럼

는 확률값

와 혼합 음성 신호의 로그 스펙트럼 벡터의 d번째 로그 스펙트럼 z_d의 곱으로 계산되며 다음의 [수학식 10]과 같다.D-d log spectrum of log spectral vector estimated by speaker S _x from mixed speech signal

Is a probability value

And is calculated as the product of the d th log spectrum z _d of the log spectral vector of the mixed speech signal, as shown in Equation 10 below.

크기 정보 확률부(130)는 분리된 음성 신호의 이산 푸리에 변환(Discrete Fourier Transform, DFT)(

)을 로그 스펙트럼 벡터

와 혼합 음성 신호의 위상 스펙트럼 벡터를 이용하여 계산한다.The magnitude information probability unit 130 performs a discrete Fourier transform (DFT) of the separated speech signal (

Log spectrum vector

And using the phase spectrum vector of the mixed speech signal.

혼합 음성 신호 z(t)의 위상 성분이

일 때 다음의 [수학식 11]과 같이 계산된다.The phase component of the mixed speech signal z (t)

Is calculated as in Equation 11 below.

본 발명에서는 혼합 음성 신호 z(t)의 위상 정보 벡터 w가 원하는 음성 신호의 위상 정보일 확률을 계산하여 혼합 신호에 곱해줌으로써 원하는 음성 신호를 추정한다.In the present invention, the desired speech signal is estimated by calculating the probability that the phase information vector w of the mixed speech signal z (t) is the phase information of the desired speech signal and multiplying the mixed signal.

본 발명에서는 두 화자의 음성 신호 x(t), y(t)의 위상 스펙트럼 벡터를 w라고 할 때, 혼합 음성 신호의 위상은 둘 중 하나의 신호의 위상과 유사하다고 가정하며 다음의 [수학식 12]와 같이 정의된다.In the present invention, assuming that the phase spectrum vectors of the two voice signals x (t) and y (t) are w, it is assumed that the phase of the mixed voice signal is similar to the phase of one of the two signals. 12].

위상 정보 확률부(140)는 혼합 음성 신호의 위상 스펙트럼 벡터(w_d)가 원하는 음성 신호(x)의 위상(u_d)일 확률(

)을 위상 스펙트럼값들 사이의 차이 정보를 이용하여 계산한다(아래의 [수학식 13]에서 정의됨).The phase information probability unit 140 may determine the probability that the phase spectrum vector w _d of the mixed speech signal is the phase u _d of the desired speech signal x (

) Is calculated using the difference information between phase spectral values (defined in Equation 13 below).

혼합 음성 신호의 위상 스펙트럼 벡터가 원하는 음성 신호의 위상일 확률은 w_d와 x의 위상 스펙트럼 u_d과의 유사도 값이 w_d와 y의 위상 스펙트럼 v_d과의 유사도 값보다 높을 확률과 같다(

).The probability that the phase spectral vector of the mixed speech signal is the phase of the desired speech signal is equal to the probability that the similarity value between the phase spectrum u _d of w _d and x is higher than the similarity value between the phase spectrum v _d of w _d and y (

).

[수학식 13]을 베이시안 정리를 이용하여 확장한 후, 절대값 연산자를 제거하기 위해 범위를 나누어 계산하면 다음의 [수학식 14]와 같다.After extending Equation 13 using Bayesian theorem, the range is calculated to remove the absolute value operator, as shown in Equation 14 below.

[수학식 14]로부터 wd가 원하는 신호의 위상 스펙트럼 u_d일 확률(

)은 다음의 [수학식 15]와 같다.From Equation 14, the probability that wd is the phase spectrum u _d of the desired signal (

) Is as shown in Equation 15 below.

위상 모델을 사용하여 구한 혼합 음성 신호의 위상 스펙트럼이 화자 S_x의 위상 스펙트럼일 확률

을

라고 하면, d번째 확률값

와 혼합 음성 신호의 위상 정보로부터 추정된 로그 스펙트럼 벡터

와 위상 스펙트럼 벡터

의 d번째 값

는 다음의 [수학식 16], [수학식 17], [수학식 18]과 같다.Probability that the phase spectrum of the mixed speech signal obtained using the phase model is the phase spectrum of the speaker S _x

of

If we say d, the probability value

Spectral vector estimated from the phase information of the mixed speech signal

And phase spectrum vector

D value of

Is the same as [Equation 16], [Equation 17], [Equation 18].

위상 정보 확률부(140)는 로그 스펙트럼 벡터와 위상 스펙트럼 벡터를 이용하여 추정된 음성 신호의 이산 푸리에 변환(

)을 다음의 [수학식 19]를 통해 계산한다.The phase information probability unit 140 performs discrete Fourier transform of the speech signal estimated using the log spectrum vector and the phase spectrum vector.

) Is calculated by the following Equation 19.

음성 신호 추출부(150)는 혼합 음성 신호로부터 원하는 음성 신호를 분리하는 과정에서 음성 신호의 크기 정보와 위상 정보를 고려하기 위해 크기 모델과 위상 모델을 사용하여 계산된 음성 정보를 혼합 음성 신호의 크기 정보와 위상 정보에 가중치로 곱하여 원하는 음성 신호를 추정한다.In the process of separating a desired speech signal from the mixed speech signal, the speech signal extractor 150 uses the magnitude model and the phase model to calculate the speech information calculated using the magnitude model and the phase model. The desired speech signal is estimated by multiplying the information and the phase information by weight.

크기 모델로부터 추출된 음성 신호의 이산 푸리에 변환을

, 위상 모델로부터 얻어진 음성 신호의 이산 푸리에 변환을

라고 하면, 음성 신호의 크기 정보와 위상 정보를 고려하여 분리된 음성 신호는 다음의 [수학식 20]와 같이 정의된다.Discrete Fourier Transform of Speech Signal Extracted from Magnitude Model

Discrete Fourier Transform of Speech Signal Obtained from Phase Model

In this regard, the separated speech signal in consideration of the magnitude information and the phase information of the speech signal is defined as in Equation 20 below.

음성 신호 추출부(150)는 분리된 음성 신호(

)를 역변환한 후, 오버랩-애드 방법을 사용하여 음성 신호를 복원하여 원하는 음성 신호를 얻는다.The voice signal extractor 150 is a separated voice signal (

) Is then inverted and the speech signal is recovered using the overlap-add method to obtain the desired speech signal.

도 2를 참조하여 음성 분리 장치(100)에서의 단일 채널 음성 분리 방법을 상세하게 설명한다.Referring to FIG. 2, the single channel voice separation method in the voice separation device 100 will be described in detail.

도 2는 본 발명의 실시예에 따른 음성 분리 장치(100)에서의 단일 채널 음성 분리 방법을 나타낸 개략도이다.2 is a schematic diagram showing a single channel speech separation method in the speech separation apparatus 100 according to an embodiment of the present invention.

도 2는 음성 신호의 크기와 위상 정보를 이용하여 추정된 음성 신호를 조합하여 원하는 음성 신호를 추정하는 음성 분리 방법을 나타낸다.2 illustrates a speech separation method of estimating a desired speech signal by combining the estimated speech signal using the magnitude and phase information of the speech signal.

크기 정보 확률부(130)는 혼합 음성 신호에서 크기 모델을 이용하여 원하는 음성 신호를 추정하기 위해 혼합 음성 신호의 크기가 원하는 신호의 크기일 확률을 계산한다. 크기일 확률을 계산하는 방법은 전술한 도 1에서 설명하였다.The magnitude information probability unit 130 calculates a probability that the magnitude of the mixed speech signal is the magnitude of the desired signal in order to estimate the desired speech signal using the magnitude model in the mixed speech signal. The method for calculating the probability of magnitude has been described with reference to FIG. 1.

여기서, 크기 모델은 통계적 모델링을 기반으로 소프트 마스크를 이용하여 확률에 의한 마스크를 적용하는 방법이다.Here, the size model is a method of applying a mask based on probability using a soft mask based on statistical modeling.

위상 정보 확률부(140)는 혼합 음성 신호에서 위상 모델을 이용하여 원하는 음성 신호를 추정하기 위해 혼합 음성 신호가 원하는 음성 신호의 위상일 확률을 계산한다. 위상일 확률을 계산하는 방법은 전술한 도 1에서 설명하였다. The phase information probability unit 140 calculates a probability that the mixed speech signal is a phase of the desired speech signal in order to estimate the desired speech signal using the phase model in the mixed speech signal. The method of calculating the probability of being a phase has been described with reference to FIG. 1.

여기서, 위상 모델은 혼합 음성 신호의 크기 성분이 가장 큰 신호의 위상과 유사한 값의 위상을 나타내도록 크기 모델에 적용된 소프트 마스크를 위상 성분에 활용한 방법이다.Here, the phase model is a method using a soft mask applied to the magnitude model for the phase component such that the magnitude component of the mixed speech signal exhibits a phase similar to the phase of the largest signal.

음성 신호 추출부(150)는 크기 정보 확률부(130)에서 계산된 확률과 위상 정보 확률부(140)에서 계산된 확률을 혼합 음성 신호의 크기와 위상에 각각 가중치로 곱하여 조합함으로써 원하는 음성 신호를 추출한다.The speech signal extractor 150 multiplies the probability calculated by the magnitude information probability unit 130 and the probability calculated by the phase information probability unit 140 by multiplying the magnitude and phase of the mixed speech signal by weight, respectively, to combine the desired speech signal. Extract.

도 3을 참조하여 입력 음성 신호 x(t), y(t)와 혼합 음성 신호 z(t)의 한 프레임의 일부에 대한 로그 스펙트럼과 위상 성분의 출력예를 설명한다.An example of the output of the log spectrum and phase component for a part of one frame of the input speech signal x (t), y (t) and the mixed speech signal z (t) will be described with reference to FIG.

도 3은 본 발명의 실시예에 따른 입력 음성 신호 x(t), y(t)와 혼합 음성 신호 z(t)의 한 프레임의 일부에 대한 로그 스펙트럼과 위상 성분의 출력예를 나타낸 도면이다.3 is a diagram showing an example of outputting log spectra and phase components for a part of one frame of an input speech signal x (t), y (t) and a mixed speech signal z (t) according to an embodiment of the present invention.

도 3에 도시된 바와 같이, x(t)의 로그 스펙트럼이 y(t)의 로그 스펙트럼보다 큰 구간 x에서는 혼합 음성 신호의 위상 스펙트럼이 x(t)의 위상 스펙트럼과 유사하게 나타난다.As shown in FIG. 3, in the period x where the log spectrum of x (t) is larger than the log spectrum of y (t), the phase spectrum of the mixed speech signal appears similar to the phase spectrum of x (t).

즉, 혼합 음성 신호의 크기 성분인 로그 스펙트럼이 두 입력 음성 신호으 ㅣ크기 성분 중 더 큰 값과 유사한 것과 마찬가지로 혼합 음성 신호의 위상 성분도 크기 성분이 더 큰 신호의 위상과 매우 유사한 값을 나타낸다.In other words, just as the log spectrum, which is the magnitude component of the mixed speech signal, is similar to the larger of the two input speech signals, the phase component of the mixed speech signal exhibits a value very similar to the phase of the signal having the larger magnitude component.

x(t)의 로그 스펙트럼이 y(t)의 로그 스펙트럼보다 작은 구간 y에서는 혼합 음성 신호의 위상 스펙트럼이 y(t)의 위상 스펙트럼과 유사하게 나타난다.In the interval y where the log spectrum of x (t) is smaller than the log spectrum of y (t), the phase spectrum of the mixed speech signal appears similar to the phase spectrum of y (t).

이상에서 설명한 본 발명의 실시예는 장치 및/또는 방법을 통해서만 구현이 되는 것은 아니며, 본 발명의 실시예의 구성에 대응하는 기능을 실현하기 위한 프로그램, 그 프로그램이 기록된 기록 매체 등을 통해 구현될 수도 있으며, 이러한 구현은 앞서 설명한 실시예의 기재로부터 본 발명이 속하는 기술분야의 전문가라면 쉽게 구현할 수 있는 것이다.The embodiments of the present invention described above are not implemented only by the apparatus and / or method, but may be implemented through a program for realizing functions corresponding to the configuration of the embodiment of the present invention, a recording medium on which the program is recorded And such an embodiment can be easily implemented by those skilled in the art from the description of the embodiments described above.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

Separating the magnitude and phase of the mixed speech signal to separate a desired target speech signal from the mixed speech signal input from the microphone;
Calculating a magnitude probability that the magnitude of the mixed speech signal is the magnitude of the target speech signal using a magnitude model based on statistical modeling of magnitude information;
Calculating a phase probability of the phase of the mixed speech signal being a phase of the target speech signal using a phase model based on statistical modeling of phase information; And
Estimating the target speech signal by multiplying the magnitude probability and the phase probability by the weight and the magnitude and phase of the mixed speech signal, respectively.
Single channel voice separation method comprising a.

The method of claim 1,
The step of calculating the size probability,
Applying the smoothing using the log spectral value of the neighboring frame to the current frame and calculating the magnitude probability based on the smoothed log spectral value.
Single channel voice separation method further comprising.

The method of claim 1,
The step of calculating the size probability,
Calculating a second log spectral vector of the target speech signal by multiplying the first log spectral vector of the mixed speech signal by the magnitude probability; And
Calculating a first discrete Fourier transform of the speech signal extracted from the magnitude model using the second log spectrum vector and the phase spectrum vector of the mixed speech signal
Single channel voice separation method comprising a.

The method of claim 3,
Computing the phase probability,
Calculating a third log spectral vector of the target speech signal by multiplying the first log spectral vector of the mixed speech signal by the phase probability;
Calculating a phase spectral vector of the target speech signal by multiplying the phase spectrum vector of the mixed speech signal by the phase probability; And
Calculating a second discrete Fourier transform of the speech signal extracted from the phase model using the third log spectrum vector and the phase spectrum vector of the target speech signal
Single channel voice separation method comprising a.

The method of claim 4, wherein
Estimating the target speech signal,
Estimating the target speech signal by combining the magnitude probability, the phase probability, the first discrete Fourier transform, and the second discrete Fourier transform;
Single channel voice separation method comprising a.

The method of claim 1,
Wherein the magnitude probability is calculated using a probability that a value of a log spectrum of the target speech signal is greater than a value of a log spectrum of a speech signal other than the target speech signal.

The method of claim 1,
The phase probability is that the similarity value between the phase spectrum vector of the mixed speech signal and the phase spectrum vector of the target speech signal is similar to the phase spectrum vector of the speech signal other than the target speech signal. Single channel speech separation method, characterized in that the probability higher than the value.

A voice signal separation unit for separating the magnitude and phase of the mixed voice signal to separate a desired target voice signal from the mixed voice signal input from the microphone;
A magnitude information probability section for calculating a magnitude probability that the magnitude of the mixed speech signal is the magnitude of the target speech signal using a magnitude model based on statistical modeling of magnitude information;
A phase information probability unit for calculating a phase probability of the phase of the mixed speech signal being the phase of the target speech signal using a phase model based on statistical modeling of phase information; And
A speech signal extractor for estimating the target speech signal by multiplying the magnitude probability and the phase probability by the weight and the magnitude and phase of the mixed speech signal, respectively.
Voice separation apparatus comprising a.

The method of claim 8,
The smoothing filter unit applies smoothing using the log spectral vectors of neighboring frames using the smoothing filter and transmits the smoothed log spectral vector as an input signal of the magnitude information probability unit for calculating the magnitude probability.
Voice separation device further comprising.

The method of claim 8,
And the magnitude information probability unit calculates the magnitude probability using a probability that a value of a log spectrum of the target speech signal is greater than a value of a log spectrum of an unwanted speech signal among the mixed speech signals.

The method of claim 8,
The magnitude model is a method of applying a mask based on probability using a soft mask based on statistical modeling, and the phase model is a phase whose value is similar to that of a signal having the largest magnitude component of the mixed speech signal. And a soft mask applied to the magnitude model to represent the phase component so as to represent.