KR102496767B1

KR102496767B1 - System for spectral subtraction and online beamforming based on BLSTM

Info

Publication number: KR102496767B1
Application number: KR1020210159916A
Authority: KR
Inventors: 권오욱; 윤성욱
Original assignee: 충북대학교 산학협력단
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2023-02-06

Abstract

The present invention relates to a system for spectrum subtraction online beamforming comprising: a speech enhancement beamforming vector used for speech enhancement with observation signal as input; a beamforming vector estimator for estimating a static noise enhancement beamforming vector used for static noise estimation; and a spectrum subtraction part for subtracting a spectrum of the static noise enhancement beamforming vector. According to the present invention, by proposing an online beamforming update algorithm using a BLSTM mask estimation value, the present invention has an effect of outputting an adapted beamforming vector by reflecting speech, static noise, and position of a speaker that change moment by moment.

Description

BLSTM based spectral subtraction online beamforming system {System for spectral subtraction and online beamforming based on BLSTM}

본 발명은 BLSTM(Bidirectional Long Short-Term Memory) 기반 온라인 빔포밍과 스펙트럼 감산을 결합하는 기술에 관한 것이다.The present invention relates to a technique for combining bidirectional long short-term memory (BLSTM) based online beamforming with spectrum subtraction.

스펙트럼과 공간 정보를 활용한 다채널 음성 향상은 자동 음성인식(Automatic Speech Recognition, ASR)의 성능 향상에 효과적인 방법임이 입증되었다. 전통적인 다채널 음성 향상 방법으로 다채널 음수 미포함 행렬 분해(Multichannel Nonnegative Matrix Factorization, MNMF), 다채널 위너 필터(Multichannel Wiener Filter, MWF), 빔포밍(beamforming)이 ASR 성능 향상을 위한 주요 기술로 사용되었다. Multi-channel speech enhancement using spectral and spatial information has been proven to be an effective method for improving the performance of Automatic Speech Recognition (ASR). As traditional multi-channel speech enhancement methods, multichannel nonnegative matrix factorization (MNMF), multichannel Wiener filter (MWF), and beamforming have been used as major technologies for improving ASR performance. .

이중 빔포밍은 성능 향상에 가장 중요한 기술로서 최소 분산 무왜곡 응답(Minimum-Variance Distortionless Response, MVDR), 일반화 고유값(Generalized Eigen Value, GEV), 일반 부엽 제거기(Generalized Sidelobe Canceller, GSC) 등이 활발하게 연구되었다.Dual beamforming is the most important technology for improving performance, and Minimum-Variance Distortionless Response (MVDR), Generalized Eigen Value (GEV), and Generalized Sidelobe Canceller (GSC) are actively used. has been extensively studied

최근 심층신경망(Deep Neural Network, DNN)이 ASR에서 주목할 만한 성능 향상을 보였으며, 딥러닝을 이용한 최신의 빔포밍 기술로 시간-주파수(Time-Frequency, T-F) 마스크 추정 방법이 제안되었다. 많은 연구에서 딥러닝 기반 마스크 추정 빔포밍이 성공적으로 적용되었고, 전통적인 빔포밍의 성능을 뛰어넘었다. 이처럼 딥러닝이 성공적으로 빔포밍에 적용됨에 따라 실제 환경에서도 어느 정도 안정적인 성능을 보였다.Recently, a deep neural network (DNN) has shown remarkable performance improvement in ASR, and a time-frequency (TF-F) mask estimation method has been proposed as the latest beamforming technology using deep learning. In many studies, deep learning-based mask estimation beamforming has been successfully applied and has surpassed the performance of traditional beamforming. As deep learning was successfully applied to beamforming, stable performance was shown to some extent in real environments.

다채널 마이크를 활용한 빔포밍 기술은 잡음제거 및 음성강화에 효과적이지만, 기존 빔포밍 알고리즘은 음성과 잡음이 완전히 겹쳐진 사전 발화를 대상으로 주로 연구되었다. 그러나 이는 실제 환경에 적용하기에 적합하지 않다는 문제점이 있다. 즉, 실제 사용 환경을 고려하면 잡음은 항상 존재하지만, 사용자 음성이 존재하는 구간이 희박한 연속된 잡음 및 잡음 음성 스트림을 처리하여야 한다. 이를 위해 시간에 따라 변하는 입력에 적응하는 온라인 빔포밍 알고리즘이 필요하고, 잡음만이 존재하는 프레임 입력은 온라인 빔포밍 알고리즘의 성능 열화로 이어지기 때문에 이에 대한 대책이 필요하다.Beamforming technology using multi-channel microphones is effective for noise cancellation and voice reinforcement, but existing beamforming algorithms have been mainly studied for pre-speech in which voice and noise are completely overlapped. However, this has a problem in that it is not suitable for application in a real environment. That is, considering the actual use environment, noise is always present, but continuous noise and noisy voice streams in which the user's voice is sparse must be processed. To this end, an online beamforming algorithm that adapts to an input that changes over time is required, and since a frame input containing only noise leads to performance degradation of the online beamforming algorithm, a countermeasure is required.

대한민국 등록특허 10-2236471Korean Registered Patent No. 10-2236471

본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, BLSTM(Bidirectional Long Short-Term Memory) 기반 마스크 추정 값을 활용하여 시간에 따라 변화하는 입력에 적응하는 빔포밍 벡터를 출력하고, 음성이 입력으로 들어오는 시점에서 빠르게 수렴하는 빔포밍 벡터 추정을 위한 시스템을 제안하는데 그 목적이 있다. The present invention has been devised to solve the above problems, and utilizes a BLSTM (Bidirectional Long Short-Term Memory) based mask estimation value to output a beamforming vector that adapts to an input that changes over time, and to output a voice input. The purpose is to propose a system for estimating a beamforming vector that quickly converges at the time of entering .

또한, 본 발명은 낮은 신호 대 잡음비(Signal to Noise Ratio, SNR) 환경에서 전반적인 성능 향상을 위해 스펙트럼 감산을 빔포밍 기술과 결합하는 방법을 제공하는데 그 다른 목적이 있다.Another object of the present invention is to provide a method of combining spectrum subtraction with beamforming technology to improve overall performance in a low signal to noise ratio (SNR) environment.

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The object of the present invention is not limited to the object mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the description below.

이와 같은 목적을 달성하기 위한 본 발명은 스펙트럼 감산 온라인 빔포밍 시스템에 관한 것으로서, 관측 신호를 입력으로 하여, 음성 강화에 사용되는 음성 강화 빔포밍 벡터와 잡음 추정에 사용되는 잡음 강화 빔포밍 벡터를 추정하기 위한 빔포밍 벡터 추정기 및 상기 잡음 강화 빔포밍 벡터의 스펙트럼을 감산하기 위한 스펙트럼 감산부를 포함한다. To achieve this object, the present invention relates to a spectrum subtraction online beamforming system, which takes an observation signal as an input and estimates a voice enhancement beamforming vector used for voice enhancement and a noise enhancement beamforming vector used for noise estimation and a beamforming vector estimator for subtracting a spectrum of the noise enhanced beamforming vector and a spectrum subtractor for subtracting a spectrum of the noise enhanced beamforming vector.

상기 빔포밍 벡터 추정기는 딥러닝 기반 마스크 추정을 수행하는 딥러닝 기반 마스크 추정기를 포함하여 이루어질 수 있다. The beamforming vector estimator may include a deep learning-based mask estimator that performs deep learning-based mask estimation.

상기 딥러닝 기반 마스크 추정기는 BLSTM(Bidirectional Long Short-Term Memory) 기반 마스크 추정기로 구현될 수 있다. The deep learning-based mask estimator may be implemented as a bidirectional long short-term memory (BLSTM)-based mask estimator.

본 발명에 의하면, BLSTM 마스크 추정 값을 이용한 온라인 빔포밍 업데이트 알고리즘을 제안함으로써, 시시각각 변하는 음성, 잡음 및 발화자의 위치를 반영하여 적응된 빔포밍 벡터를 출력할 수 있는 효과가 있다. According to the present invention, by proposing an online beamforming update algorithm using a BLSTM mask estimation value, there is an effect of outputting an adapted beamforming vector by reflecting constantly changing voice, noise, and a speaker's position.

또한, 음성이 희박하고 잡음이 존재하는 환경에서 링버퍼를 사용해 안정적인 빔포밍 벡터 계산을 보장하는 동시에 실시간성을 확보할 수 있는 효과가 있다. In addition, there is an effect of ensuring stable beamforming vector calculation using a ring buffer in an environment where voice is sparse and noise exists, and at the same time, real-time performance can be secured.

또한, 블록 배치를 나누어 처리함으로써, PSD(Power Spectral Density) 행렬의 수렴을 빠르게 하고, 스펙트럼 감산을 GEV(Generalized Eigen Value) 빔포밍에 결함함으로써 낮은 SNR에서 성능을 향상시킬 수 있는 효과가 있다.In addition, by processing the block arrangement by dividing it, convergence of a Power Spectral Density (PSD) matrix is accelerated, and spectral subtraction is combined with Generalized Eigen Value (GEV) beamforming, thereby improving performance at a low SNR.

도 1은 본 발명의 일 실시예에 따른 BLSTM 기반 스펙트럼 감산 온라인 빔포밍 시스템의 전체 블록도이다.
도 2는 본 발명의 일 실시예에 따른 빔포밍 벡터 추정기의 블록도이다.
도 3은 본 발명의 일 실시예에 따른 마스크 추정을 위한 신경망을 도시한 것이다.
도 4는 본 발명의 일 실시예에 따른 블록단위 PSD 행렬 업데이트 알고리즘을 도시한 것이다.
도 5는 본 발명의 일 실시예에 따른 PSD 행렬의 빠른 수렴을 위한 빔포밍 벡터 추정기의 병렬처리 블록도이다. 1 is an overall block diagram of a BLSTM-based spectrum subtraction online beamforming system according to an embodiment of the present invention.
2 is a block diagram of a beamforming vector estimator according to an embodiment of the present invention.
3 illustrates a neural network for mask estimation according to an embodiment of the present invention.
4 illustrates a block-by-block PSD matrix update algorithm according to an embodiment of the present invention.
5 is a parallel processing block diagram of a beamforming vector estimator for fast convergence of a PSD matrix according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시 예를 가질 수 있는 바, 특정 실시 예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can make various changes and have various embodiments, specific embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

본 출원에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Terms used in this application are only used to describe specific embodiments, and are not intended to limit the present invention. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this application, the terms "include" or "have" are intended to designate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other features It should be understood that the presence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 갖고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 갖는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the present invention belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present application, it should not be interpreted in an ideal or excessively formal meaning. don't

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted. In describing the present invention, if it is determined that a detailed description of related known technologies may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted.

본 발명은 스펙트럼 감산 온라인 빔포밍 시스템에 관한 것이다. The present invention relates to a spectral subtraction online beamforming system.

도 1은 본 발명의 일 실시예에 따른 BLSTM 기반 스펙트럼 감산 온라인 빔포밍 시스템의 전체 블록도이다. 1 is an overall block diagram of a BLSTM-based spectrum subtraction online beamforming system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 스펙트럼 감산 온라인 빔포밍 시스템은 STFT(Short Time Fourier Transform)(110), 스펙트럼 감산부(120), 링버퍼(ringbuffer)(130), 빔포밍 벡터 추정기(140), 제1 곱셈부(150), 제2 곱셈부(160)를 포함한다. Referring to FIG. 1, the spectral subtraction online beamforming system of the present invention includes a Short Time Fourier Transform (STFT) 110, a spectrum subtractor 120, a ring buffer 130, and a beamforming vector estimator 140. , a first multiplication unit 150 and a second multiplication unit 160 are included.

STFT(110)는 입력되는 관측 신호에 대해 단기 푸리에 변환(Short Time Fourier Transform)을 수행하는 역할을 한다. 본 발명의 일 실시예에서 관측 신호는 다채널 음성 신호일 수 있다. The STFT 110 serves to perform a Short Time Fourier Transform on an input observation signal. In one embodiment of the present invention, the observation signal may be a multi-channel audio signal.

스펙트럼 감산부(120)는 추정된 잡음 스펙트럼 신호의 스펙트럼을 감산하는 역할을 한다. The spectrum subtractor 120 serves to subtract the spectrum of the estimated noise spectrum signal.

본 발명의 스펙트럼 감산부(120)에서 스펙트럼이 감산된 향상 신호 스펙트럼

이 출력된다. The enhancement signal spectrum obtained by subtracting the spectrum in the spectrum subtraction unit 120 of the present invention.

is output

본 발명에서 링버퍼(130)의 길이가 K라고 할 때, 링버퍼(130)에 향상 신호 스펙트럼이 입력되고, 블록단위의 향상된 신호

를 출력한다. In the present invention, when the length of the ring buffer 130 is K, the enhanced signal spectrum is input to the ring buffer 130, and the enhanced signal in block units

outputs

빔포밍 벡터 추정기(Beamforming vector estimator)(140)는 관측 신호를 입력으로 하여, 음성 강화에 사용되는 음성 강화 빔포밍 벡터 W _GEV _{_x} 와 잡음 추정에 사용되는 잡음 강화 빔포밍 벡터 W _GEV _{_N} 를 추정한다. The beamforming vector estimator 140 takes the observation signal as an input and estimates a voice enhancement beamforming vector W _{GEV_x} used for voice enhancement and a noise enhancement _beamforming vector W _{GEV_N} _used for noise estimation. .

스펙트럼 감산부(Spectral subtraction)(120)는 잡음 강화 빔포밍 벡터의 스펙트럼을 감산하는 역할을 한다. The spectral subtraction unit 120 serves to subtract the spectrum of the noise enhanced beamforming vector.

제1 곱셈부(150)는 관측신호 스펙트럼과 잡음 강화 빔포밍 벡터를 곱하는 역할을 한다. The first multiplier 150 serves to multiply the observation signal spectrum and the noise enhanced beamforming vector.

제2 곱셈부(160)는 음성 강화 빔포밍 벡터와 향상 신호 스펙트럼을 곱하는 역할을 한다. The second multiplier 160 serves to multiply the speech enhancement beamforming vector and the enhancement signal spectrum.

제2 곱셈부(160)에서 출력된 신호 X(t,f)는 역 단기 푸리에 변환(Inverse Short Time Fourier Transform)을 수행하는 ISTFT를 거쳐 단일 채널의 빔포밍된 신호 x(t)가 된다. The signal X(t,f) output from the second multiplier 160 becomes a single-channel beamformed signal x(t) through the ISTFT performing an inverse short-time Fourier transform.

도 1에서 관측 신호 y(t)를 입력으로 하여, 음성 강화 빔포밍 벡터 W _GEV _{_x} 와 잡음 강화 빔포밍 벡터 W _GEV _{_N} 가 동시에 추정되며, W _GEV _{_x} 는 음성 강화에 사용되며, W _{GEV_N} 는 잡음 추정에 사용된다. 그리고, 이 신호들은 스펙트럼 감산부(120) 및 빔포밍 벡터 추정기(140)를 거쳐 단일 채널의 빔포밍 된 신호 x(t)가 된다.In FIG. 1, with the observation signal y(t) as an input, the voice enhancement beamforming vector W _{GEV_x} and the noise enhancement beamforming vector W _{GEV_N} are simultaneously estimated, W _{GEV_x} _is _used for voice _enhancement , and W _{GEV_N} is noise used for estimation. Then, these signals pass through the spectrum subtraction unit 120 and the beamforming vector estimator 140 to become a single-channel beamformed signal x(t).

잡음 강화 빔포밍 벡터 W _GEV _{_N} 는 다채널 관측 신호 Y(t,f)와 곱해져 잡음 추정에 사용된다. The noise-enhanced beamforming vector W _{GEV_N} is _multiplied with the multi-channel observation signal Y(t,f) and used for noise estimation.

도 2는 본 발명의 일 실시예에 따른 빔포밍 벡터 추정기의 블록도이다. 2 is a block diagram of a beamforming vector estimator according to an embodiment of the present invention.

도 2를 참조하면, 빔포밍 벡터 추정기(140)는 딥러닝 기반 마스크 추정을 수행하는 딥러닝 기반 마스크 추정기를 포함하여 이루어질 수 있다. Referring to FIG. 2 , the beamforming vector estimator 140 may include a deep learning-based mask estimator that performs deep learning-based mask estimation.

딥러닝 기반 마스크 추정기는 BLSTM(Bidirectional Long Short-Term Memory) 기반 마스크 추정기로 구현될 수 있다.The deep learning-based mask estimator may be implemented as a Bidirectional Long Short-Term Memory (BLSTM)-based mask estimator.

도 2에서 y_block을 수학식으로 나타내면 다음과 같다.In FIG. 2 , y _block is expressed as follows.

(수학식 1)(Equation 1)

스펙트럼 감산을 적용하지 않을 경우, 수학식 1은 빔포밍 벡터 추정기(140)의 입력이다. When spectrum subtraction is not applied, Equation 1 is an input of the beamforming vector estimator 140.

수학식 1에서 y_block은 한 블록단위 입력 배치(batch)를 의미하며, L은 한 블록 안에 포함된 프레임 수이다. In Equation 1, y _block means an input batch in one block unit, and L is the number of frames included in one block.

y_block을 입력으로 하여 음성강화 빔포밍 벡터 W _GEV _{_x} 와 잡음 강화 빔포밍 벡터 W _{GEV_N} 이 동시에 추정된다. Using _block y as an input, the speech enhancement beamforming vector W _{GEV_x} and _the noise enhancement beamforming vector W _{GEV_N} are simultaneously estimated.

도 2에서 BLSTM 기반 마스크 추정기가 도시되어 있으며, y_block은 STFT(Short Time Fourier Transform)된 후, 매그니튜드(magnitude)를 취하여 BLSTM 기반 마스크 추정에 사용된다. 블록 단위로 추정된 마스크 값 M^l _y(t,f)는 다음 수학식 2와 같이 계산된다. In FIG. 2, a BLSTM-based mask estimator is shown, and y _block is STFT (Short Time Fourier Transform), and then a magnitude is taken and used for BLSTM-based mask estimation. The mask value M ^l _y (t,f) estimated in units of blocks is calculated as in Equation 2 below.

(수학식 2)(Equation 2)

여기서, t는 프레임 인덱스, f는 주파수 빈(bin)의 인덱스이며, l은 블록 인덱스, v는 잡음(N) 또는 음성(x) 클래스를 의미한다. Here, t is a frame index, f is a frequency bin index, l is a block index, and v is a noise (N) or voice (x) class.

도 3은 본 발명의 일 실시예에 따른 마스크 추정을 위한 신경망을 도시한 것이다. 3 illustrates a neural network for mask estimation according to an embodiment of the present invention.

도 3을 참조하면, 마스크 추정을 위한 신경망은 4층으로 구성된다.Referring to FIG. 3 , a neural network for mask estimation is composed of four layers.

도 3의 실시예에서 잡음 음성 스트림을 16 kHz 샘플링 후 1,024 프레임 사이즈, 256 프레임 쉬프트 사이즈를 사용해 STFT(Short Time Fourier Transform)를 수행한다. STFT의 결과로부터 513개의 스펙트럼 크기를 취하여 마스크 추정을 위한 신경망의 입력으로 사용한다.In the embodiment of FIG. 3 , Short Time Fourier Transform (STFT) is performed using a frame size of 1,024 and a frame shift size of 256 after sampling the noisy voice stream at 16 kHz. 513 spectral magnitudes are taken from the results of STFT and used as inputs to the neural network for mask estimation.

첫 번째 층은 256 출력 유닛 BLSTM 층으로 이루어져 있으며, tanh 을 활성화 함수로 사용한다. BLSTM의 메모리 cell 개수는 1,024개이다. The first layer consists of a 256 output unit BLSTM layer and uses tanh as the activation function. The number of memory cells of BLSTM is 1,024.

다음 2층과 3층은 513개의 유닛을 갖는 순방향(Feed Forward, FF) 층으로 이루어져 있으며, 정류 선형 유닛(Rectified Linear Unit, ReLU)을 활성화 함수로 사용하며, 입력의 513-포인트 스펙트럼의 크기를 고려하여 513 유닛을 사용한다. The next 2nd and 3rd layers consist of a Feed Forward (FF) layer with 513 units, using a Rectified Linear Unit (ReLU) as an activation function, and the size of the 513-point spectrum of the input Consider using 513 units.

4층은 1026 유닛으로 구성되고, 2개의 부분으로 나누어지는데, 1 ~ 513 유닛은 Mx(t,f)를 추정하고, 514 ~ 1026 유닛은 MN(t,f)를 추정한다. 활성화 함수로는 sigmoid를 사용하여, 0 ~ 1 사이의 값을 추정하며, 각 시간-주파수 빈(bin)에서 음성, 잡음 마스크 추정 값의 합이 1이 되는 제약을 두지 않았다.Layer 4 consists of 1026 units, and is divided into two parts: units 1 to 513 estimate Mx(t,f), and units 514 to 1026 estimate MN(t,f). A sigmoid is used as an activation function to estimate a value between 0 and 1, and there is no restriction that the sum of the estimated voice and noise mask values is 1 in each time-frequency bin.

도 4는 본 발명의 일 실시예에 따른 블록단위 PSD 행렬 업데이트 알고리즘을 도시한 것이다. 4 illustrates a block-by-block PSD matrix update algorithm according to an embodiment of the present invention.

도 4를 참조하면, 이전 블록에서 추정된 도 2의 M^l _y(t,f) 추정 마스크 값 및 입력 프레임을 STFT후, 매그니튜드(magnitude)를 취한 Y(t,f)를 입력으로 하여, 다음 수학식 3과 같이 l번째 블록에서 추정된 PSD(Power Spectral Density) 행렬 φ^l _w(f)를 계산한다. Referring to FIG. 4, after STFT the estimated mask value of M ^l _y (t, f) in FIG. 2 estimated in the previous block and the input frame, Y (t, f) taking the magnitude is input, and then As shown in Equation 3, a Power Spectral Density (PSD) matrix φ ^l _w (f) estimated in the lth block is calculated.

(수학식 3)(Equation 3)

여기서, H는 에르미트 연산자(Hermitian operator)이며, PSD 행렬의 차원은 F×C×C으로 다채널 마이크 사이의 음성과 잡음의 전력 분포를 의미한다. Here, H is a Hermitian operator, and the dimension of the PSD matrix is F×C×C, which means the power distribution of voice and noise between multi-channel microphones.

추정된 φ^l _w(f)는 다음 수학식 4와 같이 가중치 α^l _w(f)를 이용해 누적 추정된 φ^l ^-1 _w(f)와 가중 합산되고, l번째 블록에서 누적 추정된 PSD φ^l _w(f)가 얻어진다.The estimated φ ^l _w (f) is weighted with the cumulatively estimated φ ^l ^-1 _w (f) using the weight α ^l _w (f) as shown in Equation 4 below, and the PSD φ ^l estimated cumulatively in the lth block _w (f) is obtained.

(수학식 4)(Equation 4)

구해진 PSD 행렬 φ^l _w 는 길이가 k인 링버퍼(ringbuffer)로 입력되며, 링버퍼가 차면 계산되며, 최종적으로 다음 수학식 5와 같이 계산된다. The obtained PSD matrix φ ^l _w is input to a ring buffer having a length of k, and is calculated when the ring buffer is full, and is finally calculated as in Equation 5 below.

(수학식 5)(Equation 5)

여기서, β_i는 링버퍼의 i번째 자리의 PSD 행렬 가중치이며, 현재 블록 k-1번째에서 o번째로 가면서 작게 셋팅되는데, 현재 블록의 정보를 최대한 많이 반영하고, 과거 블록의 PSD 행렬 정보를 망각하기 위해서이다. Here, β _i is the PSD matrix weight of the i-th position of the ring buffer, and is set small from the current block k-1th to the oth, reflecting the information of the current block as much as possible and forgetting the PSD matrix information of the past block It is to do.

블록 단위 업데이트 시 블록의 길이를 짧게 할수록 빔포밍 벡터가 계산되는 시점과 현재 입력 프레임 사이의 시간 지연이 블록 사이즈에 비례해서 줄어들지만, 빔포밍 벡터 계산을 위한 충분한 프레임을 모을 수 없는 경우, PSD 행렬 계산 시 특정 주파수 빈의 샘플 수가 부족하여 부정확한 빔포밍 값이 나올 수 있다. In block-by-block update, as the block length is shortened, the time delay between the point at which the beamforming vector is calculated and the current input frame is reduced in proportion to the block size. In the calculation, an inaccurate beamforming value may be obtained due to an insufficient number of samples in a specific frequency bin.

본 발명에서는 이러한 위험을 줄이며, 빔포밍 벡터가 적용되는 시점에서 현재 입력 프레임에서 시간적으로 가까운 잡음 및 음성 정보를 최대한 반영하여 빔포밍하기 위해 링버퍼를 사용한다.In the present invention, this risk is reduced, and a ring buffer is used to perform beamforming by maximally reflecting noise and voice information temporally close to the current input frame at the point in time when the beamforming vector is applied.

본 발명에서 제안하는 빔포밍 온라인 알고리즘에서는 GEV(Generalized Eigen Value) 빔포밍을 사용하며, 다음 수학식 6과 같이 Rayleigh coefficient에서 주파수 빈별 SNR을 최대화함으로써 구할 수 있다. The beamforming online algorithm proposed in the present invention uses GEV (Generalized Eigen Value) beamforming, and can be obtained by maximizing the SNR for each frequency bin from the Rayleigh coefficient as shown in Equation 6 below.

(수학식 6)(Equation 6)

최적화 해답은 다음 수학식 7과 같다. The optimization solution is as shown in Equation 7 below.

(수학식 7)(Equation 7)

수학식 6에서 음성 PSD 행렬 φ_XX와 잡음 PSD 행렬 φ_NN을 이용하여 잡음을 추정하는 빔포밍 벡터를 얻을 수 있다.In Equation 6, a beamforming vector for estimating noise can be obtained using the speech PSD matrix φ _XX and the noise PSD matrix φ _NN .

도 5는 본 발명의 일 실시예에 따른 PSD 행렬의 빠른 수렴을 위한 빔포밍 벡터 추정기의 병렬처리 블록도이다. 5 is a parallel processing block diagram of a beamforming vector estimator for fast convergence of a PSD matrix according to an embodiment of the present invention.

도 5를 참조하면, 블록단위 PSD 행렬 업데이트 시, PSD 행렬의 빠른 수렴을 위해

을 절반으로 나누어 병렬처리 한다. 나누어진 배치 각각은 잡음 및 음성 PSD 행렬 계산에 사용되며, 각 배치로 계산된 PSD 행렬에 평균을 취하여 최종적인 φ_XX와 φ_NN을 추정한다. 후에는 기존과 같은 절차로 GEV 빔포밍 벡터를 계산한다. 블록을 나누어 처리하는 이유는 연속된 음성을 처리하는 온라인 빔포밍 태스크에서 PSD 행렬의 빠른 수렴이 성능 향상에 중요하기 때문에, 블록을 나누어 각각의 PSD 행렬을 업데이트 후 평균을 취하는 것이 전체 블록을 이용해 PSD 행렬을 한번 업데이트하는 것 보다 빠른 수렴으로 이어지는 효과를 기대할 수 있기 때문이다. Referring to FIG. 5, when updating the PSD matrix in blocks, for fast convergence of the PSD matrix

Divide in half for parallel processing. Each of the divided batches is used to calculate the noise and voice PSD matrices, and the final φ _XX and φ _NN are estimated by taking the average of the PSD matrices calculated in each batch. After that, the GEV beamforming vector is calculated by the same procedure as before. The reason for processing by dividing blocks is that fast convergence of PSD matrices is important for performance improvement in online beamforming tasks that process continuous speech. This is because the effect leading to faster convergence can be expected than updating the matrix once.

(수학식 8)(Equation 8)

수학식 8과 같이, 각 채널에서 관측신호 스펙트럼 크기

에서 추정된 잡음 스펙트럼 크기

를 감산하여 향상 신호 스펙트럼 크기

를 추정한다. As shown in Equation 8, the observed signal spectrum size in each channel

Noise spectral magnitude estimated from

by subtracting the enhancement signal spectral magnitude

to estimate

감산 가중치 η(t,f)는 관측 신호 스펙트럼 크기와 추정 잡음 스펙트럼 크기에 따라 수학식 9와 같이 정의한다. The subtraction weight η(t,f) is defined as in Equation 9 according to the observed signal spectrum size and the estimated noise spectrum size.

(수학식 9)(Equation 9)

(수학식 10)(Equation 10)

수학식 10과 같이, λ_N(f)는 잡음 마스크 추정 값을 사용하여 정의되며, 블록단위 빔포밍 벡터 업데이트에서 사용된 L개의 프레임 중 시간적으로 현재에 가까운 L/2 개의 프레임만을 사용한다. 그 이유는 현재 시점의 프레임에 포함된 잡음 주파수 빈 별 분포를 최대한 고려해서 감산하기 위해서이다. 만약 L 개의 프레임을 사용하면 적용 시점의 잡음 분포와 유사성이 떨어져 성능 저하로 이어진다.As shown in Equation 10, λ _N (f) is defined using a noise mask estimation value, and only L/2 frames temporally close to the present are used among the L frames used in the block-by-block beamforming vector update. The reason is to subtract by considering the distribution of noise frequency bins included in the current frame as much as possible. If L frames are used, the similarity with the noise distribution at the time of application is lowered, leading to performance degradation.

추정된 향상 신호 스펙트럼 복원을 위해서, 위상 정보가 필요하다. 짧은 구간의 위상 정보는 상대적으로 주요하지 않기 때문에, 관측 신호 스펙트럼의 위상 정보를 사용하여 다음 수학식 11과 같이 복원한다. For the estimated enhancement signal spectrum reconstruction, phase information is required. Since the phase information of a short period is relatively insignificant, it is restored as shown in Equation 11 using the phase information of the observed signal spectrum.

(수학식 11)(Equation 11)

이렇게 스펙트럼 감산으로 추정된 향상 신호 스펙트럼

는 길이가 L인 링버퍼로 들어가고, 링버퍼가 차면 수학식 12의 블록 단위의 향상된 신호

가 빔포밍 벡터 추정기의 입력으로 들어간다. The enhanced signal spectrum estimated by this spectral subtraction

is entered into a ring buffer of length L, and when the ring buffer is full, the enhanced signal in block units of Equation 12

is input to the beamforming vector estimator.

(수학식 12)(Equation 12)

이상에서 기술한 바와 같이, 온라인 빔포밍에서는 시시각각 변하는 음성 및 잡음, 발화자의 위치를 반영하여 적응된 빔포밍 벡터를 출력하는 것이 중요하다. 이를 위해, 본 발명에서는 BLSTM 마스크 추정 값을 이용한 온라인 빔포밍 업데이트 알고리즘을 제안한다. As described above, in online beamforming, it is important to output an adapted beamforming vector by reflecting constantly changing voice and noise and a speaker's position. To this end, the present invention proposes an online beamforming update algorithm using a BLSTM mask estimation value.

또한 실제 사용 환경을 고려할 때, 잡음은 항상 존재하며 음성이 희박한 환경에서 음성이 들어오는 시점으로부터 빠르게 안정적인 빔포밍 벡터를 계산하는 것이 성능에 중요하다. 그러나 빔포밍의 수학 수식상 관측신호의 길이가 충분히 길지 않으면 안정적인 빔포밍 벡터를 계산하기 어렵다. 이를 위해 본 발명에서는 링버퍼를 사용해 안정적인 빔포밍 벡터 계산을 보장하는 동시에 실시간성을 확보한다. In addition, when considering an actual use environment, it is important for performance to quickly calculate a stable beamforming vector from the point at which a voice is received in an environment where noise is always present and voice is sparse. However, it is difficult to calculate a stable beamforming vector if the length of the observation signal is not long enough in terms of the mathematical formula of beamforming. To this end, the present invention uses a ring buffer to ensure stable beamforming vector calculation and at the same time secure real-time.

추가적으로 PSD 행렬의 빠른 수렴을 위해서 블록 배치를 나누어 처리하고, 마지막으로 낮은 SNR에서 성능 향상을 위해 스펙트럼 감산을 GEV 빔포밍에 결합한다. Additionally, block arrangement is divided and processed for fast convergence of the PSD matrix, and finally, spectrum subtraction is combined with GEV beamforming to improve performance at low SNR.

이상 본 발명을 몇 가지 바람직한 실시 예를 사용하여 설명하였으나, 이들 실시 예는 예시적인 것이며 한정적인 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 지닌 자라면 본 발명의 사상과 첨부된 특허청구범위에 제시된 권리범위에서 벗어나지 않으면서 다양한 변화와 수정을 가할 수 있음을 이해할 것이다.The present invention has been described above using several preferred embodiments, but these embodiments are illustrative and not limiting. Those skilled in the art to which the present invention pertains will understand that various changes and modifications can be made without departing from the spirit of the present invention and the scope of rights set forth in the appended claims.

110 STFT 120 스펙트럼 감산부
130 링버퍼 140 빔포밍 벡터 측정기
150 제1 곱셈부 160 제2 곱셈부110 STFT 120 Spectral subtraction section
130 Ring Buffer 140 Beamforming Vector Measurer
150 first multiplication unit 160 second multiplication unit

Claims

STFT for performing short-time Fourier transform on the input observation signal;
a beamforming vector estimator for estimating a speech enhancement beamforming vector W _{GEV_x} used for speech enhancement and a noise enhancement beamforming vector W _{GEV_N} used for noise estimation by taking the observation signal as an input;
a first multiplier for multiplying the signal Y(t,f) output from the STFT by the noise enhanced beamforming vector W _{GEV_N} ;
The spectrum of the noise-enhanced beamforming vector output from the first multiplier is subtracted, and the enhancement signal spectrum, which is the signal from which the spectrum is subtracted, is

a spectrum subtraction unit for outputting
Enhancement signal spectrum output from the spectrum subtraction unit

a ring buffer for outputting an enhanced signal in block units according to the buffer length and passing it to the beamforming vector estimator;
The speech enhancement beamforming vector W _{GEV_x} and the enhancement signal spectrum

a second multiplication unit for multiplying by and outputting an X(t,f) signal; and
ISTFT for outputting a single-channel beamformed signal x(t) by performing an inverse short-time Fourier transform on the X(t,f) signal output from the second multiplier
including,
The beamforming vector estimator includes a deep learning-based mask estimator performing deep learning-based mask estimation,
The deep learning-based mask estimator is a BLSTM (Bidirectional Long Short-Term Memory)-based mask estimator,
The signal y _block input to the BLSTM-based mask estimator,

(Equation 1)
, where y _block is a signal meaning an input batch in units of one block, L is the number of frames included in one block,
The mask value estimated in units of blocks is calculated as in Equation 2 below,

(Equation 2)
Here, t is a frame index, f is a frequency bin index, l is a block index, v is a noise (N) or voice (x) class,
After STFT the M ^l _y (t, f) estimated mask value estimated in the previous block and the input frame, with Y (t, f) taking the magnitude as input, in the lth block as shown in Equation 3 below Calculate the estimated Power Spectral Density (PSD) matrix φ ^l _w (f),

(Equation 3)
Here, H is a Hermitian operator, and the dimension of the PSD matrix is F×C×C, which means the power distribution of voice and noise between multi-channel microphones,
The estimated φ ^l _w (f) is weighted with the cumulatively estimated φ ^l-1 _w (f) using the weight α ^l _w (f) as shown in Equation 4 below, and the PSD φ ^l estimated cumulatively in the lth block _w (f) is obtained,

(Equation 4)
The obtained PSD matrix φ ^l _w is input to a ring buffer of length k, and when the ring buffer is full, it is calculated as in Equation 5 below,

(Equation 5)
Here, β _i is the PSD matrix weight of the i-th position of the ring buffer,
In the present invention, a beamforming algorithm using GEV (Generalized Eigen Value) beamforming is proposed, and it can be obtained by maximizing the SNR for each frequency bin in the Rayleigh coefficient as shown in Equation 6 below,

(Equation 6),
Optimizing this,

(Equation 7)
can be expressed as
The spectral subtraction online beamforming system, characterized in that a beamforming vector for estimating noise can be obtained using the speech PSD matrix φ _XX and the noise PSD matrix φ _NN in Equations 6 and 7.

delete