KR101811524B1

KR101811524B1 - Dual-Microphone Voice Activity Detection Based on Deep Neural Network and Method thereof

Info

Publication number: KR101811524B1
Application number: KR1020160060214A
Authority: KR
Inventors: 황승현; 장준혁
Original assignee: 한양대학교 산학협력단
Priority date: 2016-05-17
Filing date: 2016-05-17
Publication date: 2018-01-25
Also published as: KR20170129477A

Abstract

심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법이 제시된다. 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법에 있어서, 분류 단계에서, 잡음환경에 의해 오염된 음성 신호인 입력 신호로부터 기초벡터들을 추출하는 단계; 및 상기 분류 단계에서, 상기 기초벡터들을 미리 학습된 심화신경망을 통과시켜 음성존재확률을 결정하고, 상기 입력 신호를 음성 구간 또는 비음성 구간으로 분류하는 단계를 포함하고, 상기 입력 신호는 복수의 마이크로부터 입력되며, 상기 입력 신호들 사이에 상대적인 공간 정보를 포함할 수 있다. A two channel microphone based speech detection apparatus and method using deepening neural network is presented. A two-channel microphone-based speech detection method using an enhanced neural network, comprising the steps of: extracting basic vectors from an input signal that is a speech signal contaminated by a noise environment in a classification step; And classifying the input signals into a speech section or a non-speech section by passing the base vectors through a previously learned deepening neural network to determine a speech presence probability, And may include relative spatial information between the input signals.

Description

TECHNICAL FIELD [0001] The present invention relates to a two-channel microphone-based voice detection apparatus and method using a deep-

아래의 실시예들은 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법에 관한 것이다. The following embodiments relate to a two-channel microphone-based speech detection apparatus and method using an enhanced neural network.

음성 검출 장치(Voice Activity Detection)는 입력 신호를 음성 존재 구간과 부재 구간으로 분류하는 기술로 음성인식, 음성개선 등 음성 통신 시스템에서의 필수적인 요소이다. Voice activity detection (Voice Activity Detection) is a technique for classifying an input signal into a voice presence interval and an absence interval, and is an essential element in voice communication systems such as voice recognition and voice enhancement.

다중 채널 기반의 음성 검출 장치(음성 검출기)는 입력 신호 사이의 상대적인 공간 정보(Spatial Information)를 사용할 수 있기 때문에 널리 연구되고 있다. 비선형적 분포를 보이는 입력 신호의 상대적인 공간 정보들은 변별적 가중치 학습기법이나 서포트 벡터 머신과 같은 은닉층이 없거나 한 개만 가지는 얕은 구조 기반의 머신러닝 기법으로 충분히 모델링하는데 한계가 존재한다.A multi-channel based speech detection apparatus (speech detector) is widely studied because it can use relative spatial information between input signals. The relative spatial information of the nonlinearly distributed input signal is limited to modeling sufficiently with a shallow structure based machine learning method with no hidden layer or single hidden class such as discriminative weight learning method or support vector machine.

종래의 변별적 가중치 학습기법을 이용한 음성검출기는 다양한 공간 정보들의 비선형적 분포에 리니어한 가중치를 적용하기 때문에 효과적으로 모델링할 수 없어, 음성의 변화를 다양한 잡음환경 아래에서 모델링하는 것은 성능 저하를 초래한다.Since the speech detector using the conventional discriminative weight learning technique applies a linear weight to the nonlinear distribution of various spatial information, it can not be effectively modeled, and modeling the change of speech under various noise environments causes performance degradation .

한국공개특허 10-2008-0099575호는 이러한 서포트 벡터 머신을 이용한 음성 검출 방법에 관한 것으로, 음성의 통계적 모델에 기초한 기존의 음성 검출 방법에서 사용하던 주파수별 우도비를 서포트 벡터 머신(SVM)의 기초벡터로 사용함으로써 음성의 통계적 모델에 기초한 기존의 음성 검출 방법의 성능을 향상시키는 기술을 기재하고 있다.Korean Patent Laid-Open No. 10-2008-0099575 is related to a voice detection method using such a support vector machine. In this method, a frequency-dependent likelihood ratio used in a conventional voice detection method based on a statistical model of voice is divided into a basis of a support vector machine Describes a technique for improving the performance of existing speech detection methods based on statistical models of speech by using them as vectors.

실시예들은 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법에 관하여 기술하며, 보다 구체적으로 음성 신호로부터 추출한 2 채널 마이크 기반의 다양한 기초벡터들을 심화신경망을 통하여 비선형적 분포특성을 모델링하고, 이를 기반으로 입력 신호의 기초벡터로 계산된 최적화된 음성존재확률 값에 문턱값을 적용하여 음성 신호를 검출하는 기술을 제공한다. Embodiments describe a two-channel microphone-based speech detection apparatus and method using deepening neural networks. More specifically, various nonlinear distribution characteristics are modeled through deepening neural networks based on a two-channel microphone-based various base vectors extracted from speech signals, And provides a technique for detecting a speech signal by applying a threshold value to an optimized speech presence probability value calculated as a basis vector of an input signal based on the extracted speech presence probability value.

실시예들은 전력레벨 차이비율 기반의 음성검출기, 코히어런스와 위상벡터 기반의 공간 정보를 계산하는 로직과 이를 기반으로 심화신경망에 적용하여 공간 정보들의 비선형적 분포특성을 모델링하고, 이후 입력 신호로부터 추출된 공간 정보에 모델링된 심화신경망을 적용하여 최적의 음성존재확률을 도출함으로써 음성존재구간을 검출하는 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법을 제공하는데 있다. The embodiments are based on a power level difference ratio based speech detector, coherence and logic for calculating spatial information based on phase vectors, and are applied to deepening neural networks to model nonlinear characteristics of spatial information, The present invention also provides a two-channel microphone-based speech detection apparatus and method using a deep-processing neural network that detects a speech presence interval by deriving an optimal speech presence probability by applying a deepening neural network modeled to extracted spatial information.

일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법에 있어서, 분류 단계에서, 잡음환경에 의해 오염된 음성 신호인 입력 신호로부터 기초벡터들을 추출하는 단계; 및 상기 분류 단계에서, 상기 기초벡터들을 미리 학습된 심화신경망을 통과시켜 음성존재확률을 결정하고, 상기 입력 신호를 음성 구간 또는 비음성 구간으로 분류하는 단계를 포함하고, 상기 입력 신호는 복수의 마이크로부터 입력되며, 상기 입력 신호들 사이에 상대적인 공간 정보를 포함한다. A two-channel microphone-based speech detection method using an enhanced neural network according to an exemplary embodiment of the present invention includes extracting basic vectors from an input signal that is a speech signal contaminated by a noise environment in a classification step; And classifying the input signals into a speech section or a non-speech section by passing the base vectors through a previously learned deepening neural network to determine a speech presence probability, And includes relative spatial information between the input signals.

상기 심화신경망을 학습시키는 단계를 더 포함하고, 상기 심화신경망(DNN, deep neural network)을 학습시키는 단계는, 상기 학습 단계에서, 주변 잡음환경에 의해 오염된 음성 신호를 입력 받고 이산 푸리에 변환(Discrete Fourier Transform, DFT) 후, 기초벡터들을 추출하는 단계; 및 각 상기 잡음환경에서 추출된 상기 기초벡터들을 이용하여 선행 학습(pre-training) 과정과 미세 조정(fine-tuning) 과정을 통해서 상기 심화신경망을 학습시키는 단계를 포함할 수 있다. Wherein the step of learning the deep neural network (DNN) further comprises the steps of: receiving a speech signal contaminated by an ambient noise environment and performing a discrete Fourier transform Fourier transform, DFT), extracting basic vectors; And learning the deepening neural network through a pre-training process and a fine-tuning process using the basis vectors extracted in each noise environment.

상기 기초벡터는, 롱텀 전력레벨 차이비율(Long-term Power Level Difference Ratio, LT-PLDR), 숏텀 전력레벨 차이비율(Short-term Power Level Difference Ratio, ST-PLDR), 코히어런스(Coherence) 함수, 및 위상벡터(phase vector) 중 적어도 어느 하나일 수 있다. The base vector may include a long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function , And a phase vector.

상기 입력 신호로부터 기초벡터들을 추출하는 단계는, 상기 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 롱텀 전력레벨 차이(Long-term Power Level Difference, LT-PLD)를 산정하는 단계; 및 상기 롱텀 전력레벨 차이(LT-PLD)로부터 상기 롱텀 전력레벨 차이비율(LT-PLDR)을 산출하는 단계를 포함할 수 있다. The step of extracting the fundamental vectors from the input signal may include applying a recursive averaging technique to a power level difference (PLD) between the two microphones to which the input signal is input, Difference, LT-PLD); And calculating the long-term power level difference ratio (LT-PLDR) from the long-term power level difference (LT-PLD).

상기 입력 신호로부터 기초벡터들을 추출하는 단계는, 상기 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 숏텀 전력레벨 차이(Short-term Power Level Difference, ST-PLD)를 산정하는 단계; 및 상기 숏텀 전력레벨 차이(ST-PLD)로부터 상기 숏텀 전력레벨 차이비율(ST-PLDR)을 산출하는 단계를 포함할 수 있다. Wherein the step of extracting the basis vectors from the input signal comprises the steps of: applying a recursive averaging technique to a power level difference (PLD) between two microphones to which the input signal is input, Difference, ST-PLD); And calculating the short-term power level difference ratio (ST-PLDR) from the short-term power level difference (ST-PLD).

상기 입력 신호로부터 기초벡터들을 추출하는 단계는, 두 개의 마이크를 통해 입력된 상기 입력 신호를 이산 푸리에 변환 벡터 기반 벡터 형식으로 나타내어 상관(correlation) 행렬을 고유분해 하는 단계; 및 고유 분해된 고유벡터 행렬을 정규화하여 상기 위상벡터를 산출하는 단계를 포함할 수 있다. Wherein extracting the basis vectors from the input signal comprises: eigen-decomposing a correlation matrix by representing the input signal input through two microphones in a discrete Fourier transform vector-based vector format; And normalizing the eigen-decomposed eigenvector matrix to calculate the phase vector.

상기 입력 신호로부터 기초벡터들을 추출하는 단계는, 상기 두 개의 마이크로 입력된 상기 입력 신호의 전력 스펙트럼 밀도, 교차 전력 스펙트럼 밀도, 및 상기 롱텀 전력레벨 차이비율 기반의 잡음 신호의 교차 스펙트럼 밀도를 반영하여 상기 코히어런스(Coherence) 함수를 구할 수 있다. Wherein the step of extracting the basis vectors from the input signal comprises the steps of: estimating the power spectral density of the input signal, the cross power spectral density, and the cross spectrum density of the noise signal based on the long- You can get the Coherence function.

상기 입력 신호를 음성 구간 또는 비음성 구간으로 분류하는 단계는, 상기 기초벡터들은 학습된 상기 심화신경망으로 입력되어 다수의 은닉층을 통하여 변별력을 가지는 기초벡터들로 재표현되고, 최종적으로 상기 음성존재확률로 나타나 상기 음성 구간 또는 상기 비음성 구간으로 분류될 수 있다. The step of classifying the input signal into a speech interval or a non-speech interval comprises: inputting the basic vectors to the learned deepening neural network, re-expressing the basic vectors having discriminative power through a plurality of hidden layers, And can be classified into the voice section or the non-voice section.

상기 음성존재확률의 값이 미리 설정된 문턱값보다 클 경우 상기 입력 신호는 상기 음성 신호로 판단되며, 상기 미리 설정된 문턱값보다 작을 경우 상기 입력 신호는 상기 비음성 신호로 판단될 수 있다. The input signal may be determined to be the voice signal if the value of the voice presence probability is greater than a preset threshold value, and the input signal may be determined to be the non-voice signal if the voice presence probability is lower than the predetermined threshold value.

다른 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치에 있어서, 잡음환경에 의해 오염된 음성 신호인 입력 신호를 입력 받는 입력부; 상기 입력 신호로부터 기초벡터들을 추출하는 기초벡터 추출부; 및 상기 기초벡터들을 미리 학습된 심화신경망을 통과시키는 심화신경망 적용부; 및 상기 기초벡터들의 음성존재확률을 결정하고, 상기 입력 신호를 음성 구간 또는 비음성 구간으로 분류하는 음성존재확률 결정부를 포함하고, 상기 입력 신호는 복수의 마이크로부터 입력되며, 상기 입력 신호들 사이에 상대적인 공간 정보를 포함한다. A two-channel microphone-based sound detection apparatus using an enhanced neural network according to another embodiment includes: an input unit for receiving an input signal that is a speech signal contaminated by noise environment; A base vector extractor for extracting base vectors from the input signal; And a deepening neural network applying unit for passing the basic vectors through a deepened deepening neural network; And a speech presence probability determiner for determining a speech presence probability of the basis vectors and classifying the input signal into a speech interval or a non-speech interval, wherein the input signal is input from a plurality of micros, Contains relative spatial information.

상기 심화신경망(DNN, deep neural network)을 학습시키는 학습부를 더 포함하고, 상기 학습부는, 상기 학습 단계에서, 주변 잡음환경에 의해 오염된 음성 신호를 입력 받는 학습부의 입력부; 상기 오염된 음성 신호를 입력 받고 이산 푸리에 변환(Discrete Fourier Transform, DFT)하는 이산 푸리에 변환부; 이산 푸리에 변환 후, 기초벡터들을 추출하는 학습 단계의 기초벡터 추출부; 및 각 상기 잡음환경에서 추출된 상기 기초벡터들을 이용하여 선행 학습(pre-training) 과정과 미세 조정(fine-tuning) 과정을 통해서 상기 심화신경망을 학습시키는 선행 학습부 및 미세 조정부를 포함할 수 있다. Further comprising a learning unit for learning the deep neural network (DNN), wherein the learning unit comprises: an input unit of a learning unit for receiving a speech signal contaminated by an ambient noise environment in the learning step; A discrete Fourier transform unit for receiving the contaminated speech signal and performing a discrete Fourier transform (DFT); A basic vector extracting unit of a learning step of extracting basic vectors after discrete Fourier transform; And a pre-learning unit and a fine-tuning unit for learning the deepening neural network through a pre-training process and a fine-tuning process using the basis vectors extracted in each noise environment .

상기 기초벡터 추출부는, 상기 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 롱텀 전력레벨 차이(Long-term Power Level Difference, LT-PLD)를 산정하고, 상기 롱텀 전력레벨 차이(LT-PLD)로부터 상기 롱텀 전력레벨 차이비율(LT-PLDR)을 산출할 수 있다. The base vector extractor extracts a long-term power level difference (LT-PLD) by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which the input signal is input, And calculate the long-term power level difference ratio (LT-PLDR) from the long-term power level difference (LT-PLD).

상기 기초벡터 추출부는, 상기 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 숏텀 전력레벨 차이(Short-term Power Level Difference, ST-PLD)를 산정하고, 상기 숏텀 전력레벨 차이(ST-PLD)로부터 상기 숏텀 전력레벨 차이비율(ST-PLDR)을 산출할 수 있다. The basic vector extractor extracts a short-term power level difference (ST-PLD) by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which the input signal is input. And calculate the short-term power level difference ratio (ST-PLDR) from the short-circuit power level difference (ST-PLD).

상기 음성존재확률 결정부는, 상기 기초벡터들은 학습된 상기 심화신경망으로 입력되어 다수의 은닉층을 통하여 변별력을 가지는 기초벡터들로 재표현되고, 최종적으로 상기 음성존재확률로 나타나 상기 음성 구간 또는 상기 비음성 구간으로 분류될 수 있다. The speech presence probability determining unit may be configured to determine the speech presence probability based on the speech presence probability and the speech presence probability when the speech presence probability is input to the learned deep- Section.

실시예들에 따르면 음성 신호로부터 추출한 2 채널 마이크 기반의 다양한 기초벡터들을 심화신경망을 통하여 비선형적 분포특성을 모델링하고, 이를 기반으로 입력 신호의 기초벡터로 계산된 최적화된 음성존재확률 값에 문턱값을 적용하여 음성 신호를 검출함으로써, 열악한 잡음 환경에서도 우수한 성능의 음성 검출이 가능한 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법을 제공할 수 있다. According to embodiments, nonlinear distribution characteristics are modeled through deepening neural networks based on two-channel microphone-based various base vectors extracted from a speech signal, and based on the models, a threshold value It is possible to provide a two-channel microphone-based speech detection apparatus and method using an enhanced neural network capable of performing speech detection with superior performance even in a poor noise environment.

실시예들에 따르면 전력레벨 차이비율 기반의 음성검출기, 코히어런스와 위상벡터 기반의 공간 정보를 계산하는 로직과 이를 기반으로 심화신경망에 적용하여 공간 정보들의 비선형적 분포특성을 모델링하고, 이후 입력 신호로부터 추출된 공간 정보에 모델링된 심화신경망을 적용하여 최적의 음성존재확률을 도출함으로써 음성존재구간을 검출하는 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법을 제공할 수 있다. According to embodiments, a non-linear distribution characteristic of spatial information is modeled by applying a power level difference ratio based speech detector, coherence and logic for calculating spatial information based on phase vector, and applying it to deepening neural network, The present invention can provide a two-channel microphone-based voice detection apparatus and method using a deepened neural network that detects a voice presence interval by deriving an optimal voice presence probability by applying a deepened neural network modeled to spatial information extracted from a signal.

도 1은 일 실시예에 따른 음성 검출 방법을 수행하기 위한 음성 검출 장치의 구성을 나타내는 블록도이다.
도 2는 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치를 개념적으로 나타낸 도면이다.
도 3은 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법을 나타내는 흐름도이다.
도 4는 잡음의 위상 0도에서 기존 음성 검출 장치와 일 실시예에 따른 음성 검출 장치와의 ROC 커브를 비교한 도면이다.
도 5는 잡음의 위상 90도에서 기존 음성 검출 장치와 일 실시예에 따른 음성 검출 장치와의 ROC 커브를 비교한 도면이다.
도 6은 잡음의 위상 180도에서 기존 음성 검출 장치와 일 실시예에 따른 음성 검출 장치와의 ROC 커브를 비교한 도면이다. 1 is a block diagram showing a configuration of a voice detection apparatus for performing a voice detection method according to an embodiment.
2 is a conceptual diagram of a two-channel microphone-based speech detection apparatus using an enhanced neural network according to an embodiment.
FIG. 3 is a flowchart illustrating a two-channel microphone-based voice detection method using a deepening neural network according to an exemplary embodiment.
FIG. 4 is a graph comparing ROC curves of a conventional speech detection apparatus with a speech detection apparatus according to an embodiment at a phase 0 degree of a noise.
FIG. 5 is a diagram comparing ROC curves of a conventional speech detection apparatus with a speech detection apparatus according to an embodiment at a phase of 90 degrees of noise.
FIG. 6 is a diagram comparing ROC curves of a conventional speech detection apparatus and a speech detection apparatus according to an embodiment at a phase of 180 degrees of noise.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the embodiments described may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided to more fully describe the present invention to those skilled in the art. The shape and size of elements in the drawings may be exaggerated for clarity.

아래의 실시예들은 입력 신호들로부터 구해진 상대적인 공간 정보들의 비선형적 분포특성을 깊은 구조 기반의 머신 러닝 기법인 심화신경망을 통하여 모델링하여 음성존재확률을 추정하고, 계산된 음성존재확률에 문턱값을 적용하여 음성을 검출하는 것을 특징으로 한다.In the following embodiments, non-linear distribution characteristics of relative spatial information obtained from input signals are modeled through a deep structure-based machine learning technique, deepening neural network, to estimate a voice presence probability and apply a threshold value to the calculated voice presence probability And the voice is detected.

음성 검출 기술(Voice Activity Detection)은 입력 신호를 음성 존재구간과 부재구간으로 분류하는 기술로, 음성인식, 음성개선, 음성부호화기 등 음성 통신 시스템에서의 필수적인 요소이다. 예를 들어 음성 부재 구간에서 음성인식기의 동작을 중단하여 인식오류를 줄이며 시스템의 소비전력을 줄이는 역할을 한다. 또한, 음성부호화기의 경우 음성이 존재하지 않는 구간과 존재하는 구간에서의 비트 전송률을 가변적으로 조절하여 음성 신호를 보다 효율적으로 전송할 수 있다.
Voice Activity Detection (Voice Activity Detection) is a technique for classifying an input signal into a voice presence interval and an absence interval, and is an essential element in voice communication systems such as voice recognition, voice enhancement, and voice coder. For example, it stops the operation of the speech recognizer in the absence of voice to reduce recognition errors and reduces the power consumption of the system. In addition, in the case of a speech coder, a speech signal can be transmitted more efficiently by variably controlling a bit rate in a section in which no speech exists and in an existing section.

도 1은 일 실시예에 따른 음성 검출 방법을 수행하기 위한 음성 검출 장치의 구성을 나타내는 블록도이다. 1 is a block diagram showing a configuration of a voice detection apparatus for performing a voice detection method according to an embodiment.

도 1을 참조하면, 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법을 수행하기 위한 음성 검출 장치는 학습부(110) 및 분류부(120)를 포함하여 이루어질 수 있다. 여기에서 학습부(110)는 학습부의 입력부(111), 이산 푸리에 변환부(112), 학습부의 기초벡터 추출부(113), 선행 학습부(114), 및 미세 조정부(115)를 포함하여 이루어질 수 있다. 그리고 분류부(120)는 입력부(121), 기초벡터 추출부(122), 심화신경망 적용부(123), 및 음성존재확률 결정부(124)를 포함하여 이루어질 수 있다. 실시예에 따라 학습부(110) 및 분류부(120)는 메모리를 더 포함하여 이루어질 수 있다. Referring to FIG. 1, a speech detection apparatus for performing a two-channel microphone-based speech detection method using an enhanced neural network may include a learning unit 110 and a classifier 120. Here, the learning unit 110 includes the input unit 111 of the learning unit, the discrete Fourier transform unit 112, the basic vector extraction unit 113 of the learning unit, the preceding learning unit 114, and the fine adjustment unit 115 . The classification unit 120 may include an input unit 121, a base vector extraction unit 122, a deepening network application unit 123, and a voice presence probability determination unit 124. The learning unit 110 and the classifying unit 120 may further include a memory according to the embodiment.

분류부(120)의 기초벡터 추출부(122) 및 심화신경망 적용부(123)는 학습 과정을 통하여 최적화된 기초벡터의 가중치를 전달 받아 기초벡터에 적용하여 음성 검출 확률을 산출하는 부분으로서, 소정의 연산 속도를 갖는 연산 유닛을 포함할 수 있다. 예를 들어, CPU(central processing unit), GPU(graphical processing unit) 등과 같은 연산 유닛을 포함할 수 있다. 또한, 분류부(120)는 소정의 프로세스에 필요한 데이터를 저장하기 위한 메모리를 더 포함할 수 있다.The basic vector extracting unit 122 and the deepened neural network applying unit 123 of the classifying unit 120 calculate a voice detection probability by receiving a weight of an optimized basic vector through a learning process and applying the weight to a basic vector, And an arithmetic unit having an arithmetic operation speed of. For example, a central processing unit (CPU), a graphical processing unit (GPU), or the like. The classifying unit 120 may further include a memory for storing data necessary for a predetermined process.

음성존재확률 결정부(124)는 최적 음성존재확률로부터 음성 존재 구간을 검출하는 부분으로서, 소정의 연산 속도를 갖는 연산 유닛을 포함할 수 있다.The voice presence probability determination unit 124 may include a calculation unit having a predetermined calculation speed as a part for detecting a voice presence interval from the optimal voice presence probability.

입력부(121)는 소정의 입력 데이터를 전송하는 부분으로서, 예를 들어 마이크로폰 등과 같이 소리를 전기 신호로 변환하는 입력 수단을 포함할 수 있다. 예를 들어, 입력부(121)는 오염된 음성 신호(즉, 주변 잡음에 의해 오염된 음성 신호)를 제공 받을 수 있다. 이러한 입력부(121)는 두 개의 마이크로폰(마이크)로 이루어져 2 채널의 마이크로 구성될 수 있다.The input unit 121 may include input means for converting sound into an electric signal, such as a microphone, for example, as a portion for transmitting predetermined input data. For example, the input unit 121 may be provided with a contaminated voice signal (i.e., a voice signal contaminated by ambient noise). The input unit 121 may be composed of two microphones (microphones) and may be micro-composed of two channels.

아래에서 음성 검출 장치의 각각의 구성에 대해 하나의 실시예를 이용하여 더 구체적으로 설명한다.
Each of the configurations of the voice detection apparatus will be described in more detail below using one embodiment.

일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치는 학습부(110) 및 분류부(120)를 포함하여 이루어질 수 있다. The two-channel microphone-based speech detection apparatus using the deepening neural network according to an embodiment may include a learning unit 110 and a classifier 120.

먼저, 학습부(110)는 심화신경망(DNN, deep neural network)을 학습시키는 것으로, 학습부의 입력부(111), 이산 푸리에 변환부(112), 학습부의 기초벡터 추출부(113), 선행 학습부(114), 및 미세 조정부(115)를 포함하여 이루어질 수 있다.First, the learning unit 110 learns a deep neural network (DNN). The learning unit 110 includes an input unit 111 of a learning unit, a discrete Fourier transform unit 112, a basic vector extraction unit 113 of a learning unit, (114), and a fine adjustment unit (115).

학습부의 입력부(111)는 학습 단계에서 주변 잡음환경에 의해 오염된 음성 신호를 입력 받을 수 있다. The input unit 111 of the learning unit can receive the audio signal contaminated by the ambient noise environment in the learning step.

이산 푸리에 변환부(112)는 오염된 음성 신호를 입력 받고 이산 푸리에 변환(Discrete Fourier Transform, DFT)할 수 있다. The discrete Fourier transformer 112 receives the contaminated speech signal and performs discrete Fourier transform (DFT).

학습부의 기초벡터 추출부(113)는 이산 푸리에 변환 후, 기초벡터들을 추출할 수 있다. The basic vector extracting unit 113 of the learning unit can extract the basic vectors after the discrete Fourier transform.

선행 학습부(114) 및 미세 조정부(115)는 각 잡음환경에서 추출된 기초벡터들을 이용하여 선행 학습(pre-training) 과정과 미세 조정(fine-tuning) 과정을 통해서 심화신경망을 학습시킬 수 있다. The pre-learning unit 114 and the fine adjustment unit 115 can learn the deepening neural network through the pre-training process and the fine-tuning process using the extracted base vectors in each noise environment .

다음으로, 분류부(120)는 분류 단계에서 입력 신호를 전달 받아 음성 구간 또는 비음성 구간으로 분류하는 것으로, 입력부(121), 기초벡터 추출부(122), 심화신경망 적용부(123), 및 음성존재확률 결정부(124)를 포함하여 이루어질 수 있다.The classification unit 120 receives the input signal in the classification step and classifies the input signal into a voice section or a non-voice section. The classification unit 120 includes an input unit 121, a basic vector extraction unit 122, a deepened neural network application unit 123, And a voice presence probability determining unit 124. [

입력부(121)는 분류 단계에서 잡음환경에 의해 오염된 음성 신호인 입력 신호를 입력 받을 수 있다. The input unit 121 may receive an input signal that is a voice signal contaminated by noise environment in the classification step.

기초벡터 추출부(122)는 분류 단계에서 입력 신호로부터 기초벡터들을 추출할 수 있다. 여기에서 입력 신호는 복수의 마이크로부터 입력되며, 입력 신호들 사이에 상대적인 공간 정보를 포함할 수 있다.The basic vector extraction unit 122 may extract basic vectors from the input signal in the classification step. Wherein the input signal is input from a plurality of microphones and may include relative spatial information between the input signals.

기초벡터는 롱텀 전력레벨 차이비율(Long-term Power Level Difference Ratio, LT-PLDR), 숏텀 전력레벨 차이비율(Short-term Power Level Difference Ratio, ST-PLDR), 코히어런스(Coherence) 함수, 및 위상벡터(phase vector) 중 적어도 어느 하나일 수 있다. The basic vector includes a long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function, And may be at least one of a phase vector.

기초벡터 추출부(122)는 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 롱텀 전력레벨 차이(Long-term Power Level Difference, LT-PLD)를 산정하고, 롱텀 전력레벨 차이(LT-PLD)로부터 롱텀 전력레벨 차이비율(LT-PLDR)을 산출할 수 있다. The basic vector extraction unit 122 extracts a long-term power level difference (LT-PLD) by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which an input signal is input ) And calculate a long-term power level difference ratio LT-PLDR from the long-term power level difference LT-PLD.

기초벡터 추출부(122)는 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 숏텀 전력레벨 차이(Short-term Power Level Difference, ST-PLD)를 산정하고, 숏텀 전력레벨 차이(ST-PLD)로부터 숏텀 전력레벨 차이비율(ST-PLDR)을 산출할 수 있다. The basic vector extraction unit 122 extracts a short-term power level difference (ST-PLD) by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which an input signal is input ), And calculate the short-term power level difference ratio (ST-PLDR) from the short-circuit power level difference (ST-PLD).

그리고 기초벡터 추출부(122)는 두 개의 마이크를 통해 입력된 입력 신호를 이산 푸리에 변환 벡터 기반 벡터 형식으로 나타내어 상관(correlation) 행렬을 고유분해 하고, 고유 분해된 고유벡터 행렬을 정규화하여 위상벡터를 산출할 수 있다. The basic vector extraction unit 122 eigen-decomposes the correlation matrix by expressing the input signal input through the two microphones in a discrete Fourier transform vector-based vector format, normalizes the eigen-decomposed eigenvector matrix, Can be calculated.

또한 기초벡터 추출부(122)는 두 개의 마이크로 입력된 입력 신호의 전력 스펙트럼 밀도, 교차 전력 스펙트럼 밀도, 및 롱텀 전력레벨 차이비율 기반의 잡음 신호의 교차 스펙트럼 밀도를 반영하여 코히어런스(Coherence) 함수를 구할 수 있다. In addition, the basic vector extracting unit 122 extracts the coherence function of the two micro input signals based on the power spectral density, the cross power spectral density, and the cross spectrum density of the noise signal based on the long- Can be obtained.

심화신경망 적용부(123)는 분류 단계에서 기초벡터들을 미리 학습된 심화신경망을 통과시킬 수 있다. The deepening neural network application unit 123 can pass the basic vectors to the deepened deepening neural network in the classification step.

음성존재확률 결정부(124)는 분류 단계에서 기초벡터들의 음성존재확률을 결정하고, 입력 신호를 음성 구간 또는 비음성 구간으로 분류할 수 있다. The voice presence probability determining unit 124 may determine the voice presence probability of the basic vectors in the classification step and may classify the input signal into a voice interval or a non-voice interval.

음성존재확률 결정부(124)는 기초벡터들은 학습된 심화신경망으로 입력되어 다수의 은닉층을 통하여 변별력을 가지는 기초벡터들로 재표현되고, 최종적으로 음성존재확률로 나타나 음성 구간 또는 비음성 구간으로 분류될 수 있다. The speech presence probability determining unit 124 receives the basic vectors as the learned deepening neural network, re-expresses the basic vectors having discriminative power through the plurality of hidden layers, and finally expresses the presence probability as the speech presence or non- .

음성존재확률의 값이 미리 설정된 문턱값보다 클 경우 입력 신호는 음성 신호로 판단되며, 미리 설정된 문턱값보다 작을 경우 입력 신호는 비음성 신호로 판단될 수 있다. If the value of the voice presence probability is greater than a predetermined threshold value, the input signal is determined to be a voice signal, and if it is lower than a predetermined threshold value, the input signal may be determined to be a non-voice signal.

실시예들에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법은 음성 신호로부터 추출한 2 채널 마이크 기반의 다양한 기초벡터들을 심화신경망을 통하여 비선형적 분포특성을 모델링하고, 이를 기반으로 입력 신호의 기초벡터로 계산된 최적화된 음성존재확률 값에 문턱값을 적용하여 음성 신호를 검출함으로, 열악한 잡음 환경에서도 우수한 성능의 음성 검출이 가능하다.
According to embodiments, a two-channel microphone-based speech detection method is based on modeling a nonlinear distribution characteristic through deepening neural networks of various basic vectors based on a two-channel microphone extracted from a speech signal, By detecting the speech signal by applying a threshold to the optimized speech presence probability value calculated by the vector, it is possible to perform speech detection with excellent performance even in a poor noise environment.

도 2는 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치를 개념적으로 나타낸 도면이다. 2 is a conceptual diagram of a two-channel microphone-based speech detection apparatus using an enhanced neural network according to an embodiment.

일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치는 음성의 짧은 시간 변화를 효과적으로 특징짓기 위해서 입력 신호의 전력레벨 차이비율과 코히어런스(coherence), 위상벡터(phase vector)를 기초벡터로 하여 모델링된 심화신경망을 기반으로 최적화된 음성존재확률을 도출함으로써, 다양한 잡음 환경에서 우수한 성능을 가지는 음성 검출 장치 및 음성 검출 방법을 제공할 수 있다. 여기에서 기초벡터들의 비선형적 분포특성을 보다 잘 모델링하기 위해서 심화신경망을 이용할 수 있다. In order to effectively characterize a short time change of a voice, a two-channel microphone-based voice detection apparatus using a deepening neural network according to an embodiment uses a power level difference ratio, a coherence, and a phase vector of an input signal It is possible to provide a voice detection apparatus and a voice detection method having excellent performance in various noise environments by deriving an optimized voice presence probability based on the deepened neural network modeled as a basic vector. Here we can use deepening neural networks to better model nonlinear distribution characteristics of basis vectors.

도 2에 도시된 바와 같이, 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치는 학습부(210) 및 분류부(220)를 포함하여 이루어질 수 있다. As shown in FIG. 2, the two-channel microphone-based speech detection apparatus using the deepening neural network according to the embodiment may include a learning unit 210 and a classifier 220.

학습부(210)는 심화신경망(DNN, deep neural network)을 학습시키는 것으로, 주변 잡음환경에 의해 오염된 음성 신호를 입력 받고 이산 푸리에 변환(Discrete Fourier Transform, DFT) 후, 기초벡터들을 추출하고, 각 잡음환경에서 추출된 기초벡터들을 이용하여 선행 학습(pre-training) 과정과 미세 조정(fine-tuning) 과정을 통해서 심화신경망을 학습시킬 수 있다. The learning unit 210 learns a deep neural network (DNN). The learning unit 210 receives a speech signal contaminated by an ambient noise environment, performs a discrete Fourier transform (DFT), extracts basic vectors, The deepening neural network can be learned through the pre-training process and the fine-tuning process using the base vectors extracted from each noise environment.

여기에서 학습부(210)는 이산 푸리에 변환부(211), 학습부의 기초벡터 추출부(212, 213, 214), 선행 학습부(215), 및 미세 조정부(216)를 포함하여 이루어질 수 있다. 그리고 실시예에 따라 학습부의 입력부를 더 포함할 수 있다. Here, the learning unit 210 may include a discrete Fourier transform unit 211, a basic vector extraction unit 212, 213, 214, a preceding learning unit 215, and a fine adjustment unit 216 of the learning unit. And may further include an input unit of the learning unit according to the embodiment.

분류부(220)는 잡음환경에 의해 오염된 음성 신호인 입력 신호로부터 기초벡터들을 추출하여, 기초벡터들을 미리 학습된 심화신경망을 통과시켜 음성존재확률을 결정하고 입력 신호를 음성 구간 또는 비음성 구간으로 분류할 수 있다. The classification unit 220 extracts basic vectors from an input signal that is a speech signal contaminated by a noise environment, passes basic vectors through an advanced deepened neural network to determine a voice presence probability, and outputs an input signal to a voice section or a non- .

여기에서 분류부(220)는 입력부(221, 222), 기초벡터 추출부(223, 224, 225), 심화신경망 적용부(226), 및 음성존재확률 결정부(227)를 포함하여 이루어질 수 있다. The classification unit 220 may include input units 221 and 222, basic vector extraction units 223 and 224 and 225, a deepened neural network application unit 226, and a voice presence probability determination unit 227 .

실시예들에 따르면 전력레벨 차이비율 기반의 음성검출기, 코히어런스와 위상벡터 기반의 공간 정보를 계산하는 로직과 이를 기반으로 심화신경망에 적용하여 공간 정보들의 비선형적 분포특성을 모델링하고, 이후 입력 신호로부터 추출된 공간 정보에 모델링된 심화신경망을 적용하여 최적의 음성존재확률을 도출함으로써 음성존재구간을 검출할 수 있다. According to embodiments, a non-linear distribution characteristic of spatial information is modeled by applying a power level difference ratio based speech detector, coherence and logic for calculating spatial information based on phase vector, and applying it to deepening neural network, By applying the deepened neural network that is modeled to the spatial information extracted from the signal, the voice presence interval can be detected by deriving the optimal voice presence probability.

아래에서는 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법에 대해 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치를 이용하여 더 구체적으로 설명하기로 한다.
Hereinafter, a two-channel microphone-based voice detection method using the deepening neural network according to an embodiment will be described in more detail with reference to a two-channel microphone-based voice detection apparatus using the deepening neural network according to an embodiment.

도 3은 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법을 나타내는 흐름도이다. FIG. 3 is a flowchart illustrating a two-channel microphone-based voice detection method using a deepening neural network according to an exemplary embodiment.

도 3을 참조하면, 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법은 분류 단계에서, 잡음환경에 의해 오염된 음성 신호인 입력 신호로부터 기초벡터들을 추출하는 단계(330), 및 분류 단계에서, 기초벡터들을 미리 학습된 심화신경망을 통과시켜 음성존재확률을 결정하고, 입력 신호를 음성 구간 또는 비음성 구간으로 분류하는 단계(340)를 포함하여 이루어질 수 있다. 이때 입력 신호는 복수의 마이크로부터 입력되며, 입력 신호들 사이에 상대적인 공간 정보를 포함할 수 있다. 3, a two-channel microphone-based speech detection method using a deepening neural network according to an exemplary embodiment includes a step 330 of extracting basic vectors from an input signal that is a speech signal contaminated by a noise environment in a classification step, And classifying the input vectors into a speech section or a non-speech section by passing the base vectors through a previously learned deepening neural network to determine a speech presence probability, and classifying the input signals into a speech section or a non-speech section. The input signal may be input from a plurality of microphones and may include relative spatial information between the input signals.

그리고 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법은 심화신경망을 학습시키는 단계를 더 포함하여 이루어질 수 있다. In addition, the two-channel microphone-based speech detection method using the deepening neural network according to an embodiment may further include learning the deepening neural network.

더 구체적으로, 심화신경망(DNN, deep neural network)을 학습시키는 단계는 학습 단계에서 주변 잡음환경에 의해 오염된 음성 신호를 입력 받고 이산 푸리에 변환(Discrete Fourier Transform, DFT) 후, 기초벡터들을 추출하는 단계(310), 및 각 잡음환경에서 추출된 기초벡터들을 이용하여 선행 학습(pre-training) 과정과 미세 조정(fine-tuning) 과정을 통해서 심화신경망을 학습시키는 단계(320)를 포함할 수 있다. More specifically, the step of learning the deep neural network (DNN) is performed by inputting the speech signal contaminated by the ambient noise environment in the learning step, extracting basic vectors after a discrete Fourier transform (DFT) Step 310 and learning 320 the deepened neural network through a pre-training process and a fine-tuning process using the extracted base vectors in each noise environment .

여기에서 기초벡터는 롱텀 전력레벨 차이비율(Long-term Power Level Difference Ratio, LT-PLDR), 숏텀 전력레벨 차이비율(Short-term Power Level Difference Ratio, ST-PLDR), 코히어런스(Coherence) 함수, 및 위상벡터(phase vector) 중 적어도 어느 하나일 수 있다. 이에 따라 복수의 기초벡터들은 상기의 기초벡터의 조합으로 이루어질 수 있다. In this case, the base vector includes a long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function , And a phase vector. Accordingly, a plurality of basis vectors may be composed of a combination of the above basic vectors.

이와 같이 실시예들에 따르면 음성 신호로부터 추출한 2 채널 마이크 기반의 다양한 기초벡터들을 심화신경망을 통하여 비선형적 분포특성을 모델링하고, 이를 기반으로 입력 신호의 기초벡터로 계산된 최적화된 음성존재확률 값에 문턱값을 적용하여 음성 신호를 검출함으로써, 열악한 잡음 환경에서도 우수한 성능의 음성 검출이 가능하다.
According to the embodiments of the present invention, various non-linear distribution characteristics are modeled through deepening neural networks based on the two-channel microphone-based various base vectors extracted from the speech signal, and based on the models, the optimized speech presence probability value By detecting the speech signal by applying the threshold value, it is possible to perform speech detection with superior performance even in a poor noise environment.

아래에서는 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법의 각 단계에 대해 상세히 설명하기로 한다. 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 방법은 도 2에서 설명한 일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치를 이용하여 더 구체적으로 설명할 수 있다. Hereinafter, each step of the two-channel microphone-based voice detection method using the deepening neural network according to an embodiment will be described in detail. The two-channel microphone-based voice detection method using the deepening neural network according to one embodiment can be more specifically explained using a two-channel microphone-based voice detection apparatus using the deepening neural network according to the embodiment described in FIG.

일 실시예에 따른 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치는 학습부(210) 및 분류부(220)를 포함하여 이루어질 수 있다. 여기에서 학습부(210)는 이산 푸리에 변환부(211), 학습부의 기초벡터 추출부(212, 213, 214), 선행 학습부(215), 및 미세 조정부(216)를 포함하여 이루어질 수 있다. 그리고 실시예에 따라 학습부의 입력부를 더 포함할 수 있다. 그리고 분류부(220)는 입력부(221, 222), 기초벡터 추출부(223, 224, 225), 심화신경망 적용부(226), 및 음성존재확률 결정부(227)를 포함하여 이루어질 수 있다. The two-channel microphone-based speech detection apparatus using the deepening neural network according to an embodiment may include a learning unit 210 and a classifier 220. Here, the learning unit 210 may include a discrete Fourier transform unit 211, a basic vector extraction unit 212, 213, 214, a preceding learning unit 215, and a fine adjustment unit 216 of the learning unit. And may further include an input unit of the learning unit according to the embodiment. The classifying unit 220 may include input units 221 and 222, base vector extracting units 223 and 224 and 225, an enhanced neural network applying unit 226, and a voice presence probability determining unit 227.

단계(310)에서, 음성 검출 장치의 이산 푸리에 변환부(211)는 심화신경망(DNN, deep neural network)을 학습시키기 위해 학습 단계에서, 주변 잡음환경에 의해 오염된 음성 신호를 입력 받고 이산 푸리에 변환(Discrete Fourier Transform, DFT)할 수 있다. 그리고 학습부의 기초벡터 추출부(212, 213, 214)는 기초벡터들을 추출할 수 있다. In step 310, the discrete Fourier transform unit 211 of the speech detection apparatus receives a speech signal contaminated by an ambient noise environment in a learning step to learn a deep neural network (DNN), and performs a discrete Fourier transform (Discrete Fourier Transform, DFT). The basic vector extraction units 212, 213, and 214 of the learning unit can extract the basic vectors.

단계(320)에서, 음성 검출 장치의 선행 학습부(215) 및 미세 조정부(216)는 각 잡음환경에서 추출된 기초벡터들을 이용하여 선행 학습(pre-training) 과정과 미세 조정(fine-tuning) 과정을 통해서 심화신경망을 학습시킬 수 있다. In step 320, the pre-learning unit 215 and the fine adjustment unit 216 of the speech detection apparatus perform a pre-training process and a fine-tuning process using the extracted base vectors in each noise environment, The process can be used to learn deepening neural networks.

예를 들어 선행 학습 과정에서는 CD(contrastive divergence) 알고리즘을 통해서 학습하며, 미세 조정 과정에서는 역전이(back-propagation) 알고리즘을 통해서 학습할 수 있다.For example, in the pre-learning process, learning is performed through a CD (contrastive divergence) algorithm, and in the fine-tuning process, learning is performed through a back-propagation algorithm.

단계(330)에서, 음성 검출 장치의 입력부(221, 222)는 잡음환경에 의해 오염된 음성 신호인 입력 신호를 입력 받고, 기초벡터 추출부(223, 224, 225)는 분류 단계에서 잡음환경에 의해 오염된 음성 신호인 입력 신호로부터 기초벡터들을 추출할 수 있다. 여기에서 입력 신호는 복수의 마이크로부터 입력되며, 입력 신호들 사이에 상대적인 공간 정보를 포함할 수 있다. In step 330, the input units 221 and 222 of the voice detection apparatus receive input signals that are voice signals contaminated by the noisy environment, and the base vector extraction units 223, 224, The base vectors can be extracted from the input signal, which is a voice signal contaminated by the input signal. Wherein the input signal is input from a plurality of microphones and may include relative spatial information between the input signals.

단계(340)에서, 음성 검출 장치의 심화신경망 적용부(226)는 분류 단계에서 기초벡터들을 미리 학습된 심화신경망을 통과시키고, 음성존재확률 결정부(227)는 음성존재확률을 결정하고 입력 신호를 음성 구간 또는 비음성 구간으로 분류할 수 있다. In step 340, the deepening neural network application unit 226 of the voice detection apparatus passes the base vectors to the advanced deepened neural network in the classification step, and the voice presence probability determining unit 227 determines the voice presence probability, Can be classified into a voice section or a non-voice section.

아래의 실시예들에서는 음성의 짧은 시간 변화를 효과적으로 특징짓기 위해서 입력 신호의 전력레벨 차이비율과 코히어런스(coherence), 위상벡터(phase vector)를 기초벡터로 하여, 이러한 기초벡터들의 비선형적 분포특성을 보다 잘 모델링하기 위해서 심화신경망을 이용할 수 있다. 모델링된 심화신경망을 기반으로 최적화된 음성존재확률을 도출함으로써, 다양한 잡음 환경에서 우수한 성능을 가지는 음성 검출 장치 및 음성 검출 방법을 제공할 수 있다.In the following embodiments, in order to effectively characterize a short time variation of speech, a power level difference ratio, a coherence, and a phase vector of an input signal are used as basic vectors, and a nonlinear distribution Deepening neural networks can be used to better model characteristics. By deriving the optimized voice presence probability based on the modeled deepening neural network, it is possible to provide a voice detection apparatus and a voice detection method having excellent performance in various noise environments.

다시 말하면, 전력레벨 차이비율 기반의 음성검출기, 코히어런스와 위상벡터 기반의 공간 정보를 계산하는 로직과 이를 기반으로 심화신경망에 적용하여 공간 정보들의 비선형적 분포특성을 모델링하고, 이후 입력 신호로부터 추출된 공간 정보에 모델링된 심화신경망을 적용하여 최적의 음성존재확률을 도출함으로써 음성존재구간을 검출할 수 있다.In other words, the speech detector based on the power level difference ratio, the logic for calculating the coherence and the spatial information based on the phase vector, and the nonlinear distribution characteristic of the spatial information are applied to the deepening neural network based on the logic, By applying the deepened neural network modeled to the extracted spatial information, the speech presence interval can be detected by deriving the optimal speech presence probability.

시간축 상에서 원래의 음성 신호에 잡음 신호가 인가된 입력 신호는 이산 푸리에 변환(Discrete Fourier Transform, DFT)을 통해 주파수 축으로 변환하면 다음 수학식 1과 같이 표현될 수 있다. 즉, 잡음에 의하여 오염된 음성 입력 신호는 깨끗한 원래 음성 신호와 잡음 신호가 더해져 형성된다고 가정할 수 있다. An input signal to which a noise signal is applied to an original speech signal on the time axis can be expressed by the following Equation 1 when it is converted to a frequency axis through a Discrete Fourier Transform (DFT). That is, it can be assumed that a speech input signal contaminated by noise is formed by adding a clean original speech signal and a noise signal.

[수학식 1][Equation 1]

,

여기서,

은 잡음이 포함된 입력 신호의 이산 푸리에 변환 계수 벡터를 나타내고,

는 원래의 음성 신호의 이산 푸리에 변환 계수 벡터를 나타내며,

은 잡음 신호의 이산 푸리에 변환 계수 벡터를 나타낼 수 있다. 그리고

는 마이크 인덱스이고,

와

은 주파수 성분과 프레임 인덱스를 각각 나타낼 수 있다. here,

Represents a discrete Fourier transform coefficient vector of an input signal including noise,

Represents the discrete Fourier transform coefficient vector of the original speech signal,

May represent a discrete Fourier transform coefficient vector of the noise signal. And

Is a microphone index,

Wow

Can represent a frequency component and a frame index, respectively.

또한, 주어진 가설 H₀, H₁이 각각 음성의 부재와 존재를 표현한다고 하면 각 주파수 채널별로 다음 수학식 2와 같이 표현할 수 있다. In addition, if given hypotheses H ₀ and H 1 represent the presence and absence of speech, respectively, it can be expressed by the following Equation 2 for each frequency channel.

[수학식 2]&Quot; (2) "

이 때, 음성 신호와 잡음 신호가 독립적이라는 전제하에서 두 개의 마이크의 전력 스펙트럼 밀도는 다음 수학식 3과 같이 나타낼 수 있다.
At this time, under the assumption that the voice signal and the noise signal are independent, the power spectral density of the two microphones can be expressed by the following Equation (3).

아래에서는 전력레벨 비율차이 기초벡터에 대해 구체적으로 설명하기로 한다. Hereinafter, the power level ratio difference basic vector will be described in detail.

기초벡터 추출부 중 전력레벨 차이비율(PLDR) 추출부(223)는 입력 신호로부터 기초벡터들을 추출하기 위해 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 롱텀 전력레벨 차이(Long-term Power Level Difference, LT-PLD)를 산정하고, 롱텀 전력레벨 차이(LT-PLD)로부터 롱텀 전력레벨 차이비율(LT-PLDR)을 산출할 수 있다. The power level difference ratio (PLDR) extracting unit 223 of the basic vector extracting unit extracts the power level difference (PLD) between the two microphones to which the input signal is input in order to extract the basis vectors from the input signal, (LT-PLD) from the long-term power level difference (LT-PLD) and calculate the long-term power level difference ratio (LT-PLDR) from the long-term power level difference (LT-PLD).

[수학식 3]&Quot; (3) "

상기의 수학식 3으로부터 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)는 다음 수학식 4와 같이 나타낼 수 있다.From Equation (3), the power level difference (PLD) between the two microphones can be expressed by Equation (4).

[수학식 4]&Quot; (4) "

위 식의 전력레벨 차이에 재귀평균기법을 도입하여 롱텀 전력레벨 차이(Long-term Power Level Difference, LT-PLD)를 다음 수학식 5와 같이 산정할 수 있다.The long-term power level difference (LT-PLD) can be calculated by the following equation (5) by introducing a recursive averaging technique to the power level difference of the above equation.

[수학식 5]&Quot; (5) "

여기서

는, 일례로 0.9로 정할 수 있다. 상기의 롱텀 전력레벨 차이(LT-PLD)로부터 롱텀(long term) 전력레벨 차이비율을 다음 수학식 6과 같이 산출할 수 있다.here

For example, 0.9. The long term power level difference ratio from the long-term power level difference (LT-PLD) can be calculated by the following Equation (6).

[수학식 6]&Quot; (6) "

이 때,

은 MCRA(minima controlled recursive averaging)로 추정한 잡음전력으로 수학식 7과 같이 산출할 수 있다.At this time,

Is the noise power estimated by MCRA (minima controlled recursive averaging), and can be calculated as Equation (7).

[수학식 7] &Quot; (7) "

여기서, 가중치 파라미터

는 다음 수학식 8과 같이 나타낼 수 있다.Here, the weight parameter

Can be expressed by the following equation (8).

[수학식 8]&Quot; (8) "

여기서

는 일례로, 0.95로 정해지고 각 서브밴드의 음성존재확률인

은 다음 수학식 9와 같이 나타낼 수 있다. here

For example, 0.95, and the probability of voice presence of each subband

Can be expressed by the following equation (9).

[수학식 9]&Quot; (9) "

이 때,

는 일례로 0.2로 나타낼 수 있고

은 다음 수학식 10과 같이 표현될 수 있다.At this time,

For example, 0.2

Can be expressed by the following equation (10).

[수학식 10]&Quot; (10) "

여기서, 문턱값 δ는 1.5이고

은 다음 수학식 11과 같이 나타낼 수 있다. Here, the threshold value ? Is 1.5

Can be expressed by the following equation (11).

[수학식 11]&Quot; (11) "

이 때,

는 전력레벨 차이의 연속된 윈도우에서의 로컬 미니멈(local minimum)이다.
At this time,

Is the local minimum in successive windows of the power level difference.

아래에서는 숏텀(Short-term) 전력레벨 비율차이에 대해 구체적으로 설명한다. The short-term power level ratio difference will be described in detail below.

또한, 전력레벨 차이비율(PLDR) 추출부(223)는 입력 신호로부터 기초벡터들을 추출하기 위해 입력 신호가 입력되는 두 개의 마이크 사이의 전력레벨 차이(Power Level Difference, PLD)에 재귀평균기법을 적용하여 숏텀 전력레벨 차이(Short-term Power Level Difference, ST-PLD)를 산정하고, 숏텀 전력레벨 차이(ST-PLD)로부터 숏텀 전력레벨 차이비율(ST-PLDR)을 산출할 수 있다. In addition, the power level difference ratio (PLDR) extractor 223 applies a recursive averaging technique to a power level difference (PLD) between two microphones to which an input signal is input in order to extract basic vectors from the input signal The short-term power level difference (ST-PLDR) can be calculated by calculating the short-term power level difference (ST-PLD) from the short-term power level difference (ST-PLD).

숏텀(Short-term) 전력레벨 차이는 다음 수학식 14와 같이 산출할 수 있다.The short-term power level difference can be calculated by Equation (14).

[수학식 12]&Quot; (12) "

이 때,

는 0.3이고 숏텀(Short-term) 전력레벨 차이비율은 다음 수학식 13과 같이 나타낼 수 있다.At this time,

Is 0.3 and the short-term power level difference ratio can be expressed by Equation (13).

[수학식 13]&Quot; (13) "

여기서,

는 다음 수학식 14와 같이 나타낼 수 있다.here,

Can be expressed by the following equation (14).

[수학식 14]&Quot; (14) "

여기서,

는 다음 수학식 15과 같이 표현될 수 있다.here,

Can be expressed by the following equation (15).

[수학식 15]&Quot; (15) "

또한,

는 수학식 16과 같이 표현될 수 있다.Also,

Can be expressed by Equation (16).

[수학식 16]&Quot; (16) "

아래에서는 위상벡터(phase vector) 기초벡터에 대해 구체적으로 설명한다. Hereinafter, the phase vector basis vector will be described in detail.

기초벡터 추출부 중 위상벡터 추출부(224)는 입력 신호로부터 기초벡터들을 추출하기 위해 두 개의 마이크를 통해 입력된 입력 신호를 이산 푸리에 변환 벡터 기반 벡터 형식으로 나타내어 상관(correlation) 행렬을 고유분해 하고, 고유 분해된 고유벡터 행렬을 정규화하여 위상벡터를 산출할 수 있다. The phase vector extraction unit 224 in the basic vector extraction unit epsilon the correlation matrix by expressing the input signal input through the two microphones in a discrete Fourier transform vector-based vector format in order to extract the fundamental vectors from the input signal , And the phase vector can be calculated by normalizing the eigen-decomposed eigenvector matrix.

앞에서 설명한 수학식 1은 다음 수학식 17과 같이 벡터형식으로 나타낼 수 있다.The above-described equation (1) can be expressed in a vector form as shown in the following equation (17).

[수학식 17]&Quot; (17) "

위 식에서 상관(correlation) 행렬은 다음 수학식 18과 같이 고유분해를 사용하여 산출할 수 있다.In the above equation, the correlation matrix can be calculated using eigen decomposition as shown in Equation (18).

[수학식 18]&Quot; (18) "

이 때,

와

는 각각 단위 고유행렬과 대각행렬이다. 가장 큰 고유값을 가진 주(principal) 고유벡터 행렬은 다음 수학식 19와 같이 나타낼 수 있다.At this time,

Wow

Are the unit eigenmatrix and the diagonal matrix, respectively. The principal eigenvector matrix having the largest eigenvalue can be expressed by the following equation (19).

[수학식 19]&Quot; (19) "

그리고 행렬의 첫 번째 성분으로 정규화하면 다음 수학식 20과 같이 나타낼 수 있다.And normalized to the first component of the matrix, it can be expressed by the following equation (20).

[수학식 20]&Quot; (20) "

상기의 식으로부터 위상벡터는 다음 수학식 21과 같이 계산할 수 있다.From the above equation, the phase vector can be calculated by the following equation (21).

[수학식 21]&Quot; (21) "

아래에서는 코히어런스(Coherence) 기초벡터에 대해 구체적으로 설명한다. In the following, coherence basic vectors will be described in detail.

기초벡터 추출부 중 코히어런스(Coherence) 추출부(225)는 입력 신호로부터 기초벡터들을 추출하기 위해 두 개의 마이크로 입력된 입력 신호의 전력 스펙트럼 밀도, 교차 전력 스펙트럼 밀도, 및 롱텀 전력레벨 차이비율 기반의 잡음 신호의 교차 스펙트럼 밀도를 반영하여 코히어런스(Coherence) 함수를 구할 수 있다. The coherence extracting unit 225 extracts a coherence vector based on a power spectral density, an intersecting power spectral density, and a long-term power level difference ratio of two micro input signals to extract basic vectors from the input signal. The coherence function can be obtained by reflecting the cross spectral density of the noise signal of the input signal.

코히어런스(Coherence) 함수는 수학식 2로부터 다음과 같이 산출할 수 있다.The coherence function can be calculated from Equation (2) as follows.

[수학식 22]&Quot; (22) "

이 때,

,

는 각각 마이크로 입력되는 신호의 전력 스펙트럼 밀도를 나타내고,

는 두 개의 마이크에 대한 교차 전력 스펙트럼 밀도를 나타낼 수 있다.At this time,

,

Respectively denote the power spectral density of the signal to be micro-input,

Can represent the cross power spectral density for two microphones.

그리고

은 잡음 신호의 교차 전력 스펙트럼 밀도를 나타내고, 다음 수학식 23과 같이 나타낼 수 있다.And

Represents the cross power spectral density of the noise signal, and can be expressed by the following equation (23).

[수학식 23]&Quot; (23) "

음성 검출 장치는 도출된 값이 문턱값

보다 클 경우 입력 신호가 음성 신호에 해당(H₁) 되는 것으로 판단하며, 문턱값

보다 작을 경우 입력 신호가 비음성 신호에 해당(H₀)에 해당되는 것으로 판단할 수 있다.
The voice detection apparatus judges that the derived value is lower than the threshold

, It is determined that the input signal corresponds to the voice signal (H1), and the threshold value

, It can be determined that the input signal corresponds to the non-speech signal (H ₀ ).

심화신경망을 통한 음성존재확률을 추정하는 방법에 대해 더 구체적으로 설명하기로 한다. A method for estimating the probability of speech presence through the deepening neural network will be described in more detail.

음성검출을 위한 기초벡터는 롱텀 전력레벨 차이비율(Long-term Power Level Difference Ratio, LT-PLDR), 숏텀 전력레벨 차이비율(Short-term Power Level Difference Ratio, ST-PLDR), 코히어런스(Coherence) 함수, 및 위상벡터(phase vector)로 구성될 수 있다. 통계모델로부터 구해진 통계모델 파라미터 기초벡터는 학습된 심화신경망으로 입력되어 다수의 은닉층을 통하여 보다 변별력을 가지는 기초벡터로 재표현되고, 최종적으로 음성의 존재(음성 구간)/부재(비음성 구간)를 다음 식과 같이 확률로 나타나게 된다.The fundamental vector for speech detection is a long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence ) Function, and a phase vector. The statistical model parameter baseline vector obtained from the statistical model is input to the learned deepening neural network, re-represented as a basic vector having more discriminating power through a plurality of hidden layers, and finally, the presence (voice section) / absence (non-voice section) As shown in the following equation.

[수학식 24] &Quot; (24) "

여기서, Z(n)는 n 번째 프레임에서 구해진 기초벡터를 의미하며, W _i 와 b _i 는 각각 i 번째 은닉층의 가중치 매트릭스와 바이어스 벡터를 나타낼 수 있다. 또한,

는 활성함수를 나타내며 시그모이드(sigmoid) 함수를 적용할 수 있다. 음성 검출 장치는 음성의 존재와 부재에 대한 두 가지 경우를 고려하기 때문에 심화신경망의 출력층은 두 개의 노드로 구성되며, 목표 값은 음성 존재에 대하여 [1 0], 음성 부재에 대하여 [0 1] 로 나타낼 수 있다. 목표 값은 다음 식에 따라 최종적으로 음성의 존재 유무를 판단할 수 있다.Here, Z ( n ) denotes a base vector obtained in the n- th frame, and W _i and b _i can represent a weight matrix and a bias vector of the i- th hidden layer, respectively. Also,

Represents an activation function, and a sigmoid function can be applied. Since the speech detection apparatus considers two cases of existence and absence of speech, the output layer of the deepening neural network is composed of two nodes. The target value is [1 0] for speech presence, [0 1] . The target value can finally determine whether or not the voice exists according to the following equation.

[수학식 25]&Quot; (25) "

아래에서는 본 실시예에 따른 음성 검출 방법의 성능을 검증하기 위해 다양한 잡음환경에서 실험을 진행하였다. 훈련과 실험과정을 위해서 네 명의 남성화자와 네 명의 여성화자의 음성 신호는 1 m, 3 m, 5 m 거리에서 녹음되었으며, 음성 신호와 잡음 신호와의 위상은 0°, 90°, 180°에서 녹음되었다. In order to verify the performance of the speech detection method according to the present embodiment, experiments were conducted in various noise environments. For training and experimentation, the voice signals of four male and female female recorders were recorded at distances of 1 m, 3 m and 5 m, and the phases of voice and noise signals were recorded at 0 °, 90 ° and 180 ° .

아래의 표 1은 음성 신호와 잡음 신호의 위상이 0° 일 때, 기존 단일 기초벡터들과 제안하는 음성구간 검출 기술에 대한 성능을 나타낸 것이다. Table 1 below shows the performance of the conventional single basis vectors and the proposed speech interval detection technique when the phases of the speech signal and the noise signal are 0 °.

여기서, P _sh 는 음성이 존재하는 구간을 맞춘 확률을 나타내며, P _nh 는 음성 부재구간을 맞춘 확률을 나타내고, 수치가 높을수록 성능이 좋은 것을 의미한다. Here, P _sh represents the probability of matching the voice section, P _nh represents the probability that the voice section is matched, and the higher the value, the better the performance.

표 1에서 가장 좋은 기술은 진하게 표시되었다. 모든 잡음 신호 상황에서 제안하는 기술이 기존의 단일 기초벡터를 이용한 음성 검출 기술보다 정확함을 확인할 수 있다. 모든 잡음 신호 상황에서 제안하는 기술이 기존의 단일 기초벡터를 이용한 음성 검출 기술보다 정확함을 확인할 수 있으며, 그 중에서도 특히 오피스(office)와 팩토리(factory) 잡음 환경에서 뛰어남을 확인할 수 있다.In Table 1, the best technology is shown in bold. It can be confirmed that the proposed technique is more accurate than the conventional single fundamental vector speech detection technology in all noise signal situations. It can be confirmed that the proposed technique is more accurate than the conventional single fundamental vector speech detection technology in all noise signal situations, and it is particularly excellent in the office and factory noise environments.

아래의 표 2 및 표 3은 음성 신호와 잡음 신호의 위상이 각각 90°, 180° 일 때, 기존 단일 기초벡터들과 제안하는 음성구간 검출 기술에 대한 성능을 나타낸다. Tables 2 and 3 below show the performance of the conventional single basis vectors and the proposed speech interval detection technique when the phases of the speech signal and the noise signal are 90 ° and 180 °, respectively.

표 2 및 표 3을 참조하면, 표 1과 마찬가지로 모든 잡음 신호 상황에서 본 실시예에 따른 음성 검출 방법이 기존의 단일 기초벡터를 이용한 음성 검출 방법보다 정확함을 확인할 수 있다. 특히, 배블(babble)과 오피스(office) 잡음 환경에서 음성 검출 성능이 뛰어남을 확인할 수 있다.
Referring to Table 2 and Table 3, it can be confirmed that the voice detection method according to the present embodiment is more accurate than the voice detection method using the existing single basic vector in all the noise signal situations as in Table 1. Especially, it can be confirmed that speech detection performance is excellent in a babble and office noise environment.

도 4 내지 도 6은 기존 전력레벨 비율차이 음성 검출 장치와 제안하는 음성 검출 장치에 대한 ROC 커브를 나타낸다. Figs. 4 to 6 show ROC curves for the conventional power level ratio difference voice detection apparatus and the speech detection apparatus proposed.

도 4는 잡음의 위상 0도에서 기존 음성 검출 장치와 일 실시예에 따른 음성 검출 장치와의 ROC 커브를 비교한 도면이고, 도 5는 잡음의 위상 90도에서 기존 음성 검출 장치와 일 실시예에 따른 음성 검출 장치와의 ROC 커브를 비교한 도면이며, 도 6은 잡음의 위상 180도에서 기존 음성 검출 장치와 일 실시예에 따른 음성 검출 장치와의 ROC 커브를 비교한 도면이다. 여기서 기존 음성 검출 장치는 종래의 변별적 가중치 학습기법을 이용한 음성검출기가 될 수 있다. FIG. 4 is a graph comparing ROC curves of a conventional speech detecting apparatus with a speech detecting apparatus according to an embodiment at a phase of 0 degrees of noise, FIG. 5 is a graph comparing the ROC curves of a conventional speech detecting apparatus and an embodiment FIG. 6 is a diagram comparing ROC curves of a conventional speech detection apparatus with a speech detection apparatus according to an embodiment at a phase of 180 degrees of noise. Here, the conventional speech detection apparatus can be a speech detector using a conventional discriminative weight learning technique.

도 4 내지 도 6을 참조하면, 그래프는 실제 음성을 음성으로 검출한 음성 검출 확률과, 음성 부재구간을 음성으로 검출한 오경보 확률을 각각 y축과 x축으로 하여, 보다 그래프의 면적이 넓을수록 높은 성능을 나타낸다. 제안된 그래프는 각각 (a) 배블 잡음(babble noise) (b) 오피스 잡음(office noise) (c) 백색 잡음(white noise) (d) 팩토리 잡음(factory noise)을 나타낸다. 모든 잡음 상황에서 본 실시예에 따른 음성 검출 장치가 뛰어난 성능을 나타낸다.
Referring to FIGS. 4 to 6, the graph shows the relationship between the y-axis and the x-axis as the voice detection probability obtained by detecting the actual voice as the voice and the false alarm probability obtained by detecting the voice member section as the voice, High performance. The proposed graph shows (a) babble noise, (b) office noise, (c) white noise, and (d) factory noise. The speech detection apparatus according to the present embodiment exhibits excellent performance in all noise situations.

이상에서 설명한 실시예들에 따르면 입력 신호 사이의 상대적인 공간 정보(Spatial Information)를 전력레벨 차이비율 기반의 음성검출기, 코히어런스와 위상벡터 기반의 공간 정보를 계산하는 로직과 이를 기반으로 심화신경망에 적용하여 공간 정보들의 비선형적 분포특성을 모델링하고, 이후 입력 신호로부터 추출된 공간 정보에 모델링된 심화신경망을 적용하여 최적의 음성존재확률을 도출함으로써 음성존재구간을 검출하는 심화신경망을 이용한 2 채널 마이크 기반의 음성 검출 장치 및 방법을 제공할 수 있다. According to the embodiments described above, the relative spatial information (Spatial Information) between the input signals is calculated by a voice level detector based on power level difference ratio, logic for calculating coherence and phase vector based spatial information, Channel microphone using a deepening neural network that detects a voice presence interval by deriving an optimal voice presence probability by applying a deepening neural network modeled to spatial information extracted from an input signal, Based speech detection apparatus and method.

이러한 음성 검출 기술은 음성인식기의 음성검출 모듈에 적용되어 EPD(End Point Detection, 끝점 검출기)의 일부로써 적용되어 음성 부재 구간에서 음성인식기의 동작을 중단하여 음성인식기의 인식성능을 높이며 시스템의 소비전력을 줄이는 역할을 할 수 있다. 또한 음성부호화기의 음성검출모듈에 적용되어 비트 전송률을 효율적으로 관리하여 시스템의 제한된 통신 대역폭을 효과적으로 사용할 수 있다.This voice detection technique is applied to the voice detection module of the voice recognizer and is applied as a part of EPD (End Point Detector) to increase the recognition performance of the voice recognizer by stopping the operation of the voice recognizer in the voice absence section, Can be reduced. In addition, it can be applied to the voice detection module of the speech coder to efficiently manage the bit rate, so that the limited communication bandwidth of the system can be effectively used.

그리고 휴대폰 단말기, 무선통신사업자, 카카오톡 등의 음성통화 서비스, 구글 보이스, 시리 등의 음성인식 서비스뿐만 아니라 음성향상, 음성인식, 음성 부호화 등 음성의 존재/부재 여부에 따라 다른 알고리즘을 적용할 수 있는 음성 신호처리 분야에 적용되어 보다 우수한 성능을 도출할 수 있다.In addition to voice recognition services such as cellular phone terminals, wireless communication service providers, and KakaoTalk voice recognition services such as Google Voice and Siri, other algorithms can be applied depending on the presence or absence of voice such as voice enhancement, voice recognition, The present invention can be applied to a speech signal processing field to obtain better performance.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, controller, arithmetic logic unit (ALU), digital signal processor, microcomputer, field programmable array (FPA) A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing apparatus may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A two-channel microphone-based speech detection method using a deep neural network (DNN)
Extracting basic vectors from an input signal that is a speech signal contaminated by a noise environment in a classification step; And
In the classifying step, the basic vectors are passed through a previously learned deepening neural network to determine a voice presence probability, and classifying the input signal into a voice section or a non-voice section
Lt; / RTI >
Wherein the input signal is input from a plurality of microphones and includes relative spatial information between the input signals,
Wherein the extracting of the basis vectors from the input signal comprises:
Eigen-decomposing a correlation matrix by representing the input signal input through two microphones in a discrete Fourier transform vector-based vector format; And
Calculating a phase vector by normalizing the eigen-decomposed eigenvector matrix
Lt; / RTI >
The basic vector is a vector,
A long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function, phase vector)
A two - channel microphone based speech detection method using deepening neural network.

The method according to claim 1,
Further comprising learning the deepening neural network,
The step of learning the deep neural network (DNN)
Extracting basic vectors after performing a discrete Fourier transform (DFT) on the speech signal contaminated by the ambient noise environment in the learning step; And
In the learning step, the deepening neural network is learned through a pre-training process and a fine-tuning process using the basic vectors extracted in each noise environment
A two - channel microphone based speech detection method using deepening neural network.

delete

A two-channel microphone-based speech detection method using a deep neural network (DNN)
Extracting basic vectors from an input signal that is a speech signal contaminated by a noise environment in a classification step; And
In the classifying step, the basic vectors are passed through a previously learned deepening neural network to determine a voice presence probability, and classifying the input signal into a voice section or a non-voice section
Lt; / RTI >
Wherein the input signal is input from a plurality of microphones and includes relative spatial information between the input signals,
Wherein the extracting of the basis vectors from the input signal comprises:
Estimating a long-term power level difference (LT-PLD) by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which the input signal is input; And
Calculating the long-term power level difference ratio (LT-PLDR) from the long-term power level difference (LT-PLD)
Lt; / RTI >
Wherein the extracting of the basis vectors from the input signal comprises:
A coherence function is obtained by reflecting the power spectral density, the cross power spectral density, and the cross spectrum density of the noise signal based on the long-term power level difference ratio of the two input micro-inputs,
The basic vector is a vector,
A long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function, phase vector)
A two - channel microphone based speech detection method using deepening neural network.

The method according to claim 1 or 4,
Wherein the extracting of the basis vectors from the input signal comprises:
Estimating a short-term power level difference (ST-PLD) by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which the input signal is input; And
Calculating the short-term power level difference ratio (ST-PLDR) from the short-term power level difference (ST-PLD)
A two - channel microphone based speech detection method using deepening neural network.

delete

The method according to claim 1 or 4,
The step of classifying the input signal into a voice section or a non-
The basic vectors are input to the learned deepened neural network, re-represented as basic vectors having discriminative power through a plurality of hidden layers, and finally classified into the speech section or the non-speech section by the speech presence probability
A two - channel microphone based speech detection method using deepening neural network.

9. The method of claim 8,
Wherein the input signal is determined to be the voice signal when the value of the voice presence probability is greater than a preset threshold value and the input signal is determined to be the non-voice signal if the voice presence probability is lower than the preset threshold value
A two - channel microphone based speech detection method using deepening neural network.

A two-channel microphone-based speech detection apparatus using a deep neural network (DNN)
An input unit for receiving an input signal that is a voice signal contaminated by a noise environment;
A base vector extractor for extracting base vectors from the input signal; And
A deepening neural network applying unit for passing the basic vectors through a previously learned deepening network; And
A speech presence probability determining unit for determining a speech presence probability of the basic vectors and classifying the input signal into a speech interval or a non-
Lt; / RTI >
Wherein the input signal is input from a plurality of microphones and includes relative spatial information between the input signals,
The basic vector extracting unit extracts,
A correlation matrix is represented by a discrete Fourier transform vector-based vector format input through two microphones, the eigen-decomposed eigenvector matrix is normalized to calculate a phase vector,
The basic vector is a vector,
A long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function, phase vector)
A two - channel microphone based speech detection system using deep -

11. The method of claim 10,
And a learning unit for learning the deep neural network (DNN)
Wherein,
An input part of a learning part that receives a speech signal contaminated by an ambient noise environment in a learning step;
A discrete Fourier transform unit performing Discrete Fourier Transform (DFT) on the input contaminated speech signal;
A basic vector extracting unit of a learning unit for extracting basic vectors after discrete Fourier transform; And
A pre-learning unit and a fine-tuning unit for learning the deepening neural network through a pre-training process and a fine-tuning process using the basic vectors extracted in each noise environment,
A two - channel microphone - based speech detection system using deep -

delete

A two-channel microphone-based speech detection apparatus using a deep neural network (DNN)
An input unit for receiving an input signal that is a voice signal contaminated by a noise environment;
A base vector extractor for extracting base vectors from the input signal; And
A deepening neural network applying unit for passing the basic vectors through a previously learned deepening network; And
A speech presence probability determining unit for determining a speech presence probability of the basic vectors and classifying the input signal into a speech interval or a non-
Lt; / RTI >
Wherein the input signal is input from a plurality of microphones and includes relative spatial information between the input signals,
The basic vector extracting unit extracts,
A long-term power level difference (LT-PLD) is calculated by applying a recursive averaging technique to a power level difference (PLD) between two microphones to which the input signal is input, Calculating a long-term power level difference ratio (LT-PLDR) from a power level difference (LT-PLD), calculating a power spectral density, an intersecting power spectral density, A coherence function is obtained by reflecting the cross-spectral density of the noise signal of the base station,
The basic vector is a vector,
A long-term power level difference ratio (LT-PLDR), a short-term power level difference ratio (ST-PLDR), a coherence function, phase vector)
A two - channel microphone based speech detection system using deep -

delete

14. The method according to claim 10 or 13,
The voice presence probability determining unit may determine,
The basic vectors are input to the learned deepened neural network, re-represented as basic vectors having discriminative power through a plurality of hidden layers, and finally classified into the speech section or the non-speech section by the speech presence probability
A two - channel microphone based speech detection system using deep -