KR20180046062A

KR20180046062A - Method for speech endpoint detection using normalizaion and apparatus thereof

Info

Publication number: KR20180046062A
Application number: KR1020160140843A
Authority: KR
Inventors: 반성민
Original assignee: 에스케이텔레콤 주식회사
Priority date: 2016-10-27
Filing date: 2016-10-27
Publication date: 2018-05-08
Also published as: KR101893789B1

Abstract

The present invention relates to a method for determining a voice section in an input voice signal. More specifically, the present invention relates to a method for determining a voice section using normalization and a device for determining a voice section therefor, which detect a voice endpoint by using a deep neural network (DNN) and perform normalization in a window unit in consideration of an input layer of the DNN in the normalization process before determining a voice section by using the voice endpoint, thereby enabling a more rapid process. Also, the normalization is performed after a masking process performs, thereby more accurately detecting the voice section even in various noise environments. The method for determining a voice section using normalization comprises the following steps of: extracting a voice feature vector from the input voice signal; performing the normalization on a series of voice feature vector according to the input layer of the DNN; and setting the normalized voice feature vector as an input of the DNN, and determining the voice section in the voice signal.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method for determining a speech interval using normalization,

본 발명은 입력되는 음성 신호에서의 음성 구간을 판단할 수 있는 방법에 관한 것으로, 더욱 상세하게는 음성 신호에서의 음성 구간 판단 시 실시간 처리가 가능함과 동시에 다양한 잡음 환경에서 보다 더 강인한 성능을 나타낼 수 있는 정규화를 이용한 음성 구간 판단 방법 및 이를 위한 음성 구간 판단 장치에 관한 것이다. The present invention relates to a method for determining a speech interval in an input speech signal, and more particularly, to a speech recognition method capable of real-time processing when determining a speech interval in a speech signal and at the same time, To a method for determining a speech interval using the normalization and a speech interval determination apparatus therefor.

이 부분에 기술된 내용은 단순히 본 실시 예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute the prior art.

음성 인식 성능을 향상시키기 위해서는 잡음과 함께 인가되는 음성의 구간을 정확하게 획득할 수 있어야 한다. 이에 음성 신호에 대해 음성이 시작되는 시작점부터 음성이 종료되는 종료점까지의 음성 끝점(endpoint)을 검출하고, 이를 통해 음성 구간과 비음성 구간을 정확하게 판단할 수 있는 기술의 중요성이 점차 확대되고 있다. In order to improve the speech recognition performance, it is necessary to accurately obtain the interval of the speech applied with the noise. Accordingly, a technology for detecting an end point of a voice signal from a start point of a start of a voice to an end point of ending the voice, and thereby accurately determining a voice interval and a non-voice interval is gradually increasing in importance.

특히, 다양한 잡음 환경을 고려하여 음성 구간을 보다 더 정확하게 검출하기 위해서는 특징 정규화 과정을 수행하기도 한다. In particular, the feature normalization process is performed to more accurately detect the speech interval in consideration of various noise environments.

이러한 특징 정규화 과정은 크게 입력되는 음성 신호에 대응하여 훈련 데이터를 이용하여 미리 산출된 평균과 분산을 적용하여 정규화를 진행하는 GMVN(Global Mean and Variance Normalization) 방식과 입력되는 음성 신호에 대응하여 평균과 분산을 산출하여 정규화를 수행하는 UMVN(Utterance Mean and Variance Normalization)을 들 수 있다. The feature normalization process is a GMVN (Global Mean and Variance Normalization) method in which normalization is performed by applying a previously calculated mean and variance to training data corresponding to input speech signals, And Utterance Mean and Variance Normalization (UMVN) for performing normalization by calculating variance.

GMVN 방식은 음성 신호에 대응하여 미리 산출된 평균과 분산을 적용하여 정규화를 진행함으로써, 실시간 처리가 가능하다는 장점이 있으나, 음성 신호에서의 잡음 환경과 훈련 데이터에서의 잡음 환경이 상이할 경우 검출 성능이 떨어진다는 문제점이 있다. The GMVN method is advantageous in that real-time processing can be performed by performing normalization by applying a previously calculated average and variance corresponding to a voice signal. However, when the noise environment in the voice signal is different from the noise environment in the training data, There is a problem in that it drops.

또한, UMVN 방식의 경우 입력되는 음성 신호별로 평균과 분산을 산출하여 정규화를 수행함으로써, GMVN 방식에 비해 다양한 잡음 환경에서 보다 더 우수한 성능을 나타낼 수 있으나, 전체 음성 신호의 입력이 완료되어야지만 평균과 분산 산출이 가능하다는 점에서 실시간 처리가 불가능하다는 문제점이 있다. In the case of the UMVN scheme, the average and variance are calculated and normalized for each input speech signal. However, the performance of the UMVN scheme is better than that of the GMVN scheme. However, There is a problem that it is impossible to perform real-time processing in view of the possibility of distributed calculation.

이에, 실시간 처리가 가능함과 동시에 다양한 잡음 환경에 보다 더 강인한 성능을 보일 수 있는 특징 정규화 기술의 개발이 필요하다. Therefore, it is necessary to develop a feature normalization technique that can perform real time processing and show more robust performance in various noise environments.

한국공개특허 제10-1999-001828호, 1999년 1월 15일 공개 (명칭: 스펙트럼의 동적 영역 정규화에 의한 음성 특징 추출 장치 및 방법)Korean Patent Laid-Open No. 10-1999-001828, January 15, 1999 (Name: Apparatus and method for extracting speech features by dynamic region normalization of spectrum)

본 발명은 상기한 종래의 문제점을 해결하기 위해 제안된 것으로서, 음성 신호에서의 음성 구간 판단 시 실시간 처리가 가능함과 동시에 다양한 잡음 환경에서 보다 더 강인한 성능을 나타낼 수 있는 정규화를 이용한 음성 구간 판단 방법 및 이를 위한 음성 구간 판단 장치를 제공하는 데 목적이 있다. SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problems, and it is an object of the present invention to provide a speech interval determination method using normalization capable of performing real time processing in determining a speech interval in a speech signal, And an object of the present invention is to provide a voice interval determination apparatus therefor.

특히, 본 발명은 심층 신경망(Deep Neural Network)을 이용하여 음성 끝점을 검출하고 이를 이용하여 음성 구간 판단하기 이전에, 정규화 과정에서 상기 심층 신경망의 입력 레이어를 고려한 윈도우 단위로 정규화를 진행함으로써, 보다 더 신속한 처리가 가능하며 아울러 다양한 잡음 환경에서도 보다 더 정확하게 음성 구간을 검출할 수 있는 정규화를 이용한 음성 구간 판단 방법 및 이를 위한 음성 구간 판단 장치를 제공하는 데 그 목적이 있다. Particularly, in the present invention, normalization is performed on a per-window basis in consideration of the input layer of the depth-area neural network in the normalization process before detecting a voice end point using a deep neural network (Deep Neural Network) And it is an object of the present invention to provide a voice interval determination method using normalization capable of detecting a voice interval more accurately even in various noisy environments and a voice interval determination apparatus therefor.

그러나, 이러한 본 발명의 목적은 상기의 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 명확하게 이해될 수 있을 것이다.However, the object of the present invention is not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood from the following description.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법은 음성 신호를 소정 길이를 갖는 프레임으로 구분하고, 구분된 프레임을 대상으로 음성 여부를 판단하여 음성 구간을 검출할 수 있는 음성 구간 판단 장치에 있어서, 입력되는 음성 신호에서 음성 특징 벡터를 추출하는 단계; 심층 신경망 모델(DNN; Deep Neural Network)의 입력 레이어에 따라 일련의 상기 음성 특징 벡터를 대상으로 정규화를 수행하는 단계; 및 상기 정규화된 음성 특징 벡터를 상기 심층 신경망 모델의 입력으로 설정하여 상기 음성 신호에서의 음성 구간을 판단하는 단계;를 포함하여 이뤄질 수 있다. According to an aspect of the present invention, there is provided a method for determining a speech interval using normalization, the method comprising: dividing a speech signal into frames having a predetermined length; An apparatus for determining a speech interval, the apparatus comprising: extracting a speech feature vector from an input speech signal; Performing a normalization on a series of speech feature vectors according to an input layer of a deep neural network (DNN); And determining the speech interval in the speech signal by setting the normalized speech feature vector as an input of the neural network model.

이때, 상기 음성 특징 벡터를 추출하는 단계 이전에, 상기 입력되는 음성 신호의 일부 구간을 잡음 구간으로 추정하는 단계; 상기 추정된 잡음 구간에 해당하는 잡음 신호를 상기 전체 음성 신호에 더하여 마스킹 처리하는 단계;를 더 포함하며, 상기 음성 특징 벡터를 추출하는 단계는 상기 마스킹된 음성 신호를 대상으로 음성 특징 벡터를 추출할 수 있다. Estimating a part of the input speech signal as a noise section before extracting the speech feature vector; And masking the noise signal corresponding to the estimated noise period to the entire speech signal. The extracting of the speech feature vector may include extracting a speech feature vector from the masked speech signal .

또한, 상기 잡음 신호는 상기 음성 구간 판단을 위한 학습 데이터 구축 시 적용된 학습 잡음 신호일 수 있다. In addition, the noise signal may be a learning noise signal applied at the time of constructing learning data for determining the speech interval.

이때, 상기 마스킹 처리하는 단계는 상기 학습 잡음 신호의 에너지 레벨이 상기 잡음 신호의 에너지 레벨에 맞추도록 조정한 후 마스킹 처리할 수 있다. At this time, the masking process may be performed after adjusting the energy level of the learning noise signal to match the energy level of the noise signal.

또한, 상기 정규화를 수행하는 단계는 상기 음성 신호에서 임의의 프레임을 기준으로 일정 범위의 스플라이스 윈도우를 설정하는 단계; 해당 스플라이스 윈도우의 주파수 대역별로 추출된 음성 특징 벡터를 이용하여 해당 스플라이스 윈도우에 대응하는 평균과 표준편차를 산출하는 단계; 및 상기 산출된 평균과 표준편차를 이용하여 상기 스플라이스 윈도우에 대한 정규화를 수행하는 단계;를 포함하여 이뤄질 수 있다. The performing of the normalization may include setting a splice window within a certain range based on an arbitrary frame in the speech signal, Calculating an average and a standard deviation corresponding to the splice window by using a speech feature vector extracted for each frequency band of the splice window; And normalizing the splice window using the calculated average and standard deviation.

이때, 상기 스플라이스 윈도우는 상기 심층 신경망 모델의 입력 레이어의 크기에 대응하여 설정될 수 있다. At this time, the splice window may be set corresponding to the size of the input layer of the depth neural network model.

추가로 본 발명은 상술한 바와 같은 방법을 실행하는 프로그램을 기록한 컴퓨터 판독 가능한 기록 매체를 제공할 수 있다.Further, the present invention can provide a computer-readable recording medium on which a program for executing the above-described method is recorded.

상술한 바와 같은 목적을 달성하기 위한 본 발명의 실시 예에 따른 음성 신호를 소정 길이를 갖는 프레임으로 구분하고, 구분된 프레임을 대상으로 음성 여부를 판단하여 음성 구간을 검출할 수 있는 음성 구간 판단 장치에 있어서, 입력되는 음성 신호에서 음성 특징 벡터를 추출하는 특징 벡터 추출부; 심층 신경망 모델(DNN; Deep Neural Network)의 입력 레이어에 따라 상기 특징 벡터 추출부를 통해 추출된 일련의 상기 음성 특징 벡터를 대상으로 정규화를 수행하는 정규화부; 및 상기 정규화된 음성 특징 벡터를 상기 심층 신경망 모델의 입력으로 설정하여 상기 음성 신호에서의 음성 구간을 판단하는 음성 구간 판단부;를 포함하여 구성될 수 있다. According to an aspect of the present invention, there is provided an apparatus for dividing a speech signal according to an exemplary embodiment of the present invention into frames each having a predetermined length, A feature vector extractor for extracting a speech feature vector from an input speech signal; A normalization unit for performing normalization on a series of speech feature vectors extracted through the feature vector extraction unit according to an input layer of a deep neural network (DNN); And a speech interval determiner for determining the speech interval in the speech signal by setting the normalized speech feature vector as an input to the depth neural network model.

이때, 상기 입력되는 음성 신호의 일부 구간을 잡음 구간으로 추정하고, 상기 추정된 잡음 구간에 해당하는 잡음 신호를 상기 전체 음성 신호에 더하여 마스킹 처리하는 마스킹부;를 더 포함하여 구성될 수 있다. The apparatus may further include a masking unit that estimates a part of the input speech signal as a noise section and adds a noise signal corresponding to the estimated noise section to the entire speech signal.

본 발명의 정규화를 이용한 음성 구간 판단 방법 및 이를 위한 음성 구간 판단 장치에 의하면, 종래의 정규화 방식에 비해 다양한 잡음 환경에서 보다 더 강인한 성능을 보일 수 있으며, 실시간 처리가 가능하다는 효과가 있다. According to the speech interval determination method and the speech interval determination apparatus using the normalization of the present invention, it is possible to perform more robust performance than in the conventional normalization method in a variety of noise environments, and real-time processing is possible.

이에, 본 발명은 음성 신호에서 음성 구간과 비음성 구간을 보다 더 정확하게 검출할 수 있으며, 다양한 잡음 환경에서 보다 더 신속하고 정확하게 음성 구간을 검출하고 음성 인식을 수행할 수 있게 된다. Accordingly, the present invention can more accurately detect a voice section and a non-voice section in a voice signal, and more quickly and accurately detect a voice section and perform voice recognition than in various noise environments.

이를 통해 본 발명은 음성 인식, 화자 인식, 음질 개선 등과 같이 음성 신호 처리 분야의 효율을 향상시킬 수 있게 된다. Accordingly, the present invention can improve the efficiency of the speech signal processing field such as speech recognition, speaker recognition, and sound quality improvement.

아울러, 상술한 효과 이외의 다양한 효과들이 후술될 본 발명의 실시 예에 따른 상세한 설명에서 직접적 또는 암시적으로 개시될 수 있다.In addition, various effects other than the above-described effects can be directly or implicitly disclosed in the detailed description according to the embodiment of the present invention to be described later.

도 1은 본 발명의 실시 예에 따른 음성 인식 시스템을 설명하기 위한 구성도이다.
도 2는 본 발명의 실시 예에 따른 음성 구간을 설명하기 위한 예시도이다.
도 3은 본 발명의 실시 예에 따른 음성 구간 판단 장치를 설명하기 위한 구성도이다.
도 4 내지 도 6은 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법을 설명하기 위한 예시도이다.
도 7은 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법을 설명하기 위한 흐름도이다. 1 is a block diagram illustrating a speech recognition system according to an embodiment of the present invention.
2 is an exemplary diagram for explaining a voice section according to an embodiment of the present invention.
3 is a block diagram illustrating a speech interval determination apparatus according to an embodiment of the present invention.
4 to 6 are exemplary diagrams for explaining a voice interval determination method using normalization according to an embodiment of the present invention.
7 is a flowchart illustrating a method of determining a voice interval using normalization according to an embodiment of the present invention.

본 발명의 과제 해결 수단의 특징 및 이점을 보다 명확히 하기 위하여, 첨부된 도면에 도시된 본 발명의 특정 실시 예를 참조하여 본 발명을 더 상세하게 설명한다. BRIEF DESCRIPTION OF THE DRAWINGS For a more complete understanding of the nature and advantages of the present invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings, in which:

다만, 하기의 설명 및 첨부된 도면에서 본 발명의 요지를 흐릴 수 있는 공지 기능 또는 구성에 대한 상세한 설명은 생략한다. 또한, 도면 전체에 걸쳐 동일한 구성 요소들은 가능한 한 동일한 도면 부호로 나타내고 있음에 유의하여야 한다.In the following description and the accompanying drawings, detailed description of well-known functions or constructions that may obscure the subject matter of the present invention will be omitted. It should be noted that the same constituent elements are denoted by the same reference numerals as possible throughout the drawings.

이하의 설명 및 도면에서 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위한 용어의 개념으로 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시 예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형 예들이 있을 수 있음을 이해하여야 한다.The terms and words used in the following description and drawings are not to be construed in an ordinary sense or a dictionary, and the inventor can properly define his or her invention as a concept of a term to be described in the best way It should be construed as meaning and concept consistent with the technical idea of the present invention. Therefore, the embodiments described in the present specification and the configurations shown in the drawings are merely the most preferred embodiments of the present invention, and not all of the technical ideas of the present invention are described. Therefore, It is to be understood that equivalents and modifications are possible.

또한, 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하기 위해 사용하는 것으로, 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용될 뿐, 상기 구성요소들을 한정하기 위해 사용되지 않는다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제2 구성요소는 제1 구성요소로 명명될 수 있고, 유사하게 제1 구성요소도 제2 구성요소로 명명될 수 있다.Also, terms including ordinal numbers such as first, second, etc. are used to describe various elements, and are used only for the purpose of distinguishing one element from another, Not used. For example, without departing from the scope of the present invention, the second component may be referred to as a first component, and similarly, the first component may also be referred to as a second component.

더하여, 어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급할 경우, 이는 논리적 또는 물리적으로 연결되거나, 접속될 수 있음을 의미한다. 다시 말해, 구성요소가 다른 구성요소에 직접적으로 연결되거나 접속되어 있을 수 있지만, 중간에 다른 구성요소가 존재할 수도 있으며, 간접적으로 연결되거나 접속될 수도 있다고 이해되어야 할 것이다.In addition, when referring to an element as being "connected" or "connected" to another element, it means that it can be connected or connected logically or physically. In other words, it is to be understood that although an element may be directly connected or connected to another element, there may be other elements in between, or indirectly connected or connected.

또한, 본 명세서에서 사용한 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 또한, 본 명세서에서 기술되는 "포함 한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Also, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. It is also to be understood that the terms such as " comprising "or" having ", as used herein, are intended to specify the presence of stated features, integers, It should be understood that the foregoing does not preclude the presence or addition of other features, numbers, steps, operations, elements, parts, or combinations thereof.

이하, 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법 및 이를 위한 음성 구간 판단 장치에 대해 설명하도록 한다. Hereinafter, a voice interval determination method using normalization and a voice interval determination apparatus according to an embodiment of the present invention will be described.

도 1은 본 발명의 실시 예에 따른 음성 인식 시스템을 설명하기 위한 구성도이며, 도 2는 본 발명의 실시 예에 따른 음성 구간을 설명하기 위한 예시도이다. FIG. 1 is a block diagram for explaining a speech recognition system according to an embodiment of the present invention, and FIG. 2 is an exemplary view for explaining a speech section according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 실시 예가 적용된 음성 인식 시스템(500)은 음성 구간 판단 장치(100), 특징 추출 장치(200) 및 음성 인식 장치(300)를 포함하여 구성될 수 있다. 상기 음성 구간 판단 장치(100), 특징 추출 장치(200) 및 음성 인식 장치(300)는 독립된 하나의 음성 인식 시스템(500)을 구성하는 일 모듈 형태로 존재할 수 있으며, 별도의 통신 채널을 통해 물리적 또는 논리적으로 연결될 수 있는 각각 독립된 개별 장치 형태로 존재할 수도 있다. 또한, 본 발명의 음성 인식 시스템(500)에 입력되는 음성 신호는 아날로그 형태의 음성 신호가 될 수 있으며, 이 경우 디지털 형태의 음성 신호로 변환하는 A/D(Analog to Digital) 변환장치를 더 포함하여 구성될 수 있다. 또한 구현 방식에 따라 본 발명의 음성 구간 판단 장치(100)가 상기 과정을 처리할 수도 있다. Referring to FIG. 1, a speech recognition system 500 to which an embodiment of the present invention is applied may include a speech interval determination apparatus 100, a feature extraction apparatus 200, and a speech recognition apparatus 300. The voice segment determination apparatus 100, the feature extraction apparatus 200, and the voice recognition apparatus 300 may exist as one module that constitutes one voice recognition system 500 independent of each other, Or may be in the form of separate individual devices that can be logically connected. In addition, the voice signal inputted to the voice recognition system 500 of the present invention may be an analog voice signal, and in this case, further includes an A / D (Analog to Digital) conversion device for converting the voice signal into a digital voice signal . In addition, the speech interval determination apparatus 100 of the present invention may process the process according to an implementation method.

이러한 음성 인식 시스템(500)을 구성하는 각 구성에 대해 설명하면, 먼저 본 발명의 음성 구간 판단 장치(100)는 입력되는 음성 신호에 대한 끝점을 검출하고 음성 구간을 판단하는 역할을 수행한다. 예컨대, 본 발명의 음성 구간 판단 장치(100)는 도 2에 도시된 바와 같이, 음성의 시작에 대한 끝점 및 음성의 종료에 대한 끝점을 검출하고 이를 음성 구간(B)으로 판단할 수 있으며, 음성 구간(B)이 아닌 구간은 잡음 구간(A, C)으로 판단하게 된다. Each of the components configuring the speech recognition system 500 will be described. First, the speech interval determination apparatus 100 of the present invention detects an end point of an input speech signal and determines a speech interval. For example, as shown in FIG. 2, the speech interval determination apparatus 100 of the present invention can detect an end point of a start of a speech and an end point of an end of a speech and judge it as a speech interval B, The section other than the section B is judged as the noise section (A, C).

특히, 본 발명의 실시 예에 따른 음성 구간 판단 장치(100)는 다양한 잡음 환경에 강인함과 동시에 신속하게 음성 구간을 판단하기 위해, 본 발명의 실시 예에 따른 마스킹 과정 및 정규화 과정을 수행하게 된다. In particular, the speech interval determination apparatus 100 according to an exemplary embodiment of the present invention performs a masking process and a normalization process according to an embodiment of the present invention to determine a speech interval while robustness against various noise environments.

다시 말해, 본 발명의 음성 구간 판단 장치(100)는 잡음과 음성 구간을 보다 더 명료하게 파악하기 위하여, 입력되는 음성 신호의 초기 일부 구간을 잡음으로 추정하고, 추정된 잡음에 해당하는 음성 특징을 음성 신호의 전 프레임에 추가하는 마스킹 과정을 수행한다. In other words, in order to more clearly grasp the noise and the speech interval, the speech interval determination apparatus 100 of the present invention estimates an initial partial interval of the input speech signal as noise and outputs a speech characteristic corresponding to the estimated noise And performs a masking process of adding to the previous frame of the voice signal.

이후에, 본 발명의 음성 구간 판단 장치(100)는 마스킹된 음성 신호에서 프레임 단위로 음성 특징 벡터를 추출한다. 여기서 산출될 수 있는 음성 특징 벡터는 예컨대 음성 신호의 에너지 정보, 피치 정보 등이 될 수 있으며, 에너지 정보를 이용하는 경우 프레임의 음성 신호에 대해 푸리에 변환을 수행하여 스펙트럼 정보를 출력하고, 스펙트럼에 대한 필터 뱅크 에너지를 생성하고, 생성된 필터뱅크 에너지에 로그 함수를 적용하여 로그 필터뱅크 에너지를 출력하는 과정 등을 거쳐 음성 특징 벡터를 추출할 수 있다. Thereafter, the speech interval determination apparatus 100 of the present invention extracts a speech feature vector on a frame-by-frame basis from the masked speech signal. Here, the speech feature vector that can be calculated may be, for example, energy information of a speech signal, pitch information, and the like. When energy information is used, Fourier transform is performed on the speech signal of the frame to output spectral information, And a process of outputting the log filter bank energy by applying a log function to the generated filter bank energy, thereby extracting the voice feature vector.

이후에, 본 발명의 음성 구간 판단 장치(100)는 어느 하나의 음성 특징 벡터만을 이용하여 음성 구간을 판단하는 것이 아니라, 일정 구간의 음성 특징 벡터의 시퀀스를 신경망 모델 입력 레이어의 입력 값으로 적용하고, 이를 통해서 음성 구간을 검출할 수 있게 된다. Hereinafter, the speech interval determination apparatus 100 of the present invention does not determine a speech interval using only one speech feature vector, but applies a sequence of speech feature vectors of a predetermined interval as an input value of a neural network model input layer , So that it is possible to detect the voice section.

상술한 본 발명의 음성 구간 판단 장치(100)에서의 보다 구체적인 구성 및 동작에 대해서는 후술하여 설명하도록 한다. The more specific configuration and operation of the speech interval determination apparatus 100 according to the present invention will be described later.

한편, 본 발명의 특징 추출 장치(200)는 음성 구간 판단 장치(100)를 통해 검출된 음성 구간에 대응하여 음성 인식을 위한 특징 벡터들을 추출하는 역할을 수행한다. 여기서, 음성 인식을 위한 특징 벡터들은 전술한 바와 같이 음성 신호에 대한 에너지 정보뿐 아니라 다양한 음성 특징들을 이용할 수 있으며, 이러한 음성 특징 추출 시 다양한 특징 추출 알고리즘을 적용하여 구현될 수 있다. 예컨대, 본 발명의 특징 추출 장치(200)는 LMFE(Log Mel-Filterbank Energy)를 비롯하여, MFCC(Mel-Frequency Cepstrum Codfficient), LPCC(Linear Prediction Coefficient Cepstrum) 또는 PLPCC(Preceptual Linear Prediction Ceptrum Coeffcient), EIH(Ensemble Interval Histogram), SMC(Short-time Modified Coherence) 등의 다양한 특징 추출 알고리즘을 이용하여 음성 특징 파라미터를 추출할 수 있다. Meanwhile, the feature extracting apparatus 200 of the present invention extracts feature vectors for voice recognition corresponding to the voice interval detected through the voice interval determining apparatus 100. [ As described above, the feature vectors for speech recognition can utilize various speech features as well as energy information for the speech signal, and can be implemented by applying various feature extraction algorithms to the speech feature extraction. For example, the feature extracting apparatus 200 of the present invention may include a Mel-Frequency Cepstrum Coefficient (MFCC), a Linear Prediction Coefficient Cepstrum (LPCC), a Preceptual Linear Prediction Ceptrum Coeffcient (PLPCC), an EIH The feature parameters can be extracted by using various feature extraction algorithms such as Ensemble Interval Histogram (SMC) and Short-Time Modified Coherence (SMC).

이후, 특징 추출 장치(200)는 추출된 특징 벡터를 음성 인식 장치(300)로 전달한다.Then, the feature extraction apparatus 200 transmits the extracted feature vector to the speech recognition apparatus 300.

음성 인식 장치(300)는 특징 추출 장치(200)를 통해 추출된 특징 벡터를 기 구축된 음향 모델(400)을 이용하여 인식하고 이에 따른 음성 인식 결과를 출력하는 역할을 수행하게 된다. 아울러, 본 발명의 음성 인식 장치(300)는 HMM(Hiddn Markov Model), DTW(Dynamic Time Warping) 및 신경회로망(neural network) 등과 같은 다양한 음성인식 알고리즘을 적용하여 음성 인식 과정을 수행할 수 있다. The speech recognition apparatus 300 recognizes the feature vector extracted through the feature extraction apparatus 200 using the pre-built acoustic model 400 and outputs a speech recognition result corresponding thereto. In addition, the speech recognition apparatus 300 of the present invention can perform a speech recognition process by applying various speech recognition algorithms such as a Hidden Markov Model (HMM), a Dynamic Time Warping (DTW), and a neural network.

또한, 본 발명의 음향 모델(400)은 음소들을 통계적으로 모델링하여 구축한 것으로, 도면에서는 음향 모델(400)만을 예시하여 도시하였으나 음향 모델(400)뿐 아니라, 문법에 적합한 음성 인식 결과가 도출되도록 지원하는 언어 모델, 표준 발음법에 의거한 결과가 산출되도록 지원하는 발음 사전 등과 같은 다양한 모델들을 종합적으로 고려하여 유사도를 산출하고, 이에 따라 최종 음성 인식 결과를 도출할 수도 있다. In addition, the acoustic model 400 of the present invention is constructed by statistically modeling phonemes. In the figure, only the acoustic model 400 is illustrated, but the acoustic model 400 is not limited to the acoustic model 400, A language model to be supported, a phonetic dictionary to support the calculation based on the standard phonetic method, and the like, and the final speech recognition result can be derived accordingly.

또한, 특징 추출 장치(200)를 통해 출력된 결과 및 음성 인식 장치(300)를 통해 출력된 결과는 음향 모델(400)에 지속적으로 반영하여 음향 모델(400)을 구축하는 과정을 수행할 수도 있다. The result output through the feature extraction apparatus 200 and the result output through the speech recognition apparatus 300 may be reflected on the acoustic model 400 to construct the acoustic model 400 .

이하에서는 전술한 바와 같은 본 발명의 음성 구간 판단 장치(100)에 대해 보다 더 구체적으로 설명하도록 한다. Hereinafter, the speech interval determination apparatus 100 of the present invention as described above will be described in more detail.

도 3은 본 발명의 실시 예에 따른 음성 구간 판단 장치를 설명하기 위한 구성도이며, 도 4 내지 도 6은 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법을 설명하기 위한 예시도이다. FIG. 3 is a block diagram for explaining a speech interval determination apparatus according to an embodiment of the present invention. FIGS. 4 to 6 are diagrams for explaining a speech interval determination method using normalization according to an embodiment of the present invention.

먼저, 도 3을 참조하면, 본 발명의 음성 구간 판단 장치(100)는 마스킹 처리부(10), 특징 벡터 추출부(20), 정규화부(30) 및 음성구간 판단부(40)를 포함하여 구성될 수 있다. 이때, 본 발명의 음성 구간 판단 장치(100)는 입력되는 음성 신호를 일정 길이의 프레임으로 분리하는 프레임 형성부(미도시)를 더 포함하여 구성될 수도 있다. 이때, 본 발명의 프레임 형성부(미도시)는 10msec마다 20~30msec 길이의 프레임으로 분리할 수 있다. Referring to FIG. 3, the speech interval determination apparatus 100 of the present invention includes a masking processing unit 10, a feature vector extraction unit 20, a normalization unit 30, and a speech interval determination unit 40, . At this time, the speech interval determination apparatus 100 of the present invention may further include a frame formation unit (not shown) for separating the input speech signal into frames of a predetermined length. At this time, the frame forming unit (not shown) of the present invention can be separated into frames having a length of 20 to 30 msec every 10 msec.

한편, 본 발명의 음성 구간 판단 장치(100)를 구성하는 마스킹 처리부(10)는 입력되는 음성 신호에서 보다 더 명료한 잡음을 검출하기 위한 전처리 과정을 수행하는 것으로, 도 4에 도시된 바와 같이 입력되는 음성 신호에서 초기 일부 구간을 잡음 구간(A)로 추정하게 된다. 그리고 마스킹 처리부(10)는 추정된 잡음 구간(A)에 해당하는 잡음 신호를 전체 음성 신호에 더하게 된다. 여기서, 본 발명의 마스킹 처리부(10)가 음성 신호의 초기 일부 구간을 잡음 구간으로 추정하는 것은 음성 구간의 음성 특징, 예컨대 음성 에너지에 비하여 작은 레벨의 음성 에너지를 음성 신호에 부가하여 음성 신호를 왜곡시키지 않기 위한 것으로, 음성 신호의 초기 일부 구간은 잡음 구간이라는 가정하여 잡음 구간을 추정하게 된다. Meanwhile, the masking processing unit 10 constituting the speech interval determination apparatus 100 of the present invention performs a preprocessing process for detecting a clearer noise from an input speech signal. As shown in FIG. 4, (A) in the speech signal. Then, the masking processing unit 10 adds the noise signal corresponding to the estimated noise period A to the entire voice signal. The reason that the masking processor 10 of the present invention estimates an initial part of a speech signal as a noise section is to add a speech energy of a lower level to a speech signal in comparison with speech characteristics of a speech section, The noise section is estimated on the assumption that an initial part of the speech signal is a noise section.

그리고, 본 발명의 마스킹 처리부(10)는 도 4에 도시된 바와 같이, 잡음 구간(A)에 해당하는 잡음 신호를 전체 음성 신호의 프레임에 부가하게 되고, 이를 통해 잡음 구간(A)에 해당하는 음성 신호의 크기가 더 명료해질 수 있어 음성 구간과 그렇지 않은 구간을 보다 더 명확하게 판단할 수 있게 된다. 4, the masking processor 10 of the present invention adds the noise signal corresponding to the noise section A to the frame of the entire speech signal, and outputs the noise signal corresponding to the noise section A The size of the voice signal can be clarified, so that it is possible to more clearly judge the voice section and the voice section.

아울러, 본 발명의 마스킹 처리부(10)는 추정된 잡음 구간에 해당하는 평균 에너지 레벨에 해당하는 신호를 잡음 신호로 부가할 수 있으며, 음성 구간 판단을 위한 학습 데이터를 포함하는 학습 DB(50) 구축 시 적용된 잡음 신호를 상기 잡음 신호로 부가할 수도 있다. 이러한 마스킹 과정을 거침으로써, 학습 데이터 구축 시 고려되지 않은 잡음 구간일지라도 음성 구간을 보다 더 명확하게 판단할 수 있게 된다. In addition, the masking processing unit 10 of the present invention can add a signal corresponding to an average energy level corresponding to the estimated noise period as a noise signal, and construct a learning DB 50 including learning data for voice interval determination The noise signal may be added to the noise signal. By performing such a masking process, it is possible to more clearly determine a speech interval even in a noise interval not considered in constructing learning data.

이러한 마스킹 처리부(10)에서의 마스킹 과정은 하기의 수학식을 통해 산출될 수 있다. The masking process in the masking processing unit 10 can be calculated by the following equation.

여기서

는 입력 음성 신호이고,

는

를 마스킹하기 위한 잡음 신호이다.

와

의 초기 구간

샘플에서의 에너지 값의 제곱근(root mean square)

와

을 이용하여 입력 음성 신호의 초기 잡음 레벨의 수준으로

를 맞추어 마스킹을 수행하게 된다. here

Is an input speech signal,

The

As shown in FIG.

Wow

The initial section of

The root mean square of the energy value in the sample

Wow

To the level of the initial noise level of the input speech signal

And masking is performed.

특징 벡터 추출부(20)는 마스킹 처리부(10)를 거쳐 마스킹된 신호

에 대응하여 음성 특징 벡터를 추출하는 과정을 수행할 수 있다. The feature vector extracting unit 20 extracts the masked signal

And extracting speech feature vectors corresponding to the speech feature vectors.

특히, 본 발명의 특징 벡터 추출부(20)는 로그 멜 필터뱅크 에너지(LMFE; Log Mel-Filterbank Energy)의 벡터 값을 음성 특징 벡터로 추출할 수 있다. 이를 위해 본 발명의 특징 벡터 추출부(20)는 프레임 단위로 음성 신호에 대해 푸리에 변환을 수행하여 스펙트럼 정보를 추출하고, 스펙트럼에 대한 필터 뱅크 에너지를 산출하게 된다. 즉, 주파수 대역을 여러 개의 필터 뱅크로 나누고 각 뱅크에서의 에너지에 로그 함수를 적용하여 로그 필터뱅크 에너지를 산출할 수 있다. 이를 통해 본 발명의 특징 벡터 추출부(20)는 프레임별로 약 40차원의 음성 특징 벡터를 산출할 수 있다. In particular, the feature vector extraction unit 20 of the present invention can extract the vector value of the log mel-filter bank energy (LMFE) as a speech feature vector. For this, the feature vector extracting unit 20 of the present invention extracts spectral information by performing Fourier transform on the speech signal on a frame-by-frame basis, and calculates the filter bank energy for the spectrum. That is, the frequency band can be divided into several filter banks, and the logarithmic function can be applied to the energy in each bank to calculate the log filter bank energy. Accordingly, the feature vector extracting unit 20 of the present invention can calculate approximately 40-dimensional speech feature vectors for each frame.

특징 벡터 추출부(20)를 통해 추출된 음성 특징 벡터를 이용하여 음성 구간 판단부(40)는 음성 구간을 판단할 수 있으나, 실시간 처리 및 다양한 잡음 환경을 고려하여 보다 더 정확하게 음성 구간을 판단하기 위해, 본 발명은 정규화부(30)를 통해 음성 특징 벡터를 0에서 1사이로 변환하는 정규화 과정을 수행하게 된다. The speech interval determination unit 40 can determine the speech interval using the speech feature vector extracted through the feature vector extraction unit 20. However, considering the real-time processing and various noise environments, the speech interval determination unit 40 can more accurately determine the speech interval The normalization unit 30 performs a normalization process of converting a speech feature vector from 0 to 1.

즉, 본 발명의 정규화부(40)는 특징 벡터 추출부(20)를 통해 추출된 음성 특징 벡터를 0에서 1사이로 변환하는 정규화 과정을 수행하게 되는데, 이때 본 발명의 정규화부(40)는 신경망 학습 모델의 입력 레이어를 고려하여 정규화 과정을 수행하게 된다. That is, the normalization unit 40 of the present invention performs a normalization process of converting the voice feature vector extracted through the feature vector extraction unit 20 from 0 to 1. At this time, the normalization unit 40 of the present invention performs a normalization process, The normalization process is performed considering the input layer of the learning model.

보다 구체적으로 설명하면, 후술하는 본 발명의 음성 구간 판단부(40)는 신경망 학습 모델, 특히 심층 신경망 모델을 이용하여 음성 구간을 판단하게 된다. 이때, 심층 신경망 모델은 도 5에 도시된 바와 같이, 입력 레이어, 은닉 레이어 및 출력 레이어로 구성될 수 있으며, 입력 레이어에 입력되는 입력 값을 토대로 출력 레이어의 출력 값을 도출하게 된다. 이때, 본 발명의 음성 구간 판단부(40)는 입력 레이어의 입력 값으로 단일 프레임에서 추출된 하나의 음성 특징 벡터만을 고려하는 것이 아니라, 인접된 여러 음성 특징 벡터들을 이어 붙인 일련의 정규화된 음성 특징 벡터(

)를 입력 값으로 이용하게 된다. More specifically, the speech interval determination unit 40 of the present invention to be described later determines a speech interval using a neural network learning model, in particular, a neural network model. In this case, as shown in FIG. 5, the depth neural network model may include an input layer, a hidden layer, and an output layer, and derives an output value of the output layer based on input values input to the input layer. In this case, the speech interval determination unit 40 of the present invention not only considers only one speech feature vector extracted from a single frame as an input value of an input layer, but refers to a series of normalized speech features vector(

) As an input value.

여기서, 일련의 음성 특징 벡터는 입력되는 심층 신경망 모델의 입력 레이어의 크기를 고려하여 결정되게 되며, 본 발명의 정규화부(30)는 심층 신경망 모델의 입력 레이어의 크기를 스플라이스 윈도우(splice window, 600) 방식으로 처리하게 된다. 즉, 도 6에 도시된 바와 같이 본 발명의 정규화부(30)는 스플라이스 윈도우 단위로 음성 특징 벡터들을 정규화하게 되는데, 예컨대, 현재 프레임(m)을 기준으로 일정 범위(m-5 ~ m+5) 내의 프레임을 대상으로 평균과 분산을 산출하고 이를 이용하여 해당 범위 내에서의 정규화 과정을 수행할 수 있게 된다. Here, the series of speech feature vectors are determined in consideration of the size of the input layer of the input neural network model. The normalization unit 30 of the present invention divides the size of the input layer of the neural network model into a splice window, 600) method. That is, as shown in FIG. 6, the normalization unit 30 of the present invention normalizes speech feature vectors in splice window units. For example, the normalization unit 30 normalizes speech feature vectors in a certain range (m-5 to m + 5), and the normalization process can be performed within the range using the average and variance.

본 발명의 실시 예에 따른 정규화 과정은 하기의 수학식에 따라 정의될 수 있다. The normalization process according to the embodiment of the present invention can be defined according to the following equation.

여기서

와

는

번째 입력되는 음성 신호에서

번째 프레임의

번째 주파수 대역에 대한 음성 특징 벡터를 스플라이스 윈도우 단위로 추정한 평균과 표준편차를 의미하며,

은 스플라이스 윈도우 내의 전체 프레임수를 의미하는 것으로,

으로 나타낼 수 있다. 이때,

이고,

이며,

은 스플라이스 윈도우의 크기를 의미한다. here

Wow

The

Lt; th >

Th frame

(B) is a mean and standard deviation of a speech feature vector for a first frequency band in a splice window,

Is the total number of frames in the splice window,

. At this time,

ego,

Lt;

Is the size of the splice window.

이와 같이, 본 발명의 정규화부(30)는

번째 입력되는 음성 신호의

번째 프레임에 대한 스플라이스 윈도우에 속하는

번째 프레임 음성 특징 벡터의

번째 주파수 대역은 스플라이스 윈도우 내의 음성 특징 벡터를 대상으로 정규화 되는데,

가 정규화된 결과이다. As described above, the normalization unit 30 of the present invention

Of the input audio signal

Lt; RTI ID = 0.0 > frame < / RTI >

Th frame speech feature vector

Th frequency band is normalized to speech feature vectors in the splice window,

Is the normalized result.

번째 프레임의 스플라이스 윈도우 내에서 정규화된 음성 특징 벡터의 시퀀스

과 같이 구성되며,

로 발화에 대한 인덱스

은 생략했다.

은

번째 프레임에서 정규화된

차원의 음성 특징 벡터를 의미한다. 정규화된 음성 특징 벡터 시퀀스

은 심층 신경망의 입력 레이어의 입력 값으로 이용되게 된다.

A sequence of normalized speech feature vectors within the splice window of the < RTI ID = 0.0 >

Respectively,

Index for ignition as

Is omitted.

silver

Lt; RTI ID = 0.0 >

Dimensional speech feature vector. Normalized speech feature vector sequence

Is used as the input value of the input layer of the neural network.

이와 같이 본 발명은 음성 구간 판단을 위해 사용되는 신경망 모델의 입력 값을 고려한 스플라이스 윈도우 단위로 정규화를 수행함으로써, 보다 실시간 처리가 가능하며, 전체에 대한 평균과 분산을 고려함으로써 환경 왜곡에 보다 더 우수한 성능을 발휘할 수 있게 된다. As described above, the present invention realizes more real-time processing by performing normalization on a splice window basis considering an input value of a neural network model used for voice interval determination, It is possible to exert excellent performance.

음성 구간 판단부(40)는 정규화부(30)를 통해 전달되는 음성 특징 벡터를 신경망 모델, 특히 심층 신경망(DNN; Deep Neural Network) 모델에 적용하여 음성 여부를 판단하게 된다. 여기서, 본 발명의 심층 신경망 모델은 입력 레이어와 출력 레이어를 포함하며, 상기 입력 레이어(input layer)와 출력 레이어(output layer) 사이에 숨은 복수 개의 은닉 레이어(hidden layer)를 포함하는 다층 퍼셉트론(multi layer perceptron)의 구조로 이루어진 네트워크를 의미한다. 각각의 레이어들은 인공 뉴런에 대응되는 복수의 노드로 구성될 수 있으며, 학습에 의해 서로 다른 레이어의 노드들 간의 연결 관계가 결정될 수 있다. 특히, 한 노드에서의 출력 값은 그 노드의 활성화 함수 출력 값으로 결정되는 데, 활성화 함수의 입력은 그 노드로 연결된 모든 노드들의 가중된 합을 의미할 수 있다. The speech interval determination unit 40 determines whether or not the speech feature vector transmitted through the normalization unit 30 is speech by applying the speech feature vector to a neural network model, in particular, a deep neural network (DNN) model. The depth neural network model of the present invention includes an input layer and an output layer, and includes a multilayer perceptron (multi) layer including a hidden layer hidden between an input layer and an output layer. layer perceptron). Each layer may be composed of a plurality of nodes corresponding to the artificial neurons, and the learning relationship between the nodes of the different layers may be determined. In particular, the output value at a node is determined by the output value of the activation function of the node, and the input of the activation function may mean a weighted sum of all nodes connected to the node.

본 발명의 음성 구간 판단부(40)는 이러한 심층 신경망 모델을 이용하여 음성 여부를 판단하게 되며, 특히, 본 발명의 음성 구간 판단부(40)는 도 5에 도시된 바와 같이 정규화부(30)를 통해 정규화된 음성 특징 벡터의 시퀀스

를 입력 레이어의 입력 값으로 적용하고, 출력 레이어에서는 m 번째 프레임에서의 음성 검출 결과(0 또는 1)이 출력되도록 적용한 후 신경망 알고리즘을 적용하여 음성 여부를 검출하게 된다. 5, the speech interval determiner 40 of the present invention determines whether speech is normal or not by using the depth neural network model. In particular, the speech interval determiner 40 of the present invention includes a normalizer 30, A sequence of normalized speech feature vectors

Is applied as an input value of the input layer, and the output layer is applied to output the voice detection result (0 or 1) in the m-th frame, and then the neural network algorithm is applied to detect the voice.

이와 같이, 본 발명은 신경망 모델의 입력 레이어의 크기를 고려하여 스플라이스 윈도우를 결정하고, 스플라이스 윈도우 단위로 주파수 대역별 산출된 음성 특징 벡터에 대한 평균과 분산을 산출한 후 이를 이용하여 스플라이스 윈도우 단위로 정규화를 수행함으로써, 보다 신속하게 음성 여부를 검출할 수 있게 된다. 또한 음성 특징 벡터 추출 이전에, 마스킹 과정을 수행함으로써 다양한 잡음 환경을 고려하여 보다 더 명료하게 잡음 구간을 추출할 수 있게 된다. As described above, according to the present invention, a splice window is determined in consideration of a size of an input layer of a neural network model, an average and a variance of a speech feature vector calculated for each frequency band are calculated for each splice window, By performing the normalization on a window-by-window basis, it is possible to detect the presence or absence of speech more quickly. Also, before the extraction of the speech feature vector, the masking process is performed to extract the noise region more clearly considering various noise environments.

이러한 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법에 대해 보다 더 구체적으로 설명하도록 한다. Hereinafter, a method of determining a voice interval using normalization according to an embodiment of the present invention will be described in more detail.

도 7은 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법을 설명하기 위한 흐름도이다. 7 is a flowchart illustrating a method of determining a voice interval using normalization according to an embodiment of the present invention.

도 7을 참조하면, 본 발명의 음성 구간 판단 장치(100)는 음성 신호가 입력되면(S101), 입력되는 음성 신호의 초기 일부 구간을 잡음 구간으로 추정하고(S103), 추정된 잡음 구간에서의 잡음 신호를 전체 음성 신호에 더하게 된다. 여기서, 본 발명의 음성 구간 판단 장치(100)가 음성 신호의 초기 일부 구간을 잡음 구간으로 추정하는 것은 음성 구간의 음성 특징, 예컨대 음성 에너지에 비하여 작은 레벨의 음성 에너지를 음성 신호에 부가하여 음성 신호를 왜곡시키지 않기 위한 것으로, 음성 신호의 초기 일부 구간은 잡음 구간이라는 가정하여 잡음 구간을 추정하게 된다. Referring to FIG. 7, the speech interval determination apparatus 100 of the present invention estimates an initial part of an input speech signal as a noise interval (S103) when a speech signal is input (S101) Noise signal to the entire speech signal. Here, the voice interval determination apparatus 100 of the present invention estimates an initial partial interval of a voice signal as a noise interval, by adding a voice characteristic of a voice interval, for example, And estimates a noise interval on the assumption that an initial part of the speech signal is a noise interval.

그리고, 본 발명의 음성 구간 판단 장치(100)는 잡음 구간에 해당하는 잡음 신호를 전체 음성 신호의 프레임에 부가하는 마스킹 과정을 수행하게 되고(S105), 이를 통해 잡음 구간에 해당하는 음성 신호의 크기가 더 명료해질 수 있어 음성 구간과 그렇지 않은 구간을 보다 더 명확하게 판단할 수 있게 된다. In addition, the speech interval determination apparatus 100 of the present invention performs a masking process of adding a noise signal corresponding to a noise interval to a frame of the entire speech signal (S105). Thereby, the size of the speech signal corresponding to the noise interval Can be made more clear, so that it becomes possible to more clearly judge the voice interval and the non-voice interval.

이때, 본 발명의 음성 구간 판단 장치(100)는 추정된 잡음 구간에 해당하는 평균 에너지 레벨에 해당하는 신호를 잡음 신호로 부가할 수 있으며, 학습 DB(50) 구축 시 적용된 잡음 신호를 상기 잡음 신호로 부가할 수도 있다. 이러한 마스킹 과정을 거침으로써, 학습 DB 구축 시 고려되지 않은 잡음 구간일지라도 음성 구간을 보다 더 명확하게 판단할 수 있게 된다. At this time, the speech interval determination apparatus 100 of the present invention may add a signal corresponding to the average energy level corresponding to the estimated noise period as a noise signal, and may add the noise signal applied at the time of building the learning DB 50 to the noise signal . By performing such a masking process, it is possible to more clearly determine the speech interval even in the noise interval not considered in the construction of the learning DB.

이후, 본 발명의 음성 구간 판단 장치(100)는 마스킹 처리된 음성 신호를 이용하여 음성 특징 벡터를 추출한다(S107). 예컨대, 본 발명의 음성 구간 판단 장치(100)는 프레임에 대한 로그 멜 필터뱅크 에너지(LMFE; Log Mel-Filterbank Energy)의 벡터 값을 음성 특징 벡터를 추출할 수 있다. 이를 위해 본 발명의 음성 구간 판단 장치(100)는 프레임 단위로 음성 신호에 대해 푸리에 변환을 수행하여 스펙트럼 정보를 추출하고, 스펙트럼에 대한 필터 뱅크 에너지를 산출하게 된다. 즉, 주파수 대역을 여러 개의 필터 뱅크로 나누고 각 뱅크에서의 에너지에 로그 함수를 적용하여 주파수 대역별로 로그 필터뱅크 에너지를 산출할 수 있다. 이를 통해 본 발명의 특징 벡터 추출부(20)는 프레임별로 약 40차원의 음성 특징 벡터를 산출할 수 있다. Thereafter, the speech interval determination apparatus 100 of the present invention extracts speech feature vectors using the masked speech signal (S107). For example, the speech interval determination apparatus 100 of the present invention can extract a speech feature vector from a vector value of a log mel-filter bank energy (LMFE) for a frame. To this end, the speech interval determination apparatus 100 of the present invention performs Fourier transform on a speech signal on a frame-by-frame basis, extracts spectrum information, and calculates a filter bank energy for the spectrum. That is, the frequency band can be divided into several filter banks, and the log filter bank energy can be calculated for each frequency band by applying the log function to the energy in each bank. Accordingly, the feature vector extracting unit 20 of the present invention can calculate approximately 40-dimensional speech feature vectors for each frame.

이후의, 본 발명의 음성 구간 판단 장치(100)는 스플라이스 윈도우 단위로 특징 벡터를 정규화하게 된다(S109). Subsequently, the speech interval determination apparatus 100 of the present invention normalizes the feature vectors on a splice window basis (S109).

예컨대, 본 발명의 음성 구간 판단 장치(100)는 현재 프레임(m)을 기준으로 일정 범위(m-5 ~ m+5) 내의 프레임을 대상으로 평균과 분산을 산출하고 이를 이용하여 해당 스플라이스 윈도우 내에서 정규화 과정을 수행하게 된다. 이때 상기 평균과 분산을 산출 시 해당 프레임 내의 각각의 주파수 대역별로 산출된 음성 특징 벡터인 로그 필터뱅크 에너지의 벡터 값을 이용하여 프레임에 대한 평균과 분산을 산출할 수 있으며, 이를 이용하여 스플라이스 윈도우 단위로 0과 1로 맞추는 정규화 과정을 수행할 수 있게 된다. 아울러, 상기 정규화 단위인 스플라이스 윈도우는 음성 구간 판단을 위해 사용되는 신경망 모델 입력 레이어의 크기를 고려한 것으로, 고려한 스플라이스 윈도우 단위로 정규화를 수행함으로써, 보다 실시간 처리가 가능하며, 전체에 대한 평균과 분산을 고려함으로써 환경 왜곡에 보다 더 우수한 성능을 발휘할 수 있게 된다.For example, the speech interval determination apparatus 100 of the present invention calculates an average and a variance of a frame within a certain range (m-5 to m + 5) on the basis of the current frame m, The normalization process is performed. At this time, when calculating the average and variance, the average and variance of the frames can be calculated using the vector values of the log filter bank energy, which is a speech feature vector calculated for each frequency band in the corresponding frame, It is possible to perform a normalization process of setting 0 to 1 as a unit. In addition, the splice window, which is a normalization unit, takes the size of a neural network model input layer used for voice interval determination into consideration, performs normalization in units of the splice windows considered, By considering dispersion, it is possible to exert better performance in environmental distortion.

그리고 본 발명의 음성 구간 판단 장치(100)는 정규화된 음성 특징 벡터 시퀀스

은 심층 신경망의 입력 레이어의 입력 값으로 설정하고, 출력 레이어의 출력 값으로 m 번째 프레임에서의 음성 검출 결과(0 또는 1)를 설정한 후 신경망 알고리즘을 적용하여 프레임 단위로 음성 여부를 검출할 수 있게 된다. The speech interval determination apparatus 100 of the present invention includes a normalized speech feature vector sequence

Is set to the input value of the input layer of the neural network, and the speech detection result (0 or 1) in the m-th frame is set as the output value of the output layer. .

이상으로 본 발명의 실시 예에 따른 정규화를 이용한 음성 구간 판단 방법에 대해 설명하였다. The method of determining the speech interval using the normalization according to the embodiment of the present invention has been described above.

상술한 바와 같은 본 발명의 정규화를 이용한 음성 구간 판단 방법은 컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체의 형태로 제공될 수도 있다. The method of determining a speech interval using the normalization of the present invention as described above may be provided in the form of a computer readable medium suitable for storing computer program instructions and data.

특히, 본 발명의 컴퓨터 프로그램은 음성 신호를 소정 길이를 갖는 프레임으로 구분하고, 구분된 프레임을 대상으로 음성 여부를 판단하여 음성 구간을 검출할 수 있는 음성 구간 판단 장치에 있어서, 입력되는 음성 신호에서 음성 특징 벡터를 추출하는 단계, 심층 신경망 모델(DNN; Deep Neural Network)의 입력 레이어에 따라 일련의 상기 음성 특징 벡터를 대상으로 정규화를 수행하는 단계 및 상기 정규화된 음성 특징 벡터를 상기 심층 신경망 모델의 입력으로 설정하여 상기 음성 신호에서의 음성 구간을 판단하는 단계 등을 실행할 수 있다. In particular, the computer program of the present invention is a speech segment determining apparatus for dividing a speech signal into frames having a predetermined length and detecting a speech segment by determining whether or not the segment is a speech segment, the apparatus comprising: Extracting a speech feature vector, performing a normalization on a series of the speech feature vectors according to an input layer of a deep neural network (DNN), and extracting the normalized speech feature vector from the depth- And a step of determining a voice interval in the voice signal by setting it as an input.

이러한, 컴퓨터 프로그램 명령어와 데이터를 저장하기에 적합한 컴퓨터로 판독 가능한 매체는, 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM, Read Only Memory), 램(RAM, Random Access Memory), 플래시 메모리, EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM)과 같은 반도체 메모리를 포함한다. 프로세서와 메모리는 특수 목적의 논리 회로에 의해 보충되거나, 그것에 통합될 수 있다. Such computer-readable media suitable for storing computer program instructions and data include, for example, magnetic media such as hard disks, floppy disks and magnetic tape, compact disk read only memory (CD-ROM) Optical media such as a DVD (Digital Video Disk), a magneto-optical medium such as a floppy disk, and a ROM (Read Only Memory), a RAM , Random Access Memory), flash memory, EPROM (Erasable Programmable ROM), and EEPROM (Electrically Erasable Programmable ROM). The processor and memory may be supplemented by, or incorporated in, special purpose logic circuits.

또한 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 본 발명을 구현하기 위한 기능적인(Functional) 프로그램과 이와 관련된 코드 및 코드 세그먼트 등은, 기록매체를 읽어서 프로그램을 실행시키는 컴퓨터의 시스템 환경 등을 고려하여, 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론되거나 변경될 수도 있다.The computer readable recording medium may also be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner. The functional program for implementing the present invention and the related code and code segment may be implemented by programmers in the technical field of the present invention in consideration of the system environment of the computer that reads the recording medium and executes the program, Or may be easily modified or modified by the user.

아울러, 상술한 바와 같은 컴퓨터가 읽을 수 있는 기록매체에 기록된 컴퓨터 프로그램은 상술한 바와 같은 기능을 수행하는 명령어를 포함하며 기록매체를 통해 배포되고 유통되어 특정 장치, 특정 컴퓨터에 읽히어 설치되고 실행됨으로써 전술한 기능들을 실행할 수 있다. In addition, a computer program recorded on a computer-readable recording medium as described above includes instructions for performing the functions as described above, distributed and distributed through a recording medium, read and installed in a specific device, a specific computer, Thereby performing the above-described functions.

본 명세서는 다수의 특정한 구현물의 세부사항들을 포함하지만, 이들은 어떠한 발명이나 청구 가능한 것의 범위에 대해서도 제한적인 것으로서 이해되어서는 안되며, 오히려 특정한 발명의 특정한 실시형태에 특유할 수 있는 특징들에 대한 설명으로서 이해되어야 한다. 개별적인 실시형태의 문맥에서 본 명세서에 기술된 특정한 특징들은 단일 실시형태에서 조합하여 구현될 수도 있다. 반대로, 단일 실시형태의 문맥에서 기술한 다양한 특징들 역시 개별적으로 혹은 어떠한 적절한 하위 조합으로도 복수의 실시형태에서 구현 가능하다. 나아가, 특징들이 특정한 조합으로 동작하고 초기에 그와 같이 청구된 바와 같이 묘사될 수 있지만, 청구된 조합으로부터의 하나 이상의 특징들은 일부 경우에 그 조합으로부터 배제될 수 있으며, 그 청구된 조합은 하위 조합이나 하위 조합의 변형물로 변경될 수 있다.While the specification contains a number of specific implementation details, it should be understood that they are not to be construed as limitations on the scope of any invention or claim, but rather on the description of features that may be specific to a particular embodiment of a particular invention Should be understood. Certain features described herein in the context of separate embodiments may be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented in multiple embodiments, either individually or in any suitable subcombination. Further, although the features may operate in a particular combination and may be initially described as so claimed, one or more features from the claimed combination may in some cases be excluded from the combination, Or a variant of a subcombination.

마찬가지로, 특정한 순서로 도면에서 동작들을 묘사하고 있지만, 이는 바람직한 결과를 얻기 위하여 도시된 그 특정한 순서나 순차적인 순서대로 그러한 동작들을 수행하여야 한다거나 모든 도시된 동작들이 수행되어야 하는 것으로 이해되어서는 안 된다. 특정한 경우, 멀티태스킹과 병렬 프로세싱이 유리할 수 있다. 또한, 상술한 실시형태의 다양한 시스템 컴포넌트의 분리는 그러한 분리를 모든 실시형태에서 요구하는 것으로 이해되어서는 안되며, 설명한 프로그램 컴포넌트와 시스템들은 일반적으로 단일의 소프트웨어 제품으로 함께 통합되거나 다중 소프트웨어 제품에 패키징 될 수 있다는 점을 이해하여야 한다.Likewise, although the operations are depicted in the drawings in a particular order, it should be understood that such operations must be performed in that particular order or sequential order shown to achieve the desired result, or that all illustrated operations should be performed. In certain cases, multitasking and parallel processing may be advantageous. Also, the separation of the various system components of the above-described embodiments should not be understood as requiring such separation in all embodiments, and the described program components and systems will generally be integrated together into a single software product or packaged into multiple software products It should be understood.

이러한 본 발명에 의하면, 심층 신경망 모델을 이용하여 음성 구간 판단 시 상기 심층 신경망 모델의 입력 레이어의 크기를 고려하여 정규화를 수행함으로써 보다 더 신속하고 정확하게 음성 구간을 판단할 수 있으며, 마스킹 처리를 먼저 수행한 이후에 정규화 과정을 진행함으로써, 다양한 잡음 환경에 보다 더 강인한 음성 구간 판단이 가능하게 된다. 이를 통해 본 발명은 음성 인식, 화자 인식, 음질 개선 등과 같이 음성 신호 처리 분야의 효율을 향상시킬 수 있어, 산업상 이용 가능성이 충분하다. According to the present invention, normalization is performed in consideration of the size of the input layer of the depth-of-field neural network model in determining the speech interval using the depth-of-field neural network model, so that the speech interval can be determined more quickly and accurately. The normalization process is performed after the speech recognition is performed, so that it is possible to determine a stronger speech interval in various noise environments. Accordingly, the present invention can improve the efficiency of the speech signal processing field such as speech recognition, speaker recognition, sound quality improvement, and the like, and thus is industrially applicable.

10: 마스킹 처리부
20: 특징 벡터 추출부
30: 정규화 부
40: 음성 구간 판단부
50: 학습 DB
100: 음성 구간 판단 장치
200: 특징 추출 장치
300: 음성 인식 장치
400: 음향 모델
500: 음성 인식 시스템10:
20: Feature vector extraction unit
30: Normalization unit
40:
50: Learning DB
100: voice interval determination unit
200: Feature extraction device
300: Speech recognition device
400: acoustic model
500: Speech Recognition System

Claims

1. A speech interval determining apparatus for classifying a speech signal into frames having a predetermined length and detecting a speech interval by determining whether or not the divided frames are speech,
Extracting a speech feature vector from an input speech signal;
Performing a normalization on a series of speech feature vectors according to an input layer of a deep neural network (DNN); And
Determining a speech interval in the speech signal by setting the normalized speech feature vector as an input to the neural network model;
And determining a voice interval using the normalization.

The method according to claim 1,
Prior to extracting the speech feature vector,
Estimating a part of the input speech signal as a noise section;
Adding the noise signal corresponding to the estimated noise period to the entire voice signal and masking the noise signal;
Further comprising:
The step of extracting the speech feature vector
And extracting a speech feature vector from the masked speech signal.

3. The method of claim 2,
The noise signal
Wherein the training noise signal is a learning noise signal applied at the time of constructing learning data for the speech interval determination.

The method of claim 3,
The masking step
Wherein the masking process is performed after adjusting the energy level of the learning noise signal to match the energy level of the noise signal.

The method according to claim 1,
The step of performing the normalization comprises:
Setting a splice window in a certain range based on an arbitrary frame in the speech signal;
Calculating an average and a standard deviation corresponding to the splice window by using a speech feature vector extracted for each frequency band of the splice window; And
Performing normalization on the splice window using the calculated average and standard deviation;
And determining a voice interval using the normalization.

6. The method of claim 5,
Wherein the splice window is set corresponding to a size of an input layer of the depth-of-field neural network model.

A computer-readable recording medium having recorded thereon a program for executing the method of determining a speech interval using the normalization according to any one of claims 1 to 6.

1. A speech interval determining apparatus for classifying a speech signal into frames having a predetermined length and detecting a speech interval by determining whether or not the divided frames are speech,
A feature vector extractor for extracting a speech feature vector from an input speech signal;
A normalization unit for performing normalization on a series of speech feature vectors extracted through the feature vector extraction unit according to an input layer of a deep neural network (DNN); And
A speech interval determiner for determining the speech interval in the speech signal by setting the normalized speech feature vector as an input to the depth neural network model;
Wherein the speech signal is a speech signal.

9. The method of claim 8,
A masking unit for estimating a part of the input speech signal as a noise section and adding a noise signal corresponding to the estimated noise section to the entire speech signal;
Further comprising: a speech processor;