KR101704925B1

KR101704925B1 - Voice Activity Detection based on Deep Neural Network Using EVS Codec Parameter and Voice Activity Detection Method thereof

Info

Publication number: KR101704925B1
Application number: KR1020150147647A
Authority: KR
Inventors: 양준영; 장준혁
Original assignee: 한양대학교 산학협력단
Priority date: 2015-10-22
Filing date: 2015-10-22
Publication date: 2017-02-09

Abstract

Disclosed are an apparatus for detecting voice based on a deep neural network using an EVS CODEC parameter and a method therefor. The method for detecting voice based on a deep neural network using an EVS CODEC parameter includes the steps of: extracting at least one feature vector used in a voice detecting algorithm of an enhanced voice services (EVS) CODEC from a voice signal for learning; calculating a weight point matrix and a bias vector that are applied to an input layer, a hidden layer and an output layer of the DBN through learning by applying the extracted feature vector to a deep belief network (DBN), which is one of deep neural network (DNN) models; calculating a voice existence probability by using the calculated weight point matrix and the calculated bias vector; and determining whether there is voice by comparing the calculated voice existence probability with an adjustable threshold value. In the present invention, a voice detection performance is improved by using a determination logic newly modeled through the DNN.

Description

FIELD OF THE INVENTION [0001] The present invention relates to an EVS Codec Parameter and a Voice Activity Detection Method,

아래의 실시예들은 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법에 관한 것이다. 더욱 상세하게는, 심화 신경망을 이용하여 음성 검출 성능을 향상시키는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법에 관한 것이다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS The following embodiments relate to a deep-speech-based speech detection apparatus and method using an EVS codec parameter. More particularly, the present invention relates to a deep-speech-based speech detection apparatus and method using an EVS codec parameter that improves speech detection performance using an enhanced neural network.

음성 검출 장치(음성 검출기)는 음성 인식 또는 음성 코덱 등 다양한 음성 신호처리 분야에서 중요한 역할을 하는 구성 단위이다. 특히 가변 비트율 모드에서 동작하는 음성 코덱은 음성 검출 장치를 사용하여 잡음 구간에 대한 불연속 전송(DTX) 및 통신 소음 발생(CNG)을 수행하여 음성 통신 환경에서의 효율적인 대역폭 사용을 가능하게 한다. 음성 코덱의 음성 검출 알고리즘에서 사용하는 방법으로는 영교차율, 밴드 에너지, 선형 예측 계수, 프레임 에너지 및 신호 대 잡음비(SNR) 등을 계산하여 임계값과 비교하는 방법이 있다. 또한, 기계학습 기술이 발전함에 따라 해당 기술을 음성 검출 장치에 적용할 수 있다. A voice detection device (voice detector) is a structural unit that plays an important role in various voice signal processing fields such as voice recognition or voice codec. In particular, voice codecs operating in a variable bit rate mode perform discontinuous transmission (DTX) and communication noise generation (CNG) on a noise interval using a voice detection device, thereby enabling efficient bandwidth use in a voice communication environment. The method used in the voice detection algorithm of the voice codec is to calculate the zero crossing rate, the band energy, the linear prediction coefficient, the frame energy and the signal-to-noise ratio (SNR), and compare with the threshold value. Further, as the machine learning technology develops, the technology can be applied to the voice detection device.

종래의 EVS 코덱의 음성 검출 알고리즘은 음성 신호로부터 추출한 파라미터를 이용하여 경험적인(heuristic) 결정 로직에 따라 음성 존재 여부를 판단한다. 즉, 추출한 파라미터 값을 조건적으로 결정되는 임계값과 비교함으로써 결정하는 과정이 대부분이다. The voice detection algorithm of the conventional EVS codec uses the parameters extracted from the voice signal to determine whether voice exists according to the heuristic decision logic. That is, most of the process is determined by comparing the extracted parameter value with a conditionally determined threshold value.

그러나 종래의 EVS 코덱의 음성 검출 알고리즘은 음성 신호로부터 추출한 파라미터를 이용하여 경험적인(heuristic) 결정 로직에 따라 음성 존재 여부를 판단하기 때문에, 음성 신호의 특성을 잘 나타내는 각 파라미터들 사이의 의존성 또는 상관성을 충분히 고려하여 음성 존재 여부를 결정한다고 보기 어렵다. 음성 신호의 특성을 잘 나타내는 파라미터를 계산하는 데에는 우수하지만, 이 값들을 이용하여 음성 존재 여부를 판단할 때에 사용하는 결정 로직이 충분하지 않다는 문제점이 있다. However, since the voice detection algorithm of the conventional EVS codec uses the parameters extracted from the voice signal to determine the presence or absence of voice according to the heuristic decision logic, the dependency or correlation between each parameter It is difficult to consider whether or not the voice exists. There is a problem in that there is not enough decision logic to use when determining the presence or absence of voice using these values.

한국공개특허 10-2013-0012073호는 이러한 음질 개선을 위한 사용자 특정 잡음억제에 관한 것으로, 사용자 특정 잡음 억제에 관한 기술을 기재하고 있다. Korean Patent Laid-Open No. 10-2013-0012073 relates to user-specific noise suppression for improving such sound quality, and describes a technique for suppressing user-specific noise.

실시예들은 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법에 관하여 기술하며, 보다 구체적으로 심화 신경망(DNN)을 통해 새롭게 모델링한 결정 로직을 사용하여 음성 검출 성능을 향상시키는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법에 관한 기술을 제공한다. Embodiments describe a voice recognition apparatus based on deepening neural network using EVS codec parameters and a method thereof and more specifically to an EVS codec parameter for improving speech detection performance using a decision logic newly modeled through a deepening network (DNN) The present invention provides a speech detection apparatus based on deepened neural network using the same and a technique for the method.

실시예들은 EVS 코덱의 음성 검출 알고리즘에서 계산되는 다양한 파라미터를 이용하되, 심화 신경망(DNN) 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)을 기반으로 한 결정 로직을 이용하여 음성 존재 확률을 계산함으로써, 효과적으로 음성을 검출하는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법을 제공하는데 있다. Embodiments use a variety of parameters calculated in the voice detection algorithm of the EVS codec and use decision logic based on Deep Belief Network (DBN), one of the deepening network (DNN) models, The present invention also provides an apparatus and method for detecting voice based on an EVS codec parameter for detecting voice effectively by using a degenerate neural network.

일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법에 있어서, 학습용 음성 신호로부터 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 알고리즘에서 사용된 특징 벡터를 적어도 하나 이상 추출하는 단계; 추출된 상기 특징 벡터를 심화 신경망(Deep Neural Network; DNN)의 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 상기 깊은 신뢰망(DBN)의 입력 레이어, 은닉 레이어, 그리고 출력 레이어에서 가해지는 가중치 행렬 및 바이어스 벡터를 구하는 단계; 구해진 상기 가중치 행렬 및 바이어스 벡터를 이용하여 음성 존재 확률을 계산하는 단계; 및 계산된 상기 음성 존재 확률을 조절 가능한 임계값(threshold)과 비교하여 음성의 존재 유무를 판단하는 단계를 포함한다. In accordance with another aspect of the present invention, there is provided a method of detecting a speech based on an EVS codec parameter, the method comprising: extracting at least one feature vector used in a speech detection algorithm of an EVS codec for enhanced speech services; The extracted feature vector is applied to a Deep Belief Network (DBN) which is one of models of a Deep Neural Network (DNN), and the input layer, hidden layer, and hidden layer of the Deep Trust Network (DBN) And obtaining a weight matrix and a bias vector to be applied to the output layer; Calculating a voice presence probability using the obtained weight matrix and a bias vector; And comparing the calculated presence probability with an adjustable threshold to determine the presence or absence of a voice.

여기서 상기 EVS 코덱의 음성 검출 알고리즘은, 제1 SAD(Signal Activity Detection) 모듈, 제2 SAD 모듈, 및 제3 SAD 모듈을 포함하고, 상기 제1 SAD 모듈 및 상기 제2 SAD 모듈은 임계 대역(critical band)에 기반한 주파수 영역 분석 결과를 기반으로 동작하고, 상기 제3 SAD 모듈은 필터뱅크 분석 결과에 기반하여 동작할 수 있다. Wherein the voice detection algorithm of the EVS codec comprises a first SAD (Signal Activity Detection) module, a second SAD module, and a third SAD module, wherein the first SAD module and the second SAD module have a critical band band, and the third SAD module may operate based on the filter bank analysis results.

상기 특징 벡터는, 상기 제1 SAD 모듈 및 상기 제2 SAD 모듈에서 계산되는 총 프레임 에너지(Total frame energy), 신호 대 잡음비 이상값(SNR outlier), 평균 신호 대 잡음비(Average SNR), 평활화된 평균 신호 대 잡음비(Smoothed average SNR), 수정된 프레임별 신호 대 잡음비(Modified segmental SNR; MSSNR), 안정화 및 수정된 프레임별 신호 대 잡음비(Relaxed modified segmental SNR; RlxMSSNR), 및 상기 제3 SAD 모듈에서 계산되는 프레임 에너지(Frame energy), 주파수 도메인 신호 대 잡음비(Frequency-domain SNR), 모든 부분대역에 대한 총 신호 대 잡음비(Total SNR of all sub-bands), 모든 부분대역에 대한 평균 신호 대 잡음비(Average SNR of all sub-bands), 스펙트럼 중심도(Spectral centroids), 시간 도메인 안정도(Time-domain stabilities) 중 적어도 하나 이상을 포함할 수 있다. The feature vector may include a total frame energy, an SNR outlier, an average SNR, a smoothed average, and an average SNR calculated by the first SAD module and the second SAD module, (SNR), a modified segmental SNR (MSSNR), a stabilized and modified frame-to-noise ratio (RLxMSSNR), and a second SAD module Frame energy, frequency-domain signal-to-noise ratio (SNR), total SNR of all subbands, average signal-to-noise ratio SNR of all sub-bands, spectral centroids, and time-domain stabilities.

상기 가중치 행렬 및 바이어스 벡터를 구하는 단계는, 상기 깊은 신뢰망(DBN)의 각 레이어 사이의 상기 가중치 행렬 및 바이어스 벡터를 초기화시키는 선행 학습(pre-training) 단계; 및 상기 선행 학습(pre-training) 후, 상기 음성 신호에 대해 프레임 단위로 음성 존재 여부를 표기한 정답 라벨(label)과 상기 깊은 신뢰망(DBN)을 통과한 출력 값의 비교를 통해 분류 오류(classification error)를 최소화하도록 학습시키는 미세 조정(fine-tuning) 단계를 포함할 수 있다. Wherein the step of obtaining the weight matrix and the bias vector comprises: pre-training step of initializing the weight matrix and the bias vector between respective layers of the deep trust network (DBN); And comparing the output value of the deep trust network (DBN) with a correct answer label indicating the presence or absence of speech in units of frames for the speech signal after the pre-training, and a fine-tuning step of learning to minimize classification errors.

상기 가중치 행렬 및 바이어스 벡터를 구하는 단계는, 추출된 상기 특징 벡터는 상기 깊은 신뢰망(DBN)의 입력 벡터로 가해져 상기 입력 레이어로 입력되는 단계; 상기 입력 레이어를 통과한 상기 특징 벡터는 복수의 상기 은닉 레이어들 중 첫 번째 은닉 레이어부터 마지막 은닉 레이어까지 순차적으로 전달되는 단계; 및 상기 마지막 은닉 레이어까지 전달된 상기 특징 벡터는 상기 출력 레이어까지 전달되는 단계를 포함할 수 있다. Wherein the step of obtaining the weight matrix and the bias vector comprises: inputting the extracted feature vector to an input vector of the deep trust network (DBN) and inputting the feature vector to the input layer; The feature vector having passed through the input layer is sequentially transmitted from a first hidden layer to a last hidden layer among a plurality of hidden layers; And transmitting the feature vector to the last hidden layer to the output layer.

상기 가중치 행렬 및 바이어스 벡터를 구하는 단계는, 상기 특징 벡터는 상기 출력 레이어로 전달된 후 활성화 함수를 통과하는 단계를 더 포함하고, 상기 출력 레이어에서의 상기 활성화 함수는 소프트맥스(softmax) 함수를 통해 이루어지며, 상기 출력 레이어의 노드가 가지는 값들을 상기 음성 존재 확률에 대한 확률 값으로 대응시킬 수 있다. Wherein the step of obtaining the weight matrix and the bias vector further comprises the step of passing the feature vector through the activation function after being passed to the output layer and wherein the activation function in the output layer is via a softmax function , And values of nodes of the output layer can be associated with probability values for the voice presence probability.

상기 음성의 존재 유무를 판단하는 단계는, 상기 활성화 함수를 통과한 상기 확률 값이 상기 임계값 이상인 경우 상기 프레임을 음성 존재 구간으로 판단하고, 상기 활성화 함수를 통과한 상기 확률 값이 상기 임계값보다 작은 경우 상기 프레임을 음성이 존재하지 않는 잡음 구간으로 판단할 수 있다. The presence or absence of the voice may be determined by determining that the frame is a voice presence interval if the probability value passed through the activation function is equal to or greater than the threshold value and if the probability value passed through the activation function is less than the threshold value If it is small, the frame can be determined as a noise period in which no speech exists.

다른 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치에 있어서, 학습용 음성 신호로부터 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 알고리즘에서 사용된 특징 벡터를 적어도 하나 이상 추출하는 특징 벡터 추출부; 추출된 상기 특징 벡터를 심화 신경망(Deep Neural Network; DNN)의 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 상기 깊은 신뢰망(DBN)의 입력 레이어, 은닉 레이어, 그리고 출력 레이어에서 가해지는 가중치 행렬 및 바이어스 벡터를 구하고, 구해진 상기 가중치 행렬 및 바이어스 벡터를 이용하여 음성 존재 확률을 계산하는 깊은 신뢰망 학습부; 및 계산된 상기 음성 존재 확률을 조절 가능한 임계값(threshold)과 비교하여 음성의 존재 유무를 판단하는 음성 존재 판단부를 포함한다. The EVS-based speech detection apparatus using the EVS codec parameter according to another embodiment is characterized in that a feature vector for extracting at least one feature vector used in the speech detection algorithm of the EVS codec (Enhanced Voice Services) An extraction unit; The extracted feature vector is applied to a Deep Belief Network (DBN) which is one of models of a Deep Neural Network (DNN), and the input layer, hidden layer, and hidden layer of the Deep Trust Network (DBN) A deep trust network learning unit for obtaining a weight matrix and a bias vector to be applied in the output layer, and calculating a voice presence probability using the obtained weight matrix and the bias vector; And a speech presence determiner for comparing the calculated presence probability with an adjustable threshold to determine the presence or absence of speech.

상기 깊은 신뢰망 학습부는, 상기 깊은 신뢰망(DBN)의 각 레이어 사이의 상기 가중치 행렬 및 바이어스 벡터를 초기화시키는 선행 학습(pre-training)부; 및 상기 선행 학습(pre-training) 후, 상기 음성 신호에 대해 프레임 단위로 음성 존재 여부를 표기한 정답 라벨(label)과 상기 깊은 신뢰망(DBN)을 통과한 출력 값의 비교를 통해 분류 오류(classification error)를 최소화하도록 학습시키는 미세 조정(fine-tuning)부를 포함할 수 있다. The deep trust network learning unit includes a pre-training unit that initializes the weight matrix and the bias vector between layers of the deep trust network DBN; And comparing the output value of the deep trust network (DBN) with a correct answer label indicating the presence or absence of speech in units of frames for the speech signal after the pre-training, and a fine-tuning unit that learns to minimize classification errors.

상기 깊은 신뢰망 학습부는, 추출된 상기 특징 벡터는 상기 깊은 신뢰망(DBN)의 입력 벡터로 가해져 상기 입력 레이어로 입력되어 복수의 상기 은닉 레이어들 중 첫 번째 은닉 레이어부터 마지막 은닉 레이어까지 순차적으로 전달된 후, 상기 출력 레이어까지 전달될 수 있다. The deep trust network learning unit applies the extracted feature vector to the input vector of the deep trust network DBN to be input to the input layer and sequentially transmit the first hidden layer to the last hidden layer among the plurality of hidden layers And then to the output layer.

상기 특징 벡터는 상기 출력 레이어로 전달된 후 활성화 함수를 통과하고, 상기 출력 레이어에서의 상기 활성화 함수는 소프트맥스(softmax) 함수를 통해 이루어지며, 상기 출력 레이어의 노드가 가지는 값들을 상기 음성 존재 확률에 대한 확률 값으로 대응시킬 수 있다. Wherein the feature vector is passed to the output layer and then to an activation function, the activation function in the output layer is performed via a softmax function, As shown in FIG.

상기 음성 존재 판단부는, 상기 활성화 함수를 통과한 상기 확률 값이 상기 임계값 이상인 경우 상기 프레임을 음성 존재 구간으로 판단하고, 상기 활성화 함수를 통과한 상기 확률 값이 상기 임계값보다 작은 경우 상기 프레임을 음성이 존재하지 않는 잡음 구간으로 판단할 수 있다. Wherein the voice presence determination unit determines that the frame is a voice presence interval when the probability value passed through the activation function is equal to or greater than the threshold value and if the probability value passed through the activation function is less than the threshold value, It can be determined as a noise period in which no speech exists.

실시예들에 따르면 심화 신경망(DNN)을 통해 새롭게 모델링한 결정 로직을 사용함으로써 음성 검출 성능을 향상시키는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법에 관한 기술을 제공할 수 있다. According to embodiments, a deep-neural network-based voice detection apparatus using EVS codec parameters that improve voice detection performance by using decision logic newly modeled through a deep domain network (DNN) and a technique therefor can be provided.

실시예들에 따르면 EVS 코덱의 음성 검출 알고리즘에서 계산되는 다양한 파라미터를 이용하되, 심화 신경망(DNN) 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)을 기반으로 한 결정 로직을 이용하여 음성 존재 확률을 계산함으로써, 우수한 음성 검출 성능을 나타내는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 장치 및 그 방법을 제공할 수 있다. According to the embodiments, the decision logic based on Deep Belief Network (DBN), which is one of deepening neural network (DNN) models, It is possible to provide a voice detection apparatus based on the deepening neural network using the EVS codec parameter indicating excellent speech detection performance and a method thereof.

도 1은 일 실시예에 따른 음성 검출 방법을 수행하기 위한 음성 검출 장치의 구성을 나타내는 블록도이다.
도 2는 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법을 개념적으로 나타낸 도면이다.
도 3은 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법을 나타내는 흐름도이다.
도 4는 일 실시예에 따른 선행 학습 과정을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 미세 조정 과정을 설명하기 위한 도면이다.
도 6 내지 도 11은 ROC곡선을 통해서 각 잡음환경에서 음성 검출 방법의 비교 결과를 나타내는 도면이다.
도 12는 일 실시예에 따른 음성 검출 방법과 기존 음성 검출 방법의 비교를 나타내는 도면이다. 1 is a block diagram showing a configuration of a voice detection apparatus for performing a voice detection method according to an embodiment.
FIG. 2 is a conceptual diagram illustrating a voice detection method based on an enhanced neural network using an EVS codec parameter according to an exemplary embodiment.
FIG. 3 is a flowchart illustrating a voice recognition method based on an enhanced neural network using an EVS codec parameter according to an exemplary embodiment.
FIG. 4 is a diagram for explaining a pre-learning process according to an embodiment.
5 is a diagram for explaining a fine adjustment process according to an embodiment.
FIGS. 6 to 11 are diagrams showing the comparison results of the voice detection method in each noise environment through the ROC curve. FIG.
12 is a diagram showing a comparison between a speech detection method according to an embodiment and a conventional speech detection method.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.
Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the embodiments described may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided to more fully describe the present invention to those skilled in the art. The shape and size of elements in the drawings may be exaggerated for clarity.

아래의 실시예들은 심화 신경망(Deep Neural Network; DNN)을 이용하여 기존의 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 성능을 향상시키는 방법에 관한 것으로, EVS 코덱의 음성 검출 알고리즘에서 계산되는 다양한 파라미터를 이용하되, 음성 검출 여부를 판단하는 EVS 코덱의 음성 검출 결정 로직 대신 심화 신경망(DNN)을 기반으로 한 새로운 결정 로직을 이용하여 음성 존재 확률을 계산함으로써 더욱 효과적으로 음성을 검출해내는 기술에 관한 것이다. The following embodiments are directed to a method for improving the voice detection performance of a conventional EVS codec (Enhanced Voice Services) using a Deep Neural Network (DNN) A technique for detecting speech more effectively by calculating the probability of speech presence using a new decision logic based on a deepening neural network (DNN) instead of the voice detection decision logic of the EVS codec, which uses parameters to determine whether speech is detected will be.

실시예들은 학습용 음성 신호로부터 EVS 코덱의 음성 검출 알고리즘인 제1 SAD(Signal Activity Detection) 모듈, 제2 SAD 모듈, 및 제3 SAD 모듈에서 사용된 파라미터를 선택적으로 추출한 후, 해당 파라미터로 구성한 특징 벡터를 심화 신경망(DNN) 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 깊은 신뢰망(DBN)의 입력 레이어(layer), 은닉 레이어, 그리고 출력 레이어에서 가해지는 적절한 가중치 행렬 및 바이어스 벡터를 찾아낼 수 있다. 학습을 통해 구성한 모델은 음성 신호에서 추출한 EVS 코덱의 파라미터로 구성한 특징 벡터를 입력으로 하고, 이를 이용하여 계산한 음성 존재 확률을 출력으로 하며, 음성 존재 확률을 조절 가능한 임계값(threshold)과 비교함으로써 음성 검출 여부를 판단할 수 있다. The embodiments may be configured to selectively extract the parameters used in the first SAD (Signal Activity Detection) module, the second SAD module, and the third SAD module, which are the voice detection algorithms of the EVS codec, from the learning speech signal, Is applied to Deep Belief Network (DBN) which is one of DeNNN (DeNNN) networks. Through learning, the input layer, hidden layer and appropriate weight Matrices and bias vectors can be found. The model constructed through learning inputs the feature vector composed of the parameters of the EVS codec extracted from the voice signal, outputs the calculated presence probability using the feature vector, and compares the voice presence probability with the adjustable threshold value It is possible to judge whether voice is detected or not.

여기서, 심화 신경망(DNN)은 주어진 데이터에 대한 계층적인 특성 추출을 수행하여 해당 데이터로부터 보다 높은 차원의 특징 벡터들을 추출한다. 깊은 신뢰망(DBN)은 학습 시 발생할 수 있는 과적합(overfitting) 등의 문제로부터 보다 자유로운 깊은 신경망 모델 중 하나이다.
Here, the DNN performs hierarchical characteristic extraction on a given data to extract higher-order feature vectors from the corresponding data. A deep trust network (DBN) is one of the deep free neural network models free from overfitting problems that can occur during learning.

도 1은 일 실시예에 따른 음성 검출 방법을 수행하기 위한 음성 검출 장치의 구성을 나타내는 블록도이다. 1 is a block diagram showing a configuration of a voice detection apparatus for performing a voice detection method according to an embodiment.

도 1을 참조하면, EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법을 수행하기 위한 음성 검출 장치는 음성 검출 제어부(100)를 포함할 수 있다. 실시예에 따라 음성 검출 제어부(100)는 메모리를 더 포함하여 이루어질 수 있으며, 음성 검출 제어부(100)는 입력부(200)와 전기적으로 연결될 수 있다. Referring to FIG. 1, a speech detection apparatus for performing a deep-speech-based speech detection method using an EVS codec parameter may include a speech detection control unit 100. According to an embodiment, the voice detection control unit 100 may further include a memory, and the voice detection control unit 100 may be electrically connected to the input unit 200. [

음성 검출 제어부(100)는 학습 과정을 통하여 최적화된 심화 신경망을 이용한 음성 검출 방법을 수행하는 부분으로서, 소정의 연산 속도를 갖는 연산 유닛을 포함할 수 있다. 예를 들어, 음성 검출 제어부(100)는 CPU(Central Processing Unit), GPU(Graphical Processing Unit) 등과 같은 연산 유닛을 포함할 수 있다. 또한, 음성 검출 제어부(100)는 소정의 프로세스에 필요한 데이터를 저장하기 위한 메모리를 더 포함할 수 있다.The voice detection control unit 100 may include a calculation unit having a predetermined calculation speed as a part for performing a voice detection method using an optimized deepened neural network through a learning process. For example, the voice detection control unit 100 may include a calculation unit such as a CPU (Central Processing Unit), a GPU (Graphical Processing Unit), and the like. In addition, the voice detection control unit 100 may further include a memory for storing data necessary for a predetermined process.

입력부(200)는 음성 검출 제어부(100)에 대하여 소정의 입력 데이터를 전송하는 부분으로서, 예를 들어 마이크로폰 등과 같이 소리를 전기 신호로 변환하는 입력 수단을 포함할 수 있다. 일례로, 입력부(200)에 제공되는 오염된 음성 신호(즉, 주변 잡음에 의해 오염된 음성 신호)는 음성 검출 제어부(100)에 제공될 수 있다.
The input unit 200 is a part for transmitting predetermined input data to the voice detection control unit 100 and may include an input unit such as a microphone for converting sound into an electric signal. For example, the contaminated voice signal provided to the input unit 200 (that is, the voice signal contaminated by the ambient noise) may be provided to the voice detection control unit 100.

실시예에 따라 음성 검출 장치의 음성 검출 제어부(100)는 특징 벡터 추출부(110), 깊은 신뢰망 학습부(120), 및 음성 존재 판단부(130)를 포함할 수 있다. 아래에서 각각의 구성에 대해 더 구체적으로 설명한다. The voice detection control unit 100 of the voice detection apparatus may include a feature vector extraction unit 110, a deep trust network learning unit 120, and a voice presence determination unit 130 according to an embodiment of the present invention. Each configuration will be described in more detail below.

먼저, 특징 벡터 추출부(110)는 입력부(200)로부터 입력되는 학습용 음성 신호로부터 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 알고리즘에서 사용된 특징 벡터를 적어도 하나 이상 추출할 수 있다. First, the feature vector extracting unit 110 may extract at least one feature vector used in the speech detection algorithm of the EVS codec (Enhanced Voice Services) from the speech-use speech signal input from the input unit 200.

여기서, EVS 코덱의 음성 검출 알고리즘은 제1 SAD(Signal Activity Detection) 모듈, 제2 SAD 모듈, 및 제3 SAD 모듈을 포함할 수 있다. 제1 SAD 모듈 및 제2 SAD 모듈은 임계 대역(critical band)에 기반한 주파수 영역 분석 결과를 기반으로 동작하고, 제3 SAD 모듈은 필터뱅크 분석 결과에 기반하여 동작할 수 있다. Here, the voice detection algorithm of the EVS codec may include a first SAD (Signal Activity Detection) module, a second SAD module, and a third SAD module. The first SAD module and the second SAD module operate based on a frequency band analysis result based on a critical band and the third SAD module can operate based on a filter bank analysis result.

특징 벡터 추출부(110)는 EVS 코덱의 음성 검출 알고리즘을 통해 특징 벡터를 랜덤하게 추출할 수 있으며, 이때 특징 벡터는 제1 SAD 모듈 및 제2 SAD 모듈에서 계산되는 총 프레임 에너지(Total frame energy), 신호 대 잡음비 이상값(SNR outlier), 평균 신호 대 잡음비(Average SNR), 평활화된 평균 신호 대 잡음비(Smoothed average SNR), 수정된 프레임별 신호 대 잡음비(Modified segmental SNR; MSSNR), 안정화 및 수정된 프레임별 신호 대 잡음비(Relaxed modified segmental SNR; RlxMSSNR), 및 제3 SAD 모듈에서 계산되는 프레임 에너지(Frame energy), 주파수 도메인 신호 대 잡음비(Frequency-domain SNR), 모든 부분대역에 대한 총 신호 대 잡음비(Total SNR of all sub-bands), 모든 부분대역에 대한 평균 신호 대 잡음비(Average SNR of all sub-bands), 스펙트럼 중심도(Spectral centroids), 시간 도메인 안정도(Time-domain stabilities) 중 적어도 하나 이상을 포함할 수 있다.The feature vector extractor 110 may extract the feature vector randomly through the voice detection algorithm of the EVS codec. The feature vector may include a total frame energy calculated by the first SAD module and the second SAD module, SNR outlier, Average SNR, Smoothed average SNR, Modified segmental SNR (MSSNR), Stabilization and correction A frame energy calculated in the third SAD module, a frequency-domain signal-to-noise ratio (SNR), a total signal-to-noise ratio (SINR) for all partial bands, The SNR of all sub-bands, the average SNR of all sub-bands, the spectral centroids, and the time-domain stabilities. And may include at least one or more.

그리고, 깊은 신뢰망 학습부(120)는 특징 벡터 추출부(110)에서 추출된 특징 벡터를 심화 신경망(Deep Neural Network; DNN)의 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 깊은 신뢰망(DBN)의 입력 레이어, 은닉 레이어, 그리고 출력 레이어에서 가해지는 가중치 행렬 및 바이어스 벡터를 구할 수 있다. 또한, 깊은 신뢰망 학습부(120)는 구해진 가중치 행렬 및 바이어스 벡터를 이용하여 음성 존재 확률을 계산할 수 있다. The deep trust network learning unit 120 applies the feature vector extracted from the feature vector extraction unit 110 to a Deep Belief Network (DBN) which is one of the models of the Deep Neural Network (DNN) The weight matrix and bias vector applied to the input layer, the hidden layer, and the output layer of the deep trust network DBN can be obtained through learning. Also, the deep trust network learning unit 120 can calculate the probability of voice presence using the obtained weight matrix and the bias vector.

이러한 깊은 신뢰망 학습부(120)는 선행 학습(pre-training)부 및 미세 조정(fine-tuning)부를 포함할 수 있다.The deep trust network learning unit 120 may include a pre-training unit and a fine-tuning unit.

선행 학습(pre-training)부는 깊은 신뢰망(DBN)의 각 레이어 사이의 가중치 행렬 및 바이어스 벡터를 초기화시킬 수 있다. The pre-training unit may initialize the weight matrix and the bias vector between each layer of the deep trust network (DBN).

미세 조정(fine-tuning)부는 선행 학습(pre-training) 후, 음성 신호에 대해 프레임 단위로 음성 존재 여부를 표기한 정답 라벨(label)과 깊은 신뢰망(DBN)을 통과한 출력 값의 비교를 통해 분류 오류(classification error)를 최소화하도록 학습시킬 수 있다. After fine-tuning, the fine-tuning section compares the output value of the speech signal through the deep-trust network (DBN) with the correct label indicating the presence of speech in frame units after pre-training. To minimize classification errors.

깊은 신뢰망 학습부(120)는 추출된 특징 벡터는 깊은 신뢰망(DBN)의 입력 벡터로 가해져 입력 레이어로 입력되어 복수의 은닉 레이어들 중 첫 번째 은닉 레이어부터 마지막 은닉 레이어까지 순차적으로 전달된 후, 출력 레이어까지 전달될 수 있다. The deep trust network learning unit 120 receives the extracted feature vector as an input vector of the deep trust network DBN and is input to the input layer and sequentially transmitted from the first hidden layer to the last hidden layer among the plurality of hidden layers , To the output layer.

여기서, 특징 벡터는 출력 레이어로 전달된 후 활성화 함수를 통과하고, 출력 레이어에서의 활성화 함수는 소프트맥스(softmax) 함수를 통해 이루어지며, 출력 레이어의 노드가 가지는 값들을 음성 존재 확률에 대한 확률 값으로 대응시킬 수 있다. Here, the feature vector is transmitted to the output layer and then passed through the activation function. The activation function in the output layer is performed through a softmax function. The values of the nodes of the output layer are converted into probability values .

음성 존재 판단부(130)는 깊은 신뢰망 학습부(120)에서 계산된 음성 존재 확률을 조절 가능한 임계값(threshold)과 비교하여 음성의 존재 유무를 판단할 수 있다. The voice presence determiner 130 may compare the voice presence probability calculated by the deep trust network learning unit 120 with an adjustable threshold to determine the presence or absence of voice.

예를 들어, 음성 존재 판단부(130)는 활성화 함수를 통과한 확률 값이 임계값 이상인 경우 프레임을 음성 존재 구간으로 판단하고, 활성화 함수를 통과한 확률 값이 임계값보다 작은 경우 프레임을 음성이 존재하지 않는 잡음 구간으로 판단할 수 있다. For example, if the probability value passed through the activation function is greater than or equal to the threshold value, the voice presence determination unit 130 determines that the frame is a voice presence interval. If the probability value passed through the activation function is smaller than the threshold value, It can be judged as a non-existent noise period.

이와 같이 일 실시예예 따르면 EVS 코덱의 음성 검출 알고리즘에서 계산되는 다양한 파라미터를 이용하되, 심화 신경망(DNN) 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)을 기반으로 한 결정 로직을 이용하여 음성 존재 확률을 계산함으로써, 효과적으로 음성을 검출할 수 있다.
According to one embodiment of the present invention, various parameters calculated in the voice detection algorithm of the EVS codec are used, and a decision logic based on a Deep Belief Network (DBN) By calculating the existence probability, it is possible to effectively detect the voice.

도 2는 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법을 개념적으로 나타낸 도면이다. FIG. 2 is a conceptual diagram illustrating a voice detection method based on an enhanced neural network using an EVS codec parameter according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법은, 입력된 음성 신호(210)로부터 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 알고리즘에서 사용된 특징 벡터를 선택하여 추출(240)할 수 있다. Referring to FIG. 2, a deep-speech-based speech detection method using an EVS codec parameter according to an embodiment extracts features (characteristics) used in a speech detection algorithm of an EVS codec (Enhanced Voice Services) A vector may be selected and extracted (240).

여기서, EVS 코덱의 음성 검출 알고리즘은 제1 SAD 모듈, 제2 SAD 모듈, 및 제3 SAD 모듈로 구성되어, 제1 SAD 모듈 및 제2 SAD 모듈은 임계 대역(critical band)에 기반한 주파수 영역 분석 결과를 기반으로 동작(220)하고, 제3 SAD 모듈은 필터뱅크 분석 결과에 기반하여 동작(230)할 수 있다. Here, the voice detection algorithm of the EVS codec is composed of a first SAD module, a second SAD module, and a third SAD module, and the first SAD module and the second SAD module perform a frequency-domain analysis based on a critical band (220), and the third SAD module (230) may operate based on the filter bank analysis results.

추출된 특징 벡터를 깊은 신뢰망(DBN)의 각 레이어 사이의 가중치 행렬 및 바이어스 벡터를 초기화시키는 선행 학습(pre-training) 과정(250)을 수행하고, 선행 학습(pre-training) 후, 음성 신호에 대해 프레임 단위로 음성 존재 여부를 표기한 정답 라벨(label)과 깊은 신뢰망(DBN)을 통과한 출력 값의 비교를 통해 분류 오류(classification error)를 최소화하도록 학습시키는 미세 조정(fine-tuning) 과정(260)을 수행하여, 깊은 신뢰망(Deep Belief Network; DBN)에 기반한 음성 검출 장치(VAD)(270)를 나타낼 수 있다. The extracted feature vector is subjected to a pre-training process 250 for initializing a weight matrix and a bias vector between layers of the deep trust network DBN, and after pre-training, Fine-tuning is performed to minimize the classification error by comparing the output value passed through the deep trust network (DBN) with the correct label indicating the presence or absence of voice in frame units, (VAD) 270 based on a Deep Belief Network (DBN) may be performed by performing step 260 of FIG.

아래에서는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법에 대해 하나의 실시예를 이용하여 더 구체적으로 설명하기로 한다.
Hereinafter, a voice detection method based on the deepening neural network using the EVS codec parameter will be described in more detail with reference to an embodiment.

도 3은 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법을 나타내는 흐름도이다. FIG. 3 is a flowchart illustrating a voice recognition method based on an enhanced neural network using an EVS codec parameter according to an exemplary embodiment.

도 3을 참조하면, 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법은, 학습용 음성 신호로부터 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 알고리즘에서 사용된 특징 벡터를 적어도 하나 이상 추출하는 단계, 추출된 특징 벡터를 심화 신경망(Deep Neural Network; DNN)의 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 깊은 신뢰망(DBN)의 입력 레이어, 은닉 레이어, 그리고 출력 레이어에서 가해지는 가중치 행렬 및 바이어스 벡터를 구하는 단계, 구해진 가중치 행렬 및 바이어스 벡터를 이용하여 음성 존재 확률을 계산하는 단계, 및 계산된 음성 존재 확률을 조절 가능한 임계값(threshold)과 비교하여 음성의 존재 유무를 판단하는 단계를 포함하여 이루어질 수 있다. Referring to FIG. 3, a deep-speech-based speech detection method using an EVS codec parameter according to an exemplary embodiment includes at least one feature vector used in a speech detection algorithm of an EVS codec (Enhanced Voice Services) The extracted feature vector is applied to Deep Belief Network (DBN) which is one of the Deep Neural Network (DNN) models, and the input layer of Deep Trust Network (DBN) A hidden matrix and a bias vector applied to the output layer, calculating a probability of presence of speech using the obtained weight matrix and a bias vector, and calculating a threshold value And determining whether the voice exists or not.

여기서, 가중치 행렬 및 바이어스 벡터를 구하는 단계는 추출된 특징 벡터는 깊은 신뢰망(DBN)의 입력 벡터로 가해져 입력 레이어로 입력되는 단계, 입력 레이어를 통과한 특징 벡터는 복수의 은닉 레이어들 중 첫 번째 은닉 레이어부터 마지막 은닉 레이어까지 순차적으로 전달되는 단계, 및 마지막 은닉 레이어까지 전달된 특징 벡터는 출력 레이어까지 전달되는 단계를 포함하여 이루어질 수 있다. Here, the step of obtaining the weight matrix and the bias vector may include the step of inputting the extracted feature vector to the input layer of the deep trust network DBN, and inputting the feature vector to the input layer, the feature vector passing through the input layer, A step of successively delivering from the hidden layer to the last hidden layer, and a step of transmitting the feature vector delivered to the last hidden layer to the output layer.

그리고 가중치 행렬 및 바이어스 벡터를 구하는 단계는, 특징 벡터는 출력 레이어로 전달된 후 활성화 함수를 통과하는 단계를 더 포함할 수 있고, 출력 레이어에서의 활성화 함수는 소프트맥스(softmax) 함수를 통해 이루어지며, 출력 레이어의 노드가 가지는 값들을 음성 존재 확률에 대한 확률 값으로 대응시킬 수 있다. The step of obtaining the weight matrix and the bias vector may further include passing the feature vector to the output layer and then passing the activation function, wherein the activation function in the output layer is performed through a softmax function , The values of nodes of the output layer can be associated with probability values for the voice presence probability.

이에 따라, 일 실시예들에 따르면 EVS 코덱의 음성 검출 알고리즘에서 계산되는 다양한 파라미터를 이용하되, 심화 신경망(DNN) 모델 중 하나인 깊은 신뢰망(DBN)을 기반으로 한 결정 로직을 이용하여 음성 존재 확률을 계산함으로써, 우수한 음성 검출 성능을 나타내는 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법을 제공할 수 있다. Accordingly, according to one embodiment, various parameters calculated in the voice detection algorithm of the EVS codec are used, and the decision logic based on the Deep Trust Network (DBN), which is one of the deepening network (DNN) models, By calculating the probability, it is possible to provide a voice detection method based on the deepening neural network using the EVS codec parameter showing excellent voice detection performance.

아래에서는 일 실시예에 따른 EVS 코덱 파라미터를 이용한 심화 신경망 기반의 음성 검출 방법의 각 단계에 대해 상세히 설명하기로 한다.
Hereinafter, each step of the voice recognition method based on the deepening neural network using the EVS codec parameter according to an embodiment will be described in detail.

단계(310)에서, 음성 검출 장치의 특징 벡터 추출부(110)는 학습용 음성 신호로부터 EVS 코덱(Codec for Enhanced Voice Services)의 음성 검출 알고리즘에서 사용된 특징 벡터를 랜덤하게 선택하여 적어도 하나 이상 추출할 수 있다. In step 310, the feature vector extraction unit 110 of the speech detection apparatus randomly selects a feature vector used in the speech detection algorithm of the EVS codec (Enhanced Voice Services) from the speech speech signal and extracts at least one .

이 경우, EVS 코덱의 음성 검출 알고리즘에서 계산되는 파라미터 중 일부를 심화 신경망의 학습을 위한 사용 용도로 선택할 수 있다. 예를 들어, 특징 벡터로 선택되는 파라미터는 총 17개로, 제1 SAD 모듈 및 제2 SAD 모듈에서 계산되는 총 프레임 에너지(Total frame energy), 신호 대 잡음비 이상값(SNR outlier), 평균 신호 대 잡음비(Average SNR), 평활화된 평균 신호 대 잡음비(Smoothed average SNR), 수정된 프레임별 신호 대 잡음비(Modified segmental SNR; MSSNR), 안정화 및 수정된 프레임별 신호 대 잡음비(Relaxed modified segmental SNR; RlxMSSNR), 및 제3 SAD 모듈에서 계산되는 프레임 에너지(Frame energy), 주파수 도메인 신호 대 잡음비(Frequency-domain SNR), 모든 부분대역에 대한 총 신호 대 잡음비(Total SNR of all sub-bands), 모든 부분대역에 대한 평균 신호 대 잡음비(Average SNR of all sub-bands), 스펙트럼 중심도(Spectral centroids), 시간 도메인 안정도(Time-domain stabilities)가 될 수 있다.
In this case, some of the parameters calculated by the voice detection algorithm of the EVS codec can be selected for use for learning of the deepening neural network. For example, a total of 17 parameters are selected as the feature vectors. The total frame energy, the SNR outlier, the average signal-to-noise ratio (SNR outlier) calculated in the first SAD module and the second SAD module, (SNR), a smoothed average SNR, a modified segmental SNR (MSSNR), a stabilized and modified frame-to-frame SNR (RlxMSSNR) Frame energy, a frequency-domain SNR, a total SNR of all sub-bands for all sub-bands, and a sub-band for all sub-bands calculated in the third SAD module. The average SNR of all sub-bands, the spectral centroids, and the time-domain stabilities.

단계(320)에서, 음성 검출 장치의 깊은 신뢰망(DBN) 학습부(120)는 추출된 특징 벡터를 심화 신경망(Deep Neural Network; DNN)의 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 깊은 신뢰망(DBN)의 입력 레이어, 은닉 레이어, 그리고 출력 레이어에서 가해지는 가중치 행렬 및 바이어스 벡터를 구할 수 있다. In step 320, the deep trust network (DBN) learning unit 120 of the voice detection apparatus extracts the extracted feature vector from a Deep Belief Network (DBN), which is one of the models of the Deep Neural Network ), We can obtain the weight matrix and the bias vector applied to the input layer, hidden layer, and output layer of the deep trust network (DBN) through learning.

가중치 행렬 및 바이어스 벡터를 구하는 단계에서, 음성 검출 장치의 선행 학습부는 깊은 신뢰망(DBN)의 각 레이어 사이의 가중치 행렬 및 바이어스 벡터를 초기화시키는 선행 학습(pre-training) 과정을 수행할 수 있다. In the step of obtaining the weight matrix and the bias vector, the pre-learning unit of the voice detection apparatus may perform a pre-training process of initializing the weight matrix and the bias vector between the respective layers of the deep trust network (DBN).

그리고 선행 학습(pre-training) 후, 음성 검출 장치의 미세 조정부는 음성 신호에 대해 프레임 단위로 음성 존재 여부를 표기한 정답 라벨(label)과 깊은 신뢰망(DBN)을 통과한 출력 값의 비교를 통해 분류 오류(classification error)를 최소화하도록 학습시키는 미세 조정(fine-tuning) 과정을 수행할 수 있다. After the pre-training, the fine adjustment unit of the voice detection apparatus compares the output value of the voice signal with the correct label indicating the presence or absence of voice in the frame unit and the output value passed through the deep trust network (DBN) A fine-tuning process may be performed in which learning is performed to minimize a classification error.

여기서, 학습 용도로 준비된 음성 신호에서 추출한 EVS 코덱의 파라미터로 구성한 특징 벡터를 x라 할 때, 특징 벡터는 깊은 신뢰망(DBN)의 입력 벡터로 가해져 첫 번째 은닉 레이어부터 마지막 은닉 레이어까지 전달되며 각 레이어에서 가중치와 바이어스 값 및 비선형 활성화 함수를 통해 활성화될 수 있다. Here, when the feature vector composed of the parameters of the EVS codec extracted from the voice signal prepared for the learning purpose is x, the feature vector is applied to the input vector of the deep trust network (DBN) and is transmitted from the first hidden layer to the last hidden layer It can be activated through the weight and bias values and nonlinear activation functions in the layer.

활성화 함수를 통해 활성화되는 과정은 다음의 식으로 표현할 수 있다. The process of activation through the activation function can be expressed by the following equation.

여기서

는 각각 가중치 행렬, 바이어스 벡터, 활성 값 벡터를 나타낼 수 있다. 그리고,

는 활성화 함수로 아래 식과 같이 시그모이드 함수로 나타낼 수 있다. here

May each represent a weight matrix, a bias vector, and an active value vector. And,

Is an activation function and can be expressed as a sigmoid function as shown below.

입력 레이어에서 벡터 형태로 가해지는 특징 벡터는 은닉 레이어를 통과할 때마다 특징 추출(feature extraction)의 과정을 통해 해당 특징 벡터를 구성하는 파라미터 간의 의존성과 상관성을 충분히 고려한 새로운 특징 벡터를 산출할 수 있다. Each time a feature vector is applied in vector form in the input layer, a new feature vector can be calculated by taking into account the dependency and correlation between the parameters constituting the feature vector through the process of feature extraction whenever it passes through the hidden layer .

즉, 은닉 레이어를 통과하여 나온 특징 벡터는 은닉 레이어에 입력으로 가해진 특징 벡터와 비교하였을 때에 입력 데이터에 대하여 더욱 효율적이고 더 높은 차원의 표현력을 가질 수 있다. 마지막 은닉 레이어까지 전달된 입력 특징 벡터는 출력 레이어까지 한 번 더 전달된 후, 활성화 함수를 통과함으로써 음성 존재 확률에 해당하는 값을 출력할 수 있다. That is, the feature vector passing through the hidden layer can have a more efficient and higher-level expression power on the input data when compared with the feature vector applied to the hidden layer. The input feature vector delivered to the last hidden layer is transmitted once more to the output layer, and then passed through the activation function to output a value corresponding to the voice presence probability.

출력 레이어에서 적용되는 활성화 함수는 아래와 같은 소프트맥스(softmax) 함수로, 출력 레이어의 노드가 가지는 값들을 확률 값으로 대응시키는 역할을 할 수 있다. The activation function applied to the output layer is a softmax function as shown below, and it can play a role of mapping the values of nodes of the output layer to probability values.

단계(330)에서, 음성 검출 장치의 깊은 신뢰망(DBN) 학습부(120)는 구해진 가중치 행렬 및 바이어스 벡터를 이용하여 음성 존재 확률을 계산할 수 있다. In step 330, the deep trust network (DBN) learning unit 120 of the voice detection apparatus can calculate the voice presence probability using the obtained weight matrix and the bias vector.

여기서

는 출력 레이어의 노드를 가리키고,

은 활성화 함수를 통과한 값으로써 각각 음성 부재 확률과 음성 존재 확률을 의미할 수 있다. here

Indicates the node of the output layer,

Is the value passed through the activation function and can mean the probability of voice absence and the probability of voice presence, respectively.

단계(340)에서, 음성 검출 장치의 음성 존재 판단부(130)는 계산된 음성 존재 확률을 조절 가능한 임계값(threshold)과 비교하여 음성의 존재 유무를 판단할 수 있다. In step 340, the voice presence determination unit 130 of the voice detection apparatus compares the calculated voice presence probability with an adjustable threshold to determine the presence or absence of voice.

이러한 최종 음성 검출 결과는 음성 존재 확률을 의미하는

을 조절 가능한 임계값(threshold)과 비교함으로써 아래와 같이 결정될 수 있다.This final speech detection result indicates the probability of speech presence

Lt; / RTI > with an adjustable threshold. &Lt; RTI ID = 0.0 >

여기서

는 조절 가능한 임계값을 의미하고, H1과 H0는 각각 음성 존재 및 부재 가설로 활성 값

이 임계값보다 큰 경우에 음성의 존재를 가정할 수 있다. here

H1 < / RTI > and < RTI ID = 0.0 > H0 < / RTI >

Is greater than the threshold value, the presence of speech can be assumed.

다시 말하면, 음성 검출 장치의 음성 존재 판단부(130)는 활성화 함수를 통과한 확률 값이 임계값 이상인 경우 프레임을 음성 존재 구간으로 판단하고, 활성화 함수를 통과한 확률 값이 임계값보다 작은 경우 프레임을 음성이 존재하지 않는 잡음 구간으로 판단할 수 있다.
In other words, the voice presence determination unit 130 of the voice detection apparatus determines that the frame is a voice presence interval if the probability value passed through the activation function is equal to or greater than the threshold value, Can be determined as a noise period in which no speech exists.

도 4는 일 실시예에 따른 선행 학습 과정을 설명하기 위한 도면이다. FIG. 4 is a diagram for explaining a pre-learning process according to an embodiment.

도 4를 참조하면, 일 실시예에 따른 깊은 신경망(DBN)에서 선행 학습 과정을 설명하기 위한 것으로, 각 레이어(layer) 사이의 웨이트(weight) 및 바이어스(bias) 값을 초기값으로 설정하기 위해 선행 학습 과정을 수행할 수 있다. Referring to FIG. 4, in order to explain a pre-learning process in a deep neural network (DBN) according to an embodiment, in order to set a weight and a bias value between layers, It is possible to perform the pre-learning process.

다시 말하면, 특징 벡터를 심화 신경망(DNN) 모델 중 하나인 깊은 신뢰망(Deep Belief Network; DBN)에 적용하여 학습을 통해 깊은 신뢰망(DBN)의 입력 레이어(410), 은닉 레이어(420, 430), 그리고 출력 레이어(440)에서 가해지는 적절한 가중치 행렬 및 바이어스 벡터를 찾아낼 수 있다. In other words, the characteristic vector is applied to a Deep Belief Network (DBN), which is one of the deepening neural networks (DNN) models, through which the input layer 410 of the deep trust network DBN, ), And an appropriate weighting matrix and bias vector to be applied in the output layer 440.

서로 이웃한 두 레이어에 대하여 RBM(Restricted Boltzmann machine) 그래프 모델을 적용하여 선행 학습을 수행할 수 있으며, 타켓(target)은 요구되지 않는다. 이때, 출력 레이어(440)와 마지막 은닉 레이어(430) 사이의 웨이트(weight) 값에는 적용되지 않는다.
The RBM (Restricted Boltzmann machine) graph model can be applied to two neighboring layers to perform the pre-learning, and no target is required. At this time, it is not applied to the weight value between the output layer 440 and the final hidden layer 430.

도 5는 일 실시예에 따른 미세 조정 과정을 설명하기 위한 도면이다. 5 is a diagram for explaining a fine adjustment process according to an embodiment.

도 5를 참조하면, 일 실시예에 따른 깊은 신경망(DBN)에서 미세 조정 과정을 설명하기 위한 것으로, 출력 값과 타겟(target)(목표 정답 값) 사이의 오류(error)를 코스트 함수(cost function)를 이용하여 나타낼 수 있으며, 이를 최소화하는 방향으로 웨이트(weight) 값 및 바이어스(bias) 값을 조절할 수 있다. 이때 backpropagation 알고리즘을 사용할 수 있다.
Referring to FIG. 5, a fine adjustment process is described in a deep neural network (DBN) according to an embodiment. An error between an output value and a target (target correct answer value) is called a cost function ), And the weight value and the bias value can be adjusted in the direction of minimizing it. At this point, you can use the backpropagation algorithm.

아래에서는 일 실시예에 따른 음성 검출 방법의 성능을 평가하기 위하여 하나의 실시예를 이용하여 기존 기술들과 비교한 결과를 나타낸다. In order to evaluate the performance of the speech detection method according to an exemplary embodiment, a result of comparison with existing techniques is shown using one embodiment.

일 실시예에 따른 음성 검출 방법의 성능을 평가하기 위하여 ROC(receiver operation characteristics) 곡선 및 검출 오류 확률의 지표를 이용하여 일 실시예에 따른 음성 검출 방법을 기존의 EVS 코덱의 음성 검출 방법과 비교할 수 있다.In order to evaluate the performance of the speech detection method according to an embodiment, the speech detection method according to an embodiment can be compared with the speech detection method of the existing EVS codec by using the receiver operation characteristic (ROC) curve and the index of the detection error probability have.

음성 검출을 위한 깊은 신뢰망(DBN)의 학습과 그 결과를 테스트하기 위해 400초 길이 단위의 음성 신호를 다양한 환경의 잡음(noise)과 합성할 수 있다. 400초 길이의 깨끗한 음성 신호는 airport, babble, car, exhibition, restaurant, subway의 6 종류 잡음(noise)을 사용하여 신호 대 잡음비를 0 dB, 5 dB, 10 dB, 15 dB로 조절하며 합성되고, 합성된 각 오염된 음성 신호 중 절반을 학습용으로, 나머지 절반을 결과를 테스트하는 용도로 사용할 수 있다. In order to test the learning and results of the deep trust network (DBN) for voice detection, a voice signal of 400-second-long unit can be synthesized with noise of various environments. A 400-second long clear voice signal is synthesized by adjusting the signal-to-noise ratio to 0 dB, 5 dB, 10 dB, and 15 dB using 6 kinds of noise such as airport, babble, car, exhibition, Half of each synthesized speech signal can be used for learning, and the other half can be used for testing results.

학습용 음성 신호는 전부 하나의 파일로 합성되어 EVS 코덱 음성 검출 방법에서 사용되는 파라미터들이 추출되고, 상기의 파라미터들로 구성된 특징 벡터를 이용하여 깊은 신뢰망(DBN)을 학습시킬 수 있다.All the learning speech signals are synthesized into a single file, parameters used in the EVS codec voice detection method are extracted, and a deep trust network (DBN) can be learned using the feature vector composed of the above parameters.

깊은 신뢰망(DBN)의 학습은 다음의 과정을 통해 이루어질 수 있다. Learning of a deep trust network (DBN) can be done through the following process.

먼저, 신경망 모델의 각 레이어 사이의 가중치 행렬 및 바이어스 벡터를 적절한 값으로 초기화하기 위한 선행 학습(pre-training)을 100회 반복할 수 있다. 선행 학습(pre-training) 과정에서의 학습률(learning rate)은 0.005로, 가중치 행렬의 과도한 업데이트를 규제하기 위한 웨이트(weight) 값은 0.0002로, 학습을 가속시키기 위한 모멘텀(momentum) 값은 6 번째 미만의 반복에서는 0.9로, 그 이상의 반복에서는 0.5로 정할 수 있다. First, the pre-training for initializing the weight matrix and the bias vector between each layer of the neural network model to an appropriate value can be repeated 100 times. The learning rate in the pre-training process is 0.005, the weight value for regulating excessive updating of the weighting matrix is 0.0002, and the momentum value for accelerating the learning is 6th 0.9 for repetitions less than, and 0.5 for repetitions greater than.

선행 학습(pre-training)을 마친 다음, 학습용 음성 신호에 대해 프레임 단위로 실제 음성 존재 여부를 표기한 정답 라벨(label)을 이용하여 심화 신경망 모델의 출력 값과 정답 라벨의 비교를 통해 분류 오류(classification error)를 최소화하는 방향으로 모델을 학습시키는 미세 조정(fine-tuning) 과정을 수행할 수 있다. 미세 조정(fine-tuning) 과정에서의 가중치 행렬에 대한 업데이트는 라인-서치(line-search) 알고리즘을 이용하여 총 230회의 반복을 통해 이루어질 수 있다. After completion of the pre-training, the output value of the deepening neural network model is compared with the correct answer label using the correct label indicating the presence or absence of the actual voice in the frame unit for the speech signal for learning, a fine-tuning process may be performed in which the model is learned in a direction that minimizes the classification error. The update of the weighting matrix in the fine-tuning process can be accomplished through a total of 230 iterations using a line-search algorithm.

완성된 음성 검출 방법을 테스트하기 위한 방법으로 ROC(receiver operating characteristic) 곡선과 검출 오류 확률(error detection probability;

)을 조사할 수 있다.
The receiver operating characteristic (ROC) curve and the error detection probability (ROC)

) Can be examined.

표 1은 SNR = 5 dB, 10 dB, 15 dB 테스트용 입력 신호에 대한 EVS 음성 검출 방법과 본 실시예에 따른 음성 검출 방법의 음성 검출 오류 확률(%)을 나타낸다. Table 1 shows the voice detection error probability (%) of the EVS voice detection method for the input signal for SNR = 5 dB, 10 dB, 15 dB test and the voice detection method according to the present embodiment.

표 1을 참조하면, 본 실시예에 따른 음성 검출 방법이 기존 EVS 코덱의 음성 검출 방법보다 더 낮은 검출 오류 확률을 보인다는 것을 확인할 수 있다.
Referring to Table 1, it can be seen that the voice detection method according to the present embodiment exhibits a lower detection error probability than the voice detection method of the existing EVS codec.

이하, 도 6 내지 도 11은 ROC(receive operation characteristic) 곡선을 통해서 각각의 음성 검출 방법을 비교할 수 있다.Hereinafter, FIGS. 6 to 11 are diagrams for comparing respective voice detection methods through ROC (receive operation characteristic) curves.

즉, 각각 신호 대 잡음비가 5 dB인 테스트용 음성 신호가 입력으로 가해졌을 때의 ROC 곡선과, 신호 대 잡음비가 5 dB, 10 dB, 15 dB인 테스트용 음성 신호가 입력으로 가해졌을 때의 검출 오류 확률을 나타낸다. 검출 오류 확률은 최적의 임계값을 이용하여 계산되었다. That is, the ROC curve when the test voice signal having the signal-to-noise ratio of 5 dB is applied to the input and the detection when the test voice signal having the signal-to-noise ratio of 5 dB, Represents the error probability. The detection error probability was calculated using the optimal threshold value.

도 6은 Airport 잡음환경에서 측정된 ROC 곡선을 나타내며, 도 7은 Babble 잡음환경에서 측정된 ROC 곡선을 나타내고, 도 8은 Car 잡음환경에서 측정된 ROC 곡선을 나타내며, 도 9는 Exhibition 잡음환경에서 측정된 ROC 곡선을 나타내는 것이다. 그리고 도 10은 Restaurant 잡음환경에서 측정된 ROC 곡선을 나타내며, 도 11은 Subway 잡음환경에서 측정된 ROC 곡선을 나타내는 것이다.FIG. 6 shows ROC curves measured in an airport noise environment, FIG. 7 shows ROC curves measured in a Babble noise environment, FIG. 8 shows ROC curves measured in a Car noise environment, and FIG. Gt; ROC < / RTI > 10 shows the ROC curve measured in the Restaurant noise environment, and FIG. 11 shows the ROC curve measured in the subway noise environment.

도 6 내지 도 11에 도시된 바와 같이, 본 실시예에 따른 음성 검출 방법이 기존 EVS 코덱의 음성 검출 방법보다 높은 음성 검출 확률을 보이는 것을 확인할 수 있다.
As shown in FIGS. 6 to 11, it can be seen that the voice detection method according to the present embodiment has a higher voice detection probability than the voice detection method of the existing EVS codec.

도 12는 일 실시예에 따른 음성 검출 방법과 기존 음성 검출 방법의 비교를 나타내는 도면이다. 12 is a diagram showing a comparison between a speech detection method according to an embodiment and a conventional speech detection method.

도 12에 도시된 바와 같이, babble 잡음(noise)으로 오염된 SNR = 5 dB 입력 신호에 대한 기존 EVS 음성 검출 방법과 본 실시예에 따른 음성 검출 방법의 음성 검출 결과를 라벨과 함께 나타낼 수 있다. As shown in FIG. 12, the voice detection result of the conventional EVS voice detection method for the SNR = 5 dB input signal contaminated with babble noise and the voice detection method according to the present embodiment can be indicated together with the label.

비교 결과, 본 실시예에 따른 음성 검출 방법이 기존 EVS의 음성 검출 방법보다 더욱 정확하게 음성의 존재 구간을 포착하는 것을 확인할 수 있다.
As a result of comparison, it can be confirmed that the voice detection method according to the present embodiment captures the voice existence section more accurately than the voice detection method of the existing EVS.

이상과 같이, 실시예들에 따르면 결정 로직을 모델링하는 데에 깊은 신경망을 이용한다면 종래 기술에서 사용하는 동일한 파라미터 혹은 그보다 적은 수의 파라미터로도 종래 기술보다 우수한 성능을 보이는 음성 검출 알고리즘을 설계할 수 있다. 이는 깊은 신경망이 음성 신호의 특성을 잘 나타내는 파라미터 집합의 각 원소간의 의존성 및 상관성을 충분히 고려하여 매우 정교한 결정 로직을 모델링 할 수 있기 때문이다. As described above, according to the embodiments, if a deep neural network is used for modeling the decision logic, it is possible to design a speech detection algorithm that shows superior performance over the prior art even with the same or less number of parameters used in the prior art have. This is because a deep neural network can model highly sophisticated decision logic by taking into account the dependency and correlation between each element of the parameter set which well characterizes the speech signal.

이와 같이, 실시예들은 EVS 코덱의 음성 검출 알고리즘의 성능 향상을 위해 EVS　코덱 음성 검출 모듈의 파라미터를 사용하되, EVS 코덱의 음성 검출 결정 로직 대신 심화 신경망(DNN)을 이용하여 설계한 새로운 결정 로직을 도입할 수 있다. 음성 신호로부터 추출된 해당 파라미터들은 각각 하나의 특징 벡터를 구성하고, 이는 심화 신경망(DNN)의 학습을 위해 사용된다. 제안된 방법은 기존 EVS 코덱의 음성 검출 알고리즘보다 우수한 음성 검출 성능을 나타낼 수 있다. 이러한 실시예들은 Mobile phone, 음성 통신 환경을 지원하는 각종 전자기기 중 잡음(noise) 환경에 노출된 경우 등에 적용될 수 있다.
As described above, the embodiments use a parameter of the EVS codec voice detection module to improve the performance of the voice detection algorithm of the EVS codec, and a new decision logic that is designed using the deepening network (DNN) instead of the voice detection decision logic of the EVS codec Can be introduced. The parameters extracted from the speech signal constitute one feature vector, which is used for the learning of the DNN. The proposed method can show better speech detection performance than the speech detection algorithm of existing EVS codec. These embodiments can be applied to a mobile phone, a case where a user is exposed to a noise environment among various electronic devices supporting a voice communication environment, and the like.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, controller, arithmetic logic unit (ALU), digital signal processor, microcomputer, field programmable array (FPA) A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing apparatus may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

Extracting at least one feature vector used in a voice detection algorithm of an EVS codec (Enhanced Voice Services) from a speech signal for learning;
The extracted feature vector is applied to a Deep Belief Network (DBN) which is one of models of a Deep Neural Network (DNN), and the input layer, hidden layer, and hidden layer of the Deep Trust Network (DBN) And obtaining a weight matrix and a bias vector to be applied to the output layer;
Calculating a voice presence probability using the obtained weight matrix and a bias vector; And
Comparing the calculated presence probability with an adjustable threshold value to determine the presence or absence of voice
Lt; / RTI >
Wherein the voice detection algorithm of the EVS codec comprises a first SAD (Signal Activity Detection) module, a second SAD module, and a third SAD module, wherein the first SAD module and the second SAD module have a critical band, Wherein the third SAD module is operated based on a filterbank analysis result to determine at least one of the parameters calculated in the voice detection algorithm of the EVS codec to be transmitted to the deep trust network DBN) is used for learning,
Wherein the step of obtaining the weight matrix and the bias vector comprises:
Extracting the feature vector from an input vector of the deep trust network (DBN) and inputting the feature vector to the input layer;
The feature vector having passed through the input layer is sequentially transmitted from a first hidden layer to a last hidden layer among a plurality of hidden layers; And
Wherein the feature vector delivered to the last hidden layer is delivered to the output layer,
A restricting Boltzmann Machine (RBM) graph model is applied to two neighboring layers to perform a preliminary learning, and it is not applied to a weight value between the last hidden layer and the output layer, Wherein a weight value and a bias value are adjusted using a cost function based on an EVS codec parameter of the EVS codec parameter.

delete

The method according to claim 1,
The feature vector,
A total frame energy, a SNR outlier, an average SNR, and a smoothed average signal-to-noise ratio (Smoothed) calculated in the first SAD module and the second SAD module, average SNR), a modified segmental SNR (MSSNR), a stabilized and modified frame-to-noise ratio (RLxMSSNR), and a frame energy calculated in the third SAD module (SNR) of all subbands, a total signal-to-noise ratio (SNR) of all subbands, bands, spectral centroids, and time-domain stabilities.
Based on the EVS codec parameter.

The method according to claim 1,
Wherein the step of obtaining the weight matrix and the bias vector comprises:
A pre-training step of initializing the weight matrix and the bias vector between respective layers of the deep trust network (DBN); And
After the pre-training, a comparison is made between a correct label indicating the presence or absence of speech in units of frames for the speech signal and an output value passed through the deep trust network (DBN) fine-tuning step for learning to minimize the error
Based on the EVS codec parameter.

delete

The method according to claim 1,
Wherein the step of obtaining the weight matrix and the bias vector comprises:
Wherein the feature vector is passed to the output layer and then passed through an activation function
Further comprising:
Wherein the activation function in the output layer is performed via a softmax function and the values of the nodes of the output layer are associated with probability values for the voice presence probability
Based on the EVS codec parameter.

The method according to claim 6,
Wherein the step of determining presence /
Determining a voice presence interval when the probability value passed through the activation function is greater than or equal to the threshold value and determining that the noise interval does not exist when the probability value passed through the activation function is smaller than the threshold value
Based on the EVS codec parameter.

A feature vector extractor for extracting at least one feature vector used in a voice detection algorithm of an EVS codec (Enhanced Voice Services) from a speech signal for learning;
The extracted feature vector is applied to a Deep Belief Network (DBN) which is one of models of a Deep Neural Network (DNN), and the input layer, hidden layer, and hidden layer of the Deep Trust Network (DBN) A deep trust network learning unit for obtaining a weight matrix and a bias vector to be applied in the output layer, and calculating a voice presence probability using the obtained weight matrix and the bias vector; And
A voice presence determination unit for comparing the calculated presence probability with an adjustable threshold to determine the presence or absence of voice,
Lt; / RTI >
Wherein the voice detection algorithm of the EVS codec comprises a first SAD (Signal Activity Detection) module, a second SAD module, and a third SAD module, wherein the first SAD module and the second SAD module have a critical band, Wherein the third SAD module is operated based on a filterbank analysis result to determine at least one of the parameters calculated in the voice detection algorithm of the EVS codec to be transmitted to the deep trust network DBN) is used for learning,
The deep trust network learning unit,
The extracted feature vector is applied to the input vector of the deep trust network DBN and input to the input layer, sequentially transmitted from the first hidden layer to the last hidden layer among the plurality of hidden layers, And performs a pre-learning by applying a RBM (Restricted Boltzmann Machine) graph model to two neighboring layers, and is not applied to a weight value between the last hidden layer and the output layer, Wherein a weight value and a bias value are adjusted using an error function between a target correct answer value and a cost function using an EVS codec parameter.

delete

9. The method of claim 8,
The feature vector,
A total frame energy, a SNR outlier, an average SNR, and a smoothed average signal-to-noise ratio (Smoothed) calculated in the first SAD module and the second SAD module, average SNR), a modified segmental SNR (MSSNR), a stabilized and modified frame-to-noise ratio (RLxMSSNR), and a frame energy calculated in the third SAD module (SNR) of all subbands, a total signal-to-noise ratio (SNR) of all subbands, bands, spectral centroids, and time-domain stabilities.
Wherein the EVS codec parameters are used to detect the presence of the speech signal.

9. The method of claim 8,
The deep trust network learning unit,
A pre-training unit for initializing the weight matrix and the bias vector between respective layers of the deep trust network (DBN); And
After the pre-training, a comparison is made between a correct label indicating the presence or absence of speech in units of frames for the speech signal and an output value passed through the deep trust network (DBN) a fine-tuning unit which learns to minimize the error
Based on the EVS codec parameter.

delete

9. The method of claim 8,
Wherein the feature vector is passed to the output layer and then to an activation function, the activation function in the output layer is performed via a softmax function, With a probability value for
Wherein the EVS codec parameters are used to detect the presence of the speech signal.

14. The method of claim 13,
The voice presence determination unit may determine,
Determining a voice presence interval when the probability value passed through the activation function is greater than or equal to the threshold value and determining that the noise interval does not exist when the probability value passed through the activation function is smaller than the threshold value
Wherein the EVS codec parameters are used to detect the presence of the speech signal.