KR101893684B1

KR101893684B1 - Method and Apparatus for deep learning based algorithm for speech intelligibility prediction of vocoders

Info

Publication number: KR101893684B1
Application number: KR1020170024613A
Authority: KR
Inventors: 김남수; 배수현; 최인규
Original assignee: 국방과학연구소
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2018-08-30

Abstract

The present invention relates to a technique evaluating clarity of a voice passing a vocoder and, more specifically, to a method for evaluating clarity of a voice passing a vocoder based on deep learning and an apparatus thereof capable of determining a difference in clarity between an original voice before transmission to a vocoder and a voice passing a vocoder after transmission on a voice transmitted to various kinds of vocoders. The method comprises the following steps. (a) An evaluation module receives an arbitrary original voice and a vocoder passing voice generated by a vocoder. (b) The evaluation module divides the arbitrary original voice and the vocoder passing voice into frames of time units and extracts speech features from each frame to use the extracted speech features as feature vectors. (c) The evaluation module applies the feature vectors to a deep neural network (DNN) regression model to calculate a speech clarity difference for each frame. (d) The evaluation module sums the difference in speech clarity of each frame and calculates a clarity difference score on the entire original voice.

Description

TECHNICAL FIELD The present invention relates to a deep learning-based vocoder-passed speech intelligibility evaluation method and apparatus,

본 발명은 보코더를 통과한 음성의 명료도 평가 기술에 관한 것으로서, 더 상세하게는 다양한 종류의 보코더에 전송된 음성에 대해 보코더 전송전 원음성과 전송후 보코더 통과 음성에 대해 두 음성간의 명료도 차이를 판단할 수 있는, 딥 러닝 기반의 보코더 통과 음성 명료도 평가 방법 및 장치에 대한 것이다.The present invention relates to a technique for evaluating intelligibility of a voice passed through a vocoder, and more particularly, to a technique for evaluating the intelligibility of a voice transmitted through a vocoder, more specifically, Based vocoder-passed speech intelligibility evaluation method and apparatus capable of performing deep-learning based vocoder-based speech intelligibility evaluation.

보코더의 기본 목적은 제한된 채널에서의 압축을 통한 음성 정보 전달이라는 점을 생각하면 보코더의 음성 명료도는 매우 중요한 요소이다. 음성의 명료도를 판단하는 알고리즘으로 STOI(short time objective intelligibility measure) 가 존재하지만 해당 알고리즘은 일반 음성의 명료도를 판별하는데 중심을 둔 알고리즘이기에 보코더 통과 음성에 특화되지 못한다는 한계가 있다. Speech intelligibility of a vocoder is very important considering that the basic purpose of a vocoder is to transmit voice information through compression on a limited channel. Although there is STOI (short time objective intelligibility measure) as an algorithm for determining the intelligibility of speech, the algorithm has a limitation that it is not specialized in vocoder passing speech because it is an algorithm centered on discrimination of general speech.

또한, STOI는 원음성과 변조된 음성의 선형적 상관관계를 기반으로 음성 명료도를 판단하기에 실제 음성과 명료도간의 복잡한 비선형적 관계를 제대로 모델링하지 못하는 문제가 있다. In addition, the STOI judges the speech intelligibility based on the linear correlation between the original speech and the modulated speech, so that there is a problem that the complex nonlinear relationship between the actual speech and the intelligibility can not be properly modeled.

최근에 활발히 연구되고 있는 딥 러닝은 입력과 원하는 출력 사이의 복잡한 비선형 관계들을 모델링하여 분류나 회귀 문제에 효과적이다. 따라서 보코더 통과 음성 명료도 판단을 목적으로 하는 딥 러닝 기반 알고리즘이 요구되고 있다.Recently deeply studied deep running models the complex nonlinear relationships between input and desired output and is effective for classification and regression problems. Therefore, there is a need for a deep learning algorithm based on vocoder passing speech intelligibility.

1. 한국공개특허번호 제10-2016-0000680호(발명의 명칭: 광대역 보코더용 휴대폰 명료도 향상장치와 이를 이용한 음성출력장치)1. Korean Patent Laid-Open Publication No. 10-2016-0000680 (entitled " Intelligibility Enhancement Device for Mobile Phones for Broadband Vocoders and Voice Output Apparatus Using the Same) 2. 한국공개특허번호 제1020010073378호(발명의 명칭: 다차 LPC 계수를 적용한 포만트 후필터링을 통한 음성보코더의 음질 향상방법)2. Korean Unexamined Patent Publication No. 1020010073378 (entitled " Improvement of Sound Quality of Speech Vocoder with Formant Post Filtering Using Multilayer LPC Coefficients "

1. 조용덕, "음성통신을 위한 음성구간 검출, 잡음제거 및 음성부호화에 관한 연구"학위논문(박사) 서리대학교 2007년 영국1. Cho, Yong-Deok, "A Study on Detection of Noise Interference, Noise Reduction and Speech Coding for Voice Communication" Thesis (Doctor) Surrey University 2007 UK

본 발명은 위 배경기술에 따른 문제점을 해소하기 위해 제안된 것으로서, 보코더를 통과하기 전음성과 보코더를 통과한 후 음성의 명료도 차이를 계산하여 보코더의 음성 명료도를 판단할 수 있는 딥 러닝 기반 보코더 통과 음성 명료도 평가 방법 및 장치를 제공하는데 그 목적을 가진다. The present invention has been proposed in order to solve the problem according to the above background art, and it is an object of the present invention to provide a deep learning-based vocoder passage capable of determining the voice intelligibility of a vocoder, And to provide a method and apparatus for evaluating speech intelligibility.

본 발명은 위에서 제시된 과제를 달성하기 위해, 보코더를 통과하기 전음성과 보코더를 통과한 후 음성의 명료도 차이를 계산하여 보코더의 음성 명료도를 판단할 수 있는 딥 러닝 기반 보코더 통과 음성 명료도 평가 방법을 제공한다.In order to achieve the above object, the present invention provides a deep learning-based vocoder-passed speech intelligibility evaluation method capable of determining a voice intelligibility degree of a vocoder by calculating a difference in clarity of a voice after passing through a vocoder and a voice before passing through the vocoder do.

상기 딥 러닝 기반 보코더 통과 음성 명료도 평가 방법은,The deep learning-based vocoder-passed speech intelligibility evaluation method includes:

(a) 평가 모듈이 임의의 원음성 및 보코더에 의해 생성되는 보코더 통과 음성을 입력받는 단계; (a) the evaluation module receives any original voices and vocoder passing voices generated by the vocoder;

(b) 상기 평가 모듈이 상기 임의의 원음성 및 보코더 통과 음성을 시간 단위의 프레임으로 나누고 각각의 프레임으로부터 음성 특징을 추출하여 특징 벡터로 사용하는 단계;(b) dividing the arbitrary original speech and vocoder passed speech into frames of time units, and extracting speech features from the respective frames and using the extracted speech features as feature vectors;

(c) 상기 평가 모듈이 상기 특징 벡터를 DNN(Deep Neural Network) 회귀 모델에 인가하여 각 프레임별 음성 명료도 차이를 산출하는 단계; 및(c) the evaluation module applies the feature vector to a DNN (Deep Neural Network) regression model to calculate a speech intelligibility difference for each frame; And

(d) 상기 평가 모듈이 상기 각 프레임별 음성 명료도 차이를 합산하여 상기 임의의 원음성의 전체에 대한 명료도 차이 점수를 산출하는 단계;를 포함한다.and (d) the evaluation module sums the difference in speech intelligibility for each frame to calculate an intelligibility difference score for the entirety of the original speech.

여기서, 상기 (b) 단계는, 상기 임의의 원음성 및 보코더 통과 음성을 시간 단위의 프레임으로 나누고 각각의 프레임으로부터 음성 특징 벡터들을 추출하는 단계; 및 상기 음성 특징 벡터들을 하나의 벡터로 합쳐서 각 프레임의 특징 벡터로 사용하는 단계;를 포함하는 것을 특징으로 할 수 있다.The step (b) includes the steps of: dividing the arbitrary original speech and vocoder passage speech into frames of time units and extracting speech feature vectors from the respective frames; And combining the speech feature vectors into a single vector to use as a feature vector of each frame.

또한, 상기 (c) 단계는, 상기 임의의 원음성의 특징 벡터 및 보코더 통과 음성의 특징 벡터의 선형적인 상관관계 및 DNN 회귀 모델 사이의 복합적인 비선형성 관계를 모델링하여 음성 명료도 차이를 판단하는 것을 특징으로 할 수 있다.In the step (c), a difference in speech intelligibility degree may be determined by modeling a linear correlation between the feature vectors of the arbitrary original speech and the vocoder passage speech and a complex nonlinearity relationship between the DNN regression models .

또한, 상기 DNN 회귀 모델은 특징 벡터가 입력될 때 해당 입력에 대한 음성 명료도 차이를 출력하도록 DNN 훈련 절차에 의해 생성되며, 상기 DNN 훈련 절차는, DNN 훈련을 위한 훈련 데이터를 준비하는 단계; 상기 훈련 데이터로부터 특징 벡터를 추출하는 단계; 상기 특징 벡터를 입력으로 하고, 미리 설정되는 목표 MOS(mean opinion score) 점수로 훈련하는 단계; 상기 목표 MOS 점수와 DNN(Deep Neural Network)의 출력 점수간의 차이가 줄어드는 방향으로 가중치값을 조정하는 단계; 및 최종적으로 상기 DNN의 출력이 상기 목표 MOS와 동일한 값이 나올 수 있도록 조정하는 단계;를 포함하는 것을 특징으로 할 수 있다.Also, the DNN regression model is generated by a DNN training procedure to output a speech intelligibility difference for the input when a feature vector is input, the DNN training procedure comprising: preparing training data for DNN training; Extracting a feature vector from the training data; Inputting the feature vector and training with a predetermined target opinion score; Adjusting a weight value in a direction in which a difference between the target MOS score and a DNN (Deep Neural Network) output score decreases; And finally adjusting the output of the DNN so that the same value as that of the target MOS can be obtained.

또한, 상기 (d) 단계는, 상기 임의의 원음성에 대해 voice activity detection(VAD)을 활용해 각 프레임별 음성 존재 확률을 계산하는 단계; 상기 각 프레임별 음성 명료도 차이 점수를 상기 음성 존재 확률에 비례하여 가중치를 주어 전제 음성에 대해 합산하는 단계; 및 각 프레임별 가중치값의 합으로 나누어 미리 설정되는 범위로 스케일링 한 후 최종 음성 명료도 차이 점수를 출력하는 단계;를 포함하는 것을 특징으로 할 수 있다.The step (d) may include calculating voice presence probability for each frame using voice activity detection (VAD) for the original voice; Summing the speech intelligibility difference score for each frame with respect to the total speech given a weight in proportion to the speech presence probability; And dividing the weighted value by a sum of weight values of the respective frames, scaling the weighted value to a predetermined range, and outputting a final speech intelligibility difference score.

또한, 상기 음성 특징 벡터들은 시간-주파수(spectro-temporal) 특징, 피치(pitch), 선형 예측 계수(LPC: linear prediction coefficient)를 포함하는 것을 특징으로 할 수 있다.In addition, the speech feature vectors may include a spectro-temporal feature, a pitch, and a linear prediction coefficient (LPC).

또한, 상기 훈련 데이터는, 음성 데이터에 실제 환경 고려를 위하여 잡음을 인가한 원음성 데이터, 상기 원음성 데이터를 여러 종류의 보코더에 통과시켜 생성한 보코더 통과 음성 데이터, 및 상기 원음성과 보코더 통과 음성간의 명료도 차이를 평가하여 평균한 음성 명료도 차이 점수인 목표 MOS(mean opinion score) 점수로 이루어지는 것을 특징으로 할 수 있다.The training data may include original audio data obtained by applying noise to the audio data in consideration of the actual environment, vocoder passing audio data generated by passing the original audio data through various types of vocoders, And a target mean score (MOS) score, which is an average difference in speech intelligibility difference obtained by evaluating the difference in the degree of clarity between the two.

다른 한편으로, 본 발명의 다른 일실시예는, 임의의 원음성을 이용하여 보코더 통과 음성을 생성하는 보코더; 및 상기 임의의 원음성 및 보코더 통과 음성을 입력받아. 상기 임의의 원음성 및 보코더 통과 음성을 시간 단위의 프레임으로 나누고 각각의 프레임으로부터 음성 특징을 추출하여 특징 벡터로 사용하고, 상기 특징 벡터를 DNN(Deep Neural Network) 회귀 모델에 인가하여 각 프레임별 음성 명료도 차이를 산출하며, 상기 각 프레임별 음성 명료도 차이를 합산하여 상기 임의의 원음성의 전체에 대한 명료도 차이 점수를 산출하는 평가 모듈;을 포함하는 것을 특징으로 하는 딥 러닝 기반 보코더 통과 음성 명료도 평가 장치를 제공할 수 있다.On the other hand, another embodiment of the present invention includes a vocoder that generates vocoder-passed speech using any original speech; And an arbitrary original voice and a vocoder passed voice. The original speech and the vocoder passed speech are divided into frames of time units, speech features are extracted from the respective frames and used as feature vectors, and the feature vectors are applied to a DNN (Deep Neural Network) regression model, And an evaluation module for calculating an intelligibility difference and summing differences of speech intelligibility for each frame to calculate an intelligibility difference score for the whole of the original speech. Can be provided.

본 발명에 따르면, 음성이 보코더를 통과함으로써 변화하는 명료도를 측정할 수 있고, 해당 정보를 바탕으로 하여 보코더의 성능을 정량적으로 판단하는데 활용될 수 있다. According to the present invention, it is possible to measure the degree of intelligibility that changes as the voice passes through the vocoder, and can be utilized to quantitatively determine the performance of the vocoder based on the information.

또한, 본 발명의 다른 효과로서는 딥 러닝 기법을 활용하여 일반적인 음성 명료도 평가 알고리즘에서 고려하지 못하는 음성신호와 음성명료도 간의 비선형적 관계를 세밀하게 모델링하여 더 정확한 음성 명료도 판단이 가능하다는 점을 들 수 있다.As another effect of the present invention, it is possible to more accurately determine the degree of speech intelligibility by finely modeling the nonlinear relationship between the speech signal and the speech intelligibility, which is not considered in the general speech intelligibility evaluation algorithm, by utilizing the deep learning technique .

도 1은 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가의 구성 블록도이다.
도 2는 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가 알고리즘에서, DNN(Deep Neural Network) 회귀 모델을 도시한 개념도이다.
도 3은 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가 알고리즘에서, DNN 회귀 모델 훈련 과정을 보여주는 흐름도이다.1 is a block diagram of a deep learning-based vocoder-passed speech intelligibility evaluation according to an embodiment of the present invention.
FIG. 2 is a conceptual diagram illustrating a DNN (Deep Neural Network) regression model in a deep learning-based vocoder-passed speech intelligibility evaluation algorithm according to an embodiment of the present invention.
FIG. 3 is a flowchart showing a DNN regression model training process in a deep learning-based vocoder-passed speech intelligibility evaluation algorithm according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 상세한 설명에 구체적으로 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용한다.Like reference numerals are used for similar elements in describing each drawing.

제 1, 제 2등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. "및/또는" 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. The term "and / or" includes any combination of a plurality of related listed items or any of a plurality of related listed items.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않아야 한다.Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Should not.

이하 첨부된 도면을 참조하여 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가 방법 및 장치를 상세하게 설명하기로 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a deep learning-based vocoder-passed speech intelligibility evaluation method and apparatus according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가 장치(100)의 구성 블록도이다. 도 1을 참조하면, 딥 러닝 기반 보코더 통과 음성 명료도 평가 장치(100)는, 원음성(10)으로부터 보코더 통과 음성(30)을 생성하는 보코더(120), 원음성(10)과 보코더 통과 음성(30)을 이용하여 특징 벡터를 추출하고 DNN 회귀모델에 의해 모델링함으로써 각 프레임별 음성 명료도 차이를 산출하여 음성 명료도 차이 점수(50)로 변환하여 출력하는 평가 모듈(140) 등을 포함하여 구성될 수 있다. 1 is a block diagram of a deep learning-based vocoder passed speech intelligibility evaluation apparatus 100 according to an embodiment of the present invention. 1, the deep learning-based vocoder passed speech intelligibility evaluating apparatus 100 includes a vocoder 120 for generating a vocoder passing voice 30 from a original voice 10, a vocoder 120 for generating a vocoder passing voice 30 30) for extracting a feature vector and modeling the extracted feature vector by a DNN regression model to calculate a speech intelligibility difference for each frame, and converting the speech intelligibility difference score 50 into a speech intelligibility difference score 50 and outputting have.

평가 모듈(140)은 입력된 음향 신호(즉, 원 음성, 보코더 통과 음성)로부터 음성 명료도 정보를 포함한 특징 벡터를 추출하는 특징 벡터 추출부(141), 상기 특징 벡터 추출부(141)로부터 특징 벡터를 입력받아 각 프레임별 음성 명료도 차이를 계산하는 DNN(Deep Neural Network) 회귀 모델부(143), 및 상기 DNN 회귀 모델부(143)로부터 계산된 각 프레임별 음성 명료도 차이를 종합하여 입력된 음성 전체에 대한 음성 명료도 차이를 출력하는 후처리부(145) 등을 포함하여 구성될 수 있다.The evaluation module 140 includes a feature vector extractor 141 for extracting a feature vector including speech intelligibility information from the input sound signal (that is, the original speech, vocoder passage speech) A DNN (Deep Neural Network) regression modeling unit 143 for calculating a difference in speech intelligibility for each frame based on the input speech signal and the speech intelligibility degree of each frame calculated from the DNN regression modeling unit 143, And a post-processing unit 145 for outputting a difference in speech intelligibility for the user.

특징 벡터 추출부(141)는 입력 음성인 원 음성(10)으로부터 명료도 정보를 포함한 특징 벡터 추출하는 역할을 한다. 특징 벡터 추출부(141)는 입력되는 원 음성 신호 및 보코더(120)에 의해 생성되는 보코더 통과 음성 신호를 시간 단위의 프레임으로 나누고 각각의 프레임으로부터 시간-주파수(spectro-temporal) 특징과 피치(pitch, 음조), 선형 예측 계수(LPC: linear prediction coefficient)와 같은 음성 특징 벡터들을 추출하고, 추출한 특징 벡터들을 하나의 벡터로 합쳐서 각 프레임의 특징 벡터로 사용될 수 있게 만들 수 있다.The feature vector extracting unit 141 extracts a feature vector including the intelligibility information from the original speech 10 as the input speech. The feature vector extracting unit 141 divides the input original speech signal and the vocoder passed speech signal generated by the vocoder 120 into frames of time units and extracts spectro-temporal features and pitch , And a linear prediction coefficient (LPC), and extracts the feature vectors into a single vector so as to be used as a feature vector of each frame.

DNN 회귀 모델부(143)는 특징 벡터 추출부(141)로부터 음성 특징 벡터를 입력받아 각 프레임별 음성 명료도 차이를 판별하는 역할을 한다. 본 발명에서는 각 프레임별 음성 명료도 차이를 계산하기 위해 DNN 회귀 모델이 사용되는데 DNN 회귀 모델의 개념에 대해서는 도 2를 참조하여 설명하기로 한다.The DNN regression modeling unit 143 receives the voice feature vector from the feature vector extracting unit 141 and determines the difference in voice intelligibility for each frame. In the present invention, a DNN regression model is used to calculate the difference in voice intelligibility for each frame. The concept of the DNN regression model will be described with reference to FIG.

도 1을 계속 참조하면, 후처리부(145)는 DNN 회귀 모델부(143)로부터 계산된 각 프레임별 음성 명료도 차이를 종합하여 음성 전체에 대한 음성 명료도 차이를 출력한다. 후처리부(145)에서는 우선 원 음성 신호에 대해 VAD (voice activity detection)를 활용해 각 프레임별 음성 존재 확률을 계산한다. Referring to FIG. 1, the post-processing unit 145 outputs the difference in speech intelligibility for all of the speech by summing up the difference in speech intelligibility for each frame calculated from the DNN regression model unit 143. In the post-processing unit 145, the voice presence probability of each frame is calculated using voice activity detection (VAD) for the original voice signal.

이후 DNN 회귀 모델에서 구해진 각 프레임별 음성 명료도 차이를 계산된 음성 존재 확률에 비례하여 가중치를 주어 전제 음성에 대해 합산한다. 이는 음성이 실제로 존재하는 부분에 집중하여 음성 명료도 평가를 수행하여야 더 정확한 결과를 얻을 수 있기 때문이다. 합산된 음성 명료도 차이 점수(50)는 각 프레임별 가중치값의 합으로 나누어 올바른 범위로 스케일링 한 후 최종 출력된다.Then, the difference in speech intelligibility of each frame obtained from the DNN regression model is weighted in proportion to the calculated presence probability of the speech, and is added to the total speech. This is because the speech intelligibility evaluation must be performed by concentrating on the portion where the voice actually exists to obtain a more accurate result. The summed speech intelligibility difference score (50) is divided by the sum of weight values for each frame, scaled to the correct range, and finally output.

부연하면, 알고리즘의 입력으로는 원 음성 및 보코더를 통과한 음성이 입력되고 알고리즘의 출력으로는 두 음성간의 명료도 차이를 출력하게 된다. In other words, the input of the algorithm is the original voice and the voice passed through the vocoder, and the output of the algorithm outputs the difference in clarity between the two voices.

도 2는 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가 알고리즘에서, DNN(Deep Neural Network) 회귀 모델을 도시한 개념도이다. 도 2를 참조하면, 본 발명의 일실시예에 사용되는 DNN 회귀 모델은 특징 벡터가 입력될 때 해당 입력에 대한 음성 명료도 차이를 출력하도록 훈련된 모델이다. FIG. 2 is a conceptual diagram illustrating a DNN (Deep Neural Network) regression model in a deep learning-based vocoder-passed speech intelligibility evaluation algorithm according to an embodiment of the present invention. Referring to FIG. 2, the DNN regression model used in an exemplary embodiment of the present invention is a model trained to output the difference in speech intelligibility for a corresponding input when a feature vector is input.

여기서 훈련을 한다는 의미는 도 2에 도시된 신경망의 각 노드들(211,221,231)과 가중치(선)의 값들이 어떤 값을 가질 때 정답과 비슷한 결과가 나오는지를 찾아가는 과정이다. 즉, 입력(210), 은닉(220), 출력(230) 과정으로 이루어진다. 은닉(220)은 입력(210)과 출력(230)의 관계를 비선형적 함수로 모델링하는 부분으로서 하나 혹은 그 이상의 레이어로 구성된다. DNN의 훈련에는 잘 구성된 훈련 데이터가 필요하다. 훈련 데이터 수집 과정을 포함한 DNN 훈련 과정은 다음과 같다.Here, the training is a process of finding out whether the values of the respective nodes 211, 221, and 231 of the neural network shown in FIG. 2 and the values of the weights (line) have similar results to those of the correct answer. That is, input 210, concealment 220, and output 230 are performed. The concealment 220 is a part for modeling the relationship between the input 210 and the output 230 as a nonlinear function, and is composed of one or more layers. DNN's training requires well-organized training data. DNN training courses, including the training data collection process, are as follows.

① 10~60대 사이의 다양한 연령 대 남/녀로 구성된 화자가 발화한 다양한 문장에 대한 음성 데이터를 0~20dB의 화이트, 배블, 자동차 소리 등 다양한 잡음과 섞은 원본 음성과 해당 원본 음성을 MELP(Mixed Excitation Linear Prediction), AMR(Adaptive Multi-Rate) 등 다양한 보코더에 통과시킨 보코더 음성, 그리고 두 음성간의 음성 명료도 차이를 실제 여러 실험 인원이 평가하여 평균한 MOS(mean opinion score) 점수를 훈련 데이터로서 준비한다.① Voice data for various sentences composed by various speakers between 10 ~ 60 years old / male / female is mixed with various noise such as 0 ~ 20dB white, bubble, car sound and original voice mixed with MELP (Mixed (Average opinion score) scores obtained by evaluating the difference in speech intelligibility between the two vocoders and the vocoder voice passed through various vocoders such as the excitation linear prediction and the AMR (Adaptive Multi-Rate) do.

② 음성의 특징 벡터는 원 음성과 보코더 통과 음성의 시간-주파수 특징, 피치(pitch: 음조), 선형 예측 계수(LPC: Linear Prediction Coefficient) 등과 같은 음성 특징을 추출하여 사용한다.② The speech feature vector extracts the speech features such as the time-frequency characteristics of the original speech and the vocoder-passed speech, the pitch, and the linear prediction coefficient (LPC).

③ 추출된 특징 벡터를 입력으로 하고, 사람이 판단한 음성 명료도 차이 점수인 MOS 점수를 목표로 하는 DNN(Deep Neural Network)을 훈련한다. DNN은 입력값에 대해 목표인 MOS 점수가 출력될 수 있도록 목표 MOS 점수와 DNN의 출력 점수간의 차이가 줄어드는 방향으로 가중치값을 조정하여 최종적으로 DNN의 출력이 MOS 과 최대한 같은 값이 나올 수 있도록 조정한다.③ The extracted feature vector is input, and a DNN (Deep Neural Network) is trained to target the MOS score, which is the difference score of the speech intelligibility difference determined by the human. The DNN adjusts the weight value in such a way that the difference between the target MOS score and the output score of the DNN is reduced so that the target MOS score can be output to the input value. Finally, the output of the DNN is adjusted do.

④ 훈련된 DNN을 저장하여 이후 음성 명료도 측정 과정에 사용할 수 있도록 준비한다. 즉, 도 1에 도시된 딥 러닝 기반 보코더 통과 음성 명료도 평가 장치(100)에서 사용된다.④ Store the trained DNN and prepare it for use in the subsequent speech intelligibility measurement process. That is, it is used in the deep learning-based vocoder passed speech intelligibility evaluation apparatus 100 shown in FIG.

이렇게 훈련된 DNN 회귀 모델에 테스트 데이터를 입력하면, 해당 입력 데이터에 대한 원 음성과 보코더를 통과한 보코더 통과 음성의 명료도 차이가 출력될 수 있다. 여기서 도 2에 도시된 DNN 회귀 모델은 일실시예에 따른 일반적인 DNN 회귀 모델이며, 도 2의 DNN 회귀 모델에는 은닉 레이어가 하나이지만, 실시예에 따라 3개 이상의 은닉(hidden) 레이어가 사용될 수 있다.When the test data is input to the DNN regression model, the difference between the original voice for the input data and the vocoder passed voice passing through the vocoder can be output. The DNN regression model shown in FIG. 2 is a general DNN regression model according to an embodiment. In the DNN regression model shown in FIG. 2, there is one hidden layer, but three or more hidden layers may be used according to an exemplary embodiment .

도 3은 본 발명의 일실시예에 따른 딥 러닝 기반 보코더 통과 음성 명료도 평가 알고리즘에서, DNN 회귀 모델 훈련 과정을 보여주는 흐름도이다. 도 3을 참조하면, 제 1 내지 제 n 훈련용 음성 데이터(310-1 내지 310-n), 이러한 훈련용 음성 데이터(310-1 내지 310-n)를 다양한 보코더에 통과시켜 보코더 통과 음성을 생성하여 훈련 데이터를 생성한다(단계 S310). 물론, 이러한 훈련 데이터에는 원음성과 보코더 통과 음성간의 명료도 차이를 평가하여 평균한 음성 명료도 차이 점수인 목표 MOS(mean opinion score) 점수를 포함할 수 있다. FIG. 3 is a flowchart showing a DNN regression model training process in a deep learning-based vocoder-passed speech intelligibility evaluation algorithm according to an embodiment of the present invention. Referring to FIG. 3, first to n-th training voice data 310-1 to 310-n and training voice data 310-1 to 310-n are passed through various vocoders to generate vocoder passing voice And generates training data (step S310). Of course, such training data may include a mean opinion score (MOS) score, which is a difference in the speech intelligibility difference obtained by evaluating the difference in clarity between the original speech and the vocoder passage speech.

이후, 이러한 훈련 데이터로부터 특징 벡터를 추출한다(단계 S320).Then, a feature vector is extracted from the training data (step S320).

이후, MOS 테스트를 통해 획득된 음성 명료도 점수를 타깃으로 DNN 훈련을 실행한다(단계 S330,S340). Thereafter, the DNN training is executed with the voice intelligibility score obtained through the MOS test as a target (steps S330 and S340).

10: 원음성
30: 보코더 음성
50: 음성 명료도 차이 점수
100: 딥 러닝 기반 보코더 통과 음성 명료도 평가 장치
120: 보코더
140: 평가 모듈
141: 특징 벡터 추출부
143: DNN(Deep Neural Network) 회귀 모델부
145: 후처리부10: original voice
30: Vocoder voice
50: Speech intelligibility degree difference score
100: Deep learning-based vocoder-passed speech intelligibility evaluation device
120: Vocoder
140: Evaluation module
141: Feature vector extraction unit
143: DNN (Deep Neural Network) regression model unit
145: Post-

Claims

(a) the evaluation module receives any original voices and vocoder passing voices generated by the vocoder;
(b) dividing the arbitrary original speech and vocoder passed speech into frames of time units, and extracting speech features from the respective frames and using the extracted speech features as feature vectors;
(c) the evaluation module applies the feature vector to a DNN (Deep Neural Network) regression model to calculate a speech intelligibility difference for each frame; And
and (d) calculating, by the evaluation module, an intelligibility degree difference score for the entirety of the arbitrary original speech by summing the differences of the speech intelligibility for each frame,
The step (d)
Calculating a voice presence probability for each frame using voice activity detection (VAD) for the original voice;
Summing the speech intelligibility difference score for each frame with respect to the total speech given a weight in proportion to the speech presence probability; And
And outputting a final speech intelligibility difference score after scaling the speech signal into a predetermined range divided by a sum of weight values of each frame.

The method according to claim 1,
The step (b)
Dividing the original speech and vocoder passed speech into frames of time units and extracting speech feature vectors from the respective frames; And
And combining the speech feature vectors into one vector and using the speech feature vectors as a feature vector of each frame.

The method according to claim 1,
The step (c) is characterized by modeling the linear correlation between the feature vector of the arbitrary original speech and the feature vector of the vocoder passage speech and the complex nonlinearity relationship between the DNN regression models to determine the difference in speech intelligibility A method of evaluating deep speech based vocoder passing speech intelligibility.

The method according to claim 1,
Wherein the DNN regression model is generated by a DNN training procedure to output a speech intelligibility difference for a corresponding input when a feature vector is input,
Preparing training data for DNN training;
Extracting a feature vector from the training data;
Inputting the feature vector and training with a predetermined target opinion score;
Adjusting a weight value in a direction in which a difference between the target MOS score and a DNN (Deep Neural Network) output score decreases; And
And finally adjusting the output of the DNN so that the output value of the DNN may be the same as the target MOS.

delete

3. The method of claim 2,
Wherein the speech feature vectors include a spectro-temporal feature, a pitch, and a linear prediction coefficient (LPC).

5. The method of claim 4,
The training data includes original voice data obtained by applying noise to the voice data in consideration of the actual environment, vocoder passed voice data generated by passing the original voice data through various types of vocoders, And a target mean opinion score (MOS) score, which is a difference in the speech intelligibility difference obtained by evaluating the difference.

A vocoder for generating a vocoder passed speech using any original speech; And
And receives the original voice and the vocoder passed voice. The original speech and the vocoder passed speech are divided into frames of time units, speech features are extracted from the respective frames and used as feature vectors, and the feature vectors are applied to a DNN (Deep Neural Network) regression model, And an evaluation module for calculating an intelligibility difference and calculating a difference score of an intelligibility degree of the entirety of the arbitrary original voice by summing differences of the intelligibility levels of the respective frames,
Wherein the evaluation module comprises:
Calculating a voice presence probability for each frame by using voice activity detection (VAD) for the original voice, calculating a score of a voice intelligibility difference score for each frame by weighting in proportion to the voice presence probability, And dividing the result by a sum of weights for each frame, scaling the result to a predetermined range, and outputting a final speech intelligibility difference score.