KR102042344B1

KR102042344B1 - Apparatus for judging the similiarity between voices and the method for judging the similiarity between voices

Info

Publication number: KR102042344B1
Application number: KR1020180049336A
Authority: KR
Inventors: 구도영
Original assignee: (주)투미유
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2019-11-27
Also published as: KR20190125064A

Abstract

본 발명은 영상 파일 등에서 추출되는 기준 음성과 테스트 음성 사이의 유사도를 판단하여 점수화하여 제공할 수 있는 음성 유사도 판단 장치 및 음성 유사도 판단 방법에 관한 발명이다.The present invention relates to a voice similarity determining device and a voice similarity determining method capable of determining and providing a similarity between a reference voice extracted from an image file and a test voice to be scored.

Description

Apparatus for judging the similiarity between voices and the method for judging the similiarity between voices}

본 발명은 음성 유사도 판단 장치 및 음성 유사도 판단 방법에 관한 것으로서, 보다 구체적으로는 기준 음성 신호와 테스트(test) 음성 신호 사이의 유사도를 판정하여 점수화하여 학습자에게 제공함으로써, 학습 결과에 대한 객관적 반영을 통하여 학습자의 학습 능률 향상에 동기를 제공할 수 있는 것을 특징으로 하는 발명이다.The present invention relates to an apparatus for determining a speech similarity and a method for determining the speech similarity, and more specifically, to determine and score a similarity between a reference speech signal and a test speech signal and provide the result to the learner, thereby providing an objective reflection on the learning result. The invention is characterized in that it can provide motivation to improve the learning efficiency of the learner through.

들리는 소리를 거의 동시에 따라 말하거나 듣고 난 이후에 반복하여 말하여 학습하는 쉐도잉(shadowing)은 학습자가 자료의 음성을 들은 후 그것을 따라하는 자신의 음성을 듣는 두 번의 과정을 통해 단순히 듣는 것보다 더 많은 노력을 기울이게 되어 정보를 더 오래동안 기억하는 것으로 알려져 있다. 즉, 쉐도잉을 할 때 학습자는 단순 듣기보다 더 많은 집중과 정보의 처리가 가능하고, 화자의 말을 따라함으로써 발화의 부담이 적어지며, 언어 자체에 관심을 두는 것이 가능해져서 학습 효율이 증대되는 효과가 있다. 따라서, 많은 전문가들이 이해와 발화가 동시에 이루어지는 이러한 쉐도잉을 외국어 등의 학습법으로 많이 추천하고 있는 실정이다.Shadowing, which learns to speak or listen to the sound almost at the same time, then repeats and learns, is more than simply listening in two processes, where the learner hears the material and then listens to it. Much effort has been made to remember information longer. In other words, when shadowing, learners can concentrate more on processing and information than simple listening, and the burden of speech can be reduced by following the speaker's words, and it becomes possible to pay attention to the language itself, thereby increasing learning efficiency. It works. Therefore, many experts recommend such shadowing, which is understood and spoken at the same time, as a foreign language learning method.

도면 1도는 스마트 폰의 모습을 보여주는 도면이다.1 is a view showing the state of the smart phone.

스마트 폰(smart phone)은 피씨(PC, Personal Computer)와 유사한 기능을 가지는 단말 장치(10)로써, 종래의 휴대폰 기능에 개인 정보 관리 기능과 무선 인터넷(internet) 서비스(service) 기능 등을 결합한 휴대폰을 지칭한다. 사용자는 이러한 스마트 폰에 다양한 어플리케이션(application)을 설치하여 사용할 수 있으며, 최근에는 외국어 등의 학습에도 이러한 스마트 폰을 많이 활용하고 있다. 즉, 외국어 등의 언어 학습을 위한 말하기, 듣기, 쓰기 등의 연습이 가능한 많은 어플리케이션이 제공되고 있으며, 쉐도잉 학습을 위한 녹음 기능이 구비된 어플리케이션도 제공되고 있는 실정이다.A smart phone is a terminal device 10 having a function similar to a personal computer (PC). The smart phone is a mobile phone which combines a conventional mobile phone function with a personal information management function and a wireless Internet service function. Refers to. The user can install and use a variety of applications (applications) on these smartphones, and in recent years, they have utilized a lot of these smartphones for learning foreign languages. That is, many applications that can practice speaking, listening, and writing for language learning such as foreign languages are provided, and an application having a recording function for shadowing learning is also provided.

그러나, 기존의 이러한 어플리케이션은 단순 녹음 기능만 제공하거나 또는 패스(pass), 논패스(nonpass) 등의 평가 기능만 제공하여 쉐도잉 방식의 학습에 대한 정확한 피드백(feedback)을 제공하지 못하는 문제점이 있었다.However, these existing applications have a problem in that they cannot provide accurate feedback on the shadowing learning by providing only a recording function or an evaluation function such as pass or nonpass. .

본 발명은 상기와 같은 문제점을 해결하기 위해 안출된 것으로, 기준 음성 신호와 자신이 발화한 테스트 음성 신호 사이의 유사도를 점수화하여 제공함으로써 객관적인 학습 결과를 사용자에게 제공할 수 있는 것을 음성 유사도 판단 방법 및 음성 유사도 판단 장치를 제공하고자 한다.The present invention has been made to solve the above problems, the voice similarity determination method that can provide an objective learning result to the user by providing a score by comparing the similarity between the reference voice signal and the test voice signal uttered by him and An apparatus for determining a voice similarity is provided.

또한, 발음 뿐 아니라 억양 면에서도 유사도를 판정 받을 수 있는 음성 유사도 판단 방법 및 음성 유사도 판단 장치를 제공하고자 한다.The present invention also provides a voice similarity determining method and a voice similarity determining apparatus capable of determining similarity in terms of accent as well as pronunciation.

상기한 목적을 달성하기 위해, 본 발명에서는 기준 음성 신호에서 제1 음성 길이가 계산되는 단계; 상기 기준 음성 신호에서 제1 엠에프씨씨(mfcc, mel frequency cepstal coefficient)들이 추출되는 단계; 테스트(test) 음성 신호에서 제2 음성 길이가 계산되는 단계; 상기 테스트 음성 신호에서 제2 엠에프씨씨들이 추출되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유(dtw, dynamic time warping) 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑(mapping)되어 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In order to achieve the above object, the present invention comprises the steps of calculating the first speech length in the reference speech signal; Extracting first mel frequency cepstal coefficients (mfcc) from the reference speech signal; Calculating a second speech length in a test speech signal; Extracting second MFCs from the test voice signal; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating dynamic time warping (dtw) scores between the first MFCs and the second MFCs; Mapping the decoded unique scores into a mapping score; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio.

또한, 본 발명은 외부 서버(server)로부터 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들이 수신되는 단계; 테스트 음성 신호에서 제2 음성 길이가 계산되는 단계; 상기 테스트 음성 신호에서 제2 엠에프씨씨들이 추출되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑되어 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention comprises the steps of receiving the first voice length and the first MFC of the reference voice signal from an external server (server); Calculating a second speech length in the test speech signal; Extracting second MFCs from the test voice signal; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating detired double scores between the first MFCs and the second MFCs; Converting the decodeable scores into a mapping score; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio.

또한, 본 발명은 단말 장치로부터 테스트 음성 신호의 제2 음성 길이와 제2 엠에프씨씨들이 수신되는 단계; 기준 음성 신호에서 제1 음성 길이가 계산되는 단계; 상기 기준 음성 신호에서 제1 엠에프씨씨들이 추출되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑되어 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention comprises the steps of receiving the second voice length and the second MFC of the test voice signal from the terminal device; Calculating a first speech length from the reference speech signal; Extracting first MFCs from the reference voice signal; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating detired double scores between the first MFCs and the second MFCs; Converting the decodeable scores into a mapping score; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio.

또한, 본 발명은 단말 장치로부터 테스트 음성 신호의 제2 음성 길이와 제2 엠에프씨씨들이 수신되는 단계; 기준 음성 데이터베이스(database)로부터 제1 음성 길이와 제1 엠에프씨씨들이 수신되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑을 통하여 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention comprises the steps of receiving the second voice length and the second MFC of the test voice signal from the terminal device; Receiving a first voice length and a first MFC from a reference voice database; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating detired double scores between the first MFCs and the second MFCs; Converting the decodeable scores into mapping scores through mapping; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio.

또한, 상기 가중치 함수는 상기 발화길이비가 1일때 1의 가중치 값을 가지고, 상기 발화길이비가 0.5 이하 및 1.5 이상에서는 0의 가중치 값을 가지는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the weight function has a weight value of 1 when the speech length ratio is 1, and has a weight value of 0 when the speech length ratio is 0.5 or less and 1.5 or more.

또한, 상기 가중치 함수는 상기 발화길이비가 0.5 이상 1 이하의 구간에서는 2의 기울기 값을 가지고, 상기 발화길이비가 1 이상 1.5 이하의 구간에서는 -2의 기울기 값을 가지는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the weight function has a slope value of 2 in a section in which the speech length ratio is 0.5 or more and 1 or less, and a slope value of -2 in a section in which the speech length ratio is 1 or more and 1.5 or less. To provide.

또한, 상기 발음 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계 이후에, 상기 매핑 점수에 발음 가중치를 곱하여 가중치 발음 점수가 계산되는 단계; 상기 억양 점수에 억양 가중치를 곱하여 가중치 억양 점수가 계산되는 단계; 상기 가중치 발음 점수와 상기 가중치 억양 점수를 더하여 최종 유사도 점수가 산출되는 단계;가 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, after the accent score is calculated by multiplying the pronunciation score by the weighted speech length ratio, calculating a weighted pronunciation score by multiplying the mapping score by a pronunciation weight; Calculating a weighted accent score by multiplying the accent score by an accent weight; And calculating the final similarity score by adding the weighted pronunciation score and the weight accent score to provide a voice similarity determining method.

또한, 상기 발음 가중치는 0.52 이며, 상기 억양 가중치는 0.48 인 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the pronunciation weight is 0.52, the accent weight provides a voice similarity determining method, characterized in that 0.48.

또한, 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되는 단계에서, 상기 디티더블유 점수들은 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들을 이용하여 작성되는 디티더블유 거리 매트릭스(matrix)에서 경로상 값들을 더하여 구해지는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, in the step of calculating the debitable U scores between the first MCs and the second FM, the deutable double scores are obtained by using the first and the second CMs. Provided is a voice similarity determination method characterized in that it is obtained by adding values on a path in a prepared detittable distance matrix.

또한, 상기 디티더블유 점수들이 매핑되어 매핑 점수로 변환되는 단계는 상기 디티더블유 점수들 중 최소값을 최대점수로 하고, 상기 디티더블유 점수들 중 최대값을 최소점수로 하는In addition, the step of converting the deddable U scores are converted into a mapping score may be the minimum value of the deddable U scores to the maximum score, and the maximum value of the Detti double scores to the minimum score

식을 이용하여 매핑 점수로 변환되며, 상기 디티더블유 스코어는 디티더블유 거리 매트릭스에서 최적 경로의 값을 더한 것을 최적 경로의 길이에 대하여 평준화하여 구해진 것이며, 상기 디티더블유 최대 점수와 상기 디티더블유 최소 점수는 기설정되어 있는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.It is converted into a mapping score using an equation, and the detired oil score is obtained by equalizing the optimal path length in the detired oil distance matrix with respect to the length of the optimal path. Provided is a voice similarity determining method, which is set in advance.

또한, 기준 음성 신호에서 제1 음성 길이가 계산되는 단계 이전에, 상기 기준 음성 신호 및/또는 상기 테스트 음성 신호에서 잡음 및/또는 배경음이 제거되는 단계;가 더 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.The method may further include removing noise and / or a background sound from the reference voice signal and / or the test voice signal before the first voice length is calculated from the reference voice signal. Provide a method.

또한, 테스트 음성 신호에서 제2 음성 길이가 계산되는 단계 이전에, 상기 테스트 음성 신호에서 잡음 및/또는 배경음이 제거되는 단계;가 더 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.The method may further include removing noise and / or a background sound from the test voice signal before the second voice length is calculated from the test voice signal.

또한, 기준 음성 신호에서 제1 음성 길이가 계산되는 단계 이전에, 상기 기준 음성 신호에서 잡음 및/또는 배경음이 제거되는 단계;가 더 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.The method may further include removing noise and / or a background sound from the reference voice signal before calculating the first voice length from the reference voice signal.

또한, 잡음 및/또는 배경음의 제거는 위너(winer) 필터(filter)를 통하여 이루어지는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the removal of noise and / or background sound provides a method for determining the voice similarity, characterized in that through the Winner filter (winer) filter.

또한, 상기 제1 엠에프씨씨들 및/또는 상기 제2 엠에프씨씨들은 입력되는 신호에 순차적으로 프리 엠퍼시스(pre-emphasis), 해밍(hamming) 윈도우(window), 디에프티(DFT, Discrete Fourier Transform), 멜 척도 필터 뱅크(mel scale filter bank), 디씨티(DCT, Discrete Cosine Transform)를 적용하여 구해지며, 이 중 13차 차수까지를 에너지 평준화를 위한 씨엠엔(CMN, Cepstral Mean Normalization)을 적용함으로써 추출된 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and DFT (DFT, Discrete Fourier) sequentially. Transform), Mel scale filter bank, and Discrete Cosine Transform (DCT) are applied, and among them, CEM (Cepstral Mean Normalization) It provides a voice similarity determination method characterized in that extracted by applying.

또한, 본 발명은 테스트 음성 신호에서 제2 음성 길이가 계산되는 단계; 상기 제2 음성 구간 신호에서 제2 엠에프씨씨가 추출되는 단계; 상기 제2 음성 길이와 상기 제2 엠에프씨씨가 외부 서버로 전송되는 단계;가 포함되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention includes the steps of calculating the second voice length in the test voice signal; Extracting a second MC from the second voice interval signal; The second voice length and the second MFC is transmitted to an external server; provides a voice similarity determination method comprising a.

또한, 본 발명은 기준 음성 신호에서 제1 음성 길이를 계산하는 제1 음성 구간 검출 모듈(module); 테스트 음성 신호에서 제2 음성 길이를 계산하는 제2 음성 구간 검출 모듈; 상기 제1 음성 구간 검출 모듈로부터 상기 제1 음성 길이를 전달받고, 상기 제2 음성 구간 검출 모듈로부터 상기 제2 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 기준 음성 신호에서 제1 엠에프씨씨들을 추출하는 제1 엠에프씨씨 추출 모듈; 상기 테스트 음성 신호에서 제2 엠에프씨씨들을 추출하는 제2 엠에프씨씨 추출 모듈; 상기 제1 엠에프씨씨 추출 모듈로부터 상기 제1 엠에프씨씨들을 전달받고, 상기 제2 엠에프씨씨 추출 모듈로부터 상기 제2 엠에프씨씨들을 전달받아 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.The present invention also includes a first speech interval detection module for calculating a first speech length from a reference speech signal; A second voice interval detection module for calculating a second voice length from the test voice signal; The first voice length is received from the first voice interval detection module, the second voice length is received from the second voice interval detection module, and the first voice length is divided by the second voice length to speak length ratio. Ignition length ratio calculation module for calculating; A first MC extracting module for extracting first MCs from the reference speech signal; A second MC extracting module for extracting second MCs from the test voice signal; A decoded-deowable module receiving the first MCs from the first MC extracting module and receiving the second MCs from the second MC extracting module to calculate detired oil scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping.

또한, 본 발명은 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들을 외부 서버로부터 수신하여 저장하는 저장 모듈; 테스트 음성 신호에서 제2 음성 길이를 계산하는 제2 음성 구간 검출 모듈; 상기 제2 음성 구간 검출 모듈로부터 상기 제2 음성 길이를 전달받고, 상기 저장 모듈로부터 제1 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 테스트 음성 신호에서 제2 엠에프씨씨들을 추출하는 제2 엠에프씨씨 추출 모듈; 상기 저장 모듈로부터 제1 엠에프씨씨들을 전달받고, 상기 제2 엠에프씨씨 모듈로부터 제2 엠에프씨씨들을 전달받아, 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the present invention includes a storage module for receiving and storing the first voice length and the first MFC of the reference voice signal from an external server; A second voice interval detection module for calculating a second voice length from the test voice signal; An utterance length for receiving the second voice length from the second voice interval detection module and receiving a first voice length from the storage module and dividing the first voice length by the second voice length to calculate a utterance length ratio Non-calculating module; A second MC extracting module for extracting second MCs from the test voice signal; A DTI double U module receiving first MFCs from the storage module, and receiving second MFCs from the second MFC module to calculate detired double scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping.

또한, 본 발명은 단말 장치로부터 테스트 음성의 제2 음성 길이와 제2 엠에프씨씨들을 수신하여 저장하는 저장 모듈; 기준 음성 신호에서 제1 음성 길이를 계산하는 제1 음성 구간 검출 모듈; 상기 제1 음성 구간 검출 모듈로부터 제1 음성 길이를 전달받고, 상기 저장 모듈로부터 상기 제2 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 기준 음성 신호에서 제1 엠에프씨씨들을 추출하는 제1 엠에프씨씨 추출 모듈; 상기 저장 모듈로부터 제2 엠에프씨씨들을 전달받고, 상기 제1 엠에프씨씨 모듈로부터 제1 엠에프씨씨들을 전달받아, 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the present invention includes a storage module for receiving and storing the second voice length and the second MFC of the test voice from the terminal device; A first voice interval detection module configured to calculate a first voice length from a reference voice signal; A utterance length for receiving a first voice length from the first voice interval detection module and receiving a second voice length from the storage module and dividing the first voice length by the second voice length to calculate a speech length ratio. Non-calculating module; A first MC extracting module for extracting first MCs from the reference speech signal; A DTI double U module receiving second MFCs from the storage module, and receiving first MFCs from the first MFC module to calculate detired double scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping.

또한, 본 발명은 단말 장치로부터 테스트 음성의 제2 음성 길이와 제2 엠에프씨씨들을 수신하여 저장하는 저장 모듈; 기준 음성의 제1 음성 길이와 제1 엠에프씨씨들을 저장하고 있는 기준 음성 데이터베이스; 상기 기준 음성 데이터베이스로부터 상기 제1 음성 길이를 전달받고, 상기 저장 모듈로부터 상기 제2 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 기준 음성 데이터베이스로부터 상기 제1 엠에프씨씨들을 전달받고, 상기 저장 모듈로부터 상기 제2 엠에프씨씨들을 전달받아, 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the present invention includes a storage module for receiving and storing the second voice length and the second MFC of the test voice from the terminal device; A reference voice database storing the first voice length of the reference voice and the first MFCs; Speech length ratio calculation for receiving the first speech length from the reference speech database, receiving the second speech length from the storage module, and calculating the speech length ratio by dividing the first speech length by the second speech length. module; A deddable-you module for receiving the first MFCs from the reference voice database, receiving the second MFCs from the storage module, and calculating a decoded-U scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping.

또한, 상기 가중치 함수는 상기 발화길이비가 1일때 1의 가중치 값을 가지고, 상기 발화길이비가 0.5 이하 및 1.5 이상에서는 0의 가중치 값을 가지는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the weight function has a weight value of 1 when the speech length ratio is 1, and has a weight value of 0 when the speech length ratio is 0.5 or less and 1.5 or more.

또한, 상기 가중치 함수는 상기 발화길이비가 0.5 이상 1 이하의 구간에서는 2의 기울기 값을 가지고, 상기 발화길이비가 1 이상 1.5 이하의 구간에서는 -2의 기울기 값을 가지는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the weight function has a slope value of 2 in a section in which the speech length ratio is 0.5 or more and 1 or less, and a slope value of -2 in a section in which the speech length ratio is 1 or more and 1.5 or less. To provide.

또한, 상기 발음 점수 계산 모듈로부터 전달받은 상기 발음 점수에 발음 가중치를 곱하여 가중치 발음 점수를 계산하고, 상기 억양 점수 계산 모듈로부터 전달받은 상기 억양 점수에 억양 가중치를 곱하여 가중치 억양 점수를 계산한 뒤, 상기 가중치 발음 점수와 상기 가중치 억양 점수를 더하여 최종 유사도 점수를 산출하는 최종 유사도 점수 산출 모듈;을 더 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, a weighted pronunciation score is calculated by multiplying the pronunciation score received from the pronunciation score calculation module, multiplying the accent score received from the accent score calculation module, and calculating a weighted accent score by multiplying the accent score. And a final similarity score calculation module that adds a weighted pronunciation score and the weight accent score to calculate a final similarity score.

또한, 상기 발음 가중치는 0.52 이며, 상기 억양 가중치는 0.48 인 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the pronunciation weight is 0.52, and the accent weight is provided with a voice similarity determination device, characterized in that 0.48.

또한, 상기 디티더블유 모듈은 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들을 이용하여 작성되는 디티더블유 거리 매트릭스에서 경로상 값들을 더하여 상기 디티더블유 점수들을 계산하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the D TW module calculates the D TD scores by adding values on a path in a D T double distance matrix generated by using the first Ms and the second Ms. Provide the device.

또한, 상기 억양 점수 계산 모듈 및/또는 상기 발음 점수 산출 모듈이 사용하는 상기 매핑 점수는 상기 디티더블유 모듈로부터 전달받은 상기 디티더블유 점수들 중 최소값을 최대점수로 하고, 상기 디티더블유 점수들 중 최대값을 최소점수로 하는 In addition, the mapping score used by the intonation score calculation module and / or the pronunciation score calculation module is the minimum value of the deddableness points received from the decoded use module as the maximum score, and the maximum value of the decodedness points. With the minimum score

식을 이용하여 상기 매핑 점수로 변환되며, 상기 디티더블유 스코어는 디티더블유 거리 매트릭스에서 최적 경로의 값을 더한 것을 최적 경로의 길이에 대하여 평준화하여 구해진 것이며, 상기 디티더블유 최대 점수와 상기 디티더블유 최소 점수는 기설정되어 있는 것을 특징으로 하는 음성 유사도 장치를 제공한다.It is converted into the mapping score by using an equation, and the detired oil score is obtained by equalizing the optimal path length in the detired oil distance matrix with respect to the length of the optimal path. Provides a voice similarity device which is predetermined.

또한, 상기 기준 음성 신호에서 잡음 및/또는 배경음을 제거하여 상기 제1 음성 구간 검출 모듈에 전달하는 제1 필터 모듈; 상기 테스트 음성 신호에서 잡음 및/또는 배경음을 제거하여 상기 제2 음성 구간 검출 모듈에 전달하는 제2 필터 모듈;을 더 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the first filter module to remove the noise and / or background sound from the reference voice signal to pass to the first voice interval detection module; And a second filter module which removes noise and / or background sound from the test voice signal and delivers the noise to the second voice section detection module.

또한, 상기 테스트 음성 신호에서 잡음 및/또는 배경음을 제거하여 상기 제2 음성 구간 검출 모듈에 전달하는 제2 필터 모듈;을 더 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.The apparatus may further include a second filter module configured to remove noise and / or background noise from the test voice signal and transmit the noise and / or background sound to the second voice interval detection module.

또한, 상기 기준 음성 신호에서 잡음 및/또는 배경음을 제거하여 상기 제2 음성 구간 검출 모듈에 전달하는 제2 필터 모듈;을 더 포함하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.The apparatus may further include a second filter module which removes noise and / or background sound from the reference voice signal and transmits the noise to the second voice interval detection module.

또한, 상기 제1 엠에프씨씨들 및/또는 상기 제2 엠에프씨씨들은 입력되는 신호에 순차적으로 프리 엠퍼시스(pre-emphasis), 해밍(hamming) 윈도우(window), 디에프티(DFT, Discrete Fourier Transform), 멜 척도 필터 뱅크(mel scale filter bank), 디씨티(DCT, Discrete Cosine Transform)를 적용하여 구해지며, 이 중 13차 차수까지를 에너지 평준화를 위한 씨엠엔(CMN, Cepstral Mean Normalization)을 적용함으로써 추출된 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and DFT (DFT, Discrete Fourier) sequentially. Transform), Mel scale filter bank, and Discrete Cosine Transform (DCT) are applied, and among them, CEM (Cepstral Mean Normalization) Provided is a speech similarity determining apparatus, which is extracted by applying.

또한, 본 발명은 기준 음성 신호에서 제1 음성 길이가 계산되는 단계; 상기 기준 음성 신호에서 제1 엠에프씨씨(mfcc, mel frequency cepstal coefficient)들이 추출되는 단계; 테스트(test) 음성 신호에서 제2 음성 길이가 계산되는 단계; 상기 테스트 음성 신호에서 제2 엠에프씨씨들이 추출되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유(dtw, dynamic time warping) 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑(mapping)되어 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되며, 상기 제1 음성 길이 및/또는 제2 음성 길이는 에너지, 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨(lpc, linear predictive coding) 계수, 예측오차, 미분 에너지, 에스엔알(snr, siganl to noise ratio) 중 어느 하나 이상을 이용하여 검출되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention comprises the steps of calculating the first speech length in the reference speech signal; Extracting first mel frequency cepstal coefficients (mfcc) from the reference speech signal; Calculating a second speech length in a test speech signal; Extracting second MFCs from the test voice signal; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating dynamic time warping (dtw) scores between the first MFCs and the second MFCs; Mapping the decoded unique scores into a mapping score; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio, wherein the first speech length and / or the second speech length include energy, zero crossing rate, autocorrelation coefficient, increase / decrease coefficient, and first order Elp. Provided is a voice similarity determination method characterized in that it is detected using any one or more of (lpc, linear predictive coding) coefficient, predictive error, differential energy, snr (siganl to noise ratio).

또한, 본 발명은 외부 서버(server)로부터 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들이 수신되는 단계; 테스트 음성 신호에서 제2 음성 길이가 계산되는 단계; 상기 테스트 음성 신호에서 제2 엠에프씨씨들이 추출되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑되어 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되며, 상기 제1 음성 길이 및/또는 제2 음성 길이는 에너지, 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨 계수, 예측오차, 미분 에너지, 에스엔알 중 어느 하나 이상을 이용하여 검출되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention comprises the steps of receiving the first voice length and the first MFC of the reference voice signal from an external server (server); Calculating a second speech length in the test speech signal; Extracting second MFCs from the test voice signal; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating detired double scores between the first MFCs and the second MFCs; Converting the decodeable scores into a mapping score; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio, wherein the first speech length and / or the second speech length include energy, zero crossing rate, autocorrelation coefficient, increase / decrease coefficient, and first order Elp. Provided is a voice similarity determination method characterized in that it is detected using any one or more of the seed coefficient, predictive error, differential energy, SNR.

또한, 본 발명은 단말 장치로부터 테스트 음성 신호의 제2 음성 길이와 제2 엠에프씨씨들이 수신되는 단계; 기준 음성 신호에서 제1 음성 길이가 계산되는 단계; 상기 기준 음성 신호에서 제1 엠에프씨씨들이 추출되는 단계; 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비가 계산되는 단계; 상기 제1 엠에프씨씨들과 상기 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되는 단계; 상기 디티더블유 점수들이 매핑되어 매핑 점수로 변환되는 단계; 상기 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되는 단계; 상기 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수가 계산되는 단계;가 포함되며, 상기 제1 음성 길이 및/또는 제2 음성 길이는 에너지, 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨 계수, 예측오차, 미분 에너지, 에스엔알 중 어느 하나 이상을 이용하여 검출되는 것을 특징으로 하는 음성 유사도 판단 방법을 제공한다.In addition, the present invention comprises the steps of receiving the second voice length and the second MFC of the test voice signal from the terminal device; Calculating a first speech length from the reference speech signal; Extracting first MFCs from the reference voice signal; Calculating a speech length ratio by dividing the first voice length by the second voice length; Calculating detired double scores between the first MFCs and the second MFCs; Converting the decodeable scores into a mapping score; Calculating a weighted speech length ratio by applying a weight function to the speech length ratio; And calculating the accent score by multiplying the mapping score by the weighted speech length ratio, wherein the first speech length and / or the second speech length include energy, zero crossing rate, autocorrelation coefficient, increase / decrease coefficient, and first order Elp. Provided is a voice similarity determination method characterized in that it is detected using any one or more of the seed coefficient, predictive error, differential energy, SNR.

또한, 본 발명은 기준 음성 신호에서 제1 음성 길이를 계산하는 제1 음성 구간 검출 모듈(module); 테스트 음성 신호에서 제2 음성 길이를 계산하는 제2 음성 구간 검출 모듈; 상기 제1 음성 구간 검출 모듈로부터 상기 제1 음성 길이를 전달받고, 상기 제2 음성 구간 검출 모듈로부터 상기 제2 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 기준 음성 신호에서 제1 엠에프씨씨들을 추출하는 제1 엠에프씨씨 추출 모듈; 상기 테스트 음성 신호에서 제2 엠에프씨씨들을 추출하는 제2 엠에프씨씨 추출 모듈; 상기 제1 엠에프씨씨 추출 모듈로부터 상기 제1 엠에프씨씨들을 전달받고, 상기 제2 엠에프씨씨 추출 모듈로부터 상기 제2 엠에프씨씨들을 전달받아 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하며, 상기 제1 음성 구간 검출 모듈은 에너지(energy), 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨 계수, 예측오차, 미분 에너지, 에스엔알(snr, siganl to noise ratio) 중 어느 하나 이상을 이용하여 제1 음성 길이를 계산하며, 상기 제2 음성 구간 검출 모듈은 에너지(energy), 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨 계수, 예측오차, 미분 에너지, 에스엔알(snr, siganl to noise ratio) 중 어느 하나 이상을 이용하여 제2 음성 길이를 계산하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.The present invention also includes a first speech interval detection module for calculating a first speech length from a reference speech signal; A second voice interval detection module for calculating a second voice length from the test voice signal; The first voice length is received from the first voice interval detection module, the second voice length is received from the second voice interval detection module, and the first voice length is divided by the second voice length to speak length ratio. Ignition length ratio calculation module for calculating; A first MC extracting module for extracting first MCs from the reference speech signal; A second MC extracting module for extracting second MCs from the test voice signal; A decoded-deowable module receiving the first MCs from the first MC extracting module and receiving the second MCs from the second MC extracting module to calculate detired oil scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping. The first voice interval detection module may include energy, The first speech length is calculated using one or more of zero crossing rate, autocorrelation coefficient, increase and decrease coefficient, first order LPC coefficient, prediction error, differential energy, and snr (siganl to noise ratio), and the second The voice interval detection module uses one or more of energy, zero crossing rate, autocorrelation coefficient, increase and decrease coefficient, first order LPC coefficient, prediction error, differential energy, and snr (siganl to noise ratio). Provided is a voice similarity determining device, characterized in that the two voice length is calculated.

또한, 본 발명은 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들을 외부 서버로부터 수신하여 저장하는 저장 모듈; 테스트 음성 신호에서 제2 음성 길이를 계산하는 제2 음성 구간 검출 모듈; 상기 제2 음성 구간 검출 모듈로부터 상기 제2 음성 길이를 전달받고, 상기 저장 모듈로부터 제1 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 테스트 음성 신호에서 제2 엠에프씨씨들을 추출하는 제2 엠에프씨씨 추출 모듈; 상기 저장 모듈로부터 제1 엠에프씨씨들을 전달받고, 상기 제2 엠에프씨씨 모듈로부터 제2 엠에프씨씨들을 전달받아, 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하며, 상기 제2 음성 구간 검출 모듈은 에너지, 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨 계수, 예측오차, 미분 에너지, 에스엔알 중 어느 하나 이상을 이용하여 제2 음성 구간 길이를 계산하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the present invention includes a storage module for receiving and storing the first voice length and the first MFC of the reference voice signal from an external server; A second voice interval detection module for calculating a second voice length from the test voice signal; An utterance length for receiving the second voice length from the second voice interval detection module and receiving a first voice length from the storage module and dividing the first voice length by the second voice length to calculate a utterance length ratio Non-calculating module; A second MC extracting module for extracting second MCs from the test voice signal; A DTI double U module receiving first MFCs from the storage module, and receiving second MFCs from the second MFC module to calculate detired double scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping. The second voice interval detection module includes energy, zero crossing rate, Provided is a speech similarity determining apparatus, characterized in that the second speech interval length is calculated using at least one of autocorrelation coefficient, increase and decrease coefficient, first order LPC coefficient, prediction error, differential energy, and SNL.

또한, 본 발명은 단말 장치로부터 테스트 음성의 제2 음성 길이와 제2 엠에프씨씨들을 수신하여 저장하는 저장 모듈; 기준 음성 신호에서 제1 음성 길이를 계산하는 제1 음성 구간 검출 모듈; 상기 제1 음성 구간 검출 모듈로부터 제1 음성 길이를 전달받고, 상기 저장 모듈로부터 상기 제2 음성 길이를 전달받아, 상기 제1 음성 길이를 상기 제2 음성 길이로 나누어 발화길이비를 계산하는 발화길이비 계산 모듈; 상기 기준 음성 신호에서 제1 엠에프씨씨들을 추출하는 제1 엠에프씨씨 추출 모듈; 상기 저장 모듈로부터 제2 엠에프씨씨들을 전달받고, 상기 제1 엠에프씨씨 모듈로부터 제1 엠에프씨씨들을 전달받아, 디티더블유 점수들을 계산하는 디티더블유 모듈; 상기 발화 길이비 계산 모듈로부터 전달받은 상기 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고, 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아 매핑을 통하여 변환된 매핑 점수에 상기 가중치 발화길이비를 곱하여 억양 점수를 산출하는 억양 점수 계산 모듈; 상기 디티더블유 모듈로부터 상기 디티더블유 점수들을 전달받아, 매핑을 통하여 변환된 매핑 점수를 이용하여 발음 점수를 산출하는 발음 점수 산출 모듈;을 포함하며, 상기 제1 음성 구간 검출 모듈은 에너지, 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨 계수, 예측오차, 미분 에너지, 에스엔알(snr, siganl to noise ratio) 중 어느 하나 이상을 이용하여 제1 음성 길이를 계산하는 것을 특징으로 하는 음성 유사도 판단 장치를 제공한다.In addition, the present invention includes a storage module for receiving and storing the second voice length and the second MFC of the test voice from the terminal device; A first voice interval detection module configured to calculate a first voice length from a reference voice signal; A utterance length for receiving a first voice length from the first voice interval detection module and receiving a second voice length from the storage module and dividing the first voice length by the second voice length to calculate a speech length ratio. Non-calculating module; A first MC extracting module for extracting first MCs from the reference speech signal; A DTI double U module receiving second MFCs from the storage module, and receiving first MFCs from the first MFC module to calculate detired double scores; The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio; And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping. The first voice interval detection module may include energy, zero crossing rate, Voice similarity judgment, characterized in that the first speech length is calculated using one or more of autocorrelation coefficient, increase and decrease coefficient, first order LPC coefficient, prediction error, differential energy, and snr (siganl to noise ratio) Provide the device.

본 발명에 따른 음성 유사도 판단 방법 및 음성 유사도 판단 장치에 따르면 다음과 같은 효과가 있다.According to the voice similarity determining method and the voice similarity determining apparatus according to the present invention, the following effects are obtained.

첫째, 기준 음성과 자신이 듣고 따라한 테스트 음성 사이의 유사도가 점수화되어 피드백(feedback)됨으로써 학습자의 학습 능률이 향상되는 효과를 얻을 수 있다. 또한, 학습에 대한 동기가 제공되는 효과도 얻을 수 있다.First, the similarity between the reference voice and the test voice heard and followed is scored and feedbacked, thereby improving the learning efficiency of the learner. In addition, the effect of providing motivation for learning can be obtained.

둘째, 발음 뿐만 아니라 억양에 대한 점수도 산출되어 제공되므로, 자신이 발화한 테스트 음성과 기준 음성 사이의 차이를 다양한 각도에서 비교해 볼 수 있는 효과가 있다. 즉, 사람의 발음 특성과 억양 특성 사이에는 높은 상관 관계가 있는데, 발음 특징을 나타는 엠에프씨씨들과 발화길이비를 이용하여 억양 특성에 대한 정보도 제공할 수 있는 효과가 있다.Second, since the score for the accent as well as the pronunciation is calculated and provided, it is possible to compare the difference between the test voice and the reference voice spoken by the user from various angles. That is, there is a high correlation between the pronunciation characteristics and the intonation characteristics of the person, there is an effect that can provide information about the intonation characteristics by using the MFCs and the speech length ratio indicating the pronunciation characteristics.

셋째, 본 발명에 따른 기준 음성과 테스트 음성 사이의 유사도 판정 기법을 향후 음성 인식 분야로 확장하여 사용할 수 있는 효과도 있다.Third, the similarity determination method between the reference voice and the test voice according to the present invention can be extended to the speech recognition field in the future.

넷째, 외부 서버와 단말 장치의 작업 분화가 구현된 실시 예에 따르면, 음성 신호의 전처리 및 유사도 판단 과정에서 발생하는 부하를 분산, 경감할 수 있는 효과가 있다. 또한, 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들을 미리 데이터베이스화하여 구현한 예에 따르면, 기준 음성 신호와 테스트 음성 신호 사이의 유사도 판단 계산을 더 빠르고 효과적으로 점수화하여 제공할 수 있는 효과를 달성할 수 있다.Fourth, according to the embodiment in which the work division between the external server and the terminal device is implemented, the load generated during the preprocessing and the similarity determination process of the voice signal can be distributed and reduced. In addition, according to an example in which the first voice length of the reference voice signal and the first MFC are databased and implemented, the similarity determination calculation between the reference voice signal and the test voice signal can be quickly and effectively scored and provided. Can be achieved.

다섯째, 본 발명에 따른 음성 유사도 판단 장치 및 음성 유사도 판단 방법은 멜 스케일 필터 뱅크를 통하여 수신되는 음성 신호를 사람의 청각 특성을 반영하여 처리함으로써 더 효율적인 유사도 분석이 가능토록 하는 효과가 있다.Fifth, the voice similarity determining apparatus and the voice similarity determining method according to the present invention have an effect of enabling a more efficient similarity analysis by processing a voice signal received through a mel scale filter bank reflecting the hearing characteristics of a person.

여섯째, 본 발명에 따른 음성 유사도 판단 장치는 발음 또는 억양에 대한 평가를 점수화하여 제공함으로써 사용자의 발음이 향상되고 있는지 여부에 대한 객관적 근거 자료로 활용할 수도 있다. 또한, 이러한 점수 자료를 누적하여 보관함으로써, 사용자의 언어에서 어느 부분에 강점과 약점이 있는지도 분석하여 제공할 수 있는 효과가 있다.Sixth, the apparatus for determining the voice similarity according to the present invention may be used as an objective basis for whether the user's pronunciation is improved by scoring and providing an evaluation for pronunciation or intonation. In addition, by accumulating and storing such score data, there is an effect that can be provided to analyze the portion of the strengths and weaknesses in the user's language.

도면 1도는 스마트 폰의 모습을 보여주는 도면이다.
도면 2도는 본 발명에 따른 음성 유사도 판단 장치의 내부 구성을 블럭도로 보여주는 도면이다.
도면 3도는 본 발명에 따른 음성 유사도 판단 장치가 변형되어 제공될 수 있는 상황을 보여주는 도면이다.
도면 4도는 본 발명에 따른 음성 유사도 판단 장치의 변형 실시 예를 보여주는 도면이다.
도면 5도는 본 발명에 따른 음성 유사도 판단 장치의 다른 변형 실시 예를 보여주는 도면이다.
도면 6도는 본 발명에 따른 음성 유사도 판단 장치의 또 다른 변형 실시 예를 보여주는 도면이다.
도면 7도는 본 발명에 따른 음성 유사도 판단 장치의 필터 모듈의 예를 블럭도로 보여주는 도면이다.
도면 8도는 위너 필터링 전 후의 음성 파형과 스펙트로그램을 보여주는 도면이다.
도면 9도는 본 발명에 따른 음성 유사도 판단 장치의 음성 구간 검출 모듈의 예를 보여주는 도면이다.
도면 10도는 본 발명에 따른 음성 유사도 판단 장치의 음성 구간 검출 모듈의 다른 예를 보여주는 도면이다.
도면 11도는 본 발명에 따른 음성 유사도 판단 장치의 엠에프씨씨 추출 모듈의 내부 구성을 블럭도로 보여주는 도면이다.
도면 12도는 프레임이 이동하며 음성 신호에서 특징을 추출하는 과정을 보여주는 도면이다.
도면 13도는 해밍 윈도우의 모양을 보여주는 도면이다.
도면 14도는 멜 스케일 필터 뱅크 섹션을 통하여 제공되는 삼각 필터 뱅크의 모습을 보여주는 도면이다.
도면 15도는 본 발명의 음성 유사도 판단 장치의 디티더블유 모듈을 통하여 이루어지는 디티더블유 점수들이 계산되는 과정을 보여주는 도면이다.
도면 16도는 본 발명인 음성 유사도 판단 장치의 발음 점수 계산 모듈 및/또는 억양 점수 계산 모듈을 통하여 디티더블유 스코어가 매핑 점수로 변환되는 과정을 보여주는 도면이다.
도면 17도는 본 발명인 음성 유사도 판단 장치의 억양 점수 계산 모듈에서 사용되는 가중치 함수의 모습을 보여주는 도면이다.
도면 18도는 본 발명에 따른 음성 유사도 판단 방법의 처리 과정을 순서도로 보여주는 도면이다.
도면 19도는 본 발명인 음성 유사도 판단 방법의 변형 실시 예의 순서도를 보여주는 도면이다.
도면 20도는 본 발명인 음성 유사도 판단 방법의 다른 실시 예의 순서도를 보여주는 도면이다.
도면 21도는 본 발명인 음성 유사도 판단 방법의 또 다른 실시 예의 순서도를 보여주는 도면이다.
도면 22도는 본 발명인 음성 유사도 판단 장치를 구성하는 단말 장치에서 외부 서버로 제2 음성 길이와 제2 엠에프씨씨들을 전송하는 과정을 순서도로 보여주는 도면이다.
도면 23도는 본 발명에 따른 음성 유사도 판단 장치 및 음성 유사도 판단 방법에 따라 산출되는 점수 결과와 전문 채점자들이 평가한 점수 결과와의 상관 관계를 정리하여 표로써 보여주는 도면이다.
도면 24도는 본 발명을 통하여 산출된 발음 점수와 억양 점수를 선형 회귀 모델에 사용하여 산출된 점수와 전문 채점자들이 평가한 점수 결과와의 상관 관계를 정리하여 표로써 보여주는 도면이다.1 is a view showing the state of the smart phone.
2 is a block diagram illustrating an internal configuration of an apparatus for determining voice similarity according to the present invention.
3 is a diagram illustrating a situation in which the voice similarity determining apparatus according to the present invention may be modified.
4 is a view showing a modified embodiment of the voice similarity determining apparatus according to the present invention.
5 is a view showing another modified embodiment of the voice similarity determining apparatus according to the present invention.
6 is a view showing another modified embodiment of the voice similarity determining apparatus according to the present invention.
7 is a block diagram illustrating an example of a filter module of an apparatus for determining voice similarity according to the present invention.
8 is a diagram illustrating voice waveforms and spectrograms before and after Wiener filtering.
9 is a diagram illustrating an example of a speech section detection module of the apparatus for determining speech similarity according to the present invention.
10 is a view showing another example of the voice interval detection module of the voice similarity determination apparatus according to the present invention.
11 is a block diagram showing the internal configuration of the MFC extraction module of the voice similarity determination apparatus according to the present invention.
12 is a diagram illustrating a process of extracting a feature from a voice signal by moving a frame.
13 is a view showing the shape of the hamming window.
FIG. 14 is a view showing a triangular filter bank provided through a mel scale filter bank section.
FIG. 15 is a diagram illustrating a process of calculating detired double scores made through the detired double module of the voice similarity determining apparatus of the present invention.
FIG. 16 is a view illustrating a process of converting a decoded U score into a mapping score through a pronunciation score calculation module and / or an accent score calculation module of the present invention speech similarity determination device.
17 is a view showing the weight function used in the intonation score calculation module of the present invention speech similarity determination apparatus.
18 is a flowchart illustrating a processing procedure of a voice similarity determining method according to the present invention.
19 is a flowchart illustrating a modified embodiment of the voice similarity determining method according to the present invention.
20 is a flowchart illustrating another embodiment of a method for determining speech similarity according to the present invention.
FIG. 21 is a flowchart illustrating still another embodiment of the present invention.
22 is a flowchart illustrating a process of transmitting the second voice length and the second MFCs from the terminal device configuring the voice similarity determination device according to the present invention to an external server.
FIG. 23 is a table showing correlations between score results calculated according to a voice similarity determining apparatus and a voice similarity determining method according to the present invention and score results evaluated by expert scorers.
FIG. 24 is a table showing correlations between scores calculated by using a pronunciation score and an accent score calculated through the present invention in a linear regression model and score results evaluated by expert scorers.

이하, 첨부한 도면을 참조하여 본 발명의 실시예들에 대해 상세히 설명한다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는 바, 특정 실시예들은 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. As the inventive concept allows for various changes and numerous embodiments, particular embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to the specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

또한, 본 발명의 설명에서 "제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용되는 것이며,어떠한 의미를 한정하기 위하여 사용되는 것이 아니다. 그리고, 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함하며,"포함 하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In addition, in the description of the present invention, the terms "first", "second", and the like are used only for the purpose of distinguishing one component from other components, and are not used to limit any meaning. And the singular forms “a”, “an” and “the” include plural expressions unless the context clearly dictates otherwise, and the terms “comprise” or “having” may include features, numbers, steps, operations, components, parts, or parts described in the specification. It is to be understood that combinations of these are intended to exist, and do not preclude the existence or addition of one or more other features or numbers, steps, operations, components, parts or combinations thereof in advance.

도면 2도는 본 발명에 따른 음성 유사도 판단 장치의 내부 구성을 블럭도로 보여주는 도면이다.2 is a block diagram illustrating an internal configuration of an apparatus for determining voice similarity according to the present invention.

본 발명에 따른 음성 유사도 판단 장치는 크게 영상 등에 포함되어 있는 기준 음성 신호를 처리하는 부분, 사용자가 발화한 음성인 테스트 음성 신호를 처리하는 부분과 두 음성 신호 사이의 유사도를 판단하여 점수화하는 부분을 포함하여 구성될 수 있다. 보다 구체적으로는 기준 음성 신호를 처리하는 부분은 제1 필터(filter) 모듈(module)(111), 제1 음성 구간 검출 모듈(121), 제1 엠에프씨씨(mfcc, mel frequency ceptrum coefficients) 추출 모듈(131)을 포함하여 구성될 수 있으며, 테스트 음성 신호를 처리하는 부분은 제2 필터 모듈(112), 제2 음성 구간 검출 모듈(122), 제2 엠에프씨씨 추출 모듈(132)를 포함하여 구성될 수 있다. 여기서, 제1 필터 모듈(111)과 제2 필터 모듈(112), 제1 음성 구간 검출 모듈(121)과 제2 음성 구간 검출 모듈(122), 제1 엠에프씨씨 추출 모듈(131)과 제2 엠에프씨씨 추출 모듈(132)은 서로 그 내부 구성이나 역할이 동일하나, 상이한 신호를 입력받고 이를 처리하는 점에서 그 구별의 편의를 위하여 기준 음성 신호를 처리하는 모듈에는 '제1' 이라는 수식어를 붙이고, 테스트 음성 신호를 처리하는 모듈에는 '제2' 이라는 수식어를 붙이기로 한다. 그리고, 음성 신호 사이의 유사도를 판단하여 점수화하는 부분은 디티더블유(dtw, dynamic time warping) 모듈(150), 발화길이비 계산 모듈(140), 발음 점수 계산 모듈(160), 억양 점수 계산 모듈(170), 최종 유사도 점수 산출 모듈(180)을 포함하여 구성될 수 있다. The apparatus for determining voice similarity according to the present invention includes a part for processing a reference voice signal largely included in an image, a part for processing a test voice signal which is a voice spoken by a user, and a part for determining and scoring a similarity between two voice signals. It can be configured to include. More specifically, the part processing the reference voice signal may include extracting a first filter module 111, a first voice interval detection module 121, and a first mel frequency ceptrum coefficient (mfcc). The module 131 may be configured, and the part for processing the test voice signal may include a second filter module 112, a second voice interval detection module 122, and a second MC extract module 132. Can be configured. Here, the first filter module 111 and the second filter module 112, the first voice interval detection module 121 and the second voice interval detection module 122, the first MC extract module 131 and the first 2 The MC extracting module 132 has the same internal structure or role, but in terms of receiving different signals and processing them, the modifier 'first' is used for a module that processes the reference voice signal for convenience of distinction. Next, the module that processes the test voice signal is denoted by the second modifier. In addition, a part of determining and scoring the similarity between the voice signals may include a dynamic time warping (dtw) module 150, a speech length ratio calculating module 140, a pronunciation score calculating module 160, and an accent scoring calculation module ( 170, the final similarity score calculation module 180 may be configured.

보다 구체적으로, 기준 음성 신호의 잡음 및/또는 배경음은 제1 필터 모듈(111)을 거치는 과정을 통하여 제거될 수 있으며, 기준 음성 신호에 포함된 음성 구간과 묵음 구간은 제1 음성 구간 검출 모듈(121)을 거치면서 어느 부분이 음성 구간인지 묵음 구간인지 구별하여 인식될 수 있다. 그리고, 제1 엠에프씨씨 추출 모듈(131)을 거침으로써 기준 음성 신호의 특성을 나타내는 제1 엠에프씨씨들이 추출될 수 있다. 테스트 음성 신호에 있어서도 잡음 및/또는 배경음을 제거하고, 음성 구간과 묵음 구간이 인식된 후 제2 엠에프씨씨들이 추출되는 과정은 동일한 과정을 통하여 진행될 수 있으며, 이는 제2 필터 모듈(112), 제2 음성 구간 검출 모듈(122), 제2 엠에프씨씨 추출 모듈(132)을 통하여 이루어질 수 있다. 각각의 모듈의 세부 동작은 이하 도면을 통하여 더 자세히 살펴보도록 한다.More specifically, the noise and / or the background sound of the reference voice signal may be removed by passing through the first filter module 111, and the voice section and the silent section included in the reference voice signal may be removed from the first voice section detection module ( 121, it may be recognized by distinguishing which part is a voice section or a silent section. Then, the first MFCs representing the characteristics of the reference voice signal may be extracted by passing through the first MFC extraction module 131. In the test voice signal, the process of removing noise and / or background sound, extracting the second MFCs after the speech section and the silent section are recognized may be performed through the same process. The second voice section detection module 122 and the second MFC extraction module 132 may be made. Detailed operation of each module will be described in more detail with reference to the following drawings.

기준 음성 신호와 테스트 음성 신호 사이의 유사도를 판단하는 과정은 다음과 같은 과정을 통하여 진행될 수 있다.The process of determining the similarity between the reference voice signal and the test voice signal may be performed through the following process.

먼저, 제1 음성 구간 검출 모듈(121)을 통하여 계산된 제1 음성 길이와 제2 음성 구간 검출 모듈(122)을 통하여 계산된 제2 음성 길이는 발화길이비 계산 모듈(140)로 전달되고, 이를 전달받은 발화길이비 계산 모듈(140)은 제1 음성 길이를 제2 음성 길이로 나누어 발화길이비를 계산하는 과정을 수행할 수 있다. 그리고, 제1 엠에프씨씨 추출 모듈(131)이 추출한 제1 엠에프씨씨들과 제2 엠에프씨씨 추출 모듈(132)이 추출한 제2 엠에프씨씨들을 전달받은 디티더블유 모듈(150)은 디티더블유 점수들을 계산하는 과정을 수행할 수 있으며, 디티더블유 모듈(150)이 산출한 디티더블유 점수들은 발음 점수 계산 모듈(160)로 전달되어 매핑(mapping) 과정을 통하여 발음 점수로 변환되는 과정을 거칠 수 있다. 또한, 억양 점수 계산 모듈(170)은 발화길이비 계산 모듈(140)로부터 전달받은 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산한 뒤, 디티더블유 점수들이 매핑되어 변환된 매핑 점수에 이러한 가중치 발화길이비를 곱하여 억양 점수를 계산하는 과정을 수행할 수 있다. 이렇게 산출된 발음 점수와 억양 점수는 최종 유사도 점수 산출 모듈(180)로 전달되어 가중치가 반영되어 최종 유사도 점수로서 변환될 수 있다. 각 세부 모듈의 구체적인 동작은 이하 도면을 통하여 자세히 살펴보도록 한다.First, the first voice length calculated through the first voice interval detection module 121 and the second voice length calculated through the second voice interval detection module 122 are transmitted to the speech length ratio calculation module 140. The received speech length ratio calculation module 140 may perform a process of calculating the speech length ratio by dividing the first speech length by the second speech length. In addition, the DMF module 150 that receives the first MFCs extracted by the first MFC extraction module 131 and the second MFCs extracted by the second MFC extraction module 132 is a DTI double oil. The scores of the scores may be calculated, and the scores of the TTIs calculated by the DTI module 150 may be transferred to the pronunciation score calculator 160 to be converted into pronunciation scores through a mapping process. have. In addition, the intonation score calculation module 170 calculates the weighted speech length ratio by applying a weighting function to the speech length ratio received from the speech length ratio calculating module 140, and then the decoded scores are mapped to the converted mapping scores. The process of calculating the intonation score may be performed by multiplying the weighted speech length ratio. The pronunciation score and the intonation score thus calculated may be transferred to the final similarity score calculation module 180 and may be converted into a final similarity score by reflecting a weight. The detailed operation of each detailed module will be described in detail with reference to the accompanying drawings.

본 발명에 따른 음성 유사도 판단 장치는 이러한 구성을 통하여, 기준 음성과 테스트 음성 사이의 유사도를 점수화하여 제공함으로써, 학습자에게 바람직한 피드백 효과를 제공할 수 있다.The voice similarity determining apparatus according to the present invention can provide the learner with a desirable feedback effect by scoring and providing the similarity between the reference voice and the test voice.

도면 3도는 본 발명에 따른 음성 유사도 판단 장치가 변형되어 제공될 수 있는 상황을 보여주는 도면이다.3 is a diagram illustrating a situation in which the voice similarity determining apparatus according to the present invention may be modified.

앞서 살펴본 바와 같이, 본 발명에 따른 음성 유사도 판단 장치는 기준 음성 신호와 테스트 음성 신호를 모두 하나의 장치에서 통합하여 수신하고 양 음성 신호 사이의 유사도를 판정하여 점수화할 수도 있으나, 도면과 같이 스마트 폰과 같은 단말 장치(10), 외부 서버(server)(20), 기준 음성 데이터베이스(database)(190) 사이의 협업에 의하여 양 음성 신호 사이의 유사도가 판정될 수도 있다. 즉, 영화, 광고, 드라마(drama) 등의 동영상 파일에 포함되는 기준 음성 신호의 전처리나 유사도 판정과 같이 부하가 많이 걸리는 작업은 외부 서버(20)에서 이루어지고, 테스트 음성 신호의 수신이나 유사도 판정의 결과 제공 등만 스마트 폰과 같은 단말 장치(10)를 통하여 이루어지도록 작업의 분화가 이루어질 수 있다. 본 발명에 따른 음성 유사도 판단 장치가 이렇듯 두 부분으로 나누어져서 제공되는 실시 예는 이하 도면을 통하여 더 자세히 살펴보도록 한다.As described above, the voice similarity determining apparatus according to the present invention may receive a reference voice signal and a test voice signal by integrating them in one device, and determine and score similarities between the two voice signals. Similarity between the two voice signals may be determined by cooperation between the terminal device 10, the external server 20, and the reference voice database 190. That is, a heavy load, such as preprocessing or determining similarity of the reference voice signal included in a video file such as a movie, advertisement, drama, etc., is performed by the external server 20 to receive a test voice signal or determine similarity. The differentiation of the work can be done so that only through the terminal device 10 such as a smart phone to provide results. An embodiment in which the apparatus for determining the voice similarity according to the present invention is divided into two parts as described above will be described in more detail with reference to the accompanying drawings.

도면 4도는 본 발명에 따른 음성 유사도 판단 장치의 변형 실시 예를 보여주는 도면이다.4 is a view showing a modified embodiment of the voice similarity determining apparatus according to the present invention.

본 발명에 따른 음성 유사도 판단 장치는 도면 2도와 같이 하나의 장치로 구성될 수도 있으나, 그 기능을 분할하여 기준 음성 신호에 대한 처리는 외부 서버(20)에서 실시하고, 테스트 음성 신호에 대한 처리와 기준 음성 신호와 테스트 음성 신호 사이의 유사도 판단에 대한 부분은 단말 장치(10)에서 이루어지도록 분할되어 제공될 수도 있다. 즉, 기준 음성 신호에 대한 전처리 과정은 모두 외부 서버(20)단에서 수행되고, 단말 장치(10)는 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들만 수신하여 테스트 음성 신호와의 유사도를 계산하여 점수화하여 제공할 수 있다.The apparatus for determining the voice similarity according to the present invention may be configured as one device as shown in FIG. 2, but the processing of the reference voice signal by dividing its function is performed by the external server 20, and the processing for the test voice signal. A part for determining similarity between the reference voice signal and the test voice signal may be dividedly provided to be performed by the terminal device 10. That is, the preprocessing process for the reference voice signal is all performed at the external server 20, and the terminal device 10 receives only the first voice length of the reference voice signal and the first MFCs and the similarity with the test voice signal. It can be provided by calculating the score.

이러한 작업 분화를 통하여, 본 발명에 따른 음성 유사도 판단 장치는 음성 신호의 전처리 및 유사도 판단 과정에서의 발생하는 부하를 분산, 경감할 수 있는 효과를 달성할 수 있다.Through such work differentiation, the voice similarity determining apparatus according to the present invention can achieve the effect of distributing and reducing the load generated during the preprocessing and the similarity determining process of the voice signal.

도면 5도는 본 발명에 따른 음성 유사도 판단 장치의 다른 변형 실시 예를 보여주는 도면이다.5 is a view showing another modified embodiment of the voice similarity determining apparatus according to the present invention.

앞서 살펴본 실시 예는 비록 부하의 분산이 이루어진다고 하나 단말 장치(10)에서 발생하는 부하가 외부 서버(20)에 비하여 상당히 높은 편에 해당한다. 통상적으로, 스마트 폰 등의 단말 장치(10)보다는 외부 서버(20)가 그 연산량이나 용량 면에서 더 뛰어나므로, 그 역할을 반대로 수행함이 더 바람직할 수 있다. 즉, 테스트 음성 신호에 대한 전처리 부분만 단말 장치(10)에서 수행하고 이에 대한 제2 음성 길이와 제2 엠에프씨씨들만 외부 서버(20)로 전송하여, 외부 서버(20)에서 기준 음성 신호와 테스트 음성 신호 사이의 유사도를 판별토록 하는 것이 더 바람직할 수 있다. 이러한 변형 실시 예를 통하여, 본 발명에 따른 음성 유사도 판단 장치는 여러 개의 단말 장치(10)들 각각에서 수행하여야 할 작업량이 상당량 감소시킬 수 있는 효과를 달성할 수 있다.Although the above-described embodiment has a load distribution, the load generated by the terminal device 10 is considerably higher than that of the external server 20. Typically, since the external server 20 is superior in terms of the calculation amount or capacity than the terminal device 10 such as a smart phone, it may be more preferable to reverse the role. That is, only the preprocessing part for the test voice signal is performed by the terminal device 10, and only the second voice length and the second MFCs are transmitted to the external server 20, and the external server 20 transmits the reference voice signal. It may be more desirable to determine similarity between test speech signals. Through such modified embodiments, the voice similarity determining apparatus according to the present invention may achieve an effect of significantly reducing the amount of work to be performed in each of the plurality of terminal devices 10.

도면 6도는 본 발명에 따른 음성 유사도 판단 장치의 또 다른 변형 실시 예를 보여주는 도면이다.6 is a view showing another modified embodiment of the voice similarity determining apparatus according to the present invention.

앞서 살펴본 두 변형 실시 예는 그때 그때 기준 음성 신호 및/또는 테스트 음성 신호에서 잡음 및/또는 배경음을 제거하고, 음성 구간과 묵음 구간을 인식 처리하여 음성 길이 및 엠에프씨씨들을 추출하는 전처리 과정이 수행된다. 그러나, 이중에서 기준 음성 신호의 경우에는 미리 전처리 과정을 수행하여 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들을 추출하여 데이터베이스화하여 저장이 가능하다. 즉, 단말 장치(10)에서 테스트 음성 신호의 제2 음성 길이 및 제2 엠에프씨씨들이 수신되면, 영화, 광고, 드라마 등에 포함되어 있는 기준 음성 신호에서 추출된 제1 음성 길이 및 제1 엠에프씨씨들을 저장하고 있는 기준 음성 데이터베이스(190)로부터 이러한 제1 음성 길이 및 제1 엠에프씨씨들을 호출하여 발화길이비 계산 모듈(140), 디티더블유 모듈(150) 등에 전달함으로써 기준 음성 신호와 테스트 음성 신호 사이의 유사도를 더 빠르게 계산하고 점수화하여 제공할 수 있다In the two modified embodiments described above, a preprocessing process is performed in which a noise and / or a background sound are removed from a reference voice signal and / or a test voice signal, and the voice length and the silent period are recognized and extracted to extract the voice length and the MC. do. However, in the case of the reference voice signal, the first voice length and the first MFC of the reference voice signal may be extracted and stored in a database by performing a preprocessing process in advance. That is, when the second voice length and the second MFC of the test voice signal are received by the terminal device 10, the first voice length and the first MFC extracted from the reference voice signal included in a movie, advertisement, drama, or the like. The first voice length and the first MFCs are called from the reference voice database 190 storing the seeds and transferred to the speech length ratio calculation module 140, the decodeable U module 150, and the like. Similarity between signals can be calculated and scored faster

도면 7도는 본 발명에 따른 음성 유사도 판단 장치의 필터 모듈의 예를 블럭도로 보여주는 도면이다.7 is a block diagram illustrating an example of a filter module of an apparatus for determining voice similarity according to the present invention.

앞서 살펴본 바와 같이, 기준 음성 신호에서 잡음 및/또는 배경음을 제거하는 제1 필터 모듈(111)과 테스트 음성 신호에서 잡음 및/또는 배경음을 제거하는 제2 필터 모듈(112)은 그 구별의 편의성을 위하여 앞에 '제1', '제2'의 수식어만 붙었을 뿐 그 역할이나 구성은 동일할 수 있으며, 여기서는 필터 모듈로 통일하여 설명하기로 한다. 이는 이하 제1 음성 구간 검출 모듈(121), 제2 음성 구간 검출 모듈(122), 제1 엠에프씨씨 추출 모듈(131), 제2 엠에프씨씨 추출 모듈(132)에 있어서도 동일하다.As described above, the first filter module 111 for removing noise and / or background sound from the reference voice signal and the second filter module 112 for removing noise and / or background sound from the test voice signal provide convenience of discrimination. To this end, only the first 'first' and 'second' modifiers are attached, and their roles or configurations may be the same. Herein, the filter module will be uniformly described. The same applies to the first voice interval detection module 121, the second voice interval detection module 122, the first FM extracting module 131, and the second FM extracting module 132.

필터 모듈은 잡음 스펙트럼(spectrum) 추청 섹션(section)(110-2), 에프에프티(FFT, Fast Fourier Transform) 섹션(110-1), 위너(winer) 필터 섹션(110-3), 아이에프에프티(IFFT, Inverse Fast Fourier Transform) 섹션(110-4)를 포함하여 구성될 수 있으며, 이러한 구성을 통하여 음성 신호에 포함되어 있는 잡음 및/또는 배경음을 제거할 수 있다. 보다 구체적으로, 잡음 스펙트럼 추청 섹션(110-2)은 음성이 없는 구간에서 잡음 스펙트럼을 추정하여 위너 필터 섹션(110-3)으로 전달하고, 이를 수신한 위너 필터 섹션(110-3)은 에프에프티 섹션(110-1)이 푸리에 변환하여 전달하는 음성 신호와 필터링된 신호의 차이를 최소화되도록 함으로써 음성 신호에서 잡음 및/또는 배경음을 제거하는 작업을 수행할 수 있다. 그리고, 위너 필터 섹션(110-3)을 통하여 잡음이 제거된 음성 신호는 아이에프에프티 섹션(110-4)을 통하여 역변환되어 음성 구간 검출 모듈로 전달될 수 있다.The filter module includes noise spectrum section 110-2, fast fourier transform (FFT) section 110-1, winner filter section 110-3, and IFFT. (IFFT, Inverse Fast Fourier Transform) section 110-4, and through this configuration, it is possible to remove noise and / or background sound included in the voice signal. More specifically, the noise spectrum requesting section 110-2 estimates the noise spectrum and transmits the noise spectrum to the winner filter section 110-3 in the absence of speech, and receives the winner filter section 110-3. Section 110-1 may perform the operation of removing noise and / or background sound from the speech signal by minimizing the difference between the speech signal transmitted through the Fourier transform and the filtered signal. In addition, the speech signal from which noise is removed through the Wiener filter section 110-3 may be inversely transformed through the IFF section 110-4 and transferred to the speech section detection module.

이러한 과정을 수식으로 정리하면 다음과 같다.This process can be summarized as follows.

음성 신호는 잡음과 깨끗한 신호가 혼합된 신호로서,Voice signals are a mixture of noise and clear signals.

으로 표기할 수 있으며, 여기서 s[n]은 깨끗한 신호이고, z[n]은 잡음 신호를 나타낸다. Where s [n] is a clean signal and z [n] represents a noise signal.

이러한 음성 신호가 필터를 거쳐서 출력되는 것을 수식으로 표현하면, If the voice signal is output through a filter,

으로 나타낼 수 있다.It can be represented as

여기서, 위너 필터 섹션(110-3)은 추정된 신호

이 깨끗한 신호 s[n]과 MSE(Mean Square Error)가 최소가 되도록 하도록 조정하며, 이는 아래 식을 따라 이루어질 수 있다.Here, the Wiener filter section 110-3 is an estimated signal

This clean signal s [n] and MSE (Mean Square Error) are adjusted to be minimal, which can be done according to the following equation.

그리고, 위 식에서 위너-호프(hopf) 방정식이 아래와 같이 유도되고,Then, the Wiener-hopf equation is derived as

이를 푸리에 변환하여 아래와 같은 식이 산출될 수 있다.The Fourier transform can be used to calculate the following equation.

여기서, 깨끗한 신호 s[n]과 잡음 신호 z[n]이 확률적으로 독립이라면

이고,

가 되며, 최종적으로는 아래와 같은 수식을 얻을 수 있다.Where clean signal s [n] and noise signal z [n] are stochastic independent

ego,

Finally, the following equation can be obtained.

도면 8도는 위너 필터링 전 후의 음성 파형과 스펙트로그램을 보여주는 도면이다.8 is a diagram illustrating voice waveforms and spectrograms before and after Wiener filtering.

먼저, 도면 8도의 (a)는 위너 필터 섹션(110-3)을 거치기 전의 음성 신호 파형과 스펙트로그램을 보여주는 도면이고, (b)는 위너 필터 섹션(110-3)을 통과한 음성 신호 파형과 스펙트로그램을 보여주는 도면이다. 음성 신호에 포함되어 있는 잡음 등의 성분이 잘 제거된 것을 도면을 통하여 확인할 수 있다.First, (a) of FIG. 8 shows a voice signal waveform and spectrogram before passing through the Wiener filter section 110-3, and (b) shows a voice signal waveform passing through the Wiener filter section 110-3. This diagram shows the spectrogram. It can be confirmed through the drawing that components such as noise included in the voice signal are well removed.

본 발명에 따른 음성 유사도 판단 장치는 이러한 과정을 통하여 기준 음성 신호와 테스트 음성 신호에서 잡음 및/또는 배경음을 제거함으로써, 이후 기준 음성 신호와 테스트 음성 신호에서 음성 구간의 검출과 특징 파악이 더 수월하게 이루어지도록 하는 효과를 제공한다.The apparatus for determining the voice similarity according to the present invention removes noise and / or background noise from the reference voice signal and the test voice signal through this process, thereby making it easier to detect and characterize the voice section in the reference voice signal and the test voice signal. It provides the effect of making it happen.

도면 9도는 본 발명에 따른 음성 유사도 판단 장치의 음성 구간 검출 모듈의 예를 보여주는 도면이다.9 is a diagram illustrating an example of a speech section detection module of the apparatus for determining speech similarity according to the present invention.

본 발명의 음성 유사도 판단 장치는 디티더블유의 정렬 에러(error)율 감소와 신뢰성 있는 유사도 계산을 위한 전처리 작업으로써, 음성 신호에서 음성 구간을 검출하는 과정을 수행할 수 있으며, 이는 각각의 제1 음성 구간 검출 모듈(121)과 제2 음성 구간 검출 모듈(122)을 통하여 이루어질 수 있다.The apparatus for determining speech similarity of the present invention may perform a process of detecting a speech section from a speech signal as a preprocessing operation for reducing the alignment error rate of the decodeable oil and calculating the similarity similarly. It may be made through the interval detection module 121 and the second voice interval detection module 122.

음성 구간의 검출은 에너지(energy), 영교차율, 자기상관계수, 증감 계수, 1차 엘피씨(lpc, linear predictive coding) 계수, 예측오차, 미분 에너지, 에스엔알(snr, siganl to noise ratio) 중 어느 하나 이상을 이용하여 검출될 수 있다. 이를 위하여, 음성 구간 검출 모듈은 프레임(frame) 단위 처리 섹션(120-1), 제3 해밍 윈도우 섹션(120-2), 프레임 단위 에너지 계산 섹션(120-3), 영교차율 계산 섹션(120-4), 음성 구간 판별 섹션(120-5)를 포함하여 구성될 수 있다. 보다 구체적으로, 음성 구간 검출 모듈은 필터 모듈을 통하여 잡음 및/또는 배경음이 제거된 음성 신호를 프레임 단위 처리 섹션(120-1)을 통하여 프레임 단위로 처리하여 제3 해밍 윈도우 섹션(120-2), 프레임 단위 에너지 계산 섹션(120-3)로 구성되는 에너지 계산 가지와 영교차율 계산 섹션(120-4)으로 구성되는 영교차율 계산 가지로 전달하고, 제3 음성 구간 판별 섹션(120-5)은 수신되는 음성 신호의 에너지와 영교차 구간을 검출하여 음성 신호에서 음성 구간과 묵음 구간을 구별하여 인식 처리할 수 있다.The detection of the negative intervals includes energy, zero crossing rate, autocorrelation coefficient, increase and decrease coefficient, linear predictive coding (lpc) coefficient, prediction error, differential energy, snr (siganl to noise ratio). It can be detected using any one or more. To this end, the voice interval detection module includes a frame unit processing section 120-1, a third hamming window section 120-2, a frame unit energy calculation section 120-3, and a zero crossing rate calculation section 120-. 4), the voice section determination section 120-5. More specifically, the speech section detection module processes the speech signal from which the noise and / or the background sound is removed through the filter module in the frame unit through the frame unit processing section 120-1 to perform the third hamming window section 120-2. , The energy calculation branch consisting of the frame energy calculation section 120-3 and the zero crossing rate calculation branch consisting of the zero crossing rate calculation section 120-4, and the third voice section determination section 120-5 The energy section and the zero crossing section of the received speech signal may be detected to recognize and recognize the speech section and the silent section of the speech signal.

이를 통하여, 본 발명에 따른 음성 유사도 판단 장치는 이후 진행될 음성 유사도 판단에서 정렬 에러율을 감소시킬 수 있는 효과를 달성할 수 있다.Through this, the apparatus for determining speech similarity according to the present invention may achieve an effect of reducing the alignment error rate in the subsequent speech similarity determination.

도면 10도는 본 발명에 따른 음성 유사도 판단 장치의 음성 구간 검출 모듈의 다른 예를 보여주는 도면이다.10 is a view showing another example of the voice interval detection module of the voice similarity determination apparatus according to the present invention.

또한, 본 발명에 따른 음성 유사도 판단 장치의 음성 구간 검출 모듈은 필터 뱅크(bank) 섹션(120-6), 밴드(band) 에너지 추정 섹션(120-7), 음성 구간 검출 섹션(120-8), 배경 잡음 적응 섹션(120-9), 밴드 에스엔알(SNR) 합산 섹션(120-10), 문턱값 적응 섹션(120-11), 제4 음성 구간 판별 섹션(120-12)를 포함하여 구성될 수도 있다. 이는 유럽전기통신표준협회(etsi, european telecommunication standard institude)의 표준 VAD(Voice Activity Detection)을 구현하기 위한 구성이며, 입력 음성 신호를 9 개의 서브(sub) 밴드로 나눈 후, 각각의 서브 밴드에 대한 에스엔알 추정치를 이용하여 서브 밴드 에너지를 계산하여 음성 구간을 판별하는 것을 특징으로 한다. In addition, the voice interval detection module of the voice similarity determination apparatus according to the present invention includes a filter bank section 120-6, a band energy estimation section 120-7, and a voice interval detection section 120-8. A background noise adaptation section 120-9, a band SNR summation section 120-10, a threshold adaptation section 120-11, and a fourth speech segment determination section 120-12. May be This configuration is for implementing a standard Voice Activity Detection (VAD) standard of the European Telecommunication Standards Institude (ETSI). After dividing the input voice signal into nine subbands, The speech section may be determined by calculating subband energy using an SNR estimate.

도면 11도는 본 발명에 따른 음성 유사도 판단 장치의 엠에프씨씨 추출 모듈의 내부 구성을 블럭도로 보여주는 도면이다.11 is a block diagram showing the internal configuration of the MFC extraction module of the voice similarity determination apparatus according to the present invention.

본 발명에 따른 음성 유사도 판단 장치는 기준 음성 신호와 테스트 음성 신호 각각에서 엠에프씨씨들을 추출하고 이를 디티더블유를 이용하여 유사도를 판별하는데, 이를 위하여 엠에프씨씨 추출 모듈은 프리 엠퍼시스(pre emphasis) 섹션(130-1), 제4 해밍 윈도우(window) 섹션(130-2), 디에프티(DFT, Discrete Fourier Transform) 섹션(130-3), 멜(mel) 스케일(scale) 필터 뱅크 섹션(130-4), 로그(log) 섹션(130-5), 디씨티(DCT, Discrete Cosine Transform) 섹션(130-6)을 포함하여 구성될 수 있다.The apparatus for determining speech similarity according to the present invention extracts MFCs from each of a reference speech signal and a test speech signal and determines the similarity by using the de-doubled oil. For this purpose, the MFC extraction module includes a preemphasis. Section 130-1, fourth Hamming window section 130-2, Discrete Fourier Transform (DFT) section 130-3, mel scale filter bank section 130 -4), log section 130-5, and discrete cosine transform (DCT) section 130-6.

엠에프씨씨 추출 모듈의 각각의 세부 섹션의 역할을 간략히 검토하면 다음과 같다.A brief review of the role of each subsection of the MCC extraction module is as follows.

엠에프씨씨 추출 모듈의 프리 엠퍼시스 섹션(130-1)은 고주파 영역의 에너지를 증가시켜서 음소 검출의 정확도를 높이는 역할을 수행할 수 있으며, 제4 해밍 윈도우 섹션(130-2)은 음성에서 추출된 프레임의 경계에서의 신호의 값을 0에 가깝게 만들어 이산화시키는 역할을 수행하며, 디에프티 섹션(130-3)은 제4 해밍 윈도우 섹션(130-2)을 통하여 전달되는 윈도우된 프레임으로부터 스펙트럼(spectrum) 정보를 추출하는 역할을 수행할 수 있다. 즉, 해밍 윈도우된 프레임에서 각 주파수 밴드에서의 에너지를 추출하는 역할을 수행할 수 있다. 다음으로, 디에프티 섹션(130-3)을 통하여 추출된 각 주파수 밴드 에너지를 사람의 귀와 유사하게 모델링된 멜 스케일 필터 뱅크 섹션(130-4)과 로그 섹션(130-5)를 통하여 필터링하고, 이러한 결과를 디씨티 섹션(130-6)을 통하여 음소를 찾을 때 유용한 특징을 제공하는 계수 값들을 구한다. 특히, 본 발명에서는 13차 엠에프씨씨들을 사용하고, 이러한 13차 엠에프씨씨들의 각 차수의 평균이 전체 발화에 걸쳐서 같다고 가정하고 에너지 평준화를 위한 씨엔엠(CMN, Cepstral Mean Normalization)을 적용한다. 이를 위한 수식은 아래와 같으며, i는 엠에프씨씨의 차수,

는 i번째 엠에프씨씨 차수의 평균 값, j는 프레임 번호를 의미한다.The pre-emphasis section 130-1 of the MFC extraction module may increase the energy of the high frequency region to increase the accuracy of phoneme detection, and the fourth hamming window section 130-2 is extracted from the speech. It makes the value of the signal at the boundary of the frame frame close to 0 and discretizes, and the DFT section 130-3 is divided from the windowed frame transmitted through the fourth Hamming window section 130-2. spectrum) can extract information. That is, it may serve to extract energy in each frequency band from the hamming windowed frame. Next, each frequency band energy extracted through the DFT section 130-3 is filtered through the Mel scale filter bank section 130-4 and the log section 130-5 modeled similarly to the human ear, These results are obtained through coefficient sections 130-6 to obtain coefficient values that provide useful features for finding phonemes. Particularly, in the present invention, 13th MFCs are used, and the average of each order of the 13th MFCs is the same over the entire ignition and CN (Cepstral Mean Normalization) is applied for energy leveling. The formula for this is as follows, i is the order of MFC,

Is the average value of the i-th order of MC, and j is the frame number.

본 발명에 따른 음성 유사도 판단 장치는 이러한 엠에프씨씨 모듈을 통하여 기준 음성 신호와 테스트 음성 신호의 특징을 잘 뽑아낼 수 있는 효과를 달성할 수 있다.The apparatus for determining the voice similarity according to the present invention can achieve the effect of extracting the characteristics of the reference voice signal and the test voice signal through the MC module.

도면 12도는 프레임이 이동하며 음성 신호에서 특징을 추출하는 과정을 보여주는 도면이다.12 is a diagram illustrating a process of extracting a feature from a voice signal by moving a frame.

음성은 비정상 신호이기 때문에 발화된 음성 신호 전체에 걸쳐서 특징을 뽑아내지 않고 도면과 같은 작은 윈도우(window) 안에서 음성 신호가 정적이라 가정하고 특징을 추출한다. 그리고, 윈도우로부터 추출된 음성 신호를 프레임이라 정의한다.Since the voice is an abnormal signal, the feature is extracted assuming that the voice signal is static in a small window as shown in the figure without extracting the feature throughout the spoken voice signal. The voice signal extracted from the window is defined as a frame.

도면 13도는 해밍 윈도우의 모양을 보여주는 도면이다.13 is a view showing the shape of the hamming window.

음성에서 추출된 프레임은 해밍 윈도우에 곱해지게 되며, 해밍 윈도우는 윈도의 경계에서의 음성 신호의 값을 0에 가깝게 만들며, 이러한 작업은 엠에프씨씨 추출 모듈의 제4 해밍 윈도우 섹션(130-2)을 통하여 이루어질 수 있다. 그리고, 이러한 해밍 윈도우를 식으로 표현하면 다음과 같다.The frames extracted from the speech are multiplied by the Hamming window, which makes the value of the speech signal at the boundary of the window close to zero, which is the fourth Hamming window section 130-2 of the MC extraction module. It can be done through. Then, the Hamming window is expressed as an equation as follows.

도면 14도는 멜 스케일 필터 뱅크 섹션을 통하여 제공되는 삼각 필터 뱅크의 모습을 보여주는 도면이다.FIG. 14 is a view showing a triangular filter bank provided through a mel scale filter bank section.

엠에프씨씨 추출 모듈의 멜 스케일 필터 뱅크 섹션(130-4)은 1 kHz 이하에서는 일정한 간격의 중심 주파수와 100 Hz의 대역폭을 가지고, 1 kHz 이상에서는 중심 주파수와 대역폭이 로그에 비례하여 급격하게 증가하는 필터 뱅크를 구비하는 섹션이며, 주파수 도메인(domain)에서 멜 스케일 주파수는 아래와 같이 정의될 수 있다.The mel scale filter bank section 130-4 of the MC extract module has a constant interval of center frequency and a bandwidth of 100 Hz below 1 kHz, and above 1 kHz the center frequency and bandwidth increase rapidly in proportion to the log. It is a section having a filter bank, and the mel scale frequency in the frequency domain may be defined as follows.

필터 뱅크는 각, 사각, 사다리, 가우시안(gaussian) 등 다양한 방식이 있는데, 본 발명에서 사용되는 멜 스케일 필터 뱅크 섹션(130-4)은 아래와 같은 삼각 모양의 가중치를 사용하며, 이렇게 가중치를 적용되어 산출된 값을 합하는 방식으로 구현된다.The filter bank has various methods such as an angle, a square, a ladder, a Gaussian, and the mel scale filter bank section 130-4 used in the present invention uses a triangular weight as shown below. It is implemented by summing the calculated values.

위의 식에서

는 m번째 삼각 필터 뱅크를 나타내고, f[m]은 중심 주파수를 나타내는데, 본 발명에 따른 음성 유사도 판단 장치는 이러한 멜 스케일 필터 뱅크 섹션(130-4)을 이용하여 수신되는 음성 신호에서의 특징을 사람의 귀의 특성과 유사한 방식을 이용하여 추출할 수 있다.In the above expression

Denotes the m-th triangular filter bank, and f [m] denotes the center frequency, and the voice similarity determining apparatus according to the present invention uses the mel scale filter bank section 130-4 to characterize the received voice signal. Can be extracted using a method similar to that of a human ear.

도면 15도는 본 발명의 음성 유사도 판단 장치의 디티더블유 모듈을 통하여 이루어지는 디티더블유 점수들이 계산되는 과정을 보여주는 도면이다.FIG. 15 is a diagram illustrating a process of calculating detired double scores made through the detired double module of the voice similarity determining apparatus of the present invention.

본 발명에 따른 음성 유사도 판단 장치의 디티더블유 모듈(150)은 제1 엠에프씨씨 추출 모듈(131)과 제2 엠에프씨씨 추출 모듈(132)로부터 각각 제1 엠에프씨씨들과 제2 엠에프씨씨들을 전달받아서 디티더블유 점수들을 계산할 수 있으며, 이는 와핑 코스트(warping cost)를 최소화하는 최적 경로의 길이를 찾는 방식을 통하여 이루어질 수 있다. 보다 구체적으로, 디티더블유는 길이가 다른 두 데이터의 거리를 최소로 하는 경로를 찾는 알고리즘(algorithm)인데, 디티더블유 경로의 길이인 K는 아래의 식과 같은 부등식의 범위를 가질 수 있다.The DTI double-use module 150 of the voice similarity determining apparatus according to the present invention includes first and second MCs from the first MC extracting module 131 and the second MC extracting module 132, respectively. Seeds can be passed to calculate detiable double scores, which can be done by finding the optimal path length that minimizes warping cost. More specifically, the detired oil is an algorithm for finding a path that minimizes the distance between two data having different lengths. The length of the detired oil path may have an inequality range as shown in the following equation.

그리고, 많은 와핑 패스 중에서 와핑 코스트를 최소로 하는 DTW(R,T)를 구하면 다음과 같다.The DTW (R, T) that minimizes the warping cost among many warping passes is as follows.

또한, 최적의 경로를 찾아내기 위한 다이내믹(dynamic) 프로그래밍(programming)을 적용하기 위한 수식은 다음과 같다.In addition, the equation for applying dynamic programming to find the optimal path is as follows.

이렇게 구해진 디티더블유 점수들을 이용하여 디티더블유 거리 매트릭스(matrix)에서 최적 경로가 구해지게 되며, 보다 구체적으로는 최적 경로상의 디티더블유 점수들이 더해진 뒤, 최적 경로의 길이에 대한 평준화를 통하여 디티더블유 스코어가 계산되게 된다.The optimal paths are obtained from the deddable distance distance matrix using the obtained deddable points. More specifically, after the deddable points on the optimal path are added, the detired level is calculated by leveling the length of the optimal path. Will be calculated.

본 발명에 따른 음성 유사도 판단 장치는 이러한 디티더블유 모듈(150)을 통하여, 길이가 서로 다른 기준 음성 신호와 테스트 음성 신호 사이에서도 유사성을 효과적으로 파악할 수 있다.The apparatus for determining voice similarity according to the present invention can effectively grasp the similarity between the reference voice signal and the test voice signal having different lengths through the decoded u module 150.

도면 16도는 본 발명인 음성 유사도 판단 장치의 발음 점수 계산 모듈 및/또는 억양 점수 계산 모듈을 통하여 디티더블유 스코어가 매핑 점수로 변환되는 과정을 보여주는 도면이다.FIG. 16 is a view illustrating a process of converting a decoded U score into a mapping score through a pronunciation score calculation module and / or an accent score calculation module of the present invention speech similarity determination device.

디티더블유 점수는 작을수록 기준 음성 신호와 테스트 음성 신호 사이의 유사도가 높다는 의미이므로, 사용자에게 친숙한 100점 만점 단위로 디티더블유 점수로 변환할 필요가 있다. 이는 디티더블유 점수들 중 최소값을 최대점수로 하고, 디티더블유 점수들 중 최대값을 최소점수로 하는 다음의 수식을 통하여 이루어 질 수 있으며, Since the smaller the detiable double score means that the similarity between the reference voice signal and the test speech signal is higher, it is necessary to convert the detired double score into the detired double score in units of 100 points that are familiar to the user. This can be done through the following formula, in which the minimum value of the deity doubled points is the maximum score and the maximum value of the deity doubled scores is the minimum score.

여기서, X는 디티더블유 스코어를 의미하며, 디티더블유(DTW) 최대 점수와 디티더블유 최소 점수는 기설정된 점수로서 본 발명의 구현에 사용된 전체 데이터로부터 얻어진 것일 수 있다.Here, X denotes a detitable oil score, and the maximum score of the deity doubled oil (DTW) and the minimum score of debitable oil may be obtained from the entire data used in the implementation of the present invention as a predetermined score.

또한, 위 식을 통하여 산출된 값 중에서 100 보다 큰 변환 점수는 100점으로 계산하며, 0 보다 작은 점수는 0으로 출력하며, 변환된 값은 발음 점수로 그대로 사용할 수 있다.In addition, among the values calculated through the above equation, a conversion score larger than 100 is calculated as 100 points, a score smaller than 0 is output as 0, and the converted value may be used as a pronunciation score.

본 발명에 따른 음성 유사도 판단 장치는 이러한 발음 점수 계산 모듈(160)을 통하여 사용자가 발화한 테스트 음성 신호와 기준 음성 신호 사이의 발음이 어느 정도 유사한지를 객관적으로 평가하여 사용자에게 제공할 수 있다. 또한, 패스, 논패스의 단순 평가 기법은 자신의 발음이 향상되고 있는지 여부에 대하여 정보를 제공할 수 없는데, 본 발명에 따른 음성 유사도 판단 장치는 발음에 대한 평가를 점수화하여 제공함으로써 사용자의 발음이 향상되고 있는지 여부에 대한 객관적 근거 자료로 활용할 수도 있다.The apparatus for determining speech similarity according to the present invention may objectively evaluate how similar the pronunciation between the test speech signal spoken by the user and the reference speech signal is provided to the user through the pronunciation score calculation module 160. In addition, the simple evaluation technique of the pass and the non-pass cannot provide information on whether or not the pronunciation is improved. The apparatus for determining the voice similarity according to the present invention scores an evaluation of the pronunciation to provide a user's pronunciation. It can also be used as objective evidence of whether it is improving.

도면 17도는 본 발명인 음성 유사도 판단 장치의 억양 점수 계산 모듈에서 사용되는 가중치 함수의 모습을 보여주는 도면이다.17 is a view showing the weight function used in the intonation score calculation module of the present invention speech similarity determination apparatus.

본 발명에 따른 음성 유사도 판단 장치의 억양 점수 계산 모듈(170)은 발화길이비 계산 모듈(140)이 제공하는 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고 이를 발음 점수에 곱하여 억양 점수를 계산하는데, 도면 17도는 이러한 과정에 사용되는 가중치 함수의 일 예를 보여주는 도면이다. 보다 구체적으로 도면의 함수는 발화길이비가 1일 때 1의 가중치를 가지고, 발화길이비가 0.5 이하 및 1.5 이상에서는 0의 가중치를 가지는 것을 특징으로 한다. 또한, 발화 길이비가 0.5 이상 1 이하의 구간에서는 2의 기울기 값을 가지고, 발화길이비가 1 이상 1.5 이하에서는 -2의 기울기 값을 가지는 것을 특징으로 한다. 그러나, 이는 하나의 예일 뿐이고 필요에 따라 다양한 가중치 함수가 적용될 수 있다.The intonation score calculation module 170 of the apparatus for determining speech similarity according to the present invention calculates the weighted speech length ratio by applying a weighting function to the speech length ratio provided by the speech length ratio calculating module 140 and multiplies it by the pronunciation score for the intonation. To calculate the score, Figure 17 is a diagram showing an example of the weight function used in this process. More specifically, the function of the drawing has a weight of 1 when the speech length ratio is 1, and has a weight of 0 when the speech length ratio is 0.5 or less and 1.5 or more. In addition, when the ignition length ratio is 0.5 or more and 1 or less, it has a slope value of 2, and when the ignition length ratio is 1 or more and 1.5 or less, it has a slope value of -2. However, this is only one example and various weight functions may be applied as necessary.

본 발명에 따른 음성 유사도 판단 장치는 이러한 억양 점수 계산 모듈(170)을 통하여, 사용자의 발음 특성 뿐 아니라 억양 특성도 점수화하여 제공할 수 있는 효과가 있다.The apparatus for determining speech similarity according to the present invention has an effect of scoring and providing not only the pronunciation characteristics of the user but also the intonation characteristics through the intonation score calculation module 170.

도면 18도는 본 발명에 따른 음성 유사도 판단 방법의 처리 과정을 순서도로 보여주는 도면이다.18 is a flowchart illustrating a processing procedure of a voice similarity determining method according to the present invention.

먼저, 본 발명에 따른 음성 유사도 판단 방법에서는 기준 음성 신호 및/또는 테스트 음성 신호에서 잡음 및/또는 배경음을 제거되고(S18-1), 이렇게 잡음 및/또는 배경음이 제거된 기준 음성 신호에서 제1 음성 길이를 계산된 뒤(S18-2), 제1 엠에프씨씨들을 추출되는 과정이 수행될 수 있다.(S18-3) 테스트 음성 신호에서도 이와 유사한 과정이 수행되며, 테스트 음성 신호에서 제2 음성 길이를 계산된 뒤(S18-4), 제2 엠에프씨씨들이 추출되는 과정이 수행될 수 있다.(S18-5)First, in the voice similarity determining method according to the present invention, noise and / or background sound are removed from the reference voice signal and / or the test voice signal (S18-1), and the first voice is removed from the reference voice signal from which the noise and / or background sound is removed. After calculating the voice length (S18-2), a process of extracting the first MFCs may be performed. (S18-3) A similar process is performed on the test voice signal, and the second voice on the test voice signal. After the length is calculated (S18-4), a process of extracting the second MFCs may be performed. (S18-5)

이렇게 기준 음성 신호에서 제1 음성 길이와 제1 엠에프씨씨들이 추출되고 테스트 음성 신호에서 제2 음성 길이와 제2 엠에프씨씨들이 추출된 이후에는, 제1 음성 길이를 제2 음성 길이로 나누어 발화길이비를 계산하는 과정이 수행되고(S18-6), 제1엠에프씨씨들과 제2 엠에프씨씨들 사이에서 디티더블유 점수들이 계산되고(S18-7), 계산된 디티더블유 점수들을 매핑하여 매핑 점수로 변환되는 과정이 수행될 수 있다.(S18-8) 다음으로는 발화길이비에 가중치 함수가 적용되어 가중치 발화길이비가 계산되고(S18-9), 이러한 가중치 발화 길이비를 매핑 점수에 곱하여 억양 점수가 계산되는 과정이 수행될 수 있다(S18-10) After the first voice length and the first MFCs are extracted from the reference voice signal and the second voice length and the second MFCs are extracted from the test voice signal, the first voice length is divided by the second voice length. The process of calculating the length ratio is performed (S18-6), the D-T double scores are calculated between the first MFCs and the second M-Cs (S18-7), by mapping the calculated D-T double points A process of converting the mapping score to the mapping score may be performed. (S18-8) Next, a weight function is applied to the speech length ratio to calculate the weighted speech length ratio (S18-9), and the weighted speech length ratio is mapped to the mapping score. The process of multiplying the accent score may be performed (S18-10).

이렇게 억양 점수와 발음 점수의 산출이 완료되면 두 값을 더하여 최종 유사도 점수가 산출되는 과정이 수행되는데 두 값은 산술적으로 더해지는 것이 아니라, 매핑 점수를 활용하여 발음 점수가 계산되고(S18-11), 발음 가중치가 곱해져서 가중치 발음 점수가 산출되는 과정이 수행될 수 있다.(S18-12) 또한, 억양 점수에는 억양 가중치가 곱해져서 가중치 억양 점수가 계산되며(S18-13), 이러한 가중치 발음 점수와 가중치 억양 점수가 더하여져서 최종 유사도 점수로 산출되어 제공되게 된다.(S18-14)When the calculation of the accent score and the pronunciation score is completed, the process of calculating the final similarity score by adding the two values is performed. The two scores are not arithmetically added, but the pronunciation score is calculated using the mapping score (S18-11). The process of calculating the weighted pronunciation score by multiplying the pronunciation weight may be performed. (S18-12) In addition, the accent score is multiplied by the accent weight to calculate the weighted accent score (S18-13). The weight accent score is added to calculate and provide the final similarity score. (S18-14)

본 발명인 음성 유사도 판단 방법은 이러한 과정을 통하여 산출된 최종 유사도 점수 제공을 통하여, 학습자 또는 이용자에게 적절한 피드백을 제공하여 학습 능률이 향상되도록 하는 효과를 달성할 수 있다.The inventors' voice similarity determination method can achieve the effect of improving the learning efficiency by providing appropriate feedback to the learner or the user through providing the final similarity score calculated through this process.

도면 19도는 본 발명인 음성 유사도 판단 방법의 변형 실시 예의 순서도를 보여주는 도면이다.19 is a flowchart illustrating a modified embodiment of the voice similarity determining method according to the present invention.

앞서 살펴본 바와 같이 본 발명인 음성 유사도 판단 장치는 외부 서버(20)와 단말 장치(10)로 그 역할이 분할되어 구성될 수 있으며, 도면 19도는 외부 서버(20)로부터 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들을 수신하는 단말 장치가 이를 이용하여 기준 음성 신호와 테스트 음성 신호 사이의 유사도를 판단할 수 있다.As described above, the apparatus for determining voice similarity according to the present invention may be configured by dividing a role into an external server 20 and a terminal device 10, and FIG. 19 illustrates a first voice length of a reference voice signal from an external server 20. And a terminal device receiving the first MFCs may determine the similarity between the reference voice signal and the test voice signal.

이러한 변형 실시 예에 따른 음성 유사도 판단 방법은 먼저, 외부 서버(20) 등으로부터 기준 음성 신호의 제1 음성 길이와 제1 엠에프씨씨들을 수신하는 과정부터 수행할 수 있다.(S19-1) 이에 병행하여, 테스트 음성 신호의 잡음 및/또는 배경음이 제거되는 과정과(S19-2), 테스트 음성 신호에서 제2 음성 길이를 계산되는 과정(S19-3)과 제2 엠에프씨씨들을 추출하는 과정이 이시 또는 동시로 수행될 수 있다.(S19-4)The method of determining voice similarity according to this modified embodiment may be first performed from the process of receiving the first voice length and the first MFC of the reference voice signal from the external server 20 or the like. (S19-1) In parallel, a process of removing noise and / or a background sound of a test voice signal (S19-2), a process of calculating a second voice length from the test voice signal (S19-3) and a process of extracting the second MCs This may be performed simultaneously or concurrently. (S19-4)

테스트 음성 신호의 제2 음성 길이와 제2 엠에프씨씨들의 추출이 완료된 다음에는 앞서 살펴본 과정이 그대로 유사하게 단말 장치(10) 내에서 수행될 수 있으며, 보다 구체적으로는 제1 음성 길이를 제2 음성 길이로 나누어 발화길이비가 계산되고(S19-5), 제1 엠에프씨씨들과 제2 엠에프씨씨들 사이의 디티더블유 점수들이 계산되고(S19-6), 이러한 디티더블유 점수들이 매핑되어 매핑 점수로 변환되는 과정이 수행될 수 있다.(S19-7) 그리고, 발화길이비에 가중치 함수를 적용하여 가중치 발화길이비를 계산하고(S19-8), 가중치 발화길이비를 매핑 점수에 적용하여 억양 점수가 계산되는 과정과(S19-9), 매핑 점수를 활용하여 발음 점수를 산출하는 고정이 수행될 수 있다.(S19-10)After the extraction of the second voice length and the second MFC of the test voice signal is completed, the above-described process may be similarly performed in the terminal device 10, and more specifically, the first voice length may be determined as the second voice length. The speech length ratio is calculated by dividing by the voice length (S19-5), the DTI double scores between the first MCs and the second MSs are calculated (S19-6), and the DTI double scores are mapped and mapped. A process of converting the score to the score may be performed (S19-7). The weighted speech length ratio is calculated by applying a weight function to the speech length ratio (S19-8), and the weighted speech length ratio is applied to the mapping score. The process of calculating the intonation score (S19-9) and the fixing to calculate the pronunciation score using the mapping score may be performed. (S19-10)

이렇게 발음 점수와 억양 점수가 산출된 다음에는, 각각의 발음 점수와 억양 점수에 가중치를 적용하여 가중치 발음 점수와(S19-11), 가중치 발음 점수가 계산될 수 있으며(S19-12), 최종 유사도 점수는 이러한 가중치 발음 점수와 가중치 발음 점수의 합산을 통하여 산출되어 제공될 수 있다.(S19-13) 여기서, 발음 가중치는 0.52일 수 있으며, 억양 가중치는 0.48일 수 있다.After the pronunciation score and the accent score are calculated, the weighted pronunciation score and the weighted pronunciation score may be calculated by applying weights to the pronunciation score and the intonation score (S19-11) and the final similarity. The score may be calculated and provided through the sum of the weighted pronunciation score and the weighted pronunciation score. (S19-13) Here, the pronunciation weight may be 0.52 and the accent weight may be 0.48.

본 발명에 따른 음성 유사도 판단 방법은 이러한 작업 분화를 통하여 음성 신호의 전처리 및 유사도 판단 과정에서 발생하는 부하를 분산, 경감할 수 있는 효과가 있다.The voice similarity determination method according to the present invention has the effect of distributing and reducing the load generated during the preprocessing and the similarity determination process of the voice signal through such task differentiation.

도면 20도는 본 발명인 음성 유사도 판단 방법의 다른 실시 예의 순서도를 보여주는 도면이다.20 is a flowchart illustrating another embodiment of a method for determining speech similarity according to the present invention.

도면 20도는 도면 19도와는 반대로, 외부 서버(20)가 단말 장치(10)가 전송하는 제2 음성 길이와 제2 엠에프씨씨들을 수신하여 기준 음성 신호와 테스트 음성 신호 사이의 유사도를 판단하는 과정에 대하여 보여주는 순서도이다.FIG. 20 is a process in which the external server 20 receives the second voice length and the second MFCs transmitted by the terminal device 10 to determine the similarity between the reference voice signal and the test voice signal. Is a flow chart showing.

이를 위하여, 다른 실시 예에 따른 음성 유사도 판단 방법은 단말 장치(10)로부터 테스트 음성 신호의 제2 음성 길이와 제2 엠에프씨씨들을 수신하고(S20-1), 기준 음성 신호에서 잡음 및/또는 배경음을 제거한 후에(S20-2), 기준 음성 신호에서 제1 음성 길이를 계산하는 과정을 수행할 수 있다.(S20-3) 또한, 제1 엠에프씨씨들이 추출되는 과정이 수행될 수 있다.(S20-4) 이렇게 제1 음성 길이의 계산과 제1 엠에프씨씨들의 추출이 완료된 다음에는, 제1 음성 길이를 제2 음성 길이로 나누어 발화길이비가 계산되고(S20-5), 제1 엠에프씨씨들과 제2 엠에프씨시들 사이에서 계산된 디티더블유 점수들을 계산하고(S20-6), 매핑 점수로 변환하는 과정이 수행될 수 있다.(S20-7) 그 이후로, 가중치 발화길이비가 계산되고(S20-8), 매핑 점수에 적용하여 가중치 발화길이비를 적용하여 억양 점수가 계산되고(S20-9), 매핑 점수를 활용하여 발음 점수가 계산되는 과정이 수행될 수 있다.(S20-10)To this end, the voice similarity determination method according to another embodiment receives the second voice length and the second MFCs of the test voice signal from the terminal device 10 (S20-1), and the noise and / or in the reference voice signal After removing the background sound (S20-2), a process of calculating the first voice length from the reference voice signal may be performed. (S20-3) In addition, a process of extracting the first MFCs may be performed. (S20-4) After the calculation of the first voice length and the extraction of the first MFCs are completed, the speech length ratio is calculated by dividing the first voice length by the second voice length (S20-5) and the first M The process of calculating the decoded U scores calculated between the FCs and the second MCs may be performed (S20-6), and the process of converting the scores to the mapping scores may be performed (S20-7). The ratio is calculated (S20-8), and the weighted speech length ratio is applied by applying the mapping score. The accent marks may be calculated and the process (S20-9), utilizing the mapping score of pronunciation score calculation can be performed. (S20-10)

마지막으로는, 발음 점수에 0.52의 발음 가중치를 적용하여 가중치 발음 점수를 산출하고(S20-11), 억양 점수에 0.48의 억양 가중치를 적용하여 억양 가중치 점수를 산출한 뒤(S20-12), 이 두 값을 더하여 최종 유사도 점수를 산출하는 과정이 수행될 수 있다.(S20-13)Finally, a weighted pronunciation score is calculated by applying a pronunciation weight of 0.52 to the pronunciation score (S20-11), and an accent weight score is calculated by applying an accent weight of 0.48 to the accent score (S20-12). The process of calculating the final similarity score by adding the two values may be performed. (S20-13)

이러한 실시 예를 통하여서는 복잡하고 연산량이 많은 과정이 외부 서버(20)에서 수행되므로 단말 장치(10)에서의 작업 부하가 줄어드는 효과가 있다.Through such an embodiment, since a complicated and computational process is performed in the external server 20, the workload on the terminal device 10 may be reduced.

도면 21도는 본 발명인 음성 유사도 판단 방법의 또 다른 실시 예의 순서도를 보여주는 도면이다.FIG. 21 is a flowchart illustrating still another embodiment of the present invention.

테스트 음성 신호는 사용자가 발화한 음성을 그때 그때 처리해야 함에 비하여, 기준 음성 신호는 영상 등에 포함된 음성 신호이므로 미리 사전에 잡음 및/또는 배경음 등의 제거 등의 전처리 과정을 통하여 제1 음성 길이와 제1 엠에프씨씨들의 추출이 가능하다. 또한, 이러한 데이터를 미리 데이터베이스에 저장해놓고 그때 그때 호출하여 유사도 판단에 사용하는 것도 가능하다. 도면 20도는 이러한 실시 예를 보여주는 것으로써, 또 다른 실시 예에 따른 음성 유사도 판단 방법은 단말 장치(10)로부터 제2 음성 길이와 제2 엠에프씨씨들을 수신하고(S21-1), 기준 음성 데이터베이스로부터 제1 음성 길이와 제1 엠에프씨씨들을 호출하여 수신한 뒤(S21-2), 제1 음성 길이를 제2 음성 길이로 나누어 발화길이비를 계산할 수 있다.(S21-3) 그리고, 제1 엠에프씨씨들과 제2 엠에프씨씨들 사이에서 계산된 디티더블유 점수(S21-4)를 이용하여 매핑 점수로의 변환이 이루어지고(S21-5), 발화길이비에 가중치 함수를 적용하여 계산된 가중치 발화길이비를(S21-6), 매핑 점수에 적용하여 억양 점수를 계산하는 과정이 수행될 수 있다.(S21-7) 그리고, 매핑 점수를 이용하여 발음 점수가 계산되고(S21-8), 이렇게 발음 점수와 억양 점수의 산출이 완료된 다음에는 가중치를 적용하여 가중치 발음 점수와(S21-9), 가중치 억양 점수가 계산되고(S21-10), 이 두 값을 합산하여 최종 유사도 점수가 산출되는 과정이 완료될 수 있다.(S21-11)The test voice signal should be processed at that time, whereas the reference voice signal is an audio signal included in an image or the like, and thus the first voice length and the first voice length may be previously processed through a preprocessing process such as noise and / or background sound. Extraction of the first MFCs is possible. It is also possible to store such data in a database in advance and call it at that time to use it for similarity judgment. FIG. 20 illustrates this embodiment, and according to another embodiment, a method of determining a voice similarity receives a second voice length and a second MFC from the terminal device 10 (S21-1), and a reference voice database. After receiving and calling the first voice length and the first MFCs from (S21-2), the speech length ratio can be calculated by dividing the first voice length by the second voice length (S21-3). The conversion to the mapping score is performed by using the T-Double oil score (S21-4) calculated between the 1 CM and the second MC (S21-5), and by applying a weight function to the speech length ratio A process of calculating the accent score may be performed by applying the calculated weighted speech length ratio (S21-6) to the mapping score. (S21-7) Then, the pronunciation score is calculated using the mapping score (S21-). 8) After the pronunciation score and accent score have been calculated, The weighted pronunciation score and the weight accent score are calculated by applying the median (S21-9), and the process of calculating the final similarity score by summing these two values can be completed (S21-11).

본 발명의 또 다른 실시 예에 따른 음성 유사도 판단 방법에 의하면, 기준 음성 신호에서의 제1 음성 길이와 제1 엠에프씨씨들이 미리 데이터베이스에 저장되어 구현되어 있으므로, 양 신호 상의 유사도 판단 계산 및 제공이 더 빠르고 효과적으로 점수화되어 제공할 수 있는 효과가 있다.According to the voice similarity determination method according to another embodiment of the present invention, since the first voice length and the first MFC of the reference voice signal are stored in a database in advance, the similarity determination calculation and provision on both signals are There is an effect that can be scored faster and more effectively.

도면 22도는 본 발명인 음성 유사도 판단 장치를 구성하는 단말 장치에서 외부 서버로 제2 음성 길이와 제2 엠에프씨씨들을 전송하는 과정을 순서도로 보여주는 도면이다.22 is a flowchart illustrating a process of transmitting the second voice length and the second MFCs from the terminal device configuring the voice similarity determination device according to the present invention to an external server.

본 발명에 따른 음성 유사도 판단 장치는 역할이 분담되어 있는 외부 서버(20), 단말 장치(10)를 포함하여 구성될 수 있다. 이 경우, 단말 장치(10)는 사용자가 발화하는 테스트 음성 신호에서 제2 음성 길이를 계산한 뒤(S22-1), 제2 엠에프씨씨들을 추출할 수 있다.(S22-2) 다음으로, 이렇게 추출된 제2 엠에프씨씨들과 제2 음성 길이는 외부 서버(20)로 전송될 수 있으며(S22-3), 이를 수신한 외부 서버(20)는 제2 엠에프씨씨들과 제2 음성 길이를 활용하여 테스트 음성 신호와 기준 음성 신호 사이의 유사도를 점수화하여 판정하고 그 결과를 다시 단말 장치(10)로 전송할 수 있다.The apparatus for determining voice similarity according to the present invention may include an external server 20 and a terminal device 10 in which roles are shared. In this case, the terminal device 10 may calculate the second voice length from the test voice signal spoken by the user (S22-1) and then extract the second MFCs (S22-2). The extracted second MFCs and the second voice length may be transmitted to the external server 20 (S22-3), and the external server 20 receiving the second MFCs and the second voice length may be transmitted. By using the length, the similarity between the test voice signal and the reference voice signal may be scored and determined, and the result may be transmitted back to the terminal device 10.

도면 23도는 본 발명에 따른 음성 유사도 판단 장치 및 음성 유사도 판단 방법에 따라 산출되는 점수 결과와 전문 채점자들이 평가한 점수 결과와의 상관 관계를 정리하여 표로써 보여주는 도면이다.FIG. 23 is a table showing correlations between score results calculated according to a voice similarity determining apparatus and a voice similarity determining method according to the present invention and score results evaluated by expert scorers.

먼저, 본 발명의 성능 실험을 위하여 사용된 기준 음성과 테스트 음성의 스펙은 아래와 같다.First, the specifications of the reference voice and the test voice used for the performance experiment of the present invention are as follows.

기준 음성 신호 28개, 테스트 음성 신호 100개로 구성된 테스트 음성 신호로 구성된 웨이브(wav) 파일을 이용하여 성능 실험을 실시하였다. 여기서, 기준 음성 신호는 잡음 및 배경 음악이 존재하는 영화, 드라마에서 배우의 대사 음성을 1 내지 4초 길이로 발췌한 음성 신호가 사용되었으며, 테스트 음성 신호는 기준 음성 신호를 따라 실제 사용자가 유사하게 흉내를 내거나 고의적으로 다른 발음을 하거나 유창하지 못하게 발성한 음성 신호가 사용되었다.The performance test was conducted using a wave file consisting of a test voice signal consisting of 28 reference voice signals and 100 test voice signals. In this case, the reference voice signal is a voice signal obtained by extracting the actor's speech in a length of 1 to 4 seconds in a movie or drama in which noise and background music exist, and the test voice signal is similar to a real user according to the reference voice signal. Voice signals were used to imitate, deliberately pronounce other words, or be fluent.

도면을 통하여 확인할 수 있듯이, 본 발명에 따른 음성 유사도 판단 장치 및 음성 유사도 판단 방법의 개발 모듈이 산출한 점수와 채점자들이 평가한 점수의 평균 값 사이의 상관 계수의 값이 0.3823으로서, 본 발명과 가장 높은 상관계수를 가지는 채점자 3명 사이의 상관계수인 0.3795와 큰 차이가 없는 점에서 본 발명을 따라 산출된 결과가 합리적인 기준을 제공하는 값임을 알 수 있다.As can be seen through the drawings, the correlation coefficient between the score calculated by the development module of the voice similarity determining device and the voice similarity determining method according to the present invention and the average value of the scores evaluated by the scorers is 0.3823, It can be seen that the result calculated according to the present invention is a value providing a reasonable criterion in that there is no significant difference from 0.3795, which is a correlation coefficient between three scorers having a high correlation coefficient.

도면 24도는 본 발명을 통하여 산출된 발음 점수와 억양 점수를 선형 회귀 모델에 사용하여 산출된 점수와 전문 채점자들이 평가한 점수 결과와의 상관 관계를 정리하여 표로써 보여주는 도면이다.FIG. 24 is a table showing correlations between scores calculated by using a pronunciation score and an accent score calculated through the present invention in a linear regression model and score results evaluated by expert scorers.

선형 회귀 모델(model)의 정답은 채점자 평균 채점 값으로 하였으며, 전체 데이터 100개를 10개씩 10그룹으로 나누어 9그룹은 학습에 사용하고, 1그룹은 테스트에 사용하는 방식으로 학습을 실시하였으며, 선형 회귀 모델을 통해 예측된 100개의 데이터에 대한 변환 점수와 전문 채점자들 사이의 평균 채점 값 사이의 상관계수를 구하였다. 위 과정은 100번 반복되었으며 본 발명에 따른 개발 모듈과 채점자와의 상관 계수 값을 구한 후 100개의 상관계수 값의 평균을 본 발명에 따른 개발 모듈과 채점자간의 상관계수 값으로 사용하였다. 그리고, 이 과정을 반복할 때마다 10그룹은 랜덤(random)하게 그룹 지어졌으며, 이를 그림으로 정리하면 다음과 같이 나타낼 수 있다.The correct answer of the linear regression model was the average score of the scorers. The total data were divided into 10 groups of 10 by 10, and 9 groups were used for training and 1 group was used for testing. The correlation coefficient between the conversion scores for the 100 data predicted through the regression model and the average scores among the professional scorers was calculated. The above process was repeated 100 times, and after obtaining the correlation coefficient value between the development module and the grader according to the present invention, the average of 100 correlation coefficient values was used as the correlation coefficient value between the development module and the grader according to the present invention. Each time this process is repeated, 10 groups are randomly grouped. This can be expressed as follows.

도면을 통하여 정리된 값을 확인하면, 본 발명에 따른 개발 모듈을 선형 회귀 모델에 적용하여 산출한 유사도 점수와 채점자들 평균 값의 상관계수 값은 0.3313이고, 개발 모듈과 가장 높은 상관 계수를 가지는 채점자 3명 사이의 상관 계수는 0.3457인 것을 확인할 수 있으며, 그 결과로서 선형 회귀 모델을 적용하여 학습시킨 결과보다 본 발명에 따른 음성 유사도 판단 장치 및 음성 유사도 판단 방법을 직접적으로 적용하여 산출된 결과가 채점자와의 상관 관계가 더 높아 신뢰성이 더 높은 것을 확인할 수 있다.Checking the values summarized through the drawings, the correlation coefficient value of the similarity score and the average of the scorers calculated by applying the development module according to the present invention to the linear regression model is 0.3313, the scorer having the highest correlation coefficient with the development module It can be seen that the correlation coefficient between the three persons is 0.3457, and as a result, the result calculated by directly applying the voice similarity determining device and the voice similarity determining method according to the present invention is a scorer rather than learning by applying a linear regression model. The higher correlation with, the higher the reliability.

상술한 바와 같이, 본 발명의 바람직한 실시예들을 참조하여 설명하였지만 해당 기술 분야의 숙련된 당업자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.As described above, although described with reference to preferred embodiments of the present invention, those skilled in the art will be variously modified without departing from the spirit and scope of the invention described in the claims below. And can be changed.

10 : 단말 장치
20 : 외부 서버
110-1 : 에프에프티 섹션
110-2 : 잡음 스펙트럼 추청 섹션
110-3 : 위너 필터 섹션
110-4 : 아이에프에프티 섹션
111 : 제1 필터 모듈
112 : 제2 필터 모듈
120-1 :프레임 단위 처리 섹션
120-2 : 제3 해밍 윈도우 섹션
120-3 : 프레임 단위 에너지 계산 섹션
120-4 : 영교차율 계산 섹션
120-5 : 제3 음성 구간 판별 섹션
120-6 : 필터 뱅크 섹션
120-7 : 밴드 에너지 추정 섹션
120-8 : 음성 구간 검출 섹션
120-9 : 배경 잡음 적응 섹션
120-10 : 밴드 에스엔알 합산 섹션
120-11 : 문턱값 적응 섹션
120-12 : 제4 음성 구간 판별 섹션
121 : 제1 음성 구간 검출 모듈
122 : 제2 음성 구간 검출 모듈
130-1 : 프리 엠퍼시스 섹션
130-2 : 제4 해밍 윈도우 섹션
130-3 : 디에프티 섹션
130-4 : 멜 스케일 필터 뱅크 섹션
130-5 : 로그 섹션
130-6 : 디씨티 섹션
131 : 제1 엠에프씨씨 추출 모듈
132 : 제2 엠에프씨씨 추출 모듈
140 : 발화길이비 계산 모듈
150 : 디티더블유 모듈
160 : 발음 점수 계산 모듈
170 : 억양 점수 계산 모듈
180 : 최종 유사도 점수 산출 모듈
190 : 기준 음성 데이터베이스10: terminal device
20: external server
110-1: FT section
110-2: Noise Spectrum Hearing Section
110-3: Winner Filter Section
110-4: IFF Section
111: first filter module
112: second filter module
120-1: Frame Unit Processing Section
120-2: third hamming window section
120-3: Frame Energy Calculation Section
120-4: Zero crossing rate calculation section
120-5: Third voice section discrimination section
120-6: Filter Bank Section
120-7: Band Energy Estimation Section
120-8: Voice section detection section
120-9: Background Noise Adaptation Section
120-10: Band SN Combined Section
120-11: Threshold adaptation section
120-12: fourth speech section discrimination section
121: first voice section detection module
122: second voice interval detection module
130-1: pre-emphasis section
130-2: fourth hamming window section
130-3: FT section
130-4: Mel scale filter bank section
130-5: log section
130-6: DC section
131: first MC extract module
132: second MFC extraction module
140: ignition length ratio calculation module
150: D Twice module
160: pronunciation score calculation module
170: intonation score calculation module
180: final similarity score calculation module
190: reference speech database

Claims

Calculating a first speech length from the reference speech signal;
Extracting first mel frequency cepstal coefficients (mfcc) from the reference speech signal;
Calculating a second speech length in a test speech signal;
Extracting second MFCs from the test voice signal;
Calculating a speech length ratio by dividing the first voice length by the second voice length;
Calculating dynamic time warping (dtw) scores between the first MFCs and the second MFCs;
Mapping the decoded unique scores into a mapping score;
Calculating a weighted speech length ratio by applying a weight function to the speech length ratio;
Calculating the accent score by multiplying the mapping score by the weighted speech length ratio,
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination method, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

Receiving a first voice length and a first MFC of a reference voice signal from an external server;
Calculating a second speech length in the test speech signal;
Extracting second MFCs from the test voice signal;
Calculating a speech length ratio by dividing the first voice length by the second voice length;
Calculating detired double scores between the first MFCs and the second MFCs;
Converting the decodeable scores into a mapping score;
Calculating a weighted speech length ratio by applying a weight function to the speech length ratio;
Calculating the accent score by multiplying the mapping score by the weighted speech length ratio,
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination method, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

Receiving a second voice length and second FMs of the test voice signal from the terminal device;
Calculating a first speech length from the reference speech signal;
Extracting first MFCs from the reference voice signal;
Calculating a speech length ratio by dividing the first voice length by the second voice length;
Calculating detired double scores between the first MFCs and the second MFCs;
Converting the decodeable scores into a mapping score;
Calculating a weighted speech length ratio by applying a weight function to the speech length ratio;
Calculating the accent score by multiplying the mapping score by the weighted speech length ratio,
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination method, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

Receiving a second voice length and second FMs of the test voice signal from the terminal device;
Receiving a first voice length and a first MFC from a reference voice database;
Calculating a speech length ratio by dividing the first voice length by the second voice length;
Calculating detired double scores between the first MFCs and the second MFCs;
Converting the decodeable scores into mapping scores through mapping;
Calculating a weighted speech length ratio by applying a weight function to the speech length ratio;
Calculating the accent score by multiplying the mapping score by the weighted speech length ratio,
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination method, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

The method according to any one of claims 1 to 4,
The weight function is
When the ignition length ratio is 1 has a weight value of 1,
The speech similarity determining method has a weight value of 0 when the speech length ratio is 0.5 or less and 1.5 or more.

The method of claim 5,
The weight function is
In the section of the ignition length ratio of 0.5 or more 1 has a slope value of 2,
The speech similarity determination method of claim 1, wherein the speech length ratio has a slope value of -2 in a section of 1 to 1.5.

delete

The method according to any one of claims 1 to 4,
In the step of calculating the detired double scores between the first MFC and the second MFC,
The debit double points
The method of claim 1, wherein the paths are obtained by adding values on a path in a demultiplexing distance matrix created by using the first and second MCs.

The method according to any one of claims 1 to 4,
The step of converting the deddable U scores are converted into mapping scores
The minimum value of the decoded double points is the maximum score,
The minimum value of the maximum number of deddable existence scores

Are converted to mapping scores using equations,
The detired oil score is obtained by equalizing the length of the optimal path by adding the value of the optimal path in the detired oil distance matrix,
And a maximum score of the detiable double and a minimum score of the delimited double are predetermined.

The method of claim 1,
Before the step of calculating the first speech length in the reference speech signal,
And removing the noise and / or the background sound from the reference voice signal and / or the test voice signal.

The method of claim 2,
Prior to the step of calculating the second speech length in the test speech signal,
And removing the noise and / or the background sound from the test voice signal.

The method of claim 3,
Before the step of calculating the first speech length in the reference speech signal,
And removing noise and / or background sound from the reference voice signal.

The method according to any one of claims 11 to 13,
Noise and / or background noise is removed through a winner filter.

delete

A first speech interval detection module for calculating a first speech length from a reference speech signal;
A second voice interval detection module for calculating a second voice length from the test voice signal;
The first voice length is received from the first voice interval detection module, the second voice length is received from the second voice interval detection module, and the first voice length is divided by the second voice length to speak length ratio. Ignition length ratio calculation module for calculating;
A first MC extracting module for extracting first MCs from the reference speech signal;
A second MC extracting module for extracting second MCs from the test voice signal;
A decoded-deowable module receiving the first MCs from the first MC extracting module and receiving the second MCs from the second MC extracting module to calculate detired oil scores;
The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio;
A pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping;
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination device, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

A storage module for receiving and storing the first voice length and the first MC of the reference voice signal from an external server;
A second voice interval detection module for calculating a second voice length from the test voice signal;
An utterance length for receiving the second voice length from the second voice interval detection module and receiving a first voice length from the storage module and dividing the first voice length by the second voice length to calculate a utterance length ratio Non-calculating module;
A second MC extracting module for extracting second MCs from the test voice signal;
A DTI double U module receiving first MFCs from the storage module, and receiving second MFCs from the second MFC module to calculate detired double scores;
The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio;
And a pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping.
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination device, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

A storage module for receiving and storing the second voice length and the second FM of the test voice from the terminal device;
A first voice interval detection module configured to calculate a first voice length from a reference voice signal;
A utterance length for receiving a first voice length from the first voice interval detection module and receiving a second voice length from the storage module and dividing the first voice length by the second voice length to calculate a speech length ratio. Non-calculating module;
A first MC extracting module for extracting first MCs from the reference speech signal;
A DTI double U module receiving second MFCs from the storage module, and receiving first MFCs from the first MFC module to calculate detired double scores;
The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio;
A pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping;
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination device, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

A storage module for receiving and storing the second voice length and the second FM of the test voice from the terminal device;
A reference voice database storing the first voice length of the reference voice and the first MFCs;
Speech length ratio calculation for receiving the first speech length from the reference speech database, receiving the second speech length from the storage module, and calculating the speech length ratio by dividing the first speech length by the second speech length. module;
A deddable-you module for receiving the first MFCs from the reference voice database, receiving the second MFCs from the storage module, and calculating a decoded-U scores;
The weighted speech length ratio is calculated by applying a weight function to the speech length ratio received from the speech length ratio calculating module, and the weighted speech is applied to the mapping score converted from mapping by receiving the DTI double-U scores from the DTI-D module. An accent score calculation module that calculates an accent score by multiplying the length ratio;
A pronunciation score calculation module configured to receive the decoded U scores from the decoded U module and calculate a pronunciation score using the mapping score converted through the mapping;
The first MFCs and / or the second MFCs may be pre-emphasis, hamming window, and discrete fourier transform (DFT) sequentially. Obtained by applying a melt scale filter bank and a Discrete Cosine Transform (DCT),
The voice similarity determination device, characterized in that extracted by applying the CMN (Cepstral Mean Normalization) for the energy level up to the 13th order.

The method according to any one of claims 17 to 20,
The weight function is
When the ignition length ratio is 1 has a weight value of 1,
The speech similarity determining device having a weight value of 0 when the speech length ratio is 0.5 or less and 1.5 or more.

The method of claim 21,
The weight function is
In the section of the ignition length ratio of 0.5 or more 1 has a slope value of 2,
The speech similarity determination device, characterized in that having a slope value of -2 in the section of the speech length ratio of 1 to 1.5.

The method according to any one of claims 17 to 20,
Calculate a weighted pronunciation score by multiplying the pronunciation score received from the pronunciation score calculation module by multiplying the pronunciation weight, multiplying the accent score received from the accent score calculation module by the accent weight to calculate a weighted accent score, and then the weighted pronunciation And a final similarity score calculation module for adding a score and the weight accent score to calculate a final similarity score.

The method of claim 23,
The pronunciation weight is 0.52,
And the accent weight is 0.48.

The method according to any one of claims 17 to 20,
The deddable module
The apparatus of claim 1, wherein the values of paths are calculated by adding values on a path in a demultiplexing distance matrix generated by using the first and second MCs.

The method according to any one of claims 17 to 20,
The mapping score used by the intonation score calculation module and / or the pronunciation score calculation module is
Received from the Detty Double U module
The minimum value of the decoded double points is the maximum score,
The minimum value of the maximum number of deddable existence scores

Is converted to the mapping score using an equation,
The detired oil score is obtained by equalizing the length of the optimal path by adding the value of the optimal path in the detired oil distance matrix,
The detired double maximum score and the detired double minimum score are predetermined.

The method of claim 17,
A first filter module which removes noise and / or background sound from the reference voice signal and transmits the noise to the first voice interval detection module;
And a second filter module which removes noise and / or background sound from the test voice signal and delivers the noise to the second voice section detection module.

The method of claim 18,
And a second filter module which removes noise and / or background sound from the test voice signal and delivers the noise to the second voice section detection module.

The method of claim 19,
And a second filter module which removes noise and / or background sound from the reference voice signal and transmits the noise and / or background sound to the second voice interval detection module.

delete