KR20080018658A

KR20080018658A - Pronunciation comparation system for user select section

Info

Publication number: KR20080018658A
Application number: KR1020060081124A
Authority: KR
Inventors: 강사돈
Original assignee: 주식회사 예람
Priority date: 2006-08-25
Filing date: 2006-08-25
Publication date: 2008-02-28

Abstract

A pronunciation comparison system for a user selected section is provided to increase a listening comprehension skill with respect to multimedia contents and allow a user to clearly hear his pronunciation. A storage unit(100) stores multimedia contents pronunciation. A section setting unit(105) allows a user to select a section of the multimedia contents. A noise canceling unit(130) cancels noise of the multimedia contents. An output unit(110) provides the pronunciation of a section of the multimedia contents selected by the user(110). An input unit(120) receives a pronunciation of a sentence from the user and transmits the same to the noise canceling unit(130). A pre-processing unit(140) transfers the user pronunciation of the multimedia contents to a vocal sound characteristics parameter extracting unit(160) via the noise canceling unit(130). The vocal sound characteristics parameter extracting unit(160) extracts characteristics parameter from the user voice and the pronunciation of the corresponding multimedia contents. A component characteristics parameter extracting unit(170) extracts component characteristics parameter that analyzes pitch, energy and formant components with respect to the user voice and the pronunciation of the corresponding multimedia contents. A comparing/analyzing unit(170) compares the vocal sound characteristics parameter and the component characteristics parameter by certain analysis units with respect to the user voice and the pronunciation of the corresponding multimedia contents and provides the results to the user.

Description

Speech comparison system for user selection section {PRONUNCIATION COMPARATION SYSTEM FOR USER SELECT SECTION}

도 1 은 본 발명에 따른 사용자 선택구간에 대한 음성비교 시스템의 일실시예 구성도.1 is a block diagram of an embodiment of a voice comparison system for a user selection section according to the present invention;

도 2 는 본 발명에 따른 사용자 선택구간에 대한 음성비교 시스템이 사용자에게 제공하는 일실시예 화면 구성도.2 is a screen configuration diagram of an embodiment provided to the user by the voice comparison system for the user selection section in accordance with the present invention.

도 3 은 본 발명에 따른 사용자 구간설정 흐름도.3 is a user interval setting flowchart according to the present invention;

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

100 : 저장수단 105 : 구간설정수단100: storage means 105: section setting means

110 : 출력수단 120 : 입력수단110: output means 120: input means

130 : 잡음제거수단 140 : 전처리수단130: noise removing means 140: preprocessing means

150 : 데이터베이스 160 : 성음특성파라메터추출수단150: database 160: vocal characteristics parameter extraction means

170 : 성분특성파라메터추출수단 175 : 피치성분분석부170: component characteristic parameter extraction means 175: pitch component analysis unit

180 : 에너지성분추출부 185 : 포만트성분추출부180: energy component extraction unit 185: formant component extraction unit

195 : 비교분석수단195: comparative analysis means

본 발명은 사용자 선택구간에 대한 음성비교 시스템에 관한 것으로서, 특히 사용자가 선택한 멀티미디어 컨텐츠 구간에 대하여 사용자의 발음과 멀티미디어 컨텐츠의 발음을 잡음제거하고 이를 출력수단으로 출력하여 잡음이 없는 상태에서 좋은 음질의 음원을 통해 청취 능력을 향상시키며, 사용자의 음성에 포함된 잡음을 제거하여 상용자가 본인의 음성을 뚜렷하게 인식하게 하며, 성음 특성 파라메터 및 성분 특성 파라메터를 포함한 각각의 특성 파라메터를 기준으로 비교 분석하여 그 비교 분석 결과 및 발음 교정 방안을 사용자에게 제시함으로써 빠르고 용이하게 정확한 발음으로 언어를 습득할 수 있게 하는 사용자 선택구간에 대한 음성비교 시스템에 관한 것이다.The present invention relates to a voice comparison system for a user selection section. In particular, the user's pronunciation and the pronunciation of the multimedia content are removed from the user's selected multimedia content section, and the noise is output to the output means. It improves listening ability through sound source, removes noise included in user's voice, enables commercial users to clearly recognize their own voice, compares and analyzes each characteristic parameter including vocal characteristic parameter and component characteristic parameter. The present invention relates to a voice comparison system for a user selection section that enables a user to quickly and easily acquire a language with accurate pronunciation by presenting a comparative analysis result and a pronunciation correction method.

종래의 영어회화 학습교재는 인쇄본과 테이프(tape)를 이용한 듣기 기능만을 제공함으로써 사용자가 자신의 발음이 정확한지, 또는 멀티미디어 컨텐츠와의 차이는 어느 정도인지를 비교할 수 없다는 문제점이 있었다.The conventional English conversation teaching materials have a problem in that the user cannot compare the pronunciation or the difference between the multimedia contents by providing only a listening function using a printed copy and a tape.

또한, 종래의 언어 교육 방법은 일방적(one-way)으로 전달하는 방식이기 때문에 학습효과가 낮다는 문제점이 있었다.In addition, the conventional language teaching method has a problem that the learning effect is low because it is a one-way delivery method.

따라서, 본 발명은 상기 종래의 문제점을 해소하기 위해 안출된 것으로,Accordingly, the present invention has been made to solve the above conventional problems,

본 발명의 목적은 사용자가 선택한 멀티미디어 컨텐츠에 대하여 사용자의 발음과 멀티미디어 컨텐츠의 발음을 잡음 제거하여 출력함으로써 멀티미디어 컨텐츠에 대한 청취력을 증강시키고, 사용자 본인의 발음을 뚜렷하게 들을 수 있도록 하는 데에 목적이 있다.An object of the present invention is to enhance the listening ability of the multimedia content by clearly outputting the user's pronunciation and the pronunciation of the multimedia content with respect to the multimedia content selected by the user, and to clearly hear the user's pronunciation. .

또한, 본 발명의 다른 목적은 성음 특성 파라메터 및 성분 특성 파라메터를 기준으로 비교 분석하여 그 비교 분석 결과 및 발음 교정 방안을 사용자에게 제시하는데 그 목적이 있다.In addition, another object of the present invention is to present a comparison analysis result and pronunciation correction method to the user by comparing and analyzing the vocal trait and component trait parameters.

상기 목적을 달성하기 위한 본 발명의 사용자 선택구간에 대한 음성비교 시스템은,Voice comparison system for the user selection section of the present invention for achieving the above object,

음성비교 시스템에 있어서, In the voice comparison system,

학습 대상 문장 및 상기 문장에 대한 멀티미디어 컨텐츠 발음을 저장하기 위한 저장수단(100);Storage means (100) for storing the sentence to be learned and the pronunciation of the multimedia content for the sentence;

사용자가 상기 저장수단으로부터 멀티미디어 컨텐츠의 일부를 구분하여 구간을 선택할 수 있도록 하기 위한 구간설정수단(105);Section setting means (105) for allowing a user to select a section by dividing a part of the multimedia contents from the storage means;

사용자 혹은 멀티미디어 컨텐츠의 잡음을 무성음 및 유성음으로 구분하여 이를 제거하기 위한 잡음제거수단(130);Noise removing means (130) for separating the noise of the user or the multimedia contents into unvoiced and voiced sound;

상기 사용자가 구간을 선택함에 따라 상기 사용자에게 학습 대상이 되는 멀티미디어 컨텐츠의 구간의 멀티미디어 컨텐츠 발음을 제공하기 위한 출력 수 단(110); An output unit 110 for providing the multimedia content pronunciation of the section of the multimedia content to be learned to the user as the section is selected by the user;

상기 사용자로부터 상기 출력 수단에 의하여 제공된 문장에 대한 발음을 입력받아 잡음제거 수단으로 전송하기 위한 입력 수단(120); Input means (120) for receiving a pronunciation of a sentence provided by the output means from the user and transmitting the pronunciation to a noise removing means;

상기 입력 수단을 통하여 입력된 음성에 대하여 문장 단위로 시작점과 끝점을 검출하고 상기 검출된 문장 마다 소정의 분석단위로 절단하여 성음 특성 파라메터 추출 수단으로 전송하고, 상기 사용자에 의하여 음성으로 입력된 문장에 대한 멀티미디어 컨텐츠의 발음을 상기 저장 수단으로부터 잡음제거수단을 거쳐 상기 성음 특성 파라메터 추출수단으로 전달하기 위한 전처리 수단(140);The starting point and the end point are detected in a sentence unit for the voice input through the input means, and each detected sentence is cut in a predetermined analysis unit and transmitted to the vocal characteristic parameter extracting means, and the sentence is input to the sentence input by the user. Preprocessing means (140) for transmitting the pronunciation of the multimedia content from said storage means to said vocal trait parameter extraction means via a noise removing means;

상기 절단된 사용자의 음성 및 해당 멀티미디어 컨텐츠의 발음에 대하여 묵음/유성음/무성음으로 분류하여 특성 파라메터를 추출하기 위한 성음 특성 파라메터 추출수단(160); Voice characteristic parameter extracting means (160) for extracting characteristic parameters by classifying the cut-off user's voice and the pronunciation of the corresponding multimedia contents into silent / voiced / unvoiced sound;

상기 절단된 사용자의 음성 및 해당 멀티미디어 컨텐츠의 발음에 대하여 피치(Pitch), 에너지, 포만트(Formant) 성분을 분석한 성분 특성 파라메터를 추출하기 위한 성분 특성 파라메터 추출수단(170); 및Component characteristic parameter extracting means (170) for extracting component characteristic parameters obtained by analyzing pitch, energy, and formant components with respect to the cut-off user's voice and the pronunciation of the corresponding multimedia contents; And

상기 사용자의 음성과 해당 멀티미디어 컨텐츠의 발음에 대한 성음 특성 파라메터 및 성분 특성 파라메터를 상기 소정의 분석단위별로 비교분석하여 상기 사용자에게 비교분석 결과를 제공하기 위한 비교분석수단(195);을 포함하여 구성된다.And comparative analysis means (195) for comparing and analyzing the vowel characteristic parameter and the component characteristic parameter for the pronunciation of the user's voice and the corresponding multimedia contents for each predetermined analysis unit to provide a comparative analysis result to the user. do.

상기 전처리 수단의 소정의 분석단위는,The predetermined analysis unit of the preprocessing means is

상기 사용자의 선택에 따라 단어, 어절, 또는 문장 단위인 것을 특징으로 한 다.According to the user's selection it is characterized in that the word, word, or sentence unit.

상기 잡음제거수단은,The noise removing means,

음성신호를 웨이브렛 변환한 신호를 스펙트럼상에서 파라메터로 하여 유성음/무성음/묵음을 분류하는 방식을 이용하는 것을 특징으로 한다.The method of classifying voiced sound, unvoiced sound and silence using the wavelet-converted signal as a parameter on the spectrum is used.

상기 전처리수단은,The pretreatment means,

인간의 청각특성을 고려하여 음성의 주파수 대역을 3개의 대역으로 분리한 후, 대역별로 세밀한 에너지 문턱치값을 설정하여 음성의 끝점을 탐색하는 방식인 것을 특징으로 한다.After separating the frequency band of the voice into three bands in consideration of the human hearing characteristics, it is characterized in that the way to search for the end point of the voice by setting a fine energy threshold value for each band.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명에 따른 사용자 선택구간에 대한 음성비교 시스템의 일실시예 구성도이다.1 is a diagram illustrating an embodiment of a voice comparison system for a user selection section according to the present invention.

도 1에 도시한 바와 같이, 본 발명인 사용자 선택구간에 대한 음성비교 시스템은,As shown in Figure 1, the present invention, the voice comparison system for the user selection section,

음성비교 시스템에 있어서, In the voice comparison system,

사용자 혹은 멀티미디어 컨텐츠의 잡음을 무성음 및 유성음으로 구분하여 이 를 제거하기 위한 잡음제거수단(130);Noise removing means 130 for removing the noise of the user or multimedia content divided into unvoiced and voiced sound;

상기 사용자가 구간을 선택함에 따라 상기 사용자에게 학습 대상이 되는 멀티미디어 컨텐츠의 구간의 멀티미디어 컨텐츠 발음을 제공하기 위한 출력 수단(110); Outputting means (110) for providing pronunciation of multimedia contents of the section of the multimedia content to be learned to the user as the user selects the section;

상기 사용자의 음성과 해당 멀티미디어 컨텐츠의 발음에 대한 성음 특성 파라메터 및 성분 특성 파라메터를 상기 소정의 분석단위별로 비교분석하여 상기 사용자에게 비교분석 결과를 제공하기 위한 비교분석수단(195);을 포함하여 구성된 다.And comparative analysis means (195) for comparing and analyzing vowel characteristic parameters and component characteristic parameters for the pronunciation of the user's voice and the corresponding multimedia contents for each predetermined analysis unit to provide a comparative analysis result to the user. All.

상기 저장수단(100)은 학습 대상 문장 및 상기 문장에 대한 멀티미디어 컨텐츠 발음을 저장하기 위한 수단이다.The storage means 100 is a means for storing the learning target sentence and the pronunciation of the multimedia content for the sentence.

상기 구간설정수단(105)은 사용자가 상기 저장수단(100)으로부터 멀티미디어 컨텐츠의 일부를 구분하여 구간을 선택할 수 있도록 하기 위한 수단이다.The section setting means 105 is a means for allowing a user to select a section by dividing a part of the multimedia content from the storage means 100.

예를 들면 구간설정수단은, 시간단위, 단어단위, 문장단위로 설정할 수 있게 된다.For example, the section setting means can be set in units of time, words, and sentences.

도3은 구간설정수단의 흐름도이다. 3 is a flowchart of the section setting means.

사용자가 구간을 설정하기 이전에 사용자의 구간설정 파라메터를 미리 입력하여 놓는다(S310).Before setting the section, the user inputs and inputs the section setting parameter of the user in advance (S310).

상기의 구간설정 파라메터는 1초, 2초, 5초, 10초등 시간을 설정하는 방식과 1단어, 2단어, 3단어등 단어수를 설정하는 방식, 1문장, 2문장등 문장을 설정하는 방식이 있을 수 있다.The section setting parameter is a method of setting the time such as 1 second, 2 seconds, 5 seconds, 10 seconds, the method of setting the number of words such as 1 word, 2 words, 3 words, the method of setting sentences such as 1 sentence, 2 sentences, etc. This can be.

만약, 시간을 설정하는 방식이라면 구간시작점을 사용자가 선택하게 되고 시작점으로부터 해당 시간동안 진행된 지점을 구간 끝점으로 자동 지정되게 된다. If the time is set, the user selects the section start point and automatically selects the point that has been advanced during the corresponding time from the start point as the section end point.

단어수를 설정하게 되면, 시작점에서 지정된 단어수만큼 진행된 지점을 끝점으로 지정되게 된다. When the number of words is set, the end point is designated as the number of words advanced from the starting point.

문장수를 설정하게 되면, 시작점에서 지정된 문장수만큼 진행된 지점을 끝점으로 지정되게 된다. When the number of sentences is set, the end point is designated as the number of sentences advanced from the starting point.

상기 단계를 거쳐 구간이 선택되면 상기와 같이 시작점과 끝점이 추출되게 되며(S320) 구간 선택후에는 선택된 구간의 멀티미디어 컨텐츠의 텍스트가 화면에 출력되게 된다(S330).When the section is selected through the above steps, the start point and the end point are extracted as described above (S320). After the section selection, the text of the multimedia content of the selected section is output on the screen (S330).

상기 구간선택은 파형비교가 가능한지 여부를 판단하는 과정이 존재할 수 있다(S340).The section selection may include a process of determining whether waveform comparison is possible (S340).

즉, 멀티미디어 컨텐츠가 인간의 발음이 아닌 단지 소리나 음향효과에 해당하는 경우는 인간의 음성과 파형비교가 불가능하기 때문에 이러한 경우는 선택되지 아니한다. In other words, if the multimedia contents correspond to only sound or sound effects, not human pronunciation, such a case is not selected because the comparison between human voice and waveform is impossible.

상기 출력수단(110)은 사용자의 선택에 따라 사용자에게 학습 대상이 되는 문장과 해당 문장의 멀티미디어 컨텐츠 발음을 제공하며, 사용자의 요구가 있으면 사용자의 입력 음성을 출력하여 준다.The output means 110 provides the user with the sentence to be learned and the pronunciation of the multimedia content of the sentence according to the user's selection, and outputs the user's input voice upon the user's request.

그리고, 구간설정수단(105)은 데이터베이스(150)에 저장되어 있는 문장을 출력수단(110)을 통하여 사용자에게 제공한다.Then, the section setting means 105 provides the user with the sentence stored in the database 150 through the output means (110).

사용자 음성을 입력하는 입력수단(120)(예를 들면, 마이크 등)은 사용자가 출력수단(110)을 통하여 도 2의 화면 상단(210)에 텍스트 형식으로 나타나는 영어회화 내용을 시청하면서, 자신의 음성과 멀티미디어 컨텐츠의 음성을 비교하기 위해서, 녹음버튼을 이용하여 마이크를 통해 사용자의 음성을 입력받는다.The input means 120 (for example, a microphone, etc.) for inputting a user's voice is used by the user while watching the English conversation contents displayed in a text format on the upper portion 210 of the screen through the output means 110. In order to compare the voice of the multimedia content with the voice, the user's voice is input through the microphone using the record button.

잡음제거수단(130)은 사용자 혹은 멀티미디어 컨텐츠의 잡음을 무성음 및 유성음으로 구분하여 이를 제거하기 위한 수단으로 출력수단으로 출력하기 전에 처리하게 되어 사용자에게 잡음이 없는 음성을 출력하여 청취력을 향상시킨다. The noise removing unit 130 is a means for dividing the noise of the user or the multimedia content into unvoiced voices and voiced voices, and processes the noise before outputting it to the output means to improve the listening power by outputting a voice without noise to the user.

본 발명에서 잡음제거수단으로는 유성음/무성음 분리를 이용하여 잡음처리를 한다. 유성음과 무성음은 음성의 하나의 중요한 특징으로 유성음과 무성음 부분에 각각 같은 잡음처리기법을 삼는 것이 아니라 각각의 성질을 고려하여 잡음처리를 한다. 유성음/무성음의 분리는 영 교차율과 에너지를 이용하여 구할 수 있으며, 유성음/무성음 분리정보를 토대로 하여 변형된 음성/잡음우세결정방법을 이용할 수도 있으나, 본 발명에서는 음성신호를 웨이브렛 변환한 신호에서 스펙트럼상에서 이 변화를 파라메터로 하여 유성음/무성음/묵음을 분류하는 기술을 이용한다.In the present invention, the noise removing means performs noise processing by using voiced / unvoiced separation. Voiced sound and unvoiced sound are an important feature of speech. Instead of using the same noise processing techniques for voiced and unvoiced parts, noise processing is performed by considering their characteristics. The separation of voiced sound / unvoiced sound can be obtained using zero crossing rate and energy, and a modified voice / noise predominance determination method can be used based on voiced / unvoiced sound separation information. This change is used as a parameter in the spectrum to classify voiced / unvoiced / silent.

이용된 방법은 백색 잡음과 비행기 잡음에 오염된 음성문장에 대해 성능평가를 한 결과 우수한 결론을 갖는다. The method used has a good conclusion as a result of performance evaluation on the speech sentences contaminated with white noise and airplane noise.

그리고 다양한 입력 신호대잡음비(SNR)로 오염된 문장에 대해 세그멘탈 신호대잡음비를 구하게 된다. 이러한 기술은 “유성음/무성음 분리를 이용한 잡음처리”(한국음향학회지 2002년 21권 유창동) 및 “웨이브렛 변환을 이용한 음성신호의 유성음/무성음/묵음 분류“(한국음향학회 1998 손영호, )에 보다 상세히 기재되어 있다.Then, the segmental signal-to-noise ratio is obtained for the sentences contaminated with various input signal-to-noise ratios (SNR). These techniques are described in “Noise Processing Using Voiced / Unvoiced Separation” (Korean Journal of the Acoustical Society of Korea, Vol. 21, 2002) and “Voice / Unvoiced / Silent Classification of Speech Signals Using Wavelet Transformation” (Korean Society for the Acoustical Society of Korea, 1998). It is described in detail.

상기 전처리 수단(140)은 입력 수단을 통하여 입력된 음성에 대하여 문장 단위로 시작점과 끝점을 검출하고 상기 검출된 문장 마다 소정의 분석단위로 절단하여 성음 특성 파라메터 추출 수단으로 전송하고, 상기 사용자에 의하여 음성으로 입력된 문장에 대한 사용자가 선택한 구간의 멀티미디어 컨텐츠의 발음을 상기 저장 수단으로부터 잡음제거수단을 거쳐 상기 성음 특성 파라메터로 전달하기 위한 수단이다.The preprocessing means 140 detects a start point and an end point in units of sentences with respect to the voice input through the input unit, cuts each of the detected sentences in a predetermined analysis unit, and transmits them to the vocal characteristic parameter extracting means. It is a means for transferring the pronunciation of the multimedia content of the section selected by the user for the sentence input by the voice from the storage means to the vocal characteristic parameter through the noise removing means.

전처리 수단(140)은 사용자 음성 입력수단(120)를 통하여 입력/녹음된 사용 자 음성과 데이터베이스(150)에 사전에 저장된 멀티미디어 컨텐츠의 음성을 비교하는 기능을 수행하며, 또한 게인(gain) 조정을 통하여 사용자로부터 음성을 왜곡없이 입력받는다.The preprocessing means 140 performs a function of comparing the user's voice input / recorded through the user's voice input means 120 with the voice of the multimedia content previously stored in the database 150, and also adjusting the gain. Through the voice input from the user without distortion.

그리고, 전처리 수단의 소정의 분석단위는,The predetermined analysis unit of the preprocessing means is

상기 사용자의 선택에 따라 단어, 어절, 또는 문장 단위인 것을 특징으로 하고 있다.According to the user's selection it is characterized in that the word, word, or sentence unit.

또한, 전처리수단(140)은 음성의 특징을 분석하기 위해 음성을 일정하게 자르는데, 이 과정에 적용되는 시작점/끝점 검출 기술은 사용자 음성 입력수단을 통해 입력되는 신호 중에서 사용자가 말하는 구간만을 찾아내기 위해서 필요한 기술로서, 이 기술을 적용하면 사용자가 별도의 조작없이 단순히 원하는 때에 말을 하면, 자동적으로 음성 구간을 추출하여 재생하여 준다. In addition, the preprocessing means 140 constantly cuts the voice to analyze the characteristics of the voice. The start / end point detection technique applied to the process finds only the section in which the user speaks among the signals input through the user voice input means. As a necessary technique, if this technique is applied, the user simply speaks when desired without any operation, and automatically extracts and reproduces the voice section.

본 발명의 실용화를 위해서 우선적으로 해결되어야 될 문제 중 하나로 잡음환경하에서의 시작점/끝점 검출을 들 수 있다. One of the problems to be solved first of all for the practical use of the present invention is detection of a start point / end point in a noise environment.

잡음이 존재하지 않는 환경에서는 기존의 에너지 파라메터만으로도 어느 정도 신뢰성 있는 시작점/끝점 구간을 검출할 수 있으나, 도심 소음과 같은 실제 잡음 환경하에서는 대부분 좋지 않은 결과를 보인다. In the absence of noise, the existing energy parameters can detect the start / end point to some extent, but most of the results are not good under the actual noise environment such as urban noise.

도심환경의 배경 잡음을 제거하는 방법으로 입력되는 음성에 대하여 주변소음에 의해 손상된 음성 스펙트럼의 크기 성분만을 제거하는 전처리 기법인 Bark scale에 기반한 스펙트럼 차감법을 사용할 수 있으나, 본 발명에서는 상술한 바와 같이 잡음제거수단(130)을 거쳐서 이루어지는 것이기 때문에 잡음이 제거된 후의 처리라고 볼 수 있다.As a method of removing background noise in an urban environment, a spectral subtraction method based on a bark scale, which is a preprocessing technique that removes only a magnitude component of a speech spectrum damaged by ambient noise, may be used. Since this is done through the noise removing means 130, it can be regarded as a process after the noise is removed.

본 발명의 전처리수단(140)은 인간의 청각특성을 고려하여 음성의 주파수 대역을 3개의 대역으로 분리한 후, 대역별로 세밀한 에너지 문턱치값을 설정하여 음성의 끝점을 탐색하는 방법을 채택한다. 채택한 방법의 유효성을 확인하기 위해 실제 사무실 및 지하철역 등의 잡음환경하에서 녹음된 데이터베이스를 이용하여 시작점/끝점검출을 수행한 결과 기존의 에너지와 영교차율을 이용한 방법에 비해 평균 46%의 오차율 감소와 대역에너지만을 사용한 경우에 비해 평균 17%의 오차율 감소를 나타내어 제안한 방법의 유효성을 확인할 수 있었다.The preprocessing means 140 of the present invention adopts a method of searching for an end point of the voice by separating the frequency band of the voice into three bands in consideration of the human auditory characteristics, and setting a fine energy threshold value for each band. In order to verify the effectiveness of the proposed method, starting / endpoint detection was performed using a database recorded in a noisy environment such as an office or a subway station. As a result, the average error rate was reduced by 46% compared to the method using the energy and zero crossing rate. Compared with energy only, the average error rate was reduced by 17%, confirming the effectiveness of the proposed method.

본 발명의 전처리 수단(140)에 관하여는 “대역에너지를 이용한 잡음음성의 끝점검출 알고리즘”(한국음향학회, 2002, 박기상; 석수영; 정호열; 정현열)에서 보다 상세히 기재된 것을 볼 수 있다.The preprocessing means 140 of the present invention can be seen in more detail in the "end point detection algorithm of noise speech using band energy" (Korean Society for Acoustics, 2002, Park Ki-Sang; Seok Soo-Young; Jeong Ho-Yeol;

또한, 전처리 수단(140)은 DTW(Dynamic Time Warping) 기술을 사용하여 사용자와 멀티미디어 컨텐츠의 발음 유사도를 비교하기 전에 두 음성의 길이와 패턴을 정합시킨다.In addition, the preprocessing unit 140 uses DTW (Dynamic Time Warping) technology to match the length and pattern of the two voices before comparing the pronunciation similarity between the user and the multimedia content.

패턴 정합은 DTW(Dynamic Time Warping) 기술을 사용하여 사용자와 멀티미디어 컨텐츠의 발음 유사도를 비교하기 전에 두 음성의 길이와 패턴을 정합시킨다.Pattern matching uses DTW (Dynamic Time Warping) technology to match the length and pattern of the two voices before comparing the pronunciation similarity between the user and the multimedia content.

예를 들어, 멀티미디어 컨텐츠의 발음 문장은 5초인 반면에 사용자가 너무 빨리 발음하여 3초 안에 똑같은 문장을 발음했다면, 단순히 그대로 문장 비교하면 아무리 발음이 정확하더라도 좋지 않은 비교결과를 초래하기 때문에, 문장 비교 전에 DTW 기술을 이용하여 두 문장의 길이를 적절히 정합시킨다. For example, if the pronunciation sentence of the multimedia content is 5 seconds, but the user pronounces the same sentence within 3 seconds because of the pronunciation of the speaker too quickly, simply comparing the sentence as it is, even if the pronunciation is incorrect, the sentence comparison is bad. Before, use DTW techniques to properly match the lengths of the two sentences.

또한 DTW(Dynamic Time Warping)를 이용한 분석은 성음 특성 파라메터 추출수단 및 성분 특성 파라메터 추출수단에서 구한 특성 파라메터들도 DTW 알고리즘을 거쳐 길이와 패턴을 정합하며, 두 문장 사이의 유사도를 측정한다. In addition, in the analysis using DTW (Dynamic Time Warping), the characteristic parameters obtained from the vocal trait parameter extraction means and the component characteristic parameter extraction means also match the length and pattern through the DTW algorithm, and measure the similarity between the two sentences.

패턴 정합은 연속음성인식을 위한 음성구간과 피치를 검출하는 알고리즘을 이용할 수 있다. Pattern matching may use an algorithm for detecting a speech interval and pitch for continuous speech recognition.

이것은 연속음성을 입력받아 프레임 단위로 자/모음을 구분하며, 구분된 유성음에서 피치를 검출하는 방법이다 실제 잡음 환경에서 음성을 입력받아 적당한 문턱치 에너지를 사용함으로써 잡음환경에서 강인한 음성구간 추출이 가능하였고 추출한 음성구간에서 프레임단위로 영교차율과 단구간에너지를 이용한 알고리즘으로 유성음의 피치를 검출함과 동시에 자/모음을 구분하는 개선된 방식이 “연속음성인식을 위한 음성구간과 피치검출에 관한 연구“(한국멀티미디어학회논문지, 2005, 김태석; 장종칠)에 제시되어 있다.This is a method of inputting continuous voice to classify the vowel / vowels by frame unit and detecting the pitch from the divided voiced voices. It is possible to extract the robust voice section in the noisy environment by inputting the voice in the real noise environment and using the appropriate threshold energy. An algorithm using zero crossing rate and short section energy in frame unit in the extracted speech section detects the pitch of voiced sounds and separates the vowels and vowels. "Study on speech section and pitch detection for continuous speech recognition" (Korean Journal of Multimedia Research, 2005, Kim, Tae-Seok; Jang, Jong-Chil).

상기 성음 특성 파라메터 추출수단(160)은 절단된 사용자의 음성 및 해당 멀티미디어 컨텐츠의 발음에 대하여 묵음/유성음/무성음으로 분류하여 특성 파라메터를 추출하기 위한 수단이다. The vocal characteristic parameter extracting means 160 is a means for extracting characteristic parameters by classifying the cut-off user's voice and the pronunciation of the corresponding multimedia content as silent / voiced / unvoiced.

일반적으로 음성신호는 파형의 특성에 따라 파형이 준주기적인 유성음과 주기성 없이 잡음과 유사한 무성음 그리고 배경 잡음에 해당하는 묵음의 세 종류로 분류된다. In general, voice signals are classified into three types according to the characteristics of the waveform: quasi-periodic voiced sounds, unvoiced sounds similar to noise, and silence corresponding to background noise without periodicity.

기존의 유성음/무성음/묵음 분류 방법에서는 피치정보, 에너지 및 영교차율 등이 분류를 위한 파라메터로 널리 사용되었다. In the existing voiced / unvoiced / silent classification methods, pitch information, energy, and zero crossing rate are widely used as parameters for classification.

본 발명에서는 음성신호를 웨이브렛 변환한 신호에서 스펙트럼상에서이 변화를 파라메터로 하는 유성음/무성음/묵음 분류 알고리즘을 활용한다. In the present invention, a voiced / unvoiced / silent classification algorithm using the change in the spectrum as a parameter in the wavelet-converted signal is used.

이러한 기술은 “웨이브렛 변환을 이용한 음성신호의 유성음/무성음/묵음 분류“(한국음향학회 1998 손영호, )에 기재되어 있다.This technique is described in “voiced / unvoiced / silent classification of speech signal using wavelet transform” (Korean Society for Acoustics Society, 1998).

성분 특성 파라메터 추출수단(170)은 사용자의 입력 음성(발음)과 멀티미디어 컨텐츠의 발음을 비교하기 위하여 음성의 성분 특성 파라메터를 추출한다. 여기서, 성분 특성 파라메터에는 피치(Pitch), 에너지, 포만트(Formant) 성분 등이 있다. The component characteristic parameter extracting unit 170 extracts the component characteristic parameter of the voice in order to compare the pronunciation of the user's input voice (pronunciation) and the multimedia content. Here, the component characteristic parameters include pitch, energy, and formant components.

먼저, 피치 성분 분석부(175)는 사용자의 음성(발음)을 분석하여 사용자에게 발음한 문장의 높낮이, 길이, 강세, 억양 등을 비교하여 제공하여, 사용자가 자신의 발음의 문제점을 파악하도록 한다. 여기서, 피치는 음성 구간 중에서 음의 높낮이와 속도에 따라서 서서히 변하는데, 본 발명은 피치 파라메터를 추정하는 많은 알고리즘들 중에서 AMDF 방식과 median smoothing 과정을 포함한 변형된 AMDF 방식을 사용한다.First, the pitch component analyzer 175 analyzes a user's voice (pronunciation) and compares the height, length, stress, and intonation of the pronounced sentence to the user, thereby allowing the user to identify a problem of his / her pronunciation. . Here, the pitch is gradually changed according to the height and speed of the sound in the speech section. The present invention uses the modified AMDF method including the AMDF method and the median smoothing process among many algorithms for estimating the pitch parameter.

에너지 성분 추출부(180)는 음성 신호의 에너지를 구하는데, 여기서 에너지는 발음한 문장 중에서 사용자가 어느 부분을 강조하였는지를 알 수 있도록 하며, 피치 분석의 결과를 보조하는 역할을 한다. 에너지 성분을 추출하는 과정은 분석 프레임별로 평균자승을 구함으로써 수행한다. The energy component extractor 180 obtains the energy of the speech signal, where the energy allows the user to know which part of the pronounced sentence is emphasized and assists the results of the pitch analysis. The process of extracting energy components is performed by finding the mean square for each analysis frame.

포만트(Formant) 성분 추출부(185)는 단어의 발음을 비교하기 위해 사용되는 성분 특성 파라메터인 포만트 성분을 추출한다. 짧은 단어의 경우에는 피치 정보로 유사도를 측정하는 것은 매우 어려우므로, 음성의 주파수 특성을 나타내는 포만트 정보를 이용하게 된다. 포만트는 peak-picking 방식을 통해 프레임별로 구해진다. The formant component extractor 185 extracts a formant component, which is a component characteristic parameter used to compare pronunciation of a word. In the case of short words, it is very difficult to measure the similarity with the pitch information, and thus formant information representing the frequency characteristic of the voice is used. Formants are obtained frame by frame using the peak-picking method.

인간이 발성하는 음성에는 의미에 대한 정보 뿐만 아니라 화자의 성별에 따라 고유한 특성을 가지고 있다. Human voices have unique characteristics depending on the gender of the speaker as well as information on meaning.

즉, 음성은 고음이 강한 여성음성과 남성음성으로 분류할 수 있다. 그러나, 기존의 HMM을 이용한 음성인식시스템에서는 남성과 여성음성의 이러한 특성이 있음에도 불구하고 이를 고려하지 않고, 하나의 HMM으로 구성하고 있다. That is, the voice may be classified into a high female voice and a male voice. However, in the existing speech recognition system using the HMM, although there are such characteristics of male and female voices, it does not consider this and constitutes one HMM.

실험한 결과 남성과 여성의 포만트 주파수가 100∼30Hz차이가 나는 것을 알 수 있었고, 이러한 특성을 고려하여 남성과 여성의 음성을 구별할 수 있는 방법을 채택하여야 한다. As a result of the experiment, it was found that the formant frequency of male and female was 100 ~ 30Hz difference, and considering the characteristics, a method to distinguish male and female voices should be adopted.

또한, 남성과 여성음성을 각각 구분하여 GMM을 훈련시킨 후 인식과정에서 입력된 음성의 포만트 특성에 따라 남성음성이면 남성 HMM으로 여성음성이면 여성 HMM으로 인식을 수행함으로써 기존의 인식방법보다 남성음성은 5.2% 여성음성은 4.4% 향상된 결과를 나타내고 있음을 알 수 있다. 포만트 성분 추출에 관하여 “포만트 주파수를 이용한 음성인식 전처리 시스템의 설계 및 구현”(한국정보과학회:학술대회지 1999, 김태욱 외3)에 자세히 상술하고 있다.In addition, the GMM is trained by dividing the male and female voices separately, and according to the formant characteristics of the input voice in the recognition process, the male voice is recognized as the male HMM and the female voice as the female HMM. The 5.2% female voice shows a 4.4% improvement. Formant component extraction is described in detail in "Design and Implementation of a Speech Recognition Preprocessing System Using Formant Frequency" (Korean Information Science Society: 1999, Tae-Wook Kim et al.).

비교분석수단(195)는 상기의 음성신호처리 과정을 거쳐 구해진 음성 특성 파라메터를 비교하여 사용자 입력음성(발음)과 멀티미디어 컨텐츠의 음성(발음)이 얼마나 유사한지를 시청각적으로 제공한다. The comparison analysis means 195 compares the voice characteristic parameters obtained through the above-described voice signal processing process and provides audio-visually how similar the user input voice (pronunciation) and the voice of the multimedia content (pronunciation) are.

여기서, 제공되는 내용은 두 음성의 유성음, 무성음, 피치, 에너지, 포만트 등 특성 파라메터들을 보기 쉽게 사용자 화면(250)을 통하여 디스플레이(display)하고, 차이가 많이 나는 부분의 발음을 지적한다. Here, the provided content displays characteristic parameters such as voiced voice, unvoiced sound, pitch, energy, formant, etc. of the two voices through the user screen 250 so as to be easily visible, and indicates the pronunciation of the parts having a large difference.

또한, 화면 하단의 다이얼로그 박스(280)를 통해 사용자 발음의 문제점을 텍스트 형식으로 서술하여 교정이 가능하게 한다. In addition, the dialog box 280 at the bottom of the screen describes the problem of user pronunciation in a text form to enable correction.

사용자는 비교분석수단(195)에서 평가결과 및 교정방법으로 제시하는 결과를 토대로 자신의 발음의 문제를 파악하고, 평가결과 및 교정방법으로 제시하는 교정방법을 통해 반복적으로 하면서 발음을 교정하게 된다.The user grasps the problem of his / her pronunciation based on the result of the evaluation result and the correction method in the comparative analysis means 195, and repeatedly corrects the pronunciation while performing the correction method suggested by the evaluation result and the correction method.

데이터베이스(150)에는 사용자의 발음의 정확도를 비교하는 기준을 제공하는 것으로서, 멀티미디어 컨텐츠의 발음을 미리 저장하며, 또한 사용자에게 학습자료로 제공하는 문장과 사용자 음성 입력수단을 통하여 입력된 사용자의 음성을 저장한다. The database 150 provides a criterion for comparing the accuracy of the pronunciation of the user. The database 150 stores the pronunciation of the multimedia content in advance, and provides the user's voice input through the user's voice input means and sentences provided as learning materials. Save it.

도 2 는 본 발명에 따른 사용자 선택구간에 대한 음성비교 시스템이 사용자에게 제공하는 일실시예 화면 구성으로 이에 대해 상세히 설명하자면,2 is an exemplary screen configuration provided to a user by a voice comparison system for a user selection section according to the present invention.

사용자가 발음할 수 있도록 사용자에게 영어문장을 제공하는 부분(210)은,The part 210 for providing the English sentence to the user so that the user can pronounce,

사용자의 발음과 데이터베이스로 구축된 멀티미디어 컨텐츠의 발음을 청취할 수 있도록 하는 기능키(220), 사용자가 음성을 입력할 수 있도록 하는 기능키(230), 두 발음의 피치, 에너지, 포만트 등의 음성 특성 파라메터 비교를 통해 발음의 정확도를 측정결과를 시각적으로 볼 수 있도록 하는 기능키(240), "240"에 의하여 사용자에게 제공되는 발음의 정확도 측정 결과를 사용자에게 제공하는 부분(250), 멀티미디어 컨텐츠의 발음속도를 조절할 수 있게 하는 기능키(260), 발음 을 교정 방법을 텍스트로 제시할 수 있게 하는 기능키(270), "270"에 의하여 제시되는 발음 교정 방법이 제시되는 부분(280), 발음을 비교할 때 그 단위를 단어/어절/문장 전체로 선택할 수 있게 하는 기능키(290)를 나타낸다.Function key 220 to listen to the user's pronunciation and the pronunciation of the multimedia content built in the database, function key 230 to allow the user to input voice, pitch of two pronunciations, energy, formants, etc. Function key 240 for visually viewing the measurement result of the pronunciation accuracy by comparing the speech characteristic parameters, part 250 for providing the user with the measurement result of the pronunciation accuracy provided to the user by the "240", multimedia A function key 260 for adjusting the pronunciation speed of the content, a function key 270 for presenting the pronunciation correction method in text, and a portion 280 in which the pronunciation correction method presented by "270" is presented Represents a function key 290 that allows the unit to be selected as a whole word / word / sentence when comparing pronunciation.

상기에서 설명한 사용자 선택구간에 대한 음성비교 시스템은 단일 시스템에 의하여 구현하거나 또는 서버로 구축하여 인터넷 등 통신망을 통하여 사용자에게 언어 학습 서비스를 제공하는 것이 가능하다. The voice comparison system for the user selection section described above may be implemented by a single system or built as a server to provide a language learning service to a user through a communication network such as the Internet.

이상에서와 같은 내용의 본 발명이 속하는 기술분야의 당업자는 본 발명의 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시된 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. Those skilled in the art to which the present invention pertains as described above may understand that the present invention may be implemented in other specific forms without changing the technical spirit or essential features of the present invention. Therefore, the above-described embodiments are to be understood as illustrative in all respects and not restrictive.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구 범위의 의미 및 범위 그리고 그 등가 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the invention is indicated by the following claims rather than the above description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the invention. do.

이상에서 살펴본 바와 같이, 본 발명의 사용자 선택구간에 대한 음성비교 시스템은,As described above, the voice comparison system for the user selection section of the present invention,

음성신호처리 기술과 컴퓨터의 멀티미디어 기능을 적용하여 사용자에게 자신 의 발음과 멀티미디어 컨텐츠 발음을 시청각적으로 제공하고, 또한 잡음이 제거된 뚜렷한 발음을 제시함으로써, 사용자의 청취력이 증강되는 효과가 있다.By applying the audio signal processing technology and the multimedia functions of the computer, the user's listening ability is enhanced by providing the user with his / her own pronunciation and the pronunciation of the multimedia contents and presenting distinct pronunciation with noise removed.

자신의 발음과 멀티미디어 컨텐츠 발음과의 비교 결과를 시청각적으로 제공하고, 결과에 따라 발음을 보정하여 발음을 교정하게 하며 반복적으로 익힘으로 인하여 학습효과를 극대화한다.It provides audio-visual comparison results between pronunciation of one's own pronunciation and multimedia contents, and corrects the pronunciation according to the result, and maximizes the learning effect due to repeated learning.

또한, 온라인 학습망 네트웍과 연동함으로써 실시간으로 강사와 수강생들 사이에 상호작용에 의한 언어 학습이 가능하게 함으로써, 사용자의 영어 학습효과와 활용성을 향상시키고, 영어회화교육을 위해 투자되는 막대한 지출을 최소화하며, 절대적으로 부족한 영어회화 강사를 대처함으로써 교육의 폭을 확대할 수 있는 효과가 있다.In addition, by linking with the online learning network, it is possible to learn languages by interacting with instructors and students in real time, thereby improving users' English learning effect and usability, and investing enormous expenditure for English conversation education. Minimizing and dealing with the absolute lack of English speaking instructors can increase the breadth of education.

Claims

In the voice comparison system,

Storage means for storing a sentence to be learned and the pronunciation of the multimedia content of the sentence;

Section setting means for allowing a user to select a section by dividing a part of the multimedia contents from the storage means;

Noise removing means for dividing the noise of the user or the multimedia contents into unvoiced and voiced sound;

Output means for providing a pronunciation of multimedia content of a section of the multimedia content to be learned to the user as the section is selected by the user;

Input means for receiving a pronunciation of a sentence provided by the output means from the user and transmitting the pronunciation to a noise removing means;

The starting point and the end point are detected in a sentence unit for the voice input through the input means, and each detected sentence is cut in a predetermined analysis unit and transmitted to the vocal characteristic parameter extracting means, Preprocessing means for transferring the pronunciation of the multimedia content from the storage means to the vocal trait parameter extraction means via the noise removing means;

Voice sound feature parameter extracting means for extracting the feature parameter by classifying the cut-off user's voice and the pronunciation of the corresponding multimedia content into silent / voiced / unvoiced sound;

Component characteristic parameter extracting means for extracting component characteristic parameters obtained by analyzing pitch, energy, and formant components with respect to the cut-off user's voice and the pronunciation of the corresponding multimedia contents; And

And comparative analysis means for comparing and analyzing the vowel characteristic parameter and the component characteristic parameter for the pronunciation of the user's voice and the corresponding multimedia contents for each predetermined analysis unit to provide a comparative analysis result to the user. Voice comparison system for user selection section.

The method of claim 1,

The predetermined analysis unit of the preprocessing means is

Voice comparison system for the user selection section, characterized in that the unit of words, words, or sentences according to the user's selection.

The method of claim 1,

The noise removing means,

A voice comparison system for a user selection section using a method of classifying voiced sound, unvoiced sound, and silence by using the wavelet transformed signal as a parameter on a spectrum.

The method of claim 1,

The pretreatment means,

A speech comparison system for a user selection section characterized in that the frequency band of the speech is divided into three bands in consideration of human hearing characteristics, and then a fine energy threshold value is set for each band to search for an end point of the speech.

The method of claim 1,

The section setting means is a voice comparison system for a user selection section, characterized in that can be set in units of time, words, sentences.