KR20200018859A

KR20200018859A - Web service system for speech feedback

Info

Publication number: KR20200018859A
Application number: KR1020180094195A
Authority: KR
Inventors: 안성태; 여지훈; 정우성; 이철민; 임정욱; 임창준
Original assignee: 한국과학기술원
Priority date: 2018-08-13
Filing date: 2018-08-13
Publication date: 2020-02-21

Abstract

The purpose of the present invention is to detect and extract speech habits of a user, namely filler sound, to provide a web service for providing speech feedback. According to the present invention, to this end, a web system for speech feedback includes: a server communicating with a Google API server to receive information of a sound file which is converted into text, simultaneously running a Python library to analyze sound data, and transferring the analyzed sound data to a client device; a display displaying a web page configured for a user to upload the sound file and check information of the analyzed sound file; and a client transferring the sound file to the server and transferring the sound file received from the server to the display to display the sound file on the web page.

Description

Web service system for speech feedback {WEB SERVICE SYSTEM FOR SPEECH FEEDBACK}

본 발명은 스피치 피드백을 위한 웹 서비스에 관한 것으로, 사용자의 스피치를 분석하고 피드백하여 사용자의 스피치 능력 향상을 위한 웹 서비스에 관한 것이다.The present invention relates to a web service for speech feedback. The present invention relates to a web service for improving a user's speech ability by analyzing and feeding back a user's speech.

최근 스피치의 중요성이 대두되고 있다. 스피치는 누구에게나 필요한 능력으로, 특히 업무수행에 도움이 되는 자질 중 하나로 스피치가 꼽히는 반면, 스피치 기술은 단기간에 실력을 확 늘릴 수 없어서 꾸준한 연습이 필요하며, 공교육에서는 이를 위한 체계적인 교육이 이루어지고 있지 않은 문제점이 있다.The importance of speech has recently emerged. Speech is one of the abilities that everyone needs, especially one of the qualities that will help them perform their jobs.However, speech skills cannot be expanded in a short period of time, so constant practice is required. There is a problem.

시중의 스피치 학원들은 대부분 녹화 이후에 녹화된 영상을 이용한 코칭을 진행하고 있고, 학원들이 대도시에 국한되어 있으며, 월마다 지불해야하는 비용이 상당한 문제점으로 되고 있다. 또한, 도 1과 같이 스피치를 분석해주는 모바일 어플리케이션이 있으나, 이는 순간적인 소리의 크기나 말 빠르기를 피드백해주는 수준에 불과하고, 더 유용한 정보를 제공하지 않으며, 도시되지 않은 이외의 어플리케이션들 역시 음성 자체를 분석하는 기능은 없고, 녹화나 연습을 더 편리하게 해주는 정도에 불과하다. 즉, 학원과 같은 서비스는 정확하고 구체적인 피드백이 가능하지만 시간과 비용이 많이 소모되며, 어플리케이션의 경우 손쉽게 피드백이 가능하지만, 기능이 매우 제한적이라는 문제점이 있다.Most of the speech schools in the market are coaching using the recorded video after recording, the schools are limited to large cities, and the monthly fee is a significant problem. In addition, although there is a mobile application that analyzes speech as shown in FIG. 1, this is only a level of feedback of instantaneous sound volume or speed of speech, and does not provide more useful information. There is no function to analyze the data, but it is just enough to make recording or practice more convenient. That is, a service such as an institute can provide accurate and specific feedback, but it consumes a lot of time and money, and in the case of an application, the feedback can be easily provided.

도 2 내지 도 5에 도시된 설문조사의 결과와 같이, 스피치의 교정을 필요로 하는 사용자가 다수 존재하고, 대부분의 사용자가 보다 효과적이고 편한 방법으로 교정을 하고 싶어하지만, 스피치를 자동으로 교정해주는 서비스를 접해보지 못한 사용자가 대부분이며, 이들 누구도 쉽게 스피치 교정을 위한 도움을 얻지 못하는 문제점이 있다. As shown in the results of the survey shown in FIGS. 2 to 5, there are a large number of users who need to correct the speech, and most users want to correct the speech in a more effective and convenient way. Most users have never encountered services, and none of them easily get help for speech correction.

본 발명은 사용자의 말하기 습관 즉, 필러 사운드를 탐지하고 추출하여 스피치 피드백을 제공하기 위한 웹 서비스 시스템을 제공하는데 목적이 있다.An object of the present invention is to provide a web service system for providing speech feedback by detecting and extracting a speaking habit of a user, that is, a filler sound.

이와 같은 과제를 달성하기 위한 본 발명의 스피치 피드백을 위한 웹 서비스 시스템은 구글 API 서버와 통신하여 텍스트로 변환된 음성 파일 정보를 받아오고, 동시에 파이썬 라이브러리를 실행하여 음성 데이터를 분석하며, 분석된 음성 데이터를 클라이언트 디바이스로 전달하는 서버; 사용자가 음성 파일의 업로드 및 분석된 음성 파일 정보를 확인할 수 있도록 구성된 웹 페이지를 표시하기 위한 디스플레이; 및 상기 서버로 음성 파일을 전달하거나, 상기 서버에서 전달받은 음성 파일을 상기 디스플레이로 전달하여 상기 웹 페이지에 표시하는 클라이언트를 포함하는 것을 특징으로 한다. In order to achieve the above problem, the web service system for speech feedback of the present invention communicates with a Google API server to receive voice file information converted to text, and simultaneously executes a Python library to analyze voice data and analyze the analyzed voice. A server for delivering data to the client device; A display for displaying a web page configured to allow a user to check uploaded and analyzed voice file information of the voice file; And a client that delivers the voice file to the server or delivers the voice file received from the server to the display to display on the web page.

특히, 파이썬 라이브러리는 librosa를 포함하는 것을 특징으로 하며, 상기 음성 데이터의 분석은 주파수 관련 특징값 추출 및 진폭 관련 특징값 추출을 포함하는 것을 특징으로 한다. 또한, 상기 주파수 관련 특징값 추출은 librosa의 STFT(Short Time Fourier Transform)을 이용하고, 상기 진폭 관련 특징값 추출은 librosa의 RMSE를 이용하는 것을 특징으로 한다.In particular, the Python library is characterized by including librosa, and the analysis of the speech data is characterized by including frequency-related feature value extraction and amplitude-related feature value extraction. In addition, the frequency-related feature value extraction is characterized by using the short time Fourier Transform (STFT) of librosa, the amplitude-related feature value extraction is characterized by using the RMSE of librosa.

또한, 상기 웹 페이지는 분석 시작 버튼, 스크립트, 범례 버튼, 추천 동영상 등으로 구성되는 것을 특징으로 하며, 상기 스크립트는 음성 파일을 텍스트로 변환한 결과를 나타내는 것을 특징으로 하고, 상기 범례 버튼은 톤의 급격한 변화, 데시벨 큰 구간, 속도가 빠른 구간 등을 상기 스크립트에 표시하는 버튼인 것을 특징으로 한다.The web page may include a start analysis button, a script, a legend button, a recommended video, and the like. The script may include a result of converting a voice file into text. It is characterized by a button for displaying a sudden change, a large decibel period, a fast section, etc. in the script.

본 발명은 사용자의 말하기 습관 즉, 필러 사운드를 탐지하고 추출하여 스피치 피드백을 제공함으로써 장소와 시간에 제약 없이 스피치 연습을 제공할 수 있는 효과가 있다.The present invention has an effect that can provide speech practice without restriction on the place and time by detecting and extracting the speaking habits of the user, that is, the filler sound to provide speech feedback.

도 1은 종래의 스피치를 분석해주는 모바일 어플리케이션을 나타내는 도면이다.
도 2 내지 도 5는 스피치 교정의 필요성을 나타내는 설문조사의 결과를 표시하는 그래프들이다.
도 6은 구글 API를 이용하여 음성 파일을 스크립트로 변환한 결과를 나타내는 도면이다.
도 7 내지 도 10은 librosa에서 제공하는 툴(tool) 중 STFT(Short Time Fourier Transform)를 주파수 관련 특징값 추출에 이용한 결과를 나타낸 도면이다.
도 11 및 도 12는 librosa에서 제공하는 툴(tool) 중 RMSE를 진폭 관련 특징값 추출에 이용한 결과를 나타낸 도면이다.
도 13은 본 발명의 웹 시스템 구조를 나타내는 도면이다.
도 14는 본 발명의 실시예에 따른 웹 페이지를 나타내는 도면이다. 1 is a diagram illustrating a mobile application that analyzes conventional speech.
2-5 are graphs displaying the results of a survey indicating the need for speech correction.
6 is a diagram illustrating a result of converting a voice file into a script using the Google API.
7 to 10 are diagrams showing the results of using a short time fourier transform (STFT) for frequency-related feature value extraction among tools provided by librosa.
11 and 12 illustrate the results of using RMSE in amplitude-related feature value extraction among tools provided by librosa.
13 is a diagram showing a web system structure of the present invention.
14 is a view showing a web page according to an embodiment of the present invention.

전술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 실시예를 통하여 보다 분명해질 것이다. The foregoing objects, features, and advantages will become more apparent from the following examples taken in conjunction with the accompanying drawings.

특정한 구조 내지 기능적 설명들은 단지 본 발명의 개념에 따른 실시예를 설명하기 위한 목적으로 예시된 것으로, 본 발명의 개념에 따른 실시예들은 다양한 형태로 실시될 수 있으며 본 출원의 명세서에서 설명된 실시예들에 한정되는 것으로 해석되어서는 아니된다.Specific structural to functional descriptions are merely illustrated for the purpose of describing embodiments in accordance with the inventive concept, and embodiments in accordance with the inventive concept may be embodied in various forms and are described in the specification of the present application. It should not be construed as limited to these.

본 발명의 개념에 따른 실시예는 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있으므로 특정 실시예들은 도면에 예시하고 본 출원의 명세서에 상세하게 설명하고자 한다. 그러나 이는 본 발명의 개념에 따른 실시예들을 특정한 개시 형태에 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Embodiments in accordance with the concepts of the present invention can be variously modified and have a variety of forms specific embodiments will be illustrated in the drawings and described in detail in the specification of the present application. However, this is not intended to limit the embodiments in accordance with the concept of the present invention to a specific disclosed form, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

본 출원의 명세서에서 사용하는 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로서, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서 "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성 요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Singular expressions include plural expressions unless the context clearly indicates otherwise. As used herein, the terms "comprise" or "have" are intended to indicate that there is a feature, number, step, action, component, part, or combination thereof that is described, and that one or more other features or numbers, It is to be understood that it does not exclude in advance the possibility of the presence or addition of steps, actions, components, parts or combinations thereof.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써 본 발명을 상세히 설명하도록 한다. 각 도면에 제시된 동일한 참조부호는 동일한 부재를 나타낸다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Like reference numerals in the drawings denote like elements.

사용자의 스피치 피드백을 위하여는 사용자의 음성 데이터에서 '필러 사운드' 등의 주요한 정보를 분석하는 것이 중요하다. 필러 사운드(Filler Sound)란, 뜻 자체는 없지만 습관적으로 내뱉는 음성을 뜻하는 것으로, 예컨대, 한국어에서는 '음', '어', '그', '사실' 등이 있다. For user feedback, it is important to analyze important information such as 'filler sound' in the user's voice data. Filler sound is a meaningless but habitual spitting voice, for example, in Korean, 'sound', 'er', 'he', 'fact' and the like.

필러 사운드를 탐지하기 위한 시스템을 개발하기 위해 기계학습이 이용될 수 있다. 기계학습은 AWS를 이용한 방법과 악기를 분류하는 모델(예컨대, Original Machine Learning Model)을 응용하여, 주어진 음성 데이터가 필러 사운드인지 아닌지를 분류하는 방법이 있다. AWS를 이용하여 librosa로 추출한 음성 특징을 이용한 모델은 테스트 결과가 정확하지 않은 단점이 있을 수 있다. Machine learning can be used to develop a system for detecting filler sound. Machine learning includes a method using AWS and a model (eg, an original machine learning model) for classifying musical instruments to classify whether a given voice data is filler sound or not. Models using speech features extracted with librosa using AWS may have the disadvantage of inaccurate test results.

한편, 사용자가 어떤 지점에서 어떤 특성을 보였는지를 보여주려면, 기준이 되는 텍스트가 존재해야할 필요성이 있다. 음성을 텍스트로 변환하는 연구는 ASR(Automatic Speech Recognition)이라는 분야에서 활발히 연구되고 있다. 음성을 텍스트로 변환하기 위해서는 음성에 대한 이해도 필요하지만, 사람들이 실제로 사용하는 언어로 바꾸어야 하므로 언어에 대한 이해도 필요하다. 또한, 사람들이 실제로 사용하는 말뭉치(corpus) 데이터가 필요한데, 이 데이터가 막대하기 때문에 큰 기업이나 연구소가 아니면, 일반적으로 텍스트를 변환하기 어려운 문제점이 있다. 따라서, 본 발명에서는 구글에서 제공하는 텍스트 변환 서비스인 구글 음성 API를 적용할 수 있다. On the other hand, in order to show what characteristics the user has shown at what point, it is necessary to have a reference text. The research on converting speech into text is actively studied in the field of Automatic Speech Recognition (ASR). To convert speech into text, you also need to understand speech, but you also need to understand the language because you need to change it to the language that people actually use. In addition, the corpus data that people actually use is necessary, and since the data is huge, it is generally difficult to convert text unless it is a large corporation or research institute. Therefore, in the present invention, Google speech API which is a text conversion service provided by Google can be applied.

사용자가 구글 클라우드에 음성 데이터를 올리면 이 파일을 분석하여 텍스트로 변환한 정보를 JSON 파일 형태로 전송하며, 이 JSON 파일은 도 6과 같이 음성의 텍스트 정보와 각 단어가 시작되고 종료된 시점이 표시될 수 있다. 이를 바탕으로 스크립트와 말의 빠르기를 분석할 수 있다.When the user uploads the voice data to Google Cloud, this file is analyzed and the information converted into text is transmitted in the form of a JSON file. The JSON file shows the text information of the voice and when each word starts and ends, as shown in FIG. Can be. Based on this, you can analyze the speed of scripts and words.

또한, 본 발명에서는 음성의 여러 특징값을 추출하기 위해 파이톤 오픈 소스(python open source)인 librosa를 적용할 수 있다. 일반적으로 전자기기에 사용되는 음성 파일은 주로 44100Hz의 샘플링 속도(sampling rate)를 가지고 있는데, librosa는 이러한 디지털 신호를 다시 22050Hz의 샘플링 속도로 읽어서 활용한다. 일반적으로, 유의미한 사람 목소리의 주파수 대역은 0~3000Hz로, librosa의 샘플링 속도가 사람이 낼 수 있는 목소리 주파수보다 훨씬 크기 때문에, librosa를 이용하여 사람 목소리의 특징을 충분히 잡아낼 수 있다. In addition, in the present invention, a python open source librosa may be applied to extract various feature values of speech. In general, voice files used in electronic devices mainly have a sampling rate of 44100 Hz, and librosa reads these digital signals again at a sampling rate of 22050 Hz. In general, the frequency range of a meaningful human voice is 0-3000 Hz, so librosa can be used to capture the characteristics of a human voice sufficiently because the sampling rate of librosa is much higher than the human voice frequency.

도 7 내지 도 10은 librosa에서 제공하는 툴(tool) 중 STFT(Short Time Fourier Transform)를 주파수 관련 특징값 추출에 이용한 결과를 나타낸 도면이다. 7 to 10 are diagrams showing the results of using a short time fourier transform (STFT) for frequency-related feature value extraction among tools provided by librosa.

STFT는 음성 데이터를 특정 크기 단위로 묶은 후, 단위마다 푸리에 변환(Fourier transform)을 수행하는 방식이다. 22050Hz의 샘플링 속도로 읽은 음성 데이터를 512개 단위로 묶어서 푸리에 변환을 계산하면 시간 단위로 약 23ms마다 푸리에 변환을 수행하는 것이다. 23ms는 음성의 주파수 변화 양상을 파악하는데 있어서 충분히 짧은 시간이기 때문에, STFT를 이용하면 음성의 시간에 따른 주파수 성분 변화를 계산할 수 있다. STFT combines voice data into specific size units and performs Fourier transform for each unit. When the Fourier transform is calculated by combining 512 pieces of speech data read at a sampling rate of 22050 Hz, the Fourier transform is performed every 23 ms. Since 23ms is a short enough time to grasp the frequency change of the voice, the STFT can be used to calculate the frequency component change over time of the voice.

도 7에서는 사용자가 '도레미파솔라시'를 차례대로 한 음씩 발음한 녹음파일을 STFT로 분석한 결과로, X축은 시간, Y축은 주파수, 밝기는 신호의 세기를 나타낼 수 있다. 따라서, 음성의 3차원 데이터 획득이 가능하다. 도 7의 스펙트럼과 같이 밝은 띠가 있는 구간과 그렇지 않은 구간은 각각 음성이 있는 구간과 없는 구간을 나타낸다. 또한, 목소리의 주파수 대역이 높아질수록 밝은 띠 영역도 함께 높은 주파수 영역(양의 Y축 방향)으로 올라감을 확인할 수 있다. In FIG. 7, as a result of analyzing a recording file in which a user pronounces 'Doremipasolasi' one by one, STFT, X-axis represents time, Y-axis frequency, and brightness represents signal strength. Thus, three-dimensional data acquisition of speech is possible. As shown in the spectrum of FIG. 7, the sections with bright bands and the sections with no bright bands represent sections with and without speech, respectively. In addition, as the frequency band of the voice increases, the bright band region also increases in the high frequency region (positive Y-axis direction).

도 7의 스펙트럼에서는 음성(목소리)의 유무와 시간에 대한 음성의 주파수 성분별 세기를 확인할 수 있다. 본 발명에서는 도 7의 스펙트럼에서 얻은 데이터를 사용자가 필요한 발표의 특징을 나타낼 수 있는 특징값 추출에 이용할 수 있다. 유의미한 사람의 목소리의 주파수 대역은 0~3000Hz 정도이지만, 특정 시간에 대해서 지배적인 주파수가 오직 하나인 것은 아니다. 도 7의 스펙트럼과 같이 사람의 목소리는 여러 주파수의 조합으로 나타날 수 있다. 사람의 목소리가 가지는 또 하나의 특징은 특정 시간의 주파수 성분 중 가장 작은 값인 기본 주파수(fundamental frequency)가 존재하고, 나머지 주파수는 기본 주파수의 정수배로 나타날 수 있다.In the spectrum of FIG. 7, it is possible to check the presence or absence of voice (voice) and the intensity of each frequency component of the voice with respect to time. In the present invention, the data obtained from the spectrum of FIG. 7 may be used for feature value extraction, which may indicate a feature of the presentation required by the user. The frequency range of a significant human voice is on the order of 0-3000 Hz, but it is not the only dominant frequency for a particular time. As shown in the spectrum of FIG. 7, the human voice may be represented by a combination of several frequencies. Another feature of human voices is the fundamental frequency, which is the smallest value of the frequency components of a specific time, and the remaining frequencies may be expressed as integer multiples of the fundamental frequency.

도 8을 참조하면, 빨간색으로 표시된 부분이 기본 주파수이며, 밝은 띠 영역이 반복적으로 나타나는 것을 확인할 수 있다. 밝은 띠 영역은 기본 주파수의 정수배일 수 있다. 사람 목소리의 주파수 성분이 기본 주파수의 정수배로 나타나기 때문에, 목소리의 주파수 변화를 잡아내기 위해서는 기본 주파수만을 추출하면 된다. 이렇게 뽑아낸 데이터는 시간에 따른 기본 주파수의 2차원 데이터가 될 수 있다. Referring to FIG. 8, the portion marked in red is the fundamental frequency, and it can be seen that the bright band region repeatedly appears. The bright band region may be an integer multiple of the fundamental frequency. Since the frequency component of a human voice is expressed as an integer multiple of the fundamental frequency, only the fundamental frequency needs to be extracted to capture the frequency change of the voice. The extracted data may be two-dimensional data of the fundamental frequency over time.

도 9은 도 8의 스펙트럼에서 노이즈를 제거한 스펙트럼을 나타낸다. 이는 도 8의 스펙트럼에서 특징값을 뽑아내기 위함이며, 이를 위해 조용한 환경에서의 녹음을 가정할 수 있다. 조용한 환경에서는 사람의 목소리가 우성의 신호(노이즈보다 사람의 목소리가 큰 상황)이므로, 특정 크기 이하의 소리를 무시하여 노이즈를 제거할 수 있다. 도 9와 같이, 사람의 목소리에 해당하는 신호를 제외한 노이즈가 제거된 스펙트럼에서 기본 주파수만을 추출할 수 있다. 반복되는 밝은 띠의 제일 아래 띠가 이에 해당하며, 이 하나의 띠도 여러 주파수에 걸쳐서 나타나기 때문에 그 중 가운데 값을 이용하는 것이 바람직하다. 이 과정에서 발생한 노이즈는 다시 평균값과 표준편차를 이용하여 평균값에서 표준편차보다 지나치게 벗어나있는 신호를 무시하는 필터(filter)를 만들어 해결할 수 있다. FIG. 9 illustrates a spectrum from which noise is removed from the spectrum of FIG. 8. This is to extract feature values from the spectrum of FIG. 8, and for this purpose, it may be assumed that the recording is performed in a quiet environment. In a quiet environment, the human voice is a dominant signal (a situation where the human voice is louder than the noise), so that noise below a certain size can be ignored to remove noise. As shown in FIG. 9, only a fundamental frequency may be extracted from a spectrum from which noise is removed except for a signal corresponding to a human voice. This is the lowest band of repeated bright bands, and this one band also appears across several frequencies, so it is preferable to use the middle of them. The noise generated in this process can be solved by using a mean value and a standard deviation to create a filter that ignores signals that deviate from the mean value by more than the standard deviation.

도 10은 도 9의 스펙트럼에서 기본 주파수를 추출한 신호 스펙트럼을 나타낸다. 도 10을 참조하면, 목소리의 유무를 판단하는 것이 가능하다. 예컨대, 목소리가 있는 구간은 Y값이 0이 아닌 구간으로 나타나고, 목소리가 없는 구간은 Y값이 0인 구간으로 나타날 수 있다. 도 10을 참조하면, 기본 주파수가 점점 커지는 양상을 확인할 수 있다. FIG. 10 illustrates a signal spectrum obtained by extracting a fundamental frequency from the spectrum of FIG. 9. Referring to FIG. 10, it is possible to determine the presence or absence of a voice. For example, a section with a voice may be represented as a section in which the Y value is not 0, and a section without a voice may be represented as a section in which the Y value is 0. Referring to FIG. 10, it can be seen that the fundamental frequency gradually increases.

이렇게 뽑은 시간에 대한 기본 주파수의 2차원 데이터를 이용하여 다시 사람의 목소리만 있는 구간을 뽑으면 사람 목소리의 주파수에 대한 정보를 추출할 수 있다. 데이터의 평균값을 구하면, 목소리의 평균 주파수를 계산할 수 있고, 이것은 사용자의 목소리 주파수 대역이 얼마나 높은지를 나타내는 척도가 된다. 이 값을 이용하면 사용자의 성별, 목소리가 높은 정도를 알 수 있다. 표준편차를 구하면 목소리의 주파수 변화가 어느 정도인지를 계산할 수 있다. 녹음 파일 전체 길이에 대하여 표준 편차를 구하면, 사용자가 전체 발표동안 말하면서 어느 정도의 주파수 변화를 주었는지 알 수 있다. 이 값은 발표 동안 발표자의 말이 얼마나 지루했는가를 측정할 수 있는 척도로 이용될 수 있다. By using the two-dimensional data of the fundamental frequency of the extracted time as described above, if the section containing only the human voice can be extracted, information on the frequency of the human voice can be extracted. Once the mean value of the data is obtained, the average frequency of the voice can be calculated, which is a measure of how high the user's voice frequency band is. Use this value to find out how high your gender and voice are. The standard deviation can be used to calculate how much the frequency of the voice changes. By finding the standard deviation over the entire length of the recording, you can see how much frequency change the user made during the entire presentation. This value can be used as a measure of how boring the speaker was during the presentation.

구간을 약 20s 단위로 자르면 20s마다 발표가 지루했는지를 알 수 있다. 표준편차를 구하는 구간을 매우 좁히는 것 또한 가능하다. 예를 들어, 도 10에는 총 7개의 어절이 존재하는데, 각 어절에 대하여 표준편차를 구하면 어떤 순간에 급격한 주파수의 변화가 있었는지를 알 수 있다. 즉, 하나의 표준 편차를 계산하는 구간을 어느 정도로 잡느냐에 따라 다른 특징값을 추출할 수 있다. 예컨대, 전체 시간 동안의 주파수 평균값은 평균 목소리의 높낮이 및 성별의 구별이 가능하다. 전체 시간 동안의 주파수 표준 편차는 전체 발표 동안이 목소리 높낮이 변화를 확인하는 것이 가능하다. 구간별(예컨대, 20s) 주파수 표준 편차는 구간별 목소리의 높낮이 변화를 확인하는 것이 가능하다. 또한, 마디별 주파수 표준 편차는 순간적인 목소리의 높낮이 변화를 확인하는 것이 가능하다. If you cut the interval in about 20s, you can see whether the presentation was boring every 20s. It is also possible to narrow the interval to find the standard deviation. For example, a total of seven words exist in FIG. 10, and when a standard deviation is obtained for each word, it can be seen at a moment that a sudden change in frequency occurs. That is, different feature values can be extracted depending on how long the section for calculating one standard deviation is taken. For example, the frequency average value for the entire time can distinguish between the height of the average voice and the gender. The frequency standard deviation over the entire time makes it possible to check this voice telescoping change during the entire presentation. The frequency standard deviation of each section (eg, 20 s) can identify the change in height of the voice of each section. In addition, the frequency standard deviation for each node, it is possible to confirm the change in the pitch of the instantaneous voice.

목소리의 진폭에 관련한 값을 추출하기 위해서는 librosa의 툴 중 RMSE를 이용할 수 있다. RMSE는 22050Hz의 샘플링 속도로 읽은 디지털 신호를 특정 단위로 묶은 후, 단위마다 실효값(root mean square) 계산을 수행하는 방식이다. STFT의 경우와 마찬가지로 데이터를 512개 단위로 묶어서 RMSE를 계산하며, 약 23ms마다 RMS 값의 변화를 계산할 수 있다. 이것을 이용하여 음성의 시간에 따른 크기 변화를 계산할 수 있다. To extract the values related to the amplitude of the voice, you can use RMSE, a librosa tool. RMSE is a method of grouping digital signals read at a sampling rate of 22050 Hz into specific units and then performing root mean square calculations for each unit. As in the case of STFT, the RMSE is calculated by grouping the data in 512 units, and the change in RMS value can be calculated about every 23ms. This can be used to calculate the change in size of speech over time.

도 11은 RMSE의 에너지 그래프이다. 도 11의 X축은 시간, Y축은 세기를 나타낸다. RMSE의 그래프를 이용하여 음성의 2차원 데이터를 얻을 수 있는데, 음성의 2차원 데이터는 그래프에서 튀어나온 부분과 그렇지 않은 부분으로 음성의 유무 확인이 가능하고, 튀어나온 부분의 Y값을 이용해 음성의 크기를 계산할 수 있다. 도 11에서 특징값을 뽑아내기 위해 조용한 환경에서의 녹음을 가정하여, 특정 크기 이하의 소리를 무시하도록 노이즈를 제거할 수 있으며, 도 12는 소리 데이터의 주요 특징을 나타내는 그래프이다. 도 12를 참조하면, 목소리가 있는 구간은 Y값이 0이 아닌 구간으로 나타나고, 목소리가 없는 구간은 Y값이 0인 구간으로 나타나는 것을 확인할 수 있다. 또한, 시간에 따라 목소리의 크기(amplitude)가 어떻게 변화했는지도 확인할 수 있다. 도 12의 그래프에서 확인된 데이터의 평균값을 구하면 목소리의 평균 크기를 계산할 수 있다. 이 값을 이용하면, 사용자의 목소리가 발표동안 얼마나 컸는지를 확인할 수 있다. 또한, 표준편차를 계산하는 구간을 어떻게 잡느냐에 따라서 서로 다른 특징값을 추출할 수 있다. 표준편차는 목소리 크기가 충분히 컸는지, 발표가 지루하지는 않았는지, 급격한 목소리 크기 변화가 있었는지를 나타내는 척도로 이용할 수 있다. 예컨대, 전체시간 동안의 진폭 평균값은 평균 목소리의 크기를 나타낼 수 있다. 또한, 전체시간 동안의 진폭 표준편차는 전체 발표 동안의 목소리 크기의 변화를 나타낼 수 있다. 또한, 구간별(20s) 진폭 표준편차는 구간별 목소리 크기의 변화를 나타낼 수 있으며, 마디별 진폭 표준편차는 순간적인 목소리 크기의 변화를 나타낼 수 있다. 11 is an energy graph of RMSE. In FIG. 11, the X axis represents time and the Y axis represents intensity. Two-dimensional data of speech can be obtained by using RMSE graph, and the two-dimensional data of speech can be checked whether or not the part of the graph pops up and the part that is not, and the Y value of the part of the voice You can calculate the size. Assuming that recording is performed in a quiet environment in order to extract feature values in FIG. 11, noise may be removed to ignore sounds below a certain size, and FIG. 12 is a graph showing main features of sound data. Referring to FIG. 12, it can be seen that a section with voice is represented by a section in which the Y value is not 0, and a section without voice is represented by a section in which the Y value is 0. FIG. You can also see how the amplitude of the voice has changed over time. When the average value of the data identified in the graph of FIG. 12 is obtained, the average size of the voice may be calculated. Using this value, you can see how loud the user's voice was during the presentation. Also, different feature values can be extracted depending on how the interval for calculating the standard deviation is taken. The standard deviation can be used as a measure of whether the loudness is loud enough, whether the presentation is boring or if there has been a sudden change in loudness. For example, the amplitude mean value over the entire time may indicate the average loudness. In addition, the amplitude standard deviation over time can represent a change in loudness during the entire presentation. In addition, the amplitude standard deviation of each section (20s) may represent a change in voice volume for each section, and the amplitude standard deviation of each node may represent a change in instantaneous voice size.

도 7 내지 도 12에 도시된 방법들을 이용하여 사용자의 스피치를 피드백하는 웹 시스템 및 웹 페이지를 구현하였으며, 이는 아래 도 13 및 도 14에서 자세히 설명하기로 한다.A web system and a web page for feeding back a user's speech have been implemented using the methods illustrated in FIGS. 7 to 12, which will be described in detail with reference to FIGS.

도 13은 본 발명의 웹 시스템 구조를 나타내는 도면이다. 13 is a diagram showing a web system structure of the present invention.

도 13에 도시된 바와 같이, 본 발명의 웹 시스템은 크게 서버와 클라이언트로 구분될 수 있다. 이때, 서버는 예를 들어 node.js로 구현할 수 있고, 클라이언트는 react.js로 구현할 수 있다. 서버에서는 구글 API 서버와 통신을 하여 정보를 받아오고, 파이썬 라이브러리를 실행하여 음성 데이터를 분석할 수 있다. 이후, 서버가 분석한 음성 데이터 정보를 클라이언트로 전달하면, 클라이언트에서 음성 데이터를 피드백 한 정보를 웹 화면에 보여줄 수 있다. As shown in FIG. 13, the web system of the present invention can be largely divided into a server and a client. In this case, the server may be implemented by, for example, node.js, and the client may be implemented by react.js. The server can communicate with the Google API server to get the information, and run the Python library to analyze the voice data. Thereafter, when the voice data information analyzed by the server is transmitted to the client, the information on the feedback of the voice data from the client may be displayed on the web screen.

도 14는 본 발명의 실시예에 따른 웹 페이지를 나타내는 도면이다. 14 is a view showing a web page according to an embodiment of the present invention.

도 14에 도시된 바와 같이, 사용자가 미리 등록한 음성 데이터를 웹 페이지에 업로드 한 후, '분석 시작'버튼을 누르면, 시스템에서 분석한 자료가 웹 화면에 표시될 수 있다. 웹페이지에는 음성 데이터의 스크립트, 범례, 추천 동영상 등이 표시될 수 있다. 스크립트(Script)는 구글 API에서 텍스트로 변환한 결과를 보여주는 부분으로, 사용자의 스피치 어느 부분에서 특성이 나타나는지를 시각적으로 보여주기 위함이다. 범례는 스피치에서 특징이 되는 부분이라고 할 수 있는 '톤의 급격한 변화', '데시벨 큰 구간', '속도가 빠른 구간'을 스크립트에 표시할 수 있는 버튼으로, 버튼을 클릭하면 스크립트에서 특징이 되는 부분이 버튼의 색으로 표시될 수 있다. 도 14는 '속도가 빠른 구간'의 빨간색 버튼을 클릭한 후의 예시를 도시하고 있으며, 속도가 빠른 구간으로 인식된 단어가 빨간 색 단어로 바뀌어 표시된 것을 확인 할 수 있다. 다른 범례 버튼을 클릭하면, 그 범례 특징에 해당하는 단어가 표시될 수 있으며, 한 번에 한가지 범례의 표시가 가능하다. As shown in FIG. 14, after the user uploads pre-registered voice data to a web page and presses the 'start analysis' button, the data analyzed by the system may be displayed on the web screen. The web page may display a script, a legend, a recommended video, and the like of the voice data. Script is the part that shows the result of converting to text in the Google API. It is to visually show the part of the user's speech. The legend is a button that can display 'Sudden changes in tone', 'Decibel large intervals' and 'High speed intervals' which are characteristic parts of speech in the script. The part may be displayed in the color of the button. FIG. 14 shows an example after clicking a red button of a 'speed section', and it can be seen that a word recognized as a speed section is replaced with a red word. When another legend button is clicked, a word corresponding to the legend feature can be displayed, and one legend can be displayed at a time.

또한, 추천 동영상은 남녀, 말의 빠르기, 말의 높낮이 등에 적합한 다양한 동영상 들의 링크를 저장하고, 스크립트를 분석했을 때, 녹음된 파일이 어떤 특징을 가졌는지를 판별하여 알맞은 동영상 링크를 내보내고, 화면에 링크된 동영상을 표시할 수 있다.In addition, the recommended video stores links to various videos suitable for men and women, horse speed, horse height, etc., analyzes the script, determines what the recorded file has, and exports the appropriate video link and links the screen. Video can be displayed.

본 발명의 스피치 피드백을 위해 적용된 구글 API는 스스로 음성인식 결과를 도출할 때 정확도를 어느 정도 예측하여 이를 컨피던스(confidence)라는 값으로 제공해 준다. 따라서, 이 컨피던스 값을 기초로 스크립트가 잘 산출되었는지를 확인할 수 있다. The Google API applied for the speech feedback of the present invention predicts the accuracy to some extent when generating a speech recognition result by itself, and provides this as a value of confidence. Therefore, based on this confidence value, we can check whether the script is well calculated.

위와 같이, 본 발명의 스피치 피드백을 위한 웹 시스템 및 웹 페이지는 별도의 생산 비용이 필요하지 않고, 서버 유지비와 구글 API 사용료 정도의 비용만 지출되므로 비용측면에서 이점이 있다. 또한, 웹 서비스로 제공되기 때문에 저렴한 비용으로 시간과 공간의 제약을 받지 않는 다는 점에서 스피치 학원에 비해 이점이 있다. 본 발명의 스피치 피드백을 위한 웹 시스템 및 웹 페이지는 아주 기본적인 기능은 무료로 제공하고, 광고를 통해 일부 수익을 얻되, 계속적인 사용과 추가 기능(예를 들어, 저장 기능, 더 전문적인 분석)을 사용하기 위해 결제해야하는 프리미엄 모델 등의 비용구조를 통해 수익창출이 가능한 장점이 있다.As described above, the web system and the web page for speech feedback of the present invention do not need a separate production cost, there is an advantage in terms of cost because only the cost of the server maintenance cost and Google API usage fee. In addition, since it is provided as a web service, it is advantageous compared to speech school in that it is not limited by time and space at a low cost. The web system and web page for speech feedback of the present invention provide very basic functionality free of charge and gain some revenue from advertising, while continuing to use and add functionality (e.g. storage, more professional analysis). Profit can be generated through cost structures such as premium models that require payment for use.

이상, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 설명하였으나, 본 발명이 그에 한정되는 것은 아니며, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면, 이러한 기재로부터 본 발명의 기술적 사상이 벗어나지 않는 범위 내에서 치환 및/또는 변경이 가능하다. 그러므로, 본 발명의 범위는 전술한 실시예에 국한되어 정해져서는 아니되며, 후술하는 청구범위 뿐만 아니라 그에 균등한 것들에 의해 정해져야 할 것이다. As mentioned above, although preferred embodiments of the present invention have been described with reference to the accompanying drawings, the present invention is not limited thereto, and those skilled in the art to which the present invention pertains may have the technical idea of the present invention. Substitutions and / or modifications can be made without departing from this scope. Therefore, the scope of the present invention should not be limited to the above-described embodiment, but should be defined by the claims below and equivalents thereof.

Claims

A server that communicates with a Google API server to receive voice file information converted to text, simultaneously executes a Python library to analyze voice data, and deliver the analyzed voice data to a client device;
A display for displaying a web page configured to allow a user to check uploaded and analyzed voice file information of the voice file; And
A client that delivers a voice file to the server or delivers the voice file received from the server to the display to display on the web page.
Web service system for speech feedback comprising a.

The method of claim 1,
Python library contains librosa
Web service system for speech feedback.

The method of claim 1,
The analysis of the speech data includes frequency related feature value extraction and amplitude related feature value extraction.
Web service system for speech feedback.

The method of claim 3,
The frequency-related feature value extraction uses the short time Fourier transform (STFT) of librosa, and the amplitude-related feature value extraction uses the RMSE of librosa.
Web service system for speech feedback.

The method of claim 1,
The web page is composed of a start analysis button, a script, a legend button, a recommended video, etc.
Web service system for speech feedback.

The method of claim 5,
The script indicates the result of converting the voice file into text.
Web service system for speech feedback.

The method of claim 5,
The legend button is a button for displaying a sudden change in tone, a large decibel period, a high speed section, and the like in the script.
Web service system for speech feedback.