KR20200069264A

KR20200069264A - System for outputing User-Customizable voice and Driving Method thereof

Info

Publication number: KR20200069264A
Application number: KR1020200035167A
Authority: KR
Inventors: 최현희
Original assignee: 최현희
Priority date: 2020-03-23
Filing date: 2020-03-23
Publication date: 2020-06-16

Abstract

The present invention relates to a user-customized voice selectable voice output system and a method for driving the same, and by a user-customized voice selectable voice output system including: a cloud server including a voice learning part configured to collect and store user voice data from a user terminal, in which a user-customized voice output service-specific application is driven, and to, if the amount of stored voice data of a specific user is a reference value or more, analyze voice data for the specific user to learn characteristics of the voice data of the specific user, and a voice style file generating part configured to generate a voice style file defining voice characteristics of the specific user based on the contents learned by the voice learning part; a text-to-speech (TTS) module part configured to convert data in the form of a text received from a user terminal, on which the user-customized voice output service-specific application is mounted, into data in the form of a voice, and a voice output part configured to apply the data in the form of a voice converted by the TTS module part to the voice style file of the specific user received from the cloud server and output a voice of the specific user, a voice of a general person desired by the user can be learned, and the data in the form of a letter can be output with a voice of a person desired by the user.

Description

System for outputting user-customizable voice and driving method thereof

본 발명은 음성 출력 장치에 관한 것으로 보다 상세하게는 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템 및 그 구동 방법에 관한 것이다. The present invention relates to a voice output device, and more particularly, to a voice output system capable of selecting a user-customized voice and a driving method thereof.

인공지능의 발달과 함께 보다 편리함을 제공하기 위한 음성 인식 기술의 개발이 활발해지고 있다. 음성 인식 기술은 컴퓨터가 마이크와 같은 소리 센서를 통해 얻은 음향 학적 신호를 단어나 문장으로 변환시키는 것이다. With the development of artificial intelligence, the development of speech recognition technology to provide more convenience is becoming active. Speech recognition technology is a computer that converts acoustic signals from a sound sensor, such as a microphone, into words or sentences.

예를들어 음성 변환 방법에서 많이 쓰이는 음성 생성 방법 중 하나인 신경망(Neural Network)기반의 경우, 스펙트럴 모양(spectral shape)을 표현한 특징을 이용하여 각각의 모델을 만들고, 각각의 모델들에서 나온 출력값들을 이용하여 다시 서로를 매핑시켜주는 또 다른 모델을 만들어 변환을 수행할 수 있다. For example, in the case of a neural network based method, which is one of the most commonly used voice generation methods in the speech conversion method, each model is created using features expressing a spectral shape, and the output value from each model Using them, we can create another model that maps each other back to perform the transformation.

이때, 각각의 스펙트럴 모양의 특징을 반영한 모델들로는 제한 볼츠만 기계(restricted Boltzmann machine)등이 사용될 수 있다. 또한 이러한 모델들에서 나온 출력값을 매핑하는 모델로는 인공 신경망(artificial neural network), 베르누이 양방향 관련 메모리(Bernoulli bidirectional associative memory)등이 사용될 수 있다. At this time, as models reflecting the characteristics of each spectral shape, a restricted Boltzmann machine or the like can be used. In addition, an artificial neural network, a Bernoulli bidirectional associative memory, etc. may be used as a model for mapping output values from these models.

기존의 신경망 기반의 음성 생성 방법의 경우, 다양한 모델을 통해 데이터를 학습하므로 데이터의 비선형 특징을 잘 반영할 수 있는 장점이 있다. 그러나 1:1 매핑을 통해 모델을 학습해야 한다는 한계가 있다. 즉, 한명의 화자에서 여러명의 화자 목소리로 변환할 경우에 많은 수의 모델을 필요로 한다. In the case of the existing neural network-based speech generation method, data is learned through various models, and thus there is an advantage that can reflect the non-linear characteristics of the data well. However, there is a limitation that the model must be trained through 1:1 mapping. That is, a large number of models are required when converting from one speaker to several speaker voices.

특히 음성 분석은 음성 인식이나 음성 합성을 위한 필수적이면서도 아주 중요한 과정이다. 음성 파라미터는 시간 영역에서 추출하는 방법과 주파수 영역에서 추출하는 방법으로 구분된다. 시간 영역(Time domain)에서는 에너지, 영 교차율(zero-crossing), 피치주기(Pitch Period), 피치 주파수, 선형예측계수(Linear Prediction Coefficient), LPC 켑스트럼 계수(Cepstrum), 및 MFCC(Mel Frequency Cepstral Coefficient)이 사용되고, 주파수 영역(Frequency domain)에서는 스펙트럼 포락선(Spectrum Envelope)과 특징 파라미터(Formant)가 사용된다.In particular, speech analysis is an essential and very important process for speech recognition or speech synthesis. The voice parameters are divided into a method of extracting in the time domain and a method of extracting in the frequency domain. In the time domain, energy, zero-crossing, pitch period, pitch frequency, linear prediction coefficient, LPC Cepstrum, and MFCC (Mel Frequency) Cepstral Coefficient is used, and in the frequency domain, a spectrum envelope and feature parameters are used.

한편, 스마트폰과 같은 스마트 기기에서 문자 메시지를 음성 형태로 출력해주는 기술은 이미 보편화되고 있다. 그러나 이 경우에 단말기 내에 지정된 인공지능 음성 즉 한정된 목소리로 출력되는 기술에 그치고 있다. Meanwhile, a technology for outputting a text message in a voice form from a smart device such as a smartphone has already been popularized. However, in this case, it is limited to the technology of outputting the artificial intelligence voice designated within the terminal, that is, a limited voice.

KRKR 10-2001-002640210-2001-0026402 AA KRKR 10-2004-001307110-2004-0013071 AA KRKR 10-2014-001522810-2014-0015228 AA

본 발명은 이 같은 배경에서 도출된 것으로 사용자가 원하는 일반인의 목소리를 학습하고, 문자 형태의 데이터를 사용자가 원하는 사람(일반인)의 목소리로 음성출력 해주는 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템 및 그 구동방법을 제공함에 그 목적이 있다. The present invention is derived from such a background, and a voice output system capable of selecting a user-defined voice for learning a voice of a general person desired by a user and outputting text data in a voice of a person (general person) desired by the user and its driving The purpose is to provide a method.

이에 따라 보다 친근감있고, 흥미도 높은 TTS(Text to Speech) 서비스를 제공할 수 있는 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템 및 그 구동방법을 제공하고자 한다. Accordingly, it is intended to provide a voice output system capable of selecting a user-customized voice capable of providing a more friendly and interesting TTS (Text to Speech) service and a driving method thereof.

**

상기의 과제를 달성하기 위한 본 발명은 다음과 같은 구성을 포함한다. The present invention for achieving the above object includes the following configuration.

즉 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템은 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 구동되는 사용자 단말로부터 사용자 목소리 데이터를 수집하여 저장하고, 특정 사용자의 목소리 데이터 저장량이 기준치 이상이면 상기 특정 사용자의 목소리 데이터를 분석하여 상기 특정 사용자의 목소리 데이터의 특징을 학습하는 목소리 학습부, 및 상기 목소리 학습부에서 학습된 내용에 기반하여 상기 특정 사용자의 목소리 특징을 정의한 목소리 스타일 파일을 생성하는 목소리 스타일 파일 생성부를 포함하는 클라우드 서버 및 상기 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 탑재된 사용자 단말로부터 수신되는 텍스트 형태의 데이터를 음성 형태의 데이터로 변환하는 TTS(Text to Speech) 모듈부, 및 상기 TTS 모듈부에서 변환된 음성 형태의 데이터를 상기 클라우드 서버로부터 수신되는 상기 특정 사용자의 목소리 스타일 파일에 적용하여 상기 특정 사용자의 목소리로 출력하는 음성 출력부를 포함하는 스마트 스피커 장치를 포함한다. That is, according to an embodiment of the present invention, a voice output system capable of selecting a user-customized voice collects and stores user voice data from a user terminal in which a user-specific voice output service-only application is driven, and a specific user's voice data storage amount exceeds a reference value. Next, a voice learning unit that analyzes the voice data of the specific user to learn the characteristics of the voice data of the specific user, and a voice style file defining a voice characteristic of the specific user based on the content learned by the voice learning unit A cloud server including a voice style file generating unit, and a text to speech (TTS) module unit that converts text type data received from a user terminal equipped with a dedicated application for user-defined voice output service into voice type data, and the And a smart speaker device including a voice output unit that applies the data in the form of voice converted by the TTS module unit to the voice style file of the specific user received from the cloud server and outputs the voice of the specific user.

본 발명의 일 양상에 있어서, 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션은 상기 목소리 스타일 파일 생성부에서 상기 특정 사용자의 목소리 스타일 파일을 생성하면, 상기 특정 사용자에 대한 목소리 선택 버튼을 생성하여 출력하는 것을 특징으로 한다. In one aspect of the present invention, when a user-specific voice output service-only application generates the voice style file of the specific user in the voice style file generator, a voice selection button for the specific user is generated and output. do.

또 다른 양상에 따르면 상기 스마트 스피커 장치는 상기 사용자 단말과 근거리 무선통신을 수행하는 근거리 무선통신부를 더 포함하고, 상기 음성 출력부는 상기 근거리 무선통신부를 통해 사용자 단말로부터 텍스트 형태의 데이터를 수신하는 것을 특징으로 한다.According to another aspect, the smart speaker device further includes a short-range wireless communication unit for performing short-range wireless communication with the user terminal, and the voice output unit receives text data from a user terminal through the short-range wireless communication unit. Is done.

한편, 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템의 구동방법은 클라우드 서버가 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 구동되는 사용자 단말로부터 사용자 목소리 데이터를 수집하여 저장하고, 특정 사용자의 목소리 데이터 저장량이 기준치 이상이면 상기 특정 사용자의 목소리 데이터를 분석하여 상기 특정 사용자의 목소리 데이터의 특징을 학습하는 단계, 상기 클라우드 서버가 상기 학습된 내용에 기반하여 상기 특정 사용자의 목소리 특징을 정의한 목소리 스타일 파일을 생성하는 단계, 스마트 스피커 장치가 상기 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 탑재된 사용자 단말로부터 수신되는 텍스트 형태의 데이터를 음성 형태의 데이터로 변환하는 단계 및 상기 스마트 스피커 장치가 상기 변환된 음성 형태의 데이터를 상기 클라우드 서버로부터 수신되는 상기 특정 사용자의 목소리 스타일 파일에 적용하여 상기 특정 사용자의 목소리로 출력하는 단계를 포함한다.On the other hand, the method of driving a voice output system capable of selecting a user-customized voice is that the cloud server collects and stores user voice data from a user terminal running a dedicated application for a user-defined voice output service, and if a specific user's voice data storage is greater than a reference value. Analyzing the voice data of the specific user to learn the characteristics of the voice data of the specific user, and generating a voice style file defining the voice characteristics of the specific user based on the learned content by the cloud server, smart A speaker device converts text type data received from a user terminal equipped with a dedicated application for user-specific voice output service into voice form data, and the smart speaker device converts the converted voice form data from the cloud server. And applying the received voice style file of the specific user to output the voice of the specific user.

본 발명에 따르면, 사용자가 원하는 일반인의 목소리를 학습하고, 문자 형태의 데이터를 사용자가 원하는 사람의 목소리로 음성출력 해주는 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템 및 그 구동방법을 제공함으로써 사용자가 원하는 목소리로 더욱 친근감 있는 TTS(Text to Speech) 서비스 제공하고, 흥미도도 높일 수 있는 효과가 있다. According to the present invention, by providing a voice output system capable of selecting a user-specified voice and learning the voice of a general person desired by a user and outputting text data in the voice of a person desired by the user, and a driving method thereof, a voice desired by the user As it provides a more friendly TTS (Text to Speech) service, it also has the effect of increasing the degree of interest.

이에 따라 부보님, 선생님, 또는 다른 양육자의 목소리로 책을 읽어주는 것이 가능하여 영유아 아이들의 거부감을 해소하고 친근감을 높여 주어 영유아의 안정감을 높여주고 이야기 몰입도를 현저하게 높여주는 효과가 나타난다. Accordingly, it is possible to read a book in the voice of a parent, teacher, or other nurturer, thereby reducing the rejection and intimacy of infants and toddlers, thereby increasing the stability of infants and toddlers and remarkably increasing the story immersion.

뿐만 아니라 단순한 영상이나 음향만 출력하는 것이 아니라 다양한 책 내용이나 시사 내용을 사용자가 선택한 음성으로 출력하는 스마트 스피커 장치를 제공할 수 있는 효과가 있다.In addition, there is an effect of providing a smart speaker device that outputs various book contents or current affairs contents in a voice selected by a user, not only outputting a simple image or sound.

도 1 은 본 발명의 일 양상에 따른 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템의 구성을 도시한 예시도,
도 2 는 본 발명의 일 실시예에 따른 클라우드 서버의 구성을 도시한 블록도,
도 3 은 본 발명의 일 실시예에 따른 스마트 스피커 장치의 구성을 도시한 블록도,
도 4 는 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션에서 목소리 수집화면의 예시도,
도 5 는 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션의 스크립트 전송화면의 예시도,
도 6 은 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템의 구동방법을 도시한 흐름도이다. 1 is an exemplary view showing the configuration of a voice output system capable of selecting a user-customized voice according to an aspect of the present invention;
Figure 2 is a block diagram showing the configuration of a cloud server according to an embodiment of the present invention,
Figure 3 is a block diagram showing the configuration of a smart speaker device according to an embodiment of the present invention,
4 is an exemplary view of a voice collection screen in a user-specific voice output service dedicated application according to an embodiment of the present invention;
5 is an exemplary view of a script transmission screen of a user-specific voice output service-only application according to an embodiment of the present invention;
6 is a flowchart illustrating a method of driving a voice output system capable of selecting a user-customized voice according to an embodiment of the present invention.

본 발명에서 사용되는 기술적 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본 발명에서 사용되는 기술적 용어는 본 발명에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. It should be noted that the technical terms used in the present invention are only used to describe specific embodiments, and are not intended to limit the present invention. In addition, technical terms used in the present invention should be interpreted as meanings generally understood by a person having ordinary knowledge in the technical field to which the present invention belongs, unless otherwise defined in the present invention. It should not be interpreted as a meaning or an excessively reduced meaning.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명의 일 양상에 따른 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템의 구성을 도시한 예시도이다. 1 is an exemplary view showing a configuration of a voice output system capable of selecting a user-customized voice according to an aspect of the present invention.

도 1 에 도시된 바와 같이 본 발명의 일 실시예에 따른 음성 출력 시스템은 사용자 단말(10), 클라우드 서버(20), 그리고 스마트 스피커 장치(30)를 포함한다. As shown in FIG. 1, a voice output system according to an embodiment of the present invention includes a user terminal 10, a cloud server 20, and a smart speaker device 30.

사용자 단말(10)은 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 설치되어 구동된다. The user terminal 10 is driven by a user-specific voice output service dedicated application installed.

일 실시예에 있어서, 사용자 단말(10)은 사용자가 소지하는 노트북, 데스크톱(Desktop) 또는 랩톱(Laptop)과 같은 개인용 컴퓨터이거나, PCS(Personal Communication System), GSM(Global System for Mobile communications), PDC(Personal DigitalCellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division MultipleAccess), Wibro(Wireless Broadband Internet) 단말, 스마트폰(smartphone), 스마트 패드(smartpad), 타블렛PC(Tablet PC) 등과 같은 모든 종류의 핸드헬드(Handheld) 기반의 무선 통신 장치 중 하나일 수 있으나 본 발명에서 이를 한정하는 것은 아니다. In one embodiment, the user terminal 10 is a personal computer such as a laptop, desktop or laptop carried by the user, or a Personal Communication System (PCS), Global System for Mobile communications (GSM), PDC (Personal DigitalCellular), PHS(Personal Handyphone System), PDA(Personal Digital Assistant), IMT(International Mobile Telecommunication)-2000, CDMA(Code Division Multiple Access)-2000, W-CDMA(W-Code Division MultipleAccess), Wibro (Wireless Broadband Internet) terminal, a smart phone (smartphone), a smart pad (smartpad), a tablet PC (Tablet PC) and the like may be one of all types of handheld (Handheld)-based wireless communication device, but the present invention is limited to this It is not done.

도 2 는 본 발명의 일 실시예에 따른 클라우드 서버의 구성을 도시한 블록도이다. 2 is a block diagram showing the configuration of a cloud server according to an embodiment of the present invention.

도 2 에서 알 수 있듯이, 클라우드 서버(20)는 목소리 학습부(220), 목소리 스타일 파일 생성부(230), 목소리 데이터 저장부(210), 및 통신부(240)를 포함한다. 2, the cloud server 20 includes a voice learning unit 220, a voice style file generation unit 230, a voice data storage unit 210, and a communication unit 240.

클라우드 서버(20)는 통신부(240)를 통해 사용자 단말(10)로부터 사용자 목소리 데이터를 수집하여 목소리 데이터 저장부(210)에 저장한다. The cloud server 20 collects user voice data from the user terminal 10 through the communication unit 240 and stores it in the voice data storage unit 210.

목소리 데이터 저장부(210)는 휘발성 또는 비휘발성 메모리를 포함할 수 있다. 목소리 데이터 저장부(210)는 예를들어 클라우드 서버(20)의 적어도 하나의 다른 구성요소에 관계된 명령 또는 데이터를 저장할 수 있다. 한 실시예에 따르면, 목소리 데이터 저장부(210)는 소프트웨어 또는 프로그램을 저장할 수도 있다. The voice data storage unit 210 may include volatile or nonvolatile memory. The voice data storage unit 210 may store, for example, commands or data related to at least one other component of the cloud server 20. According to an embodiment, the voice data storage unit 210 may store software or a program.

여기서 프로그램이라함은, 예를 들면, 커널, 미들웨어, 어플리케이션 프로그래밍 인터페이스(API), 또는 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션 프로그램 등을 포함할 수 있다. Here, the program may include, for example, a kernel, middleware, an application programming interface (API), or an application program dedicated to a user-defined voice output service.

일 실시예에 있어서 목소리 데이터 저장부(210)는 목소리 데이터를 저장하는 빅데이터로 구현될 수도 있다.In one embodiment, the voice data storage unit 210 may be implemented as big data for storing voice data.

목소리 학습부(220)는 목소리 데이터 저장부(210)에 저장된 특정 사용자의 목소리 데이터 저장량이 기준치 이상이면, 특정 사용자의 목소리 데이터를 분석한다. The voice learning unit 220 analyzes voice data of a specific user when the amount of voice data storage of a specific user stored in the voice data storage 210 is greater than or equal to a reference value.

여기서 목소리 데이터 저장량의 기준치라함은 특정 사용자의 목소리 데이터를 파악하기에 충분한 양으로 저장된 목소리 데이터의 플레이타임(시간)을 기준으로 카운트 하는 것도 가능하고, 목소리 데이터 저장 용량을 기준으로 카운트하는 것도 가능하다. 목소리 데이터 저장부(210)에 저장되는 한 사람의 목소리에 대해서 일정 시간 이상의 목소리 데이터가 저장되거나, 일정 용량 만큼의 목소리 데이터가 저장되면 그 사람의 목소리 데이터를 분석할 수 있다. Here, the reference value of the voice data storage amount may be counted based on the play time (time) of the voice data stored in an amount sufficient to grasp the voice data of a specific user, or may be counted based on the voice data storage capacity. . The voice data of a person stored in the voice data storage unit 210 may be analyzed when voice data of a certain time or more is stored or voice data of a certain capacity is stored.

그리고 특정 사용자의 목소리 데이터를 분석하여 그 특징을 학습한다. 일 실시예에 있어서, 목소리 학습부(220)는 사용자 목소리 데이터의 파장, 진동수, 음의 색깔 등을 분석할 수 있는 응용프로그램으로 구현될 수 있다. 또는 목소리 학습부(220)는 센싱 장치 등을 포함하는 것으로 입력된 목소리의 파장, 진동수, 음의 색깔 등을 추출하여 구별하는 어떤 수단도 포함하도록 해석된다. Then, the voice data of a specific user is analyzed to learn its characteristics. In one embodiment, the voice learning unit 220 may be implemented as an application program that can analyze the wavelength, frequency, and color of user voice data. Alternatively, the voice learning unit 220 is interpreted to include any means for extracting and distinguishing the wavelength, frequency, and tone of the input voice, including a sensing device.

또한 목소리 학습부(220)는 사용자의 목소리 데이터를 확보하기 위해 사용자가 사용자 단말(10)로 음성 통화를 수행하는 동안 사용자의 음성 데이터를 수집하는 것도 가능하다. 이때 사용자가 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 통화 중 음성 데이터 수집을 허용한 경우에 한해 이루어지는 것이 바람직하다. In addition, the voice learning unit 220 may collect the user's voice data while the user makes a voice call to the user terminal 10 to secure the user's voice data. At this time, it is preferable that the user is allowed to collect voice data during a call through a user-specific voice output service-only application.

일 양상에 있어서, 목소리 학습부(220)는 딥러닝 기반의 목소리 학습을 수행한다. 딥 러닝(영어: deep learning), 심층학습(深層學習)은 여러 비선형 변환기법의 조합을 통해 높은 수준의 추상화(abstractions, 다량의 데이터나 복잡한 자료들 속에서 핵심적인 내용 또는 기능을 요약하는 작업)를 시도하는 기계학습(machine learning) 알고리즘의 집합 으로 정의된다. In one aspect, the voice learning unit 220 performs deep learning-based voice learning. Deep learning (deep learning) and deep learning are high-level abstractions (abstracts, summarizing key content or functions in a large amount of data or complex data) through a combination of several nonlinear transformation methods. It is defined as a set of machine learning algorithms that attempt.

큰 틀에서 사람의 사고방식을 컴퓨터에게 가르치는 기계학습의 한 분야라고 이야기할 수 있다. 이에 따라 기존보다 적은 양의 목소리 데이터로도 특정 사용자의 목소리 특징을 정의하는 것이 가능해진다. It can be said that in a large frame, it is a field of machine learning that teaches a person's way of thinking to computers. Accordingly, it is possible to define a voice characteristic of a specific user with less voice data than before.

목소리 학습부(220)는 사용자 단말(10)의 소유주를 포함하여 그 가족과 같은 일반인의 목소리를 학습하는 것도 가능하고, 연예인이나 아나운서, 정치인, 교수와 같은 공인들의 목소리를 학습하는 것도 가능하다. The voice learning unit 220 may learn the voices of ordinary people, such as their families, including the owner of the user terminal 10, and also learn the voices of officials such as entertainers, announcers, politicians, and professors.

목소리 스타일 파일 생성부(230)는 목소리 학습부(220)에서 학습된 내용에 기반하여 특정 사용자의 목소리 특징을 정의한 목소리 스타일 파일을 생성한다. 목소리 스타일 파일은 그에 기반하여 사용자의 목소리를 복구할 수 있는 정보들을 포함한다.The voice style file generation unit 230 generates a voice style file defining a voice characteristic of a specific user based on the content learned by the voice learning unit 220. The voice style file contains information capable of restoring the user's voice based on the voice style file.

일 실시예에 있어서 특정 사용자의 목소리 특징을 정의한 목소리 스타일 파일은 style.xml 일수 있다. XML(Extensible Markup Language)은 W3C에서 개발된, 다른 특수한 목적을 갖는 마크업 언어를 만드는데 사용하도록 권장하는 다목적 마크업 언어이다. XML은 SGML의 단순화된 부분집합으로, 다른 많은 종류의 데이터를 기술하는 데 사용할 수 있다. XML은 주로 다른 종류의 시스템, 특히 인터넷에 연결된 시스템끼리 데이터를 쉽게 주고 받을 수 있도록 한다. In one embodiment, a voice style file defining a voice characteristic of a specific user may be style.xml. Extensible Markup Language (XML) is a versatile markup language developed by W3C that is recommended for use in creating other special purpose markup languages. XML is a simplified subset of SGML, which can be used to describe many different kinds of data. XML mainly makes it easy to exchange data between different types of systems, especially those connected to the Internet.

본 발명의 일 양상에 있어서, 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션은 목소리 스타일 파일 생성부(230)에서 특정 사용자의 목소리 스타일 파일을 생성하면, 특정 사용자에 대한 목소리 선택 버튼을 생성하여 출력하는 것을 특징으로 한다. In one aspect of the present invention, when a user-specific voice output service-only application generates a voice style file of a specific user in the voice style file generation unit 230, a voice selection button for a specific user is generated and output. do.

즉 사용자는 클라우드 서버(20)에서 목소리 스타일 파일이 생성된 목소리에 대해서 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 선택 가능한 항목으로 제공받을 수 있는 것이다. That is, the user can be provided as a selectable item through a user-specific voice output service-specific application for the voice generated by the voice style file in the cloud server 20.

또한, 본 발명의 추가적인 양상에 따라 목소리 학습부(220)는 사용자의 목소리 데이터의 특징을 학습함에 있어서 사용자의 감정 상태 정보 및 억양 정보를 더 반영할 수 있다. 이때 목소리 스타일 파일이 사용자의 감정 상태 정보 및 억양 정보를 포함한다.In addition, according to an additional aspect of the present invention, the voice learning unit 220 may further reflect the user's emotional state information and intonation information in learning characteristics of the user's voice data. At this time, the voice style file includes the user's emotional state information and intonation information.

예를 들어 한사람의 목소리라 하더라도 속도나 크기를 조절하여 다른 감정 상태를 표현하는 것이 가능하다. 즐거운 상태이거나 흥분된 상태에서는 볼륨 정보를 크게 속도를 빠르게하거나 목소리 톤을 높게 설정하고, 차분한 상태나 슬픈 상태인 경우에는 볼륨을 작게, 느리게, 목소리 톤을 낮게 설정할 수 있다. 그러나 이에 한정되는 것은 아니다. 감정을 표현할 수 있는 다양한 요소들을 가변시켜서 적용 가능하다. For example, even one person's voice can express different emotional states by adjusting the speed or size. In a joyful state or an excited state, the volume information can be set to speed up or the voice tone is set to high, and in a calm or sad state, the volume can be set to low, slow, and voice tone. However, it is not limited thereto. It is applicable by varying various elements that can express emotions.

통신부(240)는 사용자 단말(10) 및 스마트 스피커 장치(30)와 데이터 송수신을 수행하는 기술적 구성을 포괄하도록 해석된다. 일 실시예에 있어서 통신부(240)는 사용자 단말(10)로부터 목소리 데이터를 수신하고, 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 목소리 스타일 파일이 생성된 목소리 정보를 제공한다. The communication unit 240 is interpreted to cover the technical configuration of performing data transmission and reception with the user terminal 10 and the smart speaker device 30. In one embodiment, the communication unit 240 receives voice data from the user terminal 10 and provides voice information in which a voice style file is generated through a user-specific voice output service-only application.

또한, 통신부(240)는 스마트 스피커 장치(30)로 사용자 단말(10)로부터 요청된 특정 사용자 목소리 스타일 파일을 제공한다. Further, the communication unit 240 provides a specific user voice style file requested from the user terminal 10 to the smart speaker device 30.

도 3 은 본 발명의 일 실시예에 따른 스마트 스피커 장치의 구성을 도시한 블록도이다. 3 is a block diagram showing the configuration of a smart speaker device according to an embodiment of the present invention.

도 3 에서 알 수 있듯이. 스마트 스피커 장치(30)는 TTS 모듈부(310), 음성 출력부(320), 근거리 무선 통신부(335)를 포함하는 통신부(330)를 포함한다. As can be seen from FIG. 3. The smart speaker device 30 includes a communication unit 330 including a TTS module unit 310, a voice output unit 320, and a short-range wireless communication unit 335.

일 실시예에 있어서 스마트 스피커 장치(30)는 기존의 블루투스 스피커에 기능적으로 탑재되거나, Linux OS-Linux PC로 구현될 수도 있다. 그러나 이에 한정되는 것은 아니다. In one embodiment, the smart speaker device 30 may be functionally mounted on an existing Bluetooth speaker or implemented as a Linux OS-Linux PC. However, it is not limited thereto.

TTS 모듈부(310)는 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 탑재된 사용자 단말(10)로부터 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 텍스트 형태의 데이터를 수신한다. 그리고 수신한 텍스트 형태의 데이터를 음성 형태의 데이터로 변환한다. The TTS module unit 310 receives text data through a user-specific voice output service-only application from the user terminal 10 equipped with a user-specific voice output service-only application. Then, the received text data is converted into voice data.

TTS 기술은, 언어의 모든 음소에 대한 발음 데이터베이스를 구축하고 이를 연결시켜 연속된 음성을 생성하게 되는데, 이때 음성의 크기, 길이, 높낮이 등을 조절하여 자연스러운 음성을 합성해 내는 것이다. 이를 위해 자연어 처리 기술이 포함될 수 있다. The TTS technology builds a pronunciation database for all phonemes in a language and connects them to generate a continuous voice. At this time, the natural voice is synthesized by adjusting the size, length, and height of the voice. For this, natural language processing technology may be included.

음성 출력부(320)는 TTS 모듈부(310)에서 변환된 음성 형태의 데이터를 클라우드 서버(20)로부터 수신되는 특정 사용자의 목소리 스타일 파일에 적용하여 특정 사용자의 목소리로 출력한다. The voice output unit 320 applies the data in the form of voice converted by the TTS module unit 310 to a voice style file of a specific user received from the cloud server 20 and outputs the voice in a specific user's voice.

음성 출력부(320)는 음성 신호를 외부로 출력할 수 있다. 예를 들어, 음성 출력부(320)는 TTS 모듈부(310)에 의해 변환된 음성(또는 음성신호)데이터를 가청음으로 변환하여 출력한다. The voice output unit 320 may output a voice signal to the outside. For example, the voice output unit 320 converts and outputs voice (or voice signal) data converted by the TTS module unit 310 into an audible sound.

통신부(330)는 클라우드 서버(20) 및 사용자 단말(10)과 데이터 송수신을 수행할 수 있는 유무선 통신모듈을 모두 포괄하도록 해석된다. The communication unit 330 is interpreted to cover both wired and wireless communication modules capable of transmitting and receiving data with the cloud server 20 and the user terminal 10.

일 실시예에 있어서 통신부(330)는 와이파이(Wi-Fi) 즉, 무선 인터넷 방식으로 클라우드 서버(20)로부터 목소리 스타일 파일을 수신한다. 그러나 이에 한정되는 것은 아니다. In one embodiment, the communication unit 330 receives a voice style file from the cloud server 20 in a Wi-Fi, ie, wireless Internet method. However, it is not limited thereto.

일 양상에 따라 본발명의 일 실시예에 따른 스마트 스피커 장치(30)의 통신부(330)는 근거리 무선 통신부(335)를 포함하여, 사용자 단말(10)과 근거리 무선통신을 수행한다. 일 실시예에 있어서 근거리 무선 통신부(335)는 블루투스 모듈일 수 있다. According to an aspect, the communication unit 330 of the smart speaker device 30 according to an embodiment of the present invention includes a short-range wireless communication unit 335 to perform short-range wireless communication with the user terminal 10. In one embodiment, the short-range wireless communication unit 335 may be a Bluetooth module.

블루투스(Bluetooth)는 대표적인 근거리 통신 방식으로서 저비용, 저전력으로 단말기들 간의 음성 및 데이터 통신을 가능하게 한다.일반적으로 블루투스 통신은 마스터 기기(master device)와 슬레이브 기기(slave device)가 피코넷(piconet)을 형성하여 이루어진다. Bluetooth is a typical short-range communication method that enables voice and data communication between terminals with low cost and low power. In general, in a Bluetooth communication, a master device and a slave device use a piconet. It is formed.

마스터 기기는 블루투스 신호를 송출하여 블루투스 통신을 개시하는 기기를 의미하며, 슬레이브 기기는 마스터 기기로부터 송출한 블루투스 신호를 수신하여 마스터 기기와 통신 수행을 하게 되는 기기를 의미한다.The master device means a device that initiates Bluetooth communication by transmitting a Bluetooth signal, and the slave device means a device that receives the Bluetooth signal sent from the master device and performs communication with the master device.

일 실시예에 있어서는 사용자 단말(10)이 마스터 기기로, 스마트 스피커 장치(30)가 슬레이브 기기로 동작할 수 있다. In one embodiment, the user terminal 10 may operate as a master device, and the smart speaker device 30 may operate as a slave device.

소정의 블루투스 기기는 주변에 위치하는 블루투스 기기를 검색하기 위해, 주파수 도약 순서(frequency hopping sequence)를 설정하여 조회(inquiry) 신호를 방송(broadcast)한다. 조회 스캔(inquiry scan)을 수행하는 블루투스 기기들은 블루투스 기기 주소(Bluetooth Device Address-BD_ADDR) 및 클럭 정보를 조회 신호를 방송하는 블루투스 기기로 송신하게 된다.A predetermined Bluetooth device broadcasts an inquiry signal by setting a frequency hopping sequence to search for Bluetooth devices located in the vicinity. Bluetooth devices that perform an inquiry scan transmit Bluetooth device address (BD_ADDR) and clock information to the Bluetooth device broadcasting the inquiry signal.

그러나 근거리 무선 통신부(335)가 블루투스 통신 방식의 모듈로 한정되는 것은 아니고 다양한 변형예들을 포괄하도록 해석된다. However, the short-range wireless communication unit 335 is not limited to a Bluetooth communication module, and is interpreted to cover various modifications.

본 발명의 일 실시예에 따른 스마트 스피커 장치(30)의 음성 출력부(320)는 근거리 무선 통신부(335)를 통해 사용자 단말(10)로부터 텍스트 형태의 데이터를 수신한다. The voice output unit 320 of the smart speaker device 30 according to an embodiment of the present invention receives data in the form of text from the user terminal 10 through the short-range wireless communication unit 335.

또한 스마트 스피커 장치(30)는 사용자 단말(10)의 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 전반적인 동작에 대한 제어신호를 입력받을 수 있다. 예를들면 전원의 ON/OFF 신호, 볼륨조절 신호과 같은 제어 신호를 입력받는다. In addition, the smart speaker device 30 may receive a control signal for the overall operation through a user-specific voice output service dedicated application of the user terminal 10. For example, control signals such as power ON/OFF signals and volume control signals are received.

뿐만 아니라 사용자 단말(10)로부터 목소리 선택 정보와, 텍스트 형태의 데이터인 스크립트, 추가적으로 목소리에 대한 감정 정보를 입력받을 수 있다. In addition, the user terminal 10 may receive voice selection information, text-type data script, and additional emotion information about the voice.

도 4 는 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션에서 목소리 수집화면의 예시도이다. 4 is an exemplary view of a voice collection screen in a user-specific voice output service-only application according to an embodiment of the present invention.

도 4 에 도시된 바와 같이 예시 스크립트를 녹음하도록 함으로써 사용자의 목소리 데이터를 수집할 수 있다. 이때 스크립트의 양이나 녹음 시간이 사용자의 목소리 스타일 파일을 생성하기에 충분한 양으로 제공하는 것이 바람직하다. As illustrated in FIG. 4, voice data of a user may be collected by recording an example script. At this time, it is desirable to provide a sufficient amount of script or recording time to generate a user's voice style file.

하나의 사용자 단말(10)로 한사람의 목소리만 등록 가능한 것은 아니고, 도 4 의 목소리 수집 화면 이전에 사용자 식별정보를 입력받아 하나의 사용자 단말(10)로 여러명의 목소리를 수집하는 것도 가능하다. It is not only possible to register one voice with one user terminal 10, it is also possible to collect multiple voices with one user terminal 10 by receiving user identification information before the voice collection screen of FIG.

도 5 는 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션의 스크립트 전송화면의 예시도이다. 5 is an exemplary view of a script transmission screen of an application for a user-specific voice output service according to an embodiment of the present invention.

도 5 에 도시된 바와 같이, 스크립트는 사용자 단말(10) 내에 저장된 파일을 선택하거나, 직접 입력하는 것으로 선택가능하다. As shown in FIG. 5, the script can be selected by selecting a file stored in the user terminal 10 or directly inputting the file.

그리고 사용자는 일 실시예에 따른 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 목소리를 선택할 수 있다. 목소리1, 목소리2, 목소리3, 목소리4는 각각 엄마, 아빠, 할머니, 할아버지와 같이 다른 사람의 목소리일 수 있다. In addition, the user may select a voice through a user-specific voice output service dedicated application according to an embodiment. Voice 1, Voice 2, Voice 3, and Voice 4 may be voices of other people, such as mom, dad, grandmother, and grandfather, respectively.

일 실시예에 있어서, 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션은 목소리를 선택할 수 있는 카테고리 목록을 더 제공한다. In one embodiment, a user-specific voice output service-only application further provides a list of categories for selecting voices.

예를 들어 연예인이나 아나운서, 정치인, 교수와 같은 유명인의 목소리를 선택 목록으로 제공할 수도 있고 사용자 단말(10)의 소유주인 사용자 본인이나, 그의 가족, 친구와 같은 일반인의 목소리를 선택 목록으로 제공하는 것이 가능하다. For example, the voice of a celebrity such as a celebrity or an announcer, a politician, or a professor may be provided as a selection list, or the voice of the user who is the owner of the user terminal 10 or a public such as his family or friends may be provided as a selection list. It is possible.

이때 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션은 클라우드 서버(20)의 목소리 스타일 파일 생성부(230)에서 목소리 스타일 파일을 생성한 목소리에 대해서 목소리 선택 버튼을 생성하는 것이다. At this time, the user-specific voice output service-only application is to generate a voice selection button for the voice generated by the voice style file generation unit 230 of the cloud server 20.

사용자 단말(10)은 선택된 텍스트 형태의 스크립트 파일과 목소리 식별 정보를 스마트 스피커 장치(30)로 근거리 무선통신 방식으로 전송한다. The user terminal 10 transmits the selected text type script file and voice identification information to the smart speaker device 30 in a short-range wireless communication method.

스마트 스피커 장치(30)는 사용자 단말(10)의 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 '전송' 버튼이 선택되는 것으로 시작 신호가 입력되면 사용자 단말(10)로부터 근거리 무선 통신방식으로 텍스트 형태의 스크립트를 수신한다. The smart speaker device 30 is a script in the form of a text in a short-range wireless communication method from the user terminal 10 when the start signal is input as the'send' button is selected through a user-specific voice output service dedicated application of the user terminal 10 To receive.

그리고 수신되는 텍스트 형태의 스크립트를 사용자가 선택한 특정 사용자의 목소리 스타일 파일에 적용하고 TTS 기능을 활용하여 선택한 특정 사용자의 목소리로 읽어줄 수 있다. In addition, the received text-type script can be applied to the voice style file of the specific user selected by the user and read by the selected user's voice by utilizing the TTS function.

일예로 아이가 잠자리에 들 경우에 사용자는 사용자 단말(10)의 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 '엄마 목소리'를 선택하고 감정 정보로 '차분하게'를 선택할 수 있다. 추가적으로 아이가 좋아하는 텍스트 형태의 동화책 파일을 선택할 수 있다. As an example, when the child goes to bed, the user may select'Mom Voice' through the user-specific voice output service dedicated application of the user terminal 10 and select'Calm' as the emotion information. In addition, you can select a child's favorite textbook file.

그러면 스마트 스피커 장치(30)는 클라우드 서버(20)로부터 기존에 생성된 '엄마 목소리'에 대한 목소리 스타일 파일을 수신한다. 그리고 사용자 단말(10)로부터 근거리 무선통신 방식으로 동화책 파일을 수신한다. 스마트 스피커 장치(30)는 TTS 기능으로 수신된 텍스트 형태의 동화책 파일을 음성 데이터로 변환하고 엄마 목소리 스타일 파일에 적용하여 엄마 목소리로 출력하는 것이다. Then, the smart speaker device 30 receives a voice style file for the existing'mother voice' from the cloud server 20. Then, the user terminal 10 receives a file of a fairy tale book through a short-range wireless communication method. The smart speaker device 30 converts the text-type fairy tale book file received by the TTS function into voice data and applies it to the mother voice style file to output it in the mother voice.

이때 사용자가 동일하게 엄마 목소리를 선택하더라도 '차분하게'와 같은 감정 정보를 적용하여 속도를 느리게, 볼륨을 작게 또는 저음의 비중을 높여서 출력하는 것이 가능하다. 이에 따라 아이에게 시시때때로 부모나 주양육자의 목소리를 제공하는 것이 가능하여 친밀도를 높일 수 있는 스마트 스피커 장치를 제공할 수 있다. At this time, even if the user selects the mother's voice in the same way, it is possible to apply the emotion information such as'calmly' to output the speed by slowing the speed, decreasing the volume or increasing the weight of the bass. Accordingly, it is possible to provide the voice of a parent or primary caregiver to the child from time to time, thereby providing a smart speaker device capable of increasing intimacy.

한편, 다른 예로 아침 시간에 사용자는 사용자 단말(10)의 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 '아나운서a 목소리'를 선택하고 감정 정보로 '활기차게'를 선택할 수 있다. 추가적으로 그날의 신문 파일을 선택할 수 있다. On the other hand, as another example, in the morning time, the user may select'announcer a voice' through the user-specific voice output service dedicated application of the user terminal 10 and'lively' as the emotion information. Additionally, you can select a newspaper file for the day.

그러면 스마트 스피커 장치(30)는 클라우드 서버(20)로부터 '아나운서a 목소리'에 대한 목소리 스타일 파일을 수신한다. 그리고 사용자 단말(10)로부터 무선 통신 방식으로 그날의 신문 파일을 수신한다. 스마트 스피커 장치(30)는 TTS 기능으로 수신된 텍스트 형태의 신문 파일을 음성 데이터로 변환하고 아나운서a의 목소리로 출력하는 것이다. Then, the smart speaker device 30 receives a voice style file for'announcer a voice' from the cloud server 20. Then, the newspaper file of the day is received from the user terminal 10 by wireless communication. The smart speaker device 30 converts the text file of the text received by the TTS function into voice data and outputs it in the voice of announcer a.

이때 사용자가 동일하게 아나운서a의 목소리를 선택하더라도 '활기차게'와 같은 감정 정보를 적용하여 속도를 다소 빠르게, 볼륨을 높게 또는 고음의 비중을 높여서 출력하는 것이 가능하다. 이에 따라 다양한 텍스트 형태의 시사 문제를 사용자가 원하는 친근감 높은 목소리로 전달해주는 스마트 스피커 장치를 제공할 수 있다. At this time, even if the user similarly selects the voice of announcer a, it is possible to apply emotion information such as'lively' to speed up the speed somewhat, increase the volume or increase the weight of the treble. Accordingly, it is possible to provide a smart speaker device that delivers a variety of text-type current affairs in a user-friendly voice.

또 다른 예로 사용자 단말(10)의 소유주가 학생인 경우에 사용자 단말(10)의 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 '아이돌b 목소리'를 선택하고 시험 내용인 교과서 정리 내용을 스크립트로 선택할 수 있다. As another example, when the owner of the user terminal 10 is a student, an'idol b voice' may be selected through a user-specific voice output service-only application of the user terminal 10, and a textbook summary that is a test content may be selected as a script. .

그러면 스마트 스피커 장치(30)는 클라우드 서버(20)로부터 기존에 생성된 '아이돌b 목소리'에 대한 목소리 스타일 파일을 수신한다. 그리고 사용자 단말(10)로부터 무선 통신 방식으로 교과서 정리 내용의 텍스트 파일 수신한다. 이때 교과서 정리 내용의 텍스트 파일은 사용자가 직접 작성한 것일 수 있다. Then, the smart speaker device 30 receives a voice style file for the existing'idol b voice' from the cloud server 20. Then, the text file of the textbook summary is received from the user terminal 10 by wireless communication. At this time, the text file of the textbook arrangement may be manually written by the user.

스마트 스피커 장치(30)는 TTS 기능으로 수신된 텍스트 형태의 교과서 정리 내용의 텍스트 파일을 음성 데이터로 변환하고 아이돌b의 목소리로 출력하는 것이다. 이에 따라 학습 흥미를 유발 시킴과 동시에 반복 학습을 통해 효과적인 복습효과를 낼 수 있는 스마트 스피커 장치를 제공할 수 있다. The smart speaker device 30 converts the text file of the textbook organized contents received by the TTS function into voice data and outputs the voice of the idol b. Accordingly, it is possible to provide a smart speaker device capable of inducing learning interest and simultaneously generating an effective review effect through repetitive learning.

스마트 스피커 장치(30)는 저장공간 효율성을 확보하기 위해서 사용자 단말(10)로부터 수신한 스크립트를 특정 사용자의 목소리로 출력한 이후에는 삭제하도록 구현되는 것이 바람직하다. In order to secure storage space efficiency, the smart speaker device 30 is preferably implemented to delete the script received from the user terminal 10 after outputting it in the voice of a specific user.

도 6 은 본 발명의 일 실시예에 따른 사용자 맞춤형 음성 선택이 가능한 음성 출력 시스템의 구동 방법을 도시한 흐름도이다. 6 is a flowchart illustrating a method of driving a voice output system capable of selecting a user-customized voice according to an embodiment of the present invention.

먼저, 클라우드 서버가 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 구동되는 사용자 단말로부터 사용자 목소리 데이터를 수집하여 저장한다(S600). First, the cloud server collects and stores user voice data from a user terminal in which a user-specific voice output service dedicated application is driven (S600).

그리고 특정 사용자의 목소리 데이터 저장량이 기준치 이상이면(S610) 특정 사용자의 목소리 데이터를 분석하여 특정 사용자의 목소리 데이터의 특징을 학습한다(S620).Then, if the storage amount of the voice data of the specific user is greater than or equal to the reference value (S610), the voice data of the specific user is analyzed to learn the characteristics of the voice data of the specific user (S620).

이때 목소리 데이터 저장량의 기준치라함은 특정 사용자의 목소리 데이터를 파악하기에 충분한 양으로 저장된 목소리 데이터의 시간으로 카운트 하는 것도 가능하고, 저장 용량으로 카운트하는 것도 가능하다. In this case, the reference value of the voice data storage amount may be counted as the time of the voice data stored in an amount sufficient to grasp the voice data of a specific user, or may be counted as the storage capacity.

그리고 특정 사용자의 목소리 데이터의 특징을 학습하는 것은 일 실시예에 있어서, 사용자 목소리 데이터의 파장, 진동수, 음의 색깔 등을 분석할 수 있는 응용프로그램으로 구현될 수 있다. 또는 센싱 장치 등을 포함하는 것으로 입력된 목소리의 파장, 진동수, 음의 색깔 등을 추출하여 구별하는 어떤 수단도 포함하도록 해석된다. And learning the characteristics of the voice data of a specific user may be implemented as an application program that can analyze the wavelength, frequency, sound color, etc. of the user voice data in one embodiment. Or it is interpreted to include any means for extracting and distinguishing the wavelength, frequency, sound color, etc. of the input voice as including a sensing device.

또한 사용자의 목소리 데이터를 확보하기 위해 사용자가 사용자 단말로 음성 통화를 수행하는 동안 사용자의 음성 데이터를 수집하는 것도 가능하다. 이때 사용자가 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 통화 중 음성 데이터 수집을 허용한 경우에 한해 이루어지는 것이 바람직하다. It is also possible to collect the user's voice data while the user is making a voice call to the user terminal to secure the user's voice data. At this time, it is preferable that the user is allowed to collect voice data during a call through an application dedicated to a user-specific voice output service.

일 양상에 있어서, 목소리 데이터의 특징을 학습하는 것은 딥러닝 기반의 목소리 학습을 수행하는 것을 특징으로 한다. In one aspect, learning the characteristics of voice data is characterized by performing deep learning-based voice learning.

딥 러닝(영어: deep learning), 심층학습(深層學習)은 여러 비선형 변환기법의 조합을 통해 높은 수준의 추상화(abstractions, 다량의 데이터나 복잡한 자료들 속에서 핵심적인 내용 또는 기능을 요약하는 작업)를 시도하는 기계학습(machine learning) 알고리즘의 집합 으로 정의된다. Deep learning (deep learning) and deep learning are high-level abstractions (abstracts, summarizing key content or functions in a large amount of data or complex data) through a combination of several nonlinear transformation methods. It is defined as a set of machine learning algorithms that attempt.

그리고 클라우드 서버는 학습된 내용에 기반하여 특정 사용자의 목소리 특징을 정의한 목소리 스타일 파일을 생성한다(S630). In addition, the cloud server generates a voice style file defining a voice characteristic of a specific user based on the learned content (S630).

일 실시예에 있어서 특정 사용자의 목소리 특징을 정의한 목소리 스타일 파일은 style.xml 일수 있다. In one embodiment, a voice style file defining a voice characteristic of a specific user may be style.xml.

XML(Extensible Markup Language)은 W3C에서 개발된, 다른 특수한 목적을 갖는 마크업 언어를 만드는데 사용하도록 권장하는 다목적 마크업 언어이다. XML은 SGML의 단순화된 부분집합으로, 다른 많은 종류의 데이터를 기술하는 데 사용할 수 있다. XML은 주로 다른 종류의 시스템, 특히 인터넷에 연결된 시스템끼리 데이터를 쉽게 주고 받을 수 있도록 한다. Extensible Markup Language (XML) is a multipurpose markup language developed by W3C that is recommended for use in creating other special purpose markup languages. XML is a simplified subset of SGML and can be used to describe many different kinds of data. XML mainly makes it easy to exchange data between different types of systems, especially those connected to the Internet.

이 후에 스마트 스피커 장치는 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션이 탑재된 사용자 단말로부터 수신되는 텍스트 형태의 데이터를 음성 형태의 데이터로 변환한다(S640). Thereafter, the smart speaker device converts text-form data received from a user terminal equipped with a user-specific voice output service-only application into voice form data (S640).

스마트 스피커 장치는 근거리 무선통신 방식으로 사용자 단말로부터 텍스트 형태의 데이터를 수신한다. The smart speaker device receives text data from a user terminal in a short-range wireless communication method.

사용자는 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션을 통해 사용자 단말 내에 저장된 파일을 선택하거나, 직접 텍스트를 입력하여 텍스트 파일을 생성하는 방식으로 텍스트 형태의 스크립트를 선택할 수 있다.The user may select a file stored in the user terminal through a user-specific voice output service-only application, or select a text-type script by directly inputting text to generate a text file.

그리고 스마트 스피커 장치가 변환된 음성 형태의 데이터를 클라우드 서버로부터 수신되는 특정 사용자의 목소리 스타일 파일에 적용하여 특정 사용자의 목소리로 출력한다(S650).Then, the smart speaker device applies the converted voice type data to a voice style file of a specific user received from the cloud server and outputs the voice in a specific user (S650).

이때 사용자 맞춤형 음성 출력 서비스 전용 어플리케이션은 클라우드 서버가 특정 사용자의 목소리 스타일 파일을 생성하면, 특정 사용자에 대한 목소리 선택 버튼을 생성하여 출력한다. At this time, when the cloud server generates a voice style file of a specific user, the application for a user-specific voice output service generates and outputs a voice selection button for a specific user.

본발명의 추가적인 양상에 따르면, 목소리 데이터의 특징을 학습하는 단계는 사용자의 목소리 데이터의 특징을 학습함에 있어서 사용자의 감정 상태 정보 및 억양 정보를 더 반영하고, 목소리 스타일 파일이 사용자의 감정 상태 정보 및 억양 정보를 포함하는 것을 특징으로 한다. According to an additional aspect of the present invention, the step of learning the characteristics of the voice data further reflects the user's emotional state information and intonation information in learning the features of the user's voice data, and the voice style file includes the user's emotional state information and It is characterized by including accent information.

그리고 스마트 스피커장치는 사용자 단말로부터 수신되는 텍스트 형태의 스크립트를 사용자가 선택한 특정 사용자의 목소리 스타일 파일에 적용하고 TTS 기능을 활용하여 선택한 특정 사용자의 목소리로 읽어줄 수 있다. In addition, the smart speaker device may apply a text-type script received from the user terminal to a voice style file of a specific user selected by the user and read the voice of a specific user selected by utilizing the TTS function.

이때 사용자가 선택한 특정 사용자의 목소리 스타일 파일은 클라우드 서버로부터 다운로드 받는다. At this time, the voice style file of the specific user selected by the user is downloaded from the cloud server.

이 후에 스마트 스피커 장치는 저장공간 효율성을 확보하기 위해서 수신한 스크립트를 특정 사용자의 목소리로 출력한 이후에는 해당 스크립트를 삭제하도록 구현되는 것이 바람직하다. After this, the smart speaker device is preferably implemented to delete the script after outputting the received script in the voice of a specific user in order to secure storage efficiency.

10 : 사용자 단말 20 : 클라우드 서버
30 : 스마트 스피커 장치 210 : 목소리 데이터 저장부
220 : 목소리 학습부 230 : 목소리 스타일 파일 생성부
240 : 통신부 310 : TTS 모듈부
320 : 음성 출력부 330 : 통신부
335 : 근거리 무선 통신부10: user terminal 20: cloud server
30: smart speaker device 210: voice data storage
220: voice learning unit 230: voice style file generation unit
240: communication unit 310: TTS module unit
320: audio output unit 330: communication unit
335: short-range wireless communication unit

Claims

User voice data is collected and stored from a user terminal in which a user-specific voice output service-only application is driven, and if a specific user's voice data storage is greater than or equal to a reference value, the voice data of the specific user is analyzed to characterize the voice data of the specific user. A cloud server including a learning voice learning unit and a voice style file generating unit for generating a voice style file defining a voice characteristic of the specific user based on the content learned by the voice learning unit; And
The text-to-speech (TTS) module unit converts text-type data received from a user terminal equipped with the application for the user-specific voice output service into speech-type data, and voice-type data converted by the TTS module unit. It includes; a smart speaker device including a voice output unit applied to the voice style file of the specific user received from the cloud server to output the voice of the specific user;
The user-specific voice output service-only application is characterized in that when the voice style file generation unit generates a voice style file of a specific user, a voice selection button for the specific user is generated and output.
The user-specific voice output service-only application further provides a list of categories for selecting voices,
The smart speaker device is characterized in that it receives a voice style file of a specific user selected from the voice selection button for the user generated in the application for the customized voice output service, from the cloud server,
The voice learning unit further reflects the user's emotional state information and intonation information in learning the characteristics of the user's voice data, and wherein the voice style file includes the user's emotional state information and intonation information,
The smart speaker device further includes a short-range wireless communication unit for performing short-range wireless communication with the user terminal,
The voice output unit is characterized in that it receives text data from the user terminal through the short-range wireless communication unit,
The smart speaker device receives a control signal for the overall operation of the smart speaker device through a user-specific voice output service-only application, and adds at least one of voice selection information, text-type data script, and emotion information about the voice. Characterized by receiving input,
The voice learning unit is characterized by performing deep learning-based voice learning,
The voice learning unit further comprises collecting user voice data from the user terminal while a user makes a voice call to the user terminal.

The cloud server collects and stores user voice data from a user terminal in which a user-specific voice output service-only application is driven, and when a specific user's voice data storage amount is greater than a reference value, analyzes the voice data of the specific user and analyzes the voice data of the specific user Learning the features of the;
Generating, by the cloud server, a voice style file defining a voice characteristic of the specific user based on the learned content;
Converting text-type data received from a user terminal equipped with the application dedicated to the user-specific voice output service into voice-type data by a smart speaker device; and
And the smart speaker device applying the converted voice data to the voice style file of the specific user received from the cloud server and outputting the voice to the specific user.
The user-specific voice output service-only application is characterized in that when the cloud server generates a voice style file of a specific user, a voice selection button for the specific user is generated and output.
The user-specific voice output service-only application further provides a list of categories for selecting voices,
The smart speaker device is characterized in that it receives a voice style file of a specific user selected from the voice selection button for the generated user in the application dedicated to the customized voice output service from the cloud server,
The step of learning the characteristics of the voice data further reflects the user's emotional state information and intonation information in learning the characteristics of the user's voice data, and the voice style file includes the user's emotional state information and intonation information. Features,
The smart speaker device further comprises receiving text data from the user terminal in a short-range wireless communication method.
The step of outputting in a voice may further include receiving text data from the user terminal in the short-range wireless communication method.
The smart speaker device receives a control signal for the overall operation of the smart speaker device through a user-specific voice output service-only application, and adds at least one of voice selection information, text-type data script, and emotion information about the voice. Characterized in that the input,
The step of learning the characteristics of the voice data is characterized by performing deep learning-based voice learning,
The step of learning the characteristics of the voice data is a method of driving a voice output system capable of selecting a user-customized voice, further comprising collecting user voice data from the user terminal while a user makes a voice call to the user terminal.