KR20210112825A

KR20210112825A - System and method for providing text-to-speech service based on celebrities' voice

Info

Publication number: KR20210112825A
Application number: KR1020200028347A
Authority: KR
Inventors: 정여름
Original assignee: 정여름
Priority date: 2020-03-06
Filing date: 2020-03-06
Publication date: 2021-09-15

Abstract

A method of providing a voice synthesis service according to an embodiment of the present disclosure includes the steps of: allowing a voice analysis part to analyze voice data of a specific person, and building a voice DB based on the analysis result; allowing a communication part to receive content generated for the specific person selected by a user from a user terminal; allowing the communication part to receive a text-to-speech request including the selected specific person information and the generated content; allowing a voice synthesis part to extract voice data of the specific person selected from the voice DB in response to a text-to-speech conversion request, converts text information included in the content into voice synthesis data based on the extracted voice data, and provides the voice synthesis data to the user terminal; and allowing the communication part to share content including the voice synthesis data.

Description

SYSTEM AND METHOD FOR PROVIDING TEXT-TO-SPEECH SERVICE BASED ON CELEBRITIES' VOICE

본 개시는 유명인의 음성에 기초한 음성 합성 서비스 제공 방법 및 시스템에 관한 것으로서, 보다 상세하게는 유명인과 갈이 사용자가 선호하는 특정인에 대한 콘텐츠에 포함된 텍스트 정보를 해당 특정인의 음성을 이용하여 음성 정보로 변환할 수 있는 음성 합성 서비스 제공 방법 및 시스템에 관한 것이다.The present disclosure relates to a method and system for providing a speech synthesis service based on the voice of a famous person, and more particularly, to a method and system for providing a speech synthesis service based on the voice of a famous person, and more particularly, text information included in content about a specific person preferred by a famous person and a user using the voice of the specific person to provide voice information It relates to a method and system for providing a speech synthesis service that can be converted into .

일반적으로 텍스트-음성 합성 기술(TTS; Text-To-Speech)이라고 불리는 음성 합성 기술은 안내방송, 네비게이션, 인공지능 비서 등과 같이 사람의 음성이 필요한 어플리케이션에서 실제의 음성을 사전에 녹음해 두지 않고 필요한 음성을 재생하기 위해 사용되는 기술이다. 음성 합성의 전형적인 방법은, 음성을 음소 등 아주 짧은 단위로 미리 잘라서 저장해두고, 합성할 문장을 구성하는 음소들을 결합하여 음성을 합성하는 연결 합성 방식(concatenative TTS)과, 음성의 특징을 파라미터로 표현하고 합성할 문장을 구성하는 음성 특징들을 나타내는 파라미터들을 보코더(vocoder)를 이용해 문장에 대응하는 음성으로 합성하는 매개 변수 합성 방식(parametric TTS)이 있다.In general, speech synthesis technology called text-to-speech (TTS) is used in applications that require human voice, such as announcements, navigation, and artificial intelligence assistants, without having to record the actual voice in advance. It is a technology used to reproduce voice. A typical method of speech synthesis is the concatenative TTS, in which speech is cut into very short units such as phonemes and stored in advance, and the phonemes constituting the sentence to be synthesized are combined to synthesize speech, and speech characteristics are expressed as parameters. There is a parametric TTS method in which parameters representing speech features constituting a sentence to be synthesized are synthesized into a voice corresponding to the sentence using a vocoder.

한편, 최근 인공지능 스피커 등과 같이 음성 데이터를 활용하는 전자제품의 수요 증가에 따라 음성 관련 전자제품을 특정 인물의 목소리를 통해 이용하고자 하는 수요 또한 증가하게 되었다. 이러한 수요를 위해 인기 연예인이나 캐릭터들 등 원하는 특정 인물의 목소리를 음성 합성 기술에 이용하는 '개인화 음성 합성 기술(P-TTS; Personalized-Text To Speech)'이 개발되고 있다. 개인화 음성 합성 기술은 음성 합성 딥러닝(artificial neural network) 기술을 통해 수집한 음성 데이터를 이용하여 특정 인물의 목소리를 합성하는 기술로, 원하는 특정 인물의 음성 데이터로부터 생동감 있고 실감나는 음성 합성 서비스를 제공할 수 있다.Meanwhile, with the recent increase in demand for electronic products that use voice data, such as artificial intelligence speakers, the demand for using voice-related electronic products through the voice of a specific person has also increased. To meet this demand, 'Personalized-Text To Speech (P-TTS)' is being developed, which uses the voice of a specific person, such as a popular celebrity or character, for speech synthesis technology. Personalized speech synthesis technology is a technology that synthesizes the voice of a specific person using voice data collected through speech synthesis deep learning (artificial neural network) technology. can do.

본 명세서에서 개시되는 실시예들은, 유명인과 같은 특정인을 선호하는 사용자들 사이에 공유되는 콘텐츠에 포함된 텍스트 정보를 해당 특정인의 음성 데이터를 이용하여 음성 합성 데이터로 변환하여 재생함으로써, 생동감 있고 실감나는 음성 합성 서비스를 제공할 수 있는 음성 합성 서비스 제공 방법 및 시스템이 제공된다.The embodiments disclosed herein convert text information included in content shared among users who prefer a specific person, such as a celebrity, into voice synthesis data using the specific person's voice data and reproduce it, thereby providing a lively and realistic experience. A method and system for providing a speech synthesis service capable of providing a speech synthesis service are provided.

본 개시의 일 실시예에 따른 음성 합성 서비스 제공 방법은 음성 분석부에 의해, 특정인의 음성 데이터를 분석하고, 분석 결과에 기초하여 음성 데이터 DB를 구축하는 단계, 통신부에 의해, 사용자 단말로부터 사용자가 선택한 특정인에 대해 생성된 콘텐츠를 수신하는 단계, 통신부에 의해, 선택된 특정인 정보 및 생성된 콘텐츠를 포함하는 문자 음성 변환 요청을 수신하는 단계, 음성 합성부에 의해, 문자 음성 변환 요청에 따라, 음성 데이터 DB로부터 선택된 특정인의 음성 데이터를 추출하여, 추출된 음성 데이터를 기초로 콘텐츠에 포함된 텍스트 정보를 음성 합성 데이터로 변환하여, 음성 합성 데이터를 사용자 단말로 제공하는 단계 및 통신부에 의해, 음성 합성 데이터를 포함하는 콘텐츠를 공유하는 단계를 포함한다.A method of providing a voice synthesis service according to an embodiment of the present disclosure includes the steps of analyzing, by a voice analysis unit, voice data of a specific person, and building a voice data DB based on the analysis result; Receiving content generated for the selected specific person, receiving, by the communication unit, a text-to-speech request including information about the selected specific person and the generated content, by the speech synthesizing unit, according to the text-to-speech request, voice data Extracting the voice data of a specific person selected from the DB, converting text information included in the content into voice synthesized data based on the extracted voice data, and providing the voice synthesized data to the user terminal, and by the communication unit, the voice synthesized data and sharing content including

일 실시예에 따르면, 음성 합성 서비스 제공 방법은 요금 관리부에 의해, 텍스트 정보에 기초하여 음성 합성부에 의해 변환되는 음성 합성 데이터의 양 또는 음성 합성 데이터의 재생 시간을 산출하는 단계 및 요금 관리부에 의해, 산출된 음성 합성 데이터의 양 및 음성 합성 데이터의 재생 시간 중 적어도 하나에 기초하여 요금을 결정하는 단계를 더 포함한다.According to an embodiment, a method of providing a speech synthesis service includes, by the fee management unit, calculating the amount of speech synthesis data converted by the speech synthesis unit or the reproduction time of the speech synthesis data based on the text information, and by the fee management unit. , determining a fee based on at least one of the calculated amount of the synthesized voice data and the reproduction time of the synthesized voice data.

일 실시예에 따르면, 음성 합성 서비스 제공 방법은 순위 관리부에 의해, 결정된 요금 및 음성 합성 데이터의 양 중 적어도 하나에 기초하여 복수의 특정인의 음성 합성 순위를 결정하는 단계를 더 포함한다.According to an embodiment, the method for providing a voice synthesis service further includes determining, by the rank management unit, a voice synthesis rank of a plurality of specific persons based on at least one of the determined rate and the amount of voice synthesis data.

일 실시예에 따르면, 음성 합성 서비스 제공 방법은 음성 분석부에 의해, 특정인의 음성 데이터에 기초하여 특정인의 음성의 음역대를 결정하는 단계 및 통신부에 의해, 복수의 특정인 중 결정된 음역대와 유사한 음역대를 가지는 적어도 하나의 특정인 정보를 사용자 단말로 전송하는 단계를 더 포함한다.According to an embodiment, a method of providing a voice synthesis service includes, by a voice analysis unit, determining a voice range of a specific person's voice based on the voice data of a specific person, and having a voice range similar to the voice range determined from among a plurality of specific people by the communication unit The method further includes transmitting the at least one specific person information to the user terminal.

일 실시예에 따르면, 음성 합성 서비스 제공 방법은 크라우드 펀딩부에 의해, 음성 데이터 DB에 포함되지 않은 특정인의 음성 데이터를 확보하기 위한 크라우드 펀딩 정보를 생성하는 단계, 크라우드 펀딩 정보를 사용자 단말로 제공하여 공유하는 단계, 사용자 단말로부터 크라우드 펀딩 정보에 대응되는 투자 정보를 수집하는 단계 및 크라우드 펀딩이 성공하는 경우, 투자 정보에 기초하여 사용자 단말로 리워드를 제공하는 단계를 더 포함한다.According to an embodiment, the method of providing a voice synthesis service includes, by the crowdfunding unit, generating crowdfunding information for securing voice data of a specific person not included in the voice data DB, and providing the crowdfunding information to a user terminal. The method further includes the step of sharing, collecting investment information corresponding to the crowdfunding information from the user terminal, and providing a reward to the user terminal based on the investment information when the crowdfunding is successful.

본 개시의 일 실시예에 따르면, 음성 합성 서비스 제공 시스템은, 특정인의 음성 데이터를 분석하고, 분석 결과에 기초하여 음성 데이터 DB를 구축하도록 구성된 음성 분석부, 사용자 단말로부터 사용자가 선택한 특정인에 대해 생성된 콘텐츠를 수신하고, 선택된 특정인 정보 및 생성된 콘텐츠를 포함하는 문자 음성 변환 요청을 수신하도록 구성된 통신부 및 문자 음성 변환 요청에 따라, 음성 데이터 DB로부터 선택된 특정인의 음성 데이터를 추출하여, 콘텐츠에 포함된 텍스트 정보를 음성 합성 데이터로 변환하여, 변환된 음성 합성 데이터를 사용자 단말로 제공하도록 구성된 음성 합성부를 포함하고, 통신부는, 음성 합성 데이터를 포함하는 콘텐츠를 공유하도록 더 구성된다.According to an embodiment of the present disclosure, a voice synthesis service providing system analyzes voice data of a specific person and generates a voice analyzer configured to build a voice data DB based on the analysis result for a specific person selected by a user from a user terminal In accordance with the text-to-speech request and the communication unit configured to receive the selected content and receive the text-to-speech request including the selected specific person information and the generated content, the voice data of the selected specific person is extracted from the voice data DB and included in the content. and a speech synthesis unit configured to convert text information into speech synthesis data and provide the converted speech synthesis data to the user terminal, wherein the communication unit is further configured to share content including the speech synthesis data.

일 실시예에 따르면, 음성 합성 서비스 제공 시스템은 텍스트 정보에 기초하여 음성 합성부에 의해 변환되는 음성 합성 데이터의 양 및 음성 재생 시간을 산출하고, 산출된 음성 합성 데이터의 양 및 음성 재생 시간 중 적어도 하나에 기초하여 요금을 결정하도록 구성된 요금 관리부를 더 포함한다.According to an embodiment, the system for providing a speech synthesis service calculates an amount of speech synthesis data converted by the speech synthesizer and a speech reproduction time based on text information, and at least among the calculated amount of speech synthesis data and a speech reproduction time and a fee management unit configured to determine the fee based on the one.

일 실시예에 따르면, 음성 합성 서비스 제공 시스템은 결정된 요금 및 음성 합성 데이터의 양 중 적어도 하나에 기초하여 복수의 특정인의 음성 합성 순위를 결정하도록 구성된 순위 관리부를 더 포함한다.According to an embodiment, the voice synthesis service providing system further includes a rank management unit configured to determine the voice synthesis rank of a plurality of specific persons based on at least one of the determined rate and the amount of voice synthesis data.

일 실시예에 따르면, 음성 분석부는, 특정인의 음성 데이터에 기초하여 특정인의 음성의 음역대를 결정하도록 더 구성되며, 통신부는, 복수의 특정인 중 결정된 음역대와 유사한 음역대를 가지는 적어도 하나의 특정인 정보를 사용자 단말로 전송하도록 더 구성된다.According to an embodiment, the voice analysis unit is further configured to determine a voice range of a specific person's voice based on the specific person's voice data, and the communication unit, at least one specific person information having a sound range similar to the determined voice range among a plurality of specific person information to the user and further configured to transmit to the terminal.

일 실시예에 따르면, 음성 합성 서비스 제공 시스템은, 음성 데이터 DB가 포함하지 않은 특정인의 음성 데이터를 확보하기 위한 크라우드 펀딩 정보를 생성하고, 크라우드 펀딩 정보를 사용자 단말로 제공하여 공유하고, 사용자 단말로부터 크라우드 펀딩 정보에 대응되는 투자 정보를 수집하고, 크라우드 펀딩이 성공하는 경우, 투자 정보에 기초하여 사용자 단말로 리워드를 제공하도록 구성된 크라우드 펀딩부를 더 포함한다.According to an embodiment, the voice synthesis service providing system generates crowdfunding information for securing voice data of a specific person not included in the voice data DB, provides and shares the crowdfunding information to a user terminal, and from the user terminal It further includes a crowdfunding unit configured to collect investment information corresponding to the crowdfunding information, and to provide a reward to the user terminal based on the investment information when the crowdfunding is successful.

본 개시의 다양한 실시예들에 따르면, 유명인과 같은 특정인을 선호하는 사용자들 사이에 공유되는 콘텐츠에 포함된 텍스트 정보를 해당 특정인의 음성 데이터를 이용하여 음성 합성 데이터로 변환함으로써, 사용자들이 직접 생성한 특정인에 대한 콘텐츠를 해당 특정인의 음원으로 제공하여 보다 생동감 있고 실감나는 음성 합성 서비스를 제공할 수 있다. According to various embodiments of the present disclosure, by converting text information included in content shared among users who prefer a specific person, such as a celebrity, into speech synthesis data using the voice data of the specific person, users directly It is possible to provide a more lively and realistic voice synthesis service by providing content for a specific person as a sound source of the specific person.

본 개시의 다양한 실시예들에 따르면, 유명인과 같은 특정인에 대해 다른 사용자들이 공유한 콘텐츠를 해당 특정인의 음성으로 제공받음으로써, 사용자는 특정인에 대한 정보를 얻기 위해 시간을 지속해서 투자할 필요가 없으며, 해당 특정인에 대한 정보를 음성을 통해 직관적으로 이해하는 것이 가능하다.According to various embodiments of the present disclosure, by receiving content shared by other users about a specific person, such as a celebrity, with the voice of the specific person, the user does not need to continuously invest time to obtain information about the specific person. , it is possible to intuitively understand information about the specific person through voice.

또한, 본 개시의 다양한 실시예들에 따르면, 유명인과 같은 특정인에 대해 공유된 콘텐츠의 텍스트로부터 변환된 음성 합성 데이터에 기초하여 사용자들에게 과금할 수 있고, 이와 같은 과금 결과에 기초하여 해당 특정인들의 인기도 또는 순위를 결정할 수 있다. 이에 따라, 음성 합성 서비스의 사용량 또는 이용도에 따른 새로운 방식의 유명인의 순위 결정 방법이 제공되며, 이 방법에 따르면 사용자들이 유명인에 대한 콘텐츠를 공유하는 팬 커뮤니티를 더욱 활성화시킬 수 있다. In addition, according to various embodiments of the present disclosure, users may be charged based on voice synthesis data converted from text of content shared with respect to a specific person such as a celebrity, and based on the charging result, You can decide the popularity or ranking. Accordingly, a new method of ranking celebrities according to the usage or usage of the speech synthesis service is provided. According to this method, users can further activate a fan community in which content about celebrities is shared.

본 개시의 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급되지 않은 다른 효과들은 청구범위의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

본 개시의 실시예들은, 이하 설명하는 첨부 도면들을 참조하여 설명될 것이며, 여기서 유사한 참조 번호는 유사한 요소들을 나타내지만, 이에 한정되지는 않는다.
도 1은 본 개시의 일 실시예에 따른 음성 합성 서비스 제공 서버를 이용하여 음성 합성 서비스를 제공하는 시스템을 나타낸 개략도이다.
도 2는 본 개시의 일 실시예에 따른 음성 합성 서비스 제공 서버의 내부 구성을 나타낸 블록도이다.
도 3은 본 개시의 일 실시예에 따른 인공신경망 모델을 나타낸 구조도이다.
도 4는 본 개시의 일 실시예에 따른 문자 음성 변환 요청을 생성하는 예시를 나타낸 도면이다.
도 5는 본 개시의 일 실시예에 따른 요금 결제를 위한 화면의 예시를 나타낸 도면이다.
도 6은 본 개시의 일 실시예에 따른 음성 합성 순위를 사용자 단말의 화면에 표시하는 예시를 나타낸 도면이다.
도 7은 본 개시의 일 실시예에 따른 크라우드 펀딩 서비스를 화면에 표시하는 예시를 나타낸 도면이다.
도 8은 본 개시의 일 실시예에 따른 음성 합성 서비스 제공 방법을 나타낸 흐름도이다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, wherein like reference numerals denote like elements, but are not limited thereto.
1 is a schematic diagram illustrating a system for providing a voice synthesis service using a voice synthesis service providing server according to an embodiment of the present disclosure.
2 is a block diagram illustrating an internal configuration of a voice synthesis service providing server according to an embodiment of the present disclosure.
3 is a structural diagram illustrating an artificial neural network model according to an embodiment of the present disclosure.
4 is a diagram illustrating an example of generating a text-to-speech request according to an embodiment of the present disclosure.
5 is a diagram illustrating an example of a screen for payment according to an embodiment of the present disclosure.
6 is a diagram illustrating an example of displaying a voice synthesis ranking on a screen of a user terminal according to an embodiment of the present disclosure.
7 is a diagram illustrating an example of displaying a crowdfunding service on a screen according to an embodiment of the present disclosure.
8 is a flowchart illustrating a method of providing a voice synthesis service according to an embodiment of the present disclosure.

이하, 본 개시의 실시를 위한 구체적인 내용을 첨부된 도면을 참조하여 상세히 설명한다. 다만, 이하의 설명에서는 본 개시의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, specific contents for carrying out the present disclosure will be described in detail with reference to the accompanying drawings. However, in the following description, if there is a risk of unnecessarily obscuring the subject matter of the present disclosure, detailed descriptions of well-known functions or configurations will be omitted.

첨부된 도면에서, 동일하거나 대응하는 구성요소에는 동일한 참조부호가 부여되어 있다. 또한, 이하의 실시예들의 설명에 있어서, 동일하거나 대응되는 구성요소를 중복하여 기술하는 것이 생략될 수 있다. 그러나 구성요소에 관한 기술이 생략되어도, 그러한 구성요소가 어떤 실시예에 포함되지 않는 것으로 의도되지는 않는다.In the accompanying drawings, identical or corresponding components are assigned the same reference numerals. In addition, in the description of the embodiments below, overlapping description of the same or corresponding components may be omitted. However, even if descriptions regarding components are omitted, it is not intended that such components are not included in any embodiment.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments, and methods of achieving them, will become apparent with reference to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the present disclosure to be complete, and those of ordinary skill in the art to which the present disclosure pertains. It is only provided to fully inform the person of the scope of the invention.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. 본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. The terms used in this specification are selected as currently widely used general terms as possible while considering the functions in the present disclosure, but may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, and the like. In addition, in a specific case, there is a term arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding invention. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the contents of the present disclosure, rather than the simple name of the term.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한, 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다. 명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.References in the singular herein include plural expressions unless the context clearly dictates that the singular is singular. Also, the plural expression includes the singular expression unless the context clearly dictates the plural. When a part "includes" a certain element throughout the specification, this means that other elements may be further included, rather than excluding other elements, unless otherwise stated.

또한, 명세서에서 사용되는 '모듈' 또는 '부'라는 용어는 소프트웨어 또는 하드웨어 구성요소를 의미하며, '모듈' 또는 '부'는 어떤 역할들을 수행한다. 그렇지만 '모듈' 또는 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '모듈' 또는 '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '모듈' 또는 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 또는 변수들 중 적어도 하나를 포함할 수 있다. 구성요소들과 '모듈' 또는 '부'들은 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '모듈' 또는 '부'들로 결합되거나 추가적인 구성요소들과 '모듈' 또는 '부'들로 더 분리될 수 있다.In addition, the term 'module' or 'unit' used in the specification means a software or hardware component, and 'module' or 'unit' performs certain roles. However, 'module' or 'unit' is not meant to be limited to software or hardware. A 'module' or 'unit' may be configured to reside on an addressable storage medium or may be configured to refresh one or more processors. Thus, as an example, 'module' or 'unit' refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, may include at least one of procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays or variables. Components and 'modules' or 'units' are the functions provided therein that are combined into a smaller number of components and 'modules' or 'units' or additional components and 'modules' or 'units' can be further separated.

본 개시의 일 실시예에 따르면 '모듈' 또는 '부'는 프로세서 및 메모리로 구현될 수 있다. '프로세서'는 범용 프로세서, 중앙 처리 장치(CPU), 마이크로프로세서, 디지털 신호 프로세서(DSP), 제어기, 마이크로제어기, 상태 머신 등을 포함하도록 넓게 해석되어야 한다. 몇몇 환경에서는, '프로세서'는 주문형 반도체(ASIC), 프로그램가능 로직 디바이스(PLD), 필드 프로그램가능 게이트 어레이(FPGA) 등을 지칭할 수도 있다. '프로세서'는, 예를 들어, DSP와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 결합한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성들의 조합과 같은 처리 디바이스들의 조합을 지칭할 수도 있다. 또한, '메모리'는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. '메모리'는 임의 액세스 메모리(RAM), 판독-전용 메모리(ROM), 비-휘발성 임의 액세스 메모리(NVRAM), 프로그램가능 판독-전용 메모리(PROM), 소거-프로그램가능 판독 전용 메모리(EPROM), 전기적으로 소거가능 PROM(EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서가 메모리로부터 정보를 판독하고/하거나 메모리에 정보를 기록할 수 있다면 메모리는 프로세서와 전자 통신 상태에 있다고 불린다. 프로세서에 집적된 메모리는 프로세서와 전자 통신 상태에 있다.According to an embodiment of the present disclosure, a 'module' or a 'unit' may be implemented with a processor and a memory. 'Processor' should be construed broadly to include general purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, and the like. In some circumstances, a 'processor' may refer to an application specific semiconductor (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), or the like. 'Processor' refers to a combination of processing devices, such as, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such configurations. You may. Also, 'memory' should be construed broadly to include any electronic component capable of storing electronic information. 'Memory' means random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erase-programmable read-only memory (EPROM); may refer to various types of processor-readable media, such as electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, and the like. A memory is said to be in electronic communication with the processor if the processor is capable of reading information from and/or writing information to the memory. A memory integrated in the processor is in electronic communication with the processor.

본 개시에서 '시스템'은 서버(또는 서버 장치), 서버에 네트워크를 통해 연결되는 사용자 단말기 중 적어도 하나의 장치를 포함할 수 있으나, 이에 한정되는 것은 아니다. 예를 들어, 시스템은 하나 이상의 사용자 단말기, 하나 이상의 서버 장치, 또는 이들이 결합된 형태를 포함할 수 있다. In the present disclosure, a 'system' may include at least one of a server (or a server device) and a user terminal connected to the server through a network, but is not limited thereto. For example, the system may include one or more user terminals, one or more server devices, or a combination thereof.

본 개시에 있어서 '음성 데이터'는, 예를 들어, 특정인 또는 유명인과 같이 사용자가 선호하는 목소리를 보유한 사람이 발화한 음성을 나타내는 데이터(예를 들어, 음성 파형 신호 데이터, 음성 파형 신호로부터 추출한 음성 특징 등)를 지칭할 수 있다. 또한, 본 개시에 있어서 '유명인'인 일반 대중들이 선호하거나 널리 알려진 연예인(예를 들어, 가수, 배우 등과 같이 엔터테인먼트 분야에 종사하는 사람), 학자, 정치인 등을 지칭할 수 있다. In the present disclosure, 'voice data' is, for example, data representing a voice uttered by a person who has a user's preferred voice, such as a specific person or a famous person (eg, voice waveform signal data, voice extracted from voice waveform signal) characteristics, etc.). In addition, in the present disclosure, 'celebrities' may refer to celebrities preferred or widely known by the general public (eg, people engaged in entertainment such as singers and actors), academics, politicians, and the like.

도 1은 본 개시의 일 실시예에 따른 음성 합성 서비스 제공 서버(130)를 이용하여 음성 합성 서비스를 제공하는 시스템의 구성을 나타낸 개략도이다. 도시된 시스템에서, 음성 합성 서비스 제공 서버(130)는 네트워크(120)를 통해 복수의 사용자 단말(112, 114, 116)에 음성 합성 서비스를 제공할 수 있다. 일 실시예에 따르면, 음성 합성 서비스 제공 서버(130)는 음성 합성 서비스와 관련된 컴퓨터 실행 가능한 프로그램(예를 들어, 다운로드 가능한 애플리케이션) 및 데이터를 저장, 제공 및 실행할 수 있는 하나 이상의 서버 장치 및/또는 데이터베이스, 또는 클라우드 컴퓨팅 서비스 기반의 하나 이상의 분산 컴퓨팅 장치 및/또는 분산 데이터베이스를 포함할 수 있다. 여기서, 음성 합성 서비스 제공 서버(130)에 의해 제공되는 음성 합성 서비스는, 복수의 사용자 단말(112, 114, 116)에 설치된 전용 애플리케이션 또는 웹 브라우저 등을 통해 사용자에게 제공될 수 있다. 여기서, 사용자는 가수, 배우 등과 같이 엔터테이먼트 분야의 유명인, 또는 정치인, 학자 등과 같이 다른 분야의 유명인을 좋아하는 팬일 수 있으나, 이에 한정되지 않는다. 1 is a schematic diagram illustrating the configuration of a system for providing a voice synthesis service using a voice synthesis service providing server 130 according to an embodiment of the present disclosure. In the illustrated system, the voice synthesis service providing server 130 may provide a voice synthesis service to the plurality of user terminals 112 , 114 , and 116 through the network 120 . According to an embodiment, the voice synthesis service providing server 130 may store, provide, and execute computer-executable programs (eg, downloadable applications) and data related to the voice synthesis service, and/or one or more server devices and/or It may include a database, or one or more distributed computing devices and/or distributed databases based on cloud computing services. Here, the voice synthesis service provided by the voice synthesis service providing server 130 may be provided to the user through a dedicated application or a web browser installed in the plurality of user terminals 112 , 114 , 116 . Here, the user may be a celebrity in the entertainment field, such as a singer or actor, or a fan who likes a celebrity in another field such as a politician or a scholar, but is not limited thereto.

복수의 사용자 단말(112, 114, 116)은 네트워크(120)를 통해 음성 합성 서비스 제공 서버(130)와 통신할 수 있다. 네트워크(120)는, 복수의 사용자 단말(112, 114, 116)과 음성 합성 서비스 제공 서버(130) 사이의 통신이 가능하도록 구성될 수 있다. 네트워크(120)는 설치 환경에 따라, 예를 들어, 이더넷(Ethernet), 유선 홈 네트워크(Power Line Communication), 전화선 통신 장치 및 RS-serial 통신 등의 유선 네트워크, 이동통신망, WLAN(Wireless LAN), Wi-Fi, Bluetooth 및 ZigBee 등과 같은 무선 네트워크 또는 그 조합으로 구성될 수 있다.The plurality of user terminals 112 , 114 , and 116 may communicate with the voice synthesis service providing server 130 through the network 120 . The network 120 may be configured to enable communication between the plurality of user terminals 112 , 114 , and 116 and the voice synthesis service providing server 130 . Network 120 according to the installation environment, for example, Ethernet (Ethernet), wired home network (Power Line Communication), telephone line communication device and wired networks such as RS-serial communication, mobile communication network, WLAN (Wireless LAN), It may consist of wireless networks such as Wi-Fi, Bluetooth and ZigBee, or a combination thereof.

도 1에서 휴대폰 단말기(112), 태블릿 단말기(114) 및 PC 단말기(116)가 사용자 단말의 예로서 도시되었으나, 이에 한정되지 않으며, 사용자 단말은 유선 및/또는 무선 통신이 가능하고 사용자로부터 유명인과 같은 특정인의 선택 정보, 유명인과 관련된 콘텐츠 정보, 문자 음성 변환(또는 음성 합성) 요청을 수신할 수 있는 사용자 인터페이스를 구비한 임의의 컴퓨팅 장치일 수 있다. 예를 들어, 사용자 단말은, 스마트폰(smart phone), 휴대폰, 내비게이션, 컴퓨터, 노트북, 디지털방송용 단말, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 태블릿 PC, 게임 콘솔(game console), 웨어러블 디바이스(wearable device), IoT(internet of things) 디바이스, VR(virtual reality) 디바이스, AR(augmented reality) 디바이스 등을 포함할 수 있다. 또한, 도 1에는 3개의 사용자 단말(112, 114, 116)이 네트워크(120)를 통해 음성 합성 서비스 제공 서버(130)와 통신하는 것으로 도시되어 있으나, 이에 한정되지 않으며, 다른 수의 사용자 단말이 네트워크(120)를 통해 음성 합성 서비스 제공 서버(130)와 통신하도록 구성될 수도 있다.Although the mobile phone terminal 112, the tablet terminal 114, and the PC terminal 116 are illustrated as examples of the user terminal in FIG. 1, the present invention is not limited thereto, and the user terminal is capable of wired and/or wireless communication and communicates with celebrities from the user. It may be any computing device having a user interface capable of receiving selection information from such a specific person, content information related to a celebrity, and a text-to-speech (or speech synthesis) request. For example, a user terminal is a smart phone, a mobile phone, a mobile phone, a navigation system, a computer, a laptop computer, a digital broadcasting terminal, a PDA (Personal Digital Assistants), a PMP (Portable Multimedia Player), a tablet PC, a game console (game console) , a wearable device, an Internet of things (IoT) device, a virtual reality (VR) device, an augmented reality (AR) device, and the like. In addition, in FIG. 1 , three user terminals 112 , 114 , and 116 are illustrated as communicating with the voice synthesis service providing server 130 through the network 120 , but the present invention is not limited thereto. It may be configured to communicate with the voice synthesis service providing server 130 through the network 120 .

사용자 단말(112, 114, 116) 및 음성 합성 서비스 제공 서버(130) 사이의 통신 방식은 제한되지 않으며, 네트워크(120)가 포함할 수 있는 통신망(일례로, 이동통신망, 유선 인터넷, 무선 인터넷, 방송망, 위성망 등)을 활용하는 통신 방식뿐만 아니라 사용자 단말 사이의 근거리 무선 통신 역시 포함될 수 있다. 예를 들어, 네트워크(120)는, PAN(personal area network), LAN(local area network), CAN(campus area network), MAN(metropolitan area network), WAN(wide area network), BBN(broadband network), 인터넷 등의 네트워크 중 하나 이상의 임의의 네트워크를 포함할 수 있다. 또한, 네트워크(120)는 버스 네트워크, 스타 네트워크, 링 네트워크, 메쉬 네트워크, 스타-버스 네트워크, 트리 또는 계층적(hierarchical) 네트워크 등을 포함하는 네트워크 토폴로지 중 임의의 하나 이상을 포함할 수 있으나, 이에 제한되지 않는다.The communication method between the user terminals 112, 114, 116 and the voice synthesis service providing server 130 is not limited, and the network 120 may include a communication network (eg, a mobile communication network, a wired Internet, a wireless Internet, Broadcast network, satellite network, etc.) as well as a communication method using a short-range wireless communication between user terminals may be included. For example, the network 120 may include a personal area network (PAN), a local area network (LAN), a campus area network (CAN), a metropolitan area network (MAN), a wide area network (WAN), and a broadband network (BBN). , the Internet, and the like. In addition, the network 120 may include any one or more of a network topology including a bus network, a star network, a ring network, a mesh network, a star-bus network, a tree or a hierarchical network, etc. not limited

음성 합성 서비스 제공 서버(130)는 네트워크(120)를 통해 외부 장치(예를 들어, 유명인과 같은 특정인의 음성 데이터를 관리하고 저장하는 음성 데이터 서버 등)로부터 복수의 특정인의 음성 데이터를 수신하고, 수신한 음성 데이터를 데이터베이스에 저장할 수 있다. 본 개시에 있어서, 음성 데이터는, 특정인이 발화한 음성 데이터 또는 이로부터 추출한 음성 특성 데이터, 음성 파라미터 데이터(예를 들어, 음성 피치, 음성 길이, 음색, 음의 세기 등)를 포함할 수 있다.The voice synthesis service providing server 130 receives the voice data of a plurality of specific people from an external device (eg, a voice data server that manages and stores voice data of a specific person such as a celebrity) through the network 120, The received voice data can be stored in a database. In the present disclosure, voice data may include voice data uttered by a specific person or voice characteristic data extracted therefrom, and voice parameter data (eg, voice pitch, voice length, tone, sound strength, etc.).

또한, 음성 합성 서비스 제공 서버(130)는 네트워크(120)를 통해 복수의 사용자 단말(112, 114, 116)로부터 문자 음성 변환 요청을 수신하고, 수신한 정보를 데이터베이스에 저장할 수 있다. 여기서, 문자 음성 변환 요청은 사용자가 선택한 유명인과 같은 특정인 정보 및 사용자가 선택한 특정인에 대해 생성된 콘텐츠(예를 들어, 메시지, 뉴스, 이미지 또는 동영상과 같은 멀티미디어 콘텐츠 등)를 포함할 수 있다. 사용자가 선택한 특정인에 대한 생성된 콘텐츠는 사용자(예를 들어, 유명인의 팬들 등)에 의해 생성된 복수의 콘텐츠 중 적어도 하나일 수 있다.Also, the voice synthesis service providing server 130 may receive text-to-speech conversion requests from the plurality of user terminals 112 , 114 , and 116 through the network 120 , and store the received information in a database. Here, the text-to-speech request may include specific person information, such as a celebrity selected by the user, and content generated for the specific person selected by the user (eg, multimedia content such as messages, news, images or videos). The content generated for the specific person selected by the user may be at least one of a plurality of contents generated by the user (eg, fans of a celebrity, etc.).

일 실시예에서, 음성 합성 서비스 제공 서버(130)는 소셜 네트워크를 통해 사용자가 선택한 유명인과 같은 특정인에 대해 생성된 콘텐츠를 수신하고 공유하도록 구성될 수 있다. 예를 들어, 사용자 A가 사용자 단말을 통해 특정 유명인에 관한 복수의 콘텐츠(예를 들어, 특정 연예인의 기사 또는 일정에 관한 글, 팬의 메세지 등)를 소셜 네트워크 서비스로부터 제공되는 특정 유명인에 대한 게시판(예를 들어, 팬페이지 등)에 업로드할 수 있다. In one embodiment, the voice synthesis service providing server 130 may be configured to receive and share content generated for a specific person, such as a celebrity selected by the user, via a social network. For example, the user A sends a plurality of contents about a specific celebrity (eg, a specific celebrity's article or schedule, a fan's message, etc.) (eg, fan page, etc.).

음성 합성 서비스 제공 서버(130)는 소셜 네트워크를 통해 업로드된 유명인에 대해 생성된 콘텐츠를 수신하고 수신한 콘텐츠를 예를 들어, 음성 합성 서비스로부터 제공되는 특정 유명인을 위한 게시판, 팬페이지 등을 통해 공유할 수 있다. 대안적으로, 사용자 A는 음성 합성 서비스로부터 제공되는 게시판(예를 들어, 특정 유명인을 위한 게시판 등)에 특정 유명인에 관한 복수의 콘텐츠를 업로드하여 다른 사용자와 공유할 수 있다.The voice synthesis service providing server 130 receives content generated for a celebrity uploaded through a social network, and shares the received content through, for example, a bulletin board or a fan page for a specific celebrity provided from the voice synthesis service. can do. Alternatively, user A may upload a plurality of contents related to a specific celebrity to a bulletin board (eg, a bulletin board for a specific celebrity, etc.) provided from the speech synthesis service and share it with other users.

사용자 B는 음성 합성 서비스 제공 서버(130)를 통해 공유된 특정 유명인에 대한 콘텐츠 중 적어도 하나를 선택하여, 선택된 콘텐츠의 텍스트를 특정인의 음성으로 변환하는 음성 합성 서비스를 이용할 수 있다. 대안적으로, 사용자 B가 유명인에 관한 콘텐츠를 생성하여 소셜 네트워크 서비스 및/또는 음성 합성 서비스로부터 제공되는 게시판을 통해 공유하고, 공유한 콘텐츠를 선택하여, 선택된 콘텐츠의 텍스트를 특정인의 음성으로 변환하는 음성 합성 서비스를 이용할 수도 있다.User B may select at least one of the contents for a specific celebrity shared through the speech synthesis service providing server 130 and use the speech synthesis service for converting the text of the selected contents into the voice of the specific person. Alternatively, the user B creates content about a celebrity and shares it through a bulletin board provided from a social network service and/or speech synthesis service, selects the shared content, and converts the text of the selected content into the voice of a specific person. A voice synthesis service may also be used.

이상에서 상술한 문자 음성 변환 요청은 사용자가 선택한 특정인 정보, 특정인에 대한 콘텐츠에 한정되지 않으며, 사용자 정보 즉, 문자 음성 변환을 요청한 특정인의 팬에 대한 정보, 음성 변환하기 위한 텍스트 정보 등과 같이 문자 음성 변환 요청과 관련된 다양한 정보를 더 포함할 수 있다.The text-to-speech request described above is not limited to the specific person information selected by the user and the content for the specific person, and is not limited to user information, that is, text-to-speech information such as information about a fan of a specific person who has requested text-to-speech conversion, text information for voice conversion, etc. Various information related to the conversion request may be further included.

음성 합성 서비스 제공 서버(130)는 사용자가 선택한 유명인에 대해 생성된 콘텐츠에 포함된 텍스트 정보에 기초하여 음성 합성 데이터를 생성하여 사용자 단말로 제공할 수 있고, 생성된 음성 합성 데이터에 대한 요금을 결정하여 사용자 단말로 제공할 수 있다. 추가적으로, 음성 합성 서비스 제공 서버(130)는 사용자가 선택한 특정인의 음성 합성 데이터를 이용하여 모닝콜 및/또는 생일 메시지 서비스를 제공할 수 있다.The voice synthesis service providing server 130 may generate and provide voice synthesis data to the user terminal based on text information included in the content generated for the celebrity selected by the user, and determine a fee for the generated voice synthesis data. Thus, it can be provided to the user terminal. Additionally, the voice synthesis service providing server 130 may provide a wake-up call and/or birthday message service using voice synthesis data of a specific person selected by the user.

이와 같이 음성 합성 서비스 서버(130)는 특정 유명인의 팬들 사이에 공유되는 콘텐츠에 포함된 텍스트 정보를 해당 유명인의 음성 데이터를 이용하여 음성 합성 데이터로 변환함으로써, 팬들이 직접 생성한 유명인에 대한 콘텐츠를 해당 유명인의 목소리로 재생 및 제공하여 보다 생동감 있고 실감나는 음성 합성 서비스를 제공할 수 있다. As such, the speech synthesis service server 130 converts text information included in content shared among fans of a specific celebrity into speech synthesis data using the celebrity's voice data, thereby generating content about the celebrity directly created by fans. It is possible to provide a more lively and realistic voice synthesis service by playing and providing the voice of the famous person.

또한, 특정 유명인에 대해 다른 사용자들이 공유한 콘텐츠를 해당 유명인(또는 특정인)의 음성으로 재생하여 제공받음으로써, 사용자는 특정 유명인에 대한 정보를 얻기 위해 시간을 지속해서 투자할 필요가 없으며, 음성을 통해 직관적으로 특정 유명인에 대한 정보를 얻는 것이 가능하다.In addition, by receiving content shared by other users about a specific celebrity by playing it with the celebrity (or specific person)'s voice, users do not need to continuously invest time to obtain information about a specific celebrity. It is possible to intuitively obtain information about a specific celebrity through

음성 합성 서비스 제공 서버(130)가 수신한 문자 음성 변환 요청에 응답하여 음성 합성 서비스를 제공하는 과정에 대해서는 이하에서, 도 2 내지 도 8을 참조하여 자세히 설명한다.A process in which the voice synthesis service providing server 130 provides the voice synthesis service in response to the text-to-speech request received will be described in detail below with reference to FIGS. 2 to 8 .

도 2는 본 개시의 일 실시예에 따른 음성 합성 서비스 제공 서버(200)의 내부 구성을 나타낸 블록도이다. 도 2에 도시된 바와 같이, 음성 합성 서비스 제공 서버(200)는 통신부(210), 프로세서(230) 및 데이터베이스(220)를 포함할 수 있다. 프로세서(230)는 음성 분석부(232), 음성 합성부(234), 요금 관리부(236), 순위 관리부(238) 및 크라우드 펀딩부(240)를 포함할 수 있다. 또한, 데이터베이스(220)는 사용자 DB(222) 및 음성 데이터 DB(224)를 포함할 수 있다. 도 2에는 프로세서(230)가 복수의 하위 모듈을 포함하고, 데이터베이스(220)가 복수의 하위 DB를 포함하는 것으로 도시되어 있으나, 이에 한정되지 않으며, 통합 프로세서(230)와 통합 데이터베이스(220)로 구성될 수도 있다. 2 is a block diagram illustrating an internal configuration of a voice synthesis service providing server 200 according to an embodiment of the present disclosure. As shown in FIG. 2 , the voice synthesis service providing server 200 may include a communication unit 210 , a processor 230 , and a database 220 . The processor 230 may include a voice analysis unit 232 , a voice synthesis unit 234 , a rate management unit 236 , a rank management unit 238 , and a crowdfunding unit 240 . Also, the database 220 may include a user DB 222 and a voice data DB 224 . In FIG. 2 , the processor 230 includes a plurality of sub-modules and the database 220 includes a plurality of sub-DBs, but the present invention is not limited thereto. may be configured.

통신부(210)는 네트워크를 통해 음성 합성 서비스 제공 서버(200)가 사용자 단말 및/또는 외부 장치(예를 들어, 음성 데이터 서버 등)와 임의의 정보 또는 데이터를 송수신하는 것이 가능하도록 구성될 수 있다. 프로세서(230)는 통신부(210)를 통해 임의의 정보를 수신하여 처리하도록 구성될 수 있다. 데이터베이스(220)는 사용자 단말 및/또는 외부 장치로부터 수신되는 정보/데이터를 저장하고, 프로세서(230)가 생성하는 정보를 저장하도록 구성될 수 있다.The communication unit 210 may be configured to enable the voice synthesis service providing server 200 to transmit and receive arbitrary information or data to and from a user terminal and/or an external device (eg, a voice data server, etc.) through a network. . The processor 230 may be configured to receive and process arbitrary information through the communication unit 210 . The database 220 may be configured to store information/data received from a user terminal and/or an external device, and to store information generated by the processor 230 .

일 실시예에서, 통신부(210)는 음성 데이터 서버로부터 유명인과 같은 특정인의 음성 데이터를 수신하여 프로세서(230)로 전송할 수 있다. 여기서, 음성 데이터는 음성 데이터 서버로부터 수신된 유명인의 음성과 관련된 다양한 정보를 포함할 수 있다. 예를 들어, 음성 데이터는, 유명인 이름-"김xx", 분야-"배우", 유명인의 음성 특성이 포함된 음성 데이터 등을 포함할 수 있다. 또한, 음성 데이터는 순차적 운율 특징과 관련된 정보를 나타내는 음성 스펙트럼 데이터, 화자의 발성 특성 등을 포함할 수 있다.In an embodiment, the communication unit 210 may receive voice data of a specific person, such as a celebrity, from a voice data server and transmit the received voice data to the processor 230 . Here, the voice data may include various information related to the voice of a celebrity received from the voice data server. For example, the voice data may include a celebrity's name - "Kim xx", a field - "actor", voice data including voice characteristics of a celebrity, and the like. In addition, the voice data may include voice spectrum data representing information related to sequential prosody features, speech characteristics of a speaker, and the like.

또한, 통신부(210)는 음성 합성 서비스를 이용하는 사용자 단말(예를 들어, 유명인의 팬의 사용자 단말)로부터 문자 음성 변환 요청을 수신하여 프로세서(230)로 전송할 수 있다. 여기서, 문자 음성 변환 요청은 사용자가 선택한 유명인과 같은 특정인 정보 및 사용자가 선택한 특정인에 대해 생성된 콘텐츠를 포함할 수 있다. 도 1에서 상술한 바와 같이, 사용자가 선택한 특정인에 대한 생성된 콘텐츠는 사용자(예를 들어, 해당 유명인의 팬들 등)에 의해 생성된 작성한 복수의 콘텐츠 중 적어도 하나일 수 있으나, 이에 한정되지 않는다.Also, the communication unit 210 may receive a text-to-speech request from a user terminal using a speech synthesis service (eg, a user terminal of a celebrity fan) and transmit the text-to-speech request to the processor 230 . Here, the text-to-speech request may include specific person information such as a celebrity selected by the user and content generated for the specific person selected by the user. As described above in FIG. 1 , the content generated for a specific person selected by the user may be at least one of a plurality of contents created by the user (eg, fans of the corresponding celebrity), but is not limited thereto.

문자 음성 변환 요청은, 유명인과 같은 특정인 정보, 특정인에 대해 생성된 콘텐츠에 한정되지 않는다. 일 실시예에 따르면, 문자 음성 변환 요청은 사용자 정보를 더 포함할 수 있다. 여기서, 사용자 정보는 예를 들어, 사용자의 개인정보, 성별, 사용자의 거주 지역 또는 출신 지역, 국적 등과 같이 사용자와 관련된 다양한 정보를 포함할 수 있다.The text-to-speech request is not limited to information about a specific person, such as a celebrity, or content generated for a specific person. According to an embodiment, the text-to-speech request may further include user information. Here, the user information may include, for example, various information related to the user, such as the user's personal information, gender, the user's residence or origin region, nationality, and the like.

프로세서(230)는 통신부(210)로부터 수신한 정보 및/또는 데이터를 데이터베이스(220)에 저장할 수 있다. 일 실시예에 따르면, 데이터베이스(220)는 수신한 정보 및/또는 데이터를 효율적으로 관리하기 위해 표준화된 표기로 변환하여 저장될 수 있다.The processor 230 may store information and/or data received from the communication unit 210 in the database 220 . According to an embodiment, the database 220 may be converted into a standardized notation and stored in order to efficiently manage the received information and/or data.

데이터베이스(220)는 사용자와 관련된 정보가 사용자별로 분류되어 저장되는 사용자 DB(222)를 포함할 수 있다. 일 실시예에 따르면, 사용자 DB(222)는 사용자 이름, 성별, 거주지역, 국적, 사용자가 요청한 문자 음성 변환 요청 등이 서로 연관되어 저장되는 임의의 데이터 구조로서 구축될 수 있다.The database 220 may include a user DB 222 in which user-related information is classified and stored for each user. According to an embodiment, the user DB 222 may be constructed as an arbitrary data structure in which a user name, gender, residence region, nationality, text-to-speech request requested by a user, and the like are stored in association with each other.

또한, 데이터베이스(220)는 유명인과 같은 특정인 정보와 특정인의 음성 데이터가 특정인별로 분류되어 저장되는 음성 데이터 DB(224)를 더 포함할 수 있다. 일 실시예에 따르면, 음성 데이터 DB(224)는 특정인 이름, 특정인의 활동 분야(예를 들어, 가수, 배우, 정치인, 학자 등), 특정인 활동명, 특정인의 음성 데이터, 특정인의 음성 분석 결과 등이 서로 연관되어 저장되는 임의의 데이터 구조로서 구축될 수 있다.Also, the database 220 may further include a voice data DB 224 in which specific person information such as a celebrity and voice data of a specific person are classified and stored for each specific person. According to an embodiment, the voice data DB 224 includes a specific person's name, a specific person's field of activity (eg, a singer, an actor, a politician, a scholar, etc.), a specific person's activity name, the specific person's voice data, the specific person's voice analysis result, etc. It can be built as any data structure that is stored in association with each other.

도 2에서는 데이터베이스(220)가 음성 합성 서비스 제공 서버(130)에 포함되도록 도시되어 있으나, 이에 한정되지 않으며, 유선/무선 통신을 통해 통신 또는 접근가능한 별도의 서버(예를 들어, 클라우드 서버)에 저장될 수 있다.In FIG. 2 , the database 220 is illustrated to be included in the voice synthesis service providing server 130 , but the present invention is not limited thereto. can be saved.

프로세서(230)의 음성 분석부(232)는, 수신한 특정인의 음성 데이터를 분석하도록 구성될 수 있다. 일 실시예에 따르면, 음성 분석부(232)는 외부 장치로부터 수신한 특정인의 음성 데이터에 기초하여 순차적 운율 특징 및/또는 화자의 발성 특징을 추출할 수 있다. 여기서, 수신된 음성 데이터는 순차적 운율 특징과 관련된 정보를 나타내는 음성 스펙트럼 데이터를 포함할 수 있으며, 예를 들어, 멜로디, 특정 화자의 음성 특징 등을 포함할 수 있다. 순차적 운율 특징은 미리 결정된 시간 단위에 따라 각 시간 단위의 운율 정보를 포함할 수 있고, 화자의 발성 특징은 그 화자의 음성을 모사하는 것뿐만 아니라, 그 발성을 구성할 수 있는 스타일, 운율, 감정, 음색, 음높이 등 다양한 요소들 중 적어도 하나를 포함할 수 있다.The voice analysis unit 232 of the processor 230 may be configured to analyze the received voice data of a specific person. According to an embodiment, the voice analyzer 232 may extract sequential prosody features and/or the speaker's vocalization features based on voice data of a specific person received from an external device. Here, the received voice data may include voice spectrum data representing information related to sequential prosody features, for example, a melody and voice features of a specific speaker. The sequential prosody characteristic may include prosody information of each time unit according to a predetermined time unit, and the speaker's vocalization characteristic not only imitates the speaker's voice, but also the style, rhyme, and emotion that can compose the vocalization. , tone, pitch, and the like may include at least one of various factors.

음성 분석부(232)는 분석 결과에 기초하여 음성 데이터 DB(224)를 구축할 수 있다. 일 실시예에 따르면, 음성 분석부(232)는 순차적 운율 특징 및 화자의 발성 특징과 특정인 이름, 특정인의 활동 분야 등을 서로 연관지어 저장하여 음성 데이터 DB(224)를 구축하고 관리할 수 있다. The voice analyzer 232 may build the voice data DB 224 based on the analysis result. According to an embodiment, the voice analyzer 232 may build and manage the voice data DB 224 by storing sequential prosody characteristics, the speaker's vocalization characteristics, the name of a specific person, and the field of activity of the specific person in association with each other.

추가적으로 또는 대안적으로, 음성 분석부(232)는 공지된 음성 데이터 분석 기술을 이용하여 외부 장치로부터 수신한 유명인의 음성 데이터를 분석하도록 구성될 수 있다.Additionally or alternatively, the voice analyzer 232 may be configured to analyze voice data of a celebrity received from an external device using a known voice data analysis technique.

음성 분석부(232)는 사용자가 선택한 특정인과 유사한 음성 특성을 갖는 다른 특정인(예를 들어, 다른 유명인)을 추천하도록 구성될 수 있다. 일 실시예에 따르면, 음성 분석부(232)는 특정인의 음성 데이터에 기초하여 특정인의 음성의 음역대를 결정할 수 있다. 예를 들어, 음성 분석부(232)는 음성 데이터 DB로부터 사용자가 선택한 특정인의 음성 데이터를 수신하여 해당 특정인의 음성 데이터의 주파수를 식별할 수 있고, 식별한 주파수를 이용하여 사용자가 선택한 특정인의 음성의 음역대를 결정할 수 있다. 결정된 특정인의 음역대는 음성 데이터 DB(224)에 저장된 특정인의 음성 데이터와 매칭되어 저장될 수 있다. The voice analyzer 232 may be configured to recommend another specific person (eg, another celebrity) having voice characteristics similar to the specific person selected by the user. According to an embodiment, the voice analyzer 232 may determine the sound range of the specific person's voice based on the specific person's voice data. For example, the voice analyzer 232 may receive the voice data of the specific person selected by the user from the voice data DB to identify the frequency of the voice data of the specific person, and use the identified frequency to identify the voice of the specific person selected by the user. can determine the range of The determined specific person's voice range may be matched with the specific person's voice data stored in the voice data DB 224 and stored.

음성 분석부(232)는 결정된 음역대와 동일 또는 유사한 음역대를 가지는 특정인 중 적어도 하나를 사용자 단말로 제공함으로써, 사용자가 선호하는 특정인과 유사한 음성 특성을 갖는 다른 특정인(또는 유명인)을 추천할 수 있다. 예를 들어, 사용자가 선택한 특정인의 음역대가 90~120Hz로 결정된 경우, 음성 분석부(232)는 음성 데이터 DB(224)로부터 음역대가 90~120Hz로 결정된 다른 특정인들 중 적어도 하나를 음성이 유사한 특정인으로 추천할 수 있다.The voice analyzer 232 may recommend another specific person (or celebrity) having voice characteristics similar to the specific person preferred by the user by providing at least one of the specific person having the same or similar sound range to the determined sound range to the user terminal. For example, when the sound range of a specific person selected by the user is determined to be 90 to 120 Hz, the voice analysis unit 232 selects at least one of the other specific persons whose sound range is determined to be 90 to 120 Hz from the voice data DB 224 as a specific person having a similar voice. can be recommended as

음성 합성부(234)는 콘텐츠에 포함된 텍스트 정보에 기초하여 음성 합성 데이터를 생성하도록 구성될 수 있다. 음성 합성부(234)는, 조음 합성(articulatory synthesis), 포먼트 합성(formant synthesis), 연결 합성(concatenative synthesis), 통계기반 파라미터 합성(statistical parametric speech synthesis), 딥러닝(또는 인공신경망) 기반의 음성 합성 등을 포함하는 다양한 음성 합성 방법을 이용하여 텍스트 정보를 음성 합성 데이터로 변환할 수 있다.The speech synthesis unit 234 may be configured to generate speech synthesis data based on text information included in content. The speech synthesis unit 234 is configured to perform articulatory synthesis, formant synthesis, concatenative synthesis, statistical parametric speech synthesis, and deep learning (or artificial neural network) based. Text information may be converted into speech synthesis data using various speech synthesis methods including speech synthesis.

일 실시예에 따르면, 음성 합성부(234)는 텍스트 정보로부터 운율 정보를 분석할 수 있다. 또한, 음성 합성부(234)는 운율 정보에 기초하여 음원 DB(224)에서 추출된 음소 단위의 음성 파형의 형태의 음성 데이터들을 합성하여 음성 합성 데이터를 생성할 수 있다.According to an embodiment, the voice synthesizer 234 may analyze prosody information from text information. Also, the voice synthesizer 234 may generate voice synthesis data by synthesizing voice data in the form of a phoneme-based voice waveform extracted from the sound source DB 224 based on the prosody information.

다른 실시예에 따르면, 음성 합성부(234)는 텍스트 정보, 순차적인 운율 특징 및 화자의 발성 특징을 인공신경망 기반의 텍스트-음성 합성(TTS: text-to-speech) 모델에 입력하여 음성 합성 데이터를 생성할 수 있다. 음성 합성 데이터는 순차적 운율 특징 및 화자의 발성 특징이 반영된 텍스트에 대한 음성 데이터를 포함할 수 있다.According to another embodiment, the speech synthesis unit 234 inputs text information, sequential prosody characteristics, and speaker's vocalization characteristics into a text-to-speech (TTS) model based on an artificial neural network to obtain speech synthesis data. can create The speech synthesis data may include speech data for a text in which sequential prosody features and speaker's vocal features are reflected.

음성 합성부(234)는 생성된 음성 합성 데이터를 사용자 단말에 제공할 수 있다. The voice synthesizer 234 may provide the generated voice synthesis data to the user terminal.

요금 관리부(236)는 음성 합성부(234)에 의해 생성된 음성 합성 데이터에 대한 요금을 결정하도록 구성될 수 있다. 일 실시예에 따르면, 요금 관리부(236)는 음성 합성부(234)에 의해 생성된 음성 합성 데이터의 양 및 음성 재생 시간을 산출하고, 산출된 음성 합성 데이터의 양 및 음성 재생 시간 중 적어도 하나에 기초하여 요금을 결정할 수 있다. 예를 들어, 요금 관리부(236)는 사용자가 선택한 콘텐츠에 포함된 텍스트로부터 변환된 음성 합성 데이터의 양에 비례하여 요금을 산정할 수 있다. 추가적으로 또는 대안적으로, 요금 관리부(236)는 산출된 음성 재생 시간에 비례하여 요금을 산정할 수 있다. The fee management unit 236 may be configured to determine a fee for the speech synthesis data generated by the speech synthesis unit 234 . According to an embodiment, the charge management unit 236 calculates the amount of speech synthesis data and the audio reproduction time generated by the speech synthesis unit 234 , and sets the amount to at least one of the calculated amount of speech synthesis data and the audio reproduction time. Fees can be determined based on For example, the fee management unit 236 may calculate a fee in proportion to the amount of speech synthesis data converted from text included in the content selected by the user. Additionally or alternatively, the fee management unit 236 may calculate a fee in proportion to the calculated voice reproduction time.

다른 실시예에 따르면, 요금 관리부(236)는 산출된 음성 합성 데이터의 양 및 음성 재생 시간 각각에 가중치를 적용하여 요금을 산출할 수 있다. 추가적으로, 요금 관리부(236)는 음성 데이터를 제공하는 특정인의 해당 음성 데이터에 대한 저작권료 및 특정인과 관련된 다른 비용(예를 들어, 연예인의 출연료) 각각에 가중치를 적용하여 최종 요금을 산출하여 요금을 결정할 수 있다. According to another embodiment, the fee management unit 236 may calculate the fee by applying a weight to each of the calculated amount of speech synthesis data and the speech reproduction time. Additionally, the fee management unit 236 determines the fee by calculating the final fee by applying a weight to each of the copyright fee for the voice data of a specific person providing the voice data and other expenses related to the specific person (eg, entertainer's appearance fee). can

이상에서 상술한 과정에 따라 요금 관리부(236)에 의해 결정된 요금이 사용자에 의해 결제되면, 결제된 금액의 일부는 해당 유명인에게 지급될 수 있다.When the fee determined by the fee management unit 236 according to the above-described process is paid by the user, a portion of the paid amount may be paid to the corresponding celebrity.

순위 관리부(238)는 특정인의 음성 합성 순위를 결정하도록 구성될 수 있다. 일 실시예에 따르면, 순위 관리부(238)는 요금 관리부(236)로부터 결정된 요금 및 음성 합성 데이터의 양 중 적어도 하나에 기초하여 음성 합성 순위를 결정할 수 있다. 예를 들어, 순위 관리부(238)는 요금 관리부(236)에 의해 결정되어 사용자가 결제한 요금의 누적 합이 높을수록 높은 음성 합성 순위로 결정할 수 있다. 추가적으로 또는 대안적으로, 순위 관리부(238)는 문자 음성 변환 요청으로부터 생성된 음성 합성 데이터의 누적 양이 많을수록 높은 음성 합성 순위로 결정할 수 있다.The rank management unit 238 may be configured to determine the voice synthesis rank of a specific person. According to an embodiment, the rank management unit 238 may determine the voice synthesis ranking based on at least one of the rate determined by the rate management unit 236 and the amount of voice synthesis data. For example, the rank management unit 238 may determine a higher voice synthesis ranking as the cumulative sum of the fees paid by the user is determined by the rate management unit 236 . Additionally or alternatively, the rank management unit 238 may determine a higher voice synthesis ranking as the accumulated amount of voice synthesis data generated from the text-to-speech conversion request increases.

다른 실시예에 따르면, 순위 관리부(238)는, 특정인 각각에 대해 누적된 요금 및 음성 합성 데이터의 누적 양 각각에 가중치를 적용하여 해당 특정인의 순위를 결정할 수 있다. According to another embodiment, the rank management unit 238 may determine the rank of the specific person by applying a weight to each of the accumulated charges for each specific person and the accumulated amount of speech synthesis data.

크라우드 펀딩부(240)는, 음성 데이터 DB(224)에 포함되지 않은 특정인의 음성 데이터에 대해 크라우드 펀딩 서비스를 제공하도록 구성될 수 있다. 일 실시예에 따르면, 크라우드 펀딩부(240)는 음성 데이터 DB(224)에 포함되지 않은 특정인이 존재하는 경우, 음성 데이터 DB(224)에 포함되지 않은 유명인의 음성 데이터와 관련된 펀딩 정보를 생성하고, 생성된 펀딩 정보를 사용자 단말로 제공하여 공유할 수 있다. 펀딩 정보가 사용자 단말로 제공되고 나면, 사용자 단말로부터 펀딩 정보에 대응되는 투자 정보를 수집할 수 있고, 크라우드 펀딩이 성공하는 경우, 투자 정보에 기초하여 사용자 단말로 리워드를 제공할 수 있다. 일 실시예에 따르면, 크라우드 펀딩부(240)는 투자 금액에 비례하여 리워드를 결정하고 결정된 리워드를 해당 크라우드 펀딩에 참여한 사용자 단말로 제공할 수 있다. 크라우드 펀딩부(240)에 의해 크라우드 펀딩 서비스가 제공되는 과정은 이하에서, 도 7을 참조하여 상세히 설명한다.The crowdfunding unit 240 may be configured to provide a crowdfunding service for voice data of a specific person not included in the voice data DB 224 . According to an embodiment, when a specific person not included in the voice data DB 224 exists, the crowdfunding unit 240 generates funding information related to the voice data of a celebrity not included in the voice data DB 224 and , can be shared by providing the generated funding information to the user terminal. After the funding information is provided to the user terminal, investment information corresponding to the funding information may be collected from the user terminal, and when crowdfunding is successful, a reward may be provided to the user terminal based on the investment information. According to an embodiment, the crowdfunding unit 240 may determine a reward in proportion to the investment amount and provide the determined reward to the user terminal participating in the corresponding crowdfunding. A process in which the crowdfunding service is provided by the crowdfunding unit 240 will be described in detail below with reference to FIG. 7 .

일 실시예에 따르면, 프로세서(230)는 통계부(미도시)를 더 포함하도록 구성될 수 있다. 통계부는 문자 음성 변환 요청에 기초하여 통계정보를 생성하도록 구성될 수 있다. 이때, 문자 음성 변환 요청은 예를 들어, 사용자 이름, 성별, 거주 지역 또는 출신 지역, 국적 등과 같은 사용자 정보, 사용자가 선택한 특정인 정보 및 특정인에 대해 생성된 콘텐츠를 포함할 수 있다.According to an embodiment, the processor 230 may be configured to further include a statistic unit (not shown). The statistics unit may be configured to generate statistical information based on the text-to-speech request. In this case, the text-to-speech request may include, for example, user information such as a user's name, gender, region of residence or origin, nationality, etc., information on a specific person selected by the user, and content generated for the specific person.

일 실시예에 따르면, 통계부는 사용자 정보에 기초하여 통계 정보를 생성할 수 있다. 예를 들어, 통계부는 문자 음성 변환 요청 시 사용자와 동일한 특정인을 선택한 사용자들의 거주 지역 또는 출신 지역에 기초하여 통계 정보를 생성하고, 생성된 통계 정보를 직관적으로 파악하도록 원형 그래프의 형태로 사용자 단말로 출력하여 사용자에게 제공할 수 있다. According to an embodiment, the statistic unit may generate statistical information based on user information. For example, when a text-to-speech conversion request is made, the statistics unit generates statistical information based on the residential area or origin of users who select the same specific person as the user, and transmits the generated statistical information to the user terminal in the form of a pie graph to intuitively grasp the generated statistical information. You can print it out and provide it to users.

일 실시예에 따르면, 통계부는 문자 음성 변환 요청에 기초하여 통계 정보를 생성할 수 있다. 예를 들어, 통계부는 다른 사용자들이 선택한 유명인에 대한 통계 정보를 생성하고, 생성된 통계 정보를 사용자 단말로 제공할 수 있다. 이때, 다른 사용자들은 문자 음성 변환 요청 시 사용자와 동일한 유명인을 선택한 사용자를 지칭할 수 있다. According to an embodiment, the statistic unit may generate statistical information based on the text-to-speech request. For example, the statistics unit may generate statistical information about celebrities selected by other users, and provide the generated statistical information to the user terminal. In this case, other users may refer to a user who has selected the same celebrity as the user when requesting text-to-speech.

도 3은 본 개시의 일 실시예에 따른 인공신경망 모델(300)을 나타낸 구조도이다. 일 실시예에 따르면, 음성 합성 서비스 제공 시스템은 학습부를 더 포함할 수 있다. 학습부는, 유명인과 같은 특정인의 음성 데이터로부터 추출된 복수의 학습 데이터에 기초하여 머신러닝 알고리즘을 통해 추천 모델을 생성하고, 생성된 추천 모델을 학습시키도록 구성될 수 있다. 여기서, 특정인의 음성 데이터는 음성 분석부(232)를 통해 결정된 특정인의 음역대를 더 포함할 수 있다. 특정인의 음성에 대한 음역대는 음성 분석부를 통해 외부 장치(예를 들어, 음성 데이터 서버 등)로부터 수신한 특정인의 음성 정보에 기초하여 해당 특정인의 음성의 주파수를 식별하고, 식별된 주파수를 통해 결정될 수 있다. 3 is a structural diagram illustrating an artificial neural network model 300 according to an embodiment of the present disclosure. According to an embodiment, the system for providing a voice synthesis service may further include a learning unit. The learning unit may be configured to generate a recommendation model through a machine learning algorithm based on a plurality of training data extracted from voice data of a specific person, such as a celebrity, and to train the generated recommendation model. Here, the specific person's voice data may further include the specific person's voice range determined through the voice analysis unit 232 . The sound range for a specific person's voice may be identified based on the specific person's voice information received from an external device (eg, a voice data server, etc.) through the voice analysis unit, and the frequency of the specific person's voice may be identified and determined through the identified frequency. have.

일 실시예에 따르면, 음성 분석부는 사용자가 선택한 콘텐츠와 관련된 특정인의 음성 데이터를 학습된 추천 모델에 입력하여 사용자가 선택한 특정인과 유사한 음성 특성을 갖는 다른 특정인의 정보를 출력하도록 구성될 수 잇다.According to an embodiment, the voice analysis unit may be configured to input the voice data of a specific person related to the content selected by the user into the learned recommendation model, and output information of another specific person having voice characteristics similar to the specific person selected by the user.

학습부 및/또는 음성 분석부에 의해 실행되는 인공신경망의 학습 및 추론 과정에 대해서는, 이하에서 좀 더 자세히 설명하기로 한다.The learning and inference process of the artificial neural network executed by the learning unit and/or the voice analysis unit will be described in more detail below.

예를 들어, 추천 모델은 머신러닝 알고리즘을 통해 생성되고 학습된 인공신경망 모델(300)일 수 있다. 인공신경망 모델(300)은, 머신러닝(Machine Learning) 기술에 따라, 생물학적 신경망의 구조에 기초하여 구현된 통계학적 학습 알고리즘 또는 그 알고리즘을 실행하는 구조이다. 일 실시예에 따르면, 인공신경망 모델(300)은, 생물학적 신경망에서와 같이 시냅스의 결합으로 네트워크를 형성한 인공 뉴런인 노드(Node)들이 시냅스의 가중치를 반복적으로 조정하여, 특정 입력에 대응한 올바른 출력과 추론된 출력 사이의 오차가 감소되도록 학습함으로써, 문제 해결 능력을 가지는 머신러닝 모델을 나타낼 수 있다. 예를 들어, 인공신경망 모델(300)은 머신 러닝, 딥러닝 등의 인공지능 학습법에 사용되는 임의의 확률 모델, 뉴럴 네트워크 모델 등을 포함할 수 있다. For example, the recommendation model may be an artificial neural network model 300 generated and trained through a machine learning algorithm. The artificial neural network model 300 is a statistical learning algorithm implemented based on the structure of a biological neural network or a structure for executing the algorithm according to a machine learning technique. According to an embodiment, the artificial neural network model 300 is an artificial neuron that forms a network by combining synapses, as in a biological neural network, by repeatedly adjusting the weights of synapses, so that the By learning to reduce the error between the output and the inferred output, it is possible to represent a machine learning model with problem-solving ability. For example, the artificial neural network model 300 may include arbitrary probabilistic models, neural network models, etc. used in artificial intelligence learning methods such as machine learning and deep learning.

인공신경망 모델(300)은 다층의 노드들과 이들 사이의 연결로 구성된 다층 퍼셉트론(MLP: multilayer perceptron)으로 구현된다. 본 실시예에 따른 인공신경망 모델(300)은 MLP를 포함하는 다양한 인공신경망 모델 구조들 중의 하나를 이용하여 구현될 수 있다. 도 3에 도시된 바와 같이, 인공신경망 모델(300)은, 외부로부터 입력 신호 또는 데이터(310)를 수신하는 입력층(320), 입력 데이터에 대응한 출력 신호 또는 데이터(350)를 출력하는 출력층(340), 입력층(320)과 출력층(340) 사이에 위치하며 입력층(320)으로부터 신호를 받아 특성을 추출하여 출력층(340)으로 전달하는 n개(여기서, n은 양의 정수)의 은닉층(330_1 내지 330_n)으로 구성된다. 여기서, 출력층(340)은 은닉층(330_1 내지 330_n)으로부터 신호를 받아 외부로 출력한다. The artificial neural network model 300 is implemented as a multilayer perceptron (MLP) consisting of multilayer nodes and connections between them. The artificial neural network model 300 according to the present embodiment may be implemented using one of various artificial neural network model structures including MLP. As shown in FIG. 3 , the artificial neural network model 300 includes an input layer 320 that receives an input signal or data 310 from the outside, and an output layer that outputs an output signal or data 350 corresponding to the input data. 340 , located between the input layer 320 and the output layer 340 , receiving a signal from the input layer 320 , extracting characteristics, and transferring the characteristics to the output layer 340 , where n is a positive integer. It is composed of hidden layers 330_1 to 330_n. Here, the output layer 340 receives signals from the hidden layers 330_1 to 330_n and outputs them to the outside.

인공신경망 모델(300)의 학습 방법에는, 교사 신호(정답)의 입력에 의해서 문제의 해결에 최적화되도록 학습하는 지도 학습(Supervised Learning) 방법과, 교사 신호를 필요로 하지 않는 비지도 학습(Unsupervised Learning) 방법이 있다. 학습부는 사용자가 선택한 특정인의 음성 데이터를 이용하여 이와 유사한 특성을 갖는 다른 특정인을 추천하기 위하여 지도 학습(Supervised Learning)을 통해 해당 특정인의 음성 데이터로부터 추출한 복수의 학습 데이터에 대한 분석을 수행하고, 음성 데이터로부터 추출된 복수의 학습 데이터에 대응되는 다른 특정 정보에 대한 복수의 학습 데이터가 추출될 수 있도록 인공신경망 모델(300)을 학습시킬 수 있다. 이렇게 학습된 인공신경망 모델(300)은 음성 분석부에 제공되거나 접근될 수 있으며, 사용자 단말로부터 수신한 문자 음성 변환 요청에 응답하여 유사한 음성 특성을 갖는 다른 특정인 정보를 출력할 수 있다. The learning method of the artificial neural network model 300 includes a supervised learning method that learns to be optimized to solve a problem by input of a teacher signal (correct answer), and an unsupervised learning method that does not require a teacher signal. ) is a way. The learning unit analyzes a plurality of learning data extracted from the voice data of the specific person through supervised learning in order to recommend another specific person having similar characteristics using the voice data of the specific person selected by the user, The artificial neural network model 300 may be trained so that a plurality of training data for other specific information corresponding to a plurality of training data extracted from the data may be extracted. The artificial neural network model 300 learned in this way may be provided or accessed to the voice analyzer, and may output information about other specific persons having similar voice characteristics in response to a text-to-speech request received from the user terminal.

일 실시예에 따르면, 도 3에 도시된 바와 같이, 유사한 음성 특성을 갖는 다른 특정 정보를 추출할 수 있는 인공신경망 모델(300)의 입력변수는, 사용자가 선택한 특정인의 음성 데이터가 될 수 있다. 예를 들어, 인공신경망 모델(300)의 입력층(320)에 입력되는 입력변수는, 음성 데이터(예를 들어, 특정인의 이름, 활동 분야, 음역대 등)가 하나의 벡터 데이터 요소로 구성한, 입력 벡터(310)가 될 수 있다. 음성 데이터로부터의 입력에 응답하여, 인공신경망 모델(300)의 출력층(340)에서 출력되는 출력변수는 유사한 음성 특성을 갖는 다른 특정인 정보를 나타내는 벡터(350)가 될 수 있다. 이에 더하여, 인공신경망 모델(300)의 출력층(340)은 출력된 유사한 음성 특성을 갖는 다른 특정인 정보에 대한 신뢰도 또는 정확도를 나타내는 벡터를 출력하도록 구성될 수 있다. 예를 들어, 이러한 출력에 대한 신뢰도 또는 정확도를 나타내는 벡터는 스코어(score)로서 해석되거나 표시될 수 있다. 본 개시에 있어서 인공신경망 모델(300)의 출력변수는, 이상에서 설명된 유형에 한정되지 않으며, 유사한 특정인 정보 및 유사한 특정인 정보의 신뢰도와 연관된 다양한 형태로 나타낼 수 있다. 여기서, 특정 입력에 대응한 올바른 출력이 추출될 수 있도록 입력층(320), 은닉층(330_1 내지 330_n) 및 출력층(340)에 포함된 노드들 사이의 가중치 값이 조정될 수 있다. According to an embodiment, as shown in FIG. 3 , the input variable of the artificial neural network model 300 from which other specific information having similar voice characteristics can be extracted may be voice data of a specific person selected by the user. For example, the input variable input to the input layer 320 of the artificial neural network model 300 is an input in which voice data (eg, a specific person's name, activity field, sound range, etc.) is composed of one vector data element. It may be a vector 310 . In response to the input from the voice data, the output variable output from the output layer 340 of the artificial neural network model 300 may be a vector 350 representing information about another person having a similar voice characteristic. In addition, the output layer 340 of the artificial neural network model 300 may be configured to output a vector indicating the reliability or accuracy of the outputted other specific person information having similar speech characteristics. For example, a vector representing confidence or accuracy for such an output may be interpreted or displayed as a score. In the present disclosure, the output variables of the artificial neural network model 300 are not limited to the types described above, and may be expressed in various forms related to similar specific person information and similar specific person information reliability. Here, weight values between nodes included in the input layer 320 , the hidden layers 330_1 to 330_n , and the output layer 340 may be adjusted so that a correct output corresponding to a specific input may be extracted.

이와 같이 인공신경망 모델(300)의 입력층(320)과 출력층(340)에 복수의 입력변수와 대응되는 복수의 출력변수가 각각 매칭되고, 입력층(320), 은닉층(330_1 내지 330_n) 및 출력층(340)에 포함된 노드들 사이의 시냅스 값이 조정됨으로써, 특정 입력에 대응한 올바른 출력이 추출될 수 있도록 학습될 수 있다. 이러한 학습 과정을 통해, 인공신경망 모델(300)의 입력변수에 숨겨져 있는 특성을 파악할 수 있고, 입력변수에 기초하여 계산된 출력변수와 목표 출력 간의 오차가 줄어들도록 인공신경망 모델(300)의 노드들 사이의 시냅스 값(또는 가중치)를 조정할 수 있다. 이렇게 학습된 인공신경망 모델(300)을 이용하여, 입력된 특정인 정보에 응답하여, 유사한 특정인 정보를 출력할 수 있다. As described above, the input layer 320 and the output layer 340 of the artificial neural network model 300 are matched with a plurality of output variables corresponding to the plurality of input variables, respectively, and the input layer 320, the hidden layers 330_1 to 330_n, and the output layer By adjusting the synaptic value between the nodes included in 340, it can be learned so that a correct output corresponding to a specific input can be extracted. Through this learning process, characteristics hidden in the input variable of the artificial neural network model 300 can be identified, and the error between the output variable calculated based on the input variable and the target output is reduced. You can adjust the synapse value (or weight) between them. Using the artificial neural network model 300 learned in this way, in response to the input specific person information, similar specific person information may be output.

도 4는 본 개시의 일 실시예에 따른 문자 음성 변환 요청을 생성하는 예시를 나타낸 도면이다. 사용자는 공유된 콘텐츠 중 적어도 하나를 선택할 수 있다. 일 실시예에 따르면, 콘텐츠는 음성 합성 서비스로부터 제공되는 게시판을 이용하여 사용자에게 공유될 수 있다. 4 is a diagram illustrating an example of generating a text-to-speech request according to an embodiment of the present disclosure. The user may select at least one of the shared contents. According to an embodiment, the content may be shared with users using a bulletin board provided from a voice synthesis service.

예를 들어, 사용자는 사용자 검색 인터페이스(미도시)에 "김XX"라는 검색 키워드를 입력하여 해당 특정인(예를 들어, 가수, 배우 등과 같은 유명인)과 관련된 게시판에 접속할 수 있다. 도 4에서와 같이, 배우 "김XX"에 대해 공유된 콘텐츠가 화면상에 표시되어 사용자에게 제공될 수 있다. 사용자는 화면에 표시된 콘텐츠 중 하나를 터치 입력 등을 통해 선택하는 경우, 해당 콘텐츠의 상세한 내용이 화면에 표시될 수 있다. 사용자는 터치 입력 등을 통해 문자 음성 변환 버튼(420)을 선택하여 해당 콘텐츠에 대해 문자 음성 변환 요청을 생성할 수 있고, 결제를 위한 화면(도 5 참조)으로 이동할 수 있다.For example, the user may access a bulletin board related to a specific person (eg, a famous person such as a singer, actor, etc.) by entering a search keyword "Kim XX" in a user search interface (not shown). As shown in FIG. 4 , the content shared for the actor “Kim XX” may be displayed on the screen and provided to the user. When the user selects one of the contents displayed on the screen through a touch input or the like, detailed contents of the corresponding contents may be displayed on the screen. The user may select the text-to-speech button 420 through a touch input or the like to generate a text-to-speech request for the corresponding content, and may move to a screen for payment (see FIG. 5 ).

한편, 사용자는 터치 입력 등을 통해 다른 글 보기 버튼(410)을 선택하여 표시된 콘텐츠를 화면상에 표시하지 않거나 선택한 콘텐츠 외 다른 콘텐츠를 화면에 표시할 수 있다.Meanwhile, the user may not display the displayed content on the screen by selecting the view other text button 410 through a touch input or the like, or may display content other than the selected content on the screen.

다른 실시예에 따르면, 사용자는 음성 합성 서비스를 통해 사용자 단말로 표시되는 사용자 인터페이스를 통해 원하는 특정인에 대한 콘텐츠를 생성하여 공유할 수 있고, 생성한 콘텐츠를 선택할 수도 있다. According to another embodiment, the user may create and share content for a specific person desired through the user interface displayed on the user terminal through the voice synthesis service, and may select the created content.

도 5는 본 개시의 일 실시예에 따른 요금 결제를 위한 화면의 예시를 나타낸 도면이다. 사용자 단말에는 사용자에 의해 선택된 콘텐츠에 포함된 텍스트 정보에 기초하여 결정된 요금에 대한 정보가 표시될 수 있다. 예를 들어, 음원 합성부에 의해 생성된 음성 합성 데이터에 대한 정보, 생성된 음성 합성 데이터의 재생 시간, 데이터 양과 요금이 표시될 수 있다. 도 5에서 도시된 바와 같이, 음원 합성부에 의해 생성된 음성 합성 데이터에 대한 정보는 해당 특정인의 이름과 활동 분야인, "배우 김XX", 재생 시간은 "50초", 데이터 양은 "800KB", 요금은 "1,000원"으로 표시될 수 있다. 사용자는 터치 입력 등을 통해 취소 버튼(510)을 선택하여 결제를 취소하거나, 결제 버튼(520)을 선택하여 결제를 완료할 수 있다.5 is a diagram illustrating an example of a screen for payment according to an embodiment of the present disclosure. Information on a fee determined based on text information included in content selected by the user may be displayed on the user terminal. For example, information on the voice synthesis data generated by the sound source synthesizer, the reproduction time of the generated voice synthesis data, the amount of data, and the fee may be displayed. As shown in FIG. 5 , the information on the voice synthesis data generated by the sound source synthesizer includes the name of the specific person and the field of activity, “actor Kim XX”, the playback time is “50 seconds”, and the amount of data is “800 KB”. , the fee may be displayed as "1,000 won". The user may cancel the payment by selecting the cancel button 510 through a touch input or the like, or may select the payment button 520 to complete the payment.

결제를 완료하게 되면, 음성 합성부에 의해 생성된 음성 합성 데이터가 사용자에게 제공될 수 있다. 일 실시예에 따르면, 사용자 단말의 스피커를 통해 음성 합성 데이터가 자동으로 재생되어 콘텐츠에 포함된 텍스트가 해당 특정인의 음성으로 나레이션될 수 있다.When payment is completed, the voice synthesized data generated by the voice synthesizer may be provided to the user. According to an embodiment, the synthesized voice data may be automatically reproduced through a speaker of the user terminal, and text included in the content may be narrated by the voice of a specific person.

일 실시예에 따르면, 결제가 완료된 경우, 사용자 단말에 사용자가 선택한 특정인과 유사한 특성을 갖는 다른 특정인에 대한 정보(미도시)가 표시되어 사용자에게 유사한 특정인을 추천할 수 있다. 대안적으로 또는 추가적으로, 사용자와 동일한 특정인을 선택한 다른 사용자들이 이용한 특정인에 대한 통계 정보(미도시)가 표시될 수 있다.According to an embodiment, when the payment is completed, information (not shown) on another specific person having characteristics similar to the specific person selected by the user may be displayed on the user terminal to recommend a similar specific person to the user. Alternatively or additionally, statistical information (not shown) on the specific person used by other users who have selected the same specific person as the user may be displayed.

도 6은 본 개시의 일 실시예에 따른 음성 합성 순위를 사용자 단말의 화면에 표시하는 예시를 나타낸 도면이다. 유명인과 같은 특정인마다 결정된 음성 합성 순위는 사용자 단말의 화면에 표시될 수 있다. 일 실시예에 따르면, 결정된 음성 합성 순위는 차트의 형태로 제공될 수 있다.6 is a diagram illustrating an example of displaying a voice synthesis ranking on a screen of a user terminal according to an embodiment of the present disclosure. The voice synthesis ranking determined for each specific person, such as a celebrity, may be displayed on the screen of the user terminal. According to an embodiment, the determined voice synthesis ranking may be provided in the form of a chart.

예를 들어, 음성 합성 순위는 오름차순으로 정렬되어 사용자 단말의 화면에 표시될 수 있다. 도시된 바와 같이, 특정인의 결정된 순위와 특정인의 프로필 사진, 특정인 이름 및 활동 분야 정보가 함께 표시될 수 있다. 이때, 순위가 표시된 영역에는 순위의 증감이 함께 표시될 수 있다. 또한, 결정된 음성 합성 순위는 예를 들어, 배우, 모델, 아이돌 등과 같이 특정인의 활동 분야에 따라 나누어 사용자 단말의 화면에 표시될 수 있다. 추가적으로, 특정인의 성별에 따라 나누어 화면에 표시될 수도 있다.For example, the voice synthesis rankings may be arranged in ascending order and displayed on the screen of the user terminal. As shown, the determined ranking of the specific person, the profile picture of the specific person, the specific person's name, and activity field information may be displayed together. In this case, the increase/decrease of the rank may be displayed together in the area where the rank is displayed. In addition, the determined voice synthesis ranking may be displayed on the screen of the user terminal by dividing it according to the activity field of a specific person, such as, for example, an actor, a model, an idol, and the like. Additionally, it may be displayed on the screen divided according to the gender of a specific person.

한편, 실시간으로 음원 순위가 상승되고 있는 특정인 정보를 사용자에게 제공할 수도 있다. 일 실시예에 따르면, 순위 관리부는 미리 결정된 범위의 시간 동안 음성 합성 순위가 가장 많이 상승된 특정인에 대한 정보를 화면에 표시하여 사용자에게 제공할 수 있다. On the other hand, it is also possible to provide the user with information about a specific person whose sound source ranking is rising in real time. According to an embodiment, the rank management unit may display information on a specific person whose voice synthesis rank has risen the most during a predetermined range of time on the screen and provide it to the user.

도 7은 본 개시의 일 실시예에 따른 크라우드 펀딩 서비스를 화면에 표시하는 예시를 나타낸 도면이다. 일 실시예에 따르면, 음성 데이터 DB에 음성 데이터가 포함되지 않은 특정인에 대해 문자 음성 변환이 요청된 경우, 크라우드 펀딩을 위한 화면으로 이동될 수 있다. 크라우드 펀딩을 위한 화면은 도시된 바와 같이, 진행중인 펀딩 정보를 표시할 수 있다. 여기서, 펀딩 정보는 펀딩 대상인 특정인의 이름, 펀딩 진행률, 펀딩 성공까지 남은 현재 금액, 펀딩 완료까지 남은 기한과 해당 크라우드 펀딩에 투자를 하는 사용자에게 제공되는 리워드에 대한 정보 등을 포함할 수 있다. 사용자는 터치 입력 등을 통해 크라우드 펀딩 참여하기 버튼(710)을 선택하여 해당 크라우드 펀딩에 투자할 수 있다. 7 is a diagram illustrating an example of displaying a crowdfunding service on a screen according to an embodiment of the present disclosure. According to an embodiment, when text-to-speech conversion is requested for a specific person whose voice data is not included in the voice data DB, the screen may be moved to a screen for crowdfunding. The screen for crowdfunding may display ongoing funding information as shown. Here, the funding information may include the name of a specific person who is the funding target, the funding progress, the current amount remaining until the funding is successful, the remaining time until the funding is completed, and information on the rewards provided to users who invest in the corresponding crowdfunding. It may include. The user may select the crowdfunding participation button 710 through a touch input or the like to invest in the corresponding crowdfunding.

사용자는 크라우드 펀딩 요청 버튼(720)을 터치 입력 등을 통해 선택하여 음성 데이터 DB에 음원 정보가 포함되지 않은 특정인에 대한 크라우드 펀딩을 요청할 수 있다. 사용자가 크라우드 펀딩 요청 버튼(720)을 선택하는 경우, 사용자 단말의 화면은 크라우드 펀딩 요청을 생성하기 위한 화면으로 이동되어 크라우딩 펀딩 요청을 생성하기 위한 인터페이스(미도시)가 표시될 수 있다. 사용자는 인터페이스(미도시)를 통해 펀딩 정보를 입력하여 해당 특정인에 대한 크라우드 펀딩을 생성할 수 있다. 이때, 크라우드 펀딩 목표 금액 및 크라우드 펀딩 기한은 크라우드 펀딩부에 의해 자동으로 결정될 수 있다. 생성된 크라우드 펀딩은 목표 금액이 달성할 수 있도록 다른 사용자 단말로 제공되어 공유될 수 있다.The user may request crowdfunding for a specific person whose sound source information is not included in the voice data DB by selecting the crowdfunding request button 720 through a touch input or the like. When the user selects the crowdfunding request button 720, the screen of the user terminal may be moved to a screen for generating a crowdfunding request, and an interface (not shown) for generating a crowdfunding request may be displayed. A user may generate crowdfunding for a specific person by inputting funding information through an interface (not shown). In this case, the crowdfunding target amount and the crowdfunding deadline may be automatically determined by the crowdfunding unit. The generated crowdfunding may be provided and shared with other user terminals so that the target amount can be achieved.

도 8은 본 개시의 일 실시예에 따른 음성 합성 서비스 제공 방법(800)을 나타낸 흐름도이다. 음성 합성 서비스 제공 방법(800)은 도 1 및 도 2의 음성 합성 서비스 제공 서버(130, 200)에 의해 수행될 수 있다. 도시된 것과 같이, 음성 합성 서비스 제공 방법(800)은 음성 분석부에 의해, 특정인의 음성 데이터를 분석하고, 분석 결과에 기초하여 음성 데이터 DB를 구축하는 단계(S810)로 개시될 수 있다. 8 is a flowchart illustrating a method 800 for providing a voice synthesis service according to an embodiment of the present disclosure. The voice synthesis service providing method 800 may be performed by the voice synthesis service providing servers 130 and 200 of FIGS. 1 and 2 . As shown, the method 800 for providing a voice synthesis service may be initiated by the voice analyzer analyzing voice data of a specific person and building a voice data DB based on the analysis result ( S810 ).

일 실시예에 따르면, 음성 분석부는 외부 장치로부터 수신한 특정인의 음성 데이터에 기초하여 순차적 운율 특징 및/또는 화자의 발성 특징을 추출할 수 있다. 음성 분석부는 분석 결과(예를 들어, 순차적 운율 특징, 화자의 발성 특징 등)에 기초하여 음성 데이터 DB를 구축할 수 있다. 여기서, 수신된 음성 데이터는 순차적 운율 특징과 관련된 정보를 나타내는 음성 스펙트럼 데이터를 포함할 수 있으며, 예를 들어, 멜로디, 특정 화자의 음성 특징 등을 포함할 수 있다. 순차적 운율 특징은 미리 결정된 시간 단위에 따라 각 시간 단위의 운율 정보를 포함할 수 있고, 화자의 발성 특징은 그 화자의 음성을 모사하는 것뿐만 아니라, 그 발성을 구성할 수 있는 스타일, 운율, 감정, 음색, 음높이 등 다양한 요소들 중 적어도 하나를 포함할 수 있다.According to an embodiment, the voice analyzer may extract sequential prosody features and/or the speaker's vocalization features based on voice data of a specific person received from an external device. The voice analyzer may build the voice data DB based on the analysis result (eg, sequential prosody characteristics, speaker's vocal characteristics, etc.). Here, the received voice data may include voice spectrum data representing information related to sequential prosody features, for example, a melody and voice features of a specific speaker. The sequential prosody characteristic may include prosody information of each time unit according to a predetermined time unit, and the speaker's vocalization characteristic not only imitates the speaker's voice, but also the style, rhyme, and emotion that can compose the vocalization. , tone, pitch, and the like may include at least one of various factors.

일 실시예에 따르면, 음성 분석부는 수신한 특정인의 음성 데이터에 기초하여 특정인의 음성의 음역대를 결정하고, 결정된 음역대와 동일 또는 유사한 음역대를 가지는 특정인 중 적어도 하나의 정보를 사용자 단말로 제공할 수 있다.According to an embodiment, the voice analyzer determines the voice range of the specific person's voice based on the received voice data of the specific person, and provides at least one information of the specific person having the same or similar voice range to the determined voice range to the user terminal. .

다음으로, 통신부에 의해, 사용자 단말로부터 사용자가 선택한 특정인에 대해 생성된 콘텐츠를 수신할 수 있다(S820). 일 실시예에서, 통신부는 네트워크를 통해 사용자가 선택한 특정인에 대해 생성된 콘텐츠를 수신할 수 있다. 예를 들어, 통신부는 소셜 네트워크를 통해 업로드 된 특정인에 대해 생성된 콘텐츠를 수신할 수 있다.Next, content generated for a specific person selected by the user may be received from the user terminal by the communication unit (S820). In an embodiment, the communication unit may receive content generated for a specific person selected by the user through a network. For example, the communication unit may receive content generated for a specific person uploaded through a social network.

다음으로, 통신부에 의해, 사용자로부터 선택된 특정인 정보 및 생성된 콘텐츠를 포함하는 문자 음성 변환 요청을 수신할 수 있다(S830). 여기서, 문자 음성 변환 요청은 사용자가 선택한 특정인 정보 및 사용자가 선택한 특정인에 대해 생성된 콘텐츠를 포함할 수 있다.Next, a text-to-speech request including information on a specific person selected from the user and the generated content may be received by the communication unit (S830). Here, the text-to-speech request may include information about the specific person selected by the user and content generated for the specific person selected by the user.

음성 데이터 DB에 음성 데이터가 포함되지 않은 특정인에 대해 문자 음성 변환이 요청된 경우, 크라우드 펀딩부는 음성 데이터 DB에 포함되지 않은 특정인의 음성 데이터와 관련된 펀딩 정보를 생성하여 생성된 펀딩 정보를 사용자 단말로 제공하여 공유할 수 있고, 사용자 단말로부터 펀딩 정보에 대응되는 투자 정보를 수집할 수 있다. 펀딩 목표 금액에 달성하여 크라우드 펀딩에 성공하는 경우, 크라우드 펀딩부는 투자 정보에 기초하여 사용자 단말로 리워드를 제공할 수 있다.When a text-to-speech conversion is requested for a specific person whose voice data is not included in the voice data DB, the crowdfunding unit generates funding information related to the voice data of a specific person not included in the voice data DB and transmits the generated funding information to the user terminal It can be provided and shared, and investment information corresponding to the funding information can be collected from the user terminal. When the crowdfunding is successful by achieving the funding target amount, the crowdfunding unit may provide a reward to the user terminal based on the investment information.

그리고 나서, 음성 합성부에 의해, 문자 음성 변환 요청에 따라 음성 데이터 DB로부터 선택된 특정인의 음성 데이터를 추출하여, 추출된 음성 데이터를 기초로 콘텐츠에 포함된 텍스트 정보를 음성 합성 데이터로 변환하여, 음성 합성 데이터를 사용자 단말로 제공할 수 있다(S840). 일 실시예에 따르면, 음성 합성부는 텍스트 정보, 순차적인 운율 특징 및 화자의 발성 특징을 인공신경망 텍스트-음성 합성 모델에 입력하여 음성 합성 데이터를 생성할 수 있다. 음성 합성 데이터는 순차적 운율 특징 및 화자의 발성 특징이 반영된 텍스트에 대한 음성 데이터를 포함할 수 있다.Then, the voice synthesizer extracts voice data of a specific person selected from the voice data DB according to the text-to-speech conversion request, and converts text information included in the content into voice synthesis data based on the extracted voice data, The synthesized data may be provided to the user terminal (S840). According to an embodiment, the speech synthesis unit may generate speech synthesis data by inputting text information, sequential prosody characteristics, and speaker vocalization characteristics into an artificial neural network text-to-speech synthesis model. The speech synthesis data may include speech data for a text in which sequential prosody features and speaker's vocal features are reflected.

변환된 음성 합성 데이터에 대한 요금은 요금 관리부에 의해, 결정될 수 있다. 일 실시예에 따르면, 요금 관리부는 텍스트 정보에 기초하여 음성 합성부에 의해 변환되는 음성 합성 데이터의 양 또는 음성 합성 데이터의 재생 시간을 산출하고, 산출된 음성 합성 데이터의 양 및 음성 합성 데이터의 재생 시간 중 적어도 하나에 기초하여 요금을 결정할 수 있다.A fee for the converted speech synthesis data may be determined by the fee management unit. According to an embodiment, the charge management unit calculates the amount of speech synthesis data converted by the speech synthesis unit or the reproduction time of the speech synthesis data based on the text information, and reproduces the calculated amount of speech synthesis data and the speech synthesis data. The rate may be determined based on at least one of time.

마지막으로, 통신부에 의해, 음성 합성 데이터를 포함하는 콘텐츠를 공유할 수 있다(S850). 일 실시예에 따르면, 통신부는 소셜 네트워크를 통해 음성 합성 데이터를 포함하는 콘텐츠를 예를 들어, 음성 합성 서비스로부터 제공되는 특정 유명인을 위한 게시판, 팬페이지 등을 통해 공유할 수 있다.Finally, content including voice synthesis data may be shared by the communication unit (S850). According to an embodiment, the communication unit may share content including voice synthesis data through a social network through, for example, a bulletin board for a specific celebrity or a fan page provided from a voice synthesis service.

상술된 음성 합성 서비스 제공 방법은, 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수도 있다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터 시스템에 의해 판독될 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 기록매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등이 있다. 또한, 컴퓨터가 읽을 수 있는 기록매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 전술된 실시예들을 구현하기 위한 기능적인(functional) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The above-described method for providing a speech synthesis service may be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording devices in which data readable by a computer system is stored. Examples of the computer-readable recording medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. In addition, the computer-readable recording medium is distributed in a computer system connected through a network, so that the computer-readable code can be stored and executed in a distributed manner. In addition, functional programs, codes, and code segments for implementing the above-described embodiments can be easily inferred by programmers in the art to which the present invention pertains.

본 개시의 방법, 동작 또는 기법들은 다양한 수단에 의해 구현될 수도 있다. 예를 들어, 이러한 기법들은 하드웨어, 펌웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수도 있다. 본원의 개시와 연계하여 설명된 다양한 예시적인 논리적 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양자의 조합들로 구현될 수도 있음을 당업자들은 이해할 것이다. 하드웨어 및 소프트웨어의 이러한 상호 대체를 명확하게 설명하기 위해, 다양한 예시적인 구성요소들, 블록들, 모듈들, 회로들, 및 단계들이 그들의 기능적 관점에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는 지의 여부는, 특정 애플리케이션 및 전체 시스템에 부과되는 설계 요구사항들에 따라 달라진다. 당업자들은 각각의 특정 애플리케이션을 위해 다양한 방식들로 설명된 기능을 구현할 수도 있으나, 그러한 구현들은 본 개시의 범위로부터 벗어나게 하는 것으로 해석되어서는 안된다.The method, operation, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those skilled in the art will appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.

하드웨어 구현에서, 기법들을 수행하는 데 이용되는 프로세싱 유닛들은, 하나 이상의 ASIC들, DSP들, 디지털 신호 프로세싱 디바이스들(digital signal processing devices; DSPD들), 프로그램가능 논리 디바이스들(programmable logic devices; PLD들), 필드 프로그램가능 게이트 어레이들(field programmable gate arrays; FPGA들), 프로세서들, 제어기들, 마이크로제어기들, 마이크로프로세서들, 전자 디바이스들, 본 개시에 설명된 기능들을 수행하도록 설계된 다른 전자 유닛들, 컴퓨터, 또는 이들의 조합 내에서 구현될 수도 있다.In a hardware implementation, the processing units used to perform the techniques include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.

따라서, 본 개시와 연계하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 및 회로들은 범용 프로세서, DSP, ASIC, FPGA나 다른 프로그램 가능 논리 디바이스, 이산 게이트나 트랜지스터 로직, 이산 하드웨어 컴포넌트들, 또는 본원에 설명된 기능들을 수행하도록 설계된 것들의 임의의 조합으로 구현되거나 수행될 수도 있다. 범용 프로세서는 마이크로프로세서일 수도 있지만, 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한, 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서, 복수의 마이크로프로세서들, DSP 코어와 연계한 하나 이상의 마이크로프로세서들, 또는 임의의 다른 구성의 조합으로서 구현될 수도 있다.Accordingly, the various illustrative logic blocks, modules, and circuits described in connection with this disclosure are suitable for use in general-purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the present disclosure. It may be implemented or performed in any combination of those designed to perform the functions described in A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, eg, a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.

펌웨어 및/또는 소프트웨어 구현에 있어서, 기법들은 랜덤 액세스 메모리(random access memory; RAM), 판독 전용 메모리(read-only memory; ROM), 비휘발성 RAM(non-volatile random access memory; NVRAM), PROM(programmable read-only memory), EPROM(erasable programmable read-only memory), EEPROM(electrically erasable PROM), 플래시 메모리, 컴팩트 디스크(compact disc; CD), 자기 또는 광학 데이터 스토리지 디바이스 등과 같은 컴퓨터 판독가능 매체 상에 저장된 명령들로서 구현될 수도 있다. 명령들은 하나 이상의 프로세서들에 의해 실행 가능할 수도 있고, 프로세서(들)로 하여금 본 개시에 설명된 기능의 특정 양태들을 수행하게 할 수도 있다.In firmware and/or software implementations, the techniques may include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), PROM ( on computer-readable media such as programmable read-only memory, erasable programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, and the like. It may be implemented as stored instructions. The instructions may be executable by one or more processors, and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

소프트웨어로 구현되는 경우, 상기 기법들은 하나 이상의 명령들 또는 코드로서 컴퓨터 판독 가능한 매체 상에 저장되거나 또는 컴퓨터 판독 가능한 매체를 통해 전송될 수도 있다. 컴퓨터 판독가능 매체들은 한 장소에서 다른 장소로 컴퓨터 프로그램의 전송을 용이하게 하는 임의의 매체를 포함하여 컴퓨터 저장 매체들 및 통신 매체들 양자를 포함한다. 저장 매체들은 컴퓨터에 의해 액세스될 수 있는 임의의 이용 가능한 매체들일 수도 있다. 비제한적인 예로서, 이러한 컴퓨터 판독가능 매체는 RAM, ROM, EEPROM, CD-ROM 또는 다른 광학 디스크 스토리지, 자기 디스크 스토리지 또는 다른 자기 스토리지 디바이스들, 또는 소망의 프로그램 코드를 명령들 또는 데이터 구조들의 형태로 이송 또는 저장하기 위해 사용될 수 있으며 컴퓨터에 의해 액세스될 수 있는 임의의 다른 매체를 포함할 수 있다. 또한, 임의의 접속이 컴퓨터 판독가능 매체로 적절히 칭해진다.If implemented in software, the techniques may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a computer. By way of non-limiting example, such computer readable medium may contain RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or desired program code in the form of instructions or data structures. may include any other medium that can be used for transport or storage to a computer and can be accessed by a computer. Also, any connection is properly termed a computer-readable medium.

예를 들어, 소프트웨어가 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선 (DSL), 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들을 사용하여 웹사이트, 서버, 또는 다른 원격 소스로부터 전송되면, 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선, 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들은 매체의 정의 내에 포함된다. 본원에서 사용된 디스크(disk) 와 디스크(disc)는, CD, 레이저 디스크, 광 디스크, DVD(digital versatile disc), 플로피디스크, 및 블루레이 디스크를 포함하며, 여기서 디스크들(disks)은 보통 자기적으로 데이터를 재생하고, 반면 디스크들(discs) 은 레이저를 이용하여 광학적으로 데이터를 재생한다. 위의 조합들도 컴퓨터 판독가능 매체들의 범위 내에 포함되어야 한다.For example, if the software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, wireless, and microwave, the coaxial cable , fiber optic cable, twisted pair, digital subscriber line, or wireless technologies such as infrared, radio, and microwave are included within the definition of a medium. As used herein, disk and disk include CD, laser disk, optical disk, digital versatile disc (DVD), floppy disk, and Blu-ray disk, where disks are usually magnetic. Data is reproduced optically, while discs reproduce data optically using a laser. Combinations of the above should also be included within the scope of computer-readable media.

소프트웨어 모듈은, RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터들, 하드 디스크, 이동식 디스크, CD-ROM, 또는 공지된 임의의 다른 형태의 저장 매체 내에 상주할 수도 있다. 예시적인 저장 매체는, 프로세가 저장 매체로부터 정보를 판독하거나 저장 매체에 정보를 기록할 수 있도록, 프로세서에 연결될 수 있다. 대안으로, 저장 매체는 프로세서에 통합될 수도 있다. 프로세서와 저장 매체는 ASIC 내에 존재할 수도 있다. ASIC은 유저 단말 내에 존재할 수도 있다. 대안으로, 프로세서와 저장 매체는 유저 단말에서 개별 구성요소들로서 존재할 수도 있다.A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium can be coupled to the processor such that the processor can read information from, or write information to, the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and storage medium may reside within the ASIC. The ASIC may exist in the user terminal. Alternatively, the processor and the storage medium may exist as separate components in the user terminal.

이상 설명된 실시예들이 하나 이상의 독립형 컴퓨터 시스템에서 현재 개시된 주제의 양태들을 활용하는 것으로 기술되었으나, 본 개시는 이에 한정되지 않고, 네트워크나 분산 컴퓨팅 환경과 같은 임의의 컴퓨팅 환경과 연계하여 구현될 수도 있다. 또 나아가, 본 개시에서 주제의 양상들은 복수의 프로세싱 칩들이나 장치들에서 구현될 수도 있고, 스토리지는 복수의 장치들에 걸쳐 유사하게 영향을 받게 될 수도 있다. 이러한 장치들은 PC들, 네트워크 서버들, 및 휴대용 장치들을 포함할 수도 있다.Although the embodiments described above have been described utilizing aspects of the presently disclosed subject matter in one or more standalone computer systems, the present disclosure is not so limited and may be implemented in connection with any computing environment, such as a network or distributed computing environment. . Still further, aspects of the subject matter in this disclosure may be implemented in a plurality of processing chips or devices, and storage may be similarly affected across the plurality of devices. Such devices may include PCs, network servers, and portable devices.

본 명세서에서는 본 개시가 일부 실시예들과 관련하여 설명되었지만, 본 개시의 발명이 속하는 기술분야의 통상의 기술자가 이해할 수 있는 본 개시의 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.Although the present disclosure has been described in connection with some embodiments herein, various modifications and changes may be made without departing from the scope of the present disclosure that can be understood by those skilled in the art to which the present disclosure pertains. Further, such modifications and variations are intended to fall within the scope of the claims appended hereto.

112, 114, 116: 사용자 단말
120: 네트워크
130, 200: 음성 합성 서비스 제공 서버
210: 통신부
220: 데이터베이스
222: 사용자 DB
224: 음성 데이터 DB
230: 프로세서
232: 음성 분석부
234: 음성 합성부
236: 요금 관리부
238: 순위 관리부
240: 크라우드 펀딩부
300: 인공신경망 모델
310: 입력 신호 또는 데이터
320: 입력층
330_1 내지 330_n: 은닉층
340: 출력층
350: 출력 신호 또는 데이터 112, 114, 116: user terminal
120: network
130, 200: Speech synthesis service providing server
210: communication unit
220: database
222: User DB
224: voice data DB
230: processor
232: voice analysis unit
234: speech synthesis unit
236: fee management department
238: Rank Management Department
240: crowdfunding unit
300: artificial neural network model
310: input signal or data
320: input layer
330_1 to 330_n: hidden layer
340: output layer
350: output signal or data

Claims

A method for providing a speech synthesis service, the method comprising:
analyzing, by the voice analyzer, voice data of a specific person, and constructing a voice data DB based on the analysis result;
receiving, by the communication unit, content generated for a specific person selected by the user from the user terminal;
Receiving, by the communication unit, a text-to-speech request including the selected specific person information and the generated content
The voice synthesizer extracts voice data of the selected specific person from the voice data DB according to the text-to-speech conversion request, and converts text information included in the content into voice synthesis data based on the extracted voice data providing the synthesized speech data to the user terminal; and
sharing, by the communication unit, content including the speech synthesis data;
A method of providing a speech synthesis service, comprising:

According to claim 1,
calculating, by the rate management unit, the amount of the speech synthesis data converted by the speech synthesis unit or a reproduction time of the speech synthesis data based on the text information; and
and determining, by the fee management unit, a fee based on at least one of the calculated amount of the speech synthesis data and a reproduction time of the speech synthesis data.

3. The method of claim 2,
The method further comprising the step of determining, by the rank management unit, the voice synthesis rank of a plurality of specific persons based on at least one of the determined rate and the amount of the voice synthesis data.

According to claim 1,
determining, by the voice analyzer, a sound range of the specific person's voice based on the specific person's voice data; and
The method further comprising the step of transmitting, by the communication unit, information on at least one specific person having a sound range similar to the determined sound range among a plurality of specific persons to the user terminal.

According to claim 1,
generating, by the crowdfunding unit, crowdfunding information for securing voice data of a specific person not included in the voice data DB;
providing and sharing the crowdfunding information to the user terminal;
collecting investment information corresponding to the crowdfunding information from the user terminal; and
When the crowdfunding is successful, the method further comprising the step of providing a reward to the user terminal based on the investment information.

A voice synthesis service providing system, comprising:
a voice analyzer configured to analyze the voice data of a specific person and build a voice data DB based on the analysis result;
a communication unit configured to receive content generated for a specific person selected by the user from the user terminal, and receive a text-to-speech request including information on the selected specific person and the generated content; and
According to the text-to-speech conversion request, the voice data of the selected specific person is extracted from the voice data DB, text information included in the content is converted into voice synthesis data, and the converted voice synthesis data is provided to the user terminal A voice synthesizer configured to
The communication unit is further configured to share content including the speech synthesis data.

7. The method of claim 6,
calculate an amount of speech synthesis data and a speech reproduction time converted by the speech synthesis unit based on the text information, and determine a fee based on at least one of the calculated amount of speech synthesis data and the speech reproduction time; A voice synthesis service providing system further comprising a configured fee management unit.

8. The method of claim 7,
and a ranking management unit configured to determine the voice synthesis rankings of a plurality of specific persons based on at least one of the determined rate and the amount of the voice synthesis data.

7. The method of claim 6,
The voice analysis unit is further configured to determine a sound range of the specific person's voice based on the specific person's voice data,
The communication unit is further configured to transmit, to the user terminal, information on at least one specific person having a sound range similar to the determined sound range among a plurality of specific persons.

7. The method of claim 6,
generating crowdfunding information for securing voice data of a specific person not included in the voice data DB;
Sharing the crowdfunding information by providing it to the user terminal,
Collecting investment information corresponding to the crowdfunding information from the user terminal,
When the crowdfunding is successful, the speech synthesis service providing system further comprises a crowdfunding unit configured to provide a reward to the user terminal based on the investment information.