KR20220050342A

KR20220050342A - Apparatus, terminal and method for providing speech synthesizer service

Info

Publication number: KR20220050342A
Application number: KR1020200133907A
Authority: KR
Inventors: 정우임
Original assignee: (주)디테일컴
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2022-04-25
Also published as: KR102574311B1

Abstract

Disclosed are a device, terminal and method for providing a voice synthesis service. According to one aspect of the present invention, a voice synthesis service providing device comprises: a voice database; a voice feature extraction unit configured to receive user voice sample information for a suggested word, extract voice feature information of the user from the voice sample information, and store the voice feature information in the voice database; and a voice synthesis service unit configured to extract, when receiving a voice synthesis request signal including text from the user, voice feature information of the user from the voice database and apply the voice feature information and the text to machine learning to generate voice data for the text. According to the present invention, the voice synthesis service providing device is capable of converting input text into a user's voice.

Description

Apparatus, terminal and method for providing speech synthesis service {APPARATUS, TERMINAL AND METHOD FOR PROVIDING SPEECH SYNTHESIZER SERVICE}

본 발명은 음성 합성 서비스를 제공하는 장치, 단말기 및 방법에 관한 것으로서, 보다 상세하게는 입력된 텍스트를 사용자의 음성으로 변환하여 제공할 수 있는 음성 합성 서비스를 제공하는 장치, 단말기 및 방법에 관한 것이다. The present invention relates to an apparatus, terminal and method for providing a speech synthesis service, and more particularly, to an apparatus, terminal and method for providing a speech synthesis service capable of converting input text into a user's voice and providing it. .

최근 음성 합성 기술의 발전과 함께 음성 합성 기술은 각종 음성 안내, 교육 분야 등에 널리 사용되고 있다. 음성 합성은 사람이 말하는 소리와 유사한 소리를 생성해내는 기술로 흔히 TTS(Text To Speech) 시스템으로도 알려져 있다. With the recent development of speech synthesis technology, speech synthesis technology is widely used in various voice guidance and education fields. Speech synthesis is a technology that generates sounds similar to human speech, and is commonly known as a text-to-speech (TTS) system.

TTS(Text To Speech)는 문자 또는 기호를 음성으로 변환하여 들려주는 기술이다. TTS는 음소에 대한 발음 데이터베이스를 구축하고 이를 연결하여 연속된 음성을 생성하는데, 이때 음성의 크기, 길이 높낮이 등을 조절하여 자연스러운 음성을 합성하는 것이 관건이다.TTS (Text To Speech) is a technology that converts characters or symbols into voice and plays them. TTS builds a pronunciation database for phonemes and connects them to create a continuous voice. At this time, the key is to synthesize a natural voice by adjusting the volume, length, and height of the voice.

즉, TTS는 문자열(문장)을 음성으로 변환하는 문자-음성 변환 장치로서, 크게 언어 처리, 운율 생성, 파형 합성 등의 3단계로 나누어지는데, 텍스트가 입력되면 언어 처리 과정에서 입력된 문서의 문법적 구조를 분석하고, 분석된 문서 구조에 의해 사람이 읽는 것과 같은 운율을 생성하고, 생성된 운율에 따라 저장된 음성 DB의 기본 단위를 모아 합성음을 생성한다.That is, the TTS is a text-to-speech conversion device that converts character strings (sentences) into speech, and is largely divided into three stages: language processing, prosody generation, and waveform synthesis. The structure is analyzed, and a rhyme similar to a human read is generated by the analyzed document structure, and a synthesized sound is generated by collecting the basic units of the stored voice DB according to the generated rhyme.

TTS는 대상 어휘에 제한이 없으며, 일반적인 문자 형태의 정보를 음성으로 변환하는 것이므로, 시스템의 구현 시 음성학, 음성 분석, 음성 합성 및 음성인식 기술 등이 접목되어 보다 자연스럽고 다양한 음성이 출력된다.Since TTS has no target vocabulary and converts general text-type information into speech, phonetics, speech analysis, speech synthesis, and speech recognition technologies are combined to output more natural and diverse speech when implementing the system.

그러나 이러한 종래의 TTS를 제공하는 단말은 문자 메시지 등의 음성을 출력하는 경우 상대방이 누구인지 관계없이 기 설정된 항상 동일한 음성으로 출력하기 때문에 다양한 사용자의 욕구를 만족시키지 못하는 문제점이 있었다.However, when a terminal providing such a conventional TTS outputs a voice such as a text message, it always outputs the same preset voice regardless of who the counterpart is, so there is a problem in that it cannot satisfy the needs of various users.

본 발명의 배경기술은 대한민국 공개특허공보 제2011-0032256호(2011. 03. 30 공개, 발명의 명칭: 매장 음원 방송시스템의 ＴＴＳ 안내방송 장치 및 방법)에 개시되어 있다.The background technology of the present invention is disclosed in Korean Patent Application Laid-Open No. 2011-0032256 (published on March 30, 2011, title of invention: TTS announcement apparatus and method of in-store sound broadcasting system).

본 발명은 상기와 같은 문제점들을 개선하기 위하여 안출된 것으로, 본 발명의 목적은 입력된 텍스트를 사용자의 음성으로 변환하여 제공할 수 있도록 하는 음성 합성 서비스를 제공하는 장치, 단말기 및 방법을 제공하는 것이다. The present invention has been devised to improve the above problems, and an object of the present invention is to provide an apparatus, a terminal, and a method for providing a voice synthesis service that converts an input text into a user's voice and provides it. .

본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제(들)로 제한되지 않으며, 언급되지 않은 또 다른 과제(들)은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problem to be solved by the present invention is not limited to the problem(s) mentioned above, and another problem(s) not mentioned will be clearly understood by those skilled in the art from the following description.

본 발명의 일 측면에 따른 음성 합성 서비스 제공 장치는 음성 데이터베이스, 제시어에 대한 사용자의 음성 샘플 정보를 수신하고, 상기 음성 샘플 정보로부터 상기 사용자의 음성 특징 정보를 추출하며, 상기 음성 특징 정보를 상기 음성 데이터베이스에 저장하는 음성 특징 추출부, 상기 사용자로부터 텍스트를 포함하는 음성 합성 요청 신호 수신 시, 상기 음성 데이터베이스로부터 상기 사용자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성하는 음성 합성 서비스부를 포함한다. An apparatus for providing a speech synthesis service according to an aspect of the present invention receives a voice database and user's voice sample information for a suggested word, extracts the user's voice characteristic information from the voice sample information, and uses the voice characteristic information as the voice A voice feature extraction unit for storing in a database, when receiving a voice synthesis request signal including text from the user, extracts the user's voice feature information from the voice database, and applies the voice feature information and the text to machine learning and a voice synthesis service unit generating voice data for the text.

본 발명에서 상기 음성 합성 서비스부는, 상기 사용자로부터 입력받은 동영상 또는 기 저장된 동영상의 편집을 위한 디자인 템플릿을 제공하고, 상기 디자인 템플릿을 통해 상기 동영상의 편집정보를 수신하며, 텍스트 및 더빙자 정보를 포함하는 동영상 더빙 요청 수신 시, 상기 음성 데이터베이스로부터 상기 더빙자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성하며, 상기 음성 데이터를 상기 동영상에 적용할 수 있다. In the present invention, the voice synthesis service unit provides a design template for editing the video input from the user or a pre-stored video, receives editing information of the video through the design template, and includes text and dubber information When receiving a video dubbing request, the voice characteristic information of the dubbing person is extracted from the voice database, the voice characteristic information and the text are applied to machine learning to generate voice data for the text, and the voice data is converted into the moving image. can be applied to

본 발명에서 상기 음성 합성 서비스부는, 상기 사용자로부터 전자책 및 리더(reader) 정보를 포함하는 전자책 더빙 요청 수신 시, 상기 음성 데이터베이스로부터 상기 리더의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 전자책을 기계학습에 적용하여 상기 전자책에 대한 음성 데이터를 생성할 수 있다. In the present invention, when receiving an e-book dubbing request including an e-book and reader information from the user, the voice synthesis service unit extracts voice characteristic information of the reader from the voice database, and extracts the voice characteristic information and the By applying the e-book to machine learning, voice data for the e-book can be generated.

본 발명의 다른 측면에 따른 음성 합성 서비스 제공 방법은, 음성 특징 추출부가 제시어에 대한 사용자의 음성 샘플 정보를 수신하고, 상기 음성 샘플 정보로부터 상기 사용자의 음성 특징 정보를 추출하며, 상기 음성 특징 정보를 음성 데이터베이스에 저장하는 단계, 음성 합성 서비스부가 상기 사용자로부터 텍스트를 포함하는 음성 합성 요청 신호 수신 시, 상기 음성 데이터베이스로부터 상기 사용자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성하는 단계를 포함한다. According to another aspect of the present invention, in a method for providing a voice synthesis service, a voice feature extraction unit receives voice sample information of a user for a suggested word, extracts voice feature information of the user from the voice sample information, and extracts the voice feature information. Storing in a voice database, when the voice synthesis service unit receives a voice synthesis request signal including text from the user, extracts the user's voice characteristic information from the voice database, and applies the voice characteristic information and the text to machine learning and generating voice data for the text by applying it.

본 발명은 상기 음성 합성 서비스부가, 상기 사용자로부터 동영상 편집 명령을 입력받으면, 상기 사용자로부터 입력받은 동영상 또는 기 저장된 동영상의 편집을 위한 디자인 템플릿을 제공하고, 상기 디자인 템플릿을 통해 상기 동영상의 편집정보를 수신하여 상기 동영상을 편집하는 단계, 및 상기 음성 합성 서비스부가, 텍스트 및 더빙자 정보를 포함하는 동영상 더빙 요청 수신 시, 상기 음성 데이터베이스로부터 상기 더빙자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성하며, 상기 음성 데이터를 상기 동영상에 적용하는 단계를 더 포함할 수 있다. According to the present invention, when the voice synthesis service unit receives a video editing command from the user, it provides a design template for editing a video input from the user or a pre-stored video, and edits the video editing information through the design template. receiving and editing the video; and when the voice synthesis service unit receives a video dubbing request including text and dubbing person information, extracting the voice characteristic information of the dubbing person from the voice database, the voice characteristic information and the The method may further include generating voice data for the text by applying the text to machine learning, and applying the voice data to the moving picture.

본 발명은 상기 음성 합성 서비스부가, 상기 사용자로부터 전자책 및 리더(reader) 정보를 포함하는 전자책 더빙 요청 수신 시, 상기 음성 데이터베이스로부터 상기 리더의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 전자책을 기계학습에 적용하여 상기 전자책에 대한 음성 데이터를 생성하는 단계를 더 포함할 수 있다. In the present invention, when the voice synthesis service unit receives an e-book dubbing request including an e-book and reader information from the user, the voice characteristic information of the reader is extracted from the voice database, the voice characteristic information and the The method may further include generating voice data for the e-book by applying the e-book to machine learning.

본 발명의 또 다른 측면에 따른 단말기는, 저장부, 음성 통화의 연결을 감지하여 통화 중 음성을 음성 데이터로 변환하고, 상기 음성 데이터에 대응하는 녹음 파일을 생성하는 음성 녹음부, 기 설정된 일정 기간동안 생성된 녹음 파일로부터 사용자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보를 상기 저장부에 저장하는 음성 특징 추출부를 포함하되, 상기 음성 녹음부는 상기 녹음 파일에 대응하는 음성 데이터를 텍스트 데이터로 변환하여 상기 녹음 파일과 함께 상기 저장부에 저장할 수 있다. A terminal according to another aspect of the present invention includes a storage unit, a voice recorder configured to detect a connection of a voice call, convert a voice during a call into voice data, and generate a recording file corresponding to the voice data, a preset period of time and a voice feature extracting unit for extracting the user's voice characteristic information from the recorded file generated during the operation and storing the voice characteristic information in the storage unit, wherein the voice recording unit converts voice data corresponding to the recorded file into text data. Thus, it can be stored in the storage unit together with the recorded file.

본 발명에서 상기 음성 녹음부는, 통화 내용에서 통화 상대의 음성을 필터링하여 상기 사용자의 음성에 대해 녹음 파일을 생성할 수 있다. In the present invention, the voice recorder may create a recording file for the user's voice by filtering the voice of the callee from the call contents.

본 발명은 상기 단말기에서 제공되는 음성 신호의 사용자 음성으로 변환 명령 수신 시, 상기 저장부로부터 상기 사용자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 음성 신호를 기계학습에 적용하여 상기 음성 신호에 대한 음성 데이터를 생성하는 음성 합성 서비스부를 더 포함할 수 있다. The present invention extracts the user's voice characteristic information from the storage unit when a command to convert the voice signal provided by the terminal into a user's voice is received, and applies the voice characteristic information and the voice signal to machine learning to apply the voice signal It may further include a voice synthesis service unit for generating voice data for the.

본 발명에서 상기 음성 합성 서비스부는, 상기 사용자로부터 텍스트를 포함하는 음성 합성 요청 신호 수신 시, 상기 저장부로부터 상기 사용자의 음성 특징 정보를 추출하고, 상기 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성할 수 있다. In the present invention, the voice synthesis service unit, when receiving a voice synthesis request signal including text from the user, extracts the user's voice characteristic information from the storage unit, and applies the voice characteristic information and the text to machine learning. Voice data for the text may be generated.

본 발명의 일 실시예에 따른 음성 합성 서비스를 제공하는 장치, 단말기 및 방법은, 입력된 텍스트를 사용자의 음성으로 변환하여 제공할 수 있다. An apparatus, a terminal, and a method for providing a speech synthesis service according to an embodiment of the present invention may convert an input text into a user's voice and provide the converted text.

본 발명의 다른 실시예에 따른 음성 합성 서비스를 제공하는 장치, 단말기 및 방법은, 유튜브, SNS 홍보 영상, 및 기업 홍보 영상 등의 동영상에 사용자 본인의 목소리를 더빙하여 제공할 수 있다. The apparatus, terminal and method for providing a voice synthesis service according to another embodiment of the present invention may provide the user's own voice by dubbing the video, such as YouTube, SNS promotional video, and corporate promotional video.

본 발명의 또 다른 실시예에 따른 음성 합성 서비스를 제공하는 장치, 단말기 및 방법은, 전자책을 사용자가 원하는 사람의 음성으로 들을 수 있다. In an apparatus, terminal, and method for providing a voice synthesis service according to another embodiment of the present invention, a user can listen to an e-book in the voice of a person desired by the user.

본 발명의 또 다른 실시예에 따른 음성 합성 서비스를 제공하는 장치, 단말기 및 방법은, 돌아가신 분의 생전 음성을 지인들에게 전달할 수 있고, 가족 행사 등에 대한 안내를 음성으로 전달할 수 있다.An apparatus, terminal, and method for providing a voice synthesis service according to another embodiment of the present invention can deliver the voice of a deceased person to acquaintances and deliver information about family events and the like by voice.

한편, 본 발명의 효과는 이상에서 언급한 효과들로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 효과들이 포함될 수 있다.On the other hand, the effects of the present invention are not limited to the above-mentioned effects, and various effects may be included within the range obvious to those skilled in the art from the description below.

도 1은 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 시스템을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 장치의 구성을 개략적으로 나타낸 블록도이다.
도 3은 도 2에 도시된 음성 합성 서비스부를 설명하기 위한 블록도이다.
도 4는 본 발명의 일 실시예에 따른 사용자의 음성 샘플 정보로부터 음성 특징 정보를 추출하는 방법을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 방법을 설명하기 위한 흐름도이다.
도 6은 본 발명의 다른 실시예에 따른 음성 합성 서비스를 제공하는 단말기의 구성을 개략적으로 나타낸 블록도이다. 1 is a diagram for explaining a system for providing a voice synthesis service according to an embodiment of the present invention.
2 is a block diagram schematically showing the configuration of an apparatus for providing a speech synthesis service according to an embodiment of the present invention.
FIG. 3 is a block diagram illustrating the voice synthesis service unit shown in FIG. 2 .
4 is a diagram for explaining a method of extracting voice characteristic information from a user's voice sample information according to an embodiment of the present invention.
5 is a flowchart illustrating a method of providing a voice synthesis service according to an embodiment of the present invention.
6 is a block diagram schematically showing the configuration of a terminal providing a voice synthesis service according to another embodiment of the present invention.

이하, 첨부된 도면들을 참조하여 본 발명의 일 실시예에 따른 음성 합성 서비스를 제공하는 장치, 단말기 및 방법을 설명한다. 이 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다. 또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 그러므로 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.Hereinafter, an apparatus, a terminal and a method for providing a voice synthesis service according to an embodiment of the present invention will be described with reference to the accompanying drawings. In this process, the thickness of the lines or the size of the components shown in the drawings may be exaggerated for clarity and convenience of explanation. In addition, the terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to the intention or custom of the user or operator. Therefore, definitions of these terms should be made based on the content throughout this specification.

본 명세서에서 설명된 구현은, 예컨대, 방법 또는 프로세스, 장치, 소프트웨어 프로그램, 데이터 스트림 또는 신호로 구현될 수 있다. 단일 형태의 구현의 맥락에서만 논의(예컨대, 방법으로서만 논의)되었더라도, 논의된 특징의 구현은 또한 다른 형태(예컨대, 장치 또는 프로그램)로도 구현될 수 있다. 장치는 적절한 하드웨어, 소프트웨어 및 펌웨어 등으로 구현될 수 있다. 방법은, 예컨대, 컴퓨터, 마이크로프로세서, 집적 회로 또는 프로그래밍가능한 로직 디바이스 등을 포함하는 프로세싱 디바이스를 일반적으로 지칭하는 프로세서 등과 같은 장치에서 구현될 수 있다. 프로세서는 또한 최종-사용자 사이에 정보의 통신을 용이하게 하는 컴퓨터, 셀 폰, 휴대용/개인용 정보 단말기(personal digital assistant: "PDA") 및 다른 디바이스 등과 같은 통신 디바이스를 포함한다.Implementations described herein may be implemented in, for example, a method or process, an apparatus, a software program, a data stream, or a signal. Although discussed only in the context of a single form of implementation (eg, discussed only as a method), implementations of the discussed features may also be implemented in other forms (eg, as an apparatus or program). The apparatus may be implemented in suitable hardware, software and firmware, and the like. A method may be implemented in an apparatus such as, for example, a processor, which generally refers to a computer, a microprocessor, a processing device, including an integrated circuit or programmable logic device, and the like. Processors also include communication devices such as computers, cell phones, portable/personal digital assistants (“PDA”) and other devices that facilitate communication of information between end-users.

도 1은 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 시스템을 설명하기 위한 도면이다. 1 is a diagram for explaining a system for providing a voice synthesis service according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 시스템은 사용자 단말기(100) 및 서비스 제공 장치(200)를 포함할 수 있고, 이들은 통신망을 통해 연결될 수 있다. Referring to FIG. 1 , a system for providing a voice synthesis service according to an embodiment of the present invention may include a user terminal 100 and a service providing apparatus 200 , which may be connected through a communication network.

사용자 단말기(100)는 사용자로부터 제시어에 대한 음성 샘플 정보를 입력받고, 입력받은 음성 샘플 정보를 서비스 제공 장치(200)로 전송할 수 있다. 이때, 사용자는 서비스 제공 장치(200)로부터 제공되는 제시어를 읽을 수 있고, 사용자 단말기(100)는 제시어의 사용자 발화에 의한 음성 샘플 정보를 서비스 제공 장치(200)로 전송할 수 있다. 사용자는 제한된 공간(예컨대, 녹음실, 스튜디오 등)이 아닌 본인이 소지하는 단말기의 마이크를 통해 제시어에 대한 음성을 녹음할 수 있다. The user terminal 100 may receive voice sample information for a suggested word from the user, and transmit the received voice sample information to the service providing apparatus 200 . In this case, the user may read the suggestion provided from the service providing apparatus 200 , and the user terminal 100 may transmit voice sample information generated by the user's utterance of the suggested word to the service providing apparatus 200 . The user may record the voice for the suggested word through the microphone of the terminal possessed by the user rather than in a limited space (eg, a recording studio, a studio, etc.).

이러한 사용자 단말기(100)는 스마트폰, 컴퓨터, 또는 휴대폰 등에서 구현될 수 있다. 텍스트-음성 합성 단말기(100)는 통신부를 포함하여 외부 장치(예를 들어, 서버 장치)와 통신할 수 있다.The user terminal 100 may be implemented in a smart phone, a computer, or a mobile phone. The text-to-speech synthesis terminal 100 may communicate with an external device (eg, a server device) including a communication unit.

이러한 사용자 단말기(100)는 서비스 제공 장치(200)에서 제공하는 음성 합성 서비스 프로그램(애플리케이션 또는 애플릿)의 이용이 가능하고, 다양한 유무선 환경에 적용될 수 있는 전자 기기일 수 있다. 예컨대, 사용자 단말기(100)는 PDA(Personal Digital Assistant), 스마트폰, 셀룰러폰, PCS(Personal Communication Service)폰, GSM(Global System for Mobile)폰, W-CDMA(Wideband CDMA)폰, CDMA-2000폰, MBS(Mobile Broadband System)폰 등을 포함한다. The user terminal 100 may be an electronic device that can use a voice synthesis service program (application or applet) provided by the service providing apparatus 200 and can be applied to various wired and wireless environments. For example, the user terminal 100 may include a Personal Digital Assistant (PDA), a smart phone, a cellular phone, a Personal Communication Service (PCS) phone, a Global System for Mobile (GSM) phone, a Wideband CDMA (W-CDMA) phone, or a CDMA-2000 phone. It includes a phone, a Mobile Broadband System (MBS) phone, and the like.

서비스 제공 장치(200)는 사용자 단말기(100)로부터 수신한 음성 샘플 정보로부터 사용자의 음성 특징 정보를 추출하고, 추출한 음성 특징 정보를 음성 데이터베이스(320)에 저장할 수 있다. 이때, 서비스 제공 장치(200)는 사용자 고유의 음성을 TTS로 구현할 수 있다. The service providing apparatus 200 may extract the user's voice characteristic information from the voice sample information received from the user terminal 100 , and store the extracted voice characteristic information in the voice database 320 . In this case, the service providing apparatus 200 may implement a user's own voice as a TTS.

서비스 제공 장치(200)는 사용자로부터 텍스트를 포함하는 음성 합성 요청 신호를 수신하면, 음성 데이터베이스(320)로부터 상기 사용자의 음성 특징 정보를 추출하고, 추출한 음성 특징 정보 및 텍스트를 기계학습에 적용하여 텍스트에 대한 음성 데이터를 생성할 수 있다. When the service providing apparatus 200 receives a voice synthesis request signal including text from the user, the user's voice characteristic information is extracted from the voice database 320, and the extracted voice characteristic information and text are applied to machine learning to apply the text can generate voice data for

이러한 서비스 제공 장치(200)는 통신망을 통하여 원격지의 서버나 단말에 접속할 수 있는 컴퓨터로 구현될 수 있다. 여기서, 컴퓨터는 예를 들어, 네비게이션, 웹 브라우저(WEB Browser)가 탑재된 노트북, 데스크톱(Desktop), 랩톱(Laptop) 등을 포함할 수 있다.The service providing apparatus 200 may be implemented as a computer capable of accessing a remote server or terminal through a communication network. Here, the computer may include, for example, navigation, a laptop equipped with a web browser, a desktop, and a laptop.

한편, 본 발명의 실시예에서는 서비스 제공 장치(200)가 텍스트를 음성 데이터로 변환하는 것으로 설명하였으나, 사용자 단말기(100)에 음성 합성 서비스 프로그램(애플리케이션 또는 애플릿)이 설치된 경우, 사용자 단말기(100)를 이용하여 텍스트를 음성 데이터로 변환할 수 있다. Meanwhile, in the embodiment of the present invention, it has been described that the service providing apparatus 200 converts text into voice data, but when a voice synthesis service program (application or applet) is installed in the user terminal 100, the user terminal 100 can be used to convert text into voice data.

이에, 텍스트를 음성 데이터로 변환하는 장치를 음성 합성 서비스 제공 장치(300)로 칭하여 설명하기로 한다. Accordingly, a device for converting text into voice data will be referred to as a voice synthesis service providing device 300 and will be described.

도 2는 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 장치의 구성을 개략적으로 나타낸 블록도, 도 3은 도 2에 도시된 음성 합성 서비스부를 설명하기 위한 블록도이다. 2 is a block diagram schematically showing the configuration of an apparatus for providing a speech synthesis service according to an embodiment of the present invention, and FIG. 3 is a block diagram illustrating the speech synthesis service unit shown in FIG. 2 .

도 2를 참조하면, 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 장치(300)는 통신부(310), 음성 데이터베이스(320), 음성 특징 추출부(330), 음성 합성 서비스부(340), 및 제어부(350)를 포함할 수 있다. Referring to FIG. 2 , the apparatus 300 for providing a voice synthesis service according to an embodiment of the present invention includes a communication unit 310 , a voice database 320 , a voice feature extraction unit 330 , a voice synthesis service unit 340 , and a control unit 350 .

통신부(310)는 통신망을 통해 외부 장치와 통신하기 위한 구성으로, 음성 샘플 정보, 및 음성 합성 요청 신호 등 다양한 정보를 송수신할 수 있다. 이때, 통신부(310)는 근거리 통신모듈, 무선 통신모듈, 이동통신 모듈, 유선 통신모듈 등 다양한 형태로 구현될 수 있다.The communication unit 310 is configured to communicate with an external device through a communication network, and may transmit/receive various information such as voice sample information and a voice synthesis request signal. In this case, the communication unit 310 may be implemented in various forms, such as a short-range communication module, a wireless communication module, a mobile communication module, and a wired communication module.

음성 데이터베이스(320)는 음성 특징 추출부(330)에서 추출된 사용자의 음성 특징 정보를 저장할 수 있다. 이때, 음성 데이터베이스(320)는 사용자 식별정보, 및 음성 특징 정보 등을 저장할 수 있고, 사용자 식별정보는 사용자 ID/비밀번호, 및 전화번호 등을 포함할 수 있다. 음성 데이터베이스(320)에 저장된 사용자의 음성 특징 정보(예를 들어, 사용자의 음성 특징을 나타내는 임베딩 벡터)는 음성 합성 시 음성 합성 서비스부(340)에 제공될 수 있다.The voice database 320 may store the user's voice feature information extracted by the voice feature extractor 330 . In this case, the voice database 320 may store user identification information and voice characteristic information, and the user identification information may include a user ID/password, and a phone number. The user's voice characteristic information (eg, an embedding vector representing the user's voice characteristic) stored in the voice database 320 may be provided to the voice synthesis service unit 340 during voice synthesis.

음성 데이터베이스(320)는 제시어(텍스트) 및 제시어에 대응되는 사용자의 음성을 저장할 수 있다. 제시어는 적어도 하나의 언어로 작성될 수 있으며, 사람이 이해할 수 있는 단어, 구문 및 문장 중 적어도 하나를 포함할 수 있다. 또한, 음성 데이터베이스(320)에 저장된 음성은 복수의 화자가 제시어를 읽은 음성 데이터를 포함할 수 있다. 제시어 및 음성 데이터는 음성 데이터베이스(320)에 미리 저장되어 있거나, 통신부(310)로부터 수신될 수 있다. The voice database 320 may store a suggestion word (text) and a user's voice corresponding to the suggestion word. The suggestion word may be written in at least one language, and may include at least one of words, phrases, and sentences that can be understood by humans. Also, the voice stored in the voice database 320 may include voice data in which a plurality of speakers read a suggested word. The suggestion word and voice data may be previously stored in the voice database 320 or may be received from the communication unit 310 .

음성 특징 추출부(330)는 제시어의 사용자 발화에 의한 음성 샘플 정보를 수신하고, 음성 샘플 정보로부터 사용자의 음성 특징 정보를 추출하며, 음성 특징 정보를 음성 데이터베이스(320)에 저장할 수 있다. 여기서, 음성 샘플 정보는 사용자의 음성 특징과 관련된 정보를 나타내는 음성 스펙트럼 데이터를 포함할 수 있다. 음성 특징 정보는 포먼트(Formant) 정보, 주파수(Log f0) 정보, LPC(Linear Predictive Coefficient) 정보, 스펙트럼 포락선(Spectral Envelope) 정보, 에너지(Energy) 정보, 발화 속도(Pitch Period) 정보, 및 로그 스펙트럼(Log Spectrum) 정보 등을 포함할 수 있다.The voice feature extractor 330 may receive voice sample information based on the user's utterance of a suggested word, extract the user's voice feature information from the voice sample information, and store the voice feature information in the voice database 320 . Here, the voice sample information may include voice spectrum data representing information related to the user's voice characteristics. Voice characteristic information includes formant information, frequency (Log f0) information, LPC (Linear Predictive Coefficient) information, spectral envelope information, energy (Energy) information, firing rate (Pitch Period) information, and a log It may include spectrum (Log Spectrum) information and the like.

음성 특징 추출부(330)는 멜 주파수 셉스트럴(MFC)과 같은 음성 처리 방법을 이용하여 음성 특징 정보를 추출할 수 있다. 또한, 음성 특징 추출부(330)는 음성 샘플 정보를 학습된 음성 특징 추출 모델(예를 들어, 인공신경망)에 입력하여 음성 특징 정보를 추출할 수 있다. The voice feature extractor 330 may extract voice feature information using a voice processing method such as Mel Frequency Septral (MFC). Also, the voice feature extractor 330 may extract voice feature information by inputting voice sample information into a learned voice feature extraction model (eg, an artificial neural network).

음성 특징 추출부(330)는 음성 특징 정보를 임베딩 벡터로 나타낼 수 있다. The voice feature extractor 330 may represent the voice feature information as an embedding vector.

음성 특징 추출부(330)는 음성 샘플 정보로부터 음성 특징 정보를 추출하기 전에, 음성 샘플 정보로부터 잡음 성분을 제거할 수 있다. The voice feature extractor 330 may remove a noise component from the voice sample information before extracting the voice feature information from the voice sample information.

음성 특징 추출부(330)는 추출된 사용자의 음성 특징 정보를 음성 데이터베이스(320)에 저장할 수 있다. 이에 따라, 입력 텍스트에 대한 음성 합성 시, 음성 데이터베이스(320)에 미리 저장된 화자의 음성 특징 정보를 선택 또는 지정될 수 있고, 선택 또는 지정된 화자의 음성 특징 정보를 음성 합성에 이용할 수 있다. The voice feature extractor 330 may store the extracted user's voice feature information in the voice database 320 . Accordingly, when synthesizing the input text, the speaker's voice characteristic information stored in advance in the voice database 320 may be selected or designated, and the selected or designated speaker's voice characteristic information may be used for voice synthesis.

음성 합성 서비스부(340)는 사용자로부터 텍스트를 포함하는 음성 합성 요청 신호 수신 시, 음성 데이터베이스(320)로부터 사용자의 음성 특징 정보를 추출하고, 음성 특징 정보 및 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성할 수 있다. 여기서, 기계학습은 음성 합성 모델(예를 들어, pre-net, CBHG 모듈, DNN, CNN+DNN 등)을 포함할 수 있다. 음성 합성 모델은 입력된 텍스트를 음성 데이터의 특징 벡터로 변환할 수 있다. 이때, 음성 데이터는 발화자의 음성 특징 정보에 따라 텍스트가 발화된 음성 데이터일 수 있다. 다시 말해, 음성 합성 모델은 입력된 텍스트가 특정 발화자의 음성 특징 정보에 따라 발화된 음성 데이터 또는 음성 데이터의 특징 벡터를 출력할 수 있다.When receiving a voice synthesis request signal including text from the user, the voice synthesis service unit 340 extracts the user's voice characteristic information from the voice database 320, and applies the voice characteristic information and the text to machine learning to add the text to the text. It is possible to generate voice data for Here, machine learning may include a speech synthesis model (eg, pre-net, CBHG module, DNN, CNN+DNN, etc.). The speech synthesis model may convert the input text into a feature vector of speech data. In this case, the voice data may be voice data in which text is uttered according to the speaker's voice characteristic information. In other words, the speech synthesis model may output speech data in which the input text is uttered according to speech feature information of a specific speaker or a feature vector of the speech data.

음성 합성 모델은 인코더-디코더 기반의 뉴럴 네트워크(neural network)일 수 있다. 여기서, 뉴럴 네트워크는 복수의 레이어들을 포함할 수 있고, 레이어들 각각은 복수의 뉴런들을 포함할 수 있다. 이웃한 레이어들의 뉴런들은 시냅스들로 연결될 수 있다. 학습에 따라 시냅스들에는 가중치들이 부여될 수 있고, 파라미터들은 이러한 가중치들을 포함할 수 있다.The speech synthesis model may be an encoder-decoder-based neural network. Here, the neural network may include a plurality of layers, and each of the layers may include a plurality of neurons. Neurons of neighboring layers may be connected by synapses. According to learning, weights may be assigned to synapses, and parameters may include these weights.

이처럼, 음성 합성 모델은 시퀀스-투-시퀀스(sequence-to-sequence) 모델로 구성된 텍스트-투-특징 벡터(text-to-feature vector) 또는 텍스트-투-음성(text-to-speech) 모델일 수 있다. 음성 합성 모델에서 결정된 음성 데이터의 특징 벡터 또는 음성 데이터는 텍스트와 함께 음성 인식 모델의 학습 데이터로 이용될 수 있다. 이로써, 다양한 발화 특성을 갖는 학습 데이터를 손쉽게 획득할 수 있다.As such, the speech synthesis model may be a text-to-feature vector or text-to-speech model composed of a sequence-to-sequence model. can The feature vector or speech data of speech data determined in the speech synthesis model may be used together with text as training data of the speech recognition model. Accordingly, it is possible to easily obtain learning data having various speech characteristics.

음성 합성 서비스부(340)는 음성 합성 요청 신호 수신 시, 음성 합성 요청 신호를 분석하여 발화자 및 텍스트를 추출할 수 있다. 음성 합성 서비스부(340)는 추출된 텍스트를 임베딩하여 문장 입력을 구성할 수 있다. 여기서, 임베딩은 문장, 화자 등의 불연속적인 기호 데이터를 연속적이고 다양한 특성을 가지는 특징벡터로 변환하기 위한 과정이다. 구체적으로, 음성 합성 서비스부(340)는 텍스트를 한글 자모 단위로 분해하여 자모 단위 입력을 생성하고, 쪼개진 자모에 각각 번호를 매길 수 있으며, 각 자모별 번호가 일대일 대응되는 색인표를 구성하고 색인표에 따라 번호를 매기게 된다. 이를 통해 음성 합성 서비스부(340)는 텍스트를 숫자 데이터로 변환할 수 있다. 음성 합성 서비스부(340)는 숫자 데이터로 매핑된 텍스트를 원-핫 인코딩(One-hot encoding)하여 원-핫 인코딩된 벡터열을 생성하고, 원-핫 인코딩된 벡터열을 문장 임베딩 매트릭스와 곱하여 텍스트 특징벡터로 변환할 수 있다. When the voice synthesis request signal is received, the voice synthesis service unit 340 may analyze the voice synthesis request signal to extract a speaker and text. The speech synthesis service unit 340 may configure a sentence input by embedding the extracted text. Here, embedding is a process for converting discontinuous symbol data such as sentences and speakers into feature vectors having continuous and various characteristics. Specifically, the speech synthesis service unit 340 decomposes the text into Hangul Jamo units to generate a Jamo unit input, and can number each of the split Jamo units. They are numbered according to the table. Through this, the speech synthesis service unit 340 may convert the text into numeric data. The speech synthesis service unit 340 generates a one-hot encoded vector sequence by one-hot encoding the text mapped to numeric data, and multiplies the one-hot encoded vector sequence with a sentence embedding matrix. It can be converted to a text feature vector.

그 후, 음성 합성 서비스부(340)는 음성 데이터베이스(320)로부터 발화자에 해당하는 음성 특징 정보를 추출하고, 추출한 음성 특징 정보와 텍스트를 음성 합성 모델에 적용하여 텍스트를 음성 데이터로 변환할 수 있다. 즉, 음성 합성 서비스부(340)는 텍스트 특징 벡터와 음성 특징 정보를 나타내는 임베딩 벡터를 음성 합성 모델에 입력하여, 텍스트를 음성 데이터로 변환할 수 있다. Thereafter, the voice synthesis service unit 340 extracts voice characteristic information corresponding to the speaker from the voice database 320, and applies the extracted voice characteristic information and text to a voice synthesis model to convert the text into voice data. . That is, the speech synthesis service unit 340 may input the text feature vector and the embedding vector representing the speech feature information into the speech synthesis model to convert the text into speech data.

음성 합성 서비스부(340)는 도 3에 도시된 바와 같이 이미지 템플릿 서비스부(352), 동영상 편집 서비스부(354), 음성 메시지 서비스부(356), 및 더빙 서비스부(358)를 포함할 수 있다.The voice synthesis service unit 340 may include an image template service unit 352 , a video editing service unit 354 , a voice message service unit 356 , and a dubbing service unit 358 as shown in FIG. 3 . there is.

이미지 템플릿 서비스부(352)는 테마별 이미지 템플릿을 제공하고, 사용자에 의해 선택된 이미지 템플릿에 대한 편집 정보를 입력받으며, 편집된 이미지 템플릿을 저장할 수 있다. 사용자는 이미지 템플릿 서비스부(352)를 통해 제공되는 테마별 이미지 템플릿을 이용하여 쇼핑몰 등의 디자인을 설정하고 편집할 수 있다.The image template service unit 352 may provide an image template for each theme, receive editing information for an image template selected by a user, and store the edited image template. A user may set and edit a design of a shopping mall, etc. by using an image template for each theme provided through the image template service unit 352 .

동영상 편집 서비스부(354)는 사용자로부터 입력받은 동영상 또는 기 저장된 동영상의 편집을 위한 디자인 템플릿을 제공하고, 디자인 템플릿을 통해 상기 동영상의 편집 정보를 수신하며, 텍스트 및 더빙자를 포함하는 동영상 더빙 요청 수신 시, 음성 데이터베이스(320)로부터 더빙자의 음성 특징 정보를 추출하고, 추출한 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성하며, 생성한 음성 데이터를 동영상에 적용할 수 있다. The video editing service unit 354 provides a design template for editing a video input from a user or a pre-stored video, receives editing information of the video through the design template, and receives a video dubbing request including text and a dubber At the same time, it is possible to extract voice characteristic information of a dubber from the voice database 320, apply the extracted voice characteristic information and the text to machine learning to generate voice data for the text, and apply the generated voice data to a moving picture. there is.

즉, 동영상 편집 서비스부(354)는 사용자로부터 동영상을 입력받으면, 그 동영상을 편집하기 위한 디자인 템플릿을 제공할 수 있고, 디자인 템플릿을 통해 편집된 동영상을 저장할 수 있다. 이때, 동영상 편집 서비스부(354)는 30초 내외의 간단한 숏컷을 바탕으로 동영상 템플릿 DB(미도시)를 이용하여 영상 추가, 효과 삽입, 위치 조정 등 간단한 수정과 텍스트 수정, 음원 믹스 등의 편집을 수행할 수 있는 편집 툴을 제공할 수 있다. That is, when receiving a video input from the user, the video editing service unit 354 may provide a design template for editing the video and store the edited video through the design template. At this time, the video editing service unit 354 uses a video template DB (not shown) based on a simple shortcut of about 30 seconds to perform simple modifications such as adding images, inserting effects, and adjusting positions, editing texts, and mixing sound sources. You can provide an editing tool that can do that.

동영상 편집 서비스부(354)는 사용자 본인 또는 성우 캐릭터의 음성을 지원할 수 있고, 사용자 본인의 음성 또는 성우 캐릭터의 음성으로 동영상을 더빙할 수 있다. The video editing service unit 354 may support the user's own voice or the voice of the voice actor character, and may dub the video using the user's own voice or the voice of the voice actor character.

사용자는 동영상 편집 서비스부(354)를 통해 홍보용 유튜브 제작시 사용자 본인의 음성 편집을 용이하게 할 수 있다. 또한, SNS내 홍보 영상 제작시 사용자는 동영상 편집 서비스부(354)를 통해 멘트 삽입 및 수정을 용이하게 할 수 있다. 또한, 동영상 편집 서비스부(354)를 통해 기업의 제품 홍보 등에 효율적인 설명을 할 수 있다. 또한, 동영상 편집 서비스부(354)를 통해 돌아가신 분의 생전 음성을 지인들에게 전달할 수 있다. 또한, 동영상 편집 서비스부(354)를 통해 가족 행사 등에 대한 안내를 음성으로 전달할 수 있다. The user can easily edit the user's own voice when producing YouTube for promotion through the video editing service unit 354 . In addition, when producing a promotional video within the SNS, the user can easily insert and edit comments through the video editing service unit 354 . In addition, through the video editing service unit 354, it is possible to effectively explain the product promotion of the company. In addition, the voice of the deceased may be delivered to acquaintances through the video editing service unit 354 . In addition, through the video editing service unit 354, information on family events, etc. may be delivered by voice.

음성 메시지 서비스부(356)는 사용자로부터 문자 메시지가 입력 받으면, 음성 데이터베이스(320)로부터 사용자의 음성 특징 정보를 추출하고, 문자 메시지에 음성 특정 정보를 적용하여 문자 메시지를 음성으로 제공할 수 있다. When a text message is input by the user, the voice message service unit 356 may extract the user's voice characteristic information from the voice database 320 and apply the voice specific information to the text message to provide the text message as voice.

사용자는 문자 메시지를 주고받을 때보다 감성적인 알림이 필요할 경우 음성 메시지 서비스부(356)를 통해 이미지/영상/음악 등에 자신의 음성을 합성하여 전송할 수 있다. 예를 들면, 음성 메시지 서비스부(356)를 통해 청첩장을 신랑/신부의 목소리로 넣어 지인들에게 전송할 수 있다. When a user needs a more emotional notification than when sending and receiving a text message, the user may synthesize his/her own voice in an image/video/music and the like through the voice message service unit 356 and transmit it. For example, through the voice message service unit 356, the wedding invitation may be transmitted to acquaintances by putting the voice of the groom/bride.

더빙 서비스부(358)는, 사용자로부터 전자책 및 리더(reader)를 포함하는 전자책 더빙 요청 수신 시, 음성 데이터베이스(320)로부터 상기 리더의 음성 특징 정보를 추출하고, 추출한 음성 특징 정보 및 전자책을 기계학습에 적용하여 전자책에 대한 음성 데이터를 생성할 수 있다. 사용자는 더빙 서비스부(358)를 통해 이북(e-book)을 사용자가 원하는 사람의 음성으로 출력되도록 할 수 있다. The dubbing service unit 358, upon receiving a request for dubbing an e-book including an e-book and a reader from the user, extracts the voice characteristic information of the reader from the voice database 320, and extracts the extracted voice characteristic information and the e-book can be applied to machine learning to generate voice data for e-books. The user may output the e-book in the voice of a person desired by the user through the dubbing service unit 358 .

예를 들면, 동화책의 텍스트가 입력되면, 더빙 서비스부(358)는 아빠 또는 엄마의 음성으로 동화책의 내용을 합성하여 제공할 수 있다. For example, when text of a children's book is input, the dubbing service unit 358 may synthesize and provide the content of the children's book with the voice of the father or mother.

본 발명의 실시예에 따른 음성 합성 서비스 제공 장치(300)는 입출력부(I/O 장치; 미도시)를 더 포함할 수 있다. 입출력부는 사용자로부터 입력을 직접 수신할 수 있다. 또한, 입출력부는 사용자에게 음성, 영상 또는 텍스트 중 적어도 하나를 출력할 수 있다.The apparatus 300 for providing a speech synthesis service according to an embodiment of the present invention may further include an input/output unit (I/O device; not shown). The input/output unit may directly receive an input from a user. Also, the input/output unit may output at least one of a voice, an image, and a text to the user.

한편, 음성 특징 추출부(330) 및 음성 합성 서비스부(340)는 컴퓨팅 장치 상에서 프로그램을 실행하기 위해 필요한 프로세서 등에 의해 각각 구현될 수 있다. 이처럼 음성 특징 추출부(330) 및 음성 합성 서비스부(340)는 물리적으로 독립된 각각의 구성으로 구현될 수도 있고, 하나의 프로세서 내에서 기능적으로 구분되는 형태로 구현될 수도 있다. Meanwhile, the voice feature extraction unit 330 and the voice synthesis service unit 340 may be implemented by a processor required to execute a program on a computing device, respectively. As such, the voice feature extraction unit 330 and the voice synthesis service unit 340 may be implemented as physically independent components, or may be implemented in a functionally separate form within one processor.

제어부(350)는 통신부(310), 음성 데이터베이스(320), 음성 특징 추출부(330), 음성 합성 서비스부(340)를 포함하는 음성 합성 서비스 제공 장치(300)의 다양한 구성부들의 동작을 제어하는 구성으로, 적어도 하나의 연산 장치를 포함할 수 있는데, 여기서 상기 연산 장치는 범용적인 중앙연산장치(CPU), 특정 목적에 적합하게 구현된 프로그래머블 디바이스 소자(CPLD, FPGA), 주문형 반도체 연산장치(ASIC) 또는 마이크로 컨트롤러 칩일 수 있다.The control unit 350 controls operations of various components of the apparatus 300 for providing a voice synthesis service including the communication unit 310 , the voice database 320 , the voice feature extraction unit 330 , and the voice synthesis service unit 340 . , it may include at least one arithmetic unit, where the arithmetic unit is a general-purpose central processing unit (CPU), a programmable device device (CPLD, FPGA) implemented appropriately for a specific purpose, an application-specific semiconductor processing unit ( ASIC) or microcontroller chip.

도 4는 본 발명의 일 실시예에 따른 사용자의 음성 샘플 정보로부터 음성 특징 정보를 추출하는 방법을 설명하기 위한 도면이다. 4 is a diagram for explaining a method of extracting voice characteristic information from a user's voice sample information according to an embodiment of the present invention.

도 4를 참조하면, 사용자 단말기(100)는 음성 합성 서비스 애플리케이션을 통해 음성 녹음 명령을 입력 받고, 음성 녹음 명령을 서비스 제공 장치로 전송하면(S410), 서비스 제공 장치는 제시어를 제공한다(S420).Referring to FIG. 4 , when the user terminal 100 receives a voice recording command through a voice synthesis service application and transmits the voice recording command to the service providing device (S410), the service providing device provides a suggestion (S420) .

사용자 단말기(100)는 마이크를 통해 제시어의 사용자 발화에 의한 음성 샘플 정보를 입력받고(S430), 입력받은 음성 샘플 정보를 서비스 제공 장치로 전송한다(S440). The user terminal 100 receives voice sample information according to the user's utterance of a suggested word through a microphone (S430), and transmits the received voice sample information to the service providing apparatus (S440).

서비스 제공 장치는 사용자의 음성 샘플 정보로부터 음성 특징 정보를 추출하고(S450), 음성 특징 정보를 음성 데이터베이스(320)에 저장한다(S460). 여기서, 음성 특징 정보는 포먼트(Formant) 정보, 주파수(Log f0) 정보, LPC(Linear Predictive Coefficient) 정보, 스펙트럼 포락선(Spectral Envelope) 정보, 에너지(Energy) 정보, 발화 속도(Pitch Period) 정보, 및 로그 스펙트럼(Log Spectrum) 정보 등을 포함할 수 있다.The service providing apparatus extracts voice characteristic information from the user's voice sample information (S450) and stores the voice characteristic information in the voice database 320 (S460). Here, the voice characteristic information includes formant information, frequency (Log f0) information, linear predictive coefficient (LPC) information, spectral envelope information, energy information, pitch period information, and log spectrum (Log Spectrum) information.

한편, 본 발명의 실시예에서는 서비스 제공 장치가 음성 샘플 정보로부터 음성 특징 정보를 추출하는 것으로 설명하였으나, 사용자 단말기(100)가 음성 샘플 정보로부터 음성 특징 정보를 추출할 수도 있다. Meanwhile, in the embodiment of the present invention, it has been described that the service providing apparatus extracts the voice characteristic information from the voice sample information, but the user terminal 100 may also extract the voice characteristic information from the voice sample information.

도 5는 본 발명의 일 실시예에 따른 음성 합성 서비스 제공 방법을 설명하기 위한 흐름도이다. 5 is a flowchart illustrating a method of providing a voice synthesis service according to an embodiment of the present invention.

도 5를 참조하면, 음성 합성 서비스 제공 장치(300)는 텍스트를 포함하는 음성 합성 요청 신호가 수신되면(S410), 음성 합성 서비스 제공 장치(300)는 음성 데이터베이스(320)로부터 해당 사용자의 음성 특징 정보를 추출한다(S420). 여기서, 음성 합성 요청 신호는 텍스트, 사용자 식별정보, 및 발화자 정보 등을 포함할 수 있다. 따라서 음성 합성 서비스 제공 장치(300)는 음성 합성 요청 신호를 분석하여 화자 정보를 추출하고, 음성 데이터베이스(320)로부터 발화자 정보에 대응하는 음성 특징 정보를 추출할 수 있다. Referring to FIG. 5 , when a voice synthesis request signal including text is received in the voice synthesis service providing apparatus 300 ( S410 ), the voice synthesis service providing apparatus 300 receives the user's voice characteristics from the voice database 320 . Information is extracted (S420). Here, the voice synthesis request signal may include text, user identification information, and speaker information. Accordingly, the voice synthesis service providing apparatus 300 may extract speaker information by analyzing the voice synthesis request signal, and may extract voice characteristic information corresponding to the speaker information from the voice database 320 .

S420 단계가 수행되면, 음성 합성 서비스 제공 장치(300)는 추출한 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성한다(S430, S440). 이때, 음성 합성 서비스 제공 장치(300)는 텍스트를 한글 자모 단위로 분해하여 자모 단위 입력을 생성하고, 쪼개진 자모에 각각 번호를 매길 수 있으며, 각 자모별 번호가 일대일 대응되는 색인표를 구성하고 색인표에 따라 번호를 매겨 텍스트를 숫자 데이터로 변환할 수 있다. 음성 합성 서비스 제공 장치(300)는 숫자 데이터로 매핑된 텍스트를 원-핫 인코딩(One-hot encoding)하여 원-핫 인코딩된 벡터열을 생성하고, 원-핫 인코딩된 벡터열을 문장 임베딩 매트릭스와 곱하여 텍스트 특징 벡터로 변환할 수 있다. 그 후, 음성 합성 서비스 ㅈ제공 장치(300)는 음성 데이터베이스(320)로부터 발화자에 해당하는 음성 특징 정보를 추출하고, 텍스트 특징 벡터와 음성 특징 정보를 나타내는 임베딩 벡터를 음성 합성 모델에 입력하여, 텍스트를 음성 데이터로 변환할 수 있다. When step S420 is performed, the apparatus 300 for providing speech synthesis service applies the extracted speech feature information and the text to machine learning to generate speech data for the text (S430 and S440). At this time, the speech synthesis service providing apparatus 300 decomposes the text into Hangul Jamo units to generate a Jamo unit input, and may number each of the split Jamo units, and configure an index table in which the numbers for each Jamo correspond one-to-one. You can convert text to numeric data by numbering according to a table. The speech synthesis service providing apparatus 300 generates a one-hot encoded vector sequence by one-hot encoding the text mapped to numeric data, and combines the one-hot encoded vector sequence with a sentence embedding matrix. It can be multiplied into a text feature vector. Thereafter, the apparatus 300 for providing speech synthesis service extracts speech characteristic information corresponding to the speaker from the speech database 320, and inputs a text characteristic vector and an embedding vector representing the speech characteristic information to the speech synthesis model, can be converted into voice data.

음성 합성 서비스 제공 장치(300)는 사용자가 동영상 편집 명령을 선택하면, 사용자로부터 입력받은 동영상 또는 기 저장된 동영상의 편집을 위한 디자인 템플릿을 제공하고, 디자인 템플릿을 통해 동영상의 편집정보를 수신하여 동영상을 편집할 수 있다. 또한, 음성 합성 서비스 제공 장치(300)는 텍스트 및 더빙자를 포함하는 동영상 더빙 요청을 수신하면, 음성 데이터베이스(320)로부터 더빙자의 음성 특징 정보를 추출하고, 추출한 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성하며, 생성한 음성 데이터를 동영상에 적용할 수 있다. When the user selects a video editing command, the voice synthesis service providing device 300 provides a design template for editing a video input from the user or a pre-stored video, and receives editing information of the video through the design template to view the video. Can be edited. In addition, when receiving a video dubbing request including text and a dubber, the voice synthesis service providing apparatus 300 extracts the voice characteristic information of the dubber from the voice database 320, and applies the extracted voice characteristic information and the text to machine learning. It is applied to generate voice data for the text, and the generated voice data can be applied to a moving picture.

또한, 음성 합성 서비스 제공 장치(300)는 사용자로부터 전자책 및 리더(reader)를 포함하는 전자책 더빙 요청을 수신하면, 음성 데이터베이스(320)로부터 상기 리더의 음성 특징 정보를 추출하고, 추출한 음성 특징 정보 및 상기 전자책을 기계학습에 적용하여 상기 전자책에 대한 음성 데이터를 생성할 수 있다. In addition, when receiving a request for dubbing an e-book including an e-book and a reader from the user, the voice synthesis service providing apparatus 300 extracts voice characteristic information of the reader from the voice database 320 and extracts the extracted voice characteristics The information and the e-book may be applied to machine learning to generate voice data for the e-book.

도 6은 본 발명의 다른 실시예에 따른 음성 합성 서비스를 제공하는 단말기의 구성을 개략적으로 나타낸 블록도이다. 6 is a block diagram schematically showing the configuration of a terminal providing a voice synthesis service according to another embodiment of the present invention.

도 6을 참조하면, 본 발명의 다른 실시예에 따른 음성 합성 서비스를 제공하는 단말기(600)는 단말기(600), 음성 녹음부(620), 음성 특징 추출부(630), 음성 합성 서비스부(640) 및 제어부(650)를 포함할 수 있다. Referring to FIG. 6 , a terminal 600 providing a voice synthesis service according to another embodiment of the present invention includes a terminal 600 , a voice recorder 620 , a voice feature extractor 630 , and a voice synthesis service unit ( 640 ) and a control unit 650 .

음성 녹음부(620)는 음성 통화의 연결을 감지하여 통화 중 음성을 음성 데이터로 변환하고, 음성 데이터에 대응하는 녹음 파일을 생성할 수 있다. 이때 음성 녹음부(620)는 통화 내용에서 통화 상대의 음성을 필터링하여 사용자의 음성에 대해 녹음 파일을 생성할 수 있다. 음성 녹음부(620)는 통화 상대의 음성을 필터링하기 위해 사전에 사용자의 음성을 등록할 수 있다. 또한 음성 녹음부(620)는 사용자 음성과 통화 상대의 음성 간의 주파수 특성을 분석하여 통화 상대의 음성을 묵음으로 처리하여 필터링할 수 있다. 또한 음성 녹음부(620)는 사용자와 통화 상대가 동시에 말할 경우도 발생할 수 있고, 통화 상대의 음성이 사용자 음성과 유사할 경우도 발생할 수 있으며, 잡음에 의해 불분명한 음성이 발생할 수 있으므로, 사용자 음성의 주파수 범위에 벗어나면 전술한 경우로 인식하여 필터링 처리할 수 있다.The voice recorder 620 may detect a connection of a voice call, convert voice during a call into voice data, and generate a recording file corresponding to the voice data. In this case, the voice recorder 620 may generate a recording file for the user's voice by filtering the voice of the callee from the call contents. The voice recorder 620 may register the user's voice in advance in order to filter the voice of the call party. In addition, the voice recorder 620 may analyze the frequency characteristics between the user's voice and the voice of the calling party, and may filter the voice of the called party by processing the voice as silence. In addition, the voice recorder 620 may occur when the user and the call partner speak at the same time, and the callee's voice may be similar to the user's voice. If it is out of the frequency range of , it can be recognized as the above-mentioned case and filtered.

음성 녹음부(620)는 음성 통화 연결 이전의 통화 대기 상태에서 녹음을 수행하지 않으며, 음성 통화의 연결을 감지하여 통화 중 사용자의 음성 및 통화 상대방의 음성을 음성 데이터로 변환할 수 있다. 예를 들어, 아날로그 음성 신호를 디지털 신호로 변환할 수 있다.The voice recorder 620 may not record in the call waiting state before the voice call connection, and may detect the voice call connection and convert the user's voice and the voice of the other party's voice into voice data during the call. For example, an analog audio signal may be converted into a digital signal.

음성 녹음부(620)는 음성 데이터에 대응하는 녹음 파일을 임시로 생성할 수 있다. 이때, 음성 녹음부(620)는 변환된 음성 데이터를 녹음 파일로 자동으로 생성할 수 있다. 예를 들어, 녹음 파일이 생성되기 이전 통화 대기 상태에서 녹음 파일의 생성에 대해서 사전 안내 문구를 출력할 수 있고, 음성 통화가 연결되어 녹음 파일이 생성되는 중에 녹음 파일이 생성 중인 상황에 대한 안내 문구를 단말기(600)의 디스플레이 일부에 출력할 수 있다.The voice recorder 620 may temporarily create a recording file corresponding to the voice data. In this case, the voice recorder 620 may automatically generate the converted voice data as a recording file. For example, in the call waiting state before the recording file is created, a pre-information about the creation of the recording file can be output, and the information about the situation in which the recording file is being created while the audio call is connected and the recording file is being created may be output to a part of the display of the terminal 600 .

음성 녹음부(620)는, 음성 통화가 종료되면 녹음 파일에 대한 저장 여부를 단말기(600)의 사용자에게 질의하고, 질의에 대한 사용자의 응답에 따라 녹음 파일을 저장할 수 있다. When the voice call is terminated, the voice recorder 620 may inquire of the user of the terminal 600 whether to store the recorded file, and may store the recorded file according to the user's response to the inquiry.

음성 녹음부(620)는 녹음 파일에 대응하는 음성 데이터를 텍스트 데이터로 변환하여 녹음 파일과 함께 단말기(600)에 저장할 수 있다. 이때, 음성 녹음부(620)는 녹음 파일에 STT(Speech to Text) 기능을 적용하여 음성 데이터를 텍스트 데이터로 변환할 수 있다.The voice recorder 620 may convert voice data corresponding to the recorded file into text data and store it in the terminal 600 together with the recorded file. In this case, the voice recorder 620 may convert the voice data into text data by applying a Speech to Text (STT) function to the recorded file.

음성 특징 추출부(630)는 기 설정된 일정 기간동안 생성된 녹음 파일로부터 사용자의 음성 특징 정보를 추출하고, 음성 특징 정보를 단말기(600)에 저장할 수 있다. 이때, 음성 특징 추출부(630)는 멜 주파수 셉스트럴(MFC)과 같은 음성 처리 방법을 이용하여 음성 특징 정보를 추출할 수 있다. 또한, 음성 특징 추출부(630)는 음성 샘플 정보를 학습된 음성 특징 추출 모델(예를 들어, 인공신경망)에 입력하여 음성 특징 정보를 추출할 수 있다. 음성 특징 추출부(630)는 음성 특징 정보를 임베딩 벡터로 나타낼 수 있다. 음성 특징 추출부(630)는 도 2의 음성 특징 추출부(330)와 동일한 동작을 수행하므로, 그 상세한 설명은 생략하기로 한다. The voice feature extractor 630 may extract the user's voice feature information from the recorded file generated for a preset period, and store the voice feature information in the terminal 600 . In this case, the voice feature extractor 630 may extract voice feature information by using a voice processing method such as Mel Frequency Septrel (MFC). Also, the voice feature extractor 630 may extract voice feature information by inputting voice sample information into a learned voice feature extraction model (eg, an artificial neural network). The voice feature extractor 630 may represent the voice feature information as an embedding vector. Since the voice feature extractor 630 performs the same operation as the voice feature extractor 330 of FIG. 2 , a detailed description thereof will be omitted.

음성 합성 서비스부(640)는 단말기(600)에서 제공되는 음성 신호의 사용자 음성으로 변환 명령 수신 시, 단말기(600)로부터 사용자의 음성 특징 정보를 추출하고, 음성 특징 정보 및 상기 음성 신호를 기계학습에 적용하여 음성 신호에 대한 음성 데이터를 생성할 수 있다. The voice synthesis service unit 640 extracts the user's voice characteristic information from the terminal 600 when receiving a command to convert the voice signal provided from the terminal 600 into the user's voice, and machine learning the voice characteristic information and the voice signal. can be applied to generate voice data for a voice signal.

예를 들면, 단말기(600)에서 제공하는 메시지 수신 알림, 및 전화 수신 알림 등 다양한 알림에 대한 음성 신호를 사용자의 음성으로 제공할 수 있다. For example, a voice signal for various notifications, such as a message reception notification and a call reception notification provided by the terminal 600 , may be provided as a user's voice.

또한 음성 합성 서비스부(640)는 사용자로부터 텍스트를 포함하는 음성 합성 요청 신호 수신 시, 단말기(600)로부터 사용자의 음성 특징 정보를 추출하고, 음성 특징 정보 및 상기 텍스트를 기계학습에 적용하여 상기 텍스트에 대한 음성 데이터를 생성할 수 있다. 즉, 음성 합성 서비스부(640)는 텍스트 특징 벡터와 음성 특징 정보를 나타내는 임베딩 벡터를 음성 합성 모델에 입력하여, 텍스트를 음성 데이터로 변환할 수 있다. In addition, the voice synthesis service unit 640 extracts the user's voice characteristic information from the terminal 600 when receiving a voice synthesis request signal including text from the user, and applies the voice characteristic information and the text to machine learning to apply the text can generate voice data for That is, the speech synthesis service unit 640 may convert the text into speech data by inputting the text feature vector and the embedding vector representing the speech feature information to the speech synthesis model.

본 발명의 실시예에 따른 단말기(600)는 입출력부(I/O 장치; 미도시)를 더 포함할 수 있다. 입출력부는 사용자로부터 입력을 직접 수신할 수 있다. 또한, 입출력부는 사용자에게 음성, 영상 또는 텍스트 중 적어도 하나를 출력할 수 있다.The terminal 600 according to an embodiment of the present invention may further include an input/output unit (I/O device; not shown). The input/output unit may directly receive an input from a user. Also, the input/output unit may output at least one of a voice, an image, and a text to the user.

한편, 음성 녹음부(620), 음성 특징 추출부(630), 및 음성 합성 서비스부(640)는 컴퓨팅 장치 상에서 프로그램을 실행하기 위해 필요한 프로세서 등에 의해 각각 구현될 수 있다. 이처럼 음성 녹음부(620), 음성 특징 추출부(630), 및 음성 합성 서비스부(640)는 물리적으로 독립된 각각의 구성으로 구현될 수도 있고, 하나의 프로세서 내에서 기능적으로 구분되는 형태로 구현될 수도 있다. Meanwhile, the voice recorder 620 , the voice feature extractor 630 , and the voice synthesis service unit 640 may be implemented by a processor required to execute a program on a computing device, respectively. As such, the voice recorder 620, the voice feature extraction unit 630, and the voice synthesis service unit 640 may be implemented as physically independent components, and may be implemented in a functionally separate form within one processor. may be

제어부(650)는 저장부(610), 음성 녹음부(620), 음성 특징 추출부(630), 및 음성 합성 서비스부(640)를 포함하는 단말기(600)의 다양한 구성부들의 동작을 제어하는 구성으로, 적어도 하나의 연산 장치를 포함할 수 있는데, 여기서 상기 연산 장치는 범용적인 중앙연산장치(CPU), 특정 목적에 적합하게 구현된 프로그래머블 디바이스 소자(CPLD, FPGA), 주문형 반도체 연산장치(ASIC) 또는 마이크로 컨트롤러 칩일 수 있다.The control unit 650 controls operations of various components of the terminal 600 including the storage unit 610 , the voice recorder 620 , the voice feature extraction unit 630 , and the voice synthesis service unit 640 . The configuration may include at least one arithmetic unit, wherein the arithmetic unit is a general-purpose central processing unit (CPU), a programmable device device (CPLD, FPGA) implemented to suit a specific purpose, and an application-specific semiconductor processing unit (ASIC). ) or a microcontroller chip.

상기와 같이 구성되는 단말기(600)는 음성 녹음 및 음성 합성 서비스 프로그램(애플리케이션 또는 애플릿)의 이용이 가능하고, 다양한 유무선 환경에 적용될 수 있는 전자 기기일 수 있다. 예컨대, 단말기(600)는 PDA(Personal Digital Assistant), 스마트폰, 셀룰러폰, PCS(Personal Communication Service)폰, GSM(Global System for Mobile)폰, W-CDMA(Wideband CDMA)폰, CDMA-2000폰, MBS(Mobile Broadband System)폰 등을 포함한다. The terminal 600 configured as described above may be an electronic device that can use a voice recording and voice synthesis service program (application or applet) and can be applied to various wired and wireless environments. For example, the terminal 600 may include a Personal Digital Assistant (PDA), a smart phone, a cellular phone, a Personal Communication Service (PCS) phone, a Global System for Mobile (GSM) phone, a Wideband CDMA (W-CDMA) phone, or a CDMA-2000 phone. , including a Mobile Broadband System (MBS) phone and the like.

상술한 바와 같이, 본 발명의 일 실시예에 따른 음성 합성 서비스를 제공하는 장치, 단말기 및 방법은, 입력된 텍스트를 사용자의 음성으로 변환하여 제공할 수 있다. As described above, the apparatus, terminal, and method for providing a speech synthesis service according to an embodiment of the present invention may convert an input text into a user's voice and provide the converted text.

본 발명은 도면에 도시된 실시예를 참고로 하여 설명되었으나, 이는 예시적인 것에 불과하며, 당해 기술이 속하는 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiment shown in the drawings, this is merely an example, and those skilled in the art to which various modifications and equivalent other embodiments are possible. will understand

따라서 본 발명의 진정한 기술적 보호범위는 아래의 특허청구범위에 의해서 정하여져야 할 것이다.Accordingly, the true technical protection scope of the present invention should be defined by the following claims.

100 : 사용자 단말기
200 : 서비스 제공 장치
300 : 음성 합성 서비스 제공 장치
310 : 통신부
320 : 음성 데이터베이스
330, 630 : 음성 특징 추출부
340, 640 : 음성 합성 서비스부
350, 650 : 제어부
600 : 단말기
610 : 저장부
620 : 음성 녹음부100: user terminal
200: service providing device
300: Speech synthesis service providing device
310: communication department
320: voice database
330, 630: voice feature extraction unit
340, 640: speech synthesis service department
350, 650: control unit
600: terminal
610: storage
620: voice recording unit

Claims

voice database;
a voice feature extracting unit for receiving voice sample information of a user for a suggested word, extracting voice characteristic information of the user from the voice sample information, and storing the voice characteristic information in the voice database; and
When receiving a voice synthesis request signal including text from the user, extracting the user's voice characteristic information from the voice database, and applying the voice characteristic information and the text to machine learning to generate voice data for the text speech synthesis service department
A voice synthesis service providing device comprising a.

According to claim 1,
The voice synthesis service unit,
Provides a design template for editing a video input from the user or a pre-stored video, receives editing information of the video through the design template, and receives a video dubbing request including text and dubbing person information, the voice Speech synthesis, characterized in that extracting voice characteristic information of the dubbing person from a database, applying the voice characteristic information and the text to machine learning to generate voice data for the text, and applying the voice data to the moving picture Service providing device.

According to claim 1,
The voice synthesis service unit,
When receiving an e-book dubbing request including an e-book and reader information from the user, extracting voice characteristic information of the reader from the voice database, and applying the voice characteristic information and the e-book to machine learning, A voice synthesis service providing apparatus, characterized in that for generating voice data for the e-book.

receiving, by a voice feature extracting unit, voice sample information of a user for a suggested word, extracting voice feature information of the user from the voice sample information, and storing the voice feature information in a voice database; and
When the voice synthesis service unit receives a voice synthesis request signal including text from the user, extracts the user's voice characteristic information from the voice database, and applies the voice characteristic information and the text to machine learning to make a voice for the text Steps to generate data
A method of providing a speech synthesis service comprising a.

5. The method of claim 4,
When the voice synthesis service unit receives a video editing command from the user, it provides a design template for editing a video input from the user or a pre-stored video, and receives editing information of the video through the design template to receive the video editing information. editing the video; and
When the voice synthesis service unit receives a request for dubbing a video including text and dubber information, extracts voice characteristic information of the dubber from the voice database, and applies the voice characteristic information and the text to machine learning to add to the text The method for providing voice synthesis service, characterized in that it further comprises the step of generating voice data for the video, and applying the voice data to the moving picture.

5. The method of claim 4,
When the voice synthesis service unit receives an e-book dubbing request including an e-book and reader information from the user, extracts voice characteristic information of the reader from the voice database, and stores the voice characteristic information and the e-book The method for providing a voice synthesis service, characterized in that it further comprises the step of generating voice data for the e-book by applying it to machine learning.

storage;
a voice recorder for detecting a connection of a voice call, converting voice during a call into voice data, and generating a recording file corresponding to the voice data; and
A voice feature extracting unit for extracting the user's voice characteristic information from the recorded file generated for a predetermined period of time and storing the voice characteristic information in the storage unit;
The voice recording unit,
and converting the voice data corresponding to the recorded file into text data and storing the converted voice data into the storage unit together with the recorded file.

8. The method of claim 7,
The voice recording unit,
A terminal, characterized in that by filtering the voice of the called party from the call contents, and generating a recording file for the voice of the user.

8. The method of claim 7,
When receiving a command to convert a voice signal provided by the terminal into a user voice, the user's voice characteristic information is extracted from the storage unit, and the voice characteristic information and the voice signal are applied to machine learning to respond to the voice signal. The terminal characterized in that it further comprises a voice synthesis service unit for generating data.

10. The method of claim 9,
The voice synthesis service unit,
When receiving a voice synthesis request signal including text from the user, extracting the user's voice characteristic information from the storage unit, and applying the voice characteristic information and the text to machine learning to generate voice data for the text A terminal, characterized in that.