KR20210109222A

KR20210109222A - Device, method and computer program for synthesizing voice

Info

Publication number: KR20210109222A
Application number: KR1020200024197A
Authority: KR
Inventors: 차재욱
Original assignee: 주식회사 케이티
Priority date: 2020-02-27
Filing date: 2020-02-27
Publication date: 2021-09-06

Abstract

A device for synthesizing voice based on a tree-structured corpus database may include: a collection part for collecting sample text data and example recording data corresponding to the example text data; a derivation part for deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase, based on the collected example text data and example recording data; a database generation part that generates the tree-structured corpus database based on the derived example sentence, example phrase, and phoneme sequence information, and a synthesis part for synthesizing synthesized voice data corresponding to a synthesis request based on the tree-structured corpus database.

Description

DEVICE, METHOD AND COMPUTER PROGRAM FOR SYNTHESIZING VOICE

본 발명은 음성을 합성하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다.The present invention relates to an apparatus, method and computer program for synthesizing speech.

코퍼스 기반의 음성 합성 방법은 음성을 합성단위 형태로 구성한 데이터베이스에서 합성에 필요한 단위를 선택하고, 이것들을 적절히 연결하여 합성음을 생성함으로써 고음질의 합성음을 생성할 수 있다. 이러한, 코퍼스 기반의 음성 합성 방법은 음성 합성에 필요한 단위를 탐색하는데 걸리는 시간이 적게 걸려 빠른 속도로 합성음을 생성할 수 있고, 별도의 GPU 모듈을 설치하지 않기 때문에 구축비용이 저렴하다. 하지만, 코퍼스 기반의 음성 합성 방법은 대용량의 음성 데이터가 필요하고, 음성 코퍼스의 데이터 품질(즉, 녹음 데이터의 라벨링 정확도)에 따라 합성음의 음질이 결정되기 때문에 합성음에 대한 튜닝에 많은 시간이 필요하다. The corpus-based speech synthesis method can generate a synthesized sound of high sound quality by selecting a unit necessary for synthesis from a database composed of a voice in the form of a synthesis unit, and connecting them appropriately to generate a synthesized sound. Such a corpus-based speech synthesis method can generate a synthesized sound at a high speed because it takes less time to search for a unit required for speech synthesis, and the construction cost is low because a separate GPU module is not installed. However, the corpus-based voice synthesis method requires a large amount of voice data, and since the sound quality of the synthesized sound is determined according to the data quality of the voice corpus (that is, the labeling accuracy of the recorded data), it takes a lot of time to tune the synthesized sound. .

최근에 활발하게 개발되고 있는 딥러닝 기반의 음성합성기술은 텍스트를 특정인의 음성으로 읽어주는 기술로서 보다 자연스러운 고품질의 합성 음성을 제공한다. 하지만, 딥러닝 기반의 음성합성기술은 딥러닝 연산량이 많아 GPU의 사용이 필수적이고, 이에 따른 하드웨어의 비용이 높다. 또한, 딥러닝 기반의 음성합성기술은 음성 합성의 훈련 과정뿐만 아니라, 합성 과정에서도 상대적으로 많은 연산이 필요하고, 음성합성과정에서도 GPU가 필수적이고, 합성음의 생성 속도도 느리다는 단점이 있다. Deep learning-based speech synthesis technology, which is being actively developed recently, is a technology that reads text as a specific person's voice and provides a more natural, high-quality synthesized speech. However, the deep learning-based speech synthesis technology requires the use of a GPU because of the large amount of deep learning computations, and thus the cost of hardware is high. In addition, the deep learning-based speech synthesis technology has disadvantages in that relatively many calculations are required not only in the training process of speech synthesis but also in the synthesis process, the GPU is essential in the speech synthesis process, and the synthesis sound generation speed is slow.

한국등록특허공보 제10-0769033호 (2007.10.16. 등록)Korean Patent Publication No. 10-0769033 (registered on October 16, 2007)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 수집된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 적어도 하나의 예제 구(Phrase) 및 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성하고자 한다. 또한, 본 발명은 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성하고자 한다. The present invention is to solve the problems of the prior art described above, and example sentences derived from collected example text data and example recorded data, at least one example phrase included in the example sentences, and at least one example phrase included in the example phrase We want to create a tree-structured corpus database based on one phoneme sequence information. In addition, the present invention intends to synthesize synthesized speech data corresponding to a synthesis request based on a tree-structured corpus database.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 트리 구조의 말뭉치(Corpus) 데이터베이스에 기초하여 음성을 합성하는 장치는 예제 텍스트 데이터 및 상기 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 수집하는 수집부; 상기 수집된 예제 텍스트 데이터 및 예제 녹음 데이터에 기초하여 예제 문장, 상기 예제 문장이 포함하는 적어도 하나의 예제 구 및 상기 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보를 도출하는 도출부; 상기 도출된 예제 문장, 예제 구 및 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성하는 데이터베이스 생성부; 및 상기 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성하는 합성부를 포함할 수 있다. As a technical means for achieving the above-described technical problem, the apparatus for synthesizing speech based on a tree-structured corpus database according to the first aspect of the present invention includes example text data and example recording corresponding to the example text data. a collection unit for collecting data; a derivation unit for deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data; a database generator for generating a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information; and a synthesizer for synthesizing synthesized speech data corresponding to a synthesis request based on the tree-structured corpus database.

본 발명의 제 2 측면에 따른 음성 합성 장치를 통해 트리 구조의 말뭉치 데이터베이스에 기초하여 음성을 합성하는 방법은 예제 텍스트 데이터 및 상기 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 수집하는 단계; 상기 수집된 예제 텍스트 데이터 및 예제 녹음 데이터에 기초하여 예제 문장, 상기 예제 문장이 포함하는 적어도 하나의 예제 구 및 상기 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보를 도출하는 단계; 상기 도출된 예제 문장, 예제 구 및 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성하는 단계; 및 상기 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성하는 단계를 포함할 수 있다. According to a second aspect of the present invention, a method for synthesizing speech based on a tree-structured corpus database through a speech synthesis apparatus includes: collecting sample text data and example recording data corresponding to the example text data; deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data; generating a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information; and synthesizing synthesized speech data corresponding to the synthesis request based on the tree-structured corpus database.

본 발명의 제 3 측면에 따른 트리 구조의 말뭉치 데이터베이스에 기초하여 음성을 합성하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 예제 텍스트 데이터 및 상기 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 수집하고, 상기 수집된 예제 텍스트 데이터 및 예제 녹음 데이터에 기초하여 예제 문장, 상기 예제 문장이 포함하는 적어도 하나의 예제 구 및 상기 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보를 도출하고, 상기 도출된 예제 문장, 예제 구 및 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성하고, 상기 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성하는 명령어들의 시퀀스를 포함할 수 있다. A computer program stored in a medium including a sequence of instructions for synthesizing a voice based on a tree-structured corpus database according to the third aspect of the present invention, when executed by a computing device, corresponds to the example text data and the example text data. collecting sample recording data, and deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data, , a sequence of commands for generating a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information, and synthesizing synthesized speech data corresponding to a synthesis request based on the tree-structured corpus database. can do.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 본 발명은 수집된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 적어도 하나의 예제 구(Phrase) 및 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성할 수 있다. 또한, 본 발명은 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성할 수 있다. According to any one of the above-described problem solving means of the present invention, the present invention includes example sentences derived from the collected example text data and example recorded data, and at least one example phrase and example phrases included in the example sentences. A tree-structured corpus database may be generated based on at least one piece of phoneme sequence information. In addition, the present invention can synthesize synthesized speech data corresponding to a synthesis request based on a tree-structured corpus database.

이를 통해, 본 발명은 합성 요청에 대응하는 예제 문장 또는 예제 구를 트리 구조의 말뭉치 데이터베이스에서 검색하고, 검색된 예제 문장 및 예제 구를 이용하여 합성 음성 데이터를 생성하기 때문에 합성 속도를 높일 수 있다. Through this, the present invention searches for example sentences or example phrases corresponding to a synthesis request in a tree-structured corpus database, and generates synthesized speech data using the searched example sentences and example phrases, so that the synthesis speed can be increased.

또한, 본 발명은 검색된 예제 문장(또는 예제 구)과 관련된 연속된 음소 시퀀스 정보를 이용하여 합성 음성 데이터를 생성하기 때문에 추가적인 스무딩 작업이 필요하지 않아 합성 속도를 높일 수 있고, 연속된 음소 시퀀스 정보를 이용하기 때문에 합성 음성 데이터의 음이 자연스럽고 왜곡이나 잡음이 발생하지 않는다. In addition, since the present invention generates synthesized speech data using continuous phoneme sequence information related to the searched example sentence (or example phrase), an additional smoothing operation is not required, so that the synthesis speed can be increased, and continuous phoneme sequence information Because it is used, the sound of the synthesized voice data is natural and there is no distortion or noise.

도 1은 본 발명의 일 실시예에 따른, 음성 합성 시스템의 구성도이다.
도 2는 본 발명의 일 실시예에 따른, 도 1에 도시된 음성 합성 장치의 블록도이다.
도 3은 본 발명의 일 실시예에 따른, 음소분석 및 음성인식을 통해 예제 문장에 라벨링하는 방법을 설명하기 위한 도면이다.
도 4는 기존의 말뭉치 데이터베이스를 설명하기 위한 도면이다.
도 5a 내지 5c는 본 발명의 일 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 생성하는 방법을 나타낸 흐름도이다.
도 7은 본 발명의 다른 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 생성하는 방법을 나타낸 흐름도이다.
도 8a 내지 8c는 본 발명의 일 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 이용하여 음성을 합성하는 방법을 설명하기 위한 도면이다.
도 9는 본 발명의 일 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 이용하여 음성을 합성하는 방법을 나타낸 흐름도이다.1 is a block diagram of a speech synthesis system according to an embodiment of the present invention.
2 is a block diagram of the speech synthesis apparatus shown in FIG. 1 according to an embodiment of the present invention.
3 is a diagram for explaining a method of labeling an example sentence through phoneme analysis and voice recognition according to an embodiment of the present invention.
4 is a diagram for explaining an existing corpus database.
5A to 5C are diagrams for explaining a corpus database having a tree structure, according to an embodiment of the present invention.
6 is a flowchart illustrating a method of generating a tree-structured corpus database according to an embodiment of the present invention.
7 is a flowchart illustrating a method of generating a tree-structured corpus database according to another embodiment of the present invention.
8A to 8C are diagrams for explaining a method of synthesizing speech using a tree-structured corpus database, according to an embodiment of the present invention.
9 is a flowchart illustrating a method of synthesizing speech using a tree-structured corpus database according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, this means that other components may be further included, rather than excluding other components, unless otherwise stated.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다. In this specification, a "part" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다. Some of the operations or functions described as being performed by the terminal or device in the present specification may be instead performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the server.

이하, 첨부된 구성도 또는 처리 흐름도를 참고하여, 본 발명의 실시를 위한 구체적인 내용을 설명하도록 한다. Hereinafter, detailed contents for carrying out the present invention will be described with reference to the accompanying configuration diagram or process flow diagram.

도 1은 본 발명의 일 실시예에 따른, 음성 합성 시스템의 구성도이다. 1 is a block diagram of a speech synthesis system according to an embodiment of the present invention.

도 1을 참조하면, 음성 합성 시스템은 음성 합성 장치(100) 및 사용자 단말(110)을 포함할 수 있다. 다만, 이러한 도 1의 음성 합성 시스템은 본 발명의 일 실시예에 불과하므로 도 1을 통해 본 발명이 한정 해석되는 것은 아니며, 본 발명의 다양한 실시예들에 따라 도 1과 다르게 구성될 수도 있다. Referring to FIG. 1 , a speech synthesis system may include a speech synthesis apparatus 100 and a user terminal 110 . However, since the speech synthesis system of FIG. 1 is only an embodiment of the present invention, the present invention is not limitedly interpreted through FIG. 1 , and may be configured differently from FIG. 1 according to various embodiments of the present invention.

일반적으로, 도 1의 음성 합성 시스템의 각 구성요소들은 네트워크(미도시)를 통해 연결된다. 네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다. In general, each component of the speech synthesis system of FIG. 1 is connected through a network (not shown). A network refers to a connection structure that enables information exchange between each node, such as terminals and servers, and includes a local area network (LAN), a wide area network (WAN), and the Internet (WWW: World). Wide Web), wired and wireless data communication networks, telephone networks, wired and wireless television networks, and the like. Examples of wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), Wi-Fi, Bluetooth communication, infrared communication, ultrasound Communication, Visible Light Communication (VLC), LiFi, etc. are included, but are not limited thereto.

음성 합성 장치(100)는 예제 텍스트 데이터 및 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 수집할 수 있다. The speech synthesis apparatus 100 may collect example text data and example recording data corresponding to the example text data.

음성 합성 장치(100)는 수집된 예제 텍스트 데이터 및 예제 녹음 데이터에 기초하여 예제 문장, 예제 문장이 포함하는 적어도 하나의 예제 구(Phrase) 및 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보를 도출할 수 있다. The speech synthesis apparatus 100 may derive an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example sentence based on the collected example text data and example recording data. can

음성 합성 장치(100)는 도출된 예제 문장, 예제 구 및 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성할 수 있다. The speech synthesis apparatus 100 may generate a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information.

사용자 단말(110)은 음성 합성을 위한 요청 구 및 요청 구에 해당하는 감정 태그를 포함하는 합성 요청을 음성 합성 장치(100)에게 전송할 수 있다. The user terminal 110 may transmit a request phrase for speech synthesis and a synthesis request including an emotion tag corresponding to the request phrase to the speech synthesis apparatus 100 .

음성 합성 장치(100)는 트리 구조의 말뭉치 데이터베이스를 이용하여 사용자 단말(110)의 합성 요청에 대응하는 합성 음성 데이터를 합성하고, 합성된 합성 음성 데이터를 사용자 단말(110)에게 제공할 수 있다. The speech synthesis apparatus 100 may synthesize synthesized voice data corresponding to the synthesis request of the user terminal 110 using a tree-structured corpus database, and may provide the synthesized synthesized voice data to the user terminal 110 .

이하에서는 도 1의 음성 합성 시스템의 각 구성요소의 동작에 대해 보다 구체적으로 설명한다. Hereinafter, the operation of each component of the speech synthesis system of FIG. 1 will be described in more detail.

도 2는 본 발명의 일 실시예에 따른, 도 1에 도시된 음성 합성 장치(100)의 블록도이다. FIG. 2 is a block diagram of the speech synthesis apparatus 100 shown in FIG. 1 according to an embodiment of the present invention.

도 2를 참조하면, 음성 합성 장치(100)는 수집부(200), 도출부(210), 데이터베이스 생성부(220) 및 합성부(230)를 포함할 수 있다. 다만, 도 2에 도시된 음성 합성 장치(100)는 본 발명의 하나의 구현 예에 불과하며, 도 2에 도시된 구성요소들을 기초로 하여 여러 가지 변형이 가능하다. Referring to FIG. 2 , the speech synthesis apparatus 100 may include a collection unit 200 , a derivation unit 210 , a database generation unit 220 , and a synthesis unit 230 . However, the speech synthesis apparatus 100 shown in FIG. 2 is only one embodiment of the present invention, and various modifications are possible based on the components shown in FIG. 2 .

수집부(200)는 예제 텍스트 데이터 및 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 포함하는 복수의 예제 데이터를 수집할 수 있다. The collection unit 200 may collect a plurality of example data including example text data and example recording data corresponding to the example text data.

또한 수집부(200)는 복수의 예제 데이터에 포함된 예제 텍스트 데이터의 예제 문장마다 감정 태그 정보를 수집할 수 있다. Also, the collection unit 200 may collect emotion tag information for each example sentence of example text data included in a plurality of example data.

전처리부(미도시)는 수집된 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 기설정된 형식에 맞춰 전처리를 수행할 수 있다. 예를 들어, 전처리부(미도시)는 예제 녹음 데이터를 이루는 예제 문장 간의 묵음 길이를 조정하도록 전처리를 수행할 수 있다. 또한, 전처리부(미도시)는 예제 녹음 데이터의 볼륨레벨을 기설정된 볼륨레벨로 조정하도록 전처리를 수행할 수 있다. 또한, 전처리부(미도시)는 수집된 예제 텍스트 데이터 및 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 비교하여 녹음시 발음 오류가 포함된 예제 녹음 데이터를 교정하도록 전처리를 수행할 수 있다. The pre-processing unit (not shown) may pre-process the sample recording data corresponding to the collected example text data according to a preset format. For example, the preprocessor (not shown) may perform preprocessing to adjust the length of silence between example sentences constituting the example recorded data. Also, the pre-processing unit (not shown) may perform pre-processing to adjust the volume level of the example recording data to a preset volume level. In addition, the pre-processing unit (not shown) may perform pre-processing to correct the example recorded data including a pronunciation error during recording by comparing the collected example text data and example recording data corresponding to the example text data.

도출부(210)는 수집된 예제 텍스트 데이터 및 예제 녹음 데이터에 기초하여 예제 문장, 예제 문장이 포함하는 적어도 하나의 예제 구 및 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보(트라이폰 정보)를 도출할 수 있다. The derivation unit 210 derives an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information (triphone information) included in the example phrase based on the collected example text data and example recording data. can do.

도출부(210)는 수집된 예제 텍스트 데이터를 구성하는 예제 문장에 대하여 형태소 및 의존 구조의 분석을 통해 예제 문장을 적어도 하나의 예제 구로 분리할 수 있다. 또한, 도출부(210)는 수집된 예제 녹음 데이터를 구성하는 예제 문장에 대하여 VAD(Voice Activity Detection) 작업을 수행하여 묵음이 긴 구간을 예제 구의 분리 위치로 지정할 수 있다. The derivation unit 210 may separate the example sentences into at least one example phrase through analysis of morphemes and dependency structures with respect to example sentences constituting the collected example text data. In addition, the derivation unit 210 may perform a voice activity detection (VAD) operation on example sentences constituting the collected example recording data to designate a section having a long silence as a separation position of the example phrase.

도출부(210)는 예제 문장에 대한 형태소 및 의존 구조의 분석 결과 및 VAD 작업에 의한 예제 구의 분리 위치에 기초하여 예제 문장을 예제 구를 분리할 수 있다. The derivation unit 210 may separate the example sentence from the example phrase based on the analysis result of the morpheme and the dependency structure for the example sentence and the separation position of the example phrase by the VAD operation.

데이터베이스 생성부(220)는 도출된 예제 문장, 예제 구 및 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성할 수 있다. The database generator 220 may generate a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information.

도 4 내지 도 5c를 참조하여, 기존의 말뭉치 데이터베이스와 본 발명의 트리 구조의 말뭉치 데이터베이스를 비교 설명한다. A comparison will be made between the existing corpus database and the tree-structured corpus database of the present invention with reference to FIGS. 4 to 5C .

도 4를 참조하면, 기존의 말뭉치 데이터베이스는 각 예제 문장을 구성하는 복수의 음소 시퀀스에 대한 음소 시퀀스 정보(예컨대, 음소의 시작과 끝 위치 정보, 음소 구간 및 강도, 음소 구간에 대한 표준편차, 음소 구간의 강도에 대한 표준편차 등)가 포함된 음소 시퀀스 테이블로만 구성된다. Referring to FIG. 4 , the existing corpus database provides phoneme sequence information for a plurality of phoneme sequences constituting each example sentence (eg, phoneme start and end position information, phoneme section and intensity, standard deviation for phoneme sections, phonemes) It consists only of a phoneme sequence table including the standard deviation of the intensity of the section).

도 5a를 참조하면, 트리 구조의 말뭉치 데이터베이스는 음소 시퀀스와 관련된 예제 문장과 예제 구에 대한 정보를 트리 형식으로 구성한 데이터베이스로서, 각 음소 시퀀스와 관련된 예제 문장 테이블, 예제 구 테이블 및 음소 시퀀스 테이블로 구성된다. 트리 구조의 말뭉치 데이터베이스에 예제 문장과 예제 구에 대한 정보가 포함되어 있기 때문에 음성합성을 위한 예제 문장 또는 예제 구의 검색을 보다 빠르게 수행할 수 있고, 검색된 예제 문장 및 예제 구 각각에 대한 정보를 이용하여 관련된 음소 시퀀스에 대한 정보를 제공할 수 있다. Referring to FIG. 5A , the tree-structured corpus database is a database configured in a tree format with information on example sentences and example phrases related to phoneme sequences, and consists of example sentence tables, example phrase tables, and phoneme sequence tables related to each phoneme sequence. do. Because the tree-structured corpus database contains information on example sentences and example phrases, it is possible to perform a faster search for example sentences or example phrases for speech synthesis, and Information on related phoneme sequences can be provided.

도 5b를 참조하면, 데이터베이스 생성부(220)는 예제 문장, 예제 문장에 해당하는 감정 태그 정보 및 예제 문장이 포함하는 적어도 하나의 예제 구에 대한 관리 정보에 기초하여 예제 문장 테이블을 생성할 수 있다. 여기서, 예제 문장 테이블은 예제 문장, 예제 문장의 문장 번호, 문장 텍스트 길이, 문장 시간 정보 및 문장 종류(평서문, 의문문 등), 예제 문장의 주요 모음길이 및 분산 정보, 예제 문장이 포함하는 예제 구에 대한 관리 정보(예컨대, 예제 문장이 포함하는 예제 구의 개수 및 순서 번호) 및 예제 문장에 해당하는 감정 태그 정보를 포함할 수 있다. Referring to FIG. 5B , the database generator 220 may generate an example sentence table based on example sentences, emotion tag information corresponding to the example sentences, and management information for at least one example phrase included in the example sentences. . Here, the example sentence table contains example sentences, sentence numbers of example sentences, sentence text length, sentence time information and sentence types (declarative sentences, interrogative sentences, etc.), main vowel length and distribution information of example sentences, and example sentences included in example sentences. management information (eg, the number and sequence number of example phrases included in the example sentence) and emotion tag information corresponding to the example sentence may be included.

데이터베이스 생성부(220)는 예제 구, 예제 구의 순서 정보 및 예제 구에 포함된 적어도 하나의 음소 시퀀스 정보에 대한 관리 정보에 기초하여 예제 구 테이블을 생성할 수 있다. 여기서, 예제 구 테이블은 예제 구, 예제 구의 순서 번호, 예제 구의 구 텍스트 길이, 구 시간 및 구 위치, 예제 구의 주요 모음 길이 및 분산 정보, 예제 구에 포함된 음소 시퀀스의 개수 및 음소 시퀀스 정보에 대한 관리 정보(예컨대, 음소 시퀀스의 순서 번호, 음소 시퀀스의 시작 및 끝 위치 정보)를 포함할 수 있다. The database generator 220 may generate an example phrase table based on example phrases, order information of the example phrases, and management information on at least one phoneme sequence information included in the example phrases. Here, the example phrase table contains example phrases, sequence numbers of example phrases, phrase text lengths of example phrases, phrase times and phrase positions, major vowel length and distribution information of example phrases, number of phoneme sequences included in example phrases, and phoneme sequence information. It may include management information (eg, a sequence number of a phoneme sequence, information on start and end positions of a phoneme sequence).

데이터베이스 생성부(220)는 음소 시퀀스 정보 및 음소 시퀀스 정보의 순서 정보에 기초하여 음소 시퀀스 테이블을 생성할 수 있다. 여기서, 음소 시퀀스 테이블은 음소 시퀀스(트라이폰), 음소 시퀀스의 순서 번호, 음소 구간(음소 시작과 끝의 위치 정보 또는 음소 길이), 음소 강도, 음소 구간의 표준편차 및 음소 강도의 표준편차를 포함하는 음소 시퀀스 정보와 음소 시퀀스 정보의 순서 정보를 포함할 수 있다. The database generator 220 may generate a phoneme sequence table based on phoneme sequence information and order information of the phoneme sequence information. Here, the phoneme sequence table includes a phoneme sequence (triphone), a sequence number of the phoneme sequence, a phoneme interval (position information or phoneme length at the beginning and end of a phoneme), phoneme strength, standard deviation of phoneme interval, and standard deviation of phoneme strength phoneme sequence information and order information of the phoneme sequence information.

데이터베이스 생성부(220)는 예제 문장 테이블, 예제 구 테이블 및 음소 시퀀스 테이블을 서로 매핑하여 트리 기반의 말뭉치 데이터베이스를 생성할 수 있다. The database generator 220 may create a tree-based corpus database by mapping the example sentence table, the example phrase table, and the phoneme sequence table to each other.

데이터베이스 생성부(220)는 예제 문장 테이블에 포함된 관리 정보와 예제 구 테이블에 포함된 순서 정보에 기초하여 예제 문장 테이블 및 예제 구 테이블을 매핑할 수 있다. The database generator 220 may map the example sentence table and the example phrase table based on the management information included in the example sentence table and the order information included in the example sentence table.

데이터베이스 생성부(220)는 예제 구 테이블에 포함된 관리 정보와 음소 시퀀스 테이블에 포함된 순서 정보에 기초하여 예제 구 테이블 및 음소 시퀀스 테이블을 매핑할 수 있다. The database generator 220 may map the example phrase table and the phoneme sequence table based on management information included in the example phrase table and order information included in the phoneme sequence table.

이와 같이, 말뭉치 데이터베이스를 트리 구조로 구성하는 이유는 도 5c를 참조하여 설명하기로 한다. As such, the reason for configuring the corpus database in a tree structure will be described with reference to FIG. 5C.

도 5c를 참조하면, 도출부(210)는 'USB 음악이 듣고 싶으면, USB 음악 틀어줘라고 말해보세요'를 포함하는 예제 문장을 'USB 음악이 듣고 싶으면'을 포함하는 제 1 예제 구 및 'USB 음악 틀어줘라고 말해보세요'를 포함하는 제 2 예제 구로 분리할 수 있다. 실제로 예제 녹음 데이터를 녹음하는 성우에게 제공되는 예제 텍스트 데이터에는 띄어 읽기가 표기되어 제공된다. 이 때, 예제 텍스트 데이터의 형태소 및 의존 구조의 분석을 통해 의미가 나눠지는 곳(예컨대, 문장에서 관형어 구문, 의미 변화가 있는 구문(예컨대, 부정어 앞), 개별 정보(숫자 등) 등)에 길게 또는 짧게 띄어 읽기가 지정될 수 있다. 길게 띄어 읽는 부분 전후에서 음소 시퀀스 정보는 음소연결이 되지 않고, 짧게 띄어 읽는 부분에서 음소 시퀀스 정보는 음소가 연결된다. Referring to FIG. 5C , the derivation unit 210 converts an example sentence including 'If you want to listen to USB music, tell me to play USB music', a first example phrase including 'If you want to listen to USB music' and 'USB can be separated into a second example phrase containing 'Tell me to play music'. In fact, the example text data provided to the voice actor who records the sample recording data is provided with spaces marked. At this time, through the analysis of the morpheme and dependency structure of the example text data, the meaning is divided (e.g., in a sentence, an adjective phrase, a phrase with a change in meaning (e.g., before a negative word), individual information (number, etc.)) Alternatively, a short space to read may be specified. Phoneme sequence information is not connected to phonemes before and after the long spaced reading part, and phoneme sequence information is connected to phonemes in the short spaced reading part.

도출부(210)는 예제 녹음 데이터 중 길게 띄어 읽는 부분에서 예제 구들을 분리할 수 있다. 또한, 도출부(210)는 예제 녹음 데이터에서 일정 볼륨 이상을 VAD 구간으로 구분하고, 예제 녹음 데이터에서 일정 볼륨 이하에 해당하는 부분을 묵음 구간으로 구분하고, 각 묵음 구간에 대한 길이가 평균보다 일정 비율 이상 초과하는 부분(즉, 길게 띄어 읽는 부분)을 예제 구의 분할 포인트로 지정할 수 있다. The derivation unit 210 may separate the example phrases from the long spaced reading part of the example recorded data. In addition, the derivation unit 210 divides a portion above a certain volume in the example recording data into a VAD section, divides a portion corresponding to a certain volume or less in the example recording data into a silence section, and the length for each silence section is more constant than the average The part exceeding the ratio (that is, the part that reads with a long space) can be designated as the dividing point of the example phrase.

산출부(미도시)는 학습 또는 음성합성 시에 예제 문장의 비용의 최소값을 산출할 수 있다. 또한, 산출부(미도시)는 예제 문장을 구성하는 각 구의 비용을 산출할 수 있다. 여기서, 산출 비용은 [수학식 1]을 통해 계산될 수 있고, 산출된 비용은 합성음 생성에 이용될 수 있다. 결과적으로 연속된 음성 시퀀스 정보를 사용하여 음성합성을 하게 되면 자연스러운 합성음이 생성될 수 있다. The calculator (not shown) may calculate a minimum value of the cost of the example sentences during learning or speech synthesis. In addition, the calculator (not shown) may calculate the cost of each phrase constituting the example sentence. Here, the calculation cost may be calculated through [Equation 1], and the calculated cost may be used to generate a synthesized sound. As a result, when speech synthesis is performed using continuous speech sequence information, a natural synthesized sound can be generated.

[수학식 1][Equation 1]

문장의 비용 = w1*TC + w2*JC cost of sentence = w1*TC + w2*JC

여기서, TC(Target Cost)는 MSE(Mean Squared Error)등의 지정된 수식을 통해 산출되고, JC(Join Cost)는 같은 구에서 음성 시퀀스 정보가 많을수록 높아지며 특히 음성 시퀀스 정보가 연속되는 더 높은 가중치가 할당된다. Here, TC (Target Cost) is calculated through a specified formula such as MSE (Mean Squared Error), etc., and JC (Join Cost) is higher as there is more voice sequence information in the same phrase. do.

도출부(210)는 각 분리된 제 1 예제 구 및 제 2 예제 구에 대하여 각 예제 구를 구성하는 음소 시퀀스 정보(예를 들어, 트라이폰(Tri-Phone)을 도출할 수 있다. 예를 들어, '틀어줘'의 경우, 'ㅌㅡㄹ'로 구성된 제 1 음소 시퀀스 정보, 'ㅡㄹㅓ'로 구성된 제 2 음소 시퀀스 정보, 'ㄹㅓㅈ'로 구성된 제 3 음소 시퀀스 정보, 'ㅓㅈㅝ'로 구성된 제 4 음소 시퀀스 정보가 도출될 수 있다. 또한, 음성인식시스템 및 음소분석시스템을 연계한 딥러닝을 통해 예제 구로부터 대략적인 음소 시퀀스 정보의 위치가 도출될 수 있다. The derivation unit 210 may derive phoneme sequence information (eg, Tri-Phone) constituting each example phrase with respect to each separated first example phrase and second example phrase. For example, , in the case of 'Play', first phoneme sequence information consisting of 'ㅡㄹ', second phoneme sequence information consisting of 'ㅡㄹㅓ', third phoneme sequence information consisting of 'ㄹㅓㅓ', and 'ㅓㅓㅝ' The fourth phoneme sequence information composed of can be derived, and the approximate location of phoneme sequence information can be derived from example phrases through deep learning in connection with a voice recognition system and a phoneme analysis system.

이하에서는 트리 구조의 말뭉치 데이터베이스를 구축하는 방법을 설명하기로 한다. Hereinafter, a method of constructing a tree-structured corpus database will be described.

데이터베이스 생성부(220)는 예제 텍스트 데이터 및 예제 녹음 데이터를 제 1 예제 데이터 및 제 2 예제 데이터로 나누고, 제 1 예제 데이터에 기초하여 생성된 트리 구조의 말뭉치 데이터베이스를 이용하여 제 2 예제 데이터에 포함된 어느 하나의 예제 문장(합성 음성 데이터)을 합성할 수 있다. 예를 들어, 데이터베이스 생성부(220)는 수집된 예제 텍스트 데이터 및 예제 녹음 데이터를 포함하는 복수의 예제 데이터 중 제 1 비율(예컨대 50%)의 예제 데이터를 제 1 예제 데이터로 분류하고, 나머지 제 1 비율의 예제 데이터를 제 2 예제 데이터로 분류할 수 있다. The database generator 220 divides the example text data and the example recording data into the first example data and the second example data, and includes it in the second example data using a corpus database of a tree structure generated based on the first example data. Any one of the example sentences (synthesized voice data) can be synthesized. For example, the database generator 220 classifies the example data of a first ratio (eg, 50%) of the plurality of example data including the collected example text data and example recording data into the first example data, and the remaining Example data of 1 ratio may be classified as second example data.

데이터베이스 생성부(220)는 합성된 예제 문장을 제 2 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 트리 구조의 말뭉치 데이터베이스를 최적화시킬 수 있다. The database generator 220 may optimize the tree-structured corpus database in a direction to reduce the difference by comparing the synthesized example sentences with the example recorded data included in the second example data.

다른 실시 예로, 데이터베이스 생성부(220)는 수집된 예제 텍스트 데이터 및 예제 녹음 데이터를 포함하는 복수의 예제 데이터 중 제 2 비율(예컨대, 40%)의 예제 데이터를 제 1 예제 데이터로 분류하고, 제 1 비율의 예제 데이터를 제 2 예제 데이터로 분류하고, 제 3 비율(예컨대, 20%)의 예제 데이터를 제 3 예제 데이터로 분류할 수 있다. 여기서, 제 1 예제 데이터 및 제 2 예제 데이터는 음성 합성의 훈련 과정에서 사용할 데이터이고, 제 3 예제 데이터는 음성 합성의 최종 테스트에서 사용할 데이터일 수 있다. In another embodiment, the database generating unit 220 classifies the example data of a second ratio (eg, 40%) of the plurality of example data including the collected example text data and example recording data into the first example data, Example data of 1 ratio may be classified as second example data, and example data of a third ratio (eg, 20%) may be classified as third example data. Here, the first example data and the second example data may be data to be used in a training process of speech synthesis, and the third example data may be data to be used in a final test of speech synthesis.

데이터베이스 생성부(220)는 제 1 예제 데이터에 포함된 예제 녹음 데이터에 대하여 음소분석 및 음성인식을 수행하여 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. 이 때, 데이터베이스 생성부(220)는 음성인식시스템 및 음소분석시스템과 연계하여 제 1 예제 데이터에 포함된 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. The database generator 220 may perform phoneme analysis and voice recognition on the example recorded data included in the first example data to select an initial position of the phoneme constituting the example recorded data. In this case, the database generator 220 may select an initial position of a phoneme constituting the example recording data included in the first example data in connection with the voice recognition system and the phoneme analysis system.

여기서, 음성인식시스템은 인식된 문장에 대한 음소 및 시간 정보를 제공하나 특성상 음성 합성 장치(100)에서 사용하는 음소 단위와 다른 단위를 사용하기 때문에 음성 합성 장치(100)에서 필요한 음소의 발음 위치를 정확하게 제공하지 못한다. 음소분석시스템의 경우, 음소분석의 결과 데이터가 특정 모음과 유사패턴을 보이는 주파수 구간을 모음으로 판단하고, 해당 주파수 성분이 유지되는 구간을 모음 구간으로 설정한다. 하지만, 음소분석시스템은 모음의 구간 정보는 제공하나 자음 파악이 어렵다.Here, the voice recognition system provides phoneme and time information for the recognized sentence, but because it uses a unit different from the phoneme unit used in the voice synthesis apparatus 100 due to its characteristics, the pronunciation position of the phoneme required by the voice synthesis apparatus 100 is determined. does not provide accurate In the case of a phoneme analysis system, a frequency section in which the result data of phoneme analysis shows a pattern similar to a specific vowel is determined as a vowel, and a section in which the corresponding frequency component is maintained is set as a vowel section. However, although the phoneme analysis system provides section information of vowels, it is difficult to identify consonants.

이러한 이유로 음소의 초기 위치를 파악하기 위해서는 음성인식시스템 및 음소분석시스템을 모두 이용할 필요가 있다. 잠시 도 3을 참조하면, 데이터베이스 생성부(220)는 음성인식시스템을 통해 'USB 음악이 듣고 싶으면, USB 음악 틀어줘라고 말씀해보세요'를 포함하는 예제 녹음 데이터(30)에 대한 음소 및 시간 정보를 수신하고, 음소분석시스템을 통해 예제 녹음 데이터(30)의 모음 발생 시간 정보를 수신할 수 있다. 또한, 데이터베이스 생성부(220)는 예제 녹음 데이터(30)에 대한 음소 및 시간 정보와 예제 녹음 데이터(30)의 모음 발생 시간 정보를 이용하여 예제 녹음 데이터(30)의 음소의 초기 위치(301)를 선정할 수 있다. For this reason, it is necessary to use both a voice recognition system and a phoneme analysis system to determine the initial position of a phoneme. Referring to FIG. 3 for a moment, the database generator 220 provides phoneme and time information for the example recording data 30 including 'If you want to listen to USB music, please tell me to play USB music' through the voice recognition system. , and may receive vowel generation time information of the example recording data 30 through the phoneme analysis system. In addition, the database generating unit 220 uses the phoneme and time information for the example recorded data 30 and the vowel generation time information of the example recorded data 30 to the initial position 301 of the phoneme of the example recorded data 30 . can be selected.

데이터베이스 생성부(220)는 제 1 예제 데이터에 포함된 예제 녹음 데이터에 대한 음소의 초기 위치에 기초하여 예제 녹음 데이터로부터 운율정보를 추출할 수 있다. 구체적으로, 데이터베이스 생성부(220)는 제 1 예제 데이터에 포함된 예제 녹음 데이터로부터 음소의 초기 위치가 포함된 각 음소 구간의 길이, 음소 구간의 음 높이 및 강세를 추출할 수 있다. The database generator 220 may extract prosody information from the example recorded data based on the initial position of the phoneme with respect to the example recorded data included in the first example data. Specifically, the database generator 220 may extract the length of each phoneme section including the initial position of the phoneme, and the pitch and stress of the phoneme section from the example recording data included in the first example data.

데이터베이스 생성부(220)는 제 1 예제 데이터에 포함된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 예제 구 및 예제 구가 포함하는 음소 시퀀스 정보에 기초하여 트리 구조의 제 1 말뭉치 데이터베이스를 생성할 수 있다. 또한, 데이터베이스 생성부(220)는 제 1 예제 데이터에 포함된 예제 녹음 데이터로부터 도출된 음소의 초기 위치, 각 음소 구간의 길이, 음소 구간의 음 높이 및 강세에 대한 정보에 더 기초하여 트리 구조의 제 1 말뭉치 데이터베이스를 구축할 수 있다. The database generator 220 generates a tree structure based on example text data included in the first example data and example sentences derived from example recording data, example phrases included in the example sentences, and phoneme sequence information included in the example phrases. 1 You can create a corpus database. In addition, the database generator 220 is configured to construct a tree structure based on information on the initial position of phonemes, the length of each phoneme section, and the pitch and stress of the phoneme section derived from the example recording data included in the first example data. A first corpus database may be constructed.

데이터베이스 생성부(220)는 제 1 예제 데이터에 기초하여 생성된 트리 구조의 말뭉치 데이터베이스를 이용하여 제 2 예제 데이터에 포함된 어느 하나의 예제 문장에 대응하는 제 1 합성 음성 데이터를 생성할 수 있다. The database generator 220 may generate the first synthesized speech data corresponding to any one example sentence included in the second example data by using the tree-structured corpus database generated based on the first example data.

데이터베이스 생성부(220)는 제 1 합성 음성 데이터를 제 2 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 트리 구조의 제 1 말뭉치 데이터베이스를 최적화시킬 수 있다. 즉, 데이터베이스 생성부(220)는 제 1 말뭉치 데이터베이스를 구성하는 적어도 하나의 음소 구간의 길이(음소의 시작과 끝 위치)를 최적화시킴으로써 제 1 말뭉치 데이터베이스를 최적화시킬 수 있다. The database generator 220 may compare the first synthesized voice data with the example recorded data included in the second example data to optimize the first corpus database having a tree structure in a direction to reduce a difference value. That is, the database generator 220 may optimize the first corpus database by optimizing the length (start and end positions of phonemes) of at least one phoneme section constituting the first corpus database.

데이터베이스 생성부(220)는 제 2 예제 데이터에 포함된 예제 녹음 데이터에 대하여 음소분석 및 음성인식을 수행하여 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. The database generator 220 may perform phoneme analysis and voice recognition on the example recorded data included in the second example data to select an initial position of the phoneme constituting the example recorded data.

데이터베이스 생성부(220)는 제 2 예제 데이터에 포함된 예제 녹음 데이터에 대한 음소의 초기 위치에 기초하여 예제 녹음 데이터로부터 운율정보를 추출할 수 있다. 구체적으로, 데이터베이스 생성부(220)는 제 2 예제 데이터에 포함된 예제 녹음 데이터로부터 음소의 초기 위치가 포함된 각 음소 구간의 길이, 음소 구간의 음 높이 및 강세를 추출할 수 있다. The database generator 220 may extract prosody information from the example recorded data based on the initial position of the phoneme with respect to the example recorded data included in the second example data. Specifically, the database generator 220 may extract the length of each phoneme section including the initial position of the phoneme, the pitch of the phoneme section, and the stress from the example recording data included in the second example data.

데이터베이스 생성부(220)는 제 2 예제 데이터에 포함된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 예제 구 및 예제 구가 포함하는 음소 시퀀스 정보에 기초하여 트리 구조의 제 2 말뭉치 데이터베이스를 생성할 수 있다. 또한, 데이터베이스 생성부(220)는 제 2 예제 데이터에 포함된 예제 녹음 데이터로부터 도출된 음소의 초기 위치, 각 음소 구간의 길이, 음소 구간의 음 높이 및 강세에 대한 정보에 더 기초하여 트리 구조의 제 2 말뭉치 데이터베이스를 구축할 수 있다. The database generator 220 generates a tree structure based on example text data included in the second example data and example sentences derived from example recording data, example phrases included in the example sentences, and phoneme sequence information included in the example phrases. 2 You can create a corpus database. In addition, the database generator 220 is configured to construct a tree structure based on information on the initial position of phonemes, the length of each phoneme section, and the pitch and stress of the phoneme section derived from the example recording data included in the second example data. A second corpus database may be constructed.

데이터베이스 생성부(220)는 제 2 예제 데이터에 기초하여 생성된 트리 구조의 제 2 말뭉치 데이터베이스를 이용하여 제 1 예제 데이터에 포함된 어느 하나의 예제 문장에 대응하는 제 2 합성 음성 데이터를 생성할 수 있다. The database generator 220 may generate second synthesized speech data corresponding to any one example sentence included in the first example data by using the second corpus database having a tree structure generated based on the second example data. have.

데이터베이스 생성부(220)는 제 2 합성 음성 데이터를 제 1 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 트리 구조의 제 2 말뭉치 데이터베이스를 최적화시킬 수 있다. The database generator 220 may compare the second synthesized voice data with the example recorded data included in the first example data to optimize the tree-structured second corpus database in a direction to reduce a difference value.

데이터베이스 생성부(220)는 최적화된 제 1 말뭉치 데이터베이스 및 제 2 말뭉치 데이터베이스를 통합하여 트리 구조의 통합 말뭉치 데이터베이스를 생성할 수 있다. The database generator 220 may integrate the optimized first corpus database and the second corpus database to generate the tree-structured integrated corpus database.

데이터베이스 생성부(220)는 통합 말뭉치 데이터베이스를 이용하여 제 3 예제 데이터에 포함된 어느 하나의 예제 문장에 대응하는 제 3 합성 음성 데이터를 생성할 수 있다. The database generator 220 may generate third synthesized speech data corresponding to any one example sentence included in the third example data using the integrated corpus database.

데이터베이스 생성부(220)는 제 3 합성 음성 데이터를 제 3 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 트리 구조의 통합 말뭉치 데이터베이스를 최적화시킬 수 있다. The database generator 220 may optimize the tree-structured integrated corpus database in a direction to reduce the difference by comparing the third synthesized voice data with the example recorded data included in the third example data.

합성부(230)는 사용자 단말(110)로부터 합성 요청을 수신한 경우, 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성할 수 있다. 여기서, 합성 요청은 요청 구 및 요청 구에 해당하는 감정 태그를 포함할 수 있다. 또한, 합성 요청은 사용자 단말(110)에 대한 정보(예컨대, 사용자 단말(110)의 인증값 및 속도), 사용자 단말(110)이 원하는 합성음에 대한 높이 및 볼륨 정보 등을 더 포함할 수도 있다. When the synthesis request is received from the user terminal 110 , the synthesizer 230 may synthesize synthesized voice data corresponding to the synthesis request based on the tree-structured corpus database. Here, the synthesis request may include a request phrase and an emotion tag corresponding to the request phrase. In addition, the synthesis request may further include information on the user terminal 110 (eg, an authentication value and speed of the user terminal 110 ), height and volume information on the synthesized sound desired by the user terminal 110 , and the like.

예를 들어, 합성부(230)는 사용자 단말(110)로부터 'USB 음악을 듣고 싶으면, USB 음악을 틀어줘라고 말씀해보세요'를 포함하는 요청 구 및 해당 요청 구에 대해 원하는 대화체에 대응하는 감정 태그(예컨대, '친근한 대화체'에 대응하는 감정 태그 또는 '차분한 뉴스 낭독체'에 대응하는 감정 태그 등)를 포함하는 합성 요청을 수신한 경우, 트리 구조의 말뭉치 데이터베이스에서 합성 요청에 포함된 요청 구 및 감정 태그에 기초한 예제 문장을 추출하고, 예제 문장을 이용하여 합성 음성 데이터를 합성할 수 있다. For example, the synthesizing unit 230 may include a request phrase including 'If you want to listen to USB music, tell me to play USB music' from the user terminal 110 , and an emotion tag corresponding to a desired dialogue body for the request phrase. (For example, when a synthesis request including an emotion tag corresponding to a 'friendly conversational body' or an emotion tag corresponding to a 'calm news reading body' is received), the request phrase included in the synthesis request in the tree-structured corpus database and An example sentence based on the emotion tag may be extracted, and synthesized voice data may be synthesized using the example sentence.

합성부(230)는 요청 구 및 감정 태그에 기초하여 트리 구조의 말뭉치 데이터베이스로부터 예제 문장을 추출하고, 추출된 예제 문장에 기초하여 합성 음성 데이터를 합성할 수 있다. 도 8a를 참조하면, 합성부(230)는 'USB 음악을 듣고 싶으면, USB 음악을 틀어줘라고 말씀해보세요'를 포함하는 요청 구에 대하여 해당 요청 구의 감정 태그를 갖고 있는 'USB 음악을 듣고 싶으면'을 포함하는 제 1 예제 구(801)과, 'USB 음악을 틀어줘라고 말씀해보세요'를 포함하는 제 2 예제 구(803)을 포함하는 예제 문장을 트리 구조의 말뭉치 데이터베이스의 예제 문장 테이블로부터 추출하고, 추출된 예제 문장과 관련된 음소 시퀀스 정보(TP)를 이용하여 합성 음성 데이터를 생성할 수 있다. The synthesizer 230 may extract an example sentence from the tree-structured corpus database based on the request phrase and the emotion tag, and synthesize synthesized speech data based on the extracted example sentence. Referring to FIG. 8A , the synthesizing unit 230 has an emotion tag of the request phrase for a request phrase including 'If you want to listen to USB music, tell me to play USB music', 'If you want to listen to USB music' Extract an example sentence including a first example phrase (801) including , and a second example phrase (803) including 'Tell me to play USB music' from the example sentence table of the corpus database in a tree structure, and , synthesized speech data may be generated using phoneme sequence information (TP) related to the extracted example sentence.

종래에는 요청 구에 대한 언어 및 운율 분석에 따른 음소 시퀀스 각각에 대한 타겟 값(Target Cost) 및 조인 값(Join Cost)의 최소값을 구하고 스무딩 과정을 거쳐 합성하기 때문에 합성음의 생성 시간이 길어질 수 밖에 없지만, 본 발명은 요청 구에 대한 타겟 값 및 조인값을 계산할 필요없이 사용자 단말로부터 수신된 감정 태그 및 요청 구와 동일한 예제 문장(또는 예제 구)를 트리 구조의 말뭉치 데이터베이스로부터 검색하고, 검색된 예제 문장(또는 예제 구)과 관련된 연속된 음소 시퀀스 정보를 이용하여 합성 음성 데이터를 생성하기 때문에 합성음의 생성 시간을 단축시킬 수 있다. 또한, 본발명은 추가적인 스무딩 작업이 필요하지 않기 때문에 합성 속도를 높일 수 있고, 연속된 음소 시퀀스 정보를 이용하기 때문에 합성 음성 데이터의 음이 자연스럽고 왜곡이나 잡음이 발생하지 않는다. Conventionally, the minimum value of the target value and the join cost for each phoneme sequence according to the analysis of language and prosody of the requested phrase is obtained and synthesized through a smoothing process. , the present invention retrieves the same example sentence (or example phrase) as the emotion tag and the request phrase received from the user terminal from the corpus database of a tree structure without having to calculate the target value and the join value for the request phrase, and the retrieved example sentence (or Since the synthesized voice data is generated using continuous phoneme sequence information related to the example sphere), the generation time of the synthesized sound can be shortened. In addition, since the present invention does not require an additional smoothing operation, the synthesis speed can be increased, and since continuous phoneme sequence information is used, the sound of the synthesized voice data is natural and distortion or noise does not occur.

합성부(230)는 요청 구에 해당하는 예제 문장을 추출할 수 없는 경우, 요청 구가 포함하는 적어도 하나의 요청 구에 기초하여 트리 구조의 말뭉치 데이터베이스로부터 예제 구를 추출하고, 추출된 예제 구에 기초하여 합성 음성 데이터를 합성할 수 있다. If the synthesis unit 230 cannot extract an example sentence corresponding to the request phrase, it extracts an example phrase from the corpus database of a tree structure based on at least one request phrase included in the request phrase, and Synthetic voice data may be synthesized based on the synthesized speech data.

합성부(230)는 요청 구가 포함하는 적어도 하나의 요청 구와 동일한 예제 구를 추출할 수 없는 경우, 요청 구가 포함하는 적어도 하나의 요청 구가 트리 구조의 말뭉치 데이터베이스에 포함된 적어도 두 개의 예제 구의 조합으로 도출이 가능한지 여부를 판단할 수 있다. 합성부(230)는 도출이 가능하다고 판단된 두 개의 예제 구를 트리 구조의 말뭉치 데이터베이스로부터 추출하고, 추출된 예제 구에 기초하여 합성 음성 데이터를 합성할 수 있다. When the synthesis unit 230 cannot extract the same example phrase as the at least one request phrase included in the request phrase, at least one request phrase included in the request phrase includes at least two example phrases included in the tree-structured corpus database. It can be determined whether a combination can be derived. The synthesizer 230 may extract two example phrases determined to be derivable from the tree-structured corpus database, and synthesize synthesized speech data based on the extracted example phrases.

도 8b 및 도 8c를 함께 참조하면, 'USB 음악 듣기를 원하시면, USB 음악을 틀어줘라고 말씀해보세요'를 포함하는 요청 구에 해당하는 예제 문장이 트리 구조의 말뭉치 데이터베이스의 예제 문장 테이블에 존재하지 않은 경우, 합성부(230)는 요청 구 중 'USB 음악 듣기를 원하시면'이 포함된 제 1 요청 구와 'USB 음악을 틀어줘라고 말씀해보세요'가 포함된 제 2 요청 구 각각에 대응하는 예제 구를 트리 구조의 말뭉치 데이터베이스의 예제 구 테이블에서 검색할 수 있다. 예제 구 테이블에 'USB 음악을 틀어줘라고 말씀해보세요'가 포함된 제 2 요청 구와 동일한 예제 구(803)이 존재하면, 해당 예제 구(803)과 관련된 음소 시퀀스 정보를 이용하여 제 2 요청 구에 대한 합성 음성 데이터를 합성할 수 있기 때문에 별도의 시퀀스 정보의 후보를 구하는 과정이 필요 없게 되어 합성 음성 데이터의 합성 속도를 높일 수 있다. 8B and 8C together, an example sentence corresponding to a request phrase including 'If you want to listen to USB music, tell me to play USB music' does not exist in the example sentence table of the corpus database of the tree structure. In this case, the synthesizing unit 230 tree example phrases corresponding to each of the first request phrase including 'If you want to listen to USB music' and the second request phrase including 'Tell me to play USB music' among the request phrases. You can search for example phrase tables in the corpus database of structures. If an example phrase 803 identical to the second request phrase including 'Tell me to play USB music' exists in the example phrase table, the second request phrase using phoneme sequence information related to the example phrase 803 Since the synthesized voice data can be synthesized, there is no need to obtain a separate candidate for sequence information, thereby increasing the synthesis speed of the synthesized voice data.

합성부(230)는 예제 구 테이블로부터 'USB 음악 듣기를 원하시면'이 포함된 제 1 요청 구의 일부분이 동일한 'USB 음악이 듣고 싶으면'이 포함된 예제 구와 '지니 음악 듣기를 원하면'이 포함된 예제 구를 추출하고, 추출된 예제 구로부터 제 1 요청 구의 일부분과 동일한 부분('USB 음악'과 '음악 듣기를 원하면')을 추출하여 추출된 부분과 관련된 음소 시퀀스 정보를 이용하여 제 1 요청 구에 대한 합성 음성 데이터를 합성할 수 있다. The synthesis unit 230 includes an example phrase including 'If you want to listen to USB music' and an example including 'If you want to listen to Genie music', in which a part of the first request phrase including 'If you want to listen to USB music' is the same from the example phrase table. The phrase is extracted, and the same part as the part of the first requested phrase ('USB music' and 'If you want to listen to music') is extracted from the extracted example phrase, and the phoneme sequence information related to the extracted part is used for the first requested phrase. Synthetic voice data can be synthesized.

한편, 당업자라면, 수집부(200), 도출부(210), 데이터베이스 생성부(220) 및 합성부(230)각각이 분리되어 구현되거나, 이 중 하나 이상이 통합되어 구현될 수 있음을 충분히 이해할 것이다. Meanwhile, those skilled in the art will fully understand that the collecting unit 200, the derivation unit 210, the database generating unit 220, and the synthesizing unit 230 may be implemented separately, or one or more of them may be integrated and implemented. will be.

도 6은 본 발명의 일 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 생성하는 방법을 나타낸 흐름도이다. 6 is a flowchart illustrating a method of generating a tree-structured corpus database according to an embodiment of the present invention.

도 6을 참조하면, 단계 S601에서 음성 합성 장치(100)는 제 1 예제 데이터에 포함된 예제 녹음 데이터에 대하여 음소분석 및 음성인식을 수행하여 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. Referring to FIG. 6 , in step S601, the voice synthesis apparatus 100 may perform phoneme analysis and voice recognition on the example recorded data included in the first example data to select an initial position of the phoneme constituting the example recorded data. have.

단계 S603에서 음성 합성 장치(100)는 제 1 예제 데이터에 포함된 예제 녹음 데이터에 대한 음소의 초기 위치에 기초하여 예제 녹음 데이터로부터 운율 정보를 추출할 수 있다. In operation S603, the speech synthesis apparatus 100 may extract prosody information from the example recorded data based on the initial position of the phoneme with respect to the example recorded data included in the first example data.

단계 S605에서 음성 합성 장치(100)는 제 1 예제 데이터에 포함된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 예제 구 및 예제 구가 포함하는 음소 시퀀스 정보에 기초하여 트리 구조의 제 1 말뭉치 데이터베이스를 생성할 수 있다. 또한, 음성 합성 장치(100)는 추출된 운율 정보에 더 기초하여 트리 구조의 제 1 말뭉치 데이터베이스를 생성할 수 있다.In step S605, the speech synthesis apparatus 100 generates a tree based on example text data included in the first example data and example sentences derived from example recorded data, example phrases included in the example sentences, and phoneme sequence information included in the example phrases. A first corpus database of structures may be created. Also, the speech synthesis apparatus 100 may generate a first corpus database having a tree structure based on the extracted prosody information.

단계 S607에서 음성 합성 장치(100)는 제 1 예제 데이터에 기초하여 생성된 트리 구조의 제 1 말뭉치 데이터베이스를 이용하여 제 2 예제 데이터에 포함된 예제 문장에 대응하는 제 1 합성 음성 데이터를 생성할 수 있다. In step S607, the speech synthesis apparatus 100 may generate the first synthesized speech data corresponding to the example sentence included in the second example data by using the first corpus database having a tree structure generated based on the first example data. have.

단계 S609에서 음성 합성 장치(100)는 제 1 합성 음성 데이터와 제 2 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 제 1 말뭉치 데이터베이스를 최적화시킬 수 있다. In operation S609 , the speech synthesis apparatus 100 may optimize the first corpus database in a direction of reducing a difference between the first synthesized speech data and the example recorded data included in the second example data.

단계 S611에서 음성 합성 장치(100)는 제 2 예제 데이터에 포함된 예제 녹음 데이터에 대하여 음소분석 및 음성인식을 수행하여 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. In step S611, the speech synthesis apparatus 100 may perform phoneme analysis and voice recognition on the example recorded data included in the second example data to select an initial position of the phoneme constituting the example recorded data.

단계 S613에서 음성 합성 장치(100)는 제 2 예제 데이터에 포함된 예제 녹음 데이터에 대한 음소의 초기 위치에 기초하여 예제 녹음 데이터로부터 운율 정보를 추출할 수 있다. In operation S613, the speech synthesis apparatus 100 may extract prosody information from the example recorded data based on the initial position of the phoneme with respect to the example recorded data included in the second example data.

단계 S615에서 음성 합성 장치(100)는 제 2 예제 데이터에 포함된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 예제 구 및 예제 구가 포함하는 음소 시퀀스 정보에 기초하여 트리 구조의 제 2 말뭉치 데이터베이스를 생성할 수 있다. 또한, 음성 합성 장치(100)는 추출된 운율 정보에 더 기초하여 트리 구조의 제 2 말뭉치 데이터베이스를 생성할 수 있다.In step S615, the speech synthesis apparatus 100 generates a tree based on example text data included in the second example data and example sentences derived from example recorded data, example phrases included in the example sentences, and phoneme sequence information included in the example phrases. A second corpus database of structures may be created. Also, the speech synthesis apparatus 100 may generate a second corpus database having a tree structure based on the extracted prosody information.

단계 S617에서 음성 합성 장치(100)는 제 2 예제 데이터에 기초하여 생성된 트리 구조의 제 2 말뭉치 데이터베이스를 이용하여 제 1 예제 데이터에 포함된 예제 문장에 대응하는 제 2 합성 음성 데이터를 생성할 수 있다. In step S617, the speech synthesis apparatus 100 may generate second synthesized speech data corresponding to the example sentences included in the first example data by using the second corpus database having a tree structure generated based on the second example data. have.

단계 S619에서 음성 합성 장치(100)는 제 2 합성 음성 데이터와 제 1 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 제 2 말뭉치 데이터베이스를 최적화시킬 수 있다. In operation S619 , the speech synthesis apparatus 100 may optimize the second corpus database in a direction to reduce a difference between the second synthesized speech data and the example recorded data included in the first example data.

단계 S621에서 음성 합성 장치(100)는 제 1 말뭉치 데이터베이스와 제 2 말뭉치 데이터베이스를 통합한 통합 말뭉치 데이터베이스를 생성할 수 있다. In operation S621, the speech synthesis apparatus 100 may generate an integrated corpus database in which the first corpus database and the second corpus database are integrated.

단계 S623에서 음성 합성 장치(100)는 통합 말뭉치 데이터베이스를 이용하여 제 3 예제 데이터에 포함된 예제 문장에 대응하는 제 3 합성 음성 데이터를 생성할 수 있다. In operation S623, the speech synthesis apparatus 100 may generate the third synthesized voice data corresponding to the example sentence included in the third example data using the integrated corpus database.

단계 S625에서 음성 합성 장치(100)는 제 3 합성 음성 데이터와 제 3 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 통합 데이터베이스를 최적화시킬 수 있다. In operation S625 , the speech synthesis apparatus 100 may optimize the integrated database in a direction to reduce a difference between the third synthesized speech data and the example recorded data included in the third example data.

상술한 설명에서, 단계 S601 내지 S625는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S601 to S625 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

도 7은 본 발명의 다른 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 생성하는 방법을 나타낸 흐름도이다. 7 is a flowchart illustrating a method of generating a tree-structured corpus database according to another embodiment of the present invention.

도 7을 참조하면, 단계 S701에서 음성 합성 장치(100)는 제 1 예제 데이터에 포함된 예제 녹음 데이터에 대하여 음소분석 및 음성인식을 수행하여 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. Referring to FIG. 7 , in step S701, the speech synthesis apparatus 100 may perform phoneme analysis and voice recognition on the example recorded data included in the first example data to select an initial position of the phoneme constituting the example recorded data. have.

단계 S703에서 음성 합성 장치(100)는 제 1 예제 데이터에 포함된 예제 녹음 데이터에 대한 음소의 초기 위치에 기초하여 예제 녹음 데이터로부터 운율 정보를 추출할 수 있다. In operation S703, the speech synthesis apparatus 100 may extract prosody information from the example recorded data based on the initial position of the phoneme with respect to the example recorded data included in the first example data.

단계 S705에서 음성 합성 장치(100)는 제 1 예제 데이터에 포함된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 예제 구 및 예제 구가 포함하는 음소 시퀀스 정보에 기초하여 트리 구조의 제 1 말뭉치 데이터베이스를 생성할 수 있다. 또한, 음성 합성 장치(100)는 추출된 운율 정보에 더 기초하여 트리 구조의 제 1 말뭉치 데이터베이스를 생성할 수 있다.In step S705, the speech synthesis apparatus 100 generates a tree based on example text data included in the first example data and example sentences derived from example recorded data, example phrases included in the example sentences, and phoneme sequence information included in the example phrases. A first corpus database of structures may be created. Also, the speech synthesis apparatus 100 may generate a first corpus database having a tree structure based on the extracted prosody information.

단계 S707에서 음성 합성 장치(100)는 제 1 예제 데이터에 기초하여 생성된 트리 구조의 제 1 말뭉치 데이터베이스를 이용하여 제 2 예제 데이터에 포함된 예제 문장에 대응하는 제 1 합성 음성 데이터를 생성할 수 있다. In step S707, the speech synthesis apparatus 100 may generate the first synthesized speech data corresponding to the example sentence included in the second example data by using the first corpus database having a tree structure generated based on the first example data. have.

단계 S709에서 음성 합성 장치(100)는 제 1 합성 음성 데이터와 제 2 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 제 1 말뭉치 데이터베이스를 최적화시킬 수 있다. In operation S709 , the speech synthesizing apparatus 100 may optimize the first corpus database in a direction of reducing a difference value by comparing the first synthesized speech data and the example recorded data included in the second example data.

단계 S711에서 음성 합성 장치(100)는 제 2 예제 데이터에 포함된 예제 녹음 데이터에 대하여 음소분석 및 음성인식을 수행하여 예제 녹음 데이터를 구성하는 음소의 초기 위치를 선정할 수 있다. In step S711 , the speech synthesis apparatus 100 may perform phoneme analysis and voice recognition on the example recorded data included in the second example data to select an initial position of a phoneme constituting the example recorded data.

단계 S713에서 음성 합성 장치(100)는 제 2 예제 데이터에 포함된 예제 녹음 데이터에 대한 음소의 초기 위치에 기초하여 예제 녹음 데이터로부터 운율 정보를 추출할 수 있다. In operation S713 , the speech synthesis apparatus 100 may extract prosody information from the example recorded data based on the initial position of the phoneme with respect to the example recorded data included in the second example data.

단계 S715에서 음성 합성 장치(100)는 제 2 예제 데이터에 포함된 예제 텍스트 데이터 및 예제 녹음 데이터로부터 도출된 예제 문장, 예제 문장이 포함하는 예제 구 및 예제 구가 포함하는 음소 시퀀스 정보에 기초하여 트리 구조의 제 2 말뭉치 데이터베이스를 생성할 수 있다. 또한, 음성 합성 장치(100)는 추출된 운율 정보에 더 기초하여 트리 구조의 제 2 말뭉치 데이터베이스를 생성할 수 있다.In step S715, the speech synthesis apparatus 100 generates a tree based on example text data included in the second example data and example sentences derived from example recorded data, example phrases included in the example sentences, and phoneme sequence information included in the example phrases. A second corpus database of structures may be created. Also, the speech synthesis apparatus 100 may generate a second corpus database having a tree structure based on the extracted prosody information.

단계 S717에서 음성 합성 장치(100)는 제 2 예제 데이터에 기초하여 생성된 트리 구조의 제 2 말뭉치 데이터베이스를 이용하여 제 1 예제 데이터에 포함된 예제 문장에 대응하는 제 2 합성 음성 데이터를 생성할 수 있다. In step S717, the speech synthesis apparatus 100 may generate second synthesized speech data corresponding to the example sentences included in the first example data by using the second corpus database having a tree structure generated based on the second example data. have.

단계 S719에서 음성 합성 장치(100)는 제 2 합성 음성 데이터와 제 1 예제 데이터에 포함된 예제 녹음 데이터와 비교하여 차이값을 감소시키는 방향으로 제 2 말뭉치 데이터베이스를 최적화시킬 수 있다. In operation S719 , the speech synthesis apparatus 100 may optimize the second corpus database in a direction to reduce a difference between the second synthesized speech data and the example recorded data included in the first example data.

단계 S721에서 음성 합성 장치(100)는 제 1 말뭉치 데이터베이스와 제 2 말뭉치 데이터베이스를 통합한 통합 말뭉치 데이터베이스를 생성할 수 있다. In operation S721, the speech synthesis apparatus 100 may generate an integrated corpus database in which the first corpus database and the second corpus database are integrated.

단계 S723에서 음성 합성 장치(100)는 통합 말뭉치 데이터베이스를 이용하여 딥러닝 모델에 의해 생성된 딥러닝 예제 데이터에 포함된 딥러닝 예제 문장에 대응하는 제 3 합성 음성 데이터를 생성할 수 있다. In step S723, the speech synthesis apparatus 100 may generate third synthesized speech data corresponding to the deep learning example sentences included in the deep learning example data generated by the deep learning model using the integrated corpus database.

단계 S725에서 음성 합성 장치(100)는 제 3 합성 음성 데이터와 딥러닝 예제 데이터에 포함된 딥러닝 예제 음성 데이터와 비교하여 차이값을 감소시키는 방향으로 통합 데이터베이스를 최적화시킬 수 있다. In step S725 , the speech synthesis apparatus 100 may optimize the integrated database in a direction to reduce a difference by comparing the third synthesized speech data and the deep learning example voice data included in the deep learning example data.

도 9는 본 발명의 일 실시예에 따른, 트리 구조의 말뭉치 데이터베이스를 이용하여 음성을 합성하는 방법을 나타낸 흐름도이다.9 is a flowchart illustrating a method for synthesizing speech using a tree-structured corpus database according to an embodiment of the present invention.

도 9를 참조하면, 단계 S901에서 음성 합성 장치(100)는 예제 텍스트 데이터 및 예제 텍스트 데이터에 대응하는 예제 녹음 데이터를 수집할 수 있다. Referring to FIG. 9 , in step S901 , the speech synthesis apparatus 100 may collect example text data and example recording data corresponding to the example text data.

단계 S903에서 음성 합성 장치(100)는 수집된 예제 텍스트 데이터 및 예제 녹음 데이터에 기초하여 예제 문장, 예제 문장이 포함하는 적어도 하나의 예제 구 및 예제 구가 포함하는 적어도 하나의 음소 시퀀스 정보를 도출할 수 있다. In step S903, the speech synthesis apparatus 100 may derive an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data. can

단계 S905에서 음성 합성 장치(100)는 도출된 예제 문장, 예제 구 및 음소 시퀀스 정보에 기초하여 트리 구조의 말뭉치 데이터베이스를 생성할 수 있다. In operation S905, the speech synthesis apparatus 100 may generate a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information.

단계 S907에서 음성 합성 장치(100)는 트리 구조의 말뭉치 데이터베이스에 기초하여 합성 요청에 대응하는 합성 음성 데이터를 합성할 수 있다. 여기하, 합성 요청은 요청 구 및 요청 구에 해당하는 감정 태그를 포함할 수 있다. In operation S907, the speech synthesis apparatus 100 may synthesize synthesized voice data corresponding to the synthesis request based on the tree-structured corpus database. Hereinafter, the synthesis request may include a request phrase and an emotion tag corresponding to the request phrase.

상술한 설명에서, 단계 S901 내지 S907은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. In the above description, steps S901 to S907 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be changed.

본 발명의 일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. An embodiment of the present invention may also be implemented in the form of a recording medium including instructions executable by a computer, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, computer-readable media may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The description of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. .

100: 음성 합성 장치
110: 사용자 단말
200: 수집부
210: 도출부
220: 데이터베이스 생성부
230: 합성부100: speech synthesizer
110: user terminal
200: collection unit
210: derivation part
220: database generator
230: synthesis unit

Claims

An apparatus for synthesizing speech based on a tree-structured corpus database, the apparatus comprising:
a collection unit for collecting example text data and example recording data corresponding to the example text data;
a derivation unit for deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data;
a database generator for generating a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information; and
A synthesizer for synthesizing synthesized speech data corresponding to a synthesis request based on the tree-structured corpus database
That comprising a, speech synthesis device.

The method of claim 1,
The database generator generates an example sentence table based on the example sentence, emotion tag information corresponding to the example sentence, and management information on at least one example phrase included in the example sentence.

3. The method of claim 2,
The database generator generates an example phrase table based on the example phrase, order information of the example phrase, and management information on at least one phoneme sequence information included in the example phrase,
and mapping the example sentence table and the example phrase table based on management information included in the example sentence table and order information included in the example phrase table.

4. The method of claim 3,
The database generator generates a phoneme sequence table based on the phoneme sequence information and order information of the phoneme sequence information,
and mapping the example phrase table and the phoneme sequence table based on management information included in the example phrase table and order information included in the phoneme sequence table.

The method of claim 1,
The database creation unit,
dividing the example text data and the example recording data into first example data and second example data;
Any one example sentence included in the second example data is synthesized using a tree-structured corpus database generated based on the first example data,
Comparing the synthesized example sentences with the example recording data included in the second example data and optimizing the tree-structured corpus database in a direction to reduce the difference value, the speech synthesis apparatus.

The method of claim 1,
The synthesis request includes a request phrase and an emotion tag corresponding to the request phrase,
wherein the synthesizer extracts an example sentence from the tree-structured corpus database based on the request phrase and the emotion tag, and synthesizes the synthesized voice data based on the extracted example sentence.

7. The method of claim 6,
If the synthesis unit cannot extract an example sentence corresponding to the request phrase, extract an example phrase from the corpus database of the tree structure based on at least one request phrase included in the request phrase, and the extracted example phrase and synthesizing the synthesized speech data based on

8. The method of claim 7,
The synthesis unit determines whether at least one request phrase included in the request phrase can be derived from a combination of at least two example phrases included in the tree-structured corpus database, and the two example phrases determined to be derivable is extracted from the tree-structured corpus database, and the synthesized speech data is synthesized based on the extracted example phrase.

A method for synthesizing speech based on a corpus database of a tree structure through a speech synthesis device, the method comprising:
collecting sample text data and example recording data corresponding to the example text data;
deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data;
generating a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information; and
and synthesizing synthesized voice data corresponding to a synthesis request based on the tree-structured corpus database.

10. The method of claim 9,
The step of creating the corpus database
and generating an example sentence table based on the example sentence, emotion tag information corresponding to the example sentence, and management information on at least one example phrase included in the example sentence.

11. The method of claim 10,
The step of creating the corpus database
generating an example phrase table based on the example phrase, order information of the example phrase, and management information on at least one phoneme sequence information included in the example phrase; and
and mapping the example sentence table and the example phrase table based on management information included in the example sentence table and order information included in the example phrase table.

12. The method of claim 11,
The step of creating the corpus database
generating a phoneme sequence table based on the phoneme sequence information and order information of the phoneme sequence information;
and mapping the example phrase table and the phoneme sequence table based on management information included in the example phrase table and order information included in the phoneme sequence table.

10. The method of claim 9,
The step of creating the corpus database
dividing the example text data and the example recording data into first example data and second example data;
synthesizing any one example sentence included in the second example data using a tree-structured corpus database generated based on the first example data; and
and optimizing the tree-structured corpus database in a direction to reduce a difference value by comparing the synthesized example sentences with example recorded data included in the second example data.

10. The method of claim 9,
The synthesis request includes a request phrase and an emotion tag corresponding to the request phrase,
The step of synthesizing the synthesized voice data is
and extracting an example sentence from the tree-structured corpus database based on the request phrase and the emotion tag, and synthesizing the synthesized voice data based on the extracted example sentence.

15. The method of claim 14,
The step of synthesizing the synthesized voice data is
When an example sentence corresponding to the request phrase cannot be extracted, an example phrase is extracted from the corpus database of the tree structure based on at least one request phrase included in the request phrase, and based on the extracted example phrase and synthesizing the synthesized speech data.

16. The method of claim 15,
The step of synthesizing the synthesized voice data is
determining whether at least one request phrase included in the request phrase can be derived from a combination of at least two example phrases included in the tree-structured corpus database; and
and extracting two example phrases determined to be derivable from the tree-structured corpus database, and synthesizing the synthesized speech data based on the extracted example phrases.

A computer program stored in a medium including a sequence of instructions for synthesizing speech based on a tree-structured corpus database, the computer program comprising:
When the computer program is executed by a computing device,
collecting sample text data and example recording data corresponding to the example text data;
Deriving an example sentence, at least one example phrase included in the example sentence, and at least one phoneme sequence information included in the example phrase based on the collected example text data and example recording data,
Create a tree-structured corpus database based on the derived example sentences, example phrases, and phoneme sequence information,
and a sequence of instructions for synthesizing synthesized speech data corresponding to a synthesis request based on the tree-structured corpus database.