KR20220071523A

KR20220071523A - A method and a TTS system for segmenting a sequence of characters

Info

Publication number: KR20220071523A
Application number: KR1020200158770A
Authority: KR
Inventors: 강진범; 주동원; 이승재; 남용욱
Original assignee: 주식회사 자이냅스
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2022-05-31

Abstract

An objective of the present invention is to provide an artificial intelligence-based speech synthesis technology capable of realizing a natural speech as if a real speaker is speaking. According to an embodiment of the present invention, provided is a method for segmenting a sequence of characters composed of a specific natural language, which comprises: a step of generating a first group including a plurality of subsequences by segmenting the sequence based on at least one punctuation mark included in the sequence; a step of generating a third subsequence by merging the first subsequence and a second subsequence adjacent to the first subsequence in a case in which the first subsequence included in the first group is shorter than a first threshold length; generating a second group by updating the first group based on the third subsequence; generating a plurality of fifth subsequences by segmenting a fourth subsequence according to a predetermined criterion in a case in which the fourth subsequence included in the second group is longer than a second threshold length; and generating a third group by updating the second group based on the plurality of fifth subsequences.

Description

{A method and a TTS system for segmenting a sequence of characters}

문자들의 시퀀스를 분할하는 방법 및 음성 합성 시스템에 관한다.A method for segmenting a sequence of characters and a speech synthesis system are provided.

최근 인공 지능 기술의 발달로 음성 신호를 활용하는 인터페이스가 보편화되고 있다. 이에 따라, 주어진 상황에 따라 합성된 음성을 발화할 수 있도록 하는 음성 합성(speech synthesis) 기술에 대한 연구가 활발히 진행되고 있다.Recently, with the development of artificial intelligence technology, interfaces using voice signals are becoming common. Accordingly, research on a speech synthesis technology capable of uttering a synthesized voice according to a given situation is being actively conducted.

음성 합성 기술은 인공 지능에 기반한 음성 인식 기술과 접목하여 가상 비서, 오디오북, 자동 통번역 및 가상 성우 등의 많은 분야에 적용되고 있다. Speech synthesis technology is applied to many fields such as virtual assistants, audiobooks, automatic interpretation and translation, and virtual voice actors by combining speech recognition technology based on artificial intelligence.

종래의 음성 합성 방법으로는 연결 합성(Unit Selection Synthesis, USS) 및 통계 기반 파라미터 합성(HMM-based Speech Synthesis, HTS) 등의 다양한 방법이 있다. USS 방법은 음성 데이터를 음소 단위로 잘라서 저장하고 음성 합성 시 발화에 적합한 음편을 찾아서 이어붙이는 방법이고, HTS 방법은 음성 특징에 해당하는 파라미터들을 추출해 통계 모델을 생성하고 통계 모델에 기반하여 텍스트를 음성으로 재구성하는 방법이다. 그러나, 상술한 종래의 음성 합성 방법은 화자의 발화 스타일 또는 감정 표현 등을 반영한 자연스러운 음성을 합성하는 데 많은 한계가 있었다. As a conventional speech synthesis method, there are various methods such as unit selection synthesis (USS) and statistic-based parameter synthesis (HMM-based speech synthesis, HTS). The USS method cuts out speech data into phoneme units and stores them, and finds sound pieces suitable for utterance during speech synthesis. The HTS method extracts parameters corresponding to speech characteristics to create a statistical model, and then converts text into speech based on the statistical model. as a way to reconstruct it. However, the conventional voice synthesis method described above has many limitations in synthesizing a natural voice reflecting the speaker's speech style or emotional expression.

이에 따라, 최근에는 인공 신경망(Artificial Neural Network)에 기반하여 텍스트로부터 음성을 합성하는 음성 합성 방법이 주목받고 있다. Accordingly, recently, a speech synthesis method for synthesizing speech from text based on an artificial neural network is attracting attention.

문자들의 시퀀스를 분할하는 방법 및 음성 합성 시스템을 제공하는데 있다. 또한, 실제 발화자가 말하는 듯한 자연스러운 음성을 구현할 수 있는 인공 지능 기반의 음성 합성 기술을 제공하는 데 있다. 또한, 적은 양의 학습 데이터를 이용하는 고효율의 인공 지능 기반의 음성 합성 기술을 제공하는 데 있다.To provide a method for segmenting a sequence of characters and a speech synthesis system. In addition, it is to provide an artificial intelligence-based speech synthesis technology that can implement a natural voice as if a real speaker is speaking. Another object of the present invention is to provide a highly efficient artificial intelligence-based speech synthesis technology using a small amount of training data.

해결하고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 유추될 수 있다.The technical problem to be solved is not limited to the technical problems as described above, and other technical problems may be inferred.

일 측면에 따른 특정 자연 언어로 구성된 문자들의 시퀀스(sequence)를 분할하는 방법은, 상기 시퀀스에 포함된 적어도 하나의 문장 부호를 기준으로 상기 시퀀스를 분할함으로써 복수의 서브 시퀀스들이 포함된 제1 그룹을 생성하는 단계; 상기 제1 그룹에 포함된 제1 서브 시퀀스의 길이가 제1 임계 길이보다 짧은 경우, 상기 제1 서브 시퀀스와 상기 제1 서브 시퀀스에 인접하는 제2 서브 시퀀스를 병합함으로써 제3 서브 시퀀스를 생성하는 단계; 상기 제3 서브 시퀀스에 기초하여 상기 제1 그룹을 업데이트함으로써 제2 그룹을 생성하는 단계; 상기 제2 그룹에 포함된 제4 서브 시퀀스의 길이가 제2 임계 길이보다 긴 경우, 상기 제4 서브 시퀀스를 소정의 기준에 따라 분할함으로써 복수의 제5 서브 시퀀스들을 생성하는 단계; 상기 복수의 제5 서브 시퀀스들에 기초하여 상기 제2 그룹을 업데이트함으로써 제3 그룹을 생성하는 단계;를 포함한다.A method of dividing a sequence of characters composed of a specific natural language according to an aspect includes dividing the sequence based on at least one punctuation mark included in the sequence, thereby generating a first group including a plurality of subsequences. generating; generating a third subsequence by merging the first subsequence and a second subsequence adjacent to the first subsequence when the length of the first subsequence included in the first group is shorter than the first threshold length step; creating a second group by updating the first group based on the third subsequence; generating a plurality of fifth subsequences by dividing the fourth subsequence according to a predetermined criterion when the length of the fourth subsequence included in the second group is longer than a second threshold length; and generating a third group by updating the second group based on the plurality of fifth subsequences.

상술한 방법에 있어서, 상기 제3 그룹에 포함된 복수의 서브 시퀀스들을 합성기에 전송하는 단계;를 더 포함한다.In the above-described method, the method further includes transmitting a plurality of subsequences included in the third group to a synthesizer.

상술한 방법에 있어서, 상기 제3 그룹에 포함된 복수의 서브 시퀀스들 각각의 말단에 소정의 텍스트를 병합하는 단계;를 더 포함하고, 상기 전송하는 단계는, 상기 소정의 텍스트가 병합된 서브 시퀀스들을 상기 합성기에 전송한다.In the above-described method, the method further includes merging a predetermined text at the end of each of the plurality of subsequences included in the third group, wherein the transmitting includes the subsequence in which the predetermined text is merged. are sent to the synthesizer.

상술한 방법에 있어서, 상기 제1 임계 길이는 상기 제2 임계 길이보다 짧은 길이를 의미한다.In the above method, the first threshold length means a length shorter than the second threshold length.

상술한 방법에 있어서, 상기 시퀀스에 두 단위 이상의 공백(space)이 적어도 하나 포함된 경우, 상기 두 단위 이상의 공백을 한 단위의 공백으로 변경함으로써 상기 시퀀스를 수정하는 단계;를 더 포함한다.In the above-described method, when the sequence includes at least one space of two or more units, modifying the sequence by changing the space of two or more units to a space of one unit; further comprising.

다른 측면에 따른 컴퓨터로 읽을 수 있는 기록매체는 상술한 방법을 컴퓨터에서 실행시키기 위한 프로그램을 포함한다.A computer-readable recording medium according to another aspect includes a program for executing the above-described method in a computer.

음성 합성 시스템은 특정 자연 언어로 구성된 문자들의 시퀀스를 서브 시퀀스들로 분할할 수 있다. 또한, 음성 합성 시스템은 서브 시퀀스의 말단에 소정의 텍스트를 병합할 수 있다. 따라서, 음성 합성 시스템은 최적의 텍스트 길이에 기반하여 동작할 수 있으며, 이에 따라 최적의 스펙트로그램을 생성할 수 있다.The speech synthesis system may divide a sequence of characters composed of a specific natural language into subsequences. Also, the speech synthesis system may merge certain text at the end of the subsequence. Accordingly, the speech synthesis system may operate based on the optimal text length, and thus may generate an optimal spectrogram.

도 1은 음성 합성 시스템의 동작을 개략적으로 나타내는 도면이다.
도 2는 음성 합성 시스템의 일 실시예를 나타내는 도면이다.
도 3은 합성기를 통해 멜 스펙트로그램을 출력하는 일 실시예를 나타내는 도면이다.
도 4는 시퀀스를 서브 시퀀스들로 분할하는 일 예를 나타낸 흐름도이다.
도 5는 음성 합성 시스템이 시퀀스를 분할하는 일 예를 설명하기 위한 도면이다.
도 6은 음성 합성 시스템이 서브 시퀀스의 길이와 제1 임계 길이를 비교하고, 인접한 서브 시퀀스들끼리 병합하는 일 예를 설명하기 위한 도면이다.
도 7은 음성 합성 시스템이 서브 시퀀스의 길이와 제2 임계 길이를 비교하고, 서브 시퀀스를 분할하는 일 예를 설명하기 위한 도면이다.
도 8은 음성 합성 시스템이 서브 시퀀스에 소정의 처리를 수행하고, 이를 합성기에 전송하는 일 예를 설명하기 위한 흐름도이다.
도 9는 음성 합성 시스템이 서브 시퀀스의 말단에 소정의 텍스트를 병합하는 일 예를 설명하기 위한 도면이다.1 is a diagram schematically illustrating an operation of a speech synthesis system.
2 is a diagram illustrating an embodiment of a speech synthesis system.
3 is a diagram illustrating an embodiment of outputting a Mel spectrogram through a synthesizer.
4 is a flowchart illustrating an example of dividing a sequence into subsequences.
5 is a diagram for explaining an example in which a speech synthesis system divides a sequence.
FIG. 6 is a diagram for explaining an example in which the speech synthesis system compares the length of a subsequence with a first threshold length and merges adjacent subsequences with each other.
7 is a diagram for explaining an example in which the speech synthesis system compares the length of a subsequence with a second threshold length and divides the subsequence.
8 is a flowchart for explaining an example in which the speech synthesis system performs predetermined processing on a subsequence and transmits it to the synthesizer.
9 is a diagram for explaining an example in which a speech synthesis system merges a predetermined text at the end of a subsequence.

본 실시예들에서 사용되는 용어는 본 실시예들에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 실시예들에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 실시예들 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the present embodiments are selected as currently widely used general terms as possible while considering the functions in the present embodiments, which may vary depending on the intention or precedent of a person skilled in the art, the emergence of new technology, etc. have. In addition, in certain cases, there are also terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the relevant part. Therefore, the terms used in the present embodiments should be defined based on the meaning of the term and the contents throughout the present embodiments, rather than the simple name of the term.

본 실시예들은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 일부 실시예들을 도면에 예시하고 상세하게 설명하고자 한다. 그러나, 이는 본 실시예들을 특정한 개시형태에 대해 한정하려는 것이 아니며, 본 실시예들의 사상 및 기술범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 명세서에서 사용한 용어들은 단지 실시예들의 설명을 위해 사용된 것으로, 본 실시예들을 한정하려는 의도가 아니다.Since the present embodiments may have various changes and may have various forms, some embodiments will be illustrated in the drawings and described in detail. However, this is not intended to limit the present embodiments to a specific disclosed form, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present embodiments. The terms used herein are used only for description of the embodiments, and are not intended to limit the present embodiments.

본 실시예들에 사용되는 용어들은 다르게 정의되지 않는 한, 본 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미가 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 실시예들에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않아야 한다.Unless otherwise defined, terms used in the present embodiments have the same meanings as commonly understood by those of ordinary skill in the art to which the present embodiments belong. Terms such as those defined in a commonly used dictionary should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in the present embodiments, have an ideal or excessively formal meaning. should not be interpreted.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이러한 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 본 명세서에 기재되어 있는 특정 형상, 구조 및 특성은 본 발명의 정신과 범위를 벗어나지 않으면서 일 실시예로부터 다른 실시예로 변경되어 구현될 수 있다. 또한, 각각의 실시예 내의 개별 구성요소의 위치 또는 배치도 본 발명의 정신과 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 행하여지는 것이 아니며, 본 발명의 범위는 특허청구범위의 청구항들이 청구하는 범위 및 그와 균등한 모든 범위를 포괄하는 것으로 받아들여져야 한다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 구성요소를 나타낸다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0012] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0010] DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS [0010] Reference is made to the accompanying drawings, which show by way of illustration specific embodiments in which the present invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the present invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be implemented with changes from one embodiment to another without departing from the spirit and scope of the present invention. In addition, it should be understood that the location or arrangement of individual components within each embodiment may be changed without departing from the spirit and scope of the present invention. Accordingly, the following detailed description is not to be taken in a limiting sense, and the scope of the present invention should be taken as encompassing the scope of the claims and all equivalents thereto. In the drawings, like reference numerals refer to the same or similar elements throughout the various aspects.

한편, 본 명세서에서 하나의 도면 내에서 개별적으로 설명되는 기술적 특징은 개별적으로 구현될 수도 있고, 동시에 구현될 수도 있다.On the other hand, in the present specification, technical features that are individually described within one drawing may be implemented individually or may be implemented at the same time.

본 명세서에서, “~부(unit)”는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.In this specification, “~ unit” may be a hardware component such as a processor or circuit, and/or a software component executed by a hardware component such as a processor.

이하에서는, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 여러 실시예에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the accompanying drawings in order to enable those of ordinary skill in the art to easily practice the present invention.

도 1은 음성 합성 시스템의 동작을 개략적으로 나타내는 도면이다. 1 is a diagram schematically illustrating an operation of a speech synthesis system.

음성 합성(Speech Synthesis) 장치는 텍스트를 인위적으로 사람의 음성으로 변환하는 장치이다. A speech synthesis device is a device that artificially converts text into human speech.

예를 들어, 도 1의 음성 합성 시스템(100)은 인공 신경망(Artificial Neural Network) 기반의 음성 합성 시스템일 수 있다. 인공 신경망은 시냅스의 결합으로 네트워크를 형성한 인공 뉴런이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 의미한다. For example, the speech synthesis system 100 of FIG. 1 may be an artificial neural network-based speech synthesis system. An artificial neural network refers to an overall model that has problem-solving ability by changing the strength of synaptic bonding through learning in which artificial neurons that form a network through synaptic bonding.

음성 합성 시스템(100)은 PC(personal computer), 서버 디바이스, 모바일 디바이스, 임베디드 디바이스 등의 다양한 종류의 디바이스들로 구현될 수 있고, 구체적인 예로서 인공 신경망를 이용하여 음성 합성을 수행하는 스마트폰, 태블릿 디바이스, AR(Augmented Reality) 디바이스, IoT(Internet of Things) 디바이스, 자율주행 자동차, 로보틱스, 의료기기, 전자책 단말기 및 네비게이션 등에 해당될 수 있으나, 이에 제한되지 않는다. The speech synthesis system 100 may be implemented in various types of devices such as a personal computer (PC), a server device, a mobile device, and an embedded device, and as a specific example, a smartphone or tablet that performs speech synthesis using an artificial neural network. It may correspond to a device, an Augmented Reality (AR) device, an Internet of Things (IoT) device, an autonomous vehicle, robotics, a medical device, an e-book terminal, and a navigation device, but is not limited thereto.

나아가서, 음성 합성 시스템(100)은 위와 같은 디바이스에 탑재되는 전용 하드웨어 가속기(HW accelerator)에 해당될 수 있다. 또는, 음성 합성 시스템(100)은 인공 신경망의 구동을 위한 전용 모듈인 NPU(neural processing unit), TPU(Tensor Processing Unit), Neural Engine 등과 같은 하드웨어 가속기일 수 있으나, 이에 제한되지 않는다.Furthermore, the speech synthesis system 100 may correspond to a dedicated hardware accelerator mounted on the above device. Alternatively, the speech synthesis system 100 may be a hardware accelerator such as a neural processing unit (NPU), a tensor processing unit (TPU), or a neural engine, which is a dedicated module for driving an artificial neural network, but is not limited thereto.

도 1을 참고하면, 음성 합성 시스템(100)은 텍스트 입력과 특정 화자 정보를 수신할 수 있다. 예를 들어, 음성 합성 시스템(100)은 텍스트 입력으로써 도 1에 도시된 바와 같이 “Have a good day!”를 수신할 수 있고, 화자 정보 입력으로써 “화자 1”을 수신할 수 있다. Referring to FIG. 1 , the speech synthesis system 100 may receive a text input and specific speaker information. For example, the speech synthesis system 100 may receive “Have a good day!” as shown in FIG. 1 as text input, and may receive “speaker 1” as speaker information input.

“화자 1”은 기 설정된 화자 1의 발화 특징을 나타내는 음성 신호 또는 음성 샘플에 해당할 수 있다. 예를 들어, 화자 정보는 음성 합성 시스템(100)에 포함된 통신부를 통해 외부 장치로부터 수신될 수 있다. 또는, 화자 정보는 음성 합성 시스템(100)의 사용자 인터페이스를 통해 사용자로부터 입력될 수 있고, 음성 합성 시스템(100)의 데이터 베이스에 미리 저장된 다양한 화자 정보들 중 하나로 선택될 수도 있으나, 이에 제한되는 것은 아니다. “Speaker 1” may correspond to a voice signal or a voice sample indicating a preset speech characteristic of speaker 1. For example, the speaker information may be received from an external device through a communication unit included in the speech synthesis system 100 . Alternatively, the speaker information may be input from a user through the user interface of the speech synthesis system 100 and may be selected from among various speaker information pre-stored in the database of the speech synthesis system 100, but is limited thereto. not.

음성 합성 시스템(100)은 입력으로 수신한 텍스트 입력과 특정 화자 정보에 기초하여 음성(speech)를 출력할 수 있다. 예를 들어, 음성 합성 시스템(100)은 “Have a good day!” 및 “화자 1”을 입력으로 수신하여, 화자 1의 발화 특징이 반영된 “Have a good day!”에 대한 음성을 출력할 수 있다. 화자 1의 발화 특징은 화자 1의 음성, 운율, 음높이 및 감정 등 다양한 요소들 중 적어도 하나를 포함할 수 있다. 즉, 출력되는 음성은 화자 1이 “Have a good day!”를 자연스럽게 발음하는 듯한 음성일 수 있다. 음성 합성 시스템(100)의 구체적인 동작은 도 2 내지 도 4에서 후술한다. The speech synthesis system 100 may output a speech based on a text input received as an input and specific speaker information. For example, the speech synthesis system 100 may say “Have a good day!” and “Speaker 1” may be received as inputs, and a voice for “Have a good day!” in which the speech characteristics of speaker 1 are reflected may be output. The speech characteristics of the speaker 1 may include at least one of various factors such as the speaker 1's voice, a rhyme, a pitch, and an emotion. That is, the output voice may be a voice as if the speaker 1 naturally pronounces “Have a good day!”. A detailed operation of the speech synthesis system 100 will be described later with reference to FIGS. 2 to 4 .

도 2는 음성 합성 시스템의 일 실시예를 나타내는 도면이다. 도 2의 음성 합성 시스템(200)은 도 1의 음성 합성 시스템(100)과 동일할 수 있다.2 is a diagram illustrating an embodiment of a speech synthesis system. The speech synthesis system 200 of FIG. 2 may be the same as the speech synthesis system 100 of FIG. 1 .

도 2를 참조하면, 음성 합성 시스템(200)은 화자 인코더(speaker encoder)(210), 합성기(synthesizer)(220) 및 보코더(vocoder)(230)를 포함할 수 있다. 한편, 도 2에 도시된 음성 합성 시스템(200)에는 일 실시예와 관련된 구성요소들만이 도시되어 있다. 따라서, 음성 합성 시스템(200)에는 도 2에 도시된 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다.Referring to FIG. 2 , the speech synthesis system 200 may include a speaker encoder 210 , a synthesizer 220 , and a vocoder 230 . Meanwhile, in the speech synthesis system 200 illustrated in FIG. 2 , only components related to an embodiment are illustrated. Accordingly, it is apparent to those skilled in the art that the speech synthesis system 200 may include other general-purpose components in addition to the components shown in FIG. 2 .

도 2의 음성 합성 시스템(200)은 화자 정보 및 텍스트(text)를 입력으로 수신하여 음성(speech)를 출력할 수 있다. The speech synthesis system 200 of FIG. 2 may receive speaker information and text as inputs and output a speech.

예를 들어, 음성 합성 시스템(200)의 화자 인코더(210)는 화자 정보를 입력으로 수신하여 화자 임베딩 벡터(embedding vector)를 생성할 수 있다. 화자 정보는 화자의 음성 신호 또는 음성 샘플에 해당할 수 있다. 화자 인코더(210)는 화자의 음성 신호 또는 음성 샘플을 수신하여, 화자의 발화 특징을 추출할 수 있으며 이를 임베딩 벡터로 나타낼 수 있다. For example, the speaker encoder 210 of the speech synthesis system 200 may receive speaker information as an input and generate a speaker embedding vector. The speaker information may correspond to a speaker's voice signal or a voice sample. The speaker encoder 210 may receive the speaker's voice signal or voice sample, extract the speaker's utterance characteristics, and represent this as an embedding vector.

화자의 발화 특징은 발화 속도, 휴지 구간, 음높이, 음색, 운율, 억양 또는 감정 등 다양한 요소들 중 적어도 하나를 포함할 수 있다. 즉, 화자 인코더(210)는 화자 정보에 포함된 불연속적인 데이터 값을 연속적인 숫자로 구성된 벡터로 나타낼 수 있다. 예를 들어, 화자 인코더(210)는 pre-net, CBHG 모듈, DNN(Deep Neural Network), CNN(convolutional neural network), RNN(Recurrent Neural Network), LSTM(Long Short-Term Memory Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 등 다양한 인공 신경망 모델 중 적어도 하나 또는 둘 이상의 조합에 기반하여 화자 임베딩 벡터를 생성할 수 있다. The speaker's speech characteristics may include at least one of various factors such as speech speed, pause period, pitch, tone, rhyme, intonation, or emotion. That is, the speaker encoder 210 may represent the discontinuous data values included in the speaker information as a vector composed of continuous numbers. For example, the speaker encoder 210 includes a pre-net, a CBHG module, a deep neural network (DNN), a convolutional neural network (CNN), a recurrent neural network (RNN), a long short-term memory network (LSTM), a BRDNN ( A speaker embedding vector may be generated based on at least one or a combination of two or more of various artificial neural network models, such as a Bidirectional Recurrent Deep Neural Network).

예를 들어, 음성 합성 시스템(200)의 합성기(220)는 텍스트(text) 및 화자의 발화 특징을 나타내는 임베딩 벡터를 입력으로 수신하여 음성 데이터를 출력할 수 있다. For example, the synthesizer 220 of the speech synthesis system 200 may receive text and an embedding vector representing the speaker's utterance characteristics as inputs, and output speech data.

예를 들어, 합성기(220)는 텍스트 인코더(미도시) 및 디코더(미도시)를 포함할 수 있다. 한편, 합성기(220)에는 상술한 구성요소들 외에 다른 범용적인 구성요소들이 더 포함될 수 있음은 당해 기술분야의 통상의 기술자에게 자명하다.For example, the synthesizer 220 may include a text encoder (not shown) and a decoder (not shown). Meanwhile, it is apparent to those skilled in the art that the synthesizer 220 may further include other general-purpose components in addition to the above-described components.

화자의 발화 특징을 나타내는 임베딩 벡터는 상술한 바와 같이 화자 인코더(210)로부터 생성될 수 있으며, 합성기(220)의 텍스트 인코더(미도시) 또는 디코더(미도시)는 화자 인코더(210)로부터 화자의 발화 특징을 나타내는 임베딩 벡터를 수신할 수 있다. The embedding vector representing the speech characteristics of the speaker may be generated from the speaker encoder 210 as described above, and the text encoder (not shown) or the decoder (not shown) of the synthesizer 220 receives the speaker encoder 210 from the speaker encoder 210 . An embedding vector representing a speech characteristic may be received.

합성기(220)의 텍스트 인코더(미도시)는 텍스트를 입력으로 수신하여 텍스트 임베딩 벡터를 생성할 수 있다. 텍스트는 특정 자연 언어로 된 문자들의 시퀀스를 포함할 수 있다. 예를 들어, 문자들의 시퀀스는 알파벳 문자들, 숫자들, 문장 부호들 또는 기타 특수 문자들을 포함할 수 있다. A text encoder (not shown) of the synthesizer 220 may receive text as an input and generate a text embedding vector. The text may include a sequence of characters in a particular natural language. For example, the sequence of characters may include alphabetic characters, numbers, punctuation marks, or other special characters.

텍스트 인코더(미도시)는 입력된 텍스트를 자모 단위, 글자 단위 또는 음소 단위로 분리할 수 있고, 분리된 텍스트를 인공 신경망 모델에 입력할 수 있다. 예를 들어, 텍스트 인코더(미도시)는 pre-net, CBHG 모듈, DNN, CNN, RNN, LSTM, BRDNN 등 다양한 인공 신경망 모델 중 적어도 하나 또는 둘 이상의 조합에 기반하여 텍스트 임베딩 벡터를 생성할 수 있다. The text encoder (not shown) may separate the inputted text into a unit of a alphabet, a unit of a letter, or a unit of a phoneme, and may input the separated text into the artificial neural network model. For example, a text encoder (not shown) may generate a text embedding vector based on at least one or a combination of two or more of various artificial neural network models such as pre-net, CBHG module, DNN, CNN, RNN, LSTM, BRDNN. .

또는, 텍스트 인코더(미도시)는 입력된 텍스트를 복수의 짧은 텍스트들로 분리하고, 짧은 텍스트들 각각에 대하여 복수의 텍스트 임베딩 벡터들을 생성할 수도 있다. Alternatively, the text encoder (not shown) may divide the input text into a plurality of short texts and generate a plurality of text embedding vectors for each of the short texts.

합성기(220)의 디코더(미도시)는 화자 인코더(210)로부터 화자 임베딩 벡터 및 텍스트 임베딩 벡터를 입력으로 수신할 수 있다. 또는, 합성기(220)의 디코더(미도시)는 화자 인코더(210)로부터 화자 임베딩 벡터를 입력으로 수신하고, 텍스트 인코더(미도시)로부터 텍스트 임베딩 벡터를 입력으로 수신할 수 있다. A decoder (not shown) of the synthesizer 220 may receive a speaker embedding vector and a text embedding vector from the speaker encoder 210 as inputs. Alternatively, the decoder (not shown) of the synthesizer 220 may receive a speaker embedding vector from the speaker encoder 210 as an input, and receive a text embedding vector from a text encoder (not shown) as an input.

디코더(미도시)는 화자 임베딩 벡터와 텍스트 임베딩 벡터를 인공 신경망 모델에 입력하여, 입력된 텍스트에 대응되는 음성 데이터를 생성할 수 있다. 즉, 디코더(미도시)는 화자의 발화 특징이 반영된 입력 텍스트에 대한 음성 데이터를 생성할 수 있다. 예를 들면, 음성 데이터는 입력된 텍스트에 대응되는 스펙트로그램(spectrogram) 또는 멜 스펙트로그램(mel-spectrogram)에 해당할 수 있으나, 이에 제한되는 것은 아니다. 다시 말해, 스펙트로그램 또는 멜 스펙트로그램은 특정 자연 언어로 구성된 문자들의 시퀀스(sequence)의 구두 발화(verbal utterance)에 대응한다.The decoder (not shown) may input the speaker embedding vector and the text embedding vector to the artificial neural network model to generate voice data corresponding to the input text. That is, the decoder (not shown) may generate voice data for the input text in which the speaker's speech characteristics are reflected. For example, the voice data may correspond to a spectrogram or a mel-spectrogram corresponding to the input text, but is not limited thereto. In other words, the spectrogram or Mel spectrogram corresponds to a verbal utterance of a sequence of characters composed of a specific natural language.

스펙트로그램은 음성 신호의 스펙트럼을 시각화하여 그래프로 표현한 것이다. 스펙트로그램의 x축은 시간, y축은 주파수를 나타내며 각 시간당 주파수가 가지는 값을 값의 크기에 따라 색으로 표현할 수 있다. 스펙토그램은 연속적으로 주어지는 음성 신호에 STFT(Short-time Fourier transform)를 수행한 결과물일 수 있다. A spectrogram is a graph that visualizes the spectrum of a voice signal. The x-axis of the spectrogram represents time and the y-axis represents frequency, and the value of each time frequency can be expressed as a color according to the size of the value. The spectogram may be a result of performing short-time Fourier transform (STFT) on a continuously given speech signal.

STFT는 음성 신호를 일정한 길이의 구간들로 나누고 각 구간에 대하여 푸리에 변환을 적용하는 방법이다. 이 때, 음성 신호에 STFT를 수행한 결과물은 복소수 값이기 때문에, 복소수 값에 절대값을 취해 위상(phase) 정보를 소실시키고 크기(magnitude) 정보만을 포함하는 스펙트로그램을 생성할 수 있다. STFT is a method in which a voice signal is divided into sections of a certain length and a Fourier transform is applied to each section. At this time, since the result of performing STFT on the speech signal is a complex value, it is possible to take an absolute value to the complex value to lose phase information and to generate a spectrogram including only magnitude information.

한편, 멜 스펙트로그램은 스펙트로그램의 주파수 간격을 멜 스케일(Mel Scale)로 재조정한 것이다. 사람의 청각기관은 고주파수(high frequency) 보다 저주파수(low frequency) 대역에서 더 민감하며, 이러한 특성을 반영해 물리적인 주파수와 실제 사람이 인식하는 주파수의 관계를 표현한 것이 멜 스케일이다. 멜 스펙트로그램은 멜 스케일에 기반한 필터 뱅크(filter bank)를 스펙트로그램에 적용하여 생성될 수 있다.On the other hand, the Mel spectrogram is a readjustment of the frequency interval of the spectrogram to the Mel scale. The human auditory organ is more sensitive in the low frequency band than in the high frequency band, and the Mel Scale expresses the relationship between the physical frequency and the frequency perceived by humans by reflecting these characteristics. The Mel spectrogram may be generated by applying a filter bank based on the Mel scale to the spectrogram.

한편, 도 2에는 도시되어 있지 않으나, 합성기(220)는 어텐션 얼라이먼트(attention alignment)를 생성하기 위한 어텐션 모듈을 더 포함할 수 있다. 어텐션 모듈은 디코더(미도시)의 특정 타임 스텝(time-step)의 출력이 텍스트 인코더(미도시)의 모든 타임 스텝의 출력 중 어떤 출력과 가장 연관이 있는가를 학습하는 모듈이다. 어텐션 모듈을 이용하여 더 고품질의 스펙트로그램 또는 멜 스펙트로그램을 출력할 수 있다. Meanwhile, although not shown in FIG. 2 , the synthesizer 220 may further include an attention module for generating an attention alignment. The attention module is a module for learning which output of a specific time-step of the decoder (not shown) is most related to which output among the outputs of all time-steps of the text encoder (not shown). A higher quality spectrogram or Mel spectrogram can be output by using the attention module.

도 3은 합성기를 통해 멜 스펙트로그램을 출력하는 일 실시예를 나타내는 도면이다. 도 3의 합성기(300)는 도 2의 합성기(220)와 동일할 수 있다.3 is a diagram illustrating an embodiment of outputting a Mel spectrogram through a synthesizer. The synthesizer 300 of FIG. 3 may be the same as the synthesizer 220 of FIG. 2 .

도 3을 참조하면, 합성기(300)는 입력 텍스트들과 이에 대응되는 화자 임베딩 벡터들을 포함하는 리스트를 수신할 수 있다. 예를 들어, 합성기(300)는 'first sentence'라는 입력 텍스트와 이에 대응되는 화자 임베딩 벡터인 embed_voice1, 'second sentence'라는 입력 텍스트와 이에 대응되는 화자 임베딩 벡터인 embed_voice2, 'third sentence'라는 입력 텍스트와 이에 대응되는 화자 임베딩 벡터인 embed_voice3을 포함하는 리스트(310)를 입력으로 수신할 수 있다.Referring to FIG. 3 , the synthesizer 300 may receive a list including input texts and speaker embedding vectors corresponding thereto. For example, the synthesizer 300 includes the input text 'first sentence', the corresponding speaker embedding vectors embed_voice1, the input text 'second sentence', and the corresponding speaker embedding vectors embed_voice2 and the input text 'third sentence'. and a list 310 including embed_voice3, which is a speaker embedding vector corresponding thereto, may be received as an input.

합성기(300)는 수신한 리스트(310)에 포함된 입력 텍스트의 개수만큼의 멜 스펙트로그램(310)을 생성할 수 있다. 도 3을 참고하면, 'first sentence', 'second sentence' 및 'third sentence' 각각의 입력 텍스트에 대응하는 멜 스펙트로그램들이 생성된 것을 알 수 있다.The synthesizer 300 may generate as many Mel spectrograms 310 as the number of input texts included in the received list 310 . Referring to FIG. 3 , it can be seen that Mel spectrograms corresponding to input texts of 'first sentence', 'second sentence', and 'third sentence' are generated.

또는, 합성기(300)는 입력 텍스트의 개수만큼의 멜 스펙트로그램(320) 및 어텐션 얼라인먼트를 함께 생성할 수 있다. 도 3에는 도시되어 있지 않으나, 예를 들어 'first sentence', 'second sentence' 및 'third sentence' 각각의 입력 텍스트에 대응하는 어텐션 얼라인먼트가 추가적으로 생성될 수 있다. 또는, 합성기(300)는 입력 텍스트들 각각에 대하여 복수의 멜 스펙트로그램 및 복수의 어텐션 얼라인먼트를 생성할 수도 있다. Alternatively, the synthesizer 300 may generate as many Mel spectrograms 320 and attention alignment as the number of input texts together. Although not shown in FIG. 3 , for example, attention alignment corresponding to each input text of 'first sentence', 'second sentence' and 'third sentence' may be additionally generated. Alternatively, the synthesizer 300 may generate a plurality of Mel spectrograms and a plurality of attention alignments for each of the input texts.

다시 도 2를 참조하면, 음성 합성 시스템(200)의 보코더(230)는 합성기(220)에서 출력된 음성 데이터를 실제 음성(speech)으로 생성할 수 있다. 상술한 바와 같이 출력된 음성 데이터는 스펙트로그램 또는 멜 스펙트로그램일 수 있다. Referring back to FIG. 2 , the vocoder 230 of the speech synthesis system 200 may generate speech data output from the synthesizer 220 as actual speech. As described above, the output voice data may be a spectrogram or a Mel spectrogram.

예를 들어, 보코더(230)는 ISTFT(Inverse Short-Time Fourier Transform)를 이용하여 합성기(220)에서 출력된 음성 데이터를 실제 음성 신호로 생성할 수 있다. 그러나, 스펙트로그램 또는 멜 스펙트로그램은 위상 정보를 포함하고 있지 않으므로, ISTFT만으로는 실제 음성 신호를 완벽하게 복원할 수 없다. For example, the vocoder 230 may generate the voice data output from the synthesizer 220 as an actual voice signal using Inverse Short-Time Fourier Transform (ISTFT). However, since the spectrogram or the Mel spectrogram does not include phase information, the ISTFT alone cannot completely reconstruct the actual speech signal.

이에 따라, 보코더(230)는 예를 들어 그리핀-림 알고리즘(Griffin-Lim algorithm)을 사용하여 합성기(220)에서 출력된 음성 데이터를 실제 음성 신호로 생성할 수 있다. 그리핀-림 알고리즘은 스펙트로그램 또는 멜 스펙트로그램의 크기 정보에서 위상 정보 추정하는 알고리즘이다. Accordingly, the vocoder 230 may generate the voice data output from the synthesizer 220 as an actual voice signal using, for example, a Griffin-Lim algorithm. The Griffin-Rim algorithm is an algorithm for estimating phase information from magnitude information of a spectrogram or Mel spectrogram.

또는, 보코더(230)는 예를 들어 뉴럴 보코더(neural vocoder)에 기 초하여 합성기(220)에서 출력된 음성 데이터를 실제 음성 신호로 생성할 수 있다. Alternatively, the vocoder 230 may generate the voice data output from the synthesizer 220 as an actual voice signal based on, for example, a neural vocoder.

뉴럴 보코더는 스펙트로그램 또는 멜 스펙트로그램을 입력으로 받아 음성 신호를 생성하는 인공 신경망 모델이다. 뉴럴 보코더는 스펙트로그램 또는 멜 스펙트로그램과 음성 신호 사이의 관계를 다량의 데이터를 통해 학습할 수 있고, 이를 통해 고품질의 실제 음성 신호를 생성할 수 있다. A neural vocoder is an artificial neural network model that generates a voice signal by receiving a spectrogram or a Mel spectrogram as an input. A neural vocoder can learn the relationship between a spectrogram or a Mel spectrogram and a voice signal from a large amount of data, and can generate a high-quality real voice signal through this.

뉴럴 보코더는 WaveNet, Parallel WaveNet, WaveRNN, WaveGlow 또는 MelGAN 등과 같은 인공 신경망 모델에 기반한 보코더에 해당할 수 있으나, 이에 제한되는 것은 아니다. A neural vocoder may correspond to a vocoder based on an artificial neural network model such as, but not limited to, WaveNet, Parallel WaveNet, WaveRNN, WaveGlow, or MelGAN.

예를 들어, WaveNet 보코더는 여러 층의 dilated causal convolution layer들로 구성되며, 음성 샘플들 간의 순차적 특징을 이용하는 자기회귀(Autoregressive) 모델이다. 예를 들어, WaveRNN 보코더는 WaveNet의 여러 층의 dilated causal convolution layer를 GRU(Gated Recurrent Unit)로 대체한 자기회귀 모델이다.For example, the WaveNet vocoder consists of several dilated causal convolution layers and is an autoregressive model that uses sequential features between speech samples. For example, the WaveRNN vocoder is an autoregressive model that replaces the multi-layered dilated causal convolution layer of WaveNet with a Gated Recurrent Unit (GRU).

예를 들어, WaveGlow 보코더는 가역성(invertible)을 지닌 변환 함수를 이용하여 음성 데이터셋(x)으로부터 가우시안 분포와 같이 단순한 분포가 나오도록 학습할 수 있다. WaveGlow 보코더는 학습이 끝난 후 변환 함수의 역함수를 이용하여 가우시안 분포의 샘플로부터 음성 신호를 출력할 수 있다. For example, the WaveGlow vocoder can learn to derive a simple distribution such as a Gaussian distribution from the speech data set (x) by using an invertible transform function. After learning, the WaveGlow vocoder can output a voice signal from a Gaussian distribution sample by using the inverse function of the transform function.

도 2 및 도 3을 참조하여 상술한 바와 같이, 합성기(220, 300)는 텍스트 및 화자 임베딩 벡터를 이용하여 스펙트로그램(또는 멜 스펙트로그램)을 생성한다. 여기에서, 텍스트는 특정 자연 언어로 구성된 문자들의 시퀀스일 수 있다.As described above with reference to FIGS. 2 and 3 , the synthesizers 220 and 300 generate spectrograms (or Mel spectrograms) using text and speaker embedding vectors. Here, the text may be a sequence of characters composed of a specific natural language.

합성기(220, 300)는 인코더 신경망 및 어텐션 기반 디코더 신경망(attention-based decoder recurrent neural network)을 포함한다. 여기에서, 인코더 신경망은 텍스트에 대응하는 시퀀스를 처리함으로써 상기 시퀀스에 포함된 문자들 각각의 인코딩된 표현을 생성한다. 그리고, 어텐션 기반 디코더 신경망은 인코더 신경망으로부터 입력되는 시퀀스 내의 각각의 디코더 입력에 대하여, 스펙트로그램의 단일 프레임을 생성하도록 디코더 입력 및 인코딩된 표현을 처리한다.The synthesizers 220 and 300 include an encoder neural network and an attention-based decoder recurrent neural network. Here, the encoder neural network generates an encoded representation of each of the characters included in the sequence by processing the sequence corresponding to the text. Then, the attention-based decoder neural network processes the decoder input and the encoded representation to generate a single frame of the spectrogram for each decoder input in the sequence input from the encoder neural network.

한편, 시퀀스의 길이가 너무 길거나 짧은 경우, 합성기(220, 300)는 고품질의 스펙트로그램(또는 멜 스펙트로그램)을 생성하지 못 할 수 있다. 즉, 시퀀스의 길이가 너무 길거나 짧은 경우, 합성기(220, 300)에 포함된 어텐션 기반 디코더 신경망이 고품질의 스펙트로그램(또는 멜 스펙트로그램)을 생성하지 못 할 수 있다.On the other hand, when the length of the sequence is too long or short, the synthesizers 220 and 300 may not be able to generate high-quality spectrograms (or Mel spectrograms). That is, when the length of the sequence is too long or too short, the attention-based decoder neural network included in the synthesizers 220 and 300 may not be able to generate high-quality spectrograms (or Mel spectrograms).

따라서, 일 실시예에 따른 음성 합성 시스템(100, 200)은 합성기(220, 300)에 입력되는 시퀀스를 복수의 서브 시퀀스들로 분할한다. 여기에서, 분할된 서브 시퀀스들 각각의 길이는 합성기(220, 300)가 고품질의 스펙트로그램(또는 멜 스펙트로그램)을 생성하는데 최적화된 길이를 갖는다.Accordingly, the speech synthesis systems 100 and 200 according to an embodiment divide a sequence input to the synthesizers 220 and 300 into a plurality of subsequences. Here, the length of each of the divided subsequences has a length optimized for the synthesizers 220 and 300 to generate high-quality spectrograms (or Mel spectrograms).

이하, 도 4 내지 도 9를 참조하여, 음성 합성 시스템(100, 200)이 특정 자연 언어로 구성된 문자들의 시퀀스를 복수의 서브 시퀀스들로 분할하는 예들을 설명한다. 예를 들어, 시퀀스를 복수의 서브 시퀀스들로 분할하는 모듈은 화자 인코더(210), 합성기(220, 300) 또는 음성 합성 시스템(100, 200)에 포함된 별도의 모듈일 수 있다.Hereinafter, examples in which the speech synthesis systems 100 and 200 divide a sequence of characters composed of a specific natural language into a plurality of subsequences will be described with reference to FIGS. 4 to 9 . For example, the module for dividing a sequence into a plurality of subsequences may be a separate module included in the speaker encoder 210 , the synthesizers 220 and 300 , or the speech synthesis systems 100 and 200 .

또한, 이하에서, 스펙트로그램과 멜 스펙트로그램은 서로 혼용될 수 있는 용어로 기재한다. 다시 말해, 이하에서 스펙트로그램으로 기재되었다고 하더라도, 이는 멜 스펙트로그램으로 대체될 수도 있다. 또한, 이하에서, 멜 스펙트로그램으로 기재되었다고 하더라도, 이는 스펙트로그램으로 대체될 수도 있다.In addition, hereinafter, the spectrogram and the Mel spectrogram are described as terms that can be used interchangeably. In other words, although it is described as a spectrogram hereinafter, it may be replaced with a Mel spectrogram. Also, in the following, even if it is described as a Mel spectrogram, it may be replaced with a spectrogram.

도 4는 시퀀스를 서브 시퀀스들로 분할하는 일 예를 나타낸 흐름도이다.4 is a flowchart illustrating an example of dividing a sequence into subsequences.

도 4를 참조하면, 시퀀스를 서브 시퀀스들로 분할하는 방법은 도 1 및 도 2에 도시된 음성 합성 시스템(100, 200)에서 시계열적으로 처리되는 단계들로 구성된다. 따라서, 이하에서 생략된 내용이라 하더라도 도 1 및 도 2에 도시된 음성 합성 시스템(100, 200)에 관하여 이상에서 기술된 내용은 도 4의 시퀀스를 서브 시퀀스들로 분할하는 방법에도 적용됨을 알 수 있다.Referring to FIG. 4 , a method of dividing a sequence into subsequences consists of steps processed in time series in the speech synthesis systems 100 and 200 shown in FIGS. 1 and 2 . Therefore, it can be seen that even if the content is omitted below, the content described above with respect to the speech synthesis systems 100 and 200 shown in FIGS. 1 and 2 is also applied to the method of dividing the sequence of FIG. 4 into subsequences. have.

410 단계에서, 음성 합성 시스템(100, 200)은 시퀀스에 포함된 적어도 하나의 문장 부호를 기준으로 시퀀스를 분할함으로써 복수의 서브 시퀀스들이 포함된 제1 그룹을 생성한다.In operation 410, the speech synthesis systems 100 and 200 generate a first group including a plurality of subsequences by dividing the sequence based on at least one punctuation mark included in the sequence.

예를 들어, 음성 합성 시스템(100, 200)은 시퀀스에 미리 정해진 문장 부호들 중 어느 하나가 포함되어 있는 경우, 해당 문장 부호를 기준으로 시퀀스를 분할할 수 있다. 여기에서, 미리 정해진 문장 부호들은 ',', '.', '?', '!', ';', '-' 및 '^' 중 적어도 하나 이상 포함될 수 있다.For example, when any one of predetermined punctuation marks is included in the sequence, the speech synthesis systems 100 and 200 may divide the sequence based on the corresponding punctuation mark. Here, the predetermined punctuation marks may include at least one of ',', '.', '?', '!', ';', '-', and '^'.

이하, 도 5를 참조하여, 음성 합성 시스템(100, 200)이 미리 정해진 문장 부호에 기초하여 시퀀스를 분할하는 예를 설명한다.Hereinafter, an example in which the speech synthesis systems 100 and 200 divide a sequence based on predetermined punctuation marks will be described with reference to FIG. 5 .

도 5는 음성 합성 시스템이 시퀀스를 분할하는 일 예를 설명하기 위한 도면이다.5 is a diagram for explaining an example in which a speech synthesis system divides a sequence.

도 5에는 특정 자연 언어로 구성된 문자들의 시퀀스(510)의 일 예가 도시되어 있다. 또한, 시퀀스(510)에는 두 종류의 문장 부호들(521, 522)이 포함되어 있다.5 shows an example of a sequence 510 of characters composed of a specific natural language. Also, the sequence 510 includes two types of punctuation marks 521 and 522 .

음성 합성 시스템(100, 200)은 시퀀스(510)에 포함된 문자들 및 문장 부호들을 확인한다. 그리고, 음성 합성 시스템(100, 200)은 시퀀스(510)에 포함된 문장 부호들(521, 522)이 미리 정해진 문장 부호인지 여부를 확인한다.The speech synthesis systems 100 and 200 identify characters and punctuation marks included in the sequence 510 . Then, the speech synthesis systems 100 and 200 check whether the punctuation marks 521 and 522 included in the sequence 510 are predetermined punctuation marks.

만약, 문장 부호들(521, 522)이 미리 정해진 문장 부호인 경우, 음성 합성 시스템(100, 200)은 시퀀스(510)를 서브 시퀀스들(511, 512)로 분할한다. 예를 들어, 문장 부호 '?' 및 문장 부호 '.'가 미리 정해진 문장 부호인 경우, 음성 합성 시스템(100, 200)은 시퀀스(510)를 분할함으로써, 서브 시퀀스들(511, 512)을 생성한다.If the punctuation marks 521 and 522 are predetermined punctuation marks, the speech synthesis system 100 or 200 divides the sequence 510 into subsequences 511 and 512 . For example, the punctuation mark '?' and when the punctuation mark '.' is a predetermined punctuation mark, the speech synthesis systems 100 and 200 divide the sequence 510 to generate subsequences 511 and 512 .

음성 합성 시스템(100, 200)은 서브 시퀀스들(511, 512)을 포함하는 제1 그룹을 생성한다. 예를 들어, 도 5에 도시된 시퀀스(510)의 경우, 제1 그룹에는 총 2개의 서브 시퀀스들(511, 512)이 포함된다.The speech synthesis system 100 , 200 generates a first group comprising subsequences 511 , 512 . For example, in the case of the sequence 510 shown in FIG. 5 , a total of two subsequences 511 and 512 are included in the first group.

다시 도 4를 참조하면, 420 단계에서, 음성 합성 시스템(100, 200)은 제1 그룹에 포함된 제1 서브 시퀀스의 길이가 제1 임계 길이보다 짧은 경우, 제1 서브 시퀀스와 제1 서브 시퀀스에 인접한 제2 서브 시퀀스를 병합함으로써 제3 서브 시퀀스를 생성한다.Referring back to FIG. 4 , in step 420 , when the length of the first subsequence included in the first group is shorter than the first threshold length, in step 420 , the first subsequence and the first subsequence A third subsequence is generated by merging the second subsequence adjacent to .

음성 합성 시스템(100, 200)은 제1 그룹에 포함된 서브 시퀀스의 길이와 제1 임계 길이를 비교한다. 그리고, 음성 합성 시스템(100, 200)은 제1 임계 길이보다 짧은 서브 시퀀스를 그에 인접한 서브 시퀀스와 병합한다. 예를 들어, 제1 임계 길이는 미리 결정될 수 있고, 음성 합성 시스템(100, 200)의 사양에 따라 조정될 수도 있다.The speech synthesis systems 100 and 200 compare the length of the subsequence included in the first group with the first threshold length. Then, the speech synthesis systems 100 and 200 merge a subsequence shorter than the first threshold length with a subsequence adjacent thereto. For example, the first threshold length may be predetermined, and may be adjusted according to specifications of the speech synthesis systems 100 and 200 .

이하, 도 6을 참조하여, 음성 합성 시스템(100, 200)이 제1 그룹에 포함된 서브 시퀀스의 길이와 제1 임계 길이를 비교하고, 인접한 서브 시퀀스들끼리 병합하는 예를 설명한다.Hereinafter, an example in which the speech synthesis systems 100 and 200 compare the length of the subsequence included in the first group with the first threshold length and merge adjacent subsequences will be described with reference to FIG. 6 .

도 6은 음성 합성 시스템이 서브 시퀀스의 길이와 제1 임계 길이를 비교하고, 인접한 서브 시퀀스들끼리 병합하는 일 예를 설명하기 위한 도면이다.FIG. 6 is a diagram for explaining an example in which a speech synthesis system compares a length of a subsequence with a first threshold length and merges adjacent subsequences with each other.

도 6에는 제1 그룹에 포함된 서브 시퀀스들(611, 612)이 도시되어 있다. 여기에서, 서브 시퀀스들(611, 612)은 서로 인접한 것으로 가정한다. 다시 말해, 시퀀스 내에서 서브 시퀀스(612)는 서브 시퀀스(611)의 바로 다음에 위치하는 것으로 가정한다.6 illustrates subsequences 611 and 612 included in the first group. Here, it is assumed that the subsequences 611 and 612 are adjacent to each other. In other words, it is assumed that the subsequence 612 is positioned immediately after the subsequence 611 in the sequence.

음성 합성 시스템(100, 200)은 서브 시퀀스(611)의 길이와 제1 임계 길이를 비교한다. 만약, 제1 임계 길이보다 서브 시퀀스(611)의 길이가 짧은 경우, 음성 합성 시스템(100, 200)은 서브 시퀀스(611)와 서브 시퀀스(612)를 병합함으로써 서브 시퀀스(620)를 생성한다. 여기에서, 병합한다는 것은 서브 시퀀스(611)의 말단에 서브 시퀀스(612)를 연결하는 것을 의미한다.The speech synthesis systems 100 and 200 compare the length of the subsequence 611 with a first threshold length. If the length of the subsequence 611 is shorter than the first threshold length, the speech synthesis systems 100 and 200 generate the subsequence 620 by merging the subsequence 611 and the subsequence 612 . Here, merging means connecting the subsequence 612 to the end of the subsequence 611 .

만약, 서브 시퀀스(611)의 길이가 제1 임계 길이보다 짧은 경우, 합성기(220, 300)가 최적의 스펙트로그램을 생성하지 못할 수 있다. 따라서, 음성 합성 시스템(100, 200)은 서브 시퀀스(611)와 서브 시퀀스(612)를 병합함으로써, 합성기(220, 300)로부터 생성되는 스펙트로그램의 품질을 높일 수 있다.If the length of the subsequence 611 is shorter than the first threshold length, the synthesizers 220 and 300 may not be able to generate the optimal spectrogram. Accordingly, the speech synthesis systems 100 and 200 may improve the quality of the spectrogram generated by the synthesizers 220 and 300 by merging the subsequence 611 and the subsequence 612 .

다시 도 4를 참조하면, 430 단계에서, 음성 합성 시스템(100, 200)은 제3 서브 시퀀스에 기초하여 제1 그룹을 업데이트함으로써 제2 그룹을 생성한다.Referring back to FIG. 4 , in step 430 , the speech synthesis systems 100 and 200 generate the second group by updating the first group based on the third subsequence.

420 단계를 참조하여 상술한 바와 같이, 음성 합성 시스템(100, 200)은 제1 그룹에 포함된 서브 시퀀스들 중 적어도 일부를 선택적으로 병합할 수 있다. 즉, 제1 그룹에 포함된 서브 시퀀스들의 길이가 모두 제1 임계 길이보다 긴 경우에는, 서브 시퀀스들이 병합되지 않는다.As described above with reference to step 420 , the speech synthesis systems 100 and 200 may selectively merge at least some of the subsequences included in the first group. That is, when the lengths of all subsequences included in the first group are longer than the first threshold length, the subsequences are not merged.

따라서, 제2 그룹에는, 제1 그룹에 포함된 서브 시퀀스들 중 일부가 병합된 시퀀스가 포함될 수도 있고, 제1 그룹의 서브 시퀀스들이 그대로 제2 그룹에 포함될 수도 있다.Accordingly, the second group may include a sequence in which some of the subsequences included in the first group are merged, or the subsequences of the first group may be included in the second group as they are.

440 단계에서, 음성 합성 시스템(100, 200)은 제2 그룹에 포함된 제4 서브 시퀀스의 길이가 제2 임계 길이보다 긴 경우, 제4 서브 시퀀스를 소정의 기준에 따라 분할함으로써 복수의 제 5 서브 시퀀스들을 생성한다.In step 440, when the length of the fourth subsequence included in the second group is longer than the second threshold length, the speech synthesis system 100 or 200 divides the fourth subsequence according to a predetermined criterion to form a plurality of fifth subsequences. Create subsequences.

음성 합성 시스템(100, 200)은 제2 그룹에 포함된 서브 시퀀스의 길이와 제2 임계 길이를 비교한다. 그리고, 음성 합성 시스템(100, 200)은 제2 임계 길이보다 긴 서브 시퀀스를 분할한다. 예를 들어, 제2 임계 길이는 미리 결정될 수 있고, 음성 합성 시스템(100, 200)의 사양에 따라 조정될 수도 있다. 또한, 제2 임계 길이는 제1 임계 길이보다 길게 설정될 수 있다.The speech synthesis systems 100 and 200 compare the length of the subsequence included in the second group with a second threshold length. Then, the speech synthesis systems 100 and 200 divide the subsequence longer than the second threshold length. For example, the second threshold length may be predetermined, and may be adjusted according to specifications of the speech synthesis systems 100 and 200 . Also, the second threshold length may be set longer than the first threshold length.

이하, 도 7을 참조하여, 음성 합성 시스템(100, 200)이 제2 그룹에 포함된 서브 시퀀스의 길이와 제2 임계 길이를 비교하고, 서브 시퀀스를 분할하는 예를 설명한다.Hereinafter, an example in which the speech synthesis systems 100 and 200 compare the length of the subsequence included in the second group with the second threshold length and divide the subsequence will be described with reference to FIG. 7 .

도 7은 음성 합성 시스템이 서브 시퀀스의 길이와 제2 임계 길이를 비교하고, 서브 시퀀스를 분할하는 일 예를 설명하기 위한 도면이다.7 is a diagram for explaining an example in which the speech synthesis system compares the length of a subsequence with a second threshold length and divides the subsequence.

도 7에는 제2 그룹에 포함된 서브 시퀀스(710)가 도시되어 있다. 음성 합성 시스템(100, 200)은 서브 시퀀스(710)의 길이와 제2 임계 길이를 비교한다. 만약, 제2 임계 길이보다 서브 시퀀스(710)의 길이가 긴 경우, 음성 합성 시스템(100, 200)은 서브 시퀀스(710)를 분할함으로써 복수의 서브 시퀀스들(721, 722)을 생성한다. 도 7에는 서브 시퀀스들(721, 722)이 총 2개인 것으로 도시되어 있으나, 이에 제한되지 않는다. 다시 말해, 서브 시퀀스(710)는 3개 이상으로 분할될 수 있다.7 shows a subsequence 710 included in the second group. The speech synthesis systems 100 and 200 compare the length of the subsequence 710 with a second threshold length. If the length of the subsequence 710 is longer than the second threshold length, the speech synthesis systems 100 and 200 divide the subsequence 710 to generate a plurality of subsequences 721 and 722 . 7 shows that there are a total of two subsequences 721 and 722, but is not limited thereto. In other words, the subsequence 710 may be divided into three or more.

만약, 서브 시퀀스(710)의 길이가 제2 임계 길이보다 긴 경우, 합성기(220, 300)가 최적의 스펙트로그램을 생성하지 못할 수 있다. 따라서, 음성 합성 시스템(100, 200)은 서브 시퀀스(710)를 분할함으로써, 합성기(220, 300)로부터 생성되는 스펙트로그램의 품질을 높일 수 있다.If the length of the subsequence 710 is longer than the second threshold length, the synthesizers 220 and 300 may not be able to generate the optimal spectrogram. Accordingly, the speech synthesis systems 100 and 200 can improve the quality of the spectrogram generated by the synthesizers 220 and 300 by dividing the subsequence 710 .

서브 시퀀스(710)가 분할되는 기준은 다양할 수 있다. 일 예로서, 서브 시퀀스(710)는, 서브 시퀀스(710)가 발화되었을 경우 숨을 쉴만한 지점을 기준으로 분할될 수 있다. 다른 예로서, 서브 시퀀스(710)는, 서브 시퀀스(710) 내에 포함된 공백(space)을 기준으로 분할될 수도 있다. 다만, 서브 시퀀스(710)가 분할되는 기준은 상술된 예들에 한정되지 않는다.The criteria by which the subsequence 710 is divided may vary. As an example, the sub-sequence 710 may be divided based on a point where the sub-sequence 710 is breathable when the sub-sequence 710 is uttered. As another example, the subsequence 710 may be divided based on a space included in the subsequence 710 . However, the criterion by which the subsequence 710 is divided is not limited to the above-described examples.

다시 도 4를 참조하면, 450 단계에서, 음성 합성 시스템(100, 200)은 복수의 제5 서브 시퀀스들에 기초하여 제2 그룹을 업데이트함으로써 제3 그룹을 생성한다.Referring back to FIG. 4 , in operation 450 , the speech synthesis systems 100 and 200 generate the third group by updating the second group based on the plurality of fifth subsequences.

440 단계를 참조하여 상술한 바와 같이, 음성 합성 시스템(100, 200)은 제2 그룹에 포함된 서브 시퀀스들 중 적어도 일부를 선택적으로 분할할 수 있다. 즉, 제2 그룹에 포함된 서브 시퀀스들의 길이가 모두 제2 임계 길이보다 짧은 경우에는, 서브 시퀀스들이 분할되지 않는다.As described above with reference to step 440 , the speech synthesis systems 100 and 200 may selectively divide at least some of the subsequences included in the second group. That is, when the lengths of all subsequences included in the second group are shorter than the second threshold length, the subsequences are not divided.

따라서, 제3 그룹에는, 제2 그룹에 포함된 서브 시퀀스들 중 일부가 분할된 시퀀스가 포함될 수도 있고, 제2 그룹의 서브 시퀀스들이 그대로 제3 그룹에 포함될 수도 있다.Accordingly, the third group may include a sequence in which some of the subsequences included in the second group are split, or the subsequences of the second group may be included in the third group as they are.

도 4에는 도시되지 않았으나, 시퀀스에 두 단위 이상의 공백이 포함된 경우, 음성 합성 시스템(100, 200)은 두 단위 이상의 공백을 한 단위의 공백으로 변경함으로써 시퀀스를 수정할 수 있다. 구체적으로, 410 단계가 수행되기 이전에, 음성 합성 시스템(100, 200)은 시퀀스 내의 공백들 각각의 배치를 확인하고, 시퀀스에 두 단위 이상 연속된 공백이 포함된 경우, 이를 한 단위의 공백으로 변경할 수 있다.Although not shown in FIG. 4 , when two or more units of spaces are included in the sequence, the speech synthesis systems 100 and 200 may correct the sequence by changing two or more units of spaces to one unit of space. Specifically, before step 410 is performed, the speech synthesis systems 100 and 200 check the arrangement of each space in the sequence, and when the sequence includes two or more consecutive spaces, it is converted into one unit of space. can be changed

한편, 음성 합성 시스템(100, 200)은 제3 그룹에 포함된 서브 시퀀스들을 합성기(220, 300)에 전송하기 전에, 서브 시퀀스들에 소정의 처리를 수행할 수 있다. 음성 합성 시스템(100, 200)이 서브 시퀀스에 소정의 처리를 수행하는 예는 도 8 및 도 9를 참조하여 설명한다.Meanwhile, the speech synthesis systems 100 and 200 may perform a predetermined process on the subsequences before transmitting the subsequences included in the third group to the synthesizers 220 and 300 . An example in which the speech synthesis systems 100 and 200 perform predetermined processing on subsequences will be described with reference to FIGS. 8 and 9 .

도 8은 음성 합성 시스템이 서브 시퀀스에 소정의 처리를 수행하고, 이를 합성기에 전송하는 일 예를 설명하기 위한 흐름도이다.8 is a flowchart illustrating an example in which the speech synthesis system performs predetermined processing on a subsequence and transmits the same to the synthesizer.

810 단계에서, 음성 합성 시스템(100, 200)은 제3 그룹에 포함된 복수의 서브 시퀀스들 각각의 말단에 소정의 텍스트를 병합한다.In operation 810, the speech synthesis systems 100 and 200 merge a predetermined text at the end of each of the plurality of subsequences included in the third group.

위치 민감 어텐션(location sensitive attention) 기반의 합성기(220, 300)는, 서브 시퀀스의 말단에 어느 정도의 텍스트가 더 포함되어 있을 때 더 좋은 스펙트로그램을 생성할 수도 있다. 따라서, 음성 합성 시스템(100, 200)은 필요에 따라 제3 그룹에 포함된 서브 시퀀스의 말단에 소정의 텍스트를 병합할 수도 있다.The location sensitive attention-based synthesizers 220 and 300 may generate better spectrograms when a certain amount of text is further included at the end of the subsequence. Accordingly, the speech synthesis systems 100 and 200 may merge a predetermined text at the end of the subsequence included in the third group, if necessary.

이하, 도 9를 참조하여, 음성 합성 시스템(100, 200)이 서브 시퀀스의 말단에 소정의 텍스트를 병합하는 예를 설명한다.Hereinafter, an example in which the speech synthesis systems 100 and 200 merge a predetermined text at the end of a subsequence will be described with reference to FIG. 9 .

도 9는 음성 합성 시스템이 서브 시퀀스의 말단에 소정의 텍스트를 병합하는 일 예를 설명하기 위한 도면이다.FIG. 9 is a diagram for explaining an example in which the speech synthesis system merges a predetermined text at the end of a subsequence.

도 9를 참조하면, 음성 합성 시스템(100, 200)은 시퀀스(910)를 분할하여 서브 시퀀스들(911, 912)을 생성할 수 있다. 여기에서, 서브 시퀀스들(911, 912)은 제3 그룹에 포함된 서브 시퀀스들인 것으로 가정한다.Referring to FIG. 9 , the speech synthesis systems 100 and 200 may generate subsequences 911 and 912 by dividing a sequence 910 . Here, it is assumed that the subsequences 911 and 912 are subsequences included in the third group.

음성 합성 시스템(100, 200)은 서브 시퀀스들(911, 912) 각각의 말단에 소정의 텍스트(930)를 병합한다. 도 9에는 소정의 텍스트(930)가 '.가나다라마.'인 것으로 도시되어 있으나, 이에 한정되지 않는다. 즉, 소정의 텍스트(930)는, 합성기(220, 300)가 고품질의 스펙트로그램을 생성하도록 다양하게 설정될 수 있다.The speech synthesis systems 100 and 200 merge a predetermined text 930 at the end of each of the subsequences 911 and 912 . Although it is illustrated in FIG. 9 that the predetermined text 930 is '.Kanadarama.', the present invention is not limited thereto. That is, the predetermined text 930 may be set in various ways so that the synthesizers 220 and 300 may generate high-quality spectrograms.

다시 도 8을 참조하면, 820 단계에서, 음성 합성 시스템(100, 200)은 소정의 텍스트가 병합된 서브 시퀀스를 합성기(220, 300)에 전송한다.Referring back to FIG. 8 , in step 820 , the speech synthesis systems 100 and 200 transmit a subsequence in which a predetermined text is merged to the synthesizers 220 and 300 .

서브 시퀀스들(911, 912)의 말단에 소정의 텍스트(930)가 병합된 경우, 음성 합성 시스템(100, 200)은 소정의 텍스트(930)에 대한 정보를 함께 합성기(220, 300)로 전송한다. 따라서, 합성기(220, 300)는 최종적으로 소정의 텍스트(930)가 제외된 스펙트로그램을 생성할 수 있다.When the predetermined text 930 is merged at the end of the subsequences 911 and 912, the speech synthesis systems 100 and 200 transmit information on the predetermined text 930 together to the synthesizers 220 and 300. do. Accordingly, the synthesizers 220 and 300 may finally generate a spectrogram in which the predetermined text 930 is excluded.

전술한 본 명세서의 설명은 예시를 위한 것이며, 본 명세서의 내용이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The description of the present specification described above is for illustration, and those of ordinary skill in the art to which the content of this specification belongs will understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be able Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 실시예의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 포함되는 것으로 해석되어야 한다.The scope of the present embodiment is indicated by the claims to be described later rather than the above detailed description, and it should be construed as including all changes or modifications derived from the meaning and scope of the claims and their equivalents.

200: 음성 합성 시스템
210: 화자 인코더
220: 합성기
230: 보코더200: speech synthesis system
210: speaker encoder
220: synthesizer
230: Vocoder

Claims

A method for segmenting a sequence of characters composed of a specific natural language, the method comprising:
generating a first group including a plurality of subsequences by dividing the sequence based on at least one punctuation mark included in the sequence;
generating a third subsequence by merging the first subsequence and a second subsequence adjacent to the first subsequence when the length of the first subsequence included in the first group is shorter than the first threshold length step;
creating a second group by updating the first group based on the third subsequence;
generating a plurality of fifth subsequences by dividing the fourth subsequence according to a predetermined criterion when the length of the fourth subsequence included in the second group is longer than a second threshold length;
generating a third group by updating the second group based on the plurality of fifth subsequences.

The method of claim 1,
transmitting the plurality of subsequences included in the third group to a synthesizer.

3. The method of claim 2,
Further comprising; merging a predetermined text at the end of each of the plurality of subsequences included in the third group;
The transmitting may include transmitting the subsequences in which the predetermined text is merged to the synthesizer.

The method of claim 1,
The first threshold length means a length shorter than the second threshold length.

The method of claim 1,
When the sequence includes at least one space of two or more units, modifying the sequence by changing the space of two or more units to one unit of space.