KR102087301B1

KR102087301B1 - Sentence selection device for speech synthesis training to build a speech synthesizer based on the voice of a personal speaker and operating method thereof

Info

Publication number: KR102087301B1
Application number: KR1020180110687A
Authority: KR
Inventors: 김회린; 서영주; 정성희; 최연주
Original assignee: 주식회사 한글과컴퓨터; 한국과학기술원
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2020-03-10

Abstract

Disclosed are a device for selecting a sentence for voice synthesis training to build a voice synthesizer based on a voice of an individual speaker and an operation method thereof. Since the device provides a technique capable of selecting the minimum number of sentences for training, which may cover various variations, when an individual user wants to build a voice synthesizer based on his or her own voice, the device may support the individual user to easily build the voice synthesizer using utterance data generated based only on the selected sentences for training.

Description

SENTENCE SELECTION DEVICE FOR SPEECH SYNTHESIS TRAINING TO BUILD A SPEECH SYNTHESIZER BASED ON THE VOICE OF A PERSONAL SPEAKER AND OPERATING METHOD THEREOF }

본 발명은 개인이 손쉽게 자신의 음성을 기반으로 하는 음성합성기를 구축할 수 있도록 지원하는 기술에 대한 것이다.The present invention relates to a technology for supporting an individual to easily build a speech synthesizer based on his or her own voice.

최근, 텍스트를 음성으로 변환하는 텍스트 음성 변환(Text to Speech: TTS) 기술이 발전함에 따라, 이러한 기술을 이용한 다양한 서비스가 출시되고 있다.Recently, with the development of Text-to-Speech (TTS) technology for converting text to speech, various services using these technologies have been released.

특히, 텍스트 음성 변환 기술은 텍스트를 음성으로 변환하여 출력해줄 수 있기 때문에 시각 장애인들을 위한 보조 도구로서의 활용 가치가 아주 높은 기술이다.In particular, since the text-to-speech technology can convert text to speech and output the text, the text-to-speech technology is very valuable as an aid for the visually impaired.

일반적으로 음성합성 기술은 음성 데이터로부터 음성을 파라미터로 모델링한 다음, 그 음향 모델로부터 원하는 텍스트에 해당하는 음성을 합성하는 파라미터 방식 음성합성(parametric speech synthesis) 기술과 발성 가능한 대규모의 음성파형들을 수집하여 코퍼스 형태로 구축한 다음에 합성하고자 하는 텍스트에 해당하면서 가장 자연스러운 음성파형이 생성되도록 작은 음편(음성파형조각)들을 그 코퍼스로부터 선택하고 이들을 서로 접합하는 음편선택 방식 음성합성(unit selection speech synthesis) 기술 또는 이 두 방식을 혼합한 기술로 나누어진다. In general, speech synthesis technology models parametric speech from speech data, and then collects large-scale speech waveforms and a parametric speech synthesis technique that synthesizes speech corresponding to a desired text from the acoustic model. Unit selection speech synthesis technology that selects small pieces (speech waveform pieces) from the corpus and combines them with each other so that the most natural voice waveforms are generated corresponding to the text to be synthesized after constructing the corpus shape Or a combination of the two.

어느 방식이든지 합성된 음성파형이 자연스럽고 명료한 고품질의 음성이 되기 위해서는 인간이 발성하는 음성에서 나타나는 여러 문맥에 따른 음운현상들로부터 발생되는 변이음들을 잘 포함할 수 있도록 음성코퍼스의 규모가 가능한 한 커야 한다. 그러나 대규모의 음성코퍼스 구축은 많은 비용과 시간을 동반하기 때문에 가능한 한 규모를 줄이면서 문맥에 따른 음소, 즉 변이음들을 효과적으로 포함 또는 포괄할 수 있는 효율적인 음성코퍼스 구축 기술이 필요하다.Either way, in order for synthesized speech waveforms to be natural and clear, high-quality speech, the size of the speech corpus should be as large as possible so as to include the variances resulting from phonological phenomena according to the various contexts of human speech do. However, since the construction of large-scale voice corpus is expensive and time-consuming, there is a need for an efficient speech corpus construction technique capable of effectively including or encompassing phonemes according to the context, that is, mutated sounds.

특히, 개인 사용자가 자신의 음성으로 음성합성이 되도록 하는 음성합성기를 구축하고자 할 때, 음성합성기의 성능을 높이기 위해서는 많은 수의 문장에 대한 발화 데이터가 확보되어야 하는데, 개인 사용자는 그 특성상 많은 수의 문장에 대한 발화 데이터를 확보하는 데에 한계가 존재할 수밖에 없다.Particularly, when an individual user wants to build a speech synthesizer that enables his or her voice synthesis, speech data for a large number of sentences should be secured to improve the performance of the speech synthesizer. There is no limit to securing speech data for sentences.

관련해서, 음성합성기의 성능 향상을 위해서는 전술한 바와 같이 문장의 특성에 따라 발생하는 다양한 변이음들을 포함하는 발화 데이터를 확보하는 것이 중요하다는 점에서, 개인 사용자가 자신의 음성을 기반으로 하는 음성합성기를 구축하고자 할 때, 다양한 변이음들을 포괄할 수 있는 최소한의 훈련용 문장들을 선별하고, 상기 선별된 훈련용 문장들을 기초로 발화 데이터를 생성할 수 있다면, 많은 수의 문장에 대한 발화 데이터를 확보하지 않더라도 고품질의 음성합성기를 구축할 수 있을 것이다.In this regard, in order to improve the performance of the speech synthesizer, as described above, it is important to obtain utterance data including various variation sounds generated according to the characteristics of a sentence. When constructing, if it is possible to select the minimum training sentences that can cover various variances, and generate utterance data based on the selected training sentences, even if the utterance data for a large number of sentences is not secured It will be possible to build a high quality voice synthesizer.

본 발명은 개인 사용자가 자신의 음성을 기반으로 하는 음성합성기를 구축하고자 할 때, 다양한 변이음들을 포괄할 수 있는 최소한의 훈련용 문장들을 선별해 줄 수 있는 기법을 제공함으로써, 상기 개인 사용자가 상기 선별된 훈련용 문장들만을 기초로 생성된 발화 데이터를 이용해서 손쉽게 음성합성기를 구축할 수 있도록 지원하고자 한다.The present invention provides a technique that can select a minimum training sentences that can cover a variety of variations, when the individual user wants to build a speech synthesizer based on his or her own voice, the individual user selects the It is intended to support the construction of a speech synthesizer easily using the spoken data generated based only on the training sentences.

본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치는 미리 정해진 복수의 문장들이 저장되어 있는 문장 저장부, 제1 화자의 음성으로 발화되어 녹음된 상기 복수의 문장들 각각에 대한 음성 데이터가 저장되어 있는 음성 데이터 저장부, 상기 제1 화자의 음성을 기반으로 구축되어 있는 음성합성기를 이용하여 상기 복수의 문장들 각각에 대해 텍스트 음성 변환(Text-to-Speech: TTS)을 수행함으로써, 상기 복수의 문장들 각각에 대한 TTS 발화 데이터를 생성하는 TTS 발화 데이터 생성부, 상기 복수의 문장들 각각에 대해 음성 데이터와 TTS 발화 데이터 간의 MCD(mel-cepstral distance)를 연산하는 MCD 연산부, 상기 복수의 문장들로부터 추출되는 서로 다른 종류의 복수의 음소들과 각 음소의 상기 복수의 문장들에서의 출현 빈도수를 서로 대응시켜 기록한 음소 테이블을 생성하는 음소 테이블 생성부, 상기 음소 테이블로부터 최대의 출현 빈도수를 갖는 제1 음소를 추출하는 음소 추출부, 상기 제1 음소가 추출되면, 상기 복수의 문장들 중 상기 제1 음소를 포함하고 있는 적어도 하나의 제1 문장을 선택하는 후보 문장 선택부, 상기 복수의 문장들 각각에 대해서 연산된 MCD를 참조하여 상기 적어도 하나의 제1 문장 중 음성 데이터와 TTS 발화 데이터 간의 MCD가 최대인 제2 문장을 선택하는 훈련 문장 선택부 및 상기 제2 문장을 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 훈련용 문장으로 결정하여 상기 제2 문장을 훈련용 문장 저장부에 저장하는 훈련용 문장 저장 처리부를 포함한다.In accordance with an embodiment of the present invention, the apparatus for selecting a speech synthesis training sentence for constructing a speech synthesizer based on the speech of an individual speaker includes a sentence storage unit in which a plurality of predetermined sentences are stored, and the speech is spoken by the first speaker. Text-to-speech for each of the plurality of sentences using a voice data storage unit for storing voice data for each of the plurality of sentences recorded and recorded, and a speech synthesizer constructed based on the voice of the first speaker. By performing a Text-to-Speech (TTS), a TTS speech data generation unit for generating TTS speech data for each of the plurality of sentences, and MCD between voice data and TTS speech data for each of the plurality of sentences. MCD operation unit for calculating a mel-cepstral distance, a plurality of different kinds of phonemes extracted from the plurality of sentences and the each phoneme A phoneme table generator for generating a phoneme table in which a plurality of sentences appear in correspondence with each other, a phoneme extractor for extracting a first phone having a maximum frequency of appearance from the phoneme table, and extracting the first phoneme A candidate sentence selector for selecting at least one first sentence including the first phoneme among the plurality of sentences, and the at least one first sentence with reference to an MCD computed for each of the plurality of sentences. A training sentence selection unit for selecting a second sentence having a maximum MCD between voice data and TTS speech data, and determining the second sentence as a training sentence for constructing a speech synthesizer based on a voice of a personal speaker; It includes a training sentence storage processing unit for storing two sentences to the training sentence storage unit.

또한, 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법은 미리 정해진 복수의 문장들이 저장되어 있는 문장 저장부를 유지하는 문장 유지 단계, 제1 화자의 음성으로 발화되어 녹음된 상기 복수의 문장들 각각에 대한 음성 데이터가 저장되어 있는 음성 데이터 저장부를 유지하는 음성 데이터 유지 단계, 상기 제1 화자의 음성을 기반으로 구축되어 있는 음성합성기를 이용하여 상기 복수의 문장들 각각에 대해 텍스트 음성 변환을 수행함으로써, 상기 복수의 문장들 각각에 대한 TTS 발화 데이터를 생성하는 TTS 발화 데이터 생성 단계, 상기 복수의 문장들 각각에 대해 음성 데이터와 TTS 발화 데이터 간의 MCD를 연산하는 MCD 연산 단계, 상기 복수의 문장들로부터 추출되는 서로 다른 종류의 복수의 음소들과 각 음소의 상기 복수의 문장들에서의 출현 빈도수를 서로 대응시켜 기록한 음소 테이블을 생성하는 음소 테이블 생성 단계, 상기 음소 테이블로부터 최대의 출현 빈도수를 갖는 제1 음소를 추출하는 음소 추출 단계, 상기 제1 음소가 추출되면, 상기 복수의 문장들 중 상기 제1 음소를 포함하고 있는 적어도 하나의 제1 문장을 선택하는 후보 문장 선택 단계, 상기 복수의 문장들 각각에 대해서 연산된 MCD를 참조하여 상기 적어도 하나의 제1 문장 중 음성 데이터와 TTS 발화 데이터 간의 MCD가 최대인 제2 문장을 선택하는 훈련 문장 선택 단계 및 상기 제2 문장을 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 훈련용 문장으로 결정하여 상기 제2 문장을 훈련용 문장 저장부에 저장하는 훈련용 문장 저장 처리 단계를 포함한다.In addition, according to an embodiment of the present invention, the operation method of the apparatus for selecting a sentence for training of speech synthesis for constructing a speech synthesizer based on the voice of an individual speaker includes a sentence for maintaining a sentence storage unit in which a plurality of predetermined sentences are stored. A sustaining step, a voice data holding step of holding a voice data storage unit for storing voice data for each of the plurality of sentences uttered and recorded by the first speaker's voice, and being constructed based on the voice of the first speaker Generating a TTS speech data for each of the plurality of sentences by performing text-to-speech on each of the plurality of sentences using a speech synthesizer, and generating voice data for each of the plurality of sentences MCD operation step of calculating an MCD between the TTS speech data and extraction from the plurality of sentences. A phoneme table generating step of generating a phoneme table in which a plurality of phonemes of different kinds and the frequency of appearance of the plurality of sentences of each phoneme are recorded in correspondence with each other, a first phoneme having a maximum appearance frequency from the phoneme table A phoneme extraction step of extracting the first phoneme, if the first phoneme is extracted, selecting a candidate sentence including at least one first sentence including the first phoneme among the plurality of sentences, and each of the plurality of sentences A training sentence selection step of selecting a second sentence having a maximum MCD between the voice data and the TTS speech data among the at least one first sentence with reference to the MCD computed for the second sentence; Training sentence storage to determine the training sentence for building a speech synthesizer and to store the second sentence in the training sentence storage unit Processing steps.

본 발명은 개인 사용자가 자신의 음성을 기반으로 하는 음성합성기를 구축하고자 할 때, 다양한 변이음들을 포괄할 수 있는 최소한의 훈련용 문장들을 선별해 줄 수 있는 기법을 제공함으로써, 상기 개인 사용자가 상기 선별된 훈련용 문장들만을 기초로 생성된 발화 데이터를 이용해서 손쉽게 음성합성기를 구축할 수 있도록 지원할 수 있다.The present invention provides a technique that can select a minimum training sentences that can cover a variety of variations, when the individual user wants to build a speech synthesizer based on his or her own voice, the individual user selects the It is possible to support the construction of a speech synthesizer easily using the spoken data generated based only on the training sentences.

도 1은 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 구조를 도시한 도면이다.
도 2는 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법을 도시한 순서도이다.1 is a diagram illustrating a structure of a sentence selecting apparatus for speech synthesis training for constructing a speech synthesizer based on an individual speaker's voice according to an embodiment of the present invention.
2 is a flowchart illustrating an operation method of a sentence synthesis training sentence selection device for constructing a speech synthesizer based on an individual speaker's voice according to an embodiment of the present invention.

이하에서는 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명하기로 한다. 이러한 설명은 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였으며, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 본 명세서 상에서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 사람에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. This description is not intended to limit the invention to the specific embodiments, it should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the invention. In describing the drawings, similar reference numerals are used for similar components, and unless otherwise defined, all terms used in the present specification, including technical or scientific terms, may be used in the art to which the present invention pertains. It has the same meaning as is commonly understood by someone who has.

도 1은 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 구조를 도시한 도면이다.1 is a diagram illustrating a structure of a sentence selecting apparatus for speech synthesis training for constructing a speech synthesizer based on an individual speaker's voice according to an embodiment of the present invention.

도 1을 참조하면, 본 발명에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치(110)는 문장 저장부(111), 음성 데이터 저장부(112), TTS 발화 데이터 생성부(113), MCD 연산부(114), 음소 테이블 생성부(115), 음소 추출부(116), 후보 문장 선택부(117), 훈련 문장 선택부(118), 훈련용 문장 저장 처리부(119) 및 훈련용 문장 저장부(120)를 포함한다.Referring to FIG. 1, the apparatus for selecting a sentence for training of speech synthesis for constructing a speech synthesizer based on the voice of a personal speaker according to the present invention includes a sentence storage unit 111, a voice data storage unit 112, TTS speech data generation unit 113, MCD operation unit 114, phoneme table generation unit 115, phoneme extraction unit 116, candidate sentence selection unit 117, training sentence selection unit 118, training sentence storage It includes a processing unit 119 and the training sentence storage unit 120.

우선, 본 발명에서는 특정 화자인 제1 화자의 음성을 기반으로 이미 구축되어 있는 음성합성기가 존재하는 것으로 가정한다.First, it is assumed in the present invention that there is a speech synthesizer that is already constructed based on the voice of the first speaker, which is a specific speaker.

문장 저장부(111)에는 미리 정해진 복수의 문장들이 저장되어 있다. 여기서, 문장 저장부(111)에 저장되어 있는 복수의 문장들은 통상의 음성합성기를 구축할 때 화자의 발화 데이터를 획득하기 위해 사용되는 일반적인 문장들일 수 있다.The sentence storage 111 stores a plurality of predetermined sentences. Here, the plurality of sentences stored in the sentence storage unit 111 may be general sentences used to obtain speaker's speech data when constructing a general voice synthesizer.

음성 데이터 저장부(112)에는 제1 화자의 음성으로 발화되어 녹음된 상기 복수의 문장들 각각에 대한 음성 데이터가 저장되어 있다.The voice data storage 112 stores voice data for each of the plurality of sentences spoken and recorded by the first speaker.

즉, 음성 데이터 저장부(112)에는 상기 제1 화자가 상기 복수의 문장들 각각을 발화하여 녹음한 상기 복수의 문장들 각각에 대한 음성 데이터가 저장되어 있다.That is, the voice data storage 112 stores voice data for each of the plurality of sentences recorded by the first speaker by uttering each of the plurality of sentences.

TTS 발화 데이터 생성부(113)는 상기 제1 화자의 음성을 기반으로 구축되어 있는 음성합성기를 이용하여 상기 복수의 문장들 각각에 대해 텍스트 음성 변환(Text-to-Speech: TTS)을 수행함으로써, 상기 복수의 문장들 각각에 대한 TTS 발화 데이터를 생성한다.The TTS speech data generator 113 performs text-to-speech (TTS) on each of the plurality of sentences using a speech synthesizer constructed based on the voice of the first speaker. Generate TTS speech data for each of the plurality of sentences.

여기서, 상기 제1 화자의 음성을 기반으로 구축되어 있는 음성합성기는 음성합성 훈련용 문장 선정 장치(110) 내부에 소프트웨어 또는 하드웨어 모듈의 형태로 탑재되어 있을 수도 있고, 음성합성 훈련용 문장 선정 장치(110)와 별도로 구분되어 존재하는 장치일 수도 있다.Here, the speech synthesizer built on the basis of the voice of the first speaker may be mounted in the form of a software or hardware module inside the speech synthesis training sentence selection device 110, or a sentence synthesis device for speech synthesis training ( The device may be separate from the device 110.

MCD 연산부(114)는 상기 복수의 문장들 각각에 대해 음성 데이터와 TTS 발화 데이터 간의 MCD(mel-cepstral distance)를 연산한다.The MCD calculator 114 calculates a mel-cepstral distance (MCD) between voice data and TTS speech data for each of the plurality of sentences.

여기서, MCD란 두 음성 간의 차이를 의미하는 것으로, MCD가 클수록 두 음성 간의 차이가 큰 것을 의미한다. Here, the MCD means a difference between two voices, and the larger the MCD, the larger the difference between the two voices.

이때, 특정 문장에 대한 실제 음성 데이터와 음성합성을 통해 생성된 TTS 발화 데이터 간의 차이가 크다는 의미는 음성합성을 통해서 생성된 음성이 실제 음성의 특성을 제대로 반영하고 있지 못하다는 것을 의미하며, 이는 결국 상기 특정 문장 내에 문맥의 특성에 따른 다양한 변이음이 포함되어 있음을 나타낸다고 볼 수 있다. 이로 인해, 특정 문장에 대한 실제 음성 데이터와 상기 문장에 대해서 음성합성을 통해 생성한 TTS 발화 데이터 간에 MCD가 크다는 것은 상기 문장 내에 문맥의 특성에 따른 다양한 변이음이 포함되어 있다는 것으로 해석될 수 있다.In this case, the large difference between the actual speech data for the specific sentence and the TTS speech data generated through the speech synthesis means that the speech generated through the speech synthesis does not properly reflect the characteristics of the actual speech. It can be seen that the specific sentence includes a variety of variations according to the characteristics of the context. For this reason, the large MCD between the actual speech data for a specific sentence and the TTS speech data generated through speech synthesis for the sentence may be interpreted as including the various transition sounds in accordance with the characteristics of the context.

이때, MCD는 하기의 수학식 1에 따라 연산될 수 있다.In this case, the MCD may be calculated according to Equation 1 below.

상기 수학식 1에서 N은 음성의 프레임 개수를 의미하며, MCD(n)은 하기의 수학식 2와 나타낼 수 있다.In Equation 1, N denotes the number of frames of speech, and MCD (n) may be represented by Equation 2 below.

이때, 상기 수학식 2에서 MC_X(n,k)는 하기의 수학식 3과 같이 나타낼 수 있고, MC_Y(n,k)는 하기의 수학식 4와 같이 나타낼 수 있다.In this case, MC _X (n, k) in Equation 2 may be represented by Equation 3 below, MC _Y (n, k) may be represented by Equation 4 below.

그리고, X_n,i와 Y_n,i는 각각 하기의 수학식 5와 6과 같이 나타낼 수 있다.In addition, X _{n, i} and Y _{n, i} may be represented by Equations 5 and 6, respectively.

상기 수학식 5와 6에서 m은 이산 푸리에 변환에서의 주파수 인덱스를 나타내며, X(n,m)은 X 음성의 n번째 프레임에 대한 m번째 푸리에 변환 값, Y(n,m)은 Y 음성의 n번째 프레임에 대한 m번째 푸리에 변환 값, w_i(m)은 m번째 주파수에 대한 i번째 임계 대역필터를 의미한다.In Equations 5 and 6, m denotes a frequency index in a discrete Fourier transform, X (n, m) is an mth Fourier transform value for an nth frame of X speech, and Y (n, m) is a Y speech. The m th Fourier transform value for the n th frame, w _i (m), means the i th threshold band filter for the m th frequency.

상기 수학식 2에서 K는 전체 mel-cepstral 분석 차수를 나타내고, 상기 수학식 3, 4에서 I는 전체 임계대역의 수를 나타내는 것으로, K와 I는 통상 40이하의 범위에서 자연수 값을 가진다.In Equation 2, K denotes the total mel-cepstral analysis order, and in Equation 3 and 4, I denotes the total number of critical bands, and K and I generally have natural number values in the range of 40 or less.

음소 테이블 생성부(115)는 상기 복수의 문장들로부터 추출되는 서로 다른 종류의 복수의 음소들과 각 음소의 상기 복수의 문장들에서의 출현 빈도수를 서로 대응시켜 기록한 음소 테이블을 생성한다.The phoneme table generator 115 generates a phoneme table in which a plurality of different kinds of phonemes extracted from the plurality of sentences and the frequency of appearance in the plurality of sentences of each phoneme are recorded.

이때, 본 발명의 일실시예에 따르면, 음소 테이블 생성부(115)는 상기 복수의 문장들 각각을 음소 단위로 분할하여 상기 복수의 문장들로부터 서로 다른 종류의 상기 복수의 음소들을 확인하고, 상기 복수의 음소들 각각의 상기 복수의 문장들에서의 출현 빈도수를 카운트하여 상기 복수의 음소들과 각 음소에 대해 카운트된 출현 빈도수를 서로 대응시켜 기록한 상기 음소 테이블을 생성할 수 있다.In this case, according to an embodiment of the present invention, the phoneme table generator 115 divides each of the plurality of sentences into phoneme units to identify the plurality of phonemes of different types from the plurality of sentences. The frequency of appearance in the plurality of sentences of each of a plurality of phonemes may be counted to generate the phoneme table that records the plurality of phonemes and the appearance frequency counted for each phoneme.

관련해서, 상기 복수의 문장들로부터 'ㄱ', 'ㄴ', 'ㄷ'이라는 음소가 추출된다고 하고, 'ㄱ'이라는 음소가 상기 복수의 문장들에서 총 10회 출현하고, 'ㄴ'이라는 음소가 상기 복수의 문장들에서 총 5회 출현하며, 'ㄷ'이라는 음소가 상기 복수의 문장들에서 총 7회 출현한다고 하면, 음소 테이블 생성부(115)는 하기의 표 1과 같이 상기 음소 테이블을 생성할 수 있다.In this regard, a phoneme of "a", "b", "c" is said to be extracted from the plurality of sentences, a phoneme of "a" appears 10 times in the plurality of sentences, and a phoneme of "b" Is a total of five times in the plurality of sentences, the phone 'c' appears a total of seven times in the plurality of sentences, the phoneme table generating unit 115 is the phoneme table as shown in Table 1 below Can be generated.

복수의 음소들Multiple phonemes 출현 빈도수Appearance frequency ㄱG 10회10th ㄴN 5회5 times ㄷC 7회7th

음소 추출부(116)는 상기 음소 테이블로부터 최대의 출현 빈도수를 갖는 제1 음소를 추출한다.The phoneme extraction unit 116 extracts a first phone having the maximum appearance frequency from the phoneme table.

후보 문장 선택부(117)는 상기 제1 음소가 추출되면, 상기 복수의 문장들 중 상기 제1 음소를 포함하고 있는 적어도 하나의 제1 문장을 선택한다.When the first phoneme is extracted, the candidate sentence selector 117 selects at least one first sentence including the first phoneme among the plurality of sentences.

예컨대, 상기 음소 테이블로부터 최대의 출현 빈도수를 갖는 상기 제1 음소가 'ㄱ'이라고 한다면, 후보 문장 선택부(117)는 문장 저장부(111)에 저장되어 있는 상기 복수의 문장들 중 'ㄱ'이라는 음소를 포함하고 있는 문장을 상기 적어도 하나의 제1 문장으로 선택할 수 있다.For example, if the first phone having the maximum frequency of appearance from the phoneme table is 'a', the candidate sentence selector 117 selects 'a' among the plurality of sentences stored in the sentence storage unit 111. The sentence including the phoneme may be selected as the at least one first sentence.

훈련 문장 선택부(118)는 상기 복수의 문장들 각각에 대해서 연산된 MCD를 참조하여 상기 적어도 하나의 제1 문장 중 음성 데이터와 TTS 발화 데이터 간의 MCD가 최대인 제2 문장을 선택한다.The training sentence selector 118 selects a second sentence having a maximum MCD between voice data and TTS speech data among the at least one first sentence by referring to the MCD calculated for each of the plurality of sentences.

전술한 바와 같이, 두 음성 간의 MCD가 클수록 두 음성 간의 차이가 크다고 볼 수 있기 때문에, 훈련 문장 선택부(118)가 선택한 상기 제2 문장은 상기 적어도 하나의 제1 문장 중 음성 데이터와 TTS 발화 데이터 간의 차이가 가장 큰 문장이라고 볼 수 있다.As described above, the larger the MCD between the two voices, the larger the difference between the two voices. Therefore, the second sentence selected by the training sentence selector 118 may include voice data and TTS speech data among the at least one first sentence. The difference between them is the largest sentence.

이렇게, 상기 제2 문장이 선택되면, 훈련용 문장 저장 처리부(119)는 상기 제2 문장을 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 훈련용 문장으로 결정하여 상기 제2 문장을 훈련용 문장 저장부(120)에 저장한다.As such, when the second sentence is selected, the training sentence storage processor 119 determines the second sentence as a training sentence for constructing a speech synthesizer based on the voice of the individual speaker and trains the second sentence. It is stored in the sentence storage unit 120.

이때, 본 발명의 일실시예에 따르면, 음성합성 훈련용 문장 선정 장치(110)는 음소 테이블 업데이트부(121)를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, the apparatus for selecting a sentence for speech synthesis training 110 may further include a phoneme table updater 121.

음소 테이블 업데이트부(121)는 상기 제2 문장이 훈련용 문장 저장부(120)에 저장되면, 상기 제2 문장으로부터 추출되는 모든 음소를 상기 음소 테이블에서 삭제함으로써, 상기 음소 테이블을 업데이트한다.The phoneme table updater 121 updates the phoneme table by deleting all phonemes extracted from the second sentence from the phoneme table when the second sentence is stored in the training sentence storage 120.

예컨대, 상기 제2 문장으로부터 추출되는 음소가 'ㄱ', 'ㄹ', 'ㅏ'라고 하는 경우, 음소 테이블 업데이트부(121)는 상기 음소 테이블에서 'ㄱ', 'ㄹ', 'ㅏ'라는 음소들을 삭제함으로써, 상기 음소 테이블을 업데이트할 수 있다.For example, when the phonemes extracted from the second sentence are 'ㄱ', 'ㄹ', 'ㅏ', the phoneme table updating unit 121 may call 'a', 'ㄹ', 'ㅏ' in the phoneme table. By deleting phonemes, the phoneme table can be updated.

이때, 본 발명의 일실시예에 따르면, 음성합성 훈련용 문장 선정 장치(110)는 반복 수행 제어부(122)를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, the apparatus for selecting a sentence for speech synthesis training 110 may further include a repeat performance control unit 122.

반복 수행 제어부(122)는 상기 제2 문장이 훈련용 문장 저장부(120)에 저장되고, 상기 음소 테이블이 업데이트되면, 훈련용 문장 저장부(120)에 저장될 훈련용 문장의 개수가 기설정된 목표 개수에 도달할 때까지 음소 추출부(116), 후보 문장 선택부(117), 훈련 문장 선택부(118), 훈련용 문장 저장 처리부(119) 및 음소 테이블 업데이트부(121)의 순차적 동작이 반복 수행되도록 제어한다.The repetition control unit 122 stores the second sentence in the training sentence storage unit 120, and when the phoneme table is updated, the number of training sentences to be stored in the training sentence storage unit 120 is preset. The sequential operations of the phoneme extractor 116, the candidate sentence selector 117, the training sentence selector 118, the training sentence storage processor 119, and the phoneme table updater 121 are performed until the target number is reached. Control to repeat.

예컨대, 상기 기설정된 목표 개수가 50개라고 하는 경우, 상기 제2 문장이 훈련용 문장 저장부(120)에 저장되고, 상기 음소 테이블이 업데이트되면, 반복 수행 제어부(122)는 훈련용 문장 저장부(120)에 저장될 훈련용 문장의 개수가 50개에 도달할 때까지 음소 추출부(116), 후보 문장 선택부(117), 훈련 문장 선택부(118), 훈련용 문장 저장 처리부(119) 및 음소 테이블 업데이트부(121)의 순차적 동작이 반복 수행되도록 제어할 수 있다.For example, when the predetermined number of targets is 50, when the second sentence is stored in the training sentence storage unit 120 and the phoneme table is updated, the repeating execution control unit 122 stores the training sentence storage unit. Phoneme extraction unit 116, candidate sentence selection unit 117, training sentence selection unit 118, training sentence storage processing unit 119 until the number of training sentences to be stored in the 120 reaches 50 And the sequential operation of the phoneme table updater 121 may be repeatedly performed.

즉, 상기 제2 문장이 훈련용 문장 저장부(120)에 저장되고, 상기 음소 테이블이 업데이트된 이후, 반복 수행 제어부(122)의 제어에 의해 음소 추출부(116)는 업데이트된 상기 음소 테이블로부터 최대의 출현 빈도수를 갖는 제1 음소를 다시 추출할 수 있고, 후보 문장 선택부(117)는 상기 복수의 문장들 중 상기 제1 음소를 포함하고 있는 적어도 하나의 제1 문장을 선택할 수 있으며, 훈련 문장 선택부(118)는 상기 적어도 하나의 제1 문장 중 음성 데이터와 TTS 발화 데이터 간의 MCD가 최대인 제2 문장을 선택할 수 있고, 훈련용 문장 저장 처리부(119)는 상기 제2 문장을 훈련용 문장 저장부(120)에 저장할 수 있고, 음소 테이블 업데이트부(121)는 상기 제2 문장으로부터 추출되는 모든 음소를 상기 음소 테이블에서 삭제하여 상기 음소 테이블을 업데이트할 수 있다.That is, after the second sentence is stored in the training sentence storage unit 120 and the phoneme table is updated, the phoneme extraction unit 116 is controlled from the updated phoneme table under the control of the iteration control unit 122. The first phoneme having the maximum appearance frequency may be extracted again, and the candidate sentence selector 117 may select at least one first sentence including the first phoneme among the plurality of sentences, and may be trained. The sentence selector 118 may select a second sentence having a maximum MCD between voice data and TTS speech data among the at least one first sentence, and the training sentence storage processor 119 trains the second sentence. The phoneme table updater 121 may update the phoneme table by deleting all phonemes extracted from the second sentence from the phoneme table.

이렇게, 반복 수행 제어부(122)의 제어에 의해 훈련용 문장 저장부(120)에 50개의 훈련용 문장들이 저장 완료되면, 사용자는 50개의 훈련용 문장들에 대해서만 음성으로 발화하여 음성 데이터를 구축한 후 구축된 음성 데이터를 기초로 개인용 음성합성기를 훈련시킴으로써, 자신의 음성을 기반으로 하는 음성합성기를 구축할 수 있게 된다.As such, when 50 training sentences are stored in the training sentence storage unit 120 under the control of the repeating control unit 122, the user utters only 50 training sentences to construct voice data. Then, by training a personal voice synthesizer based on the constructed voice data, it is possible to build a voice synthesizer based on its own voice.

이때, 50개의 훈련용 문장들은 상기 복수의 문장들 중에서 가장 높은 출현 빈도수를 갖는 음소를 포함하는 문장들을 1차적으로 선별한 후 1차 선별된 문장들 중 실제 음성 데이터와 음성합성기를 기초로 음성합성을 수행하여 생성된 TTS 발화 데이터 간의 차이가 가장 큰 문장, 즉, 문맥의 특성에 따른 변이음이 가장 많이 포함된 문장이기 때문에, 50개의 훈련용 문장들을 선별하였다는 것은 음성합성기의 구축에 있어 가장 중요한 다양한 변이음을 갖는 음성 데이터를 확보하기 위한 기초 데이터를 확보하였음을 의미한다.In this case, the 50 training sentences are first selected from the plurality of sentences including the phoneme having the highest appearance frequency, and then synthesized on the basis of the actual voice data and the speech synthesizer among the first selected sentences. Since the sentences with the largest difference between the TTS speech data generated by the method are the ones that contain the most variance sounds according to the characteristics of the context, selecting 50 training sentences is the most important in constructing the speech synthesizer. This means that basic data for securing voice data having various transition sounds is secured.

따라서, 사용자는 50개의 훈련용 문장들에 대해서만 음성 발화를 수행하여 음성 데이터를 생성하더라도 상기 음성 데이터 내에 다양한 변이음이 포함되기 때문에 상기 음성 데이터를 기초로 음성합성기를 훈련시키면, 많은 수의 문장들에 대한 음성 데이터를 생성하지 않더라도 고성능의 음성합성기를 구축할 수 있게 된다.Therefore, even if the user generates speech data by performing speech utterance only on the 50 training sentences, the voice synthesizer is trained based on the speech data, since various mutated sounds are included in the speech data. It is possible to construct a high-performance speech synthesizer without generating voice data for the voice.

본 발명의 일실시예에 따르면, 음성합성 훈련용 문장 선정 장치(110)는 중요도 점수 저장부(123) 및 중요도 점수 디스플레이부(124)를 더 포함할 수 있다.According to an embodiment of the present invention, the apparatus for selecting a sentence for speech synthesis training 110 may further include an importance score storage unit 123 and an importance score display unit 124.

중요도 점수 저장부(123)는 훈련용 문장 저장부(120)에 훈련용 문장이 상기 기설정된 목표 개수만큼 저장 완료되면, 훈련용 문장 저장부(120)에 저장 완료된 상기 기설정된 목표 개수의 훈련용 문장들 각각에 대해 각 훈련용 문장의 저장 순서에 반대되는 순서로 순번을 할당하고, 상기 기설정된 목표 개수의 훈련용 문장들 각각에 할당된 순번에 기설정된 기준 점수를 곱하여 상기 기설정된 목표 개수의 훈련용 문장들 각각에 대한 중요도 점수를 연산한 후 각 훈련용 문장에 대한 중요도 점수를 훈련용 문장 저장부(120)에 추가로 저장한다.The importance score storage unit 123 stores the training sentences in the training sentence storage unit 120 by the predetermined number of targets, and stores the training targets in the training sentence storage unit 120. For each of the sentences, a sequence number is assigned in an order opposite to the storage order of each training sentence, and the sequence number assigned to each of the predetermined target number of training sentences is multiplied by a predetermined reference score to determine the number of the predetermined target numbers. After calculating the importance score for each of the training sentences, and further stores the importance score for each training sentence in the training sentence storage unit 120.

예컨대, 훈련용 문장 저장부(120)에 50개의 훈련용 문장들이 저장 완료되었다고 하고, 상기 기설정된 기준 점수를 '10점'이라고 하는 경우, 중요도 점수 저장부(123)는 50개의 훈련용 문장들 각각에 대해 훈련용 문장 저장부(120)에서의 저장 순서에 반대되는 순서로 순번을 할당하고, 50개의 훈련용 문장들 각각에 할당된 순번에 상기 기준 점수인 '10점'을 곱하여 50개의 훈련용 문장들 각각에 대한 중요도 점수를 연산한 후 각 중요도 점수를 훈련용 문장 저장부(120)에 추가로 저장할 수 있다.For example, if 50 training sentences have been stored in the training sentence storage unit 120 and the predetermined reference score is '10 points', the importance score storage unit 123 stores 50 training sentences. For each of the 50 training sentences by assigning a sequence number in the order opposite to the storage order in the training sentence storage unit 120, and multiplying the sequence number assigned to each of the 50 training sentences by the reference score '10 points' After calculating the importance scores for each of the sentences, the importance scores may be further stored in the training sentence storage unit 120.

관련해서, 50개의 훈련용 문장들 중 첫 번째로 훈련용 문장 저장부(120)에 저장된 문장에 대해서는 '50'이라는 순번이 할당될 수 있고, 이로 인해 첫 번째로 훈련용 문장 저장부(120)에 저장된 문장에 대해서는 '500점'이라고 하는 중요도 점수가 연산될 수 있다.In relation to the sentence stored in the training sentence storage unit 120 of the first 50 training sentences, a sequence of '50' may be allocated, and thus, the training sentence storage unit 120 may be the first. An importance score of 500 points may be calculated for the sentence stored in the.

중요도 점수 디스플레이부(124)는 사용자로부터 상기 기설정된 목표 개수의 훈련용 문장들 중 제1 훈련용 문장에 대한 중요도 확인 명령이 인가되면, 훈련용 문장 저장부(120)에 저장되어 있는 상기 제1 훈련용 문장에 대한 중요도 점수를 추출하여 화면상에 디스플레이한다.The importance score display unit 124 may store the first sentence stored in the training sentence storage unit 120 when an importance check command for the first training sentence among the predetermined target number of training sentences is received from the user. The importance score for the training sentence is extracted and displayed on the screen.

이를 통해, 사용자는 음성합성을 위한 음성 데이터 구축에 있어 훈련용 문장 저장부(120)에 저장되어 있는 각 훈련용 문장의 중요도를 눈으로 확인할 수 있고, 중요도가 높은 훈련용 문장에 대해서는 보다 주의해서 음성 발화를 수행할 수 있을 것이다.Through this, the user can visually check the importance of each training sentence stored in the training sentence storage unit 120 in constructing the voice data for speech synthesis, and more carefully about the training sentence having high importance. Voice speech may be performed.

도 2는 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법을 도시한 순서도이다.2 is a flowchart illustrating an operation method of a sentence synthesis training sentence selection device for constructing a speech synthesizer based on an individual speaker's voice according to an embodiment of the present invention.

단계(S210)에서는 미리 정해진 복수의 문장들이 저장되어 있는 문장 저장부를 유지한다.In operation S210, a sentence storage unit in which a plurality of predetermined sentences are stored is maintained.

단계(S220)에서는 제1 화자의 음성으로 발화되어 녹음된 상기 복수의 문장들 각각에 대한 음성 데이터가 저장되어 있는 음성 데이터 저장부를 유지한다.In operation S220, a voice data storage unit for storing voice data for each of the plurality of sentences spoken and recorded by the first speaker is maintained.

단계(S230)에서는 상기 제1 화자의 음성을 기반으로 구축되어 있는 음성합성기를 이용하여 상기 복수의 문장들 각각에 대해 텍스트 음성 변환을 수행함으로써, 상기 복수의 문장들 각각에 대한 TTS 발화 데이터를 생성한다.In operation S230, text-to-speech is performed on each of the plurality of sentences using a speech synthesizer constructed based on the voice of the first speaker, thereby generating TTS speech data for each of the plurality of sentences. do.

단계(S240)에서는 상기 복수의 문장들 각각에 대해 음성 데이터와 TTS 발화 데이터 간의 MCD를 연산한다.In step S240, an MCD between voice data and TTS speech data is calculated for each of the plurality of sentences.

단계(S250)에서는 상기 복수의 문장들로부터 추출되는 서로 다른 종류의 복수의 음소들과 각 음소의 상기 복수의 문장들에서의 출현 빈도수를 서로 대응시켜 기록한 음소 테이블을 생성한다.In operation S250, a plurality of phonemes extracted from the plurality of sentences and a phoneme table in which the frequency of appearance in the plurality of sentences of each phoneme correspond to each other are recorded.

이때, 본 발명의 일실시예에 따르면, 단계(S250)에서는 상기 복수의 문장들 각각을 음소 단위로 분할하여 상기 복수의 문장들로부터 서로 다른 종류의 상기 복수의 음소들을 확인하고, 상기 복수의 음소들 각각의 상기 복수의 문장들에서의 출현 빈도수를 카운트하여 상기 복수의 음소들과 각 음소에 대해 카운트된 출현 빈도수를 서로 대응시켜 기록한 상기 음소 테이블을 생성할 수 있다.At this time, according to an embodiment of the present invention, in step S250, each of the plurality of sentences is divided into phoneme units to identify the plurality of phonemes of different types from the plurality of sentences, and the plurality of phonemes The frequency of appearance in each of the plurality of sentences may be counted to generate the phoneme table that records the plurality of phonemes and the appearance frequency counted for each phoneme.

단계(S260)에서는 상기 음소 테이블로부터 최대의 출현 빈도수를 갖는 제1 음소를 추출한다.In step S260, the first phone having the maximum appearance frequency is extracted from the phoneme table.

단계(S270)에서는 상기 제1 음소가 추출되면, 상기 복수의 문장들 중 상기 제1 음소를 포함하고 있는 적어도 하나의 제1 문장을 선택한다.In operation S270, when the first phoneme is extracted, at least one first sentence including the first phoneme is selected from among the plurality of sentences.

단계(S280)에서는 상기 복수의 문장들 각각에 대해서 연산된 MCD를 참조하여 상기 적어도 하나의 제1 문장 중 음성 데이터와 TTS 발화 데이터 간의 MCD가 최대인 제2 문장을 선택한다.In operation S280, a second sentence having a maximum MCD between voice data and TTS speech data is selected among the at least one first sentence by referring to the MCD calculated for each of the plurality of sentences.

단계(S290)에서는 상기 제2 문장을 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 훈련용 문장으로 결정하여 상기 제2 문장을 훈련용 문장 저장부에 저장한다.In step S290, the second sentence is determined as a training sentence for constructing a voice synthesizer based on the voice of the individual speaker, and the second sentence is stored in the training sentence storage unit.

이때, 본 발명의 일실시예에 따르면, 상기 음성합성 훈련용 문장 선정 장치의 동작 방법은 상기 제2 문장이 상기 훈련용 문장 저장부에 저장되면, 상기 제2 문장으로부터 추출되는 모든 음소를 상기 음소 테이블에서 삭제하여 상기 음소 테이블을 업데이트하는 음소 테이블 업데이트 단계를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, when the second sentence is stored in the training sentence storage unit, the operation method of the sentence synthesis training sentence selection device, all the phonemes extracted from the second sentence The method may further include a phoneme table updating step of updating the phoneme table by deleting from a table.

이때, 본 발명의 일실시예에 따르면, 상기 음성합성 훈련용 문장 선정 장치의 동작 방법은 상기 제2 문장이 상기 훈련용 문장 저장부에 저장되고, 상기 음소 테이블이 업데이트되면, 상기 훈련용 문장 저장부에 저장될 훈련용 문장의 개수가 기설정된 목표 개수에 도달할 때까지 단계(S260), 단계(S270), 단계(S280), 단계(S290)의 순차적 동작이 반복 수행되도록 제어하는 반복 수행 제어 단계를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, in the method of operating the apparatus for selecting a sentence for speech synthesis training, when the second sentence is stored in the training sentence storage unit and the phoneme table is updated, the training sentence storage is performed. Iterative performance control for controlling the sequential operations of steps S260, S270, S280, and S290 to be repeatedly performed until the number of training sentences to be stored in the unit reaches a predetermined target number. It may further comprise a step.

이때, 본 발명의 일실시예에 따르면, 상기 음성합성 훈련용 문장 선정 장치의 동작 방법은 상기 훈련용 문장 저장부에 훈련용 문장이 상기 기설정된 목표 개수만큼 저장 완료되면, 상기 훈련용 문장 저장부에 저장 완료된 상기 기설정된 목표 개수의 훈련용 문장들 각각에 대해 각 훈련용 문장의 저장 순서에 반대되는 순서로 순번을 할당하고, 상기 기설정된 목표 개수의 훈련용 문장들 각각에 할당된 순번에 기설정된 기준 점수를 곱하여 상기 기설정된 목표 개수의 훈련용 문장들 각각에 대한 중요도 점수를 연산한 후 각 훈련용 문장에 대한 중요도 점수를 상기 훈련용 문장 저장부에 추가로 저장하는 중요도 점수 저장 단계 및 사용자로부터 상기 기설정된 목표 개수의 훈련용 문장들 중 제1 훈련용 문장에 대한 중요도 확인 명령이 인가되면, 상기 훈련용 문장 저장부에 저장되어 있는 상기 제1 훈련용 문장에 대한 중요도 점수를 추출하여 화면상에 디스플레이하는 중요도 점수 디스플레이 단계를 더 포함할 수 있다.At this time, according to an embodiment of the present invention, the method of operating the voice synthesis training sentence selection device, if the training sentence is stored in the training sentence storage unit as much as the predetermined target number, the training sentence storage unit For each of the training targets of the predetermined number of target sentences that have been stored in the order of the reverse order of storage of each training sentence assigned to the order, and assigned to each of the predetermined target number of training sentences A importance score storing step of calculating a importance score for each of the training sentences of the predetermined target number by multiplying the set reference score, and then additionally storing the importance score for each training sentence in the training sentence storage unit; If the importance confirmation command for the first training sentence of the predetermined number of training sentences from the predetermined is applied, the instruction Stored in the text storage unit for extracting a priority score for the first sentence for the first training may further include an importance score display step of displaying on the screen.

이상, 도 2를 참조하여 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법에 대해 설명하였다. 여기서, 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법은 도 1을 이용하여 설명한 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치(110)의 동작에 대한 구성과 대응될 수 있으므로, 이에 대한 보다 상세한 설명은 생략하기로 한다.In the above, the operation method of the apparatus for selecting a sentence for speech synthesis training for constructing a speech synthesizer based on the individual speaker's voice according to an embodiment of the present invention has been described. Here, the operation method of the apparatus for selecting a sentence for speech synthesis training for constructing a speech synthesizer based on the individual speaker's voice according to an embodiment of the present invention is based on the voice of the individual speaker described with reference to FIG. 1. Since it may correspond to the configuration of the operation of the speech synthesis training sentence selection device 110 for building a synthesizer, a detailed description thereof will be omitted.

본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법은 컴퓨터와의 결합을 통해 실행시키기 위한 저장매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다.A method of operating a speech synthesis training sentence selection device for constructing a speech synthesizer based on a personal speaker's voice according to an embodiment of the present invention is implemented by a computer program stored in a storage medium for execution by combining with a computer. Can be.

또한, 본 발명의 일실시예에 따른 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치의 동작 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. In addition, the operation method of the apparatus for selecting a sentence for speech synthesis training for constructing a speech synthesizer based on the individual speaker's voice according to an embodiment of the present invention is implemented in the form of a program command that can be executed through various computer means. Can be written to a computer readable medium. The computer readable medium may include program instructions, data files, data structures, and the like, alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer readable recording media include magnetic media such as hard disks, floppy disks and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다. In the present invention as described above has been described by the specific embodiments, such as specific components and limited embodiments and drawings, but this is provided to help a more general understanding of the present invention, the present invention is not limited to the above embodiments. For those skilled in the art, various modifications and variations are possible from these descriptions.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Accordingly, the spirit of the present invention should not be limited to the described embodiments, and all of the equivalents and equivalents of the claims, as well as the appended claims, will fall within the scope of the present invention. .

110: 개인 화자의 음성을 기반으로 하는 음성합성기를 구축하기 위한 음성합성 훈련용 문장 선정 장치
111: 문장 저장부 112: 음성 데이터 저장부
113: TTS 발화 데이터 생성부 114: MCD 연산부
115: 음소 테이블 생성부 116: 음소 추출부
117: 후보 문장 선택부 118: 훈련 문장 선택부
119: 훈련용 문장 저장 처리부 120: 훈련용 문장 저장부
121: 음소 테이블 업데이트부 122: 반복 수행 제어부
123: 중요도 점수 저장부 124: 중요도 점수 디스플레이부110: Sentence selection device for speech synthesis training to build a speech synthesizer based on the individual speaker's voice
111: sentence storage 112: voice data storage
113: TTS speech data generation unit 114: MCD calculator
115: phoneme table generation unit 116: phoneme extraction unit
117: candidate sentence selector 118: training sentence selector
119: training sentence storage unit 120: training sentence storage unit
121: phoneme table updating unit 122: repeat performing control unit
123: importance score storage unit 124: importance score display unit

Claims

A sentence storage unit storing a plurality of predetermined sentences;
A voice data storage unit for storing voice data for each of the plurality of sentences spoken and recorded by a first speaker's voice;
TTS for each of the plurality of sentences by performing text-to-speech (TTS) on each of the plurality of sentences using a speech synthesizer constructed based on the voice of the first speaker. A TTS utterance data generator for generating utterance data;
An MCD calculator configured to calculate a mel-cepstral distance (MCD) between voice data and TTS speech data for each of the plurality of sentences;
A phoneme table generator for generating a phoneme table in which a plurality of different kinds of phonemes extracted from the plurality of sentences and occurrence frequencies of the phonemes are recorded in correspondence with each other;
A phoneme extracting unit for extracting a first phoneme having a maximum appearance frequency from the phoneme table;
A candidate sentence selector configured to select at least one first sentence including the first phoneme among the plurality of sentences when the first phoneme is extracted;
A training sentence selection unit for selecting a second sentence having a maximum MCD between voice data and TTS speech data among the at least one first sentence by referring to the MCD calculated for each of the plurality of sentences; And
Training sentence storage processing unit for determining the second sentence as a training sentence for building a speech synthesizer based on the voice of the individual speaker and storing the second sentence in the training sentence storage unit
Sentence selection device for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker comprising a.

The method of claim 1,
The phoneme table generator
Each of the plurality of sentences is divided into phoneme units to identify the plurality of phonemes of different types from the plurality of sentences, and the frequency of appearance in the plurality of sentences of each of the plurality of phonemes is counted. And a speech synthesizer for constructing a speech synthesizer based on a voice of a personal speaker for generating a phoneme table recorded by mapping a plurality of phonemes and a frequency of appearances counted for each phoneme.

The method of claim 1,
When the second sentence is stored in the training sentence storage unit, a phoneme table updating unit deleting all phonemes extracted from the second sentence from the phoneme table to update the phoneme table.
Sentence selection device for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker further comprising.

The method of claim 3,
When the second sentence is stored in the training sentence storage unit and the phoneme table is updated, the phoneme extraction unit until the number of training sentences to be stored in the training sentence storage unit reaches a preset target number. And a repetition control unit for controlling the sequential operations of the candidate sentence selection unit, the training sentence selection unit, the training sentence storage processing unit, and the phoneme table updater to be repeatedly performed.
Sentence selection device for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker further comprising.

The method of claim 4, wherein
When the training sentences are stored in the training sentence storage unit by the predetermined number of targets, the order of storing the training sentences for each of the training sentences of the preset target number stored in the training sentence storage unit is completed. Calculating the importance score for each of the training sentences of the predetermined target number by multiplying the order in the reverse order, and multiplying each of the training sentences of the predetermined target number by a predetermined reference score. An importance score storage unit for storing an importance score for each of the training sentences in addition to the training sentence storage unit; And
If a user's importance checking command for the first training sentence among the predetermined number of training sentences is applied from the user, the importance score for the first training sentence stored in the training sentence storage unit is extracted. Importance score display on the screen
Sentence selection device for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker further comprising.

A sentence holding step of holding a sentence storage unit in which a plurality of predetermined sentences are stored;
A voice data holding step of holding a voice data storage unit in which voice data for each of the plurality of sentences spoken and recorded by a first speaker's voice is stored;
TTS for each of the plurality of sentences by performing text-to-speech (TTS) on each of the plurality of sentences using a speech synthesizer constructed based on the voice of the first speaker. Generating TTS speech data to generate speech data;
Calculating an MCD (mel-cepstral distance) between voice data and TTS speech data for each of the plurality of sentences;
A phoneme table generating step of generating a phoneme table in which a plurality of different kinds of phonemes extracted from the plurality of sentences and occurrence frequencies of the phonemes are recorded in correspondence with each other;
A phoneme extraction step of extracting a first phoneme having a maximum appearance frequency from the phoneme table;
A candidate sentence selection step of selecting at least one first sentence including the first phoneme among the plurality of sentences when the first phoneme is extracted;
A training sentence selection step of selecting a second sentence having a maximum MCD between voice data and TTS speech data among the at least one first sentence by referring to the MCD computed for each of the plurality of sentences; And
Training sentence storage processing step of determining the second sentence as a training sentence for building a speech synthesizer based on the voice of the individual speaker and storing the second sentence in the training sentence storage unit
Method of operation of the sentence selection apparatus for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker comprising a.

The method of claim 6,
The phoneme table generation step
Each of the plurality of sentences is divided into phoneme units to identify the plurality of phonemes of different types from the plurality of sentences, and the frequency of appearance in the plurality of sentences of each of the plurality of phonemes is counted. A method of operating a sentence selecting apparatus for speech synthesis training for constructing a speech synthesizer based on a voice of a personal speaker for generating a phoneme table in which a plurality of phonemes and a frequency of appearance counted for each phoneme are recorded.

The method of claim 6,
A phoneme table updating step of updating the phoneme table by deleting all phonemes extracted from the second sentence from the phoneme table when the second sentence is stored in the training sentence storage unit.
Operation method of the sentence synthesis apparatus for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker further comprising.

The method of claim 8,
When the second sentence is stored in the training sentence storage unit and the phoneme table is updated, the phoneme extraction step until the number of training sentences to be stored in the training sentence storage unit reaches a preset target number. A repetition control step of controlling the sequential operations of the candidate sentence selection step, the training sentence selection step, the training sentence storage processing step, and the phoneme table update step to be repeatedly performed;
Operation method of the sentence synthesis apparatus for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker further comprising.

The method of claim 9,
When the training sentences are stored in the training sentence storage unit by the predetermined number of targets, the order of storing the training sentences for each of the training sentences of the preset target number stored in the training sentence storage unit is completed. Calculating the importance score for each of the training sentences of the predetermined target number by multiplying the order in the reverse order, and multiplying each of the training sentences of the predetermined target number by a predetermined reference score. An importance score storing step of additionally storing an importance score for each training sentence in the training sentence storage unit; And
If a user's importance checking command for the first training sentence among the predetermined number of training sentences is applied from the user, the importance score for the first training sentence stored in the training sentence storage unit is extracted. Importance score display step to display on screen
Operation method of the sentence synthesis apparatus for speech synthesis training to build a speech synthesizer based on the voice of the individual speaker further comprising.

A computer readable recording medium having recorded thereon a program for performing the method of any one of claims 6 to 10.

A computer program stored in a storage medium for executing the method of any one of claims 6 to 10 in combination with a computer.