KR20060035998A

KR20060035998A - Method for converting timber of speech using phoneme codebook mapping

Info

Publication number: KR20060035998A
Application number: KR1020040085098A
Authority: KR
Inventors: 김동관
Original assignee: 삼성전자주식회사
Priority date: 2004-10-23
Filing date: 2004-10-23
Publication date: 2006-04-27
Also published as: KR100624440B1

Abstract

본 발명은 음소별 코드북 매핑에 의해 음색이 변환된 음성을 생성하는 방법에 관한 것으로서, 본 발명에 의한 프레임 단위로 추출된 원화자의 음성 프레임을 목적화자의 음성 프레임으로 음색변환된 합성음을 생성하는 방법은 음소의 종류별로, 목적화자 코드벡터의 인덱스 필드 및 그 매핑횟수 필드로 이루어진 필드블록이 다수 포함된 블록그룹이 원화자의 코드북에 포함된 코드벡터의 인덱스에 의해 액세스되도록 구성한 코드북 매핑 테이블을 생성하는 단계; (b)상기 음성 프레임에 대한 정규화 자기 상관 함수의 피크치로부터 상기 음성 프레임의 후보 기본 주파수를 결정하고, 상기 후보 기본 주파수와 상기 후보 기본 주파수로부터 생성된 통합 가운시안 분포들에 따라 상기 음성 프레임에 대한 동적 프로그램을 실행하여 각 음성 프레임에 대한 기본 주파수를 결정하는 단계; (c)상기 기본 주파수에 기초하여 상기 원화자의 음성 프레임으로부터 음소의 종류를 판별하는 단계; (d)상기 원화자의 음성 프레임을 LSP 계수로 변환하는 단계; (e)상기 (c)단계에서 판별된 음소의 종류에 따른 원화자의 코드북을 탐색하여 상기 LSP 계수와 가장 유사한 코드벡터의 인덱스를 결정하는 단계; (f)상기 (e) 단계에서 결정된 코드벡터의 인덱스에 의해 상기 코드북 매핑 테이블을 액세스하여 목적화자의 코드벡터로 변환하는 단계; 및(g)상기 (f)단계에서 변환된 목적화자의 코드벡터에 의해 음색변환된 합성음을 생성하는 단계를 포함함을 특징으로 한다.The present invention relates to a method of generating a voice whose tone is converted by phoneme codebook mapping. A method of generating a synthesized tone that is tone-converted from a voice frame of an original speaker extracted in units of frames according to the present invention into a voice frame of a target speaker. Generates a codebook mapping table in which a block group including a large number of fieldblocks consisting of an index field of the target speaker codevector and its mapping frequency field is accessed by an index of a codevector included in the original speaker's codebook. step; (b) determine a candidate fundamental frequency of the speech frame from the peak value of a normalized autocorrelation function for the speech frame, and determine the candidate fundamental frequency of the speech frame according to unified ginsian distributions generated from the candidate fundamental frequency and the candidate fundamental frequency. Executing a dynamic program to determine a fundamental frequency for each voice frame; (c) determining the type of phoneme from the speech frame of the original speaker based on the fundamental frequency; (d) converting the speech frame of the original speaker into an LSP coefficient; (e) searching the codebook of the original speaker according to the phoneme type determined in step (c) to determine the index of the code vector most similar to the LSP coefficients; (f) accessing the codebook mapping table by the index of the codevector determined in step (e) and converting the codebook mapping table into a codevector of an object speaker; And (g) generating a synthesized sound tone converted by the code vector of the target speaker converted in the step (f).

본 발명에 의하면, 다양한 음색으로 합성음을 생성하기 위해 음성합성장치에 사용할 경우, 합성음의 목적에 따라 나이, 성별 등이 다른 음색으로 문장을 합성할 수 있다.According to the present invention, when used in a speech synthesis apparatus to generate synthesized sounds with various tones, sentences can be synthesized with different tones of different ages and genders according to the purpose of the synthesized sounds.

Description

Method for converting timber of speech using phoneme codebook mapping}

도 1은 본 발명에 의해 사용되는 코드북 매핑 테이블의 구성을 도시한다.1 shows the configuration of a codebook mapping table used by the present invention.

도 2는 본 발명에 의한 코드북 매핑 테이블 및 변환함수가 생성되는 과정을 설명하기 위한 도면이다.2 is a diagram illustrating a process of generating a codebook mapping table and a conversion function according to the present invention.

도 3은 본 발명에 의한 음소별 코드북 매핑에 의한 음색변환과정을 도시한 흐름도이다.3 is a flowchart illustrating a tone conversion process by codebook mapping according to phonemes according to the present invention.

도 4는 음성 프레임의 기본 주파수를 결정하는 단계를 구체적으로 설명하고 있는 흐름도이다.4 is a flowchart specifically describing a step of determining a fundamental frequency of a voice frame.

본 발명은 음성합성방법에 관한 것으로서, 특히 음소(音素)별 코드북 매핑에 의해 음색이 변환된 음성을 생성하는 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech synthesis method, and more particularly, to a method of generating a voice whose tone is converted by codebook mapping for each phoneme.

최근의 음성합성시스템은 그 성능이 크게 향상되어 전자우편 독출기(Email reader), 기상자료 합성음 서비스, 인터넷 웹문서 읽기 등 각종 문서에 대한 합성음 생성에 응용되는 추세이다. 일반적으로, 합성음의 질(質)은 자연성 및 명료성이 라는 두가지 척도로 평가된다. 그런데, 현재까지 합성음의 자연성은 여전히 만족스럽지 못한 수준에 있다.Recently, the speech synthesis system has been greatly improved in performance, and is being applied to generating synthesized sounds for various documents such as an email reader, a weather data synthesis sound service, and internet web document reading. In general, the quality of synthesized sound is evaluated on two scales: naturalness and clarity. However, to date, the naturalness of synthesized sound is still unsatisfactory.

합성음의 자연성을 향상시키는 방법에는 크게 두가지가 있는데, 하나는 특정 화자(話者)의 발음 특성을 흉내내는 것이고, 다른 하나는 화자의 감정을 합성하는 것이다. 이에, 본 발명은 한사람의 음성을 다른 사람의 음성처럼 들리도록 원화자(原話者)의 발음특성을 변경하는 방법에 관한 것이다.There are two ways to improve the naturalness of synthesized sounds, one to mimic the pronunciation characteristics of a particular speaker, and the other to synthesize the speaker's emotions. Accordingly, the present invention relates to a method of changing the pronunciation characteristics of the original speaker so that one voice sounds like another voice.

기존의 벡터 양자화(Vector Quantization : VQ) 코드북(codebook) 매핑에 의한 음색변환방법은 모든 음소에 공통된 코드북을 사용하였다. 그런데, 이러한 방법에서는 공통된 코드북이 음소마다 다른 화자의 음색을 반영하지 못하므로 음색변환 성능을 보장할 수 없다.In the conventional vector quantization (VQ) codebook mapping, the tone conversion method uses a common codebook for all phonemes. However, in this method, since the common codebook does not reflect the tone of a different speaker for each phoneme, the tone conversion performance cannot be guaranteed.

본 발명은 상기의 문제점을 해결하기 위하여 창작된 것으로서, 음소별 코드벡터를 사용함으로써 각 음소마다 세밀한 음색변경이 가능한 음소별 코드북 매핑에 의해 음색이 변환된 음성을 생성하는 방법을 제공함을 제1의 목적으로 한다.The present invention was devised to solve the above problems, and provides a method of generating a voice converted to a tone by phoneme codebook mapping, in which a detailed tone change is possible for each phoneme by using a phoneme code vector. The purpose.

그리고, 음색이 변환된 음성을 생성하기 위해 사용되는 코드북 매핑 테이블의 생성방법을 제공함을 제2의 목적으로 한다.It is a second object of the present invention to provide a method of generating a codebook mapping table used to generate a voice whose tone is converted.

상기의 제1의 목적을 달성하기 위하여, 본 발명에 의한 프레임 단위로 추출된 원화자의 음성 프레임을 목적화자의 음성 프레임으로 음색변환된 합성음을 생성하는 방법은 (음소의 종류별로, 목적화자 코드벡터의 인덱스 필드 및 그 매핑횟수 필드로 이루어진 필드블록이 다수 포함된 블록그룹이 원화자의 코드북에 포함된 코드벡터의 인덱스에 의해 액세스되도록 구성한 코드북 매핑 테이블을 생성하는 단계; (b)상기 음성 프레임에 대한 정규화 자기 상관 함수의 피크치로부터 상기 음성 프레임의 후보 기본 주파수를 결정하고, 상기 후보 기본 주파수와 상기 후보 기본 주파수로부터 생성된 통합 가운시안 분포들에 따라 상기 음성 프레임에 대한 동적 프로그램을 실행하여 각 음성 프레임에 대한 기본 주파수를 결정하는 단계; (c)상기 기본 주파수에 기초하여 상기 원화자의 음성 프레임으로부터 음소의 종류를 판별하는 단계; (d)상기 원화자의 음성 프레임을 LSP 계수로 변환하는 단계; (e)상기 (c)단계에서 판별된 음소의 종류에 따른 원화자의 코드북을 탐색하여 상기 LSP 계수와 가장 유사한 코드벡터의 인덱스를 결정하는 단계; (f)상기 (e) 단계에서 결정된 코드벡터의 인덱스에 의해 상기 코드북 매핑 테이블을 액세스하여 목적화자의 코드벡터로 변환하는 단계; 및(g)상기 (f)단계에서 변환된 목적화자의 코드벡터에 의해 음색변환된 합성음을 생성하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the first object described above, a method of generating a synthesized sound by tone conversion of a voice frame of an original speaker extracted in units of frames according to the present invention into a voice frame of a target speaker is performed according to the type of phoneme. (B) generating a codebook mapping table configured to access a block group including a plurality of field blocks consisting of an index field and a mapping number field of the same by an index of a code vector included in a codebook of an original speaker; The candidate fundamental frequency of the speech frame is determined from the peak value of a normalized autocorrelation function, and the dynamic program is executed for each speech frame according to the candidate fundamental frequency and the integrated ginsian distributions generated from the candidate fundamental frequency. Determining a fundamental frequency for (c) at the fundamental frequency Determining the type of a phoneme from the voice frame of the original speaker in the first place; (d) converting the voice frame of the original speaker into an LSP coefficient; (e) the original speaker according to the type of phoneme determined in step (c); Searching the codebook to determine an index of the code vector most similar to the LSP coefficients (f) accessing the codebook mapping table by the index of the codevector determined in the step (e) and converting the codebook mapping table into a codevector of an object speaker; And (g) generating a synthesized sound tone converted by the code vector of the target speaker converted in the step (f).

상기의 제2의 목적을 달성하기 위하여, 본 발명에 의한 코드북 매핑 테이블의 생성방법은 (a) 인덱스 필드 및 매핑횟수 필드로 이루어진 필드블록이 다수 포함된 블록그룹이 원화자의 코드북에 포함된 코드벡터의 인덱스에 의해 액세스되도록 구성된 코드북 매핑 테이블을 초기화하는 단계; (b)동일한 음소를 발음한 원화자와 목적화자의 각 음성 프레임에 대한 정규화 자기 상관 함수의 피크치로부터 상기 각 음성 프레임의 후보 기본 주파수를 결정하고, 상기 후보 기본 주파수와 상기 후보 기본 주파수로부터 생성된 통합 가운시안 분포들에 따라 상기 음성 프레임에 대한 동적 프로그램을 실행하여 동일한 음소에 대한 원화자와 목적화자의 각 음성 프레임에 대한 기본 주파수를 결정하는 단계; (c)상기 원화자의 기본 주파수와 목적화자의 기본 주파수를 각각 선형예측분석하고, 각각 제1 LSP 계수 및 제2 LSP 계수로 변환하는 단계; (d)상기 원화자의 코드북에서 상기 제1 LSP 계수와 가장 유사한 제1 코드벡터를 찾아 제1 코드벡터의 인덱스를 결정하고, 목적화자의 코드북에서 상기 제2 LSP 계수와 가장 유사한 제2 코드벡터를 찾아 제2 코드벡터의 인덱스를 결정하는 단계; (e)상기 코드북 매핑 테이블에서 상기 제1 코드벡터의 인덱스에 대응하는 블록그룹 내에서 상기 제2 코드벡터의 인덱스에 대응하는 매핑횟수 필드의 값을 1 증가하는 단계; 및 (f)소정의 횟수동안 상기 (b)단계 내지 상기 (e)단계를 반복하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above-described second object, a method of generating a codebook mapping table according to the present invention includes (a) a code vector in which a block group including a plurality of field blocks consisting of an index field and a mapping count field is included in a codebook of an original speaker; Initializing a codebook mapping table configured to be accessed by an index of; (b) a candidate fundamental frequency of each speech frame is determined from the peak values of the normalized autocorrelation function for each speech frame of the original speaker and the target speaker who pronounced the same phoneme, and is generated from the candidate fundamental frequency and the candidate fundamental frequency. Executing a dynamic program for the speech frame according to unified distribution distributions to determine a fundamental frequency for each speech frame of the original speaker and the object speaker for the same phoneme; (c) linearly predicting the fundamental frequency of the original speaker and the fundamental frequency of the target speaker, and converting them into first LSP coefficients and second LSP coefficients, respectively; (d) find a first code vector most similar to the first LSP coefficient in the original codebook of the original speaker, determine an index of the first code vector, and obtain a second code vector most similar to the second LSP coefficient in the codebook of the destination speaker; Finding and determining an index of the second codevector; (e) increasing the value of the mapping count field corresponding to the index of the second codevector by one in the block group corresponding to the index of the first codevector in the codebook mapping table; And (f) repeating steps (b) to (e) for a predetermined number of times.

본 발명에는 매핑 단계를 통해 원화자와 목적화자의 코드북 매핑 테이블을 생성하는 단계와 코드북 매핑 테이블을 이용하여 원화자 음성의 음색을 변환하는 단계로 이루어진다.The present invention comprises the steps of generating a codebook mapping table of the original speaker and the target speaker through a mapping step and converting the tone of the original speaker's voice using the codebook mapping table.

코드북 매핑 테이블을 생성하는 단계는 음색변환 과정과는 별도로 오프라인(off-line)으로 실행된다. 한편, 실제 음색변환은 생성된 코드북 매핑 테이블을 이용하여 온라인(on-line)으로 수행된다.Generating the codebook mapping table is performed off-line separately from the tone conversion process. On the other hand, the actual tone conversion is performed on-line using the generated codebook mapping table.

코드북 매핑 테이블을 생성하는 과정을 도 1 및 도 2를 참조하여 설명한다.A process of generating a codebook mapping table will be described with reference to FIGS. 1 and 2.

(1) 먼저, 도 1에 도시된 바와 같은 코드북 매핑 테이블을 초기화한다. 코드북 매핑 테이블은 음소의 종류별로 구비되며, 목적화자 코드벡터의 인덱스 필드(10) 및 그 매핑횟수 필드(14)로 이루어진 필드블록(14)이 j개 포함된 블록그룹 (16)이 원화자의 코드북에 포함된 코드벡터의 인덱스(0,…,N-1)에 의해 액세스되도록 구성되어 있다.(1) First, the codebook mapping table as shown in FIG. 1 is initialized. The codebook mapping table is provided for each phoneme type, and the block group 16 including j field blocks 14 including the index field 10 and the number of mapping fields 14 of the target speaker code vector is the codebook of the original speaker. It is configured to be accessed by the indices (0, ..., N-1) of the codevectors contained in.

(2) 도 2에 의하면, 동일한 음소(A)를 발음한 원화자와 목적화자의 각 음성 프레임에 대한 기본 주파수를 결정하고, 상기 동일한 음소에 대한 원화자와 목적화자의 기본 주파수를 각각 선형예측분석(linear predictive analysis)후 코드벡터와 같은 형태의 LSP(line spectral pair) 계수로 변경한다.(2) Referring to Fig. 2, the fundamental frequencies of the original frames and the object speakers that pronounce the same phoneme A are determined, and the fundamental frequencies of the original speaker and the object speaker for the same phoneme are linearly predicted, respectively. After linear predictive analysis, we change the coefficients into line spectral pair (LSP) coefficients.

(3) 다음, 원화자의 음소 A에 대한 코드북(20)을 탐색하여 원화자의 음소(A)에서 추출한 LSP와 가장 유사한 제1 코드벡터 및 그 인덱스를 결정한다. 동시에 목적화자의 음소 A에 대한 코드북(22)을 탐색하여 목적화자의 같은 음소(A)에서 추출한 LSP계수와 가장 유사한 제2 코드벡터 및 그 인덱스를 결정한다.(3) Next, the codebook 20 for the phoneme A of the original speaker is searched to determine the first code vector most similar to the LSP extracted from the phoneme A of the original speaker and its index. At the same time, the codebook 22 for the phoneme A of the target speaker is searched to determine a second code vector and its index most similar to the LSP coefficients extracted from the same phoneme A of the target speaker.

(4) 다음, 코드북 매핑 테이블에서 제1 코드벡터의 인덱스에 대응하는 블록그룹을 찾는다. 찾은 블록그룹 내에 제2 코드벡터의 인덱스를 기입한 인덱스 필드가 존재하면 대응하는 매핑횟수 필드의 값을 1 증가시킨다. 그러나, 존재하지 않으면 찾은 블록그룹 내에서 값이 할당되지 않은 인덱스 필드에 제2 코드벡터의 인덱스를 기입하고 대응하는 매핑횟수 필드의 값을 1로 둔다.(4) Next, the block group corresponding to the index of the first code vector is found in the codebook mapping table. If there is an index field in which the index of the second code vector is written in the found block group, the value of the corresponding mapping count field is increased by one. However, if it does not exist, the index of the second code vector is written in the index field to which no value is assigned in the found block group, and the value of the corresponding mapping count field is set to 1.

(5) 대량의 음소별 LSP 계수를 구하여 (2)와 (4) 과정을 반복함으로써 코드북 매핑 테이블을 모두 채운다. (5) Fill up the codebook mapping table by repeating steps (2) and (4) by obtaining a large number of LSP coefficients for each phoneme.

(6) 각 블록그룹별로 매핑횟수 필드에 기입된 값이 큰 순서대로 소정의 수(최소 3개)의 필드블록만을 선택하고 다른 필드블록을 삭제한다.(6) Select only a predetermined number (at least three) of field blocks in order of increasing value written in the mapping frequency field for each block group, and delete other field blocks.

이와 같은 방법에 의해 생성한 코드북 매핑 테이블을 이용하여 다음과 같은 변환함수(24)를 구할 수 있다.The following conversion function 24 can be obtained using the codebook mapping table generated by this method.

(여기에서, i는 원화자 코드벡터의 인덱스이고, cvAs(i)는 원화자의 i번째 코드벡터이고, j는 필드블록의 수이고, cvAt(i₀),…,cvAt(i_j-1)은 각각 cvAs(i)에 대응하는 목적화자의 코드벡터이고, i₀,…,i_j-1은 각각 목적화자의 코드벡터 인덱스이고, Freqcv(i₀),…,Freqcv(i_j-1)은 각각 cvAs(i)에 대응하는 목적화자의 코드벡터의 매핑횟수이고, Ft(i)는 Freqcv(i₀),…,Freqcv(i_j-1)를 합한 값이다) Where i is the index of the originator codevector, cvAs (i) is the i-th codevector of the originator, j is the number of fieldblocks, cvAt (i ₀ ), ..., cvAt (i _j-1 ) Are respectively the code vectors of the object speakers corresponding to cvAs (i), i ₀ , ..., i _j-1 are the code vector indices of the object speakers, respectively, Freqcv (i ₀ ), ..., Freqcv (i _j-1 ) Is the number of mapping of the target vector's codevector corresponding to cvAs (i), and Ft (i) is the sum of Freqcv (i ₀ ), ..., Freqcv (i _j-1 ))

이하에서는 코드북 매핑 테이블을 이용하여 원화자 음성의 음색을 변환하는 과정을 도 3을 참조하여 설명한다.Hereinafter, a process of converting the tone of the original speaker's voice using the codebook mapping table will be described with reference to FIG. 3.

입력된 음성을 20~30ms단위의 프레임(frame) 단위로 추출한다(300). 상기 음성 프레임에 대한 정규화 자기 상관 함수의 피크치로부터 상기 음성 프레임의 후보 기본 주파수를 결정하고, 상기 후보 기본 주파수와 상기 후보 기본 주파수로부터 생성된 통합 가운시안 분포들에 따라 상기 음성 프레임에 대한 동적 프로그램을 실행하여 각 음성 프레임에 대한 기본 주파수를 결정한다(단계 305).The input voice is extracted in units of frames in units of 20-30 ms (300). The candidate fundamental frequency of the speech frame is determined from the peak value of a normalized autocorrelation function for the speech frame, and the dynamic program for the speech frame is generated according to the candidate fundamental frequency and the combined ginsian distributions generated from the candidate fundamental frequency. To determine the fundamental frequency for each speech frame (step 305).

다음, 상기 기본 주파수에 기초하여 상기 프레임에 대한 음소종류를 판별하고, LPC분석/엑사이테이션(Excitation) 계산/LSP변환한다(310 단계 내지 340 단계). Next, a phoneme type for the frame is determined based on the fundamental frequency, and LPC analysis / excitation calculation / LSP conversion is performed (steps 310 to 340).

판별된 음소종류에 해당하는 원화자의 코드북을 탐색하여 340 단계에서 변환된 LSP 계수와 가장 유사한 코드벡터의 인덱스를 결정한다(350 단계).The codebook of the original speaker corresponding to the determined phoneme type is searched to determine the index of the code vector most similar to the transformed LSP coefficient in step 340 (step 350).

상기 350 단계에서 결정된 코드벡터의 인덱스에 의해 코드북 매핑 테이블을 액세스하여, 상기한 변환함수(24)에 의해 목적화자의 코드벡터로 변환한다(360 단계).The codebook mapping table is accessed using the index of the codevector determined in step 350, and the codebook mapping table 24 is converted into the target vector's codevector by the conversion function 24 (step 360).

상기 360 단계에서 변환된 목적화자의 LSP 코드벡터는 다시 LPC로 변환되고, 엑사이테이션도 변경한다(370 단계). 이와 같이 변경된 LPC와 엑사이테이션으로부터 음색변환된 합성음을 생성한다(380 단계).The LSP codevector of the object speaker converted in step 360 is converted into LPC again, and the excitation is also changed (step 370). In operation 380, the synthesized sound is converted from the LPC and the excitation that have been changed.

도 4는 기본 주파수를 결정하는 단계(305)를 보다 구체적으로 설명하고 있는 흐름도이다.4 is a flow chart that describes step 305 of determining the fundamental frequency in more detail.

음성 신호의 프레임에 소정의 윈도우 신호를 곱하여 윈도우된 신호에 대한 정규화 자기 상관 함수를 계산한다(단계 410). 상기 윈도우된 신호에 대한 정규화 자기 상관 함수로부터 후보 기본 주파수를 결정한다(단계 420). 상기 음성 신호에 대한 후보 기본 주파수들은 상기 윈도우된 신호에 대한 정규화 자기 상관 함수에서 소정의 제1 임계값(TH1)을 초과하는 피크 값으로부터 결정된다. 상기 결정된 후보 기본 주파수들에 대한 주기와 상기 주기의 주기성을 나타내는 주기 평가값(pr)을 보간(interpolate)한다(단계 430). 상기 기본 주파수는 상기 윈도우된 신호에 대한 정규화 자기 상관 함수의 피크 값으로부터 평가된 후보 기본 주파수로부터 유도된다.The normalized autocorrelation function for the windowed signal is calculated by multiplying the frame of the speech signal by a predetermined window signal (step 410). A candidate fundamental frequency is determined from a normalized autocorrelation function for the windowed signal (step 420). Candidate fundamental frequencies for the speech signal are determined from peak values that exceed a predetermined first threshold TH1 in a normalized autocorrelation function for the windowed signal. The period for the determined candidate fundamental frequencies and the period evaluation value pr representing the periodicity of the period are interpolated (step 430). The fundamental frequency is derived from the candidate fundamental frequency evaluated from the peak value of a normalized autocorrelation function for the windowed signal.

상기 보간된 주기의 주기 평가값(pr)에 기초하여, 제2 임계값(TH2) 이상의 보간 주기 평가값을 가지는 후보 기본 주파수들을 선택하고(이하에서 상기 제2 임계값 이상의 보간 주기 평가값을 가지는 후보 기본 주파수들을 앵커 기본 주파수라 한다), 상기 앵커 기본 주파수들에 대한 가우시안 분포(Gaussian distribution)를 생성한다(단계 440). 상기 생성된 가우시안 분포들 중에서 제3 임계값(TH3) 이하의 거리에 있는 가우시안 분포를 통합(cluster)하여 통합 가우시안 분포를 생성하고, 상기 생성된 통합 가우시안 분포들 중에서 제4 임계값(TH4)을 초과하는 가능도(likelihood)를 가지는 적어도 1개 이상의 통합 가우시안 분포를 선택한다(단계 450). Based on the period evaluation value pr of the interpolated period, candidate fundamental frequencies having an interpolation period evaluation value equal to or greater than a second threshold value TH2 are selected (hereinafter, having an interpolation period evaluation value equal to or greater than the second threshold value). Candidate fundamental frequencies are referred to as anchor fundamental frequencies), and a Gaussian distribution for the anchor fundamental frequencies is generated (step 440). From among the generated Gaussian distributions, a Gaussian distribution located at a distance less than or equal to a third threshold value TH3 is clustered to generate an integrated Gaussian distribution, and a fourth threshold value TH4 is generated from the generated Gaussian distributions. At least one integrated Gaussian distribution is selected that has an exceeding likelihood (step 450).

상기 윈도우된 신호에 대한 정규화 자기 상관 함수의 피크 값으로부터 결정된 후보 기본 주파수들과 상기 선택된 통합 가우시안 분포에 기초하여, 상기 음성 신호의 각 프레임에 대한 후보 기본 주파수들에 대해 동적 프로그램(dynamic programming)을 실행한다(단계 460). 각 프레임에 대한 후보 기본 주파수들에 대해 동적 프로그램을 실행하는 동안, 각 프레임의 후보 기본 주파수에 대한 거리값이 저장되며, 마지막 프레임(N)까지 상기 동적 프로그램을 실행하여 가장 큰 거리값을 가지는 후보 기본 주파수가 상기 마지막 프레임에 대한 기본 주파수로 추적된다. 상기 가장 큰 거리값을 가지는 경로의 후보 기본 주파수들로부터 각 프레임에 대한 기본 주파수를 결정하게 된다.Based on the selected fundamental Gaussian distribution and the candidate fundamental frequencies determined from the peak values of the normalized autocorrelation function for the windowed signal, dynamic programming is performed for the candidate fundamental frequencies for each frame of the speech signal. To execute (step 460). While executing the dynamic program for the candidate fundamental frequencies for each frame, the distance value for the candidate fundamental frequency of each frame is stored and the candidate having the largest distance value by executing the dynamic program to the last frame (N). The fundamental frequency is tracked as the fundamental frequency for the last frame. The fundamental frequency for each frame is determined from the candidate fundamental frequencies of the path having the largest distance value.

본 발명에 의하면, 첫째, 다양한 음색으로 합성음을 생성하기 위해 음성합성장치에 사용할 경우, 합성음의 목적에 따라 나이, 성별 등이 다른 음색으로 문장을 합성할수 있다. 즉 정보전달특성을 강조하려면 젊은 여성의 목소리 음색을 이용할수 있고 친근감을 강조하려면 어린아이의 목소리 음색을 이용할수 있다.According to the present invention, first, when used in a speech synthesis device to generate synthesized sounds in various tones, sentences can be synthesized with different tones of different ages and genders according to the purpose of the synthesized sounds. In other words, the voice tone of the young woman can be used to emphasize the information transmission characteristics, and the voice tone of the young child can be used to emphasize the friendliness.

둘째, 본 발명을 현재는 없는 유명인의 목소리 음색을 나타내는 방송매체에 사용할수 있다.Second, the present invention can be used in a broadcast medium that does not presently present the voice tone of a celebrity.

세째, 멀티미디어 채팅(chatting) 프로그램 등에서 문자 대신 다양한 목소리 음색을 이용하여 사용자의 욕구를 충족시킬수 있다.Third, in the multimedia chatting program, various voice tones can be used instead of texts to satisfy user needs.

넷째, 발음기관에 장애가 있는 사람들의 발음보조장치에 응용할수 있다.Fourth, it can be applied to pronunciation aids of people with impaired pronunciation.

Claims

In the method of generating a synthesized sound tone-converted voice frame of the original speaker extracted in the frame unit to the voice frame of the target speaker,

(a) A codebook mapping table is constructed in which a block group including a plurality of field blocks consisting of an index field of a target speaker code vector and a mapping frequency field of each phoneme is accessed by an index of a code vector included in a codebook of an original speaker. Generating;

(b) determine a candidate fundamental frequency of the speech frame from the peak value of a normalized autocorrelation function for the speech frame, and determine the candidate fundamental frequency of the speech frame according to unified ginsian distributions generated from the candidate fundamental frequency and the candidate fundamental frequency. Executing a dynamic program to determine a fundamental frequency for each voice frame;

(c) determining the type of phoneme from the speech frame of the original speaker based on the fundamental frequency;

(d) converting the speech frame of the original speaker into an LSP coefficient;

(e) searching the codebook of the original speaker according to the phoneme type determined in step (c) to determine the index of the code vector most similar to the LSP coefficients;

(f) accessing the codebook mapping table by the index of the codevector determined in step (e) and converting the codebook mapping table into a codevector of an object speaker; And

and (g) generating a synthesized sound converted by the code vector of the object speaker converted in the step (f).

The method of claim 1, wherein step (b)

(b1) multiplying the frame of the speech signal by the window signal W (t) to calculate a normalized autocorrelation function for the windowed signal and to determine candidate fundamental frequencies from the peak value of the normalized autocorrelation function for the windowed signal; step;

(b2) interpolating a period evaluation value representing the period of the determined candidate fundamental frequencies and the periodicity of the period;

(b3) generating a Gaussian distribution for candidate fundamental frequencies of each frame having the interpolation period evaluation value equal to or greater than a first threshold value TH1;

(b4) may generate a combined Gaussian distribution by integrating a Gaussian distribution at a distance less than or equal to a second threshold value TH2 among the Gaussian distributions, and exceed the third threshold value TH3 among the generated Gaussian distributions. Selecting at least one integrated Gaussian distribution having a likelihood; And

(b5) determining a fundamental frequency of each frame by executing dynamic programming on the frames based on the candidate fundamental frequencies of each frame and the selected integrated Gaussian distributions. A tone conversion method using codebook mapping for each phoneme.

The code vector of the target speaker converted in the step (f) is

Equation

Where i is the index of the originator codevector, cvAs (i) is the i-th codevector of the originator, j is the number of fieldblocks, cvAt (i ₀ ), ..., cvAt (i _j-1 ) Are respectively the code vectors of the object speakers corresponding to cvAs (i), i ₀ , ..., i _j-1 are the code vector indices of the object speakers, respectively, Freqcv (i ₀ ), ..., Freqcv (i _j-1 ) Is the number of mapping of the target vector's code vector corresponding to cvAs (i), and Ft (i) is obtained by adding Freqcv (i ₀ ), ..., Freqcv (i _j-1 ). Tone conversion method by codebook mapping for each phoneme.

A method of generating a codebook mapping table for converting a speech frame of an original speaker extracted in units of frames into a speech frame of a target speaker, the method comprising:

(a) initializing a codebook mapping table configured such that a block group including a plurality of field blocks consisting of an index field and a mapping count field is accessed by an index of a code vector included in a codebook of an original speaker;

(b) a candidate fundamental frequency of each speech frame is determined from the peak values of the normalized autocorrelation function for each speech frame of the original speaker and the object speaker that pronounced the same phoneme, and is generated from the candidate fundamental frequency and the candidate fundamental frequency. Executing a dynamic program for the speech frame according to unified distribution distributions to determine a fundamental frequency for each speech frame of the original speaker and the object speaker for the same phoneme;

(c) linearly predicting the fundamental frequency of the original speaker and the fundamental frequency of the target speaker, and converting them into first LSP coefficients and second LSP coefficients, respectively;

(d) find the first code vector most similar to the first LSP coefficient in the original codebook of the original speaker, determine the index of the first code vector, and obtain the second code vector most similar to the second LSP coefficient in the codebook of the destination speaker. Finding and determining an index of the second codevector;

(e) increasing the value of the mapping count field corresponding to the index of the second codevector by one in the block group corresponding to the index of the first codevector in the codebook mapping table; And

(f) repeating steps (b) to (e) for a predetermined number of times.

The method of claim 4, wherein the determining of the fundamental frequency in the step (b)

(b1) calculates a normalized autocorrelation function Ro (i) for the windowed signal Sw (t) by multiplying the speech frame by a window signal W (t) and normalized autocorrelation for the windowed signal Determining candidate fundamental frequencies for the speech frame from a peak value of a function;

(b3) generating a Gaussian distribution for candidate fundamental frequencies having the interpolation period evaluation value equal to or greater than a first threshold value TH1;

(b5) determining a fundamental frequency of each frame by executing dynamic programming on the frames based on the candidate fundamental frequencies of each frame and the selected integrated Gaussian distributions. Codebook mapping table generation method characterized in that.

The method of claim 4, wherein step (e)

(e1) finding a block group corresponding to an index of the first code vector in the codebook mapping table; And

(e2) If there is an index field in which the index of the second code vector is written in the block group found in step (e1), the value of the corresponding mapping count field is increased by one; if it is not present, the value found in step (e1) is found. And a step of writing an index of the second code vector in an index field to which a value is not assigned in a block group and setting a value of a corresponding mapping count field to one.

The method of claim 4, wherein

and (g) selecting only a predetermined number of field blocks and deleting other field blocks in order of increasing value written in the mapping frequency field for each block group.