TW200935399A - Chinese-speech phonologic transformation system and method thereof - Google Patents

Chinese-speech phonologic transformation system and method thereof Download PDF

Info

Publication number
TW200935399A
TW200935399A TW97103905A TW97103905A TW200935399A TW 200935399 A TW200935399 A TW 200935399A TW 97103905 A TW97103905 A TW 97103905A TW 97103905 A TW97103905 A TW 97103905A TW 200935399 A TW200935399 A TW 200935399A
Authority
TW
Taiwan
Prior art keywords
phonetic
speech
pitch
conversion
chinese
Prior art date
Application number
TW97103905A
Other languages
Chinese (zh)
Other versions
TWI350521B (en
Inventor
Chung-Hsien Wu
Mai-Chun Lin
Chi-Chun Hsia
Original Assignee
Univ Nat Cheng Kung
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Cheng Kung filed Critical Univ Nat Cheng Kung
Priority to TW97103905A priority Critical patent/TW200935399A/en
Publication of TW200935399A publication Critical patent/TW200935399A/en
Application granted granted Critical
Publication of TWI350521B publication Critical patent/TWI350521B/zh

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese-speech phonologic transformation system, which includes the following: a phonologic analysis unit used to receive a source speech and corresponding source characters to generate a synthetic speech, wherein the phonologic analysis unit comprises a hierarchy deconstructing module used to define the source characters by plural hierarchies, each hierarchy having one pitch model, and to generate phonologic parameters of the pitch model according to the source speech; a function selection module determining individual phonologic transformation function according to the pitch model and the corresponding phonologic parameters; a phonologic transformation module carrying out the phonologic transformation functions to respectively transform individual phonologic parameter into individual synthetic phonologic parameters; and a synthesis module executing corresponding pitch models according to individual synthetic phonologic parameter to generate the synthetic speech. The invention also discloses a Chinese-speech phonologic transformation method.

Description

200935399 九、發明說明: 【發明所屬之技術領域】 本發明係有關一種中文語音音韻轉換系統與方法,特 別是一種用於電腦合成語音的中文語音音韻轉換系統與方 法。 【先前技術】 語音科技為電腦技術下個世代發展的重要關鍵,而隨著 〇感知網路、控制系統與圖樣識別技術漸趨成熟,智慧型家庭 與醫療服務等系統都已進入應用發展的階段。然而,所有的 服務最終都需回到使用者本身,人機互動的發展便成為實際 應用的關鍵’其中又以語音科技為使用者與電腦最直接接觸 ‘ 的介面技術。 . 語音科技主要包括語音辨識與語音合成等兩大技術,語 音辨識主要係由電腦分析使用者所產生之語音,並確認該語 音所包含之指令,而語音合成主要係透過電腦以數位方式產 生近似真人發音之輸出語音。其中透過單元挑選與串接的方 ⑩式’語音合成系統已可合成出相當自然的電腦語音,甚至包 含不同語者不同情緒變化的豐富情緒語音。 語音合成技術上’不同情緒變化的聲音轉換(Voice Conversion,VC)的技術過去大多是被用來作為語者轉換的工 具,大部份技術都需錄製所謂的平行語料,建立來源語者與 目標語者的對應關係後,接著訓練出轉換模型後再進行轉化 。其中,在音韻的轉換部份,大致上可區分線性匹配(Unear Mapping)、向里里化(vect〇r卩仙此加丨的)、高斯混合模型 (Gaussian Mixture M〇del,GMM)以及分類與迴歸樹 (Classification and Regression Tree, CART) ° 200935399 在線性匹配部份’係利用訓練語料中來源語音和目標語 音所求得之音高平均值和變異數,再進行線性估算。此種方 法是一般最簡化的方式,因此轉換後求得之音韻,無法滿足 一般的語句,在不同位置,該有的重音與音調升降變化。例 如,同一個字在句子的起始、中間或結尾,其重音的變化, 即使在同二^緒的表達方式下,也會有其不同的呈現方式。 在向量量化部份,係將訓練語料中,來源語音與目標語 音的所求得之語音特徵向量,經由校準後,建立對應的碼本 ❹^odebook) ’而碼本中的每個碼字(⑶和丽…為對應的目標語 音特徵向量。轉換時,先求得來源語音的特徵向量所 對應的codeword,逐-取代成目標 得之語音特徵向量還原回立丄&amp;, 里丹对不 -施、…人?音向輸出。此種方式作法簡單,轉 、佬S暂疋巧種不連續的轉換結果,轉換效果有限 使付音質也受到限制。 〜〜〜 达立伤’其基本的假設是來源語音與目標 徵向,機率分佈為聯200935399 IX. INSTRUCTIONS: [Technical Field] The present invention relates to a Chinese speech and rhyme conversion system and method, and more particularly to a Chinese speech and rhyme conversion system and method for computer synthesized speech. [Prior technology] Voice technology is an important key to the development of computer technology for the next generation. With the maturity of network, control system and pattern recognition technology, smart home and medical service systems have entered the stage of application development. . However, all services will eventually need to go back to the users themselves. The development of human-computer interaction has become the key to practical applications. </ br> The voice technology is the interface between the user and the computer. Voice technology mainly includes two technologies: speech recognition and speech synthesis. The speech recognition mainly analyzes the speech generated by the user by the computer and confirms the instructions contained in the speech. The speech synthesis mainly generates the approximation through the computer in a digital manner. The output voice of the human voice. Among them, the unitary-style speech synthesis system that selects and concatenates through units can synthesize quite natural computer speech, and even rich emotional speech with different emotional changes of different speakers. Voice synthesis technology (Voice Conversion (VC) technology used in the past is mostly used as a tool for speaker conversion. Most technologies need to record so-called parallel corpora, establish source language and After the correspondence of the target language, the conversion model is trained and then converted. Among them, in the conversion part of the phonology, it can roughly distinguish between Unear Mapping, lining (vect〇r卩), Gaussian Mixture M〇del (GMM), and classification. The Classification and Regression Tree (CART) ° 200935399 In the linear matching part, the pitch average and the number of variances obtained from the source speech and the target speech in the training corpus are used for linear estimation. This method is generally the most simplified way, so the rhythm obtained after the conversion cannot satisfy the general statement. At different positions, the accent and pitch rise and fall. For example, the same word at the beginning, middle or end of a sentence, its accent changes, even in the same way, will have different presentations. In the vector quantization part, the obtained speech feature vector of the source speech and the target speech in the training corpus is calibrated to establish a corresponding codebook '^odebook)' and each codeword in the codebook ((3) and Li... are the corresponding target speech feature vectors. When converting, the codeword corresponding to the feature vector of the source speech is first obtained, and the speech feature vector of the target speech is replaced by the target character to be restored to the vertical &amp; - Shi, ... person? Sound output. This method is simple, turn, 佬S temporarily 不 种 种 discontinuous conversion results, limited conversion effect so that the sound quality is also limited. ~ ~ ~ Da Li injury 'its basic The hypothesis is the source speech and the target direction, and the probability distribution is

Distribution),據此推導出— V ° mal ❹用在頻譜的轉換,㈣料°此_換方式最先是 在高斯混合模型部份主4=:=::韻的部份。 向量,再進行GMM的訓練二^律參數、、”成為一特徵 頻譜和音高之間的相關變化:但、合,式。此種方法考量了 問題’而產生不連續的現象,為音框伽me)校準的 和韻律之間的相關變化’還了聲音的品質,同時頻譜 也有學者將韻律的部份單獨的的分析研究。另外, 韻模型訓練,此種方法的確 曰卽為早το進行GMM的音 但也受限於轉換模型的定義=決了不連續的現象產生, 象。 使件所轉出之音韻有均化的現 200935399 在分類與迴歸樹部份,不同於GMM只 立 訊做考量,CART結合了語言學的部份,將語 之間的關係為考量的準則,期望建立出更強建的轉換模型。 但有研究顯示,進行CART訓練時,其所對應的訓練語料要 更大量,才能使建立出的轉換模型精準。然而,語^的收集 一直是此一技術發展的關鍵,對於大量聲音語料 為語音科技實際應用的瓶頸。 〇【發明内容】 本發明之目的在於提供一種在降低語料的要求下,可達 到精確的音韻轉換之中文語音音韻轉換系統及方法。 為達上述目的,本發明係提供一種中文語音音韻轉換系 統,具有一音韻分析單元用以接收一來源語音及對應之來源 文字並產生一合成語音,該音韻分析單元包括:一階層拆解 模組’以複數個階層定義該來源文字,每一階層具有二音高 模型’並根據該來源語音產生該音高模型的音韻參數;一函 式選擇模組’根據該等音高模型及對應的音韻參數決定各自 ❹的音韻轉換函式;一音韻轉換模組,執行該等音韻轉換函式 以將各自的音韻參數分別轉換為各自的合成音韻參數;以及 一合成模組,根據各自的合成音韻參數執行對應的音高模 型,以產生該合成語音。 為達上述目的,本發明復提供一種中文語音音韻轉換方 法’用以接收一來源語音及對應之來源文字並產生一合成語 音’包括:根據該來源文字產生複數個的音高模型;根據該來源 語音產生每一音高模型的音韻參數;根據該等音高模型及對應的音 韻參數決定各自的音韻轉換函式;執行該等音韻轉換函式以將各自 的音韻參數分別轉換為各自的合成音韻參數;以及根據各自的合成 200935399 音韻參數執行對應的音高模型,以產生該合成語音。 綜上所述,本發明之中文語音音韻轉換系統與方法可應 用於,合各式人機單/雙向溝通系統,其中階層式音韻模型 =僅符合人類構音發話上的音韻結構,更可降低音韻參數的 變異程度;加上迴歸式歸群演算法,有效降低參數轉換上的 預測誤差,使音韻轉換更加準確,應用於多語者或豐富情緒 的電腦語音合成系統可減低聲音語料需求。 、 本發明之前述目的或特徵,將依據後附圖式加以詳細說 〇明’惟需明瞭岐,後關式及所舉之例,祇是做為說明^ 非在限制或縮限本發明。 【實施方式】 雖然本發明將參閱含有本發明較佳實施例之所附圖示予 以充分描述,但在此描述之前應瞭解熟悉本行技藝之人士可 修改本文中所描述之發明,同時獲致本發明之功效。因此, 需瞭解以下之描述對熟悉本行技藝之人士 示,且其内容不在於限制本發明。 M乏之揭 ® 本發明為一種中文語音音韻轉換系統,主要包含一階層 式音韻分析結構、-迴歸式歸群演算法及—基於分類迴歸樹 (Classification and Regression Tree,CART)之音韻轉換函式挑 選機制。該中文語音音韻轉換系統係輸入任意中文語音句及 相對應的文字内容與音節斷點,經階層式音韻分析結構,將 音韻分解為句子階層、詞階層與次音節階層的音韻參數。而 音韻轉換函式的訓練乃透過迴歸式歸群演算法,同時考量音 韻參數的預測誤差與語言學特徵相似度。再基於分類迴歸&amp; CART,湘語”舰與來源語音的音韻參㈣行轉換函式 的挑選。再根據賴選音㈣換函切各階層的音韻參數分 8 200935399 g行轉換,將轉換後的音韻參數合併以合成轉換後的合成 了音提到音韻,-般通常包括 響整個句子韻律的就是音高的J性等因2 最直接影 及予阔及句子的組成、句子類別 立古二予^限制’也可能來自於語音學上的變化,例如: :==習性所影響’致使這些各種不同的因子間, 於、㊅本發明之音韻轉換函式是由平行的訓練語料,基 ?群演算法所估算而得,在本發明的實施例中,頻 ;曰,換函式也同時建立,以處理頻譜的轉換。輸人的語句需 文子!!容與音節段點,首先經由階層式音韻模型的拆解 /、化,得到音韻參數,再透過分類與迴歸樹CART選取最 佳的轉換函式進行轉換。本發明的發展平台乃建置於 個人電腦、、奶幻5 作業系 統之環境’糸統開發工具為M/crojo/i r/sMa/ C++心0。 /二而本發明下列實施例所使用之語音資料庫為一組情緒平 行語料庫。其中,同樣的一句文字,由語者分別錄製中性與 目標情緒兩個音檔,在本發明的一實施例中,該語音資料庫 包含快樂情緒120句,悲傷情緒11〇句與生氣情緒115句,Distribution), derived from this - V ° mal ❹ used in the conversion of the spectrum, (four) material ° this _ change method is first in the Gaussian mixture model part of the main 4 =: =:: rhyme part. Vector, then the GMM training two law parameters, " becomes a characteristic change between the spectrum and the pitch: but, the combination, the formula. This method takes into account the problem 'and produces a discontinuous phenomenon, for the frame gamma Me) The correlation between the calibration and the rhythm 'has the quality of the sound, and the spectrum also has some separate analysis of the rhythm of the spectrum. In addition, the rhyme model training, this method is indeed the early το GMM The sound is also limited by the definition of the conversion model = the discontinuity of the phenomenon is produced, like the image. The phonology that is transferred out of the piece has been homogenized. 200935399 In the classification and regression tree part, unlike GMM only Lixun does Considering that CART combines the part of linguistics, the relationship between the words is a criterion for consideration, and it is expected to establish a more powerful transformation model. However, studies have shown that when training CART, the corresponding training corpus should be A larger amount can make the established conversion model accurate. However, the collection of the language ^ has always been the key to the development of this technology, for a large number of voice corpus is the bottleneck of the practical application of voice technology. The object of the present invention is to provide a Chinese speech and rhyme conversion system and method capable of achieving accurate phonological conversion under the requirement of reducing corpus. To achieve the above object, the present invention provides a Chinese speech phonological conversion system having a The phonological analysis unit is configured to receive a source speech and a corresponding source text and generate a synthesized speech, the phonological analysis unit comprising: a hierarchical disassembly module defining the source text by a plurality of levels, each level having a two-pitch model And generating a phoneme parameter of the pitch model according to the source speech; a function selection module 'determining the respective phoneme conversion functions according to the pitch model and the corresponding phoneme parameter; a phoneme conversion module, executing the The phonological conversion function converts the respective phonological parameters into respective synthetic phonological parameters, and a synthesis module that performs a corresponding pitch model according to the respective synthesized phonological parameters to generate the synthesized speech. The present invention provides a Chinese voice and rhyme conversion method for receiving a source voice and Generating the text and generating a synthesized speech' includes: generating a plurality of pitch models according to the source text; generating pitch parameters of each pitch model according to the source speech; determining according to the pitch models and corresponding pitch parameters Respective phonological conversion functions; performing the phonological conversion functions to convert respective phonological parameters into respective synthesized phonological parameters; and performing corresponding pitch models according to respective synthesized 200935399 phonological parameters to generate the synthesized speech. In summary, the Chinese speech sound and rhyme conversion system and method of the present invention can be applied to various human-machine single/two-way communication systems, wherein the hierarchical phonetic model=only conforms to the phonetic structure on the human voice, and can reduce the pitch. The degree of variation of the parameters; coupled with the regression clustering algorithm, effectively reduces the prediction error on the parameter conversion, making the phonological conversion more accurate, and the computer speech synthesis system applied to multi-linguistics or rich emotions can reduce the demand for sound corpus. The above-mentioned objects and features of the present invention will be described in detail with reference to the accompanying drawings. The present invention will be fully described with reference to the accompanying drawings in which the preferred embodiments of the present invention are described, but it is understood that those skilled in the art can modify the invention described herein while obtaining the present invention. The efficacy of the invention. Therefore, the following description is to be understood by those skilled in the art and is not intended to limit the invention. The invention is a Chinese phonetic rhyme conversion system, which mainly comprises a hierarchical phonetic analysis structure, a regression grouping algorithm and a phoneme conversion function based on the Classification and Regression Tree (CART). Selection mechanism. The Chinese phonetic rhyme conversion system inputs arbitrary Chinese phonetic sentences and corresponding text content and syllable breakpoints, and the hierarchical phoneme analysis structure decomposes the phonemes into the phonetic parameters of the sentence class, the word class and the sub-syllable class. The training of the phoneme conversion function is through the regression grouping algorithm, and the prediction error of the phonetic parameters and the linguistic feature similarity are considered. Based on the classification regression &amp; CART, the selection of the conversion function of the phonological reference (four) in the Xiang language and the source speech. Then according to the selection of the lyrics (four), the phonological parameters of each class are divided into 8 200935399 g lines, which will be converted. The phonological parameters are combined to synthesize the converted synthesized sounds to mention the phonology, and generally include the rhythm of the entire sentence, which is the J-sex of the pitch, the most direct influence and the composition of the sentence, and the sentence category. The "limitation" may also come from phonological changes, such as: :== habits affecting 'causing these various factors, the six rhythm conversion functions of the invention are parallel training corpus, based In the embodiment of the present invention, the frequency; 曰, the commutative function is also established at the same time to process the conversion of the spectrum. The input sentence requires a text!! The hierarchical rhythm model is disassembled and/or transformed to obtain the phonological parameters, and then the best conversion function is selected by the classification and regression tree CART. The development platform of the present invention is built on the personal computer, the milk magic 5 operating system. The environment development system tool is M/crojo/ir/sMa/C++ heart 0. The voice database used in the following embodiments of the present invention is a set of emotion parallel corpora. Among them, the same sentence text, the language The two voice files of the neutral and the target emotions are respectively recorded. In an embodiment of the present invention, the voice database includes 120 sentences of happy emotions, 115 sentences of sad emotions and 115 sentences of angry emotions.

由一位女性錄音員所錄製,取樣頻率為22.05kHz,解析度 16bits ^ X 參考第一圖為本發明一實施例之中文語音音韻轉換系統 200935399 ίο的系統架構圖,該中文語音音韻轉換系統10係用以接收 一來源語音以及對應之來源文字,再根據該來源文字的語言 學特徵解析該來源語音,並將該來源語音重新合成為具有不 同情緒表現的合成語音。該中文語音音韻轉換系統10包括 一語音分析單元20、一音韻分析單元3〇、一音韻分析單元 40以及一語音合成單元50。 該語音分析單元20係用以接收該來源語音,並從該來源 語音擷取出音韻部份以及頻譜部份,再將音韻部份以及頻譜 ❿部份分別提供至音韻分析單元30以及音韻分析單元40。該 语音分析單元 20 係採用 STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum)演算法’包含音高與頻譜的計算,以及從音高與頻 譜合成聲音訊號的演算法。而該語音分析單元2〇主要在進行 頻譜分析時’透過音高進行調適,並在音韻(音高)的分析部 伤’採用最小擾亂運算子(Minimum Perturbation Operator), 從時域操取基週進行計算。 該音韻分析單元30係用以接收該來源語音的音韻部份 Ο以及對應該來源語音的來源文字,並根據該來源文字對該來 源語音的音韻部份進行轉換’以產生新的音韻部份。該音韻 分析單70 30進—步包括一階層拆解模組(Hierarchical Dec〇mp〇Siti〇n)3l、一函式選擇模組32、一音韻轉換模組33 以及一音韻轉換函式庫34。 該階層拆解模組31是根據語句的階層將音韻部分,特別 指音高’拆解並量化成不同階層的音高模型以及音韻參數, 而語句的各階層是由所接收的來源文字經過文字分析而得, 其中該來源文字之内容係對應該來源語音。該階層拆解模組 31並提供音高模型以及音韻參數至函式選擇模組32。其中, 200935399 音韻參數的量化是將音韻部份的音高執跡用音高模型的曲線 逼近的方式,從一連續的音高軌跡數值計算成音高模型的係 數,^透過階層式的分解,將來源語音的音韻部份(音高軌 跡)’量化成句子階層,詞階層與次音節階層的音韻參數。 函式選擇模組32是根據該階層拆解模組31所產生 韻參數以及所接收來源文字所得的語言學特徵J產用= 二^類,_ CART之模型’挑選適合的音韻轉換函 式。該曰韻轉換模組33根據該函式選擇模組32所挑選句子、 ❹詞與次音節等階層的音韻轉換函式,將該階層拆解模組31所 產生的音韻參數代人該等音韻轉換函式得到轉換後的合 知參數。該音韻轉換函式庫34則用於儲存該音韻轉換模板 33所需之該等音韻轉換函式。 該音韻分析單元40係用以接收該來源語音的頻譜 '份,並將該頻譜部份轉換為新的頻譜部份。該音韻分析單元 4〇進一步包括一頻譜轉換模組41以及一頻譜轉換函式庫 42。其中,該頻譜轉換模組41是將所接收之頻譜部份擷取出 頻譜參數,並將該頻譜部份代入一頻譜轉換函式,以得出轉 罾換後的合成頻譜參數。該頻譜轉換函式庫42則用於儲存該頻 譜轉換模組41所需之該頻譜轉換函式。 該頻譜轉換模組41採用〜基於高斯混合模型(Gaussian Mixture Model,GMM)的頻譜轉換函式,該頻譜轉換函式如下 所示: F (x) = ^ [y I χ] = X / (ί I x) [y I ,·]+Σ« ()_1 (x - £ [χ I /]) j 200935399 其中χ表示輸入的頻譜部份 轉換函式的U高斯組成參f向罝”表示頻譜 .Γ二來源語音的頻譜參數之期望值,珠u第 音的共變異數矩陣, ί;:;:變異數矩陣,而麵示侧,高斯組成的 10用實齡段^語音音韻轉換系統 、練轉換函式的系統架構圖,該中文語音立Ag鐘i备 分析該訓練語音以及該目標語音,以產 生这曰韻轉換函式以及該頻譜轉換函式。 ^入該中文浯音音韻轉換系統10之音韻分析單元3〇進一步 ^ 函式訓練模組(Regression-based Clustering)35、一函式 $ ,模組(Supervised CART)36 “及一問題模組(㈣魅 扣:!37:。該、函式訓練模組35係利用迴歸式歸群演算法,將訓 練資料1群’並進行音韻轉換函式的訓練。該函式分類模組 36則將帶有類別資訊的訓練資料進行分類與迴歸樹CART的 ©訓練’以找出分類與迴歸樹CART中每個節點最佳的問題, &quot;T將資料刀類得最為正確,讓每個子節點的亂度最小,即所 包含的大部份是同一類別的資料。該問題模組37則是用來 儲存分類與迴歸樹CART的問題集和,例如「目前次音節所 在的詞的詞長否為二字詞?」之問題,每個問題可將資料分 為符合問題的與不符合問題的兩群。 該中文語音音韻轉換系統10之音韻分析單元40則進一Recorded by a female recorder, the sampling frequency is 22.05 kHz, and the resolution is 16 bits. X. Referring to the first figure, the system architecture diagram of the Chinese speech sound and rhyme conversion system 200935399 ίο according to an embodiment of the present invention, the Chinese speech phonological conversion system 10 The system is configured to receive a source speech and a corresponding source text, and then parse the source speech according to the linguistic features of the source text, and re-synthesize the source speech into synthesized speech with different emotional expressions. The Chinese phonetic phonetic conversion system 10 includes a speech analysis unit 20, a phoneme analysis unit 3A, a phoneme analysis unit 40, and a speech synthesis unit 50. The voice analysis unit 20 is configured to receive the source voice, extract the phoneme portion and the spectrum portion from the source voice, and provide the phoneme portion and the spectrum portion to the phoneme analysis unit 30 and the phoneme analysis unit 40, respectively. . The speech analysis unit 20 employs the STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of wei GHTed spectrum) algorithm to include the calculation of pitch and spectrum, and the algorithm for synthesizing the sound signal from the pitch and the spectrum. The speech analysis unit 2 ' mainly adapts the pitch through the spectrum analysis, and uses the Minimum Perturbation Operator in the analysis section of the phonology (pitch) to operate the base week from the time domain. Calculation. The phonological analysis unit 30 is configured to receive the phonological portion of the source speech and the source text corresponding to the source speech, and convert the phonological portion of the source speech according to the source text to generate a new phonological portion. The phonological analysis unit includes a hierarchical disassembly module (Hierarchical Dec〇mp〇Siti〇n) 3l, a function selection module 32, a phonological conversion module 33, and a phonological conversion library 34. . The hierarchical disassembly module 31 is based on the level of the sentence, and the part of the sentence is disassembled and quantified into different levels of pitch models and phoneme parameters, and each level of the sentence is passed by the received source text. Analyzed, where the content of the source text corresponds to the source of speech. The hierarchy disassembles the module 31 and provides a pitch model and a phoneme parameter to the function selection module 32. Among them, the quantization of the phonological parameters of 200935399 is to approximate the pitch of the pitch part by the curve of the pitch model, and calculate the coefficient of the pitch model from a continuous pitch trajectory value, and through the hierarchical decomposition. The phoneme portion (pitch trajectory) of the source speech is quantized into the syllable parameters of the sentence hierarchy, the word hierarchy and the sub-syllable hierarchy. The function selection module 32 selects a suitable phoneme conversion function based on the rhythm parameters generated by the hierarchical disassembly module 31 and the linguistic features of the received source text, J = _ CART model. The rhyme conversion module 33 selects the phonological conversion function of the sentence, the eulogy and the sub-syllable selected by the function selection module 32, and the phonological parameters generated by the hierarchical disassembly module 31 represent the phonological parameters. The conversion function gets the converted parameters. The phonetic conversion library 34 is used to store the phonetic conversion functions required for the phonetic conversion template 33. The phonological analysis unit 40 is configured to receive the spectrum of the source speech and convert the portion of the spectrum into a new portion of the spectrum. The phonological analysis unit 4 further includes a spectrum conversion module 41 and a spectral conversion library 42. The spectrum conversion module 41 extracts the spectrum parameter from the received spectrum portion, and substitutes the spectrum portion into a spectrum conversion function to obtain a converted spectrum parameter. The spectrum conversion library 42 is used to store the spectrum conversion function required by the spectrum conversion module 41. The spectrum conversion module 41 adopts a Gaussian Mixture Model (GMM)-based spectrum conversion function, and the spectrum conversion function is as follows: F (x) = ^ [y I χ] = X / (ί I x) [y I ,·]+Σ« ()_1 (x - £ [χ I /]) j 200935399 where χ denotes the input of the spectral part of the conversion function of the U Gaussian composition of the parameter f to 罝" represents the spectrum. The expected value of the spectral parameters of the second source speech, the covariance matrix of the bead u first sound, ί;:;: the matrix of the variance number, and the side of the face, the Gaussian composition of 10 with the real age segment ^ phonetic rhyme conversion system, practice conversion The system architecture diagram of the function, the Chinese voice set up the Ag clock to analyze the training speech and the target speech to generate the rhyme conversion function and the spectrum conversion function. ^Into the Chinese voice and rhyme conversion system 10 The phonological analysis unit 3 further regression-based clustering 35, a function $, a module (Supervised CART) 36 "and a problem module ((4) charm buckle: !37:. The training module 35 uses the regression grouping algorithm to train the training data 1 group and perform the phonetic conversion function. Training. The function classification module 36 classifies the training data with category information and the regression training of the regression tree CART to find the best problem for each node in the classification and regression tree CART, &quot;T will The knife is the most correct, so that each child node has the least disorder, that is, most of the data contained in the same category. The problem module 37 is used to store the problem set of the classification and regression tree CART, for example, In the current question of whether the word length of the word in the second syllable is two words?", each question can be divided into two groups that are in question and non-conforming. The phoneme analysis unit 40 of the Chinese phonetic phonetic conversion system 10 Then enter one

步包含一頻譜訓練模組(Expectation-Maximum Training,EMStep includes a spectrum training module (Expectation-Maximum Training, EM

Training)43 ’該頻譜訓練模組43係利用EM演算法進行統 計模型的參數估算。 12 200935399 如上所述本發明一實施例之中文 係先根據所接收之訓練纽立B n描 ·^将換糸統l〇 Μ燼&quot;及目標語音訓練所需之音韶 樹CART進行分類並儲存於音1換函式根據为類與迴歸 中文語音音韻轉換系統1〇接收—轉換函式庫34之令。當該 文字時’該中文語音音韻轉換來/原語音以及對應之來源 多階層的音高模型以及對應 $ G即可由來源語音產生 由分類與迴歸樹CART選擇適卷1。數,將該等音韻參數經 ❹產生合成音韻參數,以重新人二二韻轉換函式進行轉換,並 由於音韻存在著-種“Q語^。 類在發音構句上的特性,將音因此本發明考慮人 上層的階層之音韻單元,隱含了 =層式的方式來呈現,愈 而愈下層的階層之音韻單〜元,貝^全域性的句調走勢變化, 化,因此,不同階層會對應不同的更細微的局部音調變 參考第三圖為本發明—實_==數。 在本發明的該實施例中,、日韻部份階層示意圖。 成三層,分別是句子階層組31將音韻部份拆解 罾原始的音韻部份,也就是音高,巧^及次音節階層。亦即將 詞階層與子音節階層的;二拆解成分別屬於句子階層、 分別處理。 &gt; ,並對各個階層的音韻參數 該階層拆解模組Μ利用句 層’能切上騎句酬造成的H觸相及次音節階 進一的往下分析句調较局部的=性影響效應分離後,再 亦可分別使用不同的音言 放變化。此外,在各階層中 解模組31可以有彈性的搔 $線逼近模型,使該階層拆 韻參數的音韻轉換函式之模型如#換的階層之精細程度,音 200935399 NDi ΛTraining] 43 The spectrum training module 43 uses the EM algorithm to estimate the parameters of the statistical model. 12 200935399 As described above, the Chinese Department of the present invention first classifies the sound eucalyptus CART required for the target speech training according to the received training and the vocabulary tree CART required for the target speech training. It is stored in the tone 1 conversion function according to the class and the return Chinese speech phonetic conversion system 1〇 receiving-conversion library 34. When the text is used, the Chinese phonetic phoneme is converted/original voice and the corresponding source multi-level pitch model and corresponding $G can be generated by the source speech. The volume is selected by the classification and regression tree CART1. The number, the rhythm parameters are generated by the 音 合成 合成 ❹ , , , , , , 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新 重新The present invention considers the phonological unit of the upper layer of the human body, implying a = layered manner to present, and the phonological unit of the lower layer is more than a single element, and the global tone of the sentence changes, and therefore, different classes The third graph corresponding to different subtle local tones will be the present invention - the real _ = = number. In this embodiment of the invention, the rhyme part of the hierarchy diagram. Three layers, respectively, the sentence hierarchy group 31 Disassemble the phonological part into the original phonological part, that is, the pitch, the skill and the sub-syllable level. It is also the word level and the sub-syllable level; the second is disassembled into the sentence level and processed separately. And the phonological parameters of each class, the hierarchical disassembly module, the use of the sentence layer can cut the H-touch phase and the sub-syllable steps caused by the riding sentence, and then analyze the sentence to be more local. Can also be made separately Different vocabulary changes. In addition, in each class, the solution module 31 can have a flexible 搔$ line approximation model, so that the phonological conversion function of the gradation parameter of the hierarchy is like the level of fineness of the class. 200935399 NDi Λ

J /=0 其中’/⑺表示時間,的音韻參數,為第z•個階層的 所有節點數目’表示第y·層所使用的音高模型,wg則為 步階函式,表示當(t&lt;ti,j)時值為〇,表示當(t&gt;j j+1)時值為0。因此經由步階函式的設限下,每次只能取 出一區間的音高值,即第/層中的第•個節點。經由分析求得 各階層之音高模型對應的音韻參數,接著,再將所得之音韻 ®參數與原始的實際值比較用以計算之間的誤差量,亦即 值(Residual)。最後,將所得之殘差值才進入下一層做更細部 的分析,其中殘差值計算方式如下所示: 3〇+i(〇 = :^W-g;〇) 〇 •該音韻轉換函式分別以句子、詞以及次音節為各階層的 單元,因此,在進行轉換前,必須先針對原始的音高曲線進 ,量化。在本發明的一實施例中,該等音高模型是以雷建德 ❹多項式(Legendre P〇lynomial)進行原始音高曲線的量化。吵 雷建德多項式是藉由各個彼此正交的多項式基底進行曲 ,逼近與量化的方法,根據不同階數的設定,將原輸入曲線 置化成各個基底的係數,階數越高逼近的結果越準確。利用 德多項式來逼近原輸入曲線的變化 ,也可以去掉一些在 音兩分析時’所造成的音高曲線破碎不連續的誤差現象。 參考第四圖為本發明一實施例之雷建德多項式的示意 圖。其中’當階數(…為4時,雷建德多項式的估算式如下:J /=0 where '/(7) represents the time, the phoneme parameter, the number of all nodes in the z•th hierarchy' represents the pitch model used by the yth layer, and wg is the step function, indicating when (t&lt The ;ti,j) time value is 〇, indicating a value of 0 when (t&gt;j j+1). Therefore, under the limit of the step function, only the pitch value of one interval can be taken at a time, that is, the first node in the layer/layer. The phonological parameters corresponding to the pitch model of each class are obtained through analysis, and then the obtained phonogram parameters are compared with the original actual values to calculate the amount of error, that is, the value (Residual). Finally, the residual value obtained will enter the next layer for more detailed analysis. The residual difference is calculated as follows: 3〇+i(〇= :^Wg;〇) 〇•The phonological conversion function is Sentences, words, and sub-syllables are units of each level. Therefore, before the conversion, the original pitch curve must be entered and quantized. In an embodiment of the invention, the pitch model quantizes the original pitch curve by a Legendre P〇lynomial. The noisy Lei Jiande polynomial is a method of trajecting, approximating and quantifying by each of the orthogonal polynomial bases. According to the setting of different orders, the original input curve is set into the coefficients of each base. The higher the order, the more accurate the result is. . Using the polynomial to approximate the change of the original input curve, it is also possible to remove some of the error phenomenon of the pitch curve discontinuity caused by the two analysis of the sound. Reference is made to the fourth figure which is a schematic diagram of a Lei Jiande polynomial according to an embodiment of the present invention. Where ' when the order is (... when 4, the estimate of the Lei Jiande polynomial is as follows:

P Σα*φ/ /=〇 η ·&gt; _ ΚΝ 0&lt;η&lt;Ν 200935399 ai N + N p ’η、 Φ, I 0&lt;/&lt;3 e 其中户〇表示原輸入曲線’#是原輸入曲線的總點數,„ 幾個點的指標’屮為相對應的係數值,是第z•階 底夕項式’其定義則如下所示: 土 η ~Ν =1 Φ, r ηΛ ί 12Ν^ 1 2 ~ Μ Γ UJ {ν + 2) [ν) ~2_ η ~Ν ISON3 ❹ Φ, η ~Ν _(ΑΑ-ΐ)(Λ^ + 2)(^ + 3) 2S00N5 η ~Ν 、2 η ~Ν (Ν-ΐ)(Ν-2)(Ν+2)(Ν+3)(Ν+4) η ~Ν) \3 \2 6Nd Ν-1 6ΝP Σα*φ/ /=〇η ·&gt; _ ΚΝ 0&lt;η&lt;Ν 200935399 ai N + N p 'η, Φ, I 0&lt;/&lt;3 e where the household name indicates the original input curve '# is the original input The total number of points of the curve, „the index of several points 屮 is the corresponding coefficient value, which is the zth-order 底 项 term', and its definition is as follows: soil η ~Ν =1 Φ, r ηΛ ί 12Ν ^ 1 2 ~ Μ Γ UJ {ν + 2) [ν) ~2_ η ~Ν ISON3 ❹ Φ, η ~Ν _(ΑΑ-ΐ)(Λ^ + 2)(^ + 3) 2S00N5 η ~Ν , 2 η ~Ν (Ν-ΐ)(Ν-2)(Ν+2)(Ν+3)(Ν+4) η ~Ν) \3 \2 6Nd Ν-1 6Ν

X 2 ζ}Ν+2(η'\ [Ν~\)(Ν~2) ΙΟΝ2 Ν 20Ν2 ❹ 句子階層是階層式音韻模型中的最上層,因此句子階層 考量的是在音韻的變化巾屬於全域性的影響效應。因人類在 發音上的限制,使得句調的音韻變化,會隨著句子的增長, 現-個音高衰弱的走勢^因此在句子階層所截取出的音 必須考量句調因素,也就是音高衰弱的程度。利用 二迴f來估算音高衰弱的程度。句子階層建立的音高 模型如下所示: ^〇(0 = ίϊ + bt ο 其中’ α所呈現的是句子的平均音高,办所呈現的是音高 15 200935399 衰弱的程f °接著,計算估算值與實際值的誤差值(殘差值) 後再往下層進行分析。針對來源語音和目標分別求出其對 應的以及6參數後,即為句子階層的音韻參數,再進行句 子階層二換函式的訓練。 &amp;在f階層中’考慮詞與詞之間,存在著短暫的停頓,因 此5司與巧1間,有著音高重新起始的現象。所以,在詞階層 考量的θ韻因素’是詞單元内部的音高變化走勢。相較於 。子階a 在巧階層中,考量比句子階層更進一步細化的音 ❹項變化所以’私用3階的雷建德多項式(Legendre Polynomial) 係數來建巧階層的音高模型: ^η&lt;Ν 立八其中’針對第二圖中的訓練模式下,來源語音和目標語 -:分別求出其對應的3階的雷建德多項式之音韻參數後,即 可得維度為3的音韻參數向量,該函式訓練模組35即可進 詞階:ίΐ韻轉換函式的訓練。 ❹/人θ節階層是最底層的音韻單元,主要的音韻變化,著 „的聲調變化’也是能最直接的反應出語者情緒的部 二中文音’ *一種聲調語言(T〇nai Language),每一 立、/應一個音節(syllable)’而每個音節是由聲母(子 音)及聲調所構成。因此,若以每個聲調音節(她1 ί行轉in單元’則共有不同的音節約1300個,這使得 =23 要收集大量的語料才能訓練出各自對應 的轉換函式。母個音節可定義成以下格式:X 2 ζ}Ν+2(η'\ [Ν~\)(Ν~2) ΙΟΝ2 Ν 20Ν2 ❹ The sentence hierarchy is the uppermost layer in the hierarchical phonological model, so the sentence hierarchy considers the variation in the phonology belongs to the whole universe. Sexual effects. Due to the limitation of human pronunciation, the phonological changes of the sentence will change with the growth of the sentence, and the pitch of the pitch will be weak. Therefore, the tone intercepted in the sentence class must consider the sentence factor, that is, the pitch. The extent of weakness. Use two rounds of f to estimate the degree of pitch degradation. The pitch model established by the sentence hierarchy is as follows: ^〇(0 = ίϊ + bt ο where 'α represents the average pitch of the sentence, and the program presents the pitch 15 200935399 The path of weakness f ° Next, calculate The error value (residual difference value) between the estimated value and the actual value is analyzed in the lower layer. After the corresponding speech and the target are respectively determined for the source speech and the target, the phonological parameters of the sentence hierarchy are determined, and then the sentence level is changed. The training of the function. & In the f-class, there is a short pause between the word and the word, so there is a phenomenon in which the pitch is re-started between the 5th and the 1st. Therefore, the θ of the word hierarchy is considered. The rhyme factor 'is the pitch change trend inside the word unit. Compared with the sub-order a, in the clever class, consider the change of the sound item that is further refined than the sentence level, so the 'private third-order Lei Jiande polynomial ( Legendre Polynomial) Coefficient to build the pitch model of the class: ^η&lt;Ν立八中' For the training mode in the second picture, the source speech and the target language-: find the corresponding third-order Lei Jiande Polynomial parameter The rhythm parameter vector with dimension 3 can be obtained, and the function training module 35 can enter the word level: the training of the ΐ rhyme conversion function. The θ/人 θ section is the lowest phonological unit, the main phonological changes The tone change of „ is also the most direct response to the emotional part of the speaker's emotions. * A T声nai Language, each syllable and syllable. It is composed of initials (consonants) and tones. Therefore, if each tone syllable (she turns 1 in the unit), there are 1300 different sounds saved, which makes =23 to collect a large amount of corpus to train The corresponding conversion function. The parent syllable can be defined in the following format:

Syllable = (〇+(v、V(V,N)。 韻 其中C代表聲母(論斗Μ表韻母(/?_,且可將 16 200935399 母細分為頭(F)、身體Syllable = (〇+(v,V(V,N). Rhyme where C stands for the initials (on the vowels of the Μ ( (/?_, and 16 200935399 females are subdivided into heads (F), body

表示,亦即1聲(HJJ)、 輕聲的(ΜΜ)來描述。 以芬运Ρ。而聲調的資訊,大 ,針對四聲調的1聲 聲(Low)以及4聲(Faiiing),可將每一 ,以高(H)以及低(L) 2 聲(LH)、3 聲(LL)、4 聲(HL),以及 利用次音節的組成H、L、不僅可大大的減少語料的需求 母個音節細切 )L、M即可描述每個聲調音節更可將 更精準的估算每個音節。 m T i在1^階層巾’將每個音節切成為其聲調所對應的 立成為一次音節,藉由結合L來簡化地描述每個 調。接著’針對每個次音節的音高模型利用4階的 运建德多項式來建立,如下所示:Said, that is, 1 (HJJ), soft (ΜΜ) to describe. Take Fen Yun. The tone information, large, for the four tones of the sound (Low) and four (Faiiing), can be each high, (H) and low (L) 2 (LH), 3 (LL) 4 sounds (HL), and the use of sub-syllables H, L, not only can greatly reduce the need for corpus, the mother syllable cuts) L, M can describe each tone syllable can be more accurate estimate each Syllables. m T i is cut into the syllables corresponding to the tone of each syllable in the 1^ class towel, and each tone is simplified by combining L. Then the pitch model for each sub-syllable is built using a 4th-order Yunde polynomial, as follows:

同樣的,針對第二圖中的訓練模式下,來源語音和目桿 語音分別求得所對應的4階的雷建德多項式之音韻參數後二 ❹即可得維度為4的音韻參數向量,該函式訓練模組35即可進 行將所對應的音韻參數進行音韻轉換函式的訓練。 參考第五圖為一階層式音韻分析的範例,如圖所示,句 子階層的音韻參數顯示句子的音高是逐漸上揚的,詞^層^ 音韻參數表示音高在詞階層的走勢,次音節階層的音韻參數 則記錄了中文四聲調的變化。 參考第六圖表示階層式與非階層式音韻參數之標準差的 比較,可以看出經由階層式分析後的音韻參數確實有較小的 標準差,顯示音韻參數彼此間的差異縮小。因此階層式的分 析確實可將全域性的變化由上層往下層漸次地分離^二直&amp; 最底層時,標準差會漸趨變小。 17 200935399 該函式訓練模組35在訓練音韻轉換函式的部份’由於句 子階層和詞組階層可得的資料數量少,相關的語言學資訊也 較少丄因此在此兩階層中的音韻轉換函式的建立,是採用傳 統以南斯混合模型為基礎(GMM-based)轉換方式來建立。 在以高斯混合模型為基礎的轉換方式中’來源語音與目 標語音所校準過後的音韻參數向量會呈現聯合常態分佈。令 ^^^•^^^.•.^。,是來源語音的音韻部份之音韻序列’丫^^乃, y2,...,yn}是目標語音的音韻部份之音韻序列,而[χτ,yT]T是 ❹,過校準過後的音韻參數向量對,而假設X與y的向量維度 皆為d’根據基本假設,[xT,yT]T要符合聯合常態分佈,令其 聯合機率密度函式為: 〃 /(z) = /(x,y) = -则)]· V _ Εχχ Εχγ 这―Σ τ: L^yx 么yy」0 © 其中 ζ—[χΤ,ΥΤ] Τ ’而 Ε[ζ]=[Ε[χ], E[y]] 7為 ζ 的平均向量, ,2為2的共變異數矩陣。令音韻轉換函式為/=厂^),而目標 是希望找到一個音韻轉換函式使得目標語音的音韻參數向量 序列與轉換後的音韻參數向量序列之間有最小的均方差值, 也就是希望在、se=五[| 川勺最小的情況下找到音韻轉換 函式’而為使此有最小的誤差。根據最小均方差(Minhnum Mean Square Error,MMSE )的準則可知當音韻轉換函式 夕=五bk]會有最小的誤差值,也就是說以MMSE為準則的轉 換函式為F(x)=E[y|x],而兩者的條件機率密度函式為: 200935399 /(y|x)-£fezl=_!_ /(χ) (2π)Φά^(Χγγ-ΣγχΊ:^Ί: expSimilarly, for the training mode in the second figure, the source speech and the target speech respectively obtain the corresponding rhythm parameters of the fourth-order Lei Jiande polynomial, and then the phonological parameter vector with the dimension of 4 can be obtained. The function training module 35 can perform the training of the phonetic conversion function of the corresponding phoneme parameters. Referring to the fifth figure, an example of a hierarchical phonological analysis is shown. As shown in the figure, the pitch parameters of the sentence hierarchy show that the pitch of the sentence is gradually rising, and the word ^ layer ^ phonological parameter indicates the pitch of the pitch in the word hierarchy, the sub syllable The phonological parameters of the class record the changes in the four tones of Chinese. Referring to the sixth figure, the comparison of the standard deviations of the hierarchical and non-hierarchical phonological parameters shows that the phonological parameters after the hierarchical analysis do have a small standard deviation, and the difference between the displayed phonological parameters is reduced. Therefore, the hierarchical analysis can gradually separate the global change from the upper layer to the lower layer. When the bottom layer is lower, the standard deviation will gradually become smaller. 17 200935399 The functional training module 35 is part of the training rhyme conversion function. 'The amount of data available due to the sentence level and the phrase level is small, and the related linguistic information is also less. Therefore, the phonological conversion in the two levels is The establishment of the function is based on the traditional Gansian hybrid model-based (GMM-based) transformation. In the Gaussian mixture model-based transformation method, the phoneme parameter vector after the source speech and the target speech are calibrated will exhibit a joint normal distribution. Let ^^^•^^^.•.^. , is the phoneme sequence of the phoneme part of the source speech '丫^^, y2,...,yn} is the phoneme sequence of the phoneme part of the target speech, and [χτ,yT]T is ❹, after calibrating The phoneme parameter vector pairs, and assume that the vector dimensions of X and y are both d'. According to the basic assumption, [xT, yT]T should conform to the joint normal distribution, so that the joint probability density function is: 〃 /(z) = /( x, y) = - then)]· V _ Εχχ Εχ γ This Σ τ: L^yx yy”0 © where ζ—[χΤ,ΥΤ] Τ 'And Ε[ζ]=[Ε[χ], E [y]] 7 is the average vector of ζ, and 2 is the matrix of the common variance of 2. Let the phonetic conversion function be /= factory^), and the goal is to find a phonetic conversion function so that there is a minimum mean square error between the phoneme parameter vector sequence of the target speech and the converted phoneme parameter vector sequence, that is, I hope that the sein conversion function can be found in the case of se=five [| the smallest case] and there is a minimum error. According to the criterion of Minhnum Mean Square Error (MMSE), it can be known that when the phoneme conversion function 夕=5 bk], there is a minimum error value, that is, the conversion function based on MMSE is F(x)=E. [y|x], and the conditional probability density function of the two is: 200935399 /(y|x)-£fezl=_!_ /(χ) (2π)Φά^(Χγγ-ΣγχΊ:^Ί: exp

VV

~2Q Q =^^(^Μ + ΣυχΣ^(χ-£[χ])))Γ (艺yy-ΣγχΣχχΣχγ) (y-(五Ι^ + ΣγχΣ^^χ-Εΐχ]))) 所以最後求得之音韻轉換函式為: 〇 纟^(Χ) = %ΙΦ%] + ΣνχΣ&amp;(Χ —Ε[χ])。 準的音韻參數不會只是一單純的常態分佈,應該用更精 將單」型去插述音韻參數。因此’依據高斯混合模型的定義, 高斯^條件常態分佈之下所得的音韻轉換函式再透過混合 s。 、!下群的權重加總後,即可求得更為精細的預測向 1。所以最後高斯混合模型的音韻轉換函式為: F(x) = ^[y | χ] = Σ/(^1 x)[y I /]+Σ« (ς« )_1 {χ~Ε[χ\ φj~2Q Q =^^(^Μ + ΣυχΣ^(χ-£[χ])))Γ (Art yy-ΣγχΣχχΣχγ) (y-(五Ι^ + ΣγχΣ^^χ-Εΐχ]))) The phonological conversion function is: 〇纟^(Χ) = %ΙΦ%] + ΣνχΣ&amp;(Χ-Ε[χ]). The quasi-phonological parameters are not just a simple normal distribution. You should use a more refined type to interpret the rhythm parameters. Therefore, according to the definition of the Gaussian mixture model, the phonetic conversion function obtained under the Gaussian conditional normal distribution is then transmitted through the mixture s. ,! After the weights of the lower groups are summed up, a more detailed forecasting direction can be obtained. So the final phonetic conversion function of the Gaussian mixture model is: F(x) = ^[y | χ] = Σ/(^1 x)[y I /]+Σ« (ς« )_1 {χ~Ε[χ \ φj

/ {i I x)= Σ^(χ) ο ’、中/(%)表示x屬於第/個群(mixture)的事後機率。 1而在次音節階層中,因為相對的資料數量最多,也可 ^較多的語言學資訊。因此,為了達到誤差量大幅降低的效 =用迴歸式歸群的方式來建立音韻轉換函式。由於歸群 仏〜母一條迴歸線即代表一音韻轉換函式,所以當轉換時, 二夂—來源值後,因為沒有對應的目標值,所以須藉由1他 的機制來挑選出對應的轉換函式。參考第二圖所示訓練&amp;韻 19 200935399 算it將語言學資訊引人迴歸分群時的歸群依 g學的相似度和語言學的相似度來做為歸群 …果::利:分類迴歸樹來訓練函式挑選模型。 標語:求得個說明;由來源語音與目 向量序列Y={yi 向罝序列x={xi,x2,.·.,w ’與目標音韻 於是,線性迴歸的2目·;;=設Χ與y的向量維度皆為d, 其滿足:的目的’就疋找出Y與X的函式關係式,使 ® γ=/(χ)=Α)+Αχ。 而參數的轉是_最小平方法: Ε=Σ ^~(β〇+βιΧί)]2 y ββ η ^ = ~%^~{β〇 + β^)] = 〇 θβι ~ ~ + Pixi)h = ο φ 最後可得到d個維度的線性迴歸參數/ {i I x)= Σ^(χ) ο ’, medium/(%) indicates the probability that x belongs to the next group. 1 In the sub-syllable class, because there is the largest amount of relative information, more linguistic information can be obtained. Therefore, in order to achieve a significant reduction in the amount of error = the use of regression homing to establish the phonetic conversion function. Since the regression line of the group 仏~母 represents a phonological conversion function, when converting, after the source value, since there is no corresponding target value, the corresponding conversion function must be selected by 1 mechanism. formula. Referring to the second figure, the training &amp; rhyme 19 200935399 counts the linguistic information to introduce the regression of the grouping according to the similarity of the g-study and the similarity of the linguistics as the grouping... Fruit:: Lee: Classification Regression tree to train the function to pick the model. Slogan: Get a description; from the source speech and the eye vector sequence Y={yi to the sequence x={xi,x2,.·.,w ' and the target phonology, then the linear regression of 2 mesh·;;= set The vector dimension with y is d, which satisfies the purpose of 'to find out the function relationship between Y and X, so that ® γ=/(χ)=Α)+Αχ. The parameter rotation is _ minimum flat method: Ε=Σ ^~(β〇+βιΧί)]2 y ββ η ^ = ~%^~{β〇+ β^)] = 〇θβι ~ ~ + Pixi)h = ο φ finally obtains linear regression parameters of d dimensions

接著,再針對每筆資料,進行迴歸式歸群。而進行歸群 時必須考慮語音學部份的相似度和語言學部份的相似度兩 者結合後的結果。語音學相似度計算方式如下: 根據線性迴歸模型的基本假設,迴歸分析必須滿足四 點:1·條件常態分佈、2.均質性、3線性以及4·獨立性。 其中,依據第1點條件常態分佈的假定,當給定一來源語音 20 200935399 的音韻參數向量後,因為有誤差項的因素,即3, 且假定誤差項的分佈是常態分佈,因此,目標語音的音韻參 數向量的分佈也是呈現常態分佈。所以計算語音的音韻參數 向量的相似度時,是利用條件常態分佈將誤差量轉換成機率 分佈來估算: p{Ac〇ustic (y IX, = ^(y I Ρι(1)χ+P〇1}, ^{1)) 〇 參考第七圖為語言學特徵向量(Linguistic Feature ❹Vector) ajai,a2,...,aL]的一實施例。針對每一群,計算每一 個維度的語言特徵參數的機率值,即 PM\C) =Then, for each piece of data, regression homing is performed. When homing, the results of the combination of the similarity of the phonetic part and the similarity of the linguistic part must be considered. The phonetic similarity is calculated as follows: According to the basic assumptions of the linear regression model, the regression analysis must satisfy four points: 1. conditional normal distribution, 2. homogeneity, 3 linearity, and 4 independence. According to the assumption of the normal distribution of the first point condition, when the phoneme parameter vector of a source speech 20 200935399 is given, because of the error term factor, that is, 3, and the distribution of the assumed error term is a normal distribution, the target speech The distribution of the phoneme parameter vectors also exhibits a normal distribution. Therefore, when calculating the similarity of the phoneme parameter vector of the speech, it is estimated by using the conditional normal distribution to convert the error amount into the probability distribution: p{Ac〇ustic (y IX, = ^(y I Ρι(1)χ+P〇1 }, ^{1)) 第七 Referring to the seventh figure, an embodiment of Linguistic Feature ❹ Vector ajai, a2, ..., aL]. For each group, calculate the probability value of the linguistic feature parameters for each dimension, ie PM\C) =

Count Count (C) 其中,CowwiO/,C)表示第C群中,第/維的語言特徵出 現的次數,C⑽加(C)表示第C群的資料點數。接著,將每一 維的機率值連乘後,即得語言特徵相似度如下所示: 最後,結合兩種特徵向量的相似度,得到歸群的依據:Count Count (C) where CowwiO/, C) indicates the number of occurrences of the linguistic feature of the /th dimension in the Cth group, and C(10) plus (C) indicates the number of data points of the Cth group. Then, after multiplying the probability values of each dimension, the similarity of the language features is as follows: Finally, combining the similarities of the two feature vectors, the basis of the grouping is obtained:

Sim(C,S) = pAcoustic (y | x,C)° * PLinguistic (a | cfa = ^(y|PlX + P0,L)a ΙΟΓ)1 〇 其中,5為資料點,a是權重,經由實驗分析結果給定。 本發明提出以迴歸式歸群演算法,配合聲學特徵相似度 PAc_iC(.,.)與語言學特徵相似度PunguisticC.)進行音韻轉換函 式之歸群,其歸群的執行步驟為:1.將所有資料視為一群, 求出一群的對應迴歸參數,接著將3倍變異數的區間内,等 21 200935399 列;2·計算每個資料點巧 3 C,^ =參數’4.重複步㈣’直到任兩次間零有任 意圖參ίί八气施例之迴歸式歸群演算法的示 έαΓό^^音韻特性相似度與語言學特性相似 〇ΐ==Γ:Γ群後,每一條迴歸線,即代表了-個 因==二進=;二個來_, 式1行轉換,換函 .分預㈣題時常狀技術,利用樹狀 刀疋的架構來產生對應的規則。 cc語音領域中常被採用進行音韻預測 音上二Ϊ t 更進一步的也考量了語言學在語 ❿包之外,經由分類迴歸樹cart所建立的模型, 的排序’意即經由模型的結構,可大致的判定各 個屬性的重要程度,同時,其所產生的規則也易於理解。 因此’在音韻轉換函式建立的階段,針對每個來源 ^相對的目標可得知其正確的音韻轉換函式,所以',、在 情泥下,可以利用分類迴歸樹cart來建立每個音 t換J式的對應規則,意即建立音韻轉換函式 =資==韻=;分類迴一來預測出 我柄^用分類迴歸樹CART建立音韻轉換函式挑選模型時, 】將語言學的資訊引人,利用語言特徵參數以及音韻參數 22 200935399 來進行分類。而在此麟㈣語謂徵參數,與 所採用的語言特徵參數是相同的,因此迴歸 J似度的權重,也影響了此時迴歸分類樹CART訓二= 參考第九圖為本發明-實施例的分類與迴歸樹c 示意圖。其t,「T〇ne=3」表示目前次音 調是否J三聲…—+表示下一個二:口 否為-聲;「Is Word len,2」表示目前次音節所在的 ❹長,否為二字詞;「IsrightP.M. = ‘?,」表示目前音節= 的是一為“?,,的標點符號;以及rIs Initial = 6 i接 音節所在的音節的Initial部位是否為第6個類」別广目則次 分類迴歸樹所建立的模型中,每個葉節點, 音韻轉換函式。由根節點—直到葉節點所採用的分裂^ 集’即為此一音韻轉換函式的對應規則。 碭 其中,迴歸分類樹演算法的步驟包括:I將全體 訓irirr樣本集;2·根據分裂準則,將训練 £枓進订刀裂,產生第-個分裂點;3.利用測試樣本資料艰 驗證此分裂是否為最佳分裂’意即麵試樣本巾找出決 的反例;4.若反例存在,則將其移入訓練樣本集,並重覆 否則結束訓練,返回迴歸分類樹。而迴歸分類 ^資料筆數只有〗筆或少於-定數量;2.節財每筆^牛都 有相同預測值;3.分支後對同質性改善很有限 而分裂的準關定’依不同演算法有不_分裂準則。 而本研究的目的在於盡可能使每個節點純度最大,即每個 點中的資料點其所屬的轉換函式相同,目為在本研究所 的分裂演算法為ID3 Algorithm的改良版:C4 5 AlgQfith 分裂原則。分裂準則是以熵(Entr〇py)為依據,即計算增益比 23 200935399 (Gain Ratio),計算方式如下: afterSim(C,S) = pAcoustic (y | x,C)° * PLinguistic (a | cfa = ^(y|PlX + P0,L)a ΙΟΓ)1 〇 where 5 is the data point and a is the weight, via The experimental analysis results are given. The present invention proposes a regression grouping algorithm, with the acoustic feature similarity PAc_iC(.,.) and the linguistic feature similarity PunguisticC.) to perform the grouping of the phonetic conversion function, and the execution steps of the grouping are: 1. Treat all the data as a group, find a corresponding regression parameter of a group, then within the interval of 3 times the variance, etc. 21 200935399 column; 2. Calculate each data point 3 C, ^ = parameter '4. Repeat step (4) 'After any two zeros, there is an arbitrary graph. The regression homing algorithm of the 式α归^^ phonological characteristic similarity is similar to the linguistic characteristics 〇ΐ==Γ: after the group, each regression line , which means - a factor == dijin =; two to _, the formula 1 line conversion, change the letter. The pre-pre (four) problem is the regular technique, using the tree-shaped knife structure to generate the corresponding rules. In the field of cc speech, it is often used to perform phonological predictions. Further, linguistics is also considered. In addition to the linguistics, the classification of the model established by the classification regression tree cart means that the structure of the model can be The importance of each attribute is roughly determined, and the rules it generates are also easy to understand. Therefore, in the stage of the establishment of the phonetic conversion function, the correct phoneme conversion function can be known for each source^ relative target, so ', in the mud, you can use the classification regression tree cart to establish each tone. t change the J-type corresponding rule, which means to establish the phonetic conversion function = capital == rhyme =; classification back to predict my handle ^ use the classification regression tree CART to establish the phonetic conversion function selection model, 】 linguistic Information is introduced, using language feature parameters and phonological parameters 22 200935399 for classification. In this lin (four) language predicate parameter, and the language feature parameters used are the same, so the weight of the regression J-likeness also affects the regression classification tree CART training 2 = reference ninth figure is the invention - implementation Example classification and regression tree c schematic. Its t, "T〇ne=3" indicates whether the current subtone is J three...—+ indicates the next two: whether the mouth is - sound; "Is Word len, 2" indicates the length of the current sub syllable, or Two words; "IsrightP.M. = '?," means that the current syllable = is a punctuation mark of "?,"; and rIs Initial = 6 i is the sixth part of the syllable of the syllable in which the syllable is located In the model established by the sub-classification regression tree, each leaf node, the phonetic conversion function. The split rule set by the root node - up to the leaf node is the corresponding rule for this phoneme conversion function.砀 Among them, the steps of the regression classification tree algorithm include: I will train the irirr sample set; 2. According to the splitting criterion, the training will be cracked to produce the first split point; 3. Using the test sample data Verify that the split is the best split' means that the interview sample towel finds a counterexample; 4. If the counterexample exists, move it into the training sample set and repeat otherwise to end the training and return to the regression classification tree. And the regression classification ^ the number of data is only 〗 〖 pen or less than - the number; 2. Every money has the same predicted value; 3. After the branch, the improvement of homogeneity is very limited and the split is determined. The algorithm has a non-splitting criterion. The purpose of this study is to maximize the purity of each node as much as possible, that is, the data points in each point belong to the same conversion function, and the split algorithm in this study is a modified version of ID3 Algorithm: C4 5 AlgQfith split principle. The splitting criterion is based on entropy (Entr〇py), which is calculated as the gain ratio 23 200935399 (Gain Ratio), calculated as follows:

Gain Ratio =㈣ropy Weighted EntropyGain Ratio = (four) ropey Weighted Entropy

Split Gains 其中’熵(Entropy)計算方式如下 ηSplit Gains where 'Entropy' is calculated as follows η

Entropy = Pi χ log(p.) /=1 ❹ 數): 而權重值得給定方式為(子節點資料個數/母節點資 料個 ΔEntropy = Pi χ log(p.) /=1 ❹ number): The weight is worth given by (number of child nodes / parent node Δ)

SplitGains = p. x log(/?y) Μ 1 parentSplitGains = p. x log(/?y) Μ 1 parent

Pj = data Number inNode卿 I data Number in Node 因此’當增益比愈大時,代表分裂後節點的純度, 所以每次分裂㈣增益崎大者崎分裂。參考第十^ 語音資料庫巾最小均方差與增益比的雜統計。 …χ 立後’在職階段時,_來·音的財賴參數和 ❹曰韌參數經由分類迴歸樹CART的預測後, 語音的對應音韻轉換函t 此來源 尸立均方差’S)量測預測出來的音韻參數與目 “知參數的差異,作為客觀評估的準則,均方差的計算如 Μ—1 MES=~Yi[ym^ymy Μ w=0 音韻參數向量的個數則分別表示目 音韻參數向量,均方差越小’表示轉換的效 果越好。如第十-圖所示’在權重α小於0.6之前,均方差 24 200935399 有隨著權重增加而降低的趨勢,當權重超過〇· ί大二i生氣情緒表現較不明顯。悲傷情緒的預測誤差最 广决樂情緒則有較小的誤差。參考第十 == 式挑選的正確率,正確率隨權重α增加而提升,^轉換函 情緒的函式挑選正確率最差。 s 其中,生氣 為比較^數對轉換方法㈣響,對於基於 ϋ行轉換,群所對應的音韻轉換函式進 0且本㈣’預顺差隨著群數增加而降低, 的預㈣的基於迴歸式料的方法能有效降低音韻參數 式、、會=1十四圖味了基於迴歸式歸群式與高斯混合模型 .ai ii 轉換時的預測誤差,給定同樣的來源參數值 本發明所提之迴歸式歸群演算法可進一步降低轉換 法::ίί二參考第十五圖’在三種情緒下,本發明所提之方 龜去皆能有較好的表現,有較低的預測誤差。 龙本發月主要可應用於結合各式人機單/雙向溝通系統, 更、中階層式音韻模型不僅符合人類構音發話上的音韻結構, ^可降低音韻參數的變異程度;加上迴歸式歸群演算法, 於,低參數轉換上的預測誤差,使音韻轉換更加準確。應用 ^多,者或豐富情緒的電腦語音合成系統可減低聲音語料需 電不若再結合各種電子化教育、資訊交換或傳遞設備,例如 . 故事書,航空訂位系統、資訊查詢系統等,可創造出争 有價值的資訊產品。 文 士 „ ^詳細說明本發明的較佳實施例之後,熟悉該項技術人 可清楚的瞭解,在不脫離下述申請專利範圍與精神下進行 25 200935399 各種變化與改變,且本發明亦不受限於說明書中所舉實施例 的實施方式。Pj = data Number inNode Qing I data Number in Node Therefore, when the gain ratio is larger, it represents the purity of the post-split node, so each time the split (four) gains the Kawasaki split. Refer to the tenth ^ speech data library for the minimum mean square error and gain ratio of the statistics. ...χ After the post-in-service stage, the _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The difference between the phonological parameters and the objective parameter is the criterion for objective evaluation. The calculation of the mean square error is as follows: MES=1 y=yi[ym^ymy Μ w=0 The number of phonological parameter vectors respectively represents the phonological parameter vector The smaller the mean square error is, the better the conversion effect is. As shown in the tenth-picture, before the weight α is less than 0.6, the mean square error 24 200935399 has a tendency to decrease with increasing weight, when the weight exceeds 〇· ί大二i angry mood performance is less obvious. The most predictive error of sad emotions has a smaller error. The correct rate is selected according to the tenth == formula, and the correct rate increases with the increase of weight α. The correct selection rate of the function is the worst. s Among them, the anger is the comparison ^ number pair conversion method (four), for the ϋ based conversion, the corresponding phonological conversion function of the group enters 0 and the (four) 'pre-sequence increases with the number of groups. Lower, pre (four) The method based on regression material can effectively reduce the phonological parametric formula, and will be 1414. The prediction error based on the regression homing group and the Gaussian mixture model. Ai ii conversion, given the same source parameter value. The proposed regression clustering algorithm can further reduce the conversion method:: ίί II, refer to the fifteenth figure. Under the three emotions, the square turtles mentioned in the present invention can perform better and have lower predictions. The error is mainly applicable to the combination of various human-machine single/two-way communication systems. The more and middle-level phonological model not only conforms to the phonological structure of human vocalization, but also reduces the variation of phonological parameters; The categorization algorithm, the prediction error on the low-parameter conversion, makes the phonological conversion more accurate. The computer speech synthesis system with more or more emotions can reduce the sound vocabulary and the electricity needs to be combined with various electronic education. Information exchange or transmission equipment, such as storybooks, airline reservation systems, information inquiry systems, etc., can create valuable information products. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the scope and spirit of the following claims, and the present invention is not limited by the description. An example of an embodiment.

26 200935399 【圈式簡單說明】 第一圖為本發明一實施例之中文語音音韻轉換系統的 糸統架構圖, 第二圖為本發明一實施例之中文語音音韻轉換系統用 於訓練轉換函式的系統架構圖; 第三圖為本發明一實施例的音韻部分階層示意圖; 第四圖為本發明一實施例之雷建德多項式的示意圖; 第五圖為本發明一實施例之階層式音韻分析的示意圖; φ 第六圖為本發明一實施例之階層式與非階層式音韻參數 之標準差的比較圖; 第七圖為本發明一實施例之語言學特徵向量的示意圖; 第八圖為本發明一實施例之迴歸式歸群演算法的示意 圖; 第九圖為本發明一實施例之分類與迴歸樹的示意圖; 第十圖為本發明一實施例之語音資料庫的特性統計圖; 第十一圖為本發明一實施例之權重α影響音韻參數的預 測誤差的示意圖; ⑩ 第十二圖為本發明一實施例之權重α影響音韻轉換函式 的挑選正確率的示意圖; 第十三圖為本發明一實施例之群數影響音韻參數的預 測誤差的示意圖; 第十四圖為本發明一實施例之基於迴歸式歸群式與高斯 混合模型式演算法用於轉換時的預測誤差之比較圖;以及 第十五圖為本發明一實施例之迴歸式歸群式與高斯混合 模型式演算法的比較圖。 主要元件符號對照說明: 27 200935399 1 〇…中文語音音韻轉換系統 20—語音分析早元 30…音韻分析單元 31…階層拆解模組 32…函式選擇模組 33— 音韻轉換模組 34— 音韻轉換函式庫 35— 函式訓練核組 ❹36---函式分類模組 37---問題模組 40- --音勒分析早元 41— 頻譜轉換模組 42…頻譜轉換函式庫 43—頻譜訓練模組 50…語音合成單元26 200935399 [Brief Description] The first figure is a schematic diagram of a Chinese phonetic phoneme conversion system according to an embodiment of the present invention, and the second figure is a Chinese phonetic phoneme conversion system for training conversion function according to an embodiment of the present invention. The third diagram is a schematic diagram of a part of a phoneme portion according to an embodiment of the present invention; the fourth figure is a schematic diagram of a Lei Jiande polynomial according to an embodiment of the present invention; and the fifth figure is a hierarchical phoneme according to an embodiment of the present invention; A schematic diagram of analysis; φ sixth diagram is a comparison diagram of standard deviations of hierarchical and non-hierarchical phoneme parameters according to an embodiment of the present invention; and seventh figure is a schematic diagram of linguistic feature vectors according to an embodiment of the present invention; A schematic diagram of a regression grouping algorithm according to an embodiment of the present invention; a ninth diagram is a schematic diagram of a classification and regression tree according to an embodiment of the present invention; and a tenth diagram is a characteristic chart of a voice database according to an embodiment of the present invention; 11 is a schematic diagram of a weighting α affecting a prediction error of a phoneme parameter according to an embodiment of the present invention; 10th. FIG. 12 is a weight alpha shadow according to an embodiment of the present invention; Schematic diagram of the selection accuracy of the phoneme conversion function; FIG. 13 is a schematic diagram of the prediction error of the group number affecting the phoneme parameters according to an embodiment of the present invention; FIG. 14 is a regression-based grouping method according to an embodiment of the present invention A comparison diagram of a prediction error when converting with a Gaussian mixture model algorithm; and a fifteenth diagram is a comparison diagram of a regression grouping and a Gaussian mixture model algorithm according to an embodiment of the present invention. Main component symbol comparison description: 27 200935399 1 〇...Chinese speech phonological conversion system 20-speech analysis early 30... phonological analysis unit 31... hierarchical disassembly module 32... function selection module 33 - phonological conversion module 34 - phonology Conversion Library 35 - Function Training Core Group ❹ 36 - Function Classification Module 37 - Problem Module 40 - - Sound Analysis Early Element 41 - Spectrum Conversion Module 42... Spectrum Conversion Library 43 - Spectrum training module 50... speech synthesis unit

2828

Claims (1)

200935399 十、申請專利範固: 1.:種中文語音音韻轉換系統,具有—音韻分析單元用以 接收一來源語音及對應之來源文字並立 該音韻分析單元包括: 珉°°曰, -轉模组,以複數個階層定義該來源文字,每一階声 &amp;有-⑼模型’並根據該來源語音產生該音高模型的音韻^ β各㈣轉擇=1’根據該等音高模型及對應的音韻參數決定 一音韻轉換模組,執行診 數分別轉換為各自的合成音韻轉換函式以將各自的音韻參 -合成模組’根據各二入5立以及 型,以產生該合成語音。 D成音韻參數執行對應的音高模 2.如申請專利範圍第1項之由 該等階層係包括句子階屑、\文語音音韻轉換系統,其中 3·如申請專利範圍第1項a屑階層以及次音節階層。 該階層拆解模組係利用雷文語音音韻轉換系統,其中 © 該等階層之音高模型。 &quot;&quot;夕項式(Legendre Polynomial)建立 4. 如申請專利範圍第1項之由^ 該階層拆解模組分別使用 文語音音韻轉換系統,其中 等階層之音高模型。 同的音高曲線逼近模型建立該 5. 如申請專利範圍第2項 ^ 在該階層拆解模組中,句 文語音音韻轉換系統,其中 迴歸估算音高衰弱程度。階層的音高模型利用簡單線性 6. 如申請專利範圍第^ 在該階層拆解模組中,句 文語音音韻轉換系統,其中 參數係用以與句子階層的音高模型所產生之音韻 的音高模型之實際值相減,以產 29 200935399 生一殘差值(Residual),並利用該殘差值產 韻參數。 層的音 7. 如申請專利範圍第2項之中文語音音韻轉換 在該階層拆解模組中,詞階層的音高模型所產生、、丄二中 數係用以與詞階層的音高模型之實際值相減,^韻參 殘差值,並利用該殘差值產生次音節階層的 =一 8. 如申,專利範圍第2項之中文語音音韻轉換系1參 ❹ 以組合表示一^節 9. :::專利範圍第i項之中文語音音韻轉換系統,其中 μ w式選擇模組利用一分類與迴歸樹(classificati〇n and Regression Tree,CART)決定該等音韻轉換函式。 10. 如申請專利範圍第!項之中文語音音韻轉換系統,進― 步包括一函式訓練模組,用以產生該等音韻轉換函式。 11. 如申請專利範圍帛1〇帛之中文語音音韻轉換系統’進一 ,接收一目標語音,該函式訓練模組比較來源語音與目標 語音所產生之音韻參數,並產生該等音韻轉換函式。 ❿12.如中請專利範圍第9項之中文語音音韻轉換系統,進一 步包括一問題模組,用以儲存該分類與迴歸樹的問題。 13·如II專利㈣第1項之中文語音音韻轉換系統,其中 該曰韻分析單元進一步包括一音韻轉換函式庫,用以儲 存該等音韻轉換函式。 14·如申請專利範圍第2項之中文語音音韻轉換系統,其中 句子階層與詞階層的音韻轉換函式係採用高斯混合模型 式演算法建立。 15.ΐΙΐ專利範圍第2項之中文語音音韻轉換系統,其中 人曰節階層的音韻轉換函式係採用迴歸式歸群方式建 30 200935399 立 16.如申請專利範圍第丨項之中文尹立立 =含-語音分析單元’用以將該來i語音拆:為ί喟 並將該音韻部份提供至該音㈡ 17·:Π=圍第16項之中文語音音韻轉換系統,進- ❹ 析單元,該音韻分析單元包括-頻譜轉 存ϊ數個頻譜轉換函式,以及-頻譜轉 份為函式之—以轉換該頻譜部 語:=來源語音及對 根據該來源文字產生複數個的音高模型; 根據該來源語音產生每一音高模型的音 式根據該等音高模型及對應的音韻參數決料自的音韻轉換函 的合執成行/韻等參音數韻轉又換及函式以將各自的音韻參數分別轉換為各自 成書^各自的合成音韻參數執行對應的音高模型’以產生該合 i9’m利範圍第、18項之,文語音音韻轉換方法,其中 二;==屬於複數個階層’且該等階層係分 20.ίΓίί利範圍第19項之中文語音音韻轉換方法,其中 2階層係為句子階層、詞階層以及次音節階層。 範圍第18項之中文語音音韻轉換方法,其中 該等曰咼模型係利用雷建德多項式建立。 31 200935399 22. 如申請專利範圍第20項之中文語音音韻轉換方法,其中 句子階層、詞階層以及次音節階層的音高模型分別使用不同 的音高曲線逼近模型建立。 23. 如申請專利範圍第20項之中文語音音韻轉換方法,其中 句子階層的音高模型係利用簡單線性迴歸估算音高^ 程度。 〇 24·如申請專利範圍第2〇項之中文語音音韻轉換方法,其中 句子階層的音高模型所產生之音韻參數係用以與句子階 © 層的音高模型之實際值相減,以產生一殘差值,並利用 該殘差值產生詞階層的音韻參數。 25. ^申明專利範圍第20項之中文語音音韻轉換方法,其中 詞階層的音高模型所產生之音韻參數係用以與詞階層的 音高模型之實際值相減,以產生一殘差值,並利用該殘 差值產生次音節階層的音韻參數。 26.2晴專利範圍第2〇項之中文語音音韻轉換方法,其中 =一曰節階層係、利用H、L、M之部份用以組合表示一音節 旱7G 〇 ❹27 ΐΐί範圍第18項之中文語音音韻轉換方法,其中 28 換函式係彻—分類與迴歸樹決定。 .步接:專利範圍第18項之中文語音音韻轉換方法,進-戶目標語音,並進—步比較該目標語音以及來源語音 四齡數,⑽生鱗音賴換函式。 第18項之中文語音音韻轉換方法,其中 方式建二。日韻轉換函式係採用高斯混合模型式演算法 3〇.^Ϊ專利範圍第18項之中文語音音韻轉換方法,其中 人曰即的音韻轉換函式係採用迴歸式歸群方式建立。 32 200935399 31. 如申請專利範圍第18項之中文語音音韻轉換方法,進一 步包含將該來源語音拆解為音韻部份以及頻譜部份,並 根據該音韻部份產生該等音高模型的音韻參數。 32. 如申請專利範圍第31項之中文語音音韻轉換方法,進一 步包括選擇一頻譜轉換函式以轉換該頻譜部份為一合成 頻譜參數。200935399 X. Applying for patents: 1. A Chinese phonetic rhyme conversion system with a phonological analysis unit for receiving a source of speech and corresponding source text and establishing the phonological analysis unit including: 珉°°曰, -Turning module Defining the source text in a plurality of levels, each order sound &amp; has - (9) model ' and generating the pitch of the pitch model according to the source speech ^ β each (four) transfer = 1 ' according to the pitch model and corresponding The phonological parameters determine a phonological conversion module, and the number of execution diagnosing is converted into respective synthesizing phonological conversion functions to respectively convert the respective phonological ginseng-synthesis modules into two genre and type to generate the synthesized speech. D into the phonetic parameters to perform the corresponding pitch module 2. As claimed in the first item of the patent scope, the hierarchy includes sentence chip, \ text phonetic rhyme conversion system, wherein 3 · as claimed in the scope of the first item a And the sub-syllable level. The hierarchical disassembly module utilizes the Raven Voice Phonological Transformation System, where © the pitch model of the classes. &quot;&quot;Lendend Polynomial establishment 4. If the patent application scope is the first item, the layer disassembly module uses the speech and phonetic rhyme conversion system, among which the pitch model of the class is equal. The same pitch curve approximation model is established. 5. For example, in the scope of the patent application, in the hierarchical disassembly module, the sentence phonetic phonetic conversion system, in which the regression estimates the pitch attenuation. The level of the pitch model uses simple linearity. 6. As in the scope of the patent application, in the hierarchical disassembly module, the sentence phonetic phonetic transformation system, in which the parameters are used to the pitch of the pitch generated by the pitch model of the sentence hierarchy. The actual value of the high model is subtracted to produce a residual value (Residual) of 29 200935399, and the residual parameter is used to generate the rhythm parameter. The sound of the layer 7. The Chinese phonetic phoneme conversion in the second paragraph of the patent application scope is generated in the layer disassembly module, the pitch model of the word hierarchy is generated, and the second middle number is used to the pitch model of the word hierarchy. The actual value is subtracted, the rhyme is used to calculate the residual value, and the residual value is used to generate the sub-syllable level = one 8. For example, the Chinese speech phonological conversion system of the second item of the patent scope is 1 ❹ ❹ Section 9.::: The Chinese phonetic phoneme conversion system of the i-th patent range, wherein the μw-type selection module determines the phonetic conversion functions using a classification and regression tree (CART). 10. If you apply for a patent scope! The Chinese voice and rhyme conversion system of the item includes a function training module for generating the pitch conversion function. 11. If the Chinese speech sound and rhyme conversion system of the patent application scope is further selected to receive a target speech, the functional training module compares the phonetic parameters generated by the source speech and the target speech, and generates the phonetic conversion functions. . ❿12. The Chinese speech sound and rhyme conversion system of item 9 of the patent scope further includes a question module for storing the classification and regression tree problems. 13. The Chinese speech sound and rhyme conversion system of the first item of the fourth aspect of the invention, wherein the rhyme analysis unit further comprises a phonetic conversion function library for storing the phonetic conversion functions. 14. The Chinese phonetic phonetic rhyme conversion system of claim 2, wherein the phonetic conversion function of the sentence class and the word class is established by a Gaussian mixture model algorithm. 15. The Chinese phonetic phonetic rhyme conversion system of item 2 of the patent scope, in which the phonological conversion function of the human syllabary class is constructed by regression homing method. 2009 2009 399 立立 16. If the patent application scope is the third item of Chinese Yin Lili = - The speech analysis unit 'is used to split the i voice: 喟 喟 喟 喟 喟 喟 喟 喟 喟 喟 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 17 第 第 第 第 第 第 第 第 第 第 第 第 第 第The phonological analysis unit includes - spectrally dumping a plurality of spectral conversion functions, and - spectral transposing into a function - to convert the spectral portion: = source speech and generating a plurality of pitch models based on the source text The sound pattern of each pitch model generated according to the source speech is determined according to the pitch model and the corresponding phoneme parameter, and the rhythm conversion function of the rhyme conversion function is converted and converted. Converting the respective phonological parameters into their respective synthesized phonological parameters to execute the corresponding pitch model 'to generate the combined i9'm range and the 18th item, the text phonetic phonological conversion method, wherein two; == Genus A plurality of class' class system and such points 20.ίΓίί profit range Chinese phonology speech conversion method, Paragraph 19, of which 2 class system for the sentence strata, sectors and sub-syllable word class. The Chinese phonetic rhyme conversion method of the 18th item, wherein the 曰咼 model is established by using the Lei Jiande polynomial. 31 200935399 22. For the Chinese phonetic phonetic conversion method of claim 20, the pitch level, word class and sub-syllable level pitch models are respectively established using different pitch curve approximation models. 23. For the Chinese phonetic phonetic conversion method of claim 20, the pitch level model of the sentence class uses simple linear regression to estimate the pitch level. 〇24·If the Chinese phonetic phonetic conversion method of the second paragraph of the patent application scope, the pitch parameter generated by the pitch level model of the sentence is used to subtract the actual value of the pitch model of the sentence level layer to generate A residual value is used, and the residual value is used to generate a phoneme parameter of the word hierarchy. 25. ^Declaring the Chinese phonetic phonetic conversion method of the 20th patent range, wherein the phoneme parameters generated by the pitch model of the word hierarchy are used to subtract the actual value of the pitch model of the word hierarchy to generate a residual value. And using the residual value to generate a phonological parameter of the sub-syllable level. 26.2 Qing Chinese Patent Range No. 2 Chinese speech phonetic rhyme conversion method, in which = one 阶层 阶层 hierarchy, using H, L, M parts to combine to represent a syllable 7G 〇❹27 ΐΐί range of the 18th Chinese speech The rhyme conversion method, in which 28 commutative functions are categorized—classification and regression tree decisions. Step: The Chinese speech phonetic rhyme conversion method of the 18th patent range, the entry-to-house target speech, and the step-by-step comparison of the target speech and the source speech four-digit number, (10) the scalar tone-return function. The 18th Chinese speech phonetic rhyme conversion method, in which the mode is built. The rhyme conversion function is based on the Gaussian mixture model algorithm. 3〇.^Ϊ The patent range of the 18th Chinese phonetic phonetic conversion method, in which the human phonetic rhyme conversion function is established by regression grouping. 32 200935399 31. The Chinese phonetic phonetic conversion method of claim 18, further comprising disassembling the source speech into a phoneme portion and a spectrum portion, and generating a phoneme parameter of the pitch model according to the phoneme portion. . 32. The Chinese phonetic phoneme conversion method of claim 31, further comprising selecting a spectral conversion function to convert the portion of the spectrum into a synthesized spectral parameter. 3333
TW97103905A 2008-02-01 2008-02-01 Chinese-speech phonologic transformation system and method thereof TW200935399A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW97103905A TW200935399A (en) 2008-02-01 2008-02-01 Chinese-speech phonologic transformation system and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW97103905A TW200935399A (en) 2008-02-01 2008-02-01 Chinese-speech phonologic transformation system and method thereof

Publications (2)

Publication Number Publication Date
TW200935399A true TW200935399A (en) 2009-08-16
TWI350521B TWI350521B (en) 2011-10-11

Family

ID=44866577

Family Applications (1)

Application Number Title Priority Date Filing Date
TW97103905A TW200935399A (en) 2008-02-01 2008-02-01 Chinese-speech phonologic transformation system and method thereof

Country Status (1)

Country Link
TW (1) TW200935399A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8706493B2 (en) 2010-12-22 2014-04-22 Industrial Technology Research Institute Controllable prosody re-estimation system and method and computer program product thereof

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI573129B (en) * 2013-02-05 2017-03-01 國立交通大學 Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing
TWI746138B (en) * 2020-08-31 2021-11-11 國立中正大學 System for clarifying a dysarthria voice and method thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8706493B2 (en) 2010-12-22 2014-04-22 Industrial Technology Research Institute Controllable prosody re-estimation system and method and computer program product thereof

Also Published As

Publication number Publication date
TWI350521B (en) 2011-10-11

Similar Documents

Publication Publication Date Title
Oord et al. Parallel wavenet: Fast high-fidelity speech synthesis
Hopkins et al. Automatically generating rhythmic verse with neural networks
CN110060690B (en) Many-to-many speaker conversion method based on STARGAN and ResNet
Liu et al. Recent progress in the CUHK dysarthric speech recognition system
JP3412496B2 (en) Speaker adaptation device and speech recognition device
CN112687259B (en) Speech synthesis method, device and readable storage medium
CN115641543B (en) Multi-modal depression emotion recognition method and device
CN106971709A (en) Statistic parameter model method for building up and device, phoneme synthesizing method and device
CN110060657B (en) SN-based many-to-many speaker conversion method
Hashimoto et al. Trajectory training considering global variance for speech synthesis based on neural networks
JP2020034883A (en) Voice synthesizer and program
US20220157329A1 (en) Method of converting voice feature of voice
KR102272554B1 (en) Method and system of text to multiple speech
Rhyu et al. Translating melody to chord: Structured and flexible harmonization of melody with transformer
KR20190135853A (en) Method and system of text to multiple speech
Agarla et al. Semi-supervised cross-lingual speech emotion recognition
Wu et al. Speech synthesis with face embeddings
JP7469698B2 (en) Audio signal conversion model learning device, audio signal conversion device, audio signal conversion model learning method and program
Kang et al. Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion
Kuan et al. Towards General-Purpose Text-Instruction-Guided Voice Conversion
TW200935399A (en) Chinese-speech phonologic transformation system and method thereof
Zhao et al. Research on voice cloning with a few samples
Mei et al. A particular character speech synthesis system based on deep learning
Chen et al. Speaker-independent emotional voice conversion via disentangled representations
JP6864322B2 (en) Voice processing device, voice processing program and voice processing method

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees