TWI467566B

TWI467566B - Polyglot speech synthesis method

Info

Publication number: TWI467566B
Application number: TW100141766A
Authority: TW
Inventors: Chung Hsien Wu; Yi Chin Huang; Kuan Te Li
Original assignee: Univ Nat Cheng Kung
Priority date: 2011-11-16
Filing date: 2011-11-16
Publication date: 2015-01-01
Also published as: TW201322250A

Description

Multilingual speech synthesis method

本發明係關於一種語音合成方法，特別關於一種多語言語音合成方法。The present invention relates to a speech synthesis method, and more particularly to a multi-language speech synthesis method.

由於語音是人類溝通最直接的媒介，因此以語音來作為人機互動的媒介十分重要。其中已有許多基於語音合成技術的產品應運而生。例如：手機的聲控撥號、文字轉語音系統(Text-to-speech synthesis,TTS system)、即時語音導航系統等許多已經實際應用的商品。Since speech is the most direct medium for human communication, it is very important to use speech as a medium for human-computer interaction. Many products based on speech synthesis technology have emerged. For example: mobile phone voice-activated dialing, text-to-speech synthesis (TTS system), instant voice navigation system and many other practical applications.

另外，在現今社會中，語者在對話時，常常會使用不同的語言進行溝通。此種現象最常出現於某個特定的字詞，或是短語用另一個語言能夠通順的表達其意涵。例如：幫我取消掉明天的會面→幫我cancel掉明天的會面；這門課不被down我就all pass 了。all pass是英文，對話為了簡略表達經常用這個詞，夾雜在中文句中，句子從中文轉變英文，這稱作code-switching問題。In addition, in today's society, speakers often use different languages to communicate during conversations. This phenomenon most often occurs in a particular word, or the phrase expresses its meaning in a language that is fluent. For example: help me cancel tomorrow's meeting → help me cancel the meeting tomorrow; this class is not down, I will pass all. All pass is in English. The dialogue is often used in this sentence for a brief expression. It is mixed in Chinese sentences. The sentence changes English from Chinese. This is called code-switching problem.

然而，傳統TTS只能合成單一語言，已經無法滿足使用者需求。對於多語合成系統，一般都使用精通多語的語者錄語料，但這並不如單一語言來得容易蒐集。另一方面，若是結合多種不同語者之語言語料，卻面臨語者特性不連貫的情形，為了處理不連貫問題，又必須做聲音轉換。However, traditional TTS can only be synthesized into a single language, which is no longer sufficient for users. For multilingual synthesis systems, multilingual linguistic corpus is generally used, but this is not as easy to collect as a single language. On the other hand, if you combine the language corpus of many different speakers, you will face the inconsistency of the speaker's characteristics. In order to deal with the inconsistency, you must do the sound conversion.

一般而言，在語音合成技術中，是利用模型調適(HMM-based model adaptation)與聲音轉換(voice conversion)。然而，在模型調適中，會遇到跨語言音素不完全相同的問題。而在聲音轉換中，會遇到跨語言音素無法收集到平行語料之問題。In general, in speech synthesis technology, HMM-based model adaptation and voice conversion are utilized. However, in model adaptation, you will encounter problems where the cross-language phonemes are not exactly the same. In the sound conversion, there is a problem that cross-language phonemes cannot collect parallel corpora.

因此，如何提供一種多語言語音合成方法，能夠解決上述問題，進而提升語音合成效能，實為當前重要課題之一。Therefore, how to provide a multi-language speech synthesis method can solve the above problems and improve the speech synthesis performance, which is one of the current important topics.

有鑑於上述課題，本發明之目的為提供一種能夠解決習知問題，進而提升語音合成效能之多語言語音合成方法。In view of the above problems, an object of the present invention is to provide a multi-lingual speech synthesis method capable of solving conventional problems and improving speech synthesis performance.

為達上述目的，依據本發明之一種多語言語音合成方法包含：選取一第一語言為主要語言，並收集該第一語言之一第一語料；選取一第二語言為次要語言，並收集該第二語言之一第二語料；利用第一語料與第二語料將第一語言之複數第一發音單元以及第二語言之複數第二發音單元進行分類，該等第二發音單元包含複數特殊發音單元；以及決定與該等特殊發音單元所對應之第一發音單元。In order to achieve the above object, a multi-lingual speech synthesis method according to the present invention comprises: selecting a first language as a main language, and collecting a first corpus of the first language; selecting a second language as a secondary language, and Collecting a second corpus of the second language; classifying, by the first corpus and the second corpus, the plural first pronunciation unit of the first language and the plural second pronunciation unit of the second language, the second pronunciation The unit includes a plurality of special pronunciation units; and a first pronunciation unit corresponding to the special pronunciation units.

在一實施例中，第一語料多於第二語料。In an embodiment, the first corpus is more than the second corpus.

在一實施例中，多語言語音合成方法更包含：利用第一語料訓練第一語言之一第一語音模型；及利用第二語料訓練第二語言之一第二語音模型。In an embodiment, the multi-lingual speech synthesis method further comprises: training the first speech model of the first language with the first corpus; and training the second speech model of the second language with the second corpus.

在一實施例中，多語言語音合成方法更包含：藉由國際音標(IPA)來分類該等第一發音單元與該等第二發音單元。In an embodiment, the multi-lingual speech synthesis method further comprises: classifying the first pronunciation units and the second pronunciation units by an International Phonetic Alphabet (IPA).

在一實施例中，在決定與該等特殊發音單元所對應之第一發音單元之前，多語言語音合成方法更包含：將第一語料細分成多個第一音框；及將第二語料細分成多個第二音框，使得該等特殊發音單元對應該等第一發音單元係藉由時間序列上之音框對應來進行。In an embodiment, before determining the first pronunciation unit corresponding to the special pronunciation units, the multi-lingual speech synthesis method further comprises: subdividing the first corpus into a plurality of first speech boxes; and placing the second language The material is subdivided into a plurality of second sound boxes, such that the special sounding unit pairs and the first sounding unit are performed by the sound box correspondence in the time series.

在一實施例中，第一語言之一音素(phone)係分成多個該等第一音框。第二語言之一音素係分成多個該等第二音框。In an embodiment, one of the first languages is divided into a plurality of the first sound boxes. One of the second language phonemes is divided into a plurality of the second sound boxes.

在一實施例中，決定與該等特殊發音單元所對應之第一發音單元之步驟係藉由發音屬性(Articulatory attribute)或聽覺參數(Auditory feature)來進行。In one embodiment, the step of determining the first pronunciation unit corresponding to the particular pronunciation unit is performed by an Articulatory attribute or an Auditory feature.

在一實施例中，將該等特殊發音單元對應該等第一發音單元之後，多語言語音合成方法更包含：藉由對應該等特殊發音單元之該等第一發音單元來訓練第二語言之該等特殊發音單元之語音模型。In an embodiment, after the special pronunciation unit corresponds to the first pronunciation unit, the multi-lingual speech synthesis method further comprises: training the second language by the first pronunciation units corresponding to the special pronunciation unit. The speech model of these special pronunciation units.

在一實施例中，多語言語音合成方法更包含：將第二語言之該等特殊發音單元之語音模型加入第一語言之語音模型。In an embodiment, the multi-lingual speech synthesis method further comprises: adding a speech model of the special pronunciation units of the second language to the speech model of the first language.

承上所述，本發明之多語言語音合成方法可解決跨語言音素不完全相同的問題，並可解決無法收集到平行語料之問題，因而產生出任一語者之多語言語音合成。As described above, the multi-lingual speech synthesis method of the present invention can solve the problem that the trans-language phonemes are not identical, and can solve the problem that the parallel corpus cannot be collected, thereby generating multi-language speech synthesis of any speaker.

以下將參照相關圖式，說明依本發明較佳實施例之一種，其中相同的元件將以相同的參照符號加以說明。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS In the following, a preferred embodiment of the invention will be described with reference to the accompanying drawings.

圖1為本發明較佳實施例之一種多語言語音合成方法的步驟流程圖，其中包含步驟S01：選取一第一語言為主要語言，並收集第一語言之一第一語料；步驟S02：選取一第二語言為次要語言，並收集第二語言之一第二語料；步驟S03：利用第一語料與第二語料將第一語言之複數第一發音單元以及第二語言之複數第二發音單元進行分類，該等第二發音單元包含複數特殊發音(language-specific phone)單元；以及步驟S04：決定與該等特殊發音單元所對應之第一發音單元。以下詳細說明本實施例之多語言語音合成方法，其中第一語言以中文，第二語言以英文為例，但這不用以限制本發明。另外，需注意者，上述步驟S01～S04並不代表其絕對順序，例如可先進行步驟S02再進行步驟S01。1 is a flow chart of steps of a multi-language speech synthesis method according to a preferred embodiment of the present invention, including step S01: selecting a first language as a primary language, and collecting a first corpus of the first language; step S02: Selecting a second language as a secondary language, and collecting a second corpus of the second language; step S03: using the first corpus and the second corpus to use the first vocal unit of the first language and the second language The plurality of second pronunciation units are classified, the second sounding units include a plurality of language-specific phone units; and step S04: determining the first sounding unit corresponding to the special sounding units. The multi-language speech synthesis method of the present embodiment is described in detail below, wherein the first language is in Chinese and the second language is in English, but this is not intended to limit the present invention. In addition, it should be noted that the above steps S01 to S04 do not represent the absolute order. For example, step S02 may be performed first and then step S01 may be performed.

首先，選取一第一語言為主要語言，並收集第一語言之一第一語料。於此，第一語料係以出自同一語者為例，例如以台灣總統為例。First, select a first language as the primary language and collect the first corpus of one of the first languages. Here, the first corpus is taken as an example from the same language, for example, the president of Taiwan.

再者，選取一第二語言為次要語言，並收集第二語言之一第二語料。於此，第二語料係以出自同一語者為例，例如以美國總統為例。本實施例之目的即是要產生出以台灣總統的口音說出英語之語音。另外，在本實施例中，作為主要語言之第一語料多於作為次要語言之第二語料。Furthermore, a second language is selected as the secondary language and a second corpus of the second language is collected. Here, the second corpus is taken as an example from the same language, for example, the US President. The purpose of this embodiment is to produce a voice that speaks English in the accent of the President of Taiwan. Further, in the present embodiment, the first corpus as the main language is more than the second corpus as the secondary language.

另外，在步驟S01與S02之後，多語言語音合成方法可更包含：利用第一語料訓練第一語言之一第一語音模型；以及利用第二語料訓練第二語言之一第二語音模型。在此態樣中，對中英文進行不同語者之語料庫設計與建構，且利用訓練語料來進行所有的音素聲學模型的訓練，即包含第一語音模型(中文)以及第二語音模型(英文)之訓練。模型訓練包含語音訊號參數化、取得頻譜及音高的參數分析。在上述訓練中，本實施例係使用STRAIGHT分析及合成演算法，此方法可以得到精確的基頻參數以及倒頻譜參數(Cepstral feature)。In addition, after steps S01 and S02, the multi-lingual speech synthesis method may further include: training the first speech model of the first language by using the first corpus; and training the second speech model of the second language by using the second corpus; . In this aspect, the corpus design and construction of different linguistics are carried out in Chinese and English, and the training corpus is used to train all the acoustic model of the phoneme, that is, the first speech model (Chinese) and the second speech model (English) ) training. Model training includes parameter analysis of speech signal parameterization, spectrum acquisition and pitch. In the above training, this embodiment uses the STRAIGHT analysis and synthesis algorithm, which can obtain accurate fundamental frequency parameters and Cepstral feature.

接下來進行步驟S03。Next, step S03 is performed.

中文和英文在語言學上的在基本發音單元有所差異，前者分在漢藏語系，後者則是印歐語系。故如圖2所示，中文與英文之間共同音只有24個(不考慮中文聲調)，另外有16個中文才有而英文沒有的音素，英文同樣也有16個中文沒有的音素，對於這種因為語言不同而找不到對應的音素，一般稱之為未知音(unseen phone)。未知音係構成特殊發音單元(language-specific phone)的基本音素。There are differences in the basic pronunciation units between Chinese and English in linguistics. The former is in the Sino-Tibetan language family and the latter is in the Indo-European language family. Therefore, as shown in Figure 2, there are only 24 common tones between Chinese and English (not considering Chinese tones). In addition, there are 16 Chinese and no phonemes in English. There are also 16 Chinese phonemes in English. Because the language is different and the corresponding phoneme cannot be found, it is generally called unseen phone. The unknown phonology constitutes the basic phoneme of a language-specific phone.

為了處理未知音的問題，本實施例利用音素(第一發音單元與第二發音單元各可由至少一音素組成)的發音屬性對中英文的音素進行分類。發音屬性(Articulatory Attribute)可包含音素的發音位置(Place Of Articulation)以及發音方式(Manner Of Articulation)，其特性就在於發音位置及發音方式不會因為語言的不同而改變，亦不受語者特性所影響，為一種強健的(robust)特徵參數。語言學者根據每個音素的發音屬性制訂了國際音標(International Phonetic Alphabet,IPA)。本實施例根據傳統語言學家所訂定之規則來建立中英文發音屬性母音表以及子音表，如圖3和圖4所示。於此說明圖4中英文所對應之中文：Bilabial(雙唇音)、Labio-dental(唇齒音)、Dental(齒音)、Alveolar(齒齦音)、Post-Alveolar(後齒齦音)、Palatal(硬顎音)、Velar(軟顎音)、Glottal(喉音)、Plosive(爆音)、Implosive(內爆音)、Fricative(擦音)、Nasal(鼻音)、Trill(顫音)、Lateral(邊音)、Approximant(近音)。而分類的方式採用決策樹的分類方式，決策樹所使用的問題集同樣參考國際音標的訂定方式，具有同樣屬性的音素歸為同類，其中有些問題是針對中英文而定，例如兒化母音，就將英文的/r/和中文的/ㄦ/放在一起，其他像中介音/w/、/y/，則是和有中介音的中文音素例如/ㄓㄨ+*/(*指的是任一種音素，+代表的意思為後面一個音，在此例子中表示“ㄓㄨ”後面接任何一個音都符合這類的分類條件)、/ㄐㄧ+*/分在一起。以下為分類決策樹問題及設計原則：In order to deal with the problem of unknown sound, the present embodiment classifies the phonemes of Chinese and English by the pronunciation attribute of the phoneme (the first sounding unit and the second sounding unit can each be composed of at least one phoneme). The Articulatory Attribute can include the Place Of Articulation and the Manner Of Articulation. The characteristic is that the position and pronunciation of the phone are not changed by the language, nor by the speaker. The effect is a robust characteristic parameter. Linguists have developed the International Phonetic Alphabet (IPA) based on the pronunciation properties of each phoneme. In this embodiment, a Chinese-English pronunciation attribute vowel table and a consonant table are established according to rules set by a conventional linguist, as shown in FIGS. 3 and 4. The Chinese corresponding to the English in Fig. 4 are illustrated here: Bilabial, Labio-dental, Dental, Alveolar, Post-Alveolar, Palatal (hard) Arpeggio), Velar, Softlot, Plosive, Implosive, Fricative, Nasal, Trill, Lateral, Approximant (near tone). The classification method adopts the classification method of decision trees. The problem set used by the decision tree also refers to the definition of international phonetic symbols. The phonemes with the same attributes are classified into the same kind, some of which are for Chinese and English, such as babble vowels. , put English /r/ and Chinese /ㄦ/ together, other media-like sounds /w/, /y/, and Chinese phonemes with intermediate sounds such as /ㄓㄨ+*/(* It is any kind of phoneme, and + means the latter one. In this example, it means that any sound after "ㄓㄨ" matches the classification condition of this kind), /ㄐㄧ+*/ is divided together. The following are classification decision tree issues and design principles:

母音相關(Vowel related)問題：其中包含/a/、/e/、/i/、/o/、/u/等單母音相關問題；母音位置問題，如：前、中、後等；雙母音相關問題；兒化母音等問題。Vowel related problems: including single vowel-related problems such as /a/, /e/, /i/, /o/, /u/; vowel position problems such as front, middle, and back; double vowels Related issues; children's vowels and other issues.

子音相關(Consonant related)問題：其中有發音位置問題，例如Velar、Coronal…等；發音方式問題：像是擦破音(Plosive)、鼻音(Nasal)…等等。Consonant related problems: there are pronunciation position problems, such as Velar, Coronal, etc.; pronunciation mode problems: like Plosive, Nasal, etc.

而在決策樹分裂時，除了考慮最短描述距離(minimum description length,MDL)之外，也必須注意到分裂之後至少含有中英文音素最少各一個在決策樹節點內，如圖5所示。於此說明圖5中英文所對應之中文：L_vowel(左邊所接音素是否為母音)、C_Labial(目前的音素是否為唇音)、L_fricative(左邊所接音素是否為擦音)、L_O_vowel(左邊所接音素是否為/O/類母音)。When the decision tree is split, in addition to considering the minimum description length (MDL), it must be noted that at least one of the Chinese and English phonemes after splitting is within the decision tree node, as shown in FIG. 5. This shows the Chinese corresponding to English in Figure 5: L_vowel (whether the phoneme on the left is a vowel), C_Labial (whether the current phoneme is a lip tone), L_fricative (whether the phoneme on the left is a wipe), L_O_vowel (on the left) Whether the phoneme is /O/like vowel).

為解決未知音與特殊發音單元的對應問題，進行步驟S04：決定與該等特殊發音單元所對應之第一發音單元。In order to solve the problem of the correspondence between the unknown sound and the special sounding unit, step S04 is performed to determine the first sounding unit corresponding to the special sounding units.

本實施例之一特點為上述對應係利用單位更小的「音框」(frame)來進行。因此，多語言語音合成方法可更包含：將第一語料細分成多個第一音框；及將第二語料細分成多個第二音框。甚至可將第一語言之一音素(phone)分成多個第一音框，將第二語言之一音素分成多個第二音框。一個音框可例如為5毫秒(ms)。此外，決定與該等特殊發音單元所對應之第一發音單元之步驟係藉由發音屬性或聽覺參數(Auditory feature)來進行。以下說明發音屬性與聽覺參數。One of the features of this embodiment is that the corresponding system is performed using a smaller "frame". Therefore, the multi-lingual speech synthesis method may further include: subdividing the first corpus into a plurality of first sound boxes; and subdividing the second corpus into a plurality of second sound boxes. It is even possible to divide one of the first language into a plurality of first sound boxes and divide one of the second language into a plurality of second sound boxes. A sound box can be, for example, 5 milliseconds (ms). Furthermore, the step of determining the first pronunciation unit corresponding to the particular pronunciation unit is performed by a pronunciation attribute or an Auditory feature. The pronunciation attribute and the auditory parameter are explained below.

發音屬性是多語言語音處理非常重要的一種特徵參數(feature)，可以提供必要的細微資訊，比其他特徵更好地處理發音變異，是較強健的語音參數，不容易發生因為語者不同或是語言不同參數變化過大的情形。在本實施例中，所有的語料依IPA的發音屬性的分類方式定義出總共22種不同的發音事件偵測器，如圖6所示，於此說明圖6中英文所對應之中文：P(Vowel |x(t))(輸入訊號x(t)為母音之似然率(likelihood))、P(Fricative |x(t))(輸入訊號x(t)為擦音之似然率)、P(Nasal |x(t))(輸入訊號x(t)為鼻音之似然率)。所使用到的偵測器則會利用不同的聲學上語音的特性的混合來建立發音事件偵測器的模型，如梅爾倒頻譜系數(MFCC)、過零率(zero crossing rate)、音高(pitch)、能量(energy)…等等。這些事件偵測器藉由語料可訓練出一套分類器，例如用類神經網路或是支持向量機(Support vector machine,SVM)等方法訓練分類器，最後輸出的結果則是一組22維的向量，每一維分別代表此聲音對於各發音屬性事件(Articulatory attribute vector，縮寫為AA vector)的機率值。The pronunciation attribute is a very important feature of multi-language speech processing. It can provide the necessary subtle information and handle the pronunciation variation better than other features. It is a strong speech parameter and is not easy to occur because the speaker is different or A situation in which different language parameters change too much. In this embodiment, all the corpora define a total of 22 different pronunciation event detectors according to the classification manner of the pronunciation attribute of the IPA, as shown in FIG. 6, which illustrates the Chinese corresponding to the English in FIG. 6: P (Vowel |x(t)) (the input signal x(t) is the likelihood of the vowel), P(Fricative |x(t)) (the input signal x(t) is the likelihood of the squeak) , P(Nasal | x(t)) (the input signal x(t) is the likelihood of the nasal sound). The detectors used will use a mixture of different acoustical speech characteristics to create a model of the pronunciation event detector, such as the Mel Cepstral Coefficient (MFCC), the zero crossing rate, and the pitch. (pitch), energy, etc. These event detectors can train a classifier by corpus, such as using a neural network or a support vector machine (SVM) to train the classifier. The final output is a set of 22 The vector of dimensions, each dimension representing the probability value of this sound for each Articulatory attribute vector (AAMA vector).

聽覺參數是針對人類聽感而設計，具有以下幾種特性：非線性的感知量測量度，如巴克量度、梅爾量度，這是由於人耳在頻域上的感知，並非全頻域都有相同的敏感度。The auditory parameters are designed for human hearing and have the following characteristics: nonlinear perceptual measures such as the Barker metric and the Meyer metric, which are due to the perception of the human ear in the frequency domain, not the full frequency domain. The same sensitivity.

頻譜振幅壓縮(Spectral amplitude compression)為對數壓縮，因為在聽覺上人耳對音強(Intensity)的感知並非呈現線性關係，而是較接近對數曲線之呈現。Spectral amplitude compression is logarithmic compression because the human ear's perception of intensity is not linear in the sense of hearing, but rather closer to the logarithmic curve.

等響度(Equal loudness)曲線，響度為衡量聲音大小的單位，以1kHz的單頻聲為基準，不同頻率下聽覺的響度和1kHz時的響度一樣時對應的SPL(Sound Pressure Level)連成一曲線，即為等響度曲線。Equal loudness curve, loudness is the unit of sound size measurement, based on the single-frequency sound of 1kHz, the corresponding SPL (Sound Pressure Level) is connected to a curve when the loudness of the auditory at different frequencies is the same as the loudness at 1kHz. It is the equal loudness curve.

遮蔽效應，發生當某一頻率有一特定音強存在時令一個不同頻率的聲音需要加強音強才能被人耳接收。主要可分為兩種，一種是頻率遮蔽(Frequency masking)，低頻聲音傾向遮蔽掉高頻聲音；另一種則為時間遮蔽(Temporal Masking)。The shadowing effect occurs when a certain frequency has a specific sound intensity, and a sound of a different frequency needs to strengthen the sound intensity to be received by the human ear. There are two main types, one is frequency masking, the low frequency sound tends to mask high frequency sound, and the other is Temporal Masking.

請參照圖7所示，本實施例採用Lyon’s auditory model的參數，其擷取步驟如下：首先會對訊號做預強調(pre-emphasis)，接著通過86個濾波器F₁ ～F₈₆ 組合成的級聯濾波器(cascade filter)，接著通過一系列的半波整流器(half wave rectification,HWR)，這部分在模擬內毛細胞(inner hair cell)的單向性運動，整流器具有將輸入波形的負半周消除的功能，使得能量減半。每個半波整流器的輸出再經由四個自動增益控制(automatic gain control,AGC)的級聯，AGC會隨著時間和鄰近整流器的輸出而改變數值，最後輸出的結果可表現出聲波進入人耳後，經由不同部位的神經放電頻率(neural firing rate)，並可由聽覺生理內耳聽覺模型(Cochleagram)來表現。Referring to FIG. 7, the embodiment adopts the parameters of the Lyon's auditory model, and the steps of the acquisition are as follows: first, the signal is pre-emphasis, and then the 86 filters F ₁ to F _{86 are} combined. Cascade filter, followed by a series of half wave rectification (HWR), which simulates the unidirectional motion of the inner hair cell, and the rectifier has a negative input waveform The half-week elimination function halved the energy. The output of each half-wave rectifier is then cascaded through four automatic gain control (AGC). The AGC changes the value over time and the output of the adjacent rectifier. The final output shows the sound waves entering the human ear. Thereafter, the neural firing rate is transmitted through different sites, and can be expressed by an auditory physiological inner ear auditory model (Cochleagram).

本實施例所提出的音框對應(frame alignment)方法之目標是利用第一語言之第一語料挑選出跟第二語言之第二語料在發音屬性以及聽覺特性最相近的音框序列(frame sequence)，其步驟如下：The object of the frame alignment method proposed in this embodiment is to use the first corpus of the first language to select a sequence of sound frames having the closest utterance attribute and auditory characteristic to the second corpus of the second language ( Frame sequence), the steps are as follows:

1.　在已建立好的決策樹找尋第二語料每個音素標記所分類到的群(cluster)(請參照圖4)，從群中找尋其對應的第一語言之音素。1. In the established decision tree, find the cluster to which each phoneme tag is classified in the second corpus (refer to Figure 4), and find the corresponding phoneme in the first language from the group.

2.　計算第一音框與第二音框的挑選成本(substitution cost)，挑出前n個候選者(candidate)。2. Calculate the substitution cost of the first and second frames, and pick out the first n candidates (candidate).

3.　計算各個候選者之間的串接成本(concatenation cost)。3. Calculate the concatenation cost between each candidate.

4.　利用動態規劃在候選者群中找出結合concatenation cost以及substitution cost最小的最佳路徑所對應的第一語料之部分，並將其視為平行語料。4. Using dynamic programming to find the part of the first corpus corresponding to the best path with the concatenation cost and the minimum substitution cost in the candidate group, and regard it as parallel corpus.

5.　利用平行語料來進行音高(Pitch)的高斯混合模型(Gaussian mixture model,GMM)轉換。5. Parallel corpus is used to perform pitch Gaussian mixture model (GMM) transformation.

以下為各個音框之參數定義，第一語料與第二語料之各音框各自的倒頻譜參數及發音屬性(articulatory attribute,AA)，倒頻譜參數包含了梅爾一般化係數(Mel generalized Coefficient,MGC)及聽覺參數(auditory feature)。The following are the parameter definitions of the respective frames, the cepstral parameters and the articulation attributes (AA) of the respective frames of the first corpus and the second corpus, and the cepstral parameters include the Mel generalized coefficients (Mel generalized Coefficient, MGC) and auditory features.

其中第一音框特徵參數如下表示：The first sound box characteristic parameters are as follows:

而第二音框特徵參數如下：The second box characteristic parameters are as follows:

特徵參數各自定義如下：The characteristic parameters are each defined as follows:

SFp_n ：第一語料中第n個音框的倒頻譜參數。SFp _n : The cepstrum parameter of the nth frame in the first corpus.

SFs_n ：第二語料中第n個音框的倒頻譜參數。SFs _n : The cepstrum parameter of the nth sound box in the second corpus.

AFp_n ：第一語料中第n個音框的發音參數(articulatory feature)AFp _n : the pronunciation parameter of the nth sound box in the first corpus (articulatory feature)

AFs_n ：第二語料中第n個音框的articulatory feature。AFs _n : the articulatory feature of the nth frame in the second corpus.

AFp_n ^- ：第一語料中第n-1個音框的articulatory feature。AFp _n ^- : The articulatory feature of the n-1th frame in the first corpus.

AFs_n ^- ：第二語料中第n-1個音框的articulatory feature。AFs _n ^- : The articulatory feature of the n-1th frame in the second corpus.

AFp_n ⁺ ：第一語料中第n+1個音框的articulatory feature。AFp _n ⁺ : the articulatory feature of the n+1th frame in the first corpus.

AFp_n ⁺ ：第二語料中第n+1個音框的articulatory feature。AFp _n ⁺ : the articulatory feature of the n+1th frame in the second corpus.

以下說明挑選成本(Substitution cost)Substitution cost計算方法採用歐式距離，如式(1)：The following description of the Substitution cost Substitution cost calculation method uses the Euclidean distance, as in Equation (1):

其中i、j表示第一音框與第二音框之特徵參數各自的音框索引值，dim則為語音參數之維度。在實作時會保留前n個距離最近的音框，這是由於每一個與第二語料距離最相近的candidate，它們在原始語料中前後所接的資訊(contextual information)之聲音特性不同。故為了將前後音的特性考慮進入concatenation cost中，而保留前n個最近的音框。此外，由於訓練語料音檔的音框繁多，一個母音就包含數萬個音框，故可利用最近鄰居(Nearest Neighbor)演算法預先找出各音框的前n個candidate，以降低了即時運算上耗費之計算時間。Where i and j represent the respective sound box index values of the characteristic parameters of the first sound box and the second sound box, and dim is the dimension of the voice parameter. In the actual implementation, the first n nearest speakers will be retained. This is because each candidate with the closest distance to the second corpus has different sound characteristics of the contextual information in the original corpus. . Therefore, in order to consider the characteristics of the front and back sounds into the concatenation cost, the first n most recent sound boxes are retained. In addition, because the sound box of the training corpus has a large number of sound boxes, a vowel sound contains tens of thousands of sound boxes, so the nearest neighbor (Nearest Neighbor) algorithm can be used to find out the first n candidates of each frame in advance to reduce the instant. The computational time spent on computation.

以下說明Concatenation cost Concatenation cost計算數學式如下式(2)The following is a description of the Concatenation cost Concatenation cost calculation formula (2)

在本實施例中，同時考慮了candidate間往前接以及往後接兩層的前後文脈關係。首先是後接關係，也就是在時間點i的candidate本身的articulatory feature(AFp_i )以及前一個時間點i-1的candidate在原音檔後面接的音框之articulatory feature(AFp_i-1 ⁺ )。另一層關係為前接關係，時間點t的candidate在原始音檔前接之音框articulatory feature(AFp_i ^- )以及i-1時間點的candidate的articulatory feature(AFp_i-1 )，舉例來說：若時間點i及i-1的第一語料之candidate是原本接在一起的，那相減後得到最短歐式距離的值為0。In this embodiment, the context of the context between the candidate and the two layers is considered. The first is the connection relationship, that is, the articulatory feature (AFp _i ) of the candidate itself at the time point i and the articulatory feature (AFp _i-1 ⁺ ) of the frame following the original sound file at the previous time point i-1. . The other layer relationship is the front-end relationship. The candidate at time t is preceded by the original audio file articulatory feature (AFp _i ^- ) and the i1 time point of the candidate's articulatory feature (AFp _i-1 ), for example : If the candidate of the first corpus of time point i and i-1 is originally connected, the value of the shortest Euclidean distance obtained after subtraction is 0.

本實施例之音框對應的計算示意圖如圖8所示。其演算法數學公式如式(3)：A schematic diagram of the calculation of the sound box corresponding to this embodiment is shown in FIG. 8. Its algorithm mathematical formula is as shown in equation (3):

其中among them

而Sub(Us_i ,Up_i )為前述之substitution cost，而Con(Up_i-1 ,Up_i )則為concatenation cost。Us_i 為第二語料在時間點i的基本合成單元(unit)，而Up_i 則為第一語料時間點i的基本合成單元。最後可利用維特比(Viterbi)演算法進行動態規劃，以求得cost最小的音框序列，並將其視為第一語料所產生與第二語言平行之語料。Sub(Us _i , Up _i ) is the aforementioned substitution cost, and Con (Up _i-1 , Up _i ) is the concatenation cost. Us _i is the basic synthesis unit of the second corpus at time point i, and Up _i is the basic synthesis unit of the first corpus time point i. Finally, the Viterbi algorithm can be used for dynamic programming to obtain the sequence of the lowest cost frame and treat it as the corpus of the first corpus parallel to the second language.

以下說明Pitch的GMM轉換。The following describes the GMM conversion of Pitch.

由於校準後的音框序列，其音韻並非第二語言實際上的音高曲線，經過校準後各音框f₀ 並未改變，由提出方法得到的音框序列的基頻曲線(f₀ contour)和實際的第二語言尚有差距，故需經過一次轉換。由於先前挑選candidate 時已有進行f₀ 的篩選，故能夠直接進行joint density的GMM轉換。定義x為第一語言之語者之f₀ ，y為第二語言之語者之f₀ ，z=[x^T y^T ]T用來估測GMM參數(α,μ,Σ)，則轉換函式為Due to the sequence of the calibrated sound box, the phonology is not the actual pitch curve of the second language. After calibration, each frame f _{0 is} not changed, and the fundamental frequency curve of the sequence of the sound box obtained by the proposed method (f ₀ contour) There is still a gap between the actual second language and a conversion. Since the filtering of f ₀ has been performed when the candidate was previously selected, the GMM conversion of the joint density can be directly performed. X f ₀ is defined as the speaker of the first language, y is f ₀ of the speaker of the second ^{^{language, z = [x T y T}} ] T GMM for estimating the parameters (α, μ, Σ), the conversion Function is

Q為GMM的總個數Q is the total number of GMMs

圖9為本實施例藉由上述音框對應方法所得到之結果的示意圖，其顯示第二語料之多個第二音框，其對應的第一語料之第一音框係分散在第一語料的多個段落中，並以剖面線表示之。FIG. 9 is a schematic diagram showing the result obtained by the above-mentioned sound box corresponding method in the embodiment, which displays a plurality of second sound boxes of the second corpus, and the first sound frame of the corresponding first corpus is dispersed in the first In a number of paragraphs of a corpus, and represented by hatching.

在選取到該等特殊發音單元(如音框)所對應之該等第一發音單元(如音框)之後，多語言語音合成方法更包含：藉由對應該等特殊發音單元之該等第一發音單元來訓練第二語言之該等特殊發音單元之語音模型，並將第二語言之該等特殊發音單元之語音模型加入第一語言之語音模型。After selecting the first pronunciation unit (such as a sound box) corresponding to the special pronunciation units (such as a sound box), the multi-lingual speech synthesis method further includes: first, by the corresponding special pronunciation unit The pronunciation unit trains the speech models of the special pronunciation units of the second language, and adds the speech models of the special pronunciation units of the second language to the speech model of the first language.

需注意者，除了特殊發音單元是利用上述音框對應方法之外，一般非特殊發音單元亦可使用同樣的音框對應方法來找到對應的音框，並可加入語音模型的訓練。It should be noted that in addition to the special pronunciation unit using the above-mentioned sound box corresponding method, generally the non-special pronunciation unit can also use the same sound box corresponding method to find the corresponding sound box, and can be added to the training of the voice model.

總括來說，本發明較佳實施例係以“發音屬性”及/或“聽覺參數”為基礎，提出了整合各個語言之語料庫之架構，來實現通用語音合成。其作法為利用一個大量收集之特定語者語料作為主要語言語者(Primary Language Speaker)，其他語言之語料庫則是利用跨語言之發音屬性進行與主要語言語者進行分類與對應。首先收集大量某主要語言與次要語言的語料，並利用IPA定義其語言中所有的發音單元(可包含至少一音素(phone))。之後將次要語言中特殊的發音單元(Language-Specific phone)之語音段對主要語言之語音段進行分類以及對應，其中分類以及對應的依據是考慮其發音屬性(Articulatory attribute)的前後文資訊(contextual information)以及結合倒頻譜參數(Cepstral feature)與聽覺參數(Auditory feature)作為挑選對應音框的依據，針對於兩個語言之語音段進行音框之對應。最後便能利用對應好的主要語言之語音段來進行次要語言之特殊發音單元語音模型的訓練，如此將次要語言特殊發音單元語音模型加入主要語言中，即可用以實現多語言語音合成器的實現。In summary, the preferred embodiment of the present invention is based on "pronunciation attributes" and/or "auditory parameters", and proposes a framework for integrating corpora of various languages to implement general speech synthesis. The method is to use a large collection of specific language corpora as the primary language speaker (Primary Language Speaker), and the corpus of other languages uses the cross-language pronunciation attribute to classify and correspond with the main language speakers. First, collect a large number of corpora of a major language and a secondary language, and use IPA to define all the pronunciation units in the language (which can include at least one phone). Then, the speech segments of the special language in the secondary language are classified and corresponding to the speech segments of the main language, wherein the classification and the corresponding basis are the context information of the Articulatory attribute ( The contextual information is combined with the Cepstral feature and the Auditory feature as the basis for selecting the corresponding sound frame, and the sound box is matched for the speech segments of the two languages. Finally, the speech segment corresponding to the good main language can be used to train the special pronunciation unit speech model of the secondary language, so that the speech model of the secondary language special pronunciation unit is added to the main language, and the multi-language speech synthesizer can be realized. Implementation.

綜上所述，本發明之多語言語音合成方法可解決跨語言音素不完全相同的問題，並可解決無法收集到平行語料之問題，因而產生出任一語者之多語言語音合成。In summary, the multi-lingual speech synthesis method of the present invention can solve the problem that the trans-language phonemes are not identical, and can solve the problem that the parallel corpus cannot be collected, thereby generating multi-language speech synthesis of any speaker.

以上所述僅為舉例性，而非為限制性者。任何未脫離本發明之精神與範疇，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。The above is intended to be illustrative only and not limiting. Any equivalent modifications or alterations to the spirit and scope of the invention are intended to be included in the scope of the appended claims.

S01~S04‧‧‧多語言語音合成方法的步驟Steps for S01~S04‧‧‧Multilingual Speech Synthesis Method

圖1為本發明較佳實施例之一種多語言語音合成方法的步驟流程圖；1 is a flow chart showing the steps of a multi-lingual speech synthesis method according to a preferred embodiment of the present invention;

圖2為中文和英文在語言學上的基本發音單元的示意圖；Figure 2 is a schematic diagram of the basic pronunciation units of linguistics in Chinese and English;

圖3為中英文母音發音位置圖；Figure 3 is a view showing the position of the vowel sound in Chinese and English;

圖4為中英文子音發音位置與方式表；Figure 4 is a table showing the position and mode of the pronunciation of Chinese and English consonants;

圖5為決策樹分類示意圖；Figure 5 is a schematic diagram of decision tree classification;

圖6為發音屬性偵測器示意圖；Figure 6 is a schematic diagram of a pronunciation attribute detector;

圖7為Lyon’s聽覺參數流程示意圖；Figure 7 is a schematic flow chart of Lyon's auditory parameters;

圖8為本發明之一實施例之音框對應的計算示意圖；以及FIG. 8 is a schematic diagram of calculation of a sound box corresponding to an embodiment of the present invention;

圖9為本實施例藉由音框對應方法所得到之結果的示意圖。FIG. 9 is a schematic diagram of the result obtained by the sound box corresponding method in the embodiment.

S01～S04．．．多語言語音合成方法的步驟S01～S04. . . Steps in a multilingual speech synthesis method

Claims

A multi-language speech synthesis method includes: selecting a first language as a main language, and collecting a first corpus of the first language; selecting a second language as a secondary language, and collecting one of the second languages a second corpus; using the first corpus and the second corpus to classify the plural first pronunciation unit of the first language and the plural second pronunciation unit of the second language, the second pronunciation unit including a plurality of special a pronunciation unit, the special pronunciation unit including an unknown sound between the first language and the second language; and determining the first pronunciation units corresponding to the special pronunciation units.

The multi-lingual speech synthesis method of claim 1, wherein the first corpus is more than the second corpus.

The multi-lingual speech synthesis method according to claim 1, further comprising: training the first speech model of the first language by using the first corpus; and training the second language by using the second corpus; A second speech model.

The multi-lingual speech synthesis method according to claim 1, further comprising: classifying the first pronunciation unit and the second pronunciation unit by an international phonetic symbol.

The multi-lingual speech synthesis method according to claim 1, wherein before determining the first pronunciation unit corresponding to the special pronunciation units, the method further comprises: subdividing the first corpus into a plurality of first speech frames. And subdividing the second corpus into a plurality of second syllables such that the first vocalization unit corresponds to the first vocalization unit by the corresponding phonological frame correspondence.

The multi-lingual speech synthesis method according to claim 5, wherein the phoneme of the first language is divided into a plurality of the first sound frames.

The multi-lingual speech synthesis method according to claim 5, wherein the phoneme of the second language is divided into a plurality of the second sub-frames.

The multi-lingual speech synthesis method according to claim 1, wherein the step of determining the first pronunciation units corresponding to the special pronunciation units is performed by a pronunciation attribute or an auditory parameter.

The multi-lingual speech synthesis method according to claim 1, wherein the special pronunciation unit corresponds to the first pronunciation unit, and further comprises: the first pronunciation unit corresponding to the special pronunciation unit by corresponding To train the speech model of the special pronunciation units of the second language.

The multi-lingual speech synthesis method of claim 9, further comprising: adding the speech model of the special pronunciation unit of the second language to the speech model of the first language.