TW202016921A

TW202016921A - Method for speech synthesis and system thereof

Info

Publication number: TW202016921A
Application number: TW107137546A
Authority: TW
Inventors: 王文俊; 陳保清; 潘振銘; 江振宇; 張文陽; 李武豪; 林衍廷; 林彥廷; 江仁杰
Original assignee: 中華電信股份有限公司
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2020-05-01
Also published as: TWI703556B

Abstract

A method for speech synthesis is to first perform text analysis on a Chinese-foreign language mixed sentence to generate linguistic information, and the linguistic information includes Chinese part-of-speech and the foreign language part-of-speech decided by the Chinese syntactic structure in the sentence, and perform tone prediction process to generate tone information about the foreign language, and then perform breakpoint prediction via the linguistic information to generate breakpoint information of the Chinese-foreign language mixed sentence. Finally, the linguistic information, the tone information and the breakpoint information are integrated to generate a synthesized speech of the Chinese and foreign mixed sentence. Further, a system for speech synthesis is disclosed.

Description

Speech synthesis method and system

本發明係關於一種語音輸出，尤指一種中外文混用語句之語音合成方法及其系統。 The invention relates to a speech output, especially a speech synthesis method and system of mixed Chinese and foreign language sentences.

對於以中文為母語之華人社會而言，中英混合(Mandarin-English)之語音合成系統(Text-To-Speech，簡稱TTS)之使用相對頻繁。所謂之中英混合TTS係指以中文為主，英語為輔，即於中文句子中嵌入英文詞彙或短語，且嵌入於中文詞句的英文詞彙可為專有名詞、縮寫、拗口中文詞彙之翻譯或慣用語等，因而使用英文詞彙取代對應之中文詞語，例如，「Facebook是著名的網路社群」、「物聯網IoT是這幾年Internet上熱門的搜尋關鍵字」、「此應用之implement難度相當高」、「好professor教出好student」、「我倆的想法不match，所以我不follow他的作法」等日常用語。 For a Chinese society whose mother tongue is Chinese, the Mandarin-English speech synthesis system (Text-To-Speech, TTS for short) is used relatively frequently. The so-called Sino-British mixed TTS refers to Chinese as the main language, supplemented by English, that is, English vocabulary or phrases are embedded in Chinese sentences, and the English vocabulary embedded in the Chinese words and sentences can be the translation of proper nouns, abbreviations and Chinese words Or idioms, so use English vocabulary to replace the corresponding Chinese words, for example, "Facebook is a well-known online community", "Internet of Things IoT is a popular search keyword on the Internet in recent years", "implementation of this application "The difficulty is quite high", "good professor teaches good student", "our ideas do not match, so I do not follow his way" and other daily expressions.

如同傳統單一語言之TTS，該混合語言類型之TTS之建置需錄製特定語者之語音資料庫，並建立文字分析、韻律生成及合成處理等相關程序，但於混合語言TTS之應用需求中，上述各項程序均需有特殊考量。 Like the traditional single-language TTS, the construction of the mixed-language TTS requires the recording of a specific speaker's speech database, and the establishment of text analysis, prosody generation and synthesis processing and other related procedures, but it is used in mixed-language TTS applications. Among the requirements, each of the above procedures requires special consideration.

有關語音資料庫的錄製，由於中英混合語音資料係以中文為主而英文為輔，使其所含中文與英文之比例極為懸殊，故要達到能包含兩種語言所需的充足語音合成單元與豐富的韻律變化並不容易，因而本發明係採用同一語者分別錄製所需之中文與英文之語音資料的方式及其系統。 Regarding the recording of the speech database, since the Chinese-English mixed speech data is mainly Chinese and supplemented by English, the ratio of Chinese and English contained in it is extremely disparate, so it is necessary to achieve sufficient speech synthesis units that can contain two languages It is not easy to change with rich rhythm, so the present invention adopts the method and system of recording the required Chinese and English voice data separately by the same speaker.

另一方面，由於以中文為母語之語者所錄製之外文(第二語言或英文)之語音資料極易受到母語(第一語言或中文)之影響，此影響包括聲學、韻律與語法等不同特徵。 On the other hand, since the audio data recorded in a foreign language (second language or English) by a native Chinese speaker is easily affected by the native language (first language or Chinese), this impact includes differences in acoustics, prosody and grammar. feature.

因此，本發明之系統建置係依據以下特徵： Therefore, the system construction of the present invention is based on the following features:

第一、針對聲學而言，英文發音極易由相近的中文發音所取代。 First, in terms of acoustics, English pronunciation is easily replaced by similar Chinese pronunciation.

第二，於韻律之變化中，中文之聲調(tone)資訊將取代英文之重音(stress)資訊，即聲調借用(tone borrowing)現象，且韻律之斷點的產生也大多會依據母語慣用習性。 Second, in the change of rhythm, tone information in Chinese will replace stress information in English, that is, tone borrowing, and the generation of breakpoints in rhythm will mostly be based on the customary habits of the mother tongue.

第三，於習知以詞性(Part Of Speech)為例之語法資訊中，則是所嵌入之英文詞彙或短語之詞性宜藉由整個句子的中文語法結構決定，而非僅藉由各該英文詞彙或短語的原始詞性作判斷。 Thirdly, in the grammatical information of Part Of Speech as an example, the part of speech of the embedded English vocabulary or phrase should be determined by the Chinese grammatical structure of the entire sentence, not just by each To judge the original part of speech of English words or phrases.

因此，依據上述特徵，本發明提供一種語音合成方法，係包括：針對中文與外文之混用語句進行文字分析，以產生包含中文詞性以及藉由整句之該中文的語法結構所決定之該外文之詞性的外文詞性等語言資訊，並藉由聲調預測之程序產生該外文之聲調資訊；將該語言資訊經由斷點預測之程序產生該中文與該外文之混用語句的斷點資訊；以及經由語音合成之程序整合該語言資訊、聲調資訊與斷點資訊，以產生該中文與該外文之混用語句的合成語音。 Therefore, according to the above features, the present invention provides a method of speech synthesis, which includes: performing a text analysis on a mixed sentence of Chinese and foreign languages to generate a Chinese grammatical structure that includes Chinese parts of speech and is determined by the entire sentence The foreign language part of speech of the foreign language and other language information, and the tone information of the foreign language is generated by the tone prediction process; the language information is generated by the breakpoint prediction process to generate the breakpoint information of the mixed sentence of the Chinese and the foreign language; And integrate the language information, tone information and breakpoint information through a speech synthesis process to generate a synthesized speech of the mixed sentence of the Chinese language and the foreign language.

前述之語音合成方法中，該文字分析包括傳統文字分析與外文詞性標示處理，其中傳統文字分析係利用習知方法標示中文之詞性，至於外文詞性標示處理所需之模型，係利用中文與外文之混用文字資料經由傳統文字分析而產生輸入資訊，再將該輸入資訊藉由外文詞性標示訓練之程序而建置一外文詞性標示模型。 In the aforementioned speech synthesis method, the character analysis includes traditional character analysis and foreign part-of-speech tagging. The traditional character analysis uses conventional methods to indicate the part-of-speech of Chinese. As for the model required for foreign part-of-speech tagging, it uses both Chinese and foreign words. The mixed text data generates input information through traditional text analysis, and then the input information is used to construct a foreign language part-of-speech tagging model through the process of foreign part-of-speech tagging training.

前述之語音合成方法中，其中，聲調預測處理之程序所需之模型，係利用中文與外文之混用文字資料經由該文字分析而產生輸入資訊，且將該中文與該外文之混用文字資料中的外文經由標註者採用默念方式所產生之聲調作為目標資訊，以結合該輸入資訊與該目標資訊，進行聲調預測訓練之程序而建置一聲調預測模型。 In the aforementioned speech synthesis method, the model required for the process of tone prediction processing is to generate input information through the text analysis using mixed text data of Chinese and foreign languages, and the data in the mixed text data of Chinese and foreign languages In the foreign language, the tone generated by the annotator adopts the silent method as the target information, and a tone prediction model is built by combining the input information and the target information to perform a tone prediction training procedure.

前述之語音合成方法中，其中，斷點預測處理之程序所需之模型，係利用中文文字資料產生該語言資訊，且將對應該中文文字資料之中文語音資料庫之斷點標示作為目標資訊，再結合該語言資訊與目標資訊而進行中文斷點預測訓練之程序而建置一斷點預測模型。 In the aforementioned speech synthesis method, the model required for the breakpoint prediction process uses Chinese text data to generate the language information, and the breakpoint mark of the Chinese speech database corresponding to the Chinese text data is used as the target information. Combined with the language information and target information, the Chinese breakpoint prediction training program is built to build a breakpoint prediction model.

前述之語音合成方法中，其中，語音合成之程序所需之模型，係將中文語音資料與外文語音資料及中文與外文對應之語言、聲調、斷點的資訊進行整合訓練而建置一HTS模型。此外文語音資料庫之聲調資訊須藉助聲調辨識模型經由聲調標示處理而產生，該聲調辨識模型之建置係藉由中文語音資料庫與其對應之中文文字資料進行聲調辨識訓練之程序而成。 In the aforementioned speech synthesis method, among the models required for the speech synthesis process, Chinese speech data and foreign language speech data and Chinese and foreign languages are used. Corresponding language, tone, and breakpoint information are integrated and trained to build an HTS model. In addition, the tone information of the text-to-speech database must be generated through tone-marking processing by means of a tone-recognition model. The tone-recognition model is built by the process of tone-recognition training by the Chinese speech database and the corresponding Chinese text data.

此外，本發明另提供一種語音合成系統，係包括：文字分析模組，係針對中文與外文之混用語句進行文字分析，以產生包含中文詞性以及藉由整句之該中文的語法結構所決定之該外文之詞性的外文詞性等語言資訊；聲調預測模組，係針對中文與外文之混用語句之外文產生聲調資訊；斷點預測模組，係利用該語言資訊產生該中文與該外文之混用語句的斷點資訊；以及HTS模組，係整合該語言資訊、聲調資訊與斷點資訊，以產生該中文與該外文之混用語句的合成語音。 In addition, the present invention also provides a speech synthesis system, which includes: a text analysis module, which performs text analysis on mixed sentences of Chinese and foreign languages to generate a Chinese grammatical structure that includes Chinese parts of speech and is determined by the entire sentence The foreign language part of speech of the foreign language and other language information; the tone prediction module, which generates tonal information for foreign languages of mixed sentences of Chinese and foreign languages; the breakpoint prediction module, uses the language information to generate mixed sentences of the Chinese and foreign languages Breakpoint information; and the HTS module, which integrates the language information, tone information, and breakpoint information to produce a synthesized speech of the mixed sentence of the Chinese and the foreign language.

由上可知，本發明之語音合成方法及其系統中，主要藉由文字分析，以產生包含中文詞性及外文詞性等語言資訊，並藉由聲調預測之程序產生該外文之聲調資訊；將該語言資訊經由斷點預測之程序產生該中文與外文之混用語句的斷點資訊；以及整合該語言資訊、聲調資訊與斷點資訊，以產生該中文與外文之混用語句的合成語音。 It can be seen from the above that the speech synthesis method and system of the present invention mainly generate language information including Chinese part-of-speech and foreign part-of-speech through text analysis, and generate tonal information of the foreign language through a process of tone prediction; the language The information generates breakpoint information of the mixed sentence of Chinese and foreign languages through a breakpoint prediction process; and integrates the language information, tone information, and breakpoint information to generate a synthesized speech of the mixed sentence of Chinese and foreign languages.

101‧‧‧中外文混用文字資料 101‧‧‧ Mixed Chinese and Foreign Languages

102‧‧‧中文文字資料 102‧‧‧ Chinese text data

103‧‧‧外文文字資料 103‧‧‧ Foreign language information

104‧‧‧中文語音資料庫 104‧‧‧Chinese voice database

105‧‧‧外文語音資料庫 105‧‧‧ Foreign language audio database

111‧‧‧文字分析 111‧‧‧Text analysis

112‧‧‧外文詞性標示訓練 112‧‧‧ Foreign Language Part of Speech Training

121‧‧‧外文聲調標註(標註者默念方式) 121‧‧‧ foreign language tone labeling (marker's silent way)

122‧‧‧聲調預測訓練 122‧‧‧Tone prediction training

131‧‧‧文字分析處理 131‧‧‧ character analysis and processing

132‧‧‧中文斷點預測訓練 132‧‧‧ Chinese breakpoint prediction training

141‧‧‧聲調辨識訓練 141‧‧‧Tone Recognition Training

151‧‧‧聲調標示處理 151‧‧‧ tone label processing

152‧‧‧斷點標示處理 152‧‧‧Breakpoint mark processing

153‧‧‧整合訓練 153‧‧‧ Integrated training

161‧‧‧外文詞性標示模型 161‧‧‧ Foreign language part of speech model

162‧‧‧聲調預測模型 162‧‧‧ tone prediction model

163‧‧‧斷點預測模型 163‧‧‧Breakpoint prediction model

164‧‧‧聲調辨識模型 164‧‧‧ tone recognition model

165‧‧‧HTS模型 165‧‧‧ HTS model

200‧‧‧輸入文句訊號 200‧‧‧ input sentence signal

201‧‧‧文字分析 201‧‧‧ character analysis

202‧‧‧聲調預測 202‧‧‧ tone prediction

203‧‧‧斷點預測 203‧‧‧Breakpoint prediction

204‧‧‧語音合成標示產生器 204‧‧‧Speech Synthesis Mark Generator

205‧‧‧語音合成 205‧‧‧Speech synthesis

206‧‧‧輸出合成語音 206‧‧‧ output synthesized speech

第1圖係為本發明之語音合成方法之訓練處理作業之架構示意圖。 FIG. 1 is a schematic diagram of the training process of the speech synthesis method of the present invention.

第2圖係為本發明之語音合成方法之合成處理作業之架構示意圖。 Figure 2 is the synthesis processing operation of the speech synthesis method of the present invention Schematic diagram of the architecture.

第3圖係為本發明之語音合成方法之斷點預測模型之韻律架構與斷點類型之關係圖。 Figure 3 is a diagram of the relationship between the prosody structure of the breakpoint prediction model and the breakpoint type of the speech synthesis method of the present invention.

以下藉由特定的具體實施例說明本發明之實施方式，熟悉此技藝之人士可由本說明書所揭示之內容輕易地瞭解本發明之其他優點及功效。 The following describes the implementation of the present invention by specific specific examples. Those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification.

須知，本說明書所附圖式所繪示之結構、比例、大小等，均僅用以配合說明書所揭示之內容，以供熟悉此技藝之人士之瞭解與閱讀，並非用以限定本發明可實施之限定條件，故不具技術上之實質意義，任何結構之修飾、比例關係之改變或大小之調整，在不影響本發明所能產生之功效及所能達成之目的下，均應仍落在本發明所揭示之技術內容得能涵蓋之範圍內。同時，本說明書中所引用之如「前」、「後」及「一」等之用語，亦僅為便於敘述之明瞭，而非用以限定本發明可實施之範圍，其相對關係之改變或調整，在無實質變更技術內容下，當視為本發明可實施之範疇。 It should be noted that the structure, ratio, size, etc. shown in the drawings of this specification are only used to match the content disclosed in the specification, for those who are familiar with this skill to understand and read, not to limit the implementation of the present invention The limited conditions do not have technical significance. Any modification of structure, change of proportional relationship or adjustment of size should still fall within the scope of the invention without affecting the efficacy and the purpose of the invention. The technical content disclosed by the invention can be covered. At the same time, the terms such as "front", "back" and "one" quoted in this specification are only for the convenience of description, and are not used to limit the scope of the invention, the change of the relative relationship or Adjustments, without substantial changes in the technical content, should be regarded as the scope of the invention.

一般中外混用語句中，可發現一些有趣現象。例如，中英文混用語句係以中文的語法結構為主，故在決定所嵌入的英文詞彙或片語之詞性時，不應受限於其原始可能之詞性，此作法也可修正一般英文詞彙之誤用問題。例如，前述「此應用之implement難度相當高」之例句中，「implement」應依據中文語法結構被標示為名詞或形容詞，而非動詞。 In general mixed Chinese and foreign sentences, some interesting phenomena can be found. For example, Chinese and English mixed sentences are mainly based on the grammatical structure of Chinese, so when determining the part of speech of the embedded English vocabulary or phrase, it should not be limited to its original possible part of speech. This approach can also modify the general English vocabulary. Misuse problem. For example, in the example sentence of "the implementation of this application is quite difficult", "implement" should be marked as a noun or adjective, not a verb, according to the Chinese grammatical structure.

再者，有關強調聲調(tone)的中文語言與強調重音(stress)的英文語言於中英語詞交接處的交互影響中，由於以中文為母語的語者在英語發音上會受母語之影響，故傾向以聲調資訊取代英文詞彙或短語之重音資訊。例如，「好professor教出好student」與「我倆的想法不match，所以我不follow他的作法」之例句中，兩個「好」字與兩個「不」字都會受到其後所接續之英文詞彙發音之影響，而產生不同的變調結果。對「好」字而言，依序為二聲發音及三聲發音，對「不」字而言，依序為二聲發音及四聲發音，此表示英文詞彙可定義出聲調資訊，進而對鄰近之中文字詞產生變調之影響。 Furthermore, the interaction between the Chinese language that emphasizes tone and the English language that emphasizes stress in the intersection of Chinese and English words, since native Chinese speakers will be affected by the mother tongue in English pronunciation, Therefore, it is preferred to replace accent information of English vocabulary or phrases with tonal information. For example, in the example sentences of "good professor teaches good student" and "our ideas do not match, so I do not follow his way", both "good" and "no" will be followed by The effect of the pronunciation of English vocabulary has different tonal results. For the word "good", it is a two-tone sound and a three-tone sound in order, and for the word "no", it is a two-tone sound and a four-tone sound, which means that the English vocabulary can define the tone information, and then the neighboring Chinese words have a tonal effect.

本發明係包含語音資料庫之需求、詞性資訊、聲調資訊與韻律斷點資訊等特徵。 The present invention includes features such as voice database requirements, part-of-speech information, tone information, and prosody breakpoint information.

於該語音資料庫之需求中，語音合成所需之語音資料庫的基本要求係具有充足的合成單元與豐富的韻律變化，使用中英混用語音資料雖能避免訓練與測試間的不匹配現象，但由於所含之中文與英文之比例懸殊，導致上述兩項語料庫之基本要求並不易達成，且由於以中文為母語之語者於英語發音表現時會有變異性極大之不一致問題，此現象在中英混用語句中更明顯，故本發明係採用由同一語者分別錄製中文與英文之語音資料的方式。 Among the requirements of the speech database, the basic requirements of the speech database required for speech synthesis are sufficient synthesis units and rich rhythm changes. The use of mixed Chinese and English speech data can avoid the mismatch between training and testing. However, due to the disparity in the ratio of Chinese to English, the basic requirements of the above two corpora are not easy to meet, and because native speakers of Chinese will have inconsistencies in the variability of English pronunciation performance, this phenomenon is The mixed Chinese and English sentences are more obvious, so the present invention adopts the method of recording the voice data of Chinese and English by the same language.

於該詞性資訊中，因使用者誤用或習慣簡化用語，致使中英混合文句中所嵌入之英文詞彙或片語容易出錯，本發明之作法係為藉由整個語句的中文語法結構決定所嵌入之英文詞彙或片語之詞性。具體地，本發明用以取代習知TTS的方法係為針對所嵌入之英文詞彙或片語給予外語的特殊分類；或者，僅利用英文詞典中之欲嵌入之英文詞彙或片語的可能詞性作為判斷依據。 In the part-of-speech information, due to user misuse or habitual simplification, the English vocabulary or phrase embedded in the Chinese-English mixed sentence is prone to error. The practice of the present invention is to embed by determining the Chinese grammatical structure of the entire sentence The part of speech of English vocabulary or phrases. Specifically, the method used by the present invention to replace the conventional TTS is to give foreign languages a special classification for the embedded English vocabulary or phrase; or, only use the possible parts of speech of the English vocabulary or phrase to be embedded in the English dictionary as Judgments based.

於該聲調資訊中，由於以中文為母語的語者在英語發音上會受母語之影響而產生聲調借用(tone borrowing)現象，故本發明係以聲調資訊取代英文語系之詞彙或短語之重音資訊。對於中英混合TTS而言，該聲調資訊之調整係關於兩項處理程序，第一項係為文字分析之程序，其需針對輸入文句中之英文詞彙或短語產生對應之聲調資訊，第二項係為用於韻律分析與合成參數模型訓練之英文語音資料庫，其需提供聲調資訊，致能結合中文語音資料庫而採用一致性的特徵進行中英混合TTS之相關處理。 In the tonal information, since native speakers of Chinese are affected by the native language in the pronunciation of English, tone borrowing occurs. Therefore, the present invention replaces the stress of vocabulary or phrases in the English language with tonal information News. For Sino-British mixed TTS, the adjustment of the tone information is related to two processing procedures. The first is a text analysis procedure, which needs to generate corresponding tone information for the English vocabulary or phrases in the input sentence. The second The term is an English speech database used for prosody analysis and synthetic parameter model training. It needs to provide tone information, enabling the combination of Chinese speech database and the use of consistent features for Sino-British mixed TTS related processing.

於該韻律斷點資訊中，習知TTS之方法係僅針對文字資料進行分析，再結合斷詞與詞性等資訊，進而建立基礎片語之語法架構並決定韻律邊界。然而，由於語法邊界與韻律邊界並不必然一致，故本發明之方法係藉由統計語音資料中的韻律斷點分佈取代基礎片語所代表的習知語法邊界，即利用訓練語音資料所呈現的韻律斷點變化建置一斷點預測模型，且由於錄製語者係以中文為母語，其所錄製之英文語音資料易受母語影響，因而在中文語音資料所呈現之韻律變化較為穩定，致使於相較下，在英文語音資料所呈現之韻律變化則有變異性大之不一致問題，故本發明係以中文語音資料建置該斷點預測模型。 In the prosody breakpoint information, the conventional TTS method only analyzes the text data, and then combines the information such as word breaks and part of speech to establish the grammatical structure of the basic phrase and determine the prosody boundary. However, because the grammatical boundary and the prosodic boundary are not necessarily consistent, the method of the present invention replaces the conventional grammatical boundary represented by the basic phrase by statistically distributing the prosody breakpoint distribution in the speech data, that is, using the training speech data to present Changes in prosody breakpoints build a breakpoint prediction model, and since the person who records the language is Chinese as the mother tongue, the recorded English speech data is easily affected by the mother tongue, so the prosody changes presented in the Chinese speech data are relatively stable, resulting in In contrast, the rhythm changes presented in English speech data are subject to large inconsistencies, so the present invention uses Chinese speech data to build the breakpoint prediction model.

第1及2圖係為本發明之中外文混合語音合成方法，其包含訓練處理作業與合成處理作業。於本實施例中，所述之外文係以英文為例，但不限於英文。 Figures 1 and 2 are the Chinese-foreign language mixed speech synthesis method of the present invention, which includes training processing operations and synthesis processing operations. In this embodiment, the foreign language department uses English as an example, but is not limited to English.

此外，該中外文混合語音合成方法亦可利用硬體、軟體及/或韌體實作為一中外文混合語音合成系統，包括有外文詞性標示模組可實作如第1圖所示之外文詞性標示模型161，聲調預測模組可實作如第1圖所示之聲調預測模型162，斷點預測模組可實作如第1圖所示之斷點預測模型163，聲調辨識模組可實作如第1圖所示之聲調辨識模型164，HTS模組可實作如第1圖所示之HTS模型165，文字分析模組可實作如第1圖所示之文字分析111。但本發明不以此為限。 In addition, the Chinese-foreign mixed speech synthesis method can also use hardware, software and/or firmware as a Chinese-foreign mixed speech synthesis system, including a foreign language part-of-speech tagging module that can implement foreign language parts-of-speech as shown in Figure 1 Mark model 161, the tone prediction module can implement the tone prediction model 162 as shown in Figure 1, the breakpoint prediction module can implement the breakpoint prediction model 163 as shown in Figure 1, and the tone recognition module can be implemented As the tone recognition model 164 shown in FIG. 1, the HTS module can implement the HTS model 165 shown in FIG. 1, and the text analysis module can implement the text analysis 111 shown in FIG. However, the invention is not limited to this.

如第1圖所示，本發明之訓練處理作業之相關程序包括：建置針對中英混用文句中之英文詞彙或片語的外文詞性標示模型161；建置針對中英混用文句中之英文詞彙或片語的聲調預測模型162；建置中文語音資料庫104之斷點預測模型163；建置用於針對如英文之外文語音資料庫105提供聲調資訊之聲調辨識模型164；以及整合中文與英文語音資料及其語言、聲調、斷點等資訊而建置一以隱馬爾可夫模型(Hidden Markov Model，簡稱HMM)為基礎之語音合成系統(HMM-based speech synthesis system，簡稱H-Triple-S，即HTS)模型165，且建置該些模型所需之資料係包括中外文混用文字資料101、中文語音資料庫104及其對應之中文文字資料102、外文語音資料庫105及其對應之外文文字資料103，其中，該些模型之建置可視為本發明之訓練處理作業之核心處理程序。 As shown in FIG. 1, the relevant procedures of the training processing operation of the present invention include: building a foreign part-of-speech tag model 161 for English vocabulary or phrases in Chinese-English mixed sentences; and building English vocabulary for Chinese-English mixed sentences Or phrase phrase prediction model 162; build a breakpoint prediction model 163 of the Chinese speech database 104; build a tone recognition model 164 for providing tone information for a foreign language speech database 105 such as English; and integrate Chinese and English Build a speech synthesis system (HMM-based speech synthesis system (H-Triple-S for short) based on the Hidden Markov Model (HMM) based on the speech data and its language, tone, breakpoints and other information , That is, HTS) model 165, and the data required to build these models include mixed Chinese and foreign text data 101, Chinese speech database 104 and its corresponding Chinese text data 102, foreign language speech database 105 and its Corresponding to the foreign language text data 103, the construction of these models can be regarded as the core processing program of the training processing operation of the present invention.

所述之外文詞性標示模型161係針對中英混用文句中之英文詞彙或片語而建置，其利用中外文混用文字資料101經由文字分析111後所產生之語言資訊作為輸入訊號，再藉由條件式隨機場域(conditional random field，簡稱CRF)之外文詞性標示訓練112之技術，以針對所嵌入之英文詞彙或片語建立該外文詞性標示模型161，其中，該文字分析111屬於傳統文字處理程序，其包含斷詞、基本詞性標示、正規化及變調處理等程序，且該外文詞性標示模型161係為本發明針對傳統文字分析新增之詞性調整程序所需之模型，以用於第2圖所示之合成處理作業中之文字分析201之處理程序，故該詞性調整程序可視為傳統文字分析處理之後續處理，且藉由此項程序將可提升中英混用文句中之英文詞彙或片語的詞性標示正確率。 The foreign language part-of-speech tag model 161 is built for English vocabulary or phrases in Chinese-English mixed sentences, which uses the language information generated by the Chinese-foreign mixed text data 101 through the text analysis 111 as the input signal, and then Techniques for training of foreign part-of-speech tags in conditional random fields (CRF) 112 to build the foreign part-of-speech tagging model 161 for embedded English vocabulary or phrases, where the text analysis 111 belongs to traditional word processing The program includes word breaking, basic part-of-speech tagging, regularization and transposition processing, and the foreign-language part-of-speech tagging model 161 is a model required by the invention for the part-of-speech adjustment process newly added to traditional text analysis for the second The processing procedure of the character analysis 201 in the synthesis processing operation shown in the figure, so the part-of-speech adjustment process can be regarded as a follow-up process of the traditional character analysis processing, and the English vocabulary or film in the mixed Chinese and English sentences can be improved by this process The part of speech of the language indicates the correct rate.

於本實施例中，針對中英混用文句中之英文詞彙或片語，利用CRF技術進行該外文詞性標示模型161之建置，其係將該中外文混用文字資料101作為訓練資料，且其輸入訊號包括待處理英文詞彙之可能詞性、前綴資訊、後綴資訊及此英文詞彙之相鄰詞彙之詞資訊與詞性資訊，而輸出標記之訊號係為待處理英文詞彙根據文句之中文語法結構所決定之詞性資訊。 In this embodiment, for the English vocabulary or phrase in the mixed Chinese-English sentence, the CRF technology is used to build the foreign language part-of-speech tagging model 161, which uses the mixed Chinese and foreign language text data 101 as training data, and its input The signal includes the possible part of speech, prefix information, suffix information and the word information and part-of-speech information of the adjacent vocabulary of the English vocabulary, and the signal of the output tag is determined by the grammatical structure of the sentence of the Chinese vocabulary of the pending English vocabulary Part of speech information.

所述之聲調預測模型162係針對中英混用文句中之英文詞彙或片語而建置，其所需之資訊係為中外文混用文字資料101經由該文字分析111之處理後之語言資訊(輸入資訊)、以及該中外文混用文字資料101中的英文詞彙或片語的聲調資訊(目標資訊)，再結合該兩資訊並利用CRF技術進行聲調預測訓練122之程序，以建置該聲調預測模型162。具體地，所嵌入的英文詞彙或片語的外文聲調標註121之程序係採用標註者默念方式121產生該聲調資訊，該標註者係為以中文為母語且英文話語流利者，且建置完成之模型將應用於第2圖之合成處理作業中之聲調預測202之程序，尤其是，該聲調預測202之處理之對象係為中英文混用的文字資料，而非全部英文的文字資料。 The tone prediction model 162 is built for English vocabulary or phrases in Chinese and English mixed sentences, and the required information is mixed Chinese and foreign words The language information (input information) of the data 101 processed by the text analysis 111, and the tonal information (target information) of the English vocabulary or phrase in the mixed Chinese and foreign text data 101, and then combining the two information and using the CRF technology The procedure of tone prediction training 122 is performed to build the tone prediction model 162. Specifically, the procedure for annotating 121 foreign-language tones in an embedded English vocabulary or phrase is to generate the tonal information by means of the meditation 121 of the annotator. The annotator is a native Chinese speaker who is fluent in English discourse and has completed the construction. The model will be applied to the program of tone prediction 202 in the synthesis processing operation of FIG. 2, in particular, the object of the processing of tone prediction 202 is a mixture of Chinese and English text data, not all English text data.

於本實施例中，針對中英混用文句中之英文詞彙或片語，再利用CRF技術建置該聲調預測模型162，其係將該中外文混用文字資料101作為訓練資料，且其輸入資訊包括：詞資訊；音節資訊；音節在詞中的位置資訊；詞性資訊；目前音節後接之標點符號資訊；原聲調資訊；前一個音節與目前音節之原聲調組合資訊；前一個音節、目前音節與下一個音節之音節組合資訊；以及前一個音節、目前音節與下一個音節在詞中的位置組合資訊等。 In this embodiment, for the English vocabulary or phrase in the mixed Chinese-English sentence, the tone prediction model 162 is built using CRF technology, which uses the mixed Chinese-foreign text data 101 as training data, and its input information includes : Word information; syllable information; syllable position information in the word; part-of-speech information; punctuation information following the current syllable; original tone information; information on the original tone combination of the previous syllable and the current syllable; the previous syllable, current syllable and The syllable combination information of the next syllable; and the combination information of the previous syllable, the position of the current syllable and the next syllable in the word, etc.

再者，於該原聲調之資訊中，對於中文詞彙而言，該原聲調之資訊係為各音節之聲調，而對於英文詞彙而言，該原聲調之資訊係為該英文詞彙之原始重音資訊。 Furthermore, in the information of the original tone, for the Chinese vocabulary, the information of the original tone is the tone of each syllable, and for the English vocabulary, the information of the original tone is the original accent information of the English vocabulary .

又，該聲調預測訓練122之訓練資料所需之輸出標記係為待處理英文詞彙之音節經由人工標示之聲調資訊，再利用CRF技術完成該聲調預測模型162之建置。 In addition, the output mark required for the training data of the tone prediction training 122 is the tone information of the syllables of the English vocabulary to be processed through manual marking, and then the CRF technology is used to complete the establishment of the tone prediction model 162.

所述之斷點預測模型163之建置係由於提供訓練語音資料之錄製語者係以中文為母語，其所錄製之外文(或英語)語音資料易受母語影響，致使中文語音資料所呈現之韻律變化較穩定，另一方面由於本發明之中英混用文句係以中文為主，英文為輔，故本發明係利用獨立之中文語音資料建置該斷點預測模型163。 The establishment of the breakpoint prediction model 163 mentioned above is due to the fact that the speaker who provided the training speech data is Chinese as the mother tongue, and the recorded foreign language (or English) speech data is easily affected by the mother tongue, resulting in the presentation of the Chinese speech data The rhythm changes are relatively stable. On the other hand, because the Chinese-English mixed sentence system of the present invention is mainly Chinese and English is supplementary, the present invention uses independent Chinese speech data to build the breakpoint prediction model 163.

如第1圖所示，建置該斷點預測模型163之輸入資訊係為中文文字資料102經由文字分析處理131後之語言資訊，且對應之中文語音資料庫104經由統計式斷點標示處理152所產出之斷點標示係作為目標資訊，再結合該語言資訊與目標資訊後運用深度類神經網路(Deep Neural Network，簡稱DNN)技術進行中文斷點預測訓練132之程序。具體地，該斷點標示處理152之作用係針對該中文語音資料庫104進行音長、音高、音量與停頓之統計以產生適當之閥值(threshold)，再利用這些閥值針對語音資料進行韻律斷點之標註，以建置完成該斷點預測模型163，供應用於第2圖所示之合成處理作業之斷點預測203之程序。 As shown in FIG. 1, the input information of the breakpoint prediction model 163 is the language information of the Chinese text data 102 after the text analysis process 131, and the corresponding Chinese voice database 104 is processed by the statistical breakpoint marking process 152 The generated breakpoint mark is used as the target information, and then combined with the language information and the target information, the Deep Neural Network (DNN) technology is used to perform the Chinese breakpoint prediction training 132 procedure. Specifically, the function of the breakpoint marking process 152 is to perform statistics on the pitch, pitch, volume, and pause of the Chinese voice database 104 to generate appropriate thresholds, and then use these thresholds to perform voice data The labeling of the prosody breakpoint is to complete the breakpoint prediction model 163 and supply the program for the breakpoint prediction 203 of the synthesis processing operation shown in FIG. 2.

於本實施例中，建置該斷點預測模型163之中文斷點預測訓練132所使用之訓練資料係為該中文語音資料庫104及其對應之中文文字資料102，其中，運用該斷點標示處理152針對該中文語音資料庫104進行斷點標註之方法如下： In this embodiment, the training data used in the Chinese breakpoint prediction training 132 of the breakpoint prediction model 163 is the Chinese speech database 104 and the corresponding Chinese text data 102, wherein the breakpoint mark is used The method of processing 152 the breakpoint annotation for the Chinese speech database 104 is as follows:

第一點，如第3圖所示，定義四階層之韻律架構與相關之七類斷點B0,B1,B2-1,B2-2,B2-3,B3,B4，其中，第一階層係為音節(syllable，標示為SYL)，其具有對應之斷點B0,B1，第二階層係為韻律詞(prosodic word，標示為PW)，其具有對應之斷點B2-1,B2-2,B2-3，第三階層係為韻律短語(prosodic phrase，標示為PPh)，其具有對應之斷點B3，第四階層係為呼吸句群(breath group，標示為BG)或韻律句群(prosodic group，標示為PG)，其具有對應之斷點B4。 The first point, as shown in Figure 3, defines the four-level prosody structure and the related seven types of breakpoints B0, B1, B2-1, B2-2, B2-3, B3, B4, where the first order The hierarchy is syllable (labeled as SYL), which has corresponding breakpoints B0, B1, and the second hierarchy is prosodic word (labeled as PW), which has corresponding breakpoints B2-1, B2- 2, B2-3, the third hierarchical system is prosodic phrase (marked as PPh), which has a corresponding breakpoint B3, and the fourth hierarchical system is breath group (marked as BG) or prosodic sentence A prosodic group (labeled PG) has a corresponding breakpoint B4.

第二點，針對該中文語音資料庫104進行各種韻律參數之統計以產生適當之閥值(threshold)，再利用這些閥值針對語音資料標註七種韻律之斷點B0,B1,B2-1,B2-2,B2-3,B3,B4之處。例如，該韻律參數係包括音節停頓長度、音節能量低點、正規化基頻跳躍值、正規化音節延長因子以及音節間之基頻停頓長度等。 The second point is to perform statistics on various rhythm parameters for the Chinese speech database 104 to generate appropriate thresholds, and then use these thresholds to mark the seven kinds of rhythm breakpoints B0, B1, B2-1 for the speech data. B2-2, B2-3, B3, B4. For example, the prosody parameter system includes syllable pause length, vowel energy saving low point, normalized fundamental frequency jump value, normalized syllable extension factor, and fundamental frequency pause length between syllables.

第三點，所述之述韻律參數與斷點標記之間的關聯性係如下所述：音節停頓長度係影響斷點標記最重要的韻律參數，大多數出現標點符號的詞邊界會有較長的停頓，如斷點B3,B4，且斷點B4較斷點B3具有更長的停頓長度，故可利用音節停頓長度區分斷點B3與斷點B4。其它非屬詞語邊界之音節位置通常會有較短的停頓或不停頓，如斷點B0與斷點B1，此兩類斷點B0,B1可進一步利用音節之間基頻停頓與音節能量低點兩特性進行區分。非屬標點符號之詞語邊界係包括中等程度以上的音節停頓長度、基頻跳躍與音節延長等三種特性，以分別歸類為斷點B2-2,B2-1,B2-3。 The third point, the correlation between the stated prosody parameters and the breakpoint markers is as follows: the syllable pause length affects the most important prosody parameters of the breakpoint markers, and most word boundaries where punctuation marks appear will have a longer Pauses, such as breakpoints B3, B4, and breakpoint B4 has a longer pause length than breakpoint B3, so syllable pause length can be used to distinguish breakpoint B3 from breakpoint B4. The syllable positions of other non-word boundaries usually have short pauses or non-pauses, such as breakpoint B0 and breakpoint B1. These two types of breakpoints B0 and B1 can further use the fundamental frequency pause between syllables and the low point of energy saving Two characteristics are distinguished. The boundary of words that are not punctuation marks includes three characteristics of syllable pause length above medium level, fundamental frequency jump and syllable extension, etc., which are classified as breakpoints B2-2, B2-1, B2-3 respectively.

再者，該斷點B0,B1,B2-1,B2-2,B2-3,B3,B4之標記將作為建置該斷點預測模型163所需之輸出目標值，至於該斷點預測模型163之輸入特徵係包括：音素類別；目前斷點是否為詞語邊界；標點符號類別；以音節為單位的當前語句長度；以音節為單位的前一語句的長度；以音節為單位的下一語句的長度；以音節為單位，到前一個標點符號的距離；以音節為單位，到下一個標點符號的距離；當前詞與其前p個詞之詞性；當前字詞與其前p個字詞所含音節數量；當前字詞之後q個字詞之詞性；當前字詞之後q個字詞所含音節數量。 Furthermore, the marks of the breakpoints B0, B1, B2-1, B2-2, B2-3, B3, B4 will be used as the output target values required to build the breakpoint prediction model 163, as for the breakpoint prediction model The input features of 163 include: phoneme category; whether the current breakpoint is a word boundary; punctuation symbol category; the current sentence length in syllables; the length of the previous sentence in syllables; the next sentence in syllables The length of the word; the distance to the previous punctuation mark in syllables; the distance to the next punctuation mark in syllables; the part-of-speech of the current word and its previous p words; the current word and its preceding p words The number of syllables; the part of speech of q words after the current word; the number of syllables in q words after the current word.

又，p與q可依據系統複雜度之需求予以調整，再利用DNN之架構完成該斷點預測模型163之建置。 In addition, p and q can be adjusted according to the requirements of system complexity, and then the breakpoint prediction model 163 is built using the DNN architecture.

所述之聲調辨識模型164係針對英文語音資料提供聲調資訊。為了使用一致性之特徵進行中文與外文語音資料之韻律分析與該HTS模型165之整合訓練153之程序，故在外文語音資料之韻律特徵之擷取係以聲調資訊取代重音資訊。 The tone identification model 164 described above provides tone information for English speech data. In order to use the consistent features for the prosody analysis of Chinese and foreign speech data and the integration training 153 of the HTS model 165, the extraction of prosody features in foreign speech data is to replace accent information with tone information.

如第1圖所示，該聲調辨識模型164係藉由中文語音資料庫104與其對應之中文文字資料102配合DNN技術進行聲調辨識訓練141而建置者，供用於後續之聲調標示處理151，以針對英文語音資料提供聲調資訊。尤其是，該聲調標示處理151之對象係為全部英文的資料，故該聲調標示處理151所需之模型與設定對象為中英文混用資料之聲調預測模型162並不相同。 As shown in FIG. 1, the tone recognition model 164 is built by the Chinese speech database 104 and its corresponding Chinese text data 102 in conjunction with DNN technology for tone recognition training 141, which is used for subsequent tone labeling processing 151, to Provides tone information for English voice data. In particular, the object of the tone labeling process 151 is all English data, so the model required by the tone labeling process 151 is not the same as the tone prediction model 162 whose setting object is Chinese and English mixed data.

於本實施例中，建置該聲調辨識模型164所使用之訓練資料係為中文語音資料庫104及其對應之中文文字資料102，且輸入參數包括針對音節的對數基頻軌跡所求得之正交係數、音節長度、音節能量以及音節間之停頓長度，其中，該些特徵之輸入參數之涵蓋區間可依據系統複雜度之需求予以調整，通常會選擇當前音節與其前後各一共計三個音節之區間，再利用DNN之架構完成該聲調辨識模型164之建置。 In this embodiment, the training data used to build the tone recognition model 164 is the Chinese speech database 104 and its corresponding Chinese text data 102, and the input parameters include the positive values obtained for the logarithmic fundamental frequency trajectory of the syllable Intersection coefficient, syllable length, syllable energy saving, and pause length between syllables. Among them, the coverage of the input parameters of these characteristics can be adjusted according to the requirements of the system complexity. Usually, the current syllable and the three syllables before and after it are selected. In the interval, the tone recognition model 164 is built using the DNN framework.

所述之HTS模型165之建置係將中文與英文語音資料及其語言、聲調、斷點等資訊進行整合訓練153而獲得，其中，該資訊係包含獨立的中文語音資料庫104與外文語音資料庫105，且訓練所需之HTS標示(HTS label)資料係整合兩種語音資料庫之語言資訊、斷點標示處理152之資訊及聲調資訊等所產生，以應用於第2圖之合成處理作業中之語音合成205之程序。因此，在該HTS模型165之訓練過程中，所需之標註檔案與問題集合(question set)具有以下特點：其中一特點係為兩種不同語言之語音資料可採用一致的標註格式，而另一特點係為模型訓練所使用之問題集合可共享聲調與斷點等相關之韻律資訊。 The construction of the HTS model 165 mentioned above is obtained by integrating and training 153 Chinese and English speech data and its language, tone, breakpoints and other information. The information includes an independent Chinese speech database 104 and foreign language speech data Library 105, and the HTS label data required for training is generated by integrating the language information of the two speech databases, the information of the breakpoint labeling process 152, and the tone information, etc., for use in the synthesis processing operation of FIG. 2 In the speech synthesis 205 procedure. Therefore, during the training process of the HTS model 165, the required annotation files and question sets have the following characteristics: one of the characteristics is that the audio data in two different languages can use the same annotation format, while the other The feature is that the problem set used for model training can share prosody information related to tones and breakpoints.

於本實施例中，該HTS模型165係利用HTS-2.3 toolkit進行建置，且訓練語音資料包括個別的中文語音資料與英文語音資料，並搭配整合文字分析處理後的語言資訊、斷點預測處理後的斷點資訊及有關嵌入英文詞或片語的聲調資訊所產生的HTS標註，該HTS標註係包括：音素類別 (考慮當前與前後各二共計五個音素)；聲調類別(考慮當前與前後各一共計三個音節)；當前音節之母音類別；當前韻律單元之前後斷點類別；當前音節之音素位置；當前PW、PPh與BG/PG之音節位置；當前PPh與BG/PG之PW位置；當前BG/PG之PPh位置。 In this embodiment, the HTS model 165 is built using the HTS-2.3 toolkit, and the training voice data includes individual Chinese voice data and English voice data, together with language information and breakpoint prediction processing after integrated text analysis processing HTS annotations generated after the breakpoint information and tone information about embedded English words or phrases. The HTS annotations include: phoneme categories (Consider the current and the two before and after a total of five phonemes); Tone category (consider the current and before and after each a total of three syllables); The vowel category of the current syllable; The current break point before and after the breakpoint category; The current syllable phoneme position; Current Syllable positions of PW, PPh and BG/PG; current PW positions of PPh and BG/PG; current PPh positions of BG/PG.

本發明亦揭露一種合成處理作業，如第2圖所示，其包括：將中英文混用的輸入文句訊號200經過文字分析(text analysis)201產生對應的語言資訊，該文字分析包含中文之分析處理與針對該文句訊號中之英文詞彙或片語所建置之詞性標示處理；該文句訊號中之英文詞彙或片語藉由聲調預測(tone prediction)202之程序以產生對應之聲調資訊；該語言資訊經過斷點預測(break prediction)203之程序以產生對應的斷點資訊；整合該語言資訊、聲調資訊與斷點資訊，並經由以HMM為基礎之語音合成標示產生器(HTS label generator)204產生對應之HTS標註；以及利用該HTS標註，透過以HMM為基礎之語音合成(speech synthesis)205之處理，以產生及輸出合成語音206。 The present invention also discloses a synthesis processing operation, as shown in FIG. 2, which includes: inputting sentence sentence signal 200 mixed with Chinese and English through text analysis 201 to generate corresponding language information, the text analysis including analysis processing of Chinese And the part-of-speech tagging process for the English vocabulary or phrase in the sentence signal; the English vocabulary or phrase in the sentence signal is generated by the tone prediction 202 procedure to generate the corresponding tone information; the language The information passes through the process of break prediction 203 to generate corresponding breakpoint information; integrate the language information, tone information and breakpoint information, and pass the HMM-based speech synthesis label generator (HTS label generator) 204 Generate corresponding HTS annotations; and use the HTS annotations to generate and output synthesized speech 206 through HMM-based speech synthesis 205 processing.

於本實施例中，該合成處理作業與該訓練處理作業係相關聯，即所述之文字分析201、聲調預測202、斷點預測203與語音合成205等四種處理程序均需藉由該訓練處理作業所建置之模型進行。 In this embodiment, the synthesis processing operation is associated with the training processing operation, that is, the four processing procedures such as text analysis 201, tone prediction 202, breakpoint prediction 203, and speech synthesis 205 all need to pass the training The model built by the processing operation is carried out.

於該文字分析201之程序中，該詞性標示處理需使用該訓練處理作業之CRF技術所建置之外文詞性標示模型161，且使用相同之輸入樣版，並根據文句之中文語法結構決定所嵌入之英文詞彙或片語的詞性。 In the procedure of the text analysis 201, the part-of-speech tagging process needs to use the foreign language part-of-speech tagging model 161 built by the CRF technology of the training processing operation, and uses the same input sample, according to the Chinese grammatical structure of the sentence Decide the part of speech of the embedded English vocabulary or phrase.

於該聲調預測202之程序中，其需使用該訓練處理作業之CRF技術所建置之聲調預測模型162，且使用相同之輸入樣版，以產生對應之聲調資訊。 In the procedure of the tone prediction 202, it needs to use the tone prediction model 162 built by the CRF technology of the training processing operation, and use the same input template to generate the corresponding tone information.

於該斷點預測203之程序中，其需使用該訓練處理作業之DNN技術所建置之斷點預測模型163，且使用相同之輸入特徵以產生對應之斷點資訊。 In the process of the breakpoint prediction 203, it needs to use the breakpoint prediction model 163 built by the DNN technology of the training processing operation, and uses the same input features to generate the corresponding breakpoint information.

於該語音合成標示產生器204中，係整合該語言資訊、聲調資訊與斷點資訊，並產生對應之HTS標註，其中，該HTS標註之格式係依據該訓練處理作業之HTS模型165。 In the speech synthesis mark generator 204, the language information, tone information and breakpoint information are integrated, and corresponding HTS annotations are generated, wherein the format of the HTS annotations is based on the HTS model 165 of the training processing operation.

於該語音合成205之程序中，依據該HTS標註，藉由該訓練處理作業之HTS模型165，並透過該語音合成205之程序產生合成語音。 In the procedure of the speech synthesis 205, according to the HTS annotation, the synthesized speech is generated by the procedure of the speech synthesis 205 through the HTS model 165 of the training process.

綜上所述，本發明之語音合成方法及其系統中，係藉由文字分析，以產生包含中文詞性及外文詞性等語言資訊，並藉由聲調預測之程序產生該外文之聲調資訊；將該語言資訊經由斷點預測之程序產生中文與外文之混用語句的斷點資訊；以及整合該語言資訊、聲調資訊與斷點資訊，以產生該中文與該外文之混用語句的合成語音。 In summary, in the speech synthesis method and system of the present invention, the language information including Chinese part-of-speech and foreign part-of-speech is generated by text analysis, and the tonal information of the foreign language is generated by a tone prediction process; Language information generates breakpoint information for mixed sentences of Chinese and foreign languages through a breakpoint prediction process; and integrates the language information, tone information, and breakpoint information to generate synthesized speech of the mixed sentences of Chinese and foreign languages.

上述實施例係用以例示性說明本發明之原理及其功效，而非用於限制本發明。任何熟習此項技藝之人士均可在不違背本發明之精神及範疇下，對上述實施例進行修改。因此本發明之權利保護範圍，應如後述之申請專利範圍所列。 The above embodiments are used to exemplify the principles and effects of the present invention, rather than to limit the present invention. Anyone who is familiar with this skill can modify the above embodiments without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the rights of the present invention should be as listed in the scope of patent application mentioned later.

200‧‧‧輸入文句訊號 200‧‧‧ input sentence signal

201‧‧‧文字分析 201‧‧‧ character analysis

202‧‧‧聲調預測 202‧‧‧ tone prediction

203‧‧‧斷點預測 203‧‧‧Breakpoint prediction

205‧‧‧語音合成 205‧‧‧Speech synthesis

206‧‧‧輸出合成語音 206‧‧‧ output synthesized speech

Claims

A speech synthesis method includes: performing text analysis on mixed sentences of Chinese and foreign languages to generate language information including Chinese parts of speech and foreign parts of speech, and generating tone information of the foreign language through a process of tone prediction; passing the language information through The breakpoint prediction process generates the breakpoint information of the mixed sentence of the Chinese and the foreign language; and integrates the language information, tone information, and breakpoint information through the speech synthesis process to generate the synthesized speech of the mixed sentence of the Chinese and the foreign language .

The speech synthesis method as described in item 1 of the patent application scope, wherein the character analysis includes traditional character analysis and foreign language part-of-speech mark processing.

The speech synthesis method as described in item 2 of the patent application scope, wherein the model required for the foreign language part-of-speech tagging process uses the mixed text data of the Chinese and the foreign language to generate input information through traditional text analysis, and then the input information Built with foreign part-of-speech tag training.

The speech synthesis method as described in item 1 of the patent application scope, wherein the model required for the tone prediction process is to generate input information through the text analysis using the mixed text data of the Chinese and the foreign languages, and the Chinese and The foreign language in the mixed-language text data of the foreign language is established as the target information by using the tone generated by the annotator in the silent way, and is combined with the input information and the target information to carry out the tone prediction training process.

The speech synthesis method as described in item 1 of the patent application scope, wherein the model required for the breakpoint prediction procedure is generated using Chinese text data The language information, and the breakpoint mark of the Chinese speech database corresponding to the Chinese text data is used as the target information, and then combined with the language information and the target information to build a Chinese breakpoint prediction training program.

The speech synthesis method as described in item 1 of the patent application scope, in which the model required for the speech synthesis process is information on Chinese speech data and foreign language speech data and the language, tones, and breakpoints corresponding to the Chinese language and the foreign language Built for integration training.

The speech synthesis method as described in item 6 of the patent application scope, wherein the tone information of the foreign language speech database must be generated through the tone labeling process by means of a tone recognition model, and the tone recognition model is built by the Chinese speech database It is based on the procedure of tone recognition training with the corresponding Chinese text data.

A speech synthesis system includes: a text analysis module, which performs text analysis on mixed sentences of Chinese and foreign languages to generate language information including Chinese parts of speech and foreign parts of speech; a tone prediction module, which targets the Chinese and foreign languages Mixed-language sentences generate tonal information in foreign languages; breakpoint prediction module uses the language information to generate breakpoint information for mixed-language sentences in Chinese and foreign languages; and HTS module integrates the language information, tonal information and breakpoint information To generate a synthesized speech of the mixed sentence of the Chinese and the foreign language.