TW201705019A

TW201705019A - Text-to-speech method and multi-lingual speech synthesizer using the method

Info

Publication number: TW201705019A
Application number: TW104137212A
Authority: TW
Inventors: 劉訓甫; 潘迪阿布舍克; 許晋誠
Original assignee: 華碩電腦股份有限公司
Priority date: 2015-07-21
Filing date: 2015-11-11
Publication date: 2017-02-01
Also published as: TWI605350B

Abstract

A text-to-speech method and a multi-lingual speech synthesizer using the method are disclosed. The multi-lingual speech synthesizer and the method executed by a processor are applied for processing a multi-lingual text message in a mixture of a first language and a second language into a multi-lingual voice message. The multi-lingual speech synthesizer comprises a storage device configured to store a first language model database, a second language model database, a broadcasting device configured to broadcast the multi-lingual voice message, and a processor, connected to the storage device and the broadcasting device, configured to execute the method disclosed herein.

Description

Text-to-speech method and multi-language speech synthesis Device

本揭示文件係有關文字轉語音方法，特別是一種文字轉語音方法和一種將多語言文字訊息處理成多語言語音音訊的合成裝置。 The present disclosure relates to a text-to-speech method, and more particularly to a text-to-speech method and a synthesizing device for processing multi-language text messages into multi-lingual voice information.

隨著全球化市場的發展，在日常生活的用語或文章中經常會使用多種語言混雜的表達方式。尤其是當提到專業領域的專有名詞、外文人名、外文地名、外國獨有特色字彙時，將無法用翻譯名詞表達。 With the development of the global market, multilingual expressions are often used in everyday language or articles. In particular, when referring to proper nouns, foreign names, foreign place names, and foreign unique vocabulary in the professional field, it will not be possible to express them with translated nouns.

一般的文字轉語音(Text-To-Speech,TTS)方法通常僅能處理單一種語言，依文字內容在該種語言的資料庫中尋找相對應的語音訊息，藉此合成對應特定文字內容的發音音訊。然而，當文字訊息中存在另一種語言時，由於資料庫中無法找到相符的語音訊息，因此傳統的文字轉語音方法將難以處理。 The general text-to-speech (TTS) method can only process a single language, and find corresponding voice messages in the database of the language according to the text content, thereby synthesizing the pronunciation corresponding to the specific text content. Audio. However, when another language exists in the text message, the traditional text-to-speech method will be difficult to handle because the matching voice message cannot be found in the database.

本發明提供一種文字轉語音方法，藉由一處理器來執行，用以將具有一第一語言及一第二語言的多語言文字訊息轉換為一多語言語音音訊，並且搭配一具有多數個第一語言音位標籤和第一語言的同語言連接音調資訊的第一語言模型資料庫，以及一具有多數個第二語言音位標籤和第二語言的同語言連接音調資訊的第二語言模型資料庫，該文字轉語音方法包含：將該多語言文字訊息區分為至少一個第一語言段落及至少一個第二語言段落；轉換該第一語言段落為至少一個第一語言音位標籤及轉換至少一個第二語言段落為至少一個第二語言音位標籤；利用該至少一個第一語言音位標籤查找第一語言模型資料庫以獲得至少一個第一語言音位標籤串列，以及利用該至少一個第二語言音位標籤查找第二語言模型資料庫以獲得至少一個第二語言音位標籤串列；依據該多語言文字訊息的文字順序，將該至少一個第一語言音位標籤串列與該至少一個第二語言音位標籤串列組合為一多語言音位標籤串列；在每二個相鄰的音位標籤串列的交界處產生一跨語言連接音調資料，其中每二個相鄰的音位標籤串列包含該至少一個第一語言音位標籤串列中的一個第一語言音位標籤串列及該至少一個第二語言音位標籤串列中的一個第二語言音位標籤串列；合併該多語言音位標籤串列、在該至少一個第一語言音位標籤串列中的每兩個相鄰的音位標籤之間的一交界處的該第一語言的同語言連接音調資料、在該至少一個第二語言音位標籤串列中的每兩個相鄰的音位標籤之間的一交界處的該第二語言的同語言連接音調資料以及該跨語言連接音調資料以產生該多語言語音音訊；以及輸出該多語言語音音訊。 The present invention provides a text-to-speech method, which is executed by a processor for converting a multi-language text message having a first language and a second language into a multi-language voice message, and matching one with a plurality of a first language model database of a phonological phoneme label and a first language linguistically connected tone information, and a second language model data having a plurality of second language phoneme tags and a second language linguistically connected tone information The text-to-speech method includes: dividing the multi-language text message into at least one first language paragraph and at least one second language paragraph; converting the first language paragraph to at least one first language phoneme label and converting at least one The second language paragraph is at least one second language phoneme tag; using the at least one first language phoneme tag to search the first language model database to obtain at least one first language phoneme tag string, and utilizing the at least one The second language phoneme tag searches the second language model database to obtain at least one second language phoneme tag string; a textual order of the multilingual text message, combining the at least one first language phoneme tag string and the at least one second language phoneme tag string into a multilingual phoneme tag string; in each of the two adjacent A cross-language connection tone data is generated at a boundary of the phoneme tag string, wherein each two adjacent phoneme tag strings includes a first language phoneme tag string in the at least one first language phoneme tag string And a second language phoneme tag string of the at least one second language phoneme tag string; combining the multi-lingual phoneme tag string, each of the at least one first language phoneme tag string a first language contemporaneous connection tone material at a junction between two adjacent phoneme tags, each of the at least one second language phoneme tag string The second language of the second language is connected to the tone data at a junction between adjacent phoneme tags and the cross language connection tone data to generate the multilingual voice audio; and the multilingual voice audio is output.

本發明更提供一種多語言語音合成裝置，用以將具有第一語言和第二語言的多語言文字訊息處理為多語言語音音訊，該合成裝置包括：儲存模組，儲存具有多數個第一語言音位標籤和第一語言的同語言連接音調資訊的第一語言模型資料庫以及具有多數個第二語言音位標籤和第二語言的同語言連接音調資訊的第二語言模型資料庫；播音模組，播放該多語言語音音訊；處理器，與該儲存模組和該播音模組連接，以將該多語言文字訊息區分為至少一個第一語言段落及至少一個第二語言段落；轉換該第一語言段落為至少一個第一語言音位標籤及轉換至少一個第二語言段落為至少一個第二語言音位標籤；利用該至少一個第一語言音位標籤查找第一語言模型資料庫以獲得至少一個第一語言音位標籤串列，以及利用該至少一個第二語言音位標籤查找第二語言模型資料庫以獲得至少一個第二語言音位標籤串列；依據該多語言文字訊息的文字順序，將該至少一個第一語言音位標籤串列與該至少一個第二語言音位標籤串列組合為一多語言音位標籤串列；在每二個相鄰的音位標籤串列的交界處產生一跨語言連接音調資料，其中每二個相鄰的音位標籤串列包含該至少一個第一語言音位標籤串列中的一個第一語言音位標籤串列及該至少一個第二語言音位標籤串列中的一個第二語言音位標籤串列；合併該多語言音位標籤串列、在該至少一個第一語言音位標籤串列中的每兩個相鄰的音位標籤之間的一交界處的該第一語言的同語言連接音調資料、在該至少一個第二語言音位標籤串列中的每兩個相鄰的音位標籤之間的一交界處的該第二語言的同語言連接音調資料以及該跨語言連接音調資料以產生該多語言語音音訊；以及將該多語言語音音訊輸出至該播音模組。 The present invention further provides a multi-language speech synthesis device for processing multi-language text messages having a first language and a second language into multi-language speech messages, the synthesizing device comprising: a storage module storing a plurality of first languages a first language model database in which the phoneme label and the first language are connected to the same language, and a second language model database having a plurality of second language phoneme tags and a second language language connected tone information; a group, playing the multi-language voice audio; the processor is coupled to the storage module and the broadcast module to distinguish the multi-language text message into at least one first language paragraph and at least one second language paragraph; converting the first The language segment is at least one first language phoneme tag and converts the at least one second language segment into at least one second language phoneme tag; using the at least one first language phoneme tag to find the first language model database to obtain at least a first language phoneme tag string, and using the at least one second language phoneme tag to find a second language model Retrieving a library to obtain at least one second language phoneme tag string; combining the at least one first language phoneme tag string with the at least one second language phoneme tag string in accordance with a textual order of the multilingual text message a multi-language phoneme tag string; generating a cross-language connection tone data at the intersection of each two adjacent phoneme tag strings, wherein each two adjacent phoneme tag strings include the at least one a first language phoneme tag string in a language phoneme tag string and a second language phoneme tag string in the at least one second language phoneme tag string; combining the multi-language phonetic symbols a string, a homonym connected tone material of the first language at a boundary between each two adjacent phoneme tags in the at least one first language phoneme tag string, at the at least one a second language of the same language connection tone data at a boundary between each two adjacent phoneme tags in the two-language phoneme tag string and the cross-language connection tone data to generate the multi-lingual voice information; And outputting the multi-language voice audio to the broadcast module.

為讓本揭示內容之上述和其他目的、特徵、優點與實施例能更明顯易懂，所附符號之說明如下： The above and other objects, features, advantages and embodiments of the present disclosure will become more apparent and understood.

100‧‧‧文字轉語音系統 100‧‧‧Text-to-speech system

120‧‧‧儲存模組 120‧‧‧ storage module

140‧‧‧播音模組 140‧‧‧Broadcast module

160‧‧‧處理器 160‧‧‧ processor

180‧‧‧收音模組 180‧‧‧ Radio Module

200‧‧‧文字轉語音方法 200‧‧‧Text-to-speech method

LMD1~LMD2‧‧‧語言模型資料庫 LMD1~LMD2‧‧‧Language Model Database

PU1~PU3‧‧‧發音單元 PU1~PU3‧‧‧ pronunciation unit

AU1a~AU3c‧‧‧候選聲頻資訊 AU1a~AU3c‧‧‧ candidate audio information

L1、L2‧‧‧連接路徑 L1, L2‧‧‧ connection path

Pavg1、Pavg2‧‧‧基準平均頻率 Pavg1, Pavg2‧‧‧ benchmark average frequency

PAU、PAU1~PAU820‧‧‧語調頻率資料 PAU, PAU1~PAU820‧‧‧ tone frequency data

PCAND‧‧‧候選頻率資料 PCAND‧‧‧ candidate frequency data

ML‧‧‧訓練語音 ML‧‧‧ training voice

SAM、SAM1、SAM2‧‧‧取樣錄音 SAM, SAM1, SAM2‧‧‧ sample recording

LAN1、LAN2‧‧‧語言 LAN1, LAN2‧‧‧ language

P1、P2‧‧‧語調 P1, P2‧‧ ́s tone

T1、T2‧‧‧節律 T1, T2‧‧‧ rhythm

F1、F2‧‧‧音色 F1, F2‧‧‧ tone

S210~S270、S241~S244、S246~S247‧‧‧步驟 S210~S270, S241~S244, S246~S247‧‧‧ steps

S251~S252、S310~S330‧‧‧步驟 S251~S252, S310~S330‧‧‧ steps

為讓本揭示內容之上述和其他目的、特徵、優點與實施例能更明顯易懂，所附圖式之說明如下：第1圖為根據本揭示文件之一實施例中一種多語言語音合成裝置的功能方塊圖；第2圖繪示根據本案之一實施例中一種文字轉語音方法的流程圖；第3圖和第4圖繪示根據本案之一實施例中步驟S204的流程圖；第5圖繪示根據本案之一實施例中步驟S250的流程圖；第6A圖至第6B圖繪示根據本揭示文件之一實施例的候選聲頻資訊的計算方法；第7圖是繪示根據本案之一實施例的確定發音單元的連接路徑的示意圖；第8圖是繪示根據本案之一實施例中一種文字轉語音方法的訓練程式的訓練方法的流程圖；以及第9A圖至第9C圖繪示根據本案之一實施例中混合語言的訓練語音ML、取樣語音SAM以及分析到的混合語言的語調、節律與音色的示意圖。 The above and other objects, features, advantages and embodiments of the present disclosure will be more apparent and understood. The description of the drawings is as follows: FIG. 1 is a multi-language speech synthesis apparatus according to an embodiment of the present disclosure. FIG. 2 is a flow chart of a text-to-speech method according to an embodiment of the present invention; FIG. 3 and FIG. 4 are a flow chart of step S204 according to an embodiment of the present invention; FIG. 6A to FIG. 6B are diagrams illustrating a method for calculating candidate audio information according to an embodiment of the present disclosure; FIG. 7 is a diagram illustrating a method for calculating candidate audio information according to an embodiment of the present disclosure; FIG. 8 is a flowchart showing a training method of a training program according to a text-to-speech method according to an embodiment of the present invention; and FIG. 9A to FIG. 9C Show mixed language according to an embodiment of the present case Schematic diagram of the training speech ML, the sampled speech SAM, and the analyzed intonation, rhythm, and timbre of the mixed language.

下文係舉實施例配合所附圖式作詳細說明，但所提供之實施例並非用以限制本案所涵蓋的範圍，而結構運作之描述非用以限制其執行之順序，任何由元件重新組合之結構，所產生具有均等功效的裝置，皆為本案所涵蓋的範圍。此外，圖式僅以說明為目的，並未依照原尺寸作圖。 The embodiments are described in detail below with reference to the accompanying drawings, but the embodiments are not intended to limit the scope of the present disclosure, and the description of structural operation is not intended to limit the order of execution, any component recombined. The structure, which produces equal devices, is covered by this case. In addition, the drawings are for illustrative purposes only and are not drawn to the original dimensions.

關於本文中所使用之『第一』、『第二』、...等，並非特別指稱次序或順位的意思，亦非用以限定本案，其僅僅是為了區別以相同技術用語描述的元件或操作而已。其次，在本文中所使用的用詞「包含」、「包括」、「具有、「含有」等等，均為開放性的用語，即意指包含但不限於此。 The terms “first”, “second”, etc. used in this document are not specifically intended to refer to the order or order, nor are they used to limit the case, but merely to distinguish the elements described in the same technical terms or Just do it. Secondly, the words "including", "including", "having," "containing," etc., as used herein are all terms of an open term, meaning, but not limited to.

請參閱第1圖，其為根據本揭示文件之一實施例中多語言語音合成裝置的功能方塊圖。如第1圖所示，一種多語言語音合成裝置100包含儲存模組120、播音模組140以及處理器160。 Please refer to FIG. 1, which is a functional block diagram of a multi-language speech synthesis apparatus in accordance with an embodiment of the present disclosure. As shown in FIG. 1 , a multi-language speech synthesis device 100 includes a storage module 120, a broadcast module 140, and a processor 160.

多語言語音合成裝置100用以將文字訊息處理/轉換為相對應的多語言語音音訊，播音模組140用以輸出此多語言語音音訊。在一個實施例中，多語言語音合成裝置100可以處理同時包含多種語言的文字訊息。 The multi-lingual speech synthesis device 100 is configured to process/convert text messages into corresponding multi-language speech audio, and the broadcast module 140 is configured to output the multi-language speech audio. In one embodiment, the multi-lingual speech synthesis apparatus 100 can process text messages that include multiple languages simultaneously.

在一個實施例中，儲存模組120用以儲存多數個語言模型資料庫，例如LMD1以及LMD2等，多個語言模型資料庫各自對應單一種語言(例如中文、英文、日文、德文、法文、西班牙文等等各種世界上流通的語言)。每一個語言模型資料庫包含單一種語言的多數個音位標籤(phoneme label)以及該些音位標籤之間的同語言連接音調資訊。在一個實施例中，儲存模組120儲存一組中文的語言模型資料庫LMD1以及另一組英文的語言模型資料庫LMD2作為例示性說明。然而，在本文中，語言的種類並不局限於此。在一個實施例中，不需要同時供中文與英文兩種語言的混合多語言模型資料庫。 In one embodiment, the storage module 120 is configured to store a majority Each language model database, such as LMD1 and LMD2, has a plurality of language model databases corresponding to a single language (for example, Chinese, English, Japanese, German, French, Spanish, etc.). Each language model database contains a plurality of phoneme labels in a single language and the same language connection tone information between the phoneme tags. In one embodiment, the storage module 120 stores a set of Chinese language model database LMD1 and another set of English language model database LMD2 as an illustrative description. However, in this paper, the type of language is not limited to this. In one embodiment, a mixed multilingual model database for both Chinese and English languages is not required.

音位標籤(phoneme label)是能夠區別發音的最小聲音單位。在一個實施例中，一個字或詞可由一至數個音節組成，一個音節可由一至數個音位組成。在一個實施例中，以中文來為例，每一個中文字(character)僅包含一個音節，而這一個音節通常由一至三個音位組成(每個音位類似一個注音符號)。在一個實施例中，對英文來說，每一個英文單詞(word)包含至少一個音節，每個音節包含由一至多個音位(每個音位類似一個英文音標)。在一個實施例中，為了達到合適的發音效果，每一個語言模型資料庫除了儲存每一個音位本身的發音方式，更儲存有連接音調資訊。其中，連接音調資訊是相鄰音位之間連續發音時，用來連接(或是字詞與字詞之間)前後音位的音調(Tone)。 The phoneme label is the smallest unit of sound that can distinguish between pronunciations. In one embodiment, a word or word may consist of one to several syllables, and a syllable may consist of one to several phonemes. In one embodiment, in Chinese, for example, each character contains only one syllable, and this syllable usually consists of one to three phonemes (each phoneme resembles a phonetic symbol). In one embodiment, for English, each English word contains at least one syllable, and each syllable contains one to more phonemes (each phoneme resembles an English phonetic symbol). In one embodiment, in order to achieve a proper pronunciation effect, each language model database stores stored tone information in addition to the pronunciation of each phoneme itself. Among them, the connected tone information is a tone (Tone) used to connect (or between a word and a word) before and after a continuous sounding between adjacent phonemes.

音位標籤是方便系統進行處理時使用的代表符號，實際上，語言模型資料庫LMD1~LMD2更進一步儲存有每一個音位標籤在發音合成時所需的語調(pitch)、節律(tempo)、音色(timbre)等聲頻資訊。在一個實施例中，舉例來說，語調包含但不限於發聲的頻率高低；節律包含但不限於發聲的速度、間隔與韻律；音色包含但不限於發聲的品質、口形、發聲部位等。 The phoneme label is a representative symbol that is convenient for the system to process. In fact, the language model database LMD1~LMD2 is further stored. There are audio information such as pitch, tempo, and timbre required for each phoneme label to be synthesized. In one embodiment, for example, the tone includes, but is not limited to, the frequency of the utterance; the rhythm includes, but is not limited to, the speed, spacing, and rhythm of the utterance; the timbre includes, but is not limited to, the quality of the vocalization, the shape of the mouth, the location of the utterance, and the like.

參閱第2圖，其繪示根據本案之一實施例中一種一種文字轉語音方法的流程圖。多語言文字轉語音方法200用以將同時包含相異的多數種語言的文字訊息處理/轉換為多語言語音音訊。在一個實施例中，處理器160用以執行多語言文字轉語音的方法，處理器160例如可為，但不限於，中央處理器(central processing unit,CPU)、系統單晶片(System on Chip,SoC)、應用處理器、音訊處理器、數位訊號處理器(digital signal processor)或特定功能的處理晶片或控制器。 Referring to FIG. 2, a flow chart of a text-to-speech method according to an embodiment of the present invention is shown. The multi-language text-to-speech method 200 is used to process/convert text messages containing different languages in different languages into multi-language voice audio. In one embodiment, the processor 160 is configured to perform a multi-language text-to-speech method. The processor 160 can be, for example, but not limited to, a central processing unit (CPU), a system on chip (System on Chip, SoC), application processor, audio processor, digital signal processor or a specific function processing chip or controller.

在一個實施例中，多語言文字訊息可以是，但不限於，文稿中的段落、使用者輸入的指令、網頁上圈選的文字或是其他各種來源輸入的文字。在一個實施例中，第一語言模型資料庫具有多數個第一語言音位標籤和第一語言的同語言連接音調資訊，第二語言模型資料庫具有多數個第二語言音位標籤和第二語言的同語言連接音調資訊。 In one embodiment, the multi-language text message can be, but is not limited to, a paragraph in the document, an instruction entered by the user, a circled text on the web page, or text entered by various other sources. In one embodiment, the first language model database has a plurality of first language phoneme tags and the same language connection tone information in the first language, the second language model database has a plurality of second language phoneme tags and a second The same language of the language is connected to the tone information.

如第2圖中所示，多語言文字轉語音方法200包含以下步驟。在步驟S210中，將多語言文字訊息區分為至少一個第一語言段落以及至少一個第二語言段落。在一個實施例中，由處理器160將多語言文字訊息依據不同語言區分為多數個語言段落。在一個實施例中，文字訊息「放個Jason Mraz來聽」將被分為三個語言段落，分別是「放個」(中文語言段落)、「來聽」(中文語言段落)以及「Jason Mraz」(英文語言段落)。 As shown in FIG. 2, the multi-language text-to-speech method 200 includes the following steps. In step S210, the multi-language text message is divided into at least one first language paragraph and at least one second language paragraph. In one embodiment, the multi-language text message is differentiated by the processor 160 according to different languages. For most language passages. In one embodiment, the text message "put a Jason Mraz to listen" will be divided into three language paragraphs, namely "put a" (Chinese language paragraph), "to listen" (Chinese language paragraph) and "Jason Mraz" (English language paragraph).

在步驟S220中，將至少一個第一語言段落轉換成至少一個第一語言音位標籤，將至少一個第二語言段落轉換成至少一個第二語言音位標籤。在一個實施例中，每一個音位標籤可以包含，但不限於，音位的語調(pitch)、節律(tempo)、音色(timbre)等聲頻資訊。 In step S220, the at least one first language paragraph is converted into at least one first language phoneme label, and the at least one second language paragraph is converted into at least one second language phoneme label. In one embodiment, each phoneme tag may include, but is not limited to, audio information such as pitch, tempo, timbre, etc. of the phoneme.

在步驟S230中，利用至少一個第一語言音位標籤查找第一語言模型資料庫以獲得至少一個第一語言音位標籤串列，利用至少一個第二語言音位標籤查找第二語言模型資料庫以獲得至少一個第二語言音位標籤串列。 In step S230, the first language model database is searched by using at least one first language phoneme tag to obtain at least one first language phoneme tag string, and the second language model database is searched by using at least one second language phoneme tag. Obtaining at least one second language phoneme tag string.

在一個實施例中，M代表為中文(Mandarin)的音位，數字代表中文裡不同音位的編號。在一個實施例中，中文字「放」對應到兩個音位標籤[M04]與[M29]，中文字「個」對應到另兩個音位標籤[M09]與[M25]。因此，中文語言段落「放個」所轉換而成的音位標籤串列則為[M04 M29 M09 M25]；同理，語言段落「來聽」亦可轉換為相對應的另一組音位標籤串列，為[M08 M29 M41 M44]。另一方面，英文語言段落「Jason Mraz」則依照英文的語言模型資料庫LMD2轉換為對應的另一組音位標籤串列，為[E19 E13 E37 E01 E40]。 In one embodiment, M represents a phoneme of Chinese (Mandarin), and the numbers represent the numbers of different phonemes in Chinese. In one embodiment, the Chinese character "put" corresponds to two phoneme labels [M04] and [M29], and the Chinese character "one" corresponds to the other two phoneme labels [M09] and [M25]. Therefore, the phoneme label string converted from the Chinese language paragraph "put one" is [M04 M29 M09 M25]; similarly, the language paragraph "to listen" can also be converted into another corresponding phoneme label. Tandem, for [M08 M29 M41 M44]. On the other hand, the English language paragraph "Jason Mraz" is converted into a corresponding set of other phoneme tag strings according to the English language model database LMD2, which is [E19 E13 E37 E01 E40].

在步驟S240中，依據該多語言文字訊息的文字順序，將至少一個第一語言音位標籤串列與至少一個第二語言音位標籤串列組合為一多語言音位標籤串列。 In step S240, the text according to the multi-language text message In sequence, at least one first language phoneme tag string and at least one second language phoneme tag string are combined into a multi-lingual phoneme tag string.

換而言之，處理器160依據最初的多語言文字訊息的順序對不同語言段落的多個音位標籤串列進行排列，並將排列的音位標籤串列組合為多語言音位標籤串列。於此例中，文字訊息「放個Jason Mraz來聽」所轉換的三個音位標籤串列，也就是，[M04 M29 M09 M25]、[E19 E13 E37 E01 E40]以及[M04 M29 M09 M25]，依據最初的多語言文字訊息的順序被組合成多語言音位標籤串列[M04 M29 M09 M25 E19 E13 E37 E01 E40 M08 M29 M41 M44]。 In other words, the processor 160 arranges a plurality of phoneme tag sequences of different language segments according to the order of the original multi-language text messages, and combines the arranged phoneme tag strings into a multi-language phoneme tag string. . In this example, the text message "put a Jason Mraz to listen" is converted into three phonetic label strings, that is, [M04 M29 M09 M25], [E19 E13 E37 E01 E40], and [M04 M29 M09 M25] According to the order of the original multi-language text messages, they are combined into a multi-language phoneme tag string [M04 M29 M09 M25 E19 E13 E37 E01 E40 M08 M29 M41 M44].

在步驟S250中，處理器160在每二個相鄰的音位標籤串列的交界處產生一跨語言連接音調資料，其中每二個相鄰的音位標籤串列包含至少一個第一語言音位標籤串列中的一個第一語言音位標籤串列以及至少一個第二語言音位標籤串列中的一個第二語言音位標籤串列。在一個實施例中，處理器160查找語言模型資料庫LMD1和LMD2以獲得每二個相鄰的音位標籤串列的跨語言連接音調資料。下文中特舉一實施例加以說明。 In step S250, the processor 160 generates a cross-language connection tone data at the boundary of every two adjacent phoneme tag sequences, wherein each two adjacent phoneme tag strings includes at least one first language tone. A first language phoneme tag string in the bit tag string and a second language phoneme tag string in the at least one second language phoneme tag string. In one embodiment, processor 160 looks up language model repositories LMD1 and LMD2 to obtain cross-language connection tone data for every two adjacent phonetic tag strings. An embodiment will be described below.

在步驟S260中，處理器160合併多語言音位標籤串列、在至少一個第一語言音位標籤串列中每兩個相鄰的音位標籤之間交界處的第一語言的同語言連接音調資料、在至少一個第二語言音位標籤串列中每兩個相鄰的音位標籤之間交界處的第二語言的同語言連接音調資料以及跨語言連接音調資料以產生多語言語音音訊。在步驟S270中，輸出多語言語音音訊。 In step S260, the processor 160 merges the multi-lingual phoneme tag string, the same language connection in the first language at the boundary between each two adjacent phoneme tags in the at least one first language phoneme tag string. Tone material, homolingual tonal data of a second language at the interface between every two adjacent phoneme tags in at least one second language phoneme tag string, and cross-language Connect tone data to produce multi-language voice audio. In step S270, multi-language voice audio is output.

為了達到較好的發音效果，在一個實施例中，圖2中的文字轉語音方法的步驟S240進一步包含如第3圖中所示之步驟S241-S245。 In order to achieve a better sounding effect, in one embodiment, step S240 of the text-to-speech method of FIG. 2 further includes steps S241-S245 as shown in FIG.

如第3圖所示，在步驟S241中，處理器160將組合的多語言音位標籤串列分成多數個第一發音單元，每一個發音單元同屬單一種語言，並且包含至少一個第一語言音位標籤串列和至少一個第二語言音位標籤串列中相應的一個語言音位標籤串列的連續的音位標籤。 As shown in FIG. 3, in step S241, the processor 160 divides the combined multi-lingual phoneme tag string into a plurality of first pronunciation units, each of the pronunciation units belonging to a single language and including at least one first language. And a continuous phoneme tag of the phoneme tag string and the corresponding one of the at least one second language phoneme tag string.

接著，對每個第一發音單元執行步驟S242。在步驟S242中，處理器160確定在第一語言模型資料庫和第二語言模型資料庫其中一個對應於第一發音單元的語言模型資料庫中，第一發音單元所對應的候選數目是否大於等於該第一發音單元所對應的一預定數目。當第一語言模型資料庫和第二語言模型資料庫中所對應的語言模型資料庫中的每一個第一發音單元的候選數目均大於等於相應的預定數目時，該處理器160執行步驟S243來計算每一個候選路徑的加入代價值，其中每一個候選路徑經過每個第一發音單元的一個候選。在步驟S244中，該處理器160依據每個候選路徑的加入代價值確定每兩個相鄰第一發音單元之間的連接路徑。 Next, step S242 is performed for each first pronunciation unit. In step S242, the processor 160 determines whether the number of candidates corresponding to the first pronunciation unit is greater than or equal to the language model database corresponding to the first pronunciation unit in the first language model database and the second language model database. a predetermined number corresponding to the first sounding unit. When the number of candidates of each of the first pronunciation units in the language model database corresponding to the first language model database and the second language model database is greater than or equal to a corresponding predetermined number, the processor 160 performs step S243. The join value of each candidate path is calculated, wherein each candidate path passes through one candidate for each first pronunciation unit. In step S244, the processor 160 determines a connection path between every two adjacent first sounding units according to the joining value of each candidate path.

在一個實施例中，在步驟S244中，處理器160進一步確定在相鄰兩個第一發音單元中前一個第一發音單元中所選的一個候選和相鄰兩個第一發音單元中後一個第一發音單元中所選的一個候選之間的連接路徑，其中相鄰兩個第一發音單元中前一個第一發音單元中所選的候選和相鄰兩個第一發音單元中後一個第一發音單元中所選的候選都位於最低加入代價值的候選路徑之一。 In one embodiment, in step S244, the processor 160 further determines a previous first pronunciation in the two adjacent first pronunciation units. a connection path between one candidate selected in the meta and a selected one of the candidates in the next first pronunciation unit of the two adjacent first pronunciation units, wherein the first first pronunciation of the two adjacent first pronunciation units The candidate selected in the unit and the candidate selected in the latter first pronunciation unit of the two adjacent first pronunciation units are located in one of the candidate paths of the lowest added value.

然而，在步驟242之後，當第一語言模型資料庫和第二語言模型資料庫其中一個對應的語言模型資料庫中任何一個或多個第一發音單元的候選數目小於相應的預定數目時，執行如第4圖所示的本發明一實施例中的子步驟S246和S247(如第3圖中示出為A)。 However, after step 242, when the number of candidates of any one or more first pronunciation units in one of the corresponding language model databases of the first language model database and the second language model database is less than a corresponding predetermined number, execution is performed Sub-steps S246 and S247 in an embodiment of the invention as shown in Fig. 4 (shown as A in Fig. 3).

在第4圖中的步驟S246中，處理器160進一步將一個或多個第一發音單元分成多數個第二發音單元，其中任一一個第二發音單元的長度均小於相應的其中一個第一發音單元的長度。在步驟S247中，對於每個第二發音單元，處理器160進一步確定第一語言模型資料庫和第二語言模型資料庫中一個對應於第一發音單元的語言模型資料庫中，第二發音單元所對應的候選數目是否大於等於該第二發音單元所對應的一預定數目。 In step S246 in FIG. 4, the processor 160 further divides the one or more first sounding units into a plurality of second sounding units, wherein the length of any one of the second sounding units is smaller than the corresponding one of the first The length of the pronunciation unit. In step S247, for each second pronunciation unit, the processor 160 further determines a language module database corresponding to the first pronunciation unit in the first language model database and the second language model database, and the second pronunciation unit Whether the corresponding number of candidates is greater than or equal to a predetermined number corresponding to the second sounding unit.

換而言之，在步驟S242中，如果第一語言模型資料庫和第二語言模型資料庫其中一個對應的語言模型資料庫中任何一個或多個第一發音單元(或第二發音單元等)的候選數目被確定為小於相應的預定數目時，則重複子步驟S246和S247，直至候選數目被確定為大於等於相應的預定數目，接著在步驟S243中，計算每個候選路徑的加入代價值。 In other words, in step S242, if the first language model database and the second language model database are in one of the corresponding language model databases, any one or more of the first pronunciation units (or the second pronunciation unit, etc.) When the number of candidates is determined to be less than the corresponding predetermined number, sub-steps S246 and S247 are repeated until the number of candidates is determined to be greater than or equal to the corresponding predetermined number, and then in step S243, the join cost of each candidate path is calculated. value.

在一個實施例中，多語言文字訊息「我們下個星期一起去Boston University參加畢業典禮」被分成多個第一發音單元，例如聲頻資訊「我們」、「下個星期」、「一起」、「去」、「Boston University」、「參加畢業典禮」。處理器160確定在第一語言模型資料庫和第二語言模型資料庫其中一個對應語言模型資料庫中這些第一發音單元的候選數目是否大於等於該第一發音單元所對應的一預定數目。 In one embodiment, the multi-language text message "We will go to Boston University next week to attend the graduation ceremony" is divided into a number of first pronunciation units, such as audio information "we", "next week", "together", " Go, "Boston University", "Participate in the graduation ceremony." The processor 160 determines whether the number of candidates of the first pronunciation units in one of the first language model database and the second language model database is greater than or equal to a predetermined number corresponding to the first pronunciation unit.

在一個實施例中，假定第一發音單元「參加畢業典禮」候選的預定數目是十，如果在第一語言模型資料庫LMD1中只有五個第一發音單元「參加畢業典禮」的候選，這意味著第一語言模型資料庫LMD1中的候選數目小於相應的預定數目。然後，將第一發音單元「參加畢業典禮」分成比第一發音單元「參加畢業典禮」長度小的第二發音單元，如圖4中的步驟S246所示。 In one embodiment, it is assumed that the predetermined number of candidates for the first pronunciation unit "participating in the graduation ceremony" is ten. If there are only five candidates for the first pronunciation unit "joining the graduation ceremony" in the first language model database LMD1, this means that The number of candidates in the first language model database LMD1 is less than the corresponding predetermined number. Then, the first pronunciation unit "participation graduation ceremony" is divided into second pronunciation units smaller than the length of the first pronunciation unit "participation graduation ceremony", as shown in step S246 in FIG.

在一個實施例中，每個第二發音單元的預定數目與相應的第一發音單元的預定數目相同。在另一個實施例中，每個第二發音單元的預定數目設定為與相應的第一發音單元的預定數目不相同。在本實施例中，將第一發音單元「參加畢業典禮」分成兩個第二發音單元：「參加」和「畢業典禮」，以「參加」以及「畢業典禮」兩個詞分別查找第一語言模型資料庫LMD1中得到280筆以及56筆的候選數目。例如，在本實施例中，第二發音單元「參加」和「畢業典禮」的候選的預定數目均為十。這意味著，第二發音單元「參加」和「畢業典禮」所對應的候選數目均大於相應的預定數目。接著，執行步驟S243。為了獲得較好的發音效果，將第一發音單元進一步分成長度較短的第二發音單元直至相應的語言資料庫中能夠找到足夠數目的候選。 In one embodiment, the predetermined number of each second sounding unit is the same as the predetermined number of corresponding first sounding units. In another embodiment, the predetermined number of each of the second sounding units is set to be different from the predetermined number of the corresponding first sounding units. In this embodiment, the first pronunciation unit "participating in the graduation ceremony" is divided into two second pronunciation units: "participation" and "graduation ceremony", and the words "participation" and "graduation ceremony" are respectively used to find the first language. The model database LMD1 obtained 280 and 56 candidate numbers. For example, in the present embodiment, the second pronunciation unit "participation" and "graduation ceremony" The predetermined number of candidates is ten. This means that the number of candidates corresponding to the second pronunciation unit "participation" and "graduation ceremony" is greater than the corresponding predetermined number. Next, step S243 is performed. In order to obtain a better pronunciation effect, the first sounding unit is further divided into second sounding units of shorter length until a sufficient number of candidates can be found in the corresponding language database.

如第5圖中所示，在一個實施例中，在每兩個相鄰音位標籤串之間的交界處產生跨語言連接音調資料的步驟S250，進一步包含子步驟。將同一種語言的發音單元的音位標籤之間的連接關係分別儲存在語言模型資料庫LMD1和LMD2中。再次以文字訊息「放個Jason Mraz來聽」的多語言音位標籤串列[M04 M29 M09 M25 E19 E13 E37 E01 E40 M08 M29 M41 M44]作為例子，將用以連接的同語言連接音調資訊[M04 M29]儲存在中文模型資料庫LMD1中，以L[M04，M29]表示，[M29 M09]的同語言連接音調資訊以L[M29，M09]表示，以此類推。將中文的任何兩個相鄰的音位標籤的同語言連接音調資訊都儲存在語言模型資料庫LMD1中。在一個實施例中，將相鄰音位標籤的同語言連接音調資訊[E19 E13]預先儲存在英文模型資料庫LMD2，以此類推。 As shown in FIG. 5, in one embodiment, step S250 of generating cross-language connection tone material at the interface between every two adjacent phoneme tag strings further includes sub-steps. The connection relationship between the phoneme labels of the pronunciation units of the same language is stored in the language model databases LMD1 and LMD2, respectively. As an example, the multi-language phoneme label string [M04 M29 M09 M25 E19 E13 E37 E01 E40 M08 M29 M41 M44] with the text message "Listen to Jason Mraz" will be used to connect the same language to the tone information [M04 M29] is stored in the Chinese model database LMD1, represented by L[M04, M29], the same language connection tone information of [M29 M09] is represented by L[M29, M09], and so on. The same language connection tone information of any two adjacent phoneme labels in Chinese is stored in the language model database LMD1. In one embodiment, the same language connection tone information [E19 E13] of the adjacent phoneme label is pre-stored in the English model database LMD2, and so on.

由於每個語言模型資料庫LMD1和LMD2分別儲存有同一種語言訊息的資料，傳統的文字轉語音方法中不能找到多語言音位標籤串列[M04 M29 M09 M25 E19 E13 E37 E01 E40 M08 M29 M41 M44]的跨兩種語言的跨語言連接音調資料(例如[M25 E19]的跨語言連接音調資料以及[E40 M08]的跨語言連接音調資料)。 Since each language model database LMD1 and LMD2 respectively store data of the same language message, the multi-language phoneme label string cannot be found in the traditional text-to-speech method [M04 M29 M09 M25 E19 E13 E37 E01 E40 M08 M29 M41 M44 Cross-language connection tone data across two languages (eg [M25 E19] cross-language connection tone data] And [E40 M08] cross-language connection tone data).

音位標籤之間的連接音調資料使語音具有流暢性、一致性以及連續性。因此，在一個實施例中，根據步驟S250，處理器160在兩種不同的語言之間任何兩個音位標籤的交界處產生跨語言連接音調資料，這將在後文中詳細描述。 The connection tone data between the phoneme labels makes the voice fluent, consistent and continuous. Thus, in one embodiment, in accordance with step S250, processor 160 generates cross-language connection tone material at the intersection of any two phoneme labels between two different languages, as will be described in detail below.

第5圖是在第一語言和第二語言之間的交界處產生跨語言連接音調資料的方法的流程圖。在一個實施例中，如第5圖所示，步驟S250進一步包含子步驟S251-S252。 Figure 5 is a flow diagram of a method of generating cross-language connection tone material at the interface between the first language and the second language. In one embodiment, as shown in FIG. 5, step S250 further includes sub-steps S251-S252.

在第5圖的步驟S251中，處理器將與至少一個第二語言音位標籤串列的首位音位標籤具有近似發音的相應的第一語言音位標籤的音位標籤替換至少一個第二語言音位標籤串列的該首位音位標籤。 In step S251 of FIG. 5, the processor replaces the phoneme tag of the corresponding first language phoneme tag having the approximate pronunciation with the first phoneme tag of the at least one second language phoneme tag string in at least one second language. The first phoneme label of the phoneme label string.

在一個實施例中，在多語言文字訊息「放個Jason Mraz來聽」中，第一語言和第二語言之間的第一交界處是在「個」和「Jason」之間。在本實施例中，中文是第一語言，英文是第二語言，中文文字「個」(對應音位標籤[M09 M25])出現在英文文字「Jason」(對應音位標籤[E19 E13])前面。也就是說，在本實施例中，第一語言語言段落的末位元音位標籤與第二語言語言段落的首位音位標籤的第一交界處，是在音位標籤[M25]和[E19]之間。 In one embodiment, in the multi-language text message "put a Jason Mraz to listen", the first boundary between the first language and the second language is between "one" and "Jason". In this embodiment, Chinese is the first language, English is the second language, and the Chinese character "one" (corresponding to the phoneme label [M09 M25]) appears in the English text "Jason" (corresponding phoneme label [E19 E13]) front. That is, in the present embodiment, the first vowel bit label of the first language language paragraph and the first boundary of the first phoneme label of the second language language paragraph are at the phoneme label [M25] and [E19 ]between.

依據步驟S251，將第二語言(在本實施例中是英文)語言段落中的首位音位標籤[E19]替換為具有近似發音的第一語言(在本實施例中是中文)的音位標籤。在一個實施例中，將英文中的音位「Ja」(對應音位標籤[E19])替換為中文中的音位「ㄐ」(發音為「Ji」，對應音位標籤[M12])。在本實施例中，將英文中的音位「Ja」的音位標籤[E19]替換為中文中的音位「ㄐ」的音位標籤[M12]。 According to step S251, the first phoneme label [E19] in the second language (in this embodiment is English) language paragraph is replaced with the phoneme label of the first language (in this embodiment, Chinese) having an approximate pronunciation. . In a real In the example, the phoneme "Ja" (corresponding to the phoneme label [E19]) in English is replaced with the phoneme "ㄐ" in Chinese (pronounced "Ji", corresponding to the phoneme tag [M12]). In the present embodiment, the phoneme label [E19] of the phoneme "Ja" in English is replaced with the phoneme tag [M12] of the phoneme "ㄐ" in Chinese.

進一步地，在相同的示例文字(「放個Jason Mraz來聽」)中，第二跨語言交界處是「Mraz」(對應音位標籤[E37 E01 E40])和「來」(對應音位標籤[M08 M29])之間的交界處。也就是說，第二交界處是在第二語言語言段落的末位元音位標籤與第一語言語言段落首位音位標籤之間。在本實施例中，第二交界處存在於音位標籤[E40]和[M08]之間。接著，將中文中的音位「來」的音位標籤[M08]替換為英文中的音位「le」的音位標籤[E21](與中文中的音位「來」的音位標籤[M08]近似)。 Further, in the same example text ("put a Jason Mraz to listen"), the second cross-language interface is "Mraz" (corresponding to the phoneme label [E37 E01 E40]) and "coming" (corresponding phoneme label) Junction between [M08 M29]). That is, the second junction is between the last vowel label of the second language language paragraph and the first phoneme label of the first language language paragraph. In the present embodiment, the second boundary exists between the phoneme labels [E40] and [M08]. Next, replace the phoneme label [M08] of the phoneme "来来" in Chinese with the phoneme tag [E21] of the phoneme "le" in English (the phoneme tag with the phoneme "来" in Chinese [ M08] approximate).

接著，在步驟S252中，處理器160利用第一語言音位標籤中相應的音位標籤查找第一語言模型資料庫LDM1以獲得第一語言模型資料庫LDM1中在至少一個第一語言音位標籤串列的末位音位標籤和第一語言音位標籤中該相應的音位標籤之間的相應的同語言連接音調資訊，其中第一語言模型資料庫LDM1中該相應的同語言連接音調資訊用作為至少一個第一語言音位標籤串列中的該個第一語言音位標籤以及至少一個第二語言音位標籤串列中的該個第二語言音位標籤之間交界處的跨語言連接音調資料。 Next, in step S252, the processor 160 searches the first language model database LDM1 by using the corresponding phoneme label in the first language phoneme label to obtain at least one first language phoneme label in the first language model database LDM1. Corresponding homophone connection tone information between the last phoneme label of the serial port and the corresponding phoneme label in the first language phoneme tag, wherein the corresponding language connection tone information in the first language model database LDM1 Used as a cross-language at the junction between the first language phoneme label in the at least one first language phoneme tag string and the second language phoneme tag in the at least one second language phoneme tag string Connect tone data.

在上一個實施例中，對於第一交界處來說，依據第一語言在第一交界處的末位音位標籤和替換的音位標籤[M25 M12]，在第一語言模型資料庫LMD1中找到第一交界處的同語言連接音調資訊L[M25 M12]。接著，同語言連接音調資訊L[M25 M12]被視作第一交界處的跨語言連接音調資料。對於第二交界處來說，依據第二語言在第二交界處的末位音位標籤以及最近似的替換音位標籤[E40 E21]，也能夠在第二語言模型資料庫LMD2中找到同語言連接音調資訊[E40 E21]。接著，同語言連接音調資訊L[E40 E21]被視作第二交界處的跨語言連接音調資料。 In the previous embodiment, for the first junction, the last phoneme label and the replaced phoneme label at the first boundary according to the first language Sign [M25 M12], find the same language connection tone information L[M25 M12] at the first boundary in the first language model database LMD1. Then, the same language connection tone information L[M25 M12] is regarded as the cross-language connection tone data at the first boundary. For the second junction, the last phoneme label at the second junction in the second language and the most approximate replacement phoneme label [E40 E21] can also find the same language in the second language model database LMD2. Connect tone information [E40 E21]. Next, the same language connection tone information L[E40 E21] is regarded as the cross-language connection tone data at the second boundary.

以下透過第6A圖與第6B圖提出一個實施例來說明如何計算目標聲頻資訊的候選數目多寡。 An embodiment is presented below through FIGS. 6A and 6B to illustrate how to calculate the number of candidates for the target audio information.

如第6A圖所示，假設目前選擇的發音單元為「參加畢業典禮」，便由第一語言模型資料庫LMD1中找到「參加畢業典禮」每個字所對應的語調、節律或音色。語調(pitch)包含發聲的頻率高低；節律包含發聲的音長(duration)、速度、間隔與韻律；音色包含發聲的品質、口形、發聲部位等。第4B圖與第4C圖的例子中示意性提出以語調作比對基準進行說明。 As shown in FIG. 6A, assuming that the currently selected pronunciation unit is "attending the graduation ceremony", the tone, rhythm, or timbre corresponding to each word of "joining the graduation ceremony" is found in the first language model database LMD1. The pitch contains the frequency of the utterance; the rhythm includes the duration, velocity, interval, and rhythm of the utterance; the timbre includes the quality of the vocalization, the shape of the mouth, and the location of the vocalization. In the examples of FIGS. 4B and 4C, the tone is used as a comparison reference.

在本實施例中，發音單元的語調(pitch)與節律(如音長duration)可以分別由一維度的高斯模型代表其分布曲線。例如語調(pitch)的一維度的高斯模型便為該發音單元在不同頻率(單位如赫茲Hz)下的統計分佈；音長(duration)的一維度的高斯模型便為發音單元在不同時間長度(單位如毫秒ms)下的統計分佈。 In the present embodiment, the pitch and rhythm of the pronunciation unit (e.g., the length of the duration) can be represented by a one-dimensional Gaussian model, respectively. For example, the Gaussian model of one dimension of the pitch is the statistical distribution of the pronunciation unit at different frequencies (such as Hertz Hz); the Gaussian model of one dimension of the duration is the pronunciation unit at different time lengths ( The statistical distribution of units such as milliseconds ms).

於此實施例中，代表音色的口形可採用多個多維度的高斯混和模型所建立。在一個實施例中，可以利用語者調適(Speaker Adaptation)來建立這樣的多維度的高斯混和模型，以記錄代表音色的口形。使用語者調適(Speaker Adaptation)技術可以為輸入的文字訊息建立較為可靠的口形。語者調適(Speaker Adaptation)技術的實行方法包含：利用同語言但是不同語者的大量語料先建立那個語言裡所有音位(phoneme)的通用模型；在建立該語言的所有音位的通用模型後，再利用原先錄製的混和語言音檔中抽出的符合其語言的聲音片段，對其抽取口形參數；然後將先前已經有的各個音位的通用模型，移動到抽取的口形參數樣本中，移動後的新模型就是調適過的模型(adapted model)。語者調適技術的詳細步驟與原理在Reynolds,Douglas A.在2000年發表於Digital Signal Processing的期刊文章「Speaker Verification Using Adapted Gaussian Mixture Models」中有詳細的解釋與說明。實際上，語者調適技術僅為本案建立口腔模型的其中一種實施方式，但本揭示文件並不以此為限。 In this embodiment, the mouth shape representing the tone can be used in multiple A Gaussian mixture model of dimensions is established. In one embodiment, Speaker Adaptation can be utilized to create such a multi-dimensional Gaussian blend model to record a mouth shape representing a timbre. Use Speaker Adaptation technology to create a more reliable mouth shape for incoming text messages. The implementation of the Speaker Adaptation technique involves: using a large number of corpora of the same language but different language to first establish a universal model of all phonemes in that language; a general model for establishing all phonemes in the language; Then, the sound segment corresponding to the language extracted from the originally recorded mixed language audio file is used to extract the lip shape parameter; then the general model of each previously existing phoneme is moved to the extracted lip shape parameter sample, and moved. The new model is the adapted model. The detailed steps and principles of the speaker adaptation technique are explained and explained in Reynolds, Douglas A., in the journal article "Speaker Verification Using Adapted Gaussian Mixture Models" published in 2000 by Digital Signal Processing. In fact, the speaker adaptation technique is only one of the implementation methods for establishing an oral model in this case, but the disclosure is not limited thereto.

於此例中，先找出語言模型資料庫LMD1中「參加畢業典禮」的各別的語調的基準平均頻率Pavg1，如此例中，參加畢業典禮這六個中文字的平均頻率依序為100Hz、140Hz、305Hz、203Hz、150Hz以及143Hz。此一組基準平均頻率Pavg1便作為目標聲頻資訊，即為後續挑選的標準。 In this example, first find the reference average frequency Pavg1 of the respective intonations of the "attending graduation ceremony" in the language model database LMD1. In this example, the average frequency of the six Chinese characters participating in the graduation ceremony is 100Hz, 140 Hz, 305 Hz, 203 Hz, 150 Hz, and 143 Hz. This set of reference average frequency Pavg1 is used as the target audio information, which is the criterion for subsequent selection.

隨後找出語言模型資料庫LMD1中所有符合「參加畢業典禮」此一發音單元的168組語調頻率資料PAU，如圖6A中所示為PAU1~PAU168。在一個實施例中，語調頻率資料與目標聲頻(也就是基準平均頻率Pavg1)所選組別的差異量被設定為預定範圍內，如一基準平均頻率Pavg1的20%以內。在本實施例中，這六個中文字的目標聲頻資訊的預定範圍是分別是：100HZ±20%、100Hz±20%、146Hz±20%、305Hz±20%、230Hz±20%、150Hz±20%以及143Hz±20%。集合中所有這六個字的聲頻資訊均在預定範圍內的即為候選(PCAND)。例如，在第一組語調頻率資料PAU1中，這六個中文字的頻率依序為175Hz、179Hz、275Hz、300Hz、120Hz以及150Hz，均在基準平均頻率Pavg1的20%的預定範圍之外。於此例子中，168組中僅有兩組候選頻率資料PCAND，分別是語調頻率資料PAU63與PAU103，差異量在預定範圍以內。然而，假定第一發音單元的預定數目是10，候選數目(即為2組，PAU63和PAU103)小於該預定數目(即為10)。因此，第一發音單元需要被分成多數個比第一發音單元長度短的第二發音單元以獲得更多的候選。 Then find out all the matches in the language model database LMD1 The 168 sets of tone frequency data PAU of the pronunciation unit of the "Graduation Ceremony" are shown as PAU1~PAU168 as shown in FIG. 6A. In one embodiment, the amount of difference between the tone frequency data and the selected group of target audio frequencies (i.e., the reference average frequency Pavg1) is set to within a predetermined range, such as within 20% of a reference average frequency Pavg1. In this embodiment, the predetermined ranges of the target audio information of the six Chinese characters are: 100HZ±20%, 100Hz±20%, 146Hz±20%, 305Hz±20%, 230Hz±20%, 150Hz±20. % and 143 Hz ± 20%. The audio information of all six words in the set is within the predetermined range as a candidate (PCAND). For example, in the first set of tone frequency data PAU1, the frequencies of the six Chinese characters are 175 Hz, 179 Hz, 275 Hz, 300 Hz, 120 Hz, and 150 Hz, respectively, which are outside the predetermined range of 20% of the reference average frequency Pavg1. In this example, there are only two sets of candidate frequency data PCAND in the 168 group, which are the tone frequency data PAU63 and PAU103, respectively, and the difference is within a predetermined range. However, assuming that the predetermined number of first pronunciation units is 10, the number of candidates (i.e., 2 groups, PAU 63 and PAU 103) is smaller than the predetermined number (i.e., 10). Therefore, the first sounding unit needs to be divided into a plurality of second sounding units shorter than the length of the first sounding unit to obtain more candidates.

接著，將第一發音單元「參加畢業典禮」分成多數個第二發音單元，「參加」和「畢業典禮」。以其中一個第二發音單元「畢業典禮」作為例子來作進一步說明。在第6B圖中，在一實施例中，找出第一語言模型資料庫LMD1中第二發音單元「畢業典禮」的各個語調平均頻率Pavg2。在一個實施例中，第二發音單元「畢業典禮」的平均頻率依序為305Hz、203Hz、150Hz以及143Hz。此一組基準平均頻率Pavg2便作為目標聲頻資訊，即為後續挑選的標準。 Next, the first pronunciation unit "Participation Graduation Ceremony" is divided into a plurality of second pronunciation units, "participation" and "graduation ceremony". One of the second pronunciation units "Graduation Ceremony" is taken as an example for further explanation. In Fig. 6B, in an embodiment, the respective tone average frequencies Pavg2 of the second pronunciation unit "Graduation Ceremony" in the first language model database LMD1 are found. In one embodiment, the average frequency of the second pronunciation unit "graduation ceremony" is The order is 305 Hz, 203 Hz, 150 Hz, and 143 Hz. This set of reference average frequency Pavg2 is used as the target audio information, which is the criterion for subsequent selection.

隨後找出第一語言模型資料庫LMD1中符合第二發音單元「畢業典禮」的所有語調頻率資料PAU，共包含820組語調頻率資料PAU1~PAU820。在一個實施例中，在第一組語調頻率資料PAU1中，這六個中文字的頻率依序為275Hz、300Hz、120Hz以及150Hz。接著，從語調頻率資料組PAU1~PAU820中間挑選出與目標聲頻(也就是基準平均頻率Pavg)的差異量在預定範圍內(即，基準平均頻率Pavg2的20%以內)的組別。於此例子中，语调頻率資料差異量在預定範圍內的組別有340組候選頻率資料PCAND。此時目標聲頻資訊的候選數目充足，因此第二發音單元長度是合適的。因此，不需要再將第二發音單元分成更小長度的發音單元。預定範圍並不以20%為限，亦可調整為在基準平均頻率上下其它合理範圍。 Then, all the tone frequency data PAU corresponding to the second pronunciation unit "graduation ceremony" in the first language model database LMD1 are found, and a total of 820 sets of tone frequency data PAU1~PAU820 are included. In one embodiment, in the first set of tone frequency data PAU1, the frequencies of the six Chinese characters are 275 Hz, 300 Hz, 120 Hz, and 150 Hz. Next, a group having a difference from the target audio frequency (that is, the reference average frequency Pavg) within a predetermined range (i.e., within 20% of the reference average frequency Pavg2) is selected from among the tone frequency data sets PAU1 to PAU820. In this example, the group of the tone frequency data difference amount within the predetermined range has 340 sets of candidate frequency data PCAND. At this time, the number of candidates of the target audio information is sufficient, so the length of the second sounding unit is appropriate. Therefore, it is no longer necessary to divide the second sounding unit into smaller-length sounding units. The predetermined range is not limited to 20%, and may be adjusted to other reasonable ranges above and below the reference average frequency.

上述第6A圖與第6B圖的實施例中僅示意性繪示以語調頻率資料挑選候選聲頻資訊的作法。在另一個實施例中，依據語調(pitch)、節律(tempo)與音色(timbre)的權重加成選擇候選聲頻資訊。 In the embodiments of FIGS. 6A and 6B above, only the method of selecting candidate audio information by using the tone frequency data is schematically illustrated. In another embodiment, the candidate audio information is selected based on the weight of the pitch, tempo, and timbre.

舉例來說，目標聲頻資訊AUavg表示為：AUaug=αPavg+βTavg+γFavg For example, the target audio information AUavg is expressed as: AUaug=αPavg+βTavg+γFavg

其中，Pavg為語調的平均頻率，Tavg為節律的平均音長，Favg為音色的平均口形。在一個實施例中，口形可由多維度的矩陣所表示。在一個實施例中，可用梅爾頻率倒頻譜係數(Mel-frequency cepstral coefficient,MFCC)加以表示口形，此為習知技藝之人所熟知，並非本案的主要討論範圍，在此不另贅述。α、β、γ為Pavg、Tavg以及Favg三者各自的權重。α、β、γ均大於0且三者相加為1。在一個實施例中，同時參照目標聲頻資訊AUavg與語言模型資料庫LMD1個別聲訊資料的語調(pitch)、節律(tempo)與音色(timbre)權重加成結果來挑選候選聲頻資訊。 Among them, Pavg is the average frequency of the tone, Tavg is the average length of the rhythm, and Favg is the average mouth shape of the tone. In one embodiment, the mouth shape can be represented by a multi-dimensional matrix. In one embodiment, the frequency of the frequency can be scrambled The Mel-frequency cepstral coefficient (MFCC) is used to represent the shape of the mouth, which is well known to those skilled in the art and is not the main scope of discussion in this case, and will not be further described herein. α, β, and γ are the weights of each of Pavg, Tavg, and Favg. α, β, γ are all greater than 0 and the three are added to 1. In one embodiment, the candidate audio information is selected by referring to the pitch, tempo, and timbre weighting results of the individual audio data of the target audio information AUavg and the language model database LMD1.

第7圖繪示決定本發明一個實施例的發音單元的連接路徑的操作實例示意圖。 FIG. 7 is a schematic diagram showing an operation example of determining a connection path of a sounding unit according to an embodiment of the present invention.

如第7圖所示，在一個實施例中，文字訊息最後被區分為發音單元PU1(例如一中文字)、發音單元PU2(例如一英文片語)以及發音單元PU3(例如一英文片語)。在本實施例中，在查找語言模型資料庫LMD1~LMD2時，針對發音單元PU1找到四筆的不同的候選聲頻資訊AU1a~AU1d；針對發音單元PU2找到兩筆的不同的候選聲頻資訊AU2a~AU2b；針對發音單元PU3找到三筆的不同的候選聲頻資訊AU3a~AU3c。 As shown in FIG. 7, in one embodiment, the text message is finally divided into a pronunciation unit PU1 (for example, a Chinese character), a pronunciation unit PU2 (for example, an English phrase), and a pronunciation unit PU3 (for example, an English phrase). . In this embodiment, when the language model database LMD1~LMD2 is searched, four different candidate audio information AU1a~AU1d are found for the pronunciation unit PU1; two different candidate audio information AU2a~AU2b are found for the pronunciation unit PU2. Find three different candidate audio information AU3a~AU3c for the pronunciation unit PU3.

並且，由語言模型資料庫LMD1~LMD2得到候選聲頻資訊AU1a~AU1d至候選聲頻資訊AU2a~AU2b之間的連接路徑L1，以及候選聲頻資訊AU2a~AU2b至候選聲頻資訊AU3a~AU3c之間的連接音調資訊L2。 And, the connection path L1 between the candidate audio information AU1a~AU1d to the candidate audio information AU2a~AU2b and the connection tone between the candidate audio information AU2a~AU2b to the candidate audio information AU3a~AU3c are obtained by the language model database LMD1~LMD2 Information L2.

每一條候選路徑包含流暢性代價，每一條連接路徑也包含流暢性代價(cost)。步驟S254用以在連接路徑L1與連接路徑L2的組合中找出一條流暢性代價最小的連接路徑，使得三個發音單元PU1~PU3以及兩個連接路徑L1和L2加總的流暢性代價最小，如此一來，該所選的連接路徑的發音方式其整體的流暢性最高。 Each candidate path contains a fluency cost, and each connection path also contains a fluency cost. Step S254 is used to find a connection with the least fluency cost in the combination of the connection path L1 and the connection path L2. The path minimizes the fluency of the three pronunciation units PU1~PU3 and the two connection paths L1 and L2, so that the selected connection path has the highest overall fluency.

計算流暢性代價之最小值的計算公式，如下所示：其中，代表每一發音單元的所有候選聲頻資訊，代表與相鄰的另一發音單元的所有候選聲頻資訊。加總的流暢性代價等於每一個發音單元各個候選聲頻資訊本身的目標代價值(如式中C_Target())、兩個相鄰發音單元之間候選聲頻資訊彼此連接的音譜代價值(如式中C_Spectrum(,))、兩個相鄰發音單元之間候選聲頻資訊彼此連接的聲調代價值(如式中C_Pitch(,))、兩個相鄰發音單元之間候選聲頻資訊彼此連接的節律代價值(如式中C_Duration(,))以及兩個相鄰發音單元之間候選聲頻資訊彼此連接的強度代價值(如式中C_Intensity(,))的加權總和來求得，於算式中α、β、γ、δ與ε分別代表目標代價值、音譜代價值、聲調代價值、節律代價值以及強度代價值各自的權重。把沿L1和L2不同路徑組合的加權總和得到的流暢性代價相比較，其中加權總和最低的路徑便作為最終合成用的種子聲頻。 The calculation formula for calculating the minimum value of the fluency cost is as follows: among them, Representing all candidate audio information for each pronunciation unit, Representative and All candidate audio information of another adjacent sounding unit. The summed fluency cost is equal to the target value of each candidate audio information in each pronunciation unit (such as C _Target ( )), the spectral value of the candidate audio information connected between two adjacent pronunciation units (such as C _Spectrum ( , )), the tone value of the candidate audio information connected between two adjacent pronunciation units (such as C _Pitch ( , )), the rhythmic value of the candidate audio information connected between two adjacent pronunciation units (such as C _Duration ( , )) and the strength value of the candidate audio information connected between two adjacent pronunciation units (such as C _Intensity ( , ))) The weighted sum of the equations is obtained. In the formula, α, β, γ, δ, and ε represent the weights of the target generation value, the sound spectrum generation value, the tone generation value, the rhythm generation value, and the strength generation value, respectively. The fluency cost obtained by combining the weighted sums of the different paths of L1 and L2 is compared, and the path with the lowest weighted sum is used as the seed audio for the final synthesis.

透過上述加權計算，可以得出每一條路徑上的加總的流暢性代價，並找出整體流暢性代價最低的一條路徑。在一個實施例中，候選聲頻資訊AU1c至候選聲頻資訊AU2b至候選聲頻資訊AU3a路徑的加總的流暢性代價最小，則選取這一條路徑上的候選聲頻資訊AU1c、候選聲頻資訊AU2b以及候選聲頻資訊AU3a作為文字轉語音方法中最終合成用的種子聲頻。 Through the above weighting calculation, the aggregated fluency cost on each path can be obtained, and a path with the lowest overall fluency cost can be found. In one embodiment, the sum of the candidate audio information AU1c to the candidate audio information AU2b to the candidate audio information AU3a has the lowest fluency cost, and the candidate audio information AU1c, the candidate audio information AU2b, and the candidate audio information on the path are selected. AU3a is used as the seed audio for final synthesis in the text-to-speech method.

接著，執行第2圖中的步驟S260，由處理器160，將串連拼接各發音單元的聲頻資訊(如聲頻資訊AU1c、AU2b以及AU3a)，產生多語言語音音訊。多語言語音音訊可以透過播音模組140播放，如第2圖中步驟S270所示，便達成本案的文字轉語音方法200中的聲音輸出。在本實施例中，播音模組140可為，但不限於揚聲器和/或電話聽筒。 Next, step S260 in FIG. 2 is executed, and the processor 160 converts the audio information (such as the audio information AU1c, AU2b, and AU3a) of each pronunciation unit in series to generate multi-language voice information. The multi-language voice audio can be played through the broadcast module 140. As shown in step S270 in FIG. 2, the sound output in the text-to-speech method 200 of the present invention is achieved. In this embodiment, the sounding module 140 can be, but is not limited to, a speaker and/or a telephone handset.

此外，在上述實施例中，語言模型資料庫LMD1~LMD2是透過一訓練程序預先建立的。在一個實施例中，本案所提出的文字轉語音方法200除了上述發音程序之外，亦包含如何建立/訓練適當的語言模型資料庫LMD1~LMD2的訓練程序。 Further, in the above embodiment, the language model databases LMD1 to LMD2 are pre-established through a training program. In one embodiment, the text-to-speech method 200 proposed in the present application includes, in addition to the above-described pronunciation program, a training program for how to establish/train the appropriate language model database LMD1~LMD2.

如第1圖所示，多語言語音合成裝置100更包含一收音模組180。在本實施例中，收音模組180可以內建於文字轉語音系統100中，或者獨立外設於多語言語音合成裝置100。在一個實施例中，收音模組180可為，但不限於，麥克風或錄音單元。 As shown in FIG. 1, the multi-lingual speech synthesis device 100 further includes a radio module 180. In the embodiment, the sound collection module 180 can be built in the text-to-speech system 100 or independently peripherally used in the multi-language speech synthesis device 100. In one embodiment, the radio module 180 can be, but is not limited to, Microphone or recording unit.

在一個實施例中，收音模組180用以取樣至少一訓練語音以進行語言模型資料庫LMD1~LMD2的訓練程序。將訓練產生之語言模型資料庫LMD1~LMD2提供給文字轉語音系統100使用。 In one embodiment, the radio module 180 is configured to sample at least one training speech for the training program of the language model database LMD1~LMD2. The language model database LMD1~LMD2 generated by the training is provided to the text-to-speech system 100 for use.

第8圖繪示根據本揭示文件之一實施例文字轉語音方法200有關訓練程序的訓練方法的流程圖。參閱第8和9A-9C圖，在第8圖中的文字轉語音方法200的訓練程序中，首先執行步驟S310，利用收音模組180接收至少一個單一種語言的訓練語音。第9A圖至第9C繪示混合語言的訓練語音ML、取樣錄音SAM以及分析到的混合語言的語調(pitch)、節律(tempo)與音色(timbre)資訊的示意圖。語調(pitch)包含，但不限於，發聲的頻率高低；節律包含，但不限於發聲的音長(duration)、速度、間隔與韻律；音色包含，但不限於，發聲的品質、口形(如MFCC)以及發聲部位等。 FIG. 8 is a flow chart showing a training method for a training program according to an embodiment of the present disclosure. Referring to FIGS. 8 and 9A-9C, in the training program of the text-to-speech method 200 in FIG. 8, step S310 is first performed, and the training voice of at least one single language is received by the radio module 180. 9A to 9C are diagrams showing the mixed speech training speech ML, the sampled recording SAM, and the analyzed mixed language pitch, tempo, and timbre information. The pitch includes, but is not limited to, the frequency of the utterance; the rhythm includes, but is not limited to, the duration, velocity, interval, and rhythm of the utterance; the timbre includes, but is not limited to, the quality of the vocalization, the shape of the mouth (eg, MFCC) ) and the sounding parts, etc.

於一實施例中，如第9A圖所示，這一段訓練語音ML的取樣錄音SAM取自以中文為母語人士，並且這位以中文為母語人士能順暢地使用中英文兩種語言。這樣便能取得中英文混雜的語音，並且中英文之間的銜接是順暢的。同理，若以英文為母語人士進行錄音，亦需能順暢地使用中英文兩種語言。 In one embodiment, as shown in FIG. 9A, the sampled recording SAM of the training speech ML is taken from a native Chinese speaker, and the Chinese-speaking person can smoothly use both Chinese and English. In this way, the mixed Chinese and English voices can be obtained, and the connection between Chinese and English is smooth. For the same reason, if you are recording in English as a native speaker, you need to be able to use both Chinese and English smoothly.

於另一實施例中，該訓練語音包含純中文的第一取樣錄音以及純英文的第二取樣錄音，分別由中文為母語人士以及英文為母語人士分別錄製。接著，執行步驟S320，分析訓練語音的樣本中的兩種不同種語言各自的語調(pitch)、節律(tempo)或音色(timbre)。如第9B圖所示，首先將第9A圖中混合語言的訓練語音ML分為第一語言LAN1的取樣錄音SAM1與第二語言LAN2的取樣錄音SAM2。接著，如第9C圖所示，對第一語言L1的取樣錄音SAM1與第二語言L2的取樣錄音SAM2分別分析語調(pitch)、節律(tempo)或音色(timbre)，得到如頻率、音長、口形等聲頻資訊。之後，得到取樣錄音SAM1的語調P1、節律T1與音色F1及取樣錄音SAM2的語調P2、節律T2與音色F2。 In another embodiment, the training voice includes a first sampled recording in pure Chinese and a second sampled recording in pure English, respectively native to Chinese. Individuals and native speakers of English are recorded separately. Next, step S320 is performed to analyze the pitch, tempo or timbre of the two different languages in the sample of the training speech. As shown in Fig. 9B, the training speech ML of the mixed language in Fig. 9A is first divided into the sample recording SAM1 of the first language LAN1 and the sample recording SAM2 of the second language LAN2. Next, as shown in FIG. 9C, the sampled recording SAM1 of the first language L1 and the sampled recording SAM2 of the second language L2 respectively analyze a pitch, a tempo, or a timbre to obtain, for example, a frequency and a sound length. Audio information such as mouth shape. Thereafter, the tone P1 of the sample recording SAM1, the rhythm T1 and the tone F1, and the tone P2 of the sample recording SAM2, the rhythm T2 and the tone F2 are obtained.

其中，語調P1與語調P2分別為取樣錄音SAM1與取樣錄音SAM2中所有發音單元的頻率分佈，橫軸為不同的頻段(其單位為赫茲Hz)，縱軸為取樣點統計個數。節律T1與節律T2分別為取樣錄音SAM1與取樣錄音SAM2中所有發音單元的音長分佈，橫軸為不同的時間長度(其單位為毫秒ms)，縱軸為取樣點統計個數。單個取樣點為取樣錄音SAM1或取樣錄音SAM2中每一個音素的單一個音訊框(frame)。 Among them, the tone P1 and the tone P2 are the frequency distributions of all the pronunciation units in the sample recording SAM1 and the sample recording SAM2, the horizontal axis is a different frequency band (the unit is Hertz Hz), and the vertical axis is the number of sampling points. The rhythm T1 and the rhythm T2 are the sound length distributions of all the pronunciation units in the sample recording SAM1 and the sample recording SAM2, respectively, the horizontal axis is a different time length (the unit is millisecond ms), and the vertical axis is the number of sampling points. A single sampling point is a single audio frame of each of the sampled recording SAM1 or sampled recording SAM2.

於此實施例中，音色F1與音色F2分別為取樣錄音SAM1與取樣錄音SAM2中所有發音單元的口形，如第9C圖所示，分別採用多個多維度的高斯混和模型表示。 In this embodiment, the timbre F1 and the timbre F2 are the mouth shapes of all the utterance units in the sampled recording SAM1 and the sampled recording SAM2, respectively, as shown in FIG. 9C, and are respectively represented by a plurality of multi-dimensional Gaussian mixture models.

不同語音L1與L2的取樣錄音SAM1/取樣錄音SAM2各自得到的語調P1、節律T1與音色F1以及語調P2、節律T2與音色F2將分別儲存到相對應的語言模型資料庫LMD1~LMD2。 Sampling recordings of different voices L1 and L2 SAM1/sampling recording SAM2 respectively obtained tone P1, rhythm T1 and tone F1 and tone P2 The rhythm T2 and the timbre F2 will be stored in the corresponding language model databases LMD1~LMD2, respectively.

接著，執行步驟S330，將語調、節律和音色落在預定範圍內的訓練語音進行儲存。將訓練語音的語調、節律或音色與一基準範圍作比較。在一個實例中，基準範圍可以是過去錄音所得聲音的中位範圍(middle range)，例如在高於或低於語調、節律或音色的平均值兩個標準差的範圍內屬於基準範圍。該步驟包含將訓練語音的樣本中語調、節律或音色落在該基準範圍外的部份排除。如此一來，便可以將極端的語調、節律或音色排除，或是將取樣錄音中明顯不一致(例如中文為母語人士以及英文為母語人士兩者語調差異過大的部份)的內容加以濾除，藉以確保資料庫中兩種語言之間語調、節律或音色的一致性。 Next, step S330 is executed to store the training speech whose tones, rhythms, and timbres fall within a predetermined range. The tone, rhythm or timbre of the training speech is compared to a reference range. In one example, the reference range may be the middle range of sounds recorded in the past, for example, within a range of two standard deviations above or below the mean of the intonation, rhythm, or timbre. This step involves excluding portions of the sample of training speech that have tones, rhythms, or timbres that fall outside of the reference range. In this way, extreme tone, rhythm or tone can be excluded, or the content of the sample recording is obviously inconsistent (for example, Chinese native speakers and English-speaking people have too much difference in tone). To ensure consistency of intonation, rhythm, or tone between the two languages in the database.

也就是說，若是新錄得的訓練語音的語調、節律與音色其中一者明顯偏離過去錄音所得資料的統計分佈模型的中間值時(例如語調、節律與音色落在原統計分佈模型的兩倍標準差之外，或是落在一定累積分佈如10%~90%的範圍之外)便將新錄得的訓練語音濾除，以避免變異過大的語調、節律與音色(例如發聲者突然採用較尖銳或較激昂的發音方式)影響語言模型資料庫中候選聲頻資訊的一致性。將訓練語音依照不同種語言分別儲存入語言模型資料庫LMD1或LMD2中。 That is, if one of the newly recorded training speech's intonation, rhythm, and timbre deviates significantly from the intermediate value of the statistical distribution model of the past recorded data (eg, the intonation, rhythm, and timbre fall twice the original statistical distribution model) In addition to the difference, or falling within a certain cumulative distribution, such as 10% to 90%, the newly recorded training speech is filtered to avoid excessively large tones, rhythms and timbres (for example, the vocalist suddenly adopts Sharp or more vocal pronunciation) affects the consistency of candidate audio information in the language model database. The training speech is stored in the language model database LMD1 or LMD2 in different languages.

如以上實施例所描述的，多語言文字訊息被轉換成多語言語音訊息，從而改善了語音的流暢性、一致性與連續性。 As described in the above embodiments, multi-language text messages are converted into multi-language voice messages, thereby improving the fluency and consistency of speech. Continuity.

雖然本案已以實施方式揭露如上，然其並非用以限定本案，任何本領域具通常知識者，在不脫離本案之精神和範圍內，當可作各種之更動與潤飾，因此本案之保護範圍當視後附之申請專利範圍所界定者為準。 Although the present invention has been disclosed in the above embodiments, it is not intended to limit the present case. Anyone having ordinary knowledge in the field can make various changes and refinements without departing from the spirit and scope of the case. Therefore, the scope of protection of this case is This is subject to the definition of the scope of the patent application.

200‧‧‧文字轉語音方法 200‧‧‧Text-to-speech method

S210~S270‧‧‧步驟 S210~S270‧‧‧Steps

Claims

A text-to-speech method is implemented by a processor for converting a multi-language text message having a first language and a second language into a multi-language voice message, and matching a plurality of first language tones a first language model database in which the bit label and the first language are connected to the same language, and a second language model database having a plurality of second language phoneme labels and a second language connected tone information. The text-to-speech method includes: dividing the multi-language text message into at least one first language paragraph and at least one second language paragraph; converting the first language paragraph into at least one first language phoneme label and converting at least one second language The paragraph is at least one second language phoneme tag; the first language model database is searched using the at least one first language phoneme tag to obtain at least one first language phoneme tag string, and the at least one second language is utilized The phoneme tag looks up the second language model database to obtain at least one second language phoneme tag string; according to the multilingual a textual order of the text message, combining the at least one first language phoneme tag string and the at least one second language phoneme tag string into a multi-lingual phoneme tag string; in each two adjacent phonemes A cross-language connection tone data is generated at a boundary of the tag string, wherein each of the two adjacent phoneme tag strings includes a first language phoneme tag string in the at least one first language phoneme tag string And a second language phoneme tag string in the at least one second language phoneme tag string; Merging the multilingual phoneme tag string, the first language consonant connection tone data at a boundary between each two adjacent phoneme tags in the at least one first language phoneme tag string And a language-connected tone material of the second language at a boundary between each two adjacent phoneme tags in the at least one second language phoneme tag string and the cross-language connection tone data to generate The multi-lingual voice audio; and outputting the multi-language voice audio.

The text-to-speech method of claim 1, wherein the first language phoneme tag string in the at least one first language phoneme tag string is listed in the at least one second language phoneme tag string The step of generating the cross-language connection tone data in front of the second language phoneme tag string includes: replacing the first phoneme tag of the at least one second language phoneme tag string with the at least one second language tone The first phoneme tag of the bit tag string has a corresponding phoneme tag of the corresponding first language phoneme tag; and the first phoneme tag of the first language phoneme tag is used to find the a language model database for obtaining the first language model between a last phoneme label of the at least one first language phoneme tag string and the corresponding phoneme tag of the corresponding first language phoneme tag Corresponding linguistic connection tone information in the database, wherein the corresponding linguistic connection tone information in the first language model database is used as the first language in the at least one first language phoneme tag string a boundary between the phoneme tag string and the second language phoneme tag string in the at least one second language phoneme tag string This cross-language connection tone material.

The text-to-speech method of claim 1, wherein the first language model database and the second language model database each further comprise a phrase, a word formed by consecutive phoneme labels, Audio information of one or a combination of characters, syllables, or phonemes, and one of a phrase, word, word, syllable, or phoneme formed by the consecutive phoneme tags Or a combination thereof is a single pronunciation unit.

The text-to-speech method according to claim 3, wherein the at least one first language phoneme tag string and the at least one second language phoneme tag string are combined into one according to a text order of the multi-language text message. After the step of serializing the multi-lingual phoneme label, the text-to-speech method further comprises the steps of: dividing the multi-language phoneme tag string into a plurality of first pronunciation units, each of the first pronunciation units being in a single language, And each of the first pronunciation units includes a continuous phoneme label of the at least one first language phoneme tag string and the corresponding one of the at least one second language phoneme tag string; a first pronunciation unit determining whether a candidate number corresponding to one of the first pronunciation units is greater than or equal to the first language model database and a corresponding language model database in the second language model database a predetermined number corresponding to a pronunciation unit; when the first language model database and the second language model database a candidate number of each of the first pronunciation units in a corresponding language model database is greater than or equal to the corresponding predetermined number, and a value of each of the candidate paths is calculated, wherein each of the candidate paths passes each of the One of the candidates of a pronunciation unit; and determining a connection path between each two adjacent first sounding units according to the joining value of each candidate path.

The text-to-speech method of claim 4, wherein the determining the connection path between each two adjacent first sounding units comprises: determining a previous one of the two adjacent first sounding units a connection path between a candidate selected in a pronunciation unit and a selected one of the next first pronunciation units of the adjacent two first pronunciation units; wherein the previous one of the two adjacent first pronunciation units The candidate selected in the first pronunciation unit and the candidate selected in the latter first pronunciation unit of the two adjacent first pronunciation units are located in one of the candidate paths with the lowest added value.

The text-to-speech method as claimed in claim 4, wherein the candidate number of any one or more first pronunciation units in the first language model database and the corresponding language model database of the second language model database When the predetermined number is less than the corresponding number, the text-to-speech method further comprises: dividing each of the one or more first sounding units into a plurality of second sounding units, wherein any one of the second sounding units The length is smaller than the length of the corresponding first pronunciation unit; for each of the second pronunciation units, one of the first language model database and the second language model database is determined, one of the corresponding language model databases Whether the number of candidates corresponding to the second pronunciation unit is greater than or equal to a predetermined number corresponding to the second pronunciation unit.

The text-to-speech method of claim 4, wherein the added value of each candidate path is a target value of each candidate audio information in each first pronunciation unit, and each two adjacent first pronunciation units are connected The spectral value of the candidate audio information, the pitch value of the candidate audio information in each of the two adjacent first sounding units, the rhythmic value of the candidate audio information in each of the two adjacent first sounding units, And a weighted sum of strength values connecting the candidate audio information in each of the two adjacent first pronunciation units.

The text-to-speech method of claim 1, wherein the first language model database and the second language model database are pre-established by a training program, wherein the training program comprises: receiving training speech in at least one single language And analyzing the intonation, rhythm, and timbre of the training speech; and storing the training speech, the rhythm, and the timbre being within a respective predetermined range.

A multi-language speech synthesis device for processing multi-language text messages having a first language and a second language into multi-lingual speech sounds The synthesizing device includes: a storage module, a first language model database storing a plurality of first language phoneme tags and a first language connected tone information, and a plurality of second language phonemes tags and a second language model database for connecting the tone information in the same language; a broadcast module for playing the multi-language voice audio; and a processor coupled to the storage module and the broadcast module for: the multi-language text The message is divided into at least one first language paragraph and at least one second language paragraph; converting the first language paragraph into at least one first language phoneme label and converting the at least one second language paragraph into at least one second language phoneme label; Locating the first language model database with the at least one first language phoneme tag to obtain at least one first language phoneme tag string, and using the at least one second language phoneme tag to search the second language model database Obtaining at least one second language phoneme tag string; according to the text order of the multi-language text message, the at least one first The phoneme tag string and the at least one second language phoneme tag string are combined into a multi-lingual phoneme tag string; a cross-language connection tone is generated at the intersection of each two adjacent phoneme tag strings Data, wherein each two adjacent phoneme tag strings comprise a first language phoneme tag string in the at least one first language phoneme tag string and the at least one second language phoneme tag string a second language phoneme tag string; merging the multi-lingual phoneme tag string in the at least one first language a first language contemporaneous connection tone material at a junction between every two adjacent phoneme tags in the phoneme tag string, every two in the at least one second language phoneme tag string a second language of the same language at a junction between adjacent phoneme tags and the interlingual connection tone data to generate the multilingual voice audio; and outputting the multilingual voice audio to the broadcast Module.

The multilingual speech synthesis apparatus of claim 9, wherein the first language phoneme tag string in the at least one first language phoneme tag string is listed in the at least one second language phoneme tag string When the second language phoneme label is in front of the string, the processor generates the cross-language connection tone data to: replace the first phoneme tag of the at least one second language phoneme tag string with the at least one second language The first phoneme tag of the phoneme tag string has a corresponding one of the first language phoneme tags of the approximated pronunciation; and the first phonetic tag of the first language phoneme tag is used to find the first phoneme tag a language model database for obtaining the first language model data between a last phoneme label of the at least one first language phoneme tag string and the corresponding phoneme tag of the corresponding first language phoneme tag a corresponding language connection tone information in the library, wherein the corresponding language connection tone information in the first language model database is used as the first language tone in the at least one first language phoneme tag string The cross-language connection tone material at a boundary between the bit tag string and the second language phoneme tag string in the at least one second language phoneme tag string.

The multilingual speech synthesis device of claim 9, wherein the first language model database and the second language model database each further comprise a phrase, a word formed by consecutive phoneme labels. Audio information of one or a combination of characters, syllables, or phonemes, and the words, words, words, syllables, or phonemes formed by the consecutive phoneme tags One or a combination thereof is a single pronunciation unit.

The multilingual speech synthesis apparatus of claim 11, wherein the at least one first language phoneme tag string is combined with the at least one second language phoneme tag string in accordance with a textual order of the multilingual text message After the step of stringing a multi-lingual phoneme label, the processor is further configured to: divide the multi-language phoneme tag string into a plurality of first pronunciation units, each of the first pronunciation units belong to a single language, and a continuous phoneme tag comprising the at least one first language phoneme tag string and the corresponding one of the at least one second language phoneme tag string; for each first pronunciation unit, determining In the corresponding language model database of the first language model database and the second language model database, whether the number of candidates corresponding to one of the first pronunciation units is greater than or equal to a predetermined one of the first pronunciation unit a number; each of the first pronunciation model in the first language model database and the corresponding language model database in the second language model database Greater than equal to the number of candidate corresponding to the predetermined number, calculating each of the candidate Selecting a value of the join value, wherein each of the candidate paths passes through one of each of the first pronunciation units; and determining, between each two adjacent first pronunciation units, based on the join value of each candidate path Connection path.

The multi-lingual speech synthesis device of claim 12, when determining the connection path between each two adjacent first pronunciation units, the processor is further configured to: determine two adjacent first pronunciations a connection path between the selected one of the previous first pronunciation units in the unit and the selected one of the next one of the two first first pronunciation units; wherein the two adjacent first The candidate selected in the previous first pronunciation unit of the pronunciation unit and the candidate selected in the latter first pronunciation unit of the two adjacent first pronunciation units are located in the candidate path of the lowest value of the added value.

The multilingual speech synthesis apparatus of claim 12, wherein the candidate of any one or more of the first pronunciation units in the first language model database and the corresponding language model database of the second language model database When the number is less than the corresponding predetermined number, the processor is further configured to: divide each of the one or more first sounding units into a plurality of second sounding units, wherein any one of the second sounding units The length is less than the length of the corresponding first pronunciation unit; for each of the second pronunciation units, determining the first language model data Whether the number of candidates corresponding to one of the second pronunciation units is greater than or equal to a predetermined number corresponding to the second pronunciation unit in the corresponding language model database of the library and the second language model database.

The multilingual speech synthesis device of claim 12, wherein the join value of each candidate path is a target value of each candidate audio information in each first pronunciation unit, and each two adjacent first pronunciations are connected The spectral value of the candidate audio information in the unit, the pitch value of the candidate audio information in each of the two adjacent first sounding units, and the rhythmic value of the candidate audio information in each of the two adjacent first sounding units And a weighted sum of the strength values of the candidate audio information in each of the two adjacent first sounding units.

The multilingual speech synthesis device of claim 9, wherein the first language model database and the second language model database are both pre-established by a training program, wherein the training program comprises: receiving training in at least one single language Speech; analyzing the intonation, rhythm, and timbre of the training speech; and storing the speech, the rhythm, and the timbre within the respective predetermined range of the training speech.