TWI635483B

TWI635483B - Method and system for generating prosody by using linguistic features inspired by punctuation

Info

Publication number: TWI635483B
Application number: TW106124240A
Authority: TW
Inventors: 王文俊; 陳保清; 江振宇; 洪宇平
Original assignee: 中華電信股份有限公司
Priority date: 2017-07-20
Filing date: 2017-07-20
Publication date: 2018-09-11
Also published as: TW201909165A

Abstract

本發明係揭露一種藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統，其中包括主要標點符號預測單元與引號詞組預測單元，而主要標點符號和韻律斷點習習相關，對於構建合成語音之韻律結構相當重要，另一方面引號詞組在語音上所呈現出的特殊韻律變化在合成語音之語意理解過程中扮演關鍵角色，因此藉由此兩項資訊，將可提升文字轉語音系統中的韻律生成處理程序之效能並進一步提升合成語音品質。本發明是利用條件式隨機場域技術，針對輸入文字進行主要標點符號與引號詞組之預測並產生對應之信心度預測值，最後整合此兩項信心度預測值與基礎語言特徵作為韻律生成處理之輸入特徵，以預測包括音量、音長、音高與停頓音長等4項韻律參數。 The present invention discloses a method and system for generating a linguistic feature inspired by punctuation and applied to a national prosody generation, including a main punctuation prediction unit and a quotation phrase prediction unit, and the main punctuation and prosody breakpoint correlation are related to It is very important to construct the rhythm structure of synthetic speech. On the other hand, the special rhythm changes in the phonetic expression of the quotation marks play a key role in the semantic understanding of the synthesized speech. Therefore, the two messages can improve the text-to-speech The prosody in the system generates the performance of the handler and further enhances the synthesized speech quality. The invention utilizes the conditional random field technique to predict the main punctuation marks and quoted phrases for the input characters and generate corresponding confidence prediction values, and finally integrates the two confidence prediction values and the basic language features as the prosody generation processing. Input features to predict four prosody parameters including volume, length, pitch, and pause length.

Description

Method and system for generating prosody by using linguistic features inspired by punctuation

本發明屬於一種藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統，尤指一種利用條件式隨機場域技術，針對輸入文字進行主要標點符號與引號詞組之預測並產生對應之信心度預測值，最後整合此兩項信心度預測值與基礎語言特徵作為韻律生成處理之輸入特徵。 The invention belongs to a method and system for generating linguistic features inspired by punctuation marks and applied to national language prosody, in particular to using a conditional random field technique to predict and generate correspondence between main punctuation marks and quotation marks for input text. The confidence prediction value finally integrates the two confidence prediction values and the basic language features as input features of the prosody generation processing.

標點符號大略分為標號與點號，其在文字轉語音(text-to-speech，TTS)系統中的作用可有以下兩種面相，一為表現文句中的文法架構或韻律架構，另一為藉由引號標示出具特殊意義或具強調語氣之詞串所在。而觀察一般人的書寫習慣，雖然大部份文字資料的字裡行間都會有標點符號，但是實際上仍存在著兩個問題，一為標點符號不足的情況仍居多，另一為不一致問題，不僅是不同人對標點符號的給定會有不同看法，即使是同一人也很難維持一致且偶有錯誤發生。 Punctuation marks are roughly divided into labels and dot numbers. Their roles in text-to-speech (TTS) systems can be as follows: one is the grammar structure or the prosody structure in the expression sentence, and the other is The quotation marks are marked with special meaning or emphasized tone. Observing the writing habits of ordinary people, although there are punctuation marks between the lines of most of the texts, there are still two problems. In fact, there are still many problems with insufficient punctuation, and the other is inconsistent, not only different people. There is a different perception of the punctuation given, even the same person is difficult to maintain consistency and occasional errors occur.

國語文字轉語音系統中，韻律生成處理之功能係將語言特徵轉換為韻律特徵，而韻律生成處理之效能則與兩個因子相關，一為韻律預測模型的能力，一為所使用的語言特徵，相關語言特徵依層次可歸類為以下6類：(1)發音資訊，(2)文字資訊，(3)斷詞資訊，(4)詞類資訊，(5)基礎片語資訊，(6)文法結構資訊。上述前4類之語言特徵是大部份國語文字轉語音系統都會使用的基礎語言特徵，其中第一類發音資訊與第二類文字資訊特徵可說是韻律生成處理所需之基本要素，而第三類斷詞資訊與第四類詞類資訊特徵是針對輸入文字進行斷詞與詞類標示處理後之產物，至於第五類與第六類特徵則為高階的語言特徵，建立此兩類特徵所需的剖析模型必須依靠具備語言學知識的專家預先針對大量文字資料進行標示，由於此標示處理極為耗時且不同標示者間存在一致性問題，導致訓練資料之取得不易，進而影響剖析模型之效能。 In the Mandarin-to-speech system, the function of prosody generation processing converts language features into prosodic features, while the performance of prosody generation processing is related to two factors, one is the ability to predict the prosody model, and the other is the language used. Features, related language features can be classified into the following six categories according to the hierarchy: (1) pronunciation information, (2) text information, (3) word segmentation information, (4) word class information, (5) basic phrase information, (6) Grammatical structure information. The above four language features are the basic language features that most Mandarin-to-speech systems use. The first type of pronunciation information and the second type of text information can be said to be the basic elements required for prosody generation processing. The three types of word-breaking information and the fourth-type word-class information feature are the products of the word-breaking and word-class labeling processing for the input text. The fifth-class and sixth-class features are higher-order language features, and the two types of features are required. The profiling model must rely on experts with linguistic knowledge to pre-mark a large amount of text data. Because this labeling process is extremely time-consuming and there are consistency issues among different labelers, the training data is not easy to obtain, which in turn affects the performance of the profiling model.

另外傳統TTS系統均未善加利用引號所提供之資訊，由於引號裡的詞組或短句，大多為重要語意所在，對於合成語音之可理解度具有一定程度之影響，值得進一步探討以豐富化韻律預測所需之語言特徵。 In addition, the traditional TTS system does not make good use of the information provided by the quotation marks. Because the phrases or short sentences in the quotation marks are mostly important semantic meanings, they have a certain degree of influence on the comprehensibility of synthesized speech, and it is worth further exploration to enrich the rhythm. Predict the language features required.

本案發明人鑑於上述習用方式所衍生的各項缺點，乃亟思加以改良創新，並經多年苦心孤詣潛心研究後，終於成功研發完成本藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統。 In view of the shortcomings derived from the above-mentioned conventional methods, the inventor of the present invention has improved and innovated, and after years of painstaking research, he finally succeeded in researching and developing the language features inspired by punctuation and applying it to the generation of Mandarin rhythm. Method and system.

本發明之目的即在提出以機器抽取的語言特徵以應用於國語文字轉語音系統，其一為語言特徵為主要標點符號信心度，其作用為測量詞與詞間插入主要標點符號之可能性。其二為語言特徵為引號詞組信心度，其作用為測量某個詞串前後會出現引號，以致於該詞串成為一個有特殊意義或具強調語氣之單元的可能性。 The purpose of the present invention is to propose a machine-extracted linguistic feature for use in a Mandarin-to-speech-to-speech system, one of which is a linguistic feature as the main punctuation confidence, which is used to measure the possibility of inserting a main punctuation between a word and a word. The second is that the linguistic feature is the confidence of the quoted phrase. The function is to measure the quotation marks before and after a certain string, so that the string becomes a special meaning or The possibility of a unit with an emphasis on tone.

為達上述目的，本發明提出一個統計特徵稱為主要標點符號信心度，其著眼點是以不同程度的停頓預測信心值取代基礎片語資訊與文法結構資訊特徵。本發明所定義之主要標點符號包括句號(。)、驚歎號(！)、問號(？)、分號(；)、冒號(：)、逗號(，)，一般而言，停頓預測信心值愈高就代表是主要停頓斷點，多半為片語或短句的結束所在且停頓時間較長；停頓預測信心值較低則代表是次要停頓斷點，多半為詞邊界所在且停頓時間較短。主要標點符號信心度值是產生自一個以條件式隨機場域(conditional random field，CRF)技術為基礎之主要標點符號產生模型，建立此模型所需之大量含各類主要標點符號之文字資料並不需要再經過人工標示處理，應用此模型將可針對各詞與詞間提供主要標點符號出現可能性之預測值，數值越高即表示該位置越可能插入停頓。 In order to achieve the above object, the present invention proposes a statistical feature called main punctuation confidence, which focuses on replacing the basic piece information and the grammatical structure information feature with different degrees of pause prediction confidence value. The main punctuation marks defined by the present invention include a period (.), an exclamation point (!), a question mark (?), a semicolon (;), a colon (:), a comma (,), and generally, the higher the pause prediction confidence value The representative is the main pause breakpoint, most of which is the end of the phrase or short sentence and the pause time is longer; the lower pause prediction confidence value represents the secondary pause breakpoint, mostly the word boundary and the pause time is short. The main punctuation confidence value is generated from a main punctuation generation model based on the conditional random field (CRF) technique, and a large number of texts containing various major punctuation marks are required to establish the model. There is no need for manual labeling. Applying this model will provide a predictive value for the probability of occurrence of major punctuation between words and words. A higher value indicates that the position is more likely to insert a pause.

其中利用大量含欲處理標點符號之文字資料，藉由機器學習技術為文字轉語音系統中的文字分析處理建立一致性的標點符號預測能力並輸出對應之信心度數值，以提供後續韻律預測處理更豐富之資訊。 Using a large amount of textual data containing punctuation marks, machine learning technology is used to establish consistent punctuation prediction ability for text analysis processing in text-to-speech system and output corresponding confidence value to provide subsequent prosody prediction processing. Rich information.

以CRF技術為基礎之主要標點符號產生模型所要處理的是一種標記標示問題，此模型之建立程序是首先必須定義一個標記集合，輸入文字之各詞需利用此標記集合進行標示，而最佳標記序列的求取過程中會考慮輸入特徵的不同組合所構成的不同場域，並利用維特比搜尋(Viterbi search)技術進行最佳路徑搜尋，進而產生與主要標點符號相關的各標記可信度預測值。 The main punctuation generation model based on CRF technology deals with a mark labeling problem. The establishment procedure of this model is to first define a set of marks. The words of the input text need to be marked with this mark set, and the best mark In the process of sequence determination, different fields composed of different combinations of input features are considered, and Viterbi search technology is used to search for the best path, and then the prediction of each mark is related to the main punctuation. value.

標點符號在文字轉語音系統所扮演的角色，除了停頓外，另外也有標示功能，特別是標示有特殊意義與具強調語氣之文字串。經由大量文字資料統計引號之種類與引號詞組之功能有以下發現：首先是引號相關之標點符號一般可歸為10類，分別為「()」、「{}」、「〔〕」、「「」」、「『』」、「〈〉」、「【】」、「《》」、「“”」與「〝〞」。 The role of punctuation in the text-to-speech system, except In addition to the pause, there are also marking functions, especially the text strings with special meaning and emphasis. Through the large number of transcripts, the types of quotation marks and the function of quotation marks have the following findings: First, the punctuation marks associated with quotation marks can be generally classified into 10 categories, namely "()", "{}", "[]", "" "","""","<>","【】","""",""""versus"""".

其中第一類符號「()」，大部份用來作為排序之用，非本發明欲處理之目標符號。 The first type of symbol "()", which is mostly used for sorting, is not the target symbol to be processed by the present invention.

第二類符號「{}」，大部份用來表示書名或文章名，為本發明欲處理之目標符號。 The second type of symbol "{}" is used to indicate the title of the book or the name of the article, which is the target symbol to be processed by the present invention.

第三類符號「〔〕」，大部份用來表示文章之註解或補充說明，非本發明欲處理之目標符號。 The third type of symbol "[]" is used to indicate the annotation or supplementary description of the article, which is not the target symbol to be processed by the present invention.

第四類符號「「」」與第五類符號「「」」、「『』」，依引號詞組所含詞數多寡排列，依序有短句、專有名詞、成語與流行語等功能，此兩類為本發明欲處理目標符號的主要取樣來源，出現機率最高。 The fourth type of symbol """ and the fifth type of symbols """ and """ are arranged according to the number of words contained in the quoted phrase, followed by short sentences, proper nouns, idioms and buzzwords. These two categories are the main sampling sources for the target symbols to be processed in the present invention, and the probability of occurrence is the highest.

第六類至第十類符號、「〈〉」、「【】」、「《》」、「“”」與「〝〞」則是以專有名詞、成語與流行語功能為主，較少以短句功能出現，但均為本發明欲處理之目標符號。 The sixth to tenth symbols, "<>", "[]", """, """ and "〝〞" are based on proper nouns, idioms and buzzwords. Appears in short sentences, but is the target symbol to be processed by the present invention.

由於引號裡的詞組或短句，大多為重要語意所在，因此其相對應之語音就會呈現出以下明顯的變化：引號詞組邊界有明顯停頓與音高重置現象，引號詞組內則多半不停頓且有音量加大現象。為了讓這些變化能夠有效的被文字轉語音系統的韻律預測模型所學習，本發明希望建立引號詞組信心度預測模型，針對各詞與詞間提供引號出現可能性之預測值以協助韻律生成處理。本發明同樣利用CRF技術建立此引號詞組信心度預測模型，所需之訓練資料則是大量含引號的文字資料。 Because the phrases or short sentences in the quotes are mostly important semantics, the corresponding voices will show the following obvious changes: the quoted phrase boundaries have obvious pauses and pitch resets, and the quoted phrases are mostly non-stop. And there is a volume increase phenomenon. In order to enable these changes to be effectively learned by the prosody prediction model of the text-to-speech system, the present invention hopes to establish a quotation phrase confidence prediction model, and provides predictive values for the possibility of quotation marks between words and words to assist the prosody generation process. The invention also uses the CRF technology to establish the confidence prediction model of the quoted phrase, and the required training materials are a large number of cited Number of texts.

上述兩項信心度預測值接著將結合基礎語言特徵進行韻律生成處理，預測出包含音量、音高、音長與停頓長度之韻律參數。 The above two confidence prediction values are then subjected to prosody generation processing in combination with the basic language features to predict prosody parameters including volume, pitch, length and pause length.

一種藉由標點符號所啟發之語言特徵並運用於國語韻律生成之系統，其包括斷詞與詞類標示單元，是為處理輸入之文字，並將文字內容斷詞，並進行詞類標記與標記詞的構成成分，以提供後續處理單元所需之資訊；主要標點符號預測單元，是與引號詞組預測單元同時進行處理斷詞與詞類標示單元所提供之輸入資訊，以產生主要標點符號信心度；引號詞組預測單元，是與主要標點符號預測單元同時進行處理斷詞與詞類標示單元所提供之輸入資訊，以產生引號詞組信心度；前後文分析單元，是接收處理斷詞與詞類標示單元所提供之輸入資訊，以產生基礎語言特徵；整合特徵向量單元，是整合主要標點符號信心度、引號詞組信心度、及基礎語言特徵；韻律生成處理單元是以多層認知器(multi-layer perceptron，MLP)為基礎，接收整合特徵向量單元整合後之資訊，並產生以音節為單位之韻律參數，而前後文分析單元與韻律生成處理單元之建立，必須使用語音資料進行相關參數之調整。 A system inspired by punctuation and applied to the system of Chinese prosody generation, which includes a word segmentation unit and a word class indicator unit, which is used to process the input text, and to break the word content, and to perform word class tagging and tagging words. Constituting components to provide information needed for subsequent processing units; the main punctuation prediction unit is to input the input information provided by the word segmentation unit and the word class labeling unit simultaneously with the quoted phrase prediction unit to generate confidence of the main punctuation symbol; The prediction unit is configured to process the input information provided by the word segmentation unit and the word class labeling unit simultaneously with the main punctuation prediction unit to generate the confidence of the quoted phrase phrase; the context analysis unit is the input provided by the receiving processing word segment and the word class labeling unit. Information to generate basic language features; integrated feature vector unit, which integrates main punctuation confidence, quoted phrase confidence, and basic language features; prosody generation processing unit is based on multi-layer perceptron (MLP) Receiving the integrated information of the integrated feature vector unit, Generating a syllable of the prosodic units, while the text before and after analysis and prosody generating unit to establish a processing unit, the voice data must be used to adjust the parameters of.

其中主要標點符號預測單元之輸入資訊，為詞(WORD)及詞類(part of speech,POS)，而引號詞組預測單元之輸入資訊及前後文分析單元之輸入資訊，則均為詞、詞類及標點符號(punctuation mark,PM)。 The input information of the main punctuation prediction unit is a word (WORD) and a part of speech (POS), and the input information of the quoted phrase prediction unit and the input information of the before and after analysis units are all words, word classes and punctuation. Punctuation mark (PM).

建立前後文分析單元與韻律生成處理單元所使用之語音資料，包含訓練集合(training set)、發展集合 (development set)、及測試集合(testing set)，其中訓練集合(training set)，是訓練習知之MLP相關參數，包括不同層節點間的連接權值；發展集合(development set)，是為決定前後文分析單元所需設定之最佳相鄰詞數，以及韻律訊息生成處理單元之隱藏層所含節點數目；以及測試集合(testing set)，是評估韻律訊息生成處理單元在音高、音長、音量與停頓長度之4項韻律參數之預估效能。 Establishing speech data used by the context analysis unit and the prosody generation processing unit, including training sets and development sets (development set), and testing set, wherein the training set is a training MLP related parameter, including the connection weight between different layer nodes; the development set is determined before and after The optimal number of adjacent words to be set by the text analysis unit, and the number of nodes included in the hidden layer of the prosody message generation processing unit; and the testing set is to evaluate the prosody information generation processing unit in pitch, length, Estimated performance of four prosody parameters for volume and pause length.

一種藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統，其包括：步驟一、將輸入之文字內容斷詞後，進行詞類標記；步驟二、同時進行建立主要標點符號預測模型與建立引號詞組預測模型；步驟三、建立考慮前後文相鄰字詞以產生基礎語言特徵之前後文分析模組；以及步驟四、以MLP為基礎建立韻律訊息生成處理模組。 A method and system for generating a linguistic feature inspired by punctuation and applied to a prosody generation of a national language, comprising: step one: performing word class tagging after the input text content is broken; and step 2, simultaneously establishing main punctuation prediction The model and the quotation phrase prediction model are established; in step 3, the post-analysis module is established before considering the adjacent words to generate the basic language features; and the fourth step is to establish the prosody message generation processing module based on the MLP.

其中上述步驟三與步驟四必須使用語音資料調整相關參數。 The above steps 3 and 4 must use the voice data to adjust the relevant parameters.

其步驟一是為產生斷詞、詞類標記與詞的構成成分之相關資訊。 The first step is to generate information about the components of the word break, the word class mark and the word.

其步驟二之標點符號預測模型，是使用11標記法或2標記法，並將所使用之輸入特徵包括詞、詞長與詞類，根據輸入特徵之不同組合利用CRF++模型建立對應之機率場，藉此機率場求取最佳標記序列，並運用CRF結合有限狀態機規則法進行主要標點符號預估並產生對應之信心度預測值。其中詞長為詞所含之音節數。 The punctuation prediction model of step 2 uses the 11-marking method or the 2-marking method, and uses the input features including words, word lengths and word classes, and uses the CRF++ model to establish a corresponding probability field according to different combinations of input characteristics. This probability field is used to obtain the optimal marker sequence, and the CRF is combined with the finite state machine rule method to perform the main punctuation estimation and generate the corresponding confidence prediction value. The length of the word is the number of syllables contained in the word.

其中11標記法之各標記之說明如下：B1(第一個)、B2(第二個)、B3(第三個)、B4(第四個)、I(詞數小於九且為奇數時之中間詞項)、M(詞數大於等於九時之中間詞項)、E4(倒數第四個)、E3(倒數第三個)、E2(倒數第二個)、E1(倒數第一個)、S(單一個)；亦可採用2標記法，在文句中各個詞項之後，利用兩個標記分別標示是否存在主要標點符號。 The description of each of the 11 marking methods is as follows: B1 (first , B2 (second), B3 (third), B4 (fourth), I (intermediate term when the number of words is less than nine and odd), M (the middle of the number of words is greater than or equal to nine o'clock) Term), E4 (the fourth to last), E3 (the third to the last), E2 (the second to last), E1 (the first to the last), S (single); can also use the 2 mark method, in After each term in the sentence, two tags are used to indicate whether the main punctuation exists.

其步驟二之引號詞組預測模型，是可以使用8標記法或19標記法進行與引號詞組相關之標記預測，並將所使用之輸入特徵包括詞、詞長、詞類以及引號除外之其他標點符號，並利用CRF++模型求取最佳標記序列，同時運用CRF方法結合有限狀態機規則法進行引號詞組預估並產生對應之信心度預測值。 The quotation phrase prediction model of step 2 is that the mark prediction related to the quotation mark group can be performed by using the 8 mark method or the 19 mark method, and the input features used include words, word length, word class and other punctuation marks except quotation marks. The CRF++ model is used to obtain the optimal marker sequence. At the same time, the CRF method is combined with the finite state machine rule method to predict the quoted phrase and generate the corresponding confidence prediction value.

其中上述8標記法之各標記之說明如下：B(第一個)、B2(第二個)、B3(第三個)、I(詞數為三時之中間詞項)、M(詞數大於三時之中間詞項)、E(最後一個)、S(單一個)、O(引號詞組以外的詞)。 The descriptions of the above-mentioned eight-marking method are as follows: B (first), B2 (second), B3 (third), I (intermediate term with three words), M (number of words) Greater than three o'clock intermediate term), E (last), S (single one), O (words other than quoted phrase).

其上述19標記法之各標記之說明如下：包括8標記法的前7個標記描述引號詞組內的結構與新增之含12個標記之集合描述非引號詞組在文句中的結構，後者所採取的格式是Xy，X所屬集合為{P、M、F}，依序分別代表引號詞組之前、引號詞組之間以及引號詞組之後，而y所屬集合為{b、m、e、s}，依序分別代表起始詞、中間詞、結尾詞與單一詞。 The descriptions of the above-mentioned 19-marking methods are as follows: the first 7 marks including the 8-marking method describe the structure in the quoted phrase and the newly added set of 12 marks describing the structure of the non-quoted phrase in the sentence, the latter adopted The format is Xy, and the set of X belongs to {P, M, F}, which in turn represents the quotation marks before, after the quotation marks and after the quotation marks, and the y belongs to the set {b, m, e, s}, The order represents the starting word, the middle word, the ending word and the single word.

其中步驟三及步驟四所使用之語音資料，是包含訓練集合(training set)、發展集合(development set)與測試集合(testing set)，其中訓練集合，係是訓練習知之MLP相關參數，包括不同層節點間的連接權值；發展集合，係為決定前後文分析單元所需設定之最佳相鄰詞數，以及韻律訊息生成處理單元之隱藏層所含節點數目；以及測試集合，係評估韻律訊息生成處理單元在音高、音長、音量與停頓長度之4項韻律參數之預估效能。 The voice data used in the third step and the fourth step includes a training set, a development set and a testing set, wherein the training set is a training related MLP related parameter, including different The connection weight between the nodes; the development set is the optimal number of adjacent words required to determine the context analysis unit, and the prosody message generation. The number of nodes included in the hidden layer of the processing unit; and the test set is an estimated performance of the prosody signal generation processing unit in terms of pitch, length, volume, and pause length.

在上述步驟中，步驟一、二，僅需使用文字資料，而步驟三、四，則必須再增加使用語音資料。 In the above steps, steps 1 and 2 only need to use text data, and steps 3 and 4 must use voice data again.

綜上所述，輸入文字首先經過斷詞與詞類標示處理，其輸出資訊同時輸入CRF-based主要標點符號預測單元、CRP-based引號詞組預測單元與前後文分析單元等三項處理，其中前後文分析單元的作用是藉由考慮前後文之相鄰字詞以產生基礎語言特徵，也就是包括發音資訊、文字資訊、斷詞資訊與詞類資訊，至於主要標點符號預測單元與引號詞組預測單元則分別產生主要標點符號信心度與引號詞組信心度。整合上述之基礎語言特徵、主要標點符號信心度與引號詞組信心度所產生之整合特徵向量，再輸入至韻律生成處理單元，此韻律生成處理單元共含4組獨立的類神經網路，各組類神經網路之架構，是以MLP為基礎的3層架構，含輸入層、隱藏層與輸出層，輸入層所含之輸入特徵即為前述之整合特徵向量，至於輸出層則為以音節為單位的相關韻律參數並分述如下：第一組類神經網路之輸出為描述音高軌跡之4階正交轉換係數，第二組類神經網路之輸出為音長，第三組類神經網路之輸出為音量，第四組類神經網路之輸出則為音節間的停頓長度。 In summary, the input text is first processed by word segmentation and word class labeling, and the output information is simultaneously input into three processings: CRF-based main punctuation prediction unit, CRP-based quotation phrase prediction unit and context analysis unit. The role of the analysis unit is to generate basic language features by considering adjacent words in the context, that is, including pronunciation information, text information, word segmentation information, and word class information. The main punctuation prediction unit and the quoted phrase prediction unit are respectively Generates confidence in the main punctuation and confidence in the quoted phrase. The integrated feature vector generated by integrating the above basic language features, main punctuation confidence and quoted phrase confidence is input into the prosody generation processing unit, which contains 4 sets of independent neural networks, each group The architecture of a neural network is a three-layer architecture based on MLP, including an input layer, a hidden layer and an output layer. The input features contained in the input layer are the integrated feature vectors described above, and the output layer is in syllables. The relevant prosodic parameters of the unit are described as follows: the output of the first set of neural networks is the fourth-order orthogonal transform coefficient describing the pitch trajectory, the output of the second set of neural networks is the sound length, and the third set of neural networks The output of the network is the volume, and the output of the fourth group of neural networks is the pause length between the syllables.

本發明所提供一種藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統，與其他習用技術相互比較時，更具備下列優點： The invention provides a method and a system for using the linguistic features inspired by punctuation marks and applied to the prosody generation of the national language, and has the following advantages when compared with other conventional techniques:

1.本發明所處理之第一類對象為主要標點符號，此類符號是與表現韻律結構的停頓斷點相關，有別於傳統TTS系統多為僅使用一般較粗淺的語言特徵，或是需要利用語法剖析處理(syntax parser)以產生高階語言特徵，本發明以主要標點符號代表停頓斷點以取代高階語言特徵。 1. The first type of object processed by the present invention is a main punctuation mark, such a class The symbol is related to the pause breakpoint that expresses the prosody structure. Different from the traditional TTS system, the language feature is generally only used, or the syntax parser is needed to generate high-order language features. Punctuation marks represent pause breakpoints to replace higher-order language features.

2.本發明所處理之第二類對象為引號，傳統TTS系統均未善加利用此類標點符號所提供之資訊，以協助韻律預測與語音合成處理。本發明希望藉由此資訊以豐富化韻律預測之輸入語言特徵，進一步提升合成語音之可理解度。 2. The second type of object processed by the present invention is a quotation mark. The traditional TTS system does not make good use of the information provided by such punctuation to assist in prosody prediction and speech synthesis processing. The present invention hopes to further enhance the intelligibility of synthesized speech by using this information to enrich the input language features predicted by prosody.

3.本發明係利用條件式隨機場域(conditional random field，CRF)技術建立上述兩類標點符號之預測系統，對於原無主要標點符號之長句或缺乏引號標示的句子，藉由機器學習技術即能預測出可能出現此兩類標點符號的可能性。 3. The present invention uses a conditional random field (CRF) technique to establish a prediction system for the above two types of punctuation, for a long sentence without a main punctuation or a sentence lacking a quotation mark, by machine learning techniques. That is, the possibility of these two types of punctuation marks can be predicted.

110‧‧‧斷詞與詞類標示單元 110‧‧‧Words and word class

120‧‧‧主要標點符號預測單元 120‧‧‧main punctuation prediction unit

130‧‧‧引號詞組預測單元 130‧‧‧ quoted phrase prediction unit

140‧‧‧前後文分析單元 140‧‧‧ before and after analysis unit

150‧‧‧整合特徵向量單元 150‧‧‧Integrated feature vector unit

160‧‧‧韻律生成處理單元 160‧‧‧prosody generation processing unit

S210~S240‧‧‧流程 S210~S240‧‧‧ Process

請參閱有關本發明之詳細說明及其附圖，將可進一步瞭解本發明之技術內容及其目的功效；有關附圖為：圖1為本發明藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統之架構圖；圖2為本發明藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統之流程圖；圖3為本發明藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統之韻律生成處理之多層認知器示意圖。 The detailed description of the present invention and the accompanying drawings will be further understood, and the technical contents of the present invention and the functions thereof can be further understood. The related drawings are: FIG. 1 is a language feature inspired by punctuation marks and applied to Mandarin. Schematic diagram of a method and system for prosody generation; FIG. 2 is a flow chart of a method and system for generating prosody in Chinese by the punctuation-inspired linguistic features of the present invention; FIG. 3 is inspired by punctuation in the present invention. The linguistic features are applied to the method of generating the prosody of the national language and the schematic diagram of the multi-layered cognition processing of the rhythm generation processing of the system.

為了使本發明的目的、技術方案及優點更加清楚明白，下面結合附圖及實施例，對本發明進行進一步詳細說明。應當理解，此處所描述的具體實施例僅用以解釋本發明，但並不用於限定本發明。 The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

以下，結合附圖對本發明進一步說明： Hereinafter, the present invention will be further described with reference to the accompanying drawings:

請參閱圖1所示，為本發明藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統之架構圖，其包括斷詞與詞類標示單元110，是為處理輸入之文字，並將文字內容斷詞，並進行詞類標記與標記詞的構成成分，以提供後續處理單元所需之資訊；主要標點符號預測單元120，是與引號詞組預測單元130同時進行處理斷詞與詞類標示單元所提供之輸入資訊，以產生主要標點符號信心度；引號詞組預測單元130，是與主要標點符號預測單元120同時進行處理斷詞與詞類標示單元所提供之輸入資訊，以產生引號詞組信心度；前後文分析單元140，是接收處理斷詞與詞類標示單元所提供之輸入資訊，以產生基礎語言特徵；整合特徵向量單元150，是整合主要標點符號信心度、引號詞組信心度、及基礎語言特徵；韻律生成處理單元160，是接收整合特徵向量單元整合後之資訊，並產生以音節為單位之韻律參數。而前後文分析單元與韻律生成處理單元之建立，必須使用語音資料進行相關參數之調整。 Please refer to FIG. 1 , which is an architectural diagram of a method and system for generating a prosody generated by a punctuation mark in accordance with the present invention, which includes a word segmentation and word class indicator unit 110 for processing input text. And the word content is broken, and the constituent components of the word class mark and the mark word are provided to provide information required by the subsequent processing unit; the main punctuation prediction unit 120 performs the process of breaking the word and the word class at the same time with the quoted phrase prediction unit 130. The input information provided by the unit to generate the main punctuation confidence; the quotation phrase prediction unit 130 is configured to process the input information provided by the word segmentation unit and the word class indicator unit simultaneously with the main punctuation prediction unit 120 to generate the quotation phrase confidence. The contextual analysis unit 140 receives the input information provided by the processing word segmentation and the word class labeling unit to generate a basic language feature; the integrated feature vector unit 150 integrates the main punctuation confidence, the quotation phrase confidence, and the basic language. Feature; prosody generation processing unit 160 is a receiver integrated feature vector unit integration The latter information, and produces prosodic parameters in syllables. The establishment of the context analysis unit and the prosody generation processing unit must use the voice data to adjust the relevant parameters.

其中主要標點符號預測單元120之輸入資訊，為詞及詞類，而引號詞組預測單元130之輸入資訊及前後文分析單元140之輸入資訊，則均為詞、詞類及標點符號。 The input information of the main punctuation prediction unit 120 is a word and a word class, and the input information of the quoted phrase prediction unit 130 and the input information of the context analysis unit 140 are words, word classes and punctuation marks.

建立前後文分析單元與韻律生成處理單元所使用之語音資料，包含訓練集合(training set)、發展集合(development set)、及測試集合(testing set)，其中訓練集合(training set)，是訓練習知之MLP相關參數，包括不同層節點間的連接權值；發展集合(development set)，是為決定前後文分析單元所需設定之最佳相鄰詞數，以及韻律訊息生成處理單元之隱藏層所含節點數目；以及測試集合(testing set)，是評估韻律訊息生成處理單元在音高、音長、音量與停頓長度之4項韻律參數之預估效能。 Established before and after the text analysis unit and prosody generation processing unit The voice data includes a training set, a development set, and a testing set, wherein the training set is a training MLP related parameter, including a connection between different layer nodes. The set of developments is the number of optimal neighbors required to determine the context analysis unit, and the number of nodes in the hidden layer of the prosody message generation processing unit; and the testing set. It is the estimated performance of the prosody signal generation processing unit in the four prosody parameters of pitch, length, volume and pause length.

請參閱圖2所示，為本發明藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統之流程圖，其包括：步驟一、S210將輸入之文字內容斷詞後，進行詞類標記；步驟二、同時進行S221建立主要標點符號預測模型與S222建立引號詞組預測模型；步驟三、S230建立考慮前後文相鄰字詞以產生基礎語言特徵之前後文分析模組；以及步驟四、S240以MLP為基礎建立韻律訊息生成處理模組。 Please refer to FIG. 2, which is a flow chart of a method and system for generating a prosody of a national language by the punctuation-inspired language feature of the present invention, which includes: Step 1: After S210 breaks the input text content, Word class tag; Step 2: Simultaneously establish S221 to establish the main punctuation prediction model and S222 to establish the quotation phrase prediction model; Step 3, S230 establishes the context analysis module before considering the adjacent words to generate the basic language features; and Step 4 The S240 establishes a prosody message generation processing module based on the MLP.

其中，步驟三S230與步驟四S240必須使用語音資料調整相關參數。 Among them, step three S230 and step four S240 must use the voice data to adjust related parameters.

其S210步驟一是為產生斷詞、詞類標記與詞的構成成分之相關資訊。 Step 1 of S210 is to generate related information about the components of the word break, the word class mark and the word.

其S221步驟二之建立主要標點符號預測模型，是使用11標記法或2標記法，並將所使用之輸入特徵包括詞、詞長與詞類，根據輸入特徵之不同組合利用CRF++模型建立對應之機率場，藉此機率場求取最佳標記序列，並運用CRF 結合有限狀態機規則法進行主要標點符號預估並產生對應之信心度預測值。其中詞長為詞所含之音節數。 The main punctuation prediction model of step 2 of S221 is to use the 11 mark method or the 2 mark method, and the input features used include words, word lengths and word classes, and use the CRF++ model to establish a corresponding probability according to different combinations of input features. Field, taking the probability field to find the best mark sequence, and using CRF The main punctuation prediction is carried out in combination with the finite state machine rule method and the corresponding confidence prediction value is generated. The length of the word is the number of syllables contained in the word.

其中11標記法之各標記之說明如下：B1(第一個)、B2(第二個)、B3(第三個)、B4(第四個)、I(詞數小於九且為奇數時之中間詞項)、M(詞數大於等於九時之中間詞項)、E4(倒數第四個)、E3(倒數第三個)、E2(倒數第二個)、E1(倒數第一個)、S(單一個)；亦可採用2標記法，在文句中各個詞項之後，利用兩個標記分別標示是否存在主要標點符號。 The description of each mark of the 11 mark method is as follows: B1 (first), B2 (second), B3 (third), B4 (fourth), I (when the number of words is less than nine and is odd) Intermediate term), M (intermediate term with words greater than or equal to nine o'clock), E4 (fourth from the last), E3 (third from the last), E2 (second from the penultimate), E1 (first from the last) , S (single one); can also use the 2 mark method, after each word in the sentence, use two marks to indicate whether there is a main punctuation mark.

其S222步驟二之建立引號詞組預測模型，是可以使用8標記法或19標記法進行與引號詞組相關之標記預測，並將所使用之輸入特徵包括詞、詞長、詞類以及引號除外之其他標點符號，並利用CRF++模型求取最佳標記序列，同時運用CRF方法結合有限狀態機規則法進行引號詞組預估並產生對應之信心度預測值。 The quotation phrase prediction model of step S222 of S222 is that the mark prediction related to the quotation mark group can be performed by using the 8 mark method or the 19 mark method, and the input features used include words, word length, word class and other punctuation except quotation marks. Symbols, and use CRF++ model to obtain the best mark sequence, and use CRF method combined with finite state machine rule method to make quoted phrase prediction and produce corresponding confidence prediction value.

其中S230步驟三及S240步驟四所使用之語音資料，是包含訓練集合(training set)、發展集合(development set) 與測試集合(testing set)，其中訓練集合，係是訓練習知之MLP相關參數，包括不同層節點間的連接權值；發展集合，係為決定前後文分析單元所需設定之最佳相鄰詞數，以及韻律訊息生成處理單元之隱藏層所含節點數目；以及測試集合，係評估韻律訊息生成處理單元在音高、音長、音量與停頓長度之4項韻律參數之預估效能。 The voice data used in step S230 of S230 and step 4 of S240 includes a training set and a development set. And the test set, wherein the training set is a training MLP related parameter, including the connection weight between different layer nodes; the development set is the best adjacent word determined by the decision analysis unit. The number, and the number of nodes included in the hidden layer of the prosody message generation processing unit; and the test set are estimates of the prognostic performance of the prosody information generation processing unit in terms of pitch, length, volume, and pause length.

綜上所述，輸入文字首先經過斷詞與詞類標示處理，其輸出資訊同時輸入CRF-based主要標點符號預測單元、CRP-based引號詞組預測單元與前後文分析單元等三項處理，其中前後文分析單元的作用是藉由考慮前後文之相鄰字詞以產生基礎語言特徵，也就是包括發音資訊、文字資訊、斷詞資訊與詞類資訊，至於主要標點符號預測單元與引號詞組預測單元則分別產生主要標點符號信心度與引號詞組信心度。整合特徵向量即為整合上述之基礎語言特徵、主要標點符號信心度與引號詞組信心度，再輸入至韻律生成處理單元，此韻律生成處理單元共含4組獨立的類神經網路，各組類神經網路之架構，是以MLP為基礎的3層架構，請同時參閱圖3所示，含輸入層、隱藏層與輸出層，輸入層所含之輸入特徵即為前述之整合特徵向量，至於輸出層則為以音節為單位的相關韻律參數並分述如下：第一組類神經網路之輸出為描述音高軌跡之4階正交轉換係數，第二組類神經網路之輸出為音長，第三組類神經網路之輸出為音量，第四組類神經網路之輸出則為音節間的停頓長度。 In summary, the input text is first processed by word segmentation and word class labeling, and the output information is simultaneously input into three processings: CRF-based main punctuation prediction unit, CRP-based quotation phrase prediction unit and context analysis unit. The role of the analysis unit is to generate basic language features by considering adjacent words in the context, that is, including pronunciation information, text information, word segmentation information, and word class information. The main punctuation prediction unit and the quoted phrase prediction unit are respectively Generates confidence in the main punctuation and confidence in the quoted phrase. The integrated feature vector is to integrate the above basic language features, the main punctuation confidence and the confidence of the quoted phrase, and then input to the prosody generation processing unit. The prosody generation processing unit contains 4 sets of independent neural networks, each group. The architecture of the neural network is a three-layer architecture based on MLP. Please refer to Figure 3, including the input layer, hidden layer and output layer. The input features contained in the input layer are the integrated feature vectors mentioned above. The output layer is the relevant prosody parameters in syllables and is described as follows: the output of the first set of neural networks is the fourth-order orthogonal transform coefficient describing the pitch trajectory, and the output of the second set of neural networks is the sound. Long, the output of the third group of neural networks is the volume, and the output of the fourth group of neural networks is the pause length between the syllables.

藉由標點符號所啟發之語言特徵並運用於國語韻律生成之方法及系統，分成四部份，第一部份是屬於傳統文字分析處理，包括斷詞與詞類標示處理。第二部份是以CRF技術為基礎之主要標點符號預測單元，第三部份為以CRF技術為基礎之引號詞組預測單元，第四部份則包括考慮前後文相鄰字詞以產生基礎語言特徵之前後文分析單元，以及利用MLP所建立之韻律訊息生成處理單元。 The linguistic features inspired by punctuation and applied to the rhyme of the national language The method and system for generating the law are divided into four parts. The first part belongs to the traditional text analysis processing, including the word segmentation and word class labeling processing. The second part is the main punctuation prediction unit based on CRF technology, the third part is the CRF technology based quotation phrase prediction unit, and the fourth part includes the consideration of the adjacent words to generate the basic language. The feature analysis unit before and after, and the prosody message generation processing unit established by the MLP.

其中實施例如下： The implementation is as follows:

首先第一部份將輸入文字經斷詞與詞類標示單元處理，斷詞與詞類標示單元將文字內容斷詞後進行詞類標記與標記詞的構成成分以提供後續處理模組所需之資訊。 First, the first part treats the input text through the word segmentation and the word class labeling unit. The word segmentation and word class indicator unit breaks the word content and then performs the component of the word class tag and the tag word to provide the information needed for the subsequent processing module.

再，第二部分建立主要標點符號預測單元可採用最簡單的設計方法，也就是2標記法，在文句中各個詞項之後，利用兩個標記分別標示是否存在主要標點符號，經實驗發現，如此作法僅有助於較長文句之分割，若希望模型能有更好的主要標點符號預測能力，就必須使用能表現出文句整體架構的較複雜的標記集合。 Furthermore, the second part of the main punctuation prediction unit can use the simplest design method, that is, the 2 mark method. After each word in the sentence, the two marks are used to indicate whether there is a main punctuation mark, and it is found through experiments. The practice only helps to segment the long sentences. If you want the model to have better primary punctuation prediction ability, you must use a more complex set of tags that can represent the overall structure of the sentence.

有鑑於此，本發明使用11標記法，各標記之說明如下：B1(第一個)、B2(第二個)、B3(第三個)、B4(第四個)、I(詞數小於九且為奇數時之中間詞項)、M(詞數大於等於九時之中間詞項)、E4(倒數第四個)、E3(倒數第三個)、E2(倒數第二個)、E1(倒數第一個)、S(單一個)，下列則為文句標記方式之進一步說明： In view of this, the present invention uses the 11-marking method, and the description of each mark is as follows: B1 (first), B2 (second), B3 (third), B4 (fourth), I (less than the number of words) Nine (the middle term in the odd number), M (the middle term in the number of words greater than or equal to nine), E4 (the fourth in the last), E3 (the third in the last), E2 (the second to last), E1 (the last one), S (single), the following is a further description of the way the sentence is marked:

如長度為一個單位則標記成：S。 If the length is one unit, it is marked as: S.

如長度為二個單位則標記成：B1、E1。 If the length is two units, it is marked as: B1, E1.

如長度為三個單位則標記成：B1、I、E1。 If the length is three units, it is marked as: B1, I, E1.

如長度為四個單位則標記成：B1、B2、E2、E1。 If the length is four units, it is marked as: B1, B2, E2, E1.

如長度為五個單位則標記成：B1、B2、I、E2、E1。 If the length is five units, it is marked as: B1, B2, I, E2, E1.

如長度為六個單位則標記成：B1、B2、B3、E3、E2、E1。 If the length is six units, it is marked as: B1, B2, B3, E3, E2, E1.

如長度為七個單位則標記成：B1、B2、B3、I、E3、E2、E1。 If the length is seven units, it is marked as: B1, B2, B3, I, E3, E2, E1.

如長度為八個單位則標記成：B1、B2、B3、B4、E4、E3、E2、E1。 If the length is eight units, it is marked as: B1, B2, B3, B4, E4, E3, E2, E1.

如長度為九個單位以上則標記成：B1、B2、B3、B4、M…M、E4、E3、E2、E1。 If the length is more than nine units, it is marked as: B1, B2, B3, B4, M...M, E4, E3, E2, E1.

請參閱下表1，主要標點符號預測單元之訓練例句與標示資料，其中(a)為訓練例句之文字資料，(b)為使用2標記法所得之標示資料，其中2標記分別為y ₀與y ₁，(c)則為使用11標記法所得之標示資料。建立此主要標點符號預測單元所使用之輸入特徵包括詞、詞長(詞所含之音節數)與詞類，根據輸入特徵之不同組合利用CRF++模型建立對應之機率場，藉此機率場求取最佳標記序列。 Please refer to Table 1 below for the training examples and marking materials of the main punctuation prediction unit, where (a) is the text of the training example sentence, and (b) is the labeling data obtained by using the 2 marking method, where 2 marks are y ₀ and y ₁ , (c) is the labeling data obtained by using the 11-marking method. The input features used to establish the main punctuation prediction unit include the word, the word length (the number of syllables contained in the word) and the word class. The CRF++ model is used to establish the corresponding probability field according to different combinations of input features, thereby taking the probability field to obtain the most Good marker sequence.

再，第三部分建立引號詞組預測單元可以採取兩種作法，第一種作法為8標記法，標記集合為{B、B2、B3、M、I、E、S、O}，此方法偏重於引號詞組內的結構預測，引號詞組以外的詞均給定標記O，至於其餘7個標記之作用說明如下：B(第一個)、B2(第二個)、B3(第三個)、I(詞數為三時之中間詞項)、M(詞數大於三時之中間詞項)、E(最後一個)、S(單一個)，下列則為引號內詞組標記方式之進一步說明： Then, the third part of the quotation phrase prediction unit can take two In the first method, the labeling method is {B, B2, B3, M, I, E, S, O}. This method focuses on the structure prediction in the quoted phrase, and the words other than the quoted phrase are Given the mark O, the role of the remaining 7 marks is as follows: B (first), B2 (second), B3 (third), I (intermediate term with three words), M (The number of words is greater than three in the middle term), E (last), S (single), the following is a further description of the way the phrase is marked in quotes:

如長度為二個單位則標記成：B、E。 If the length is two units, it is marked as: B, E.

如長度為三個單位則標記成：B、I、E。 If the length is three units, it is marked as: B, I, E.

如長度為四個單位則標記成：B、B2、M、E。 If the length is four units, it is marked as: B, B2, M, E.

如長度為五個單位則標記成：B、B2、M、M、E。 If the length is five units, it is marked as: B, B2, M, M, E.

如長度為六個單位以上則標記成：B、B2、B3、M…M、E。 If the length is more than six units, it is marked as: B, B2, B3, M...M, E.

第二種作法則為19標記法，有別於第一種方法，此法同時預測引號詞組內的結構以及非引號詞組在文句中的結構。此作法可視為是第一種作法之改進版，使用額外附加之標記來代表非引號詞組之各詞在句中之位置資訊，也就是說其所使用之標記包括第一種作法的前7個標記與新增之非引號詞組標記集合，非引號詞組標記集合所採取的格式是Xy，X所屬集合為{P、M、F}，依序分別代表引號詞組之前、引號詞組之間以及引號詞組之後，而y所屬集合為{b、m、e、s}，依序分別代表起始詞、中間詞、結尾詞與單一詞。 The second method is the 19-marking method, which is different from the first method, which predicts both the structure within the quoted phrase and the structure of the non-quoted phrase in the sentence. This practice can be considered as an improved version of the first method, using additional marks to represent the positional information of the words of the non-quoted phrase in the sentence, that is, the mark used includes the first seven of the first practice. Tag and new non-quoted phrase tag set, the format of the non-quoted phrase tag set is Xy, X belongs to the set {P, M, F}, respectively, before the quoted phrase, between the quoted phrase and the quoted phrase After that, the set of y belongs to {b, m, e, s}, which respectively represent the starting word, the middle word, the ending word and the single word.

請參閱下表2，為引號詞組預測單元之訓練例句與標示資料，其中(a)為訓練例句之文字資料，(b)為使用8標記法所得之標示資料，(c)則為使用19標記法所得之標示資料。引號詞組預測單元所使用之輸入特徵包括詞、詞長、詞類以及引號除外之其他標點符號，並利用CRF++模型求取最佳標記序列。 Please refer to Table 2 below for the training example sentences and marking materials of the quoted phrase prediction unit, where (a) is the text of the training example sentence, (b) is the labeling data obtained by using the 8 marking method, and (c) is the 19 marking using the marking method. Marking information obtained by law. The input features used by the quoted phrase prediction unit include words, word lengths, and word classes. And other punctuation marks except quotation marks, and use CRF++ model to obtain the best mark sequence.

最後第四部份包括考慮前後文相鄰字詞以產生基礎語言特徵之前後文分析單元，整合相關語言特徵，以及利用MLP所建立之韻律訊息生成處理單元。有別於建立主要標點符號預測單元與引號詞組預測單元是使用文字資料，建立第四部份的前後文分析單元與韻律訊息生成處理單元需使用含語音之資料，相關語料將分為訓練集合(training set)、發展集合(development set)與測試集合(testing set)，其中訓練集合之作用是訓練習知之MLP相關參數，請參閱圖3所示之不同層節點間的連接權值w _ji與w _kj，發展集合之作用則是用於決定前後文分析單元所需設定之最佳相鄰詞數，以及韻律訊息生成處理單元之隱藏層所含節點數目，至於測試集合則是用來評估韻律訊息生成處理單元在音高、音長、音量與停頓長度等4項韻律參數之預估效能，所使用之4組獨立的類神經網路請參閱圖3之MLP架構，第一組MLP之輸出為描述音高軌跡之4階正交轉換係數，因此輸出層節點數K為4，第二組至第四組MLP之輸出，依序為音長、音量與音節間的停頓長度，其輸出層節點數K均為1。 The final fourth part includes the analysis of the preceding and subsequent words to generate the basic language features before the analysis unit, the integration of the relevant language features, and the prosody message generation processing unit established by the MLP. Different from the establishment of the main punctuation prediction unit and the quotation phrase prediction unit is the use of text data, the fourth part of the analytic unit and the prosody message generation processing unit need to use the speech-containing data, the relevant corpus will be divided into training sets (training set), development set and testing set, wherein the role of the training set is to train the known MLP related parameters, see the connection weight w _ji between the different layer nodes shown in Figure 3. w _kj , the role of the development set is used to determine the optimal number of adjacent words required for the context analysis unit, and the number of nodes in the hidden layer of the prosody message generation processing unit. The test set is used to evaluate the rhythm. The estimated performance of the message generation processing unit in four prosody parameters such as pitch, length, volume and pause length. For the four independent neural networks used, please refer to the MLP architecture of Figure 3, the output of the first group of MLPs. 4 is described in order of the track pitch orthogonal transform coefficients, the output layer nodes K is 4, the second group to the fourth group of MLP output sequentially as sound length, volume and between syllables Dayton length, which are output layer nodes K 1.

綜上所述為建立本發明相關系統之實施方式，內容包括為文字分析處理新增主要標點符號預測功能與引號詞組預測功能，因而增加兩項對應之信心度預測值作為韻律訊息生成處理單元之輸入特徵，至於上述處理加入傳統國語文字轉語音系統中則有以下彈性之實施方式。第一種作法是藉由本發明新增之信心度預測值協助韻律生成處理預測相關韻律參數，輸入文字之原有標點符號仍舊保留。另一種作法是設定適當的閥值(threshold)，藉由高信心度預測值產生主要標點符號與引號詞組，並取代原輸入文字之相關標點符號。後者作法之相關應用除了本發明所討論的國語文字轉語音領域外，也可應用於音轉字(speech-to-text)處理對於語音文件添加適當之標點符號，以及在文書處理系統中輔助標點符號之校正處理。 In summary, in order to establish an implementation manner of the system related to the present invention, The content includes adding a main punctuation prediction function and a quotation phrase prediction function for text analysis processing, thereby adding two corresponding confidence prediction values as input features of the prosody message generation processing unit, and the above processing is added to the traditional Mandarin text-to-speech system. There are the following flexible implementations. The first method is to assist the prosody generation process to predict the relevant prosody parameters by the added confidence prediction value of the present invention, and the original punctuation marks of the input text are still retained. Another approach is to set an appropriate threshold to generate the main punctuation and quoted phrases by high confidence predictions and replace the relevant punctuation of the original input text. The related application of the latter method can be applied to the speech-to-text processing in addition to the speech-to-speech field discussed in the present invention, and the addition of appropriate punctuation marks to the voice file, and the auxiliary punctuation in the document processing system. Correction processing of symbols.

上列詳細說明乃針對本發明之一可行實施例進行具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本案之專利範圍中。 The detailed description of the present invention is intended to be illustrative of a preferred embodiment of the invention, and is not intended to limit the scope of the invention. The patent scope of this case.

綜上所述，本案不僅於技術思想上確屬創新，並具備習用之傳統方法所不及之上述多項功效，已充分符合新穎性及進步性之法定發明專利要件，爰依法提出申請，懇請貴局核准本件發明專利申請案，以勵發明，至感德便。 To sum up, this case is not only innovative in terms of technical thinking, but also has many of the above-mentioned functions that are not in the traditional methods of the past. It has fully complied with the statutory invention patent requirements of novelty and progressiveness, and applied for it according to law. Approved this invention patent application, in order to invent invention, to the sense of virtue.

Claims

A system inspired by punctuation and applied to the system of Chinese prosody generation, which includes: a word segmentation unit and a word class labeling unit, which is to process the input text, and to break the word content, and to perform word class tagging and tagging words. The composition component to provide the information needed for the subsequent processing unit; the main punctuation prediction unit simultaneously processes the input information provided by the word segmentation unit and the word class labeling unit to generate the main punctuation confidence. Wherein, the main punctuation confidence function acts to measure the possibility of inserting a main punctuation mark between the word and the word; the quoted phrase prediction unit is processed simultaneously with the main punctuation prediction unit by the word segment and the word class indicating unit. input information to generate confidence phrase in quotation marks, in which the phrase in quotation marks to measure the role of confidence there will be a word quotation marks around the string, so that the possibility of becoming a word string has a special meaning or tone with emphasis on the unit; The context analysis unit receives and processes the input provided by the word segmentation and the word class indicator unit. Information to generate basic language features; integrating feature vector units, integrating the main punctuation confidence, the confidence of the quoted phrase, and the basic language feature; the prosody generation processing unit receives the integrated information of the integrated feature vector unit And generating a prosody parameter in units of syllables; the establishment of the context analysis unit and the prosody generation processing unit, the speech data must be used to adjust the relevant parameters.

For example, the linguistic features inspired by punctuation as described in claim 1 of the patent application are applied to a system for generating a prosody of a national language, wherein the input information of the main punctuation prediction unit is a word and a word class.

The linguistic features inspired by punctuation as described in claim 1 of the patent application and applied to the system for generating a prosody of the national language, wherein the input information of the quotation phrase prediction unit is a word, a word class and a punctuation symbol.

For example, the language features inspired by punctuation as described in the first paragraph of the patent application are applied to the system for generating the prosody of the national language, wherein the input information of the analysis unit is a word, a word class and a punctuation symbol.

The linguistic feature inspired by punctuation as described in claim 1 of the patent application and applied to the system for generating a prosody of the national language, wherein the linguistic analysis unit and the phonological data used by the prosody generation processing unit are: The training set is a training MLP related parameter, including the connection weight between different layer nodes; the development set is the optimal number of adjacent words required to determine the context analysis unit. And the number of nodes included in the hidden layer of the prosody message generation processing unit; and the testing set is an estimated performance of the prosody signal generation processing unit in terms of pitch, length, volume, and pause length. .

A method for generating a linguistic feature inspired by punctuation and applied to a prosody generation of a national language includes: step one: after the word content of the input text is broken, the word class is marked; and step 2, simultaneously establishing a main punctuation prediction model and Establish a quotation phrase prediction model; Step 3: Establish a analytic module before considering the adjacent words to generate the basic language features; and Step 4, establish a prosody message generation processing module based on the MLP; The punctuation prediction model uses the 11-marking method or the 2-labeling method, and uses the input features including words, word lengths, and word classes, and uses the CRF ⁺⁺ model to establish a corresponding probability field according to different combinations of input characteristics. The field obtains the optimal mark sequence, and uses CRF combined with the finite state machine rule method to predict the main punctuation and produces a corresponding confidence prediction value, wherein the word length is the number of syllables contained in the word; Quoted phrase prediction model, which is able to use the 8 mark method or the 19 mark method to perform mark prediction related to the quoted phrase, and will make The additional features include punctuation input, except the word, long word, parts of speech, and quotation marks, and obtaining the best model using CRF ⁺⁺ marker sequence, while the use of CRF binding method finite state machine rule method and generates a corresponding estimated quoted phrase of The confidence prediction value, wherein the word length is the number of syllables included in the word; and the third step and the fourth step must use the voice data to adjust related parameters.

For example, the linguistic features inspired by punctuation and the method for generating the prosody of the national language are described in the sixth paragraph of the patent application, wherein the first step is to generate information about the components of the word break, the word class mark and the word.

The method for applying the linguistic features inspired by the punctuation marks and the method for generating the prosody of the national language according to the sixth aspect of the patent application, wherein the speech data used in the third step and the fourth step includes: a training set (training set) The system is a training MLP related parameter, including the connection weight between different layer nodes; the development set is the optimal number of adjacent words required to determine the context analysis unit, and the prosody message generation processing The number of nodes in the hidden layer of the unit; and the testing set, which is the evaluation prosody message generation processing unit Estimated performance of four prosody parameters for pitch, length, volume, and pause length.

The method of applying the linguistic features inspired by punctuation and applying to the pronodicity of the national language as described in claim 6 of the patent application, wherein the description of each of the 11 marks is as follows: B1 (first), B2 (first) Two), B3 (third), B4 (fourth), I (intermediate term when the number of words is less than nine and odd), M (intermediate term with the number of words greater than or equal to nine), E4 ( The fourth to the last), E3 (the third to the last), E2 (the second to last), E1 (the first one), S (single); can also use the 2 mark method, after the words in the sentence Use two markers to indicate whether there is a main punctuation mark.

The method of applying the linguistic features inspired by punctuation as described in claim 6 of the patent application and applying the method for generating the prosody of the national language, wherein the description of each mark of the 8 mark method is as follows: B (first), B2 (first) Two), B3 (third), I (intermediate term with three words), M (intermediate term with more than three words), E (last), S (single), O (words other than quoted phrases).

The method of applying the linguistic features inspired by punctuation as described in claim 6 and applying the method to the pronodicity of the national language, wherein the description of each of the 19 marks is as follows: including the first 7 marks of the 8 mark Describe the structure within the quoted phrase and the new set of 12 markers to describe the structure of the non-quoted phrase in the sentence. The format adopted by the latter is Xy, and the set of X belongs to {P, M, F}, which respectively represent Before the quoted phrase, after the quoted phrase and after the quoted phrase, and the set of y belongs to {b, m, e, s}, respectively, the start word, the middle word, the ending word and the single word.