CN1604185B - Voice synthesizing system and method by utilizing length variable sub-words - Google Patents

Voice synthesizing system and method by utilizing length variable sub-words Download PDF


Publication number
CN1604185B CN 03164848 CN03164848A CN1604185B CN 1604185 B CN1604185 B CN 1604185B CN 03164848 CN03164848 CN 03164848 CN 03164848 A CN03164848 A CN 03164848A CN 1604185 B CN1604185 B CN 1604185B
Prior art keywords
Prior art date
Application number
CN 03164848
Other languages
Chinese (zh)
Other versions
CN1604185A (en
Original Assignee
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 摩托罗拉公司 filed Critical 摩托罗拉公司
Priority to CN 03164848 priority Critical patent/CN1604185B/en
Publication of CN1604185A publication Critical patent/CN1604185A/en
Application granted granted Critical
Publication of CN1604185B publication Critical patent/CN1604185B/en



It is a sound synthesizing system and method from input text, which comprises the following steps: first to receive s input text sting; then to compare the sound lists of text string and index; to search the relative complete son wave to the input text string in the sound list; to search the relative phoneme wave to the input text string from the sound list; to search the relative single phoneme wave to the input text string from the sound list; to connect the said wave; to generate the relative synthesizing sound to the input text string.


利用可变长子字的语音合成系统和方法 Speech synthesis system and method for using a variable word eldest

技术领域 FIELD

[0001] 本发明一般涉及一种利用一个相对较小的声音目录实现语音合成的方法和系统。 [0001] The present invention relates generally to a sound using a relatively small catalog voice synthesis method and system. 本发明特别适用于,但不仅限于,例如:移动电话和个人数字助理等的手持装置的语音合成。 The present invention is particularly suitable for, but not limited to, for example: voice synthesis mobile phones and personal digital assistant, handheld device.

背景技术 Background technique

[0002] 熟知的复杂的语音合成技术使用的是一种联接的方法。 [0002] The well known speech synthesis techniques use complex is a method of coupling. 该技术使用的是存储在发音数据库中的讲话发音的实际记录。 The technology used is actually recorded pronunciation of speech pronunciation are stored in the database. 发音的各个部分经重新组合或联接,来生成各种口语短语。 Respective portions utterance re-assembled or coupled, to generate a variety of spoken phrases. 被重新组合的部分可以包括完整的词语,词语段或者甚至是单个音节的更小分段。 Portions are recombined may include complete words, word segments or even smaller segments of single syllable. 当较大的词语段被联接时,所得到的合成语音听起来要更为自然一些。 When a large segment of words are coupled, resulting synthesized speech sounds to be more natural. 然而,当使用较大的词语段时,就需要大容量的存储器来存放声音数据,才能够维持一个可以合成相当大词汇量的声音数据库。 However, when a large segment words, it requires a large capacity memory to store audio data, it is possible to maintain a relatively large vocabulary may be synthesized sound database.

[0003] 可以通过仅仅存储较小的段,例如双音素或者单音,来减小这种声音数据库的大小;然而由此得到的合成语音的质量也通常会降低。 [0003] By storing only be smaller segments, for example diphone or tone, to reduce the size of the sound database; however, the quality of the synthesized speech thus obtained are also typically decreases. 这是因为形成正确的音调和非常短的语音段之间过渡时间长度,从而产生自然发声的语音是困难的。 This is because the length of the transition time between the correct pitch and a very short speech segments, thereby producing natural sounding speech is difficult. 存在复杂的技术分析小的音素链单元,例如CV和VCV(在此C代表辅音,V代表元音)。 Analysis of the presence of small complex technical phoneme chain units, such as CV and a VCV (here C stands for a consonant, V representative of vowels). 然而实现该技术的算法将会非常复杂和需要加强处理器。 However algorithm implementation of this technique would be very complicated and need to be strengthened processor.

[0004] 其他用于减小与语音合成系统相关的声音数据库大小的方法包括使用称为共振峰合成法的技术。 [0004] Other methods for reducing associated with speech synthesis system comprises a sound database size using a technique known formant synthesis method. 使用共振峰合成法,由于人的声音只使用滤波的电子激励信号进行模拟, 就可以不再需要声音数据库。 Formant synthesis method, since only the human voice using the filtered analog electronic excitation signals, can no longer required sound database. 然而得到的合成语音通常听起来极为不自然和"机器腔"。 However, the resulting synthetic speech usually sounds very unnatural and "machine chamber." [0005] 移动电话和个人数字助理(PDA)等手持式电子装置的流行,增加了对高质量的语音合成器的需求。 [0005] popularity of mobile phones and personal digital assistants (PDA) and other handheld electronic devices, increased demand for high-quality voice synthesizer. 如果这种手持装置装内置有语音合成器,其方便性将大大增加。 If such a handheld device with a built-speech synthesizer means, which will greatly increase the convenience. 例如,电子邮件和文本信息,例如:SMS信息,可以合成为语音由移动电话的用户来接听。 For example, e-mail and text messages, such as: SMS message, can be synthesized voice to answer the user's mobile phone. 然而,这种手持电子装置的存储与处理资源通常非常有限。 However, the storage and processing resources such handheld electronic device is usually very limited. 所以内置于这种装置中的语音合成器件必须使用压縮和高效率的声音数据库。 Therefore, such a device built in the speech synthesis device must be used and efficient compression of a sound database.

[0006] 因此,就需要一种改进的语音合成的方法和系统,使用压縮的声音数据库同时仍 [0006] Thus, a need for an improved method and system for speech synthesis, voice compression using the database while still

可提供自然声语音。 It can provide natural sound voice.

[0007] 发明概述 [0007] Summary of the Invention

[0008] 根据本发明的一方面,本发明是一种语音合成的方法,包括接收输入的文本串;将所述输入文本串与索引的声音目录进行比较;从所述声音目录中检索出与所述输入文本串相一致的完整子字波形;从所述声音目录中检索出与所述输入文本串相一致的音素串波形;从所述声音目录中检索出与所述输入文本串相一致的单个音素波形;联接所述波形, 产生与所述输入文本串相一致的合成语音。 [0008] According to an aspect of the present invention, the present invention is a method of synthesizing speech comprising receiving an input text string; the input text string index compares the voice directory; from the sound catalog and retrieve the input text string sub-word coincides full waveform; retrieved coincides with the input text string from the waveform of phoneme string sound catalog; consistent with said retrieved text string input from the sound catalog phase single phoneme waveforms; coupled said waveform, generates the input text string coincides synthesized speech.

[0009] 本发明优选的可以包括通过对大文本语料库实施一个统计分析来决定常用词,并将所述常用词划分成位置音节,产生所述声音目录的步骤。 [0009] Preferably the present invention may include commonly used words is determined by performing a statistical analysis of a large corpus of text, and the position of common words into syllables, the step of generating said sound inventory.

[0010] 生成所述的声音目录的步骤可以进一步包括对所述位置音节进行归类的音节归类步骤,和舍弃具有低清晰度的所述音节的步骤。 [0010] The generating step may further include a sound inventory the syllable for syllable position collation grouping step, and said step of discarding syllable having low resolution.

[0011] 生成所述声音目录的步骤可以进一步包括:计算所述大文本语料库中的CV型子字的频率,和选择所述大文本语料库中最常见部分的所述子字的步骤。 Step [0011] The sound generating directory may further comprising: calculating the frequency CV type subword large text corpus, word sub-step of the large text corpus part and the most common choice.

[0012] 联接所述波形的步骤可以包括硬联接(几乎不需要信号处理的联接)所述子字波 [0012] The coupling step may include hard waveform coupling (coupling process almost no signal) of the sub-word wave

形,或可以包括对所述音节串波形和所述单个音节波形的修正联接的步骤。 Shaped, or may comprise the step of coupling said correction waveform and the single syllable string syllable waveform.

[0013] 修正联接优选的包括改变所述联接波形的持续时间。 [0013] Preferred coupling correction coupling comprises changing the duration of the waveform.

[0014] 根据本发明的另一方面,本发明是一种根据输入语音进行语音合成的系统,它包括具有子字波形的声音目录。 [0014] According to another aspect of the present invention, the present invention is a speech synthesis system based on an input speech, comprising a sub-word having a sound waveform directory. 一多级声音单元选择器与所述的声音目录联接,一多层合成器与所述的声音单元选择器联接。 Coupling a multi-level directory voice sound unit to said selector, a sound unit selector coupled with said multi-layer composite. 根据所述输入文本的分段是否与所述声音目录中的子字波形相一致,选择所述的音调单元选择器的一级。 Whether consistent with said sub-word sound waveform in accordance with the directory input text segment, selecting the tone of a unit selector.

[0015] 所述的多层合成器优选的包括用于执行硬联接的第一层和用于执行修正联接的 [0015] The multilayer synthesizer preferably comprises means for performing a first hard layer and a coupling for coupling the correction is performed

第二层。 Second floor.

[0016] 所述声音目录可以包括CV型子字波形,并且所述的CV型子字波形可以用一注释文件标引。 The [0016] sound inventory may include CV-type sub-word waveform, and the waveform of the CV-type sub-word may be a comment file index.

[0017] 所述多级声音单元选择器优选的包括可与所述多层合成器的第一层联接以实现硬联接的第一级,和可与所述多层合成器的第二层联接以实现修正联接的第二级和第三级。 [0017] The multi-stage selector sound unit preferably comprises a first layer coupled with said multi-layer composite is hard to achieve the coupling of the first stage, and the second layer of the multilayer synthesizer coupled correction to achieve the second and third coupling stage.

[0018] 在本说明书,以及权利要求书中,词语"包含","包括"或者类似术语意在表示非派他性的包括,所以,包括所列出的元件的方法和装置,并不仅仅是包括这些元件,还可以包括没有提到的其它元件。 [0018] In Therefore, the listed element comprising a method and apparatus of the present specification and claims, the word "comprise", "comprising" or similar terms are intended to mean a non-send his nature include, not only It is to include these elements may include other elements not mentioned.


[0019] 为使本发明易于理解并付诸实施,现在将参照附图对优选实施例进行说明,在图中,相同的标号表示相同的元件,其中: [0019] In order that the invention be readily understood and put into practice, reference will now be made to the preferred embodiments will be described with reference to embodiments, in the drawings, like reference numerals denote like elements, wherein:

[0020] 图1是根据本发明的语音合成系统的功能性组件的示意图;[0021] 图2是根据本发明的如何生成一个声音目录的流程图;禾口[0022] 图3是根据本发明的语音合成方法的流程图。 [0020] FIG. 1 is a schematic diagram of the functional components of the speech synthesis system according to the invention; [0021] FIG 2 is a flowchart showing how to generate a sound inventory of the present invention; Wo port [0022] FIG. 3 according to the present invention is the flowchart of a method of synthesizing speech. [0023] 优选实施例的详细说明 [0023] detailed description of preferred embodiments

[0024] 参见图1,图中所示为根据本发明的用于语音合成的系统100的功能性组件的示意图。 [0024] Referring to Figure 1, shown is a schematic diagram of functional components of a system for speech synthesis according to the present invention 100 in FIG. 声音目录110包括多个子字组件120,例如起始、辅音结尾和CV型子字。 Sound inventory comprises 110 120, such as start, end, and CV-type sub-word consonant plurality of sub word components. 利用索引130对子字组件120进行分类。 130 sub word using the index component 120 to classify.

[0025] 声音目录110与多层单元选择器140接口。 [0025] The voice directory unit 110 and the selector 140 the multilayer interfaces. 单元选择器140决定三级中的哪一级将被用来合成输入到系统100中的词。 Unit selector 140 determines which of the three will be used to synthesize an input to the system 100 words. 当输入文本串的分段可以被划分为与其对应的波形都包含在声音目录110中的子字时,选择单元选择器140的第一级。 When the input text string may be divided into segments corresponding to the waveform of the sound in the word are included sub directory 110, the selection unit selects the first stage 140. 当合成输入文本串分段所需要的子字不包括在声音目录110中,但是声音目录110中的音素串可以用来合成输入文本串分段时,选择单元选择器140的第二级。 When the composite input text string sub-word segments need not be included in the voice directory 110, but the phoneme string sound inventory 110 may be used to synthesize the input text string segment, the selection unit selects the second stage 140. 最后,当只能用包括在声音目录110中的单个音素来合成输入文本串的分段时,选择单元选择器140的第三级。 Finally, when only a composite input comprises a text string segment always a single tone in the voice directory 110, the selection unit selects the third stage 140. [0026] 单元选择器140与双层合成器150接口,合成器150合成由系统100输出的语音。 [0026] unit 150 and the selector 140 Synthesis of double speech synthesizer Interface 150, the synthesizer 100 is output by the system. 第一层160对来自单元选择器140的第一级的子字的执行硬联接合成。 The first layer 160 from the execution unit selecting a first sub-stage 140 of the word hard synthetic coupling. 合成器150的第二层170对从单元选择器140的第二级或者第三级接收的语音组件执行修正联接合成。 Synthesizer 170 performs the correction of the second layer is coupled to the voice component received from the second stage or the third stage of the synthesis unit selector 140 150. 在本说明的后面将对硬联接和修正联接进行详述。 It will be described later in this hard and coupling the coupling correction described in detail. 图1中的虚线箭头表示从单元选择器140的第二级或者第三级接收到的语音组件也可以使用硬联接进行联接。 The broken line arrows in FIG. 1 represents received from the second stage or the third stage selection unit 140 to the voice component may be used for rigid coupling is coupled.

[0027] 参见图2,图中所示为生成声音目录110的方法200的流程图。 [0027] Referring to Figure 2, shown is a flowchart 110 of a method for generating sound inventory 200 in FIG. 在步骤205中,对大文本语料库进行统计分析。 In step 205, a large text corpus for statistical analysis. 该分析包括计算在任意给定的示例性输入文本的词语中占显著多数的词语。 The analysis comprises computing most prominently word in any given word exemplary input text. 对大多数的西方语音而言,例如英语,有超过150, 000个单词,包含至少41,000个位置音节。 For most of the West in terms of voice, such as English, there are more than 150, 000 words, including at least 41,000 positions syllable. 然后,在步骤210中,来自步骤205的常用词被划分为位置音节。 Then, in step 210, step 205 from the common word is divided into syllables position. 位置 position

音节定义为具有词语位置标记的音节,如下:[0028] Ws :单音节词语中的音节; It is defined as having a syllable syllable word location markers, as follows: [0028] Ws: tone syllable word section;

[0029] Wo :多音节词语中的音节但不包括词的最后一个音节;禾口[0030] Wf :多音节词语中的最后一个音节。 [0029] Wo: polysyllabic words in syllables, but not including the last syllable of the word; Hekou [0030] Wf: multi-syllable words and the last syllable.

[0031] 然后,方法200继续到步骤215,在此,每一音节中的音素都被分类。 [0031] Then, the method 200 continues to step 215, in this case, each syllable phoneme is classified. 音素大致可以分为如下四类:辅音、半元音、元音和浊音尾。 Phoneme can be divided into four categories: consonant, semivowels, vowels and voiced tail. 各类之间的清晰度是不同的。 Clarity between the various types is different. 于是在步骤220中,具有低清晰度的音素可以被舍弃。 Then in step 220, a phoneme having a low resolution can be discarded. 因此,根据本发明的语音单元的定义是基于音节的,并且语音单元的长度从一个音节到四个或者更多音节变化。 Thus, a speech unit according to the definition of the present invention is based on the syllable, and the length of speech units from one to four or more syllables syllable change. 这就意味着下面的组合可以从声音目录IIO中省略:辅音到辅音、元音到辅音、半元音到辅音、和鼻尾音到辅音。 This means that a combination of the following may be omitted from the sound catalog IIO in: consonant to consonant, vowel to consonant, semi-vowels to consonants and consonant nose to tail. 然而,下面的组合在语音单元的联接中要考虑:辅音到元音、半元音到元音、元音到半元音。 However, the combination of the following speech unit of the coupling to be considered: the consonant vowel, the vowel semivowel, vowel to vowel half. 辅音串结尾可以被不同的词语共用。 Consonants strings may be different from the common words. 因此,上面所述的超过41,000个位置音节减少为只有 Thus, more than 41,000 described above is reduced only syllable positions

16, 000个CV型子字。 16, 000 CV type sub-word. 下面的表1提供一个例子,说明如何使用上述子字单元来描述,例如"Battery level is low,,中的音节转换:[0032] 表1 The following Table 1 provides an example of how sub-word units using the above-described example, "Battery level is low ,, syllable conversion: [0032] TABLE 1

[0033] "Battery level is low"中的音节转换[0034] [0033] "Battery level is low" in the syllable conversion [0034]

Word CV-like皿it Word CV-like dish it

Battery b, ae(Wo)+tax(Wo)+riy (Wf) Battery b, ae (Wo) + tax (Wo) + riy (Wf)

Level 1, eh(Wo)+vaxl(Wf) Level 1, eh (Wo) + vaxl (Wf)

Is ,Ih(Ws)+s Is, Ih (Ws) + s

Low 1, ow(Ws) Low 1, ow (Ws)

[0035] 然后,方法200继续到步骤225,其中根据词典(根据本发明的优选实施例包括超过190, 000个词条)中的单词频率和单元频率来计算CV型子字的频率。 [0035] Then, the method 200 continues to step 225, where according to the dictionary (the present invention according to a preferred embodiment includes more than 190, 000 entries) and word frequency calculating unit frequency frequency CV-type sub-word. 英语文本的统计分析显示,大约6, 900个词语能覆盖大约90%的输入文本,而大约4, IOO个词语能覆盖大约85%的输入文本,每一子字出现的频率或者次数定义如下:[0036] & = nu+nH Statistical Analysis of English text showed that about 6, 900 words can cover about 90% of the input text, and about 4, IOO words to cover about 85% of the input text, the frequency or number of times to define each sub-word appears as follows: [0036] & = nu + nH

[0037] 其中ni为第i个子字出现次数,其中nii是带有第i个子字的词语出现的次数,其中n2i是第i个子字在词典中出现的次数。 [0037] wherein ni is the number of occurrences of the i-th word in which the number of words with nii is the i-th word appears, the number of times where n2i i-th word appears in the dictionary. 对于ni, i = 1, 2,. . . . ,N(其中N是字典中子字 For ni, i = 1, 2 ,...., N (where N is the neutron dictionary word

6的数目),可以计算出每一个子字的频率。 Number 6), the frequency can be calculated for each sub-word.

[0038] 最后在步骤230中,选择将覆盖预期输入文本大部分的最常用的子字。 [0038] Finally, in step 230, will cover most of the selected sub-word most commonly expected input text. 当实施于 When implemented in

英语时,上面计算的结果显示20%的子字将覆盖超过85%的英语文本。 When English, the result of the calculation shown above 20% of the sub-word will cover more than 85% of the English text. 因此,大约2,400 Therefore, about 2,400

个子字被选择构成语音单元目录。 Sub-word speech units constituting the selected directory. 从声音语料库中提取与每一子字相关的语音波形,形成 Extraction associated with each sub-word speech waveform from the voice corpus, formed

声音目录110。 110 voice directory. 上述方法200从而大大减少了声音目录110中的冗余。 The method of the above-described 200 thus greatly reducing the redundancy in the voice directory 110.

[0039] 声音目录110中每一个子字的相关语音波形都用索引130标引。 Related voice waveform of each sub-word [0039] 110 are a voice directory index 130 index. 索引130可以包 Index 130 can package

括一个与记录的语音波形一起的简单注释文件。 It comprises a simple voice annotation file with the recorded waveform together. 因此,索引130被用于标识包含在子字波 Therefore, the index 130 is used to identify words included in the sub-wave

形中的音素串和单个音素。 And a single phoneme phoneme string shape.

[0040] 参见图3,图中所示为根据本发明的语音合成方法300的流程图。 [0040] Referring to Figure 3, shown is a flowchart 300 of the speech synthesis method of the present invention in FIG. 方法300在起始步骤305被调用,例如;当手持装置的用户接收到一个文本信息并想将其合成为语音时。 The method 300 in an initial step 305 is invoked, for example; when a user of the handheld device receives a text message and want to synthesize the speech. 在步骤310中,语音合成系统IOO接收一个输入文本串,例如:是前面提到的文本信息。 In step 310, the speech synthesis system IOO receives an input text string, for example: a text information previously mentioned. 在步骤315中,实施对输入文本串的预处理。 In step 315, a pretreatment of the input text string. 预处理将输入文本串分类成包括与每一段相关的位置信息的子字段。 Preprocessing the input text string comprising a sub-classified into fields of each segment related to location information. 然后,在步骤320,将输入文本串分段与声音目录110进行比较。 Then, at step 320, the input text string segment directory 110 is compared with the sound. 在步骤325,确定声音目录110中的完整子字波形是否与输入文本串的当前段一致。 In step 325, the current segment is determined consistent subword complete waveform audio directory 110 whether the input text string. 如果是,方法300执行步骤330,从声音目录110检索出一致的子字波形。 If yes, the method 300 performs step 330, the same sub-word retrieved from the voice waveform 110 directory. 接下来在步骤360中,子字波形被联接。 Next in step 360, the sub-word waveform is coupled. 步骤330和步骤360与单元选择器140的第一级相关,子字的联接由双层合成器150的第一层160执行硬联接。 Step 330 and step 360 associated with the first stage unit selector 140, the coupling is performed subword is coupled by a first hard layer 160 of the double synthesizer 150. 硬联接将在下文中详述。 Rigid coupling will be described hereinafter. 接下来在步骤335中,确定输入文本串是否还有其它段要与声音目录110进行比较。 Next in step 335, it is determined whether or not the input text string to be compared as well as other segments of the sound inventory 110. 如果还有,方法300重新返回到步骤320,在此,输入文本串的下一段与声音目录110进行比较;否则,方法300在步骤340结束。 If there is, method 300 returns to step 320 again, in this case, the lower section of the input text string is compared with the sound inventory 110; otherwise, the method ends in step 300340.

[0041] 如果在步骤325确定声音目录110中没有与输入文本串的当前段一致的完整子字波形,则方法300前进到步骤345,以判断在声音目录110中是否有与输入文本串的当前段一致的多个音素串波形。 [0041] If step 325 determines that, consistent and complete sub-word current waveform segment voice directory 110 does not string and the input text, the method 300 proceeds to step 345 to determine whether there is a current sequence of input text in the voice directory 110 a plurality of waveform segment consistent phoneme string. 如果有,方法300进行到步骤350,从声音目录110中检索出一致的多个音素串波形。 If yes, method 300 proceeds to step 350, the plurality of retrieved phoneme string identical waveforms from the voice directory 110. 接下来在步骤365中,多音子串波形得以联接。 Next in step 365, multi-tone waveform is coupled substring. 步骤350和步骤365与单元选择器140的第二级相关,并且多个音素串的联接是由合成器150的第二层170来执行的修正联接。 Step 350 and step 365 selector means associated with the second stage 140, and a plurality of coupling phoneme string is corrected by a coupling layer 170, a second synthesizer 150 is performed. 修正联接也在下文中详述。 Fixed coupling also described in detail below. 接着,方法300返回到步骤335,判断输入本文串是否还有其他段要与声音目录110进行比较。 Next, the method 300 returns to step 335, it is determined whether or not the input string other article to be compared with the sound segment directory 110.

[0042] 如果在步骤345判定在声音目录110中没有多个音素串波形与输入文本串的当前段相一致,方法300就前进到355步骤,从声音目录110中检索出单个音素波形。 [0042] If step 345 determines that no more waveform input phoneme string text string in the current segment coincides voice directory 110, the method 300 proceeds to step 355, to retrieve a single phoneme waveform from the voice directory 110. 然后在步骤365,单个音素波形被联接以与输入文本串的当前段最相应。 Then in step 365, the phoneme waveform is coupled to a single current segment the input text string with the most appropriate. 这里,步骤355和步骤365与单元选择器140的第三级相关,单个音素的联接还是由合成器150的第二层170来完成的修正联接。 Here, step 355 and step 365 the selector means associated with the third stage 140, coupled to a single phoneme or the second layer 170 by a synthesizer 150 is coupled to complete the correction. 然后,方法300返回到步骤335,判断输入本文串是否还有其他分段要与声音目录110进行比较。 Then, method 300 returns to step 335, it is determined whether another input string herein to be compared with the sound segment directory 110. 当输入文本串的所有分段都与标引的声音目录IIO比较完成后,方法300在步骤340结束。 When all segments are associated with the input text string indexing sound inventory IIO completion of the comparison, the method ends in step 300340.

[0043] 因此,根据本发明的方法300,基于对输入文本串的分段进行"最适合"的分析,联接来自声音目录110中的波形。 [0043] Accordingly, 300, analyze the "best fit" of the method according to the present invention based on the segment of the input text string, the sound wave is coupled from the directory 110. 双层合成器150的第一层执行硬联接意味着在没有修正的情况下,将从声音目录110中的多个波形简单的拼接在一起。 Double synthesizer performing a first hard layer 150 in the absence of the coupling means of the correction, a plurality of waveforms from the audio directory 110 simply spliced ​​together. 当联接的波形足够大,以至于联接波形的总共持续时间与相应的输入文本串分段的自然说话的持续时间非常接近时,这个过程会导致听起来自然的语音。 When coupled wave large enough to speak of a total duration of nature and the duration of the corresponding input text string segments that link waveform is very close to the time, this process will result in natural sounding voice. [0044] 另一方面,当硬联接不能得到听起来自然的语音时,就要使用修正联接。 [0044] On the other hand, the coupling can not be obtained when the hard natural sounding speech, it is necessary to use the correction coupling. 合成器150的第二层170执行修正联接。 A second layer of the synthesizer 150 is coupled 170 performs correction. 这里调整联接波形的持续时间以得到听起来更为自然的语音。 Here adjust the duration of the connection waveform to obtain a more natural sounding voice.

[0045] 参照下面的表2,可以更好的理解修正联接。 [0045] Referring to Table 2 below, may be better understood corrected coupling.

[0046] 表2 [0046] TABLE 2

<table>table see original document page 8</column></row> <table> <Table> table see original document page 8 </ column> </ row> <table>

[0048] 表2中给出了十种不同的情况的范例,其中声音目录110的子字组件120被划分为左边和右边文本。 [0048] Table 2 shows an example of ten different situations, wherein the sub-word sound inventory 110 of assembly 120 is divided into left and right text. 在表2的最右边的列描述的是当联接子字组件120,产生听起来自然的合成语音时,所需要的联接类型。 Described in the rightmost column of Table 2 is the type of the coupling when the coupling assembly 120 sub-word, to generate natural sounding synthesized speech required. 例如,表2中的情况2说明当使用修正联接来联接声音目录110的两个元音波形时,联接波形的持续时间必须减少25%才能得到听起来自然的语 For example, in the case of the table of 22 instructions when using the correction join to join two vowels waveform audio directory 110, the duration of the connection waveform must be reduced by 25% in order to get natural sounding language

[0049] 作为选择,表2中的情况9说明当联接由一个元音和一个辅音组成的两个波形时,联接波形的持续时间不必修正。 [0049] Alternatively, the case 9 described in Table 2 when two waveforms are connected by one vowel and a consonant component, the duration of the coupled waveform without modification. 因此,合成器150的第一层160将执行这种硬联接。 Thus, a first layer 160, the synthesizer 150 to perform such a hard coupling. [0050] 因此,本发明为一种使用相对较小的声音目录110的用于语音合成的改进的方法和系统。 [0050] Accordingly, the present invention is used as a relatively small sound inventory 110 for speech synthesis method and an improved system. 适当组建声音目录110可以得到波形的标引集,它能通过硬联接而合成大约85%的输入文本串。 Directory 110 can set appropriate voice index set waveform obtained, it can be synthesized about 85% of the input text string is coupled by hard. 输入文本串其余的15%可以利用所述的修正联接技术而得以合成。 Coupling the input text string correction technique using the remaining 15% can be synthesized according to the. 声音目录110因此是高度压縮的而且具有最小冗余波形,使得它特别适用于具有有限存储器的手持装置中。 Sound inventory 110 is highly compressed and thus has a minimum redundancy waveform, making it particularly suitable for handheld devices with limited memory. 而且,声音目录110大小的縮减使得本发明的检索算法更高效快捷。 Further, the sound inventory 110 size reduction algorithm of the present invention makes retrieval more efficient and quick. [0051] 上述详细描述提供的仅是一个优选的实施例,并非是对本发明的范围、使用性和结构的限制。 [0051] The detailed description provided above is only one preferred embodiment, not limiting the scope of the present invention, the use and structure. 相反,优选示范实施例的详细描述为本领域的熟练技术人员实施本发明的优选示范实施例提供可能。 Rather, a skilled person in the art detailed description of preferred exemplary embodiments of the preferred exemplary embodiment of the present invention may provide embodiments. 应该理解的是,在不脱离所附权利要求中的本发明的精神和范围的情况下,可以对元件和步骤的功能和布置作出各种修改。 It should be understood that, without departing from the spirit and scope of the appended claims the present invention, various modifications may be made in the function and arrangement of elements and steps.

Claims (10)

  1. 一种手持装置中的语音合成方法,包括:接收输入文本串;将所述输入文本串与索引的声音目录进行比较,所述索引的声音目录包含CV型子字波形、包含在所述CV型子字波形中的被索引的音素串波形和包含在所述CV型子字波形中的被索引的单个音素波形;在所述声音目录中检索与所述输入文本串相应的完整的CV型子字波形;如果没有检索到与所述输入文本串相应的完整的CV型子字波形,则在所述声音目录中检索与所述输入文本串相应的被索引的音素串波形;如果没有检索到与所述输入文本串相应的被索引的音素串波形,则在所述声音目录中检索与所述输入文本串相应的被索引的单个音素波形;以及联接所检索的波形,以提供与所述输入文本串相应的合成语音。 An apparatus in speech synthesis method, comprising a handheld: receiving an input text string; the input text string index compares the voice directory, a voice directory contains the index CV-type sub-word waveform, comprising the CV type in waveform subword phoneme string indexed waveform and a single phoneme waveform is contained in the index CV-type sub-word of the waveform; retrieving the voice input in the text string in the corresponding directory complete CV-type sub shaped waveforms; if not retrieved, retrieving the input text string corresponding to the complete CV-type sub-word sound waveform in the catalog of the input waveform indexed phoneme string corresponding text string; if not retrieved is retrieved text string corresponding to the input phoneme string being indexed in the sound waveform of the input directory corresponding to a single text string phoneme waveform is indexed; and coupling the retrieved waveform, to provide the input text string corresponding synthetic speech.
  2. 2. 根据权利要求1的方法,还包括通过如下步骤生成所述声音目录的步骤: 对大文本语料库实施一个统计分析来决定常用词,禾口将所述常用词划分成位置音节。 2. The method of claim 1, further comprising the step of generating the sound inventory by the steps of: a statistical analysis of a large corpus of text embodiment to determine common words, Wo the opening position of common words into syllables.
  3. 3. 根据权利要求2的方法,其中所述产生所述声音目录的步骤还包括以下步骤: 将每一个所述位置音节的音素进行分类;禾口舍弃所述位置音节内辅音到辅音、元音到辅音、半元音到辅音和鼻尾音到辅音组合的所述音素,以构成CV型子字。 3. The method according to claim 2, wherein said step of generating said sound inventory further comprising the steps of: each of said syllable positions phoneme classifying; Wo discarding said opening position syllable consonant to consonant, vowel to the consonant, consonant phoneme to the semi-vowel and consonant combinations in the nose to tail to form a CV-type sub-word.
  4. 4. 根据权利要求3的方法,其中所述产生所述声音目录的步骤还包括以下步骤: 计算所述CV型子字在所述大文本语料库中的频率; 选择在所述的大文本语料库中最常用的所述CV型子字;以及从所述大文本语料库中提取出包含所述最常用的cv型子字的声音目录。 4. The method as claimed in claim 3, wherein said step of generating said sound inventory further comprising the step of: calculating the CV-type sub-word frequencies in the large text corpus; large text corpus selecting one of the most common of the CV-type sub-word; and extracting said large text corpus from the voice directory contains the most common type cv said sub-word.
  5. 5. 根据权利要求1的方法,其中所述联接所检索的所述波形的步骤包括:硬联接所检索的cv型子字波形。 The method according to claim 1, wherein said step of coupling said retrieved waveform comprising: cv-type sub-word retrieved waveform rigid coupling.
  6. 6. 根据权利要求1的方法,其中所述联接所检索的所述波形的步骤包括:修正联接所检索的被索引的音素串波形和修正联接所检索的被索引的单个音素波形。 6. The method according to claim 1, wherein said step of coupling said retrieved waveform comprising: a single phoneme string phoneme waveform correction waveform corrected coupling coupled indexed retrieved is retrieved index.
  7. 7. 根据权利要求6的方法,其中所述修正联接包括改变所述联接波形的持续时间。 7. A method according to claim 6, wherein said coupling comprises a correction varying the duration of said waveform coupling.
  8. 8. —种用于对输入文本串进行语音合成的系统,包括:声音目录,其包含cv型子字波形、包含在所述cv型子字波形中的被索引的音素串波形和包含在所述CV型子字波形中的被索引的单个音素波形;多级声音单元选择器,其与所述声音目录连接,用于在所述声音目录中选择与所述输入文本串相应的波形,包括:用于选择所述CV型子字波形的第一级、用于选择被索引的音素串波形的第二级和用于选择被索引的单个音素波形的第三级;以及多层合成器,其与所述多级声音单元选择器连接,用于对所选择的波形进行联接以提供所述输入文本串的合成语音。 8. - Species input text string for speech synthesis system, comprising: a sound inventory comprising a sub-word type cv waveform, comprising a waveform of phoneme string in the indexed type cv sub word contained in the waveform and the single phoneme waveform of said sub-word indexed CV type of waveform; multilevel element selection unit, connected to said sound inventory, for selecting the input text string corresponding to the sound waveform in the directory, comprising : CV for selecting the type of waveform of a first sub-stage word, phoneme string for selecting a waveform is indexed to a second stage, and selecting a single phoneme waveform is indexed to a third stage; and a multilayer synthesizer, the multi-stage connected sound units selector for coupling a selected waveform to provide synthesized speech of the input text string.
  9. 9. 根据权利要求8的系统,其中,所述多层合成器包括:用于对所选择的CV型子字波形执行硬联接的第一层和用于对所选择的音素串波形和所选择的单个音素波形分别执行修正联接的第二层。 9. The system of claim 8, wherein said multilayer synthesizer comprising: means for performing a hard coupling of the CV-type sub-word selected waveform for the first layer and the waveform of the selected phoneme string and the selected single phoneme waveform correction is performed respectively coupled to the second layer.
  10. 10. 根据权利要求8的系统,其中,所述包含在所述CV型子字波形中的被索引的音素串波形和单个音素波形使用联接注释文件进行索引 Phoneme string waveform is indexed system according to claim 8, wherein said included in the CV-type sub-word phoneme waveform and a single waveforms coupled annotation file using index
CN 03164848 2003-09-29 2003-09-29 Voice synthesizing system and method by utilizing length variable sub-words CN1604185B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 03164848 CN1604185B (en) 2003-09-29 2003-09-29 Voice synthesizing system and method by utilizing length variable sub-words

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 03164848 CN1604185B (en) 2003-09-29 2003-09-29 Voice synthesizing system and method by utilizing length variable sub-words

Publications (2)

Publication Number Publication Date
CN1604185A CN1604185A (en) 2005-04-06
CN1604185B true CN1604185B (en) 2010-05-26



Family Applications (1)

Application Number Title Priority Date Filing Date
CN 03164848 CN1604185B (en) 2003-09-29 2003-09-29 Voice synthesizing system and method by utilizing length variable sub-words

Country Status (1)

Country Link
CN (1) CN1604185B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102334119B (en) * 2009-02-26 2014-05-21 国立大学法人丰桥技术科学大学 Speech search device and speech search method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5682501A (en) 1994-06-22 1997-10-28 International Business Machines Corporation Speech synthesis system
US6064960A (en) 1997-12-18 2000-05-16 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
US20020184030A1 (en) 2001-06-04 2002-12-05 Hewlett Packard Company Speech synthesis apparatus and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5682501A (en) 1994-06-22 1997-10-28 International Business Machines Corporation Speech synthesis system
US6064960A (en) 1997-12-18 2000-05-16 Apple Computer, Inc. Method and apparatus for improved duration modeling of phonemes
US20020184030A1 (en) 2001-06-04 2002-12-05 Hewlett Packard Company Speech synthesis apparatus and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Jack C. Richards等著,管燕红译.朗文语言教学及应用语言学辞典 1.外语教学与研究出版社,2000,460-461.

Also Published As

Publication number Publication date
CN1604185A (en) 2005-04-06

Similar Documents

Publication Publication Date Title
Potamianos et al. Robust recognition of children's speech
Pitrelli et al. The IBM expressive text-to-speech synthesis system for American English
CA2545873C (en) Text-to-speech method and system, computer program product therefor
EP1184839B1 (en) Grapheme-phoneme conversion
Allen Synthesis of speech from unrestricted text
US8073693B2 (en) System and method for pronunciation modeling
US8219398B2 (en) Computerized speech synthesizer for synthesizing speech from text
US7596499B2 (en) Multilingual text-to-speech system with limited resources
JP3408477B2 (en) Formant-based speech synthesizer semitone clause linked performing crossfade independently in the filter parameters and the source region
TWI413105B (en) Multi-lingual text-to-speech synthesis system and method
TWI281146B (en) Apparatus and method for synthesized audible response to an utterance in speaker-independent voice recognition
US5911129A (en) Audio font used for capture and rendering
US7315811B2 (en) System and method for accented modification of a language model
EP1704558B1 (en) Corpus-based speech synthesis based on segment recombination
EP2815397B1 (en) System and method for generating name pronunciations
US7269557B1 (en) Coarticulated concatenated speech
US20070192105A1 (en) Multi-unit approach to text-to-speech synthesis
KR101120710B1 (en) Front-end architecture for a multilingual text-to-speech system
US20020099547A1 (en) Method and apparatus for speech synthesis without prosody modification
US7869999B2 (en) Systems and methods for selecting from multiple phonectic transcriptions for text-to-speech synthesis
Dutoit High-quality text-to-speech synthesis: An overview
EP1071074A2 (en) Speech synthesis employing prosody templates
US7716052B2 (en) Method, apparatus and computer program providing a multi-speaker database for concatenative text-to-speech synthesis
JPWO2010018796A1 (en) Exception word dictionary creation device, exception word dictionary creation method and program, and speech recognition device and speech recognition method
US20090006097A1 (en) Pronunciation correction of text-to-speech systems between different spoken languages

Legal Events

Date Code Title Description
C06 Publication
C10 Request of examination as to substance
C14 Granted
C41 Transfer of the right of patent application or the patent right
ASS Succession or assignment of patent right



Effective date: 20100908

COR Bibliographic change or correction in the description