CN104217713A - Tibetan-Chinese speech synthesis method and device - Google Patents

Tibetan-Chinese speech synthesis method and device Download PDF

Info

Publication number
CN104217713A
CN104217713A CN201410341827.9A CN201410341827A CN104217713A CN 104217713 A CN104217713 A CN 104217713A CN 201410341827 A CN201410341827 A CN 201410341827A CN 104217713 A CN104217713 A CN 104217713A
Authority
CN
China
Prior art keywords
chinese
model
speaker
language
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410341827.9A
Other languages
Chinese (zh)
Inventor
杨鸿武
王海燕
徐世鹏
裴东
甘振业
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201410341827.9A priority Critical patent/CN104217713A/en
Publication of CN104217713A publication Critical patent/CN104217713A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a Tibetan-Chinese speech synthesis method and device and aims to synthesize input Chinese or Tibetan statements to be synthesized, through a Chinese-Tibetan hybrid corpus preliminarily established. Chinese or Tibetan speeches can be synthesized at the same time by the method and the device. Compared with a traditional HMM-based (hidden Markov model based) speech synthesis system, the method and the device have the advantages that a speaker adaptive training process is added to a training phase to acquire a Chinese-Tibetan hybrid speech average model, the influence caused by speaker differences in a speech library can be reduced through the speaker adaptive training process, and quality of synthesized speeches is improved; on the basis of the average model, and a Tibetan or Chinese speech excellent in both naturalness and fluency can be obtained through synthesis of few Tibetan or Chinese corpus data through a speaker adaptive conversion algorithm; the research is significant to promoting the development of communication with minorities and the development of minority speech technologies.

Description

Double-language voice synthetic method and device hidden in the Chinese
Technical field
The present invention relates to multilingual speech synthesis technique field, more specifically, provide a kind of Chinese and hide the method and apparatus synthesized across language double-language voice.
Background technology
In recent years, multilingual speech synthesis technique becomes the study hotspot in the mutual field of man machine language.Utilize this technology can realize the man machine language of different language alternately in same system, this is to saying that the country of several language has important using value.China is the country that a minority language and dialect are numerous, this is with regard to the important meaning that made the research of this technology have, the Tibetan areas of such as China, mainly speak standard Chinese pronunciation, Tibetan language and dialect, if a voice system can be had to realize synthesizing across the multiple voice of language, exchange to promoting with ethnic group and promote that the development of ethnic group's voice technology has important meaning.
The technology studied multilingual phonetic synthesis both at home and abroad mainly comprises unit selection splicing synthetic method and statistical parameter phoneme synthesizing method.The ultimate principle of waveform concatenation synthetic method analyzes according to input text, obtain basic unit information, then from the sound bank prerecorded and mark, pick out suitable unit, carry out a small amount of adjustment, again through splicing, finally obtain the voice synthesized.Because the unit phoneme of final synthetic speech all directly extracts, so it can keep the tonequality of original transcription people from sound storehouse.But waveform concatenation synthesis system generally needs a large-scale sound bank, the workload of language material storage preparation is very large, waste time and energy, and synthetic effect depends on sound bank to a great extent, and affected by environment comparatively large, robustness is not high.The basic thought of statistical parameter phoneme synthesizing method is that the voice signal of input is carried out to parameter decomposition and sets up statistical model, by training the speech parameter of the model prediction text to be synthesized obtained, by parameters input Parametric synthesizers, finally obtains the voice synthesized.The data volume needed when this method system builds is few, lessly needs manual intervention, and the voice smooth flow of synthesis, robustness is high.But the speech quality of synthesis is not high, and Perception is abundant not comparatively flat.
Because the statistical parameter phoneme synthesizing method based on HMM is by the voice of the different speaker of speaker adaptation conversion synthesis, become across the study hotspot in the multilingual phonetic synthesis of language.Multilingual speech synthesis system based on HMM adopts hybrid language method, phoneme reflection method or state mapping method to realize multilingual phonetic synthesis.But existing research mostly for having Big-corpus and the language of speech synthesis technique relative maturity expansion research, lacks the research to the language that dialect, native language and voice resource not easily obtain.In current research both domestic and external, do not realize the multilingual speech synthesis system of mandarin/minority language or mandarin/dialect.The research of current multilingual phonetic synthesis mainly pays close attention to mainstream speech, many employing phoneme reflection methods or state mapping method, but these two kinds of methods all need a large amount of double-language voice data.For the Tibetan language lacking voice resource, hide double-language voice language material owing to lacking the large-scale Chinese, be difficult to above method to be applied in the multilingual phonetic synthesis of mandarin-Tibetan language.
Summary of the invention
The present invention will solve the multilingual speech synthesis system proposed in background technology and lack research to the voice that dialect, native language and voice resource not easily obtain, such as Tibetan language, the multilingual phonetic synthesis of mandarin-Tibetan language language can not be realized, provide a kind of Chinese to hide double-language voice synthetic method and device.
For solving the problems of the technologies described above, the technical solution used in the present invention is: double-language voice synthetic method hidden in the Chinese, wherein, comprises step:
A, be reference with the International Phonetic Symbols, to the Tibetan language phonetic alphabet of input, obtain its International Phonetic Symbols, then compare with the International Phonetic Symbols of the Chinese phonetic alphabet, identical part directly adopts SAMPA-SC to mark, different parts then utilizes untapped keyboard symbol to mark, and utilizes the making character fonts algorithm towards SAMPA-T, completes the SAMPA-T automatic marking of Tibetan language corpus of text;
B, similarity according to Tibetan language and mandarin, on the basis of mandarin mark with phonetic symbols system, design Chinese and the general mark with phonetic symbols system of Tibetan language and problem set;
C, utilize the Chinese to hide the speech data of many speakers, based on HMM model, trained by speaker adaptation, training obtains the average sound model of hybrid language;
D, utilize the language material of target language Tibetan language to be synthesized or a small amount of speaker of Chinese speech, converted by speaker adaptation, obtain speaker adaptation model, and adaptive model is revised and upgrades;
E, input text to be synthesized, generate speech parameter, synthesis Tibetan language or Chinese speech.
Further, the making character fonts algorithm of SAMPA-T described in described steps A comprises the steps:
First Tibetan language statement text is read in, then according to Dan Chuifu and syllabic sign cutting sentence and syllable, obtain Tibetan language list, to each syllable, spoken and simple or compound vowel of a Chinese syllable by base word fourth location and base word fourth decomposition and separation, finally by the SAMPA-T character string of searching initial consonant SAMPA-T table and this syllable of simple or compound vowel of a Chinese syllable SAMPA-T table acquisition, it is according to base word fourth split table that described base word fourth is decomposed.
Further, the design Chinese in described step B and the general mark with phonetic symbols system of Tibetan language and problem set comprise the steps:
First, the Tibetan language sound consistent with Mandarin Chinese speech is female, and adopt the Chinese phonetic alphabet to mark, the employing Tibetan language phonetic inconsistent with Mandarin Chinese speech marks;
Then, choose female and the quiet and synthesis unit paused as context-sensitive MSD-HSMM of all sound of mandarin and Tibetan language and carry out design context annotation formatting, be used for marking the context-sensitive feature of the sound mother layer of each synthesis unit, syllablic tier, word layer, rhythm word layer, phrase layer and statement layer;
Finally, the basis of the context-sensitive problem set of a mandarin is designed the Chinese and hides bilingual general problem set, the relevant issues of the distinctive synthesis unit of Tibetan language have been expanded in this problem set, to reflect the special pronunciation of Tibetan language, problem set comprises more than 3000 context-sensitive problem, covers all features of context-sensitive mark.
Further, being trained by speaker adaptation in described step C, training obtains the average sound model of hybrid language and comprises the steps:
A, speech analysis is carried out to the Chinese data storehouse of many speakers and the Tibetan language corpus data of single speaker, extracts its parameters,acoustic:
(1) extract mel cepstrum coefficient, logarithm fundamental frequency and non-periodic index,
(2) their first order difference and second order difference is calculated;
B, in conjunction with context property collection, carry out HMM model training, train the statistical model of its parameters,acoustic:
(1) the HMM model of frequency spectrum and base frequency parameters is trained,
(2) many distributions half Hidden Markov Model (HMM) (MSD-HSMM) of physical training condition duration parameters;
C, utilize a small amount of single speaker's Chinese speech information library and single speaker's Tibetan voice storehouse, carry out speaker adaptation training, thus obtain average sound model:
(1) constraint maximum likelihood is adopted linearly to return (CMML) algorithm, by the difference linear regression function representation between the speech data of speaker in training and average sound,
(2) by the difference between one group of State-output distribution and the equation of linear regression normalization training speaker of state duration distribution,
(3) training obtains the Chinese and hides the average sound model of bilingual hybrid language, thus obtains context-sensitive MSD-HSMM model;
D, utilize single speaker's self-adapting data of Chinese and Tibetan language, carry out speaker adaptation conversion:
(1) adopt CMML algorithm, calculate State-output probability distribution and the state duration mean of a probability distribution vector sum covariance matrix of speaker,
(2) mean vector of average sound model and Covariance Matrix Transform are the target speaker model of Tibetan language to be synthesized or Chinese by the transformation matrix distributed with the distribution of one group of State-output and state duration,
(3) maximal possibility estimation is carried out to normalization and frequency spectrum, fundamental frequency and duration parameters after transforming;
E, adaptive model revised and upgrades:
(1) adopt maximum a posteriori (MAP) algorithm, calculate the MAP estimated parameter that average sound model state exports and duration distributes,
(2) mean vector of the State-output after adaptive transformation and state duration is calculated,
(3) the weighted mean MAP estimated value of self-adaptation mean vector is calculated;
F, input text to be synthesized, text analyzing is carried out to it, obtains the HMM model of sentence;
G, carry out parameter prediction to sentence HMM, carry out speech parameter generation, after Parametric synthesizers, obtain synthetic speech, formula is as follows:
Wherein, for the State-output mean vector of training speaker s, for its state duration mean vector.W=[A, b] and X=[α, β] is respectively the transformation matrix of State-output distribution and state duration distributional difference between training speaker s and average sound model, o iand d ifor average observed vector sum mean time long vector.
Further, the language material utilizing target language Tibetan language to be synthesized or a small amount of speaker of Chinese speech described in described step D, is converted by speaker adaptation, obtains speaker adaptation model, and revise adaptive model and upgrade, comprise the steps:
First, after speaker adaptation training, utilize the CMLLR adaptive algorithm based on HSMM, calculate State-output probability distribution and the duration mean of a probability distribution vector sum covariance matrix of voice conversion, under state i, the transformation equation of proper vector o and state duration d is:
b i(o)=N(o;Au i-b,A∑ iA T)
=|A -1|N(Wξ;u i,∑ i)
p i ( d ) = N ( d ; α m i - β , α σ i 2 α ) = | α - 1 | N ( αψ ; m i , σ i 2 )
Wherein, ξ=[o t, 1], ψ=[d, 1] t, μ ifor the average of State-output distribution, m ifor the average of duration distribution, ∑ ifor diagonal covariance matrix, for variance, W=[A -1b -1] be the matrix of a linear transformation of target speaker State-output probability density distribution, X=[α -1, β -1] be the transformation matrix of state duration probability density distribution;
Then, by the adaptive transformation algorithm based on HSMM, can be normalized the frequency spectrum of speech data, fundamental frequency and duration parameters and convert, be the self-adapting data O of T for length, can carry out maximal possibility estimation to conversion Λ=(W, X)
Λ ~ = ( W ~ , X ~ ) = arg max Λ P ( O | λ , Λ )
Wherein, λ is the parameter set of HSMM;
Finally, the adaptive model of maximum a posteriori (MAP) algorithm to voice is adopted to revise and upgrade, for given HSMM λ, if its forward direction probability and backward probability are respectively: α i(i) and β i(i), then its Continuous Observation sequence o under state i t-d+1... o tgenerating probability for:
κ t d ( i ) = 1 P ( O | λ ) Σ j = 1 j ≠ i N α t - d ( j ) p ( d ) Π s = t - d + 1 t b i ( O s ) β t ( i )
MAP estimates to be described below:
Wherein, with for the mean vector after linear regression conversion, ω and τ is respectively the MAP estimated parameter of State-output and duration distribution, with for self-adaptation mean vector with weighted mean MAP estimated value.
Further, the text to be synthesized of the input described in described step e, generates speech parameter, and synthesis Tibetan language or Chinese speech comprise the steps:
First use text analyzing instrument that given text-converted is become to comprise the pronunciation annotated sequence of linguistic context descriptor, the linguistic context using the decision tree obtained in training process to dope each pronunciation is correlated with HMM model, and connects into the HMM model of a statement;
Then, operation parameter generating algorithm generates the argument sequence of frequency spectrum, duration and fundamental frequency from statement HMM;
Finally approach (MLSA) wave filter with Mel logarithmic spectrum and, as Parametric synthesizers, synthesize voice.
Further, double-language voice synthesizer hidden in the Chinese, comprising: HMM model training unit, for setting up the HMM model of speech data; Speaker adaptation unit, for the characteristic parameter of speaker in normalization and conversion training, obtains adaptive model; Phonetic synthesis unit, for the synthesis of Tibetan language to be synthesized or Chinese speech.
Further, described HMM model training unit, comprising: speech analysis subelement, extracts the parameters,acoustic of speech data in sound bank, mainly extracts fundamental frequency, frequency spectrum and duration parameters; Target HMM model determination subelement, in conjunction with the context markup information in sound storehouse, the statistical model of training acoustic model, based on context property set, determine fundamental frequency, frequency spectrum and duration parameters, described speech analysis subelement is connected with described target HMM model determination subelement.
Further, described speaker adaptation unit, comprise the speaker be connected successively and train subelement, average sound model determination subelement, speaker adaptation varitron unit and adaptive model determination subelement, described target HMM model determination subelement trains subelement to be connected with described speaker
Described speaker trains subelement, for the difference in normalization training between speaker and average sound model between State-output distribution and the distribution of state duration;
Described average sound model determination subelement, adopts the maximum likelihood linear regression algorithm determination Chinese to hide the average sound model of bilingual mixing voice;
Described speaker adaptation varitron unit, utilizes self-adapting data, calculates State-output probability distribution and the duration mean of a probability distribution vector sum covariance matrix of speaker, and it is transformed to target speaker model;
Described adaptive model determination subelement, sets up the adaptive model of the MSD-HSMM of target speaker.
Further, described phonetic synthesis unit comprises the adaptive model correction subelement and synthon unit that are connected successively, and described adaptive model determination subelement is connected with described adaptive model correction subelement,
Described adaptive model correction subelement, utilizes the adaptive model of MAP algorithm to voice revise and upgrade, and reduces model bias, improves synthesis quality;
Described synthon unit, utilizes the adaptive model revised, and the speech parameter of prediction input text, extracting parameter finally synthesizes Chinese or Tibetan voice by voice operation demonstrator.
The advantage that the present invention has and good effect are: double-language voice synthetic method and device hidden in the Chinese, by Chinese and Tibetan language in enunciative similarity, utilize the adaptive training based on HMM and self-adaptation transfer algorithm, achieve and synthesize naturalness and fluency Chinese and Tibetan voice all preferably with same system and device.Compared with traditional speech synthesis system based on HMM, native system adds speaker adaptation training process in the training stage, obtain the Chinese and hide the average sound model of mixing voice, by this process, the impact that in sound bank, the difference of speaker causes can be reduced, improve the quality of synthetic speech, on the basis of average sound model, by speaker adaptation mapping algorithm, only with a small amount of Tibetan language to be synthesized or Chinese data, naturalness and fluency all well Tibetan language or Chinese speech just can be synthesized.The research of native system exchanges with minority name race promoting and promotes that the development of minority name race voice technology has important meaning.
Accompanying drawing explanation
Fig. 1 is that double-language voice synthetic method FB(flow block) hidden in the Chinese;
Fig. 2 is that Tibetan language text is to SAMPA-T flow path switch figure;
Fig. 3 is that bilingual speaker adaptation phonetic synthesis FB(flow block) hidden in the Chinese;
Fig. 4 is that double-language voice synthesizer structural representation hidden in the Chinese;
Fig. 5 is the process flow diagram of model training;
Fig. 6 is the process flow diagram of phonetic synthesis.
Embodiment
The present invention proposes a kind of method that double-language voice synthesis hidden in Chinese, propose a kind of making character fonts algorithm towards the machine-readable phonetic symbol SAMPA-T of Tibetan language, achieve the automatic marking of the SAMPA-T of Tibetan language corpus of text, according to the similarity between Tibetan language and mandarin, devise mandarin and general mark with phonetic symbols system, annotation formatting and the problem set of Tibetan language, utilize the language material of the speaker of multiple mandarin and Tibetan language, trained and speaker adaptation mapping algorithm by the speaker adaptation based on HMM, final synthesis Chinese or Tibetan voice.The Chinese of the present invention hides double-language voice synthetic method FB(flow block) as shown in Figure 1, and concrete steps are:
(1) design the SAMPA-T labelling schemes of Tibetan language Lhasa dialect, utilize the making character fonts algorithm towards SAMPA-T, complete the SAMPA-T automatic marking of Tibetan language corpus of text.
Machine-readable phonetic symbol SAMPA (Speech Assessment Methods Phonetic Alphabet) is a kind of computer-readable phonetic system, and it can represent all symbols of the International Phonetic Symbols with ascii character.At present, SAMPA is widely used in the main language in Europe, and the east-asian language such as Japanese, domestic Chinese, Guangdong language and Taiwan " national language " it is also proposed SAMPA scheme.
Chinese system is hidden because Tibetan language and Chinese belong to together, the present invention is on the basis of the machine-readable phonetic symbol design proposal of standard Chinese, devise the computer-readable mark with phonetic symbols system SAMPA-T (Tibetan) of a set of Tibetan language, list design proposal for Tibetan language Lhasa words and achieve the transcription of Tibetan language to SAMPA-T.
By contrasting the International Phonetic Symbols of Chinese and Tibetan language, find Chinese and Tibetan language the International Phonetic Symbols some be consistent, therefore be reference with the International Phonetic Symbols, to the Tibetan language phonetic alphabet of input, obtain its International Phonetic Symbols, then compare with the International Phonetic Symbols of the Chinese phonetic alphabet, identical part directly adopts SAMPA-SC to mark, different parts, then according to simplification principle, utilizes untapped keyboard symbol to mark.
Tibetan language is a kind of alphabetic writing, and spelt by Tibetan language phonetic and form, its base unit is syllable.The tradition Tibetan language syntax are according to the locations of structures of letter in syllable, the syllable of these diverse locations is divided into " pre-script ", " base word ", " upper word adding ", " down word adding ", " back word adding " and " again back word adding ", wherein, base word is the core of whole Tibetan word.The word fourth of wherein Tibetan language initial consonant=pre-script+upper word adding+base word+down word adding composition, Tibetan language simple or compound vowel of a Chinese syllable=vowel+back word adding+back word adding again.
Tibetan language text is mainly several from the viewpoint of the cutting of Tibetan language sentence, the cutting of Tibetan language individual character, the location of base word fourth, the separation of sound mother and transcription, SAMPA-T character string combinations etc. to the transcription of SAMPA-T.Wherein the SAMPA-T of base word fourth location harmony simple or compound vowel of a Chinese syllable is converted to nucleus module, the identification of the positioning instant base word, vowel etc. of base word fourth, mainly through realizing towards the statistics of dictionary and lookup method.The conversion of word fourth is mainly by realizing the searching of SAMPA-T transcription Support Library of initial consonant and simple or compound vowel of a Chinese syllable.First read in Tibetan language text, then according to Dan Chuifu and syllabic sign cutting sentence and syllable, obtain Tibetan language list.To each syllable, to be spoken simple or compound vowel of a Chinese syllable by base word fourth location and word fourth decomposition and separation, then by searching the SAMPA-T of this syllable of SAMPA-T table acquisition of initial consonant and simple or compound vowel of a Chinese syllable.Tibetan language text to SAMPA-T flow path switch figure as shown in Figure 2.
(2) according to the similarity of Tibetan language and mandarin, on the basis of mandarin mark with phonetic symbols system, design Chinese and the general mark with phonetic symbols system of Tibetan language and problem set.
Tibetan language and Chinese belong to Han-Tibetan family together, and pronunciation has many general character and difference.Standard Chinese and Tibetan language Lhasa dialect are all the language of syllable composition, and each syllable is made up of an initial consonant and a simple or compound vowel of a Chinese syllable.Mandarin has 22 initial consonants and 39 simple or compound vowel of a Chinese syllable, and Tibetan language Lhasa dialect has 36 initial consonants and 45 simple or compound vowel of a Chinese syllable, and 20 initial consonants and 13 simple or compound vowel of a Chinese syllable shared in this bilingual.First consistent with Mandarin Chinese speech in the present invention Tibetan language sound is female, adopts the Chinese phonetic alphabet to mark; The employing Tibetan language phonetic inconsistent with Mandarin Chinese speech marks.
Then, choose female and the quiet and synthesis unit paused as context-sensitive MSD-HSMM of all sound of mandarin and Tibetan language and carry out design context annotation formatting, be used for marking the context-sensitive feature of the sound mother layer of each synthesis unit, syllablic tier, word layer, rhythm word layer, phrase layer and statement layer.
Finally, the basis of the context-sensitive problem set of a mandarin is designed the Chinese and hide bilingual general problem set.The relevant issues of the distinctive synthesis unit of Tibetan language have been expanded, to reflect the special pronunciation of Tibetan language in this problem set.Problem set comprises more than 3000 context-sensitive problem, covers all features of context-sensitive mark.
The method that native system have employed hierarchical mark marks Tibetan language corpus of text, and the content of mark comprises syllablic tier, boundary information and SAMPA-T transcription result.Use Praat phonetics software general in the world to mark, system can also add necessary markup information as required.After having marked, compile script program, writes .TexGrid file by markup information, and the inside contains four layers of information of mark, mainly comprises pronunciation and syllable boundaries information.Comprise the classified information of essential characteristic in problem set, as initial consonant, rhythm parent type, whether syllable comes front two in prosodic phrase etc., and these classified informations are normally according to the set of a certain group of essential information in contextual information.By the subseries again of problem set, the linguistic context classified information more complicated than essential characteristic just can be obtained.In HTS system, the problem set of design has all been placed in .hed file, each line description problem, and the problem of description is the problem of true and false type, and each problem starts with QS order.
(3) utilize the Chinese to hide the speech data of many speakers, based on HMM model, trained by speaker adaptation, training obtains the average sound model of hybrid language.
Compared with traditional phoneme synthesizing method based on HMM, the present invention adds speaker adaptation training process in the training stage, obtain the Chinese and hide the average sound model of mixing voice, pass through the method, the impact that in sound bank, the difference of speaker causes can be reduced, improve the quality of synthetic speech, on the basis of average sound model, by speaker adaptation mapping algorithm, only with a small amount of Tibetan language to be synthesized or Chinese data, naturalness and fluency all well Tibetan language or Chinese speech just can be synthesized.The Chinese hides bilingual speaker adaptation phonetic synthesis FB(flow block) as shown in Figure 3:
The Tibetan language corpus data of 1> to the Chinese data storehouse of many speakers and single speaker carry out speech analysis, extract its parameters,acoustic:
(1) extract mel cepstrum coefficient, logarithm fundamental frequency and non-periodic index;
(2) their first order difference and second order difference is calculated.
2>, in conjunction with context property collection, carries out HMM model training, trains the statistical model of its parameters,acoustic:
(1) the HMM model of frequency spectrum and base frequency parameters is trained;
(2) many distributions half Hidden Markov Model (HMM) (MSD-HSMM) of physical training condition duration parameters.
3> utilizes a small amount of single speaker's Chinese speech information library and single speaker's Tibetan voice storehouse, carries out speaker adaptation training, thus obtains average sound model:
(1) constraint maximum likelihood is adopted linearly to return (CMML) algorithm, by the difference linear regression function representation between the speech data of speaker in training and average sound;
(2) by the difference between one group of State-output distribution and the equation of linear regression normalization training speaker of state duration distribution;
(3) training obtains the Chinese and hides the average sound model of bilingual hybrid language, thus obtains context-sensitive MSD-HSMM model.
4> utilizes single speaker's self-adapting data of Chinese and Tibetan language, carries out speaker adaptation conversion:
(1) adopt CMML algorithm, calculate State-output probability distribution and the state duration mean of a probability distribution vector sum covariance matrix of speaker;
(2) mean vector of average sound model and Covariance Matrix Transform are the target speaker model of Tibetan language to be synthesized or Chinese by the transformation matrix distributed with the distribution of one group of State-output and state duration;
(3) maximal possibility estimation is carried out to normalization and frequency spectrum, fundamental frequency and duration parameters after transforming.
5> revises adaptive model and upgrades:
(1) adopt maximum a posteriori (MAP) algorithm, calculate the MAP estimated parameter that average sound model state exports and duration distributes;
(2) mean vector of the State-output after adaptive transformation and state duration is calculated;
(3) the weighted mean MAP estimated value of self-adaptation mean vector is calculated.
6> inputs text to be synthesized, carries out text analyzing to it, obtains the HMM model of sentence.
7> carries out parameter prediction to sentence HMM, carries out speech parameter generation, after Parametric synthesizers, obtains synthetic speech.
Fig. 3 is that double-language voice synthetic schemes hidden in the Chinese.Utilize the mixing language material of mandarin and Tibetan language, adopt constraint maximum likelihood linearly to return (CMML) training acquisition Chinese and hide the average sound model of bilingual hybrid language, thus obtain context-sensitive many distributions half Hidden Markov Model (HMM) (MSD-HSMM).In speaker adaptation training, train the linear regression function representation of the training utterance data of speaker and the output state distribution of the difference on average between sound and state duration distribution mean vector, difference between available one group of State-output distribution and the equation of linear regression normalization training speaker of state duration distribution, formula is as follows:
Wherein, for the State-output mean vector of training speaker s, for its state duration mean vector.W=[A, b] and X=[α, β] is respectively the transformation matrix of State-output distribution and state duration distributional difference between training speaker s and average sound model, o iand d ifor average observed vector sum mean time long vector.
(4) utilize the language material of target language Tibetan language to be synthesized or a small amount of speaker of Chinese speech, converted by speaker adaptation, obtain speaker adaptation model, and adaptive model is revised and upgrades.
After speaker adaptation training, utilize the CMLLR adaptive algorithm based on HSMM, calculate State-output probability distribution and the duration mean of a probability distribution vector sum covariance matrix of voice conversion.Under state i, the transformation equation of proper vector o and state duration d is:
b i(o)=N(o;Au i-b,A∑ iA T)
=|A -1|N(Wξ;u i,∑ i)
p i ( d ) = N ( d ; α m i - β , α σ i 2 α ) = | α - 1 | N ( αψ ; m i , σ i 2 )
Wherein, ξ=[o t, 1], ψ=[d, 1] t, μ ifor the average of State-output distribution, m ifor the average of duration distribution, ∑ ifor diagonal covariance matrix, for variance.W=[A -1b -1] be the matrix of a linear transformation of target speaker State-output probability density distribution, X=[α -1, β -1] be the transformation matrix of state duration probability density distribution.
By the adaptive transformation algorithm based on HSMM, can the frequency spectrum of speech data, fundamental frequency and duration parameters be normalized and be converted.For the self-adapting data O that length is T, maximal possibility estimation can be carried out to conversion Λ=(W, X),
Λ ~ = ( W ~ , X ~ ) = arg max Λ P ( O | λ , Λ )
Wherein, λ is the parameter set of HSMM.
Finally, the adaptive model of maximum a posteriori (MAP) algorithm to voice is adopted to revise and upgrade.For given HSMM λ, if its forward direction probability and backward probability are respectively: α i(i) and β i(i), then its Continuous Observation sequence o under state i t-d+1... o tgenerating probability for:
κ t d ( i ) = 1 P ( O | λ ) Σ j = 1 j ≠ i N α t - d ( j ) p ( d ) Π s = t - d + 1 t b i ( O s ) β t ( i )
MAP estimates to be described below:
Wherein, with for the mean vector after linear regression conversion, ω and τ is respectively the MAP estimated parameter of State-output and duration distribution. with for self-adaptation mean vector with weighted mean MAP estimated value.
Training stage mainly comprises pre-service and HMM training.At pretreatment stage, first the speech data in sound storehouse is analyzed, extract corresponding speech parameter (fundamental frequency and spectrum parameter).According to the speech parameter extracted, the observation vector of HMM can be divided into spectrum and fundamental frequency two parts, wherein composing argument section adopts continuous probability distribution HMM to carry out modeling, and fundamental frequency part adopts many spatial probability distribution HMM (MSD-HMM) to carry out modeling, meanwhile, system uses Gaussian distribution or gamma distribution to set up state duration modeling to describe the time structure of voice.In addition, HMM synthesis system also uses the feature interpretation linguistic context of linguistics and metrics.To design to context property collection with for the problem set of decision tree-based clustering before model training, namely select some that parameters,acoustic (spectrum, fundamental frequency and duration) is had to the context property of certain influence and designs corresponding problem set for context-sensitive Model tying according to priori.
In the training process of model, according to ML criterion, use the HMM model of EM Algorithm for Training acoustic parameter vector sequence.Finally, context of use decision tree carries out cluster to spectrum parameter model, base frequency parameters model and duration modeling respectively, thus obtains synthesizing the forecast model used.The flow process of whole model training as shown in Figure 5.
(5) input text to be synthesized, generate speech parameter, synthesis Tibetan language or Chinese speech.
First use text analyzing instrument that given text-converted is become to comprise the pronunciation annotated sequence of linguistic context descriptor, the linguistic context using the decision tree obtained in training process to dope each pronunciation is correlated with HMM model, and connects into the HMM model of a statement.Then, operation parameter generating algorithm generates the argument sequence of frequency spectrum, duration and fundamental frequency from statement HMM.Finally approach (MLSA) wave filter with Mel logarithmic spectrum and, as Parametric synthesizers, synthesize voice.The flow process of whole synthesis as shown in Figure 6.
Corresponding with said method, the present invention also provides a kind of Chinese to hide double-language voice synthesizer, this device carries out phonetic synthesis for utilizing the Chinese set up in advance to hide two sound bank to the Chinese to be synthesized of input or Tibetan voice, in realization, the function of this device is realized by software, hardware or software and hardware combining.The inner structure schematic diagram of apparatus of the present invention as shown in Figure 4.
Apparatus of the present invention inner structure comprises HMM model training unit, speaker adaptation unit and phonetic synthesis unit.
1>HMM model training unit, for setting up the HMM model of speech data:
(1) speech analysis subelement, extracts the parameters,acoustic of speech data in sound bank, mainly extracts fundamental frequency, frequency spectrum and duration parameters;
(2) target HMM model determination subelement, in conjunction with the context markup information in sound storehouse, the statistical model of training acoustic model, based on context property set, determines fundamental frequency, frequency spectrum and duration parameters.
2> speaker adaptation unit, for the characteristic parameter of speaker in normalization and conversion training, obtains adaptive model:
(1) speaker trains subelement, the difference in normalization training between speaker and average sound model between State-output distribution and the distribution of state duration;
(2) average sound model determination subelement, adopts the maximum likelihood linear regression algorithm determination Chinese to hide the average sound model of bilingual mixing voice;
(3) speaker adaptation varitron unit, utilizes self-adapting data, calculates State-output probability distribution and the duration mean of a probability distribution vector sum covariance matrix of speaker, and it is transformed to target speaker model;
(4) adaptive model determination subelement, sets up the adaptive model of the MSD-HSMM of target speaker.
3> phonetic synthesis unit, for the synthesis of Tibetan language to be synthesized or Chinese speech:
(1) adaptive model correction subelement, utilizes the adaptive model of MAP algorithm to voice revise and upgrade, and reduces model bias, improves synthesis quality:
(2) synthon unit, utilizes the adaptive model revised, and the speech parameter of prediction input text, extracting parameter finally synthesizes Chinese or Tibetan voice by voice operation demonstrator.
Procedure described above completes by the hardware that programmed instruction is relevant, and described program can be stored in the storage medium that can read, and this program performs the corresponding steps in said method when performing.
In order to the superiority of the method that the present invention adopts and additive method is described, the Tibetan voice of assessment synthesis and Chinese speech quality, trained the model of 3 kinds of different MSD-HSMM, by comparing the quality of synthetic speech under these three kinds of models, the advantage of the open method of the present invention can be described.
800 voice of the mandarin pronunciation (each speaker 169) being chosen for 7 female speaker of sound bank and 1 Tibetan language female speaker of recording are as training data.Tibetan language statement is selected from Tibetan language newspaper in recent years.All recording all save as Microsoft wav file form (single channel, 16 quantifications, 16 k Hz are sampled).
In experiment, from 800 Tibetan language statements, random selecting 100 statements are as test statement.The statement that random choose goes out 10,100 and 700 from 700 remaining Tibetan language statements establishes the training set of 3 Tibetan language.These training sets are used for training the Chinese to hide the average sound model of bilingual hybrid language together with the training utterance of 7 women's mandarin speakers.These 3 Tibetan language training sets and the corpus of first women's mandarin speaker are also used for obtaining the Tibetan language acoustic model relevant with the speaker of mandarin.
1) SD model: utilize 3 Tibetan language training sets (10/100/700 Tibetan language statement) to train speaker's correlation model of the Tibetan language Lhasa dialect obtained respectively.
2) SI model: only utilize the training statement of 7 women's mandarin speakers to train the speaker's independence model obtained.
3) SAT model: all Mandarin Training statements utilizing 3 Tibetan language training sets and 7 mandarin speakers first respectively, the average sound model of bilingual hybrid language hidden in training acquisition 3 Chinese; Then the training statement of 3 Tibetan language training sets and first mandarin speaker is utilized respectively, speaker's correlation model of acquisition.
When testing and assessing, to the Tibetan language Lhasa dialect test statement that 8 Tibetan language Lhasa dialect evaluation and test people's shuffle SD models and SAT model synthesize, have 120 tested speech files (20 Tibetan language test statement × 3 Tibetan language training set × 2 models).Require that test and appraisal person carefully listens this 120 statements, then the voice quality of each statement is given a mark by 5 points of systems.After having carried out MOS test and appraisal, also require that test and appraisal person carries out the description of an overall intelligibility to the Tibetan voice that different Tibetan language Lhasa dialect training set synthesizes.In the evaluation and test of the MOS of Chinese, adopt identical method, to the mandarin pronunciation (18 genic male sterility statement × 3 Tibetan language training sets) of mandarin evaluation and test person shuffle 54 synthesis, the tested voice quality to each mandarin statement is by 5 points of system marking.
The voice MOS assessment result of synthesizing under different Tibetan language statement training storehouse shows the fiducial interval of average MOS score and 95%.For Tibetan language synthetic speech, under often kind of Tibetan language training set, SAT model is all better than SD model.For the Tibetan language training utterance of 10, the voice MOS score of SD model synthesis only has 1.99, and the MOS of SAT model is relatively high, is 2.4.Test and appraisal person thinks that when participating in assessment the Tibetan language that SD model synthesizes is understood more difficult, and the Tibetan language easy understand of SAT model synthesis.When Tibetan language training statement is 100, the MOS score of 2 models and intelligibility have had raising, but SAT model is still obviously better than SD model.When training statement to reach 700, the MOS score of 2 models is substantially identical.Meanwhile, test and appraisal person thinks that synthetic speech is readily appreciated that.Therefore, in little language material situation, the quality of SAT model synthetic speech is better than the quality of SD model synthetic speech.When Tibetan language corpus increases, the quality of the Tibetan voice of two kinds of model synthesis will be tending towards identical.So method disclosed by the invention is well suited for synthesizing high-quality voice when lacking corpus.
For mandarin synthetic speech, under often kind of Tibetan language training set, being mixed into the synthesis result of Tibetan language statement on mandarin in corpus does not almost affect, and the mandarin MOS score of synthesis is all about 4.0, and synthetic effect is better.
The similarity of DMOS method to language is adopted to evaluate and test.In DMOS evaluation and test, all test statements and original recording thereof are all used for participating in evaluation and test.The Tibetan voice file (20 Tibetan language statements of Tibetan language training set × 2, a 20 Tibetan language statement × 3 model+SI model synthesis) of totally 140 synthesis.Tibetan language statement and the original recording thereof of each synthesis are one group of voice document.To Tibetan language Lhasa dialect evaluation and test people's shuffle 140 groups of test files: first play original Tibetan language recording, then play the Tibetan voice of synthesis.Require that test and appraisal people carefully compares 2 kinds of voice documents, the similarity degree of assessment synthetic speech and raw tone.Adopt 5 points of systems during assessment, 5 points to represent synthetic speech substantially identical with raw tone, and 1 point represents synthetic speech and raw tone is distinguished very large.
In the evaluation and test of the DMOS of Chinese, adopt identical method, to mandarin evaluation and test person shuffle 54 groups of mandarin pronunciations (18 genic male sterility statement × 3 Tibetan language training sets), tested to often organizing the similarity degree of mandarin statement by 5 points of system marking.
Result shows the fiducial interval of average DMOS score and 95%.For the synthetic speech of Tibetan language, the DMOS of SI model must be divided into 2.41 points, and score is better than the SD model with the training of 10 Tibetan language, and the SAT model score of training with 10 Tibetan language is close.When the SI model of test and appraisal person to synthesis carries out subjective assessment, think that Tibetan voice that SI model synthesizes is similar to non-Tibetan language speaker and says Tibetan language.This is because mandarin and Tibetan language not only share 33 synthesis units, and they have identical syllable structure and rhythm structure.Therefore, only can synthesize with the model of mandarin the voice being similar to Tibetan language.When adding more Tibetan language training statement, the DMOS score of the Tibetan voice of SAT model synthesis is better than the result of SD model.When Tibetan language training statement be increased to 700 time, the DMOS score of SD model and the DMOS score of SAT model very close.This shows when Tibetan language statement is less, and the Tibetan language Lhasa dialect phonetic of method synthesis disclosed by the invention is better than the Tibetan language Lhasa dialect phonetic based on the synthesis of SD model method.
Above embodiments of the invention have been described in detail, but described content being only preferred embodiment of the present invention, can not being considered to for limiting practical range of the present invention.All equalizations done according to the scope of the invention change and improve, and all should still belong within this patent covering scope.

Claims (10)

1. double-language voice synthetic method hidden in the Chinese, it is characterized in that: comprise step:
A, be reference with the International Phonetic Symbols, to the Tibetan language phonetic alphabet of input, obtain its International Phonetic Symbols, then compare with the International Phonetic Symbols of the Chinese phonetic alphabet, identical part directly adopts SAMPA-SC to mark, different parts then utilizes untapped keyboard symbol to mark, and utilizes the making character fonts algorithm towards SAMPA-T, completes the SAMPA-T automatic marking of Tibetan language corpus of text;
B, similarity according to Tibetan language and mandarin, on the basis of mandarin mark with phonetic symbols system, design Chinese and the general mark with phonetic symbols system of Tibetan language and problem set;
C, utilize the Chinese to hide the speech data of many speakers, based on HMM model, trained by speaker adaptation, training obtains the average sound model of hybrid language;
D, utilize the language material of target language Tibetan language to be synthesized or a small amount of speaker of Chinese speech, converted by speaker adaptation, obtain speaker adaptation model, and adaptive model is revised and upgrades;
E, input text to be synthesized, generate speech parameter, synthesis Tibetan language or Chinese speech.
2. double-language voice synthetic method hidden in the Chinese according to claim 1, it is characterized in that: the making character fonts algorithm of the described SAMPA-T in described steps A comprises the steps:
First Tibetan language statement text is read in, then according to Dan Chuifu and syllabic sign cutting sentence and syllable, obtain Tibetan language list, to each syllable, spoken and simple or compound vowel of a Chinese syllable by base word fourth location and base word fourth decomposition and separation, finally by the SAMPA-T character string of searching initial consonant SAMPA-T table and this syllable of simple or compound vowel of a Chinese syllable SAMPA-T table acquisition, it is according to base word fourth split table that described base word fourth is decomposed.
3. double-language voice synthetic method hidden in the Chinese according to claim 1, it is characterized in that: the design Chinese in described step B and the general mark with phonetic symbols system of Tibetan language and problem set comprise the steps:
First, the Tibetan language sound consistent with Mandarin Chinese speech is female, and adopt the Chinese phonetic alphabet to mark, the employing Tibetan language phonetic inconsistent with Mandarin Chinese speech marks;
Then, choose female and the quiet and synthesis unit paused as context-sensitive MSD-HSMM of all sound of mandarin and Tibetan language and carry out design context annotation formatting, be used for marking the context-sensitive feature of the sound mother layer of each synthesis unit, syllablic tier, word layer, rhythm word layer, phrase layer and statement layer;
Finally, the basis of the context-sensitive problem set of a mandarin is designed the Chinese and hides bilingual general problem set, the relevant issues of the distinctive synthesis unit of Tibetan language have been expanded in this problem set, to reflect the special pronunciation of Tibetan language, problem set comprises more than 3000 context-sensitive problem, covers all features of context-sensitive mark.
4. double-language voice synthetic method hidden in the Chinese according to claim 1, it is characterized in that: being trained by speaker adaptation in described step C, and training obtains the average sound model of hybrid language and comprises the steps:
A, speech analysis is carried out to the Chinese data storehouse of many speakers and the Tibetan language corpus data of single speaker, extracts its parameters,acoustic:
(1) extract mel cepstrum coefficient, logarithm fundamental frequency and non-periodic index,
(2) their first order difference and second order difference is calculated;
B, in conjunction with context property collection, carry out HMM model training, train the statistical model of its parameters,acoustic:
(1) the HMM model of frequency spectrum and base frequency parameters is trained,
(2) many distributions half Hidden Markov Model (HMM) (MSD-HSMM) of physical training condition duration parameters;
C, utilize a small amount of single speaker's Chinese speech information library and single speaker's Tibetan voice storehouse, carry out speaker adaptation training, thus obtain average sound model:
(1) constraint maximum likelihood is adopted linearly to return (CMML) algorithm, by the difference linear regression function representation between the speech data of speaker in training and average sound,
(2) by the difference between one group of State-output distribution and the equation of linear regression normalization training speaker of state duration distribution,
(3) training obtains the Chinese and hides the average sound model of bilingual hybrid language, thus obtains context-sensitive MSD-HSMM model;
D, utilize single speaker's self-adapting data of Chinese and Tibetan language, carry out speaker adaptation conversion:
(1) adopt CMML algorithm, calculate State-output probability distribution and the state duration mean of a probability distribution vector sum covariance matrix of speaker,
(2) mean vector of average sound model and Covariance Matrix Transform are the target speaker model of Tibetan language to be synthesized or Chinese by the transformation matrix distributed with the distribution of one group of State-output and state duration,
(3) maximal possibility estimation is carried out to normalization and frequency spectrum, fundamental frequency and duration parameters after transforming;
E, adaptive model revised and upgrades:
(1) adopt maximum a posteriori (MAP) algorithm, calculate the MAP estimated parameter that average sound model state exports and duration distributes,
(2) mean vector of the State-output after adaptive transformation and state duration is calculated,
(3) the weighted mean MAP estimated value of self-adaptation mean vector is calculated;
F, input text to be synthesized, text analyzing is carried out to it, obtains the HMM model of sentence;
G, carry out parameter prediction to sentence HMM, carry out speech parameter generation, after Parametric synthesizers, obtain synthetic speech, formula is as follows:
Wherein, for the State-output mean vector of training speaker s, for its state duration mean vector, W=[A, b] and X=[α, β] is respectively the transformation matrix of State-output distribution and state duration distributional difference between training speaker s and average sound model, o iand d ifor average observed vector sum mean time long vector.
5. double-language voice synthetic method hidden in the Chinese according to claim 1, it is characterized in that: the language material utilizing target language Tibetan language to be synthesized or a small amount of speaker of Chinese speech described in described step D, converted by speaker adaptation, obtain speaker adaptation model, and adaptive model is revised and upgrades, comprise the steps:
First, after speaker adaptation training, utilize the CMLLR adaptive algorithm based on HSMM, calculate State-output probability distribution and the duration mean of a probability distribution vector sum covariance matrix of voice conversion, under state i, the transformation equation of proper vector o and state duration d is:
b i(o)=N(o;Au i-b,A∑ iA T)
=|A -1|N(Wξ;u i,Σ i)
p i ( d ) = N ( d ; α m i - β , α σ i 2 α ) = | α - 1 | N ( αψ ; m i , σ i 2 )
Wherein, ξ=[o t, 1], ψ=[d, 1] t, μ ifor the average of State-output distribution, m ifor the average of duration distribution, ∑ ifor diagonal covariance matrix, for variance, W=[A -1b -1] be the matrix of a linear transformation of target speaker State-output probability density distribution, X=[α -1, β -1] be the transformation matrix of state duration probability density distribution;
Then, by the adaptive transformation algorithm based on HSMM, can be normalized the frequency spectrum of speech data, fundamental frequency and duration parameters and convert, be the self-adapting data O of T for length, can carry out maximal possibility estimation to conversion Λ=(W, X),
Λ ~ = ( W ~ , X ~ ) = arg max Λ P ( O | λ , Λ )
Wherein, λ is the parameter set of HSMM;
Finally, the adaptive model of maximum a posteriori (MAP) algorithm to voice is adopted to revise and upgrade, for given HSMM λ, if its forward direction probability and backward probability are respectively: α i(i) and β i(i), then its Continuous Observation sequence o under state i t-d+1... o tgenerating probability for:
κ t d ( i ) = 1 P ( O | λ ) Σ j = 1 j ≠ i N α t - d ( j ) p ( d ) Π s = t - d + 1 t b i ( O s ) β t ( i )
MAP estimates to be described below:
Wherein, with for the mean vector after linear regression conversion, ω and τ is respectively the MAP estimated parameter of State-output and duration distribution, with for self-adaptation mean vector with weighted mean MAP estimated value.
6. double-language voice synthetic method hidden in the Chinese according to claim 1, it is characterized in that: the text to be synthesized of the input described in described step e, generates speech parameter, and synthesis Tibetan language or Chinese speech comprise the steps:
First use text analyzing instrument that given text-converted is become to comprise the pronunciation annotated sequence of linguistic context descriptor, the linguistic context using the decision tree obtained in training process to dope each pronunciation is correlated with HMM model, and connects into the HMM model of a statement;
Then, operation parameter generating algorithm generates the argument sequence of frequency spectrum, duration and fundamental frequency from statement HMM;
Finally approach (MLSA) wave filter with Mel logarithmic spectrum and, as Parametric synthesizers, synthesize voice.
7. double-language voice synthesizer hidden in the Chinese, it is characterized in that: comprising: HMM model training unit, for setting up the HMM model of speech data; Speaker adaptation unit, for the characteristic parameter of speaker in normalization and conversion training, obtains adaptive model; Phonetic synthesis unit, for the synthesis of Tibetan language to be synthesized or Chinese speech.
8. double-language voice synthesizer hidden in the Chinese according to claim 7, it is characterized in that: described HMM model training unit, comprising: speech analysis subelement, extracts the parameters,acoustic of speech data in sound bank, mainly extract fundamental frequency, frequency spectrum and duration parameters; Target HMM model determination subelement, in conjunction with the context markup information in sound storehouse, the statistical model of training acoustic model, based on context property set, determine fundamental frequency, frequency spectrum and duration parameters, described speech analysis subelement is connected with described target HMM model determination subelement.
9. double-language voice synthesizer hidden in the Chinese according to claim 8, it is characterized in that: described speaker adaptation unit, comprise the speaker be connected successively and train subelement, average sound model determination subelement, speaker adaptation varitron unit and adaptive model determination subelement, described target HMM model determination subelement trains subelement to be connected with described speaker
Described speaker trains subelement, for the difference in normalization training between speaker and average sound model between State-output distribution and the distribution of state duration;
Described average sound model determination subelement, adopts the maximum likelihood linear regression algorithm determination Chinese to hide the average sound model of bilingual mixing voice;
Described speaker adaptation varitron unit, utilizes self-adapting data, calculates State-output probability distribution and the duration mean of a probability distribution vector sum covariance matrix of speaker, and it is transformed to target speaker model;
Described adaptive model determination subelement, sets up the adaptive model of the MSD-HSMM of target speaker.
10. double-language voice synthesizer hidden in the Chinese according to claim 9, it is characterized in that: described phonetic synthesis unit comprises the adaptive model correction subelement and synthon unit that are connected successively, described adaptive model determination subelement is connected with described adaptive model correction subelement
Described adaptive model correction subelement, utilizes the adaptive model of MAP algorithm to voice revise and upgrade, and reduces model bias, improves synthesis quality;
Described synthon unit, utilizes the adaptive model revised, and the speech parameter of prediction input text, extracting parameter finally synthesizes Chinese or Tibetan voice by voice operation demonstrator.
CN201410341827.9A 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device Pending CN104217713A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410341827.9A CN104217713A (en) 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410341827.9A CN104217713A (en) 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device

Publications (1)

Publication Number Publication Date
CN104217713A true CN104217713A (en) 2014-12-17

Family

ID=52099126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410341827.9A Pending CN104217713A (en) 2014-07-15 2014-07-15 Tibetan-Chinese speech synthesis method and device

Country Status (1)

Country Link
CN (1) CN104217713A (en)

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN105390133A (en) * 2015-10-09 2016-03-09 西北师范大学 Tibetan TTVS system realization method
CN105654939A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Voice synthesis method based on voice vector textual characteristics
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN106297764A (en) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 A kind of multilingual mixed Chinese language treatment method and system
CN106294311A (en) * 2015-06-12 2017-01-04 科大讯飞股份有限公司 A kind of Tibetan language tone Forecasting Methodology and system
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN107886938A (en) * 2016-09-29 2018-04-06 中国科学院深圳先进技术研究院 Virtual reality guides hypnosis method of speech processing and device
CN108492821A (en) * 2018-03-27 2018-09-04 华南理工大学 A kind of method that speaker influences in decrease speech recognition
CN108573694A (en) * 2018-02-01 2018-09-25 北京百度网讯科技有限公司 Language material expansion and speech synthesis system construction method based on artificial intelligence and device
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109949796A (en) * 2019-02-28 2019-06-28 天津大学 A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component
CN110232909A (en) * 2018-03-02 2019-09-13 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN110349567A (en) * 2019-08-12 2019-10-18 腾讯科技(深圳)有限公司 The recognition methods and device of voice signal, storage medium and electronic device
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111833845A (en) * 2020-07-31 2020-10-27 平安科技(深圳)有限公司 Multi-language speech recognition model training method, device, equipment and storage medium
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112116903A (en) * 2020-08-17 2020-12-22 北京大米科技有限公司 Method and device for generating speech synthesis model, storage medium and electronic equipment
CN115547292A (en) * 2022-11-28 2022-12-30 成都启英泰伦科技有限公司 Acoustic model training method for speech synthesis
CN117275458A (en) * 2023-11-20 2023-12-22 深圳市加推科技有限公司 Speech generation method, device and equipment for intelligent customer service and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035271A (en) * 1995-03-15 2000-03-07 International Business Machines Corporation Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 Syllable splitting method of Tibetan language of Anduo
US20090055162A1 (en) * 2007-08-20 2009-02-26 Microsoft Corporation Hmm-based bilingual (mandarin-english) tts techniques
CN202615783U (en) * 2012-05-23 2012-12-19 西北师范大学 Mel cepstrum analysis synthesizer based on FPGA
CN203276836U (en) * 2013-06-08 2013-11-06 西北民族大学 Novel Tibetan language identification apparatus
CN103440236A (en) * 2013-09-16 2013-12-11 中央民族大学 United labeling method for syntax of Tibet language and semantic roles

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6035271A (en) * 1995-03-15 2000-03-07 International Business Machines Corporation Statistical methods and apparatus for pitch extraction in speech recognition, synthesis and regeneration
CN101290766A (en) * 2007-04-20 2008-10-22 西北民族大学 Syllable splitting method of Tibetan language of Anduo
US20090055162A1 (en) * 2007-08-20 2009-02-26 Microsoft Corporation Hmm-based bilingual (mandarin-english) tts techniques
CN202615783U (en) * 2012-05-23 2012-12-19 西北师范大学 Mel cepstrum analysis synthesizer based on FPGA
CN203276836U (en) * 2013-06-08 2013-11-06 西北民族大学 Novel Tibetan language identification apparatus
CN103440236A (en) * 2013-09-16 2013-12-11 中央民族大学 United labeling method for syntax of Tibet language and semantic roles

Non-Patent Citations (14)

* Cited by examiner, † Cited by third party
Title
C.J. LEGGETTER AND P.C. WOODLAND: ""Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markova models"", 《COMPUTER SPEECH& LANGUAGE》 *
HONGWU YANG ET AL: ""Using speaker adaptive training to realize Mandarin-Tibetan cross-lingual speech synthesis"", 《MULTIMEDIA TOOLS AND APPLICATIONS》 *
J. YAMAGISHI ET AL: ""Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
SIOHAN O ET AL: ""Structural maximum a posteri-ori linear regression for fast HMM adaptation"", 《COMPUTER SPEECH & LANGUAGE》 *
YAMAEISHI J ET AL: ""Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training"", 《IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS》 *
YAMAGISHI J ET AL.: ""Analysis of speaker adaptation algorithms for HMM-based speech synthesis and a constrained SMAPLR adaptation algorithm "", 《IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING》 *
YAMAGISHI J ET AL: "" Average-voice-based speech synthesis"", 《TOKYO INSTITUTE OF TECHNOLOGY》 *
YU HONGZHI,ZHANG JINXI ET AL: ""Research On Tibetan Language Synthesis System Front-end Text Processing Technology Based on HMM"", 《APPLIED MECHANICS AND MATERIALS》 *
刘博: ""藏语拉萨方言的统计参数语音合成的研究"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
吴义坚: ""基于隐马尔科夫模型的语音合成技术研究"", 《中国优秀博硕士论文全文数据库(博士)信息科技辑》 *
宋文龙: ""基于说话人自适应训练的统计参数语音合成的研究"", 《万方学位论文》 *
杜嘉: ""HMM在基于参数的语音合成系统中的应用"", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
王海燕等: ""基于说话人自适应训练的汉藏双语语音合成"", 《第十二届全国人机语音通讯学术会议(NCMMSC2013)论文集》 *
赵欢欢等: ""基于最大后验概率的语音合成说话人自适应"", 《数据采集与处理》 *

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN104882141A (en) * 2015-03-03 2015-09-02 盐城工学院 Serial port voice control projection system based on time delay neural network and hidden Markov model
CN106297764A (en) * 2015-05-27 2017-01-04 科大讯飞股份有限公司 A kind of multilingual mixed Chinese language treatment method and system
CN106294311B (en) * 2015-06-12 2019-03-19 科大讯飞股份有限公司 A kind of Tibetan language tone prediction technique and system
CN106294311A (en) * 2015-06-12 2017-01-04 科大讯飞股份有限公司 A kind of Tibetan language tone Forecasting Methodology and system
CN105390133A (en) * 2015-10-09 2016-03-09 西北师范大学 Tibetan TTVS system realization method
CN105654939A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Voice synthesis method based on voice vector textual characteristics
CN105654939B (en) * 2016-01-04 2019-09-13 极限元(杭州)智能科技股份有限公司 A kind of phoneme synthesizing method based on sound vector text feature
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN107886938B (en) * 2016-09-29 2020-11-17 中国科学院深圳先进技术研究院 Virtual reality guidance hypnosis voice processing method and device
CN107886938A (en) * 2016-09-29 2018-04-06 中国科学院深圳先进技术研究院 Virtual reality guides hypnosis method of speech processing and device
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107103900A (en) * 2017-06-06 2017-08-29 西北师范大学 A kind of across language emotional speech synthesizing method and system
CN108573694A (en) * 2018-02-01 2018-09-25 北京百度网讯科技有限公司 Language material expansion and speech synthesis system construction method based on artificial intelligence and device
CN110232909A (en) * 2018-03-02 2019-09-13 北京搜狗科技发展有限公司 A kind of audio-frequency processing method, device, equipment and readable storage medium storing program for executing
CN108492821B (en) * 2018-03-27 2021-10-22 华南理工大学 Method for weakening influence of speaker in voice recognition
CN108492821A (en) * 2018-03-27 2018-09-04 华南理工大学 A kind of method that speaker influences in decrease speech recognition
CN109036370A (en) * 2018-06-06 2018-12-18 安徽继远软件有限公司 A kind of speaker's voice adaptive training method
CN109003601A (en) * 2018-08-31 2018-12-14 北京工商大学 A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN109949796A (en) * 2019-02-28 2019-06-28 天津大学 A kind of end-to-end framework Lhasa dialect phonetic recognition methods based on Tibetan language component
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN110534089A (en) * 2019-07-10 2019-12-03 西安交通大学 A kind of Chinese speech synthesis method based on phoneme and rhythm structure
CN110349567A (en) * 2019-08-12 2019-10-18 腾讯科技(深圳)有限公司 The recognition methods and device of voice signal, storage medium and electronic device
CN110349567B (en) * 2019-08-12 2022-09-13 腾讯科技(深圳)有限公司 Speech signal recognition method and device, storage medium and electronic device
CN111326138A (en) * 2020-02-24 2020-06-23 北京达佳互联信息技术有限公司 Voice generation method and device
CN111833845A (en) * 2020-07-31 2020-10-27 平安科技(深圳)有限公司 Multi-language speech recognition model training method, device, equipment and storage medium
CN111833845B (en) * 2020-07-31 2023-11-24 平安科技(深圳)有限公司 Multilingual speech recognition model training method, device, equipment and storage medium
CN111986646A (en) * 2020-08-17 2020-11-24 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN112116903A (en) * 2020-08-17 2020-12-22 北京大米科技有限公司 Method and device for generating speech synthesis model, storage medium and electronic equipment
CN111986646B (en) * 2020-08-17 2023-12-15 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN115547292A (en) * 2022-11-28 2022-12-30 成都启英泰伦科技有限公司 Acoustic model training method for speech synthesis
CN115547292B (en) * 2022-11-28 2023-02-28 成都启英泰伦科技有限公司 Acoustic model training method for speech synthesis
CN117275458A (en) * 2023-11-20 2023-12-22 深圳市加推科技有限公司 Speech generation method, device and equipment for intelligent customer service and storage medium
CN117275458B (en) * 2023-11-20 2024-03-05 深圳市加推科技有限公司 Speech generation method, device and equipment for intelligent customer service and storage medium

Similar Documents

Publication Publication Date Title
CN104217713A (en) Tibetan-Chinese speech synthesis method and device
CN101178896B (en) Unit selection voice synthetic method based on acoustics statistical model
CN107103900A (en) A kind of across language emotional speech synthesizing method and system
US7574360B2 (en) Unit selection module and method of chinese text-to-speech synthesis
Patil et al. A syllable-based framework for unit selection synthesis in 13 Indian languages
CN103680498A (en) Speech recognition method and speech recognition equipment
CN101950560A (en) Continuous voice tone identification method
CN103632663B (en) A kind of method of Mongol phonetic synthesis front-end processing based on HMM
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
Toman et al. Unsupervised and phonologically controlled interpolation of Austrian German language varieties for speech synthesis
Waseem et al. Speech synthesis system for indian accent using festvox
Anumanchipalli et al. A statistical phrase/accent model for intonation modeling
Liu et al. A maximum entropy based hierarchical model for automatic prosodic boundary labeling in mandarin
Ling et al. Minimum unit selection error training for HMM-based unit selection speech synthesis system
Sakti et al. Development of HMM-based Indonesian speech synthesis
Ni et al. Automatic prosodic events detection by using syllable-based acoustic, lexical and syntactic features
Iriondo et al. Objective and subjective evaluation of an expressive speech corpus
Maia et al. An HMM-based Brazilian Portuguese speech synthesizer and its characteristics
Pitrelli et al. Expressive speech synthesis using American English ToBI: questions and contrastive emphasis
Sreejith et al. Automatic prosodic labeling and broad class Phonetic Engine for Malayalam
Yang et al. Enriching Mandarin speech recognition by incorporating a hierarchical prosody model
Chen et al. Automatic phrase boundary labeling of speech synthesis database using context-dependent HMMs and n-gram prior distributions.
Adeeba et al. Comparison of Urdu text to speech synthesis using unit selection and HMM based techniques
Yang et al. Automatic phrase boundary labeling for Mandarin TTS corpus using context-dependent HMM
Nair et al. Indian text to speech systems: A short survey

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20141217

RJ01 Rejection of invention patent application after publication