CN101004909A - Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm - Google Patents

Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm Download PDF

Info

Publication number
CN101004909A
CN101004909A CNA2006101511527A CN200610151152A CN101004909A CN 101004909 A CN101004909 A CN 101004909A CN A2006101511527 A CNA2006101511527 A CN A2006101511527A CN 200610151152 A CN200610151152 A CN 200610151152A CN 101004909 A CN101004909 A CN 101004909A
Authority
CN
China
Prior art keywords
primitive
candidate
prosodic features
cost
rhythm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2006101511527A
Other languages
Chinese (zh)
Inventor
张鹏
王丽红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Heilongjiang University
Original Assignee
Heilongjiang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Heilongjiang University filed Critical Heilongjiang University
Priority to CNA2006101511527A priority Critical patent/CN101004909A/en
Publication of CN101004909A publication Critical patent/CN101004909A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

A method for selecting voice synthetic primitive of Chinese based on rhythm character includes outputting object primitive sequence, outputting primitive sequence to matching process, outputting candidate primitive sequence with matched rhythm character to pasting-up step, judging and selecting candidate primitive sequence to optimization-screening step according to pasting-up cost and optimization screening out primitive sequence for voice synthetic program.

Description

Choosing method based on the Chinese speech synthesis unit of prosodic features
(1) technical field
The present invention relates to the voice process technology field, be specifically related to a kind of choosing method of the Chinese speech synthesis unit based on prosodic features.
(2) background technology
Relevant primitive choosing method in the existing speech synthesis technique generally is to adopt the primitive choosing method of statistics or the primitive choosing method of rule.The primitive choosing method of statistics is applicable to that large-scale primitive chooses, and the primitive choosing method of rule is applicable to that some primitives in particular cases choose.Find in the practice that the control of the rhythm model of Chinese and the rhythm is very complicated, can't satisfy the requirement of phonetic synthesis under existing conditions at all prosodic features and rhythm control thereof based on a kind of primitive choosing method of mode.Because the rhythm of Chinese is changeable, even same sentence also can produce different rhythm combinations owing to different linguistic context.So, must seek better method, change to adapt to the changeable rhythm.
(3) summary of the invention
The object of the present invention is to provide a kind of synthetic speech that makes to satisfy under the prerequisite of target prosodic features in maximum, the primitive choosing method of statistics and the primitive choosing method of prosodic features are combined the choosing method that carries out that primitive chooses based on the phonetic synthesis primitive of prosodic features.
The objective of the invention is to be achieved through the following technical solutions, it comprises the attainable step of following computing machine:
Input target Sequence of Primitive Elements;
Determine the primitive step, will import the target Sequence of Primitive Elements and send into statistical module and handle, adopt statistical method, from sound bank, select the candidate Sequence of Primitive Elements corresponding, and output candidate Sequence of Primitive Elements is given the coupling step with the target Sequence of Primitive Elements according to index;
The coupling step, send into rhythm coupling cost primitive and choose module determining candidate's Sequence of Primitive Elements that the primitive step is sent, select the candidate's primitive that mates at prosodic features with the single target primitive according to rhythm coupling cost from sound bank, output candidate Sequence of Primitive Elements is given the splicing step;
The splicing step, candidate's Sequence of Primitive Elements of sending of coupling step is sent into rhythm splicing cost primitive choose module, judge according to the splicing cost whether adjacent candidate's primitive meets the demands on prosodic features, therefrom select and satisfy candidate's primitive that the splicing cost requires, export candidate's Sequence of Primitive Elements at last and give optimization screening step;
Optimize the screening step, candidate's Sequence of Primitive Elements of sending of splicing step is sent into rhythm coupling cost choose module with the primitive that the rhythm splices cost, find out at the candidate's primitive that makes on the prosodic features on the path of mating cost and rhythm splicing cost minimum according to rhythm coupling cost and rhythm splicing cost, output is at last optimized the Sequence of Primitive Elements of screening to voice operation program.
The present invention also has some architectural features like this:
1, the processing procedure of described statistical module is:
The conditional probability threshold value at first is set, candidate's primitive quantity maximal value M is set then;
For the target Sequence of Primitive Elements of input and the phonetic sign indicating number sequence corresponding, calculate the conditional probability of phonetic sign indicating number primitive in sound bank of each target primitive and its correspondence from first to last, one by one with it;
If when judging its conditional probability, then select candidate's primitive and be retained in the buffer memory from sound bank more than or equal to the threshold value that sets in advance, otherwise, continue in sound bank, to search candidate's primitive, recomputate next target primitive conditional probability then;
When finding M primitive, just stop to search, otherwise whole sound bank is searched;
By that analogy, one by one the target primitive of input is searched and finished, output is retained in the candidate's Sequence of Primitive Elements in the buffer memory and withdraws from this module;
2, described rhythm coupling cost primitive is chosen the processing procedure of module and is:
At first, candidate's primitive quantity maximal value M of target primitive is set, calls in the p value of a target primitive p dimension prosodic features parameter vector;
With the mark of input in the phonetic sign indicating number sequence of the prosodic features parameter p dimension prosodic features parameter vector that calculates each primitive from first to last, one by one and the sound bank p of each primitive tie up the weighted sum of the difference of prosodic features parameter vector, search in whole sound bank and finish, preceding M candidate's primitive therefrom selecting the weighted sum minimum is retained in the buffer memory;
By that analogy, finish until the text search calculating with whole input, output is retained in the candidate's primitive in the buffer memory and withdraws from module;
3, described rhythm splicing cost primitive is chosen the processing procedure of module and is:
At first, candidate's primitive quantity maximal value M is set, the p value size of the p dimension prosodic features parameter vector of candidate's primitive is set then;
The p dimension prosodic features parameter vector that calculates each candidate's primitive according to the order of phonetic sign indicating number sequence of input from first to last, one by one and the p of its previous candidate's primitive tie up the weighted sum of the difference of prosodic features parameter vector, and the primitive of therefrom selecting the weighted sum minimum is retained in the buffer memory;
By that analogy, finish until whole input text sequence search is calculated, output is retained in the primitive in the buffer memory and withdraws from module;
4, described rhythm coupling cost and the rhythm primitive that splices cost is chosen the processing procedure of module and is:
Candidate's primitive quantity maximal value M at first is set, the p value size of the p dimension prosodic features parameter vector of candidate's primitive is set then;
The p that calculates each candidate's primitive according to the order of the phonetic sign indicating number sequence of importing from first to last, one by one ties up the coupling cost of prosodic features parameter vector and the weighted sum of splicing cost, is retained in the buffer memory;
By that analogy, until whole input text searching and computing is finished, candidate's primitive of optimizing from the candidate's primitive that satisfies condition on the path of screening the weighted sum minimum of sening as an envoy to is exported as the last primitive of optimizing screening;
5, described rhythm coupling cost is the prosodic features vector of current candidate's primitive in the sound bank that sets in advance and the weighted sum of the difference of its target primitive prosodic features vector, and weights wherein satisfy normalizing condition;
6, described rhythm splicing cost is the weighted sum of the difference of previous candidate's primitive prosodic features vector of being adjacent of prosodic features vector that any two candidate's primitive rhythms splicing cost is current candidate's primitive, and weights wherein satisfy normalizing condition;
7, the step of the weighted sum minimal path of described rhythm coupling cost and rhythm splicing cost is the Viterbi optimizing algorithm;
8, described optimization screening is the Viterbi optimizing algorithm.
Useful advantage of the present invention has:
(1) choosing method based on the Chinese speech synthesis unit of prosodic features has been proposed, for speech synthesis system provides more optimal candidate's primitive;
(2) important indicator that prosodic features is chosen as primitive meets the rhythm requirement of natural-sounding, has reduced the difficulty of the phonetic synthesis rhythm control in later stage simultaneously;
(3) will combine based on the primitive choosing method and the prosodic features primitive choosing method of statistics, can satisfy the rhythm better and change requirement complicated and changeable.
Chinese is different from other department of western languages, shows many aspects such as syntactic structure, syntax rule, acoustic characteristic, prosodic features.At first, Chinese is one word for one tone, i.e. monosyllable; Secondly, Chinese is tone language, and tone has distinguishes the justice effect, and each word all has fixing tone (fundamental frequency shape).And can morph in the tone front and back between word and word influence each other, even lost original accent type, coarticulation phenomenon (change of tune phenomenon) promptly occurs.Simultaneously, also have of short duration pause in the middle of the pronunciation of continuous statement.Everyone has a basic frequency in a minute, is called fundamental frequency, and it has embodied speaker's tone height, and in addition, people also have difference of sound size or the like in a minute.In the literary composition of Chinese language conversion (TTS) system, prediction, analysis and the control of prosodic informations such as speech pitch, duration, amplitude is called rhythm control.
At this situation, the inventor is from the phonetic feature of Chinese, the intonation and the pattern of the tone of research Chinese and characteristics, Chinese, by research to Chinese prosodic features and the synthetic prosodic analysis method of Chinese speech, disclose the prosodic features of Chinese and the inner link between prosodic rules and the model, from prosodic features, constructed choosing method based on the phonetic synthesis primitive of prosodic features, abundant and when improving the choosing method of phonetic synthesis primitive, improved the synthetic naturalness of Chinese speech greatly.The structure choosing method of primitive of the present invention is reasonable and practical, wherein each step and module are the computer programs process process, highly versatile, portable strong, the scope of application and occasion are wide, being widely used in the speech synthesis technique based on prosodic features, is the primitive choosing method in the speech synthesis technique of new generation.
(4) description of drawings
Fig. 1 is the phonetic synthesis primitive choosing method block diagram based on prosodic features;
Fig. 2 chooses process flow diagram for the primitive of statistical method;
Fig. 3 chooses process flow diagram for the primitive of coupling cost method;
Fig. 4 chooses process flow diagram for the primitive of splicing cost method
Fig. 5 is coupling cost and splicing cost primitive choosing method path synoptic diagram;
Fig. 6 is the computer hardware system block diagram of the embodiment of the invention.
(5) embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments:
The target sequence of being imported in the present embodiment for the mark after handling through the text analyzing processing module phonetic sign indicating number string of prosodic information, be the prosodic labeling sequence, the present invention adopts primitive choosing method based on prosodic features to filter out suitable speech primitive from sound bank the phonetic sign indicating number string of input and makes further phonetic synthesis to voice operation program and handle.
Present embodiment comprises the attainable step of following computing machine:
Input target Sequence of Primitive Elements;
Determine the primitive step, will import the target Sequence of Primitive Elements and send into statistical module and handle, adopt statistical method, from sound bank, select the candidate Sequence of Primitive Elements corresponding, and output candidate Sequence of Primitive Elements is given the coupling step with the target Sequence of Primitive Elements according to index;
The coupling step, send into rhythm coupling cost primitive and choose module determining candidate's Sequence of Primitive Elements that the primitive step is sent, select the candidate's primitive that mates at prosodic features with the single target primitive according to rhythm coupling cost from sound bank, output candidate Sequence of Primitive Elements is given the splicing step;
The splicing step, candidate's Sequence of Primitive Elements of sending of coupling step is sent into rhythm splicing cost primitive choose module, judge according to the splicing cost whether adjacent candidate's primitive meets the demands on prosodic features, therefrom select and satisfy candidate's primitive that the splicing cost requires, export candidate's Sequence of Primitive Elements at last and give optimization screening step;
Optimize the screening step, candidate's Sequence of Primitive Elements of sending of splicing step is sent into rhythm coupling cost choose module with the primitive that the rhythm splices cost, find out at the candidate's primitive that makes on the prosodic features on the path of mating cost and rhythm splicing cost minimum according to rhythm coupling cost and rhythm splicing cost, output is at last optimized the Sequence of Primitive Elements of screening to voice operation program.
Each step and module among the present invention and the embodiment are the computer programs process process.
1, chooses based on the primitive of statistical method
From the Bayes principle, natural-sounding is counted as a random series.Each speech primitive in the flow all is a stochastic variable with certain distribution.In the tts system based on big sound bank, the main thought of phonetic synthesis is at known text sequence W=(w 1, w 2, w j... w n) condition under, obtain the speech primitive sequence V=(v of probability maximum 1, v 2, v jV n), to reach the requirement of high naturalness, high expressive force.For text sequence W=(w 1, w 2, w j... w n) in certain target primitive w j, in sound bank, have a plurality of candidates' speech primitive, be designated as (v J, 1, v J, 2, v J, k... v J, m), wherein define V J, cBe candidate's primitive of selecting, v J, kBe w jK corresponding candidate's speech primitive, m is the number of its candidate's primitive.Choose and can be described as based on the primitive of statistical method:
V j , c = arg max P P ( v j , k | w j ) = arg max P P ( w j | v j , k ) P ( v j , k ) ( 1 ≤ k ≤ m ) - - - ( 1 )
By following formula as can be known, choose based on the primitive of statistical method and need finish the binomial action: the one, calculate P (w m| v M, k), be exactly to set up the sound bank that contains mass data based on the method for statistics, it is marked, set up and select the sound model; The 2nd, set up rhythm model, promptly calculate P (v M, k).
The input that the primitive of statistical method is chosen module is input text and its phonetic sign indicating number sequence.The purpose of probability threshold value is set, and is in order to select suitable primitive from sound bank.The size of this probability threshold value is to determine according to maximum generating criteria of probability or the minimum generating criteria of probability.The maximum generating criteria of probability is meant that the conditional probability threshold ratio is bigger, the conditional probability of having only certain primitive in the sound bank just can be used as candidate's primitive during more than or equal to this threshold value, can only select less candidate's primitive like this and participate in and optimize screening in sound bank.The minimum generating criteria of probability is meant that the conditional probability threshold ratio is less, as long as the conditional probability of certain primitive just can be used as candidate's primitive more than or equal to this threshold value in the sound bank.Like this, in sound bank, just can select more candidate's primitive and participate in the optimization screening.Candidate's primitive is meant that process filters out but also will participates in the primitive of optimizing screening.Optimize screening and be from candidate's primitive, to filter out again and satisfy prosodic features requirement, unique primitive.The big I of probability threshold value is determined according to the actual effect of optimizing screening.
It is in order to work as candidate's primitive quantity more for a long time, to make the candidate's primitive quantity that is selected can not surpass this maximal value, accelerating to optimize the speed of screening candidate primitive by the quantity that limits candidate's primitive that candidate's primitive quantity maximal value is set.Preceding M the primitive of probability threshold value of for example only satisfying condition can be participated in later optimization screening.Desired computing velocity was determined when the big I of M was screened according to optimization, also can the size of M not added restriction.
The processing procedure that the primitive of statistical method is chosen module is: the conditional probability threshold value at first is set, candidate's primitive quantity maximal value M is set then.Next, for the text sequence of input and the phonetic sign indicating number sequence corresponding with it, calculate the conditional probability of each target primitive sound bank from beginning of the sentence one by one to the sentence tail,, otherwise continue in sound bank, to search if its conditional probability satisfies the threshold value requirement then is retained in the buffer memory; When finding M primitive, just stop to search, otherwise whole sound bank search is finished.Like this, a target primitive can be selected several primitives in sound bank, and its quantity equals M or individual less than M.By that analogy, until being searched, the text sequence of whole input finishes.The output that the primitive of statistical method is chosen module is the some primitive group sequences that filter out through statistical method.
Optimization is screened optimizing algorithms such as can directly adopting Viterbi and is realized, also can utilize the method for back rhythm coupling cost and rhythm splicing cost to realize.
2, choosing based on the primitive of rhythm coupling cost
Because the sound of voice joins phenomenon, on prosodic features and acoustic characteristic, influence each other between adjacent syllable, the vocabulary in the Chinese.The coupling cost of primitive has reflected the prosodic features of current candidate's primitive and the matching degree between the target primitive prosodic features, for a known text sequence W=(w 1, w 2, w j... w n), the voice sequence of intending selected correspondence is V=(v 1, v 2, v j... v n); Wherein j is the speech primitive sequence number, and n is the quantity of target primitive in the text sequence.For text sequence W=(w 1, w 2, w j... w n) in certain target primitive w j, in sound bank, have a plurality of candidates' speech primitive, be designated as (v J, 1, v J, 2, v J, k... v J, m), wherein define v J, kBe w jK corresponding candidate's speech primitive, m is the number of its candidate's primitive.Each candidate's speech primitive in the sound bank can be with the prosodic features parameter vector of p dimension D → k = ( d k 1 . . . , d k i , . . . d k p ) Describe, wherein p is the number of prosodic features parameter, and k is the sequence number of current candidate's primitive, then d k iBe expressed as i prosodic features parameter of k candidate's primitive.For target primitive to be synthesized,, can obtain its target prosodic features vector by text analyzing and prosody modeling etc. D → = ( d 1 . . . , d i , . . . d p ) 。We define rhythm coupling cost is the prosodic features vector of current candidate's primitive and the weighted sum of the difference of the prosodic features vector of its target primitive, that is:
D u = arg min D u ( D → , D → k ) = arg min Σ i = 1 p ω u i | d i - d k i | - - - ( 2 )
Wherein, ω u iBe rhythm coupling weight coefficient, and satisfy normalizing condition by the moon Σ i = 1 p ω u i = 1 , ( ω u i ≥ 0 ) 。When so primitive is chosen, as long as according to target prosodic features vector
Figure A20061015115200105
From sound bank, select suitable sample, make its prosodic features vector With its target prosodic features vector
Figure A20061015115200107
Reach the most approaching getting final product.
Therefore, the problem of choosing two keys of necessary solution of primitive: 1. the phonetic-rhythm characteristic parameter is selected; 2. according to relation between the prosodic parameter and of the influence of each parameter, be provided with or the training weight vectors whole prosodic features.
The processing procedure that rhythm coupling cost primitive is chosen module is: at first, candidate's primitive quantity maximal value M of target primitive is set, calls in the p value of a target primitive p dimension prosodic features parameter vector then.Next, for the input mark the phonetic sign indicating number sequence of prosodic features parameter, from first to last, the p of each speech primitive ties up the weighted sum of the difference of prosodic features parameter vector in the p dimension prosodic features parameter vector that calculates each target primitive one by one and the sound bank, up to whole sound bank search is finished, preceding M candidate's primitive therefrom selecting the weighted sum minimum is retained in the buffer memory.By that analogy, finish until text sequence search whole input.
3, choosing based on the primitive of rhythm splicing cost
Usually, with all optimum primitive direct splicing that filter out separately through the overmatching cost together, the final synthetic statement that obtains might not be optimum.Because, the coupling cost is just considered the matching degree on prosodic features between single candidate's primitive and its pairing target primitive, and do not consider the involutory appreciable impact that becomes the statement naturalness of syllable and the splicing matching degree between the syllable more, that is to say that the single optimum primitive that filters out might not be the primitive of global optimum.
The successional matching degree of the rhythm of speech primitive before and after the splicing cost has reflected.For synthetic speech, each candidate's primitive can be with the prosodic features parameter vector of p dimension D → j = ( d j 1 . . . , d j i , . . . d j p ) Describe, and the prosodic features parameter of its previous candidate's primitive is D → j - 1 = ( d j - 1 1 . . . , d j - 1 i , . . . d j - 1 p ) 。More natural in order to guarantee the spectrum transition between the phonetic synthesis primitive, the weighted sum of the difference of previous candidate's primitive prosodic features vector that the prosodic features vector that we define any two candidate's primitive rhythms splicing cost is current candidate's primitive is adjacent, that is:
D c = arg min D c ( D → j , D → j - 1 ) = arg min Σ i = 1 p ω c i | d j i - d j - 1 i |
ω c wherein iBe rhythm splicing weight coefficient, and satisfy normalizing condition by the moon Σ i = 1 p ω c i = 1 , ( ω c i ≥ 0 ) 。So guaranteeing under the situation that matching distance is certain between candidate's primitive and the target primitive, the splicing between the prosodic features parameter of the prosodic features of current candidate's primitive previous candidate's primitive adjacent with it got final product apart from minimum.
The processing procedure that rhythm splicing cost primitive is chosen module is: at first, candidate's primitive quantity maximal value M is set, the p value size of the p dimension prosodic features parameter vector of candidate's primitive is set then.Next, order according to the phonetic sign indicating number sequence of importing, the p dimension prosodic features parameter vector that calculates each candidate's primitive from first to last, one by one and the p of its previous candidate's primitive tie up the weighted sum of the difference of prosodic features parameter vector, and candidate's primitive of therefrom selecting the weighted sum minimum is retained in the buffer memory.By that analogy, until whole input text sequence search is finished.
The choosing of primitive of 4, splicing cost based on the rhythm coupling cost and the rhythm
Equally, we can mate the rhythm cost and rhythm splicing cost and take all factors into consideration, and make candidate's speech primitive can approach the prosodic features of target aspect two.For a known text sequence, can find the speech primitive sequence of a corresponding with it optimum, promptly the screening following formula of sening as an envoy to reaches speech primitive on the minimum path from corpus:
D = arg min [ ω u Σ j = 1 n D uj + ω c Σ j = 2 n D cj ] - - - ( 4 )
Can guarantee that like this candidate's primitive can both satisfy the requirement of a complete sentence to naturalness aspect prosodic features and two of the transition of spectrum.Different weights can cause the relative variation of different characteristic importance, such as when primitive is chosen, can allow the bigger effect of fundamental frequency aspect ratio duration feature performance.In addition, weight can so just can be ignored the influence of a certain feature for zero when primitive is chosen.Meanwhile, the weight of coupling cost also is different from the weight of splicing cost.
The processing procedure that the primitive of rhythm coupling cost and rhythm splicing cost is chosen module is: at first, candidate's primitive quantity maximal value M is set, the p value size of the p dimension prosodic features parameter vector of candidate's primitive is set then.Next, according to the order of the phonetic sign indicating number sequence of importing, the p that calculates each candidate's primitive from first to last, one by one ties up the coupling cost of prosodic features parameter vector and the weighted sum of splicing cost, and is retained in the buffer memory.By that analogy, until whole input text sequence search is finished, therefrom select the primitive that the candidate's primitive on the path that makes the weighted sum minimum screens as last optimization.
5, the structure of sound bank
The structure of sound bank adopts conventional method to get final product, and is identical with the method for generally setting up database, just carefully do not lift here.The present invention selects the minimum in the Chinese to listen the unit of distinguishing after taking all factors into consideration various factors---and syllable is as the primitive of phonetic synthesis, and a plurality of samples stored in a syllable in the sound bank, and the soft and stress tone and the fundamental curve of each sample also have nothing in common with each other.
6, computingasystem environment
Fig. 6 has provided one can implement suitable computingasystem environment of the present invention.This computingasystem environment just can be implemented an embodiment of computingasystem environment of the present invention, and is not to be that range of application of the present invention or function are carried out any restriction.Computing environment should not be considered to that the combination of any one parts shown in the example operational environment or parts is had any dependence or requirement yet.
The present invention can be used for numerous specific or unspecific computingasystem environment or configurations, as: personal computer, small-size computer, medium-size computer, mainframe computer, network computer, server computer, hand or laptop devices, multicomputer system is based on the system of microprocessor, set-top box, the programmable electronic consumption device comprises any above-mentioned system or the distributed computing environment of device, or the like.
Can the use a computer general modfel of executable instruction of the present invention is described, for example the program module of computing machine.Program module comprises program, subroutine, object, control, assembly, data structure etc., and they are used for carrying out specific task or realize specific abstract data type.The present invention also can be applied to distributed computing environment, wherein executes the task by the teleprocessing device that utilizes the communication network link.In distributed computing environment, program module can leave in the local and remote computer-readable storage medium that comprises memory storage apparatus simultaneously.
Implement canonical system of the present invention and comprise a computing machine or a calculation element that is used for nonspecific purpose.The system bus that its parts comprise one or more processing units, internal storage, external memory storage, input interface, output interface and connect above-mentioned each unit or parts.System bus can be any bus structure that comprise in the bus structure of following several types: memory bus or memory controller, a peripheral bus and use the local bus of bus in the various bus structure.These bus structure: as industrial standard architectures (ISA) bus, MCA (MCA) bus, the ISA line of enhancing, VESA (VESA), local bus with peripheral component interconnect (PCI) bus (also be mezzanine bus (Mezzanine bus), or the like.
The user can be by input media to defeated people's order of computer port and information. and these input medias can be keyboard, microphone and pointing device such as mouse, trace ball or touch pad.Can also be other input media (not drawing on the figure), for example control lever, game mat, the big line of disc type satellite television (satellite dish), scanner etc.Above-mentioned defeated people's device is normally received processing unit in succession by user's input of being coupled to system bus, but also can be connected with bus structure by other interface, for example parallel port, game port or USB (universal serial bus) (USB).The display device of monitor or other types by an interface for example video interface be connected to system bus.Except this monitor, computing machine also can comprise other output peripheral equipment, for example loudspeaker and printer, and they connect by an outside output interface.
Computing machine can by the logic ways of connecting be connected to one or more how far journey computing machine (for example remote computer) thus in network environment, operate.Remote computer can be personal computer, hand-held device, server, router, network computer, peer (peer device) or other network nodes commonly used, generally includes a plurality of or all above-mentioned parts relevant with computing machine.Logic shown in Fig. 6 connects and comprises a LAN (Local Area Network) and a wide area network, but agreement comprises other network.This network environment is common in computer network, in-house network and the Internet in office, the enterprise-wide.

Claims (9)

1, a kind of choosing method of the Chinese speech synthesis unit based on prosodic features, it is characterized in that: it comprises the attainable step of following computing machine:
Input target Sequence of Primitive Elements;
Determine the primitive step, will import the target Sequence of Primitive Elements and send into statistical module and handle, adopt statistical method, from sound bank, select the candidate Sequence of Primitive Elements corresponding, and output candidate Sequence of Primitive Elements is given the coupling step with the target Sequence of Primitive Elements according to index;
The coupling step, send into rhythm coupling cost primitive and choose module determining candidate's Sequence of Primitive Elements that the primitive step is sent into, select the candidate's primitive that mates at prosodic features with the single target primitive according to rhythm coupling cost from sound bank, output candidate Sequence of Primitive Elements is given the splicing step;
The splicing step, candidate's Sequence of Primitive Elements of sending of coupling step is sent into rhythm splicing cost primitive choose module, judge according to the splicing cost whether adjacent candidate's primitive meets the demands on prosodic features, therefrom select and satisfy candidate's primitive that the splicing cost requires, export candidate's Sequence of Primitive Elements at last and give optimization screening step;
Optimize the screening step, candidate's Sequence of Primitive Elements of sending of splicing step is sent into rhythm coupling cost choose module with the primitive that the rhythm splices cost, find out at the candidate's primitive that makes on the prosodic features on the path of mating cost and rhythm splicing cost minimum according to rhythm coupling cost and rhythm splicing cost, output is at last optimized the Sequence of Primitive Elements of screening to voice operation program.
2, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1, it is characterized in that: the processing procedure of described statistical module is:
The conditional probability threshold value at first is set, candidate's primitive quantity maximal value M is set then;
For the target Sequence of Primitive Elements of input and the phonetic sign indicating number sequence corresponding, calculate the conditional probability of phonetic sign indicating number primitive in sound bank of each target primitive and its correspondence from first to last, one by one with it;
If when judging its conditional probability, then select candidate's primitive and be retained in the buffer memory from sound bank more than or equal to the threshold value that sets in advance, otherwise, continue in sound bank, to search candidate's primitive, recomputate next target primitive conditional probability then;
When finding M primitive, just stop to search, otherwise whole sound bank is searched;
By that analogy, one by one the target primitive of input is searched and finished, output is retained in the candidate's Sequence of Primitive Elements in the buffer memory and withdraws from this module.
3, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1 is characterized in that: the processing procedure that described rhythm coupling cost primitive is chosen module is:
At first, candidate's primitive quantity maximal value M of target primitive is set, calls in the p value of a target primitive p dimension prosodic features parameter vector;
With the mark of input in the phonetic sign indicating number sequence of the prosodic features parameter p dimension prosodic features parameter vector that calculates each primitive from first to last, one by one and the sound bank p of each primitive tie up the weighted sum of the difference of prosodic features parameter vector, search in whole sound bank and finish, preceding M candidate's primitive therefrom selecting the weighted sum minimum is retained in the buffer memory;
By that analogy, finish until the text search calculating with whole input, output is retained in the candidate's primitive in the buffer memory and withdraws from module.
4, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1 is characterized in that: the processing procedure that described rhythm splicing cost primitive is chosen module is:
At first, candidate's primitive quantity maximal value M is set, the p value size of the p dimension prosodic features parameter vector of candidate's primitive is set then;
The p dimension prosodic features parameter vector that calculates each candidate's primitive according to the order of phonetic sign indicating number sequence of input from first to last, one by one and the p of its previous candidate's primitive tie up the weighted sum of the difference of prosodic features parameter vector, and the primitive of therefrom selecting the weighted sum minimum is retained in the buffer memory;
By that analogy, finish until whole input text sequence search is calculated, output is retained in the primitive in the buffer memory and withdraws from module.
5, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1 is characterized in that: the processing procedure that the primitive that described rhythm coupling cost and the rhythm splice cost is chosen module is:
Candidate's primitive quantity maximal value M at first is set, the p value size of the p dimension prosodic features parameter vector of candidate's primitive is set then;
The p that calculates each candidate's primitive according to the order of the phonetic sign indicating number sequence of importing from first to last, one by one ties up the coupling cost of prosodic features parameter vector and the weighted sum of splicing cost, is retained in the buffer memory;
By that analogy, until whole input text searching and computing is finished, candidate's primitive of optimizing from the candidate's primitive that satisfies condition on the path of screening the weighted sum minimum of sening as an envoy to is exported as the last primitive of optimizing screening.
6, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1, it is characterized in that: described rhythm coupling cost is the prosodic features vector of current candidate's primitive in the sound bank that sets in advance and the weighted sum of the difference of its target primitive prosodic features vector, and weights wherein satisfy normalizing condition.
7, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1, it is characterized in that: described rhythm splicing cost is the weighted sum of the difference of previous candidate's primitive prosodic features vector of being adjacent of prosodic features vector that any two candidate's primitive rhythms splicing cost is current candidate's primitive, and weights wherein satisfy normalizing condition.
8, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 5 is characterized in that: described rhythm coupling cost is the Viterbi optimizing algorithm with the step that the rhythm splices the weighted sum minimal path of cost.
9, the choosing method of the Chinese speech synthesis unit based on prosodic features according to claim 1 is characterized in that: described optimization screening is the Viterbi optimizing algorithm.
CNA2006101511527A 2007-02-16 2007-02-16 Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm Pending CN101004909A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2006101511527A CN101004909A (en) 2007-02-16 2007-02-16 Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2006101511527A CN101004909A (en) 2007-02-16 2007-02-16 Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm

Publications (1)

Publication Number Publication Date
CN101004909A true CN101004909A (en) 2007-07-25

Family

ID=38704004

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2006101511527A Pending CN101004909A (en) 2007-02-16 2007-02-16 Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm

Country Status (1)

Country Link
CN (1) CN101004909A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178896B (en) * 2007-12-06 2012-03-28 安徽科大讯飞信息科技股份有限公司 Unit selection voice synthetic method based on acoustics statistical model
CN106356052A (en) * 2016-10-17 2017-01-25 腾讯科技(深圳)有限公司 Voice synthesis method and device
CN108172211A (en) * 2017-12-28 2018-06-15 云知声(上海)智能科技有限公司 Adjustable waveform concatenation system and method
CN115294958A (en) * 2022-06-28 2022-11-04 北京奕斯伟计算技术股份有限公司 Unit selection method and device for speech synthesis
CN115294958B (en) * 2022-06-28 2024-07-02 北京奕斯伟计算技术股份有限公司 Unit selection method and device for speech synthesis

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178896B (en) * 2007-12-06 2012-03-28 安徽科大讯飞信息科技股份有限公司 Unit selection voice synthetic method based on acoustics statistical model
CN106356052A (en) * 2016-10-17 2017-01-25 腾讯科技(深圳)有限公司 Voice synthesis method and device
CN106356052B (en) * 2016-10-17 2019-03-15 腾讯科技(深圳)有限公司 Phoneme synthesizing method and device
US10832652B2 (en) 2016-10-17 2020-11-10 Tencent Technology (Shenzhen) Company Limited Model generating method, and speech synthesis method and apparatus
CN108172211A (en) * 2017-12-28 2018-06-15 云知声(上海)智能科技有限公司 Adjustable waveform concatenation system and method
CN115294958A (en) * 2022-06-28 2022-11-04 北京奕斯伟计算技术股份有限公司 Unit selection method and device for speech synthesis
CN115294958B (en) * 2022-06-28 2024-07-02 北京奕斯伟计算技术股份有限公司 Unit selection method and device for speech synthesis

Similar Documents

Publication Publication Date Title
JP6916264B2 (en) Real-time speech recognition methods based on disconnection attention, devices, equipment and computer readable storage media
CN101000765B (en) Speech synthetic method based on rhythm character
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
JP7167074B2 (en) Speech recognition method, device, equipment and computer-readable storage medium
CN101000764B (en) Speech synthetic text processing method based on rhythm structure
CN110050302B (en) Speech synthesis
CN107464559B (en) Combined prediction model construction method and system based on Chinese prosody structure and accents
CN111739508B (en) End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN101192404B (en) System and method for identifying accent of input sound
CN101064103B (en) Chinese voice synthetic method and system based on syllable rhythm restricting relationship
JP6815899B2 (en) Output statement generator, output statement generator and output statement generator
JP7051919B2 (en) Speech recognition and decoding methods based on streaming attention models, devices, equipment and computer readable storage media
CN111145718A (en) Chinese mandarin character-voice conversion method based on self-attention mechanism
CN109976702A (en) A kind of audio recognition method, device and terminal
Hono et al. Sinsy: A deep neural network-based singing voice synthesis system
JP2019159654A (en) Time-series information learning system, method, and neural network model
CN110600002B (en) Voice synthesis method and device and electronic equipment
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
CN104392716B (en) The phoneme synthesizing method and device of high expressive force
Alías et al. Towards high-quality next-generation text-to-speech synthesis: A multidomain approach by automatic domain classification
Pollet et al. Unit Selection with Hierarchical Cascaded Long Short Term Memory Bidirectional Recurrent Neural Nets.
CN101004909A (en) Method for selecting primitives for synthesizing Chinese voice based on characters of rhythm
Mei et al. A particular character speech synthesis system based on deep learning
Wang et al. Synthesizing spoken descriptions of images
Zhou et al. Learning and Modeling Unit Embeddings for Improving HMM-based Unit Selection Speech Synthesis.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20070725