CN105895075B - Improve the method and system of synthesis phonetic-rhythm naturalness - Google Patents
Improve the method and system of synthesis phonetic-rhythm naturalness Download PDFInfo
- Publication number
- CN105895075B CN105895075B CN201510038454.2A CN201510038454A CN105895075B CN 105895075 B CN105895075 B CN 105895075B CN 201510038454 A CN201510038454 A CN 201510038454A CN 105895075 B CN105895075 B CN 105895075B
- Authority
- CN
- China
- Prior art keywords
- weak
- synthesis unit
- reading
- syllable
- weak reading
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a kind of method and system for improving synthesis phonetic-rhythm naturalness, this method comprises: receiving text to be synthesized;Determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one or more basic synthesis units;Determine the whether weak reading of each basic synthesis unit;The corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis unit is weak reading, the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, the synthetic parameters model updated;Generate the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence;Continuous speech is generated according to the synthetic parameters Model sequence.Using the present invention, the naturalness of continuous synthesis voice can be simply and effectively improved.
Description
Technical field
The present invention relates to speech synthesis technique field more particularly to it is a kind of improve synthesis phonetic-rhythm naturalness method and
System.
Background technique
Realize between man-machine hommization, intelligentized effective interaction, construct man-machine communication's environment of efficient natural, at
For the urgent need of current information technical application and development.Text information is converted natural voice by speech synthesis technique to be believed
Number, it realizes the real-time conversion of any text, changes tradition by recording and play back the troublesome operation for realizing that machine is lifted up one's voice, and
System memory space is saved, in the increasing current dynamic for especially needing often to change in the information content of information exchange
Inquiry application aspect has played increasingly important role.
In recent years, as the demand of information-intensive society develops, more stringent requirements are proposed to human-computer interaction by user, high naturalness
Speech synthesis effect have become the important symbol of high-performance speech synthesis system.Words is interrupted (break) and word tone is read again
(focus) concern of the rhythm problem of reflection voice modulation in tone timing by more and more researchers such as.Words interruption
It can be analyzed and be solved by syntactic informations such as parts of speech, 80% or more can be being obtained in the case where training data is enough just
True rate, meets functional need.And the problem that word tone is read again cannot solve very well due to being related to semantic focal point analysis still, it is many to this
Speech synthesis system causes synthesis voice not have height to rise and fall on tune frequently with the method for avoiding offer word tone from reading function again
Timing affects the natural effect of synthesis.
In the prior art, the stress predicted method based on semantic analysis is generally used, i.e., is determined and is connected by semantic analysis
Continue the focus of input text and then determines the synthesis unit for needing to read again and mark, it is then special according to stress prediction result and synthesis
Sign obtains corresponding synthetic model, and then obtains continuous synthetic speech signal.However there is very big uncertainty in stress predicted,
Its prediction result is often not accurate enough, is especially more prone to produce problem in the unlimited text of content, is used in stressed information
Apparent negative effect can be brought when inappropriate place.
Summary of the invention
The embodiment of the present invention provides a kind of method and system for improving synthesis phonetic-rhythm naturalness, to improve continuous synthesis
The naturalness of voice.
To achieve the above object, the technical scheme is that
A method of improving synthesis phonetic-rhythm naturalness, comprising:
Receive text to be synthesized;
Determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one or more
Basic synthesis unit;
Determine the whether weak reading of each basic synthesis unit;
The corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis unit is weak
It reads, then the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, the synthetic parameters mould updated
Type;
Generate the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence;
Continuous speech is generated according to the synthetic parameters Model sequence.
Preferably, the determination basic synthesis unit it is whether weak reading include:
Obtain syllable string and/or syllable belonging to the basic synthesis unit;
Determine whether the syllable string and/or syllable are weak reading, if it is, determining that the basic synthesis unit is weak
It reads.
Preferably, the determination syllable string and/or syllable it is whether weak reading include:
Check syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary;
If it is, determining the basic weak reading of synthesis unit;
Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading vocabulary;
If syllable belonging to the basic synthesis unit extracts the rhythm of the syllable in preset weak reading vocabulary
Feature, the weak reading decision tree then constructed according to the prosodic features of the syllable and in advance determine the whether weak reading of the syllable;Such as
The weak reading of syllable described in fruit, the then weak reading of basic synthesis unit, the otherwise not weak reading of the basic synthesis unit;
If syllable belonging to the basic synthesis unit is not in preset weak reading vocabulary, it is determined that the basic synthesis
The not weak reading of unit.
Preferably, the weak building process for reading vocabulary includes:
Candidate weak reading word is obtained, weak reading word set is formed;
Obtain training corpus;
Successively calculate the weak weak reading frequency for reading word in the training corpus of each candidate in the weak reading word set;
If the weak reading frequency is greater than frequency threshold, it is determined that the weak reading word of candidate is weak reading word;
Weak reading vocabulary is generated by determining weak reading word.
Preferably, the weak building process for reading decision tree includes:
It obtains based on the weak a large amount of texts for reading vocabulary as training data;
Word segmentation processing is carried out to the training data, and determines each syllable that each participle includes;
Prosodic labeling is carried out to each syllable, prosodic labeling information includes: weak reading information;
According to the training text data and the prosodic labeling information of corresponding each syllable, training obtains weak reading decision tree.
Preferably, described that the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, it obtains more
New synthetic parameters model includes:
The model parameter of the synthetic parameters model is obtained, the model parameter includes: duration parameters, base frequency parameters, energy
Measure parameter;
The model parameter, the synthetic parameters model updated are updated according to the mapping ruler that preparatory training obtains.
A kind of system improving synthesis phonetic-rhythm naturalness, the system comprises:
Receiving module, for receiving text to be synthesized;
Basic synthesis unit sequence determining module, it is described for determining the basic synthesis unit sequence of the corresponding text
It include one or more basic synthesis units in basic synthesis unit sequence;
Weak reading prediction module, for determining the whether weak reading of each basic synthesis unit;
Synthetic parameters model obtains module, for obtaining the corresponding synthetic parameters model of the basic synthesis unit;
Weak readingization processing module, for being corresponded to the basic synthesis unit when the basic synthesis unit is weak reading
Synthetic parameters model carry out weak readingizations and handle, the synthetic parameters model updated;
Synthetic parameters Model sequence generation module, for generating the synthetic parameters mould of the corresponding basic synthesis unit sequence
Type sequence;
Synthesis module, for generating continuous speech according to the synthetic parameters Model sequence.
Preferably, the weak reading prediction module includes:
Acquiring unit, for obtaining syllable string and/or syllable belonging to each basic synthesis unit;
Determination unit, for determining whether the syllable string and/or syllable are weak reading, if it is, determining described basic
Synthesis unit is weak reading.
Preferably, the determination unit includes:
Inspection unit, for checking syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary;
If it is, determining the weak reading of syllable;Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading
In vocabulary;If it is, triggering extraction unit extracts the prosodic features of the syllable;Otherwise the basic synthesis unit is determined not
Weak reading;
Extraction unit extracts the prosodic features of the syllable for the triggering according to the inspection unit;
Judging unit, the prosodic features of the syllable for being extracted according to the extraction unit and the weak reading decision constructed in advance
Tree determines the whether weak reading of syllable, and if the weak reading of the syllable, it is determined that the basic weak reading of synthesis unit, otherwise really
The fixed not weak reading of basic synthesis unit.
Preferably, the system also includes: weak reading vocabulary constructs module, for constructing the weak reading vocabulary.
Preferably, the system also includes: weak reading decision tree constructs module, for constructing the weak reading decision tree.
Preferably, the weak readingization processing module includes:
Model parameter acquiring unit, for obtaining the model parameter of the synthetic parameters model, the model parameter includes:
Duration parameters, base frequency parameters, energy parameter;
Parameter updating unit is updated for updating the model parameter according to the mapping ruler that training obtains in advance
Synthetic parameters model.
The method and system provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness, are relatively easy to by processing
Weak reading phenomenon, realize the effect of continuous speech integrally to rise and fall, filled up current semantics understanding technology in speech synthesis weight
Sound predicts that the blank of practical function has not yet been reached.Moreover, compared with the existing technology, the scheme of the embodiment of the present invention is to the pre- of weak reading
It is not only accurate but also efficient to survey, and substantially improves the naturalness of continuous synthesis voice.
Detailed description of the invention
It, below will be to attached drawing needed in the embodiment in order to illustrate more clearly of the technical solution that the present invention is implemented
It is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
The embodiment of the present invention that Fig. 1 shows improves the flow chart of the method for synthesis phonetic-rhythm naturalness;
Fig. 2 shows the weak flow charts for reading prediction of synthesis unit basic in the embodiment of the present invention;
Fig. 3 shows the weak building flow chart for reading decision tree in the embodiment of the present invention;
Fig. 4 shows the flow chart for carrying out weak readingization processing in the embodiment of the present invention to synthesis parameter model;
The embodiment of the present invention that Fig. 5 shows improves the structural block diagram of the system of synthesis phonetic-rhythm naturalness;
Fig. 6 shows the structural block diagram of weak readingization processing module in the embodiment of the present invention.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Existing to there is very big uncertainty using the stress predicted method based on semantic analysis, prediction result is often
It is not accurate enough, its reason is analyzed, mainly there is the following:
1. the most of notional word (such as noun, verb) in general occupying dictionary may all be read again, exhaustive to its
Enumerating is impossible task.
2. the control only according to syntax level is difficult to determine stressed word, only has semantic information and be possible to determining read again
Information, this also needs higher level intelligent processing, and the prior art is also extremely limited to semantic intelligent processing ability.
3. the characteristic parameter that current stress predicted uses is mainly part of speech (POS), word is long, word is locating in rhythm structure
Position etc. and semantic unrelated parameter, do not have direct directive significance to prediction result, are based on these characteristic parameters accordingly
Prediction result it is also less reliable.
Based on above-mentioned analysis, for the demand of the low fluctuation effect of pairing Chinese idiom pitch in continuous speech synthesis system and
The case where prior art is to stressed accurate judgement scarce capacity, the embodiment of the present invention propose that a kind of synthesis text is weak to read prediction
Method and system realize the weak efficiently and accurately for reading prediction result.Correspondingly, it is also proposed that a kind of based on the weak language for reading prediction
Sound synthetic method and system are relatively easy to weak reading phenomenon by processing, i.e. utilization " light " contrast " weight ", rise and fall on solution tune
Problem.Specifically, the scheme of the embodiment of the present invention is by realizing that synthesis connects to the processing of the weak readingization of part words in continuous text
The natural effect that continuous voice height rises and falls, and then substantially improve the naturalness of continuous synthesis voice.
For different language, weak reading is usually expressed as different word and feature, for example, in standard Chinese softly
Function word (preposition, conjunction etc.) in function word, English and many western languages in word, Tibetan language etc..Weak reading factor is in sentence
Act on it is relatively unambiguous, usually can by part of speech, even voice determine, will not generally surmount syntax level, i.e., not be related to semanteme.
Therefore it is more much smaller than stressed cost to handle weak reading.
For this purpose, the method and system of the raising synthesis phonetic-rhythm naturalness of the embodiment of the present invention, are predicted based on weak reading, it is high
It imitates, accurately determine weak reading unit in synthesis text, to provide accurate prosodic information for speech synthesis.Based on this, In
When speech synthesis, if the prosodic features of basic synthesis unit includes weak reading feature, it is corresponding to obtain the basic synthesis unit
Weak reading synthetic parameters model or weak reading sound bite;If the prosodic features of basic synthesis unit does not include weak reading feature, obtain
Take the basic synthesis unit is corresponding to be conventionally synthesized parameter model or regular speech segment.In this way, utilizing these corresponding synthesis
Parameter model or sound bite generate continuous speech, efficiently solve the problems, such as to rise and fall on tune.
As shown in Figure 1, the process that the embodiment of the present invention improves the method for synthesis phonetic-rhythm naturalness is shown, including with
Lower step:
Step 101, text to be synthesized is received.
Step 102, determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one
A or multiple basic synthesis units.
Specifically, each basic synthesis unit for corresponding to the text can be obtained, and by described basic by making character fonts
Synthesis unit forms basic synthesis unit sequence corresponding with the text.
The basic synthesis unit refers to the smallest synthesis unit, for western language, generallys use phoneme as basic
Synthesis unit, such as: there are three the phonemes that English word tone is included, they are t, ow, ng;Tone language based on syllable
Can be using initial consonant/simple or compound vowel of a Chinese syllable as basic synthesis unit, such as the initial and the final sequence of one word of initial consonant is sh, eng, m, u.Wherein rhythm
Female eng includes two phonemes e, ng.
Step 103, the whether weak reading of each basic synthesis unit is determined.
Specifically, syllable string and/or syllable belonging to available each basic synthesis unit, then determine the syllable string
And/or whether syllable is weak reading, if it is, determining that the basic synthesis unit is weak reading.
Syllable is the basic unit of phonetic structure.The pronunciation of in general one Chinese is a syllable in Chinese.
In English, a vowel may make up a syllable, and a vowel and one or several consonant phonemes, which combine, also may be constructed one
A syllable.
It should be noted that a syllable can correspond to one or more basic synthesis units.Such as " initial consonant " is one
Participle, it includes two syllables, and each syllable includes an initial consonant, a simple or compound vowel of a Chinese syllable (sh, eng, m, u), therefore " initial consonant " one
Word includes four basic synthesis units.Correspondingly, if a syllable string or syllable are weak reading, corresponding all elementary sums
At the weak reading of unit.
Step 104, the corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis is single
Member is weak reading, then handles the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit, the synthesis updated
Parameter model.
The synthetic parameters model is acoustic model.It should be noted that a basic synthesis unit is in different contexts
Under, it weak may read, it is also possible to not weak reading.Therefore, in embodiments of the present invention, the basic synthesis unit of reading weak for needs,
Weak readingization processing is carried out to its synthetic parameters model, model parameter is enable preferably to embody the height fluctuations of voice.And it is right
In the basic synthesis unit of non-weak reading, then weak readingization is not carried out to its synthetic parameters model and handled.
The detailed process for carrying out weak readingization processing to synthesis parameter model will be described in detail later.
Step 105, the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence is generated.
I.e. by synthesizing the corresponding synthetic parameters model sequential of each basic synthesis unit in unit sequence substantially, obtain
The synthetic parameters Model sequence.It is handled including the synthetic parameters model not handled by weak readingization and by weak readingization
Synthetic parameters model.That is, corresponding synthetic parameters model is if basic synthesis unit therein is weak reading
By weak readingization treated synthetic parameters model;If basic synthesis unit therein is non-weak reading, corresponding synthesis
Parameter model is the synthetic parameters model of original acquisition, and the synthetic parameters model of these original acquisitions can be regarded as normal articulation
When synthetic parameters model.
Step 106, continuous speech is generated according to the synthetic parameters Model sequence.
As it can be seen that the method provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness, is relatively easy to by processing
Weak reading phenomenon, i.e., utilization " light " contrast " weight ", efficiently solve the problems, such as tune on rise and fall, preferably realize continuous speech
Whole fluctuation effect.
As shown in Fig. 2, being the weak flow chart for reading prediction of basic synthesis unit in the embodiment of the present invention.
It should be noted that requiring successively to carry out for synthesizing the basic synthesis unit of each of unit sequence substantially
It checks, determines if weak reading, specifically includes the following steps:
Step 201, the basic synthesis unit of current check is obtained.
Step 202, syllable string belonging to the basic synthesis unit is checked whether there is;If so, thening follow the steps 203;
Otherwise, step 204 is executed.
Specifically, synthesis text can be treated and carry out word segmentation processing, and determine that is obtained respectively segments each syllable string for including
And/or syllable, to obtain syllable string or syllable belonging to the basic synthesis unit.
Step 203, check the syllable string whether in preset weak reading vocabulary;If so, thening follow the steps 208;It is no
Then, step 204 is executed.
Step 204, syllable belonging to the basic synthesis unit is obtained.
Step 205, check the syllable whether in preset weak reading vocabulary.If so, thening follow the steps 206;Otherwise,
Execute step 209.
Weak pronunciation section is easy capture and negligible amounts, thus relatively easy exhaustive.It in embodiments of the present invention, can be preparatory
Weak reading vocabulary is established based on the statistics to training corpus, specifically, can be carried out according to following procedure:
(1) candidate weak reading word is obtained, weak reading word set is formed.It in practical applications, can be using all function words as candidate weak
Read word.
(2) training corpus is obtained.
(3) the weak weak reading frequency for reading word in the training corpus of each candidate in the weak reading word set is successively calculated.
(4) if the weak reading frequency is greater than frequency threshold, it is determined that the weak reading word of candidate is weak reading word;
(5) weak reading vocabulary is generated by determining weak reading word.
Certainly, in practical applications, weak reading vocabulary, such as statistical model method can also be constructed by other methods, it is right
This embodiment of the present invention is without limitation.
Step 206, the prosodic features of the syllable is extracted.
The prosodic features of the syllable may include one or more of feature: the part of speech of participle, syllable where syllable
Position etc. in the participle of place.
Step 207, the weak reading decision tree constructed according to the prosodic features of the syllable and in advance determines the basic synthesis
The whether weak reading of unit.
Specifically, the weak reading decision tree constructed first according to the prosodic features of syllable and in advance determines whether the syllable is weak
It reads;If the weak reading of syllable, the basic weak reading of synthesis unit, the otherwise not weak reading of the basic synthesis unit.
Step 208, the weak reading of basic synthesis unit is determined.
In view of same word has the function of different under different context environmentals, especially taking on different parts of speech
When, often there is different expressive forces, thus weak reading has certain uncertainty.To the further root of this embodiment of the present invention
According to the weak reading decision tree pre-established determine the syllable of current check in specific context whether weak reading.
It the weak building process for reading decision tree and weak read decision tree using this and determines that the detailed process of the whether weak reading of syllable will be
It is described in detail below.
Step 209, the not weak reading of the basic synthesis unit is determined.
As shown in figure 3, being the weak building process for reading decision tree in the embodiment of the present invention, comprising the following steps:
Step 301, it obtains based on the weak a large amount of texts for reading vocabulary as training data.
Step 302, word segmentation processing is carried out to the training data, and determines each syllable that each participle includes.
Step 303, prosodic labeling is carried out to the syllable, prosodic labeling information includes: weak reading information.
Specifically, prosodic labeling can be carried out to each syllable according to the corresponding voice data of training data.
In practical applications, prosodic labeling information can also further comprise: position of the weak pronunciation section in participle, weak pronunciation
The part of speech etc. segmented where section.
Step 304, according to the prosodic labeling information of the training data and corresponding each syllable, training obtains weak reading decision
Tree.
Specifically, weak reading decision tree is initialized first, then since the weak root node for reading decision tree, according to preparatory
The problem of establishing collection (problem set includes the relevant information of all and weak reading) successively investigates each nonleaf node, if currently examined
The node examined needs to divide, then divides to the node currently investigated, and obtains the child node and the child node after division
Corresponding training data;Otherwise, will currently investigate vertex ticks is leaf node;After the completion of the investigation of all nonleaf nodes, obtain
The weak reading decision tree.
It should be noted that in practical applications, weak reading decision tree can also be constructed using other methods, to this present invention
Embodiment is without limitation.
It is exemplified below based on the above-mentioned weak process read decision tree and carry out weak reading prediction.
Such as text to be synthesized: red team and blue team share 49 books.
Carry out word segmentation processing, obtain: red team/and (conjunction)/blue team/be total to/have (there are verbs)/49 (number)/sheet/
Book.
Weak reading prediction: wherein syllable "and" " having " " ten " is in weak reading vocabulary, therefore only needs to carry out these three syllables
Judge whether weak reading.
There is following judgement according to weak reading forecast and decision tree:
(1) participle where weak pronunciation section whether function word if it is weak reading."and" is eligible, is determined as weak reading;
(2) participle where weak pronunciation section whether there is verb if so, there is negative word in front if it is, weak
It reads." having " though being there are verb, front does not have negative word, is determined as non-weak reading;
(3) participle where weak pronunciation section whether number if so, whether being located in word if it is weak reading.Where " ten "
Participle is number, and is located in word, and weak reading is determined as.
If a weak reading of syllable, the corresponding all weak readings of basic synthesis unit of the syllable, vice versa.
It should be noted that synthetic parameters model described in the embodiment of the present invention is acoustic model.
In general relatively normal pronunciation, the weak basic synthesis unit of reading have following characteristics:
(1) the weak voice duration for reading basic synthesis unit is often shorter;
(2) the weak fundamental curve for reading basic synthesis unit tends to the intermediate value of tone range, i.e., original fundamental curve compared with
High voice unit, fundamental curve meeting relative reduction, and the lower voice unit of original fundamental curve, fundamental curve can opposite lifts
It is high;
(3) the weak energy for reading basic synthesis unit is lower.
Based on These characteristics, in embodiments of the present invention, the corresponding sound of each weak basic synthesis unit of reading can be trained first
Learn model, and with it is corresponding it is non-it is weak read basic synthesis unit and carry out acoustics comparison, determine in terms of duration, energy, fundamental frequency weak reading with
Variance rule between non-weak reading.Then pass through when carrying out weak readingization to synthesis parameter model and shorten duration, reduce or raise
Fundamental frequency reduces the Policy Updates model parameters such as energy to realize weak reading effect.
As shown in figure 4, be the flow chart that weak readingizations is handled carried out to synthesis parameter model in the embodiment of the present invention, including with
Lower step:
Step 401, the model parameter of the synthetic parameters model is obtained, the model parameter includes: duration parameters, fundamental frequency
Parameter, energy parameter;
Step 402, the model parameter, the synthetic parameters updated are updated according to the mapping ruler that preparatory training obtains
Model.
The training process of above-mentioned mapping ruler is as follows:
In practical applications, duration parameters, base frequency parameters, energy parameter pair in synthetic parameters model can be respectively trained
The mapping ruler answered, specific as follows:
1, duration parameters mapping ruler
(1) training data is obtained;
(2) the basic synthesis unit of weak reading in the training data is determined;
(3) calculate it is described it is weak read basic synthesis unit in the case that weak reading and it is non-it is weak read two kinds duration ratio, and by its
As duration parameters mapping ruler.
Since a syllable has corresponded to one or more basic synthesis units, it, can in order to keep mapping ruler more acurrate
With calculate separately the basic synthesis unit in syllable different location (i.e. syllable is first, in syllable, these three last positions of syllable),
Duration mean value when weak reading is with two kinds of non-weak reading;Then further according to the mean value computation in weak reading and two kinds of situations of non-weak reading
Under duration ratio.
It can be according to base when carrying out the processing of weak readingization to synthesis parameter model based on above-mentioned duration parameters mapping ruler
Different location of this synthesis unit in syllable carries out the duration parameters in the synthetic parameters model according to above-mentioned duration ratio
Adjustment.
2, base frequency parameters mapping ruler
Duration is a scalar, and fundamental frequency is a vector, the corresponding fundamental curve of a basic synthesis unit.In order to simplify rule
Then, the average fundamental frequency that can use basic synthesis unit carries out parameter mapping, specific as follows:
(1) training data is obtained;
(2) the basic synthesis unit of weak reading in the training data is determined;
(3) the weak average fundamental frequency ratio for reading basic synthesis unit when weak reading is with two kinds of non-weak reading is calculated, and
As base frequency parameters mapping ruler.
It can be according to base when carrying out the processing of weak readingization to synthesis parameter model based on above-mentioned base frequency parameters mapping ruler
Different location of this synthesis unit in syllable carries out the base frequency parameters in the synthetic parameters model according to above-mentioned fundamental frequency ratio
Adjustment.
3, energy parameter mapping ruler
Energy is also a vector, the corresponding energy curve of a basic synthesis unit.It can be reflected using with base frequency parameters
The identical method of rule is penetrated, energy parameter mapping is carried out.This will not be repeated here.
The method provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness, in continuous speech synthesis system
The demand of the low fluctuation effect of pairing Chinese idiom pitch, based on the prediction to weak pronunciation section, basic synthesis corresponding to weak pronunciation section is single
The synthetic parameters model of member carries out weak readingization processing, realizes the effect of continuous speech integrally to rise and fall.The program by processing compared with
For easy weak reading phenomenon, utilizes " light " contrast " weight ", realize the effect of continuous speech integrally to rise and fall, filled up current semantics
Understanding technology stress predicted in speech synthesis has not yet been reached the blank of practical function, substantially improves continuous synthesis voice oneself
So degree.
In addition, it is necessary to which explanation can also consider weak reading and stressed factor simultaneously, further change in speech synthesis
The naturalness of kind continuous synthesis voice.
Correspondingly, the embodiment of the present invention also provides a kind of speech synthesis system, as shown in figure 5, being a kind of knot of the system
Structure block diagram.
In this embodiment, the system comprises:
Receiving module 501, for receiving text to be synthesized;
Basic synthesis unit sequence determining module 502, for determining the basic synthesis unit sequence of the corresponding text, institute
Stating in basic synthesis unit sequence includes one or more basic synthesis units;
Weak reading prediction module 503, for determining the whether weak reading of each basic synthesis unit;
Synthetic parameters model obtains module 504, for obtaining the corresponding synthetic parameters model of the basic synthesis unit;
Weak readingization processing module 505 is used for when the basic synthesis unit is weak reading, to the basic synthesis unit pair
The synthetic parameters model answered carries out weak readingization processing, the synthetic parameters model updated;
Synthetic parameters Model sequence generation module 506, for generating the synthesis ginseng of the corresponding basic synthesis unit sequence
Exponential model sequence;
Synthesis module 507, for generating continuous speech according to the synthetic parameters Model sequence.
Above-mentioned weak reading prediction module 503 can specifically determine that each basic synthesis is single using previously described weak reading prediction technique
Member whether weak reading, it is weak read prediction module 503 a kind of specific structure may include following each unit:
Acquiring unit, for obtaining syllable string and/or syllable belonging to each basic synthesis unit;
Determination unit, for determining whether the syllable string and/or syllable are weak reading, if it is, determining described basic
Synthesis unit is weak reading.
Wherein, above-mentioned determination unit may include:
Inspection unit, for checking syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary;
If it is, determining the weak reading of syllable;Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading
In vocabulary;If it is, triggering extraction unit extracts the prosodic features of the syllable;Otherwise the basic synthesis unit is determined not
Weak reading;
Said extracted unit extracts the prosodic features of the syllable for the triggering according to the inspection unit,
Judging unit, prosodic features for being extracted according to the extraction unit and the weak reading decision tree constructed in advance determine
The whether weak reading of the syllable, and if the weak reading of the syllable, it is determined that the basic weak reading of synthesis unit, otherwise determine described in
The not weak reading of basic synthesis unit.
Above-mentioned weak reading vocabulary and weak reading decision tree can be constructed by speech synthesis system of the present invention, can also be by other systems
Building, without limitation to this embodiment of the present invention.It, within the system can be with if constructed by speech synthesis system of the present invention
Further comprise: weak reading vocabulary building module and weak reading decision tree building module are respectively used to construct weak reading vocabulary and weak reading are determined
Plan tree.According to the difference of specific construction method, the two modules can have adaptable structure respectively, not limit this.
A kind of specific structure of above-mentioned weak readingization processing module 505 is as shown in Figure 6, comprising:
Model parameter acquiring unit 601, for obtaining the model parameter of the synthetic parameters model, the model parameter packet
It includes: duration parameters, base frequency parameters, energy parameter;
Parameter updating unit 602 obtains more for updating the model parameter according to the mapping ruler that training obtains in advance
New synthetic parameters model.
In practical applications, the mapping ruler can be trained in advance by present system, can also be pre- by other systems
First train.
If trained by present system, also need within the system further comprise: mapping ruler training module is (not
Diagram), reflect the non-weak mapping ruler for reading synthetic parameters and weak reading synthetic parameters corresponding relationship for constructing.
Mapping ruler training module can be directed to the model parameter of synthetic parameters model, and duration parameters mapping rule are respectively trained
Then, base frequency parameters mapping ruler, energy parameter mapping ruler.Specific training process can refer in the embodiment of the present invention method of front
Description, details are not described herein.
Correspondingly, the needs of parameter updating unit 602 update each model parameter according to corresponding mapping ruler.
The system provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness passes through processing in speech synthesis
It is relatively easy to weak reading phenomenon, realizes the effect of continuous speech integrally to rise and fall, has filled up current semantics understanding technology to voice
The blank of practical function has not yet been reached in stress predicted in synthesis, substantially improves the naturalness of continuous synthesis voice.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.System embodiment described above is only schematical, wherein described be used as separate part description
Unit and module may or may not be physically separated.Furthermore it is also possible to select it according to the actual needs
In some or all of unit and module achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying
In the case where creative work, it can understand and implement.
Structure, feature and effect of the invention, the above institute are described in detail based on the embodiments shown in the drawings
Only presently preferred embodiments of the present invention is stated, but the present invention does not limit the scope of implementation as shown in the drawings, it is all according to structure of the invention
Think made change or equivalent example modified to equivalent change, when not going beyond the spirit of the description and the drawings,
It should all be within the scope of the present invention.
Claims (12)
1. a kind of method for improving synthesis phonetic-rhythm naturalness characterized by comprising
Receive text to be synthesized;
Determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one or more basic
Synthesis unit;
Determine the whether weak reading of each basic synthesis unit;
The corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis unit is weak reading,
The weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, the synthetic parameters model updated;
Generate the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence;
Continuous speech is generated according to the synthetic parameters Model sequence.
2. the method according to claim 1, wherein the whether weak reading packet of the determination basic synthesis unit
It includes:
Obtain syllable string and/or syllable belonging to the basic synthesis unit;
Determine whether the syllable string and/or syllable are weak reading, if it is, determining that the basic synthesis unit is weak reading.
3. according to the method described in claim 2, it is characterized in that, the determination syllable string and/or the whether weak reading of syllable
Include:
Check syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary;
If it is, determining the basic weak reading of synthesis unit;
Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading vocabulary;
If the rhythm that syllable belonging to the basic synthesis unit in preset weak reading vocabulary, extracts the syllable is special
Sign, the weak reading decision tree then constructed according to the prosodic features of the syllable and in advance determine the whether weak reading of the syllable;If
The weak reading of syllable, the then weak reading of basic synthesis unit, the otherwise not weak reading of the basic synthesis unit;
If syllable belonging to the basic synthesis unit is not in preset weak reading vocabulary, it is determined that the basic synthesis unit
Not weak reading.
4. according to the method described in claim 3, it is characterized in that, the weak building process for reading vocabulary includes:
Candidate weak reading word is obtained, weak reading word set is formed;
Obtain training corpus;
Successively calculate the weak weak reading frequency for reading word in the training corpus of each candidate in the weak reading word set;
If the weak reading frequency is greater than frequency threshold, it is determined that the weak reading word of candidate is weak reading word;
Weak reading vocabulary is generated by determining weak reading word.
5. according to the method described in claim 3, it is characterized in that, the weak building process for reading decision tree includes:
It obtains based on the weak a large amount of texts for reading vocabulary as training data;
Word segmentation processing is carried out to the training data, and determines each syllable that each participle includes;
Prosodic labeling is carried out to each syllable, prosodic labeling information includes: weak reading information;
According to the training text data and the prosodic labeling information of corresponding each syllable, training obtains weak reading decision tree.
6. method according to any one of claims 1 to 5, which is characterized in that described corresponding to the basic synthesis unit
Synthetic parameters model carry out weak readingizations and handle, the synthetic parameters model updated includes:
The model parameter of the synthetic parameters model is obtained, the model parameter includes: duration parameters, base frequency parameters, energy ginseng
Number;
The model parameter, the synthetic parameters model updated are updated according to the mapping ruler that preparatory training obtains.
7. a kind of system for improving synthesis phonetic-rhythm naturalness, which is characterized in that the system comprises:
Receiving module, for receiving text to be synthesized;
Basic synthesis unit sequence determining module, it is described basic for determining the basic synthesis unit sequence of the corresponding text
It include one or more basic synthesis units in synthesis unit sequence;
Weak reading prediction module, for determining the whether weak reading of each basic synthesis unit;
Synthetic parameters model obtains module, for obtaining the corresponding synthetic parameters model of the basic synthesis unit;
Weak readingization processing module, for when the basic synthesis unit is weak reading, conjunction corresponding to the basic synthesis unit
Weak readingization processing, the synthetic parameters model updated are carried out at parameter model;
Synthetic parameters Model sequence generation module, for generating the synthetic parameters model sequence of the corresponding basic synthesis unit sequence
Column;
Synthesis module, for generating continuous speech according to the synthetic parameters Model sequence.
8. system according to claim 7, which is characterized in that the weak reading prediction module includes:
Acquiring unit, for obtaining syllable string and/or syllable belonging to each basic synthesis unit;
Determination unit, for determining whether the syllable string and/or syllable are weak reading, if it is, determining the basic synthesis
Unit is weak reading.
9. system according to claim 8, which is characterized in that the determination unit includes:
Inspection unit, for checking syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary;If
It is, it is determined that the weak reading of syllable;Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading vocabulary
In;If it is, triggering extraction unit extracts the prosodic features of the syllable;Otherwise determine that the basic synthesis unit is not weak
It reads;
Extraction unit extracts the prosodic features of the syllable for the triggering according to the inspection unit;
Judging unit, the prosodic features of the syllable for being extracted according to the extraction unit and the weak reading decision tree constructed in advance are true
The fixed whether weak reading of the syllable, and if the weak reading of the syllable, it is determined that otherwise the basic weak reading of synthesis unit determines institute
State the not weak reading of basic synthesis unit.
10. system according to claim 9, which is characterized in that the system also includes: weak reading vocabulary constructs module, uses
In the building weak reading vocabulary.
11. system according to claim 9, which is characterized in that the system also includes: weak reading decision tree constructs module,
For constructing the weak reading decision tree.
12. according to the described in any item systems of claim 7 to 11, which is characterized in that the weak readingization processing module includes:
Model parameter acquiring unit, for obtaining the model parameter of the synthetic parameters model, the model parameter includes: duration
Parameter, base frequency parameters, energy parameter;
Parameter updating unit, for updating the model parameter, the conjunction updated according to the mapping ruler that training obtains in advance
At parameter model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510038454.2A CN105895075B (en) | 2015-01-26 | 2015-01-26 | Improve the method and system of synthesis phonetic-rhythm naturalness |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510038454.2A CN105895075B (en) | 2015-01-26 | 2015-01-26 | Improve the method and system of synthesis phonetic-rhythm naturalness |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105895075A CN105895075A (en) | 2016-08-24 |
CN105895075B true CN105895075B (en) | 2019-11-15 |
Family
ID=56999749
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510038454.2A Active CN105895075B (en) | 2015-01-26 | 2015-01-26 | Improve the method and system of synthesis phonetic-rhythm naturalness |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105895075B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109087627A (en) * | 2018-10-16 | 2018-12-25 | 百度在线网络技术(北京)有限公司 | Method and apparatus for generating information |
CN110751940B (en) * | 2019-09-16 | 2021-06-11 | 百度在线网络技术(北京)有限公司 | Method, device, equipment and computer storage medium for generating voice packet |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1604184A (en) * | 2003-09-29 | 2005-04-06 | 摩托罗拉公司 | Transformation from characters to sound for synthesizing text paragraph pronunciation |
CN1664922A (en) * | 2004-03-05 | 2005-09-07 | 雅马哈株式会社 | Pitch model production device, method and pitch model production program |
CN1685396A (en) * | 2002-09-23 | 2005-10-19 | 因芬尼昂技术股份公司 | Method for computer-aided speech synthesis of a stored electronic text into an analog speech signal, speech synthesis device and telecommunication apparatus |
CN101123089A (en) * | 2006-08-08 | 2008-02-13 | 苗玉水 | Voice mixing method for Chinese voice code |
CN101271687A (en) * | 2007-03-20 | 2008-09-24 | 株式会社东芝 | Method and device for pronunciation conversion estimation and speech synthesis |
CN101894547A (en) * | 2010-06-30 | 2010-11-24 | 北京捷通华声语音技术有限公司 | Speech synthesis method and system |
-
2015
- 2015-01-26 CN CN201510038454.2A patent/CN105895075B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1685396A (en) * | 2002-09-23 | 2005-10-19 | 因芬尼昂技术股份公司 | Method for computer-aided speech synthesis of a stored electronic text into an analog speech signal, speech synthesis device and telecommunication apparatus |
CN1604184A (en) * | 2003-09-29 | 2005-04-06 | 摩托罗拉公司 | Transformation from characters to sound for synthesizing text paragraph pronunciation |
CN1664922A (en) * | 2004-03-05 | 2005-09-07 | 雅马哈株式会社 | Pitch model production device, method and pitch model production program |
CN101123089A (en) * | 2006-08-08 | 2008-02-13 | 苗玉水 | Voice mixing method for Chinese voice code |
CN101271687A (en) * | 2007-03-20 | 2008-09-24 | 株式会社东芝 | Method and device for pronunciation conversion estimation and speech synthesis |
CN101894547A (en) * | 2010-06-30 | 2010-11-24 | 北京捷通华声语音技术有限公司 | Speech synthesis method and system |
Also Published As
Publication number | Publication date |
---|---|
CN105895075A (en) | 2016-08-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Sóskuthy | Evaluating generalised additive mixed modelling strategies for dynamic speech analysis | |
CN101000765B (en) | Speech synthetic method based on rhythm character | |
WO2018153213A1 (en) | Multi-language hybrid speech recognition method | |
CN101650942B (en) | Prosodic structure forming method based on prosodic phrase | |
WO2020098269A1 (en) | Speech synthesis method and speech synthesis device | |
CN112352275A (en) | Neural text-to-speech synthesis with multi-level textual information | |
Watts | Unsupervised learning for text-to-speech synthesis | |
EP2958105A1 (en) | Method and apparatus for speech synthesis based on large corpus | |
US7574360B2 (en) | Unit selection module and method of chinese text-to-speech synthesis | |
CN101000764A (en) | Speech synthetic text processing method based on rhythm structure | |
CN102254554B (en) | Method for carrying out hierarchical modeling and predicating on mandarin accent | |
CN105404621A (en) | Method and system for blind people to read Chinese character | |
CN103632663B (en) | A kind of method of Mongol phonetic synthesis front-end processing based on HMM | |
CN103165126A (en) | Method for voice playing of mobile phone text short messages | |
CN105895076B (en) | A kind of phoneme synthesizing method and system | |
Batliner et al. | Prosodic models, automatic speech understanding, and speech synthesis: Towards the common ground? | |
CN1811912B (en) | Minor sound base phonetic synthesis method | |
CN105895075B (en) | Improve the method and system of synthesis phonetic-rhythm naturalness | |
Panda et al. | Text-to-speech synthesis with an Indian language perspective | |
Fackrell et al. | Multilingual prosody modelling using cascades of regression trees and neural networks | |
Hamad et al. | Arabic text-to-speech synthesizer | |
TW201937479A (en) | Multilingual mixed speech recognition method | |
CN116434730A (en) | Speech synthesis method, device, equipment and storage medium based on multi-scale emotion | |
Xu | Speech prosody as articulated communicative functions | |
Wang et al. | Research on correction method of spoken pronunciation accuracy of AI virtual English reading |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |