CN105895075B

CN105895075B - Improve the method and system of synthesis phonetic-rhythm naturalness

Info

Publication number: CN105895075B
Application number: CN201510038454.2A
Authority: CN
Inventors: 祖漪清; 王祖燕; 黄维; 邵鹏飞; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2015-01-26
Filing date: 2015-01-26
Publication date: 2019-11-15
Anticipated expiration: 2035-01-26
Also published as: CN105895075A

Abstract

The invention discloses a kind of method and system for improving synthesis phonetic-rhythm naturalness, this method comprises: receiving text to be synthesized；Determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one or more basic synthesis units；Determine the whether weak reading of each basic synthesis unit；The corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis unit is weak reading, the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, the synthetic parameters model updated；Generate the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence；Continuous speech is generated according to the synthetic parameters Model sequence.Using the present invention, the naturalness of continuous synthesis voice can be simply and effectively improved.

Description

Improve the method and system of synthesis phonetic-rhythm naturalness

Technical field

The present invention relates to speech synthesis technique field more particularly to it is a kind of improve synthesis phonetic-rhythm naturalness method and System.

Background technique

Realize between man-machine hommization, intelligentized effective interaction, construct man-machine communication's environment of efficient natural, at For the urgent need of current information technical application and development.Text information is converted natural voice by speech synthesis technique to be believed Number, it realizes the real-time conversion of any text, changes tradition by recording and play back the troublesome operation for realizing that machine is lifted up one's voice, and System memory space is saved, in the increasing current dynamic for especially needing often to change in the information content of information exchange Inquiry application aspect has played increasingly important role.

In recent years, as the demand of information-intensive society develops, more stringent requirements are proposed to human-computer interaction by user, high naturalness Speech synthesis effect have become the important symbol of high-performance speech synthesis system.Words is interrupted (break) and word tone is read again (focus) concern of the rhythm problem of reflection voice modulation in tone timing by more and more researchers such as.Words interruption It can be analyzed and be solved by syntactic informations such as parts of speech, 80% or more can be being obtained in the case where training data is enough just True rate, meets functional need.And the problem that word tone is read again cannot solve very well due to being related to semantic focal point analysis still, it is many to this Speech synthesis system causes synthesis voice not have height to rise and fall on tune frequently with the method for avoiding offer word tone from reading function again Timing affects the natural effect of synthesis.

In the prior art, the stress predicted method based on semantic analysis is generally used, i.e., is determined and is connected by semantic analysis Continue the focus of input text and then determines the synthesis unit for needing to read again and mark, it is then special according to stress prediction result and synthesis Sign obtains corresponding synthetic model, and then obtains continuous synthetic speech signal.However there is very big uncertainty in stress predicted, Its prediction result is often not accurate enough, is especially more prone to produce problem in the unlimited text of content, is used in stressed information Apparent negative effect can be brought when inappropriate place.

Summary of the invention

The embodiment of the present invention provides a kind of method and system for improving synthesis phonetic-rhythm naturalness, to improve continuous synthesis The naturalness of voice.

To achieve the above object, the technical scheme is that

A method of improving synthesis phonetic-rhythm naturalness, comprising:

Receive text to be synthesized；

Determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one or more Basic synthesis unit；

Determine the whether weak reading of each basic synthesis unit；

The corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis unit is weak It reads, then the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, the synthetic parameters mould updated Type；

Generate the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence；

Continuous speech is generated according to the synthetic parameters Model sequence.

Preferably, the determination basic synthesis unit it is whether weak reading include:

Obtain syllable string and/or syllable belonging to the basic synthesis unit；

Determine whether the syllable string and/or syllable are weak reading, if it is, determining that the basic synthesis unit is weak It reads.

Preferably, the determination syllable string and/or syllable it is whether weak reading include:

Check syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary；

If it is, determining the basic weak reading of synthesis unit；

Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading vocabulary；

If syllable belonging to the basic synthesis unit extracts the rhythm of the syllable in preset weak reading vocabulary Feature, the weak reading decision tree then constructed according to the prosodic features of the syllable and in advance determine the whether weak reading of the syllable；Such as The weak reading of syllable described in fruit, the then weak reading of basic synthesis unit, the otherwise not weak reading of the basic synthesis unit；

If syllable belonging to the basic synthesis unit is not in preset weak reading vocabulary, it is determined that the basic synthesis The not weak reading of unit.

Preferably, the weak building process for reading vocabulary includes:

Candidate weak reading word is obtained, weak reading word set is formed；

Obtain training corpus；

Successively calculate the weak weak reading frequency for reading word in the training corpus of each candidate in the weak reading word set；

If the weak reading frequency is greater than frequency threshold, it is determined that the weak reading word of candidate is weak reading word；

Weak reading vocabulary is generated by determining weak reading word.

Preferably, the weak building process for reading decision tree includes:

It obtains based on the weak a large amount of texts for reading vocabulary as training data；

Word segmentation processing is carried out to the training data, and determines each syllable that each participle includes；

Prosodic labeling is carried out to each syllable, prosodic labeling information includes: weak reading information；

According to the training text data and the prosodic labeling information of corresponding each syllable, training obtains weak reading decision tree.

Preferably, described that the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, it obtains more New synthetic parameters model includes:

The model parameter of the synthetic parameters model is obtained, the model parameter includes: duration parameters, base frequency parameters, energy Measure parameter；

The model parameter, the synthetic parameters model updated are updated according to the mapping ruler that preparatory training obtains.

A kind of system improving synthesis phonetic-rhythm naturalness, the system comprises:

Receiving module, for receiving text to be synthesized；

Basic synthesis unit sequence determining module, it is described for determining the basic synthesis unit sequence of the corresponding text It include one or more basic synthesis units in basic synthesis unit sequence；

Weak reading prediction module, for determining the whether weak reading of each basic synthesis unit；

Synthetic parameters model obtains module, for obtaining the corresponding synthetic parameters model of the basic synthesis unit；

Weak readingization processing module, for being corresponded to the basic synthesis unit when the basic synthesis unit is weak reading Synthetic parameters model carry out weak readingizations and handle, the synthetic parameters model updated；

Synthetic parameters Model sequence generation module, for generating the synthetic parameters mould of the corresponding basic synthesis unit sequence Type sequence；

Synthesis module, for generating continuous speech according to the synthetic parameters Model sequence.

Preferably, the weak reading prediction module includes:

Acquiring unit, for obtaining syllable string and/or syllable belonging to each basic synthesis unit；

Determination unit, for determining whether the syllable string and/or syllable are weak reading, if it is, determining described basic Synthesis unit is weak reading.

Preferably, the determination unit includes:

Inspection unit, for checking syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary； If it is, determining the weak reading of syllable；Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading In vocabulary；If it is, triggering extraction unit extracts the prosodic features of the syllable；Otherwise the basic synthesis unit is determined not Weak reading；

Extraction unit extracts the prosodic features of the syllable for the triggering according to the inspection unit；

Judging unit, the prosodic features of the syllable for being extracted according to the extraction unit and the weak reading decision constructed in advance Tree determines the whether weak reading of syllable, and if the weak reading of the syllable, it is determined that the basic weak reading of synthesis unit, otherwise really The fixed not weak reading of basic synthesis unit.

Preferably, the system also includes: weak reading vocabulary constructs module, for constructing the weak reading vocabulary.

Preferably, the system also includes: weak reading decision tree constructs module, for constructing the weak reading decision tree.

Preferably, the weak readingization processing module includes:

Model parameter acquiring unit, for obtaining the model parameter of the synthetic parameters model, the model parameter includes: Duration parameters, base frequency parameters, energy parameter；

Parameter updating unit is updated for updating the model parameter according to the mapping ruler that training obtains in advance Synthetic parameters model.

The method and system provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness, are relatively easy to by processing Weak reading phenomenon, realize the effect of continuous speech integrally to rise and fall, filled up current semantics understanding technology in speech synthesis weight Sound predicts that the blank of practical function has not yet been reached.Moreover, compared with the existing technology, the scheme of the embodiment of the present invention is to the pre- of weak reading It is not only accurate but also efficient to survey, and substantially improves the naturalness of continuous synthesis voice.

Detailed description of the invention

It, below will be to attached drawing needed in the embodiment in order to illustrate more clearly of the technical solution that the present invention is implemented It is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.

The embodiment of the present invention that Fig. 1 shows improves the flow chart of the method for synthesis phonetic-rhythm naturalness；

Fig. 2 shows the weak flow charts for reading prediction of synthesis unit basic in the embodiment of the present invention；

Fig. 3 shows the weak building flow chart for reading decision tree in the embodiment of the present invention；

Fig. 4 shows the flow chart for carrying out weak readingization processing in the embodiment of the present invention to synthesis parameter model；

The embodiment of the present invention that Fig. 5 shows improves the structural block diagram of the system of synthesis phonetic-rhythm naturalness；

Fig. 6 shows the structural block diagram of weak readingization processing module in the embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

Existing to there is very big uncertainty using the stress predicted method based on semantic analysis, prediction result is often It is not accurate enough, its reason is analyzed, mainly there is the following:

1. the most of notional word (such as noun, verb) in general occupying dictionary may all be read again, exhaustive to its Enumerating is impossible task.

2. the control only according to syntax level is difficult to determine stressed word, only has semantic information and be possible to determining read again Information, this also needs higher level intelligent processing, and the prior art is also extremely limited to semantic intelligent processing ability.

3. the characteristic parameter that current stress predicted uses is mainly part of speech (POS), word is long, word is locating in rhythm structure Position etc. and semantic unrelated parameter, do not have direct directive significance to prediction result, are based on these characteristic parameters accordingly Prediction result it is also less reliable.

Based on above-mentioned analysis, for the demand of the low fluctuation effect of pairing Chinese idiom pitch in continuous speech synthesis system and The case where prior art is to stressed accurate judgement scarce capacity, the embodiment of the present invention propose that a kind of synthesis text is weak to read prediction Method and system realize the weak efficiently and accurately for reading prediction result.Correspondingly, it is also proposed that a kind of based on the weak language for reading prediction Sound synthetic method and system are relatively easy to weak reading phenomenon by processing, i.e. utilization " light " contrast " weight ", rise and fall on solution tune Problem.Specifically, the scheme of the embodiment of the present invention is by realizing that synthesis connects to the processing of the weak readingization of part words in continuous text The natural effect that continuous voice height rises and falls, and then substantially improve the naturalness of continuous synthesis voice.

For different language, weak reading is usually expressed as different word and feature, for example, in standard Chinese softly Function word (preposition, conjunction etc.) in function word, English and many western languages in word, Tibetan language etc..Weak reading factor is in sentence Act on it is relatively unambiguous, usually can by part of speech, even voice determine, will not generally surmount syntax level, i.e., not be related to semanteme. Therefore it is more much smaller than stressed cost to handle weak reading.

For this purpose, the method and system of the raising synthesis phonetic-rhythm naturalness of the embodiment of the present invention, are predicted based on weak reading, it is high It imitates, accurately determine weak reading unit in synthesis text, to provide accurate prosodic information for speech synthesis.Based on this, In When speech synthesis, if the prosodic features of basic synthesis unit includes weak reading feature, it is corresponding to obtain the basic synthesis unit Weak reading synthetic parameters model or weak reading sound bite；If the prosodic features of basic synthesis unit does not include weak reading feature, obtain Take the basic synthesis unit is corresponding to be conventionally synthesized parameter model or regular speech segment.In this way, utilizing these corresponding synthesis Parameter model or sound bite generate continuous speech, efficiently solve the problems, such as to rise and fall on tune.

As shown in Figure 1, the process that the embodiment of the present invention improves the method for synthesis phonetic-rhythm naturalness is shown, including with Lower step:

Step 101, text to be synthesized is received.

Step 102, determine that the basic synthesis unit sequence of the corresponding text, the basic synthesis unit sequence include one A or multiple basic synthesis units.

Specifically, each basic synthesis unit for corresponding to the text can be obtained, and by described basic by making character fonts Synthesis unit forms basic synthesis unit sequence corresponding with the text.

The basic synthesis unit refers to the smallest synthesis unit, for western language, generallys use phoneme as basic Synthesis unit, such as: there are three the phonemes that English word tone is included, they are t, ow, ng；Tone language based on syllable Can be using initial consonant/simple or compound vowel of a Chinese syllable as basic synthesis unit, such as the initial and the final sequence of one word of initial consonant is sh, eng, m, u.Wherein rhythm Female eng includes two phonemes e, ng.

Step 103, the whether weak reading of each basic synthesis unit is determined.

Specifically, syllable string and/or syllable belonging to available each basic synthesis unit, then determine the syllable string And/or whether syllable is weak reading, if it is, determining that the basic synthesis unit is weak reading.

Syllable is the basic unit of phonetic structure.The pronunciation of in general one Chinese is a syllable in Chinese. In English, a vowel may make up a syllable, and a vowel and one or several consonant phonemes, which combine, also may be constructed one A syllable.

It should be noted that a syllable can correspond to one or more basic synthesis units.Such as " initial consonant " is one Participle, it includes two syllables, and each syllable includes an initial consonant, a simple or compound vowel of a Chinese syllable (sh, eng, m, u), therefore " initial consonant " one Word includes four basic synthesis units.Correspondingly, if a syllable string or syllable are weak reading, corresponding all elementary sums At the weak reading of unit.

Step 104, the corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis is single Member is weak reading, then handles the weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit, the synthesis updated Parameter model.

The synthetic parameters model is acoustic model.It should be noted that a basic synthesis unit is in different contexts Under, it weak may read, it is also possible to not weak reading.Therefore, in embodiments of the present invention, the basic synthesis unit of reading weak for needs, Weak readingization processing is carried out to its synthetic parameters model, model parameter is enable preferably to embody the height fluctuations of voice.And it is right In the basic synthesis unit of non-weak reading, then weak readingization is not carried out to its synthetic parameters model and handled.

The detailed process for carrying out weak readingization processing to synthesis parameter model will be described in detail later.

Step 105, the synthetic parameters Model sequence of the corresponding basic synthesis unit sequence is generated.

I.e. by synthesizing the corresponding synthetic parameters model sequential of each basic synthesis unit in unit sequence substantially, obtain The synthetic parameters Model sequence.It is handled including the synthetic parameters model not handled by weak readingization and by weak readingization Synthetic parameters model.That is, corresponding synthetic parameters model is if basic synthesis unit therein is weak reading By weak readingization treated synthetic parameters model；If basic synthesis unit therein is non-weak reading, corresponding synthesis Parameter model is the synthetic parameters model of original acquisition, and the synthetic parameters model of these original acquisitions can be regarded as normal articulation When synthetic parameters model.

Step 106, continuous speech is generated according to the synthetic parameters Model sequence.

As it can be seen that the method provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness, is relatively easy to by processing Weak reading phenomenon, i.e., utilization " light " contrast " weight ", efficiently solve the problems, such as tune on rise and fall, preferably realize continuous speech Whole fluctuation effect.

As shown in Fig. 2, being the weak flow chart for reading prediction of basic synthesis unit in the embodiment of the present invention.

It should be noted that requiring successively to carry out for synthesizing the basic synthesis unit of each of unit sequence substantially It checks, determines if weak reading, specifically includes the following steps:

Step 201, the basic synthesis unit of current check is obtained.

Step 202, syllable string belonging to the basic synthesis unit is checked whether there is；If so, thening follow the steps 203； Otherwise, step 204 is executed.

Specifically, synthesis text can be treated and carry out word segmentation processing, and determine that is obtained respectively segments each syllable string for including And/or syllable, to obtain syllable string or syllable belonging to the basic synthesis unit.

Step 203, check the syllable string whether in preset weak reading vocabulary；If so, thening follow the steps 208；It is no Then, step 204 is executed.

Step 204, syllable belonging to the basic synthesis unit is obtained.

Step 205, check the syllable whether in preset weak reading vocabulary.If so, thening follow the steps 206；Otherwise, Execute step 209.

Weak pronunciation section is easy capture and negligible amounts, thus relatively easy exhaustive.It in embodiments of the present invention, can be preparatory Weak reading vocabulary is established based on the statistics to training corpus, specifically, can be carried out according to following procedure:

(1) candidate weak reading word is obtained, weak reading word set is formed.It in practical applications, can be using all function words as candidate weak Read word.

(2) training corpus is obtained.

(3) the weak weak reading frequency for reading word in the training corpus of each candidate in the weak reading word set is successively calculated.

(4) if the weak reading frequency is greater than frequency threshold, it is determined that the weak reading word of candidate is weak reading word；

(5) weak reading vocabulary is generated by determining weak reading word.

Certainly, in practical applications, weak reading vocabulary, such as statistical model method can also be constructed by other methods, it is right This embodiment of the present invention is without limitation.

Step 206, the prosodic features of the syllable is extracted.

The prosodic features of the syllable may include one or more of feature: the part of speech of participle, syllable where syllable Position etc. in the participle of place.

Step 207, the weak reading decision tree constructed according to the prosodic features of the syllable and in advance determines the basic synthesis The whether weak reading of unit.

Specifically, the weak reading decision tree constructed first according to the prosodic features of syllable and in advance determines whether the syllable is weak It reads；If the weak reading of syllable, the basic weak reading of synthesis unit, the otherwise not weak reading of the basic synthesis unit.

Step 208, the weak reading of basic synthesis unit is determined.

In view of same word has the function of different under different context environmentals, especially taking on different parts of speech When, often there is different expressive forces, thus weak reading has certain uncertainty.To the further root of this embodiment of the present invention According to the weak reading decision tree pre-established determine the syllable of current check in specific context whether weak reading.

It the weak building process for reading decision tree and weak read decision tree using this and determines that the detailed process of the whether weak reading of syllable will be It is described in detail below.

Step 209, the not weak reading of the basic synthesis unit is determined.

As shown in figure 3, being the weak building process for reading decision tree in the embodiment of the present invention, comprising the following steps:

Step 301, it obtains based on the weak a large amount of texts for reading vocabulary as training data.

Step 302, word segmentation processing is carried out to the training data, and determines each syllable that each participle includes.

Step 303, prosodic labeling is carried out to the syllable, prosodic labeling information includes: weak reading information.

Specifically, prosodic labeling can be carried out to each syllable according to the corresponding voice data of training data.

In practical applications, prosodic labeling information can also further comprise: position of the weak pronunciation section in participle, weak pronunciation The part of speech etc. segmented where section.

Step 304, according to the prosodic labeling information of the training data and corresponding each syllable, training obtains weak reading decision Tree.

Specifically, weak reading decision tree is initialized first, then since the weak root node for reading decision tree, according to preparatory The problem of establishing collection (problem set includes the relevant information of all and weak reading) successively investigates each nonleaf node, if currently examined The node examined needs to divide, then divides to the node currently investigated, and obtains the child node and the child node after division Corresponding training data；Otherwise, will currently investigate vertex ticks is leaf node；After the completion of the investigation of all nonleaf nodes, obtain The weak reading decision tree.

It should be noted that in practical applications, weak reading decision tree can also be constructed using other methods, to this present invention Embodiment is without limitation.

It is exemplified below based on the above-mentioned weak process read decision tree and carry out weak reading prediction.

Such as text to be synthesized: red team and blue team share 49 books.

Carry out word segmentation processing, obtain: red team/and (conjunction)/blue team/be total to/have (there are verbs)/49 (number)/sheet/ Book.

Weak reading prediction: wherein syllable "and" " having " " ten " is in weak reading vocabulary, therefore only needs to carry out these three syllables Judge whether weak reading.

There is following judgement according to weak reading forecast and decision tree:

(1) participle where weak pronunciation section whether function word if it is weak reading."and" is eligible, is determined as weak reading；

(2) participle where weak pronunciation section whether there is verb if so, there is negative word in front if it is, weak It reads." having " though being there are verb, front does not have negative word, is determined as non-weak reading；

(3) participle where weak pronunciation section whether number if so, whether being located in word if it is weak reading.Where " ten " Participle is number, and is located in word, and weak reading is determined as.

If a weak reading of syllable, the corresponding all weak readings of basic synthesis unit of the syllable, vice versa.

It should be noted that synthetic parameters model described in the embodiment of the present invention is acoustic model.

In general relatively normal pronunciation, the weak basic synthesis unit of reading have following characteristics:

(1) the weak voice duration for reading basic synthesis unit is often shorter；

(2) the weak fundamental curve for reading basic synthesis unit tends to the intermediate value of tone range, i.e., original fundamental curve compared with High voice unit, fundamental curve meeting relative reduction, and the lower voice unit of original fundamental curve, fundamental curve can opposite lifts It is high；

(3) the weak energy for reading basic synthesis unit is lower.

Based on These characteristics, in embodiments of the present invention, the corresponding sound of each weak basic synthesis unit of reading can be trained first Learn model, and with it is corresponding it is non-it is weak read basic synthesis unit and carry out acoustics comparison, determine in terms of duration, energy, fundamental frequency weak reading with Variance rule between non-weak reading.Then pass through when carrying out weak readingization to synthesis parameter model and shorten duration, reduce or raise Fundamental frequency reduces the Policy Updates model parameters such as energy to realize weak reading effect.

As shown in figure 4, be the flow chart that weak readingizations is handled carried out to synthesis parameter model in the embodiment of the present invention, including with Lower step:

Step 401, the model parameter of the synthetic parameters model is obtained, the model parameter includes: duration parameters, fundamental frequency Parameter, energy parameter；

Step 402, the model parameter, the synthetic parameters updated are updated according to the mapping ruler that preparatory training obtains Model.

The training process of above-mentioned mapping ruler is as follows:

In practical applications, duration parameters, base frequency parameters, energy parameter pair in synthetic parameters model can be respectively trained The mapping ruler answered, specific as follows:

1, duration parameters mapping ruler

(1) training data is obtained；

(2) the basic synthesis unit of weak reading in the training data is determined；

(3) calculate it is described it is weak read basic synthesis unit in the case that weak reading and it is non-it is weak read two kinds duration ratio, and by its As duration parameters mapping ruler.

Since a syllable has corresponded to one or more basic synthesis units, it, can in order to keep mapping ruler more acurrate With calculate separately the basic synthesis unit in syllable different location (i.e. syllable is first, in syllable, these three last positions of syllable), Duration mean value when weak reading is with two kinds of non-weak reading；Then further according to the mean value computation in weak reading and two kinds of situations of non-weak reading Under duration ratio.

It can be according to base when carrying out the processing of weak readingization to synthesis parameter model based on above-mentioned duration parameters mapping ruler Different location of this synthesis unit in syllable carries out the duration parameters in the synthetic parameters model according to above-mentioned duration ratio Adjustment.

2, base frequency parameters mapping ruler

Duration is a scalar, and fundamental frequency is a vector, the corresponding fundamental curve of a basic synthesis unit.In order to simplify rule Then, the average fundamental frequency that can use basic synthesis unit carries out parameter mapping, specific as follows:

(1) training data is obtained；

(3) the weak average fundamental frequency ratio for reading basic synthesis unit when weak reading is with two kinds of non-weak reading is calculated, and As base frequency parameters mapping ruler.

It can be according to base when carrying out the processing of weak readingization to synthesis parameter model based on above-mentioned base frequency parameters mapping ruler Different location of this synthesis unit in syllable carries out the base frequency parameters in the synthetic parameters model according to above-mentioned fundamental frequency ratio Adjustment.

3, energy parameter mapping ruler

Energy is also a vector, the corresponding energy curve of a basic synthesis unit.It can be reflected using with base frequency parameters The identical method of rule is penetrated, energy parameter mapping is carried out.This will not be repeated here.

The method provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness, in continuous speech synthesis system The demand of the low fluctuation effect of pairing Chinese idiom pitch, based on the prediction to weak pronunciation section, basic synthesis corresponding to weak pronunciation section is single The synthetic parameters model of member carries out weak readingization processing, realizes the effect of continuous speech integrally to rise and fall.The program by processing compared with For easy weak reading phenomenon, utilizes " light " contrast " weight ", realize the effect of continuous speech integrally to rise and fall, filled up current semantics Understanding technology stress predicted in speech synthesis has not yet been reached the blank of practical function, substantially improves continuous synthesis voice oneself So degree.

In addition, it is necessary to which explanation can also consider weak reading and stressed factor simultaneously, further change in speech synthesis The naturalness of kind continuous synthesis voice.

Correspondingly, the embodiment of the present invention also provides a kind of speech synthesis system, as shown in figure 5, being a kind of knot of the system Structure block diagram.

In this embodiment, the system comprises:

Receiving module 501, for receiving text to be synthesized；

Basic synthesis unit sequence determining module 502, for determining the basic synthesis unit sequence of the corresponding text, institute Stating in basic synthesis unit sequence includes one or more basic synthesis units；

Weak reading prediction module 503, for determining the whether weak reading of each basic synthesis unit；

Synthetic parameters model obtains module 504, for obtaining the corresponding synthetic parameters model of the basic synthesis unit；

Weak readingization processing module 505 is used for when the basic synthesis unit is weak reading, to the basic synthesis unit pair The synthetic parameters model answered carries out weak readingization processing, the synthetic parameters model updated；

Synthetic parameters Model sequence generation module 506, for generating the synthesis ginseng of the corresponding basic synthesis unit sequence Exponential model sequence；

Synthesis module 507, for generating continuous speech according to the synthetic parameters Model sequence.

Above-mentioned weak reading prediction module 503 can specifically determine that each basic synthesis is single using previously described weak reading prediction technique Member whether weak reading, it is weak read prediction module 503 a kind of specific structure may include following each unit:

Wherein, above-mentioned determination unit may include:

Said extracted unit extracts the prosodic features of the syllable for the triggering according to the inspection unit,

Judging unit, prosodic features for being extracted according to the extraction unit and the weak reading decision tree constructed in advance determine The whether weak reading of the syllable, and if the weak reading of the syllable, it is determined that the basic weak reading of synthesis unit, otherwise determine described in The not weak reading of basic synthesis unit.

Above-mentioned weak reading vocabulary and weak reading decision tree can be constructed by speech synthesis system of the present invention, can also be by other systems Building, without limitation to this embodiment of the present invention.It, within the system can be with if constructed by speech synthesis system of the present invention Further comprise: weak reading vocabulary building module and weak reading decision tree building module are respectively used to construct weak reading vocabulary and weak reading are determined Plan tree.According to the difference of specific construction method, the two modules can have adaptable structure respectively, not limit this.

A kind of specific structure of above-mentioned weak readingization processing module 505 is as shown in Figure 6, comprising:

Model parameter acquiring unit 601, for obtaining the model parameter of the synthetic parameters model, the model parameter packet It includes: duration parameters, base frequency parameters, energy parameter；

Parameter updating unit 602 obtains more for updating the model parameter according to the mapping ruler that training obtains in advance New synthetic parameters model.

In practical applications, the mapping ruler can be trained in advance by present system, can also be pre- by other systems First train.

If trained by present system, also need within the system further comprise: mapping ruler training module is (not Diagram), reflect the non-weak mapping ruler for reading synthetic parameters and weak reading synthetic parameters corresponding relationship for constructing.

Mapping ruler training module can be directed to the model parameter of synthetic parameters model, and duration parameters mapping rule are respectively trained Then, base frequency parameters mapping ruler, energy parameter mapping ruler.Specific training process can refer in the embodiment of the present invention method of front Description, details are not described herein.

Correspondingly, the needs of parameter updating unit 602 update each model parameter according to corresponding mapping ruler.

The system provided in an embodiment of the present invention for improving synthesis phonetic-rhythm naturalness passes through processing in speech synthesis It is relatively easy to weak reading phenomenon, realizes the effect of continuous speech integrally to rise and fall, has filled up current semantics understanding technology to voice The blank of practical function has not yet been reached in stress predicted in synthesis, substantially improves the naturalness of continuous synthesis voice.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.System embodiment described above is only schematical, wherein described be used as separate part description Unit and module may or may not be physically separated.Furthermore it is also possible to select it according to the actual needs In some or all of unit and module achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying In the case where creative work, it can understand and implement.

Structure, feature and effect of the invention, the above institute are described in detail based on the embodiments shown in the drawings Only presently preferred embodiments of the present invention is stated, but the present invention does not limit the scope of implementation as shown in the drawings, it is all according to structure of the invention Think made change or equivalent example modified to equivalent change, when not going beyond the spirit of the description and the drawings, It should all be within the scope of the present invention.

Claims

1. a kind of method for improving synthesis phonetic-rhythm naturalness characterized by comprising

Receive text to be synthesized；

Determine the whether weak reading of each basic synthesis unit；

The corresponding synthetic parameters model of the basic synthesis unit is obtained, and if the basic synthesis unit is weak reading, The weak readingization of the basic corresponding synthetic parameters model progress of synthesis unit is handled, the synthetic parameters model updated；

2. the method according to claim 1, wherein the whether weak reading packet of the determination basic synthesis unit It includes:

Obtain syllable string and/or syllable belonging to the basic synthesis unit；

Determine whether the syllable string and/or syllable are weak reading, if it is, determining that the basic synthesis unit is weak reading.

3. according to the method described in claim 2, it is characterized in that, the determination syllable string and/or the whether weak reading of syllable Include:

If it is, determining the basic weak reading of synthesis unit；

If the rhythm that syllable belonging to the basic synthesis unit in preset weak reading vocabulary, extracts the syllable is special Sign, the weak reading decision tree then constructed according to the prosodic features of the syllable and in advance determine the whether weak reading of the syllable；If The weak reading of syllable, the then weak reading of basic synthesis unit, the otherwise not weak reading of the basic synthesis unit；

If syllable belonging to the basic synthesis unit is not in preset weak reading vocabulary, it is determined that the basic synthesis unit Not weak reading.

4. according to the method described in claim 3, it is characterized in that, the weak building process for reading vocabulary includes:

Candidate weak reading word is obtained, weak reading word set is formed；

Obtain training corpus；

Weak reading vocabulary is generated by determining weak reading word.

5. according to the method described in claim 3, it is characterized in that, the weak building process for reading decision tree includes:

6. method according to any one of claims 1 to 5, which is characterized in that described corresponding to the basic synthesis unit Synthetic parameters model carry out weak readingizations and handle, the synthetic parameters model updated includes:

The model parameter of the synthetic parameters model is obtained, the model parameter includes: duration parameters, base frequency parameters, energy ginseng Number；

7. a kind of system for improving synthesis phonetic-rhythm naturalness, which is characterized in that the system comprises:

Receiving module, for receiving text to be synthesized；

Basic synthesis unit sequence determining module, it is described basic for determining the basic synthesis unit sequence of the corresponding text It include one or more basic synthesis units in synthesis unit sequence；

Weak readingization processing module, for when the basic synthesis unit is weak reading, conjunction corresponding to the basic synthesis unit Weak readingization processing, the synthetic parameters model updated are carried out at parameter model；

Synthetic parameters Model sequence generation module, for generating the synthetic parameters model sequence of the corresponding basic synthesis unit sequence Column；

8. system according to claim 7, which is characterized in that the weak reading prediction module includes:

Determination unit, for determining whether the syllable string and/or syllable are weak reading, if it is, determining the basic synthesis Unit is weak reading.

9. system according to claim 8, which is characterized in that the determination unit includes:

Inspection unit, for checking syllable string belonging to the basic synthesis unit whether in preset weak reading vocabulary；If It is, it is determined that the weak reading of syllable；Otherwise, check syllable belonging to the basic synthesis unit whether in preset weak reading vocabulary In；If it is, triggering extraction unit extracts the prosodic features of the syllable；Otherwise determine that the basic synthesis unit is not weak It reads；

Judging unit, the prosodic features of the syllable for being extracted according to the extraction unit and the weak reading decision tree constructed in advance are true The fixed whether weak reading of the syllable, and if the weak reading of the syllable, it is determined that otherwise the basic weak reading of synthesis unit determines institute State the not weak reading of basic synthesis unit.

10. system according to claim 9, which is characterized in that the system also includes: weak reading vocabulary constructs module, uses In the building weak reading vocabulary.

11. system according to claim 9, which is characterized in that the system also includes: weak reading decision tree constructs module, For constructing the weak reading decision tree.

12. according to the described in any item systems of claim 7 to 11, which is characterized in that the weak readingization processing module includes:

Model parameter acquiring unit, for obtaining the model parameter of the synthetic parameters model, the model parameter includes: duration Parameter, base frequency parameters, energy parameter；

Parameter updating unit, for updating the model parameter, the conjunction updated according to the mapping ruler that training obtains in advance At parameter model.