CN106157948B

CN106157948B - A kind of fundamental frequency modeling method and system

Info

Publication number: CN106157948B
Application number: CN201510195120.6A
Authority: CN
Inventors: 殷翔; 江源; 王影; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2015-04-22
Filing date: 2015-04-22
Publication date: 2019-10-18
Anticipated expiration: 2035-04-22
Also published as: CN106157948A

Abstract

The invention discloses a kind of fundamental frequency modeling method and systems, this method comprises: fascicule is in turn divided into from high to low: phrase layer, word layer, syllablic tier, phonemic stratum, state layer, the phrase layer and the word layer are higher fascicule, and the syllablic tier, the phonemic stratum and the state layer are lower fascicule；Determine the influence that the tone information that the syllablic tier includes models higher fascicule fundamental frequency；According to the fundamental frequency feature of the prosodic units fundamental frequency model is successively constructed using iterative manner from high to low, and for higher fascicule, the influence that the tone information that the syllablic tier includes models higher fascicule fundamental frequency is removed when constructing fundamental frequency model.Using the present invention, the influence that tone information models higher fascicule can be effectively eliminated, and then fundamental frequency feature can be predicted more naturally.

Description

A kind of fundamental frequency modeling method and system

Technical field

The present invention relates to speech signal processing technologies, and in particular to a kind of fundamental frequency modeling method and system.

Background technique

Important feature one of of the fundamental frequency feature as speech synthesis technique, had both included the prosodic information of Short Time Speech section, The prosodic information (suprasegmental prosodic information) of voice segments when also including long, such as tone information.It is how more naturally pre- Measure one of the important goal that fundamental frequency is characterized in speech synthesis effect.

The fundamental frequency modeling method generally taken is layering fundamental frequency modeling method at present, i.e., from the mechanism of production of the rhythm and The additive property that the domain log fundamental frequency feature generates, which is set out, to be modeled, as shown in formula (1) and Fig. 1:

F0_all=F0_state+F0_phone+F0_syllable+F0_word (1)

Distinguishing hierarchy is carried out to fascicule, is in turn divided into from high to low: word layer, syllablic tier, phonemic stratum, state layer, As shown in Figure 1, wherein each layer of fundamental frequency feature all corresponds to different rhythm variations.Mechanism of production of the existing scheme from the rhythm It sets out, the rhythm variation influenced by different levels context property is targetedly modeled.

However, there is no the rhythm variation for considering higher prosodic units, such as phrases for existing layering fundamental frequency modeling method The rhythm variation of layer causes the entire sentence for synthesizing voice to rise and fall and feels not strong, sounds without what emotion.In addition, existing point The modeling sequence of layer fundamental frequency modeling method is from high to low layer-by-layer modeling, and there is no consider the tone information of tone language to higher Fascicule models the influence of effect, leads to the modeling side traditional hidden Markov model (Hidden Markov Model, HMM) Method cannot capture the fundamental frequency feature of higher fascicule well, such as word layer information, phrase layer information, so that higher fascicule base Frequency feature modeling effect is greatly reduced.The above reason causes existing fundamental frequency modeling method that cannot predict fundamental frequency spy more naturally Sign.

Summary of the invention

The embodiment of the present invention provides a kind of fundamental frequency modeling method and system, cannot be more to solve existing fundamental frequency modeling method Naturally the problem of predicting fundamental frequency feature.

For this purpose, the embodiment of the present invention provides the following technical solutions:

A kind of fundamental frequency modeling method, comprising:

Fascicule is in turn divided into from high to low: phrase layer, word layer, syllablic tier, phonemic stratum, state layer, and determine Each layer prosodic units, the phrase layer and the word layer are higher fascicule, the syllablic tier, the phonemic stratum and the shape State layer is lower fascicule；

Determine the influence that the tone information that the syllablic tier includes models higher fascicule fundamental frequency；

Fundamental frequency model is successively constructed using iterative manner from high to low according to the fundamental frequency feature of the prosodic units, and right In higher fascicule, the tone information that the syllablic tier includes is removed when constructing fundamental frequency model, higher fascicule fundamental frequency is modeled Influence.

Preferably, the influence packet that the tone information that the determination syllablic tier includes models higher fascicule fundamental frequency It includes:

Nature fundamental frequency is divided as unit of syllable, obtains the corresponding natural fundamental frequency value of each syllable unit；

The natural fundamental frequency value is parameterized, the corresponding natural fundamental frequency feature of each syllable unit is obtained；

The prediction fundamental frequency value of each syllable unit is obtained according to the natural fundamental frequency feature.

Preferably, described parametrization is carried out to the natural fundamental frequency value to include:

The natural fundamental frequency value is parameterized using the dct transform after optimization, the dct transform after the optimization refers to Quadratic sum to generate fundamental frequency feature and natural fundamental frequency feature difference estimates dct transform coefficient as objective function；

It is described to include: according to natural each syllable unit prediction fundamental frequency value of fundamental frequency feature acquisition

It is corresponding to each syllable unit according to the corresponding context property information of each syllable unit and the natural fundamental frequency feature Natural fundamental frequency feature carry out fundamental frequency modeling；

It is according to the fundamental frequency model, the affiliated model mean value of each syllable unit is special as the prediction fundamental frequency of the syllable unit Sign；

DCT inverse transformation is carried out to the prediction fundamental frequency feature, obtains the prediction fundamental frequency value of each syllable unit.

Preferably, building phrase layer fundamental frequency model includes:

The prediction fundamental frequency value that the corresponding natural fundamental frequency value of the syllable unit is subtracted to the syllable unit, obtains for going The natural residual error fundamental frequency value of phrase layer modeling after being influenced except syllablic tier；

The natural residual error fundamental frequency value is divided as unit of phrase, obtains the corresponding natural fundamental frequency of each phrase unit Value；

The natural fundamental frequency value is parameterized, the corresponding natural fundamental frequency feature of each phrase unit is obtained；

Using the corresponding natural fundamental frequency feature construction phrase layer fundamental frequency model of each phrase unit, each phrase unit is obtained Prediction fundamental frequency feature.

Preferably, building word layer fundamental frequency model includes:

The prediction fundamental frequency value that the corresponding natural fundamental frequency value of the phrase unit is subtracted to the phrase unit, obtains for single The natural residual error fundamental frequency value of word layer modeling；

The natural residual error fundamental frequency value is divided as unit of word, obtains the corresponding natural fundamental frequency of each sub-word units Value；

The natural fundamental frequency value is parameterized, the corresponding natural fundamental frequency feature of each sub-word units is obtained；

Using the corresponding natural fundamental frequency feature construction word layer fundamental frequency model of each sub-word units, each sub-word units are obtained Prediction fundamental frequency feature.

Preferably, the method also includes:

Use the corresponding natural fundamental frequency feature of DCT parameter characterization phrase unit and sub-word units.

Preferably, the method also includes: optimized based on fundamental frequency model parameter of the method for DNN to each fascicule.

A kind of fundamental frequency modeling, comprising:

Fascicule division module, for fascicule to be in turn divided into from high to low: phrase layer, word layer, syllablic tier, sound Plain layer, state layer, and determine each layer prosodic units, the phrase layer and the word layer are higher fascicule, the syllablic tier, The phonemic stratum and the state layer are lower fascicule；

Influence determining module, the shadow modeled for determining the tone information that the syllablic tier includes to higher fascicule fundamental frequency It rings；

Modeling module, for successively constructing base from high to low using iterative manner according to the fundamental frequency feature of the prosodic units Frequency model, and for higher fascicule removes tone information that the syllablic tier includes when constructing fundamental frequency model to higher The influence of fascicule fundamental frequency modeling, the modeling module include: phrase layer modeling module, word layer modeling module, low layer modeling Module.

Preferably, the influence determining module includes:

Natural fundamental frequency division unit obtains each syllable unit pair for dividing nature fundamental frequency as unit of syllable The natural fundamental frequency value answered；

Parameterized units obtain the corresponding polynomial basis of each syllable unit for parameterizing to the natural fundamental frequency value Frequency feature；

Fundamental frequency value acquiring unit is predicted, for obtaining the prediction fundamental frequency of each syllable unit according to the natural fundamental frequency feature Value.

Preferably, the parameterized units, specifically for using the dct transform after optimization to carry out the natural fundamental frequency value Parametrization, the dct transform after the optimization refer to the quadratic sum to generate fundamental frequency feature and natural fundamental frequency feature difference as target Function, dct transform coefficient is estimated；

The prediction fundamental frequency value acquiring unit includes:

Fundamental frequency models subelement, for special according to the corresponding context property information of each syllable unit and the natural fundamental frequency Sign, nature fundamental frequency feature corresponding to each syllable unit carry out fundamental frequency modeling；

It predicts subelement, is used for according to the fundamental frequency model, using the affiliated model mean value of each syllable unit as the syllable The prediction fundamental frequency feature of unit；

DCT inverse transformation subelement obtains each syllable unit for carrying out DCT inverse transformation to the prediction fundamental frequency feature Predict fundamental frequency value.

Preferably, the phrase layer modeling module includes:

Phrase layer acquiring unit, for the corresponding natural fundamental frequency value of the syllable unit to be subtracted the pre- of the syllable unit Survey fundamental frequency value, obtain for remove syllablic tier influence after phrase layer modeling natural residual error fundamental frequency value；

Phrase layer division unit obtains each short for dividing the natural residual error fundamental frequency value as unit of phrase The corresponding natural fundamental frequency value of language unit；

It is corresponding to obtain each phrase unit for parameterizing to the natural fundamental frequency value for phrase layer parameterized units Natural fundamental frequency feature；

Phrase layer predicting unit, for utilizing the corresponding natural fundamental frequency feature construction phrase layer fundamental frequency of each phrase unit Model obtains the prediction fundamental frequency feature of each phrase unit.

Preferably, the word layer modeling module includes:

Word layer acquiring unit, for the corresponding natural fundamental frequency value of the phrase unit to be subtracted the pre- of the phrase unit Fundamental frequency value is surveyed, the natural residual error fundamental frequency value modeled for word layer is obtained；

Word layer division unit obtains each list for dividing the natural residual error fundamental frequency value as unit of word The corresponding natural fundamental frequency value of word unit；

It is corresponding to obtain each sub-word units for parameterizing to the natural fundamental frequency value for word layer parameter unit Natural fundamental frequency feature；

Word layer predicting unit, for utilizing the corresponding natural fundamental frequency feature construction word layer fundamental frequency of each sub-word units Model obtains the prediction fundamental frequency feature of each sub-word units.

Preferably, the system also includes:

Model Parameter Optimization module is optimized for fundamental frequency model parameter of the method based on DNN to each fascicule.

Fundamental frequency modeling method and system provided in an embodiment of the present invention, by being divided into fascicule from high to low including short Each fascicule of language layer increases the modeling to phrase layer fundamental frequency feature, so as to enhance the fluctuating sense of synthesis sentence, and right Before the fundamental frequency feature of higher fascicule (phrase layer, word layer) is modeled, tone information is eliminated to higher fascicule base The influence of frequency modeling, improves the effect of higher fascicule fundamental frequency feature modeling.

Further, the fundamental frequency feature of higher fascicule is characterized using the dct transform coefficient after optimization, it can be preferably It is special closer to natural fundamental frequency that the fundamental frequency feature predicted after modeling has been effectively ensured in the variation for embodying entire prosodic units fundamental frequency feature Sign.

Further, after based on deep neural network (Deep Neural Networks, DNN) to fascicule initialization Fundamental frequency model parameter optimizes, since the non-linear layer level structure of DNN can preferably characterize text attribute combination, it is not easy to There is over-fitting, while DNN will not divide data in training, can preferably embody the system of entire data space About relationship effectively prevent Sparse Problem.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only one recorded in the present invention A little embodiments are also possible to obtain other drawings based on these drawings for those of ordinary skill in the art.

Fig. 1 is existing layering fundamental frequency modeling principle schematic diagram；

Fig. 2 is the flow chart of fundamental frequency modeling method of the embodiment of the present invention；

Fig. 3 is the flow chart of fundamental frequency value parameter in fundamental frequency modeling method of the embodiment of the present invention；

Fig. 4 is the influence for determining the tone information that syllablic tier includes in the embodiment of the present invention and modeling to higher fascicule fundamental frequency Flow chart；

Fig. 5 is in the embodiment of the present invention using the flow chart of iterative manner building fundamental frequency model；

Fig. 6 is a kind of structural schematic diagram of fundamental frequency modeling of the embodiment of the present invention；

Fig. 7 is a kind of concrete structure schematic diagram that determining module is influenced in fundamental frequency modeling of the embodiment of the present invention；

Fig. 8 is another structural schematic diagram of fundamental frequency modeling of the embodiment of the present invention.

Specific embodiment

The scheme of embodiment in order to enable those skilled in the art to better understand the present invention with reference to the accompanying drawing and is implemented Mode is described in further detail the embodiment of the present invention.

Fascicule is in turn divided by the fundamental frequency modeling method of the embodiment of the present invention from high to low: phrase layer, word layer, sound Ganglionic layer, phonemic stratum, state layer so that phrase layer it is long when the rhythm variation can be described well, and then enhance synthesis voice it is whole The fluctuating sense of a sentence；And before fundamental frequency modeling, the influence modeled by removing tone information to higher fascicule fundamental frequency has The influence for preventing tone information to model higher fascicule is imitated, the naturalness of synthesis voice is improved.

As shown in Fig. 2, being a kind of flow chart of fundamental frequency modeling method of the embodiment of the present invention, comprising the following steps:

Step 201, fascicule is in turn divided into from high to low: phrase layer, word layer, syllablic tier, phonemic stratum, state Layer, and determine each layer prosodic units, the phrase layer and the word layer are higher fascicule, the syllablic tier, the phoneme Layer and the state layer are lower fascicule.

In the present embodiment, fascicule is in turn divided into from high to low: phrase layer, word layer, syllablic tier, phonemic stratum, State layer, and design each layer prosodic units context property and its corresponding context property problem.

Then, to the context property and its corresponding context property problem, by using the method for traditional HMM The modeling that phoneme duration is carried out to training data, obtains the duration information of each phoneme.

Then, the context property point of every layer of prosodic units is carried out using the duration information and context property of each phoneme Analysis, and then obtain the duration information of each layer prosodic units.

For example, for the syllable unit of Chinese, if current context attribute is " opposite position of the current phoneme in syllable Set " be 1 or 0 when, it may be considered that the corresponding initial time point of the phoneme first state be the syllable unit starting；When Encountering context property for " relative position of the current phoneme in syllable " is 3 (, it is specified that Chinese one when design context attribute Three phonemes are contained up in syllable) or when 0, then the phoneme end-state corresponding end time point is the ending of syllable, Context property obtains the corresponding starting of syllable unit, end position after analyzing.The division of other rhythm layer units is therewith It is similar.

Step 202, the influence that the tone information that the syllablic tier includes models higher fascicule fundamental frequency is determined.

In order to remove the influence of rhythm variation of the tone information to higher fascicule, syllablic tier is pre-processed first. For example, natural fundamental frequency can be divided as unit of syllable, the corresponding natural fundamental frequency value of each syllable unit is obtained；Then right The nature fundamental frequency value is parameterized, and the corresponding natural fundamental frequency feature of each syllable unit is obtained；Then, according to the polynomial basis Frequency feature obtains the prediction fundamental frequency value of each syllable unit.

In embodiments of the present invention, it can use discrete cosine transform (Discrete Cosine Transform, DCT) The natural fundamental frequency value is parameterized, the corresponding natural fundamental frequency feature of each syllable unit is obtained.Then, according to each syllable list The corresponding context property information of member and the natural fundamental frequency feature, nature fundamental frequency feature corresponding to each syllable unit carry out base Frequency models；It is according to the fundamental frequency model, the affiliated model mean value of each syllable unit is special as the prediction fundamental frequency of the syllable unit Sign；Then DCT inverse transformation is carried out to the prediction fundamental frequency feature, obtains the prediction fundamental frequency value of each syllable unit.

Further, existing dct transform parametric method can also be optimized, utilizes the dct transform after optimization Parametric method parameterizes the natural fundamental frequency value.Dct transform parametric method after the optimization is to generate base The quadratic sum of frequency feature and natural fundamental frequency feature difference estimates dct transform coefficient, is further ensured that and builds as objective function The closer natural fundamental frequency feature of the fundamental frequency feature predicted after mould.DCT after the optimization proposed below to the embodiment of the present invention Transformation parameter method is described in detail.

As shown in figure 3, being the stream in the embodiment of the present invention using the dct transform after optimization to natural fundamental frequency value parameter Journey, comprising the following steps:

Step 301, objective function is set.

The present embodiment sets objective function L to make the prediction fundamental frequency feature after modeling closer to natural fundamental frequency feature For natural fundamental frequency feature and generate fundamental frequency feature difference quadratic sum, as shown in formula (1):

Wherein, s_tFor in the natural fundamental frequency value of t frame,For in the prediction fundamental frequency value of t frame, V indicates that nature fundamental frequency is special Sign indicates dct transform coefficient vector sequence with the frame ordinal number that fundamental frequency feature is voiced sound simultaneously, C is generated.

Step 302, objective function is subjected to traditional dct transform.

It, can will be in formula (1) according to traditional dct transformIt is expressed as constant value vector D^(t)With DCT coefficient vector C's Product, then formula (1) can be exchanged into formula (2):

Wherein,

The dimension of N expression dct transform.

Step 303, transformed objective function is minimized.

DCT coefficient C in estimator (2), specifically as shown in formula (4):

Step 304, the DCT coefficient C after estimation is calculated according to the objective function after minimum^*, specifically such as formula (5) institute Show:

C^*=R^-1q (5)

Wherein,

The DCT coefficient that dct transform parametric method after optimization estimates is closed solution, mathematically, this closed solution The fitting effect of fundamental frequency feature can achieve it is optimal, it may therefore be assured that modeling after dct transform coefficient modeling after measure in advance The fundamental frequency feature arrived natural fundamental frequency feature closer compared to conventional method.

The tone that syllablic tier includes is determined based on the dct transform parametric method after above-mentioned optimization, in the embodiment of the present invention Information is as shown in Figure 4 to the process for the influence that higher fascicule fundamental frequency models, comprising the following steps:

Step 401, nature fundamental frequency is divided as unit of syllable, obtains the corresponding natural fundamental frequency of each syllable unit Value.

Step 402, it is parameterized, is obtained using the natural fundamental frequency value corresponding to each syllable unit of the dct transform after optimization Natural fundamental frequency feature after to dct transform.

Step 403, according to the natural fundamental frequency feature after the corresponding context property information of each syllable unit and dct transform, Nature fundamental frequency feature corresponding to each syllable unit carries out decision tree-based clustering, the model mean value after being clustered.

In practical applications, the distribution of fundamental frequency feature in each cluster can be described using single Gauss model.

Step 404, fundamental frequency feature is predicted using the affiliated Clustering Model mean value of each syllable unit as the syllable unit, pass through DCT inverse transformation obtains each syllable unit prediction fundamental frequency value after carrying out inverse transformation to the prediction fundamental frequency feature.

Step 203, fundamental frequency mould is successively constructed using iterative manner according to the fundamental frequency feature of the prosodic units from high to low Type, and for higher fascicule removes tone information that the syllablic tier includes when constructing fundamental frequency model to the higher rhythm The influence of layer fundamental frequency modeling.

In practical applications, for higher fascicule, it can be modeled using frame level fundamental frequency value, DCT can also be used The fundamental frequency value of parameter characterization is modeled；And for lower fascicule, frame level fundamental frequency value can be directly used and modeled.

As shown in figure 5, being in the embodiment of the present invention using the flow chart of iterative manner building fundamental frequency model, including following step It is rapid:

(1) phrase layer models

Firstly, the corresponding natural fundamental frequency value of each syllable unit of syllablic tier to be subtracted to the prediction fundamental frequency of the syllable unit Value, obtain for remove syllablic tier influence after phrase layer modeling natural residual error fundamental frequency value, then execute following steps:

Step a) is divided the natural residual error fundamental frequency value for being used for phrase layer modeling as unit of phrase, obtains corresponding to each short The natural fundamental frequency value of language unit；

Step b) is parameterized using natural fundamental frequency value of the dct transform to phrase unit, and transformed each phrase list is obtained The corresponding natural fundamental frequency feature DCT_F0 of member_phrase, it is preferable that the dct transform after can use previously described optimization is to phrase The natural fundamental frequency value of unit is parameterized；

Step is c) according to the corresponding context property information of each phrase unit natural fundamental frequency feature DCT_ corresponding with its F0_phrase, context property problem set is corresponded to using the preset phrase unit, each phrase unit fundamental frequency feature is carried out Decision tree-based clustering, can describe the distribution of fundamental frequency feature in each cluster using single Gauss model, and the model after being clustered is equal Value；

Step is d) according to decision tree-based clustering as a result, pre- as the phrase unit using the affiliated Clustering Model mean value of each phrase unit It surveys fundamental frequency feature (being herein dct transform coefficient), is obtained after carrying out inverse transformation to the prediction fundamental frequency feature by DCT inverse transformation Each phrase unit predicts fundamental frequency value.

(2) word layer models

Firstly, the corresponding natural fundamental frequency value of each phrase unit of phrase layer to be subtracted to the prediction fundamental frequency of the phrase unit Value obtains the natural residual error fundamental frequency value modeled for word layer, then executes following steps:

Step a) is divided the natural residual error fundamental frequency value for being used for the modeling of word layer as unit of word, obtains corresponding to each list The natural fundamental frequency value of word unit；

Step b) is parameterized using natural fundamental frequency value of the dct transform to sub-word units, and transformed each word list is obtained The corresponding natural fundamental frequency feature DCT_F0 of member_word, it is preferable that the dct transform after can use previously described optimization is to word The natural fundamental frequency value of unit is parameterized；

Step is c) according to the corresponding context property information of each sub-word units natural fundamental frequency feature DCT_ corresponding with its F0_word, context property problem set is corresponded to using preset sub-word units, and decision is carried out to each sub-word units fundamental frequency feature Tree cluster can describe the distribution of fundamental frequency feature in each cluster, the model mean value after being clustered using single Gauss model；

Step is d) according to decision tree-based clustering as a result, pre- as the sub-word units using the affiliated Clustering Model mean value of each sub-word units It surveys fundamental frequency feature (being herein dct transform coefficient), is obtained after carrying out inverse transformation to the prediction fundamental frequency feature by DCT inverse transformation Each sub-word units predict fundamental frequency value；

(3) lower fascicule modeling

Firstly, phrase layer and word layer prediction fundamental frequency value are subtracted with nature fundamental frequency value, to obtain for lower fascicule The natural residual error fundamental frequency value of (syllablic tier, phonemic stratum, state layer) modeling.

Lower fascicule includes syllablic tier, phonemic stratum, state layer, different from the parametrization of higher fascicule, the lower rhythm Layer can be directly used frame level fundamental frequency value and be modeled, and specific modeling procedure is as follows:

Step will a) be used for the natural residual error fundamental frequency value of lower fascicule modeling, carry out HMM to lower fascicule prosodic units Modeling, the model after being clustered；

It walks b) according to the model after cluster, fundamental frequency feature is predicted using maximum likelihood parameter generation algorithm, thus Obtain the prediction fundamental frequency value of lower fascicule.

(4) the prediction fundamental frequency value of low layer is subtracted with nature fundamental frequency value, the modeling object of phrase layer when as next iteration, Iteration carries out the modeling of phrase layer, word layer and low layer, and so as to optimize each layer base frequency parameters, least mean-square error is minimum When, iteration terminates.Rule of thumb general iteration 2 times, least mean-square error can reach minimum.

In above-mentioned modeling process, each fascicule fundamental frequency modeling is based on the assumption that between each fascicule fundamental frequency model it is independent , however researchers have shown that each fascicule model parameter be it is associated, this resulted in based on this assume and construct base Frequency model and actual conditions have deviation.Therefore, the present invention can also be further to each fascicule fundamental frequency model parameter of above-mentioned building It optimizes.

Specifically, each fascicule fundamental frequency model parameter can be optimized based on the method for decision tree using existing. In addition, the embodiment of the present invention also provides a kind of minimum generation error criterion training method used based on fundamental frequency feature, to each rhythm The fundamental frequency feature for restraining layer carries out global parameter optimization using DNN model, to solve above-mentioned offset issue.

The fundamental frequency model ginseng of phrase layer, word layer, lower fascicule is separately optimized using three DNN networks for the present embodiment Number, detailed process is as follows:

Firstly, carrying out data preparation, comprising: determine input/output data form, training data and test data etc., It is specifically as follows:

Determine input data form: respectively by phrase layer, word layer and lower fascicule (syllablic tier, phonemic stratum, state Layer) modeling when corresponding context-sensitive attribute question answer be used as input feature vector, the input feature vector totally two kinds of forms: several Word text feature and two-value text feature.The characteristic value of digital text feature is a variety of digital forms, such as 7,5,4, two-value is literary The characteristic value of eigen only has 0 or 1 two kind of form.

Determine output data form: output feature of the fundamental frequency feature as DNN network after each rhythm layer unit initialization, Wherein the fundamental frequency feature of phrase layer and word layer is indicated using the dct transform coefficient after optimization, and low layer fundamental frequency feature uses frame level Fundamental frequency value indicates.

Then, it is determined that network topology structure, is specifically as follows:

Phrase layer DNN network inputs node number is that (two-value text of the digital text feature of 5 dimensions and 9 dimensions is special for 14 dimensions Sign).Answer of the digital text feature such as to " current phrase includes how many a words " problem.Two-value text feature is such as to " current Relative position of the phrase in sentence whether be 1 " problem answer.Output node is 5 dimension DCT coefficients, and phrase layer DNN network is total Using 2 hidden layers, each hidden node is 512.

Word layer DNN network inputs node number is the 241 dimensions (two-value text of the digital text feature of 21 dimensions and 220 dimensions Feature), answer of the digital text feature such as to " current word includes how many a syllables " problem.Two-value text feature such as to " when Relative position of the preceding word in phrase whether be 1 " problem answer.Output node is 3 dimension DCT coefficients, word layer DNN network 2 hidden layers are used altogether, and hidden node is 1024.

Lower fascicule DNN network inputs node number is 570 (two-values of the digital text feature of 29 dimensions and 541 dimensions Text feature), answer of the digital text feature such as to " forward location of the current syllable in word is how many " problem, two-value text Answer of the eigen such as to " whether current phoneme is ' g ' " problem.Output is 3 dimension frame level fundamental frequency value (static state of present frame, single order With second order behavioral characteristics), lower fascicule DNN network has used 3 hidden layers, hidden node 1024 altogether.

Then, model training is carried out.Remaining each fascicule prediction fundamental frequency outside current layer is subtracted using nature fundamental frequency feature Feature, and the update of current layer model parameter is carried out based on the minimum error criterion that generates, so that respectively pre- after the superposition of layering fundamental frequency feature The closer natural fundamental frequency feature of the fundamental frequency feature of survey.

For example, within i-th of period of DNN backpropagation, using nature fundamental frequency first when for phrase layer model training Value subtracted in (i-1)-th period of backpropagation, after sub-word units prediction fundamental frequency feature DCT inverse transformation obtained fundamental frequency value and compared with The frame level fundamental frequency value that low fascicule DNN neural network forecast obtains obtains phrase layer nature residual error fundamental frequency value tag；It then, will be described Phrase layer nature residual error fundamental frequency value optimize after dct transform, obtain transformed DCT coefficient, using the DCT coefficient as The new output feature of phrase layer DNN model training；Then using tradition DNN parameter updating method to phrase layer DNN model parameter It is updated；Then, it according to the updated fundamental frequency model of parameter, predicts phrase layer fundamental frequency feature, and is used for subsequent words Layer DNN model parameter updates and lower fascicule DNN model parameter updates.

By above-mentioned circulation for several times, under based on the minimum thought for generating error criterion, so that it may to all layers of DNN model Parameter carries out unified update, thus the closer natural fundamental frequency feature of the fundamental frequency feature predicted after being superimposed each layering fundamental frequency feature.

Fundamental frequency modeling method provided in an embodiment of the present invention, by being divided into fascicule from high to low including phrase layer Each fascicule increases the modeling to phrase layer fundamental frequency feature, so as to enhance the fluctuating sense of synthesis sentence, and to compared with Gao Yun Before the fundamental frequency feature of rule layer (phrase layer, word layer) is modeled, eliminates tone information and higher fascicule fundamental frequency is modeled Influence, improve the effect of higher fascicule fundamental frequency feature modeling.

Further, the fundamental frequency model parameter after fascicule initialization is optimized based on DNN, it is non-thread due to DNN Property hierarchical structure can preferably characterize text attribute combination, it is not easy to there is over-fitting, while DNN will not logarithm in training According to being divided, the restricting relation of entire data space can be preferably embodied, Sparse Problem is effectively prevent.

Correspondingly, the embodiment of the present invention also provides a kind of fundamental frequency modeling, as shown in fig. 6, being base of the embodiment of the present invention The structural schematic diagram of frequency modeling.

The system includes:

Fascicule division module 601, for fascicule to be in turn divided into from high to low: phrase layer, word layer, syllable Layer, phonemic stratum, state layer, and determine each layer prosodic units, the phrase layer and the word layer are higher fascicule, the sound Ganglionic layer, the phonemic stratum and the state layer are lower fascicule；

Determining module 602 is influenced, for determining that the tone information that the syllablic tier includes models higher fascicule fundamental frequency Influence；

Modeling module 603, for using iterative manner layer-by-layer structure from high to low according to the fundamental frequency feature of the prosodic units Fundamental frequency model is built, and for higher fascicule, the tone information pair that the syllablic tier includes is removed when constructing fundamental frequency model The influence of higher fascicule fundamental frequency modeling, the modeling module includes: phrase layer modeling module 631, word layer modeling module 632, low layer modeling module 633.

Above-mentioned fascicule division module 601 specifically can according to the context properties of each layer prosodic units and its it is corresponding on Hereafter attribute question carries out the modeling of phoneme duration to training data by using the method for traditional HMM, obtains each phoneme Duration information, the context property of every layer of prosodic units is then carried out using the duration information and context property of each phoneme Analysis, and then the duration information of each layer prosodic units is obtained, so that it is determined that the prosodic units of each layer.

Above-mentioned influence determining module 602 is determining that the tone information that the syllablic tier includes models higher fascicule fundamental frequency Influence when, mainly need to calculate the prediction fundamental frequency value of each syllable unit of syllablic tier.The one kind for influencing determining module 602 is specific Structure is as shown in fig. 7, comprises following each unit:

Natural fundamental frequency division unit 701 obtains each syllable unit for dividing nature fundamental frequency as unit of syllable Corresponding nature fundamental frequency value；

Parameterized units 702 obtain the corresponding nature of each syllable unit for parameterizing to the natural fundamental frequency value Fundamental frequency feature；

Fundamental frequency value acquiring unit 703 is predicted, for obtaining the prediction base of each syllable unit according to the natural fundamental frequency feature Frequency is worth.

In practical applications, above-mentioned parameter unit 702 can using existing dct transform to the natural fundamental frequency value into Row parametrization can also parameterize the natural fundamental frequency value using the dct transform after above-mentioned optimization, i.e., with life At the quadratic sum of fundamental frequency feature and natural fundamental frequency feature difference as objective function, dct transform coefficient is estimated, specific mistake Journey can be found in the description in the embodiment of the present invention method of front, and details are not described herein.

Above-mentioned prediction fundamental frequency value acquiring unit 703 may include following subelement:

Fundamental frequency modeling provided in an embodiment of the present invention, by being divided into fascicule from high to low including phrase layer Each fascicule increases the modeling to phrase layer fundamental frequency feature, so as to enhance the fluctuating sense of synthesis sentence, and to compared with Gao Yun Before the fundamental frequency feature of rule layer (phrase layer, word layer) is modeled, eliminates tone information and higher fascicule fundamental frequency is modeled Influence, improve the effect of higher fascicule fundamental frequency feature modeling.

A kind of specific structure of above-mentioned phrase layer modeling module 631 may include following each unit:

It can refer to front embodiment of the present invention method using the detailed process of above-mentioned each unit building phrase layer fundamental frequency model In description, details are not described herein.

A kind of specific structure of above-mentioned word layer modeling module 632 may include following each unit:

It can refer to front embodiment of the present invention method using the detailed process of above-mentioned each unit building word layer fundamental frequency model In description, details are not described herein.

It should be noted that in practical applications, above-mentioned phrase layer modeling module 631 and word layer modeling module 632 can To be modeled using frame level fundamental frequency value, can also be modeled using the fundamental frequency value of DCT parameter characterization.

And for lower fascicule, low layer modeling module 633 can be directly used frame level fundamental frequency value carry out modeling it is specific Ground subtracts phrase layer and word layer with nature fundamental frequency value and predicts fundamental frequency value, obtain for lower fascicule (syllablic tier, phonemic stratum, State layer) modeling natural residual error fundamental frequency value, then constructed using the natural residual error fundamental frequency value of lower fascicule modeling lower The fundamental frequency model of fascicule.

The fundamental frequency modeling of the embodiment of the present invention uses the dct transform after optimization to the fundamental frequency feature of higher fascicule Coefficient characterization, can preferably embody the variation of entire prosodic units fundamental frequency feature, and the fundamental frequency predicted after modeling has been effectively ensured The closer natural fundamental frequency feature of feature.

In modeling process, the modeling of each fascicule fundamental frequency be based on the assumption that between each fascicule fundamental frequency model be it is independent, However researchers have shown that each fascicule model parameter be it is associated, this resulted in based on this assume and construct fundamental frequency mould Type and actual conditions have deviation.Therefore, as shown in figure 8, in another embodiment of fundamental frequency modeling of the present invention, the system Can also further comprise:

Model Parameter Optimization module 604 carries out excellent for fundamental frequency model parameter of the method based on DNN to each fascicule Change, specific optimization process can refer to the description in the embodiment of the present invention method of front, and details are not described herein.

The fundamental frequency modeling of the embodiment of the present invention is based further on DNN and joins to the fundamental frequency model after fascicule initialization Number optimizes, since the non-linear layer level structure of DNN can preferably characterize text attribute combination, it is not easy to occur intending It closes, while DNN will not divide data in training, can preferably embody the restricting relation of entire data space, It effectively prevent Sparse Problem.

All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for system reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.System embodiment described above is only schematical, wherein described be used as separate part description Unit may or may not be physically separated, component shown as a unit may or may not be Physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to the actual needs Some or all of the modules therein is selected to achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying In the case where creative work, it can understand and implement.

The embodiment of the present invention has been described in detail above, and specific embodiment used herein carries out the present invention It illustrates, method and system of the invention that the above embodiments are only used to help understand；Meanwhile for the one of this field As technical staff, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the contents of this specification are not to be construed as limiting the invention.

Claims

1. a kind of fundamental frequency modeling method characterized by comprising

Fascicule is in turn divided into from high to low: phrase layer, word layer, syllablic tier, phonemic stratum, state layer, and determine each layer Prosodic units, the phrase layer and the word layer are higher fascicule, the syllablic tier, the phonemic stratum and the state layer For lower fascicule；

Fundamental frequency model is successively constructed using iterative manner from high to low according to the fundamental frequency feature of the prosodic units, and for compared with High fascicule removes the shadow that the tone information that the syllablic tier includes models higher fascicule fundamental frequency when constructing fundamental frequency model It rings.

2. the method according to claim 1, wherein the determination syllablic tier tone information that includes to compared with The influence of high fascicule fundamental frequency modeling includes:

3. according to the method described in claim 2, it is characterized in that,

It is described parametrization is carried out to the natural fundamental frequency value to include:

The natural fundamental frequency value is parameterized using the dct transform after optimization, the dct transform after the optimization refers to life At the quadratic sum of fundamental frequency feature and natural fundamental frequency feature difference as objective function, dct transform coefficient is estimated；

According to the corresponding context property information of each syllable unit and the natural fundamental frequency feature, it is corresponding to each syllable unit from Right fundamental frequency feature carries out fundamental frequency modeling；

According to the fundamental frequency model, using the affiliated model mean value of each syllable unit as the prediction fundamental frequency feature of the syllable unit；

4. according to the method described in claim 2, it is characterized in that, building phrase layer fundamental frequency model includes:

The prediction fundamental frequency value that the corresponding natural fundamental frequency value of the syllable unit is subtracted to the syllable unit, obtains for removing sound The natural residual error fundamental frequency value of phrase layer modeling after ganglionic layer influence；

The natural residual error fundamental frequency value is divided as unit of phrase, obtains the corresponding natural fundamental frequency value of each phrase unit；

Using the corresponding natural fundamental frequency feature construction phrase layer fundamental frequency model of each phrase unit, the pre- of each phrase unit is obtained Survey fundamental frequency feature.

5. according to the method described in claim 4, it is characterized in that, building word layer fundamental frequency model includes:

The prediction fundamental frequency value that the corresponding natural fundamental frequency value of the phrase unit is subtracted to the phrase unit, obtains for word layer The natural residual error fundamental frequency value of modeling；

The natural residual error fundamental frequency value is divided as unit of word, obtains the corresponding natural fundamental frequency value of each sub-word units；

Using the corresponding natural fundamental frequency feature construction word layer fundamental frequency model of each sub-word units, the pre- of each sub-word units is obtained Survey fundamental frequency feature.

6. method according to claim 4 or 5, which is characterized in that the method also includes:

7. method according to any one of claims 1 to 5, which is characterized in that the method also includes:

It is optimized based on fundamental frequency model parameter of the method for DNN to each fascicule.

8. a kind of fundamental frequency modeling characterized by comprising

Fascicule division module, for fascicule to be in turn divided into from high to low: phrase layer, word layer, syllablic tier, phoneme Layer, state layer, and determine each layer prosodic units, the phrase layer and the word layer are higher fascicule, the syllablic tier, institute It states phonemic stratum and the state layer is lower fascicule；

Influence determining module, the influence modeled for determining the tone information that the syllablic tier includes to higher fascicule fundamental frequency；

Modeling module, for successively constructing fundamental frequency mould from high to low using iterative manner according to the fundamental frequency feature of the prosodic units Type, and for higher fascicule removes tone information that the syllablic tier includes when constructing fundamental frequency model to the higher rhythm The influence of layer fundamental frequency modeling, the modeling module includes: phrase layer modeling module, word layer modeling module, low layer modeling module.

9. system according to claim 8, which is characterized in that the influence determining module includes:

It is corresponding to obtain each syllable unit for dividing nature fundamental frequency as unit of syllable for natural fundamental frequency division unit Natural fundamental frequency value；

It is special to obtain the corresponding natural fundamental frequency of each syllable unit for parameterizing to the natural fundamental frequency value for parameterized units Sign；

Fundamental frequency value acquiring unit is predicted, for obtaining the prediction fundamental frequency value of each syllable unit according to the natural fundamental frequency feature.

10. system according to claim 9, which is characterized in that

The parameterized units, it is described specifically for being parameterized using the dct transform after optimization to the natural fundamental frequency value Dct transform after optimization refers to the quadratic sum to generate fundamental frequency feature and natural fundamental frequency feature difference as objective function, to DCT Transformation coefficient is estimated；

The prediction fundamental frequency value acquiring unit includes:

Fundamental frequency models subelement, is used for according to the corresponding context property information of each syllable unit and the natural fundamental frequency feature, Nature fundamental frequency feature corresponding to each syllable unit carries out fundamental frequency modeling；

It predicts subelement, is used for according to the fundamental frequency model, using the affiliated model mean value of each syllable unit as the syllable unit Prediction fundamental frequency feature；

DCT inverse transformation subelement obtains the prediction of each syllable unit for carrying out DCT inverse transformation to the prediction fundamental frequency feature Fundamental frequency value.

11. system according to claim 9, which is characterized in that the phrase layer modeling module includes:

Phrase layer acquiring unit, for the corresponding natural fundamental frequency value of the syllable unit to be subtracted to the prediction base of the syllable unit Frequency be worth, obtain for remove syllablic tier influence after phrase layer modeling natural residual error fundamental frequency value；

Phrase layer division unit obtains each phrase list for dividing the natural residual error fundamental frequency value as unit of phrase The corresponding natural fundamental frequency value of member；

Phrase layer parameterized units obtain the corresponding nature of each phrase unit for parameterizing to the natural fundamental frequency value Fundamental frequency feature；

Phrase layer predicting unit, for utilizing the corresponding natural fundamental frequency feature construction phrase layer fundamental frequency mould of each phrase unit Type obtains the prediction fundamental frequency feature of each phrase unit.

12. system according to claim 11, which is characterized in that the word layer modeling module include:

Word layer acquiring unit, for the corresponding natural fundamental frequency value of the phrase unit to be subtracted to the prediction base of the phrase unit Frequency is worth, and obtains the natural residual error fundamental frequency value modeled for word layer；

Word layer division unit obtains each word list for dividing the natural residual error fundamental frequency value as unit of word The corresponding natural fundamental frequency value of member；

Word layer parameter unit obtains the corresponding nature of each sub-word units for parameterizing to the natural fundamental frequency value Fundamental frequency feature；

Word layer predicting unit, for utilizing the corresponding natural fundamental frequency feature construction word layer fundamental frequency mould of each sub-word units Type obtains the prediction fundamental frequency feature of each sub-word units.

13. according to the described in any item systems of claim 8 to 12, which is characterized in that the system also includes: