NZ721092B2

NZ721092B2 - System and method for synthesis of speech from provided text

Info

Publication number: NZ721092B2
Application number: NZ721092A
Authority: NZ
Inventors: Aravind Ganapathiraju; Yingyi Tan; Felix Immanuel Wyss
Original assignee: Interactive Intelligence Group Inc
Priority date: 2014-01-14
Filing date: 2015-01-14
Publication date: 2021-06-29

Abstract

system and method are presented for the synthesis of speech from provided text. Particularly, the generation of parameters within the system is performed as a continuous approximation in order to mimic the natural flow of speech as opposed to a step-wise approximation of the parameter stream. Provided text may be partitioned and parameters generated using a speech model. The generated parameters from the speech model may then be used in a post-processing step to obtain a new set of parameters for application in speech synthesis. ided text may be partitioned and parameters generated using a speech model. The generated parameters from the speech model may then be used in a post-processing step to obtain a new set of parameters for application in speech synthesis.

Description

SYSTEM AND METHOD FOR SIS OF SPEECH FROM PROVIDED TEXT BACKGROUND The present ion generally relates to telecommunications systems and methods, as well as speech synthesis. More particularly, the present invention pertains to synthesizing speech from provided text using parameter generation.

SUMMARY A system and method are presented for the synthesis of speech from provided text.

Particularly, the generation of parameters within the system is performed as a continuous approximation in order to mimic the natural flow of speech as opposed to a step-wise approximation of the parameter stream. Provided text may be partitioned and parameters generated using a speech model. The generated parameters from the speech model may then be used in a post-processing step to obtain a new set of parameters for application in speech synthesis.

In one embodiment, a system is presented for synthesizing speech for provided text comprising: means for generating context labels for said provided text; means for generating a set of parameters for the context labels generated for said ed text using a speech model; means for sing said generated set of parameters, wherein said means for sing is capable of variance scaling; and means for synthesizing speech for said provided text, wherein said means for synthesizing speech is capable of applying the processed set of parameters to synthesizing speech.

In another embodiment, a method for generating parameters, using a continuous feature stream, for provided text for use in speech synthesis, is ted, comprising the steps of: partitioning said provided text into a sequence of phrases; generating ters for said sequence of phrases using a speech model; and processing the generated parameters to obtain an other set of parameters, n said other set of parameters are capable of use in speech synthesis for provided text.

BRIEF DESCRIPTION OF THE GS Figure 1 is a diagram illustrating an embodiment of a system for synthesizing speech.

Figure 2 is a diagram illustrating a modified embodiment of a system for sizing .

Figure 3 is a flowchart illustrating an embodiment of parameter generation.

Figure 4 is a diagram illustrating an embodiment of a generated parameter.

Figure 5 is a flowchart illustrating an ment of a process for f0 parameter generation.

Figure 6 is a flowchart illustrating an embodiment of a process for MCEPs generation.

DETAILED DESCRIPTION For the purposes of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the drawings and specific language will be used to describe the same. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Any tions and further modifications in the described embodiments, and any r ations of the principles of the invention as described herein are contemplated as would normally occur to one skilled in the art to which the invention relates.

In a traditional text-to-speech (TI'S) system, written language, or text, may be automatically converted into linguistic specification. The linguistic specification indexes the stored form of a speech corpus, or the model of speech corpus, to generate speech waveform. A statistical parametric speech system does not store any speech itself, but the model of speech instead. The model of the speech corpus and the output of the linguistic analysis may be used to te a set of parameters which are used to synthesize the output speech. The model of the speech corpus includes mean and covariance of the probability function that the speech parameters fit. The retrieved model may generate al ters, such as fundamental frequency (f0) and mel-cepstral ), to ent the speech signal. These parameters, however, are for a fixed frame rate and are derived from a state machine. A step-wise approximation of the parameter stream results, which does not mimic the natural flow of speech. Natural speech is continuous and not step-wise. In one embodiment, a system and method are disclosed that converts the step-wise approximation from the models to a continuous stream in order to mimic the natural flow of .

Figure 1 is a diagram illustrating an embodiment of a traditional system for synthesizing speech, indicated generally at 100. The basic components of a speech synthesis system may include a training module 105, which may comprise a speech corpus 106, stic specifications 107, and a parameterization module 108, and a sizing module 110, which may comprise text 111, context labels 112, a statistical parametric model 113, and a speech synthesis module 114.

The training module 105 may be used to train the statistical parametric model 113. The training module 105 may comprise a speech corpus 106, stic specifications 107, and a parameterization module 108. The speech corpus 106 may be converted into the stic specifications 107. The speech corpus may comprise written language or text that has been chosen to cover sounds made in a language in the t of syllables and words that make up the vocabulary of the language. The linguistic specification 107 indexes the stored form of speech corpus or the model of speech corpus to generate speech waveform. Speech itself is not stored, but the model of speech is stored. The model includes mean and the covariance of the probability function that the speech parameters fit.

The synthesizing module 110 may store the model of speech and generate speech. The synthesizing module 110 may comprise text 111, context labels 112, a statistical parametric model 113, and a speech synthesis module 114. Context labels 112 represent the contextual information in the text 111 which can be of a varied granularity, such as ation about surrounding sounds, surrounding words, surrounding phrases, etc. The context labels 112 may be ted for the provided text from a language model. The statistical parametric model 113 may e mean and covariance of the probability function that the speech parameters fit.

WO 08935 2015/011348 The speech synthesis module 114 receives the speech parameters for the text 111 and transforms the parameters into synthesized speech. This can be done using standard methods to transform spectral information into time domain signals, such as a mel log spectrum approximation (MLSA) filter.

Figure 2 is a diagram illustrating a modified embodiment of a system for synthesizing speech using parameter generation, indicated generally at 200. The basic components of a system may include similar components to those in Figure 1, with the addition of a parameter generation module 205. In a tical parametric speech synthesis system, the speech signal is represented as a set of parameters at some fixed frame rate. The parameter generation module 205 receives the audio signal from the statistical ter model 113 and transforms it. In an embodiment, the audio signal in the time domain has been mathematically transformed to another domain, such as the spectral domain, for more efficient processing. The spectral information is then stored as the form of frequency coefficients, such as f0 and MCEPs to represent the speech signal. ter generation is such that it has an indexed speech model as input and the spectral parameters as output. In one embodiment, Hidden Markov Model (HMM) techniques are used. The model 113 includes not only the statistical distribution of parameters, also called static coefficients, but also their rate of change. The rate of change may be described as having first-order derivatives called delta coefficients and second-order derivatives referred to as deltadelta coefficients. The three types of parameters are stacked together into a single observation vector for the model. The process of generating parameters is bed in greater detail below.

In the ional tical model of the parameters, only the mean and the ce of the parameter are considered. The mean parameter is used for each state to generate parameters. This generates piecewise constant parameter trajectories, which change value abruptly at each state tion, and is contrary to the behavior of l sound. Further, the statistical properties of the static coefficient are only considered and not the speed with which the parameters change value. Thus, the statistical properties of the first- and second-order derivatives must be considered, as in the modified embodiment described in Figure 2.

Maximum likelihood parameter generation (MLPG) is a method that considers the statistical properties of static coefficients and the derivatives. However, this method has a great ational cost that increases with the length of the sequence and thus is impractical to ent in a real-time system. A more efficient method is described below which generates parameters based on linguistic segments instead of whole text message. A linguistic segment may refer to any group of words or sentences which can be separated by context labeInpause” in a TTS system.

Figure 3 is a flowchart illustrating an embodiment of generating parameter trajectories, indicated generally at 300. Parameter tories are generated based on linguistic segments instead of whole text message. Prior to ter generation, a state sequence may be chosen using a duration model present in the tical parameter model 113. This determines how many frames will be ted from each state in the statistical parameter model. As hypothesized by the parameter generation module, the parameters do not vary while in the same state. This trajectory will result in a poor quality speech signal. However, if a smoother trajectory is estimated using information from delta and delta-delta parameters, the speech sis output is more natural and intelligible.

In operation 305, the state sequence is chosen. For example, the state sequence may be chosen using the statistical parameter model 113, which determines how many frames will be generated from each state in the model 113. Control passes to ion 310 and s 300 ues.

In operation 310, segments are partitioned. In one embodiment, the segment partition is defined as a sequence of states encompassed by the pause model. l is passed to at least one of operations 315a and 315b and process 300 continues.

In operations 315a and 315b, spectral parameters are generated. The spectral parameters represent the speech signal and comprise at least one of the fundamental frequency 3153 and MCEPs, 315b. These processes are described in greater detail below in Figures 5 and 6. Control is passed to operation 320 and process 300 continues.

In ion 320, the ter trajectory is created. For e, the ter trajectory may be created by concatenating each parameter stream across all states along the time domain. In effect each dimension in the parametric model will have a trajectory. An illustration of a ter trajectory creation for one such dimension is provided generally in Figure 4. Figure 4 (copied from: KING, Simon, ”A beginners’ guide to statistical parametric speech synthesis” The Centre for Speech Technology Research, University of rgh, UK, 24 June 2010, page 9) is a generalized embodiment of a trajectory from MLPG that has been smoothed.

Figure 5 is a flowchart illustrating an embodiment of a s for fundamental spectral parameter generation, ted generally at 500. The process may occur in the parameter generation module 205 e 2) after the input text is split into linguistic segments. ters are predicted for each segment.

In operation 505, the frame is incremented. For example, a frame may be examined for linguistic segments which may contain several voiced segments. The parameter stream may be based on frame units such that i=1 represents the first frame, i=2 represents the second frame, etc. For frame incrementing, the value for ll'n ' I IS increased by a desired interval. In an embodiment, the value for {I'll | may be increased by 1 each time. Control is passed to operation 510 and the process 500 continues.

In operation 510, it is determined whether or not linguistic segments are present in the signal. If it is determined those linguistic segments are present, control is passed to operation 515 and s 500 continues. If it is determined that linguistic segments are not present, control is passed to operation 525 and the process 500 continues.

The determination in operation 510 may be made based on any suitable criteria. In one embodiment, the segment partition of the linguistic segments is defined as a sequence of states encompassed by the pause model.

In operation 515, a global variance adjustment is med. For example, the global variance may be used to adjust the ce of the linguistic segment. The f0 trajectory may tend to have a smaller dynamic range compared to natural sound due to the use of the mean of the static coefficient and the delta cient in ter generation. Variance scaling may expand the dynamic range of the f0 trajectory so that the synthesized signal sounds livelier. Control is passed to operation 520 and process 500 continues.

In operation 520, a conversion to the linear frequency domain is med on the ental frequency from the log domain and the process 500 ends.

In ion 525, it is determined whether or not the voicing has started. If it is determined that the voicing has not started, control is passed to operation 530 and the process 500 ues. If it is ined that voicing has started, control is passed to operation 535 and the process 500 continues.

The determination in operation 525 may be based on any suitable criteria. In an embodiment, when the f0 model predicts valid values for fo, the segment is deemed a voiced segment and when the f0 model predicts zeros, the segment is deemed an unvoiced segment.

In operation 530, the frame has been determined to be unvoiced. The spectral parameter for that frame is 0 such that f0(i) = 0. Control is passed back to operation 505 and the process 500 continues.

In operation 535, the frame has been determined to be voiced and it is further determined whether or not the voicing is in the first frame. If it is determined that the voicing is in the first frame, control is passed to operation 540 and process 500 continues. If it is determined that the voicing is not in the first frame, control is passed to operation 545 and process 500 continues.

The determination in operation 535 may be based on any suitable criteria. In one embodiment it is based on predicted f0 values and in another embodiment it could be based on a specific model to predict voicing.

In operation 540, the spectral parameter for the first frame is the mean of the t such that f0(i)=f0_mean(i). Control is passed back to operation 505 and the process 500 continues.

In ion 545, it is determined whether or not the delta value needs to be adjusted. If it is determined that the delta value needs adjusted, control is passed to operation 550 and the s 500 conﬁnues WhisdmennMedthmthedehavmuedoesnotneedawuﬁed(nnUoHspa$edtoopemﬁon 555andtherwocessSOOconﬁnues The determination in operation 545 may be based on any suitable criteria. For example, an adjustment may need to be made in order to control the parameter change for each frame to a desired leveL In operation 550, the delta is clamped. The f0_deltaMean(i) may be represented as f0_new_deltaMean(i) after clamping. If clamping has not been performed, then the f0_new_deltaMean(i) is lent to taMean(i). The purpose of clamping the delta is to ensure that the parameter change for each frame is controlled to a desired level. If the change is too large, and say lasts over several frames, the range of the parameter trajectory will not be in the d natural sound’s range. Control is passed to operation 555 and the process 500 continues.

In operation 555, the value of the current parameter is updated to be the predicted value plus the value of delta for the ter such that f0(i) = f0(i-1) + f0_new_deltaMean(i). This helps the trajectory ramp up or down as per the model. Control is then passed to operation 560 and the process 500 continues.

In operation 560, it is ined whether or not the voice has ended. If it is determined that the voice has not ended, control is passed to operation 505 and the process 500 continues. If it is determined that the voice has ended, control is passed to operation 565 and the s 500 continues.

The determination in operation 560 may be determined based on any suitable criteria. In an embodiment the f0 values becoming zero for a number of consecutive frames may te the voice has ended.

In ion 565, a mean shift is performed. For example, once all of the voiced frames, or voiced segments, have ended, the mean of the voice segment may be adjusted to the desired value.

Mean adjustment may also bring the parameter trajectory come into the desired natural sound’s range.

Control is passed to operation 570 and the process 500 continues.

In operation 570, the voice segment is ed. For example, the generated parameter tory may have abruptly d somewhere, which makes the synthesized speech sound warble and jumpy. Long window smoothing can make the f0 trajectory smoother and the synthesized speech sound more natural. Control is passed back to operation 505 and the process 500 continues. The process may continuously cycle any number of times that are necessary. Each frame may be processed until the linguistic segment ends, which may contain several voiced segments. The variance of the stic segment may be adjusted based on global variance. e the mean of static coefficients and delta coefficients are used in parameter generation, the parameter trajectory may have smaller dynamic ranges compared to natural sound. A variance scaling method may be utilized to expand the dynamic range of the parameter trajectory so that the synthesized signal does not sound muffled. The spectral parameters may then be converted from the log domain into the linear domain.

Figure 6 is a flowchart illustrating an embodiment of MCEPs generation, indicated generally at 600. The s may occur in the parameter generation module 205 (Figure 2).

In operation 605, the output parameter value is initialized. In an embodiment, the output parameter may be initialized at time i=0 e the output parameter value is dependent on the parameter generated for the previous frame. Thus, the initial mcep(0) = mcep_mean(1). Control is passed to operation 610 and the process 600 continues.

In operation 610, the frame is incremented. For example, a frame may be examined for linguistic segments which may contain several voiced segments. The parameter stream may be based on frame units such that i=1 represents the first frame, i=2 represents the second frame, etc. For frame incrementing, the value for ”i” is sed by a desired interval. In an embodiment, the value for ”i” may be increased by 1 each time. Control is passed to ion 615 and the process 600 continues.

In operation 615, it is determined whether or not the segment is ended. If it is determined that the segment has ended, control is passed to operation 620 and the process 600 continues. If it is determined that the t has not ended, control is passed to operation 630 and the process ues.

The determination in operation 615 is made using information from linguistic module as well as existence of pause.

In operation 620, the voice segment is smoothed. For example, the generated parameter trajectory may have abruptly changed somewhere, which makes the synthesized speech sound warble and jumpy. Long window smoothing can make the trajectory smoother and the synthesized speech sound more natural. Control is passed to operation 625 and the process 600 continues.

In operation 625, a global variance adjustment is med. For example, the global variance may be used to adjust the variance of the linguistic segment. The tory may tend to have a smaller dynamic range ed to natural sound due to the use of the mean of the static coefficient and the delta coefficient in parameter generation. Variance scaling may expand the dynamic range of the trajectory so that the synthesized signal should not sound muffled. The process 600 ends.

In operation 630, it is determined whether or not the voicing has started. If it is determined that the voicing has not started, control is passed to operation 635 and the process 600 continues. If it is determined that voicing has started, control is passed to operation 540 and the s 600 continues.

The determination in operation 630 may be made based on any suitable ia. In an embodiment, when the f0 model ts valid values for fo, the segment is deemed a voiced segment and when the f0 model predicts zeros, the segment is deemed an unvoiced segment.

In operation 635, the spectral parameter is determined. The al parameter for that frame s ) = (mcep(i-1)+mcep_mean(i))/2. Control is passed back to operation 610 and the process 600 continues.

In operation 640, the frame has been determined to be voiced and it is further determined whether or not the voice is in the first frame. If it is ined that the voice is in the first frame, control is passed back to operation 635 and process 600 continues. If it is determined that the voice is not in the first frame, control is passed to operation 645 and process 500 continues.

In operation 645, the voice is not in the first frame and the spectral parameter becomes mcep(i) = (mcep(i-1)+mcep_delta(i)+mcep_mean(i))/2. Control is passed back to operation 610 and process 600 continues. In an embodiment, multiple MCEPs may be present in the system. Process 600 may be repeated any number of times until all MCEPs have been processed.

While the ion has been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as rative and not restrictive in character, it being understood that only the preferred embodiment has been shown and described and that all equivalents, changes, and modifications that come within the spirit of the invention as described herein and/or by the following claims are desired to be protected.

Hence, the proper scope of the present ion should be determined only by the broadest interpretation of the appended claims so as to encompass all such modifications as well as all relationships equivalent to those illustrated in the drawings and described in the specification.

Claims

1. A system for synthesizing speech for provided text comprising: a. means for generating context labels for said provided text, wherein the means for generating context labels is configured for partitioning said ed text into a ce of phrases and each phrase into a plurality of frames; b. means for generating a set of parameters for the context labels generated for said provided text using a speech model, wherein the means for generating a set of parameters is configured for generating a set of parameters comprising a mean, a variance, a delta coefficient, and a delta-delta coefficient, for each frame of the plurality of frames; c. means for processing said ted set of parameters, n said means for processing is capable of variance scaling and is configured for generating a processed set of parameters comprising at least one clamped delta coefficient in order to control the parameter change for each frame to a desired level; and d. means for synthesizing speech for said provided text, wherein said means for sizing speech is capable of applying the processed set of parameters to synthesizing speech.

2. The system of claim 1, n said speech model comprises at least a statistical distribution of spectral parameters and a rate of change of said spectral parameters.

3. The system of claim 1, n said speech model ses a predictive statistical parametric model.

4. The system of claim 1, wherein said means for generating context labels for said provided text comprises a language model.

5. The system of claim 1, wherein said means for synthesizing speech is capable of transforming spectral information into time domain signals.

6. The system of claim 1, wherein the means for sing said set of parameters is capable of determining the rate of change of said parameters and generating a trajectory of the parameters.

7. A method for ting parameters, using a continuous feature stream, for provided text for use in speech sis, comprising the steps of: a. partitioning said ed text into a sequence of phrases and each phrase into a plurality of frames; b. generating parameters for said sequence of phrases using a speech model, the generated ters comprising: a mean; a ce, a delta coefficient, and a delta-delta coefficient for each frame of a plurality of frames; and c. processing the generated parameters to obtain another set of parameters, wherein said other set of parameters have a smoother trajectory than the generated parameters computed in accordance with the delta coefficient and the delta-delta coefficient of the generated parameters, characterized in that the step (c) of processing the generated parameters comprising the step of clamping the delta coefficient in order to control the parameter change for each frame to a desired level.

8. The method of claim 7, wherein said partitioning is performed based on linguistic knowledge.

9. The method of claim 7, wherein said speech model comprises a tive statistical parametric model.

10. The method of claim 7, wherein the generated parameters for the phrases comprise spectral ters.

11. The method of claim 10, wherein the spectral parameters comprise one or more of the following: phrase-based spectral parameter values, rate of change of al parameters, al envelope values, and rate of change of al envelope.

12. The method of claim 7, wherein the phrases comprise a grouping of words capable of being separated by at least one of: linguistic pauses and acoustic pauses.

13. The method of claim 7, wherein the partitioning of said provided text into a sequence of s further comprises the steps of: a. generating an output parameter based on predicted parameters, wherein said predicted parameters are determined by a model of a speech corpus as parameters that represent the text; b. incrementing a frame value; and c. determining state of a phrase, wherein i. if the phrase has started, determining if voicing has started by: predicting values for fundamental frequency; determining that voicing has started in response to predicting ro values for fundamental frequency; and determining voicing has not started in response to predicting zero values for fundamental frequency; and 1. If g has started, adjusting the output parameter based on ters of voiced frames and restarting step (c); otherwise, 2. if voicing has ended, adjusting the output parameter based on parameters of unvoiced frames and restarting from step (c); ii. if the phrase has ended, smoothing the output parameter and performing a global variance adjustment by performing variance scaling to expand a dynamic range of a trajectory.

14. The method of claim 7, n the generation of the parameters ses generating a parameter trajectory, which further ses the steps of: a. initializing a first element of a plurality of generated output parameters; b. incrementing a frame value; c. determining if a stic segment is present, by examining a sequence of states for segment partition, the linguistic segment referring to one or more words separated by a context label of “pause” in a text-to-speech system, wherein; i. if the linguistic segment is not present, determining if voicing has started by: predicting values for fundamental frequency; determining that voicing has started in response to predicting non-zero values for fundamental frequency; and determining voicing has not started in response to predicting zero values for fundamental frequency; and 1. if voicing has not started, adjusting the output parameters based on parameters of voiced frames and restarting the process from step (a); 2. If voicing has started, determining if the voicing is in a first frame, wherein, if the voice is in the first frame, setting the fundamental frequency of the first frame to a mean of the fundamental frequency of the segment, and if the voice is not in the first frame, performing a clamp of the fundamental frequency of the frame, ii. if the stic segment is present, removing abrupt changes of the parameter trajectory, and performing a global variance adjustment by ming variance scaling to expand a dynamic range of a trajectory.

15. The method of claim 14, wherein step c.i. further comprises the step of determining if voicing has ended, wherein if voicing has not ended, ing claim 14 from step (a), and if voicing has ended, adjusting the coefficient mean to a desired value and ming long window smoothing on the segment.

16. The method of claim 14, wherein said initializing is performed at time zero.

17. The method of claim 14, wherein said frame increment value comprises a d integer.

18. The method of claim 17, n said desired integer is 1. ._m_m_<._ _.m_._.<._.m mm.._.m=>_<m<n_ ._m_n_o_>_ m_mm_I._.Z>m .CGHZOU IummEm - IUmmmm mDamOU 2.5502: 29.20720QO ZO_._.<N_mm_._.m=>_<~_<n_ ._m_m_<._ .CGHZOU _.m_._.<._.m mm.._.m=>_<m<n_ ._m_n_o_>_ mm_._.m__>_<m<n_ ZO_._.<mm_Zm_0 m_mm_I._.Z>m IummEm - IUmmmm mDamOU 2.5502: 29.20720QO ZO_._.<N_mm_._.m=>_<~_<n_ nmam Qmm<m-._.zm:>_0m_m ZO_._.<mmZm0 mama—2 mUZdemm ZO_.:.E<n_ mm_._.m_ _.m HZmEOmm _>_<m<n_ >mO._.Um:<m._. Qmm<m-._.zm:>_0m_m ZO_._.<memO PCT/U