WO2013020329A1 - 参数语音合成方法和系统 - Google Patents

参数语音合成方法和系统 Download PDF

Info

Publication number
WO2013020329A1
WO2013020329A1 PCT/CN2011/081452 CN2011081452W WO2013020329A1 WO 2013020329 A1 WO2013020329 A1 WO 2013020329A1 CN 2011081452 W CN2011081452 W CN 2011081452W WO 2013020329 A1 WO2013020329 A1 WO 2013020329A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
parameter
parameters
value
phoneme
Prior art date
Application number
PCT/CN2011/081452
Other languages
English (en)
French (fr)
Inventor
吴凤梁
职振华
Original Assignee
歌尔声学股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 歌尔声学股份有限公司 filed Critical 歌尔声学股份有限公司
Priority to JP2013527464A priority Critical patent/JP5685649B2/ja
Priority to DK11864132.3T priority patent/DK2579249T3/en
Priority to EP11864132.3A priority patent/EP2579249B1/en
Priority to KR1020127031341A priority patent/KR101420557B1/ko
Priority to US13/640,562 priority patent/US8977551B2/en
Publication of WO2013020329A1 publication Critical patent/WO2013020329A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/226Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics
    • G10L2015/227Procedures used during a speech recognition process, e.g. man-machine dialogue using non-speech characteristics of the speaker; Human-factor methodology

Definitions

  • the present invention relates to the field of parametric speech synthesis technology, and more particularly to a parametric speech synthesis method and system for continuously synthesizing speech of arbitrary duration.
  • Speech synthesis produces artificial speech by mechanical and electronic methods, which is an important technique for making human-computer interaction more natural.
  • speech synthesis techniques There are two types of common speech synthesis techniques, one is speech synthesis based on unit selection and waveform splicing, and the other is parametric speech synthesis based on acoustic statistical model. Since the parameter speech synthesis method has relatively small requirements on the storage space, it is more suitable for application on small electronic devices.
  • the parameter speech synthesis method it is divided into two stages of training and synthesis.
  • the acoustic parameters of all speech in the corpus are first extracted, including static parameters such as spectral envelope parameters, gene frequency parameters, and dynamic parameters, such as the first step of the spectral envelope parameters and the fundamental frequency parameters. And second-order difference parameters; then, for each phoneme, according to its context annotation information, the corresponding acoustic statistical model is trained, and the global variance model for the entire corpus is trained; finally, the acoustic statistical model of all phonemes and the global variance model form a model library. .
  • the layered offline processing is used to synthesize speech.
  • the first layer is included: Analyze the entire text of the input to get all the phonemes with context information to form a phoneme sequence.
  • the second layer The model sequence model corresponding to each phoneme in the phoneme sequence is extracted from the trained model library.
  • the third layer The maximum likelihood algorithm is used to predict the acoustic parameters corresponding to each frame of speech from the model sequence to form a sequence of speech parameters.
  • the fourth layer The overall optimization of the speech parameter sequence is performed using the global variance model.
  • Layer 5 Input all optimized speech parameter sequences into the parametric speech synthesizer to generate the final synthesized speech.
  • the existing parameter speech synthesis method adopts a horizontal processing method in the hierarchical operation of the synthesis stage: taking out the parameters of all statistical models and predicting and generating by the maximum likelihood algorithm.
  • the smoothing parameters of all frames, the optimization parameters of all frames are obtained by the global variance model, and finally the speech of all frames is output from the parameter synthesizer, that is, the relevant parameters of all frames need to be saved at each layer, resulting in random randomization required for speech synthesis.
  • the capacity of the memory Random Access Memory, RAM
  • the RAM of the chip is as small as less than 100K bytes, and the existing parameter speech synthesis The method cannot continuously synthesize any length of speech on a chip with a smaller RAM.
  • the implementation of the speech parameter sequence predicted from the model sequence by using the maximum likelihood algorithm must be implemented by frame-by-frame forward recursion and backward recursion.
  • corresponding temporary parameters are generated for each frame of speech.
  • the temporary parameters of all frames are then input to the second reverse recursion process to predict the desired sequence of parameters.
  • the length of the synthesized speech is longer, the number of corresponding speech frames is larger.
  • each speech parameter is predicted, a corresponding temporary parameter is generated.
  • the temporary parameters of all frames must be stored in RAM to complete the second recursive prediction process, resulting in the inability to continuously synthesize any length of speech on a chip with smaller RAM.
  • an object of the present invention is to solve the problem that the RAM size required in the original speech synthesis process increases in proportion to the length of the synthesized speech, and thus it is impossible to continuously synthesize any length of speech on the chip of the small RAM.
  • a parameter speech synthesis method including a training phase and a synthesis phase, wherein the synthesis phase specifically includes:
  • the corresponding statistical model is extracted from the statistical model library, and the corresponding model parameter of the statistical model in the current frame of the current phoneme is used as the rough value of the currently predicted speech parameter;
  • the coarse value is filtered by using the coarse value and the information of the previous time speech frame to obtain a smoothed value of the currently predicted speech parameter, where the information of the last time speech frame is The smoothed value of the predicted speech parameters at the last moment.
  • a preferred solution is to obtain a global mean value and a global standard deviation ratio of the voice parameter according to a statistic, and globally optimize the smoothed value of the currently predicted voice parameter to generate a required voice parameter:
  • the smoothed value of the speech parameter before the optimization at time t, which is the initial optimized value
  • W the weight value
  • r the statistically obtained predicted speech parameter.
  • m the global mean of the predicted speech parameters obtained by statistics
  • the values of r and m are constant.
  • the solution further includes: constructing a voiced subband filter and an unvoiced subband filter by using a subband voiced parameter; and synthesizing a quasi-periodic pulse sequence constructed by a pitch frequency parameter through the voiced subband filter a voiced component of the signal; a random sequence constructed by white noise, the unvoiced component of the voice signal is obtained through the unvoiced subband filter; the voiced component is added to the unvoiced component to obtain a mixed excitation signal; and the mixed excitation signal is passed A filter constructed by a spectral envelope parameter outputs a synthesized speech waveform.
  • the method further includes a training phase,
  • the acoustic parameters extracted from the corpus only include static parameters, or the acoustic parameters extracted from the corpus include static parameters and dynamic parameters; only the static model parameters are retained in the model parameters of the statistical model obtained after training;
  • the corresponding static model parameter of the statistical model obtained in the training phase in the current frame of the current phoneme is used as the rough value of the currently predicted speech parameter.
  • a parametric speech synthesis system comprising:
  • a loop synthesis device configured to perform speech synthesis on each frame of each phoneme in the phoneme sequence of the input text in sequence during the synthesis phase;
  • the cyclic synthesis device includes:
  • a rough search unit configured to extract a corresponding statistical model from a statistical model library for the current phoneme in the phoneme sequence of the input text, and use the corresponding model parameter of the statistical model in the current frame of the current phoneme as the current predicted speech parameter.
  • a smoothing filtering unit configured to filter the coarse value by using the coarse value and information of a predetermined number of voice frames before the current time to obtain a smoothed value of the currently predicted voice parameter
  • a global optimization unit configured to globally optimize a smoothed value of the currently predicted speech parameter according to a statistically obtained global mean value and a global standard deviation ratio of the speech parameter to generate a required speech parameter
  • the parameter speech synthesis unit is configured to synthesize the generated speech parameters to obtain a frame of speech synthesized by the current frame of the current phoneme.
  • the smoothing filtering unit includes a low-pass filter group, configured to filter the coarse value by using the coarse value and the information of the previous time speech frame to obtain a smoothed value of the currently predicted speech parameter.
  • the information of the speech frame at the previous moment is the smoothed value of the predicted speech parameter at the previous moment.
  • the global optimization unit includes a global parameter optimizer for utilizing The following formula obtains a global mean value and a global standard deviation ratio of the voice parameter according to statistics, and globally optimizes the smoothed value of the currently predicted voice parameter to generate a required voice parameter:
  • is the smoothed value of the speech parameter before the optimization at time t, which is the initial optimized value
  • w the weight value, and is the required speech parameter obtained after global optimization
  • r is the statistically obtained predicted speech parameter.
  • m is the global mean of the predicted speech parameters obtained by statistics
  • the values of r and m are constant.
  • a filter construction module for constructing a voiced subband filter and an unvoiced subband filter using subband voiced parameters
  • the voiced subband filter is configured to filter a quasi-periodic pulse sequence constructed by a pitch frequency parameter to obtain a voiced component of the voice signal;
  • the unvoiced subband filter is configured to filter a random sequence constructed by white noise to obtain an unvoiced component of the voice signal
  • An adder configured to add the voiced component and the unvoiced component to obtain a mixed excitation signal
  • a synthesis filter configured to output the mixed excitation signal to a one-frame synthesized speech waveform after passing through a filter constructed by a spectral envelope parameter .
  • the system further includes training means for, during the training phase, the acoustic parameters extracted from the corpus include only static parameters, or the acoustic parameters extracted from the corpus include static parameters and dynamic parameters; and, in training Only the static model parameters are retained in the model parameters of the obtained statistical model;
  • the coarse search unit is specifically configured to use, according to the current phoneme, the corresponding static model parameter of the statistical model obtained in the training stage in the current frame of the current phoneme as the rough value of the currently predicted speech parameter according to the current phoneme.
  • the technical solution of the embodiment of the present invention utilizes the voice before the current frame.
  • the information of the frame and the technical means of pre-statistically obtaining the global mean value of the speech parameters and the global standard deviation ratio provide a novel parametric speech synthesis scheme.
  • the parameter speech synthesis method and system provided by the invention adopts a synthesis method of vertical processing, that is, the synthesis of each frame of speech needs to take out the statistical model rough value, the filtered smooth value, the global optimized optimization value, and the parameter speech synthesis.
  • Four steps of speech, and then the synthesis of each frame of speech repeats the four steps again, so that only the parameters of the fixed storage capacity required for the current frame need to be saved in the process of parameter speech synthesis processing, so that the speech synthesis institute
  • the required RAM does not increase as the length of the synthesized speech increases, and the duration of the synthesized speech is no longer limited by the RAM.
  • the acoustic parameters used in the present invention are static parameters, and only the static mean parameters of the respective models are saved in the model library, so that the size of the statistical model library can be effectively reduced.
  • the present invention uses a multi-sub-band turbidity mixed excitation in the process of synthesizing speech, so that the unvoiced and voiced sounds in each sub-band are mixed according to the voiced sound, so that the unvoiced and voiced sounds no longer have a clear hard boundary in time. Significant distortion of the sound quality after speech synthesis is avoided.
  • FIG. 1 is a schematic diagram of a staged segment of a speech synthesis method based on dynamic parameters and maximum likelihood criteria in the prior art
  • FIG. 2 is a flowchart of a parameter speech synthesis method according to an embodiment of the present invention.
  • FIG. 3 is a schematic diagram showing a phased manner of a parameter speech synthesis method according to an embodiment of the present invention
  • FIG. 5 is a schematic diagram of filtering smoothing parameter prediction based on static parameters according to an embodiment of the present invention
  • FIG. 6 is a schematic diagram of a hybrid excitation based synthesis filter in accordance with one embodiment of the present invention
  • FIG. 8 is a block diagram showing a parametric speech synthesis system according to another embodiment of the present invention.
  • FIG. 9 is a schematic diagram showing the logical structure of a parameter speech synthesis unit according to another embodiment of the present invention.
  • FIG. 10 is a flowchart of a method for synthesizing a parameter speech according to still another embodiment of the present invention.
  • FIG. 11 is a schematic structural diagram of a parameter speech synthesis system according to still another embodiment of the present invention.
  • FIG. 2 shows a flow chart of a parametric speech synthesis method in accordance with one embodiment of the present invention.
  • the implementation of the parameter speech synthesis method capable of continuously synthesizing speech of any duration is provided by the present invention, including the following steps:
  • S220 sequentially extract one phoneme in the phoneme sequence, search a statistical model library for a statistical model corresponding to each acoustic parameter of the phoneme, and take each statistical model of the phoneme as a rough value of the speech parameter to be synthesized according to a frame;
  • S230 Perform a parameter smoothing on the rough value of the to-be-synthesized speech parameter by using a filter group to obtain a smoothed speech parameter.
  • S240 Perform global parameter optimization on the smoothed speech parameter by using a global parameter optimizer to obtain an optimized speech parameter.
  • S250 Synthesize the optimized speech parameters by using a parameter speech synthesizer, and output a frame of synthesized speech;
  • S260 Determine whether all the frames of the phoneme are processed, if not, repeat the speech synthesis processing of steps S220 to S250 for the next frame of the phoneme until all the phonemes in the phoneme sequence are processed. frame.
  • FIG. 3 is a schematic diagram showing a phased manner of a parameter speech synthesis method according to an embodiment of the present invention. As shown in FIG.
  • the implementation of the parameter speech synthesis of the present invention also includes two stages of training and synthesis, wherein the training stage is used for
  • the acoustic parameters of the speech are extracted by the speech information in the corpus, and the statistical model corresponding to each phoneme in each context information is trained according to the extracted acoustic parameters to form a statistical model library of phonemes required for the synthesis phase.
  • Steps S210 ⁇ S260 belong to the synthesis stage.
  • the three parts include text analysis, parameter prediction and speech synthesis.
  • the parameter prediction part can be subdivided into three parts: target model search, parameter generation and parameter optimization.
  • the main difference between the present invention and the existing parameter speech synthesis technology is that: the acoustic parameters extracted in the prior art include dynamic parameters, and the extracted in the present invention.
  • the acoustic parameters may all be static parameters, or may include dynamic parameters that characterize the changes of the frame parameters before and after, such as first-order or second-order differential parameters, to improve the accuracy of the model after training.
  • the acoustic parameters extracted from the corpus of the present invention include at least three static parameters: a spectral envelope parameter, a pitch frequency parameter, a sub-band voiced parameter, and optionally other parameters such as a resonant peak frequency.
  • the spectrum envelope parameter may be a linear prediction coefficient (LPC) or a derivative parameter thereof, such as a line spectrum pair parameter (LSP), or a cepstrum type parameter; or may be a parameter of the first few formants (frequency, bandwidth) , amplitude) or discrete Fourier transform coefficients.
  • LPC linear prediction coefficient
  • LSP line spectrum pair parameter
  • cepstrum type parameter a parameter of the first few formants (frequency, bandwidth) , amplitude) or discrete Fourier transform coefficients.
  • variants of these spectral envelope parameters in the Meier domain can be used to improve the sound quality of synthesized speech.
  • the fundamental tone rate uses a logarithmic fundamental tone rate, and the sub-band voiced tone is the proportion of the voiced tone in the subband.
  • the acoustic parameters extracted from the corpus may also include dynamic parameters that characterize changes in the acoustic parameters of the frames before and after, such as first or second order parameters between the fundamental frequencies of the previous and subsequent frames.
  • dynamic parameters that characterize changes in the acoustic parameters of the frames before and after, such as first or second order parameters between the fundamental frequencies of the previous and subsequent frames.
  • each phoneme is automatically aligned to a large number of speech segments in the corpus, and then the acoustic parameter model corresponding to the phoneme is calculated from these speech segments.
  • the accuracy of automatic alignment using static parameters and dynamic parameters is slightly higher than that of using only static parameters, making the parameters of the model more accurate.
  • the present invention since the present invention does not require dynamic parameters in the model during the synthesis phase, the present invention retains only static parameters in the final trained model library.
  • each acoustic parameter is modeled by a Hidden Markov Model (HMM).
  • HMM Hidden Markov Model
  • This modeling scheme is a modeling scheme existing in the prior art, so only the modeling scheme will be briefly explained in the following description.
  • HMM is a typical statistical signal processing method. It is widely used due to its randomness, ability to process string input with unknown word length, can effectively avoid the problem of segmentation, and has a large number of fast and effective training and recognition algorithms. Used in various fields of signal processing.
  • the structure of the HMM is five state-left-type, and the distribution of observed probability in each state is a single Gaussian density function. This function is uniquely determined by the mean and variance of the parameters.
  • the mean is composed of the mean of the static parameters and the mean of the dynamic parameters (first and second order differences).
  • the variance consists of the variance of the static parameters and the variance of the dynamic parameters (first and second order differences).
  • a model is trained for each acoustic parameter of each phoneme according to the context information.
  • the related phonemes need to be clustered according to the context information of the phoneme, such as a decision tree-based clustering method.
  • the models are used to perform frame-to-state forced alignment on the speech in the training corpus, and then use the duration information generated during the alignment process (ie, the number of frames corresponding to each state) to train the phonemes.
  • the state time length model after clustering of decision trees is used in different context information, and finally the statistical model library corresponding to each acoustic parameter of each phoneme in different context information forms a statistical model library.
  • the present invention saves only the static mean parameters of each model in the model library.
  • the existing parameter speech synthesis method needs to retain the static mean parameter, the first-order difference parameter, the second-order difference mean parameter and the variance parameter corresponding to these parameters, and the statistical model library is large.
  • the size of the statistical model library for storing only the static mean parameters of each model is only about 1/6 of that of the statistical model library formed in the prior art, which greatly reduces the storage space of the statistical model library.
  • the reduced data is necessary in the existing parameter speech synthesis technology, but is not required for the parameter speech synthesis technical solution provided by the present invention. Therefore, the reduction of the data amount does not affect the parameters of the present invention.
  • the implementation of speech synthesis is necessary in the existing parameter speech synthesis technology, but is not required for the parameter speech synthesis technical solution provided by the present invention. Therefore, the reduction of the data amount does not affect the parameters of the present invention.
  • the input text needs to be analyzed first to extract a phoneme sequence containing the context information (step S210) as a basis for parameter synthesis.
  • the context information of the phoneme refers to the information of the phonemes adjacent to the current phoneme, and the context information may be the name of one or several phonemes before and after it, and may also include information of other language layers or phonological layers.
  • the context information of a phoneme includes the current phoneme name, two phonemes before and after, The pitch or accent of the syllable, and optionally the part of speech of the word.
  • one of the phonemes in the sequence may be sequentially taken out, the statistical model corresponding to each acoustic parameter of the phoneme is searched in the statistical model library, and then the statistics of the phoneme are taken out by the frame.
  • the model is used as a rough value of the speech parameters to be synthesized (step S220).
  • the context annotation information of the phoneme is input into the cluster decision tree, and the statistical model corresponding to the spectrum envelope parameter, the fundamental tone rate parameter, the sub-band voiced parameter, and the state duration parameter can be searched.
  • the state duration parameter is not a static acoustic parameter extracted from the original corpus. It is a new parameter generated when the state is aligned with the frame during training.
  • the average value of the saved static parameters is taken out from each state of the model, which is the static mean parameter corresponding to each parameter.
  • the state duration mean parameter is directly used to determine how many frames of each state in the phoneme to be synthesized, and the static mean parameters such as the spectral envelope, the fundamental frequency, and the sub-band voiced sound are the rough values of the speech parameters to be synthesized.
  • the determined speech parameter coarse value is filtered based on the filter bank to predict the speech parameter (step S230).
  • a special set of filters is used to filter the spectral envelope, the fundamental frequency and the sub-band voiced sound to predict the better synthesized speech parameter values.
  • the filtering method employed in the step S230 of the present invention is a smoothing filtering method based on static parameters.
  • FIG. 5 is a schematic diagram of filtering smoothing parameter prediction based on static parameters according to the present invention.
  • the present invention replaces the maximum likelihood parameter predictor in the existing parameter speech synthesis technology with the set of parameter prediction filters.
  • the group low pass filter is used to respectively predict a spectral envelope parameter, a pitch frequency parameter, and a sub-band voiced parameter of the speech parameter to be synthesized.
  • is the rough value of a speech parameter obtained from the model at the tth frame, is the filtered smoothed value, and the operator * indicates the convolution, which is pre-designed.
  • the impulse response of the filter For different types of acoustic parameters, ⁇ can be designed to be different representations due to different parameter characteristics.
  • Equation (2) For the spectral envelope parameters and sub-clouded pitch parameters, the filter shown in equation (2) can be used.
  • the filter can be predicted using the filter shown in equation (3).
  • is a pre-designed fixed filter coefficient, and the selection can be determined experimentally according to the degree of change of the fundamental frequency parameter in the actual speech over time.
  • the parameters involved in the process of predicting the speech parameters to be synthesized are not extended to future parameters, and the output frame at a certain time depends only on the input frame at the moment and before. Or the output frame at the previous moment of the time, regardless of the future input or output frame, so that the RAM size required by the filter bank can be fixed in advance. That is, in the present invention, when the acoustic parameters of the speech are predicted using equations (2) and (3), the output parameters of the current frame depend only on the input of the current frame and the output parameters of the previous frame.
  • the prediction process of the entire parameter can be realized by using a fixed-size RAM buffer, and does not increase with the increase of the duration of the speech to be synthesized, so that the speech parameters of any duration can be continuously predicted, and the maximum application in the prior art is solved.
  • the likelihood criterion predicts the problem that the RAM required in the parameter process increases proportionally with the synthesized speech duration.
  • the filter group when used to perform parameter smoothing on the rough value of the speech parameter to be synthesized at the current time, the coarse value of the time and the speech frame of the previous time can be used.
  • the information is filtered to obtain a smoothed speech parameter.
  • the information of the speech frame at the last moment is the smoothed value of the speech parameter predicted at the previous moment.
  • the smoothed speech parameters can be optimized using the global parameter optimizer to determine the optimized speech parameters (step S240).
  • the present invention adjusts the variation range of the synthesized speech parameters using the following formula (4) in the process of optimizing the speech parameters.
  • w is the mean value of the synthesized speech, and is the ratio of the training speech to the synthesized speech standard deviation
  • w is the control A fixed weight for the adjustment effect.
  • the existing parameter speech synthesis method needs to use the value corresponding to a certain speech parameter in all frames to calculate the mean and variance when determining the sum of the sum, and then use the global variance model to adjust the parameters of all the frames, so that the adjustment The variance of the post-synthesized speech parameters is consistent with the global variance model, as shown in (5) to improve the sound quality.
  • the total duration of the speech to be synthesized is a frame
  • is a standard deviation of all the speech parameters of a certain speech parameter in the training corpus (provided by the global variance model), which is the standard deviation of the current speech parameters to be synthesized, and each synthesis segment
  • the text needs to be recalculated. Since the calculation of "and ⁇ needs to use the speech parameter values of all the frames of the pre-adjusted synthesized speech, the RAM is required to save all the parameters when the frame is not optimized. Therefore, the required RAM increases as the duration of the speech to be synthesized increases. This results in a fixed-size RAM that does not meet the need to continuously synthesize any length of speech.
  • the present invention redesigns the global parameter optimizer when optimizing the parameter speech, and optimizes the parameter speech using the following formula (6).
  • the method for determining is to synthesize a long speech, such as a synthesized speech of about one hour, without using global parameter optimization, and then calculate the ratio of the mean to the standard deviation of each acoustic parameter using equation (5), and
  • the global parameter optimizer designed by the present invention contains the global mean and the global variance ratio, and the global mean value is used to represent the mean value of each acoustic parameter of the synthesized speech, and the global variance ratio is used. The ratio of the parameters of the synthetic speech and the training speech to the variance.
  • one frame of speech parameters can be directly optimized at each synthesis, and no longer need to be from all synthesized speech frames.
  • the mean and standard deviation ratios of the speech parameters are recalculated, so there is no need to save the values of all frames of the speech parameters to be synthesized.
  • the fixed parameter RAM solves the problem that the existing parameter speech synthesis method RAM grows proportionally with the synthesized speech duration.
  • the present invention adjusts the same m and r for each synthesized speech, and the original method uses the newly calculated m and r for adjustment in each synthesis, so that the present invention synthesizes the synthesized speech when synthesizing different texts. Sex is better than the original method.
  • the computational complexity of the present invention is lower than the original method.
  • the optimized speech parameters may be synthesized by a parametric speech synthesizer to synthesize a frame of speech waveforms (step S250).
  • FIG. 6 is a schematic diagram of a synthesis filter based on hybrid excitation according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a synthesis filter based on a clear/turbidity decision in the prior art.
  • the hybrid excitation based synthesis filter employed in the present invention takes the form of a source-filter; whereas the prior art filter excitation is a simple binary excitation.
  • the technique used in synthesizing speech using the parameter synthesizer is based on the parameter speech synthesis of the clear/turbidity decision, and it is necessary to use a preset threshold to make a hard judgment of the clear/voiced sound, and to The frame synthesized speech is either determined to be voiced or determined to be unvoiced. This causes an unvoiced frame to suddenly appear in some of the synthesized voiced sounds, and there is a noticeable distortion in the sense of sound.
  • the clear/voiced sound is predicted, and then the excitation is performed separately, the white noise is used as the excitation in the unvoiced sound, the quasi-periodic pulse is used as the excitation in the voiced sound, and finally the excitation is synthesized.
  • the filter obtains the waveform of the synthesized speech. Inevitably, this method of excitation synthesis results in a clear hard boundary of the unvoiced and voiced sounds that are synthesized, resulting in significant distortion of the sound quality in the synthesized speech.
  • the multi-sub-band turbidity mixed excitation is used, and the clear/turbidity prediction is no longer performed, but the unvoiced and voiced sounds in each sub-band are pressed.
  • the voiced sounds are mixed, so that the unvoiced and voiced sounds no longer have a clear hard boundary in time, which solves the problem that the original method is obviously distorted due to the sudden appearance of the unvoiced sound in some voiced sounds.
  • the voicedness of the current frame of a subband can be extracted from the speech of the following library:
  • & is the value of the t-th speech sample of the current frame of a sub-band
  • is the value of the speech sample with time interval t
  • is the number of samples of one frame.
  • a quasi-periodic pulse sequence is first constructed according to the pitch frequency parameter in the speech parameter, and a random sequence is constructed by white noise;
  • the voiced subband filter product constructed by the voiced tone obtains the voiced component of the signal from the constructed quasi-periodic pulse sequence, and obtains the unvoiced component of the signal from the random sequence through the unvoiced subband filter constructed by the voiced tone;
  • the mixed excitation signal is obtained by adding the unvoiced component.
  • the mixed excitation signal is passed through a synthesis filter constructed by the spectral envelope parameter to output a frame synthesized speech waveform.
  • the present invention is preferably an embodiment in which the above-described cleaning/turbidity prediction is not performed and the multi-subband turbidity mixing excitation is used.
  • the present invention has the advantage of continuously synthesizing any length of speech, it is possible to continue to cyclically process the next frame of speech after completing the output of one frame of the speech waveform. Since the optimized speech parameters of the next frame are not pre-generated and stored in the RAM, after the current frame is processed, it is required to return to step S220 to take the rough value of the next frame speech parameter of the phoneme from the model, repeating Steps S220 to S250 are performed to perform speech synthesis processing on the next frame of the phoneme to finally output the speech waveform of the next frame. This loops until all the parameters of all the phoneme models have been processed and all the speeches are synthesized.
  • FIG. 8 shows a block schematic diagram of a parametric speech synthesis system 800 in accordance with another embodiment of the present invention.
  • the parametric speech synthesis system 800 includes an input text analysis unit 830, a coarse search unit 840, a smoothing filter unit 850, a global optimization unit 860, a parametric speech synthesis unit 870, and a loop determination unit 880.
  • an acoustic parameter extraction unit and a statistical model training unit (not shown) for corpus training may also be included.
  • the acoustic parameter extraction unit is configured to extract acoustic parameters of the speech in the training corpus;
  • the training unit is configured to train a statistical model corresponding to each acoustic parameter of each phoneme according to the acoustic parameters extracted by the acoustic parameter extraction unit, and save the statistical model in the statistical model input text analysis unit 830 for analysis.
  • Entering text, and acquiring a phoneme sequence containing context information according to the analysis of the input text; the coarse search unit 840 is configured to sequentially extract one phoneme in the phoneme sequence, and search the statistical model library for the acquired by the input text analysis unit 830.
  • the smoothing filtering unit 850 is configured to filter the coarse value of the synthesized speech parameter by using the filter group The smoothed speech parameter is obtained;
  • the global optimization unit 860 is configured to perform global parameter optimization on each speech parameter smoothed by the smoothing filtering unit 850 using the global parameter optimizer to obtain an optimized speech parameter;
  • the parameter speech synthesizing unit 870 is configured to: Using a parametric speech synthesizer for global The speech parameters optimization unit 860 synthesizes and outputs synthesized speech.
  • the loop judging unit 880 is connected between the parametric speech synthesizing unit 870 and the coarse search unit 840 to determine whether there is an unprocessed frame in the phoneme after completing the output of the one-frame speech waveform, and if so, the lower of the phoneme.
  • the one-frame repetition uses the coarse search unit, the smoothing filter unit, the global optimization unit, and the parameter speech synthesis unit to continue searching to obtain a statistical model rough value corresponding to the acoustic parameter, a filtered smooth value, a global optimization, and a loop synthesis of the parameter speech synthesis. Until all frames of all phonemes in the phoneme sequence are processed.
  • the smoothing filtering unit 850, the global optimization unit 860, and the parametric speech synthesis unit 870 perform speech synthesis processing to finally output the speech waveform of the next frame. This loop is processed until all the parameters of all the frames of all the phonemes in the phoneme sequence are processed and all the speech is synthesized.
  • the statistical model training unit further includes an acoustic parameter model training unit, a clustering unit, a forced alignment unit, a state duration model training unit, and a model statistical unit (Fig. Not shown), specific:
  • An acoustic parameter model training unit configured to train a model for each acoustic parameter of each phoneme according to context information of each phoneme;
  • a clustering unit configured to cluster related phonemes according to context information of the phoneme
  • a forced alignment unit for performing frame-to-state forced alignment of speech in the training corpus using the model
  • a state duration model training unit configured to use the duration information generated by the forced alignment unit in the forced alignment process to train a state duration model after the phonemes are clustered in different context information
  • the model statistic unit is configured to form a statistical model library for a statistical model corresponding to each acoustic parameter of each phoneme in different context information.
  • Figure 9 is a diagram showing the logical structure of a parametric speech synthesis unit in accordance with a preferred embodiment of the present invention.
  • the parametric speech synthesis unit 870 further includes a quasi-periodic pulse generator 871, a white noise generator 872, a voiced subband filter 873, an unvoiced subband filter 874, an adder 875, and a synthesis filter 876.
  • the quasi-periodic pulse generator 871 is configured to construct a quasi-periodic pulse sequence according to the pitch frequency parameter in the speech parameter;
  • the white noise generator 872 is configured to construct a random sequence by white noise;
  • the voiced subband filter 873 is used to The voiced tone determines the voiced component of the signal from the constructed quasi-periodic pulse sequence;
  • the unvoiced subband filter 874 is used to determine the unvoiced component of the signal from the random sequence based on the sub-band voiced vowel; and then passes the voiced component and the unvoiced component through the adder
  • the 875 is added to get the mixed excitation signal.
  • the mixed excitation signal is filtered by the synthesis filter 876 constructed by the spectral envelope parameter to output a corresponding one-frame synthesized speech waveform.
  • the synthesis method adopted by the present invention is vertical processing, that is, the synthesis of each frame of speech requires the extraction of the statistical model rough value, the filtered smooth value, the global optimized optimization value, and the parameter speech synthesis speech. After each link, the synthesis of each frame of speech is repeated again.
  • the existing parameter speech synthesis method adopts the horizontal offline processing, that is, the rough parameters of all the models are taken out, the smoothing parameters of all the frames are generated by the maximum likelihood algorithm, the optimization parameters of all the frames are obtained by the global variance model, and finally the parameters are synthesized.
  • the output of the voice of all frames Compared with the existing parameter speech synthesis method, each layer needs to save the parameters of all the frames.
  • the vertical processing mode of the present invention only needs to save the parameters of the fixed storage capacity required by the current frame, so the vertical processing method of the present invention It also solves the problem that the length of the synthesized speech caused by the horizontal processing method of the original method is limited.
  • the present invention reduces the size of the model library to about 1/6 of the original method by using only static parameters in the synthesis stage, and no longer using dynamic parameters and variance information.
  • a specially designed filter bank instead of the maximum likelihood parameter method for smooth generation of parameters, and using the new global parameter optimizer to replace the global variance model in the original method for speech parameter optimization, combined with the vertical processing structure
  • Using a fixed-size RAM to continuously predict the voice parameters of arbitrary duration solves the problem that the original method can not continuously predict the speech parameters of any duration on a small RAM chip, and helps to expand the speech synthesis method on a small memory space chip.
  • a parameter speech synthesis method provided by still another embodiment of the present invention, referring to FIG. 10, the method includes:
  • each frame of each phoneme in the phoneme sequence of the input text is processed in turn as follows:
  • Step 102 Filter the coarse value by using the coarse value and information of a predetermined number of voice frames before the current time to obtain a smoothed value of the currently predicted voice parameter.
  • the parameters involved in the prediction do not extend to future parameters, and the output frame at a certain time depends only on the input frame at the moment and before or at the moment.
  • the output frame at the previous moment regardless of future input or output frames.
  • the coarse value and the information of the previous time speech frame may be used to filter the coarse value to obtain a smoothed value of the currently predicted speech parameter, where the information of the last time speech frame is obtained.
  • the smoothed value of the speech parameters predicted for the previous moment may be used to filter the coarse value to obtain a smoothed value of the currently predicted speech parameter, where the information of the last time speech frame is obtained.
  • the predicted speech parameter is a spectral envelope parameter and a sub-band voiced sound parameter
  • the solution uses the rough value and the previous one according to the following formula.
  • y t a - y t _ l + (l - a) - x t .
  • the solution uses the coarse value and the smoothed value of the predicted speech parameter at the previous moment to filter the coarse value to obtain the current
  • the smoothed value of the predicted speech parameter wherein, in the above formula, the time is the first frame, and the predicted speech parameter is a rough value at the time of the frame, and ⁇ represents the filtered smoothed value, respectively, the coefficient of the filter, and The values are different.
  • the program may specifically include the following processing: constructing the voiced subband filter and the unvoiced subband filter by using the subband voiced parameter; and constructing the quasi-periodic pulse sequence constructed by the pitch frequency parameter
  • the voiced subband filter obtains a voiced component of the voice signal; a random sequence constructed by white noise is passed through the unvoiced subband filter to obtain an unvoiced component of the voice signal;
  • the program includes a training phase before the above-mentioned synthesis phase.
  • the acoustic parameters extracted from the corpus only include static parameters, or the acoustic parameters extracted from the corpus include static parameters and dynamic parameters; only the static model parameters are retained in the model parameters of the statistical model obtained after training;
  • the step 101 in the synthesizing stage may specifically include: according to the current phoneme, the corresponding static model parameter of the statistical model obtained in the training phase in the current frame of the current phoneme is used as a rough value of the currently predicted speech parameter.
  • a further embodiment of the present invention also provides a parametric speech synthesis system. Referring to FIG. 11, the system includes:
  • the loop synthesizing device 110 includes:
  • the rough search unit 111 is configured to extract a corresponding statistical model from the statistical model library for the current phoneme in the phoneme sequence of the input text, and use the corresponding model parameter of the statistical model in the current frame of the current phoneme as the current predicted speech parameter.
  • the smoothing filtering unit 112 is configured to filter the coarse value by using the coarse value and information of a predetermined number of voice frames before the current time to obtain a smoothed value of the currently predicted voice parameter;
  • the global optimization unit 113 is configured to globally optimize the smoothed value of the currently predicted speech parameter according to the global mean value and the global standard deviation ratio of the voice parameter obtained by the statistics, to generate a required voice parameter;
  • the parameter speech synthesis unit 114 is configured to synthesize the generated speech parameters to obtain a frame of speech synthesized by the current frame of the current phoneme.
  • the smoothing filtering unit 112 includes a low-pass filter group, configured to filter the coarse value by using the coarse value and the information of the previous time speech frame to obtain a smoothed value of the currently predicted speech parameter.
  • the information of the speech frame at the last moment is a smoothed value of the predicted speech parameter at the previous moment.
  • the low-pass filter bank uses the coarse value and the smoothed value of the predicted speech parameter at the previous moment according to the following formula, Filtering the coarse value to obtain a smoothed value of the currently predicted speech parameter: when the predicted speech parameter is a pitch frequency parameter, the low pass filter bank uses the coarse value and the previous moment according to the following formula Predicting a smoothed value of the speech parameter, and filtering the coarse value to obtain a smoothed value of the currently predicted speech parameter: wherein, in the above formula, the time is the first frame, and the rough value of the predicted speech parameter in the first frame is displayed, Indicates the filtered smoothed value, a, which is the coefficient of the filter, and the value of the sum is different.
  • the global optimization unit 113 includes a global parameter optimizer for obtaining a global mean value and a global standard deviation ratio of the voice parameter according to a statistic, and performing smoothing values of the currently predicted voice
  • is the smoothed value of the speech parameter before the optimization at time t, which is the initial optimized value.
  • w is the weight value, which is the required speech parameter obtained after global optimization, r is the global standard deviation ratio of the predicted speech parameters obtained by statistics, m is the global mean value of the predicted speech parameters obtained by statistics, r and m are taken The value is a constant.
  • the parameter speech synthesis unit 114 includes:
  • a filter construction module for constructing a voiced subband filter and an unvoiced subband filter using subband voiced parameters
  • the voiced subband filter is configured to filter a quasi-periodic pulse sequence constructed by a pitch frequency parameter to obtain a voiced component of the voice signal;
  • the unvoiced subband filter is configured to filter a random sequence constructed by white noise to obtain an unvoiced component of the voice signal
  • An adder configured to add the voiced component and the unvoiced component to obtain a mixed excitation signal
  • a synthesis filter configured to output the mixed excitation signal to a one-frame synthesized speech waveform after passing through a filter constructed by a spectral envelope parameter .
  • the system further includes training means for, during the training phase, the acoustic parameters extracted from the corpus include only static parameters, or the acoustic parameters extracted from the corpus include static parameters and dynamic parameters; and, in training Only the static model parameters are retained in the model parameters of the obtained statistical model;
  • the coarse search unit 111 is specifically configured to use, according to the current phoneme, the corresponding static model parameter of the statistical model obtained in the training stage in the current frame of the current phoneme as the rough value of the currently predicted speech parameter. .
  • the related operations of the coarse search unit 111, the smoothing filter unit 112, the global optimization unit 113, and the parameter speech synthesis unit 114 in the embodiment of the present invention may be referred to the coarse search unit 840, the smoothing filter unit 850, and the global optimization in the foregoing embodiments, respectively.
  • the relevant content of unit 860 and parametric speech synthesis unit 870 As described above, the technical solution of the embodiment of the present invention provides a novel parametric speech synthesis scheme by using information of a speech frame before the current frame and a global mean value and a global standard deviation ratio of the speech parameters. .
  • the vertical processing method is adopted in the synthesis stage, and each frame of speech is separately synthesized one by one, and only the parameters of the fixed capacity required by the current frame are saved in the synthesis process.
  • the new vertical processing architecture of the solution enables the synthesis of speech of any length of time using a fixed-size RAM, which significantly reduces the RAM capacity requirement for speech synthesis, thereby enabling continuous synthesis of arbitrary on a chip with smaller RAM. Duration speech.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Machine Translation (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

本发明提供了一种参数语音合成方法和系统,该方法包括:依次对输入文本的音素序列中每一音素的每一帧语音进行如下处理:对当前音素,从统计模型库中提取相应的统计模型,并将该统计模型在当前音素当前帧下相应的模型参数作为当前所预测语音参数的粗略值;利用粗略值以及当前时刻之前预定数目语音帧的信息,得到当前所预测语音参数的平滑值;根据统计得到的语音参数的全局均值和全局标准差比值,对语音参数的平滑值进行全局优化,生成所需的语音参数;对生成的所述语音参数进行合成,得到对当前音素当前帧所合成的一帧语音。利用本方案,能够使语音合成所需要的RAM不会随着合成语音长度的增加而增加,合成语音的时长不再受到RAM的限制。

Description

参数语音合成方法和系统 技术领域
本发明涉及参数语音合成技术领域, 更为具体地, 涉及一种连续合成任意 时长语音的参数语音合成方法和系统。
背景技术 语音合成通过机械、 电子的方法产生人造语音, 其是使人机交互 更加自然的一项重要技术。 当前常见的语音合成技术有两类, 一类是 基于单元挑选和波形拼接的语音合成方法, 另一类是基于声学统计模 型的参数语音合成方法。 由于参数语音合成方法对存储空间的要求相 对较小, 更适于应用在小型电子设备上。
在参数语音合成方法中, 分为训练与合成两个阶段。 在训练阶段, 参见图 1,首先提取出语料库中所有语音的声学参数,这包括静态参数, 如频谱包络参数、 基因频率参数, 和动态参数, 如频谱包络参数和基 音频率参数的一阶和二阶差分参数; 然后为每个音素根据其上下文标 注信息训练出对应的声学统计模型, 同时训练出针对整个语料库的全 局方差模型; 最后由所有音素的声学统计模型及全局方差模型组成模 型库。
在合成阶段, 采用分层离线处理的方式, 进行语音的合成。 如图 1 所示, 包括第一层: 分析输入的整段文本得到所有带上下文信息的音 素组成音素序列。 第二层: 从训练好的模型库中提取音素序列中每个 音素对应的模型组成模型序列。 第三层: 使用最大似然算法从模型序 列中预测出每一帧语音对应的声学参数组成语音参数序列。 第四层: 使用全局方差模型对语音参数序列进行整体优化。 第五层: 将所有优 化后的语音参数序列输入到参数语音合成器生成最终的合成语音。
发明人在实现本发明的过程中, 发现现有技术中至少存在如下缺 陷:
现有的参数语音合成方法, 在合成阶段的分层操作中采用一种横 向的处理方式: 取出所有统计模型的参数、 以最大似然算法预测生成 所有帧的平滑参数、 以全局方差模型得到所有帧的优化参数, 最后从 参数合成器输出所有帧的语音, 即在每一层都需要保存所有帧的相关 参数, 导致语音合成时所需的随机存储器 (Random Access Memory , RAM ) 的容量随着合成语音时长的增长呈正比例增加, 而芯片上 RAM 的大小是固定的, 很多应用中芯片的 RAM小到不足 100K字节, 现有的 参数语音合成方法无法在具有较小 RAM的芯片上连续合成任意时长 语音。
下面结合上述合成阶段中第三层和第四层的操作, 进一歩详细说 明造成上述问题的原因:
在上述合成阶段的第三层操作中, 参见图 4, 运用最大似然算法从 模型序列中预测出语音参数序列的实施过程必须通过逐帧前向递推和 后向递推两歩来实现。 在第一歩递推过程结束后, 会为每帧语音产生 对应的临时参数。 所有帧的临时参数再输入到第二歩的反向递推过程 才能预测出所需的参数序列。 当合成语音时长越长时, 对应的语音帧 数就越多, 预测每帧语音参数时都会产生一帧对应的临时参数。 所有 帧的临时参数都必须保存在 RAM中, 才能完成第二歩的递推预测过 程, 从而导致无法在具有较小 RAM的芯片上连续合成任意时长语音。
并且, 第四层中的操作需要从第三层输出的所有帧语音参数中计 算出均值与方差, 再运用全局方差模型对语音参数的平滑值进行整体 优化生成最终的语音参数。 因此, 也需要相应帧数的 RAM保存第三层 输出的所有帧的语音参数, 也导致无法在具有较小 RAM的芯片上连续 合成任意时长语音。 发明内容 鉴于上述问题,本发明的目的是解决原有的语音合成过程中需要的 RAM大 小随着合成语音长度呈正比例增加、进而无法在小 RAM的芯片上连续合成出任 意时长语音的问题。
根据本发明的一个方面, 提供了一种参数语音合成方法, 包括训练阶段和 合成阶段, 其中所述合成阶段具体包括:
依次对输入文本的音素序列中每一音素的每一帧语音进行如下处 理:
对输入文本的音素序列中的当前音素, 从统计模型库中提取相应 的统计模型, 并将该统计模型在当前音素当前帧下相应的模型参数作 为当前所预测语音参数的粗略值;
利用所述粗略值以及当前时刻之前预定数目语音帧的信息, 对所 述粗略值进行滤波, 得到当前所预测语音参数的平滑值;
根据统计得到的所述语音参数的全局均值和全局标准差比值, 对 所述当前所预测语音参数的平滑值进行全局优化, 生成所需的语音参 数;
对生成的所述语音参数进行合成, 得到对当前音素当前帧所合成的一帧语
、、 其中, 优选的方案是, 利用所述粗略值以及上一时刻语音帧 的信息, 对所述粗略值进行滤波, 得到当前所预测语音参数的平 滑值, 该上一时刻语音帧的信息为上一时刻所预测语音参数的平 滑值。
此外, 优选的方案是, 利用如下公式, 根据统计得到所述语音参 数的全局均值和全局标准差比值, 对所述当前所预测语音参数的平滑 值进行全局优化, 生成所需的语音参数:
= r - (yt - m) + m
Figure imgf000005_0001
其中, Λ为 t时刻的语音参数在优化前的平滑值, 为初歩优化后的值, W 为权重值, 为全局优化后得到的所需的语音参数, r为统计得到的所预测语音 参数的全局标准差比值, m为统计得到的所预测语音参数的全局均值, r和 m的 取值为常数。
进一歩的, 本方案还包括: 利用子带浊音度参数构造浊音子带滤 波器和清音子带滤波器;将由基音频率参数构造的准周期性脉冲序列, 经过所述浊音子带滤波器得到语音信号的浊音成分; 将由白噪声构造 的随机序列, 经过所述清音子带滤波器得到语音信号的清音成分; 将 所述浊音成分与清音成分相加得到混合激励信号; 将所述混合激励信 号通过由频谱包络参数构造的滤波器后输出一帧合成的语音波形。 进一歩的, 本方案在所述合成阶段之前, 所述方法还包括训练阶 段,
在训练阶段, 从语料库中提取的声学参数仅包括静态参数, 或者, 从语料库中提取的声学参数包括静态参数和动态参数; 训练后所得到 的统计模型的模型参数中仅保留静态模型参数;
在合成阶段中, 根据所述当前音素, 将训练阶段中所得到所述统 计模型在当前音素当前帧下相应的静态模型参数作为当前所预测语音 参数的粗略值。
根据本发明的另一方面, 提供了一种参数语音合成系统, 包括:
循环合成装置, 用于在合成阶段, 依次对输入文本的音素序列中 每一音素的每一帧语音进行语音合成;
所述循环合成装置包括:
粗略搜索单元, 用于对输入文本的音素序列中的当前音素, 从统 计模型库中提取相应的统计模型, 并将该统计模型在当前音素当前帧 下相应的模型参数作为当前所预测语音参数的粗略值;
平滑滤波单元, 用于利用所述粗略值以及当前时刻之前预定数目 语音帧的信息, 对所述粗略值进行滤波, 得到当前所预测语音参数的 平滑值;
全局优化单元, 用于根据统计得到的所述语音参数的全局均值和 全局标准差比值,对所述当前所预测语音参数的平滑值进行全局优化, 生成所需的语音参数;
参数语音合成单元, 用于对生成的所述语音参数进行合成, 得到对当前音 素当前帧所合成的一帧语音。 进一歩的, 所述平滑滤波单元包括低通滤波器组, 用于利用所述 粗略值以及上一时刻语音帧的信息, 对所述粗略值进行滤波, 得到当 前所预测语音参数的平滑值, 该上一时刻语音帧的信息为上一时刻所 预测语音参数的平滑值。 进一歩的, 所述全局优化单元包括全局参数优化器, 用于利用如 下公式, 根据统计得到所述语音参数的全局均值和全局标准差比值, 对所述当前所预测语音参数的平滑值进行全局优化, 生成所需的语音 参数:
= r - (yt - m) + m
Figure imgf000007_0001
其中, Λ为 t时刻的语音参数在优化前的平滑值, 为初歩优化后的 值, w为权重值, 为全局优化后得到的所需的语音参数, r为统计得 到的所预测语音参数的全局标准差比值, m为统计得到的所预测语音 参数的全局均值, r和 m的取值为常数。 进一歩的, 所述参数语音合成单元, 包括:
滤波器构造模块, 用于利用子带浊音度参数构造浊音子带滤波器 和清音子带滤波器;
所述浊音子带滤波器, 用于对由基音频率参数构造的准周期性脉 冲序列进行滤波, 得到语音信号的浊音成分;
所述清音子带滤波器,用于对由白噪声构造的随机序列进行滤波, 得到语音信号的清音成分;
加法器,用于将所述浊音成分与清音成分相加得到混合激励信号; 合成滤波器, 用于将所述混合激励信号通过由频谱包络参数构造 的滤波器后输出一帧合成的语音波形。 进一歩的, 所述系统还包括训练装置, 用于在训练阶段, 从语料 库中提取的声学参数仅包括静态参数, 或者, 从语料库中提取的声学 参数包括静态参数和动态参数; 以及, 在训练后所得到的统计模型的 模型参数中仅保留静态模型参数;
所述粗略搜索单元, 具体用于在合成阶段中, 根据所述当前音素, 将训练阶段中所得到所述统计模型在当前音素当前帧下相应的静态模 型参数作为当前所预测语音参数的粗略值。 由上所述, 本发明实施例的技术方案通过利用当前帧之前的语音 帧的信息以及预先统计得到语音参数的全局均值和全局标准差比值等 技术手段, 提供了一种新型的参数语音合成方案。
本发明所提供的参数语音合成方法和系统, 采用纵向处理的合成方法, 即 每一帧语音的合成都需要经过取出统计模型粗略值、 滤波得平滑值、 全局优化 得优化值、 参数语音合成得语音四个歩骤, 之后每一帧语音的合成都再次重复 这四个歩骤, 从而在参数语音合成处理的过程中仅需要保存当前帧需要的固定 存储容量的参数即可,使语音合成所需要的 RAM不会随着合成语音长度的增加 而增加, 合成语音的时长不再受到 RAM的限制。
另外, 本发明中所采用的声学参数为静态参数, 在模型库中也仅保存各模 型的静态均值参数, 从而能够有效减少统计模型库的大小。
再者, 本发明在合成语音的过程中使用多子带清浊混合激励, 使每个子带 中清音与浊音按照浊音度进行混合, 从而使清音和浊音在时间上不再有明确的 硬边界, 避免了语音合成后音质的明显畸变。
本方案能够合成出具有较高连续性、 一致性和自然度的语音, 有助于语音 合成方法在小存储空间芯片上的推广和应用。 为了实现上述以及相关目的, 本发明的一个或多个方面包括后面将详细说 明并在权利要求中特别指出的特征。 下面的说明以及附图详细说明了本发明的 某些示例性方面。 然而, 这些方面指示的仅仅是可使用本发明的原理的各种方 式中的一些方式。 此外, 本发明旨在包括所有这些方面以及它们的等同物。 附图说明 通过参考以下结合附图的说明及权利要求书的内容, 并且随着对本发明的 更全面理解, 本发明的其它目的及结果将更加明白及易于理解。 在附图中: 图 1 为现有技术中基于动态参数及最大似然准则的参数语音合成方法分阶 段示意图;
图 2为本发明一个实施例的参数语音合成方法的流程图;
图 3为本发明一个实施例的参数语音合成方法分阶段示意图;
图 4为现有技术中基于动态参数的最大似然参数预测示意图;
图 5为本发明一个实施例的基于静态参数的滤波平滑参数预测示意图; 图 6为根据本发明一个实施例的基于混合激励的合成滤波器示意图;
图 7为现有技术中基于清 /浊判决的合成滤波示意图;
图 8为本发明另一个实施例的参数语音合成系统的方框示意图;
图 9为本发明另一个实施例的参数语音合成单元的逻辑结构示意图;
图 10为本发明又一个实施例的参数语音合成方法的流程图;
图 11为本发明又一个实施例的参数语音合成系统的结构示意图。
在所有附图中相同的标号指示相似或相应的特征或功能。 具体实施方式 以下将结合附图对本发明的具体实施例进行详细描述。
图 2示出了根据本发明一个实施例的参数语音合成方法的流程图。
如图 2所示, 本发明所提供的能够连续合成任意时长语音的参数语音合成 方法的实现包括如下歩骤:
S210: 分析输入文本, 根据对输入文本的分析获取包含上下文信息的音素 序列;
S220: 依次取出上述音素序列中的一个音素, 在统计模型库中搜索所述音 素的各声学参数对应的统计模型, 按帧取出所述音素的各统计模型作为待合成 语音参数的粗略值;
S230: 使用滤波器组对上述待合成语音参数的粗略值进行参数平滑, 得到 平滑后的语音参数;
S240: 使用全局参数优化器对所述平滑后的语音参数进行全局参数优化, 得到优化后的语音参数;
S250: 利用参数语音合成器对所述优化后的语音参数进行合成, 输出一帧 合成语音;
S260: 判断所述音素的所有帧是否都处理完毕, 如果没有, 则对所述音素 的下一帧重复歩骤 S220〜S250的语音合成处理,直至处理完所述音素序列中的 所有音素的所有帧。
为了能够进一歩清楚的对本发明的参数语音合成技术进行说明, 以突出本 发明的技术特点, 下面将分阶段、 分歩骤与现有技术中的参数语音合成方法逐 一进行对比说明。 图 3为本发明实施例的参数语音合成方法分阶段示意图。 如图 3所示, 与 现有技术中基于动态参数及最大似然准则的参数语音合成方法相类似, 本发明 的参数语音合成的实现也包括训练和合成两个阶段, 其中, 训练阶段用于通过 语料库中的语音信息提取语音的声学参数, 并根据所提取的声学参数训练出每 个音素在每个上下文信息时对应的统计模型, 形成合成阶段所需要的音素的统 计模型库。 歩骤 S210〜S260属于合成阶段, 在合成阶段, 主要包括文本分析、 参数预测和语音合成三部分, 其中参数预测部分又可以细分为目标模型搜索、 参数生成和参数优化三个环节。
首先, 在训练阶段提取训练语料库的声学参数的过程中, 本发明与现有参 数语音合成技术的主要区别在于: 现有技术中所提取的声学参数中包含动态参 数, 而本发明中所提取的声学参数可以全部为静态参数, 也可以包含表征前后 帧参数变化的动态参数, 如一阶或二阶差分参数, 以提高模型训练后的精度。
具体地, 本发明从语料库中提取的声学参数至少包括三种静态参数: 频谱 包络参数、 基音频率参数、 子带浊音度参数, 还可以选择性地包括其它如共振 峰频率等参数。
其中, 频谱包络参数可以是线性预测系数(LPC)或其衍生参数, 如线谱对 参数 (LSP), 也可以是倒谱类参数; 还可以是前几个共振峰的参数 (频率、 带 宽、 幅值) 或者离散傅立叶变换系数。 另外, 还可以使用这些频谱包络参数在 美尔域的变种, 以改善合成语音的音质。 基音频率使用对数基音频率, 子带浊 音度为子带中浊音所占比重。
除了上述静态参数外, 从语料库中提取的声学参数还可以包括表征前后帧 声学参数变化的动态参数, 如前后几帧基音频率间的一阶或二阶参数。 训练时 要将各音素自动对齐到语料库中大量的语音片段上, 然后从这些语音片段中统 计出该音素对应的声学参数模型。 联合使用静态参数和动态参数进行自动对齐 的精度略高于仅使用静态参数的情形, 使得模型的参数更准确。 但是, 由于本 发明在合成阶段并不需要模型中的动态参数, 因此, 本发明在最终训练出的模 型库中仅保留静态参数。
在根据所提取的声学参数训练出每个音素在不同上下文信息时各声学参数 对应的统计模型的过程中,采用隐马尔可夫模型(HMM, Hidden Markov Model) 对各声学参数进行建模。 具体地, 对于频谱包络参数与子带浊音度参数, 使用 连续概率分布的 HMM建模, 而对于基音频率则采用多空间概率分布的 HMM 建模。 这种建模方案为现有技术中已有的建模方案, 因此在下面的表述中只对 该建模方案作简单的说明。
HMM是一种典型的统计信号处理方法, 由于其随机性、可以处理未知字长 的字符串输入、 可以有效的避开切分的问题以及具有大量快速有效的训练和识 别算法等特点, 被广泛应用于信号处理的各个领域。 HMM的结构为 5个状态左 右型, 每个状态上观察概率的分布为单高斯密度函数。 而该函数由参数的均值 和方差唯一确定。 所述的均值由静态参数的均值、 动态参数 (一阶与二阶差分) 的均值组成。 所述的方差由静态参数的方差、 动态参数 (一阶与二阶差分) 的 方差组成。
训练时根据上下文信息为每个音素的各声学参数训练出一个模型, 为了提 高模型训练的稳健性, 需要根据音素的上下文信息对相关的音素进行聚类, 如 采用基于决策树的聚类方法。 在上述声学参数对应的模型训练完成之后, 再使 用这些模型对训练语料库中的语音进行帧到状态的强制对齐, 然后利用对齐过 程中产生的时长信息 (即各状态对应的帧数), 训练音素在不同上下文信息时采 用决策树聚类后的状态时长模型, 最终由每个音素在不同上下文信息时的各声 学参数对应的统计模型形成统计模型库。
在训练完成后, 本发明在模型库中仅保存各模型的静态均值参数。 而现有 的参数语音合成方法则需要保留静态均值参数、 一阶差分参数、 二阶差分的均 值参数及这些参数所对应的方差参数, 统计模型库较大。 实践证明, 在本发明 中, 仅保存各模型的静态均值参数的统计模型库的大小只有现有技术中形成的 统计模型库的约 1/6, 极大地减少了统计模型库的存储空间。 其中, 所减少的数 据虽然在现有的参数语音合成技术中是必须的, 但对于本发明提供的参数语音 合成技术方案则是不需要的, 因此, 数据量的减少并不会影响本发明参数语音 合成的实现。
在合成阶段, 首先需要对输入的文本进行分析, 以便从中提取出包含上下 文信息的音素序列 (歩骤 S210), 作为参数合成的基础。
在此, 音素的上下文信息指的是与当前音素前后相邻的音素的信息, 这些 上下文信息可以是其前后一个或几个音素的名称, 也可以包含其它语言层或音 韵层的信息。 比如, 一个音素的上下文信息包括当前音素名、 前后两个音素名、 所在音节的音调或者重音, 还可以选择性地包括所在词的词性等。
在确定了输入文本中包含上下文信息的音素序列之后, 就可以依次取出序 列中的一个音素, 在统计模型库中搜索该音素的各声学参数对应的统计模型, 然后按帧取出该音素的各统计模型作为待合成语音参数的粗略值 (歩骤 S220)。
在目标统计模型的搜索过程中, 将音素的上下文标注信息输入到聚类决策 树中, 即可搜索出频谱包络参数、 基音频率参数、 子带浊音度参数、 状态时长 参数对应的统计模型。 其中的状态时长参数不是从原始语料库中提取的静态声 学参数, 它是在训练中作状态与帧的对齐时生成的新参数。 从模型各状态中依 次取出所保存的静态参数的均值即为各参数对应的静态均值参数。 其中, 状态 时长均值参数被直接用于确定待合成的某个音素中各状态该持续多少帧, 而频 谱包络、 基音频率、 子带浊音度等静态均值参数就是待合成语音参数的粗略值。
在确定了待合成语音参数的粗略值之后, 基于滤波器组对所确定的语音参 数粗略值进行滤波, 从而预测语音参数(歩骤 S230)。在这一歩骤中, 利用一组 专门的滤波器分别对频谱包络、 基音频率和子带浊音度进行滤波, 以预测合成 效果更好的语音参数值。
本发明在歩骤 S230中所采用的滤波方法为基于静态参数的平滑滤波方法。 图 5为本发明基于静态参数的滤波平滑参数预测示意图, 如图 5所示, 本发明 用这组参数预测滤波器取代了现有的参数语音合成技术中的最大似然参数预测 器, 利用一组低通滤波器用以分别预测待合成语音参数的频谱包络参数、 基音 频率参数、 子带浊音度参数。 处理过程如公式 (1 ) 所示: yt=ht * xt ( i )
其中, t表示时间为第 t帧, ^是从模型中得到的某个语音参数在第 t帧时 的粗略值, 为经过滤波平滑后的值, 运算符 *表示卷积, 为预先设计好的滤 波器的冲击响应。 对于不同类型的声学参数, 由于参数特性不同, ^可以被设 计成不同的表示。
对于频谱包络参数、 子带浊音度参数, 可使用公式 (2) 所示的滤波器进行
yt = a - yt_l + (\ - a) ' Xt ( 2 ) 其中, "为预先设计好的固定的滤波器系数, "的选择可根据实际语音中频 谱包络参数、 子带浊音度随时间变化的快慢程度由实验确定。
对于基音频率参数, 则可使用公式 (3 ) 所示的滤波器进行参数的预测。
Figure imgf000013_0001
其中, ^为预先设计好的固定的滤波器系数, 的选择可根据实际语音中基 音频率参数随时间变化的快慢程度由实验确定。
可以看出, 本发明所使用的这组滤波器在预测待合成语音参数的过程中所 涉及的参数不会延及将来的参数, 某一时刻的输出帧仅仅依赖于该时刻及之前 的输入帧或该时刻的前一时刻的输出帧, 而与将来的输入或输出帧无关, 从而 使滤波器组所需要的 RAM大小能够事先固定。 也就是说, 在本发明中, 运用公 式 (2) 和 (3 ) 预测语音的声学参数时, 当前帧的输出参数仅依赖于当前帧的 输入及前一帧的输出参数。
这样, 整个参数的预测过程使用固定大小的 RAM缓存即可现实, 不会随着 待合成语音时长的增加而增加, 从而就可以连续预测出任意时长的语音参数, 解决了现有技术中运用最大似然准则预测参数过程中所需 RAM 随合成语音时 长呈正比例增长的问题。
由上述公式 (2) 和 (3 ) 可以看出, 本方案在使用滤波器组对当前时刻的 待合成语音参数的粗略值进行参数平滑时, 可以根据该时刻的粗略值以及上一 时刻语音帧的信息, 对该粗略值进行滤波, 得到平滑后的语音参数。 在此, 上 一时刻语音帧的信息为上一时刻所预测语音参数的平滑值。
在预测出语音参数的平滑值之后, 就可以使用全局参数优化器对平滑后的 各语音参数进行优化, 进而确定优化后的语音参数 (歩骤 S240)。
为了使合成语音参数的方差与训练语料库中语音参数的方差一致, 改善合 成语音的音质, 本发明在优化语音参数的过程中, 使用下面的公式 (4) 对合成 语音参数的变化范围进行调节。 yt = r - (yt - m) + m
Figure imgf000013_0002
其中, Λ为 时刻的语音参数在优化前的平滑值, 为初歩优化后的值, 为 最终优化后的值, w为合成语音的均值, 是训练语音与合成语音标准差的比 值, w为控制调节效果的一个固定权重。 然而, 现有的参数语音合成方法在确定 "^和 时, 需要利用到某个语音参数 在所有帧对应的值来计算均值和方差, 然后才能运用全局方差模型来调整所有 帧的参数, 使得调整后合成语音参数的方差与全局方差模型一致, 达到提高音 质的 (5 ) 所示。
Figure imgf000014_0001
其中, 表示待合成语音总时长为 帧, σ为某个语音参数在训练语料库中 所有语音上统计得到的标准差(由全局方差模型提供), 为当前待合成语音参 数的标准差, 每合成一段文本, 都需要重新计算。 由于《和^的计算需要用到 调整前合成语音所有帧的语音参数值, 需要 RAM保存所有帧未优化时的参数, 因此,所需要的 RAM会随着待合成语音时长的增加而增加, 从而导致固定大小 的 RAM无法满足连续合成任意时长语音的需要。
针对现有技术中存在的这种缺陷, 本发明在对参数语音进行优化时, 重新 设计了全局参数优化器, 使用如下的公式 (6 ) 对参数语音进行优化。
m = M
r = R 其中, M和 R均为常数, 其值为从大量合成语音中分别统计出来的某个参 数的均值及标准差比。 优选的确定方法是, 在不加全局参数优化时, 合成一段 较长的语音, 例如一小时左右的合成语音, 然后使用公式 (5 ) 计算出各声学参 数对应的均值与标准差的比值, 并将其作为固定值赋予各声学参数对应的 M和 可以看出, 本发明所设计的全局参数优化器包含全局均值及全局方差比, 用全局均值表征合成语音各声学参数的均值, 用全局方差比表征合成语音与训 练语音的参数在方差上的比例。 使用本发明中的全局参数优化器, 在每次合成 时, 可以对输入的一帧语音参数直接进行优化, 不再需要从所有合成语音帧中 重新计算语音参数的均值及标准差比, 因而不需要保存待合成语音参数所有帧 的值。 以固定的 RAM解决了现有的参数语音合成方法 RAM随合成语音时长呈 正比例增长的问题。 另外, 本发明对每次合成的语音采用相同的 m和 r进行调 节, 而原方法在每次合成中使用新计算的 m和 r进行调节, 因而本发明在合成 不同文本时合成语音间的一致性比原方法要好。 并且, 可以明显看出本发明的 计算复杂度低于原方法。
在确定了优化后的语音参数之后, 就可以利用参数语音合成器对所述优化 后的语音参数进行合成, 合成出一帧语音波形 (歩骤 S250)。
图 6为根据本发明实施例的基于混合激励的合成滤波器示意图, 图 Ί为现 有技术中基于清 /浊判决的合成滤波示意图。 如图 6和图 7所示, 本发明采用的 基于混合激励的合成滤波器采用源-滤波器形式; 而现有技术中的滤波激励为简 单的二元激励。
现有的参数语音合成技术中, 在运用参数合成器合成语音时所采用的技术 是基于清 /浊判决的参数语音合成, 需要使用一个预先设定的门限做清 /浊音的硬 判决, 将某帧合成语音要么判定为浊音, 要么判定为清音。 这就导致在合成出 的某些浊音中间突然出现清音帧, 听感上会有明显的音质畸变。 在图 7所示的 合成滤波示意图中, 合成语音前先进行清 /浊音预测, 然后分别进行激励, 清音 时采用白噪声作为激励, 浊音时采用准周期性脉冲作为激励, 最后将此激励通 过合成滤波器得到合成语音的波形。 不可避免地, 这种激励合成方法会导致合 成出的清音和浊音在时间上有明确的硬边界, 从而使合成语音中存在音质明显 畸变。
但在如图 6所示的采用本发明提供的基于混合激励的合成滤波示意图中, 使用多子带清浊混合激励, 不再进行清 /浊预测, 而是将每个子带中清音与浊音 按浊音度进行混合, 因而清音和浊音在时间上不再有明确的硬边界, 解决了原 方法在某些浊音中间因突然出现清音而导致音质明显畸变的问题。 可以通过下 面的 库的语音中提取某个子带当前帧的浊音度:
Figure imgf000015_0001
其中, &是某个子带当前帧第 t个语音样本的值, ^为和 t间隔为 时的 语音样本的值, τ为一帧的样本数, 当 取基音周期时, 为就为当前子带当 前帧的浊音度。
具体地, 如图 6所示, 经过全局优化后生成的语音参数, 输入参数语音合 成器中, 首先根据语音参数中的基音频率参数构造准周期性脉冲序列, 由白噪 声构造随机序列; 然后经过由浊音度构造的浊音子带滤波器产品从所构造的准 周期脉冲序列中得到信号的浊音成分, 经过由浊音度构造的清音子带滤波器从 随机序列中得到信号的清音成分; 将浊音成分与清音成分相加即可得到混合激 励信号。 最后将混合激励信号通过由频谱包络参数构造的合成滤波器后输出一 帧合成语音波形。
当然, 在确定了优化后的语音参数之后, 也可以依然先作清 /浊音判决, 浊 音时使用混合激励, 清音时仅使用白噪声。 但此方案同样有硬边界致音质畸变 的问题, 因此, 本发明优选上述不进行清 /浊预测、 使用多子带清浊混合激励的 实施方式。
由于本发明在连续合成任意时长语音方面的优势, 因此, 在完成一帧语音 波形的输出之后, 还可以继续循环处理下一帧语音。 由于下一帧优化后的语音 参数没有预先生成并存储在 RAM中, 因此, 在当前帧处理完之后, 需要返回歩 骤 S220,从模型中取出该音素的下一帧语音参数的粗略值,重复进行歩骤 S220〜 S250, 对该音素的下一帧进行语音合成处理, 才能最终输出下一帧的语音波形。 这样循环处理, 直至所有音素模型的所有帧的参数都处理完成、 合成出所有的 语音。
本发明的上述参数语音合成方法, 可以采用软件实现, 也可以采用硬件实 现, 或采用软件和硬件组合的方式实现。 图 8示出了根据本发明另一个实施例的参数语音合成系统 800的方框示意 图。 如图 8所示, 参数语音合成系统 800包括输入文本分析单元 830、粗略搜索 单元 840、 平滑滤波单元 850、 全局优化单元 860、 参数语音合成单元 870和循 环判断单元 880。其中, 还可以包括用于语料训练的声学参数提取单元和统计模 型训练单元 (图中未示出)。
其中, 声学参数提取单元用于提取训练语料库中语音的声学参数; 统计模 型训练单元用于根据声学参数提取单元所提取的声学参数训练出每个音素在不 同上下文信息时各声学参数对应的统计模型, 并将该统计模型保存在统计模型 输入文本分析单元 830用于分析输入的文本, 并根据对所述输入文本的分 析获取包含上下文信息的音素序列; 粗略搜索单元 840用于依次取出音素序列 中的一个音素, 并在统计模型库中搜索输入文本分析单元 830所获取的所述音 素的各声学参数对应的统计模型, 按帧取出该音素的各统计模型作为待合成语 音参数的粗略值; 平滑滤波单元 850用于使用滤波器组对待合成语音参数的粗 略值进行滤波, 得到平滑后的语音参数; 全局优化单元 860用于使用全局参数 优化器对平滑滤波单元 850所平滑后的各语音参数进行全局参数优化, 得到优 化后的语音参数; 参数语音合成单元 870用于利用参数语音合成器对全局优化 单元 860所优化后的语音参数进行合成, 输出合成语音。
循环判断单元 880连接在参数语音合成单元 870和粗略搜索单元 840之间, 用以在完成一帧语音波形的输出之后, 判断音素中是否存在未处理的帧, 如果 存在, 则对该音素的下一帧重复利用所述粗略搜索单元、 平滑滤波单元、 全局 优化单元和参数语音合成单元继续进行搜索取得声学参数对应的统计模型粗略 值、 滤波得平滑值、 全局优化、 参数语音合成的循环处理, 直至处理完所述音 素序列中的所有音素的所有帧。
由于下一帧优化后的语音参数没有预先生成并存储在 RAM中, 因此, 在当 前帧处理完之后, 需要返回粗略搜索单元 840, 从模型中取出该音素的下一帧, 重复利用粗略搜索单元 840、 平滑滤波单元 850、 全局优化单元 860和参数语音 合成单元 870进行语音合成处理, 才能最终输出下一帧的语音波形。 这样循环 处理, 直至所有音素序列中的所有音素的所有帧的参数都处理完成、 合成出所 有的语音。
其中, 与上述方法相对应, 在本发明的一个优选实施方式中, 统计模型训 练单元进一歩包括声学参数模型训练单元、 聚类单元、 强制对齐单元、 状态时 长模型训练单元以及模型统计单元 (图中未示出), 具体的:
声学参数模型训练单元, 用于根据每个音素的上下文信息为每个音素的各 声学参数训练出一个模型;
聚类单元, 用于根据所述音素的上下文信息对相关的音素进行聚类; 强制对齐单元, 用于使用所述模型对训练语料库中的语音进行帧到状态的 强制对齐;
状态时长模型训练单元, 用于利用所述强制对齐单元在强制对齐过程中产 生的时长信息训练音素在不同上下文信息时聚类后的状态时长模型;
模型统计单元, 用于将每个音素在不同上下文信息时的各声学参数对应的 统计模型形成统计模型库。
图 9示出了根据本发明一个优选实施例的参数语音合成单元的逻辑结构示 意图。 如图 9所示, 参数语音合成单元 870进一歩包括准周期脉冲发生器 871、 白噪声发生器 872、 浊音子带滤波器 873、 清音子带滤波器 874、 加法器 875和 合成滤波器 876, 其中, 准周期脉冲发生器 871用于根据语音参数中的基音频率 参数构造准周期性脉冲序列; 白噪声发生器 872用于通过白噪声构造随机序列; 浊音子带滤波器 873 用于根据子带浊音度从所构造的准周期脉冲序列中确定信 号的浊音成分; 清音子带滤波器 874用于根据子带浊音度从随机序列中确定信 号的清音成分; 然后将浊音成分与清音成分通过加法器 875 相加即可得到混合 激励信号。 最后将混合激励信号通过由频谱包络参数构造的合成滤波器 876合 成滤波后即可输出对应的一帧合成语音波形。 可以看出, 本发明采用的合成方法是纵向处理, 即每一帧语音的合成, 都 需要经过取出统计模型粗略值、 滤波得平滑值、 全局优化得优化值、 参数语音 合成得语音这处理四个环节, 之后每一帧语音的合成都再次重复这四个处理环 节。 而现有的参数语音合成方法采用的是横向离线处理, 即取出所有模型的粗 略参数、 以最大似然算法生成所有帧的平滑参数、 以全局方差模型得到所有帧 的优化参数, 最后从参数合成器输出所有帧的语音。 与现有的参数语音合成方 法中每一层都需要保存所有帧的参数相比, 本发明的纵向处理方式仅需要保存 当前帧需要的固定存储容量的参数即可, 因此本发明的纵向处理方式也解决了 原有方法采用横向处理方式所导致的合成语音时长有限的问题。
另外, 本发明通过在合成阶段仅仅使用静态参数, 不再使用动态参数及方 差信息, 将模型库的大小减小为原有方法的约 1/6。 通过使用专门设计的滤波器 组取代最大似然参数方法进行参数的平滑生成, 并使用新的全局参数优化器取 代原有方法中的全局方差模型进行语音参数的优化, 结合纵向处理结构实现了 使用固定大小的 RAM连续预测出任意时长语音参数的功能,解决了原有方法在 小 RAM芯片上不能连续预测出任意时长语音参数的问题,并且有助于扩大语音 合成方法在小存储空间芯片上的应用。 通过在每一时刻均使用清浊音混合激励, 取代原有方法在合成语音波形前先做清 /浊音硬判决, 解决了原有方法在合成某 些浊音的中间突然出现清音而产生音质畸变的问题, 使得产生的语音更加一致 连贯。 本发明又一个实施例提供的一种参数语音合成方法, 参见图 10, 该方法包 括:
在合成阶段, 依次对输入文本的音素序列中每一音素的每一帧语 音进行如下处理:
101: 对输入文本的音素序列中的当前音素, 从统计模型库中提取 相应的统计模型, 并将该统计模型在当前音素当前帧下相应的模型参 数作为当前所预测语音参数的粗略值;
102: 利用所述粗略值以及当前时刻之前预定数目语音帧的信息, 对所述粗略值进行滤波, 得到当前所预测语音参数的平滑值;
103: 根据统计得到的所述语音参数的全局均值和全局标准差比 值, 对所述当前所预测语音参数的平滑值进行全局优化, 生成所需的 语音参数;
104: 对生成的所述语音参数进行合成, 得到对当前音素当前帧所合成的一 帧语音。
进一歩的, 本方案在预测待合成语音参数的过程中, 预测时所涉 及的参数不会延及将来的参数, 某一时刻的输出帧仅仅依赖于该时刻 及之前的输入帧或该时刻的前一时刻的输出帧, 而与将来的输入或输 出帧无关。 具体地, 在歩骤 102中, 可以利用上述粗略值以及上一时刻 语音帧的信息, 对该粗略值进行滤波, 得到当前所预测语音参数的平 滑值, 其中, 该上一时刻语音帧的信息为上一时刻所预测语音参数的 平滑值。
进一歩的, 所预测语音参数为频谱包络参数、 子带浊音度参数时, 参见上述公式 (2 ), 本方案根据如下公式, 利用所述粗略值以及上一 时刻所预测语音参数的平滑值, 对所述粗略值进行滤波, 得到当前所 预测语音参数的平滑值:
yt = a - yt_l + (l - a) - xt .
所预测语音参数为基音频率参数时, 参见上述公式 (3 ), 本方案根 据如下公式,利用所述粗略值以及上一时刻所预测语音参数的平滑值, 对所述粗略值进行滤波, 得到当前所预测语音参数的平滑值: 其中, 上述公式中 表示时刻为第 帧, 表示所预测语音参数在 第 帧时的粗略值, Λ表示 经过滤波平滑后的值, 、 分别为滤波 器的系数, 和 的取值不同。
进一歩的, 本方案在歩骤 104中, 可以具体包括如下处理: 利用子带浊音度参数构造浊音子带滤波器和清音子带滤波器; 将由基音频率参数构造的准周期性脉冲序列, 经过所述浊音子带 滤波器得到语音信号的浊音成分; 将由白噪声构造的随机序列, 经过 所述清音子带滤波器得到语音信号的清音成分;
将所述浊音成分与清音成分相加得到混合激励信号; 将所述混合 激励信号通过由频谱包络参数构造的滤波器后输出一帧合成的语音波 形。
进一歩的, 本方案在上述合成阶段之前, 还包括训练阶段。 在训 练阶段, 从语料库中提取的声学参数仅包括静态参数, 或者, 从语料 库中提取的声学参数包括静态参数和动态参数; 训练后所得到的统计 模型的模型参数中仅保留静态模型参数;
合成阶段中歩骤 101可以具体包括: 根据当前音素, 将训练阶段中 所得到所述统计模型在当前音素当前帧下相应的静态模型参数作为当 前所预测语音参数的粗略值。 本发明又一个实施例还提供了一种参数语音合成系统, 参见图 11, 该系统 包括:
循环合成装置 110, 用于在合成阶段, 依次对输入文本的音素序列 中每一音素的每一帧语音进行语音合成; 所述循环合成装置 110包括:
粗略搜索单元 111, 用于对输入文本的音素序列中的当前音素, 从 统计模型库中提取相应的统计模型, 并将该统计模型在当前音素当前 帧下相应的模型参数作为当前所预测语音参数的粗略值;
平滑滤波单元 112,用于利用所述粗略值以及当前时刻之前预定数 目语音帧的信息, 对所述粗略值进行滤波, 得到当前所预测语音参数 的平滑值;
全局优化单元 113,用于根据统计得到的所述语音参数的全局均值 和全局标准差比值, 对所述当前所预测语音参数的平滑值进行全局优 化, 生成所需的语音参数;
参数语音合成单元 114, 用于对生成的所述语音参数进行合成, 得到对当前 音素当前帧所合成的一帧语音。 进一歩的, 所述平滑滤波单元 112包括低通滤波器组, 用于利用所 述粗略值以及上一时刻语音帧的信息, 对所述粗略值进行滤波, 得到 当前所预测语音参数的平滑值, 该上一时刻语音帧的信息为上一时刻 所预测语音参数的平滑值。 进一歩的, 所预测语音参数为频谱包络参数、 子带浊音度参数时, 所述低通滤波器组, 根据如下公式, 利用所述粗略值以及上一时刻所 预测语音参数的平滑值, 对所述粗略值进行滤波, 得到当前所预测语 音参数的平滑值: 所预测语音参数为基音频率参数时, 所述低通滤波器组, 根据如下 公式, 利用所述粗略值以及上一时刻所预测语音参数的平滑值, 对所 述粗略值进行滤波, 得到当前所预测语音参数的平滑值: 其中, 上述公式中 表示时刻为第 帧, 表示所预测语音参数在 第 帧时的粗略值, Λ表示 经过滤波平滑后的值, a 、 分别为滤波 器的系数, 和 的取值不同。 进一歩的, 所述全局优化单元 113包括全局参数优化器, 用于利用 如下公式,根据统计得到所述语音参数的全局均值和全局标准差比值, 对所述当前所预测语音参数的平滑值进行全局优化, 生成所需的语音 参数:
yt =r-(yt -m) + m zt =^'(yt -yt)+yt 其中, Λ为 t时刻的语音参数在优化前的平滑值, 为初歩优化后的 值, w为权重值, 为全局优化后得到的所需的语音参数, r为统计得 到的所预测语音参数的全局标准差比值, m为统计得到的所预测语音 参数的全局均值, r和 m的取值为常数。 进一歩的, 所述参数语音合成单元 114, 包括:
滤波器构造模块, 用于利用子带浊音度参数构造浊音子带滤波器 和清音子带滤波器;
所述浊音子带滤波器, 用于对由基音频率参数构造的准周期性脉 冲序列进行滤波, 得到语音信号的浊音成分;
所述清音子带滤波器,用于对由白噪声构造的随机序列进行滤波, 得到语音信号的清音成分;
加法器,用于将所述浊音成分与清音成分相加得到混合激励信号; 合成滤波器, 用于将所述混合激励信号通过由频谱包络参数构造 的滤波器后输出一帧合成的语音波形。 进一歩的, 所述系统还包括训练装置, 用于在训练阶段, 从语料 库中提取的声学参数仅包括静态参数, 或者, 从语料库中提取的声学 参数包括静态参数和动态参数; 以及, 在训练后所得到的统计模型的 模型参数中仅保留静态模型参数;
上述粗略搜索单元 111, 具体用于在合成阶段中, 根据所述当前音 素, 将训练阶段中所得到所述统计模型在当前音素当前帧下相应的静 态模型参数作为当前所预测语音参数的粗略值。 本发明实施例中的粗略搜索单元 111、 平滑滤波单元 112、 全局优 化单元 113和参数语音合成单元 114的相关操作, 可以分别参见上述实 施例中的粗略搜索单元 840、 平滑滤波单元 850、 全局优化单元 860和参 数语音合成单元 870的相关内容。 由上所述, 本发明实施例的技术方案通过利用当前帧之前的语音 帧的信息以及预先统计得到语音参数的全局均值和全局标准差比值等 技术手段, 提供了一种新型的参数语音合成方案。
本方案在合成阶段采用了一种纵向的处理方式, 对每一帧语音逐 次分别进行合成, 在合成过程中仅保存当前帧需要的固定容量的参数 即可。 本方案这种新型的纵向处理的架构, 能够使用固定容量大小的 RAM实现任意时长语音的合成, 显著降低了语音合成时对 RAM容量的 要求, 从而能够在具有较小 RAM的芯片上连续合成任意时长语音。
本方案能够合成出具有较高连续性、 一致性和自然度的语音, 有 助于语音合成方法在小存储空间芯片上的推广和应用。 如上参照附图以示例的方式描述根据本发明的参数语音合成方法和系统。 但是, 本领域技术人员应当理解, 对于上述本发明所提出的参数语音合成方法 和系统, 还可以在不脱离本发明内容的基础上做出各种改进。 因此, 本发明的 保护范围应当由所附的权利要求书的内容确定。

Claims

权利要求书
1、 一种参数语音合成方法, 包括:
在合成阶段, 依次对输入文本的音素序列中每一音素的每一帧语 音进行如下处理:
对输入文本的音素序列中的当前音素, 从统计模型库中提取相应 的统计模型, 并将该统计模型在当前音素当前帧下相应的模型参数作 为当前所预测语音参数的粗略值;
利用所述粗略值以及当前时刻之前预定数目语音帧的信息, 对所 述粗略值进行滤波, 得到当前所预测语音参数的平滑值;
根据统计得到的所述语音参数的全局均值和全局标准差比值, 对 所述当前所预测语音参数的平滑值进行全局优化, 生成所需的语音参 数;
对生成的所述语音参数进行合成, 得到对当前音素当前帧所合成的一帧语
、、
2、 根据权利要求 1所述的方法, 其特征在于, 所述利用所述粗略 值以及当前时刻之前预定数目语音帧的信息,对所述粗略值进行滤波, 得到当前所预测语音参数的平滑值具体包括:
利用所述粗略值以及上一时刻语音帧的信息, 对所述粗略值进行 滤波, 得到当前所预测语音参数的平滑值;
其中, 所述上一时刻语音帧的信息为上一时刻所预测语音参数的 平滑值。
3、 根据权利要求 1所述的方法, 其特征在于,
利用如下公式, 根据统计得到所述语音参数的全局均值和全局标 准差比值, 对所述当前所预测语音参数的平滑值进行全局优化, 生成 所需的语音参数:
= r - (yt - m) + m
Figure imgf000024_0001
其中, Λ为 t时刻的语音参数在优化前的平滑值, 为初歩优化后的 值, w为权重值, 为全局优化后得到的所需的语音参数, r为统计得 到的所预测语音参数的全局标准差比值, m为统计得到的所预测语音 参数的全局均值, r和 m的取值为常数。
4、 根据权利要求 1所述的方法, 其特征在于, 所述对生成的所述 语音参数进行合成, 得到对当前音素当前帧所合成的一帧语音包括: 利用子带浊音度参数构造浊音子带滤波器和清音子带滤波器; 将由基音频率参数构造的准周期性脉冲序列, 经过所述浊音子带 滤波器得到语音信号的浊音成分;
将由白噪声构造的随机序列, 经过所述清音子带滤波器得到语音 信号的清音成分;
将所述浊音成分与清音成分相加得到混合激励信号;
将所述混合激励信号通过由频谱包络参数构造的滤波器后输出一 帧合成的语音波形。
5、 根据权利要求 1所述的方法, 其特征在于, 在所述合成阶段之 前, 所述方法还包括训练阶段,
在训练阶段, 从语料库中提取的声学参数仅包括静态参数, 或者, 从语料库中提取的声学参数包括静态参数和动态参数;
训练后所得到的统计模型的模型参数中仅保留静态模型参数; 合成阶段中所述将该统计模型在当前音素当前帧下相应的模型参 数作为当前所预测语音参数的粗略值具体为:
根据所述当前音素, 将训练阶段中所得到所述统计模型在当前音 素当前帧下相应的静态模型参数作为当前所预测语音参数的粗略值。
6、 一种参数语音合成系统, 包括: 循环合成装置, 用于在合成阶 段, 依次对输入文本的音素序列中每一音素的每一帧语音进行语音合 成;
所述循环合成装置包括:
粗略搜索单元, 用于对输入文本的音素序列中的当前音素, 从统 计模型库中提取相应的统计模型, 并将该统计模型在当前音素当前帧 下相应的模型参数作为当前所预测语音参数的粗略值;
平滑滤波单元, 用于利用所述粗略值以及当前时刻之前预定数目 语音帧的信息, 对所述粗略值进行滤波, 得到当前所预测语音参数的 平滑值;
全局优化单元, 用于根据统计得到的所述语音参数的全局均值和 全局标准差比值,对所述当前所预测语音参数的平滑值进行全局优化, 生成所需的语音参数;
参数语音合成单元, 用于对生成的所述语音参数进行合成, 得到对当前音 素当前帧所合成的一帧语音。
7、 根据权利要求 6所述的系统, 其中, 所述平滑滤波单元包括低通滤波器 组,
所述低通滤波器组, 用于利用所述粗略值以及上一时刻语音帧的 信息, 对所述粗略值进行滤波, 得到当前所预测语音参数的平滑值; 其中, 所述上一时刻语音帧的信息为上一时刻所预测语音参数的 平滑值。
8、 根据权利要求 6所述的系统, 其中, 所述全局优化单元包括全局参数优 化器,
所述全局参数优化器, 用于利用如下公式, 根据统计得到所述语音参 数的全局均值和全局标准差比值, 对所述当前所预测语音参数的平滑 值进行全局优化, 生成所需的语音参数:
yt = r - (yt - m) + m zt = ^ ' (yt - yt ) + yt 其中, Λ为 t时刻的语音参数在优化前的平滑值, 为初歩优化后的 值, w为权重值, 为全局优化后得到的所需的语音参数, r为统计得 到的所预测语音参数的全局标准差比值, m为统计得到的所预测语音 参数的全局均值, r和 m的取值为常数。
9、 根据权利要求 6所述的系统, 其中, 所述参数语音合成单元, 包括:
滤波器构造模块, 用于利用子带浊音度参数构造浊音子带滤波器 和清音子带滤波器;
所述浊音子带滤波器, 用于对由基音频率参数构造的准周期性脉 冲序列进行滤波, 得到语音信号的浊音成分;
所述清音子带滤波器,用于对由白噪声构造的随机序列进行滤波, 得到语音信号的清音成分;
加法器,用于将所述浊音成分与清音成分相加得到混合激励信号; 合成滤波器, 用于将所述混合激励信号通过由频谱包络参数构造 的滤波器后输出一帧合成的语音波形。
10、 根据权利要求 6所述的系统, 其特征在于, 所述系统还包括训 练装置,
所述训练装置, 用于在训练阶段, 从语料库中提取的声学参数仅 包括静态参数, 或者, 从语料库中提取的声学参数包括静态参数和动 态参数; 以及, 在训练后所得到的统计模型的模型参数中仅保留静态 模型参数;
所述粗略搜索单元, 具体用于在合成阶段中, 根据所述当前音素, 将训练阶段中所得到所述统计模型在当前音素当前帧下相应的静态模 型参数作为当前所预测语音参数的粗略值。
PCT/CN2011/081452 2011-08-10 2011-10-27 参数语音合成方法和系统 WO2013020329A1 (zh)

Priority Applications (5)

Application Number Priority Date Filing Date Title
JP2013527464A JP5685649B2 (ja) 2011-08-10 2011-10-27 パラメータ音声の合成方法及びシステム
DK11864132.3T DK2579249T3 (en) 2011-08-10 2011-10-27 PARAMETER SPEECH SYNTHESIS PROCEDURE AND SYSTEM
EP11864132.3A EP2579249B1 (en) 2011-08-10 2011-10-27 Parameter speech synthesis method and system
KR1020127031341A KR101420557B1 (ko) 2011-08-10 2011-10-27 파라미터 음성 합성 방법 및 시스템
US13/640,562 US8977551B2 (en) 2011-08-10 2011-10-27 Parametric speech synthesis method and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2011102290132A CN102270449A (zh) 2011-08-10 2011-08-10 参数语音合成方法和系统
CN201110229013.2 2011-08-10

Publications (1)

Publication Number Publication Date
WO2013020329A1 true WO2013020329A1 (zh) 2013-02-14

Family

ID=45052729

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/081452 WO2013020329A1 (zh) 2011-08-10 2011-10-27 参数语音合成方法和系统

Country Status (7)

Country Link
US (1) US8977551B2 (zh)
EP (1) EP2579249B1 (zh)
JP (1) JP5685649B2 (zh)
KR (1) KR101420557B1 (zh)
CN (2) CN102270449A (zh)
DK (1) DK2579249T3 (zh)
WO (1) WO2013020329A1 (zh)

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103854643B (zh) * 2012-11-29 2017-03-01 株式会社东芝 用于合成语音的方法和装置
CN103226946B (zh) * 2013-03-26 2015-06-17 中国科学技术大学 一种基于受限玻尔兹曼机的语音合成方法
US9484015B2 (en) * 2013-05-28 2016-11-01 International Business Machines Corporation Hybrid predictive model for enhancing prosodic expressiveness
AU2015206631A1 (en) * 2014-01-14 2016-06-30 Interactive Intelligence Group, Inc. System and method for synthesis of speech from provided text
US9472182B2 (en) * 2014-02-26 2016-10-18 Microsoft Technology Licensing, Llc Voice font speaker and prosody interpolation
KR20160058470A (ko) * 2014-11-17 2016-05-25 삼성전자주식회사 음성 합성 장치 및 그 제어 방법
JP5995226B2 (ja) * 2014-11-27 2016-09-21 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation 音響モデルを改善する方法、並びに、音響モデルを改善する為のコンピュータ及びそのコンピュータ・プログラム
JP6483578B2 (ja) * 2015-09-14 2019-03-13 株式会社東芝 音声合成装置、音声合成方法およびプログラム
CN107924678B (zh) * 2015-09-16 2021-12-17 株式会社东芝 语音合成装置、语音合成方法及存储介质
EP3363015A4 (en) * 2015-10-06 2019-06-12 Interactive Intelligence Group, Inc. METHOD FOR FORMING THE EXCITATION SIGNAL FOR A PARAMETRIC SPEECH SYNTHESIS SYSTEM BASED ON GLOTTAL PULSE MODEL
CN105654939B (zh) * 2016-01-04 2019-09-13 极限元(杭州)智能科技股份有限公司 一种基于音向量文本特征的语音合成方法
US10044710B2 (en) 2016-02-22 2018-08-07 Bpip Limited Liability Company Device and method for validating a user using an intelligent voice print
JP6852478B2 (ja) * 2017-03-14 2021-03-31 株式会社リコー 通信端末、通信プログラム及び通信方法
JP7209275B2 (ja) * 2017-08-31 2023-01-20 国立研究開発法人情報通信研究機構 オーディオデータ学習装置、オーディオデータ推論装置、およびプログラム
CN107481715B (zh) * 2017-09-29 2020-12-08 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN107945786B (zh) * 2017-11-27 2021-05-25 北京百度网讯科技有限公司 语音合成方法和装置
US11264010B2 (en) 2018-05-11 2022-03-01 Google Llc Clockwork hierarchical variational encoder
EP3776531A1 (en) 2018-05-11 2021-02-17 Google LLC Clockwork hierarchical variational encoder
CN109036377A (zh) * 2018-07-26 2018-12-18 中国银联股份有限公司 一种语音合成方法及装置
CN108899009B (zh) * 2018-08-17 2020-07-03 百卓网络科技有限公司 一种基于音素的中文语音合成系统
CN109102796A (zh) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 一种语音合成方法及装置
CN109285535A (zh) * 2018-10-11 2019-01-29 四川长虹电器股份有限公司 基于前端设计的语音合成方法
CN109285537B (zh) * 2018-11-23 2021-04-13 北京羽扇智信息科技有限公司 声学模型建立、语音合成方法、装置、设备及存储介质
US11302301B2 (en) * 2020-03-03 2022-04-12 Tencent America LLC Learnable speed control for speech synthesis
CN111862931B (zh) * 2020-05-08 2024-09-24 北京嘀嘀无限科技发展有限公司 一种语音生成方法及装置
US11495200B2 (en) * 2021-01-14 2022-11-08 Agora Lab, Inc. Real-time speech to singing conversion
CN112802449B (zh) * 2021-03-19 2021-07-02 广州酷狗计算机科技有限公司 音频合成方法、装置、计算机设备及存储介质
CN113160794B (zh) * 2021-04-30 2022-12-27 京东科技控股股份有限公司 基于音色克隆的语音合成方法、装置及相关设备
CN115440205A (zh) * 2021-06-04 2022-12-06 中国移动通信集团浙江有限公司 语音处理方法、装置、终端以及程序产品
CN113571064B (zh) * 2021-07-07 2024-01-30 肇庆小鹏新能源投资有限公司 自然语言理解方法及装置、交通工具及介质
CN114822492B (zh) * 2022-06-28 2022-10-28 北京达佳互联信息技术有限公司 语音合成方法及装置、电子设备、计算机可读存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178896A (zh) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 基于声学统计模型的单元挑选语音合成方法
US7478039B2 (en) * 2000-05-31 2009-01-13 At&T Corp. Stochastic modeling of spectral adjustment for high quality pitch modification
CN101369423A (zh) * 2007-08-17 2009-02-18 株式会社东芝 语音合成方法和装置
US7996222B2 (en) * 2006-09-29 2011-08-09 Nokia Corporation Prosody conversion

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03102399A (ja) * 1989-09-18 1991-04-26 Fujitsu Ltd 規則音声合成装置
AU1941697A (en) * 1996-03-25 1997-10-17 Arcadia, Inc. Sound source generator, voice synthesizer and voice synthesizing method
GB0112749D0 (en) * 2001-05-25 2001-07-18 Rhetorical Systems Ltd Speech synthesis
US6912495B2 (en) * 2001-11-20 2005-06-28 Digital Voice Systems, Inc. Speech model and analysis, synthesis, and quantization methods
US20030135374A1 (en) * 2002-01-16 2003-07-17 Hardwick John C. Speech synthesizer
CN1262987C (zh) * 2003-10-24 2006-07-05 无敌科技股份有限公司 母音间转音的平滑处理方法
WO2006032744A1 (fr) * 2004-09-16 2006-03-30 France Telecom Procede et dispositif de selection d'unites acoustiques et procede et dispositif de synthese vocale
US20060129399A1 (en) * 2004-11-10 2006-06-15 Voxonic, Inc. Speech conversion system and method
US20060229877A1 (en) * 2005-04-06 2006-10-12 Jilei Tian Memory usage in a text-to-speech system
JP4662139B2 (ja) * 2005-07-04 2011-03-30 ソニー株式会社 データ出力装置、データ出力方法、およびプログラム
CN1835075B (zh) * 2006-04-07 2011-06-29 安徽中科大讯飞信息科技有限公司 一种结合自然样本挑选与声学参数建模的语音合成方法
US8321222B2 (en) * 2007-08-14 2012-11-27 Nuance Communications, Inc. Synthesis by generation and concatenation of multi-form segments
KR100932538B1 (ko) * 2007-12-12 2009-12-17 한국전자통신연구원 음성 합성 방법 및 장치
EP2357646B1 (en) * 2009-05-28 2013-08-07 International Business Machines Corporation Apparatus, method and program for generating a synthesised voice based on a speaker-adaptive technique.
US20110071835A1 (en) * 2009-09-22 2011-03-24 Microsoft Corporation Small footprint text-to-speech engine
GB2478314B (en) * 2010-03-02 2012-09-12 Toshiba Res Europ Ltd A speech processor, a speech processing method and a method of training a speech processor
US20120143611A1 (en) * 2010-12-07 2012-06-07 Microsoft Corporation Trajectory Tiling Approach for Text-to-Speech

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7478039B2 (en) * 2000-05-31 2009-01-13 At&T Corp. Stochastic modeling of spectral adjustment for high quality pitch modification
US7996222B2 (en) * 2006-09-29 2011-08-09 Nokia Corporation Prosody conversion
CN101369423A (zh) * 2007-08-17 2009-02-18 株式会社东芝 语音合成方法和装置
CN101178896A (zh) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 基于声学统计模型的单元挑选语音合成方法

Also Published As

Publication number Publication date
KR101420557B1 (ko) 2014-07-16
DK2579249T3 (en) 2018-05-28
US8977551B2 (en) 2015-03-10
EP2579249A1 (en) 2013-04-10
US20130066631A1 (en) 2013-03-14
JP2013539558A (ja) 2013-10-24
EP2579249B1 (en) 2018-03-28
CN102270449A (zh) 2011-12-07
CN102385859A (zh) 2012-03-21
JP5685649B2 (ja) 2015-03-18
CN102385859B (zh) 2012-12-19
EP2579249A4 (en) 2015-04-01
KR20130042492A (ko) 2013-04-26

Similar Documents

Publication Publication Date Title
WO2013020329A1 (zh) 参数语音合成方法和系统
CN109147758B (zh) 一种说话人声音转换方法及装置
CN112420026B (zh) 优化关键词检索系统
Ardaillon et al. Fully-convolutional network for pitch estimation of speech signals
KR20150016225A (ko) 타겟 운율 또는 리듬이 있는 노래, 랩 또는 다른 가청 표현으로의 스피치 자동 변환
WO2014183411A1 (en) Method, apparatus and speech synthesis system for classifying unvoiced and voiced sound
GB2603776A (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
WO2015025788A1 (ja) 定量的f0パターン生成装置及び方法、並びにf0パターン生成のためのモデル学習装置及び方法
CN116994553A (zh) 语音合成模型的训练方法、语音合成方法、装置及设备
CN108369803A (zh) 用于形成基于声门脉冲模型的参数语音合成系统的激励信号的方法
JP6594251B2 (ja) 音響モデル学習装置、音声合成装置、これらの方法及びプログラム
JP4945465B2 (ja) 音声情報処理装置及びその方法
CN116168678A (zh) 语音合成方法、装置、计算机设备和存储介质
CN111862931B (zh) 一种语音生成方法及装置
JP7088796B2 (ja) 音声合成に用いる統計モデルを学習する学習装置及びプログラム
CN112951256A (zh) 语音处理方法及装置
CN112164387A (zh) 音频合成方法、装置及电子设备和计算机可读存储介质
CN111739547B (zh) 语音匹配方法、装置、计算机设备和存储介质
JP6234134B2 (ja) 音声合成装置
CN111696530B (zh) 一种目标声学模型获取方法及装置
US20140343934A1 (en) Method, Apparatus, and Speech Synthesis System for Classifying Unvoiced and Voiced Sound
JP6587308B1 (ja) 音声処理装置、および音声処理方法
CN114005467A (zh) 一种语音情感识别方法、装置、设备及存储介质
Galajit et al. ThaiSpoof: A Database for Spoof Detection in Thai Language
CN117765898A (zh) 一种数据处理方法、装置、计算机设备及存储介质

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 13640562

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2011864132

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2013527464

Country of ref document: JP

Kind code of ref document: A

Ref document number: 20127031341

Country of ref document: KR

Kind code of ref document: A

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11864132

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE