CN104916282B - A kind of method and apparatus of phonetic synthesis - Google Patents

A kind of method and apparatus of phonetic synthesis Download PDF

Info

Publication number
CN104916282B
CN104916282B CN201510142395.3A CN201510142395A CN104916282B CN 104916282 B CN104916282 B CN 104916282B CN 201510142395 A CN201510142395 A CN 201510142395A CN 104916282 B CN104916282 B CN 104916282B
Authority
CN
China
Prior art keywords
syllable
fundamental frequency
model
segment
parameters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510142395.3A
Other languages
Chinese (zh)
Other versions
CN104916282A (en
Inventor
王愈
李健
张连毅
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Original Assignee
BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP filed Critical BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Priority to CN201510142395.3A priority Critical patent/CN104916282B/en
Publication of CN104916282A publication Critical patent/CN104916282A/en
Application granted granted Critical
Publication of CN104916282B publication Critical patent/CN104916282B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

An embodiment of the present invention provides a kind of method and apparatus of phonetic synthesis and the training methods and device of a kind of fundamental frequency model, wherein the method for phonetic synthesis includes:Each segment treated in synthesis text carries out segment model decision, determines the corresponding baseline HTS fundamental frequency models of each segment;Syllable-based hmm decision is carried out to each syllable in the text to be synthesized, determines the corresponding continuous voiced segments fundamental frequency model of each syllable;According to the corresponding baseline HTS fundamental frequency models of each segment continuous voiced segments fundamental frequency model corresponding with each syllable, combines according to multilayer blending algorithm and generate fusion base frequency parameters;According to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.The embodiment of the present invention can improve the accuracy of pitch, to make the rhythm closer to real speech.

Description

A kind of method and apparatus of phonetic synthesis
Technical field
The present invention relates to technical field of voice recognition, more particularly to the method and apparatus and one kind of a kind of phonetic synthesis The training method and device of fundamental frequency model.
Background technology
With the development of speech synthesis technique, sound quality, naturalness, the degree of intelligence of synthetic video have large increase, HTS (HMM-based speech synthesis system, the speech synthesis system based on HMM) technology has become voice conjunction at present At the core technology of industry, HMM (Hidden Markov Model, hidden Markov model) is used as a kind of Statistic analysis models, It is founded the 1970s.It is propagated the eighties and is developed, become an important directions of signal processing, succeeded Ground is used for speech recognition.
HTS technologies can be divided into two stages, training stage and synthesis phase.In the training stage, work is analyzed using signal Tool extracts parameters,acoustic (spectrum parameter and base frequency parameters) from voice, then establishes HMM moulds to parameters,acoustic using segment as granularity Type;In synthesis phase, the Markov chain of whole sentence is constructed, maximum likelihood principle is then pressed on it and generates spectrum parameter, fundamental frequency Parameter and duration reuse signal synthesizer and reconstruct voice.
Traditional HTS usually using segment as voice granularity unit, i.e., is made in training and synthesis phase with initial consonant or simple or compound vowel of a Chinese syllable For voice granularity unit.However, the method for carrying out phonetic synthesis as unit of planting small voice granularity by this, can cause to synthesize Rhythm effect it is flat, stiff, it is larger with real speech gap.In addition, the language that above-mentioned decision tree-based clustering will be distinguished finely originally Border type merges into more rough set of types, is lumped together with single Gauss model, is lost many personal details, leads to fundamental frequency " cross equalize " of parameter, Wire Parameters further aggravate " crossing equalization " problem between state, base frequency parameters it is " excessively average Change " lead to that the tone of each word is mechanical, lacks variation, machine style is apparent.
Invention content
The technical problem to be solved is that provide a kind of method and apparatus and one kind of phonetic synthesis for the embodiment of the present invention The training method and device of fundamental frequency model, can improve the accuracy of pitch, to make the rhythm closer to real speech.
To solve the above-mentioned problems, the invention discloses a kind of methods of phonetic synthesis, including:
Each segment treated in synthesis text carries out segment model decision, determines the corresponding baseline HTS bases of each segment Frequency model;
Syllable-based hmm decision is carried out to each syllable in the text to be synthesized, determines that each syllable is corresponding continuous turbid Segment fundamental frequency model;
According to the corresponding baseline HTS fundamental frequency models of each segment continuous voiced segments fundamental frequency mould corresponding with each syllable Type combines according to multilayer blending algorithm and generates fusion base frequency parameters;
According to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.
Preferably, each syllable in the text to be synthesized carries out syllable-based hmm decision, determines each syllable The step of corresponding continuous voiced segments fundamental frequency model, including:
Syllable fundamental frequency model prediction is carried out to each syllable in the text to be synthesized;
The optimal syllable fundamental frequency model of each syllable is determined based on the multichannel preferred method of tendency line fitting;
Optimal syllable fundamental frequency model according to each syllable generates continuous voiced segments fundamental frequency model.
Preferably, the step of tendency line generates, including:
To each syllable in the text to be synthesized, multiple syllable fundamental frequency candidate families are determined;
Straight line is fitted by criterion of least squares in two-dimensional space according to the multiple syllable fundamental frequency candidate family, institute It is tendency line to state straight line.
Preferably, the optimal syllable fundamental frequency model according to each syllable generates continuous voiced segments fundamental frequency model, packet It includes:
The optimal syllable fundamental frequency model of each syllable is merged by continuous voiced segments for unit successively;
The corresponding Gauss model of each continuous voiced segments is obtained into continuous voiced segments fundamental frequency model according to duration weighted average.
Preferably, the method further includes:
Intonation according to tendency line traffic control phonetic synthesis.
Preferably, the multilayer blending algorithm is the parameter of the parameter set and continuous voiced sound segment model of united state layer model Collection carries out COMPREHENSIVE CALCULATING according to state layer and the continuous respective optiaml ciriterion of voiced segments layer.
Other side according to the present invention provides a kind of training method of syllable fundamental frequency model, including:
Parameters,acoustic is extracted to speech samples;The parameters,acoustic includes base frequency parameters;
Syllable fundamental frequency Mean Parameters are generated according to the base frequency parameters;
According to the syllable fundamental frequency Mean Parameters, more set syllable fundamental frequency models are trained.
Preferably, described to generate syllable fundamental frequency Mean Parameters according to the base frequency parameters, including:
Feature is extracted as unit of syllable for the base frequency parameters, and syllable fundamental frequency mean value ginseng is generated by syllable average statistical Number.
Preferably, described according to the syllable fundamental frequency Mean Parameters, train the step of covering syllable fundamental frequency model, packet more It includes:
The speech samples are generated respectively by segment context of co-text information and by sound according to all kinds of marks in sound library Save context of co-text information;
For the syllable fundamental frequency Mean Parameters, more set syllable fundamental frequency moulds are trained in conjunction with syllable context of co-text information Type.
Another aspect according to the present invention provides a kind of device of phonetic synthesis, including:
Segment model decision module, for treat each segment in synthesis text carry out segment model decision, determine described in The corresponding baseline HTS fundamental frequency models of each segment;
Syllable-based hmm decision-making module is determined for carrying out syllable-based hmm decision to each syllable in the text to be synthesized The corresponding continuous voiced segments fundamental frequency model of each syllable;
Fusion parameters generation module, for according to each corresponding baseline HTS fundamental frequency models of segment and each syllable Corresponding continuous voiced segments fundamental frequency model combines according to multilayer blending algorithm and generates fusion base frequency parameters;And
Voice synthetic module, for according to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.
Another aspect according to the present invention provides a kind of training device of syllable fundamental frequency model, including:
Parameters,acoustic extraction module, for extracting parameters,acoustic to speech samples;The parameters,acoustic includes base frequency parameters;
Syllable parameter generation module, for generating syllable fundamental frequency Mean Parameters according to the base frequency parameters;And
Syllable fundamental frequency model training module, for according to the syllable fundamental frequency Mean Parameters, training more set syllable fundamental frequencies Model.
Compared with prior art, the embodiment of the present invention includes following advantages:
The embodiment of the present invention is in this high-rise granularity unit of the continuous voiced segments of synthesis phase increase, according to each segment pair The baseline HTS fundamental frequency models answered continuous voiced segments fundamental frequency model corresponding with each syllable, combines according to multilayer blending algorithm Fusion base frequency parameters are generated, and according to fusion base frequency parameters and corresponding spectrum parameter synthesis voice;Due to merging base frequency parameters For baseline HTS models and high-rise (continuous voiced segments fundamental frequency) model according to the joint generation of multilayer blending algorithm as a result, therefore, melting Baseline HTS features can either be retained by baseline HTS fundamental frequency models by closing base frequency parameters, and can further be repaiied by high-level model The tone and the rhythm of positive voice, therefore the accuracy of pitch can be improved, to make the rhythm closer to real speech.
Description of the drawings
Fig. 1 shows the schematic diagram of traditional HMM pronunciation modelings;
Fig. 2 shows a kind of step flow charts of the embodiment of the method for phonetic synthesis of the present invention;
Fig. 3 shows that a kind of each syllable in the text to be synthesized of the present invention carries out syllable fundamental frequency model decision, Determine the step flow chart of the corresponding continuous voiced segments fundamental frequency model of each syllable;
Fig. 4 shows the schematic diagram that a kind of tendency line of the present invention generates;
Fig. 5 a show a kind of exemplary flow chart of steps of phonetic synthesis of the present invention;
Figure 5b shows that a kind of system flow charts of phonetic synthesis of the present invention;
Fig. 6 shows the training method flow chart of steps of syllable fundamental frequency model in a kind of phonetic synthesis of the invention;
Fig. 7 shows a kind of schematic diagram of decision tree-based clustering based on context of co-text of the present invention;
Fig. 8 shows a kind of fundamental frequency model training system flow chart of phonetic synthesis of the present invention;
Fig. 9 shows the schematic diagram for the experiment example sentence for being trained and being synthesized for high-rise granularity with syllable;
Figure 10 shows the close-up schematic view for the experiment example sentence for being trained and being synthesized for high-rise granularity with syllable;
Figure 11 shows a kind of experiment for being trained and being synthesized for high-rise granularity unit with continuous voiced segments of the present invention The schematic diagram of example sentence;
Figure 12 shows a kind of apparatus structure block diagram of phonetic synthesis of the present invention;And
Figure 13 shows a kind of training device structure diagram of syllable fundamental frequency model of the present invention.
Specific implementation mode
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below in conjunction with the accompanying drawings and specific real Applying mode, the present invention is described in further detail.
In order to make it easy to understand, voice concept of the present invention is explained first:
Fundamental frequency:Voice signal can be divided into voiced sound and voiceless sound two major classes.Only have voiced sound just to there is fundamental frequency, voiced sound to swash in voice It is periodically pulsing string to encourage, and the frequency of train of pulse is exactly fundamental frequency, abbreviation fundamental frequency.Due to the difference in terms of phonatory organ physiology Different, the base frequency range of male and female is different, and usually, the base frequency range of male is 50~250Hz;The base frequency range of women is 120~500Hz;The base frequency range of baby is about 250~800Hz;The crying base frequency range higher of ewborn infant.
Syllable, segment (voice granularity unit):In Mandarin Chinese, a word is exactly a syllable;One syllable can It is made of an initial consonant and a simple or compound vowel of a Chinese syllable, or is only made of (such as " ") a simple or compound vowel of a Chinese syllable.By initial consonant and rhythm in the embodiment of the present invention Mother is referred to as segment.Phonetic synthesis field is usually to whole segment unified Modelings.
Prosodic phrase:The prosodic units of speech have complete intonation structure, are paused and are terminated with ventilation.Popular it can understand For the segment said without a break.Under general indicative mood intonation it is raised it is low only, referred to as pitch has a down dip phenomenon.Include one in a word A or multiple prosodic phrases.
The acoustics characterization of segment is considered as a random process changed over time, this mistake by traditional HTS in the training stage The a certain number of states of Cheng Lijing are connected into Markov chain in such a way that probability shifts;It is assumed that keep stablizing in each state, It is counted and is described with GMM (Gaussian Mixture Model, mixed Gauss model).Referring to Fig.1, traditional HMM languages are shown The schematic diagram of sound modeling describes one section of voice, top half HMM with 5 state HMM, and lower half portion is speech parameter data, and two Person, which is segmented, to be corresponded to.In HMM:SxFor state, axyTransition probability between state, bx are data (referred to as observing data) relative to shape The generation probability of state.In addition, the duration dx of each state counts description (duration concept when HMM trains completion, with Gauss model As unit of the time shift of signal analysis window, referred to as frame).
In synthesis phase, text to be synthesized is sent into text analysis model, voice is carried out by a variety of natural language algorithms It learns, metrics and grammar analysis, mark pronunciation generate context of co-text information at segment sequence, and for each segment.Then It is sent into model decision module, based on context language ambience information determines subclass Gauss model to each state of each segment.In addition with phase The duration of each state is determined like algorithm.Next the Gauss model of each state of each segment is connected in series, each state presses duration exhibition It opens and (is repeated as many times and is unfolded), form the Model sequence of whole sentence, parameter generation module is sent into, by the optimization criterion of maximum likelihood Solve optimized parameter (fundamental frequency and spectrum parameter).Gained parameters,acoustic is finally sent into synthesizer, generates voice.
However, existing HTS technologies carry out statistics description using state as granularity to parameters,acoustic, when synthesis is also with state Parameters,acoustic is generated for granularity.And state and it is non-genuine tool as existing phonetics or phonology unit, therefore, it is impossible to embody word Between or phrase between connection relation, cause synthesis the rhythm it is not natural enough.
The present invention increases high-level model in existing HTS technical foundation on the basis of standing state layer model, profit State layer model (baseline HTS models) is instructed with high-level model, so that the rhythm is more nearly true voice.
Embodiment one
With reference to Fig. 2, shows a kind of step flow chart of the embodiment of the method for phonetic synthesis of the present invention, can specifically wrap Include following steps:
Step 201, each segment treated in synthesis text carry out segment model decision, determine the corresponding base of each segment Line HTS fundamental frequency models;
Specifically, the corresponding baseline HTS fundamental frequency models of each segment can be determined according to traditional HTS algorithms;Wherein, Segment is specifically as follows sound mother or smaller phoneme, the present invention do not limit this.
Step 202 carries out syllable-based hmm decision to each syllable in the text to be synthesized, determines that each syllable corresponds to Continuous voiced segments fundamental frequency model;
The present invention increases high-level model, baseline HTS moulds is instructed using high-level model in existing HTS technical foundation Type, so that the rhythm is more nearly true voice.In phonetics meaning, syllable is unit minimum, that specification for structure is consistent, But it is that high-rise granularity unit counts high-level model that inventor, which has been found that with syllable, in continuous " voiced sound-voiced sound " meeting Tone is caused to tremble and distort, therefore, inventor creatively proposes in synthesis phase using continuous voiced segments as high-rise granularity Unit, and combine according to multilayer blending algorithm according to continuous voiced segments fundamental frequency model and baseline HTS fundamental frequency models and generate fusion base Frequency parameter, in this way, fusion base frequency parameters can not only retain the feature of baseline HTS models, but also can be by high-level model, i.e., continuously The macro readjustment of direction of voiced segments fundamental frequency model so that the tone finally synthesized is more accurate, and the rhythm is closer to real speech.
Since the present invention is in the training stage, each syllable can train to obtain multiple syllable fundamental frequency models, therefore, to described When each syllable in text to be synthesized carries out model decision, most suitable syllable base can be selected in multiple syllable fundamental frequency models Frequency model, such as can be according to the optimal syllable fundamental frequency model of preset algorithms selection, or be current sound according to actual needs Suitable syllable fundamental frequency model is selected in selected parts, and the syllable fundamental frequency model for then choosing each syllable is sequentially unit by continuous voiced segments Merge, per the syllable fundamental frequency model of each syllable in segment limit with it is respective when a length of Weight averagely obtain continuous voiced segments base Frequency model.
Step 203, according to the corresponding baseline HTS fundamental frequency models of each segment continuous voiced sound corresponding with each syllable Section fundamental frequency model combines according to multilayer blending algorithm and generates fusion base frequency parameters;
It specifically, can be corresponding continuous by each corresponding baseline HTS fundamental frequency models of segment and each syllable Voiced segments fundamental frequency model concatenates respectively to form a complete sentence, and is sent into the parameter generation module merged based on multilayer.Such two-way equal length The corresponding sequence of model has been aggregated into parameter generation module, and according to multilayer blending algorithm, joint generates final base frequency parameters, is Convenient for explanation, the present invention is known as merging base frequency parameters, and the fusion base frequency parameters are sent into acoustics synthesizer, and composes parameter one It rises and generates voice.
Wherein, fusion base frequency parameters are that baseline HTS fundamental frequency models and high-rise (continuous voiced segments) fundamental frequency model joint generate As a result, the fusion base frequency parameters can be generated according to two layers of respective optiaml ciriterion tradeoff is comprehensive so that it is complete to generate result It is originated from statistical model entirely without mandatory modification, the rough problem of fundamental curve can be led to avoid forcing to correct.In addition, melting Base frequency parameters are closed under the premise of retaining existing HTS features, it can be in the effect of high-level model, that is, continuous voiced segments fundamental frequency model Under corrected, final so that the tone of synthesis is more accurate, the rhythm is closer to real speech.
In a kind of application example of the present invention, if the parameter set of state layer HMM model is λpm, in output base frequency parameters O In the case of, optimal path Qmax;The parameter set of high-rise (continuous voiced segments) model is λsm, what both final joint generated melts Conjunction base frequency parameters are C, then the optimal solution of C needs that following joint probability distribution is made to maximize:
Wherein, subscript pm and sm indicates state layer and high-rise two-way system respectively;High layer information acts on dynamics to k in order to control Weight, in general, k values are bigger, then high layer information is also bigger to the influence for merging base frequency parameters C.
About pure and impure problem, the strategy of baseline HTS is continued to use:The pure and impure of each frame can be by state layer model in whole word MSD (Multi-SpaceProbabilityDistribution, more spatial probability distributions) is predefined, and fundamental frequency solution pertains only to Unvoiced frame, unvoiced frames directly inform synthesizer without calculating with flag bit.In this way, C is voiced sound from front to back in whole word Unvoiced frames (section) have been skipped in the accumulation of frame.
State coating systems are the algorithm structures of baseline HTS:λpmIt is the mould gone out according to the context of co-text decision of each segment Shape parameter collection includes the Gauss model and duration modeling of each state;WpmFor calculating adjacent difference on parameter C, (0 rank is former Value, front and back 1 order difference, it is preceding in after 2 order differences), concrete structure repeats no more.QmaxIt is determined by the duration modeling of each state, therefore State chain expansion can be first about to, namely the Gauss model of each state repeats multiframe by its duration.
High coating systems are the newly-increased algorithm structures of the present invention.High-level model is to count letter by the fundamental frequency in high-rise granularity unit For breath as training sample training gained, training algorithm is similar with HTS basic algorithms, is also that HMM model training is poly- in conjunction with context Class, only model structure is different, and the structure of state layer model, each segment can be concatenated by 5 states;And high-level model Structure, each granularity unit can only include single status.λsmIt is to determine according to the context of co-text of each high-rise granularity unit The model parameter collection that plan goes out.Above it was mentioned that the invention that for ensure training sample coverage, first using syllable be granularity unit train with Then decision presses and carries out continuous voiced segments merging, obtains each section of Gauss model again.Duration is subject to state layer --- by state Duration joint account goes out syllable duration, and then joint account goes out the duration of each continuous voiced segments --- to ensure two layers respectively Close alignment when expansion concatenation.
WsmFor calculating mean value paragraph by paragraph on base frequency parameters C, shown in structure such as formula (2), line number is equal to high-rise list First number, namely continuous voiced sound hop count.Often row is all a window for averaging to the fundamental frequency of unvoiced frame within the scope of active cell Function, total columns be equal to C length, nonzero value therein correspond to the unit within the scope of unvoiced frame, zero correspond to it is other The unvoiced frame of unit --- it does not come into force, often the sum of row is equal to 1.From the point of view of longitudinal direction, only there are one nonzero values for each column, indicate each frame It is pertaining only to a high level elements.Formula (2) illustrates that the words contains 8 unvoiced frames, 3 continuous voiced segments, every section of voiced sound for including altogether Frame number is followed successively by 2,1,5, and the fundamental frequency characteristic value of first segment is averaged to obtain by the fundamental frequency of front cross frame, the fundamental frequency characteristic value of second segment It is directly equal to the fundamental frequency of third frame, the fundamental frequency characteristic value of third section is averaged to obtain by the fundamental frequency of rear five frame.
For the parameter C for acquiring optimal, joint probability formula (1) both sides are taken into logarithm, partial derivative is asked to C, obtains following equation:
Wherein, UpmAnd MpmThe total covariance square being accumulated by diagonal matrix structure for the Gauss model of each unvoiced frame of state layer Battle array and grand mean matrix, UsmAnd MsmGauss model for the corresponding high-rise each unit of each unvoiced frame is accumulated by diagonal matrix structure Total covariance matrix and grand mean matrix.
WpmAnd WsmWindow function collection can all be regarded as, often row can regard a window function applied to certain part in C as, only It is that window length is different with time shift speed:WpmThe same frame in C is acted on per continuous three row, calculates 3 order differences successively, window length is 3, Then a frame is moved backward, 3 order differences of next frame are calculated;WsmOften row acts on one section of continuous voiced segments in C, calculates the section In range then the mean value of all effective fundamental frequency values, the number of a length of unvoiced frame of window move backward to next continuous voiced segments. Therefore, the higher-layer algorithm structure newly increased and original state state layer are isomorphisms.This is provided a convenient for the specific implementation of the present invention.
Above application example is to generate to merge fundamental frequency with continuous voiced segments fundamental frequency model joint according to baseline HTS fundamental frequency models Parameter, in the specific implementation, those skilled in the art can select other a variety of high-level models and baseline according to actual needs HTS fundamental frequency models carry out multilayer fusion and generate final parameters,acoustic, and the present invention does not do the type and quantity of high-level model Concrete restriction.
Step 204, according to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.
In practical applications, sound quality is determined by spectrum parameter, and the rhythm is embodied by base frequency parameters and segment duration synthesis.Therefore originally The main generation method for improving base frequency parameters of invention so that synthesize the pitch curve of voice closer to natural pronunciation, reach and rise and fall The rhythm effect of pause and transition in rhythm or melody.The Model sequence of parameter is composed, continues to calculate according to baseline HTS algorithm flows, that is, is concatenated into whole sentence, by biography The parameter generation algorithm of system generates final spectrum parameter.
In the specific implementation, the fusion base frequency parameters and the corresponding spectrum parameter of the text to be synthesized can be sent into Acoustics synthesizer synthesizes voice.
To sum up, the embodiment of the present invention is in this high-rise granularity unit of the continuous voiced segments of synthesis phase increase, according to described each The corresponding baseline HTS fundamental frequency models of segment continuous voiced segments fundamental frequency model corresponding with each syllable, merges according to multilayer and calculates Method joint generates fusion base frequency parameters, and according to fusion base frequency parameters and corresponding spectrum parameter synthesis voice;Due to merging base Frequency parameter be baseline HTS models and high-rise (continuous voiced segments fundamental frequency) model according to multilayer blending algorithm joint generate as a result, Therefore, fusion base frequency parameters can either retain baseline HTS features by baseline HTS fundamental frequency models, and can pass through high-level model The tone and the rhythm of voice are further corrected, therefore the accuracy of pitch can be improved, to make the rhythm closer to real speech.
Embodiment two
The method of the phonetic synthesis of the present embodiment on the basis of the above embodiment 1, can also include it is following can selecting technology Scheme.
With reference to Fig. 3, show that a kind of each syllable in the text to be synthesized of the present invention carries out syllable fundamental frequency model Decision determines the step flow chart of the corresponding continuous voiced segments fundamental frequency model of each syllable, can specifically include:
Step 301 carries out syllable fundamental frequency model prediction to each syllable in the text to be synthesized;
Since the present invention is in the training stage, each syllable can train to obtain multiple syllable fundamental frequency models, it is therefore desirable to right Multiple models decision one by one, therefore by the corresponding multiple syllable fundamental frequency models of the corresponding each syllable of text to be synthesized, pass through tendency The method global planning of line fitting selects an optimal model for each syllable.
Step 302, the optimal syllable fundamental frequency model that each syllable is determined based on the multichannel preferred method of tendency line fitting;
The accuracy of continuous voiced segments fundamental frequency model is improved, key is to improve the quality of syllable fundamental frequency model.Certainly It first has to improve precision in the links of model training, but due to clustering inborn " cross and equalize " defect, still unavoidably Loss of significance can occur.The present invention proposes that the preferred thought of multichannel, comprehensive multichannel source improve hit rate.The information of separate sources Respectively there is quality, have complementary advantages, bone information is extracted with highest common divisor principle, part loss of detail can be retrieved, to make up due to poly- Fundamental frequency caused by class crosses the defect of equalization.
In one preferred embodiment of the invention, tendency line can be generated as follows:
To each syllable in the text to be synthesized, multiple syllable fundamental frequency candidate families are determined;
Straight line is fitted by criterion of least squares in two-dimensional space according to the multiple syllable fundamental frequency candidate family, institute It is tendency line to state straight line.
With reference to Fig. 4, shows the schematic diagram that a kind of tendency line of the present invention generates, treat each sound in synthesis text Section, cooks up multiple syllable fundamental frequency mean value candidate families, then with geometry visual angle on the two-dimensional space of whole sentence according to different channels (such as Fig. 4 regards the corresponding each fundamental frequency model of each syllable as a sample point in space, with its horizontal seat of syllable serial number Mark, using its Mean Parameters as ordinate) straight line is fitted by criterion of least squares, which can embody point set in space Whole tendency, referred to as tendency line;To each column (candidate family of each syllable), selected with certain criterion nearest from tendency line A point (optimal models), the final mask result of decision as the syllable.
In a kind of application example of the present invention, a kind of multichannel being fitted based on tendency line for giving the present invention is preferably square Method applies example as follows:
Specifically, the multichannel preferred method of the invention based on the fitting of tendency line may include that multichannel construction and fitting are preferred Two links.
In terms of multichannel construction, except it is basic take syllable as the syllable fundamental frequency mean value candidate family of training object in addition to, also Other three road high-level model is increased, which can specifically include two with class model and a reference data mould Type.Same syllable mean value training data, same model structure (single state HMM) and algorithm, but training object is changed to segment, It can be realized and obtain different by new algorithm approach as a result, temporary be known as segment audio mean value model, physical meaning is believed that It is that segment (initial consonant/simple or compound vowel of a Chinese syllable) is under specific context of co-text, the whole pitch situation of place syllable.To some syllable When carrying out model decision, according to its context of co-text, decision obtains syllable incoming road model in syllable fundamental frequency mean value model;Foundation The context of co-text of its simple or compound vowel of a Chinese syllable decision in segment fundamental frequency mean value model obtains simple or compound vowel of a Chinese syllable incoming road model;If there is initial consonant, according to it The context of co-text of initial consonant decision in segment fundamental frequency mean value model obtains initial consonant incoming road model, otherwise (does not have initial consonant), by syllable The result of decision copy of fundamental frequency mean value model is a to be used as initial consonant incoming road model.In addition, state layer information may also participate in into and make To guard reference, the result of syllablic tier is avoided to deviate too far:In the Model sequence that state layer has concatenated, within the scope of the syllable The Mean Parameters of each model are averaged, and as the syllable fundamental frequency mean value of state layer incoming road, what it reflected is the whole of state layer model Body perfect pitch, it is a value rather than model, same with other three roads model identity in two-dimensional space, is only existed When selected, the fundamental frequency of the syllable, which generates, does not have to high layer information, is calculated by state layer itself merely.
In this way, each tetra- tunnels syllable Jiu You are candidate, in two-dimensional space, there are four points for each column.In Fig. 4, circle three Road model, triangle are state layer reference point.Next the point set highest common divisor letter in space is extracted in a manner of curve matching Breath.
According to phonetic system theory, intonation has a down dip phenomenon within the scope of prosodic phrase there are different degrees of pitch:High pitch line Embody focus and stress;Bottom line strictly has a down dip, the resetting mark prosodic phrase boundary of bottom line.Based on this kind of theory, we Think the candidate point set in two-dimensional space, the tendency within the scope of laterally each prosodic phrase can use fitting a straight line.Language in Fig. 4 Sentence is very short, itself is exactly a prosodic phrase, therefore fit tendency by least-squares algorithm using whole candidate points as data Line.The inclination of tendency line has confirmed pitch and has had a down dip theory.
The reflection of tendency line is global trunk information, is a kind of global planning using it as standard reconnaissance.For each syllable (four points of each column in two-dimensional space), basic standard is to select the point nearest from tendency line, is preset again on this basis certain Ballot criterion:Only there are three circles all with triangle every bank and to (being distributed in tendency line both sides), just allows in most (circles Circle) selection of this side;Otherwise only allow to select in this side of triangle.This set criterion embodies the honor to state layer reference value Weight, can foreclose circle open country point.If having selected state layer reference value, then it represents that current syllable does not need high-rise guide and adjusts Section, the Model sequence that need to only press state layer itself merely calculate.The point chosen is labeled as with X in Fig. 4, final as syllable determines Plan is as a result, give subsequent model concatenation module.
To sum up, the multichannel preferred method of the invention based on the fitting of tendency line cooks up multiple high-rise bases according to different channels Frequency model, the tendency line then fitted according to the overall situation preferably go out best model sequence, i.e., can be determined according to tendency line and currently be answered Whether the selection sound case mold or high-level model and current syllable need the guide of high-level model to adjust.Different channels The information in source respectively has good and bad, mutual supplement with each other's advantages, extracts bone information with highest common divisor principle, hit rate can be improved, to guide The rhythm finally synthesized is more nearly real speech, and can retrieve part loss of detail, to make up since cluster causes Fundamental frequency cross the defect of equalization.
In another preferred embodiment of the invention, the method can also include:
Intonation according to tendency line traffic control phonetic synthesis.
Further, tendency line also has the function of active control intonation.By the slope of active control tendency line, can influence each The selection result of syllable, to control the whole pitch of each syllable, whole sentence language tune is moulded in realization.Due to each syllable Fundamental frequency mean value all tetra- road statistical models of Shi Cong in it is preferred according to context of co-text, be not groundless rumour, compared to tradition The geometry class intonation model for directly acting on whole sentence pitch envelope it is more true and reliable, be not likely to produce affected sense.For example, can will walk Gesture line is adjusted to raise up, and realizes the appropriate query tone;Tendency line can be also changed to broken line, realize the rhetorical question tone.It can be certain Multi-lingual gas emotional culture effect is realized in degree.
Step 303, the continuous voiced segments fundamental frequency model of optimal syllable fundamental frequency model generation according to each syllable.
Specifically, the optimal syllable fundamental frequency model of each syllable is sequentially arranged, it is unit then sequentially to press continuous voiced segments Merge, per the Gauss model of each syllable in segment limit with it is respective when a length of Weight averagely obtain the fundamental frequencies of continuous voiced segments Mean value Gauss model.They are concatenated and are formed a complete sentence, including press duration weight by the model and duration for so obtaining each continuous voiced segments Multiple expansion is sent into the parameter generation module merged based on multilayer and generates continuous voiced segments fundamental frequency model.
Referring to Fig. 5 a, shows a kind of exemplary flow chart of steps of phonetic synthesis of the present invention, can specifically include following Step, and Figure 5b shows that a kind of system flow charts of phonetic synthesis of the present invention;
Step 501, the segment sequence and syllable sequence with context of co-text information for generating text analyzing, send respectively Enter segment (baseline HTS basic algorithms) model system and syllable fundamental frequency model system, make separate decisions a grade Model sequence (spectrum that does well Parameter, fundamental frequency and state duration) and syllable grade Models Sets sequence (fundamental frequency mean value);
Step 502 generates fusion base frequency parameters according to the method for multilayer fusion;
Specifically, on the one hand, syllable system has more set of model, needs decision one by one, therefore the corresponding each sound of text to be synthesized Section obtains multiple models, the method global planning being then fitted by tendency line, and an optimal model is selected for each syllable, And be sequentially arranged the optimal models of each syllable, it then sequentially presses continuous voiced segments and merges for unit, each syllable in every segment limit Gauss model with it is respective when a length of Weight averagely obtain the fundamental frequency mean value Gauss models of continuous voiced segments.So obtain Each continuous voiced segments concatenation is formed a complete sentence, including presses duration and repeat to be unfolded by the model and duration of each continuous voiced segments, is sent into base In the parameter generation module of multilayer fusion;
On the other hand, fundamental frequency model baseline HTS decisions gone out is also concatenated and is formed a complete sentence, is sent into the parameter merged based on multilayer Generation module.The Model sequence of such two-way equal length has been aggregated into parameter generation module, is calculated by the multilayer fusion of the present invention Method, joint generate fusion base frequency parameters, are sent into synthesizer.
Step 503, the Model sequence that spectrum parameter is generated according to baseline HTS algorithms;
Wherein, the Model sequence of parameter is composed, continues to calculate according to baseline HTS algorithm flows, that is, is concatenated into whole sentence, by former ginseng Number generating algorithm generates final spectrum parameter, is sent into synthesizer.
Step 504, foundation fusion base frequency parameters and spectrum parameter synthesis voice.
Embodiment three
With reference to Fig. 6, the training method flow chart of steps of syllable fundamental frequency model in a kind of phonetic synthesis of the present invention is shown, It can specifically include:
Step 601 extracts parameters,acoustic to speech samples;The parameters,acoustic can specifically include base frequency parameters;
Specifically, in the training stage, it is sample that can process the sound library marked, is on the one hand carried to voice document therein Parameters,acoustic (fundamental frequency and spectrum parameter) is taken, on the other hand according to all kinds of mark (classes such as phonetics, metrics and grammar therein Type) it is that each segment generates context of co-text information (part of speech of position, place word such as in prosodic phrase).Acoustics is joined Number and context of co-text information are sent into model training module together with the description problem set for classifying to segment according to context, Train the hidden Markov model collection based on context of co-text as unit of segment.
After extracting parameters,acoustic to speech samples, each segment can be trained corresponding according to the parameters,acoustic first Segment model.
It specifically, can be by original HTS basic algorithms, according to the parameters,acoustic, the context of co-text information of segment And the Context quantization problem set of segment, training segment model;The segment model include fundamental frequency model, spectrum parameter model with And state duration modeling.In real speech, segment can because rhythm role, grammatical roles and before and after adjacent influence due to change. These influence factors are referred to as context of co-text, and same initial consonant/simple or compound vowel of a Chinese syllable may be widely different under different contexts.To carry High modeling accuracy usually classifies to same initial consonant/simple or compound vowel of a Chinese syllable according to context of co-text, it is assumed that under each specific context, sound The acoustics characterization of section has relatively-stationary characteristic, and description is modeled respectively to every class.This model is known as the HMM moulds based on context Type.
But since context of co-text embraces a wide spectrum of ideas, classification results are excessively scrappy, and true training data can not cover quantity such as This huge type can not effective statistical modeling even if the several training samples that may also be only rare being covered to.Cause This, can introduce decision tree-based clustering mechanism, and automatically according to the distribution situation of training data, scrappy context type is clustered into more On the one hand big subclass ensures that each subclass has enough training samples, on the other hand pursue more readily apparent area between different subclasses Indexing.With reference to Fig. 7, a kind of schematic diagram of decision tree-based clustering being based on context of co-text of the present invention is shown, it can be right Each state is independently clustered, and a Gauss model is counted within the scope of each subclass clustered out.It can be between different models Certain states are shared, to further overcome Sparse Problem.The duration of state can also take same clustering to build Mould.It in the specific implementation, can also be by certain states to further increase the reusing degree of training data to overcome Sparse Problem State transition probability matrix binding it is shared.
Fundamental frequency, spectrum parameter and state duration decision tree file can be obtained after decision tree-based clustering.
Step 602 generates syllable fundamental frequency Mean Parameters according to the base frequency parameters;
Since the statistical parameter phonetic synthesis based on HMM is to carry out statistics description to parameters,acoustic using segment as granularity, From the point of view of the phonology angle of macroscopic view, this granularity of state is excessively microcosmic, and statistics description is done in this level, only sees that trees lose Forest can not portray between word, change in pitch track macroscopical between word or even within the scope of phrase.And the pitch of natural-sounding exactly exists A large amount of rhythms, semantic information are embodied on Supersonic section level.Therefore the phonetic synthesis of the prior art causes rhythm effect flat, raw Firmly, larger with real speech gap.Therefore, in fundamental frequency generating process, it is necessary to be supplemented in more information, while these information Macro-particle size level should be come from can regard high layer information as relatively with microcosmic state level.Therefore, the embodiment of the present invention In the training stage of phonetic synthesis, syllable this high-rise granularity unit is increased.
Specifically, feature is further extracted in high-rise (syllable) granularity to the base frequency parameters, i.e., by syllable average statistical Generate syllable fundamental frequency Mean Parameters.
Step 603, according to the syllable fundamental frequency Mean Parameters, train and cover syllable fundamental frequency models more.
In practical applications, the embodiment of the present invention is not limited particularly for how many set syllable fundamental frequency models are trained, By taking 4 sets of syllable fundamental frequency models as an example for example, still in specific implementation, can train in above-described embodiment two The syllable fundamental frequency model more covered, so that the pitch of synthesis is more accurate, the rhythm is closer to real speech, that is, of the invention Embodiment does not limit the specific number of syllable fundamental frequency model.
Specifically, described according to the syllable fundamental frequency Mean Parameters, it the step of training more set syllable fundamental frequency model, can be with Including:
Institute's voice file is generated respectively by segment context of co-text information and by sound according to all kinds of marks in sound library Save context of co-text information;
For the syllable fundamental frequency Mean Parameters, more set syllable fundamental frequency moulds are trained in conjunction with syllable context of co-text information Type.
High coating systems are the newly-increased algorithm structures of the present invention.High-level model is to count letter by the fundamental frequency in high-rise granularity unit For breath as training sample training gained, training algorithm is similar with HTS basic algorithms, is also that HMM model training is poly- in conjunction with context Class, only model structure is different, and each model includes only single status.
After carrying out decision tree-based clustering to the more set syllable fundamental frequency models trained, the decision tree text of syllable fundamental frequency can be obtained Part.
The embodiment of the present invention trains more set syllable fundamental frequency models in the training stage with different objects, it is therefore an objective to for synthesis rank The multichannel that section provides high-level model is preferably prepared.Therefore, it is generated by segment and by syllable two according to all kinds of marks in sound library Cover context language ambience information.Here syllable fundamental frequency mean value respectively with syllable/segment context of co-text information and classification problem Collect this two sets of classified description systems, trains more set syllable fundamental frequency mean value models;The training data of each syllable is exactly its fundamental frequency Mean value can be a steady state value, not be related to Concept of Process, the purpose of modeling is the fundamental frequency described under different context of co-texts Mean value size;But baseline HTS algorithm structures are continued to use for convenience, syllable fundamental frequency mean value are described with single state HMM, emphasis meaning is Context clusters.
To sum up, in the embodiment of the present invention, syllable this high-rise granularity unit is increased in the training stage of phonetic synthesis, it is right Original base frequency parameters further extract feature in high-rise (syllable) granularity, i.e., generate syllable fundamental frequency mean value by syllable average statistical Parameter, and according to the syllable fundamental frequency Mean Parameters, train more set syllable fundamental frequency models, high-level model is provided for synthesis phase Multichannel preferably prepare.Solve existing HTS only using segment as granularity to parameters,acoustic carry out statistics description lead to the rhythm Effect is flat, stiff, the larger problem with real speech gap.
In a kind of application example of the present invention, with reference to Fig. 8, a kind of fundamental frequency model of phonetic synthesis of the present invention is shown Training system flow chart.
Specifically, in the training stage, first to the voice document extraction parameters,acoustic (fundamental frequency and spectrum parameter) in sound library, so Feature is further extracted in high-rise granularity to base frequency parameters afterwards, i.e., by syllable average statistical.On the other hand, according to each in sound library Class mark is generated by segment and by two sets of context language ambience informations of syllable.According to the first step extraction parameters,acoustic, segment it is upper Hereafter language ambience information and Context quantization problem set train segment model (baseline HTS basic algorithms), including fundamental frequency and spectrum The model and state duration modeling of parameter.Due to the present invention another innovative point be high-level model multichannel it is preferred, this requirement Train more set of model for candidate with different objects in the training stage.Therefore, syllable fundamental frequency mean value here respectively with syllable/segment Context of co-text information and classification problem collection this two sets of classified description systems train and cover syllable fundamental frequency mean value models more:Each The training data of syllable is exactly its fundamental frequency mean value, is a steady state value, is not related to Concept of Process, and the purpose of modeling only describes not With the fundamental frequency mean value size under context of co-text;But baseline HTS algorithm structures are continued to use for convenience, syllable is described with single state HMM Fundamental frequency mean value, emphasis meaning are that context clusters.
Embodiment five
In a kind of application example of the present invention, with reference to Fig. 9, show with syllable to be that high-rise granularity is trained and synthesizes Experiment example sentence schematic diagram, wherein be 1. the pitch contour of original recording, ladder dotted line 2. be to original recording in each syllable Mean value (mute section nonsensical, and display is inaccurate), the base frequency parameters 3. generated for baseline HTS algorithms, ladder reality are taken in range 4. line is determined the mean value of model by " context-sensitive HMM sequences are adjudicated " of syllablic tier, 5. generated for final multilayer union Base frequency parameters.From sense of hearing, about very from the tone of two syllables (content is " mystery ") between 410 to 500 moment It is strange.Amplify this section to obtain Figure 10, show with syllable be high-rise granularity be trained and synthesize test example sentence Close-up schematic view.In Fig. 10,3. envelope is substantially identical with envelope 1. shape and absolute altitude, and envelope is 5. then There is very big difference with them:First syllable substantially translates up under the guide 4. of ladder solid line;Second syllable is in ladder It under the guide of solid line 4., should translate slightly downward, but its front half section is to run in the opposite direction, and distorted, it is clear that be Receive the influence of first syllable.That causes this association influence has its source in adjacent difference.According to maximum-likelihood criterion When solving optimal result, adjacent difference plays the role of smoothness constraint, smooth as possible between guiding consecutive points in final result.
Solve the problems, such as this, it is necessary to coordinate the relationship with adjacent difference.It, will in the parameter generation algorithm of state layer Beginning of the sentence, end of the sentence and pure and impure boundary are considered as the interruption boundary of adjacent difference, in these local variances by its single order and second order It is set as infinitely great, it is invalid in these edictum provinciales adjoining difference to be equivalent to.In most cases, one can be spaced between two simple or compound vowel of a Chinese syllable Clear initial consonant, previous simple or compound vowel of a Chinese syllable terminates place at this time and the latter simple or compound vowel of a Chinese syllable beginning adjoining difference is interrupted, and is independent of each other between two simple or compound vowel of a Chinese syllable, Two syllables can be adjusted independently and are independent of each other.And if what a simple or compound vowel of a Chinese syllable abutted below is that turbid initial consonant (l, m, n, r) or zero initial are (straight Connect is to connect simple or compound vowel of a Chinese syllable), adjacent difference is unbroken at syllable boundaries, and forceful action will occur in this subrange, will It mixes at both ends.The case where " mystery " in upper example is exactly adjacent turbid initial consonant.Through a large amount of audiometry, it is found that the problem of tone distorts is certain Exist mainly in turbid initial consonant and zero initial both of these case.
From Mechanism of Speech Production, when speaker is near completion a simple or compound vowel of a Chinese syllable, the shape of the mouth as one speaks, lip type start to send out sound next Save transition.If next syllable is turbid initial consonant or zero initial, trunnion vibrations will not stop, and current inertia is kept to continue down The tune type transition of one syllable can continue always if next syllable is also similar situation, until encountering one Clear initial consonant can just stop.When mouth, lip and nose are near completion this clear initial consonant, trunnion is again started up, and starts next section.Thus it sees Come, for driving source:Including turbid initial consonant or continuous several syllables of zero initial are a sequential cells, correspond to fundamental frequency packet It is exactly continuous voiced segments in network;And in the case of common clear initial consonant, each syllable includes one section of isolated voiced segments.It is each turbid In segment, fundamental frequency is all smooth change.If to the adjacent syllable inside voiced segments, by different directions or the amplitude tune of great disparity Section, does not both meet Mechanism of Speech Production, can distort normal tone because of adjacent difference yet.
Based on above-mentioned analysis, the embodiment of the present invention is proposed with continuous voiced segments as high-rise granularity unit, progress fundamental frequency statistics, It model training and is merged with state layer.In each continuous voiced sound segment limit, a high-level model is only provided, is instructed within the scope of this Whole state layer planning, according to unified direction and amplitude adjusted, avoiding problems in section distortion and unsmooth problem.Ginseng According to Figure 11, show the present invention it is a kind of with continuous voiced segments be high-rise granularity unit be trained and synthesize test example sentence Schematic diagram.It can be seen that 4. ladder solid line is a steady state value in continuous voiced segments " mystery " range, instruct envelope 5. whole Unified in segment limit to adjust, results contrast is reasonable, is more nearly real speech.
Device embodiment one
Referring to Fig.1 2, show that a kind of apparatus structure block diagram of phonetic synthesis of the present invention, described device can specifically wrap It includes:
Segment model decision module 1210 carries out segment model decision for treating each segment in synthesis text, determines The corresponding baseline HTS fundamental frequency models of each segment;
Syllable-based hmm decision-making module 1220, for carrying out syllable-based hmm decision to each syllable in the text to be synthesized, Determine the corresponding continuous voiced segments fundamental frequency model of each syllable;
Fusion parameters generation module 1230, for according to the corresponding baseline HTS fundamental frequency models of each segment with it is described each The corresponding continuous voiced segments fundamental frequency model of syllable combines according to multilayer blending algorithm and generates fusion base frequency parameters;And
Voice synthetic module 1240, for according to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.
Preferably, the syllable-based hmm decision-making module 1220, can specifically include:
Syllable-based hmm predicts submodule, pre- for carrying out syllable fundamental frequency model to each syllable in the text to be synthesized It surveys;
Optimal models determination sub-module, the multichannel preferred method for being fitted based on tendency line determine each syllable most Excellent syllable fundamental frequency model;
Continuous voiced sound segment model generates submodule, continuous for being generated according to the optimal syllable fundamental frequency model of each syllable Voiced segments fundamental frequency model.
Preferably, the optimal models determination sub-module, can specifically include:
Syllable fundamental frequency candidate's determination unit, for each syllable in the text to be synthesized, determining multiple syllable bases Frequency candidate family;
Tendency line generation unit, for accurate by least square in two-dimensional space according to the multiple syllable fundamental frequency candidate family Straight line is then fitted, the straight line is tendency line.
Preferably, the continuous voiced sound segment model generates submodule, can specifically include:
Combining unit is carried out for the optimal syllable fundamental frequency model of each syllable to be pressed to continuous voiced segments successively for unit Merge;
Generation unit, for the corresponding Gauss model of each continuous voiced segments to be obtained continuous voiced sound according to duration weighted average Section fundamental frequency model.
Preferably, described device can also include:
Intonation control module, for the intonation according to tendency line traffic control phonetic synthesis.
Device embodiment two
Referring to Fig.1 3, show a kind of training device structure diagram of syllable fundamental frequency model of the present invention, described device tool Body may include:
Parameters,acoustic extraction module 1310, for extracting parameters,acoustic to speech samples;The parameters,acoustic includes fundamental frequency Parameter and spectrum parameter;
Syllable parameter generation module 1320, for generating syllable fundamental frequency Mean Parameters according to the base frequency parameters;And
Syllable fundamental frequency model training module 1330, for according to the syllable fundamental frequency Mean Parameters, training more set syllables Fundamental frequency model.
Preferably, the syllable parameter generation module 1320, can specifically include:
Syllable parameter generates submodule, and feature is extracted as unit of syllable for being directed to the base frequency parameters, unites by syllable Count average generation syllable fundamental frequency Mean Parameters.
Preferably, the syllable fundamental frequency model training module 1330, can specifically include:
Language ambience information generates submodule, for being generated the speech samples by sound respectively according to all kinds of marks in sound library Section context of co-text information and by syllable context of co-text information;
Syllable fundamental frequency model trains submodule, for being directed to the syllable fundamental frequency Mean Parameters, in conjunction with syllable context language Border information trains more set syllable fundamental frequency models.
For device embodiments, since it is basically similar to the method embodiment, so fairly simple, the correlation of description Place illustrates referring to the part of embodiment of the method.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with The difference of other embodiment, the same or similar parts between the embodiments can be referred to each other.
It should be understood by those skilled in the art that, the embodiment of the embodiment of the present invention can be provided as method, apparatus or calculate Machine program product.Therefore, the embodiment of the present invention can be used complete hardware embodiment, complete software embodiment or combine software and The form of the embodiment of hardware aspect.Moreover, the embodiment of the present invention can be used one or more wherein include computer can With in the computer-usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) of program code The form of the computer program product of implementation.
The embodiment of the present invention be with reference to according to the method for the embodiment of the present invention, terminal device (system) and computer program The flowchart and/or the block diagram of product describes.It should be understood that flowchart and/or the block diagram can be realized by computer program instructions In each flow and/or block and flowchart and/or the block diagram in flow and/or box combination.These can be provided Computer program instructions are set to all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing terminals Standby processor is to generate a machine so that is held by the processor of computer or other programmable data processing terminal equipments Capable instruction generates for realizing in one flow of flow chart or multiple flows and/or one box of block diagram or multiple boxes The device of specified function.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing terminal equipments In computer-readable memory operate in a specific manner so that instruction stored in the computer readable memory generates packet The manufacture of command device is included, which realizes in one flow of flow chart or multiple flows and/or one side of block diagram The function of being specified in frame or multiple boxes.
These computer program instructions can be also loaded into computer or other programmable data processing terminal equipments so that Series of operation steps are executed on computer or other programmable terminal equipments to generate computer implemented processing, thus The instruction executed on computer or other programmable terminal equipments is provided for realizing in one flow of flow chart or multiple flows And/or in one box of block diagram or multiple boxes specify function the step of.
Although the preferred embodiment of the embodiment of the present invention has been described, once a person skilled in the art knows bases This creative concept, then additional changes and modifications can be made to these embodiments.So the following claims are intended to be interpreted as Including preferred embodiment and fall into all change and modification of range of embodiment of the invention.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that process, method, article or terminal device including a series of elements not only wrap Those elements are included, but also include other elements that are not explicitly listed, or further include for this process, method, article Or the element that terminal device is intrinsic.In the absence of more restrictions, being wanted by what sentence "including a ..." limited Element, it is not excluded that there is also other identical elements in process, method, article or the terminal device including the element.
Method and apparatus to a kind of phonetic synthesis provided by the present invention and a kind of training side of fundamental frequency model above Method and device, are described in detail, and specific case used herein explains the principle of the present invention and embodiment It states, the explanation of above example is only intended to facilitate the understanding of the method and its core concept of the invention;Meanwhile for this field Those skilled in the art, according to the thought of the present invention, there will be changes in the specific implementation manner and application range, to sum up institute It states, the content of the present specification should not be construed as limiting the invention.

Claims (7)

1. a kind of method of phonetic synthesis, which is characterized in that the method includes:
Each segment treated in synthesis text carries out segment model decision, determines the corresponding baseline HTS fundamental frequency moulds of each segment Type;
Syllable-based hmm decision is carried out to each syllable in the text to be synthesized, determines the corresponding continuous voiced segments of each syllable Fundamental frequency model;
According to the corresponding baseline HTS fundamental frequency models of each segment continuous voiced segments fundamental frequency model corresponding with each syllable, Combine according to multilayer blending algorithm and generates fusion base frequency parameters;
According to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.
2. the method as described in claim 1, which is characterized in that each syllable in the text to be synthesized carries out syllable Model decision, the step of determining each syllable corresponding continuous voiced segments fundamental frequency model, including:
Syllable fundamental frequency model prediction is carried out to each syllable in the text to be synthesized;
The optimal syllable fundamental frequency model of each syllable is determined based on the multichannel preferred method of tendency line fitting;
Optimal syllable fundamental frequency model according to each syllable generates continuous voiced segments fundamental frequency model.
3. method as claimed in claim 2, which is characterized in that the step of tendency line generates, including:
To each syllable in the text to be synthesized, multiple syllable fundamental frequency candidate families are determined;
Straight line is fitted by criterion of least squares in two-dimensional space according to the multiple syllable fundamental frequency candidate family, it is described straight Line is tendency line.
4. method as claimed in claim 2, which is characterized in that the optimal syllable fundamental frequency model according to each syllable is given birth to At continuous voiced segments fundamental frequency model, including:
The optimal syllable fundamental frequency model of each syllable is merged by continuous voiced segments for unit successively;
The corresponding Gauss model of each continuous voiced segments is obtained into continuous voiced segments fundamental frequency model according to duration weighted average.
5. method as claimed in claim 2, which is characterized in that the method further includes:
Intonation according to tendency line traffic control phonetic synthesis.
6. the method as described in claim 1, which is characterized in that the multilayer blending algorithm is the parameter of united state layer model The parameter set of collection and continuous voiced sound segment model, is integrated according to state layer and the continuous respective optiaml ciriterion of voiced segments layer It calculates.
7. a kind of device of phonetic synthesis, which is characterized in that described device includes:
Segment model decision module carries out segment model decision for treating each segment in synthesis text, determines each sound The corresponding baseline HTS fundamental frequency models of section;
Syllable-based hmm decision-making module, described in each syllable progress syllable-based hmm decision in the text to be synthesized, determining The corresponding continuous voiced segments fundamental frequency model of each syllable;
Fusion parameters generation module, for corresponding with each syllable according to the corresponding baseline HTS fundamental frequency models of each segment Continuous voiced segments fundamental frequency model, according to multilayer blending algorithm combine generate fusion base frequency parameters;And
Voice synthetic module, for according to the fusion base frequency parameters and corresponding spectrum parameter synthesis voice.
CN201510142395.3A 2015-03-27 2015-03-27 A kind of method and apparatus of phonetic synthesis Active CN104916282B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510142395.3A CN104916282B (en) 2015-03-27 2015-03-27 A kind of method and apparatus of phonetic synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510142395.3A CN104916282B (en) 2015-03-27 2015-03-27 A kind of method and apparatus of phonetic synthesis

Publications (2)

Publication Number Publication Date
CN104916282A CN104916282A (en) 2015-09-16
CN104916282B true CN104916282B (en) 2018-11-06

Family

ID=54085311

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510142395.3A Active CN104916282B (en) 2015-03-27 2015-03-27 A kind of method and apparatus of phonetic synthesis

Country Status (1)

Country Link
CN (1) CN104916282B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108346424B (en) * 2017-01-23 2021-11-19 北京搜狗科技发展有限公司 Speech synthesis method and device, and device for speech synthesis
CN107452369B (en) * 2017-09-28 2021-03-19 百度在线网络技术(北京)有限公司 Method and device for generating speech synthesis model
CN108288464B (en) * 2018-01-25 2020-12-29 苏州奇梦者网络科技有限公司 Method for correcting wrong tone in synthetic sound
CN110930975B (en) * 2018-08-31 2023-08-04 百度在线网络技术(北京)有限公司 Method and device for outputting information
CN110047463B (en) * 2019-01-31 2021-03-02 北京捷通华声科技股份有限公司 Voice synthesis method and device and electronic equipment
CN112992147A (en) * 2021-02-26 2021-06-18 平安科技(深圳)有限公司 Voice processing method, device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102201234A (en) * 2011-06-24 2011-09-28 北京宇音天下科技有限公司 Speech synthesizing method based on tone automatic tagging and prediction
CN102496363A (en) * 2011-11-11 2012-06-13 北京宇音天下科技有限公司 Correction method for Chinese speech synthesis tone
CN102810309A (en) * 2011-05-30 2012-12-05 雅马哈株式会社 Voice synthesis apparatus

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09258796A (en) * 1996-03-25 1997-10-03 Toshiba Corp Voice synthesizing method
JP5085700B2 (en) * 2010-08-30 2012-11-28 株式会社東芝 Speech synthesis apparatus, speech synthesis method and program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102810309A (en) * 2011-05-30 2012-12-05 雅马哈株式会社 Voice synthesis apparatus
CN102201234A (en) * 2011-06-24 2011-09-28 北京宇音天下科技有限公司 Speech synthesizing method based on tone automatic tagging and prediction
CN102496363A (en) * 2011-11-11 2012-06-13 北京宇音天下科技有限公司 Correction method for Chinese speech synthesis tone

Also Published As

Publication number Publication date
CN104916282A (en) 2015-09-16

Similar Documents

Publication Publication Date Title
CN104916282B (en) A kind of method and apparatus of phonetic synthesis
Kim et al. Real-time emotion detection system using speech: Multi-modal fusion of different timescale features
CN108777140A (en) Phonetics transfer method based on VAE under a kind of training of non-parallel corpus
JP3910628B2 (en) Speech synthesis apparatus, speech synthesis method and program
US10497362B2 (en) System and method for outlier identification to remove poor alignments in speech synthesis
CN107369440A (en) The training method and device of a kind of Speaker Identification model for phrase sound
CN1835074B (en) Speaking person conversion method combined high layer discription information and model self adaption
CN1835075B (en) Speech synthetizing method combined natural sample selection and acaustic parameter to build mould
Casale et al. Multistyle classification of speech under stress using feature subset selection based on genetic algorithms
CN103531196A (en) Sound selection method for waveform concatenation speech synthesis
Kim et al. Intelligibility classification of pathological speech using fusion of multiple high level descriptors
Ringeval et al. Exploiting a vowel based approach for acted emotion recognition
Lee et al. A whispered Mandarin corpus for speech technology applications.
CN102473416A (en) Voice quality conversion device, method therefor, vowel information generating device, and voice quality conversion system
Chen et al. A new prosody-assisted mandarin ASR system
CN106297766B (en) Phoneme synthesizing method and system
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
CN104376850A (en) Estimation method for fundamental frequency of Chinese whispered speech
Chen et al. An investigation of implementation and performance analysis of DNN based speech synthesis system
Ling et al. Articulatory control of HMM-based parametric speech synthesis driven by phonetic knowledge
Hacioglu et al. Parsing speech into articulatory events
Krug et al. Articulatory synthesis for data augmentation in phoneme recognition
Wei et al. Predicting articulatory movement from text using deep architecture with stacked bottleneck features
Durrieu et al. Sparse non-negative decomposition of speech power spectra for formant tracking
CN107924677A (en) For outlier identification to remove the system and method for the bad alignment in phonetic synthesis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100193 Haidian District, Beijing, Northeast China, Beijing Zhongguancun Software Park incubator 2 floor 1.

Applicant after: Beijing InfoQuick SinoVoice Speech Technology Corp.

Address before: 100193 two, 206-1, Zhongguancun Software Park, 8 Northeast Northeast Road, Haidian District, Beijing, 206-1

Applicant before: Jietong Huasheng Speech Technology Co., Ltd.

GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: Building 2102, building 1, Haidian District, Beijing

Patentee after: BEIJING SINOVOICE TECHNOLOGY Co.,Ltd.

Address before: 100193 Haidian District, Beijing, Northeast China, Beijing Zhongguancun Software Park incubator 2 floor 1.

Patentee before: BEIJING SINOVOICE TECHNOLOGY Co.,Ltd.