CN101901598A

CN101901598A - Humming synthesis method and system

Info

Publication number: CN101901598A
Application number: CN2010102234975A
Authority: CN
Inventors: 李健; 张连毅; 武卫东
Original assignee: JIETONG HUASHENG SPEECH TECHNOLOGY Co Ltd
Current assignee: JIETONG HUASHENG SPEECH TECHNOLOGY Co Ltd; Beijing Sinovoice Technology Co Ltd
Priority date: 2010-06-30
Filing date: 2010-06-30
Publication date: 2010-12-01

Abstract

The invention provides a humming synthesis method and a humming synthesis system. The method comprises the following steps of: receiving a text input by a user; performing text analysis to acquire a syllable sequence corresponding to the text and syllable names of syllables in the syllable sequence; acquiring corresponding time length parameters, base frequency parameters and spectrum parameters by combing a statistical parameter model and planning according to the syllable names of the syllables and the context aiming at the syllables in the syllable sequence; regulating the time length parameters and the base frequency parameters acquired by planing according to a song template selected by the user and the number of the syllables of the syllable sequence, wherein the song template stores the time length parameters and the base frequency parameters of the syllables; performing interpolation regulation on the spectrum parameters of the corresponding syllables according to the regulated time length parameters; and acquiring voice data by using a synthesizer according to the time length parameters, the base frequency parameters and the spectrum parameters of the syllables in the syllable sequence. The method and the system can output the voice data with song rhythm and melody.

Description

A kind of humming synthesis method and system

Technical field

The present invention relates to the speech synthesis technique field, particularly relate to a kind of humming synthesis method and system.

Background technology

Speech synthesis technique claims literary composition language conversion (TTS, Text to Speech) technology again, and its massage voice reading that any Word message can be converted into the standard smoothness comes out.

Present phoneme synthesizing method is to prerecord a sound bank, finishes a speech synthesis system then on this sound bank basis.The intonation rhythm of the method synthetic video depends on sound bank, and the sound that promptly synthesizes similarly is that the recording people is speaking.

And in some entertainment applications, the user wishes to regulate the intonation rhythm of synthetic speech, such as, note " is sung " with the intonation of song.

In a word, need the urgent technical matters that solves of those skilled in the art to be exactly: how can synthesize voice with song intonation rhythm.

Summary of the invention

Technical matters to be solved by this invention provides a kind of humming synthesis method and system, is used to export the speech data that has song rhythm and melody.

In order to address the above problem, the invention discloses a kind of humming synthesis method, comprising:

Receive the text of user's input;

Carry out text analyzing, obtain the syllable sequence corresponding with described text, and, the syllable title of each syllable in this syllable sequence;

At each syllable in the described syllable sequence, according to its syllable title and context environmental, in conjunction with the statistical parameter model, planning obtains corresponding time length parameter, base frequency parameters and spectrum parameter;

According to the song template of user's selection and the syllable number of described syllable sequence, duration parameters, base frequency parameters that described planning obtains are adjusted, wherein, articulatory duration parameters of storage and base frequency parameters in the described song template;

According to adjusted duration parameters, the spectrum parameter of corresponding syllables is carried out the interpolation adjustment;

According to duration parameters, base frequency parameters and the spectrum parameter of each syllable in the described syllable sequence, utilize compositor to obtain the speech data corresponding with described syllable sequence.

Preferably, the described step that duration parameters, base frequency parameters are adjusted comprises:

Obtain the syllable number of described syllable sequence;

From described song template, extract and described syllable number corresponding time length parameter and base frequency parameters, and cover duration parameters, the base frequency parameters that described planning obtains.

Preferably, described text analyzing step comprises:

Described text is carried out the participle operation;

Numeric character in the described text is converted to literal;

According to word segmentation result, the text after the numeric character conversion is carried out rhythm prediction;

Predicting the outcome according to the rhythm, is syllable sequence with text-converted, and, based on the syllable mapping table, obtain the syllable title of each syllable in this syllable sequence.

Preferably, described song template is the template that generates as follows:

At song sample, extract the wherein duration parameters and the base frequency parameters of each syllable;

With described duration parameters and base frequency parameters, be saved to the song template.

Preferably, described song sample comprises the song sample of singing opera arias.

On the other hand, the invention also discloses a kind of humming synthesis system, comprising:

Interface module is used to receive the text that the user imports;

Text analysis model is used to carry out text analyzing, obtains the syllable sequence corresponding with described text, and, the syllable title of each syllable in this syllable sequence;

The parametric programming module is used at each syllable of described syllable sequence, and according to its syllable title and context environmental, in conjunction with the statistical parameter model, planning obtains corresponding time length parameter, base frequency parameters and spectrum parameter;

First parameter adjustment module, be used for according to the song template of user's selection and the syllable number of described syllable sequence, duration parameters, base frequency parameters that described planning obtains are adjusted, wherein, articulatory duration parameters of storage and base frequency parameters in the described song template;

Second parameter adjustment module is used for according to adjusted duration parameters the spectrum parameter of corresponding syllables being carried out the interpolation adjustment;

Synthesis module is used for duration parameters, base frequency parameters and spectrum parameter according to described each syllable of syllable sequence, utilizes compositor to obtain the speech data corresponding with described syllable sequence.

Preferably, described first parameter adjustment module comprises:

Acquiring unit is used to obtain the syllable number of described syllable sequence;

Adjustment unit is used for extracting and described syllable number corresponding parameters information from the song template, covers duration parameters, base frequency parameters that described planning obtains, and the spectrum parameter is carried out interpolation according to the planning duration.

Preferably, described text analysis model comprises:

The participle unit is used for described text is carried out the participle operation;

The numeric character converting unit is used for the numeric character of described text is converted to literal;

Rhythm predicting unit is used for according to word segmentation result, and the text after the numeric character conversion is carried out rhythm prediction;

The syllable converting unit is used for predicting the outcome according to the rhythm, is syllable sequence with text-converted, and, based on the syllable mapping table, obtain the syllable title of each syllable in this syllable sequence.

Preferably, described system also comprises song template generation module, and this song template generation module comprises:

Extraction unit is used at song sample, extracts the wherein duration parameters and the base frequency parameters of each syllable;

Preserve the unit, be used for described duration parameters and base frequency parameters are saved to the song template.

Compared with prior art, the present invention has the following advantages:

It is unit storage duration parameters, base frequency parameters with the syllable that the present invention adopts the song template, and can name described song template according to the rule of sign rhythm such as song title, melody; Like this, the user can select suitable song template according to actual demands such as personal habits, application scenarioss, adjusts with duration and base frequency parameters that planning is obtained, obtains the speech data of user input text at last based on the parameter synthetic technology.Because in speech parameter, duration and base frequency parameters determine the information of rhythm, melody aspect jointly, spectrum parameter decision tone color information, i.e. the characteristic voice information of speaker; Thereby the present invention can combine duration, the base frequency parameters of song template with the spectrum parameter of sound storehouse speaker, and can access tone color is that sound storehouse speaker, tone rhythm are song and the humming voice flow that has certain melody.

Description of drawings

Fig. 1 is the process flow diagram of a kind of humming synthesis method embodiment of the present invention;

Fig. 2 is a kind of structural drawing of humming synthesis system embodiment of the present invention.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

One of core idea of the embodiment of the invention is, generate the song template based on duration parameters and base frequency parameters, and, when user input text, can adjust duration and base frequency parameters that planning obtains according to described song template, utilize compositor to obtain the speech data of described text then.Because in speech parameter, duration and base frequency parameters determine the information of rhythm, melody aspect jointly, spectrum parameter decision tone color information, i.e. the characteristic voice information of speaker; Thereby above-mentioned duration with the song template, base frequency parameters combine with the spectrum parameter of sound storehouse speaker, and can access tone color is that sound storehouse speaker, tone rhythm are song and the humming voice flow that has certain melody.

With reference to Fig. 1, show the process flow diagram of a kind of humming synthesis method embodiment of the present invention, specifically can comprise:

The text of step 101, reception user input;

The text of described user's input can comprise literal and numeric character, wherein, described literal can be Chinese character, Japanese, Korean, English etc., perhaps, in the above-mentioned kinds of words one or several, as Chinese-English combination or the like, the present invention is not limited concrete text, below mainly is example with the Chinese character.

Step 102, carry out text analyzing, obtain the syllable sequence corresponding with described text, and, the syllable title of each syllable in this syllable sequence;

Below concrete text " grand Opening Ceremony of the Games has been held at 2008-8-8 in Beijing " be example, described text analyzing step is described, specifically can comprise:

Substep A1, described text is carried out participle operation;

Word segmentation result: Beijing/hold at/2008-8-8/// grand// Olympic Games/opening ceremony

Substep A2, the numeric character in the described text is converted to literal;

Corresponding this example, described numeric character conversion also promptly is converted to " 2008-8-8 " " 2008 on August 8, ", and the text after the numeric character conversion is " grand Opening Ceremony of the Games has been held in Beijing 2008 on August 8, ".

Substep A3, according to word segmentation result, the text after the numeric character conversion is carried out rhythm prediction;

The rhythm predicts the outcome: Beijing is at grand Opening Ceremony of the Games in 2008 on August 8 ,/held

Substep A4, predicting the outcome according to the rhythm, is syllable sequence with text-converted, and, based on the syllable mapping table, obtain the syllable title of each syllable in this syllable sequence.

Syllable sequence: bei3 jing1 zai4 er4 ling2 ling2 ba1 nian2 ba1 yue4 ba1 ri4Ju3 xing2 le5 sheng4 da4 de5 ao4 yun4 hui4 kai1 mu4 shi4

Wherein, numeral 12345 is represented tone, is respectively, two, three, the four tones of standard Chinese pronunciation, softly.In practice, the syllable title of Chinese character syllable can obtain by inquiry of Chinese character syllable mapping table, and " bei3 " that for example go up in the example promptly is the syllable title.

Step 103, at each syllable in the described syllable sequence, according to its syllable title and context environmental, in conjunction with the statistical parameter model, planning obtains corresponding time length parameter, base frequency parameters and spectrum parameter;

Described context environmental mainly is meant the positional information of syllable, can comprise in beginning of the sentence, the sentence and end of the sentence; Example on the correspondence, the context environmental of " shi4 " is an end of the sentence, the context environmental of " er4 " then is in the sentence.

In practice, described statistical parameter model can obtain by off-line training, and it stores each syllable pairing parameter under different context environmentals.

For example, during off-line, train first statistical model, train second statistical model at base frequency parameters at duration parameters, and, at spectrum parameter training the 3rd statistical model; So, during online planning, can directly obtain and syllable corresponding time length parameter, base frequency parameters and spectrum parameter from described three statistical models.

Step 104, the song template of selecting according to the user and the syllable number of described syllable sequence are adjusted duration parameters, base frequency parameters that described planning obtains, wherein, and articulatory duration parameters of storage and base frequency parameters in the described song template;

In practice, can set up the song template by following off-line step:

Substep A1, at song sample, extract the wherein duration parameters and the base frequency parameters of each syllable;

Substep A2, with described duration parameters and base frequency parameters, be saved to the song template.

Because common song is made up of voice and music two parts, and the sounding characteristics and the mankind of musical instrument differ greatly, and can produce a lot of deviations during extraction, therefore, the present invention preferentially selects the song sample of singing opera arias for use.

In speech parameter, duration parameters also is the tone period length of each syllable, can determine according to wave file; Base frequency parameters is the vibration frequency of sound wave, can at first detect the cycle of waveform during extraction, gets inverse then and can obtain base frequency parameters.

In specific implementation, can adopt ripe instrument, from song sample, extract described duration parameters and base frequency parameters automatically, the present invention is not limited concrete extracting mode.

In addition, the present invention generally generates a song template at a song sample, and wherein, described song sample can be complete song, also can be snatch of song; And, select for the convenience of the user, can be described song template name, for example, described naming rule can be a song title: " approximately in the winter time ", " moon is represented my heart ", " story Of The Spring " etc.

When user input text, the present invention can represent the option that described off-line is set up several song templates, select for the user, and the user can select suitable song template according to actual needs such as personal habits, application scenarioss.

Particularly, described step 104 can realize by following substep:

Substep B1, obtain the syllable number of described syllable sequence;

Substep B2, extraction and described syllable number corresponding time length parameter and base frequency parameters from described song template, and cover duration parameters, the base frequency parameters that described planning obtains.

The syllable number of supposing the described syllable sequence that acquires is N, and the syllable number in the described song template is M, wherein, M, N is natural number, and set-up procedure of the present invention mainly contains two kinds of situations:

Situation 1, M 〉=N;

At this moment, can directly from the song template, intercept the duration parameters and the base frequency parameters of top n syllable.

Situation 2, M＜N;

At this situation, can recycling described song template in the duration parameters and the base frequency parameters of M syllable, suppose that the syllable sequence number is 1,2 in the song template, ..., M, and hypothesis N＞2M, so, the syllable sequence number in the pairing song template of duration parameters of finally obtaining and base frequency parameters can for: 1,2, ..., M, 1,2 ..., M, 1,2 ... N.

Here, the duration parameters that described coverage planning obtains, base frequency parameters also, are replaced original duration parameters and base frequency parameters with duration parameters in the song template and base frequency parameters.

In practice, can after the duration parameters and base frequency parameters of extracting a syllable, and then carry out described overlapping operation, carry out at other syllable then and extract and overlapping operation; Perhaps, after the duration parameters and base frequency parameters of extracting N syllable, carry out overlapping operation again, the present invention is not limited concrete sequence of operation.

Step 105, the adjusted duration parameters of foundation are carried out the interpolation adjustment to the spectrum parameter of corresponding syllables;

The precondition of utilizing compositor to carry out phonetic synthesis is that base frequency parameters and spectrum parameter should be one to one, also, must corresponding one an of base frequency parameters compose parameter; So this step by adjusting the spectrum parameter, makes it corresponding with the base frequency parameters that step 104 planning obtains, to carry out next step phonetic synthesis.

Below by concrete example described adjustment process is described:

Suppose that step 103 is 400ms at the duration parameters that described syllable sequence planning obtains, the number of sampling each second is 1000, and also, sample frequency is 1000HZ (hertz), and by calculating, the number that can obtain base frequency parameters and spectrum parameter is 400;

Suppose step 104 according to the song template of user's selection and the syllable number of described syllable sequence, adjusting the duration parameters that obtains is 500ms, and the number that also is base frequency parameters is 500;

So, this step then is at 400 in the step 103 spectrum parameters, and interpolation obtains 500 spectrum parameters.

Interpolation method has a lot, for example, and linear interpolation, non-linear interpolation, perhaps, two point interpolations, multiple spot interpolation etc., those skilled in the art can adopt any as required, and the present invention is not limited this.

For example, when adopting 2 linear interpolation, interpolation formula can be Qs=(aQ1+bQ2+u1)/(a+b), wherein, Q1, Q2 are respectively the spectrum parameter of known spectrum parameter point 1,2 (can be original spectrum parameter point in the step 103, also can be the acquired new spectrum parameter point of this step), a, b is a natural number, can represent known spectrum parameter point 1,2 to treat the weight that interpolation point S produces, 0＜u1＜a+b respectively.

In summary, this step promptly is to be N with M spectrum parameter interpolation, and to satisfy the requirement of a corresponding fundamental frequency of spectrum, wherein, M value can be obtained by step 103, and the N value can be by step 104 acquisition, and M, N are natural number.

Step 106, the duration parameters according to each syllable in the described syllable sequence, base frequency parameters and spectrum parameter utilize compositor to obtain the speech data corresponding with described syllable sequence.

Because have advantages such as regulating power is big, voice plasticity is strong, the parameter synthetic technology has obtained using widely in phonetic synthesis; In practice, wave filter is as compositor can to adopt LPC (linear predictive coding, linearpredictive coding), and the present invention is not limited concrete compositor.

Owing to added duration parameters and base frequency parameters in the song template, thereby the described synthetic speech data that obtains has melody identical with song and rhythm.

With reference to Fig. 2, show a kind of structural drawing of humming synthesis system embodiment of the present invention, specifically can comprise:

Interface module 201 is used to receive the text that the user imports;

Text analysis model 202 is used to carry out text analyzing, obtains the syllable sequence corresponding with described text, and, the syllable title of each syllable in this syllable sequence;

Parametric programming module 203 is used at each syllable of described syllable sequence, and according to its syllable title and context environmental, in conjunction with the statistical parameter model, planning obtains corresponding time length parameter, base frequency parameters and spectrum parameter;

First parameter adjustment module 204, be used for according to the song template of user's selection and the syllable number of described syllable sequence, duration parameters, base frequency parameters that described planning obtains are adjusted, wherein, articulatory duration parameters of storage and base frequency parameters in the described song template;

Second parameter adjustment module 205 is used for according to adjusted duration parameters the spectrum parameter of corresponding syllables being carried out the interpolation adjustment;

Synthesis module 206 is used for duration parameters, base frequency parameters and spectrum parameter according to described each syllable of syllable sequence, utilizes compositor to obtain the speech data corresponding with described syllable sequence.

In practice, described text analysis model 202 may further include:

Participle unit C1 is used for described text is carried out the participle operation;

Numeric character processing unit C2 is used for the numeric character of described text is converted to literal;

Rhythm predicting unit C3 is used for according to word segmentation result, and the text after the numeric character conversion is carried out rhythm prediction;

Syllable converting unit C4 is used for predicting the outcome according to the rhythm, is syllable sequence with text-converted, and, based on the syllable mapping table, obtain the syllable title of each syllable in this syllable sequence.

The present invention can adopt the song template generation module of following off-line to set up described song template, and this song template generation module specifically can comprise:

Extraction unit D1 is used at song sample, extracts the wherein duration parameters and the base frequency parameters of each syllable;

Preserve cells D 2, be used for described duration parameters and base frequency parameters and corresponding sample frequency are saved to the song template.

Particularly, described first parameter adjustment module 204 can comprise following cellular construction:

Acquiring unit E1 is used to obtain the syllable number of described syllable sequence;

Adjustment unit E2 is used for extracting and described syllable number corresponding parameters information from the song template, and covers duration parameters, the base frequency parameters that described planning obtains.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed all is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For system embodiment, because it is similar substantially to method embodiment, so description is fairly simple, relevant part gets final product referring to the part explanation of method embodiment.

The present invention can be applied to various computer terminals and digital mobile equipment, and arbitrary text that be used for system is received or input converts the voice flow that has song rhythm and melody to.

More than to a kind of humming synthesis method provided by the present invention and system, be described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. a humming synthesis method is characterized in that, comprising:

Receive the text of user's input;

2. the method for claim 1 is characterized in that, the described step that duration parameters, base frequency parameters are adjusted comprises:

Obtain the syllable number of described syllable sequence;

3. the method for claim 1 is characterized in that, described text analyzing step comprises:

Described text is carried out the participle operation;

Numeric character in the described text is converted to literal;

4. the method for claim 1 is characterized in that, described song template is the template that generates as follows:

5. method as claimed in claim 4 is characterized in that described song sample comprises the song sample of singing opera arias.

6. a humming synthesis system is characterized in that, comprising:

Interface module is used to receive the text that the user imports;

7. system as claimed in claim 6 is characterized in that, described first parameter adjustment module comprises:

8. system as claimed in claim 6 is characterized in that, described text analysis model comprises:

9. system as claimed in claim 6 is characterized in that, also comprises song template generation module, and this song template generation module comprises:

10. system as claimed in claim 9 is characterized in that described song sample comprises the song sample of singing opera arias.