CN103594082A

CN103594082A - Sound synthesis device, sound synthesis method and storage medium

Info

Publication number: CN103594082A
Application number: CN201310357397.5A
Authority: CN
Inventors: 橘健太郎; 笼岛岳彦; 森田真弘
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2012-08-16
Filing date: 2013-08-16
Publication date: 2014-02-19
Also published as: JP5726822B2; JP2014038208A

Abstract

The invention relates to a sound synthesis device, a sound synthesis method and a storage medium. The sound synthesis device can generates synthesized sounds with various rhythm features and also can keep features of a target talker. The device comprises a text analysis portion for analyzing and outputting language messages of an input text; a dictionary storage portion for storing a first rhythm control dictionary of the target talker and a second rhythm control dictionary of one or more reference talkers; a rhythm parameter generation portion for generating first rhythm parameters by using the first rhythm control dictionary and generating one or more second rhythm parameters by using the second rhythm control dictionary on the basis of the language messages; a standardization portion for standardizing the one or more second rhythm parameters based on a standardized parameter; a rhythm parameter interpolating portion for interpolating the first rhythm parameters and the standardized one or more second rhythm parameters on the basis of weight information to generate third rhythm parameters; and a sound synthesis portion for generating synthesized sounds according to the third rhythm parameters.

Description

Speech synthesizing device, method and storage medium

Technical field

Embodiments of the present invention relate to speech synthesizing device, method and storage medium.

Background technology

The operation of manually making voice signal according to any article is called to text sound synthetic.Conventionally, text sound is synthetic generates and synthetic these 3 stages of sound carry out by text resolution, synthetic parameters.

In typical text sound synthetic system, first, text resolution portion carries out morpheme parsing and syntax parsing etc., output language information to the text of input.Language message comprises the sound symbol string corresponding with the pronunciation of text, the information of stress sentence that becomes prosodic control unit, the position of stress, part of speech etc.Then, synthetic parameters generating unit is carried out prosodic control based on above-mentioned language message with reference to prosodic control dictionary, to generate synthetic parameters.Synthetic parameters comprises the harmonious sounds parameters such as the prosodic parameters such as basic frequency figure (F0 figure) harmonious sounds duration volume and phoneme symbol string.Then, speech synthesiser generates synthetic video according to above-mentioned synthetic parameters.

In such text sound is synthetic, the sound of such tone (so-called present tone) when normally synthetic people reads aloud article, but proposed this year multiple for realizing the method for various prosodic features.For example, following method has been proposed: by carry out interpolation processing between a plurality of prosodic parameters, generate new prosodic parameter, with it, generate synthesized voice, the synthesized voice with various prosodic features is provided.

But in the method, for example, according to the relation between prosodic parameter (, in the situation that have larger difference between the characteristic quantity of prosodic parameter), the result of interpolation can produce unfavorable condition sometimes.For example, if take F0 figure as example as prosodic parameter, in the situation that carry out interpolation between the male sex's object talker's prosodic parameter and women talker's prosodic parameter, in general, women's F0 figure is high, so the F0 mean value of the rhythm figure generating by interpolation can be higher than this object talker's (male sex talker) F0 figure mean value.Consequently, from the prosodic parameter generating, lost object talker's speciality.

In prior art, do not propose: can generate the voice synthesis that there is the synthesized voice of various prosodic features and also keep object talker speciality.

Summary of the invention

Problem to be solved by this invention is to provide and can generates speech synthesizing device, method and the storage medium that has the synthesized voice of various prosodic features and also keep object talker speciality.

According to an embodiment, possess interpolating unit, speech synthesiser in text resolution portion, dictionaries store portion, prosody generation portion, Standardization Sector, prosodic parameter.Text resolution portion resolves and output language information the text of input.The storage object talker's of dictionaries store portion the 1st prosodic control dictionary and one or more benchmark talker's the 2nd prosodic control dictionary.Prosody generation portion, based on described language message, utilizes described the 1st prosodic control dictionary to generate the 1st prosodic parameter, and utilizes respectively described the 2nd prosodic control dictionary to generate one or more the 2nd prosodic parameters.Standardization Sector is carried out standardization to described one or more the 2nd prosodic parameters respectively based on normalizing parameter.In prosodic parameter, interpolating unit, based on weight information, is carried out interpolation processing and generates the 3rd prosodic parameter one or more the 2nd prosodic parameters after described the 1st prosodic parameter and described standardization.Speech synthesiser is according to described the 3rd prosody generation synthetic video.

According to the device of above-mentioned formation, can generate and there is the synthesized voice of various prosodic features and also can keep object talker speciality.

According to another embodiment, possess interpolating unit, Standardization Sector and speech synthesiser in text resolution portion, dictionaries store portion, prosody generation portion, prosodic parameter.Text resolution portion resolves and output language information the text of input.The storage object talker's of dictionaries store portion the 1st prosodic control dictionary and one or more benchmark talker's the 2nd prosodic control dictionary.Prosody generation portion, based on described language message, utilizes described the 1st prosodic control dictionary to generate the 1st prosodic parameter, and utilizes respectively described the 2nd prosodic control dictionary to generate one or more the 2nd prosodic parameters.In prosodic parameter, interpolating unit, based on weight information, is carried out interpolation processing to described the 1st prosodic parameter and described one or more the 2nd prosodic parameters, generates the 3rd prosodic parameter.Standardization Sector is carried out standardization based on normalizing parameter to described the 3rd prosodic parameter.The described 3rd prosody generation synthetic video of speech synthesiser after according to standardization.

According to the device of above-mentioned formation, can generate and there is the synthesized voice of various prosodic features and can also keep object talker's speciality.

Accompanying drawing explanation

Fig. 1 is the basic block diagram of the speech synthesizing device of the 1st embodiment.

Fig. 2 is the block diagram of the 1st configuration example of the 1st embodiment.

Fig. 3 means the process flow diagram of work example of the speech synthesizing device of the 1st embodiment.

Fig. 4 is for the figure of the standardized method based on mean value is described.

Fig. 5 is for the figure of the standardized method based on dynamic range is described.

Fig. 6 means the figure of the example that weight is adjusted.

Fig. 7 is the figure for describing about interpolation.

Fig. 8 is the figure for (extrapolation) describes about extrapolation.

Fig. 9 is for process the figure describing about interpolation.

The figure of Figure 10 for describing about extrapolation process.

Figure 11 is the block diagram of the 2nd configuration example of the 1st embodiment.

Figure 12 is the basic block diagram of the speech synthesizing device of the 2nd embodiment.

Figure 13 is the block diagram of the 1st configuration example of the 2nd embodiment.

Figure 14 means the process flow diagram of work example of the speech synthesizing device of the 2nd embodiment.

Figure 15 is the block diagram of the 2nd configuration example of the 2nd embodiment.

Embodiment

Below, on one side with reference to accompanying drawing on one side the speech synthesizing device that relates to about embodiments of the present invention be described in detail.In addition,, in embodiment below, about the part of mark same reference numerals, as the part of carrying out same work, the repetitive description thereof will be omitted.

As will be described in detail below, the 1st embodiment is processed in the interpolation of prosodic parameter the column criterionization of advancing and is processed, and the 2nd embodiment is processed laggard column criterionization in the interpolation of prosodic parameter and processed.

(the 1st embodiment)

About the 1st embodiment, describe.

One example of the block diagram of the speech synthesizing device of the 1st embodiment shown in Fig. 1.

As shown in Figure 1, the speech synthesizing device of present embodiment comprise text resolution portion 1, prosodic control dictionaries store portion 2, synthetic parameters generating unit 3, interpolating unit 5 and speech synthesiser 6 in interpolation is processed Standardization Sector (following description is the 1st Standardization Sector) 4 that the column criterionization of advancing processes, synthetic parameters.

In addition, Fig. 1 illustrates centered by the relevant structure of the prosodic parameter to synthetic parameters, and the part relevant to other parameters or information suitably omitted.Even this point in other each figure too.Explanation below is also carried out centered by prosodic parameter.

In addition, below, in the situation that describe about the processing and utilizing concrete example relevant to prosodic parameter, take F0 figure as example.

In present embodiment, about generation of normalizing parameter etc., can there be various configuration examples.Below, about several configuration examples, describe successively.About the details of the each several part of the speech synthesizing device of present embodiment, describe in the following description.

(the 1st configuration example of the 1st embodiment)

First, the 1st configuration example about present embodiment describes.

The block diagram of the speech synthesizing device of this configuration example shown in Fig. 2.

As shown in Figure 2, the speech synthesizing device of this configuration example comprises interpolating unit 5 and speech synthesiser 6 in text resolution portion 1, prosodic control dictionaries store portion 2, synthetic parameters generating unit 3, normalizing parameter generating unit 7, the 1st Standardization Sector 4, synthetic parameters.

Below, about each several part, describe.

The text (character string) of 1 pair of input of text resolution portion carries out the processing (for example morpheme is resolved and syntax is resolved) of language aspect with production language information 101.

Language message comprises such as the corresponding sound symbol string of the pronunciation with text, becomes the necessary various information of generation of the such synthetic parameters of the information, stress position, part of speech etc. of stress sentence of unit of prosodic control.

Prosodic control dictionaries store portion 2 storage 1 object talker's prosodic control dictionary and n benchmark talker's prosodic control dictionaries.At this, n is more than or equal to 1 Arbitrary Digit.Object talker's prosodic control dictionary comprises the parameter for control object talker's the rhythm.In addition, a benchmark talker's prosodic control dictionary comprises for controlling the parameter of a benchmark talker's the rhythm.In addition,, in object talker's prosodic control dictionary and benchmark talker's prosodic control dictionary, there is no the difference on forming.

More specifically, prosodic control dictionary carries out reference for the such rhythms such as F0 figure harmonious sounds duration length volume pause such as synthetic video are controlled, for example can consider, the parameter of the statistical model of the controlled quentity controlled variable of the typical variation figure of F0 figure, stress composition harmonious sounds duration length volume pause length etc. or (still, being not limited to this) such as rules being showed by decision tree.

In addition, in prosodic control dictionaries store portion 2, also can store in advance a plurality of object talkers' prosodic control dictionary, can (for example, by user's indication) select: the prosodic control dictionary that uses which object talker.In addition, also can use by the prosodic control dictionary using the object talker of use object talker's in addition prosodic control dictionary as benchmark talker.

The prosodic control dictionary of synthetic parameters generating unit 3 based on language message 101 reference object talkers, formation object talker's synthetic parameters (harmonious sounds parameter and the 1st prosodic parameter), and similarly, prosodic control dictionary based on language message 101 with reference to each benchmark talker, generates respectively each benchmark talker's synthetic parameters (harmonious sounds parameter and the 2nd prosodic parameter).Prosody generation portion is a part for synthetic parameters generating unit 3.

Synthetic parameters comprises prosodic parameter and harmonious sounds parameter.Prosodic parameter is the set of the rhythm of take such as the synthetic video of basic frequency figure (F0 figure), harmonious sounds duration and volume pause etc. the parameter that is feature.Harmonious sounds parameter is such as phoneme symbol string etc.

In addition, prosodic parameter can change because of each talker, by each talker, generates.With respect to this, harmonious sounds parameter is common and talker is irrelevant, is identical.But, also no problem even if the generation of harmonious sounds parameter is undertaken by each talker.In addition, once generate after harmonious sounds parameter the generation that also can omit harmonious sounds parameter.

Prosodic parameter (1st prosodic parameter) 301 of normalizing parameter generating unit 7 based on object talker and one or more benchmark talker's prosodic parameter (the 2nd prosodic parameter) 302, generate predetermined normalizing parameter 701.Normalizing parameter 701 generates by each benchmark talker's prosodic parameter.

Each benchmark talker's that the 1st 4 pairs of Standardization Sectors generate prosodic parameter 302 carries out respectively the standardization based on normalizing parameter 701.

At this, so-called standardization is following processing: for example, about each benchmark talker's prosodic parameter 302, make this amount of the more than one characteristic quantity of this prosodic parameter 302 and object talker's prosodic parameter 301 be close to predetermined threshold value till (or consistent).Characteristic quantity can be considered such as mean value, dispersion, dynamic range etc.

In the situation that carrying out standardization about multiple prosodic parameter, by every kind of prosody generation normalizing parameter 701.

Interpolating unit 5 is based on weight information 901 arbitrarily in synthetic parameters, prosodic parameter after object talker's prosodic parameter (the 1st prosodic parameter) 301 and each benchmark talker's standardization (the 2nd prosodic parameter after standardization) 401 is carried out to interpolation processing, generate the 3rd prosodic parameter, and output comprises the synthetic parameters 501 of the 3rd prosodic parameter and above-mentioned harmonious sounds parameter.In prosodic parameter, interpolating unit is a part for interpolating unit 5 in synthetic parameters.

At this, it is for example by be weighted flat equalization process between a plurality of prosodic parameters that the interpolation of so-called prosodic parameter is processed, thereby generates the processing of the intermediateness prosodic parameter of these prosodic parameters.Wherein, statement used herein " interpolation processing " not only comprises that weight only also includes the situation (so-called extrapolation process) of negative weight for positive situation.There is in the situation of negative weight the prosodic parameter that sometimes the generated feature of the prosodic parameter of more emphasizing certain talker that also can become.In addition, in the following description, about there being the interpolation in the situation of negative weight to process, in order only processing to distinguish for the interpolation in positive situation with weight, to describe, sometimes to use the such statement of extrapolation process.

In addition, interpolation is processed and both can be carried out the prosodic parameter of all categories, also can for example, to partial parameters (, only F0 figure), carry out.About not carrying out the prosodic parameter of interpolation processing, also for example former state adopts object talker's prosodic parameter.

In addition, the prosodic parameter of all categories that can process about interpolation also carries out standardization, and as an alternative, the part in the prosodic parameter that also can only process about interpolation is also carried out standardization.

In addition, also can with the kind of prosodic parameter independently, the weight of common land while specifying interpolation.For example, also can be aspect F0 figure and harmonious sounds duration length, the weight while making interpolation is identical.Or, the weight in the time of also can specifying interpolation by every kind of prosodic parameter.For example, also can be aspect F0 figure and harmonious sounds duration length, the weighted while making interpolation.

In addition, for example, weight information can be also certain in the text.Or weight information also can change in the text.

Speech synthesiser 6 generates synthetic video according to the harmonious sounds information by synthetic parameters 501 appointments and prosodic information.

Next, on one side with reference to Fig. 3 one side, the work example about this configuration example describes.

Here, as the concrete example use F0 figure of prosodic parameter, but be not limited to as previously mentioned this.

First, the 1 production language information 101(step S1 of text resolution portion).

Then, synthetic parameters generating unit 3 is based on language message 101, and reference object talker's prosodic control dictionary and more than one benchmark talker's prosodic control dictionary generate respectively each talker's synthetic parameters (step S2).

In addition the dictionary of, F0 figure being controlled (F0 Graph Control dictionary) is stored in prosodic control dictionary.As the formation of prosodic control dictionary, can consider following formation: the representative of graphics of for example selecting F0 with the representative of graphics of stress Ju Wei unit storage F0, language message 101 based on being generated.

Then, normalizing parameter generating unit 7 dynamically generates normalizing parameter 701(step S3 by each benchmark talker's prosodic parameter).

Then, the 1st Standardization Sector 4 utilizes normalizing parameter 701 respectively each benchmark talker's prosodic parameter 302 to be carried out to standardization (step S4).

At this, about normalizing parameter, generate and the concrete example of standardization describes.

As standardization, there is the method for the mean value that for example utilizes F0 figure.Can consider, benchmark talker's the mean value of F0 figure of for example take is benchmark, by the difference of the mean value of itself and object talker's F0 figure (or, be added the multiply each other value etc. of gained of the value of gained or this difference and predetermined threshold value such as this difference and predetermined threshold value) as normalizing parameter.For example, in Fig. 4, if the track of 41 indicated object talkers' F0 figure, 42 represents benchmark talkers' the track of F0 figure, average, 44 F0 figure average that represent benchmark talkers of 43 indicated object talkers' F0 figure, normalizing parameter be for example difference d(=object talker F0 figure average 43-benchmark talker F0 figure average 44).In this situation, by the F0 figure to benchmark talker, add difference d, generate the F0 figure of the benchmark talker after standardization.Thus, can make average 43 average 44 consistent with benchmark talker's F0 figure of object talker's F0 figure.

In addition, in the situation that for example normalizing parameter being made as to difference d+ threshold value Thre, by the F0 figure to benchmark talker, add difference d+ threshold value Thre, generate the F0 figure of the benchmark talker after standardization.Thus, can make object talker F0 figure average and benchmark talker F0 figure be on average close to the poor of threshold value Thre.In Fig. 4 45 represents the degree (level) that adds upper threshold value Thre gained to average 43 of object talker's F0 figure, and 46 represent the F0 figure of the benchmark talker after standardizations.

For example, object talker, be the male sex, benchmark talker for women in the situation that, make average and male sex talker's the average homogeneity of F0 figure of women talker's F0 figure carry out standardization (or approaching).The speciality that can keep thus, object talker.

As other standardization, for example, there is the method for the dynamic range of using F0 figure.For example, have above-mentioned mean value is changed to dynamic range, above-mentioned difference changed to ratio and the method processed.For example, in Fig. 5, the track of 51 indicated object talkers' F0 figure, 52 represent the track of benchmark talker's F0 figure, the dynamic range of 53 indicated object talkers' F0 figure, 54 represent the dynamic range of benchmark talker's F0 figure.In this situation, first, according to the maximal value of object talker's F0 figure and minimum value, calculate dynamic range 53, and calculate dynamic range 54 according to the maximal value of benchmark talker's F0 figure and minimum value.Then, the benchmark talker's that calculates the dynamic range 54 of take is benchmark, calculates the ratio α with object talker's dynamic range 53, in the hope of going out normalizing parameter.Then, by the F0 figure 51 to benchmark talker, be multiplied by ratio α, generate the F0 figure of the benchmark talker after standardization.Thus, can make the dynamic range of F0 figure of the benchmark talker after standardization and the dynamic range of object talker's F0 figure consistent.In Fig. 5 55 represents the dynamic range of the F0 figure of the object talker after standardization, and 56 represent the F0 figure of the benchmark talker after standardization.

In addition, during with use mean value, similarly, also can further adjust above-mentioned ratio.For example, also can be by further adding predetermined threshold value with respect to above-mentioned ratio or being multiplied by predetermined threshold value, in the hope of going out normalizing parameter.

In addition, also can utilize the mean value of F0 figure and these both sides of dynamic range to carry out standardization.

Except these methods, can also adopt various standardization processing methods.

Then, interpolating unit 5, based on weight information 901 arbitrarily, is carried out interpolation processing (step S5) to the prosodic parameter 401 after object talker's prosodic parameter 301 and each benchmark talker's standardization in synthetic parameters.

In addition, weight and each synthetic parameters (each talker) are set on ground of each parameter accordingly.About the designation method of weight, can make in all sorts of ways, be not particularly limited.Both the value of each weight can be inputted respectively, also the graphic user interfaces such as adjusting lever (GUI) can be utilized.

Weight when Fig. 6 illustrates benchmark talker and is 1 people is selected the example with GUI.In the example of Fig. 6,61 is adjusting lever.By making this adjusting lever 61 move to optional position, can change arbitrarily object talker with benchmark talker's interpolation than (left end and object talker are corresponding, and right-hand member is corresponding with benchmark talker).In addition,, in the example of Fig. 6, for example, by object talker being placed in to 62 and benchmark talker is placed in to 63, also can specify extrapolation ratio.

In the situation that benchmark talker is 2 people, also can utilize GUI.In this situation, for example on GUI picture, show accordingly object talker, the 1st benchmark talker and the 2nd benchmark talker's image with leg-of-mutton each summit, user is inner or outside optional position with indicant indication triangle, according to the relation of the position of each vertex of a triangle position and indicant, can certain weights.

At this, the situation that the benchmark talker of take is 1 people is example, about the interpolation of prosodic parameter, describes.As previously mentioned, the interpolation here comprises that weight is only for positive situation with there is this two side of situation of negative weight.

Fig. 7 illustrates the interpolation based on positive weight.At this, tgt indicated object talker, std represents benchmark talker, int indicated object talker's weight is the interpolated point of m and benchmark talker's weight while being n.At this, m≤0, n≤0.

Fig. 8 illustrates the situation of so-called extrapolation.At this, ext indicated object talker's weight is the extrapolation of m, benchmark talker's weight while being n.At this, m >=0, n≤0.

In addition, Fig. 8 is the extrapolation of emphasizing benchmark talker, but can be also the extrapolation point of emphasizing object talker.In this situation, m≤0, n >=0.

The example of the interpolation of prosodic parameter when Fig. 9 illustrates benchmark talker and is 1 people.In Fig. 9,91 is object talker's F0 figure, and 92 is benchmark talker's F0 figure, and 93 for to carry out the F0 figure after interpolation processing according to them.In the situation that object talker and benchmark talker being carried out to interpolation by m:n as Fig. 9, can enough formulas (1) below represent.

\frac{m \cdot std + n \cdot tgt}{m + n} (m &GreaterEqual; 0, n &GreaterEqual; 0) - - - (1)

The example of the extrapolation of prosodic parameter when Figure 10 illustrates benchmark talker and is 1 people.In Figure 10,101 is object talker's F0 figure, and 102 is benchmark talker's F0 figure, and 103 is according to the F0 figure after their extrapolations.In the situation that object talker and benchmark talker being carried out to extrapolation by m:n as Figure 10, can enough formulas (2) below represent.

\frac{m \cdot std + n \cdot tgt}{m + n} (m &GreaterEqual; 0, n \leq 0) - - - (2)

Another extrapolation can enough formulas (3) below represent.

\frac{m \cdot std + n \cdot tgt}{m + n} (m \leq 0, n &GreaterEqual; 0) - - - (3)

Interpolation (also comprising extrapolation) when benchmark talker is n people for example can enough formulas (4) below represent.At this, stdi represents i benchmark talker, w0 indicated object talker's weight, and wi represents i benchmark talker's weight.

\frac{w 0 \cdot tgt + w 1 \cdot std 1 + w 2 \cdot std 2 + . . . + wn \cdot stdn}{w 0 + w 1 + w 2 + . . . + wn} - - - (4)

Except said method, various interpolations (comprising extrapolation) method is also passable.

In addition, weight information 901 can be the form inputted of user, the form being provided by other programs (processing), by the predetermined unit of text (for example, take sentence as unit, the inscape of sentence of take be unit) form of giving, the various forms such as form that text resolution portion 1 generates by resolving text.

Finally, speech synthesiser 6, according to by specified harmonious sounds information and the prosodic information of synthetic parameters 501, generates synthetic video (step S6).

As described above, according to present embodiment, because carry out the standardization of prosodic parameter before the interpolation of prosodic parameter is processed, thus can generate possess various or meet object talker hobby prosodic features synthesized voice and also can keep object talker's speciality.

(the 2nd configuration example of the 1st embodiment)

Next, the 2nd configuration example with regard to present embodiment describes.

Here, centered by the 1st configuration example difference with the 1st embodiment, describe.

Figure 11 illustrates the block diagram of the speech synthesizing device of this configuration example.

The different normalizing parameter generating units 7 that are from the 1st configuration example (Fig. 2).

The work example of this configuration example is substantially identical with Fig. 3.But in this configuration example, the normalizing parameter of step S3 generates also can carry out before step S2 or before step S1.In addition, when the initial execution of the process flow diagram of Fig. 3, (or in other are processed) is once generate normalizing parameter, just be stored in normalizing parameter storage part (not shown), after the execution of process flow diagram of Fig. 3 time, the normalizing parameter that also can omit step S3 generates.

The normalizing parameter generating unit 7 of this configuration example, according to object talker's prosodic control dictionary and one or more benchmark talker's prosodic control dictionary, generates corresponding normalizing parameter separately statically.

Particularly, for example, calculate the mean value of the F0 figure of all representatives of storing in object talker's prosodic control dictionary, and calculate the mean value of the F0 figure of all representatives of storing in benchmark talker's prosodic control dictionary.Then, based on these mean values and the 1st configuration example, similarly obtain normalizing parameter.

For example, as described in the 1st configuration example, by calculating the difference of these mean values, or by calculating this difference and further as required this difference being added or be multiplied by predetermined threshold value, thereby the normalizing parameter of calculating.Or, also can consider: for example, by using benchmark talker's mean value as benchmark, calculate the ratio with object talker's mean value, or, by calculating this ratio and further this ratio being added or be multiplied by predetermined threshold value, be used as normalizing parameter.In addition, also can consider with the 1st configuration example similarly, use dynamic range and ratio with instead of flat average and difference.

(the 3rd configuration example of the 1st embodiment)

Next, the 3rd configuration example with regard to present embodiment describes.

Here, centered by the 1st and the 2nd configuration example difference with present embodiment, describe.

In the 1st and the 2nd configuration example, by each object talker, using object talker's prosodic parameter as benchmark, obtain normalizing parameter, and based on this normalizing parameter, prosodic parameter has been carried out to standardization.As its alternative practice, also can use the benchmark beyond object talker's prosodic parameter.For example, also can be using the mean value of specified F0 figure as benchmark, to substitute the mean value of object talker's F0 figure.

In this situation, about object talker, also can be with benchmark talker the mean value of the F0 figure based on specified similarly, obtain normalizing parameter, and based on this normalizing parameter, prosodic parameter carried out to standardization.

(the 4th configuration example of the 1st embodiment)

Next, the 4th configuration example with regard to present embodiment describes.

Here, centered by the 1st～3rd configuration example difference with present embodiment, describe.

In the formation of Fig. 1 and Figure 11, comprise normalizing parameter generating unit 7, but also can obtain normalizing parameter from outside.In this case, do not need normalizing parameter generating unit 7, can adopt the formation identical with Fig. 1.

(the 2nd embodiment)

Below, with regard to the 2nd embodiment, describe.

In the 1st embodiment, after standardization, then carry out interpolation processing, in the 2nd embodiment, after interpolation is processed, then carry out standardization.

In present embodiment, by with the 1st embodiment difference centered by describe.

Figure 12 illustrates the example of block diagram of the speech synthesizing device of the 2nd embodiment.

As shown in figure 12, the speech synthesizing device of present embodiment comprises interpolating unit 5 in text resolution portion 1, prosodic control dictionaries store portion 2, synthetic parameters generating unit 3, synthetic parameters, in interpolation, processes Standardization Sector (following description is the 2nd Standardization Sector) 8 and the speech synthesiser 6 that laggard column criterionization is processed.

Present embodiment, also adopts F0 figure to describe as the concrete example of prosodic parameter.

With the difference of the 1st embodiment be: interpolating unit 5 and the 2nd Standardization Sector 8 in the synthetic parameters of present embodiment.Interpolating unit 5 in synthetic parameters, before standardization, based on weight information 901 arbitrarily, the prosodic parameter 301 that the prosodic control dictionary by object talker is generated and carry out interpolation processing by the prosodic parameter 302 that each benchmark talker's prosodic control dictionary generates.The 2nd Standardization Sector 8, carries out standardization by predetermined normalizing parameter to the prosodic parameter being interpolated after processing.

In present embodiment, about generation of normalizing parameter etc., there are various configuration examples.Below, about several configuration examples, describe in order.

(the 1st configuration example of the 2nd embodiment)

First, the 1st configuration example about present embodiment describes.

Figure 13 illustrates the block diagram of the speech synthesizing device of this configuration example.

As shown in figure 13, the speech synthesizing device of this configuration example comprises interpolating unit 5, normalizing parameter generating unit 7, the 2nd Standardization Sector 8 and speech synthesiser 6 in text resolution portion 1, prosodic control dictionaries store portion 2, synthetic parameters generating unit 3, synthetic parameters.

Below, about each several part, describe.

About text resolution portion 1 and language message 101, identical with the 1st embodiment.

About prosodic control dictionaries store portion 2, object talker's prosodic control dictionary and benchmark talker's prosodic control dictionary, identical with the 1st embodiment.

Synthetic parameters generating unit 3 is based on language message 101, with reference to each prosodic control dictionary difference formation object talker's synthetic parameters (harmonious sounds parameter and the 1st prosodic parameter) and each benchmark talker's synthetic parameters (harmonious sounds parameter and the 2nd prosodic parameter).Prosody generation portion is a part for synthetic parameters generating unit 3.

Interpolating unit 5 is based on weight information 901 arbitrarily in synthetic parameters, object talker's prosodic parameter 301 and each benchmark talker's prosodic parameter 302 is carried out to interpolation processing to generate the 3rd prosodic parameter, and output comprises the synthetic parameters 502 of the 3rd prosodic parameter and above-mentioned harmonious sounds parameter.In prosodic parameter, interpolating unit is a part for interpolating unit 5 in synthetic parameters.

Normalizing parameter generating unit 7 is by the such method illustrating in the 1st embodiment, and the prosodic parameter 502 being interpolated after processing of take is benchmark, according to object talker's prosodic parameter 301, generates normalizing parameter 702.

The 2nd Standardization Sector 8 is by the such method illustrating in the 1st embodiment, the prosodic parameter 502 being interpolated after processing is carried out to the standardization of carrying out based on normalizing parameter 702, and output comprises by the 3rd prosodic parameter after standardization and the synthetic parameters 801 of above-mentioned harmonious sounds parameter.

Speech synthesiser 6, according to by specified harmonious sounds information and the prosodic information of synthetic parameters 801, generates synthetic video.

Next, with reference to Figure 14 and meanwhile the work example of this configuration example is described.

Here, as the object lesson use F0 figure of prosodic parameter, but be not limited thereto as mentioned above.

First, the 1 production language information 101(step S11 of text resolution portion).

Then, synthetic parameters generating unit 3 is based on language message 101, and reference object talker's prosodic control dictionary and more than one benchmark talker's prosodic control dictionary generate respectively each talker's synthetic parameters (step S12).

Then, the interior interpolating unit 5 of synthetic parameters is based on 901 couples of object talkers' of weight information prosodic parameter 301 and each benchmark talker's prosodic parameter 302 carry out interpolation processing (step S13) arbitrarily.

Then, normalizing parameter generating unit 7 dynamically generates normalizing parameter 702(step S14 about the prosodic parameter 502 being interpolated after processing).For example,, in the situation that the benchmark talker who illustrates in the 1st embodiment is in 1 people's method, as long as benchmark talker's prosodic parameter 302 is replaced as to the prosodic parameter 502 being interpolated after processing.

Then the prosodic parameter 502 that, the 2nd Standardization Sector 8 utilizes 702 pairs of normalizing parameters to be interpolated after processing carries out standardization (step S15).For example,, in the situation that the benchmark talker who illustrates in the 1st embodiment is in 1 people's method, as long as benchmark talker's prosodic parameter 302 is replaced as to the prosodic parameter 502 being interpolated after processing.

Finally, speech synthesiser 6 is according to generating synthetic video (step S16) by the specified harmonious sounds information of synthetic parameters 801 and prosodic information.

As described above, according to present embodiment, after the interpolation of prosodic parameter is processed, carry out the standardization of prosodic parameter, thus can generate have multiple or meet object talker hobby prosodic features synthesized voice and also can keep object talker's speciality.

(the 2nd configuration example of the 2nd embodiment)

Next, the 2nd configuration example about present embodiment describes.

Here, centered by the 1st configuration example difference with the 2nd embodiment, describe.

Figure 15 illustrates the block diagram of the speech synthesizing device of this configuration example.

The different normalizing parameter generating units 7 that are from the 1st configuration example (Figure 13).

The work example of this configuration example is substantially identical with Figure 14.But in this configuration example, the normalizing parameter of step S14 generates also can carry out before step S13 or before step S12 or before step S11.In addition, when the initial execution of the process flow diagram of Figure 14, (or in other are processed) is once generate normalizing parameter, just be stored in normalizing parameter storage part (not shown), after the execution of process flow diagram of Figure 14 time also can omit step S3 normalizing parameter generate.

Similarly, the prosodic control dictionary according to object talker's prosodic control dictionary and one or more benchmark talker, generates normalizing parameter to the 2nd configuration example of the normalizing parameter generating unit 7 of this configuration example and the 1st embodiment statically.Particularly, for example, calculate the mean value of the F0 figure of all representatives of storing in object talker's prosodic control dictionary, and calculate the mean value (or, the weighted mean value of the weighting gained presupposing) of the F0 figure of all representatives of storing in all talkers' prosodic control dictionary.And, based on these mean value, similarly obtain normalizing parameter with the 1st configuration example.In addition, for example, also can use dynamic range.

(the 3rd configuration example of the 2nd embodiment)

Next, the 3rd configuration example about present embodiment describes.

In the 1st and the 2nd configuration example, object talker's the prosodic parameter of take is obtained normalizing parameter as benchmark, based on this normalizing parameter, prosodic parameter has been carried out to standardization.As its alternative practice, also can use the benchmark beyond object talker's prosodic parameter.For example, also can substitute the mean value of object talker's F0 figure, be benchmark and take the mean value of specified F0 figure.

(the 4th configuration example of the 2nd embodiment)

Next, the 3rd configuration example about present embodiment describes.

In the formation of Figure 13 and Figure 15, comprise normalizing parameter generating unit 7, but also can obtain normalizing parameter from outside.In this case, do not need normalizing parameter generating unit 7, the formation identical with Figure 12.

In addition, in each embodiment having illustrated before this, the model of imagination based on representative of graphics and being illustrated, but for example in the supply filter type sound of the synthetic representative of the sound based on hidden Markov model is synthetic, also can be used other the model such as channel model.In this case, as long as suitably revise prosodic control dictionary, synthetic parameters generation, normalizing parameter generation etc.

For example, in the 1st embodiment, normalizing parameter generating unit also can be based on object talker the corresponding predetermined prosodic parameter statistic of prosodic control dictionary and the predetermined prosodic parameter statistic corresponding with benchmark talker's prosodic control dictionary, generate normalizing parameter.

In addition, for example, in the 2nd embodiment, the corresponding predetermined prosodic parameter statistic of prosodic control dictionary that normalizing parameter generating unit also can be based on object talker and the predetermined prosodic parameter statistic corresponding with benchmark talker's prosodic control dictionary (or, also based on weight information), generate the 2nd normalizing parameter.

What understand as mentioned above is such, according to embodiment, carries out the standardization of prosodic parameter, so can generate the speciality that has the synthesized voice of various prosodic features and also can keep object talker before or after the interpolation processing of prosodic parameter.

In addition, the indication shown in the treatment step shown in above-mentioned embodiment, can be that program is carried out based on software.Pre-stored this program of general computer system, by reading in this program, also can access the effect identical with the resulting effect of manual documentation indexing unit by above-mentioned embodiment.The indication of describing in above-mentioned embodiment, program as making computing machine carry out, is recorded to disk (floppy disk, hard disk etc.), CD (CD-ROM, CD-R, CD-RW, DVD-ROM, DVD ± R, DVD ± RW etc.), semiconductor memory or is similar in this storage medium.So long as the storage medium that computing machine or embedded system can be read in, its file layout is which kind of form can.Computing machine, as long as from this storage medium read-in programme the indication that CPU executive routine described based on this program, can realize the work identical with the manual documentation indexing unit of above-mentioned embodiment.Certainly, computing machine obtains in the situation of program or in the situation of read-in programme, also can be obtained or be read in by network.

In addition, the OS(operating system of operation on computers based on be installed to the indication of the program computing machine and/or embedded system from storage medium) and/or, the MW(middleware of database management language, network etc.) etc., also can carry out for realizing each part of processing of present embodiment.

And the storage medium in present embodiment, is not limited to be independent of the medium of computing machine or embedded system, also comprises the storage medium of downloading and storing or temporarily store by the program of the transmission such as LAN and/or internet.

In addition, storage medium is not limited to 1, in the situation of being carried out the processing in present embodiments by a plurality of media, is also included in the storage medium of present embodiment, and the structure of medium is which kind of structure can.

In addition, computing machine in present embodiment or embedded system, for the program of storing based on storage medium, carrying out each processing in present embodiment, can be any structures such as system that unitary device, a plurality of devices such as personal computer, microcomputer are formed by connecting by network.

In addition, computing machine in so-called present embodiment, being not limited to personal computer, also comprising the included arithmetic processing apparatus of messaging device, microcomputer etc., is to utilize program to realize the general name of unit of the function of present embodiment.

Several embodiment of the present invention has more than been described, but this embodiment proposes as an example, is not used in restriction scope of invention.These new embodiments can be implemented with other various forms, can in the scope of purport that does not depart from invention, carry out various omissions, displacement, change.These embodiments and/or its distortion are included in scope of invention and/or purport, and are also included within the invention and equivalency range thereof that technical scheme records.

Claims

1. a device, it comprises:

Text resolution portion, its text to input is resolved and output language information;

Dictionaries store portion, it stores object talker's the 1st prosodic control dictionary and one or more benchmark talker's the 2nd prosodic control dictionary;

Prosody generation portion, it utilizes described the 1st prosodic control dictionary to generate the 1st prosodic parameter, and utilizes respectively described the 2nd prosodic control dictionary to generate one or more the 2nd prosodic parameters based on described language message;

Standardization Sector, it carries out standardization to described one or more the 2nd prosodic parameters respectively based on normalizing parameter;

Interpolating unit in prosodic parameter, based on weight information, one or more the 2nd prosodic parameters after to described the 1st prosodic parameter and described standardization carry out interpolation processing for it, generate the 3rd prosodic parameter; With

Speech synthesiser, it generates synthetic video according to described the 3rd prosodic parameter.

2. device according to claim 1, wherein,

Also comprise normalizing parameter generating unit, this normalizing parameter generating unit is based on normalizing parameter described in described the 1st prosodic parameter and described one or more the 2nd prosody generations.

3. device according to claim 1, wherein,

Also comprise normalizing parameter generating unit, the predetermined prosodic parameter statistic of this normalizing parameter generating unit based on corresponding with described the 1st prosodic control dictionary and the predetermined prosodic parameter statistic corresponding with described the 2nd prosodic control dictionary, generate described normalizing parameter.

4. device according to claim 1, wherein,

Described normalizing parameter is predefined parameter.

5. device according to claim 1, wherein,

Described Standardization Sector is also carried out standardization to described the 1st prosodic parameter,

1st prosodic parameter of interpolating unit after to described standardization and one or more the 2nd prosodic parameters after described standardization carry out interpolation processing in described prosodic parameter.

6. a device, it is speech synthesizing device, comprising:

Interpolating unit in prosodic parameter, it carries out interpolation based on weight information to described the 1st prosodic parameter and described one or more the 2nd prosodic parameters and processes to generate the 3rd prosodic parameter;

Standardization Sector, it carries out standardization based on normalizing parameter to described the 3rd prosodic parameter; With

Speech synthesiser, its described the 3rd prosodic parameter after according to standardization generates synthetic video.

7. device according to claim 6, wherein,

Also possess normalizing parameter generating unit, this normalizing parameter generating unit, based on described the 1st prosodic parameter and the 3rd described generated prosodic parameter, generates described normalizing parameter.

8. device according to claim 6, wherein,

Also comprise normalizing parameter generating unit, the predetermined prosodic parameter statistic of this normalizing parameter generating unit based on corresponding with described the 1st prosodic control dictionary and predetermined prosodic parameter statistic and the described weight information corresponding with described the 2nd prosodic control dictionary, generate described normalizing parameter.

9. device according to claim 6, wherein,

Described normalizing parameter is predefined parameter.

10. a speech synthesizing method, it is the speech synthesizing method of speech synthesizing device, comprises the following steps:

Text to input is resolved and output language information;

Storage object talker's the 1st prosodic control dictionary and one or more benchmark talker's the 2nd prosodic control dictionary;

Based on described language message, utilize described the 1st prosodic control dictionary to generate the 1st prosodic parameter, and utilize respectively described the 2nd prosodic control dictionary to generate one or more the 2nd prosodic parameters;

Based on normalizing parameter, respectively described one or more the 2nd prosodic parameters are carried out to standardization;

Based on weight information, one or more the 2nd prosodic parameters after described the 1st prosodic parameter and described standardization are carried out to interpolation processing, generate the 3rd prosodic parameter; With

According to described the 3rd prosody generation synthetic video.

11. 1 kinds of speech synthesizing methods, it is the speech synthesizing method of speech synthesizing device, comprises the following steps:

Text to input is resolved and output language information;

Based on described language message, utilize described the 1st prosodic control dictionary generate the 1st prosodic parameter and utilize respectively described the 2nd prosodic control dictionary to generate one or more the 2nd prosodic parameters;

Based on weight information, described the 1st prosodic parameter and described one or more the 2nd prosodic parameters are carried out to interpolation processing, generate the 3rd prosodic parameter;

Based on normalizing parameter, described the 3rd prosodic parameter is carried out to standardization; With

According to the 3rd prosody generation synthetic video after described standardization.

12. 1 kinds of storage mediums, it stores for the synthetic program of sound, wherein,

Described program is carried out computing machine:

Text to input is resolved and the step of output language information;

The step of storage object talker's the 1st prosodic control dictionary and one or more benchmark talker's the 2nd prosodic control dictionary;

Based on described language message, utilize described the 1st prosodic control dictionary to generate the 1st prosodic parameter, and utilize respectively described the 2nd prosodic control dictionary to generate the step of one or more the 2nd prosodic parameters;

Based on normalizing parameter, the step of respectively described one or more the 2nd prosodic parameters being carried out to standardization;

Based on weight information, one or more the 2nd prosodic parameters after described the 1st prosodic parameter and described standardization are carried out to interpolation processing, generate the step of the 3rd prosodic parameter; With

According to the step of described the 3rd prosody generation synthetic video.

13. 1 kinds of storage mediums, it stores for the synthetic program of sound, wherein,

Described program is carried out computing machine:

Text to input is resolved and the step of output language information;

Based on weight information, described the 1st prosodic parameter and described one or more the 2nd prosodic parameters are carried out to interpolation processing, generate the step of the 3rd prosodic parameter;

The step of described the 3rd prosodic parameter being carried out to standardization based on normalizing parameter; With

According to the step of the 3rd prosody generation synthetic video after described standardization.