WO2014017024A1

WO2014017024A1 - Speech synthesizer, speech synthesizing method, and speech synthesizing program

Info

Publication number: WO2014017024A1
Application number: PCT/JP2013/004023
Authority: WO
Inventors: 正徳加藤; 玲史近藤; 康行三井
Original assignee: 日本電気株式会社
Priority date: 2012-07-27
Filing date: 2013-06-27
Publication date: 2014-01-30
Also published as: JPWO2014017024A1

Abstract

Provided is a speech synthesizer capable of generating synthesized speech with a smooth spectrum change during a section in which consecutive phones in recorded speech are selected. The speech synthesizer comprises: a phone selection unit (3) for selecting phones used for synthesis from a plurality of preliminarily stored phones on the basis of an input text string; and a waveform generation unit (4) which includes a waveform generation parameter selection unit (50) for selecting waveform generation parameters extracted from the phones and generates synthesized speech using the selected waveform generation parameters. Taking into account the continuity of the selected phones, the waveform generation parameter selection unit (50) generates a waveform generation parameter selection function indicating where the waveform generation parameters on the time axis of the phones are to be placed on the time axis of the synthesized speech, and selects the waveform generation parameters on the basis of the waveform generation parameter selection function.

Description

Speech synthesis apparatus, speech synthesis method, and speech synthesis program

The present invention relates to a speech synthesis technique, and more particularly to a speech synthesizer, a speech synthesis method, and a speech synthesis program for synthesizing speech based on input text.

A speech synthesizer that analyzes an input character string and generates synthesized speech from speech information indicated by the character string is known. Such a speech synthesizer first generates prosodic information (sound pitch (pitch), sound length (phoneme duration time) of synthesized speech based on a language processing result obtained by analyzing an input character string. Long) and information on sound volume (power) and the like.

Next, the speech synthesizer selects a plurality of optimal segments from the segment dictionary based on the language processing result and the generated prosodic information (referred to as “target prosodic information”), and one optimal segment is selected. Create a series. Note that the segment is sometimes referred to as a speech segment, and is generated in advance for each semi-syllable, for example, based on the recorded speech. In general, a plurality of types of segments are generated from various recorded voices for one voice (here, a voice of about half syllable). A synthesized speech can be obtained by forming a waveform generation parameter sequence from the optimal segment sequence and generating a speech waveform from the sequence. Segments stored in the segment dictionary are extracted and generated from a large amount of natural speech using various methods.

Such a speech synthesizer generates a speech waveform having a prosody close to the generated prosodic information for the purpose of ensuring high sound quality when generating a synthesized speech waveform from the selected segment. Therefore, for example, a method described in Non-Patent Document 1 is used as a method for generating both a synthesized speech waveform and a segment used for generating the synthesized speech.

FIG. 11 is an explanatory diagram showing the assignment of waveform generation parameters in Non-Patent Document 1. As shown in FIG. 11, the waveform generation parameter generated by the method described in Non-Patent Document 1 is a window function having a time width calculated from the pitch around the pitch synchronization position calculated from the pitch of the recorded audio. Is a waveform (pitch waveform) cut out from the speech waveform. When the synthesized speech waveform is generated by the method described in Non-Patent Document 1, the waveform generation parameter (pitch waveform) is based on the pitch generated from the language processing result, that is, the pitch of the synthesized speech. It is selected from the inside. A synthesized speech waveform is generated by concatenating the selected pitch waveforms. The selection of the pitch waveform is basically performed based on the correspondence between the pitch synchronization positions of the recorded voice and the synthesized voice.

Note that Non-Patent Document 7 describes that a power spectrum, a linear prediction coefficient, a cepstrum, a mel cepstrum, an LSP (Line Spectrum Pair), and the like are used as a waveform parameter in addition to a pitch waveform.

However, the waveform generation method described in Non-Patent Document 1 has a problem that the sound quality of synthesized speech is deteriorated because an appropriate waveform generation parameter is not selected.

According to Non-Patent Document 1, the waveform generation parameter is selected so that the target prosodic information is faithfully reproduced for each speech unit based on a predetermined boundary position of the segment. For this reason, since thinning and insertion of waveform generation parameters are repeated many times, the temporal change in the spectrum of the synthesized speech is biased, making it difficult to realize a smooth spectral change. Therefore, the above problem occurs.

Therefore, the present invention provides a speech synthesizer, a speech synthesis method, and a speech synthesis program capable of generating a synthesized speech with a smooth spectrum change in a section in which continuous segments are selected on recorded speech. With the goal.

The speech synthesizer according to the present invention includes a unit selection unit that selects the speech unit used for synthesis from a plurality of speech units stored in advance based on an input character string, and a waveform extracted from the speech unit A waveform generation parameter selection unit that selects a generation parameter, and a waveform generation unit that generates a synthesized speech using the selected waveform generation parameter, wherein the waveform generation parameter selection unit includes a time axis of the speech unit Generate a waveform generation parameter selection function, which is a function indicating where to place the above waveform generation parameter on the time axis of the synthesized speech in consideration of the continuity of the selected speech unit, and generate the waveform A waveform generation parameter is selected based on a parameter selection function.

The speech synthesis method according to the present invention selects, based on an input character string, the speech unit used for synthesis from a plurality of speech units stored in advance, and sets the waveform generation parameter on the time axis of the speech unit as the waveform generation parameter. A waveform generation parameter selection function, which is a function indicating where to place the synthesized speech on the time axis, is generated in consideration of the continuity of the selected speech segment, and based on the waveform generation parameter selection function, A waveform generation parameter extracted from the speech segment is selected, and synthesized speech is generated using the selected waveform generation parameter.

The speech synthesis program according to the present invention includes a computer to select a speech unit to be used for synthesis from a plurality of speech units stored in advance based on an input character string, and a time of the speech unit. A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the axis is arranged on the time axis of the synthesized speech, is generated in consideration of the continuity of the selected speech segment, and the waveform Including a waveform generation parameter selection process for selecting a waveform generation parameter extracted from the speech segment based on a generation parameter selection function, and a waveform generation process for generating synthesized speech using the selected waveform generation parameter. It is made to perform.

According to the present invention, it is possible to generate a synthesized speech with a smooth spectrum change in a section in which continuous segments are selected on the recorded speech.

It is a block diagram which shows the structure of 1st Embodiment of the speech synthesizer by this invention. It is a flowchart which shows operation | movement of a waveform generation parameter selection part. It is explanatory drawing which shows allocation of a waveform generation parameter. It is explanatory drawing which shows the example which plotted Fu2 (t) based on the allocation shown in FIG. It is explanatory drawing which shows the 1st example of a waveform generation parameter selection function. It is explanatory drawing which shows the 2nd example of a waveform generation parameter selection function. It is explanatory drawing which shows the 3rd example of a waveform generation parameter selection function. It is explanatory drawing which showed a mode that a voiced sound waveform was produced | generated from two speech segments comprised from nine waveform generation parameters. It is a block diagram which shows the structure of 2nd Embodiment of the speech synthesizer by this invention. It is a block diagram which shows the structure of the principal part of the speech synthesizer by this invention. It is explanatory drawing which shows allocation of the waveform generation parameter in a nonpatent literature 1.

Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

Embodiment 1. FIG.
FIG. 1 is a block diagram showing the configuration of a first embodiment (Embodiment 1) of a speech synthesizer according to the present invention. As shown in FIG. 1, the speech synthesizer of this embodiment includes a language processing unit 1, a prosody generation unit 2, a segment selection unit 3, a waveform generation unit 4, and a segment information storage unit 10. The waveform generation unit 4 includes a voiced sound generation unit 5, an unvoiced sound generation unit 6, and a waveform connection unit 7. The voiced sound generation unit 5 includes a waveform generation parameter selection unit 50 and a voiced sound waveform generation unit 51.

The unit information storage unit 10 stores speech unit information representing speech units and attribute information representing attributes of each speech unit. A speech segment is a part of basic speech (speech generated by humans (natural speech)) that is the basis of speech synthesis processing for synthesizing speech, and is generated by dividing the basic speech into speech synthesis units. .

In this embodiment, the speech unit information includes time series data of waveform generation parameters extracted from the speech unit and used for generating a synthesized speech waveform. As the waveform generation parameter, a pitch waveform is used in the following description, but may be, for example, a power spectrum, a linear prediction coefficient, a cepstrum, a mel cepstrum, or an LSP (see Non-Patent Document 7). As the waveform generation parameter, it is preferable to use a linear prediction coefficient, LSP, or the like as the waveform generation parameter, particularly when it is necessary to reduce the data amount of the segment. The speech synthesis unit is a syllable. Note that the speech synthesis unit may be a phoneme, a semiphone, a semi-syllable such as CV (Consonant, Vowel), CVC, or VCV, as disclosed in Patent Document 2.

Attribute information includes language information including information representing a character string (recorded sentence) corresponding to basic speech and prosodic information of basic speech. The language information is, for example, information expressed in a kanji / kana mixed sentence. Furthermore, the language information may include information such as readings, syllable strings, phoneme strings, accent positions, accent phrase breaks, morpheme parts of speech. The prosodic information includes a pitch (fundamental frequency), an amplitude, a time series of short-time power, and the syllables, phonemes, and pause duration lengths included in natural speech.

The language processing unit 1 analyzes the character string of the input text sentence. Specifically, the language processing unit 1 performs analysis such as morphological analysis, syntax analysis, or reading. Then, based on the analysis result, the language processing unit 1 uses information representing the symbol string representing “reading” such as phoneme symbols and information representing the morpheme part-of-speech, utilization, accent type, etc. as the prosody. The data is output to the generation unit 2 and the segment selection unit 3.

The prosody generation unit 2 generates a prosody of the synthesized speech based on the language analysis processing result output from the language processing unit 1, and uses the prosody information indicating the generated prosody as target prosody information and the unit selection unit 3 and waveform generation Output to part 4. For example, the method described in Patent Document 3 is used to generate the prosody.

The segment selection unit 3 selects a segment that satisfies a predetermined requirement from the segments stored in the segment information storage unit 10 based on the language analysis processing result and the target prosody information, and selects the selected segment. The pieces and the attribute information of the pieces are output to the waveform generation unit 4.

Details of the operation of the element selection unit 3 will be described. Based on the input language analysis processing result and the target prosodic information, the segment selection unit 3 sets information indicating the characteristics of the synthesized speech (hereinafter referred to as “target segment environment”) for each speech synthesis unit. To generate.

The target segment environment is the corresponding phoneme that constitutes the synthesized speech for which the target segment environment is generated, the preceding phoneme that is the phoneme before the corresponding phoneme, the subsequent phoneme that is the phoneme after the corresponding phoneme, the presence or absence of stress, the accent The information includes the distance from the nucleus, the pitch frequency for each speech synthesis unit, the power, the duration of the speech synthesis unit, the cepstrum, the MFCC (Mel Frequency Cepstial Coefficients), and the amount of change per unit time.

Next, the segment selection unit 3 acquires a plurality of segments corresponding to continuous phonemes from the segment information storage unit 10 for each synthesized speech unit based on the information included in the generated target segment environment. . That is, the segment selection unit 3 acquires a plurality of segments corresponding to each of the corresponding phoneme, the preceding phoneme, and the subsequent phoneme based on information included in the target segment environment. The acquired segment is a candidate for a segment used to generate a synthesized speech, and is hereinafter referred to as a candidate segment.

Then, the unit selection unit 3 synthesizes speech for each combination of a plurality of acquired candidate segments (for example, a combination of a candidate unit corresponding to the corresponding phoneme and a candidate unit corresponding to the preceding phoneme). The cost, which is an index indicating the appropriateness as the segment used for the calculation, is calculated. The cost is a calculation result of the difference between the target element environment and the attribute information of the candidate element, and the difference between the attribute information of adjacent candidate elements.

The cost, which is the value of the calculation result, decreases as the similarity between the synthesized speech feature indicated by the target segment environment and the candidate segment increases, that is, as the appropriateness for synthesizing the speech increases. Further, the smaller the difference in attribute information between adjacent candidate segments, that is, the smaller the gap at the time of segment connection, the lower the cost. Then, the lower the cost, the higher the degree of naturalness that indicates the degree to which the synthesized speech is similar to the speech uttered by humans. Therefore, the segment selection unit 3 selects the segment with the lowest calculated cost.

Specifically, the cost calculated by the segment selection unit 3 includes a unit cost and a connection cost. The unit cost indicates the degree of sound quality degradation estimated to occur when the candidate segment is used in the environment indicated by the target segment environment. The unit cost is calculated based on the similarity between the attribute information of the candidate segment and the target segment environment. The connection cost indicates the degree of sound quality degradation estimated to be caused by the discontinuity of the element environment between connected speech elements. The connection cost is calculated based on the affinity of the element environments between adjacent candidate elements. Various proposed general methods are used for calculating the unit cost and the connection cost.

The element selection unit 3 selects an element of the combination that minimizes the calculated cost as the element most suitable for speech synthesis from the candidate elements. The segment selected by the segment selection unit 3 is referred to as “optimal segment”.

Based on the target prosody information supplied from the prosody generation unit 2, the selected segment supplied from the segment selection unit 3, and its attribute information, the waveform generation unit 4 has a prosody that matches or is similar to the target prosody. Generate a waveform and connect the generated speech waveform to generate a synthesized speech.

By the way, the segment represented by the segment information supplied from the segment selection unit 3 is classified into a segment composed of voiced sound and a segment composed of unvoiced sound. The method used for performing prosody control for voiced sound and the method used for performing prosody control for unvoiced sound are different from each other. Therefore, the waveform generation unit 4 includes a voiced sound generation unit 5, an unvoiced sound generation unit 6, and a waveform connection unit 7 that connects voiced sound and unvoiced sound.

The unvoiced sound generation unit 6 generates an unvoiced sound waveform having a prosody that matches or is similar to the prosodic information supplied from the prosody generation unit 2 based on the segments supplied from the segment selection unit 3. In the present embodiment, since the unvoiced sound element supplied from the element selection unit 3 is a cut speech waveform, the unvoiced sound generation unit 6 generates an unvoiced sound waveform using the method described in Non-Patent Document 4. can do. Further, the method described in Non-Patent Document 5 may be used.

The voiced sound generation unit 5 includes a waveform generation parameter selection unit 50 and a voiced sound waveform generation unit 51. The waveform generation parameter selection unit 50 selects a waveform generation parameter used to generate a voiced sound waveform based on the segment information supplied from the segment selection unit 3 and the prosody information supplied from the prosody generation unit 2.

FIG. 2 is a flowchart showing the operation of the waveform generation parameter selection unit 50. First, the waveform generation parameter selection unit 50 generates a function for determining which waveform generation parameter is arranged on the time axis of the synthesized speech from the time length of the optimum segment and the target time length (step S1). Since this function is a function used for selecting a waveform generation parameter, in the present embodiment, this function is referred to as a “waveform generation parameter selection function”.

For example, the time length T _u of the optimum _unit, when a target time length is T _o, waveform generation parameter selection unit 50, the optimum unit as waveform generation parameter selection function linear function such as the following equation (1) Generate for.

Next, the waveform generation parameter selection unit 50 checks whether or not all selected segments are continuous with subsequent segments (step S2). Here, being continuous with the subsequent segment means that it is continuous on the recorded voice of the selection source stored in the segment information storage unit 10. For example, the unit of the segment is a syllable, the syllable of the segment to be checked (hereinafter referred to as “preceding segment”) is “U”, and the syllable of the subsequent segment to be checked is “ma”. If the preceding and succeeding segments are selected from different recorded voices such as “Ushi” and “Mari”, respectively, it can be said that the preceding and succeeding segments are discontinuous. On the other hand, if selected from consecutive sections on the same recorded voice such as “delicious” and “suma”, it can be said that the preceding segment and the subsequent segment are continuous.

If the segments selected by the segment selection unit 3 are continuous, it is preferable to realize a smooth spectrum change in consideration of the continuity. Therefore, the waveform generation parameter selection unit 50 obtains a common waveform generation parameter selection function used by both using the waveform generation parameter selection function for the preceding and subsequent segments. For example, assuming that the time lengths of the preceding and subsequent optimum segments are T _u1 and T _u2 and the target time lengths are T _o1 and T _o2 , a polygonal line function as shown in the following equation (2) is obtained.

FIG. 3 is an explanatory diagram showing assignment of waveform generation parameters. FIG. 3 shows a situation showing an example in which waveform generation parameters are assigned in accordance with the target time length when the pieces are continuous. “Nth segment” represents a preceding segment, and “N + 1th segment” represents a subsequent segment. FIG. 4 is an explanatory diagram showing an example in which Fu2 (t) is plotted based on the assignment shown in FIG.

Next, the waveform generation parameter selection unit 50 corrects the waveform generation parameter selection function used to select an appropriate waveform generation parameter from the preceding and subsequent optimum segments, and the waveform generation parameter selection function considering continuity. Is obtained (step S3). There are several methods described below for obtaining the corrected waveform generation parameter selection function.

FIG. 5 is an explanatory diagram showing a first example of a waveform generation parameter selection function. As shown in FIG. 5, the first example of the waveform generation parameter selection function is generated by introducing straight lines passing through the midpoints of the preceding and succeeding segments. At this time, a polygonal line function such as the following expression (3) is used as the waveform generation parameter selection function.

FIG. 6 is an explanatory diagram illustrating a second example of the waveform generation parameter selection function. The second example of the waveform generation parameter selection function shown in FIG. 6 is obtained based on a linear function that connects the start end of the preceding segment and the end of the subsequent segment. For example, as shown in FIG. 6, a polygonal line function passing through the intersection (T _o1 , Q) of the segment connection boundary line and the straight line function and the midpoint of the end of the preceding segment (T _o1 , T _u1 ) generates a waveform. Used as a parameter selection function. At this time, if the midpoint of (T _o1 , Q) and (T _o1 , T _u1 ) is (T _o1 , T _um ), the polygonal line function represented by the following equation (4) is used as the waveform generation parameter selection function. Used.

In Equation (4), T _um is expressed as in Equation (5) below.

FIG. 7 is an explanatory diagram showing a third example of the waveform generation parameter selection function. The third example of the waveform generation parameter selection function shown in FIG. 7 is obtained by smoothing the polygonal line function Fu2 (t). As a smoothing method, for example, a method in which a polygonal line function is regarded as a time series and smoothed by a moving average or first-order leak integration is used.

The waveform generation parameter selection unit 50 smoothes the change in the slope of the waveform generation parameter selection function by using the methods of the first to third examples. Thereby, the speech synthesizer of this embodiment can generate synthesized speech with a smooth spectrum change.

The above correction method has been described on the assumption that the waveform generation parameter selection function to be corrected is a line function, but the same method can be used for functions other than a line function such as a curve. Further, regarding the first example shown in FIG. 5, the example in which the corrected waveform generation parameter selection function passes through the midpoint of the preceding and subsequent segments has been described, but the waveform generation parameter selection function is other than the midpoint. It may be a function that passes through the points. Further, regarding the second example shown in FIG. 6, the corrected waveform generation parameter selection function includes the intersection (T _o1 , Q) of the segment connection boundary line and the straight line function and the end of the preceding segment (T _o1 , T _Although an example of passing through the midpoint of _u1 ) has been described, the waveform generation parameter selection function may also be a function that passes through points other than the midpoint.

Next, the waveform generation parameter selection unit 50 calculates a pitch synchronization time (also referred to as a pitch mark) from the pitch time series generated by the prosody generation unit 2 (step S4). A method for calculating the pitch synchronization position from the pitch time series is described in Non-Patent Document 6, for example. For example, the waveform generation unit 4 may calculate the pitch synchronization position by the method described in Non-Patent Document 6.

Then, the waveform generation parameter selection unit 50 uses the waveform generation parameter selection function to select the waveform generation parameter closest to the pitch synchronization time (step S5). In the selection method, as in the case where continuity is not taken into account, the time of an ideal waveform generation parameter position is first calculated from the pitch synchronization position of the synthesized speech using a waveform generation parameter selection function. Next, the waveform generation parameter selection unit 50 employs the waveform generation parameter closest to the time. For example, the time of the nth waveform generation parameter position is 100 milliseconds, the time of the (n + 1) th waveform generation parameter position is 180 milliseconds, and the time obtained by the waveform generation parameter selection function is 160 milliseconds. In this case, the (n + 1) th waveform generation parameter is selected.

FIG. 8 is an explanatory diagram showing a state in which a voiced sound waveform is generated from two speech segments composed of nine waveform generation parameters. In the example shown in FIG. 8, the function shown in FIG. 5 is used as the waveform generation parameter selection function. In the example shown in FIG. 8, the waveform generation parameters corresponding to the pitch synchronization time are the first, third, fourth, fifth, sixth, seventh, eighth, eighth, and ninth waveform generation parameters. The unit 4 generates a waveform using these waveform generation parameters.

The voiced sound waveform generator 51 generates a voiced sound waveform based on the waveform generation parameters supplied from the waveform generation parameter selector 50 and the prosody information supplied from the prosody generator 2. The voiced sound waveform generator 51 generates a voiced sound waveform by arranging the center of each selected waveform generation parameter at the pitch synchronization time. When the waveform generation parameter is a pitch waveform, the voiced sound waveform generation unit 51 generates a voiced sound waveform by arranging the pitch waveform at the pitch synchronization time.

The waveform connecting unit 7 connects the voiced sound waveform supplied from the voiced sound generating unit 5 and the unvoiced sound waveform supplied from the unvoiced sound generating unit 6 and outputs it as a synthesized speech waveform. Specifically, for example, in the waveform linking unit 7, the waveform of the voiced sound generated by the voiced sound generation unit 5 is v (t) (where t = 1, 2, 3,..., T_v), When the unvoiced sound waveform generated by the unvoiced sound generation unit 6 is u (t) (where t = 1, 2, 3,..., T_u), the voiced sound waveform v (t) and the unvoiced sound waveform u (T) is concatenated to generate and output a synthesized speech waveform x (t) shown below.

When t = 1 to t_v: x (t) = v (t)
When t = t_v + 1 to t_v + t_u: x (t) = u (t−t_v)

As described above, the speech synthesizer of this embodiment corrects the waveform generation parameter selection function in consideration of continuity. For this reason, according to the speech synthesizer of the present embodiment, the spectral change compared to the general method disclosed in Non-Patent Document 1 or the like in a section in which continuous segments on the recorded speech are selected. It is possible to generate a synthesized speech that is smooth.

Embodiment 2. FIG.
Next, a speech synthesizer according to the second embodiment of the present invention will be described. The speech synthesis apparatus according to the second embodiment is the first implementation in that the degree of spectrum change is estimated according to the attribute information of the speech unit, and the waveform generation parameter selection function is controlled based on the estimated degree of spectrum change. This is different from the speech synthesis apparatus according to the embodiment. Therefore, the difference will be mainly described below.

FIG. 9 is a block diagram showing the configuration of the second embodiment of the speech synthesizer according to the present invention. The configuration of the speech synthesizer of this embodiment shown in FIG. 9 is compared with the configuration of the speech synthesizer of the first embodiment shown in FIG. 1, and the waveform generation parameter selection unit 50 is replaced with the waveform generation parameter selection unit 60. Further, a spectrum shape change degree estimation unit 62 is newly provided.

The spectrum shape change degree estimation unit 62 estimates the degree of change of the spectrum shape at the unit connection boundary based on the unit attribute information supplied from the unit information storage unit 10. The spectrum shape change degree estimation unit 62 uses language information and prosodic information included in the attribute information for estimation of the change degree of the spectrum shape. When the phoneme or syllable type is used in the language information, a method of estimating the shape change rate of the voice spectrum for each corresponding type is effective. For example, if the segment obtained by combining the preceding and subsequent segments is a syllable of a long vowel, since the change in the spectrum shape at the segment connection boundary is small, the estimated amount of the spectrum shape change is reduced. The same applies when the preceding and subsequent segments are the same phoneme. If the preceding or succeeding segment is a voiced consonant, the spectrum shape change at the segment connection boundary is large, so the estimated amount of the spectrum shape change is increased.

The waveform generation parameter selection unit 60 converts the segment information supplied from the segment selection unit 3, the prosody information supplied from the prosody generation unit 2, and the spectrum shape change degree supplied from the spectrum shape change degree estimation unit 62. Based on this, a waveform generation parameter used for generating a voiced sound waveform is selected. The waveform generation parameter selection unit 60 generates a waveform generation parameter selection function based on the estimated amount of spectrum shape change.

The waveform generation parameter selection unit 60 adjusts the length of the correction section, for example, when using the selection function shown in FIG. The waveform generation parameter selection unit 60 makes the spectrum shape smoother by lengthening the correction section if the degree of change in the spectrum shape is small. When the degree of change in spectrum shape is large, it is not preferable to lengthen the correction section because the amount of correction increases and the difference in prosody between the speech segment and the synthesized segment increases. Therefore, the waveform generation parameter selection unit 60 adjusts the length of the correction section according to the magnitude of the spectrum shape change degree. Further, when the selection function shown in FIG. 6 is used, the waveform generation parameter selection unit 60 similarly adjusts the distance between the end of the preceding segment on the segment boundary and the corrected selection function. The waveform generation parameter selection unit 60 increases the distance between the end of the preceding segment and the corrected selection function on the segment boundary if the degree of change in the spectrum shape is small.

According to the speech synthesizer of this embodiment, the waveform generation parameter selection function is controlled according to the attribute information of the speech unit. As a result, the speech synthesizer of this embodiment can generate synthesized speech with a smooth spectrum change, particularly in a section where the degree of change in spectrum shape is small.

The present invention is not limited to the speech synthesizer described in each embodiment, and the configuration and operation thereof can be changed as appropriate without departing from the spirit of the invention.

FIG. 10 is a block diagram showing the configuration of the main part of the speech synthesizer according to the present invention. As shown in FIG. 10, the speech synthesizer according to the present invention has, as a main configuration, a unit selection unit that selects a speech unit to be used for synthesis from a plurality of previously stored speech units based on an input character string. 3 and a waveform generation unit 4 including a waveform generation parameter selection unit 50 for selecting a waveform generation parameter extracted from the speech segment, and generating a synthesized speech using the selected waveform generation parameter. The waveform generation parameter selection unit 50 also selects a waveform generation parameter selection function, which is a function indicating where the waveform generation parameters on the time axis of the speech segment are to be placed on the time axis of the synthesized speech. The waveform generation parameters are selected based on the waveform generation parameter selection function.

In the above embodiment, speech synthesis apparatuses as shown in the following (1) to (4) are also disclosed.

(1) The waveform generation parameter selection unit includes a first function that connects the start and end of a preceding unit that is one of the selected plurality of speech units, and a speech unit that follows the preceding unit. If the waveform generation parameter selection function that connects the second function connecting the start and end of a certain subsequent segment is generated and the preceding segment and the subsequent segment are continuous, the slope of the waveform generation parameter selection function A speech synthesizer that makes corrections to smooth out changes.

(2) The waveform generation parameter selection unit is configured such that the waveform generation parameter selection function is on a straight line connecting the start end of the preceding segment and the end of the subsequent segment, and the end time of the preceding segment on the time axis of the synthesized speech It is also possible to make the change in the slope smooth by correcting so as to pass through the internal dividing point of the straight line connecting the point at and the end of the preceding element.

(3) In the speech synthesizer, the waveform generation parameter selection unit smoothes the change in inclination by correcting using the line connecting the internal dividing point of the first function and the internal dividing point of the second function. The waveform generation parameter selection function may be generated.

(4) The speech synthesizer includes a spectral shape change degree estimation unit (for example, a spectral shape change degree estimation unit 62) that estimates the spectral change degree at the connection boundary of the speech unit based on the attribute information of the speech unit. The waveform generation parameter selection unit may be configured to generate a waveform generation parameter selection function based on the estimated degree of spectrum change.

This application claims priority based on Japanese Patent Application No. 2012-167220 filed on July 27, 2012, the entire disclosure of which is incorporated herein.

The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

Industrial applicability

The present invention can be applied to information providing services using synthesized speech.

DESCRIPTION OF SYMBOLS 1 Language processing part 2 Prosody generation part 3 Segment selection part 4 Waveform generation part 5 Voiced sound generation part 6 Unvoiced sound generation part 7 Waveform connection part 10 Segment

information storage part

50, 60 Waveform generation parameter selection part 51 Voiced sound waveform generation part 51 62 Spectral shape change degree estimation unit

Claims

A unit selection unit for selecting the speech unit used for synthesis from a plurality of speech units stored in advance based on an input character string;
A waveform generation parameter selection unit that selects a waveform generation parameter extracted from the speech segment, and a waveform generation unit that generates a synthesized speech using the selected waveform generation parameter;
The waveform generation parameter selection unit
A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the time axis of the speech unit is arranged on the time axis of the synthesized speech, is considered in consideration of the continuity of the selected speech unit. And generating a waveform generation parameter based on the waveform generation parameter selection function.
The waveform generation parameter selector
A first function that connects a start and end of a preceding unit that is one of a plurality of selected speech units; and a start and end of a subsequent unit that is a speech unit following the preceding unit Generate a waveform generation parameter selection function that connects the second function connecting
The speech synthesizer according to claim 1, wherein when the preceding segment and the subsequent segment are continuous, correction for smoothing a change in slope of the waveform generation parameter selection function is performed.
The waveform generation parameter selector
A waveform generation parameter selection function is located on a straight line connecting the start end of the preceding segment and the end of the subsequent segment, and a point at the end time of the preceding segment on the time axis of the synthesized speech; The speech synthesizer according to claim 2, wherein a change in inclination is smoothed by correcting so as to pass through an internal dividing point of a straight line connecting the end.
The waveform generation parameter selection unit
The waveform generation parameter selection function in which a change in slope is smoothed by correction using a line connecting the internal dividing point of the first function and the internal dividing point of the second function is generated. The speech synthesizer according to 2.
Based on the attribute information of the speech unit, a spectrum shape change degree estimation unit for estimating the spectrum change degree at the connection boundary of the speech unit,
The waveform generation parameter selector
The speech synthesizer according to any one of claims 1 to 4, wherein a waveform generation parameter selection function is generated based on the estimated degree of spectrum change.
Based on the input character string, select the speech unit used for synthesis from a plurality of speech units stored in advance,
A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the time axis of the speech unit is arranged on the time axis of the synthesized speech, is considered in consideration of the continuity of the selected speech unit. And select a waveform generation parameter extracted from the speech segment based on the waveform generation parameter selection function,
A synthesized speech is generated using the selected waveform generation parameter. A speech synthesis method characterized by:
On the computer,
Based on an input character string, a unit selection process for selecting the speech unit to be used for synthesis from a plurality of speech units stored in advance;
A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the time axis of the speech unit is arranged on the time axis of the synthesized speech, is considered in consideration of the continuity of the selected speech unit. A waveform generation parameter selection process that selects a waveform generation parameter extracted from the speech segment based on the waveform generation parameter selection function, and generates a synthesized speech using the selected waveform generation parameter A speech synthesis program for executing waveform generation processing.