WO2014017024A1 - Speech synthesizer, speech synthesizing method, and speech synthesizing program - Google Patents

Speech synthesizer, speech synthesizing method, and speech synthesizing program Download PDF

Info

Publication number
WO2014017024A1
WO2014017024A1 PCT/JP2013/004023 JP2013004023W WO2014017024A1 WO 2014017024 A1 WO2014017024 A1 WO 2014017024A1 JP 2013004023 W JP2013004023 W JP 2013004023W WO 2014017024 A1 WO2014017024 A1 WO 2014017024A1
Authority
WO
WIPO (PCT)
Prior art keywords
waveform generation
speech
generation parameter
unit
segment
Prior art date
Application number
PCT/JP2013/004023
Other languages
French (fr)
Japanese (ja)
Inventor
正徳 加藤
玲史 近藤
康行 三井
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to JP2014526737A priority Critical patent/JPWO2014017024A1/en
Publication of WO2014017024A1 publication Critical patent/WO2014017024A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • G10L13/07Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present invention relates to a speech synthesis technique, and more particularly to a speech synthesizer, a speech synthesis method, and a speech synthesis program for synthesizing speech based on input text.
  • a speech synthesizer that analyzes an input character string and generates synthesized speech from speech information indicated by the character string is known. Such a speech synthesizer first generates prosodic information (sound pitch (pitch), sound length (phoneme duration time) of synthesized speech based on a language processing result obtained by analyzing an input character string. Long) and information on sound volume (power) and the like.
  • the speech synthesizer selects a plurality of optimal segments from the segment dictionary based on the language processing result and the generated prosodic information (referred to as “target prosodic information”), and one optimal segment is selected.
  • target prosodic information referred to as “target prosodic information”
  • the segment is sometimes referred to as a speech segment, and is generated in advance for each semi-syllable, for example, based on the recorded speech.
  • a plurality of types of segments are generated from various recorded voices for one voice (here, a voice of about half syllable).
  • a synthesized speech can be obtained by forming a waveform generation parameter sequence from the optimal segment sequence and generating a speech waveform from the sequence.
  • Segments stored in the segment dictionary are extracted and generated from a large amount of natural speech using various methods.
  • Such a speech synthesizer generates a speech waveform having a prosody close to the generated prosodic information for the purpose of ensuring high sound quality when generating a synthesized speech waveform from the selected segment. Therefore, for example, a method described in Non-Patent Document 1 is used as a method for generating both a synthesized speech waveform and a segment used for generating the synthesized speech.
  • FIG. 11 is an explanatory diagram showing the assignment of waveform generation parameters in Non-Patent Document 1.
  • the waveform generation parameter generated by the method described in Non-Patent Document 1 is a window function having a time width calculated from the pitch around the pitch synchronization position calculated from the pitch of the recorded audio. Is a waveform (pitch waveform) cut out from the speech waveform.
  • the waveform generation parameter (pitch waveform) is based on the pitch generated from the language processing result, that is, the pitch of the synthesized speech. It is selected from the inside.
  • a synthesized speech waveform is generated by concatenating the selected pitch waveforms. The selection of the pitch waveform is basically performed based on the correspondence between the pitch synchronization positions of the recorded voice and the synthesized voice.
  • Non-Patent Document 7 describes that a power spectrum, a linear prediction coefficient, a cepstrum, a mel cepstrum, an LSP (Line Spectrum Pair), and the like are used as a waveform parameter in addition to a pitch waveform.
  • Non-Patent Document 1 has a problem that the sound quality of synthesized speech is deteriorated because an appropriate waveform generation parameter is not selected.
  • the waveform generation parameter is selected so that the target prosodic information is faithfully reproduced for each speech unit based on a predetermined boundary position of the segment. For this reason, since thinning and insertion of waveform generation parameters are repeated many times, the temporal change in the spectrum of the synthesized speech is biased, making it difficult to realize a smooth spectral change. Therefore, the above problem occurs.
  • the present invention provides a speech synthesizer, a speech synthesis method, and a speech synthesis program capable of generating a synthesized speech with a smooth spectrum change in a section in which continuous segments are selected on recorded speech. With the goal.
  • the speech synthesizer includes a unit selection unit that selects the speech unit used for synthesis from a plurality of speech units stored in advance based on an input character string, and a waveform extracted from the speech unit A waveform generation parameter selection unit that selects a generation parameter, and a waveform generation unit that generates a synthesized speech using the selected waveform generation parameter, wherein the waveform generation parameter selection unit includes a time axis of the speech unit Generate a waveform generation parameter selection function, which is a function indicating where to place the above waveform generation parameter on the time axis of the synthesized speech in consideration of the continuity of the selected speech unit, and generate the waveform A waveform generation parameter is selected based on a parameter selection function.
  • the speech synthesis method selects, based on an input character string, the speech unit used for synthesis from a plurality of speech units stored in advance, and sets the waveform generation parameter on the time axis of the speech unit as the waveform generation parameter.
  • a waveform generation parameter selection function which is a function indicating where to place the synthesized speech on the time axis, is generated in consideration of the continuity of the selected speech segment, and based on the waveform generation parameter selection function, A waveform generation parameter extracted from the speech segment is selected, and synthesized speech is generated using the selected waveform generation parameter.
  • the speech synthesis program includes a computer to select a speech unit to be used for synthesis from a plurality of speech units stored in advance based on an input character string, and a time of the speech unit.
  • a waveform generation parameter selection function which is a function indicating where the waveform generation parameter on the axis is arranged on the time axis of the synthesized speech, is generated in consideration of the continuity of the selected speech segment, and the waveform Including a waveform generation parameter selection process for selecting a waveform generation parameter extracted from the speech segment based on a generation parameter selection function, and a waveform generation process for generating synthesized speech using the selected waveform generation parameter. It is made to perform.
  • FIG. FIG. 1 is a block diagram showing the configuration of a first embodiment (Embodiment 1) of a speech synthesizer according to the present invention.
  • the speech synthesizer of this embodiment includes a language processing unit 1, a prosody generation unit 2, a segment selection unit 3, a waveform generation unit 4, and a segment information storage unit 10.
  • the waveform generation unit 4 includes a voiced sound generation unit 5, an unvoiced sound generation unit 6, and a waveform connection unit 7.
  • the voiced sound generation unit 5 includes a waveform generation parameter selection unit 50 and a voiced sound waveform generation unit 51.
  • the unit information storage unit 10 stores speech unit information representing speech units and attribute information representing attributes of each speech unit.
  • a speech segment is a part of basic speech (speech generated by humans (natural speech)) that is the basis of speech synthesis processing for synthesizing speech, and is generated by dividing the basic speech into speech synthesis units. .
  • the speech unit information includes time series data of waveform generation parameters extracted from the speech unit and used for generating a synthesized speech waveform.
  • a pitch waveform is used in the following description, but may be, for example, a power spectrum, a linear prediction coefficient, a cepstrum, a mel cepstrum, or an LSP (see Non-Patent Document 7).
  • the waveform generation parameter it is preferable to use a linear prediction coefficient, LSP, or the like as the waveform generation parameter, particularly when it is necessary to reduce the data amount of the segment.
  • the speech synthesis unit is a syllable. Note that the speech synthesis unit may be a phoneme, a semiphone, a semi-syllable such as CV (Consonant, Vowel), CVC, or VCV, as disclosed in Patent Document 2.
  • Attribute information includes language information including information representing a character string (recorded sentence) corresponding to basic speech and prosodic information of basic speech.
  • the language information is, for example, information expressed in a kanji / kana mixed sentence.
  • the language information may include information such as readings, syllable strings, phoneme strings, accent positions, accent phrase breaks, morpheme parts of speech.
  • the prosodic information includes a pitch (fundamental frequency), an amplitude, a time series of short-time power, and the syllables, phonemes, and pause duration lengths included in natural speech.
  • the language processing unit 1 analyzes the character string of the input text sentence. Specifically, the language processing unit 1 performs analysis such as morphological analysis, syntax analysis, or reading. Then, based on the analysis result, the language processing unit 1 uses information representing the symbol string representing “reading” such as phoneme symbols and information representing the morpheme part-of-speech, utilization, accent type, etc. as the prosody. The data is output to the generation unit 2 and the segment selection unit 3.
  • the prosody generation unit 2 generates a prosody of the synthesized speech based on the language analysis processing result output from the language processing unit 1, and uses the prosody information indicating the generated prosody as target prosody information and the unit selection unit 3 and waveform generation Output to part 4. For example, the method described in Patent Document 3 is used to generate the prosody.
  • the segment selection unit 3 selects a segment that satisfies a predetermined requirement from the segments stored in the segment information storage unit 10 based on the language analysis processing result and the target prosody information, and selects the selected segment.
  • the pieces and the attribute information of the pieces are output to the waveform generation unit 4.
  • the segment selection unit 3 Based on the input language analysis processing result and the target prosodic information, the segment selection unit 3 sets information indicating the characteristics of the synthesized speech (hereinafter referred to as “target segment environment”) for each speech synthesis unit. To generate.
  • the target segment environment is the corresponding phoneme that constitutes the synthesized speech for which the target segment environment is generated, the preceding phoneme that is the phoneme before the corresponding phoneme, the subsequent phoneme that is the phoneme after the corresponding phoneme, the presence or absence of stress, the accent
  • the information includes the distance from the nucleus, the pitch frequency for each speech synthesis unit, the power, the duration of the speech synthesis unit, the cepstrum, the MFCC (Mel Frequency Cepstial Coefficients), and the amount of change per unit time.
  • the segment selection unit 3 acquires a plurality of segments corresponding to continuous phonemes from the segment information storage unit 10 for each synthesized speech unit based on the information included in the generated target segment environment. . That is, the segment selection unit 3 acquires a plurality of segments corresponding to each of the corresponding phoneme, the preceding phoneme, and the subsequent phoneme based on information included in the target segment environment.
  • the acquired segment is a candidate for a segment used to generate a synthesized speech, and is hereinafter referred to as a candidate segment.
  • the unit selection unit 3 synthesizes speech for each combination of a plurality of acquired candidate segments (for example, a combination of a candidate unit corresponding to the corresponding phoneme and a candidate unit corresponding to the preceding phoneme).
  • the cost which is an index indicating the appropriateness as the segment used for the calculation, is calculated.
  • the cost is a calculation result of the difference between the target element environment and the attribute information of the candidate element, and the difference between the attribute information of adjacent candidate elements.
  • the cost which is the value of the calculation result, decreases as the similarity between the synthesized speech feature indicated by the target segment environment and the candidate segment increases, that is, as the appropriateness for synthesizing the speech increases. Further, the smaller the difference in attribute information between adjacent candidate segments, that is, the smaller the gap at the time of segment connection, the lower the cost. Then, the lower the cost, the higher the degree of naturalness that indicates the degree to which the synthesized speech is similar to the speech uttered by humans. Therefore, the segment selection unit 3 selects the segment with the lowest calculated cost.
  • the cost calculated by the segment selection unit 3 includes a unit cost and a connection cost.
  • the unit cost indicates the degree of sound quality degradation estimated to occur when the candidate segment is used in the environment indicated by the target segment environment.
  • the unit cost is calculated based on the similarity between the attribute information of the candidate segment and the target segment environment.
  • the connection cost indicates the degree of sound quality degradation estimated to be caused by the discontinuity of the element environment between connected speech elements.
  • the connection cost is calculated based on the affinity of the element environments between adjacent candidate elements.
  • Various proposed general methods are used for calculating the unit cost and the connection cost.
  • the element selection unit 3 selects an element of the combination that minimizes the calculated cost as the element most suitable for speech synthesis from the candidate elements.
  • the segment selected by the segment selection unit 3 is referred to as “optimal segment”.
  • the waveform generation unit 4 Based on the target prosody information supplied from the prosody generation unit 2, the selected segment supplied from the segment selection unit 3, and its attribute information, the waveform generation unit 4 has a prosody that matches or is similar to the target prosody. Generate a waveform and connect the generated speech waveform to generate a synthesized speech.
  • the segment represented by the segment information supplied from the segment selection unit 3 is classified into a segment composed of voiced sound and a segment composed of unvoiced sound.
  • the method used for performing prosody control for voiced sound and the method used for performing prosody control for unvoiced sound are different from each other. Therefore, the waveform generation unit 4 includes a voiced sound generation unit 5, an unvoiced sound generation unit 6, and a waveform connection unit 7 that connects voiced sound and unvoiced sound.
  • the unvoiced sound generation unit 6 generates an unvoiced sound waveform having a prosody that matches or is similar to the prosodic information supplied from the prosody generation unit 2 based on the segments supplied from the segment selection unit 3.
  • the unvoiced sound generation unit 6 since the unvoiced sound element supplied from the element selection unit 3 is a cut speech waveform, the unvoiced sound generation unit 6 generates an unvoiced sound waveform using the method described in Non-Patent Document 4. can do. Further, the method described in Non-Patent Document 5 may be used.
  • the voiced sound generation unit 5 includes a waveform generation parameter selection unit 50 and a voiced sound waveform generation unit 51.
  • the waveform generation parameter selection unit 50 selects a waveform generation parameter used to generate a voiced sound waveform based on the segment information supplied from the segment selection unit 3 and the prosody information supplied from the prosody generation unit 2.
  • FIG. 2 is a flowchart showing the operation of the waveform generation parameter selection unit 50.
  • the waveform generation parameter selection unit 50 generates a function for determining which waveform generation parameter is arranged on the time axis of the synthesized speech from the time length of the optimum segment and the target time length (step S1). Since this function is a function used for selecting a waveform generation parameter, in the present embodiment, this function is referred to as a “waveform generation parameter selection function”.
  • waveform generation parameter selection unit 50 the optimum unit as waveform generation parameter selection function linear function such as the following equation (1) Generate for.
  • the waveform generation parameter selection unit 50 checks whether or not all selected segments are continuous with subsequent segments (step S2).
  • being continuous with the subsequent segment means that it is continuous on the recorded voice of the selection source stored in the segment information storage unit 10.
  • the unit of the segment is a syllable
  • the syllable of the segment to be checked (hereinafter referred to as “preceding segment”) is “U”
  • the syllable of the subsequent segment to be checked is “ma”. If the preceding and succeeding segments are selected from different recorded voices such as “Ushi” and “Mari”, respectively, it can be said that the preceding and succeeding segments are discontinuous.
  • if selected from consecutive sections on the same recorded voice such as “delicious” and “suma”, it can be said that the preceding segment and the subsequent segment are continuous.
  • the waveform generation parameter selection unit 50 obtains a common waveform generation parameter selection function used by both using the waveform generation parameter selection function for the preceding and subsequent segments. For example, assuming that the time lengths of the preceding and subsequent optimum segments are T u1 and T u2 and the target time lengths are T o1 and T o2 , a polygonal line function as shown in the following equation (2) is obtained.
  • FIG. 3 is an explanatory diagram showing assignment of waveform generation parameters.
  • FIG. 3 shows a situation showing an example in which waveform generation parameters are assigned in accordance with the target time length when the pieces are continuous. “Nth segment” represents a preceding segment, and “N + 1th segment” represents a subsequent segment.
  • FIG. 4 is an explanatory diagram showing an example in which Fu2 (t) is plotted based on the assignment shown in FIG.
  • the waveform generation parameter selection unit 50 corrects the waveform generation parameter selection function used to select an appropriate waveform generation parameter from the preceding and subsequent optimum segments, and the waveform generation parameter selection function considering continuity. Is obtained (step S3). There are several methods described below for obtaining the corrected waveform generation parameter selection function.
  • FIG. 5 is an explanatory diagram showing a first example of a waveform generation parameter selection function.
  • the first example of the waveform generation parameter selection function is generated by introducing straight lines passing through the midpoints of the preceding and succeeding segments.
  • a polygonal line function such as the following expression (3) is used as the waveform generation parameter selection function.
  • FIG. 6 is an explanatory diagram illustrating a second example of the waveform generation parameter selection function.
  • the second example of the waveform generation parameter selection function shown in FIG. 6 is obtained based on a linear function that connects the start end of the preceding segment and the end of the subsequent segment. For example, as shown in FIG. 6, a polygonal line function passing through the intersection (T o1 , Q) of the segment connection boundary line and the straight line function and the midpoint of the end of the preceding segment (T o1 , T u1 ) generates a waveform. Used as a parameter selection function.
  • Equation (4) T um is expressed as in Equation (5) below.
  • FIG. 7 is an explanatory diagram showing a third example of the waveform generation parameter selection function.
  • the third example of the waveform generation parameter selection function shown in FIG. 7 is obtained by smoothing the polygonal line function Fu2 (t).
  • a smoothing method for example, a method in which a polygonal line function is regarded as a time series and smoothed by a moving average or first-order leak integration is used.
  • the waveform generation parameter selection unit 50 smoothes the change in the slope of the waveform generation parameter selection function by using the methods of the first to third examples. Thereby, the speech synthesizer of this embodiment can generate synthesized speech with a smooth spectrum change.
  • the above correction method has been described on the assumption that the waveform generation parameter selection function to be corrected is a line function, but the same method can be used for functions other than a line function such as a curve. Further, regarding the first example shown in FIG. 5, the example in which the corrected waveform generation parameter selection function passes through the midpoint of the preceding and subsequent segments has been described, but the waveform generation parameter selection function is other than the midpoint. It may be a function that passes through the points. Further, regarding the second example shown in FIG.
  • the corrected waveform generation parameter selection function includes the intersection (T o1 , Q) of the segment connection boundary line and the straight line function and the end of the preceding segment (T o1 , T
  • the waveform generation parameter selection function may also be a function that passes through points other than the midpoint.
  • the waveform generation parameter selection unit 50 calculates a pitch synchronization time (also referred to as a pitch mark) from the pitch time series generated by the prosody generation unit 2 (step S4).
  • a pitch synchronization time also referred to as a pitch mark
  • a method for calculating the pitch synchronization position from the pitch time series is described in Non-Patent Document 6, for example.
  • the waveform generation unit 4 may calculate the pitch synchronization position by the method described in Non-Patent Document 6.
  • the waveform generation parameter selection unit 50 uses the waveform generation parameter selection function to select the waveform generation parameter closest to the pitch synchronization time (step S5).
  • the time of an ideal waveform generation parameter position is first calculated from the pitch synchronization position of the synthesized speech using a waveform generation parameter selection function.
  • the waveform generation parameter selection unit 50 employs the waveform generation parameter closest to the time. For example, the time of the nth waveform generation parameter position is 100 milliseconds, the time of the (n + 1) th waveform generation parameter position is 180 milliseconds, and the time obtained by the waveform generation parameter selection function is 160 milliseconds. In this case, the (n + 1) th waveform generation parameter is selected.
  • FIG. 8 is an explanatory diagram showing a state in which a voiced sound waveform is generated from two speech segments composed of nine waveform generation parameters.
  • the function shown in FIG. 5 is used as the waveform generation parameter selection function.
  • the waveform generation parameters corresponding to the pitch synchronization time are the first, third, fourth, fifth, sixth, seventh, eighth, eighth, and ninth waveform generation parameters.
  • the unit 4 generates a waveform using these waveform generation parameters.
  • the voiced sound waveform generator 51 generates a voiced sound waveform based on the waveform generation parameters supplied from the waveform generation parameter selector 50 and the prosody information supplied from the prosody generator 2.
  • the voiced sound waveform generator 51 generates a voiced sound waveform by arranging the center of each selected waveform generation parameter at the pitch synchronization time.
  • the voiced sound waveform generation unit 51 When the waveform generation parameter is a pitch waveform, the voiced sound waveform generation unit 51 generates a voiced sound waveform by arranging the pitch waveform at the pitch synchronization time.
  • the waveform connecting unit 7 connects the voiced sound waveform supplied from the voiced sound generating unit 5 and the unvoiced sound waveform supplied from the unvoiced sound generating unit 6 and outputs it as a synthesized speech waveform.
  • the voiced sound waveform v (t) and the unvoiced sound waveform u (T) is concatenated to generate and output a synthesized speech waveform x (t) shown below.
  • the speech synthesizer of this embodiment corrects the waveform generation parameter selection function in consideration of continuity. For this reason, according to the speech synthesizer of the present embodiment, the spectral change compared to the general method disclosed in Non-Patent Document 1 or the like in a section in which continuous segments on the recorded speech are selected. It is possible to generate a synthesized speech that is smooth.
  • Embodiment 2 a speech synthesizer according to the second embodiment of the present invention will be described.
  • the speech synthesis apparatus according to the second embodiment is the first implementation in that the degree of spectrum change is estimated according to the attribute information of the speech unit, and the waveform generation parameter selection function is controlled based on the estimated degree of spectrum change. This is different from the speech synthesis apparatus according to the embodiment. Therefore, the difference will be mainly described below.
  • FIG. 9 is a block diagram showing the configuration of the second embodiment of the speech synthesizer according to the present invention.
  • the configuration of the speech synthesizer of this embodiment shown in FIG. 9 is compared with the configuration of the speech synthesizer of the first embodiment shown in FIG. 1, and the waveform generation parameter selection unit 50 is replaced with the waveform generation parameter selection unit 60. Further, a spectrum shape change degree estimation unit 62 is newly provided.
  • the spectrum shape change degree estimation unit 62 estimates the degree of change of the spectrum shape at the unit connection boundary based on the unit attribute information supplied from the unit information storage unit 10.
  • the spectrum shape change degree estimation unit 62 uses language information and prosodic information included in the attribute information for estimation of the change degree of the spectrum shape.
  • a method of estimating the shape change rate of the voice spectrum for each corresponding type is effective. For example, if the segment obtained by combining the preceding and subsequent segments is a syllable of a long vowel, since the change in the spectrum shape at the segment connection boundary is small, the estimated amount of the spectrum shape change is reduced. The same applies when the preceding and subsequent segments are the same phoneme. If the preceding or succeeding segment is a voiced consonant, the spectrum shape change at the segment connection boundary is large, so the estimated amount of the spectrum shape change is increased.
  • the waveform generation parameter selection unit 60 converts the segment information supplied from the segment selection unit 3, the prosody information supplied from the prosody generation unit 2, and the spectrum shape change degree supplied from the spectrum shape change degree estimation unit 62. Based on this, a waveform generation parameter used for generating a voiced sound waveform is selected.
  • the waveform generation parameter selection unit 60 generates a waveform generation parameter selection function based on the estimated amount of spectrum shape change.
  • the waveform generation parameter selection unit 60 adjusts the length of the correction section, for example, when using the selection function shown in FIG.
  • the waveform generation parameter selection unit 60 makes the spectrum shape smoother by lengthening the correction section if the degree of change in the spectrum shape is small.
  • the waveform generation parameter selection unit 60 adjusts the length of the correction section according to the magnitude of the spectrum shape change degree.
  • the waveform generation parameter selection unit 60 similarly adjusts the distance between the end of the preceding segment on the segment boundary and the corrected selection function.
  • the waveform generation parameter selection unit 60 increases the distance between the end of the preceding segment and the corrected selection function on the segment boundary if the degree of change in the spectrum shape is small.
  • the waveform generation parameter selection function is controlled according to the attribute information of the speech unit.
  • the speech synthesizer of this embodiment can generate synthesized speech with a smooth spectrum change, particularly in a section where the degree of change in spectrum shape is small.
  • the present invention is not limited to the speech synthesizer described in each embodiment, and the configuration and operation thereof can be changed as appropriate without departing from the spirit of the invention.
  • FIG. 10 is a block diagram showing the configuration of the main part of the speech synthesizer according to the present invention.
  • the speech synthesizer according to the present invention has, as a main configuration, a unit selection unit that selects a speech unit to be used for synthesis from a plurality of previously stored speech units based on an input character string. 3 and a waveform generation unit 4 including a waveform generation parameter selection unit 50 for selecting a waveform generation parameter extracted from the speech segment, and generating a synthesized speech using the selected waveform generation parameter.
  • the waveform generation parameter selection unit 50 also selects a waveform generation parameter selection function, which is a function indicating where the waveform generation parameters on the time axis of the speech segment are to be placed on the time axis of the synthesized speech.
  • the waveform generation parameters are selected based on the waveform generation parameter selection function.
  • speech synthesis apparatuses as shown in the following (1) to (4) are also disclosed.
  • the waveform generation parameter selection unit includes a first function that connects the start and end of a preceding unit that is one of the selected plurality of speech units, and a speech unit that follows the preceding unit. If the waveform generation parameter selection function that connects the second function connecting the start and end of a certain subsequent segment is generated and the preceding segment and the subsequent segment are continuous, the slope of the waveform generation parameter selection function A speech synthesizer that makes corrections to smooth out changes.
  • the waveform generation parameter selection unit is configured such that the waveform generation parameter selection function is on a straight line connecting the start end of the preceding segment and the end of the subsequent segment, and the end time of the preceding segment on the time axis of the synthesized speech It is also possible to make the change in the slope smooth by correcting so as to pass through the internal dividing point of the straight line connecting the point at and the end of the preceding element.
  • the waveform generation parameter selection unit smoothes the change in inclination by correcting using the line connecting the internal dividing point of the first function and the internal dividing point of the second function.
  • the waveform generation parameter selection function may be generated.
  • the speech synthesizer includes a spectral shape change degree estimation unit (for example, a spectral shape change degree estimation unit 62) that estimates the spectral change degree at the connection boundary of the speech unit based on the attribute information of the speech unit.
  • the waveform generation parameter selection unit may be configured to generate a waveform generation parameter selection function based on the estimated degree of spectrum change.
  • the present invention can be applied to information providing services using synthesized speech.

Abstract

Provided is a speech synthesizer capable of generating synthesized speech with a smooth spectrum change during a section in which consecutive phones in recorded speech are selected. The speech synthesizer comprises: a phone selection unit (3) for selecting phones used for synthesis from a plurality of preliminarily stored phones on the basis of an input text string; and a waveform generation unit (4) which includes a waveform generation parameter selection unit (50) for selecting waveform generation parameters extracted from the phones and generates synthesized speech using the selected waveform generation parameters. Taking into account the continuity of the selected phones, the waveform generation parameter selection unit (50) generates a waveform generation parameter selection function indicating where the waveform generation parameters on the time axis of the phones are to be placed on the time axis of the synthesized speech, and selects the waveform generation parameters on the basis of the waveform generation parameter selection function.

Description

音声合成装置、音声合成方法、及び音声合成プログラムSpeech synthesis apparatus, speech synthesis method, and speech synthesis program
 本発明は、音声合成技術に関し、特に、入力されたテキストに基づいて音声を合成するための音声合成装置、音声合成方法及び音声合成プログラムに関する。 The present invention relates to a speech synthesis technique, and more particularly to a speech synthesizer, a speech synthesis method, and a speech synthesis program for synthesizing speech based on input text.
 入力された文字列を解析し、その文字列が示す音声情報から合成音声を生成する音声合成装置が知られている。このような音声合成装置は、先ず入力された文字列を解析して得られた言語処理結果を基に、合成音声の韻律情報(音の高さ(ピッチ)、音の長さ(音韻継続時間長)、及び、音の大きさ(パワー)等に関する情報)を生成する。 A speech synthesizer that analyzes an input character string and generates synthesized speech from speech information indicated by the character string is known. Such a speech synthesizer first generates prosodic information (sound pitch (pitch), sound length (phoneme duration time) of synthesized speech based on a language processing result obtained by analyzing an input character string. Long) and information on sound volume (power) and the like.
 次に、音声合成装置は、言語処理結果や生成された韻律情報(「目標韻律情報」と呼ぶ)を基に、最適な素片を素片辞書の中から複数選択し、一つの最適素片系列を作成する。なお、素片は、音声素片と呼ばれることもあり、収録された音声に基づいて例えば半音節程度毎に予め生成されている。また、一般的に、1つの音声(ここでは、半音節程度の音声)に対して、種々の収録音声から複数種類の素片が生成される。そして、最適素片系列から波形生成パラメータ系列を形成し、その系列から音声波形を生成することで合成音声が得られる。素片辞書に蓄積されている素片は、多量の自然音声から様々な手法を用いて抽出、生成される。 Next, the speech synthesizer selects a plurality of optimal segments from the segment dictionary based on the language processing result and the generated prosodic information (referred to as “target prosodic information”), and one optimal segment is selected. Create a series. Note that the segment is sometimes referred to as a speech segment, and is generated in advance for each semi-syllable, for example, based on the recorded speech. In general, a plurality of types of segments are generated from various recorded voices for one voice (here, a voice of about half syllable). A synthesized speech can be obtained by forming a waveform generation parameter sequence from the optimal segment sequence and generating a speech waveform from the sequence. Segments stored in the segment dictionary are extracted and generated from a large amount of natural speech using various methods.
 このような音声合成装置は、選択された素片から合成音声波形を生成する際に、高い音質を確保する目的で、生成された韻律情報に近い韻律を有する音声波形を素片から作り出す。そこで、合成音声波形と、その合成音声の生成に用いる素片の両者を生成する方法として、例えば非特許文献1に記載された方法が用いられる。 Such a speech synthesizer generates a speech waveform having a prosody close to the generated prosodic information for the purpose of ensuring high sound quality when generating a synthesized speech waveform from the selected segment. Therefore, for example, a method described in Non-Patent Document 1 is used as a method for generating both a synthesized speech waveform and a segment used for generating the synthesized speech.
 図11は、非特許文献1における波形生成パラメータの割り当てを示す説明図である。図11に示す通り、非特許文献1に記載された方法により生成される波形生成パラメータは、収録音声のピッチから算出されたピッチ同期位置を中心に、ピッチから算出された時間幅を有する窓関数が用いられ、音声波形から切り出された波形(ピッチ波形)である。そして、非特許文献1に記載の方法により合成音声波形を生成する場合、言語処理結果から生成されたピッチ、つまり合成音声のピッチに基づいて、波形生成パラメータ(ピッチ波形)が波形生成パラメータ系列の中から選択される。そして、選択されたピッチ波形の連結により合成音声波形が生成される。ピッチ波形の選択は、基本的には収録音声と合成音声のピッチ同期位置の対応関係に基づいて行われる。 FIG. 11 is an explanatory diagram showing the assignment of waveform generation parameters in Non-Patent Document 1. As shown in FIG. 11, the waveform generation parameter generated by the method described in Non-Patent Document 1 is a window function having a time width calculated from the pitch around the pitch synchronization position calculated from the pitch of the recorded audio. Is a waveform (pitch waveform) cut out from the speech waveform. When the synthesized speech waveform is generated by the method described in Non-Patent Document 1, the waveform generation parameter (pitch waveform) is based on the pitch generated from the language processing result, that is, the pitch of the synthesized speech. It is selected from the inside. A synthesized speech waveform is generated by concatenating the selected pitch waveforms. The selection of the pitch waveform is basically performed based on the correspondence between the pitch synchronization positions of the recorded voice and the synthesized voice.
 なお、非特許文献7には、波形パラメータとして、ピッチ波形の他にパワースペクトル、線形予測係数、ケプストラム、メルケプストラム、LSP(Line Spectrum Pair)などが用いられることが記載されている。 Note that Non-Patent Document 7 describes that a power spectrum, a linear prediction coefficient, a cepstrum, a mel cepstrum, an LSP (Line Spectrum Pair), and the like are used as a waveform parameter in addition to a pitch waveform.
 しかし、非特許文献1に記載された波形生成方法では、適切な波形生成パラメータが選択されず合成音声の音質が低下する問題点がある。 However, the waveform generation method described in Non-Patent Document 1 has a problem that the sound quality of synthesized speech is deteriorated because an appropriate waveform generation parameter is not selected.
 非特許文献1によれば、予め定めた素片の境界位置に基づいて、個々の音声素片毎に目標韻律情報が忠実に再現されるよう波形生成パラメータが選択される。このため、波形生成パラメータの間引き、挿入が多く繰り返されるので、合成音声のスペクトルの時間変化に偏りが生じてしまい、滑らかなスペクトル変化を実現することが困難となる。よって、上記問題点が生じる。 According to Non-Patent Document 1, the waveform generation parameter is selected so that the target prosodic information is faithfully reproduced for each speech unit based on a predetermined boundary position of the segment. For this reason, since thinning and insertion of waveform generation parameters are repeated many times, the temporal change in the spectrum of the synthesized speech is biased, making it difficult to realize a smooth spectral change. Therefore, the above problem occurs.
 そこで、本発明は、収録音声上で連続している素片が選択されている区間において、スペクトル変化が滑らかである合成音声を生成できる音声合成装置、音声合成方法及び音声合成プログラムを提供することを目的とする。 Therefore, the present invention provides a speech synthesizer, a speech synthesis method, and a speech synthesis program capable of generating a synthesized speech with a smooth spectrum change in a section in which continuous segments are selected on recorded speech. With the goal.
 本発明による音声合成装置は、入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる前記音声素片を選択する素片選択部と、前記音声素片から抽出された波形生成パラメータを選択する波形生成パラメータ選択部を含み、選択された前記波形生成パラメータを用いて合成音声を生成する波形生成部とを備え、前記波形生成パラメータ選択部は、前記音声素片の時間軸上の波形生成パラメータを前記合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された前記音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて波形生成パラメータを選択することを特徴とする。 The speech synthesizer according to the present invention includes a unit selection unit that selects the speech unit used for synthesis from a plurality of speech units stored in advance based on an input character string, and a waveform extracted from the speech unit A waveform generation parameter selection unit that selects a generation parameter, and a waveform generation unit that generates a synthesized speech using the selected waveform generation parameter, wherein the waveform generation parameter selection unit includes a time axis of the speech unit Generate a waveform generation parameter selection function, which is a function indicating where to place the above waveform generation parameter on the time axis of the synthesized speech in consideration of the continuity of the selected speech unit, and generate the waveform A waveform generation parameter is selected based on a parameter selection function.
 本発明による音声合成方法は、入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる前記音声素片を選択し、前記音声素片の時間軸上の波形生成パラメータを前記合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された前記音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて、前記音声素片から抽出された波形生成パラメータを選択し、選択された前記波形生成パラメータを用いて合成音声を生成することを特徴とする。 The speech synthesis method according to the present invention selects, based on an input character string, the speech unit used for synthesis from a plurality of speech units stored in advance, and sets the waveform generation parameter on the time axis of the speech unit as the waveform generation parameter. A waveform generation parameter selection function, which is a function indicating where to place the synthesized speech on the time axis, is generated in consideration of the continuity of the selected speech segment, and based on the waveform generation parameter selection function, A waveform generation parameter extracted from the speech segment is selected, and synthesized speech is generated using the selected waveform generation parameter.
 本発明による音声合成プログラムは、コンピュータに、入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる前記音声素片を選択する素片選択処理と、前記音声素片の時間軸上の波形生成パラメータを前記合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された前記音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて、前記音声素片から抽出された波形生成パラメータを選択する波形生成パラメータ選択処理を含み、選択された前記波形生成パラメータを用いて合成音声を生成する波形生成処理とを実行させることを特徴とする。 The speech synthesis program according to the present invention includes a computer to select a speech unit to be used for synthesis from a plurality of speech units stored in advance based on an input character string, and a time of the speech unit. A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the axis is arranged on the time axis of the synthesized speech, is generated in consideration of the continuity of the selected speech segment, and the waveform Including a waveform generation parameter selection process for selecting a waveform generation parameter extracted from the speech segment based on a generation parameter selection function, and a waveform generation process for generating synthesized speech using the selected waveform generation parameter. It is made to perform.
 本発明によれば、収録音声上で連続している素片が選択されている区間において、スペクトル変化が滑らかな合成音声を生成できる。 According to the present invention, it is possible to generate a synthesized speech with a smooth spectrum change in a section in which continuous segments are selected on the recorded speech.
本発明による音声合成装置の第1実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of the speech synthesizer by this invention. 波形生成パラメータ選択部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a waveform generation parameter selection part. 波形生成パラメータの割り当てを示す説明図である。It is explanatory drawing which shows allocation of a waveform generation parameter. 図3に示した割り当てに基づきFu2(t)をプロットした例を示す説明図である。It is explanatory drawing which shows the example which plotted Fu2 (t) based on the allocation shown in FIG. 波形生成パラメータ選択関数の第1の例を示す説明図である。It is explanatory drawing which shows the 1st example of a waveform generation parameter selection function. 波形生成パラメータ選択関数の第2の例を示す説明図である。It is explanatory drawing which shows the 2nd example of a waveform generation parameter selection function. 波形生成パラメータ選択関数の第3の例を示す説明図である。It is explanatory drawing which shows the 3rd example of a waveform generation parameter selection function. 9つの波形生成パラメータから構成される2つの音声素片から、有声音波形を生成する様子を示した説明図である。It is explanatory drawing which showed a mode that a voiced sound waveform was produced | generated from two speech segments comprised from nine waveform generation parameters. 本発明による音声合成装置の第2の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 2nd Embodiment of the speech synthesizer by this invention. 本発明による音声合成装置の主要部の構成を示すブロック図である。It is a block diagram which shows the structure of the principal part of the speech synthesizer by this invention. 非特許文献1における波形生成パラメータの割り当てを示す説明図である。It is explanatory drawing which shows allocation of the waveform generation parameter in a nonpatent literature 1.
 以下、本発明の実施の形態について図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
実施形態1.
 図1は、本発明による音声合成装置の第1の実施形態(実施形態1)の構成を示すブロック図である。図1に示すように本実施形態の音声合成装置は、言語処理部1と、韻律生成部2と、素片選択部3と、波形生成部4と、素片情報記憶部10とを備える。波形生成部4は、有声音生成部5と、無声音生成部6と、波形連結部7とを含む。また、有声音生成部5は、波形生成パラメータ選択部50と、有声音波形生成部51とを含む。
Embodiment 1. FIG.
FIG. 1 is a block diagram showing the configuration of a first embodiment (Embodiment 1) of a speech synthesizer according to the present invention. As shown in FIG. 1, the speech synthesizer of this embodiment includes a language processing unit 1, a prosody generation unit 2, a segment selection unit 3, a waveform generation unit 4, and a segment information storage unit 10. The waveform generation unit 4 includes a voiced sound generation unit 5, an unvoiced sound generation unit 6, and a waveform connection unit 7. The voiced sound generation unit 5 includes a waveform generation parameter selection unit 50 and a voiced sound waveform generation unit 51.
 素片情報記憶部10は、音声素片を表す音声素片情報と、各音声素片の属性を表す属性情報とを記憶する。音声素片は、音声を合成する音声合成処理の基となる基礎音声(人間が発した音声(自然音声))の一部であり、基礎音声を音声合成単位毎に分割することにより生成される。 The unit information storage unit 10 stores speech unit information representing speech units and attribute information representing attributes of each speech unit. A speech segment is a part of basic speech (speech generated by humans (natural speech)) that is the basis of speech synthesis processing for synthesizing speech, and is generated by dividing the basic speech into speech synthesis units. .
 本実施形態では、音声素片情報は、音声素片から抽出され且つ合成音声波形の生成に用いられる波形生成パラメータの時系列データを含む。波形生成パラメータには、以下の説明ではピッチ波形を用いるが、例えば、パワースペクトル、線形予測係数、ケプストラム、メルケプストラム、LSPなどであってもよい(非特許文献7参照)。また、波形生成パラメータには、特に素片のデータ量削減が必要な場合には、線形予測係数やLSPなどを波形生成パラメータとして利用することが好ましい。また、音声合成単位は、音節である。なお、音声合成単位は、特許文献2に示されているとおり、音素、半音素、CV(Consonant(子音) Vowel(母音))等の半音節、CVC、又はVCV等であってもよい。 In this embodiment, the speech unit information includes time series data of waveform generation parameters extracted from the speech unit and used for generating a synthesized speech waveform. As the waveform generation parameter, a pitch waveform is used in the following description, but may be, for example, a power spectrum, a linear prediction coefficient, a cepstrum, a mel cepstrum, or an LSP (see Non-Patent Document 7). As the waveform generation parameter, it is preferable to use a linear prediction coefficient, LSP, or the like as the waveform generation parameter, particularly when it is necessary to reduce the data amount of the segment. The speech synthesis unit is a syllable. Note that the speech synthesis unit may be a phoneme, a semiphone, a semi-syllable such as CV (Consonant, Vowel), CVC, or VCV, as disclosed in Patent Document 2.
 属性情報は、基礎音声に対応する文字列(収録文)を表す情報を含む言語情報と、基礎音声の韻律情報を含む。言語情報は、例えば、漢字かな混じり文で表される情報である。さらに、言語情報は、読み、音節列、音素列、アクセント位置、アクセント句区切り、形態素の品詞等の情報を含んでいてもよい。また、韻律情報は、ピッチ(基本周波数)、振幅、短時間パワーの時系列、及び、自然音声に含まれる各音節、音素、ポーズの継続時間長等を含む。 Attribute information includes language information including information representing a character string (recorded sentence) corresponding to basic speech and prosodic information of basic speech. The language information is, for example, information expressed in a kanji / kana mixed sentence. Furthermore, the language information may include information such as readings, syllable strings, phoneme strings, accent positions, accent phrase breaks, morpheme parts of speech. The prosodic information includes a pitch (fundamental frequency), an amplitude, a time series of short-time power, and the syllables, phonemes, and pause duration lengths included in natural speech.
 言語処理部1は、入力されたテキスト文の文字列を分析する。具体的には、言語処理部1は、形態素解析、構文解析、または読み付け等の分析を行う。そして、言語処理部1は分析結果に基づいて、音素記号等の「読み」を表す記号列を表す情報と、形態素の品詞、活用、およびアクセント型等を表す情報とを言語解析処理結果として韻律生成部2と素片選択部3とに出力する。 The language processing unit 1 analyzes the character string of the input text sentence. Specifically, the language processing unit 1 performs analysis such as morphological analysis, syntax analysis, or reading. Then, based on the analysis result, the language processing unit 1 uses information representing the symbol string representing “reading” such as phoneme symbols and information representing the morpheme part-of-speech, utilization, accent type, etc. as the prosody. The data is output to the generation unit 2 and the segment selection unit 3.
 韻律生成部2は、言語処理部1によって出力された言語解析処理結果に基づいて、合成音声の韻律を生成し、生成した韻律を示す韻律情報を目標韻律情報として素片選択部3および波形生成部4に出力する。韻律の生成には、例えば、特許文献3に記載された方法が用いられる。 The prosody generation unit 2 generates a prosody of the synthesized speech based on the language analysis processing result output from the language processing unit 1, and uses the prosody information indicating the generated prosody as target prosody information and the unit selection unit 3 and waveform generation Output to part 4. For example, the method described in Patent Document 3 is used to generate the prosody.
 素片選択部3は、言語解析処理結果と目標韻律情報とに基づいて、素片情報記憶部10に記憶されている素片のうち、所定の要件を満たす素片を選択し、選択した素片とその素片の属性情報とを波形生成部4に出力する。 The segment selection unit 3 selects a segment that satisfies a predetermined requirement from the segments stored in the segment information storage unit 10 based on the language analysis processing result and the target prosody information, and selects the selected segment. The pieces and the attribute information of the pieces are output to the waveform generation unit 4.
 素片選択部3の動作の詳細を説明する。素片選択部3は、入力された言語解析処理結果と目標韻律情報とに基づいて、合成音声の特徴を示す情報(以下、これを「目標素片環境」と呼ぶ。)を音声合成単位毎に生成する。 Details of the operation of the element selection unit 3 will be described. Based on the input language analysis processing result and the target prosodic information, the segment selection unit 3 sets information indicating the characteristics of the synthesized speech (hereinafter referred to as “target segment environment”) for each speech synthesis unit. To generate.
 目標素片環境は、当該目標素片環境の生成対象の合成音声を構成する該当音素、該当音素の前の音素である先行音素、該当音素の後の音素である後続音素、ストレスの有無、アクセント核からの距離、音声合成単位毎のピッチ周波数、パワー、音声合成単位の継続時間長、ケプストラム、MFCC(Mel Frequency Cepstral Coefficients)、およびこれらの単位時間あたりの変化量等を含む情報である。 The target segment environment is the corresponding phoneme that constitutes the synthesized speech for which the target segment environment is generated, the preceding phoneme that is the phoneme before the corresponding phoneme, the subsequent phoneme that is the phoneme after the corresponding phoneme, the presence or absence of stress, the accent The information includes the distance from the nucleus, the pitch frequency for each speech synthesis unit, the power, the duration of the speech synthesis unit, the cepstrum, the MFCC (Mel Frequency Cepstial Coefficients), and the amount of change per unit time.
 次に、素片選択部3は、生成した目標素片環境に含まれる情報に基づいて、合成音声単位毎に、連続する音素に対応する素片を素片情報記憶部10からそれぞれ複数取得する。つまり、素片選択部3は、目標素片環境に含まれる情報に基づいて、該当音素、先行音素、および後続音素のそれぞれに対応する素片をそれぞれ複数取得する。取得された素片は、合成音声を生成するために用いられる素片の候補であり、以下、候補素片という。 Next, the segment selection unit 3 acquires a plurality of segments corresponding to continuous phonemes from the segment information storage unit 10 for each synthesized speech unit based on the information included in the generated target segment environment. . That is, the segment selection unit 3 acquires a plurality of segments corresponding to each of the corresponding phoneme, the preceding phoneme, and the subsequent phoneme based on information included in the target segment environment. The acquired segment is a candidate for a segment used to generate a synthesized speech, and is hereinafter referred to as a candidate segment.
 そして、素片選択部3は、取得した複数の隣接する候補素片の組み合わせ(例えば、該当音素に対応する候補素片と先行音素に対応する候補素片との組み合わせ)毎に、音声を合成するために用いる素片としての適切度を示す指標であるコストを算出する。コストは、目標素片環境と候補素片の属性情報との差異、および隣接する候補素片の属性情報の差異の算出結果である。 Then, the unit selection unit 3 synthesizes speech for each combination of a plurality of acquired candidate segments (for example, a combination of a candidate unit corresponding to the corresponding phoneme and a candidate unit corresponding to the preceding phoneme). The cost, which is an index indicating the appropriateness as the segment used for the calculation, is calculated. The cost is a calculation result of the difference between the target element environment and the attribute information of the candidate element, and the difference between the attribute information of adjacent candidate elements.
 算出結果の値であるコストは、目標素片環境によって示される合成音声の特徴と候補素片との類似度が高いほど、つまり音声を合成するための適切度が高くなるほど小さくなる。また、隣接する候補素片の属性情報の差異が小さいほど、つまり素片接続時のギャップが小さいほど、コストは小さくなる。そして、コストが小さい素片を用いるほど、合成された音声は、人間が発した音声と類似している程度を示す自然度が高くなる。従って、素片選択部3は、算出したコストが最も小さい素片を選択する。 The cost, which is the value of the calculation result, decreases as the similarity between the synthesized speech feature indicated by the target segment environment and the candidate segment increases, that is, as the appropriateness for synthesizing the speech increases. Further, the smaller the difference in attribute information between adjacent candidate segments, that is, the smaller the gap at the time of segment connection, the lower the cost. Then, the lower the cost, the higher the degree of naturalness that indicates the degree to which the synthesized speech is similar to the speech uttered by humans. Therefore, the segment selection unit 3 selects the segment with the lowest calculated cost.
 素片選択部3で計算されるコストは、具体的には、単位コストと接続コストとがある。単位コストによって、候補素片が目標素片環境によって示される環境で用いられた場合に生じると推定される音質劣化度が示される。単位コストは、候補素片の属性情報と目標素片環境との類似度にもとづいて算出される。また、接続コストによって、接続する音声素片間の素片環境が不連続であることによって生じると推定される音質劣化度が示される。接続コストは、隣接する候補素片同士の素片環境の親和度にもとづいて算出される。単位コストおよび接続コストの算出方法には、各種提案されている一般的な方法が用いられる。 Specifically, the cost calculated by the segment selection unit 3 includes a unit cost and a connection cost. The unit cost indicates the degree of sound quality degradation estimated to occur when the candidate segment is used in the environment indicated by the target segment environment. The unit cost is calculated based on the similarity between the attribute information of the candidate segment and the target segment environment. The connection cost indicates the degree of sound quality degradation estimated to be caused by the discontinuity of the element environment between connected speech elements. The connection cost is calculated based on the affinity of the element environments between adjacent candidate elements. Various proposed general methods are used for calculating the unit cost and the connection cost.
 素片選択部3は、候補素片の中から音声の合成に最も適した素片として、算出したコストが最小となる組み合わせの素片を選択する。なお、素片選択部3によって選択された素片を「最適素片」と呼ぶ。 The element selection unit 3 selects an element of the combination that minimizes the calculated cost as the element most suitable for speech synthesis from the candidate elements. The segment selected by the segment selection unit 3 is referred to as “optimal segment”.
 波形生成部4は、韻律生成部2から供給された目標韻律情報と、素片選択部3から供給された選択素片及びその属性情報を基に、目標韻律に一致若しくは類似する韻律を有する音声波形を生成し、生成した音声波形を接続して合成音声を生成する。 Based on the target prosody information supplied from the prosody generation unit 2, the selected segment supplied from the segment selection unit 3, and its attribute information, the waveform generation unit 4 has a prosody that matches or is similar to the target prosody. Generate a waveform and connect the generated speech waveform to generate a synthesized speech.
 ところで、素片選択部3から供給される素片情報が表す素片は、有声音からなる素片と、無声音からなる素片と、に分類される。有声音に対する韻律制御を行うために用いられる方法と、無声音に対する韻律制御を行うために用いられる方法と、は互いに異なる。従って、波形生成部4は、有声音生成部5と無声音生成部6と、有声音と無声音を連結する波形連結部7とを含む。 By the way, the segment represented by the segment information supplied from the segment selection unit 3 is classified into a segment composed of voiced sound and a segment composed of unvoiced sound. The method used for performing prosody control for voiced sound and the method used for performing prosody control for unvoiced sound are different from each other. Therefore, the waveform generation unit 4 includes a voiced sound generation unit 5, an unvoiced sound generation unit 6, and a waveform connection unit 7 that connects voiced sound and unvoiced sound.
 無声音生成部6は、素片選択部3から供給された素片を基に、韻律生成部2から供給された韻律情報に一致若しくは類似する韻律を有する無声音波形を生成する。本実施形態では、素片選択部3から供給された無声音の素片は切り出された音声波形であるので、無声音生成部6は、非特許文献4に記載された方法を用いて無声音波形を生成することができる。また、非特許文献5に記載の方法を用いてもよい。 The unvoiced sound generation unit 6 generates an unvoiced sound waveform having a prosody that matches or is similar to the prosodic information supplied from the prosody generation unit 2 based on the segments supplied from the segment selection unit 3. In the present embodiment, since the unvoiced sound element supplied from the element selection unit 3 is a cut speech waveform, the unvoiced sound generation unit 6 generates an unvoiced sound waveform using the method described in Non-Patent Document 4. can do. Further, the method described in Non-Patent Document 5 may be used.
 有声音生成部5は、波形生成パラメータ選択部50と有声音波形生成部51を備える。波形生成パラメータ選択部50は、素片選択部3から供給された素片情報と、韻律生成部2から供給された韻律情報に基づき、有声音波形の生成に用いる波形生成パラメータの選択を行う。 The voiced sound generation unit 5 includes a waveform generation parameter selection unit 50 and a voiced sound waveform generation unit 51. The waveform generation parameter selection unit 50 selects a waveform generation parameter used to generate a voiced sound waveform based on the segment information supplied from the segment selection unit 3 and the prosody information supplied from the prosody generation unit 2.
 図2は、波形生成パラメータ選択部50の動作を示すフローチャートである。波形生成パラメータ選択部50は、はじめに、最適素片の時間長と目標時間長から、どの波形生成パラメータを合成音声の時間軸上のどこに配置するかを決定する関数を生成する(ステップS1)。この関数は、波形生成パラメータの選択に用いる関数であることから、本実施形態では、この関数のことを「波形生成パラメータ選択関数」と呼ぶ。 FIG. 2 is a flowchart showing the operation of the waveform generation parameter selection unit 50. First, the waveform generation parameter selection unit 50 generates a function for determining which waveform generation parameter is arranged on the time axis of the synthesized speech from the time length of the optimum segment and the target time length (step S1). Since this function is a function used for selecting a waveform generation parameter, in the present embodiment, this function is referred to as a “waveform generation parameter selection function”.
 例えば最適素片の時間長をT、目標時間長をTとすると、波形生成パラメータ選択部50は、以下の式(1)のような直線関数を波形生成パラメータ選択関数として各最適素片に対して生成する。 For example, the time length T u of the optimum unit, when a target time length is T o, waveform generation parameter selection unit 50, the optimum unit as waveform generation parameter selection function linear function such as the following equation (1) Generate for.
Figure JPOXMLDOC01-appb-M000001
Figure JPOXMLDOC01-appb-M000001
 次に、波形生成パラメータ選択部50は、全ての選択素片に対して、後続素片と連続しているか否かをチェックする(ステップS2)。ここで、後続素片と連続しているとは、素片情報記憶部10に記憶された選択元の収録音声上で連続していることを意味する。例えば、素片の単位が音節であり、チェック対象の素片(ここでは「先行素片」と呼ぶことにする)の音節が「う」、チェック対象の後続素片の音節が「ま」のとき、先行素片と後続素片がそれぞれ「うし」と「まり」のような別々の収録音声から選択されたならば、先行素片と後続素片は不連続であると言える。一方、「うまい」や「しまうま」のように同一の収録音声上の連続した区間から選択されたならば、先行素片と後続素片は連続していると言える。 Next, the waveform generation parameter selection unit 50 checks whether or not all selected segments are continuous with subsequent segments (step S2). Here, being continuous with the subsequent segment means that it is continuous on the recorded voice of the selection source stored in the segment information storage unit 10. For example, the unit of the segment is a syllable, the syllable of the segment to be checked (hereinafter referred to as “preceding segment”) is “U”, and the syllable of the subsequent segment to be checked is “ma”. If the preceding and succeeding segments are selected from different recorded voices such as “Ushi” and “Mari”, respectively, it can be said that the preceding and succeeding segments are discontinuous. On the other hand, if selected from consecutive sections on the same recorded voice such as “delicious” and “suma”, it can be said that the preceding segment and the subsequent segment are continuous.
 素片選択部3が選択した素片がもし連続していた場合、その連続性を考慮して滑らかなスペクトル変化を実現することが好ましい。そのため、波形生成パラメータ選択部50は、先行と後続のそれぞれの素片に対する波形生成パラメータ選択関数を用いて、両者が用いる共通の波形生成パラメータ選択関数を求める。例えば先行と後続の最適素片の時間長をTu1及びTu2、目標時間長をTo1及びTo2、とすると、以下の式(2)に示すような折れ線関数が求められる。 If the segments selected by the segment selection unit 3 are continuous, it is preferable to realize a smooth spectrum change in consideration of the continuity. Therefore, the waveform generation parameter selection unit 50 obtains a common waveform generation parameter selection function used by both using the waveform generation parameter selection function for the preceding and subsequent segments. For example, assuming that the time lengths of the preceding and subsequent optimum segments are T u1 and T u2 and the target time lengths are T o1 and T o2 , a polygonal line function as shown in the following equation (2) is obtained.
Figure JPOXMLDOC01-appb-M000002
Figure JPOXMLDOC01-appb-M000002
 図3は、波形生成パラメータの割り当てを示す説明図である。図3は、素片が連続しているときに、目標時間長に合わせて波形生成パラメータを割り当てる例を示している状況示している。「N番目の素片」が先行素片、「N+1番目の素片」が後続素片を表す。図4は、図3に示した割り当てに基づきFu2(t)をプロットした例を示す説明図である。 FIG. 3 is an explanatory diagram showing assignment of waveform generation parameters. FIG. 3 shows a situation showing an example in which waveform generation parameters are assigned in accordance with the target time length when the pieces are continuous. “Nth segment” represents a preceding segment, and “N + 1th segment” represents a subsequent segment. FIG. 4 is an explanatory diagram showing an example in which Fu2 (t) is plotted based on the assignment shown in FIG.
 次に、波形生成パラメータ選択部50は、先行と後続の最適素片から適切な波形生成パラメータを選択するのに用いられる波形生成パラメータ選択関数を補正し、連続性を考慮した波形生成パラメータ選択関数を求める(ステップS3)。この補正された波形生成パラメータ選択関数の求め方には、以下に説明するいくつの方法がある。 Next, the waveform generation parameter selection unit 50 corrects the waveform generation parameter selection function used to select an appropriate waveform generation parameter from the preceding and subsequent optimum segments, and the waveform generation parameter selection function considering continuity. Is obtained (step S3). There are several methods described below for obtaining the corrected waveform generation parameter selection function.
 図5は、波形生成パラメータ選択関数の第1の例を示す説明図である。図5に示すように、波形生成パラメータ選択関数の第1の例は、先行及び後続のそれぞれ素片の中点を通過する直線を導入することにより生成される。このとき、波形生成パラメータ選択関数には、以下の式(3)のような折れ線関数が用いられる。 FIG. 5 is an explanatory diagram showing a first example of a waveform generation parameter selection function. As shown in FIG. 5, the first example of the waveform generation parameter selection function is generated by introducing straight lines passing through the midpoints of the preceding and succeeding segments. At this time, a polygonal line function such as the following expression (3) is used as the waveform generation parameter selection function.
Figure JPOXMLDOC01-appb-M000003
Figure JPOXMLDOC01-appb-M000003
 図6は、波形生成パラメータ選択関数の第2の例を示す説明図である。図6に示す、波形生成パラメータ選択関数の第2の例は、先行素片の始端と後続素片の終端を結ぶ直線関数に基づき求められる。例えば、図6に示すように、素片接続境界線と直線関数の交点(To1,Q)と、先行素片の終端(To1,Tu1)の中点を通過する折れ線関数が波形生成パラメータ選択関数として用いられる。このとき、(To1,Q)と(To1,Tu1)の中点を(To1,Tum)とすると、以下の式(4)で表される折れ線関数が波形生成パラメータ選択関数として用いられる。 FIG. 6 is an explanatory diagram illustrating a second example of the waveform generation parameter selection function. The second example of the waveform generation parameter selection function shown in FIG. 6 is obtained based on a linear function that connects the start end of the preceding segment and the end of the subsequent segment. For example, as shown in FIG. 6, a polygonal line function passing through the intersection (T o1 , Q) of the segment connection boundary line and the straight line function and the midpoint of the end of the preceding segment (T o1 , T u1 ) generates a waveform. Used as a parameter selection function. At this time, if the midpoint of (T o1 , Q) and (T o1 , T u1 ) is (T o1 , T um ), the polygonal line function represented by the following equation (4) is used as the waveform generation parameter selection function. Used.
Figure JPOXMLDOC01-appb-M000004
Figure JPOXMLDOC01-appb-M000004
 式(4)において、Tumは以下の式(5)のように表される。 In Equation (4), T um is expressed as in Equation (5) below.
Figure JPOXMLDOC01-appb-M000005
Figure JPOXMLDOC01-appb-M000005
 図7は、波形生成パラメータ選択関数の第3の例を示す説明図である。図7に示す波形生成パラメータ選択関数の第3の例は、折れ線関数Fu2(t)を平滑化することにより求められる。平滑化方法として、例えば、折れ線関数を時系列と見なし、移動平均や一次リーク積分で平滑化する方法が用いられる。 FIG. 7 is an explanatory diagram showing a third example of the waveform generation parameter selection function. The third example of the waveform generation parameter selection function shown in FIG. 7 is obtained by smoothing the polygonal line function Fu2 (t). As a smoothing method, for example, a method in which a polygonal line function is regarded as a time series and smoothed by a moving average or first-order leak integration is used.
 波形生成パラメータ選択部50は、第1の例から第3の例の方法を用いることで、波形生成パラメータ選択関数の傾きの変化を滑らかにする。これにより、本実施形態の音声合成装置は、スペクトル変化が滑らかな合成音声を生成できる。 The waveform generation parameter selection unit 50 smoothes the change in the slope of the waveform generation parameter selection function by using the methods of the first to third examples. Thereby, the speech synthesizer of this embodiment can generate synthesized speech with a smooth spectrum change.
 以上の補正方法は、補正対象の波形生成パラメータ選択関数が折れ線関数であることを前提に説明したが、曲線などの折れ線関数以外の関数についても同様の方法を用いることが可能である。また、図5に示した第1の例に関して、補正した波形生成パラメータ選択関数が、先行や後続の素片の中点を通過する例を説明したが、波形生成パラメータ選択関数は、中点以外の点を通過する関数でもよい。また、図6に示した第2の例に関して、補正した波形生成パラメータ選択関数が、素片接続境界線と直線関数の交点(To1,Q)と、先行素片の終端(To1,Tu1)の中点を通過する例を説明したが、波形生成パラメータ選択関数は、こちらも中点以外の点を通過する関数でもよい。 The above correction method has been described on the assumption that the waveform generation parameter selection function to be corrected is a line function, but the same method can be used for functions other than a line function such as a curve. Further, regarding the first example shown in FIG. 5, the example in which the corrected waveform generation parameter selection function passes through the midpoint of the preceding and subsequent segments has been described, but the waveform generation parameter selection function is other than the midpoint. It may be a function that passes through the points. Further, regarding the second example shown in FIG. 6, the corrected waveform generation parameter selection function includes the intersection (T o1 , Q) of the segment connection boundary line and the straight line function and the end of the preceding segment (T o1 , T Although an example of passing through the midpoint of u1 ) has been described, the waveform generation parameter selection function may also be a function that passes through points other than the midpoint.
 次に、波形生成パラメータ選択部50は、韻律生成部2で生成されたピッチ時系列からピッチ同期時刻(ピッチマークとも呼ばれる)を算出する(ステップS4)。ピッチ時系列からピッチ同期位置を算出する方法は、例えば、非特許文献6に記載されている。波形生成部4は、例えば、非特許文献6に記載された方法でピッチ同期位置を算出すればよい。 Next, the waveform generation parameter selection unit 50 calculates a pitch synchronization time (also referred to as a pitch mark) from the pitch time series generated by the prosody generation unit 2 (step S4). A method for calculating the pitch synchronization position from the pitch time series is described in Non-Patent Document 6, for example. For example, the waveform generation unit 4 may calculate the pitch synchronization position by the method described in Non-Patent Document 6.
 そして、波形生成パラメータ選択部50は、波形生成パラメータ選択関数を用いて、ピッチ同期時刻に最も近い波形生成パラメータを選択する(ステップS5)。選択方法は、連続性を考慮しない場合と同様に、先ず合成音声のピッチ同期位置から波形生成パラメータ選択関数を利用して、理想的な波形生成パラメータ位置の時刻を算出する。次に、波形生成パラメータ選択部50は、その時刻に最も近い波形生成パラメータを採用する。例えば、第n番目の波形生成パラメータ位置の時刻が100ミリ秒、第n+1番目の波形生成パラメータ位置の時刻が180ミリ秒であり、波形生成パラメータ選択関数で求まった時刻が160ミリ秒であった場合、第n+1番目の波形生成パラメータが選択される。 Then, the waveform generation parameter selection unit 50 uses the waveform generation parameter selection function to select the waveform generation parameter closest to the pitch synchronization time (step S5). In the selection method, as in the case where continuity is not taken into account, the time of an ideal waveform generation parameter position is first calculated from the pitch synchronization position of the synthesized speech using a waveform generation parameter selection function. Next, the waveform generation parameter selection unit 50 employs the waveform generation parameter closest to the time. For example, the time of the nth waveform generation parameter position is 100 milliseconds, the time of the (n + 1) th waveform generation parameter position is 180 milliseconds, and the time obtained by the waveform generation parameter selection function is 160 milliseconds. In this case, the (n + 1) th waveform generation parameter is selected.
 図8は、9つの波形生成パラメータから構成される2つの音声素片から、有声音波形を生成する様子を示した説明図である。図8に示す例では、波形生成パラメータ選択関数としては、図5に示した関数を用いている。また、図8に示す例では、ピッチ同期時刻に該当する波形生成パラメータは、第1,3,4,5,6,7,8,8,9の波形生成パラメータとなっているので、波形生成部4は、これらの波形生成パラメータを使って波形を生成する。 FIG. 8 is an explanatory diagram showing a state in which a voiced sound waveform is generated from two speech segments composed of nine waveform generation parameters. In the example shown in FIG. 8, the function shown in FIG. 5 is used as the waveform generation parameter selection function. In the example shown in FIG. 8, the waveform generation parameters corresponding to the pitch synchronization time are the first, third, fourth, fifth, sixth, seventh, eighth, eighth, and ninth waveform generation parameters. The unit 4 generates a waveform using these waveform generation parameters.
 有声音波形生成部51は、波形生成パラメータ選択部50から供給された波形生成パラメータと、韻律生成部2から供給された韻律情報に基づき、有声音波形の生成を行う。有声音波形生成部51は、選択された各波形生成パラメータの中心をピッチ同期時刻に配置することで有声音波形を生成する。波形生成パラメータがピッチ波形である場合、有声音波形生成部51は、ピッチ波形をピッチ同期時刻に配置することで有声音波形を生成する。 The voiced sound waveform generator 51 generates a voiced sound waveform based on the waveform generation parameters supplied from the waveform generation parameter selector 50 and the prosody information supplied from the prosody generator 2. The voiced sound waveform generator 51 generates a voiced sound waveform by arranging the center of each selected waveform generation parameter at the pitch synchronization time. When the waveform generation parameter is a pitch waveform, the voiced sound waveform generation unit 51 generates a voiced sound waveform by arranging the pitch waveform at the pitch synchronization time.
 波形連結部7は、有声音生成部5から供給された有声音波形と無声音生成部6から供給された無声音波形を連結し、合成音声波形として出力する。具体的には、例えば、波形連結部7は、有声音生成部5が生成した有声音の波形がv(t)であり(ただし、t=1,2,3,・・・,t_v)、無声音生成部6が生成した無声音の波形がu(t)である(ただし、t=1,2,3,・・・,t_u)場合に、有声音の波形v(t)と無声音の波形u(t)とを連結して、以下に示す合成音声の波形x(t)を生成して出力する。 The waveform connecting unit 7 connects the voiced sound waveform supplied from the voiced sound generating unit 5 and the unvoiced sound waveform supplied from the unvoiced sound generating unit 6 and outputs it as a synthesized speech waveform. Specifically, for example, in the waveform linking unit 7, the waveform of the voiced sound generated by the voiced sound generation unit 5 is v (t) (where t = 1, 2, 3,..., T_v), When the unvoiced sound waveform generated by the unvoiced sound generation unit 6 is u (t) (where t = 1, 2, 3,..., T_u), the voiced sound waveform v (t) and the unvoiced sound waveform u (T) is concatenated to generate and output a synthesized speech waveform x (t) shown below.
t=1~t_vのとき:x(t)=v(t)
t=t_v+1~t_v+t_uのとき:x(t)=u(t-t_v)
When t = 1 to t_v: x (t) = v (t)
When t = t_v + 1 to t_v + t_u: x (t) = u (t−t_v)
 以上のように、本実施形態の音声合成装置は、連続性を考慮して波形生成パラメータ選択関数を補正する。このため、本実施形態の音声合成装置によれば、収録音声上で連続している素片が選択されている区間において、非特許文献1等に開示された一般的な方法と比べてスペクトル変化が滑らかである合成音声を生成できる。 As described above, the speech synthesizer of this embodiment corrects the waveform generation parameter selection function in consideration of continuity. For this reason, according to the speech synthesizer of the present embodiment, the spectral change compared to the general method disclosed in Non-Patent Document 1 or the like in a section in which continuous segments on the recorded speech are selected. It is possible to generate a synthesized speech that is smooth.
実施形態2.
 次に、本発明の第2の実施形態の音声合成装置について説明する。第2の実施形態に係る音声合成装置は、音声素片の属性情報に応じてスペクトル変化度を推定し、推定したスペクトル変化度に基づいて波形生成パラメータ選択関数を制御する点において第1の実施形態に係る音声合成装置と相違している。従って、以下、かかる相違点を中心に説明する。
Embodiment 2. FIG.
Next, a speech synthesizer according to the second embodiment of the present invention will be described. The speech synthesis apparatus according to the second embodiment is the first implementation in that the degree of spectrum change is estimated according to the attribute information of the speech unit, and the waveform generation parameter selection function is controlled based on the estimated degree of spectrum change. This is different from the speech synthesis apparatus according to the embodiment. Therefore, the difference will be mainly described below.
 図9は、本発明による音声合成装置の第2の実施形態の構成を示すブロック図である。図9に示す本実施形態の音声合成装置の構成は、図1に示す第1の実施形態の音声合成装置の構成と対比すると、波形生成パラメータ選択部50が波形生成パラメータ選択部60に置換され、スペクトル形状変化度推定部62を新たに備えている。 FIG. 9 is a block diagram showing the configuration of the second embodiment of the speech synthesizer according to the present invention. The configuration of the speech synthesizer of this embodiment shown in FIG. 9 is compared with the configuration of the speech synthesizer of the first embodiment shown in FIG. 1, and the waveform generation parameter selection unit 50 is replaced with the waveform generation parameter selection unit 60. Further, a spectrum shape change degree estimation unit 62 is newly provided.
 スペクトル形状変化度推定部62は、素片情報記憶部10から供給された素片の属性情報に基づいて、素片接続境界におけるスペクトル形状の変化度を推定する。スペクトル形状変化度推定部62は、スペクトル形状の変化度の推定に、属性情報に含まれる言語情報や韻律情報を利用する。言語情報の中で音素や音節の種別を利用する場合は、該当する種別ごとに音声スペクトルの形状変化速度を推定する方法が有効である。例えば、先行と後続の素片を合わせた素片が長母音の音節であれば、素片接続境界におけるスペクトル形状の変化は小さいので、スペクトル形状変化度の推定量は小さくする。先行と後続の素片が同一の音素である場合も同様である。また、先行又は後続の素片が有声子音であれば、素片接続境界におけるスペクトル形状の変化は大きいので、スペクトル形状変化度の推定量は大きくする。 The spectrum shape change degree estimation unit 62 estimates the degree of change of the spectrum shape at the unit connection boundary based on the unit attribute information supplied from the unit information storage unit 10. The spectrum shape change degree estimation unit 62 uses language information and prosodic information included in the attribute information for estimation of the change degree of the spectrum shape. When the phoneme or syllable type is used in the language information, a method of estimating the shape change rate of the voice spectrum for each corresponding type is effective. For example, if the segment obtained by combining the preceding and subsequent segments is a syllable of a long vowel, since the change in the spectrum shape at the segment connection boundary is small, the estimated amount of the spectrum shape change is reduced. The same applies when the preceding and subsequent segments are the same phoneme. If the preceding or succeeding segment is a voiced consonant, the spectrum shape change at the segment connection boundary is large, so the estimated amount of the spectrum shape change is increased.
 波形生成パラメータ選択部60は、素片選択部3から供給された素片情報と、韻律生成部2から供給された韻律情報と、スペクトル形状変化度推定部62から供給されたスペクトル形状変化度に基づき、有声音波形の生成に用いる波形生成パラメータの選択を行う。波形生成パラメータ選択部60は、スペクトル形状変化度の推定量に基づき波形生成パラメータ選択関数を生成する。 The waveform generation parameter selection unit 60 converts the segment information supplied from the segment selection unit 3, the prosody information supplied from the prosody generation unit 2, and the spectrum shape change degree supplied from the spectrum shape change degree estimation unit 62. Based on this, a waveform generation parameter used for generating a voiced sound waveform is selected. The waveform generation parameter selection unit 60 generates a waveform generation parameter selection function based on the estimated amount of spectrum shape change.
 波形生成パラメータ選択部60は、例えば、図5に示した選択関数を利用する場合、補正区間の長さを調整する。波形生成パラメータ選択部60は、もしスペクトル形状変化度が小さい場合に補正区間を長くすることで、より滑らかなスペクトル形状とする。スペクトル形状変化度が大きい場合、補正区間を長くすると補正量が多くなり音声素片と合成素片との韻律の差が大きくなるため好ましくない。よって、波形生成パラメータ選択部60は、スペクトル形状変化度の大きさに応じて補正区間の長さを調節する。また、波形生成パラメータ選択部60は、図6に示した選択関数を利用する場合、同様に素片境界上における先行素片の終端と補正後の選択関数の距離を調整する。波形生成パラメータ選択部60は、もしスペクトル形状変化度が小さければ、素片境界上において先行素片の終端と補正後の選択関数の距離を長くする。 The waveform generation parameter selection unit 60 adjusts the length of the correction section, for example, when using the selection function shown in FIG. The waveform generation parameter selection unit 60 makes the spectrum shape smoother by lengthening the correction section if the degree of change in the spectrum shape is small. When the degree of change in spectrum shape is large, it is not preferable to lengthen the correction section because the amount of correction increases and the difference in prosody between the speech segment and the synthesized segment increases. Therefore, the waveform generation parameter selection unit 60 adjusts the length of the correction section according to the magnitude of the spectrum shape change degree. Further, when the selection function shown in FIG. 6 is used, the waveform generation parameter selection unit 60 similarly adjusts the distance between the end of the preceding segment on the segment boundary and the corrected selection function. The waveform generation parameter selection unit 60 increases the distance between the end of the preceding segment and the corrected selection function on the segment boundary if the degree of change in the spectrum shape is small.
 本実施形態の音声合成装置によれば、音声素片の属性情報に応じて波形生成パラメータ選択関数を制御する。この結果、本実施形態の音声合成装置は、特にスペクトル形状変化度が小さい区間において、スペクトル変化が滑らかである合成音声を生成することができる。 According to the speech synthesizer of this embodiment, the waveform generation parameter selection function is controlled according to the attribute information of the speech unit. As a result, the speech synthesizer of this embodiment can generate synthesized speech with a smooth spectrum change, particularly in a section where the degree of change in spectrum shape is small.
 本発明は、各実施形態で説明した音声合成装置に限定されるものではなく、その構成および動作は、発明の趣旨を逸脱しない範囲で適宜に変更することができる。 The present invention is not limited to the speech synthesizer described in each embodiment, and the configuration and operation thereof can be changed as appropriate without departing from the spirit of the invention.
 図10は、本発明による音声合成装置の主要部の構成を示すブロック図である。図10に示すように、本発明による音声合成装置は、主要な構成として、入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる音声素片を選択する素片選択部3と、音声素片から抽出された波形生成パラメータを選択する波形生成パラメータ選択部50を含み、選択された波形生成パラメータを用いて合成音声を生成する波形生成部4とを備える。また、波形生成パラメータ選択部50は、音声素片の時間軸上の波形生成パラメータを合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて波形生成パラメータを選択する。 FIG. 10 is a block diagram showing the configuration of the main part of the speech synthesizer according to the present invention. As shown in FIG. 10, the speech synthesizer according to the present invention has, as a main configuration, a unit selection unit that selects a speech unit to be used for synthesis from a plurality of previously stored speech units based on an input character string. 3 and a waveform generation unit 4 including a waveform generation parameter selection unit 50 for selecting a waveform generation parameter extracted from the speech segment, and generating a synthesized speech using the selected waveform generation parameter. The waveform generation parameter selection unit 50 also selects a waveform generation parameter selection function, which is a function indicating where the waveform generation parameters on the time axis of the speech segment are to be placed on the time axis of the synthesized speech. The waveform generation parameters are selected based on the waveform generation parameter selection function.
 また、上記の実施形態には、以下の(1)~(4)に示すような音声合成装置も開示されている。 In the above embodiment, speech synthesis apparatuses as shown in the following (1) to (4) are also disclosed.
(1)波形生成パラメータ選択部が、選択された複数の音声素片のうちの一つである先行素片の始端と終端とを結ぶ第一の関数と、先行素片に続く音声素片である後続素片の始端と終端とを結ぶ第二の関数とを接続した波形生成パラメータ選択関数を生成し、先行素片と後続素片とが連続していた場合、波形生成パラメータ選択関数の傾きの変化を滑らかにする補正をする音声合成装置。 (1) The waveform generation parameter selection unit includes a first function that connects the start and end of a preceding unit that is one of the selected plurality of speech units, and a speech unit that follows the preceding unit. If the waveform generation parameter selection function that connects the second function connecting the start and end of a certain subsequent segment is generated and the preceding segment and the subsequent segment are continuous, the slope of the waveform generation parameter selection function A speech synthesizer that makes corrections to smooth out changes.
(2)波形生成パラメータ選択部は、波形生成パラメータ選択関数が、先行素片の始端と後続素片の終端とを結ぶ直線上であって合成音声の時間軸上の先行素片の終端の時刻における点と、先行素片の終端とを結ぶ直線の内分点を通過するように補正することにより傾きの変化を滑らかにするように構成されていてもよい。 (2) The waveform generation parameter selection unit is configured such that the waveform generation parameter selection function is on a straight line connecting the start end of the preceding segment and the end of the subsequent segment, and the end time of the preceding segment on the time axis of the synthesized speech It is also possible to make the change in the slope smooth by correcting so as to pass through the internal dividing point of the straight line connecting the point at and the end of the preceding element.
(3)音声合成装置は、波形生成パラメータ選択部が、第一の関数の内分点と第二の関数の内分点とを結ぶ線を用いて補正することにより傾きの変化を滑らかにした波形生成パラメータ選択関数を生成するように構成されていてもよい。 (3) In the speech synthesizer, the waveform generation parameter selection unit smoothes the change in inclination by correcting using the line connecting the internal dividing point of the first function and the internal dividing point of the second function. The waveform generation parameter selection function may be generated.
(4)音声合成装置は、音声素片の属性情報に基づいて、音声素片の接続境界におけるスペクトル変化度を推定するスペクトル形状変化度推定部(例えば、スペクトル形状変化度推定部62)を備え、波形生成パラメータ選択部は、推定されたスペクトル変化度に基づいて波形生成パラメータ選択関数を生成するように構成されていてもよい。 (4) The speech synthesizer includes a spectral shape change degree estimation unit (for example, a spectral shape change degree estimation unit 62) that estimates the spectral change degree at the connection boundary of the speech unit based on the attribute information of the speech unit. The waveform generation parameter selection unit may be configured to generate a waveform generation parameter selection function based on the estimated degree of spectrum change.
 この出願は、2012年7月27日に出願された日本出願特願2012-167220を基礎とする優先権を主張し、その開示の全てをここに取り込む。 This application claims priority based on Japanese Patent Application No. 2012-167220 filed on July 27, 2012, the entire disclosure of which is incorporated herein.
 以上、実施形態を参照して本願発明を説明したが、本願発明は上記実施形態に限定されるものではない。本願発明の構成や詳細には、本願発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 The present invention has been described above with reference to the embodiments, but the present invention is not limited to the above embodiments. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.
産業上の利用の可能性Industrial applicability
 本発明は、合成音声を用いた情報提供サービス等に適用できる。 The present invention can be applied to information providing services using synthesized speech.
 1 言語処理部
 2 韻律生成部
 3 素片選択部
 4 波形生成部
 5 有声音生成部
 6 無声音生成部
 7 波形連結部
 10 素片情報記憶部
 50,60 波形生成パラメータ選択部
 51 有声音波形生成部
 62 スペクトル形状変化度推定部
DESCRIPTION OF SYMBOLS 1 Language processing part 2 Prosody generation part 3 Segment selection part 4 Waveform generation part 5 Voiced sound generation part 6 Unvoiced sound generation part 7 Waveform connection part 10 Segment information storage part 50, 60 Waveform generation parameter selection part 51 Voiced sound waveform generation part 51 62 Spectral shape change degree estimation unit

Claims (7)

  1.  入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる前記音声素片を選択する素片選択部と、
     前記音声素片から抽出された波形生成パラメータを選択する波形生成パラメータ選択部を含み、選択された前記波形生成パラメータを用いて合成音声を生成する波形生成部とを備え、
     前記波形生成パラメータ選択部は、
     前記音声素片の時間軸上の波形生成パラメータを前記合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された前記音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて波形生成パラメータを選択する
     ことを特徴とする音声合成装置。
    A unit selection unit for selecting the speech unit used for synthesis from a plurality of speech units stored in advance based on an input character string;
    A waveform generation parameter selection unit that selects a waveform generation parameter extracted from the speech segment, and a waveform generation unit that generates a synthesized speech using the selected waveform generation parameter;
    The waveform generation parameter selection unit
    A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the time axis of the speech unit is arranged on the time axis of the synthesized speech, is considered in consideration of the continuity of the selected speech unit. And generating a waveform generation parameter based on the waveform generation parameter selection function.
  2.  波形生成パラメータ選択部は、
     選択された複数の音声素片のうちの一つである先行素片の始端と終端とを結ぶ第一の関数と、前記先行素片に続く音声素片である後続素片の始端と終端とを結ぶ第二の関数とを接続した波形生成パラメータ選択関数を生成し、
     前記先行素片と前記後続素片とが連続していた場合、前記波形生成パラメータ選択関数の傾きの変化を滑らかにする補正をする
     請求項1記載の音声合成装置。
    The waveform generation parameter selector
    A first function that connects a start and end of a preceding unit that is one of a plurality of selected speech units; and a start and end of a subsequent unit that is a speech unit following the preceding unit Generate a waveform generation parameter selection function that connects the second function connecting
    The speech synthesizer according to claim 1, wherein when the preceding segment and the subsequent segment are continuous, correction for smoothing a change in slope of the waveform generation parameter selection function is performed.
  3.  波形生成パラメータ選択部は、
     波形生成パラメータ選択関数が、先行素片の始端と後続素片の終端とを結ぶ直線上であって合成音声の時間軸上の前記先行素片の終端の時刻における点と、前記先行素片の終端とを結ぶ直線の内分点を通過するように補正することにより傾きの変化を滑らかにする
     請求項2記載の音声合成装置。
    The waveform generation parameter selector
    A waveform generation parameter selection function is located on a straight line connecting the start end of the preceding segment and the end of the subsequent segment, and a point at the end time of the preceding segment on the time axis of the synthesized speech; The speech synthesizer according to claim 2, wherein a change in inclination is smoothed by correcting so as to pass through an internal dividing point of a straight line connecting the end.
  4.  前記波形生成パラメータ選択部は、
     第一の関数の内分点と第二の関数の内分点とを結ぶ線を用いて補正することにより傾きの変化を滑らかにした波形生成パラメータ選択関数を生成する
     ことを特徴とする請求項2記載の音声合成装置。
    The waveform generation parameter selection unit
    The waveform generation parameter selection function in which a change in slope is smoothed by correction using a line connecting the internal dividing point of the first function and the internal dividing point of the second function is generated. The speech synthesizer according to 2.
  5.  音声素片の属性情報に基づいて、前記音声素片の接続境界におけるスペクトル変化度を推定するスペクトル形状変化度推定部を備え、
     波形生成パラメータ選択部は、
     推定された前記スペクトル変化度に基づいて波形生成パラメータ選択関数を生成する
     請求項1から請求項4のいずれか1項に記載の音声合成装置。
    Based on the attribute information of the speech unit, a spectrum shape change degree estimation unit for estimating the spectrum change degree at the connection boundary of the speech unit,
    The waveform generation parameter selector
    The speech synthesizer according to any one of claims 1 to 4, wherein a waveform generation parameter selection function is generated based on the estimated degree of spectrum change.
  6.  入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる前記音声素片を選択し、
     前記音声素片の時間軸上の波形生成パラメータを前記合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された前記音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて、前記音声素片から抽出された波形生成パラメータを選択し、
     選択された前記波形生成パラメータを用いて合成音声を生成する
     ことを特徴とする音声合成方法。
    Based on the input character string, select the speech unit used for synthesis from a plurality of speech units stored in advance,
    A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the time axis of the speech unit is arranged on the time axis of the synthesized speech, is considered in consideration of the continuity of the selected speech unit. And select a waveform generation parameter extracted from the speech segment based on the waveform generation parameter selection function,
    A synthesized speech is generated using the selected waveform generation parameter. A speech synthesis method characterized by:
  7.  コンピュータに、
     入力文字列に基づいて、予め記憶された複数の音声素片から合成に用いる前記音声素片を選択する素片選択処理と、
     前記音声素片の時間軸上の波形生成パラメータを前記合成音声の時間軸上のどこに配置するかを示す関数である波形生成パラメータ選択関数を、選択された前記音声素片の連続性を考慮して生成し、当該波形生成パラメータ選択関数に基づいて、前記音声素片から抽出された波形生成パラメータを選択する波形生成パラメータ選択処理を含み、選択された前記波形生成パラメータを用いて合成音声を生成する波形生成処理とを
     実行させるための音声合成プログラム。
    On the computer,
    Based on an input character string, a unit selection process for selecting the speech unit to be used for synthesis from a plurality of speech units stored in advance;
    A waveform generation parameter selection function, which is a function indicating where the waveform generation parameter on the time axis of the speech unit is arranged on the time axis of the synthesized speech, is considered in consideration of the continuity of the selected speech unit. A waveform generation parameter selection process that selects a waveform generation parameter extracted from the speech segment based on the waveform generation parameter selection function, and generates a synthesized speech using the selected waveform generation parameter A speech synthesis program for executing waveform generation processing.
PCT/JP2013/004023 2012-07-27 2013-06-27 Speech synthesizer, speech synthesizing method, and speech synthesizing program WO2014017024A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2014526737A JPWO2014017024A1 (en) 2012-07-27 2013-06-27 Speech synthesis apparatus, speech synthesis method, and speech synthesis program

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012-167220 2012-07-27
JP2012167220 2012-07-27

Publications (1)

Publication Number Publication Date
WO2014017024A1 true WO2014017024A1 (en) 2014-01-30

Family

ID=49996852

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/004023 WO2014017024A1 (en) 2012-07-27 2013-06-27 Speech synthesizer, speech synthesizing method, and speech synthesizing program

Country Status (2)

Country Link
JP (1) JPWO2014017024A1 (en)
WO (1) WO2014017024A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0990987A (en) * 1995-09-26 1997-04-04 Toshiba Corp Method and device for voice synthesis
JPH11338488A (en) * 1998-05-26 1999-12-10 Ricoh Co Ltd Voice synthesizing device and voice synthesizing method
JP2009069179A (en) * 2007-09-10 2009-04-02 Toshiba Corp Device and method for generating fundamental frequency pattern, and program
JP2010026223A (en) * 2008-07-18 2010-02-04 Nippon Hoso Kyokai <Nhk> Target parameter determination device, synthesis voice correction device and computer program
JP2010078808A (en) * 2008-09-25 2010-04-08 Toshiba Corp Voice synthesis device and method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0990987A (en) * 1995-09-26 1997-04-04 Toshiba Corp Method and device for voice synthesis
JPH11338488A (en) * 1998-05-26 1999-12-10 Ricoh Co Ltd Voice synthesizing device and voice synthesizing method
JP2009069179A (en) * 2007-09-10 2009-04-02 Toshiba Corp Device and method for generating fundamental frequency pattern, and program
JP2010026223A (en) * 2008-07-18 2010-02-04 Nippon Hoso Kyokai <Nhk> Target parameter determination device, synthesis voice correction device and computer program
JP2010078808A (en) * 2008-09-25 2010-04-08 Toshiba Corp Voice synthesis device and method

Also Published As

Publication number Publication date
JPWO2014017024A1 (en) 2016-07-07

Similar Documents

Publication Publication Date Title
JP3913770B2 (en) Speech synthesis apparatus and method
US8175881B2 (en) Method and apparatus using fused formant parameters to generate synthesized speech
JP4966048B2 (en) Voice quality conversion device and speech synthesis device
JP5159325B2 (en) Voice processing apparatus and program thereof
JP4551803B2 (en) Speech synthesizer and program thereof
US20080027727A1 (en) Speech synthesis apparatus and method
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
JP2006309162A (en) Pitch pattern generating method and apparatus, and program
JP5983604B2 (en) Segment information generation apparatus, speech synthesis apparatus, speech synthesis method, and speech synthesis program
US20110196680A1 (en) Speech synthesis system
JP2009133890A (en) Voice synthesizing device and method
JP5874639B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP5930738B2 (en) Speech synthesis apparatus and speech synthesis method
US8407054B2 (en) Speech synthesis device, speech synthesis method, and speech synthesis program
WO2011118207A1 (en) Speech synthesizer, speech synthesis method and the speech synthesis program
JPH09319391A (en) Speech synthesizing method
JP2003208188A (en) Japanese text voice synthesizing method
WO2014017024A1 (en) Speech synthesizer, speech synthesizing method, and speech synthesizing program
JP2011141470A (en) Phoneme information-creating device, voice synthesis system, voice synthesis method and program
Koriyama et al. An F0 modeling technique based on prosodic events for spontaneous speech synthesis
JP5245962B2 (en) Speech synthesis apparatus, speech synthesis method, program, and recording medium
JP2010078808A (en) Voice synthesis device and method
EP1589524B1 (en) Method and device for speech synthesis
JP2006084854A (en) Device, method, and program for speech synthesis
JP2008299266A (en) Speech synthesis device and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13823266

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2014526737

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13823266

Country of ref document: EP

Kind code of ref document: A1