WO2013008384A1 - 音声合成装置、音声合成方法および音声合成プログラム - Google Patents
音声合成装置、音声合成方法および音声合成プログラム Download PDFInfo
- Publication number
- WO2013008384A1 WO2013008384A1 PCT/JP2012/003760 JP2012003760W WO2013008384A1 WO 2013008384 A1 WO2013008384 A1 WO 2013008384A1 JP 2012003760 W JP2012003760 W JP 2012003760W WO 2013008384 A1 WO2013008384 A1 WO 2013008384A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- phoneme
- state
- index
- voiced
- voicedness
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 18
- 238000001308 synthesis method Methods 0.000 title claims abstract description 9
- 238000003786 synthesis reaction Methods 0.000 title abstract description 9
- 238000007619 statistical method Methods 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims description 24
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 description 24
- 238000010586 diagram Methods 0.000 description 12
- 238000000605 extraction Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 4
- 239000000203 mixture Substances 0.000 description 2
- 230000000877 morphologic effect Effects 0.000 description 2
- 238000004904 shortening Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 1
- 238000007429 general method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Definitions
- the present invention relates to a speech synthesizer, a speech synthesis method, and a speech synthesis program that perform waveform generation in speech synthesis using phoneme duration information generated by a statistical method.
- HMM speech synthesis using a hidden Markov model is known as a speech synthesis method using a statistical method.
- the prosody generated by the HMM speech synthesis is expressed using a specific number of states.
- MSD-HMM Multi-Space Probability Distribution HMM
- a voiced index an index indicating the degree of voicedness
- the phoneme duration of each phoneme may be shortened.
- speech synthesis is performed using such data, it is difficult to reproduce the phoneme duration.
- an object of the present invention is to provide a speech synthesizer, a speech synthesis method, and a speech synthesis program that can represent phonemes with a duration shorter than the duration when modeled by a statistical method.
- the speech synthesizer uses a voicedness index, which is an index indicating the degree of voicedness of each state expressing a phoneme modeled by a statistical method, and a boundary with another phoneme adjacent to the phoneme.
- Phoneme boundary updating means for updating the phoneme boundary position is provided.
- the speech synthesis method uses a voicedness index, which is an index indicating the degree of voicedness of each state expressing a phoneme modeled by a statistical method, and a boundary with another phoneme adjacent to the phoneme.
- the phoneme boundary position is updated.
- the speech synthesis program uses a voicing index, which is an index indicating the degree of voicedness of each state expressing a phoneme modeled by a statistical method, on a computer, and uses other phonemes adjacent to the phoneme.
- a phoneme boundary update process for updating a phoneme boundary position that is a boundary between
- a phoneme can be expressed with a duration length shorter than the duration length when modeled by a statistical method.
- FIG. 1 is a block diagram showing a configuration example of a first embodiment of a speech synthesizer according to the present invention.
- the speech synthesizer in this embodiment includes a language analysis unit 11, a state duration generation unit 12, a pitch pattern generation unit 13, a voiced index extraction unit 14, a phoneme boundary movement direction determination unit 15, and a phoneme duration.
- a long generation unit 16, a waveform generation unit 17, a prosody model storage unit 18, and a speech unit database (hereinafter referred to as speech unit DB) storage unit 19 are provided.
- the prosodic model storage unit 18 stores prosodic models generated by a statistical method.
- the prosodic model indicates a model created by MSD-HMM.
- the voicing index 22 is an index indicating the degree of voicedness, and is information derived for each state from the prosodic model when learning by a statistical method.
- the prosodic model storage unit 18 may store the voicing index 22 itself set for each state. Further, the prosody model storage unit 18 may not store the voicing index 22 itself, but the voicing index extraction unit 14 described later may derive the voicing index from the prosodic model.
- the voiced index is an index indicating whether each state expressed by the HMM has a feature of voiced sound or a feature of unvoiced sound (including silence). It is judged that the characteristic as a voiced sound is stronger as the voicedness index is larger.
- a method for deriving a voicing index for example, a method of using a mixing coefficient of a Gaussian mixture model (Gaussian mixture model: GMM) as a voicing indicator, as represented by Equation (27) in Non-Patent Document 1. Is mentioned.
- the unvoiced sound has the property that the energy in the high frequency (high frequency) part is large. Furthermore, the shape of the spectrum has the property that the high band is larger than the low band. Therefore, the result of analyzing the spectrum component using Fourier transform (FFT) or the like may be used as a voiced index. Further, a result obtained by using a voiced / unvoiced determination method using numerical values such as a linear prediction coefficient, a zero crossing rate, and a waveform power may be used as a voicedness index.
- FFT Fourier transform
- a voicedness index is set for each state
- a voicedness index may be set for each frame unit.
- the speech segment DB storage unit 19 stores attributes for each unit (segment) for creating speech. This attribute includes the waveform of each phoneme, information indicating vowels / consonants, information indicating voiced / unvoiced, and the like. These pieces of information are stored in advance in the speech unit DB storage unit 19. Note that information indicating voiced / unvoiced may not be stored in the speech segment DB storage unit 19, and the phoneme boundary moving direction determination unit 15, which will be described later, may determine whether it is voiced or unvoiced based on the information indicating the phoneme. . However, since the speech element DB storage unit 19 stores information indicating voiced / unvoiced, the phoneme boundary moving direction determination unit 15 does not need to perform the above-described determination process. Therefore, it is more preferable to store information indicating voiced / unvoiced in the speech segment DB storage unit 19 in advance from the viewpoint of increasing the processing speed.
- the prosodic model storage unit 18 and the speech segment DB storage unit 19 are realized by, for example, a magnetic disk.
- the language analysis unit 11 performs language analysis processing such as morphological analysis on the input text 21. In addition, the language analysis unit 11 performs processing for adding or changing additional information necessary for speech synthesis, such as accent positions and accent phrase breaks, on the language analysis result. However, the language analysis processing performed by the language analysis unit 11 is not limited to the above content. In addition, the language analysis unit 11 performs processing for analyzing the reading of characters included in the input text 21.
- the state duration generator 12 calculates the state duration based on the analysis result by the language analyzer 11 and the prosodic model. Note that the phoneme duration is generated by the phoneme duration generation unit 16 described later. In the following description, a case where one phoneme is expressed in five states will be described as an example. Further, the pitch pattern generation unit 13 generates a pitch pattern based on the calculation result by the state duration generation unit 12 and the prosodic model.
- the voiced index extracting unit 14 extracts a voiced index corresponding to each state from the prosodic model storage unit 18. For example, when the prosody model storage unit 18 stores the voicing index 22 set for each state, the voicing index extraction unit 14 extracts the voicing index 22 corresponding to each state from the prosodic model storage unit 18. May be.
- the voiced index extracting unit 14 may read the prosody model from the prosody model storage unit 18 and derive the voiced index of each state from the prosody model. In this case, it is desirable that the prosodic model includes spectral information.
- the phoneme boundary moving direction determination unit 15 uses a voicing index of each state expressing a phoneme modeled by a statistical method, and a boundary with another phoneme adjacent to the phoneme (hereinafter referred to as a phoneme boundary position). Update).
- the phoneme boundary moving direction determination unit 15 specifies whether each state representing a phoneme indicates a voiced state or an unvoiced state. Specifically, the phoneme boundary moving direction determination unit 15 determines whether or not the voicedness index for each state exceeds a predetermined threshold. When the voiced index exceeds the threshold, the phoneme boundary moving direction determination unit 15 specifies that the state indicates a voiced state. On the other hand, when the voiced index does not exceed the threshold, the phoneme boundary moving direction determination unit 15 specifies that the state indicates an unvoiced state.
- the phoneme boundary moving direction determination unit 15 may set a flag for each state after specifying that each state indicates a voiced state or an unvoiced state. For example, the phoneme boundary moving direction determination unit 15 may set a flag “H” for each state to indicate a voiced state and a flag “L” for each state.
- voiced determination information the result of determining whether the voiced state or the unvoiced state is based on the voiced index (here, the flag “H” and the flag “L”) will be referred to as voiced determination information.
- the phoneme boundary moving direction determination unit 15 determines the direction of moving the phoneme boundary position depending on whether the phonemes before and after the phoneme boundary are unvoiced or voiced and whether the state before and after the phoneme boundary is voiced or unvoiced.
- the type when the phoneme is an unvoiced sound is denoted as “U”
- the type when the phoneme is a voiced sound is denoted as “V”.
- the type determined in this way is referred to as a UV type. That is, it can be said that the UV type is information for identifying whether each phoneme is an unvoiced sound or a voiced sound.
- the phoneme boundary moving direction determination unit 15 extracts the unit phoneme information 23 corresponding to each phoneme from the speech unit DB storage unit 19 and determines whether each phoneme is an unvoiced sound or a voiced sound.
- FIG. 2 is a flowchart showing an example of processing for determining the direction in which the phoneme boundary position is moved.
- the phoneme boundary moving direction determination unit 15 determines whether adjacent phonemes (that is, phonemes before and after the phoneme boundary) are of the same UV type (step S11). If the UV types are the same (Yes in step S11), the phoneme boundary moving direction determination unit 15 ends the process. On the other hand, when the UV types are different (No in step S11), the phoneme boundary moving direction determination unit 15 determines the relationship between the voicedness determination information indicated by the states before and after the phoneme boundary (step S12). Specifically, the phoneme boundary movement direction determination unit 15 determines the movement direction of the phoneme boundary position based on a predetermined correspondence.
- FIG. 3 is an explanatory diagram showing an example of the correspondence relationship between the voicedness determination information and the movement direction of the phoneme boundary position.
- the table illustrated in FIG. 3 is a table that defines the direction in which the phoneme boundary position is moved in accordance with the content of the voicedness determination information (L or H) in each state in the case of unvoiced sound (U) and voiced sound (V). It is. For example, when adjacent phonemes are arranged in the order of unvoiced sound (U) and voiced sound (V), the voicedness determination information of unvoiced sound is “L” and the voicedness determination information of voiced sound is “H”. In this case, it is derived from this table that the phoneme boundary is not changed (that is, the phoneme boundary is not moved).
- the phoneme boundary moving direction determination unit 15 moves the phoneme boundary position to the V side (step S13).
- both of the voicedness determination information indicated by the adjacent states are “H” (HH in step S12)
- the phoneme boundary moving direction determination unit 15 moves the phoneme boundary position to the U side (step S14).
- the phoneme boundary moving direction determination unit 15 ends the process without moving the phoneme boundary position.
- FIG. 4 is an explanatory diagram illustrating an example of a method for changing a phoneme boundary.
- the example shown in FIG. 4 indicates that the phoneme boundary between unvoiced sound and voiced sound (hereinafter referred to as “UV boundary”) is moved according to the voicedness determination information indicated by the adjacent state.
- each phoneme illustrated in FIG. 4 is represented by five states, and one cell represents one state.
- the phoneme “a” is an unvoiced sound
- the phoneme “b” is a voiced sound.
- the phoneme boundary moving direction determination unit 15 sets the phoneme boundary to the V side (ie, voiced sound). To the side) by the width of one state.
- the phoneme boundary moving direction determination unit 15 sets the phoneme boundary to the V side (that is, Move to the voiced side) by the width of one state. After moving the phoneme boundary position, the phoneme boundary moving direction determination unit 15 verifies the voicedness determination information indicated by the adjacent state at the phoneme boundary, and repeats the same processing.
- the phoneme boundary moving direction determination unit 15 determines the phoneme boundary position. It is not moved (refer to FIGS. 4C and 4D).
- the phoneme boundary moving direction determination unit 15 moves the phoneme boundary position by a length corresponding to the width of each state. For example, in the case of one frame per state, if one frame is 5 msec, the phoneme boundary moving direction determination unit 15 moves the phoneme boundary position by 5 msec.
- the phoneme boundary moving direction determination unit 15 sets the voicedness determination information based on whether or not the voicedness index for each state exceeds a predetermined threshold and updates the phoneme boundary position.
- the method by which the phoneme boundary moving direction determination unit 15 updates the phoneme boundary position is not limited to the above method.
- the phoneme boundary moving direction determination unit 15 may update the phoneme boundary position based on, for example, a difference in voicedness index between adjacent states. In this case, it is not necessary to specify whether each state indicates a voiced state or an unvoiced state.
- the phoneme boundary moving direction determination unit 15 sequentially calculates the difference ⁇ v i of the voicedness index. The phoneme boundary moving direction determination unit 15 determines a boundary between the i ⁇ 1th state and the ith state when ⁇ v i exceeds a preset threshold.
- FIG. 5 is an explanatory diagram illustrating an example of a method for determining a boundary without specifying a voiced / unvoiced state.
- the threshold value is 0.8.
- the phoneme boundary movement direction determining unit 15 determines the boundary by using the difference Delta] v i and the threshold voiced index. Additional phoneme boundary movement direction determining section 15 may determine the boundary by changes in the difference Delta] v i. Also, the phoneme boundary movement direction determining section 15 may determine the boundary by using the difference (second difference) is delta 2 v i of Delta] v i.
- the phoneme duration generation unit 16 calculates the duration of each phoneme based on the phoneme boundary position moved by the phoneme boundary movement direction determination unit 15. For example, it is assumed that the phoneme boundary moving direction determination unit 15 moves the phoneme boundary position in the direction of shortening the target phoneme, the moved width is one frame, and one frame is 5 msec. In this case, the phoneme duration generation unit 16 may set the phoneme duration as a time obtained by reducing the duration of the phoneme by 5 msec. However, the method by which the phoneme duration generation unit 16 calculates the phoneme duration is not limited to the above method.
- the waveform generation unit 17 generates a speech waveform based on the phoneme duration length calculated by the phoneme duration generation unit 16 and the pitch pattern generated by the pitch pattern generation unit 13. In other words, the waveform generation unit 17 generates synthesized speech based on these pieces of information.
- Language analysis unit 11, state duration generation unit 12, pitch pattern generation unit 13, voiced index extraction unit 14, phoneme boundary moving direction determination unit 15, phoneme duration generation unit 16, waveform generation unit 17 is realized by a CPU of a computer that operates according to a program (speech synthesis program).
- the program is stored in a storage unit (not shown) of the speech synthesizer, and the CPU reads the program, and in accordance with the program, the language analysis unit 11, the state duration generation unit 12, the pitch pattern generation unit 13, and the voiced It may operate as the sex index extraction unit 14, the phoneme boundary moving direction determination unit 15, the phoneme duration generation unit 16, and the waveform generation unit 17.
- Each of the generation units 17 may be realized by dedicated hardware.
- FIG. 6 is a flowchart showing an example of the operation of the speech synthesizer in this embodiment.
- the language analysis unit 11 performs language analysis processing such as morphological analysis (step S21).
- the state duration generator 12 calculates the state duration based on the analysis result by the language analyzer 11 and the prosodic model stored in the prosodic model storage unit 18 (step S22).
- the pitch pattern generation unit 13 generates a pitch pattern based on the calculation result by the state duration generation unit 12 and the prosodic model (step S23).
- the voicing index extraction unit 14 extracts the voicing index 22 of each state from the prosodic model storage unit 18 (step S24).
- the phoneme boundary moving direction determination unit 15 updates the phoneme boundary position using the voicing index of each state expressing the phoneme modeled by the HMM (step S25).
- the phoneme boundary moving direction determination unit 15 may determine the direction in which the phoneme boundary position is moved based on the voicedness determination information.
- the phoneme boundary moving direction determination unit 15 may determine the direction in which the phoneme boundary position is moved based on the difference in voicedness index between adjacent states.
- the phoneme duration generation unit 16 calculates the duration of each phoneme based on the phoneme boundary position moved by the phoneme boundary movement direction determination unit 15 (step S26). Then, the waveform generation unit 17 generates a speech waveform based on the phoneme duration length calculated by the phoneme duration generation unit 16 and the pitch pattern generated by the pitch pattern generation unit 13 (step S27).
- the phoneme boundary moving direction determination unit 15 uses the voicing index of each state expressing phonemes modeled by a statistical method (for example, MSD-HMM), The phoneme boundary position with other phonemes adjacent to the phoneme is updated. Therefore, phonemes can be expressed with a duration length shorter than the duration length when modeled by a statistical method.
- a statistical method for example, MSD-HMM
- FIG. 7 is an explanatory diagram showing an example of the result of changing the phoneme boundary position.
- the phoneme duration is at least the number of analysis frames ⁇ the number of states. Therefore, as illustrated in FIG. 7B, for example, the phoneme duration of d immediately after the pause needs to be expressed in five states although it is less than 25 msec.
- the phoneme boundary position is updated using the voicedness index. Therefore, for example, by setting the state of d immediately after the pause to 3 states, it is possible to represent phonemes with a short duration (see FIG. 7C).
- Embodiment 2 a second embodiment of the present invention will be described.
- the voicedness index is a value derived by various calculations (in this embodiment, a statistical method). Therefore, an appropriate numerical value is not always obtained. If the voiced index is inappropriate, it is difficult to appropriately determine the voiced / unvoiced boundary in the first embodiment.
- ⁇ There are roughly two cases where the voicing index is inappropriate. The first is a case where the voicing determination information in each state within the target phoneme is switched twice or more. The second case is a case where all states (frames) indicate a voiced state or an unvoiced state opposite to the target phoneme segment phoneme information.
- FIG. 8 is an explanatory diagram showing an example of an inappropriate voicedness index.
- the case illustrated in FIGS. 8A and 8C corresponds to the first case described above.
- the voicedness determination information is “L” only in the center state of the voiced phoneme.
- the example shown in FIG. 8C corresponds to the second case described above. Also in this case, since there are a plurality of boundary candidates, it is difficult to determine the boundary similarly.
- the case illustrated in FIG. 8B corresponds to the second case.
- all states (frames) of voiced phonemes indicate unvoiced states.
- there is no switching point between the voicedness determination information “H” and “L” there is no boundary candidate, so it is difficult to determine the boundary.
- a method for appropriately determining a phoneme boundary position even when an inappropriate value is included in the voiced index will be described.
- FIG. 9 is a block diagram showing a configuration example of the second embodiment of the speech synthesizer according to the present invention.
- the speech synthesizer in this embodiment includes a language analysis unit 11, a state duration generation unit 12, a pitch pattern generation unit 13, a voiced index extraction unit 14, a phoneme boundary movement direction determination unit 15, and a phoneme duration.
- a long generation unit 16, a waveform generation unit 17, a prosody model storage unit 18, a speech segment DB storage unit 19, and a voicing index determination unit 20 are provided. That is, the speech synthesizer in the present embodiment is different from the speech synthesizer in the first embodiment in that the speech synthesizer further includes a voicing index determination unit 20.
- the voicing index determination unit 20 determines whether or not the voicing index in each state is appropriate, and changes the inappropriate voicing index to an appropriate value. As described above, the voicing index determination unit 20 may determine that the voicing index is inappropriate when the voicing determination information is switched twice or more in one phoneme. Further, the voicing index determination unit 20 is voicing when the voicing determination information (voiced state / unvoiced information) of the target phoneme indicates information (for example, reverse information) different from the segment information. It may be determined that the index is inappropriate. As described above, the voicedness index determination unit 20 determines that the phoneme boundary position is inappropriate when “a plurality of candidates” or “no candidates exist”.
- the voiced index determination unit 20 changes the inappropriate voiced index to an appropriate value.
- the voicedness index determination unit 20 may change the voicedness determination information based on the unit phoneme information 23 of the corresponding phoneme stored in the speech unit DB storage unit 19.
- the voicedness index determination unit 20 when the segment phoneme information of the corresponding phoneme indicates that it is voiced, the voicedness index determination unit 20 is that all the frames belonging to the phoneme are voiced (that is, the voicedness determination information is “H”). Is determined). On the other hand, when the segment phoneme information of the corresponding phoneme indicates that it is unvoiced, the voicedness index determination unit 20 is that all frames belonging to the phoneme are unvoiced (that is, the voicedness determination information is “L”). Judge. Then, the voicedness index determination unit 20 changes the voicedness determination information of the original phoneme with the determined voicedness determination information.
- the voiced index determination unit 20 does not use the unit phoneme information of the corresponding phoneme stored in the speech unit DB storage unit 19, and One may be determined as a phoneme boundary position.
- the voicedness index determination unit 20 may determine, for example, the switching location closest to the original phoneme boundary position as the phoneme boundary position.
- the voicedness index determination unit 20 may determine the original boundary as it is as a boundary between voiced and unvoiced.
- FIG. 10 is an explanatory diagram illustrating an example of processing for determining a phoneme boundary position.
- the example shown in FIG. 10 (1) shows an initial state of voicedness determination information.
- the voiced phoneme (V) voicedness determination information is arranged in the order of “LHLHL”.
- Each frame in the voiced phoneme (V) is indicated by F1 to F5.
- the last frame of the unvoiced phoneme (U 1 ) located before the voiced phoneme (V) is F0, and the first frame of the unvoiced phoneme (U 2 ) located after the voiced phoneme (V) is F6. To do.
- the voicedness index determination unit 20 pays attention to the boundary between unvoiced phonemes and voiced phonemes (that is, frame F0 and frame F1). Both of the voicedness determination information of the frame F0 and the frame F1 are “L”. Therefore, the voicing index determination unit 20 pays attention to the voicing determination information between neighboring frames.
- the voicedness determination information of the frame F1 is “L”
- the voicedness determination information of the frame F2 is “H”. Therefore, the voicedness index determination unit 20 determines that the boundary between the frame F1 and the frame F2 is the switching point closest to the original phoneme boundary position. Then, the voicedness index determination unit 20 moves the phoneme boundary position to the switching location (see FIG. 10 (2)).
- the voicedness index determination unit 20 determines that the boundary between the frames F4 and F5 is the closest switching point to the original phoneme boundary position (that is, the boundary between the frames F5 and F6). Then, the voiced index determination unit 20 moves the phoneme boundary position to the switching location (see FIG. 10 (3)). In the example shown in FIG. 10, the frame F3 having the center unvoiced state is ignored.
- the voicedness index determination unit 20 may determine, as the phoneme boundary position, a point indicating the maximum difference among points exceeding a predetermined threshold. Further, when there is no switching point between voiced and unvoiced, there is no state having a voicedness index exceeding the threshold. In this case, for example, the voicedness index determination unit 20 may determine the point indicating the maximum difference as the phoneme boundary position. In this case, the phoneme boundary position can be determined even when there is no state having a voiced index exceeding the threshold.
- FIG. 11 is a flowchart showing an example of the operation of the speech synthesizer in the present embodiment. It should be noted that the processing from performing the language analysis processing of the input text 21 to generating the duration length and pitch pattern based on the prosodic model and extracting the voicing index 22 is from step S21 to step S24 shown in FIG. It is the same as the processing.
- the voicedness index determination unit 20 determines whether or not the voicedness index of each state is appropriate (step S31). If the voicing index is inappropriate (No in step S31), the voicing index determination unit 20 changes the voicing determination information of the original phoneme to an appropriate voicing index (step S32).
- the phoneme boundary moving direction determination unit 15 updates the phoneme boundary position (step S25). Thereafter, the phoneme duration generation unit 16 calculates the duration of each phoneme based on the phoneme boundary position, and the waveform generation unit 17 generates the speech waveform based on the phoneme duration and the pitch pattern.
- the processing is the same as the processing from step S26 to step S27 shown in FIG.
- the voicedness index determination unit 20 determines whether or not the voicedness index in each state is appropriate, and changes the voicedness index determined to be inappropriate to an appropriate value. Therefore, in addition to the effects of the first embodiment, it is possible to prevent an error from occurring in the boundary determination by correcting the voicedness index of each state to an appropriate value.
- FIG. 12 is a block diagram showing an example of the minimum configuration of the speech synthesizer according to the present invention.
- the speech synthesizer 80 according to the present invention has a voicing index (for example, a state from a prosodic model) that is an index indicating the degree of voicedness of each state expressing a phoneme modeled by a statistical method (for example, MSD-HMM).
- Phoneme boundary update means 81 (for example, a phoneme boundary moving direction determination unit 15) is used to update a phoneme boundary position that is a boundary with another phoneme adjacent to the phoneme. . Therefore, phonemes can be expressed with a duration length shorter than the duration length when modeled by a statistical method.
- the phoneme boundary updating unit 81 indicates whether each state representing a phoneme indicates a voiced state (for example, a state in which the flag “H” is set) or an unvoiced state (for example, a state in which the flag “L” is set). Voicedness determination information), and when one of the adjacent phonemes indicates unvoiced sound (for example, “U” in the UV type) and the other phoneme indicates voiced sound (for example, “V” in the UV type), the voiced
- the moving direction of the phoneme boundary position may be determined according to a rule (for example, the correspondence relationship illustrated in FIG. 3) determined in advance based on the relationship between the state and the unvoiced state.
- the phoneme boundary updating unit 81 specifies the state expressing the phoneme as a voiced state when the voicedness index exceeds a predetermined threshold, and determines the phoneme when the voiced index is equal to or less than the predetermined threshold.
- the state to be expressed may be specified as a silent state.
- the phoneme boundary update means 81 may update the phoneme boundary position based on the difference (for example, ⁇ v i ) of the voicedness index between adjacent states.
- the phoneme boundary update unit 81 is configured such that when the difference between the voicedness index of one adjacent state and the voicedness index of the other state exceeds a predetermined threshold, A position between one state and the other state may be determined as a phoneme boundary position.
- the speech synthesizer 80 may include phoneme duration calculation means (for example, phoneme duration generation unit 16) that calculates a phoneme duration based on the updated phoneme boundary position.
- phoneme duration calculation means for example, phoneme duration generation unit 16
- the phoneme boundary update unit 81 may update the phoneme boundary position in units of length corresponding to the state width (for example, frame length).
- the speech synthesizer 80 determines whether or not the voicing index of each state is appropriate, and changes the voicing index determined to be inappropriate to an appropriate value (for example, voicing index determination). Part 20). With such a configuration, the voicing index of each state is corrected to an appropriate value, so that it is possible to prevent an error from occurring in the boundary determination.
- the voiced index determination means is used when the voicedness determination information, which is the result of determining whether the voiced state or the unvoiced state is determined based on the voiced index, is switched two or more times within one phoneme, or as a target
- the phoneme voicing determination information indicates information that is different from the segment information that is predetermined as information indicating the phoneme property (for example, a voiced state or an unvoiced state opposite to the unit phoneme information is displayed) (If it is shown), it may be determined that the voicing index is inappropriate.
- the present invention is preferably applied to a speech synthesizer using phoneme duration information generated by a statistical method.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
図1は、本発明による音声合成装置の第1の実施形態の構成例を示すブロック図である。本実施形態における音声合成装置は、言語解析部11と、状態継続長生成部12と、ピッチパタン生成部13と、有声性指標抽出部14と、音素境界移動方向決定部15と、音素継続時間長生成部16と、波形生成部17と、韻律モデル記憶部18と、音声素片データベース(以下、音声素片DBと記す。)記憶部19とを備えている。
次に、本発明の第2の実施形態を説明する。本実施形態では、有声性指標に不適切な値が含まれる可能性があることを想定する。すなわち、有声性指標は、各種の計算(本実施形態では、統計的手法)により導出される値である。そのため、必ずしも適切な数値が得られるとは限らない。有声性指標が不適切であると、第1の実施形態において、有声/無声の境界を適切に決定することが困難になる。
12 状態継続長生成部
13 ピッチパタン生成部
14 有声性指標抽出部
15 音素境界移動方向決定部
16 音素継続時間長生成部
17 波形生成部
18 韻律モデル記憶部
19 音声素片データベース記憶部
20 有声性指標判定部
Claims (15)
- 統計的手法によりモデル化された音素を表現する各状態の有声らしさの度合いを示す指標である有声性指標を用いて、当該音素に隣接する他の音素との境界である音素境界位置を更新する音素境界更新手段を備えた
ことを特徴とする音声合成装置。 - 音素境界更新手段は、音素を表現する各状態が有声状態を示すか無声状態を示すかを特定し、隣接する音素の一方が無声音を示し他方の音素が有声音を示す場合、前記有声状態と無声状態との関係に基づいて予め定められた規則に応じて、音素境界位置の移動方向を決定する
請求項1記載の音声合成装置。 - 音素境界更新手段は、有声性指標が予め定められた閾値を超える場合に音素を表現する状態を有声状態と特定し、有声性指標が予め定められた閾値以下の場合に音素を表現する状態を無声状態と特定する
請求項2記載の音声合成装置。 - 音素境界更新手段は、隣接する状態間の有声性指標の差分に基づいて、音素境界位置を更新する
請求項1記載の音声合成装置。 - 音素境界更新手段は、隣接する一方の状態の有声性指標と他方の状態の有声性指標との差分が予め定められた閾値を超えた場合、当該一方の状態と他方の状態との間の位置を音素境界位置と決定する
請求項4記載の音声合成装置。 - 更新された音素境界位置に基づいて音素の継続時間長を計算する音素継続長計算手段を備えた
請求項1から請求項5のうちのいずれか1項に記載の音声合成装置。 - 音素境界更新手段は、状態の幅に対応する長さの単位で音素境界位置を更新する
請求項1から請求項6のうちのいずれか1項に記載の音声合成装置。 - 各状態の有声性指標が適切か否かを判定し、不適切と判定した有声性指標を適切な値に変更する有声性指標判定手段を備えた
請求項1から請求項7のうちのいずれか1項に記載の音声合成装置。 - 有声性指標判定手段は、有声性指標に基づいて有声状態か無声状態かが判定された結果である有声性判定情報が1音素内において2回以上切り替わっている場合、または、対象とする音素の前記有声性判定情報が、音素の性質を示す情報として予め定められた情報である素片情報とは異なる情報を示している場合に、有声性指標が不適切であると判定する
請求項8記載の音声合成装置。 - 統計的手法によりモデル化された音素を表現する各状態の有声らしさの度合いを示す指標である有声性指標を用いて、当該音素に隣接する他の音素との境界である音素境界位置を更新する
ことを特徴とする音声合成方法。 - 音素を表現する各状態が有声状態を示すか無声状態を示すかを特定し、隣接する音素の一方が無声音を示し他方の音素が有声音を示す場合、前記有声状態と無声状態との関係に基づいて予め定められた規則に応じて、音素境界位置の移動方向を決定する
請求項10記載の音声合成方法。 - 隣接する状態間の有声性指標の差分に基づいて、音素境界位置を更新する
請求項10記載の音声合成方法。 - コンピュータに、
統計的手法によりモデル化された音素を表現する各状態の有声らしさの度合いを示す指標である有声性指標を用いて、当該音素に隣接する他の音素との境界である音素境界位置を更新する音素境界更新処理
を実行させるための音声合成プログラム。 - コンピュータに、
音素境界更新処理で、音素を表現する各状態が有声状態を示すか無声状態を示すかを特定させ、隣接する音素の一方が無声音を示し他方の音素が有声音を示す場合、前記有声状態と無声状態との関係に基づいて予め定められた規則に応じて、音素境界位置の移動方向を決定させる
請求項13記載の音声合成プログラム。 - コンピュータに、
音素境界更新処理で、隣接する状態間の有声性指標の差分に基づいて、音素境界位置を更新させる
請求項13記載の音声合成プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/131,409 US9520125B2 (en) | 2011-07-11 | 2012-06-08 | Speech synthesis device, speech synthesis method, and speech synthesis program |
JP2013523777A JP5979146B2 (ja) | 2011-07-11 | 2012-06-08 | 音声合成装置、音声合成方法および音声合成プログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011-152849 | 2011-07-11 | ||
JP2011152849 | 2011-07-11 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013008384A1 true WO2013008384A1 (ja) | 2013-01-17 |
Family
ID=47505695
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/003760 WO2013008384A1 (ja) | 2011-07-11 | 2012-06-08 | 音声合成装置、音声合成方法および音声合成プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US9520125B2 (ja) |
JP (1) | JP5979146B2 (ja) |
WO (1) | WO2013008384A1 (ja) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017015821A (ja) * | 2015-06-29 | 2017-01-19 | 日本電信電話株式会社 | 音声合成装置、音声合成方法、およびプログラム |
CN107945786A (zh) * | 2017-11-27 | 2018-04-20 | 北京百度网讯科技有限公司 | 语音合成方法和装置 |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9672811B2 (en) * | 2012-11-29 | 2017-06-06 | Sony Interactive Entertainment Inc. | Combining auditory attention cues with phoneme posterior scores for phone/vowel/syllable boundary detection |
CN107481715B (zh) * | 2017-09-29 | 2020-12-08 | 百度在线网络技术(北京)有限公司 | 用于生成信息的方法和装置 |
CN112242132B (zh) * | 2019-07-18 | 2024-06-14 | 阿里巴巴集团控股有限公司 | 语音合成中的数据标注方法、装置和系统 |
CN114360587A (zh) * | 2021-12-27 | 2022-04-15 | 北京百度网讯科技有限公司 | 识别音频的方法、装置、设备、介质及产品 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002169583A (ja) * | 2000-12-05 | 2002-06-14 | Hitachi Ulsi Systems Co Ltd | 音声合成方法及び音声合成装置 |
JP2004341259A (ja) * | 2003-05-15 | 2004-12-02 | Matsushita Electric Ind Co Ltd | 音声素片伸縮装置およびその方法 |
JP2007233181A (ja) * | 2006-03-02 | 2007-09-13 | Casio Comput Co Ltd | 音声合成装置、音声合成方法、及び、プログラム |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6236964B1 (en) * | 1990-02-01 | 2001-05-22 | Canon Kabushiki Kaisha | Speech recognition apparatus and method for matching inputted speech and a word generated from stored referenced phoneme data |
EP1138038B1 (en) * | 1998-11-13 | 2005-06-22 | Lernout & Hauspie Speech Products N.V. | Speech synthesis using concatenation of speech waveforms |
KR100486735B1 (ko) * | 2003-02-28 | 2005-05-03 | 삼성전자주식회사 | 최적구획 분류신경망 구성방법과 최적구획 분류신경망을이용한 자동 레이블링방법 및 장치 |
JP4551803B2 (ja) * | 2005-03-29 | 2010-09-29 | 株式会社東芝 | 音声合成装置及びそのプログラム |
CN101346758B (zh) * | 2006-06-23 | 2011-07-27 | 松下电器产业株式会社 | 感情识别装置 |
US8155964B2 (en) * | 2007-06-06 | 2012-04-10 | Panasonic Corporation | Voice quality edit device and voice quality edit method |
JP5159279B2 (ja) * | 2007-12-03 | 2013-03-06 | 株式会社東芝 | 音声処理装置及びそれを用いた音声合成装置。 |
JP5665780B2 (ja) * | 2012-02-21 | 2015-02-04 | 株式会社東芝 | 音声合成装置、方法およびプログラム |
-
2012
- 2012-06-08 US US14/131,409 patent/US9520125B2/en active Active
- 2012-06-08 WO PCT/JP2012/003760 patent/WO2013008384A1/ja active Application Filing
- 2012-06-08 JP JP2013523777A patent/JP5979146B2/ja active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002169583A (ja) * | 2000-12-05 | 2002-06-14 | Hitachi Ulsi Systems Co Ltd | 音声合成方法及び音声合成装置 |
JP2004341259A (ja) * | 2003-05-15 | 2004-12-02 | Matsushita Electric Ind Co Ltd | 音声素片伸縮装置およびその方法 |
JP2007233181A (ja) * | 2006-03-02 | 2007-09-13 | Casio Comput Co Ltd | 音声合成装置、音声合成方法、及び、プログラム |
Non-Patent Citations (1)
Title |
---|
MIYAZAKI ET AL.: "A STUDY ON PITCH PATTERN GENERATION USING HMMS BASED ON MULTI-SPACE PROBABILITY DISTRIBUTIONS", TECHNICAL REPORT OF IEICE SP, vol. 98, no. 33, 24 April 1998 (1998-04-24), pages 27 - 34 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2017015821A (ja) * | 2015-06-29 | 2017-01-19 | 日本電信電話株式会社 | 音声合成装置、音声合成方法、およびプログラム |
CN107945786A (zh) * | 2017-11-27 | 2018-04-20 | 北京百度网讯科技有限公司 | 语音合成方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
JP5979146B2 (ja) | 2016-08-24 |
JPWO2013008384A1 (ja) | 2015-02-23 |
US20140149116A1 (en) | 2014-05-29 |
US9520125B2 (en) | 2016-12-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10529314B2 (en) | Speech synthesizer, and speech synthesis method and computer program product utilizing multiple-acoustic feature parameters selection | |
JP5979146B2 (ja) | 音声合成装置、音声合成方法および音声合成プログラム | |
US8175881B2 (en) | Method and apparatus using fused formant parameters to generate synthesized speech | |
US10540956B2 (en) | Training apparatus for speech synthesis, speech synthesis apparatus and training method for training apparatus | |
JP5085700B2 (ja) | 音声合成装置、音声合成方法およびプログラム | |
US8315871B2 (en) | Hidden Markov model based text to speech systems employing rope-jumping algorithm | |
JP3910628B2 (ja) | 音声合成装置、音声合成方法およびプログラム | |
JP2005043666A (ja) | 音声認識装置 | |
JP2006084715A (ja) | 素片セット作成方法および装置 | |
JP6669081B2 (ja) | 音声処理装置、音声処理方法、およびプログラム | |
JP5983604B2 (ja) | 素片情報生成装置、音声合成装置、音声合成方法および音声合成プログラム | |
JP4533255B2 (ja) | 音声合成装置、音声合成方法、音声合成プログラムおよびその記録媒体 | |
JP5930738B2 (ja) | 音声合成装置及び音声合成方法 | |
JP5474713B2 (ja) | 音声合成装置、音声合成方法および音声合成プログラム | |
JP5874639B2 (ja) | 音声合成装置、音声合成方法及び音声合成プログラム | |
JP2012237925A (ja) | 音声合成装置とその方法とプログラム | |
Ni et al. | Quantitative and structural modeling of voice fundamental frequency contours of speech in Mandarin | |
US20120316881A1 (en) | Speech synthesizer, speech synthesis method, and speech synthesis program | |
Özer | F0 Modeling For Singing Voice Synthesizers with LSTM Recurrent Neural Networks | |
WO2014017024A1 (ja) | 音声合成装置、音声合成方法、及び音声合成プログラム | |
JP4630038B2 (ja) | 音声波形データベース構築方法、この方法を実施する装置およびプログラム | |
JPH1097289A (ja) | 音声素片選択方法,音声合成装置,及び命令記憶媒体 | |
Black et al. | Using the Tilt Intonation Model: A Data-Driven Approach |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12810798 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013523777 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 14131409 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12810798 Country of ref document: EP Kind code of ref document: A1 |