WO2016043322A1 - 音声合成方法、プログラム及び装置 - Google Patents
音声合成方法、プログラム及び装置 Download PDFInfo
- Publication number
- WO2016043322A1 WO2016043322A1 PCT/JP2015/076743 JP2015076743W WO2016043322A1 WO 2016043322 A1 WO2016043322 A1 WO 2016043322A1 JP 2015076743 W JP2015076743 W JP 2015076743W WO 2016043322 A1 WO2016043322 A1 WO 2016043322A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech
- database
- feature quantity
- speech synthesis
- target person
- Prior art date
Links
- 238000001308 synthesis method Methods 0.000 title description 4
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 145
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 136
- 230000002159 abnormal effect Effects 0.000 claims abstract description 87
- 238000000034 method Methods 0.000 claims abstract description 80
- 238000012937 correction Methods 0.000 claims abstract description 71
- 230000008569 process Effects 0.000 claims abstract description 52
- 238000001514 detection method Methods 0.000 claims abstract description 23
- 238000002156 mixing Methods 0.000 claims description 53
- 238000012545 processing Methods 0.000 claims description 45
- 238000001228 spectrum Methods 0.000 claims description 21
- 230000001020 rhythmical effect Effects 0.000 abstract 5
- 239000011295 pitch Substances 0.000 description 18
- 238000004364 calculation method Methods 0.000 description 16
- 238000010586 diagram Methods 0.000 description 13
- 230000005856 abnormality Effects 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 8
- 238000002474 experimental method Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000002194 synthesizing effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000003825 pressing Methods 0.000 description 1
- 230000000452 restraining effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Definitions
- the present invention relates to a speech synthesis method, program, and apparatus.
- an object of the present invention is to provide a speech synthesis parameter generation method, program, and apparatus for synthesizing a natural speech of a specific individual from a small amount of speech data based on the speech of the specific individual.
- the present invention is a speech synthesis parameter generation method for generating a synthesized speech of a subject.
- the speech synthesis parameter generation method calculates a target person acoustic feature quantity and / or target person prosodic feature quantity, which is an acoustic feature quantity and / or prosody feature quantity of a target person speech for a plurality of sample texts, and saves it as a target person voice database
- a target person speech database storage step and a target person voice part having an abnormal acoustic feature quantity and / or prosodic feature quantity among the target person acoustic feature quantity and / or subject person prosodic feature quantity is detected as an abnormal voice part
- An abnormal sound detection step, and a manual correction processing reception step for receiving correction of the acoustic feature quantity and / or the target person prosodic feature quantity corresponding to the abnormal voice portion detected in the abnormal voice detection step in the target person voice database;
- a speech database having the corrected acoustic feature value and / or prosodic feature value is
- the target person voice database and the manual correction database mixed at a predetermined mixing ratio, and the database mixed in the mixing step, the target person's voice synthesis is performed.
- the feature amount can be manually corrected only for a portion that is highly likely to be abnormal speech while using the automatically calculated feature amount of the target person. And do not impose an excessive burden.
- correction processing can be performed in advance for a portion that is highly likely to be abnormal speech, the quality of the database that is the basis of speech synthesis parameter generation can be improved, and the quality of the final speech synthesis can be improved. At the same time, it is possible to perform speech synthesis with sufficient quality even with a small amount of sample data.
- the mixing ratio in the mixing step is a plurality of different mixing ratios
- a plurality of speech synthesis parameters are generated based on the plurality of mixing ratios
- the speech synthesis parameter generation The method further includes a speech quality comparison step, wherein the speech quality comparison step compares the speech generated by the plurality of speech synthesis parameters with a reference speech composed of the subject's own speech, and the reference speech It is preferable to select a speech synthesis database that generates speech with a high degree of similarity to the high-quality speech synthesis database.
- the voice most similar to the voice of the subject person is selected from a plurality of databases mixed at a plurality of mixing ratios. Since the selection is performed, it is possible to select the voice that is most likely to be the target person, thereby improving the naturalness of the synthesized voice.
- the acoustic feature amount is a sound power spectrum
- the prosodic feature amount is a sound pitch and a phoneme duration
- the duration of the phoneme as the subject prosody feature amount is preferably generated based on an acoustic model obtained based on sample speech data of a plurality of persons. .
- the above invention may be realized as a computer program for generating speech synthesis parameters. That is, the computer program calculates a target person acoustic feature quantity and / or target person prosodic feature quantity, which is an acoustic feature quantity and / or prosody feature quantity of a target person speech related to a plurality of sample texts, and A target person speech database storing step for storing as a database, and a target person voice part having an abnormal acoustic feature quantity and / or prosody feature quantity among the target person acoustic feature quantity and / or subject prosodic feature quantity is an abnormal voice part An abnormal voice detecting step for detecting the sound feature and a manual correction processing reception for receiving correction of the acoustic feature quantity and / or the target person prosodic feature quantity corresponding to the abnormal voice portion detected in the abnormal voice detection step in the target person voice database And manually correcting the speech database having the corrected acoustic features and / or prosodic features.
- the mixing step for mixing the subject voice database and the manual correction database at a predetermined mixing ratio, and the database mixed in the mixing step, the subject A speech synthesis parameter generation step of generating a speech synthesis parameter for performing the speech synthesis.
- the above invention may be realized as a speech synthesis parameter generation device. That is, the apparatus calculates the target person's acoustic feature quantity and / or the target person prosodic feature quantity, which is the acoustic feature quantity and / or prosodic feature quantity of the target person's speech related to a plurality of sample texts, and saves it as the target person's voice database.
- the target person voice database storage unit and the target person sound feature and / or the target person prosodic feature quantity among the target person voice feature quantity and / or the target person voice part having an abnormal sound feature quantity and / or prosodic feature quantity is detected as an abnormal voice part
- a voice detection unit a manual correction processing reception unit that receives correction of an acoustic feature quantity and / or a subject person prosodic feature quantity corresponding to the abnormal voice part detected in the abnormal voice detection step in the target person voice database
- Manual correction database storage for storing a sound database having corrected acoustic features and / or prosodic features as a manual correction database
- Voice synthesis for performing voice synthesis of the subject based on the database mixed in the mixing step, and a mixing unit that mixes the target person voice database and the manual correction database at a predetermined mixing ratio
- a speech synthesis parameter generation unit that generates parameters.
- the device is a concept including a server, and in this case, input / output processing is performed with a client device via a
- FIG. 1 is a diagram showing a hardware configuration according to the present invention.
- FIG. 2 is a functional block diagram according to the present invention.
- FIG. 3 is a diagram showing a general flowchart of speech synthesis according to the present invention.
- FIG. 4 is a diagram showing an acoustic model generation flowchart according to the present invention.
- FIG. 5 is a diagram showing a subject voice database generation flowchart according to the present invention.
- FIG. 6 is a diagram showing an abnormal sound detection flowchart according to the present invention.
- FIG. 7 is a diagram showing a manual correction database generation flowchart according to the present invention.
- FIG. 8 is a diagram showing a mixed database generation flowchart according to the present invention.
- FIG. 1 is a diagram showing a hardware configuration according to the present invention.
- FIG. 2 is a functional block diagram according to the present invention.
- FIG. 3 is a diagram showing a general flowchart of speech synthesis according to the present invention.
- FIG. 4 is
- FIG. 9 is a diagram showing a flowchart of a high quality database selection process according to the present invention.
- FIG. 10 is a flowchart of the audio output process according to the present invention.
- FIG. 11 is a diagram showing an evaluation text example (part 1) of the experiment according to the present invention.
- FIG. 12 is a diagram showing an evaluation text example (part 2) of the experiment according to the present invention.
- FIG. 13 is a diagram showing experimental results of experiments according to the present invention.
- a hardware configuration in which the speech synthesis software according to the present invention is executed will be described with reference to FIG.
- a general personal computer 100 is employed as hardware for executing the speech synthesis software according to the present invention.
- the personal computer 100 is a control unit 10 including a CPU that executes various programs including speech synthesis software described later, a memory in which various programs including speech synthesis software described later, and other storage devices are stored.
- Character input unit 30 for processing character information by receiving input from storage unit 20, keyboard (not shown), etc., voice input unit 40 for processing voice input from a microphone (not shown), etc., control unit 10
- An audio output unit 50 that processes audio output to a speaker (not shown) based on a command from the display, a display processing unit 60 that performs display processing on a display (not shown), an external storage medium such as a storage, etc.
- an input / output interface 70 that enables connection and function expansion.
- the personal computer 100 as a general-purpose device is assumed as hardware in the present embodiment, it is needless to say that the hardware may be realized as a dedicated device for speech synthesis. ⁇ 2.
- FIG. 2 is a functional block diagram illustrating functions provided in the control unit 10 and the storage unit 20 of FIG.
- the control unit 10 performs the process indicated by each functional block by writing the result of the process indicated by each functional block described later to the storage unit 20 or reading the content stored in the storage unit 20.
- control unit 10 includes a feature amount calculation unit 11 that calculates an acoustic feature amount and prosodic feature amount from speech data, an abnormal speech detection unit 12 that detects a portion corresponding to an abnormal speech in the speech data, and an abnormal speech.
- Manual correction processing unit 13 that allows a user to manually correct a portion corresponding to the above, a mixed database generation that mixes text related to the subject's speech database and text related to the manual correction database based on a predetermined mixing ratio Unit 14, a speech synthesis parameter generation unit 15 that generates a parameter for output of synthesized speech based on a predetermined speech synthesis database, a speech quality comparison unit 16 that selects the highest quality speech from a plurality of speech data, and an input An audio output processing unit 17 for outputting audio based on the text and the generated high-quality voice synthesis database;
- the storage unit 20 stores data calculated as a result of processing by the control unit 10 or data stored in advance.
- an acoustic model storage unit 21 that stores data calculated by the feature amount calculation unit 11 based on a plurality of unspecified person's utterances, and a feature amount calculation unit based on the voice data of the database creation target person 11, a target person voice database 22 that stores data calculated in 11, a manual correction database 23 that stores voice data after manual correction is performed by the manual correction processing unit, and a plurality of data used in the mixed database generation unit 14.
- a mixing ratio holding unit 24 that stores a mixing ratio
- a mixing database 25 that stores a plurality of speech synthesis parameters generated based on a database mixed at a plurality of mixing ratios, and a voice having the highest quality among a plurality of speech synthesis parameters
- a high quality speech database 26 for storing synthesis parameters.
- the data stored in the storage unit 20 is not limited to the list of representative databases and the like, but various data to be described later, that is, N persons other than the voice database generation target are unspecified. Also included are utterance voice data read out by the speaker for each of a plurality of sample texts, utterance voice data by a voice database generation target person, an abnormal voice section flag provided for each feature amount, reference voice data, and the like.
- FIG. 3 is a general flow showing a method for generating a high-quality voice database from the voice data of the database creation target person.
- an acoustic model is generated using HMM (Hidden Markov Model) or the like from speech data uttered by a plurality of unspecified persons using the feature quantity calculation unit 11 (S100).
- HMM Hidden Markov Model
- various feature amounts are calculated based on the voice data of the target person, and the target person voice database 22 is generated (S200).
- the abnormal voice detection unit 12 performs a predetermined condition determination process on the subject voice database 22 to detect an abnormal voice part (S300).
- the manual correction processing part 13 manually corrects the abnormal voice part of the target person voice data to generate the manual correction voice database 23 (S400).
- the mixing database generating unit 14 mixes the manually corrected sound database 23 and the target person sound database 22 by using a plurality of mixing ratios stored in the mixing ratio holding unit 24, thereby synthesizing the sound.
- the parameter generation unit 15 generates a plurality of speech synthesis parameters and generates the mixed database 25 (S500).
- the voice quality comparison unit 16 generates a plurality of voices from a plurality of voice synthesis parameters stored in the mixing database 25 and compares them with predetermined voice data to select a high-quality voice database,
- the high quality speech synthesis database 26 is stored (S600).
- FIG. 4 is a flowchart for explaining details of the acoustic model generation step (S100).
- the voice data read out for each of the plurality of sample texts by N unspecified speakers other than the voice database generation target person is read from the storage unit 20 (S101). This reading process is repeated until all the data for N people are read (NO in S102).
- the voice data by N unspecified speakers is stored in the storage unit 20 in advance, but may be input via the voice input unit 40 from a microphone or the like each time.
- a sample text consists of short sentences, and a plurality of sample texts consist of a collection of several tens to several hundreds of such short sentences.
- an acoustic model generation process is performed based on the utterance voice data for N people (S103).
- the acoustic model generation process is specifically an HMM (Hidden Markov Model) learning process based on the uttered speech, and as a result of the learning process, the phoneme duration of each phoneme can be calculated by the model. It becomes.
- “K” in a series of speech that phoneme as, for example, “Hello (KO-N-NI-CHI-WA)", “O”, “N", “N”, “I”, “CHI” “W” and “A” mean each decomposed phoneme, and the duration means the duration of each phoneme.
- FIG. 5 is a flowchart for explaining details of the speech database generation step (S200) of the subject who is to generate the speech synthesis database.
- S201 utterance voice data read out by the subject for each of a plurality of sample texts
- utterance voice data by a database creation subject is stored in advance in the storage unit 20, but may be input via a voice input unit 40 from a microphone or the like each time.
- a sample text consists of short sentences, and a plurality of sample texts consist of a collection of several tens to several hundreds of such short sentences.
- an acoustic feature value and a prosodic feature value are calculated for each of the read voice data (S202).
- the acoustic feature amount is a sound feature amount obtained by analyzing voice data in the frequency domain, and specifically includes a sound power spectrum, a cepstrum, and the like.
- the prosodic feature value is a feature value that represents the characteristics of how the speaker speaks, including voice pitch, intonation, rhythm, pose, and the like, and specifically includes phoneme duration and sound pitch.
- the power spectrum of the sound analyzed for every 5 [msec] sections is calculated as the acoustic feature amount.
- various known methods can be applied to calculate the power spectrum of the sound. For example, a power spectrum extraction method using Fourier transform is preferable.
- the pitch of the sound analyzed for each 5 [msec] section and the duration ([msec]) of each phoneme are calculated as prosodic feature values.
- Various known methods can be applied to calculate the pitch of the sound. For example, an extraction method using an autocorrelation method is preferable.
- the analysis interval of the power spectrum and the pitch of the sound is exemplarily set to 5 [msec], but the analysis sample interval can be changed as appropriate.
- the phoneme duration is calculated using the acoustic model generated in the acoustic model generation step (S100) and stored in the acoustic model storage unit 21. Specifically, the phoneme duration is calculated by inputting the subject's voice data to the HMM obtained in the acoustic model generation step (S100). In this example, the target person's voice data is simply input to the acoustic model. However, after the acoustic model learning process including the target person's voice data is performed again, the above-described voice data is transferred to the HMM. Input processing, that is, calculation of phoneme duration may be performed.
- Each feature is calculated and stored in the target person voice database 22 (S203).
- the abnormal speech section described later is used from the viewpoint of the speech frequency and the characteristics of the speaker's speaking. Detection can be performed.
- FIG. 6 is a flowchart for explaining in detail the step (S300) of detecting the abnormal voice part of the target person voice data based on the target person voice database 22.
- S301 data relating to the acoustic feature value and the prosodic feature value is read from the subject speech database 22 (S301).
- an abnormality determination process is performed for each read acoustic feature quantity and prosodic feature quantity (S302). If each feature quantity does not satisfy a predetermined abnormality determination condition (NO in S303), an abnormal voice section is displayed. Starts processing for the next feature value (NO in S305). On the other hand, when each feature amount satisfies a predetermined abnormality determination condition (YES in S303), the abnormal sound portion flag is set to “true” as the abnormal sound portion (S304), and the process for the next feature amount is started (NO in S305). .
- the abnormal sound part flag is set for each feature amount, and is set to “false” in the initial state.
- the power spectrum abnormality determination process (S302) the power spectrum in the preceding and following voice sections is compared, and when a predetermined difference is recognized between the power spectra, it is determined as an abnormal voice part, and the abnormal voice part related to the power spectrum This is done by setting the flag to “true” (S304). Further, in the pitch abnormality determination process (S302), comparison with the pitches in the preceding and following voice sections is performed, and when a predetermined difference exists between the pitches, it is determined as an abnormal voice part, and an abnormal voice part flag related to the pitch is determined. Is set to "true" (S304). Here, the comparison with the feature values in the preceding and following speech sections is performed. However, since it is only necessary to eliminate obvious abnormal values, for example, the abnormality determination is performed by comparing with the moving average value of each feature value in the time direction. May be.
- the phoneme duration abnormality determination process (S302) is determined from the viewpoint of vowel length, closed section (Cl) length, and specific consonant (k, t) length.
- the reason for paying attention to such a parameter is that if the phoneme detection is automatically performed based on a predetermined phoneme detection algorithm, a problem which will be described later, which is undesirable in speech synthesis, occurs. That is, when a predetermined vowel (for example, “a, i, u, e, o, n”, etc.) continues, a point that should be recognized as two vowels is detected as a single vowel at the boundary between phonemes. May end up.
- the closed section (Cl) is a so-called silent section that is generated when a voice is generated, and corresponds to a silent section that occurs before “p” when the character “pu” is pronounced.
- the speech included in the silent section is the basis for creating the speech synthesis database, a predetermined pause is generated in the generated speech, which hinders natural speech synthesis.
- the consonants “k” and “t” tend to have longer phonological lengths than actual when performing speech synthesis, and as a result, the listener may feel poor tempo when performing speech synthesis.
- the phoneme duration is determined by the following method. For vowel length, it is determined whether or not the following determination condition is satisfied, and when the mathematical formula is not satisfied, that is, when the vowel length is shorter or longer than a predetermined period, the vowel length is detected as an abnormal voice portion.
- the abnormal voice flag related to the phoneme duration is set to “true” (S304).
- l is a non-negative integer for adjusting the range to be extracted as abnormal (the number of extracted abnormal audio parts).
- the length of the closed section (Cl) it is determined whether or not the following determination condition is satisfied, and when the mathematical formula is satisfied, that is, when the closed section is longer than a predetermined period, the portion is detected as an abnormal sound part.
- the abnormal voice flag related to the phoneme duration is set to “true” (S304).
- m is a non-negative integer for adjusting the range to be extracted as abnormal (the number of extracted abnormal audio parts).
- the phoneme lengths of the consonants “k” and “t” it is determined whether or not the following determination condition is satisfied, and if the mathematical expression is satisfied, that is, if the consonant length for k and t is longer than a predetermined period
- the part is detected as an abnormal voice part, and the abnormal voice flag relating to the phoneme duration is set to “true” (S304).
- the authenticity determination processing related to the abnormal sound part flag is performed for each text, and such processing is performed for all texts (S305 YES, S306 YES). That is, when the authenticity determination process is completed for all the feature amounts of a certain text, it is determined whether the determination process is performed for all the sample texts (S306). If it is determined that the processing has not been completed for all the sample texts, the above processing is repeated for the next sample text (NO in S306). When abnormality determination processing is performed for all sample texts (YES in S306), the processing ends. Information relating to the abnormal sound part flag for each feature amount is stored in the post-calculation storage unit 20.
- FIG. 7 is a flowchart for explaining the details of the step of generating the manual correction database 23 (S400).
- the process is started, first, information on the target person voice database 22 and the abnormal voice part flag is read from the storage part 20 (S401).
- the authenticity of the voice stored in the target person voice database 22 starts from the abnormal voice part flag at the beginning of each text (S402).
- the processing for the next feature quantity is started (S406 NO) after performing the manual correction acceptance process (S403 to S405) of the feature quantity corresponding to the abnormal voice part flag.
- the abnormal voice part flag is false (NO in S402)
- the process for the next feature quantity is started without performing the manual correction process for the feature quantity corresponding to the abnormal voice part flag (NO in S406). .
- the abnormal sound part flag is “true” (S402 YES)
- information related to the feature amount corresponding to the abnormal sound part flag is displayed via a sound output unit 50 and a display processing unit 60, respectively (not shown) and It is output or presented to a speaker (not shown) (S403).
- the feature quantity is a power spectrum of sound that is an acoustic feature quantity
- the power spectrum distribution of the target person voice is displayed on the display, and the target person voice corresponding to the feature quantity is displayed on the speaker. Is output.
- the feature quantity is the pitch of the sound that is the prosodic feature quantity
- the sound frequency [Hz] is displayed on the display
- the target person voice corresponding to the feature quantity is output to the speaker.
- the feature amount is a phoneme duration that is a prosodic feature amount
- the duration of the phoneme is displayed on the display, and the target person voice corresponding to the feature amount is output to the speaker.
- Such output or presentation to the display and speaker facilitates correction of a feature amount described later.
- the output to the display and the speaker is not limited to the above, and other configurations may be used as long as the correction is easy from a visual or auditory viewpoint.
- correction accepting process is a process of accepting a correction input by a database creator or the like regarding the feature amount of an audio part regarded as an abnormal audio part when the abnormal audio part flag is set to “true”.
- the sound related to the part where the abnormal sound part flag is “true” is corrected through the keyboard, the mouse, or the like while confirming the display display or the speaker output, and the sound part is corrected. Is replaced with the sound related to the corrected feature value, or the sound related to the portion where the abnormal sound part flag is true is set to be excluded from the basis for creating the later mixed database 25.
- the feature amount is the power spectrum of sound, which is an acoustic feature amount
- the feature amount is the pitch of the sound that is the prosodic feature amount
- the frequency can be manually adjusted while confirming the display display and the sound output from the speaker.
- the feature amount is a phoneme duration that is a prosodic feature amount
- the database creator or the like can manually correct only the abnormal sound part while visually or audibly confirming the sound information.
- the correction acceptance process is repeated until a correction end command is input (NO in S405), and a display display or audio output related to the corrected feature value is presented for each correction.
- the feature amount can be manually corrected only for a portion that is highly likely to be abnormal speech while using the feature amount of the subject automatically calculated by the feature amount calculation unit. It does not impose an excessive burden on the database creator.
- correction processing can be performed in advance for a portion that is highly likely to be abnormal speech, the quality of the database that is the basis of speech synthesis parameter generation can be improved, and the quality of the final speech synthesis can be improved. At the same time, it is possible to perform speech synthesis with sufficient quality even with a small amount of sample data.
- FIG. 8 is a flowchart illustrating details of the generation step (S500) of the mixed database 25.
- the mixing database generation unit 14 first reads the manual correction database 23 and the target person voice database 22 (S501), and then reads a plurality of mixing ratios from the mixing ratio holding unit 24 of the storage unit 20 (S502). ).
- this mixing ratio represents the mixing ratio of the number of texts related to the manual correction database 23 and the number of texts related to the subject speech database 22, for example, 9: 1, 8: 2, 7: 3, 6: 4. 5: 5, 4: 6, 3: 7, 2: 8 or 1: 9.
- the mixing database generation unit 14 mixes the text randomly selected from the manual correction database 23 based on the read plurality of mixing ratios and the text randomly selected from the target person speech database 22 to generate a plurality of databases. Is generated (S503). At this time, only one of the databases (a database having a mixing ratio of 10: 0 or 0:10), that is, a database without mixing is not generated. This is because the quality of the final synthesized speech is better when the manual correction database and the subject speech database are mixed to a certain extent, as will be apparent from the experimental results described later.
- the speech synthesis parameter generation unit 15 generates speech synthesis parameters for each database (S504).
- This speech synthesis parameter is a parameter (or coefficient) for performing speech synthesis output, and various known methods can be applied to calculate the speech synthesis parameter. For example, an HMM (Hidden Markov Model) that performs learning based on a speech database is used. It is preferable to use a waveform segment connection type speech synthesis technique.
- These speech synthesis parameters are stored in the storage unit 20 as the mixed database 25 (S505).
- FIG. 9 is a flowchart for explaining the details of the high quality speech synthesis database selection processing step (S600).
- the voice quality comparison unit 16 reads a plurality of voice synthesis parameters and a predetermined sample text from the storage unit 20 (S601).
- synthesized speech corresponding to the sample text is generated for each speech synthesis parameter based on the plurality of read speech synthesis parameters and sample text (S602).
- the speech synthesis algorithm used at this time is the same as the algorithm used when the speech synthesis parameters are generated, and is, for example, HMM or a waveform connection type speech synthesis technique.
- the voice (real voice) actually read by the person to be synthesized for the same sample text is read as one or a plurality of reference voice data (S603).
- the reference voice data may be read from the target person voice database 22 or may be stored in advance in a storage unit.
- a plurality of synthesized speech generated for the same sample text as the sample text used in the read reference speech data is compared with the reference speech, and an acoustic similarity between the speech data is calculated (S604).
- the speech synthesis parameter that generated the synthesized speech having the highest similarity with the reference speech data is specified, and the speech synthesis parameter is selected as the high-quality speech synthesis database 26 and stored in the storage unit 20. (S604).
- the details of the similarity calculation method between the reference voice and each voice data will be described.
- the feature amount of each of the reference speech and the plurality of synthesized speech that is, in this embodiment, the power spectrum, the pitch, and the duration of the phoneme are calculated by the feature amount calculation unit 11 or already calculated. If it is, the data is read from the storage unit 20.
- a difference calculation is performed between each feature amount related to the calculated or read reference speech and each feature amount related to each synthesized speech, and each difference calculation result is added to each feature amount.
- the total difference is calculated, and ranking is performed in order from the data having the smallest total difference for each feature amount, in other words, the feature amount close to the feature amount of the reference voice data.
- Ranking is done for each of the power spectrum, pitch and duration of the phoneme, and the higher the ranking, the highest total difference is 10 points for the 1st place, 20 points for the 2nd place, 30 points for the 3rd place, etc. Set so that the points awarded are smaller.
- the points of each feature amount are added to each other, and a total point is calculated for each synthesized speech.
- weighting may be performed according to the feature amount. For example, a coefficient may be multiplied so that points relating to the power spectrum are added at a rate of 50%, points relating to the pitch are added at a rate of 25%, and points relating to the phoneme duration are added at a rate of 25%.
- the speech parameter related to the synthesized speech at the minimum point is identified as the speech synthesis parameter with the highest similarity and selected as the high-quality speech synthesis parameter.
- a method based on the difference between feature amounts has been described as a similarity calculation method, but the similarity calculation method is not limited to such a method. That is, in calculating the similarity, any known method for calculating the acoustic similarity between audio data may be employed.
- the voice most similar to the voice of the target person among a plurality of databases mixed at a plurality of mixing ratios Therefore, it is possible to perform speech synthesis that is most appropriate for the subject person, thereby improving the naturalness of the synthesized speech.
- a plurality of speech synthesis parameters are generated using a plurality of mixing ratios (S501 to 505), and the highest quality speech synthesis parameter is selected from among them (S601 to S601). 604).
- the quality of the synthesized speech is not pursued or if it is desired to perform mixing at a predetermined mixing ratio in advance, one speech synthesis parameter is generated using one mixing ratio, and this is used as a final result for speech synthesis. It may be stored as a synthesized speech parameter.
- Speech output using generated speech synthesis database> processing for generating high-quality synthesized speech based on the high-quality speech synthesis database 26 and text input from the outside will be described.
- FIG. 9 is a flowchart for explaining a specific flow of audio output processing.
- the speech synthesis output processing unit 17 starts text input reception processing in cooperation with the character input unit 30 that processes character information obtained through keyboard input or the like (S701). Such text input acceptance processing is repeated until a speech synthesis command is input (NO in S702), and the input text is sequentially stored in the storage unit 20.
- a voice synthesis process (S703 to S705) is started.
- the speech synthesis output processing unit 17 reads the speech synthesis parameters stored in the storage unit 20 from the high quality speech synthesis database 26 (S703), and then the text input process (S701 ⁇ ). A series of texts input through S702 NO) is read (S704).
- the speech synthesis output processing unit 17 performs speech synthesis using the HMM based on the read speech synthesis parameters and a series of texts, and outputs the synthesis result via the speech output unit 50 (S705). According to such a configuration, it is possible to output high-quality synthesized speech corresponding to an arbitrary input text.
- Synthetic voice listening experiment> The inventors conducted a listening experiment to verify that the quality of the synthesized speech is improved by mixing the subject speech database 22 and the manual correction database 23.
- a speech synthesis parameter hereinafter referred to as an AL parameter
- a speech synthesis parameter generated based on a database mixed at a ratio of 3 (hereinafter referred to as an HHL parameter) and a speech synthesis parameter generated based only on the manual correction database 23 (hereinafter referred to as an HL parameter).
- HHL parameter A speech synthesis parameter generated based on a database mixed at a ratio of 3
- HL parameter a speech synthesis parameter generated based only on the manual correction database 23
- the acoustic model was generated based on the data of seven speakers.
- synthetic speech related to the evaluation text shown in FIGS. 11 and 12, that is, AL synthetic speech, HHL synthetic speech, and HL synthetic speech was generated.
- 11 and 12 are short sentences that are likely to be used in various situations, specifically, 20 short sentences that are likely to be used in daily life, 10 short sentences that are likely to be used in news drafts, and 10 that are likely to be used in meetings. It consists of 50 sentences, 10 short sentences that are likely to be used by the sentence change consultant.
- FIG. 13 is a graph summarizing the five-stage evaluation results of four subjects.
- the horizontal axis represents AL synthesized speech, HHL synthesized speech, and HL synthesized speech from the left, and the vertical axis represents five-level evaluation results of 1 to 5.
- Each plot in the graph represents the average value of the evaluation results, and the vertical line segment represents the deviation.
- the speech synthesis software is executed on a server, and a client machine accesses the server via a LAN or the Internet to perform a speech synthesis function.
- the storage unit is not necessarily configured in the server, and may be provided separately from the server such as an external storage.
- the speech synthesis method, program, apparatus, and server which can perform high quality speech synthesis based on a speech database can be provided, thereby contributing to industries such as welfare, entertainment, and other businesses. be able to.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
<1.ハードウエア構成>
図1を参照しつつ、本発明に係る音声合成ソフトウエアが実行されるハードウエア構成について説明する。本発明に係る音声合成ソフトウエアを実行するハードウエアとしては一般的なパーソナルコンピュータ100が採用される。当該パーソナルコンピュータ100は、後述の音声合成ソフトウエアを含む各種プログラムを実行するCPUから成る制御部10、後述の音声合成ソフトウエア等を含む各種プログラムが記憶されるメモリやHDDその他の記憶装置である記憶部20、キーボード(図示せず)等からの入力を受けて文字情報を処理する文字入力部30、マイクロフォン(図示せず)等から入力された音声を処理する音声入力部40、制御部10からの指令に基づきスピーカー(図示せず)等へと出力される音声を処理する音声出力部50、ディスプレイ(図示せず)への表示処理を行う表示処理部60、ストレージ等の外部記憶媒体等との接続や機能拡張等を可能とする入出力インタフェース70等を備える。なお、本実施形態におけるハードウエアとして汎用装置としてのパーソナルコンピュータ100を想定したが音声合成のための専用装置として実現してもよいことは勿論である。
<2.音声合成ソフトウエアによる音声合成データベースの生成>
以下では、本実施形態に係る音声合成ソフトウエアを用いて特定の個人の音声合成データベース乃至パラメータを生成するまでの装置の動作について、図2乃至9を参照しつつ説明する。
<2.1 音声合成データベース生成の概要>
図2は、図1の制御部10と記憶部20の備える機能について説明する機能ブロック図である。制御部10は後述の各機能ブロックで示される処理を行った結果を記憶部20へと書き込み、或いは記憶部20へと記憶された内容を読出すことにより各機能ブロックで示される処理を行う。制御部10は、後述するように音声データから音響特徴量や韻律特徴量を算出する特徴量算出部11、音声データ中の異常音声に相当する部分を検出する異常音声検出部12、異常な音声に相当する部分を手動補正することをユーザに許容する手動補正処理部13、所定の混合比に基づいて対象者の音声データベースに係るテキストと手動補正データベースに係るテキストとの混合を行う混合データベース生成部14、所定の音声合成データベースに基づいて合成音声出力用のパラメータを生成する音声合成パラメータ生成部15、複数の音声データから最も品質の良い音声を選択する音声品質比較部16、及び入力されたテキストと生成された高音質の音声合成データベースとに基づいて音声出力を行う音声出力処理部17とを備えている。
<2.2 音響モデルの生成>
図4は、音響モデルの生成ステップ(S100)の詳細について説明するフローチャートである。処理が開始すると、音声データベース生成対象者以外のN人の不特定の話者により複数のサンプルテキスト毎に読み上げられた音声データが記憶部20より読み出される(S101)。この読み出し処理はN人分のすべてのデータを読み出すまで繰り返される(S102NO)。なお、本実施形態においてはN人の不特定の話者による音声データは記憶部20に予め記憶されているものであるが、都度マイク等から音声入力部40を介して入力してもよい。サンプルテキストは短文から成り、複数のサンプルテキストはそのような短文が数十から数百程度集合することで成る。
<2.3 対象者音声データベースの生成>
図5は、音声合成データベースを生成しようとする対象者の音声データベース生成ステップ(S200)の詳細について説明するフローチャートである。処理が開始すると、まず、複数のサンプルテキスト毎に対象者により読み上げられた発話音声データが読み出される(S201)。本実施形態においてはデータベース作成対象者による発話音声データは記憶部20に予め記憶されているものであるが、都度マイク等から音声入力部40を介して入力してもよい。サンプルテキストは短文から成り、複数のサンプルテキストはそのような短文が数十から数百程度集合することで成る。
<2.4 異常音声の検出処理>
図6は、対象者音声データベース22に基づいて、対象者音声データの異常音声部を検出するステップ(S300)について詳細に説明するフローチャートである。処理が開始すると、まず、対象者音声データベース22から音響特徴量及び韻律特徴量に関するデータが読み出される(S301)。
<2.5 手動補正データベースの生成>
図7は、手動補正データベース23を生成するステップ(S400)の詳細について説明するフローチャートである。処理が開始されると、まず、対象者音声データベース22及び異常音声部フラグに関する情報が記憶部20より読み出される(S401)。
<2.6 混合データベースの生成>
図8は、混合データベース25の生成ステップ(S500)の詳細について説明するフローチャートである。処理が開始すると、混合データベース生成部14は、まず手動補正データベース23と対象者音声データベース22とを読み出した後(S501)、記憶部20の混合比保持部24から複数の混合比を読み出す(S502)。ここで、この混合比は手動補正データベース23に係るテキスト数と対象者音声データベース22に係るテキスト数の混合比率を表すものであり、例えば9:1、8:2、7:3、6:4、5:5、4:6、3:7、2:8又は1:9等である。
<2.7 高品質データベースの選択処理>
図9は、高品質音声合成データベースの選択処理ステップ(S600)の詳細について説明するフローチャートである。処理が開始すると、音声品質比較部16は、記憶部20から複数の音声合成パラメータと所定のサンプルテキストとを読み出す(S601)。
<3.生成された音声合成データベースを用いた音声出力>
次に、高品質音声合成データベース26と外部からのテキスト入力に基づいて、高品質の合成音声を生成する処理について説明する。
<4.合成音声の聴取実験>
発明者らは対象者音声データベース22と手動補正データベース23とを混合することにより合成音声の品質が向上することを検証するため聴取実験を行った。
<4.1 実験方法>
実験では、まず、対象者音声データベース22のみに基づいて生成された音声合成パラメータ(以下では、ALパラメータと呼ぶ)と、対象者音声データベース22に係るテキストと手動補正データベース23に係るテキストとを7:3の割合で混合したデータベースに基づいて生成した音声合成パラメータ(以下では、HHLパラメータと呼ぶ)と、手動補正データベース23のみに基づいて生成された音声合成パラメータ(以下では、HLパラメータと呼ぶ)とを予め算出した。なお、音響モデルは7人の話者のデータに基づいて生成した。次に、上記ALパラメータ、HHLパラメータ、HLパラメータのそれぞれに基づいて、図11及び12に示す評価テキストに関する合成音声、すなわちAL合成音声、HHL合成音声、HL合成音声の生成を行った。なお、図11及び12に示す文章は様々な場面で使用されそうな短文であり、具体的には日常生活で使用しそうな短文20文、ニュース原稿の短文10文、会議で使用しそうな短文10文、転職コンサルタントの使用しそうな短文10文の計50文で構成した。
<4.2 実験結果>
図13は、被験者4名の5段階の評価結果をまとめたグラフである。同グラフにおいて、横軸は左からAL合成音声、HHL合成音声、HL合成音声を表し、縦軸は1~5の5段階の評価結果を表す。同グラフ中の各プロットは評価結果の平均値を表し垂直方向の線分はその偏差を表す。同図より、ALパラメータのみに基づく音声合成、HLパラメータのみに基づく音声合成よりもHHLパラメータに基づく音声合成の方が良好な評価結果となっていることが看取される。すなわち、本実験により、手動補正データベースと対象者音声データベースとを一定程度混合した方が最終的な音声出力の品質が良いことが確認された。
<5.その他の実施形態について>
なお、本願発明は上記一実施形態に限定されるものではなく、発明の要旨を変更しない範囲で種々変形可能である。例えば、上記実施形態においては、スタンドアロン型の装置にて実行される音声合成ソフトウエア(プログラム)について説明した。しかしながら、当該音声合成ソフトウエアは必ずしもスタンドアロン型の装置にて実行する必要はなく、当該音声合成ソフトウエアをサーバ上で実行しクライアント機からLAN又はインターネットを介して当該サーバにアクセスし音声合成機能を使用するネットワークシステムとして実現してもよい。すなわち、その場合の装置はサーバを意味する。また、上記実施形態は1つの装置内にてすべての処理がなされるとしたが各機能を分散してもよい。例えば、記憶部は必ずしもサーバ内に構成される必要はなく外部ストレージなどサーバとは別個に設けられてもよい。
20 記憶部
30 文字入力部
40 音声入力部
50 音声出力部
60 表示処理部
70 入出力I/F
100 パーソナルコンピュータ
Claims (6)
- 対象者の合成音声を生成するための音声合成パラメータ生成方法であって、
複数のサンプルテキストに関する対象者音声の音響特徴量及び/又は韻律特徴量である対象者音響特徴量及び/又は対象者韻律特徴量を算出し、対象者音声データベースとして保存する対象者音声データベース保存ステップと、
前記対象者音響特徴量及び/又は対象者韻律特徴量のうち、異常な音響特徴量及び/又は韻律特徴量を有する対象者音声部分を異常音声部として検出する異常音声検出ステップと、
対象者音声データベースのうち前記異常音声検出ステップで検出された前記異常音声部に対応する音響特徴量及び/又は対象者韻律特徴量の補正を受け付ける手動補正処理受付ステップと、
前記補正後の音響特徴量及び/又は韻律特徴量を有する音声データベースを手動補正データベースとして保存する手動補正データベース保存ステップと、
前記対象者音声データベースと前記手動補正データベースとを所定の混合比で混合する混合ステップと、
前記混合ステップにて混合されたデータベースに基づいて、前記対象者の音声合成を行うための音声合成パラメータを生成する音声合成パラメータ生成ステップと、
を備える音声合成パラメータ生成方法。 - 請求項1記載の音声合成パラメータ生成方法において、
前記混合ステップにおいて、前記混合比は互いに異なる複数の混合比であり、
前記音声合成パラメータ生成ステップにおいて、前記複数の混合比に基づいて複数の音声合成パラメータを生成し、
前記音声合成パラメータ生成方法は、さらに、音声品質比較ステップを備え、
前記音声品質比較ステップは、前記複数の音声合成パラメータにより生成される音声と対象者自身の音声にて成る基準音声との比較を行い、前記基準音声との類似度の高い音声を生成した音声合成データベースを高品質音声合成データベースとして選択する、
音声合成パラメータ生成方法。 - 請求項2記載の音声合成パラメータ生成方法において、
前記音響特徴量は、音のパワースペクトルであり、
前記韻律特徴量は、音のピッチ及び音韻の継続時間である、
音声合成パラメータ生成方法。 - 請求項3記載の音声合成パラメータ生成方法において、
前記対象者音声データベース保存ステップにおいて、前記対象者韻律特徴量である音韻の継続時間は、複数人のサンプル音声データに基づいて得られた音響モデルに基づいて生成される、
音声合成パラメータ生成方法。 - 対象者の合成音声を生成するための音声合成パラメータ生成プログラムであって、
コンピュータに、
複数のサンプルテキストに関する対象者音声の音響特徴量及び/又は韻律特徴量である対象者音響特徴量及び/又は対象者韻律特徴量を算出し、対象者音声データベースとして保存する対象者音声データベース保存ステップと、
前記対象者音響特徴量及び/又は対象者韻律特徴量のうち、異常な音響特徴量及び/又は韻律特徴量を有する対象者音声部分を異常音声部として検出する異常音声検出ステップと、
対象者音声データベースのうち前記異常音声検出ステップで検出された前記異常音声部に対応する音響特徴量及び/又は対象者韻律特徴量の補正を受け付ける手動補正処理受付ステップと、
前記補正後の音響特徴量及び/又は韻律特徴量を有する音声データベースを手動補正データベースとして保存する手動補正データベース保存ステップと、
前記対象者音声データベースと前記手動補正データベースとを所定の混合比で混合する混合ステップと、
前記混合ステップにて混合されたデータベースに基づいて、前記対象者の音声合成を行うための音声合成パラメータを生成する音声合成パラメータ生成ステップと、
を実行させる音声合成パラメータ生成プログラム。 - 対象者の合成音声を生成するための音声合成パラメータ生成装置であって、
複数のサンプルテキストに関する対象者音声の音響特徴量及び/又は韻律特徴量である対象者音響特徴量及び/又は対象者韻律特徴量を算出し、対象者音声データベースとして保存する対象者音声データベース保存部と、
前記対象者音響特徴量及び/又は対象者韻律特徴量のうち、異常な音響特徴量及び/又は韻律特徴量を有する対象者音声部分を異常音声部として検出する異常音声検出部と、
対象者音声データベースのうち前記異常音声検出ステップで検出された前記異常音声部に対応する音響特徴量及び/又は対象者韻律特徴量の補正を受け付ける手動補正処理受付部と、
前記補正後の音響特徴量及び/又は韻律特徴量を有する音声データベースを手動補正データベースとして保存する手動補正データベース保存部と、
前記対象者音声データベースと前記手動補正データベースとを所定の混合比で混合する混合部と、
前記混合ステップにて混合されたデータベースに基づいて、前記対象者の音声合成を行うための音声合成パラメータを生成する音声合成パラメータ生成部と、
を備える音声合成パラメータ生成装置。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201462052597P | 2014-09-19 | 2014-09-19 | |
US62/052,597 | 2014-09-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2016043322A1 true WO2016043322A1 (ja) | 2016-03-24 |
Family
ID=55533358
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2015/076743 WO2016043322A1 (ja) | 2014-09-19 | 2015-09-18 | 音声合成方法、プログラム及び装置 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2016043322A1 (ja) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11488603B2 (en) * | 2019-06-06 | 2022-11-01 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for processing speech |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008233542A (ja) * | 2007-03-20 | 2008-10-02 | Fujitsu Ltd | 韻律修正装置、韻律修正方法、および、韻律修正プログラム |
JP2008256942A (ja) * | 2007-04-04 | 2008-10-23 | Toshiba Corp | 音声合成データベースのデータ比較装置及び音声合成データベースのデータ比較方法 |
JP2008292587A (ja) * | 2007-05-22 | 2008-12-04 | Fujitsu Ltd | 韻律生成装置、韻律生成方法、および、韻律生成プログラム |
-
2015
- 2015-09-18 WO PCT/JP2015/076743 patent/WO2016043322A1/ja active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2008233542A (ja) * | 2007-03-20 | 2008-10-02 | Fujitsu Ltd | 韻律修正装置、韻律修正方法、および、韻律修正プログラム |
JP2008256942A (ja) * | 2007-04-04 | 2008-10-23 | Toshiba Corp | 音声合成データベースのデータ比較装置及び音声合成データベースのデータ比較方法 |
JP2008292587A (ja) * | 2007-05-22 | 2008-12-04 | Fujitsu Ltd | 韻律生成装置、韻律生成方法、および、韻律生成プログラム |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11488603B2 (en) * | 2019-06-06 | 2022-11-01 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for processing speech |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8595004B2 (en) | Pronunciation variation rule extraction apparatus, pronunciation variation rule extraction method, and pronunciation variation rule extraction program | |
JP2017058513A (ja) | 学習装置、音声合成装置、学習方法、音声合成方法、学習プログラム及び音声合成プログラム | |
US20230036020A1 (en) | Text-to-Speech Synthesis Method and System, a Method of Training a Text-to-Speech Synthesis System, and a Method of Calculating an Expressivity Score | |
US20230230576A1 (en) | Text-to-speech synthesis method and system, and a method of training a text-to-speech synthesis system | |
JP2009031452A (ja) | 音声解析装置、および音声解析方法、並びにコンピュータ・プログラム | |
JP2018146803A (ja) | 音声合成装置及びプログラム | |
US8103505B1 (en) | Method and apparatus for speech synthesis using paralinguistic variation | |
JP2015068897A (ja) | 発話の評価方法及び装置、発話を評価するためのコンピュータプログラム | |
Ibrahim et al. | Robust feature extraction based on spectral and prosodic features for classical Arabic accents recognition | |
JP5807921B2 (ja) | 定量的f0パターン生成装置及び方法、f0パターン生成のためのモデル学習装置、並びにコンピュータプログラム | |
Csapó et al. | Residual-based excitation with continuous F0 modeling in HMM-based speech synthesis | |
JP2007316330A (ja) | 韻律識別装置及び方法、並びに音声認識装置及び方法 | |
JP4859125B2 (ja) | 発音評定装置、およびプログラム | |
Sawada et al. | The nitech text-to-speech system for the blizzard challenge 2016 | |
WO2016043322A1 (ja) | 音声合成方法、プログラム及び装置 | |
Liu et al. | Controllable accented text-to-speech synthesis | |
JP4839970B2 (ja) | 韻律識別装置及び方法、並びに音声認識装置及び方法 | |
JP2010060846A (ja) | 合成音声評価システム及び合成音声評価方法 | |
JP5874639B2 (ja) | 音声合成装置、音声合成方法及び音声合成プログラム | |
JP6786065B2 (ja) | 音声評定装置、音声評定方法、教師変化情報の生産方法、およびプログラム | |
Tang et al. | The Acoustic Realization of Mandarin Tones in Fast Speech. | |
Chunwijitra et al. | A tone-modeling technique using a quantized F0 context to improve tone correctness in average-voice-based speech synthesis | |
Iriondo et al. | Objective and subjective evaluation of an expressive speech corpus | |
Cahyaningtyas et al. | HMM-based indonesian speech synthesis system with declarative and question sentences intonation | |
JP4684770B2 (ja) | 韻律生成装置及び音声合成装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15841794 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16/06/2017) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15841794 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |
|
NENP | Non-entry into the national phase |
Ref country code: JP |