WO2017082717A2 - Method and system for text to speech synthesis - Google Patents

Method and system for text to speech synthesis Download PDF

Info

Publication number
WO2017082717A2
WO2017082717A2 PCT/MY2016/050076 MY2016050076W WO2017082717A2 WO 2017082717 A2 WO2017082717 A2 WO 2017082717A2 MY 2016050076 W MY2016050076 W MY 2016050076W WO 2017082717 A2 WO2017082717 A2 WO 2017082717A2
Authority
WO
WIPO (PCT)
Prior art keywords
malay
given
speech
text input
word
Prior art date
Application number
PCT/MY2016/050076
Other languages
French (fr)
Other versions
WO2017082717A3 (en
Inventor
Mumtaz Begum PEER MUSTAFA
Mansoor Ali MOHAMED YUSOOF
Original Assignee
Universiti Malaya
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Universiti Malaya filed Critical Universiti Malaya
Publication of WO2017082717A2 publication Critical patent/WO2017082717A2/en
Publication of WO2017082717A3 publication Critical patent/WO2017082717A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present invention generally relates to telecommunications systems and methods. More particularly the present invention relates to an improved method and system of text to speech synthesis. Most particularly, the present invention relates to an improved method and system of Malay language text to speech synthesis.
  • the state of the art of speech synthesizing systems is populated by three general categories of speech synthesizers namely articulatory speech synthesizers, formant speech synthesizers, and concatenative speech synthesizers.
  • Articulatory speech synthesizers are based on the physics of sound generation in the vocal apparatus. Individual parameters related to the position and movement of vocal chords are provided and the sound generation therefrom is determined according to the laws of classical physics. Due to the complex nature of the physics involved and the difficulties that have to be overcome to mathematically model said complicated physics, practical realization of said articulatory type of speech synthesizers appear to be a distant prospect.
  • the second type of speech synthesizer the formant synthesizer unlike said articulatory speech synthesizers, do not use mathematical models of physical laws that govern speech formation in vocal chords, but conversely models acoustic features of speech or spectra of a speech signal.
  • Aforementioned formant synthesizers more particularly utilize a set of rules in combination with said spectra or acoustic features of speech signals to effectively synthesize speech.
  • a phoneme is modelled with "formants" wherein each formant has a distinct frequency 'trajectory' and distinct bandwidth which varies over the duration of the phoneme.
  • An audio signal is synthesized by using the frequency and bandwidth trajectories to control a formant synthesizer.
  • Concatenative speech synthesizers for generating speech from text input operate in an entirely different principle of operation. More particularly, aforementioned concatenative systems operate using pre-recorded actual speech which forms a large database or corpus. The corpus is segmented based on phonological features.
  • the drawback of concatenative speech synthesizing systems however, similar to formant synthesizers is a lack of "naturalness" despite good intelligibility.
  • concatenative speech synthesizers provide a higher degree of "natural” sounding synthesized speech compared to either formant synthesizers or articulatory synthesizers.
  • the decrease in "natural-ness" in sound quality of the synthesized speech in aforementioned concatenative synthesizers is particularly pronounced for under-resourced languages that include languages such as Malay, Bengali and Vietnamese which coincidently fall under the category of languages that are amongst the 20 most widely spoken languages in the world, which as a result of being under-resourced, have to rely on resources for other more popular and sufficiently resourced languages such as English.
  • HMM Hidden Markov Model
  • segmental labels and duration model are required for use by a HMM text to speech synthesis system to hence yield a HMM text to speech synthesis system that is adapted to receive Malay input and synthesize Malay speech output which is intelligible and natural sounding.
  • Malay is a widely spoken language that is spoken by more than 150 million people in Malay-speaking countries such as Malaysia, Indonesia, Brunei, Singapore and Southern Thailand.
  • HMM Hidden Markov Model
  • Malay language text input to Malay speech synthesis may find utility in a wide range of applications that include teaching and reading aids for the visually impaired.
  • the state of the prior art of HMM-based speech synthesis systems fails to disclose a HMM-based speech synthesis system for under-resourced languages and more particularly, the state of the prior art fails to disclose a HMM-based speech synthesis system tailored to Malay text input to speech output.
  • the development of a HMM-based speech synthesis system tailored to Malay text input to speech output requires the development of speech synthesis resources that include a speech database and segmental labels.
  • the development of a speech database can be easily conceived by a person skilled in the art of speech synthesis.
  • the challenge in the development of Malay text input to speech synthesis systems is the generation of segment phonetic labels, and more particularly the generation of context dependent labels. .
  • the present invention provides a HMM (Hidden Markov Model) based speech synthesis system comprising of a plurality of software modules in the form of computer readable instructions residing in memory of a computer and executed by a central processing unit (CPU) of said computer, the system including: a training module for training of a plurality of HMM (Hidden Markov Model) statistical predictive models to generate an acoustic speech model of Malay speech; and a synthesizing module for synthesizing Malay speech from Malay text input utilizing context dependent labels generated from Malay text input and the acoustic model for Malay speech; characterized in that; the speech synthesis module includes a context dependent label generation unit that automatically generates context dependent labels that correspond to phoneme units of Malay language from Malay language input text string, derived in part from syllabification rules that are exclusive to Malay language.
  • HMM Hidden Markov Model
  • aforementioned training module comprises of a database housing a Malay speech corpus that includes a Malay language speech model embodied in the form of speech recordings of speech signals of a native Malay speaker and corresponding phonetic symbols representative of said speech recordings of a Malay native speaker, and a parametrization module which extracts and/or generates parameters that include spectral parameters and excitation parameters for training of a plurality of HMM (Hidden Markov Model) probabilistic models.
  • aforementioned synthesizing module comprises a statistical parameter model database that includes a plurality of HMM probabilistic models that in combination represent an acoustic speech model derived from the plurality of excitation and spectral parameters generated by the parametrization module of the training module.
  • Said acoustic speech model housed in the statistical parameter model database and the plurality of context dependent labels generated from a Malay text input string and that correspond to a given plurality of phoneme units of said Malay text input string are utilized to generate acoustic speech parameters representative of Malay speech corresponding to said Malay text input string.
  • the synthesizing module further includes a speech synthesis module which correlates the plurality of context dependent labels of a given plurality of phoneme units corresponding to the input text string and the acoustic speech model defined by the plurality of HMMs residing in the statistical parameter model database to generate a corresponding plurality of acoustic parameters that correspond to said given plurality of phoneme units of the Malay text input string.
  • Said synthesis module of the synthesizing module further concatenates the acoustic speech parameters representative of Malay speech corresponding to the input Malay text string to synthesize Malay speech.
  • the automatic context dependent label generation unit generates context dependent labels from a Malay text input string that include phoneme duration information and POS (Part of speech tagging).
  • the speech synthesis module of the synthesizing module includes a filter such as a Mel-Log Spectrum Approximating (MLS A) filter.
  • a filter such as a Mel-Log Spectrum Approximating (MLS A) filter.
  • the acoustic speech model formed by the plurality of HMM probabilistic models is a probabilistic model in the form of a plurality of decision trees.
  • spectral parameters generated and/or extracted by the parametrization module of the training module include fundamental frequency (F 0 ) and Mel-Cepstral (MCEPs) Component coefficients.
  • the excitation parameters generated and/or extracted by the parametrization module of the training module include the log of the fundamental frequency of speech signals (log Fo), the delta coefficients which represent the first order differential of MCEPs (Mel-Cepstrals) of speech signals, and the delta-delta coefficients which represent the second order differential of MCEPs (Mel-Cepstrals) of speech signals.
  • the present invention provides a method for Malay speech synthesis comprising of: i. ) a step of extracting and/or generating a plurality of parameters representative of a Malay speech model of a native Malay speaker whose speech recordings of speech signals and corresponding phonetic symbols representative of said speech recordings are stored in a speech corpus, that include spectral parameters and excitation parameters; ii. ) a step of training a plurality of HMMs (Hidden Markov Models) which represent a plurality of probabilistic models, utilizing the spectral and excitation parameters extracted and/or generated from the Malay speech model and representative of said Malay speech model to generate a Malay speech acoustic model; iii.
  • HMMs Hidden Markov Models
  • the step of generating a plurality of context dependent labels corresponding to a plurality of phoneme units of an arbitrary Malay input text string comprises of: i. ) a step of segmenting of an arbitrary Malay text input string into a plurality of words and sentences, in which the position of each word in a given sentence of the Malay text input string is identified; ii. ) a step of executing grapheme to phoneme conversion for each identified word in a given sentence of the Malay text input string, in which graphemes of each word are converted to corresponding phoneme units; iii.
  • a step of syllabification in which each word identified in a given sentence of the given Malay text input string, is segmented into identifiable syllables, in which a position of each syllable in a given word of a given sentence is identified; iv. ) a step of compiling contextual information of each phonetic symbol of the
  • Malay text input string from the results of the step of text input segmentation, the step of grapheme to phoneme conversion and the step of syllabification; and v.) a step of generating a plurality of context dependent labels for each phonetic symbol of the Malay text input string which are in essence acoustic labels that incorporate the compiled contextual information of the previous step and that correspond to the Malay text input string.
  • the step of grapheme to phoneme conversion comprises of the following steps listed in no particular order: i.) a step of generating a phoneme unit for each phonetic symbol representative of a consonant in a given word of a given sentence of a given Malay text input string ; and a step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a given Malay text input string;
  • the step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a given Malay input text string is as detailed by the following steps: a step of identifying at least three preceding phonetic symbols representative of three preceding consonants that precedes a given phonetic symbol representative of a vowel identified in a given word of a given sentence of a given Malay text input string ; a step of identifying at least three subsequent phonetic symbols representative of three subsequent consonant that are located subsequent to the given phonetic symbol representative of the vowel identified in the given word of the given sentence of the given Malay input text string; and iii.) a step of co-relating said phonetic symbol representative of the vowel identified, said at least three phonetic symbols representative of the consonants preceding said phonetic symbol representative of the vowel, and said at least three phonetic symbols representative of the consonants subsequent to said phonetic symbol
  • the step of syllabification of each word identified in a given sentence comprises of: i. ) a step of counting a number of syllables in a given word of a given sentence; and ii. ) a step of identifying and marking each syllable in a given word of a given sentence.
  • the step of identifying and marking each syllable in a given word of a given sentence is as detailed by the following steps: i. ) a step of identifying a phonetic symbol representative of a vowel or diphthong of a given word in a given sentence of a given Malay input text string, starting from the rear of the word ii.
  • FIG. 1 is a block diagram of an exemplary computing environment for implementing a HMM based Malay text input to Malay speech synthesis (TTS) system in accordance with a preferable embodiment of the present invention
  • Figure 2 is a block diagram illustrating a schematic view of a general software architecture of a concatenative HMM based Malay text to Malay speech synthesis system in accordance to a preferable embodiment of the present invention
  • Figure 3 is a flow-chart detailing general steps executed by a method of HMM based Malay text to Malay speech synthesis of the HMM based Malay text to Malay speech synthesis system in accordance to a preferable embodiment of the present invention
  • Figure 4 is a flow-chart detailing a method of context dependent label generation in accordance to a method of HMM based text to speech synthesis for Malay language by way a context dependent label generation unit software module, in accordance to a preferable embodiment of the present invention
  • Figure 5 is a flow-chart detailing a method of Malay text input segmentation in accordance to a preferable embodiment of the method of context dependent label generation detailed in the flow-chart of figure 4;
  • Figure 6 is a flow-chart detailing a method grapheme to phoneme conversion in accordance to a preferable embodiment of the method of context dependent label generation detailed in the flow-chart of figure 4;
  • Figure 7 is a flow-chart detailing a method of phoneme unit identification associated with a vowel, as executed by the method of grapheme to phoneme conversion illustrated in figure 6, in accordance to a method of context dependent label generation, in accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis of the present invention
  • Figure 8 is a flow-chart detailing a method of syllabification in accordance to a method of context dependent label generation detailed in the flow-chart of figure 4;
  • Figure 9 is a flowchart detailing a method of syllable identification and marking as executed by the method of syllabification illustrated by the flowchart of figure 8, in accordance to a method of context dependent label generation, in accordance to a preferable embodiment of the HMM based Malay text input to speech synthesis of the present invention.
  • FIG. 1 and the related discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented.
  • the invention will be described at least in part, in the general context of computer -executable instruction, such as program modules, being executed by a personal computer.
  • program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types.
  • the invention may be practiced with other computer system configurations, including hand-held devices, multi-processor systems, micro-processor-based or programmable consumer electronics, network PCs, minicomputers, main-frame computers, and the like.
  • the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network.
  • program modules may be located in both local and remote memory storage
  • FIG 1 illustrates an exemplary system for implementing the invention which includes a general purpose computing device in the form of a conventional personal computer 20, including a central processing unit (CPU) 21 , a system memory 22, referred to hereinafter as memory, and a system bus 23 that couples various system components including the system memory 22 to the processing unit 21.
  • the system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures.
  • the system memory 22 includes read only memory (ROM) 24 and random access memory (RAM) 25.
  • ROM read only memory
  • RAM random access memory
  • a basic input/output (BIOS) 26 containing a basic routine that helps to transfer information between elements within the personal computer 20, such as during start-up, is stored in ROM 24.
  • the personal computer 20 further includes a hard disk drive 27 for reading from and writing to a hard disk (not shown), a magnetic disk drive 28 for reading from or writing to removable magnetic disk 29, and an optical disk drive 30 for reading from or and writing to an optical disk such as CD-ROM.
  • the hard disk drive 27, magnetic disk drive 28 and the optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical disk drive interface 34, respectively.
  • the drives and the associated computer readable media provide non-volatile storage of computer readable instructions, data structures, program modules and other data for the personal computer 20.
  • the exemplary environment described herein employs the hard disk, the removable magnetic disk 29 and the removable optical disk 31 , it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer such as removable flash memories may also be used in the exemplary operating environment.
  • a number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31 , ROM 24, RAM 25 including an operating system 35, one or more application programs 36, other program modules 37, and program data 38.
  • a user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40, pointing device 42 and a micro- phone 43.
  • serial port interface 46 that is coupled to the system bus 23, but may be connected by other interfaces, such as a parallel port and/or a universal serial bus (USB) port.
  • a monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48.
  • personal computers may typically include other peripheral output devices, such as a speaker 45 and printers (not shown).
  • the personal computer 20 may operate in a networked environment using logic connections to one or more remote computers, such as a remote computer 49.
  • the remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in figure 1 .
  • the logic connections depicted in figure 1 include a local area network (LAN) 51 and a wide area network (WAN) 52.
  • LAN local area network
  • WAN wide area network
  • the personal computer 20 When used in a LAN networking environment, the personal computer 20 is connected to the local area network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52 such as the Internet.
  • the modem 54 which may be internal or external, is connected to the system bus 23 via the serial port interface 46.
  • program modules depicted relative to the personal computer 20, or portions thereof may be stored in the remote memory storage devices. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used.
  • HMM Hidden Markov Model
  • a context dependent label generation unit 112 software module is conceived to automatically generate context dependent labels of phonemes units that correspond to a given Malay text input 111 .
  • the present invention is specifically directed toward a HMM based Malay text to Malay speech synthesis system 100 that is characterized in that the system 100 comprises of a context dependent label generation unit 112 that generates context dependent labels for phoneme units that correspond to a given Malay text input 111 , derived in part from syllabification rules that are exclusive to Malay language.
  • FIG. 1 there is shown a diagram illustrating an embodiment of a HMM based speech synthesis system 100 for Malay speech synthesis from Malay text input, in accordance to a preferable embodiment of the present invention.
  • the fundamental building blocks of said HMM based speech synthesis system 100 for Malay speech synthesis from a Malay text input string 111 include a training module 105 which comprises of a speech corpus 106, and a parametrization module 107; as well as a synthesizing module 110, which comprises of a statistical parametric model database 113 which houses a plurality of HMM probabilistic models that in combination form of an acoustic speech model of Malay speech and a speech synthesis module 114.
  • the training module 105 is utilized to train the synthesizing module 110 of the HMM based speech synthesis system 100 in accordance to a preferable embodiment of the present invention. More particularly, the training module 105 is utilized to train a plurality of HMMs (Hidden Markov Models) which are in essence probabilistic models in the form of a plurality of decision trees. Each HMM (Hidden Markov Model ) is in essence a decision tree that contains states assigned probabilities, in which each state corresponds to a context dependent label which contains contextual information pertaining to a given phoneme unit that corresponds to a given word in a given sentence of a given Malay text input string.
  • HMMs Hidden Markov Models
  • Each HMM is in essence a decision tree that contains states assigned probabilities, in which each state corresponds to a context dependent label which contains contextual information pertaining to a given phoneme unit that corresponds to a given word in a given sentence of a given Malay text input string.
  • a context dependent label provides contextual information pertaining to a given phonetic symbol which when considered in combination correspond to a given phoneme unit.
  • the plurality of HMMs represent probabilistic models that model a given speaker's acoustic speech and hence represents an acoustic speech model of a given speaker's speech which is used in a given HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention.
  • the training module 105 comprises a speech corpus 106 and a parametrization module 107.
  • the speech corpus 106 may comprise of recordings of speech signals of a given speaker's speech as well as phonetic symbols representative of said recordings that have been chosen to cover sounds made in Malay language in the context of syllables and words that make up the vocabulary of the Malay language.
  • the parametrization module 107 serves to extract and/or generate parameters such as spectral parameters and excitation parameters of the recorded speech (comprising recordings of speech signals).
  • Said spectral and excitation parameters and the plurality of phonetic symbols stored together with the speech recordings of speech signals and representative of said speech recordings form a speech model and are used to train the plurality of HMMs, which in-combination, after the completion of training, acoustically models a speaker's speech whose speech recordings of speech signals are stored in the speech corpus 106 and utilized in the HMM based Malay text input to speech synthesis system 100 in accordance to a preferable embodiment of the present invention for Malay text input to Malay speech synthesis.
  • Aforementioned plurality of HMMs forming a statistical parametric model housed in a statistical parameter model database 113 and effectively represent a speech acoustic model of a Native Malay speaker's speech, said statistical parameter model database 113 forming part of a synthesizing module 110 of the HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention.
  • aforementioned parameterization module 107 serves to extract and/or generate spectral parameters of said speech recordings of speech signals stored in the speech corpus 106 such as the fundamental frequency (F 0 ) of speech signals corresponding to said speech recordings stored in the speech corpus 106 and the Mel-Cepstral coefficients (MCEPs) of said speech signals.
  • F 0 fundamental frequency
  • MCEPs Mel-Cepstral coefficients
  • Aforementioned parametrization module 107 further serving to extract and/or generate excitation parameters such as the logarithm of the fundamental frequency (Log F 0 ) and the delta (first order derivative) and delta-delta (second order derivative) of the Mel-Cepstral coefficients (MCEPs) of said speech signals corresponding to said speech recordings stored in the speech corpus 106.
  • excitation parameters such as the logarithm of the fundamental frequency (Log F 0 ) and the delta (first order derivative) and delta-delta (second order derivative) of the Mel-Cepstral coefficients (MCEPs) of said speech signals corresponding to said speech recordings stored in the speech corpus 106.
  • the speech corpus 106 of the training module 105 is a Malay neutral speech database which consists of 1 ,000 recorded utterances uttered by two Malay native speakers, one male and the other female.
  • the 1000 recorded speech utterances were constructed from 1000 Malay sentences that contain a representative range of Malay words, syllables and phonemes which were taken from different sources that include Malay newspapers (43%), educational text books (39%), and other general reading materials (18%).
  • Aforementioned 1000 Malay sentences which correspond to aforementioned 1000 speech utterances, in an exemplary embodiment of the HMM based Malay text to Malay speech synthesis system 100 of the present invention, include 577 short sentences (less than seven words) and 412 long sentences (seven words or more), with a range from 3 to 12 words per sentence.
  • the synthesizing module 110 comprises of a statistical parameter model database 113 which houses said plurality of HMMs which effectively form a speech acoustic model of a native Malay speaker whose speech recordings and corresponding phonetic symbols representing said speech recordings are utilized to synthesize Malay speech from Malay text input in accordance to preferable embodiments of the HMM based Malay text to Malay speech synthesis system 100 of the present invention.
  • Aforementioned synthesizing module 110 further including a speech synthesis module 114 that serves to extract relevant acoustic parameters (i.e. phoneme units) that are representative of context dependent labels of given phoneme units in a given text input, said context dependent labels being automatically generated by a context dependent label generation unit 112, and subsequent to extraction of said acoustic parameters, concatenates said acoustic parameters to effectively synthesize Malay speech corresponding to Malay text input by a user.
  • relevant acoustic parameters i.e. phoneme units
  • context dependent labels being automatically generated by a context dependent label generation unit 112
  • Said speech synthesis module 114 of the synthesizing module 110 further including a synthesis filter that in accordance to a preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention, is a Mel-Log Spectrum Analyser (MLS A) filter.
  • a synthesis filter that in accordance to a preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention, is a Mel-Log Spectrum Analyser (MLS A) filter.
  • MLS A Mel-Log Spectrum Analyser
  • Aforementioned synthesizing module 110 readily accepts a Malay text input string 111 from a user via a user input peripheral device such as a keyboard 40 in an exemplary computer system 20 environment that houses the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention and outputs synthesized Malay speech through an output device such as speaker 45 in an exemplary computer system 20 environment that houses said HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention.
  • the HMM (Hidden Markov Model) based Malay text to Malay speech synthesis system 100 of the present invention as alluded to in figure 2 is specifically directed to a HMM based speech synthesis system 100 adapted for Malay text input to Malay speech synthesis in which a context dependent label generation unit 112 software module is conceived to automatically generate context dependent labels of phonemes units of a given Malay text input string 111.
  • a context dependent label generation unit 112 software module is conceived to automatically generate context dependent labels of phonemes units of a given Malay text input string 111.
  • the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention is characterized in that the system 100 comprises of a context dependent label generation unit 112 that generates context dependent labels for phoneme units of a Malay text input string 111 derived in part on syllabification rules that are exclusive to Malay language to hence provide for Malay text to Malay speech synthesis that is more natural sounding and has a high degree of intelligibility.
  • the HMM based Malay text to Malay speech synthesis system 100 comprising of said training module 105 and synthesizing module 110 executes a method of HMM based Malay speech to Malay text synthesis 200 in accordance to preferable embodiments of the present invention, comprising of: i. ) a step 201 of extracting and/or generating a plurality of parameters representative of a Malay speech model of a native Malay speaker whose speech recordings of speech signals and corresponding phonetic symbols representative of said speech recordings are stored in a Malay speech corpus, that include spectral parameters and excitation parameters; ii.
  • HMMs Hidden Markov Models
  • the HMM based Malay text to Malay speech synthesis system 100 of the present invention is characterized by the provision of automatic context dependent label generation by a context dependent label generation unit 112 comprised in the synthesizing module 110.
  • said context dependent label generation unit 112 is a software module residing in an exemplary computing environment, in an area of memory 22 of a computer system 20 that generates context dependent labels of phoneme units of a given Malay text input string 111 according to a method 203 comprising of the following general steps: i.
  • a step 307 of generating a plurality of context dependent labels which are in essence acoustic labels that incorporate the compiled contextual information of a previous step and that correspond to a plurality of phoneme units of a given sentence of the given Malay text input string 111 for every sentence of said given Malay text input string 111.
  • the term "grapheme” refers to a phonetic symbol which corresponds to a given phoneme unit in a given word of a given sentence of a given string of Malay text input 111.
  • the method executed by the context dependent label generation unit 112 includes a step in which a phonetic symbol, i.e.
  • a grapheme of each word in a given sentence, is converted to a corresponding phoneme unit.
  • a phoneme unit corresponds to a fundamental phonological unit of a given language, and hence in the context of this description, a phoneme unit refers to a fundamental phonological unit of Malay language.
  • the step 301 of segmenting of an arbitrary Malay text input string 111 into a plurality of words and sentences, in which the position of each word in a given sentence of the given Malay text input string 111 is identified comprises of: i.) a step 301 a of receiving a Malay text input string 111 provided by a user; ii.) a step 301 b of segmenting the Malay text input string 111 into sentences and words and hence identifying sentences and words in said Malay text input string 111 ; iii.
  • word information such as word numbers assigned to each word in a sentence that alludes to a position of a given word in relation to a given sentence for every word and every sentence of said Malay input text string 111.
  • the step of segmenting an arbitrary Malay text input string 111 includes a step of assigning a word number for each word in a sentence that alludes to a position of a given word in relation to the Malay text input string 111 , as well as a step of updating a register for storage of word information such as word numbers assigned to each word within the given Malay text input string 111 apart from storing of word numbers assigned to each word in a sentence that alludes to a position of a given word in relation to a given sentence for every word and every sentence of said Malay input text string 111 .
  • the register designated for storage of word information is a hardware resource that forms part of memory 22 of an exemplary computer system 20 that houses the preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention.
  • the step 302 of grapheme to phoneme conversion of a given word comprises generally of the following steps: i.) a step of generating a phoneme unit for each phonetic symbol representative of a consonant in a given word of a given sentence of said Malay text input string 111 (refer to block 302a, 302cb, 302db of figure 6); and ii. ) a step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a said given given steps: i.) a step of generating a phoneme unit for each phonetic symbol representative of a consonant in a given word of a given sentence of said Malay text input string 111 (refer to block 302a, 302cb, 302db of figure 6); and ii. ) a step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a said
  • Malay text input string 111 ( refer to block 302a, 302ca, 30da of figure 6).
  • step 302a segmenting a given word of a given sentence of a given Malay text input string 111 into a plurality of graphemes (i.e. phonetic symbols); ii.) a step 302cb of identifying from the plurality of graphemes that result from the step of segmenting said given word, phonetic symbols representative of a consonant; iii. ) a step 302ca of identifying from the plurality of graphemes that result from the step of segmenting said given word, phonetic symbols representative of a vowel; iv.
  • a step 302db of mapping phonetic symbols representative of consonants to phoneme units representative of said consonants v. ) a step 302da of mapping phonetic symbols representative of vowels to phoneme units representative of said vowels; vi. ) a step 302f of updating a register designated for storage of phoneme units representative of phonetic symbols that correspond to consonants and phoneme units representative of phonetic symbols that correspond to vowels as well phoneme information of each phoneme unit such as phoneme number within a given word of a given sentence, for every word and for every sentence of the given Malay input text string 111.
  • the register mentioned above when alluding to step 302f and designated for the storage of phoneme units representative of consonants and phoneme units representative of vowels with a given word as well as phoneme information such as phoneme number within a given word of a given sentence, for every word and for every sentence of the given Malay text input string 111 is a hardware resource that forms part of memory 22 of an exemplary computer system 20 that houses the preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention.
  • mapping of phonetic symbols/graphemes representative of consonants to corresponding phoneme units is a direct and simple one to one mapping as evident in block 302db of figure 6. This fact is well understood and is evident to an ordinary person skilled in the art of Malay linguistics.
  • the mapping of phonetic symbols/graphemes representative of vowels to corresponding phoneme units, in a given word is not and cannot be a direct one to one mapping of a phoneme unit to a vowel, as single vowel can have a plurality of phoneme units associated with it depending on the context of the vowel in a given word, i.e. depending on the position of the vowel in a given word and the preceding or subsequent phonetic symbols that abut a given vowel in a given word.
  • FIG 7 there is shown a flow-chart detailing a method of vowel identification as executed by the method of grapheme to phoneme conversion illustrated in figure 6 and alluded to in the preceding discussion, in accordance to a method of context dependent label generation, in accordance to a preferable embodiment of a method of HMM based Malay text to Malay speech synthesis of the present invention. More particularly, alluding to the step of grapheme to phoneme 302 conversion above, the step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a given Malay input text string is as detailed by the following steps: i.
  • step of syllabification which is executed by the context dependent label generation unit 112 comprised within the synthesizing module 110 of the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention will be detailed.
  • the step of syllabification 303 as alluded to in figure 4 is a step 303 executed by the context dependent label generation unit 112 that similar to the step of text input segmentation 301 and the step of grapheme to phoneme conversion 302, is executed for every word across all sentences of a given Malay text input string in order to obtain contextual information of all the phoneme units represented by said given Malay text input string.
  • aforementioned step of syllabification 303 which is executed for each word of a given sentence, for all words and all sentences of a given Malay text input string, is performed based on syllabification rules that are exclusive to Malay language and comprises the general steps of: i. ) a step of counting a number of syllables in a given word of a given sentence; and ii. ) a step of identifying and marking each syllable in a given word of a given sentence.
  • the step of identifying and marking each syllable 303b in a given word of a given sentence, across all sentences in a given Malay text input string, in accordance to a preferable embodiment of the method and system 100 of HMM based Malay text input to Malay speech synthesis of the present invention is unique by virtue of the uniqueness of syllabification rules Malay, which forms the basis of the identification and marking of a syllable within a given word as executed by the context dependent label generation unit 112 in accordance to a preferable embodiment of the present invention.
  • the process of syllabification is essentially the process of marking and identifying of a syllable of given word and in the context of the HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention, further comprises steps of counting a number of syllables in a given word and updating syllable information in a designated register in an area of memory 22 of an exemplary computer system 20 housing the HMM Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention.
  • the step of syllabification 303 executed on each word of a given sentence, across all sentences in a Malay text string as also alluded to in figure 4, comprises of the following steps: i. ) a step 303a counting a number of syllables in a given word of a given sentence; ii. ) a step 303b of identifying and marking each syllable in a given word of a given sentence; and iii. ) a step 303c of updating a register designated for storage of syllabification data of a given word of a given sentence, for every word and for every sentence of the given Malay input text string 111 .
  • the step 303b of identification and marking of a syllable in a given Malay word of a given Malay sentence in a given Malay text input string comprises of the following steps: i.) a step 303ba of identifying a phonetic symbol representative of a vowel or diphthong of a given word in a given sentence of a given Malay input text string, starting from the rear of the word ii. ) a step 303bb of identifying a phonetic symbol representative of a consonant that super-cedes in position referenced from the rear of the given word, the position of the identified phonetic symbol representative of the vowel or diphthong in the previous step; iii.
  • the register designated for the storage of syllabification information is a hardware resource that forms part of memory 22 of an exemplary computer system 20 that houses the preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention.
  • Aforementioned syllabification information stored in said register designated for the storage of syllabification information include information such as syllable position and syllable number within a given word of a given sentence for all syllables in all words in all sentences of a given Malay text input string.
  • Aforementioned syllabification information stored in said register designated for the storage of syllabification information may further include information such as syllable position with a given sentence for all syllables and for all sentences of the given Malay text input string.
  • the context dependent label generation unit 112 compiles contextual information from the results of execution of the steps of text input segmentation 301 , grapheme to phoneme conversion 302 and syllabification 303 of a given Malay text input string 111.
  • the context dependent label generation unit 112 in accordance to a preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention, compiles syllabification information from the register alluded to in step 303bf of figure 9, compiles word information from the register alluded to in step 301 d of figure 5 and phoneme information from the register alluded to in step 302f of figure 6 to hence generate context dependent labels for each phoneme unit of every word in every sentence of a given Malay text input string 111.
  • the abovementioned sequence of steps is executed by the context dependent label generation unit 112 software module. More particularly, as would be readily understood by a person of ordinary skill in the art of programming, in the exemplary computing environment of the computer system 20 alluded to in figure 1 , said method of context dependent label generation alluded to in figures 4 to 9, is executed by a central processing unit (CPU) 21 of said computer system 20 that reads and executes computer readable and executable instructions that embody the context dependent label generation unit 112 software module in accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis system 100 of the present invention.
  • CPU central processing unit

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a HMM (Hidden Markov Model) based Malay text to Malay speech synthesis system (100) comprising of a plurality of software modules (105, 110) in the form of computer readable instructions residing in memory (22) of a computer (20) and executed by a central processing unit (CPU) (21) of said computer (20), the system (100) comprising of a training module (105) for the training of a plurality of HMM (Hidden Markov Model) statistical predictive models residing in a statistical parameter model database (113) to generate an acoustic speech model of Malay speech and a synthesizing module (110) for synthesizing Malay speech from a Malay text input string (111) utilizing context dependent labels generated from the Malay text input string (111) and the acoustic model for Malay speech; characterized in that, the speech synthesis module (110) includes a context dependent label generation unit (112) that automatically generates context dependent labels for Malay language from Malay language input text string (111) derived in part from syllabification rules that are exclusive to Malay language.

Description

METHOD AND SYSTEM FOR TEXT TO SPEECH SYNTHESIS
The present invention generally relates to telecommunications systems and methods. More particularly the present invention relates to an improved method and system of text to speech synthesis. Most particularly, the present invention relates to an improved method and system of Malay language text to speech synthesis.
BACKGROUND OF THE INVENTION
More particularly, the state of the art of speech synthesizing systems is populated by three general categories of speech synthesizers namely articulatory speech synthesizers, formant speech synthesizers, and concatenative speech synthesizers. Articulatory speech synthesizers are based on the physics of sound generation in the vocal apparatus. Individual parameters related to the position and movement of vocal chords are provided and the sound generation therefrom is determined according to the laws of classical physics. Due to the complex nature of the physics involved and the difficulties that have to be overcome to mathematically model said complicated physics, practical realization of said articulatory type of speech synthesizers appear to be a distant prospect.
The second type of speech synthesizer, the formant synthesizer unlike said articulatory speech synthesizers, do not use mathematical models of physical laws that govern speech formation in vocal chords, but conversely models acoustic features of speech or spectra of a speech signal. Aforementioned formant synthesizers, more particularly utilize a set of rules in combination with said spectra or acoustic features of speech signals to effectively synthesize speech. In formant synthesizers, a phoneme is modelled with "formants" wherein each formant has a distinct frequency 'trajectory' and distinct bandwidth which varies over the duration of the phoneme. An audio signal is synthesized by using the frequency and bandwidth trajectories to control a formant synthesizer. While formant synthesizers can achieve high intelligibility, its "natural-ness" is typically low, since it is very difficult to accurately describe the process of speech generation by a set of rules. The third and most popular and feasible type of speech synthesizers are concatenative speech synthesizers. Concatenative speech synthesizers for generating speech from text input operate in an entirely different principle of operation. More particularly, aforementioned concatenative systems operate using pre-recorded actual speech which forms a large database or corpus. The corpus is segmented based on phonological features. The drawback of concatenative speech synthesizing systems however, similar to formant synthesizers is a lack of "naturalness" despite good intelligibility. However concatenative speech synthesizers provide a higher degree of "natural" sounding synthesized speech compared to either formant synthesizers or articulatory synthesizers. The decrease in "natural-ness" in sound quality of the synthesized speech in aforementioned concatenative synthesizers, is particularly pronounced for under-resourced languages that include languages such as Malay, Bengali and Vietnamese which coincidently fall under the category of languages that are amongst the 20 most widely spoken languages in the world, which as a result of being under-resourced, have to rely on resources for other more popular and sufficiently resourced languages such as English. This is particularly true for concatenative speech synthesizers that fall under the class of HMM (Hidden Markov Model) text input to speech synthesizers. Specifically, HMM text input to speech synthesizers for a given language require resources like segmental labels that provide contextual information of phonemic units, which for the Malay language is at the moment unavailable by virtue of being an under-resourced language. Hence in view of a deficiency of said speech synthesis resources, the state of the prior art fails to disclose a HMM text to speech synthesis system tailored to the Malay language, as said resources (i.e. segmental labels and duration model) are required for use by a HMM text to speech synthesis system to hence yield a HMM text to speech synthesis system that is adapted to receive Malay input and synthesize Malay speech output which is intelligible and natural sounding.
Malay is a widely spoken language that is spoken by more than 150 million people in Malay-speaking countries such as Malaysia, Indonesia, Brunei, Singapore and Southern Thailand. Hence, it is apparent that there is a large market for a HMM (Hidden Markov Model) text input to speech synthesis system for providing high quality Malay language text input to Malay speech synthesis. Moreover Malay language text input to Malay speech synthesis may find utility in a wide range of applications that include teaching and reading aids for the visually impaired. Thus far, the state of the prior art of HMM-based speech synthesis systems fails to disclose a HMM-based speech synthesis system for under-resourced languages and more particularly, the state of the prior art fails to disclose a HMM-based speech synthesis system tailored to Malay text input to speech output. The development of a HMM-based speech synthesis system tailored to Malay text input to speech output requires the development of speech synthesis resources that include a speech database and segmental labels. The development of a speech database can be easily conceived by a person skilled in the art of speech synthesis. The challenge in the development of Malay text input to speech synthesis systems however, is the generation of segment phonetic labels, and more particularly the generation of context dependent labels. . Since the Malay language is an under-resourced language, context dependent labelling to produce context dependent labels for the "Malay Language" has to be done manually. This is expensive both in terms of time and man-power. As such, it would be advantageous and desirable if an automatic context dependent label generation unit is conceived and integrated in a HMM speech synthesizing system, to enable the automatic generation of context dependent labels for use in a HMM speech synthesizing system to hence enable Malay text input to speech synthesis that has a high degree of intelligibility and is further very natural sounding.
SUMMARY OF THE INVENTION
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. This summary is not intended to identify key/critical elements or essential features of the claimed subject matter. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later. In one aspect, the present invention provides a HMM (Hidden Markov Model) based speech synthesis system comprising of a plurality of software modules in the form of computer readable instructions residing in memory of a computer and executed by a central processing unit (CPU) of said computer, the system including: a training module for training of a plurality of HMM (Hidden Markov Model) statistical predictive models to generate an acoustic speech model of Malay speech; and a synthesizing module for synthesizing Malay speech from Malay text input utilizing context dependent labels generated from Malay text input and the acoustic model for Malay speech; characterized in that; the speech synthesis module includes a context dependent label generation unit that automatically generates context dependent labels that correspond to phoneme units of Malay language from Malay language input text string, derived in part from syllabification rules that are exclusive to Malay language.
In accordance to a preferable embodiment of the HMM (Hidden Markov Model) based Malay text to Malay speech synthesis system of the present invention, aforementioned training module comprises of a database housing a Malay speech corpus that includes a Malay language speech model embodied in the form of speech recordings of speech signals of a native Malay speaker and corresponding phonetic symbols representative of said speech recordings of a Malay native speaker, and a parametrization module which extracts and/or generates parameters that include spectral parameters and excitation parameters for training of a plurality of HMM (Hidden Markov Model) probabilistic models.
In accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis system of the present invention, aforementioned synthesizing module, comprises a statistical parameter model database that includes a plurality of HMM probabilistic models that in combination represent an acoustic speech model derived from the plurality of excitation and spectral parameters generated by the parametrization module of the training module. Said acoustic speech model housed in the statistical parameter model database and the plurality of context dependent labels generated from a Malay text input string and that correspond to a given plurality of phoneme units of said Malay text input string, are utilized to generate acoustic speech parameters representative of Malay speech corresponding to said Malay text input string. The synthesizing module further includes a speech synthesis module which correlates the plurality of context dependent labels of a given plurality of phoneme units corresponding to the input text string and the acoustic speech model defined by the plurality of HMMs residing in the statistical parameter model database to generate a corresponding plurality of acoustic parameters that correspond to said given plurality of phoneme units of the Malay text input string. Said synthesis module of the synthesizing module further concatenates the acoustic speech parameters representative of Malay speech corresponding to the input Malay text string to synthesize Malay speech. In accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis system of the present invention, the automatic context dependent label generation unit, generates context dependent labels from a Malay text input string that include phoneme duration information and POS (Part of speech tagging).
In accordance to a preferable embodiment of the HMM based speech synthesis system of the present invention, the speech synthesis module of the synthesizing module includes a filter such as a Mel-Log Spectrum Approximating (MLS A) filter.
In accordance to a preferable embodiment of the HMM based speech synthesis system of the present invention, the acoustic speech model formed by the plurality of HMM probabilistic models is a probabilistic model in the form of a plurality of decision trees. In accordance to a preferable embodiment of the HMM based speech synthesis system of the present invention, spectral parameters generated and/or extracted by the parametrization module of the training module include fundamental frequency (F0) and Mel-Cepstral (MCEPs) Component coefficients.
In accordance to a preferable embodiment of the HMM based speech synthesis system of the present invention, the excitation parameters generated and/or extracted by the parametrization module of the training module include the log of the fundamental frequency of speech signals (log Fo), the delta coefficients which represent the first order differential of MCEPs (Mel-Cepstrals) of speech signals, and the delta-delta coefficients which represent the second order differential of MCEPs (Mel-Cepstrals) of speech signals.
In accordance to another aspect, the present invention provides a method for Malay speech synthesis comprising of: i. ) a step of extracting and/or generating a plurality of parameters representative of a Malay speech model of a native Malay speaker whose speech recordings of speech signals and corresponding phonetic symbols representative of said speech recordings are stored in a speech corpus, that include spectral parameters and excitation parameters; ii. ) a step of training a plurality of HMMs (Hidden Markov Models) which represent a plurality of probabilistic models, utilizing the spectral and excitation parameters extracted and/or generated from the Malay speech model and representative of said Malay speech model to generate a Malay speech acoustic model; iii. ) a step of generating a plurality of context dependent labels for a plurality of phoneme units from an arbitrary Malay text input string, said plurality of context dependent labels comprising contextual information pertaining to said plurality of phoneme units of said arbitrary Malay text input string such as part of speech (POS) tagging; iv.) a step of utilizing said generated plurality of context dependent labels representative of a plurality of contextual information of a plurality of phoneme units present in said arbitrary Malay text input of the previous step, to obtain a plurality of acoustic parameters of Malay speech based on a generated Malay speech acoustic model, said obtained plurality of acoustic parameters corresponding to a plurality of phoneme units that correspond to said plurality of context dependent labels; and v.) a step of synthesizing Malay speech based on the plurality of acoustic parameters obtained in the previous step, by concatenating said plurality of acoustic parameters corresponding to a plurality of phoneme units that correspond to said plurality of context dependent labels that in turn correspond to said Malay text input string of a preceding step.
In accordance to a preferable embodiment of the method of HMM based Malay text input to Malay speech synthesis of the present invention; the step of generating a plurality of context dependent labels corresponding to a plurality of phoneme units of an arbitrary Malay input text string, comprises of: i. ) a step of segmenting of an arbitrary Malay text input string into a plurality of words and sentences, in which the position of each word in a given sentence of the Malay text input string is identified; ii. ) a step of executing grapheme to phoneme conversion for each identified word in a given sentence of the Malay text input string, in which graphemes of each word are converted to corresponding phoneme units; iii. ) a step of syllabification in which each word identified in a given sentence of the given Malay text input string, is segmented into identifiable syllables, in which a position of each syllable in a given word of a given sentence is identified; iv. ) a step of compiling contextual information of each phonetic symbol of the
Malay text input string from the results of the step of text input segmentation, the step of grapheme to phoneme conversion and the step of syllabification; and v.) a step of generating a plurality of context dependent labels for each phonetic symbol of the Malay text input string which are in essence acoustic labels that incorporate the compiled contextual information of the previous step and that correspond to the Malay text input string.
In accordance to a preferable embodiment of the method of HMM based Malay text to Malay speech synthesis of the present invention, the step of grapheme to phoneme conversion comprises of the following steps listed in no particular order: i.) a step of generating a phoneme unit for each phonetic symbol representative of a consonant in a given word of a given sentence of a given Malay text input string ; and a step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a given Malay text input string;
Alluding to the step of grapheme to phoneme conversion above, the step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a given Malay input text string is as detailed by the following steps: a step of identifying at least three preceding phonetic symbols representative of three preceding consonants that precedes a given phonetic symbol representative of a vowel identified in a given word of a given sentence of a given Malay text input string ; a step of identifying at least three subsequent phonetic symbols representative of three subsequent consonant that are located subsequent to the given phonetic symbol representative of the vowel identified in the given word of the given sentence of the given Malay input text string; and iii.) a step of co-relating said phonetic symbol representative of the vowel identified, said at least three phonetic symbols representative of the consonants preceding said phonetic symbol representative of the vowel, and said at least three phonetic symbols representative of the consonants subsequent to said phonetic symbol representative of the vowel, to a state of a binary decision tree in which each state of said decision tree corresponds to a phoneme unit, the result of correlation of said phonetic symbol representative of the vowel identified to the state of said decision tree which corresponds to a phoneme unit, results in the generation of said phoneme unit.
In accordance to a preferable embodiment of the HMM based speech synthesis system of the present invention, the step of syllabification of each word identified in a given sentence comprises of: i. ) a step of counting a number of syllables in a given word of a given sentence; and ii. ) a step of identifying and marking each syllable in a given word of a given sentence.
Alluding to the step of syllabification above, in accordance to a preferable embodiment of the method of HMM based speech synthesis of the present invention, the step of identifying and marking each syllable in a given word of a given sentence is as detailed by the following steps: i. ) a step of identifying a phonetic symbol representative of a vowel or diphthong of a given word in a given sentence of a given Malay input text string, starting from the rear of the word ii. ) a step of identifying a phonetic symbol representative of a consonant that supercedes in position referenced from the rear of the given word, the position of the identified phonetic symbol representative of the vowel or diphthong in the previous step; iii. ) a step of marking a boundary of a syllable from the rear of the given word in which a syllable is delineated by a combination of a phonetic symbol representative of a consonant followed (as referenced from the beginning of a syllable) by a phonetic symbol representative of a vowel or a diphthong; iv. ) a step of determining if all phonetic symbols representative of all vowels or diphthongs in a given word have been identified or otherwise; v.) repeating steps (i) to (iv) for a subsequent phonetic symbol representative of a vowel or diphthong in the given word from the rear or said given word, if not all phonetic symbols representative of vowels or diphthongs in said given word have been identified; and vi.) repeating steps (i) to (v) for all words of a given sentence of a given Malay input text string.
It is an advantage of the present invention to provide a HMM based text to speech synthesis system and method adapted for Malay text input to Malay speech output.
It is another advantage of the present invention to provide a HMM based text to speech synthesis system and method for Malay text input to Malay speech output such that the synthesized Malay speech has a high degree of intelligibility and is natural sounding.
It is an advantage of the present invention to provide an automatic context dependent label generation unit for automatic context dependent label generation from Malay text input, for synthesizing of Malay Speech.
It is yet another advantage of the present invention to provide a method for automatic context dependent label generation for Malay text to speech synthesizing applications. BRIEF DESCIPRTION OF THE DRAWINGS
The above and other objects, features and other advantages of the present invention will be more clearly understood from the detailed description taken in conjunction with the accompanying drawings, in which:
Figure 1 is a block diagram of an exemplary computing environment for implementing a HMM based Malay text input to Malay speech synthesis (TTS) system in accordance with a preferable embodiment of the present invention;
Figure 2 is a block diagram illustrating a schematic view of a general software architecture of a concatenative HMM based Malay text to Malay speech synthesis system in accordance to a preferable embodiment of the present invention;
Figure 3 is a flow-chart detailing general steps executed by a method of HMM based Malay text to Malay speech synthesis of the HMM based Malay text to Malay speech synthesis system in accordance to a preferable embodiment of the present invention;
Figure 4 is a flow-chart detailing a method of context dependent label generation in accordance to a method of HMM based text to speech synthesis for Malay language by way a context dependent label generation unit software module, in accordance to a preferable embodiment of the present invention;
Figure 5 is a flow-chart detailing a method of Malay text input segmentation in accordance to a preferable embodiment of the method of context dependent label generation detailed in the flow-chart of figure 4; Figure 6 is a flow-chart detailing a method grapheme to phoneme conversion in accordance to a preferable embodiment of the method of context dependent label generation detailed in the flow-chart of figure 4;
Figure 7 is a flow-chart detailing a method of phoneme unit identification associated with a vowel, as executed by the method of grapheme to phoneme conversion illustrated in figure 6, in accordance to a method of context dependent label generation, in accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis of the present invention; Figure 8 is a flow-chart detailing a method of syllabification in accordance to a method of context dependent label generation detailed in the flow-chart of figure 4; and
Figure 9 is a flowchart detailing a method of syllable identification and marking as executed by the method of syllabification illustrated by the flowchart of figure 8, in accordance to a method of context dependent label generation, in accordance to a preferable embodiment of the HMM based Malay text input to speech synthesis of the present invention. DETAILED DESCRIPTION OF THE INVENTION
The detailed description set forth below in connection with the appended drawings is intended as a description of an exemplary embodiment and is not intended to represent the only form in which the embodiment may be constructed and/or utilized. The description sets forth the functions and the sequence for constructing the exemplary embodiment. However, it is to be understood that the same or equivalent functions and sequences may be accomplished by different embodiments that are also intended to be encompassed within the scope of this disclosure.
Figure 1 and the related discussion are intended to provide a brief, general description of a suitable computing environment in which the invention may be implemented. Although not required, the invention will be described at least in part, in the general context of computer -executable instruction, such as program modules, being executed by a personal computer. Generally, program modules include routine programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Moreover, those skilled in the art will appreciate that the invention may be practiced with other computer system configurations, including hand-held devices, multi-processor systems, micro-processor-based or programmable consumer electronics, network PCs, minicomputers, main-frame computers, and the like. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote memory storage devices.
More particularly figure 1 illustrates an exemplary system for implementing the invention which includes a general purpose computing device in the form of a conventional personal computer 20, including a central processing unit (CPU) 21 , a system memory 22, referred to hereinafter as memory, and a system bus 23 that couples various system components including the system memory 22 to the processing unit 21. The system bus 23 may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. The system memory 22 includes read only memory (ROM) 24 and random access memory (RAM) 25. A basic input/output (BIOS) 26 containing a basic routine that helps to transfer information between elements within the personal computer 20, such as during start-up, is stored in ROM 24. The personal computer 20, further includes a hard disk drive 27 for reading from and writing to a hard disk (not shown), a magnetic disk drive 28 for reading from or writing to removable magnetic disk 29, and an optical disk drive 30 for reading from or and writing to an optical disk such as CD-ROM. The hard disk drive 27, magnetic disk drive 28 and the optical disk drive 30 are connected to the system bus 23 by a hard disk drive interface 32, a magnetic disk drive interface 33, and an optical disk drive interface 34, respectively. The drives and the associated computer readable media provide non-volatile storage of computer readable instructions, data structures, program modules and other data for the personal computer 20. Although the exemplary environment described herein employs the hard disk, the removable magnetic disk 29 and the removable optical disk 31 , it should be appreciated by those skilled in the art that other types of computer readable media which can store data that is accessible by a computer such as removable flash memories may also be used in the exemplary operating environment. A number of program modules may be stored on the hard disk, magnetic disk 29, optical disk 31 , ROM 24, RAM 25 including an operating system 35, one or more application programs 36, other program modules 37, and program data 38. A user may enter commands and information into the personal computer 20 through input devices such as a keyboard 40, pointing device 42 and a micro- phone 43. These input devices are connected to the processing unit 21 through a serial port interface 46 that is coupled to the system bus 23, but may be connected by other interfaces, such as a parallel port and/or a universal serial bus (USB) port. A monitor 47 or other type of display device is also connected to the system bus 23 via an interface, such as a video adapter 48. In addition to the monitor 47, personal computers may typically include other peripheral output devices, such as a speaker 45 and printers (not shown).
The personal computer 20 may operate in a networked environment using logic connections to one or more remote computers, such as a remote computer 49. The remote computer 49 may be another personal computer, a server, a router, a network PC, a peer device or other network node, and typically includes many or all of the elements described above relative to the personal computer 20, although only a memory storage device 50 has been illustrated in figure 1 . The logic connections depicted in figure 1 include a local area network (LAN) 51 and a wide area network (WAN) 52. Such networking environments are commonplace in offices, enterprise -wide computer network intranets and the Internet.
When used in a LAN networking environment, the personal computer 20 is connected to the local area network 51 through a network interface or adapter 53. When used in a WAN networking environment, the personal computer 20 typically includes a modem 54 or other means for establishing communications over the wide area network 52 such as the Internet. The modem 54, which may be internal or external, is connected to the system bus 23 via the serial port interface 46. In a network environment, program modules depicted relative to the personal computer 20, or portions thereof, may be stored in the remote memory storage devices. It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers may be used. Having described an exemplary computing environment by alluding to the personal computer 20 alluded to in figure 1 which houses in an area of system memory or memory 22, a HMM based Malay text to Malay speech synthesis system 100 in accordance to preferable embodiments of the present invention, we will now proceed with a description of an exemplary embodiment of the HMM (Hidden Markov Model) based speech synthesis system 100 of the present invention as alluded to in figure 2 and which is specifically directed to a HMM based speech synthesis system 100 adapted for Malay text input to Malay speech synthesis in which a context dependent label generation unit 112 software module is conceived to automatically generate context dependent labels of phonemes units that correspond to a given Malay text input 111 . Hence the reader is advised to note that the present invention is specifically directed toward a HMM based Malay text to Malay speech synthesis system 100 that is characterized in that the system 100 comprises of a context dependent label generation unit 112 that generates context dependent labels for phoneme units that correspond to a given Malay text input 111 , derived in part from syllabification rules that are exclusive to Malay language.
More particularly, with reference to figure 2, there is shown a diagram illustrating an embodiment of a HMM based speech synthesis system 100 for Malay speech synthesis from Malay text input, in accordance to a preferable embodiment of the present invention. The fundamental building blocks of said HMM based speech synthesis system 100 for Malay speech synthesis from a Malay text input string 111 include a training module 105 which comprises of a speech corpus 106, and a parametrization module 107; as well as a synthesizing module 110, which comprises of a statistical parametric model database 113 which houses a plurality of HMM probabilistic models that in combination form of an acoustic speech model of Malay speech and a speech synthesis module 114. The training module 105 is utilized to train the synthesizing module 110 of the HMM based speech synthesis system 100 in accordance to a preferable embodiment of the present invention. More particularly, the training module 105 is utilized to train a plurality of HMMs (Hidden Markov Models) which are in essence probabilistic models in the form of a plurality of decision trees. Each HMM (Hidden Markov Model ) is in essence a decision tree that contains states assigned probabilities, in which each state corresponds to a context dependent label which contains contextual information pertaining to a given phoneme unit that corresponds to a given word in a given sentence of a given Malay text input string. A context dependent label provides contextual information pertaining to a given phonetic symbol which when considered in combination correspond to a given phoneme unit. As may be understood by one of ordinary skill in the art of HMM based text to speech synthesis systems, the plurality of HMMs (Hidden Markov Models) represent probabilistic models that model a given speaker's acoustic speech and hence represents an acoustic speech model of a given speaker's speech which is used in a given HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention.
The training module 105 comprises a speech corpus 106 and a parametrization module 107. The speech corpus 106 may comprise of recordings of speech signals of a given speaker's speech as well as phonetic symbols representative of said recordings that have been chosen to cover sounds made in Malay language in the context of syllables and words that make up the vocabulary of the Malay language. The parametrization module 107 serves to extract and/or generate parameters such as spectral parameters and excitation parameters of the recorded speech (comprising recordings of speech signals). Said spectral and excitation parameters and the plurality of phonetic symbols stored together with the speech recordings of speech signals and representative of said speech recordings form a speech model and are used to train the plurality of HMMs, which in-combination, after the completion of training, acoustically models a speaker's speech whose speech recordings of speech signals are stored in the speech corpus 106 and utilized in the HMM based Malay text input to speech synthesis system 100 in accordance to a preferable embodiment of the present invention for Malay text input to Malay speech synthesis. Aforementioned plurality of HMMs forming a statistical parametric model housed in a statistical parameter model database 113 and effectively represent a speech acoustic model of a Native Malay speaker's speech, said statistical parameter model database 113 forming part of a synthesizing module 110 of the HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention. In accordance to a preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention, aforementioned parameterization module 107 serves to extract and/or generate spectral parameters of said speech recordings of speech signals stored in the speech corpus 106 such as the fundamental frequency (F0) of speech signals corresponding to said speech recordings stored in the speech corpus 106 and the Mel-Cepstral coefficients (MCEPs) of said speech signals. Aforementioned parametrization module 107 further serving to extract and/or generate excitation parameters such as the logarithm of the fundamental frequency (Log F0) and the delta (first order derivative) and delta-delta (second order derivative) of the Mel-Cepstral coefficients (MCEPs) of said speech signals corresponding to said speech recordings stored in the speech corpus 106.
In accordance to a preferable embodiment of the HMM based Malay text input to speech synthesizing system 100 of the present invention, the speech corpus 106 of the training module 105 is a Malay neutral speech database which consists of 1 ,000 recorded utterances uttered by two Malay native speakers, one male and the other female. The 1000 recorded speech utterances were constructed from 1000 Malay sentences that contain a representative range of Malay words, syllables and phonemes which were taken from different sources that include Malay newspapers (43%), educational text books (39%), and other general reading materials (18%). The selection of aforementioned 1000 Malay sentences were chosen to ensure phonetic diversity of the Malay speech corpora comprised within the speech corpus 106 of the training module 105 of the HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention. Aforementioned 1000 Malay sentences which correspond to aforementioned 1000 speech utterances, in an exemplary embodiment of the HMM based Malay text to Malay speech synthesis system 100 of the present invention, include 577 short sentences (less than seven words) and 412 long sentences (seven words or more), with a range from 3 to 12 words per sentence. Aforementioned Malay speech corpora of the speech corpus 106 containing 1000 speech utterances corresponding to aforementioned 1000 Malay sentences which are made up of 2,763 word types, 12,666 syllables and 39,996 phoneme units. As mentioned in a preceding paragraph of this detailed description, the synthesizing module 110 comprises of a statistical parameter model database 113 which houses said plurality of HMMs which effectively form a speech acoustic model of a native Malay speaker whose speech recordings and corresponding phonetic symbols representing said speech recordings are utilized to synthesize Malay speech from Malay text input in accordance to preferable embodiments of the HMM based Malay text to Malay speech synthesis system 100 of the present invention. Aforementioned synthesizing module 110 further including a speech synthesis module 114 that serves to extract relevant acoustic parameters (i.e. phoneme units) that are representative of context dependent labels of given phoneme units in a given text input, said context dependent labels being automatically generated by a context dependent label generation unit 112, and subsequent to extraction of said acoustic parameters, concatenates said acoustic parameters to effectively synthesize Malay speech corresponding to Malay text input by a user. Said speech synthesis module 114 of the synthesizing module 110 further including a synthesis filter that in accordance to a preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention, is a Mel-Log Spectrum Analyser (MLS A) filter.
Aforementioned synthesizing module 110 readily accepts a Malay text input string 111 from a user via a user input peripheral device such as a keyboard 40 in an exemplary computer system 20 environment that houses the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention and outputs synthesized Malay speech through an output device such as speaker 45 in an exemplary computer system 20 environment that houses said HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention.
Thus far, a general description of the components of the HMM based Malay text to Malay speech synthesis system 100 of the present invention has been provided. A detailed description, particularly that pertaining to the extraction and/or generation of spectral parameters and excitation parameters by the parametrization module 107 and utilization of said spectral and excitation parameters to train a plurality of HMM (Hidden Markov Models) probabilistic models residing in a statistical parameter model database 113 of a synthesizing module 110 to hence form a speech acoustic model of a native Malay speaker whose speech is utilized in Malay text to Malay speech synthesis in accordance to preferable embodiments of HMM based Malay text to Malay speech synthesis systems 100 of the present invention, will not be elaborated on any further and described in detail as it falls under the ambit of prior art and is already well known by persons skilled in the art of HMM based speech synthesis. Moreover a detailed description of how the plurality of HMMs residing in the statistical parameter model database 113 of the synthesizing module 110 actually provide for context dependent modelling of a native Malay speaker, as well as how the synthesis module 114 of the synthesizing module 110 extracts and correlates acoustic parameters (i.e. in the form of phoneme units) to context dependent labels representative of phoneme units of a given Malay input text string to hence synthesize speech, will not be elaborated on any further and described in detail as it falls under the ambit of prior art and is already well known by persons skilled in the art of HMM based speech synthesis. As mentioned in a preceding paragraph of this detailed description, the HMM (Hidden Markov Model) based Malay text to Malay speech synthesis system 100 of the present invention as alluded to in figure 2 is specifically directed to a HMM based speech synthesis system 100 adapted for Malay text input to Malay speech synthesis in which a context dependent label generation unit 112 software module is conceived to automatically generate context dependent labels of phonemes units of a given Malay text input string 111. The HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention is characterized in that the system 100 comprises of a context dependent label generation unit 112 that generates context dependent labels for phoneme units of a Malay text input string 111 derived in part on syllabification rules that are exclusive to Malay language to hence provide for Malay text to Malay speech synthesis that is more natural sounding and has a high degree of intelligibility. With reference to figure 3, the HMM based Malay text to Malay speech synthesis system 100 comprising of said training module 105 and synthesizing module 110 executes a method of HMM based Malay speech to Malay text synthesis 200 in accordance to preferable embodiments of the present invention, comprising of: i. ) a step 201 of extracting and/or generating a plurality of parameters representative of a Malay speech model of a native Malay speaker whose speech recordings of speech signals and corresponding phonetic symbols representative of said speech recordings are stored in a Malay speech corpus, that include spectral parameters and excitation parameters; ii. ) a step 202 of training a plurality of HMMs (Hidden Markov Models) which represent a plurality of probabilistic models, utilizing the spectral and excitation parameters extracted and/or generated from the Malay speech model and representative of said Malay speech model to generate a Malay speech acoustic model; iii. ) a step 203 of generating a plurality of context dependent labels for a plurality of phoneme units of an arbitrary Malay text input string 111 , said plurality of context dependent labels comprising contextual information pertaining to said plurality of phoneme units of said arbitrary Malay text input string 111 such as part of speech (POS) tagging; iv. ) a step 204 of utilizing said generated plurality of context dependent labels representative of a plurality of contextual information of a plurality of phoneme units present in said arbitrary Malay text input string 111 of the previous step 203, to obtain a plurality of acoustic parameters of Malay speech based on a generated Malay speech acoustic model, said obtained plurality of acoustic parameters corresponding to a plurality of phoneme units that correspond to said plurality of context dependent labels; and v.) A step 205 of synthesizing Malay speech based on the plurality of acoustic parameters obtained in the previous step 204, by concatenating said plurality of acoustic parameters corresponding to a plurality of phoneme units that correspond to said plurality of context dependent labels that in turn correspond to said Malay text input string 111 of a preceding step.
As mentioned earlier, the HMM based Malay text to Malay speech synthesis system 100 of the present invention is characterized by the provision of automatic context dependent label generation by a context dependent label generation unit 112 comprised in the synthesizing module 110. With reference to figure 4, said context dependent label generation unit 112 is a software module residing in an exemplary computing environment, in an area of memory 22 of a computer system 20 that generates context dependent labels of phoneme units of a given Malay text input string 111 according to a method 203 comprising of the following general steps: i. ) a step 301 of segmenting of an arbitrary Malay text input string 111 into a plurality of words and sentences, in which the position of each word in a given sentence of the given Malay text input string 111 is identified; ii. ) a step 302 of executing grapheme to phoneme conversion for each identified word in the given sentence, in which a grapheme of each word is converted to a corresponding phoneme unit; iii. ) a step 303 of syllabification in which each word identified in the given sentence of the given Malay text input string 111 , is segmented into identifiable syllables, in which a position of each syllable in a given word of a given sentence is identified; iv. ) a step 304 of compiling contextual information of each phonetic symbol of a given word in a given sentence of the Malay input text string 111 from the results of the step 301 of text input segmentation, the step 302 of grapheme to phoneme conversion and the step 303 of syllabification; v. ) a step 305 of repeating steps 303 and 304 until every word in a given sentence has been subjected to grapheme to phoneme conversion and syllabification; vi. ) a step 306 of repeating steps 302 to 305 for every sentence of the given Malay text input string 111 ; and vii. ) a step 307 of generating a plurality of context dependent labels which are in essence acoustic labels that incorporate the compiled contextual information of a previous step and that correspond to a plurality of phoneme units of a given sentence of the given Malay text input string 111 for every sentence of said given Malay text input string 111. Referring to the sequence of steps listed above, the term "grapheme" refers to a phonetic symbol which corresponds to a given phoneme unit in a given word of a given sentence of a given string of Malay text input 111. More particularly, as alluded to in the above listed sequence of steps, the method executed by the context dependent label generation unit 112 includes a step in which a phonetic symbol, i.e. a grapheme, of each word in a given sentence, is converted to a corresponding phoneme unit. It should be noted by one of ordinary skill in the art of linguistics, that a phoneme unit corresponds to a fundamental phonological unit of a given language, and hence in the context of this description, a phoneme unit refers to a fundamental phonological unit of Malay language.
With reference to figures 4 and 5, alluding to the sequence of steps 301 to 307 that refer to a method of context dependent label generation as executed by the context dependent label generation unit 112 comprised within the synthesizing module 110 of the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention, the step 301 of segmenting of an arbitrary Malay text input string 111 into a plurality of words and sentences, in which the position of each word in a given sentence of the given Malay text input string 111 is identified, comprises of: i.) a step 301 a of receiving a Malay text input string 111 provided by a user; ii.) a step 301 b of segmenting the Malay text input string 111 into sentences and words and hence identifying sentences and words in said Malay text input string 111 ; iii. ) a step 301 c of assigning a word number for each word in a sentence that alludes to a position of a given word in relation to a given sentence for every word of said Malay text input string 111 , said word number for each word being one of a number of contextual informational parameters pertaining to a phoneme unit corresponding to a phonetic symbol/grapheme of said given word in said given sentence of said Malay text input string 111 that form part of a context dependent label for said phoneme unit; and iv. ) a step 301 d of updating a register designated for storage of word information such as word numbers assigned to each word in a sentence that alludes to a position of a given word in relation to a given sentence for every word and every sentence of said Malay input text string 111.
In accordance to a preferable embodiment, the step of segmenting an arbitrary Malay text input string 111 , includes a step of assigning a word number for each word in a sentence that alludes to a position of a given word in relation to the Malay text input string 111 , as well as a step of updating a register for storage of word information such as word numbers assigned to each word within the given Malay text input string 111 apart from storing of word numbers assigned to each word in a sentence that alludes to a position of a given word in relation to a given sentence for every word and every sentence of said Malay input text string 111 .
Alluding to the preceding paragraph which alludes to a step 301 of segmenting of an arbitrary Malay text input string 111 into a plurality of words and sentences, and specifically alluding to the last step 301 d of a method of executing step 301 , it is readily understood by one of ordinary skill in the art of computing, that the register designated for storage of word information is a hardware resource that forms part of memory 22 of an exemplary computer system 20 that houses the preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention.
With reference to figures 4 and 6, alluding to the sequence of steps 301 to 307 that refer to a method of context dependent label generation as executed by the context dependent label generation unit 112 comprised within the synthesizing module 110 of the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention, the step 302 of grapheme to phoneme conversion of a given word, comprises generally of the following steps: i.) a step of generating a phoneme unit for each phonetic symbol representative of a consonant in a given word of a given sentence of said Malay text input string 111 (refer to block 302a, 302cb, 302db of figure 6); and ii. ) a step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a said given
Malay text input string 111 ( refer to block 302a, 302ca, 30da of figure 6).
More particularly aforementioned step of grapheme to phoneme conversion of a given word, alluded to in block 302 of the flowchart of figure 4, comprises of the following steps: i.) a step 302a segmenting a given word of a given sentence of a given Malay text input string 111 into a plurality of graphemes (i.e. phonetic symbols); ii.) a step 302cb of identifying from the plurality of graphemes that result from the step of segmenting said given word, phonetic symbols representative of a consonant; iii. ) a step 302ca of identifying from the plurality of graphemes that result from the step of segmenting said given word, phonetic symbols representative of a vowel; iv. ) a step 302db of mapping phonetic symbols representative of consonants to phoneme units representative of said consonants; v. ) a step 302da of mapping phonetic symbols representative of vowels to phoneme units representative of said vowels; vi. ) a step 302f of updating a register designated for storage of phoneme units representative of phonetic symbols that correspond to consonants and phoneme units representative of phonetic symbols that correspond to vowels as well phoneme information of each phoneme unit such as phoneme number within a given word of a given sentence, for every word and for every sentence of the given Malay input text string 111. Before proceeding any further, alluding to the preceding paragraph which alludes to a step 302 of grapheme to phoneme conversion as executed by the context dependent label generation unit 112 of the HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention, the register mentioned above when alluding to step 302f and designated for the storage of phoneme units representative of consonants and phoneme units representative of vowels with a given word as well as phoneme information such as phoneme number within a given word of a given sentence, for every word and for every sentence of the given Malay text input string 111 , is a hardware resource that forms part of memory 22 of an exemplary computer system 20 that houses the preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention.
The mapping of phonetic symbols/graphemes representative of consonants to corresponding phoneme units is a direct and simple one to one mapping as evident in block 302db of figure 6. This fact is well understood and is evident to an ordinary person skilled in the art of Malay linguistics. However the mapping of phonetic symbols/graphemes representative of vowels to corresponding phoneme units, in a given word is not and cannot be a direct one to one mapping of a phoneme unit to a vowel, as single vowel can have a plurality of phoneme units associated with it depending on the context of the vowel in a given word, i.e. depending on the position of the vowel in a given word and the preceding or subsequent phonetic symbols that abut a given vowel in a given word.
With reference to figure 7 appended herein, there is shown a flow-chart detailing a method of vowel identification as executed by the method of grapheme to phoneme conversion illustrated in figure 6 and alluded to in the preceding discussion, in accordance to a method of context dependent label generation, in accordance to a preferable embodiment of a method of HMM based Malay text to Malay speech synthesis of the present invention. More particularly, alluding to the step of grapheme to phoneme 302 conversion above, the step of generating a phoneme unit for each phonetic symbol representative of a vowel in a given word of a given sentence of a given Malay input text string is as detailed by the following steps: i. ) a step 302da(i) of identifying at least three preceding phonetic symbols representative of three preceding consonants that precedes a given phonetic symbol representative of a vowel identified in a given word of a given sentence of a given Malay text input string; ii. ) a step 302da(ii) of identifying at least three subsequent phonetic symbols representative of three subsequent consonants that are located subsequent to the given phonetic symbol representative of the vowel identified in the given word of the given sentence of the given Malay input text string; and iii. ) a step 302da(iii) of co-relating said phonetic symbol representative of the vowel identified, said at least three phonetic symbols representative of the consonants preceding said phonetic symbol representative of the vowel, and said at least three phonetic symbols representative of the consonants subsequent to said phonetic symbol representative of the vowel, to a state of a binary decision tree in which each state of said decision tree corresponds to a phoneme unit; and iv.) A step 302da(iv) of generating a phoneme unit based on the result of correlation of said phonetic symbol representative of the vowel identified in a preceding step to the state of said decision tree which corresponds to a phoneme unit in the preceding step. Now, with reference to figures 8 and 4, a step of syllabification which is executed by the context dependent label generation unit 112 comprised within the synthesizing module 110 of the HMM based Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention will be detailed. More particular, the step of syllabification 303 as alluded to in figure 4, is a step 303 executed by the context dependent label generation unit 112 that similar to the step of text input segmentation 301 and the step of grapheme to phoneme conversion 302, is executed for every word across all sentences of a given Malay text input string in order to obtain contextual information of all the phoneme units represented by said given Malay text input string.
In accordance to said preferable embodiment of the HMM based Malay text to Malay speech synthesis system and method of the present invention, aforementioned step of syllabification 303 which is executed for each word of a given sentence, for all words and all sentences of a given Malay text input string, is performed based on syllabification rules that are exclusive to Malay language and comprises the general steps of: i. ) a step of counting a number of syllables in a given word of a given sentence; and ii. ) a step of identifying and marking each syllable in a given word of a given sentence. The step of identifying and marking each syllable 303b in a given word of a given sentence, across all sentences in a given Malay text input string, in accordance to a preferable embodiment of the method and system 100 of HMM based Malay text input to Malay speech synthesis of the present invention, is unique by virtue of the uniqueness of syllabification rules Malay, which forms the basis of the identification and marking of a syllable within a given word as executed by the context dependent label generation unit 112 in accordance to a preferable embodiment of the present invention. More particularly the process of syllabification is essentially the process of marking and identifying of a syllable of given word and in the context of the HMM based Malay text input to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention, further comprises steps of counting a number of syllables in a given word and updating syllable information in a designated register in an area of memory 22 of an exemplary computer system 20 housing the HMM Malay text to Malay speech synthesis system 100 in accordance to a preferable embodiment of the present invention. In view of the preceding passage and the flow chart of figure 8, the step of syllabification 303 executed on each word of a given sentence, across all sentences in a Malay text string as also alluded to in figure 4, comprises of the following steps: i. ) a step 303a counting a number of syllables in a given word of a given sentence; ii. ) a step 303b of identifying and marking each syllable in a given word of a given sentence; and iii. ) a step 303c of updating a register designated for storage of syllabification data of a given word of a given sentence, for every word and for every sentence of the given Malay input text string 111 .
Elaborating further, as shown in figure 9, the step 303b of identification and marking of a syllable in a given Malay word of a given Malay sentence in a given Malay text input string comprises of the following steps: i.) a step 303ba of identifying a phonetic symbol representative of a vowel or diphthong of a given word in a given sentence of a given Malay input text string, starting from the rear of the word ii. ) a step 303bb of identifying a phonetic symbol representative of a consonant that super-cedes in position referenced from the rear of the given word, the position of the identified phonetic symbol representative of the vowel or diphthong in the previous step; iii. ) a step 303bc of marking a boundary of a syllable from the rear of the given word in which a syllable is delineated by a combination of a phonetic symbol representative of a consonant followed (as referenced from the beginning of a syllable) by a phonetic symbol representative of a vowel or a diphthong; iv.) a step 303bd of repeating steps 303ba (i) to 303bc (iii) for a subsequent phonetic symbol representative of a vowel or diphthong in the given word from the rear of said given word, if not all phonetic symbols representative of vowels or diphthongs in said given word have been identified and subsequent consonant identification and syllable boundary delineation for said subsequent phonetic symbol representative of a vowel has not been executed; vi. ) a step 303be of repeating steps 303ba (i) to 303bd (iv) for all words of a given sentence, across all sentences of a given Malay input text string; and vii. ) a final step of 303bf of updating a register designated for storage of syllabification information that result from the execution of steps 303ba to 303be for a given word of a given sentence, for all words and for all sentences of a given Malay text input string 111.
With reference to the preceding paragraph that details the steps involved in the method of syllabification 303 as executed by the context dependent label generation unit 112 in accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis of the present invention, and particularly alluding to the last step 303bf of syllabification 303, the register designated for the storage of syllabification information, is a hardware resource that forms part of memory 22 of an exemplary computer system 20 that houses the preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention. Aforementioned syllabification information stored in said register designated for the storage of syllabification information, include information such as syllable position and syllable number within a given word of a given sentence for all syllables in all words in all sentences of a given Malay text input string. Aforementioned syllabification information stored in said register designated for the storage of syllabification information, may further include information such as syllable position with a given sentence for all syllables and for all sentences of the given Malay text input string.
Though not explicitly alluded to in any of the attached figures, the context dependent label generation unit 112 compiles contextual information from the results of execution of the steps of text input segmentation 301 , grapheme to phoneme conversion 302 and syllabification 303 of a given Malay text input string 111. More particularly the context dependent label generation unit 112 in accordance to a preferable embodiment of the HMM based Malay text input to Malay speech synthesis system 100 of the present invention, compiles syllabification information from the register alluded to in step 303bf of figure 9, compiles word information from the register alluded to in step 301 d of figure 5 and phoneme information from the register alluded to in step 302f of figure 6 to hence generate context dependent labels for each phoneme unit of every word in every sentence of a given Malay text input string 111.
The abovementioned sequence of steps is executed by the context dependent label generation unit 112 software module. More particularly, as would be readily understood by a person of ordinary skill in the art of programming, in the exemplary computing environment of the computer system 20 alluded to in figure 1 , said method of context dependent label generation alluded to in figures 4 to 9, is executed by a central processing unit (CPU) 21 of said computer system 20 that reads and executes computer readable and executable instructions that embody the context dependent label generation unit 112 software module in accordance to a preferable embodiment of the HMM based Malay text to Malay speech synthesis system 100 of the present invention.
While the invention has been illustrated and described in detail in the drawings and foregoing description, the same is to be considered as illustrative and not restrictive in character, it being understood that only the preferred embodiment has been shown and described and that all equivalents, changes and modifications that come within the scope of the invention as described herein and/or by the following claims are desired to be protected.

Claims

1 . A HMM (Hidden Markov Model) based Malay text to Malay speech synthesis system (100) comprising of a plurality of software modules in the form of computer readable instructions residing in memory (22) of a computer system (20) and executed by a central processing unit (CPU) (21 ) of said computer system (20), the HMM based speech synthesis system (100) including: a training module (105) for the training of a plurality of HMM (Hidden Markov Model) statistical predictive models to generate an acoustic speech model of Malay speech; and a synthesizing module (1 10) for synthesizing Malay speech from Malay text input utilizing context dependent labels generated from a Malay text input string (1 1 1 ) and the acoustic speech model of Malay speech; characterized in that; the synthesizing module (1 10) includes a context dependent label generation unit (1 12) that automatically generates context dependent labels that correspond to phoneme units of Malay language from Malay language text input string (1 1 1 ), derived in part from syllabification rules that are exclusive to Malay language.
2. A system (100) according to claim 1 , wherein the training module (105) comprises of a database (106) housing a Malay speech corpus that includes a Malay language speech model embodied in the form of speech recordings of speech signals of a native Malay speaker and corresponding phonetic symbols representative of said speech recordings of a Malay native speaker.
3. A system (100) according to claim 1 or 2, wherein the training module (105) further includes a parametrization module (107) which extracts and/or generates parameters that include spectral parameters and excitation parameters for training of a plurality of HMM (Hidden Markov Model) probabilistic models.
4. A system (100) according to claim 3, wherein the spectral parameters and excitation parameters extracted and/or generated by the parametrization module (107) include parameters extracted from the Malay speech model housed in the Malay speech corpus residing in the database (106) of the training module (105).
5. A system (100) according to claim 4, wherein the spectral parameters generated include a fundamental frequency (F0) of a given speech signal and MCEPs (Mel-Ceptral Co-efficients) of the given speech signal and the excitation parameters include a logarithm of the fundamental frequency (F0) of a given speech signal, a delta and delta-delta coefficients of MCEPs of the given speech signals.
6. A system (100) according to any one of claims 4 or 5, wherein the synthesizing module (1 10) comprises a statistical parameter model database
(1 13) that includes a plurality of HMM probabilistic models that in combination represent an acoustic speech model derived from the plurality of excitation and spectral parameters extracted and/or generated by the parametrization module (107) of the training module (105).
7. A system (100) according to claim 6, wherein the synthesizing module (1 10) includes a speech synthesis module (1 14) which correlates a plurality of context dependent labels of a given plurality of phoneme units of a Malay input text string (1 1 1 ) and an acoustic speech model defined by a plurality of HMMs residing in the statistical parameter model database (1 13) to generate a corresponding plurality of acoustic parameters that correspond to said given plurality of phoneme units of the Malay text input string (1 1 1 ) and further concatenates the acoustic speech parameters representative of Malay speech corresponding to the phoneme units of the input Malay text input string (1 1 1 ) to synthesize Malay speech.
8. A system (100) according to claim 7, wherein the speech synthesis module (1 14) of the synthesizing module (1 10) includes a Mel-Log Spectrum Approximating Filter (MLSA).
9. A method (200) for Malay text input to Malay speech synthesis (200) as executed by a HMM based Malay text input to Malay speech synthesis system (100), characterized in that, the method comprises of: i. ) a step (201 ) of extracting and/or generating a plurality of parameters representative of a Malay speech model of a native Malay speaker whose speech recordings of speech signals and corresponding phonetic symbols representative of said speech recordings are stored in a speech corpus, that include spectral parameters and excitation parameters; ii. ) a step (202) of training a plurality of HMMs (Hidden Markov Models) which represent a plurality of probabilistic models, utilizing the spectral and excitation parameters extracted and /or generated from the Malay speech model and representative of said Malay speech model to generate a Malay speech acoustic model; iii. ) a step (203) of generating a plurality of context dependent labels for a plurality of phoneme units from an arbitrary Malay text input string (1 1 1 ), said plurality of context dependent labels comprising contextual information pertaining to said plurality of phoneme units of said arbitrary Malay text input string (1 1 1 ) such as part of speech (POS) tagging; iv. ) a step (204) of utilizing said generated plurality of context dependent labels representative of a plurality of contextual information of a plurality of phoneme units present in said arbitrary Malay text input string (1 1 1 ) of the previous step (203), to obtain a plurality of acoustic parameters of Malay speech based on a generated Malay speech acoustic model, said obtained plurality of acoustic parameters corresponding to a plurality of phoneme units that correspond to said plurality of context dependent labels; and v. ) a step (205) of synthesizing Malay speech based on the plurality of acoustic parameters obtained in the previous step (204), by concatenating said plurality of acoustic parameters corresponding to a plurality of phoneme units that correspond to said plurality of context dependent labels that in turn correspond to said Malay text input string (1 1 1 ) of a preceding step.
10. A method (200) according to claim 9, wherein the step (203) of generating a plurality of context dependent labels for a Malay text input string (1 1 1 ) as executed by a context dependent label generation unit (1 12) of the HMM based Malay text to Malay speech synthesis system (100) comprises the steps of: i.) a step (301 ) of segmenting of an arbitrary Malay text input string (1 1 1 ) into a plurality of words and sentences, in which the position of each word in a given sentence of the given Malay text input string (1 1 1 ) is identified;
11. ) a step (302) of executing grapheme to phoneme conversion for each identified word in the given sentence of the given Malay text input string (1 1 1 ), in which a grapheme of each word is converted to a corresponding phoneme unit; iii.) a step (303) of syllabification in which each word identified in the given sentence of the given Malay text input string (1 1 1 ), is segmented into identifiable syllables, in which a position of each syllable in a given word of a given sentence is identified; iv.) a step (304) of compiling contextual information of each phonetic symbol of a given word in a given sentence of the Malay input text string (1 1 1 ) from the results of the step (301 ) of text input segmentation, the step (302) of grapheme to phoneme conversion and the step (303) of syllabification; v.) a step (305) of repeating steps (302) and (304) until every word in a given sentence has been subjected to grapheme to phoneme conversion and syllabification; and vi. ) a step (306) of repeating steps (302) to (305) for every sentence of the given Malay text input string (1 1 1 ). vii. ) a step (307) of generating a plurality of context dependent labels which are in essence acoustic labels that incorporate the compiled contextual information of a previous step and that correspond to a plurality of phoneme units of a given sentence of the given Malay text input string (1 1 1 ) for every sentence of said given Malay text input string (1 1 1 ).
1 1 . A method (200) according to claim 10, wherein the step of syllabification (303) comprises the steps of: i.) a step (303a) of counting a number of syllables in a given word of a given sentence; ii.) a step (303b) of identifying and marking each syllable in a given word of a given sentence; and iii.) a step (303c) of updating a register designated for storage of syllabification data of a given word of a given sentence, for every word and for every sentence of a given Malay input text string (1 1 1 ).
12. A method (200) according to claim 1 1 , wherein the step (303b) of identifying and marking each syllable in a given word of a given sentence for every word and every sentence of a Malay text input string (1 1 1 ) comprises of: i.) a step (303ba) of identifying a phonetic symbol representative of a vowel or diphthong of a given word in a given sentence of a given Malay input text string, starting from the rear of the word ii.) a step (303bb) of identifying a phonetic symbol representative of a consonant that super-cedes in position referenced from the rear of the given word, the position of the identified phonetic symbol representative of the vowel or diphthong in the previous step; iii.) a step (303bc) of marking a boundary of a syllable from the rear of the given word in which a syllable in Malay is a phonetic symbol delineated by a combination of a phonetic symbol representative of a consonant followed by a phonetic symbol representative of a vowel or a diphthong; iv.) a step (303bd) of repeating step (303ba) (i) to step (303bc) (iii) for a subsequent phonetic symbol representative of a vowel or diphthong in the given word from the rear of said given word, if not all phonetic symbols representative of vowels or diphthongs in said given word have been identified and subsequent consonant identification and syllable boundary delineation for said subsequent phonetic symbol representative of a vowel has not been executed; vi. ) a step (303be) of repeating step (303ba) (i) to step (303bd) (iv) for all words of a given sentence, across all sentences of a given Malay input text string; and vii. ) a final step of (303bf) of updating a register designated for storage of syllabification information that result from the execution of steps (303ba) to (303be) for a given word of a given sentence, for all words and for all sentences of a given Malay text input string (1 1 1 ).
13. A method according to any one of claims 9 to 12, wherein each of the context dependent labels generated for a plurality of phoneme units of a given Malay text input string incorporate part of speech information such as word information of a given phoneme unit in the Malay text input string, as well as syllable information and phoneme information of said given phoneme unit.
14. A method according to claim 13, wherein the syllable information of the given phoneme unit comprises of syllable number within a word containing said given phoneme unit , syllable position within a word containing said given phoneme unit and syllable position within a sentence containing said given phoneme unit.
PCT/MY2016/050076 2015-11-09 2016-11-09 Method and system for text to speech synthesis WO2017082717A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2015704052 2015-11-09
MYPI2015704052 2015-11-09

Publications (2)

Publication Number Publication Date
WO2017082717A2 true WO2017082717A2 (en) 2017-05-18
WO2017082717A3 WO2017082717A3 (en) 2018-02-15

Family

ID=58695826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2016/050076 WO2017082717A2 (en) 2015-11-09 2016-11-09 Method and system for text to speech synthesis

Country Status (1)

Country Link
WO (1) WO2017082717A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073679A (en) * 2017-11-10 2018-05-25 中国科学院信息工程研究所 Stochastic model set of strings generation method, equipment and readable storage medium storing program for executing under a kind of String matching scene
CN112818089A (en) * 2021-02-23 2021-05-18 掌阅科技股份有限公司 Text phonetic notation method, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108073679A (en) * 2017-11-10 2018-05-25 中国科学院信息工程研究所 Stochastic model set of strings generation method, equipment and readable storage medium storing program for executing under a kind of String matching scene
CN108073679B (en) * 2017-11-10 2021-09-28 中国科学院信息工程研究所 Random pattern string set generation method and device in string matching scene and readable storage medium
CN112818089A (en) * 2021-02-23 2021-05-18 掌阅科技股份有限公司 Text phonetic notation method, electronic equipment and storage medium

Also Published As

Publication number Publication date
WO2017082717A3 (en) 2018-02-15

Similar Documents

Publication Publication Date Title
JP4302788B2 (en) Prosodic database containing fundamental frequency templates for speech synthesis
JP4054507B2 (en) Voice information processing method and apparatus, and storage medium
US8126717B1 (en) System and method for predicting prosodic parameters
KR20190085883A (en) Method and apparatus for voice translation using a multilingual text-to-speech synthesis model
US11763797B2 (en) Text-to-speech (TTS) processing
McGraw et al. Learning lexicons from speech using a pronunciation mixture model
CN115485766A (en) Speech synthesis prosody using BERT models
US20090157408A1 (en) Speech synthesizing method and apparatus
JP7379756B2 (en) Prediction of parametric vocoder parameters from prosodic features
Rashad et al. An overview of text-to-speech synthesis techniques
JP2023505670A (en) Attention-Based Clockwork Hierarchical Variational Encoder
CN115101046A (en) Method and device for synthesizing voice of specific speaker
Kayte et al. A Marathi Hidden-Markov Model Based Speech Synthesis System
WO2017082717A2 (en) Method and system for text to speech synthesis
Mullah A comparative study of different text-to-speech synthesis techniques
Chen et al. A Mandarin Text-to-Speech System
Chinathimatmongkhon et al. Implementing Thai text-to-speech synthesis for hand-held devices
Ronanki et al. The CSTR entry to the Blizzard Challenge 2017
Bahaadini et al. Implementation and evaluation of statistical parametric speech synthesis methods for the Persian language
Hendessi et al. A speech synthesizer for Persian text using a neural network with a smooth ergodic HMM
Klabbers Text-to-Speech Synthesis
IMRAN ADMAS UNIVERSITY SCHOOL OF POST GRADUATE STUDIES DEPARTMENT OF COMPUTER SCIENCE
Trinh et al. HMM-based Vietnamese speech synthesis
Liu et al. Design and Implementation of Burmese Speech Synthesis System Based on HMM-DNN
Nitisaroj et al. The Lessac Technologies system for Blizzard Challenge 2010

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16864641

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16864641

Country of ref document: EP

Kind code of ref document: A2