KR101735195B1 - Method, system and recording medium for converting grapheme to phoneme based on prosodic information - Google Patents
Method, system and recording medium for converting grapheme to phoneme based on prosodic information Download PDFInfo
- Publication number
- KR101735195B1 KR101735195B1 KR1020150111644A KR20150111644A KR101735195B1 KR 101735195 B1 KR101735195 B1 KR 101735195B1 KR 1020150111644 A KR1020150111644 A KR 1020150111644A KR 20150111644 A KR20150111644 A KR 20150111644A KR 101735195 B1 KR101735195 B1 KR 101735195B1
- Authority
- KR
- South Korea
- Prior art keywords
- unit
- text
- prosody
- phoneme
- string
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 35
- 230000033764 rhythmic process Effects 0.000 claims abstract description 26
- 238000006243 chemical reaction Methods 0.000 claims abstract description 22
- 230000015572 biosynthetic process Effects 0.000 claims description 36
- 238000003786 synthesis reaction Methods 0.000 claims description 36
- 238000013507 mapping Methods 0.000 claims description 5
- 230000008859 change Effects 0.000 claims description 4
- 230000001020 rhythmical effect Effects 0.000 claims description 4
- 230000026683 transduction Effects 0.000 abstract description 3
- 238000010361 transduction Methods 0.000 abstract description 3
- 230000002093 peripheral effect Effects 0.000 description 10
- 238000010586 diagram Methods 0.000 description 9
- 238000012545 processing Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 4
- 238000013519 translation Methods 0.000 description 4
- 210000000988 bone and bone Anatomy 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000004888 barrier function Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 239000000523 sample Substances 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 150000001875 compounds Chemical class 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000010248 power generation Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
A method and system for prosody thermal phoneme conversion based on rhythm information and a recording medium are disclosed. A computer-implemented prosody thermal phonetic transduction method comprises the steps of receiving a text to be transcribed, inputting the text to be transcribed, based on a pre-defined rhyme structure based on IP, IP, and CL Estimating a prosody unit of the prosody unit, and converting the sequence of phonemes into a phoneme sequence based on the estimated prosody unit.
Description
Embodiments of the present invention relate to a technology for converting a character string into a phoneme string based on a rhyme structure estimated from text.
In this paper, we propose a new method to convert phonemes into phonemes. In this paper, we propose a phonetic phonetic transduction system. Recognition technology.
Generally, rhyme modeling in speech synthesis is an important factor that directly affects nature and clarity. Prosodic modeling depends on the rhythmic nature of the individual language.
For example, English is an accent language, while it requires modeling of sentence accents, intermediate phrases, and Intonation Phrases. Japanese is a pitch accent language and it requires accentual phrase and accentual modeling with accent.
A method for predicting the rhyme for speech synthesis according to the above trends is disclosed in Japanese Patent Application Laid-Open No. 10-2006-0008330. Thus, it is known that rhyme information and rhyme structure are related to syllable modules, phonological knowledge and rules in the existing studies, but there is no known relation between rhyme information and rhyme structure.
In general, the phonetic transcription of phonemic phonemes has been applied to phonological rules or phonetic models in phonetic units with each word of the text as a basic unit, independent of the rhyme structure.
The Korean text is based on the unit of the word, and the pronunciation is actually different depending on what kind of prosody unit is realized in the rhyme structure.
Therefore, in order to generate an accurate phoneme sequence from a text string, the problem of how each word is realized in a rhyme unit should be solved in advance.
Korean requires a step of mapping each word of the text as a unit of rhythm, assuming a rhyme structure having a hierarchical structure, a strong word, a strong word, and a fold. In addition, we propose a method and system for converting phoneme strings into phoneme strings for a phonetic mapped word, and a recording medium.
The Korean text consists of the units of the word. First, based on the fact that the pronunciations are actually changed according to what the prosodic unit is realized in the rhyme structure, the prosodic unit of each word is estimated Probe thermophysical thermal conversion should be performed.
A computer-implemented prosodic thermal phoneme string conversion method comprises the steps of: receiving a text to be converted; converting a text of the text based on a predefined prosodic structure based on an IP number, an emphasis word (AP), and a fold (CL) Estimating a prosody unit, and converting the sequence of phonemes into phonemes based on the estimated prosody unit.
A computer readable medium comprising instructions for controlling a computer system to provide speech synthesis, said instructions comprising the steps of: receiving text to be spoken; receiving an IP, Estimating a prosody unit of the text based on a pre-defined prosodic structure based on the prosodic structure (CL); converting a string of the phonemes into a phoneme string based on the estimated rhyme unit; Text To Speech) speech. ≪ RTI ID = 0.0 > [0002] < / RTI >
The speech synthesis system includes a rhythm unit for estimating a rhyme unit of the text based on a predefined rhythm structure based on a memory to which a text to be converted is to be loaded, an IP, an emphasis (AP), and a fold (CL) A phoneme string converting unit for converting the string of phonemes into a phoneme string based on the estimated rhyme unit and a speech synthesizer for converting the text into a TTS (Text To Speech) speech based on the phoneme string, And a voice output unit for outputting the TTS voice through a speaker of the user terminal.
According to the embodiment of the present invention, each word of the text is mapped in units of rhyme according to a rhyme structure having a hierarchical structure of the elements, the strength, and the fold. Then, To a phoneme string, it is possible to output a natural voice close to the actual pronunciation.
According to the embodiment of the present invention, quality improvement of speech transcription and performance improvement of speech-to-speech conversion can ultimately contribute directly to speech synthesis performance .
1 illustrates an overview of a user terminal and a speech synthesis system in an embodiment of the present invention.
2 is a block diagram for explaining an internal configuration of a speech synthesis system according to an embodiment of the present invention.
3 is a flow chart provided to illustrate a speech synthesis system based on a rhyme structure, in one embodiment of the present invention.
4 is a diagram illustrating an example of receiving text to be converted into speech in an embodiment of the present invention.
FIG. 5 is a diagram showing a rhythm structure composed of a plurality of bones, a strong mouth, and a fold in one embodiment of the present invention. FIG.
6 is a block diagram for explaining an example of an internal configuration of a computer system in an embodiment of the present invention.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
The embodiments can be applied to a speech synthesis system that converts text included in a document into speech. Particularly, the prosodic thermophysical thermal conversion method is a method in which a phonetic string written in a spelling based on a rhyme structure having Intonation Phrases (IP), Accentual Phrases (AP), and Clitics (CL) To a phoneme string, and can be applied to a speech recognition system that converts speech to text in addition to a speech-to-speech system.
In the case of the speech synthesis system, texts representing the examples, word / idioms, etc. included in the dictionary database are converted into speech, service for reading news articles, e-books, etc., keyword search results, translation results, etc. And the like.
In the present specification, 'rhyme' means a phenomenon that is realized by voice, such as strong tones, strong, rhythm, intonation, etc., and is an element that distinguishes the semantic difference from the upper part of the phonemic range. The word phrase represents a unit of text that is separated by a space.
The rhythm of Korean is generally known to consist of the strong, the strong, and the fold. However, most of the texts in the text are reported to be realized as strong or accentuated.
In the present invention, the definition of 'clitic' (CL) is more clearly defined in addition to the strong word, the word K, and the prediction of the fold in addition to the strong word K and the word K is proposed to be essential for the actual word string thermal conversion.
Here, fold is a prosodic, non-independent unit, which means a word that can not form an independent word. For example, "fold" means all the words that are spelled in spelling, including incomplete nouns such as - do, do - do, do, etc.
The method of converting the sentence column of the text suggested by the present specification into a phoneme string and outputting it can be used not only in the case of converting the text in Chinese or Japanese that is not word-divided on the spelling into speech, but also in the case of German, It is expected that these languages can be used in languages that combine several words together to form a compound word.
Based on the fact that the same string is pronounced differently according to different morpheme boundaries, the G2P (Grapheme-to-phoneme) method generally assumes a morpheme as a basic unit, . In Korean spelling, words or phrases are separated by spaces (that is, spaces). In general, K-ToBI (Korean Tone and Break Indices), which is used as a Korean prosodic structure system, consists of hierarchical prosodic units consisting of Intonation Phrases (IP) and Accentual Phrases (AP). According to the present invention, a new unit called Clitic (CL) can be added to the rhyme structure to form a hierarchical structure. At this time, the fold may be formed as a lower hierarchical structure of the strong force. Since one bulb can form a bifurcation, actually folding can also occur inside the bifurcation. Hereinafter, considering the characteristic that the actual pronunciation is changed according to the rhythm phrase, the operation of converting the text into speech using the rhyme structure having the hierarchical structure of the word, phrase, and fold as a basic unit will be described.
FIG. 1 illustrates an overview of a user terminal and a text-to-speech system, that is, a speech synthesis system, in an embodiment of the present invention. FIG. 1 shows a
The
The
FIG. 2 is a block diagram for explaining an internal configuration of a speech synthesis system according to an embodiment of the present invention. FIG. 3 is a block diagram of a speech synthesis system according to an embodiment of the present invention. FIG.
The
The
The text to be converted into speech may be loaded into the
The
The
The
The
The
The prosody
In
In
For example, when the sentence is loaded into the
In
For example, the prosody
In
FIG. 4 is a diagram illustrating an example of receiving text to be voice-converted in an embodiment of the present invention.
4, when the dictionary application is executed in the
FIG. 5 is a diagram showing a rhythm structure composed of a plurality of bones, a strong mouth, and a fold in one embodiment of the present invention. FIG.
The prosodic structure can be composed of three prosodic units (IP, AP, CL) with a hierarchical structure. According to Fig. 5, the rhythmic structure is located at least one orifice AP below the bifurcation IP, and one or more folds CL can be located under the orifice AP. The word (W) is a unit of text that is input into the input. In real pronunciation, phonology rules are applied as a strong word or a positive word, and in the case of a weak word or an infinitive word, a phonological rule is applied to a word boundary.
For example, as shown in FIG. 5, the four phrases "seem to be a doctor" may be classified into "doctor", "to be", "to", and "to be the same". At this time, if the spacing is exactly watched and read out one by one for each word, it can be pronounced as 'doctor', 'will', 'something', 'equal'. However, when the words 'to', 'to', and 'equals' are realized as a single intonation phrase by applying the phonological rules at the boundary of the word, the transposition is realized at the boundary of the word, Can be pronounced as 'to be'. In other words, when two or more words are read consecutively in the actual pronunciation, the pronunciation changes such as tilting (??, ??) may appear depending on the relation between the phoneme corresponding to the vernacular boundary and the preceding sound. As described above, according to a phoneme rule predefined on the basis of a phonetic change occurring in the rhyme unit, the phoneme phoneme
Table 1 below shows an exemplary pronunciation using the International Phonetic Alphabet, which is generated through the conventional G2P method, which does not consider actual pronunciation and rhyme information.
In Table 1, it can be seen that tenseification, liaison, and / n / insertion occur at the rhythm boundary. For example: a. 'This weekend' is pronounced as 'this time' when the actual pronunciation occurs at the boundary of the word boundary and b. 'Thursday morning' is pronounced as 'myo-chime' with the occurrence of an echo, c. 'Around 8 am' is pronounced as 'morning syllable' with / n / insertion and sounding, d. 'I think it's going to be' It can be pronounced as 'carton'.
The prosodic boundaries can include no boundary, '0', CL boundary, '1', AP boundary, '2, and IP boundary, have. The prosody
For example, the prosody
Thus, the strong boundary 2 and the marginal boundary 3 indicate that no phonetic changes occur between the given phoneme and its preceding phoneme, and the fold boundary 1 is pronounced between the given phoneme and its preceding phoneme A case where a change always occurs can be shown. Borderless '0' generally allows all phonetic changes that occur between concatenated phonemes.
Table 2 below shows the number of syllables (# syll / AP, # syll / AP) per folding cadence (IP) and the number of syllables per fold (CL) IP, # syll / CL). The voice used in the statistical data in Table 2 is composed of 5,915 sentences covering 87,465 words (280,635 syllables), and the average word phrase per sentence can be 14.79. The voice was a voice recorded by a female speaker, and two experts listened to the recording file and participated in the war for folding, bowing, and bangyanggu.
Table 3 below shows the distributions of fold (CL), strength (AP), and bite (IP) due to the annotations in Table 2.
According to Table 3, it can be seen that 11.24% of all the phrases correspond to the folded boundary (CL). In the case where the fold is predicted as such, the phoneme rule is applied to other folds that are connected to the preceding / If correct prediction is not done, correct suface thermophysical conversion is impossible.
The performance of the prosodic unit prediction system can be confirmed by k-fold crossover verification based on the prosodic structure, the strong and the fold based prosodic structure, and the 10-cross verification is performed on the above data.
The mean F-1 of the folds, strengths, and bones were 79.81%, 86.64%, and 75.24%, respectively, within the error rate of 18.53%.
For the evaluation of the G2P system, the 9th fold of the data used in the above prosodic unit prediction was used. Table 4 below shows statistical data for the 9th fold. 5.
The performance of the prosodic thermal conversion system can be evaluated according to the phoneme level, the syllable level and the word level.
Table 6 below shows a conventional prosodic thermo-phonetic conversion system for converting text into speech according to the rhythm structure based on the biannual and strong-mouth regions, the present invention for converting text to speech according to the rhythm structure based on the biannual, Of the system performance.
Table 6 shows that the system performance of the present invention, in which the text is converted into speech, is remarkably improved considering the folding at the prosodic boundary rather than the existing G2P system in which the text is converted into speech based on the rhythm structure based on the number . This is due to the fact that the prediction of folding, which is a word pronounced with a single accent or an intonation phrase, was included, as well as the correct pronunciation of it could be generated.
The methods according to embodiments of the present invention may be implemented in the form of a program instruction that can be executed through various computer systems and recorded in a computer-readable medium.
The program according to the present embodiment can be configured as a PC-based program or an application dedicated to a mobile terminal. The service application for speech synthesis according to the present embodiment may be implemented as a program that operates independently or may be implemented in an in-app form of a specific application and may operate on the specific application .
6 is a block diagram for explaining an example of an internal configuration of a computer system in an embodiment of the present invention. The
The
The input /
The
The
6 is merely an example of the
Embodiments of the present invention may include further shortened operations or additional operations based on the details described with reference to Figures 1-6. In addition, more than one operation may be combined, and the order or location of the operations may be changed.
The methods according to embodiments of the present invention may be implemented in the form of a program instruction that can be executed through various computer systems and recorded in a computer-readable medium.
As described above, according to the embodiment of the present invention, character synthesis is confirmed at the rhythm boundary of both the strong and the weak bands, so that the phonemes are not generated, By confirming the synthesis and generating a phoneme string, the pronunciation of the converted TTS voice can be output naturally as if it were actually pronounced.
The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.
The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.
The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.
While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.
200: Speech synthesis system
211: Proportion unit estimation unit
212: Probe thermophysical thermal conversion unit
213: Audio output unit
Claims (15)
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
The folding-
As a non-independent unit in a rhythmical sense,
Preferably,
One or more words or two or more folds,
In this case,
Composed of one or more strengths
Wherein the prosodic information is based on a rhythm information.
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
Wherein the converting the string of phonemes into phonemes comprises:
And generating phoneme strings by mapping phonetic symbols reflecting changes in pronunciation occurring as two or more phrases constituting the text are realized as one force or phrase, to a corresponding phrase
Wherein the prosodic information is based on a rhythm information.
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
Wherein the step of estimating the prosody unit comprises:
Estimating the prosody unit for each word phrase constituting the text
Wherein the prosodic information is based on a rhythm information.
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
Wherein the converting the string of phonemes into phonemes comprises:
To generate a phoneme string containing the rhythm boundary before and after phonemes and phonemes in the word
Wherein the prosodic information is based on a rhythm information.
The prosodic boundary may comprise:
Containing at least one of an IP-boundary, an AP-boundary, a CL-boundary, and a no-boundary.
Wherein the prosodic information is based on a rhythm information.
Wherein the step of receiving the text to be converted includes:
Receiving the text from a news article, a mobile translator, or a mobile dictionary
Wherein the prosodic information is based on a rhythm information.
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
The folding-
As a non-independent unit in a rhythmical sense,
Preferably,
One or more words or two or more folds,
In this case,
Composed of one or more strengths
And the speech synthesis system.
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
Wherein the sub-
Generating a phoneme string by mapping a pronunciation symbol to a corresponding phrase according to a predefined phoneme; and generating the phoneme string by mapping a phonetic symbol to a corresponding phrase according to a change in pronunciation that occurs when two or more phrases constituting the text are realized as one force word or phrase
And the speech synthesis system.
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
Wherein the prosody unit estimating unit comprises:
Estimating the prosody unit for each word phrase constituting the text
And the speech synthesis system.
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
Wherein the sub-
To generate a phoneme sequence containing the rhythm boundary before and after phonemes and phonemes in the word according to a predefined phoneme rule
And the speech synthesis system.
The prosodic boundary may comprise:
Containing at least one of an IP-boundary, an AP-boundary, a CL-boundary, and a no-boundary.
And the speech synthesis system.
Receiving the text from a news article, a mobile translator, or a mobile dictionary
And the speech synthesis system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150111644A KR101735195B1 (en) | 2015-08-07 | 2015-08-07 | Method, system and recording medium for converting grapheme to phoneme based on prosodic information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020150111644A KR101735195B1 (en) | 2015-08-07 | 2015-08-07 | Method, system and recording medium for converting grapheme to phoneme based on prosodic information |
Publications (2)
Publication Number | Publication Date |
---|---|
KR20170017545A KR20170017545A (en) | 2017-02-15 |
KR101735195B1 true KR101735195B1 (en) | 2017-05-12 |
Family
ID=58111955
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020150111644A KR101735195B1 (en) | 2015-08-07 | 2015-08-07 | Method, system and recording medium for converting grapheme to phoneme based on prosodic information |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR101735195B1 (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11443732B2 (en) * | 2019-02-15 | 2022-09-13 | Lg Electronics Inc. | Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium |
US11227578B2 (en) | 2019-05-15 | 2022-01-18 | Lg Electronics Inc. | Speech synthesizer using artificial intelligence, method of operating speech synthesizer and computer-readable recording medium |
WO2020256170A1 (en) * | 2019-06-18 | 2020-12-24 | 엘지전자 주식회사 | Voice synthesis device using artificial intelligence, operation method of voice synthesis device, and computer-readable recording medium |
KR102281504B1 (en) * | 2019-09-16 | 2021-07-26 | 엘지전자 주식회사 | Voice sythesizer using artificial intelligence and operating method thereof |
WO2021071221A1 (en) * | 2019-10-11 | 2021-04-15 | Samsung Electronics Co., Ltd. | Automatically generating speech markup language tags for text |
US11380300B2 (en) | 2019-10-11 | 2022-07-05 | Samsung Electronics Company, Ltd. | Automatically generating speech markup language tags for text |
KR102222597B1 (en) * | 2020-02-03 | 2021-03-05 | (주)라이언로켓 | Voice synthesis apparatus and method for 'call me' service |
-
2015
- 2015-08-07 KR KR1020150111644A patent/KR101735195B1/en active IP Right Grant
Non-Patent Citations (1)
Title |
---|
임기정, 이정철, ‘운율경계정보를 이용한 HMM기반 한국어 TTS 자연성 향상 연구’, Journal of The Korea Society of Computer and Information, 2012년 9월.* |
Also Published As
Publication number | Publication date |
---|---|
KR20170017545A (en) | 2017-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101735195B1 (en) | Method, system and recording medium for converting grapheme to phoneme based on prosodic information | |
JP7280386B2 (en) | Multilingual speech synthesis and cross-language voice cloning | |
US8990089B2 (en) | Text to speech synthesis for texts with foreign language inclusions | |
US11450313B2 (en) | Determining phonetic relationships | |
US8626510B2 (en) | Speech synthesizing device, computer program product, and method | |
JP2008134475A (en) | Technique for recognizing accent of input voice | |
Ekpenyong et al. | Statistical parametric speech synthesis for Ibibio | |
KR20220108169A (en) | Attention-Based Clockwork Hierarchical Variant Encoder | |
US9129596B2 (en) | Apparatus and method for creating dictionary for speech synthesis utilizing a display to aid in assessing synthesis quality | |
Sangeetha et al. | Speech translation system for english to dravidian languages | |
JP2008243080A (en) | Device, method, and program for translating voice | |
KR20230158603A (en) | Phonemes and graphemes for neural text-to-speech conversion | |
KR20080045413A (en) | Method for predicting phrase break using static/dynamic feature and text-to-speech system and method based on the same | |
Kayte et al. | A text-to-speech synthesis for Marathi language using festival and Festvox | |
Stöber et al. | Speech synthesis using multilevel selection and concatenation of units from large speech corpora | |
Lin et al. | Hierarchical prosody modeling for Mandarin spontaneous speech | |
KR101097186B1 (en) | System and method for synthesizing voice of multi-language | |
Alam et al. | Development of annotated Bangla speech corpora | |
US20220189455A1 (en) | Method and system for synthesizing cross-lingual speech | |
Watts et al. | based speech synthesis | |
JP2001117921A (en) | Device and method for translation and recording medium | |
Gros et al. | SI-PRON pronunciation lexicon: a new language resource for Slovenian | |
WO2010113396A1 (en) | Device, method, program for reading determination, computer readable medium therefore, and voice synthesis device | |
Lazaridis et al. | Comparative evaluation of phone duration models for Greek emotional speech | |
WO2023047623A1 (en) | Information processing device, information processing method, and information processing program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
A201 | Request for examination | ||
E902 | Notification of reason for refusal | ||
E701 | Decision to grant or registration of patent right | ||
GRNT | Written decision to grant |