KR101735195B1

KR101735195B1 - Method, system and recording medium for converting grapheme to phoneme based on prosodic information

Info

Publication number: KR101735195B1
Application number: KR1020150111644A
Authority: KR
Inventors: 김선희; 홍진표; 김재민
Original assignee: 네이버 주식회사
Priority date: 2015-08-07
Filing date: 2015-08-07
Publication date: 2017-05-12
Also published as: KR20170017545A

Abstract

A method and system for prosody thermal phoneme conversion based on rhythm information and a recording medium are disclosed. A computer-implemented prosody thermal phonetic transduction method comprises the steps of receiving a text to be transcribed, inputting the text to be transcribed, based on a pre-defined rhyme structure based on IP, IP, and CL Estimating a prosody unit of the prosody unit, and converting the sequence of phonemes into a phoneme sequence based on the estimated prosody unit.

Description

TECHNICAL FIELD [0001] The present invention relates to a prosodic thermo-phonetic transduction method, a system, and a recording medium,

Embodiments of the present invention relate to a technology for converting a character string into a phoneme string based on a rhyme structure estimated from text.

In this paper, we propose a new method to convert phonemes into phonemes. In this paper, we propose a phonetic phonetic transduction system. Recognition technology.

Generally, rhyme modeling in speech synthesis is an important factor that directly affects nature and clarity. Prosodic modeling depends on the rhythmic nature of the individual language.

For example, English is an accent language, while it requires modeling of sentence accents, intermediate phrases, and Intonation Phrases. Japanese is a pitch accent language and it requires accentual phrase and accentual modeling with accent.

A method for predicting the rhyme for speech synthesis according to the above trends is disclosed in Japanese Patent Application Laid-Open No. 10-2006-0008330. Thus, it is known that rhyme information and rhyme structure are related to syllable modules, phonological knowledge and rules in the existing studies, but there is no known relation between rhyme information and rhyme structure.

In general, the phonetic transcription of phonemic phonemes has been applied to phonological rules or phonetic models in phonetic units with each word of the text as a basic unit, independent of the rhyme structure.

The Korean text is based on the unit of the word, and the pronunciation is actually different depending on what kind of prosody unit is realized in the rhyme structure.

Therefore, in order to generate an accurate phoneme sequence from a text string, the problem of how each word is realized in a rhyme unit should be solved in advance.

Korean requires a step of mapping each word of the text as a unit of rhythm, assuming a rhyme structure having a hierarchical structure, a strong word, a strong word, and a fold. In addition, we propose a method and system for converting phoneme strings into phoneme strings for a phonetic mapped word, and a recording medium.

The Korean text consists of the units of the word. First, based on the fact that the pronunciations are actually changed according to what the prosodic unit is realized in the rhyme structure, the prosodic unit of each word is estimated Probe thermophysical thermal conversion should be performed.

A computer-implemented prosodic thermal phoneme string conversion method comprises the steps of: receiving a text to be converted; converting a text of the text based on a predefined prosodic structure based on an IP number, an emphasis word (AP), and a fold (CL) Estimating a prosody unit, and converting the sequence of phonemes into phonemes based on the estimated prosody unit.

A computer readable medium comprising instructions for controlling a computer system to provide speech synthesis, said instructions comprising the steps of: receiving text to be spoken; receiving an IP, Estimating a prosody unit of the text based on a pre-defined prosodic structure based on the prosodic structure (CL); converting a string of the phonemes into a phoneme string based on the estimated rhyme unit; Text To Speech) speech. &Lt; RTI ID = 0.0 > [0002] < / RTI >

The speech synthesis system includes a rhythm unit for estimating a rhyme unit of the text based on a predefined rhythm structure based on a memory to which a text to be converted is to be loaded, an IP, an emphasis (AP), and a fold (CL) A phoneme string converting unit for converting the string of phonemes into a phoneme string based on the estimated rhyme unit and a speech synthesizer for converting the text into a TTS (Text To Speech) speech based on the phoneme string, And a voice output unit for outputting the TTS voice through a speaker of the user terminal.

According to the embodiment of the present invention, each word of the text is mapped in units of rhyme according to a rhyme structure having a hierarchical structure of the elements, the strength, and the fold. Then, To a phoneme string, it is possible to output a natural voice close to the actual pronunciation.

According to the embodiment of the present invention, quality improvement of speech transcription and performance improvement of speech-to-speech conversion can ultimately contribute directly to speech synthesis performance .

1 illustrates an overview of a user terminal and a speech synthesis system in an embodiment of the present invention.
2 is a block diagram for explaining an internal configuration of a speech synthesis system according to an embodiment of the present invention.
3 is a flow chart provided to illustrate a speech synthesis system based on a rhyme structure, in one embodiment of the present invention.
4 is a diagram illustrating an example of receiving text to be converted into speech in an embodiment of the present invention.
FIG. 5 is a diagram showing a rhythm structure composed of a plurality of bones, a strong mouth, and a fold in one embodiment of the present invention. FIG.
6 is a block diagram for explaining an example of an internal configuration of a computer system in an embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The embodiments can be applied to a speech synthesis system that converts text included in a document into speech. Particularly, the prosodic thermophysical thermal conversion method is a method in which a phonetic string written in a spelling based on a rhyme structure having Intonation Phrases (IP), Accentual Phrases (AP), and Clitics (CL) To a phoneme string, and can be applied to a speech recognition system that converts speech to text in addition to a speech-to-speech system.

In the case of the speech synthesis system, texts representing the examples, word / idioms, etc. included in the dictionary database are converted into speech, service for reading news articles, e-books, etc., keyword search results, translation results, etc. And the like.

In the present specification, 'rhyme' means a phenomenon that is realized by voice, such as strong tones, strong, rhythm, intonation, etc., and is an element that distinguishes the semantic difference from the upper part of the phonemic range. The word phrase represents a unit of text that is separated by a space.

The rhythm of Korean is generally known to consist of the strong, the strong, and the fold. However, most of the texts in the text are reported to be realized as strong or accentuated.

In the present invention, the definition of 'clitic' (CL) is more clearly defined in addition to the strong word, the word K, and the prediction of the fold in addition to the strong word K and the word K is proposed to be essential for the actual word string thermal conversion.

Here, fold is a prosodic, non-independent unit, which means a word that can not form an independent word. For example, "fold" means all the words that are spelled in spelling, including incomplete nouns such as - do, do - do, do, etc.

The method of converting the sentence column of the text suggested by the present specification into a phoneme string and outputting it can be used not only in the case of converting the text in Chinese or Japanese that is not word-divided on the spelling into speech, but also in the case of German, It is expected that these languages can be used in languages that combine several words together to form a compound word.

Based on the fact that the same string is pronounced differently according to different morpheme boundaries, the G2P (Grapheme-to-phoneme) method generally assumes a morpheme as a basic unit, . In Korean spelling, words or phrases are separated by spaces (that is, spaces). In general, K-ToBI (Korean Tone and Break Indices), which is used as a Korean prosodic structure system, consists of hierarchical prosodic units consisting of Intonation Phrases (IP) and Accentual Phrases (AP). According to the present invention, a new unit called Clitic (CL) can be added to the rhyme structure to form a hierarchical structure. At this time, the fold may be formed as a lower hierarchical structure of the strong force. Since one bulb can form a bifurcation, actually folding can also occur inside the bifurcation. Hereinafter, considering the characteristic that the actual pronunciation is changed according to the rhythm phrase, the operation of converting the text into speech using the rhyme structure having the hierarchical structure of the word, phrase, and fold as a basic unit will be described.

FIG. 1 illustrates an overview of a user terminal and a text-to-speech system, that is, a speech synthesis system, in an embodiment of the present invention. FIG. 1 shows a speech synthesis system 100 and a user terminal 101. FIG. The arrows in FIG. 1 may indicate that data can be transmitted and received between the voice synthesis system 100 and the user terminal 101 via the wired / wireless network.

The user terminal 101 may be connected to a web / mobile site related to the voice synthesis system 100 or a service dedicated application (hereinafter, referred to as a 'service application') through a PC, a smart phone, a tablet, (Hereinafter referred to as " terminal "). At this time, the user terminal 101 can perform a service overall operation such as service screen configuration, data input, data transmission / reception, and data storage under the control of a web / mobile site or a service app.

The speech synthesis system 100 serves as a service platform for converting a document selected or input by a user into speech and providing the speech to a user terminal 101 as a client. For example, the speech synthesis system 100 may provide various services for converting text to speech, such as an e-book service, a translation service, a dictionary service, and a news reading service, to the user terminal 101 . The voice synthesis system 100 may be implemented in an application form on the user terminal 101, but is not limited thereto. It may be implemented in a form including a service platform providing the service in a client-server environment have.

FIG. 2 is a block diagram for explaining an internal configuration of a speech synthesis system according to an embodiment of the present invention. FIG. 3 is a block diagram of a speech synthesis system according to an embodiment of the present invention. FIG.

The speech synthesis system 200 according to the present embodiment may include a processor 210, a bus 220, a network interface 230, a database 240, and a memory 250. The memory 250 may include an operating system 251 and a service providing routine 252. The processor 210 may include a prosody unit estimating unit 211, a phoneme string converting unit 212, and a voice output unit 213. In other embodiments, speech synthesis system 200 may include more components than the components of FIG. However, there is no need to clearly illustrate most prior art components.

The memory 250 may be a computer-readable recording medium and may include a permanent mass storage device such as a random access memory (RAM), a read only memory (ROM), and a disk drive. Also, the memory 250 may store program codes for the operating system 251 and the service providing routine 252. [ These software components may be loaded from a computer readable recording medium separate from the memory 250 using a drive mechanism (not shown). Such a computer-readable recording medium may include a computer-readable recording medium (not shown) such as a floppy drive, a disk, a tape, a DVD / CD-ROM drive, or a memory card. In other embodiments, the software components may be loaded into the memory 250 via the network interface 230 rather than from a computer readable recording medium.

The text to be converted into speech may be loaded into the memory 250. For example, text corresponding to a news article, an e-book, a translation, and a dictionary example selected by the user may be loaded into the memory 250.

The bus 220 may enable communication and data transmission between the components of the speech synthesis system 200. The bus 220 may be configured using a high-speed serial bus, a parallel bus, a Storage Area Network (SAN), and / or other suitable communication technology.

The network interface 230 may be a computer hardware component for coupling the speech synthesis system 200 to a computer network. The network interface 230 may connect the text-to-speech system 200 to a computer network via a wireless or wired connection.

The database 240 may store and maintain all information required to provide a voice synthesis service. In particular, the database 240 may previously store the rhyme model related information for converting the text into phonetic units, including the rhyme estimation result.

The database 240 may be included in the speech synthesis system 200 and may be included in the user terminal 101 or both as needed or may be included in an external database Lt; / RTI >

The processor 210 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input / output operations of the speech synthesis system 200. The instructions may be provided by the memory 250 or the network interface 230 and to the processor 210 via the bus 220. The processor 210 may be configured to execute program codes for the prosody unit estimation unit 211, the prosophophysical column conversion unit 212, and the voice output unit 213. [ Such a program code may be stored in a recording device such as the memory 250. [

The prosody unit estimation unit 211, the prosody thermal phonetic unit conversion unit 212, and the speech output unit 213 may be configured to perform the steps (steps 301 to 304) of FIG.

In step 301, text to be voice-converted may be loaded in the memory 250. [ For example, as the dictionary related service application is executed in the user terminal 101, a meaning and an example sentence corresponding to a specific word selected by the user can be provided on the screen of the user terminal 101. [ At this time, the text corresponding to the translation of the example sentence and the example sentence may be loaded into the memory 250.

In step 302, the prover unit estimator 211 can estimate the prosody unit of the text loaded in the memory 250 based on the pre-defined prosodic structure in the database 240. [ At this time, the prosody unit estimating unit 211 can estimate the prosody unit for each word phrase constituting the text using a statistical decision model (CART: Classification And Regression Tress).

For example, when the sentence is loaded into the memory 250, the prosody unit estimating unit 211 may classify the texts constituting the sentence according to a word, and the corresponding word may be classified into IP, ), Or a fold (CL). Here, the rhyme structure has a hierarchical structure of an Intonation Phrase (IP), an Accent Phrase (AP), and a Clitic (CL), and will be described in detail with reference to FIG.

In step 303, the phoneme string phoneme string conversion unit 212 may generate a phoneme string including a rhyme unit estimated according to a predefined phoneme rule or pronunciation modeling method. At this time, the prosodic phoneme string conversion unit 212 converts the phoneme included in each word and the forward or backward word to generate a forward or backward boundary of the phoneme, It is possible to generate a phoneme string including a prosodic boundary.

For example, the prosody unit estimating unit 211 may estimate the prosody unit for each word in the input text. At this time, each word is estimated to be one of the individual prosody units, ie, the strongest, the strongest, and the fold. Each of the folds constitutes an upper unit of the strong unit or the upper unit. If two or more of the phrases classified by the fold are continuous, a phonological phenomenon may occur such as a smoothing or a rumening. Phoneme strings can be generated according to predefined phonetic rules or pronunciation modeling methods.

In step 304, the phoneme string phoneme string conversion unit 212 may convert the text into a phoneme string based on the generated phoneme string. Then, the voice output unit 213 can output voice from the phoneme string. The output voice may be transmitted to the user terminal 101 and output through a speaker provided in the user terminal 101.

FIG. 4 is a diagram illustrating an example of receiving text to be voice-converted in an embodiment of the present invention.

4, when the dictionary application is executed in the user terminal 101 and a word 'doctor' is searched, a meaning and an example sentence corresponding to 'doctor' may be displayed on the screen of the user terminal 101. [ At this time, when the voice conversion display information 401 is selected, the text corresponding to the example sentence 'likely to be a doctor' may be loaded into the memory 250. Then, the prosody unit estimating unit 211 can estimate the prosody unit by classifying the corresponding example sentences loaded into the memory 250 by the phrase. Then, the hypothesis thermal phonorecording unit 212 may generate a phoneme sequence including a rhythm boundary according to a pre-defined phoneme rule or pronunciation modeling method based on the estimated rhyme unit. For example, the prosody unit estimating unit 211 can estimate the prosody unit of the prosody, the prosodic unit, the prosodic unit, the prosodic unit, and the equal unit as three folds constituting one unit .

FIG. 5 is a diagram showing a rhythm structure composed of a plurality of bones, a strong mouth, and a fold in one embodiment of the present invention. FIG.

The prosodic structure can be composed of three prosodic units (IP, AP, CL) with a hierarchical structure. According to Fig. 5, the rhythmic structure is located at least one orifice AP below the bifurcation IP, and one or more folds CL can be located under the orifice AP. The word (W) is a unit of text that is input into the input. In real pronunciation, phonology rules are applied as a strong word or a positive word, and in the case of a weak word or an infinitive word, a phonological rule is applied to a word boundary.

For example, as shown in FIG. 5, the four phrases "seem to be a doctor" may be classified into "doctor", "to be", "to", and "to be the same". At this time, if the spacing is exactly watched and read out one by one for each word, it can be pronounced as 'doctor', 'will', 'something', 'equal'. However, when the words 'to', 'to', and 'equals' are realized as a single intonation phrase by applying the phonological rules at the boundary of the word, the transposition is realized at the boundary of the word, Can be pronounced as 'to be'. In other words, when two or more words are read consecutively in the actual pronunciation, the pronunciation changes such as tilting (??, ??) may appear depending on the relation between the phoneme corresponding to the vernacular boundary and the preceding sound. As described above, according to a phoneme rule predefined on the basis of a phonetic change occurring in the rhyme unit, the phoneme phoneme string conversion unit 212 reflects a change in pronunciation that occurs when two or more phrases are realized as one intense phrase or an intonation phrase A phoneme string mapping pronunciation symbols to the corresponding word can be generated. Phonological rules are pre-modeled of all the rules that occur in the strength zone and the barrier zone in accordance with the rhythm structure of the barrier zone, strong zone and fold zone, and can be stored and maintained in the database 240 and managed.

Table 1 below shows an exemplary pronunciation using the International Phonetic Alphabet, which is generated through the conventional G2P method, which does not consider actual pronunciation and rhyme information.

In Table 1, it can be seen that tenseification, liaison, and / n / insertion occur at the rhythm boundary. For example: a. 'This weekend' is pronounced as 'this time' when the actual pronunciation occurs at the boundary of the word boundary and b. 'Thursday morning' is pronounced as 'myo-chime' with the occurrence of an echo, c. 'Around 8 am' is pronounced as 'morning syllable' with / n / insertion and sounding, d. 'I think it's going to be' It can be pronounced as 'carton'.

The prosodic boundaries can include no boundary, '0', CL boundary, '1', AP boundary, '2, and IP boundary, have. The prosody unit estimation unit 211 estimates the delimiter of each boundary corresponding to each word input from the text. Subsequently, the phoneme string phoneme string converter 212 can generate a phoneme string including the rhythm boundary before and after each phoneme according to the word.

For example, the prosody unit estimating unit 211 receives the input of the text 'this weekend' as input such as 'this time' and 'weekend', and then estimates a rhyme boundary corresponding to each word. According to the estimated rhythm boundary information of each word, the phoneme string conversion unit 212 of the present invention generates a phoneme string corresponding to the first phoneme / 2 i 0], and if it is estimated to be folded as in the actual pronunciation, the same phoneme / i / is generated as [1 i 0].

Thus, the strong boundary 2 and the marginal boundary 3 indicate that no phonetic changes occur between the given phoneme and its preceding phoneme, and the fold boundary 1 is pronounced between the given phoneme and its preceding phoneme A case where a change always occurs can be shown. Borderless '0' generally allows all phonetic changes that occur between concatenated phonemes.

Table 2 below shows the number of syllables (# syll / AP, # syll / AP) per folding cadence (IP) and the number of syllables per fold (CL) IP, # syll / CL). The voice used in the statistical data in Table 2 is composed of 5,915 sentences covering 87,465 words (280,635 syllables), and the average word phrase per sentence can be 14.79. The voice was a voice recorded by a female speaker, and two experts listened to the recording file and participated in the war for folding, bowing, and bangyanggu.

# syll / AP # syll / CL # syll / IP # AP / IP 5.40 1.57 14.25 3.24

Table 3 below shows the distributions of fold (CL), strength (AP), and bite (IP) due to the annotations in Table 2.

boundary Ratio (number) CL 11.24% (9,828) AP 59.47% (52,014) IP 29.29% (25,614)

According to Table 3, it can be seen that 11.24% of all the phrases correspond to the folded boundary (CL). In the case where the fold is predicted as such, the phoneme rule is applied to other folds that are connected to the preceding / If correct prediction is not done, correct suface thermophysical conversion is impossible.

The performance of the prosodic unit prediction system can be confirmed by k-fold crossover verification based on the prosodic structure, the strong and the fold based prosodic structure, and the 10-cross verification is performed on the above data.

The mean F-1 of the folds, strengths, and bones were 79.81%, 86.64%, and 75.24%, respectively, within the error rate of 18.53%.

For the evaluation of the G2P system, the 9th fold of the data used in the above prosodic unit prediction was used. Table 4 below shows statistical data for the 9th fold. 5.

Fold no. 9 Sentences 589 Eulogy 8,675 Phonemes 27,828

Precision Recall F-1 CL 91.20% 71.25% 80.00% AP 83.17% 91.67% 87.21% IP 81.55% 71.38% 76.13%

The performance of the prosodic thermal conversion system can be evaluated according to the phoneme level, the syllable level and the word level.

Table 6 below shows a conventional prosodic thermo-phonetic conversion system for converting text into speech according to the rhythm structure based on the biannual and strong-mouth regions, the present invention for converting text to speech according to the rhythm structure based on the biannual, Of the system performance.

phoneme Syllable word Conventional G2P system 90.69% 78.38% 41.27% The system of the present invention 94.54% 87.19% 63.75%

Table 6 shows that the system performance of the present invention, in which the text is converted into speech, is remarkably improved considering the folding at the prosodic boundary rather than the existing G2P system in which the text is converted into speech based on the rhythm structure based on the number . This is due to the fact that the prediction of folding, which is a word pronounced with a single accent or an intonation phrase, was included, as well as the correct pronunciation of it could be generated.

The methods according to embodiments of the present invention may be implemented in the form of a program instruction that can be executed through various computer systems and recorded in a computer-readable medium.

The program according to the present embodiment can be configured as a PC-based program or an application dedicated to a mobile terminal. The service application for speech synthesis according to the present embodiment may be implemented as a program that operates independently or may be implemented in an in-app form of a specific application and may operate on the specific application .

6 is a block diagram for explaining an example of an internal configuration of a computer system in an embodiment of the present invention. The computer system 600 includes at least one processor 610, a memory 620, a peripheral interface 630, an input / output subsystem 640, A power circuit 650, and a communication circuit 660, as shown in FIG. At this time, the computer system 600 may correspond to a user terminal.

The memory 620 may include, for example, a high-speed random access memory, a magnetic disk, an SRAM, a DRAM, a ROM, a flash memory, or a non-volatile memory. have. The memory 620 may include software modules, a set of instructions, or various other data required for operation of the computer system 600. At this point, access to the memory 620 from other components, such as the processor 610 or the peripheral device interface 630, may be controlled by the processor 610. The text to be voice-converted may be loaded into the memory 620. For example, text such as a dictionary text, news article, and the like selected by the user may be loaded into the memory 620.

Peripheral device interface 630 may couple the input and / or output peripheral devices of computer system 600 to processor 610 and memory 620. The processor 610 may perform various functions and process data for the computer system 600 by executing a software module or a set of instructions stored in the memory 620. [

The input / output subsystem 640 may couple various input / output peripheral devices to the peripheral interface 630. For example, the input / output subsystem 640 may include a controller for coupling a peripheral device such as a monitor, keyboard, mouse, printer, or a touch screen or sensor as needed, to the peripheral interface 630. According to another aspect, the input / output peripheral devices may be coupled to the peripheral device interface 630 without going through the input / output subsystem 640.

The power circuit 650 may provide power to all or a portion of the components of the terminal. For example, the power circuitry 650 may include one or more power supplies, such as a power management system, a battery or alternating current (AC), a charging system, a power failure detection circuit, a power converter or inverter, And any other components for power generation, management, and distribution.

The communication circuitry 660 may enable communication with other computer systems using at least one external port. Or as described above, the communication circuitry 660 may communicate with other computer systems by sending and receiving an RF signal, also known as an electromagnetic signal, including RF circuitry as needed.

6 is merely an example of the computer system 600, and the computer system 600 may include additional components not shown in FIG. 6, or some components shown in FIG. 6 may be omitted, Lt; RTI ID = 0.0 > components. &Lt; / RTI > For example, in addition to the components shown in FIG. 6, a computer system for a mobile communication terminal may further include a touch screen, a sensor, and the like, and may be connected to a communication circuit 660 through various communication methods (WiFi, , Bluetooth, NFC, Zigbee, etc.). The components that may be included in computer system 600 may be implemented in hardware, software, or a combination of both hardware and software, including one or more signal processing or application specific integrated circuits.

Embodiments of the present invention may include further shortened operations or additional operations based on the details described with reference to Figures 1-6. In addition, more than one operation may be combined, and the order or location of the operations may be changed.

As described above, according to the embodiment of the present invention, character synthesis is confirmed at the rhythm boundary of both the strong and the weak bands, so that the phonemes are not generated, By confirming the synthesis and generating a phoneme string, the pronunciation of the converted TTS voice can be output naturally as if it were actually pronounced.

The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA) , A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

200: Speech synthesis system
211: Proportion unit estimation unit
212: Probe thermophysical thermal conversion unit
213: Audio output unit

Claims

A computer-implemented prosody thermophysical thermal conversion method,
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
The folding-
As a non-independent unit in a rhythmical sense,
Preferably,
One or more words or two or more folds,
In this case,
Composed of one or more strengths
Wherein the prosodic information is based on a rhythm information.

delete

A computer-implemented prosody thermophysical thermal conversion method,
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
Wherein the converting the string of phonemes into phonemes comprises:
And generating phoneme strings by mapping phonetic symbols reflecting changes in pronunciation occurring as two or more phrases constituting the text are realized as one force or phrase, to a corresponding phrase
Wherein the prosodic information is based on a rhythm information.

A computer-implemented prosody thermophysical thermal conversion method,
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
Wherein the step of estimating the prosody unit comprises:
Estimating the prosody unit for each word phrase constituting the text
Wherein the prosodic information is based on a rhythm information.

A computer-implemented prosody thermophysical thermal conversion method,
Receiving a text to be voice-converted;
Estimating a prosody unit of the text based on a predefined prosodic structure based on a population number (IP), an emphasis group (AP), and a fold (CL); And
Converting the string of the phonemes into a phoneme string based on the estimated rhyme unit
Lt; / RTI >
Wherein the converting the string of phonemes into phonemes comprises:
To generate a phoneme string containing the rhythm boundary before and after phonemes and phonemes in the word
Wherein the prosodic information is based on a rhythm information.

6. The method of claim 5,
The prosodic boundary may comprise:
Containing at least one of an IP-boundary, an AP-boundary, a CL-boundary, and a no-boundary.
Wherein the prosodic information is based on a rhythm information.

The method according to claim 1,
Wherein the step of receiving the text to be converted includes:
Receiving the text from a news article, a mobile translator, or a mobile dictionary
Wherein the prosodic information is based on a rhythm information.

A computer-readable recording medium storing a program for executing the method according to any one of claims 1 to 7.

A memory in which a text to be converted is loaded;
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
The folding-
As a non-independent unit in a rhythmical sense,
Preferably,
One or more words or two or more folds,
In this case,
Composed of one or more strengths
And the speech synthesis system.

delete

A memory in which a text to be converted is loaded;
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
Wherein the sub-
Generating a phoneme string by mapping a pronunciation symbol to a corresponding phrase according to a predefined phoneme; and generating the phoneme string by mapping a phonetic symbol to a corresponding phrase according to a change in pronunciation that occurs when two or more phrases constituting the text are realized as one force word or phrase
And the speech synthesis system.

A memory in which a text to be converted is loaded;
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
Wherein the prosody unit estimating unit comprises:
Estimating the prosody unit for each word phrase constituting the text
And the speech synthesis system.

A memory in which a text to be converted is loaded;
A prosody unit estimating unit for estimating a prosody unit of the text based on a pre-defined prosodic structure based on a prosody number (IP), an emphasis phrase (AP), and a fold (CL);
A phoneme thermo-phonetic unit converting a character string into a phoneme string based on the estimated rhyme unit and converting the text into a TTS (Text To Speech) speech based on the phoneme string; And
A voice output unit for outputting the TTS voice through a speaker of the user terminal,
Lt; / RTI >
Wherein the sub-
To generate a phoneme sequence containing the rhythm boundary before and after phonemes and phonemes in the word according to a predefined phoneme rule
And the speech synthesis system.

14. The method of claim 13,
The prosodic boundary may comprise:
Containing at least one of an IP-boundary, an AP-boundary, a CL-boundary, and a no-boundary.
And the speech synthesis system.

10. The method of claim 9,
Receiving the text from a news article, a mobile translator, or a mobile dictionary
And the speech synthesis system.