GB2378877A - Prosodic boundary markup mechanism - Google Patents

Prosodic boundary markup mechanism Download PDF

Info

Publication number
GB2378877A
GB2378877A GB0119842A GB0119842A GB2378877A GB 2378877 A GB2378877 A GB 2378877A GB 0119842 A GB0119842 A GB 0119842A GB 0119842 A GB0119842 A GB 0119842A GB 2378877 A GB2378877 A GB 2378877A
Authority
GB
United Kingdom
Prior art keywords
text
prosodic
prosodic boundary
text portion
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB0119842A
Other versions
GB0119842D0 (en
GB2378877B (en
Inventor
Peter Phelan
Kalika Bali
David Horowitz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Vox Generation Ltd
Original Assignee
Vox Generation Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Vox Generation Ltd filed Critical Vox Generation Ltd
Priority to GB0119842A priority Critical patent/GB2378877B/en
Publication of GB0119842D0 publication Critical patent/GB0119842D0/en
Priority to PCT/GB2002/003738 priority patent/WO2003017251A1/en
Publication of GB2378877A publication Critical patent/GB2378877A/en
Application granted granted Critical
Publication of GB2378877B publication Critical patent/GB2378877B/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

An automated prosodic boundary markup mechanism is implemented in a text-to-speech converter system. The mechanism parses text at 116 and assigns prosodic boundaries to the text at 118 based on the parse. However, this results in overmarking of prosodic boundaries. Prosodic boundary constraints are applied to the marked up text based on the parts of speech assigned at 114 to the words in the text portion to remove inappropriate boundaries. For example, a prosodic boundary is removed if it immediately follows a deaccented word, if there is only one word between the boundary and punctuation, if the boundary lies between two words of the same syntactic class, or if the boundary immediately follows a function word.

Description

<Desc/Clms Page number 1>
PROSODIC BOUNDARY MARKUP MECHANISM The present invention relates to an automated prosodic boundary markup mechanism, a method for automated prosodic boundary markup, a programme element for prosodic boundary markup and a computer system configured to implement prosodic boundary markup. In particular, but not exclusively, the present invention relates to an automated prosodic boundary markup mechanism for an automated textto-speech (TTS) converter.
The use of text-to-speech converters, sometimes referred to as text-to-speech synthesisers, is widespread in multimedia and telecommunications applications for oral and aural human-computer interaction. A text-to-speech converter is a computer-based system that is able to read out aloud text that is input to it. Although recent progress in speech reproduction technology has produced TTS converters with a very high level of intelligibility, sound quality and the naturalness or humanness of the speech still remains a major problem. Many, and increasingly more, TTS applications are implemented in multi-task scenarios where the human-machine communication takes place via audio channels, for example via a telephone call. Many such multi-task scenarios require a relatively long period of interaction between the human user and the machine, and whilst existing TTS systems can synthesis reasonably intelligible spoken responses, they nevertheless produce what is perceived by a user to be a monotonous, highly"mechanical"voice. Long periods of having to interact with such a "mechanical"voice often puts off users from using the application. Thus,"natural" sound quality and rhythm in synthesised speech responses are important in order for TTS applications to provide an enhanced user experience, and thereby encourage use of such applications.
It is an aim of an embodiment of the present invention to ameliorate and relieve one or more problems associated with known TTS converters, and to seek to improve the"natural"sound quality of synthesised speech.
Particular and preferred aspects of the invention are set out in the accompanying independent claims. Combinations of features from the dependent and/or independent claims may be combined as appropriate and not merely as set out in the claims.
<Desc/Clms Page number 2>
The terms'break'and'boundary'are used substantially interchangeably herein.
In general terms, an initial syntactic estimation of the appropriate sites for prosodic
breaks/boundaries in a text portion can be redefined, using accent-based and other constraints, to give an improved estimation of where these breaks/boundaries can be marked. This yields improved synthesised speech generated from such marked text.
Viewed from a first aspect, the present invention provides an automated prosodic boundary markup mechanism for an automated text-to-speech converter, the markup mechanism operable to perform a syntactic analysis on a text portion input thereto and to assign one or more prosodic boundaries to said text portion, the markup mechanism further operable to apply a constraint to said one or more prosodic boundaries and to remove a prosodic boundary satisfying said constraint.
Viewed from a second aspect, the present invention provides a method for configuring a computer system, including memory and a processor, to mark up a text portion input thereto with one or more prosodic boundaries, the method comprising configuring said computer system to: store said portion in said memory ; process said text portion in said processor to perform a syntactic analysis thereof and to assign one or more prosodic boundaries to said text portion; store in said memory prosodic boundary marks associated with said prosodic boundaries; process said text portion having prosodic boundaries assigned thereto to apply a constraint; and delete prosodic boundary marks satisfying said constraint.
An advantage of an embodiment in accordance with said first or second aspect of the present invention is that overmarking with prosodic boundaries, which typically occurs following a syntactic analysis of a text portion, is reduced. Thus, speech synthesised from a text portion processed in accordance with an embodiment of the present invention may be perceived to sound more like natural speech than known TTS converters, and thereby the intelligibility of the synthesised speech is improved.
Preferably, the prosodic boundary markup mechanism is operable to assign a prosodic boundary mark to each of said one or more prosodic boundaries, and to remove a prosodic boundary mark satisfying said constraint. The assigning of such
<Desc/Clms Page number 3>
prosodic boundary marks to identified prosodic boundaries within the text portion aid the automated analysis of the prosodic boundary content of the text portion, since such prosodic boundary marks may be identified by a machine and processed accordingly.
In a preferred embodiment, the prosodic boundary markup-'mechanism is operable to identify a deaccented word in said text portion and to apply a constraint comprising removing a prosodic boundary immediately subsequent to said deaccented word. Advantageously, inappropriate prosodic boundaries identified in the text portion on the basis of solely a grammatical syntactic analysis, and incorrectly associated with a deaccented word, may be removed.
Preferably, the prosodic boundary markup mechanism is operable to identify punctuation in said text portion, and to apply a constraint comprising removing a prosodic boundary conditional on there being only one word between a prosodic boundary and preceding or subsequent punctuation, thereby further removing inappropriate prosodic boundaries assigned by the syntactic analysis.
Preferably, the prosodic boundary markup mechanism is operable to assign a major syntactic class to one or more words in said text portion, and to apply a constraint comprising removing a prosodic boundary bounded by two words having the same major syntactic class assigned to them. Suitably, the major syntactic class is preferably selected from the set comprising the following major syntactic classes: noun; verb; adjective and adverb. Thereby, further inappropriately assigned prosodic boundaries may be removed. The syntactic classes may by identified by POS tags assigned to the words.
Preferably, the prosodic boundary markup mechanism may be operable to identify one or more function words in said text portion, and to apply a constraint comprising removing a prosodic boundary immediately subsequent to a function word, which further removes inappropriately assigned prosodic boundaries.
Yet more preferably, the function words are identified in accordance with one or more function definitions, drawn from the Penn TreeBank tag set {dt, ex, ttin, Is, md, pdt, pos, prp, pps, rp, sym, tto, uh, vbp, wdt, wp, wpz}.
Preferably, the prosodic boundary markup mechanism is operable to determine a first number being the number of words between a first prosodic boundary and a
<Desc/Clms Page number 4>
second prosodic boundary being the next subsequent prosodic boundary from said first prosodic boundary; to determine a second number being the number of words between said second prosodic boundary and a third prosodic boundary being the next subsequent prosodic boundary from said second prosodic boundary; and to apply a constraint comprising removing said second prosodic boundary conditional on the ratio of said first number to said second number being greater than a predetermined threshold, suitably the threshold being two. Thereby, further inappropriately assigned prosodic boundaries may be removed.
More preferably, the foregoing constraint based on the ratio of the first number to the second number is applied conditional on the second number being less than four.
Thus, the constraint is only applied where there is only a relatively short distance between the second and third prosodic boundaries.
Suitably, the prosodic boundary markup mechanism is operable to assign a part of speech to one or more words, preferably each word, in said text portion, and to parse said text portion including said part of speech assigned words, thereby to perform said syntactic analysis on said text portion. Parsing a text portion on the basis of the parts of speech assigned to individual words in the text portion is a well-known technique, and for which automated processes are known.
Advantageously, the parts of speech assigned to each word may be used to determined a condition for applying a constraint, thereby utilising the results of the parts of speech analysis for two separate functions of the markup mechanism.
Suitably, the prosodic boundary markup mechanism is operable to implement an automated Brill Tagger mechanism to assign a part of speech to said one or more words in said text portion, which is a well known and available Tagger mechanism.
Yet more suitably, the prosodic boundary markup mechanism comprises a database including a Penn Treebank set of part of speech tags, and is operable to implement said automated Brill Tagger mechanism to interrogate said database to obtain a part of speech tag to assign to one or more words in said text portion. The use of the Penn TreeBank set of part of speech tags is advantageous in that it is a well known nomenclature for marking parts of speech, and is well understood by persons practised in the relevant art.
<Desc/Clms Page number 5>
Preferably, the prosodic boundary markup mechanism is operable to parse said text portion including said part of speech assigned words to return a partial parse of said text portion, and yet more preferably to return the longest possible parse. Even more preferably a parsed sentence is returned. An advantage of a prosodic boundary markup mechanism in which a longest possible parse is sought to be returned, preferably a parsed sentence, is that there is a graceful degradation in the parsed quality should it not be possible to provide a complete parse of a sentence. That is to say, a failure to provide a completely parsed sentence, results in the next longest possible parse being returned. In a particular implementation a parse corresponding to the longest spanning edge from a vertex is returned. Such an implementation minimises the number of prosodic boundaries assigned to a text portion, thereby reducing the risk of inappropriate prosodic boundaries being assigned.
Viewed from a third aspect, the present invention provides a computer system comprising memory, a processor and a prosodic boundary markup mechanism as described above.
Viewed from a fourth aspect, the present invention provides an automated textto-speech converter system including a computer system as described above, a text input mechanism for the computer system and an audio output mechanism for outputting speech from the text-to-speech converter system. Suitably, the automated text-to-speech converter system includes a text source, and the computer system is operable to communicate with said text source to provide text to said text input mechanism. In particular embodiments, the text source may comprise a news report database, a sports report database, a children's story database or an e-mail database for example.
Suitably, the text input mechanism comprises a keyboard for the computer system, and the automated text-to-speech converter system may be operable to provide editing of text originating from a text source with or via the keyboard. Thus, a human operator may process text prior to it being input to the TTS converter. This is an important process since the better formed a text portion is, the better the text-to-speech conversion. For example, the text portion should be correctly marked in accordance with grammatical and punctuation rules, and have appropriate capitalisation. The text or a news report or sports report provided by a professional journalist, for example, is
<Desc/Clms Page number 6>
likely to be well formed. However, e-mails are unlikely to be so well formed and in accordance with grammatical rules, in particular where the author has adopted common e-mail shorthand conventions.
In a particularly suitable embodiment, the automated text-to-speech converter system is configured with an audio output mechanism which outputs u-law, A-law, or MPEG formatted audio output. Thus, the audio output may be put onto the regular Public Subscriber Telephone Network (PSTN), or via a suitable digital communications conduit. Optionally, or additionally, the audio output mechanism may comprise a speaker.
Viewed from a fifth aspect, the present invention provides an automated text-tospeech converter including a prosodic boundary markup mechanism as described above.
Viewed from a sixth aspect, the present invention provides a user device which includes a prosodic boundary markup mechanism such as described above, or a text-tospeech converter such as referred to above. The user device may comprise a Personal Digital Assistant (PDA), a hand held computer, a mobile telephone or a laptop computer, for example. Thus, such user devices, which are preferably portable, are capable of providing text-to-speech conversion.
Viewed from a seventh aspect, the present invention provides a communications network which comprises an automated text-to-speech converter system, a network interface connecting the automated text-to-speech converter system to the network and a user device connected to the network. The user device may be such as described above, but in this aspect need not include the prosodic boundary markup mechanism, since they are connected to a text-to-speech converter via the network, and thereby will receive text converted synthesised speech.
Particular embodiments of the present invention will now be described, by way of example only, and with reference to the following drawings for which like numerals refer to like parts, in which: Figure 1 is a schematic representation of a computer system; Figure 2 is a schematic and simplified representation of an implementation of the computer system of Figure 1;
<Desc/Clms Page number 7>
Figure 3 is a schematic illustration of the main processes in a TTS converter system; Figure 4 is a schematic illustration of a TTS converter system coupled by a network to user devices ; Figure 5 is a schematic illustration of a TSS converter system in accordance with an embodiment of the present invention; Figure 6 is a flow diagram for an embodiment of the present invention; Figure 7 is a flow diagram illustrating in more detail a part of the flow diagram of Figure 6; Figure 8 is a flow diagram illustrating in more detail a part of the flow diagram of Figure 7; and Figure 9 is a schematic illustraton of the main processes of an embodiment of the present invention.
There will now be described examples of a computer system which may provide a platform for an implementation of an embodiment of the present invention. As will be evident to a person of ordinary skill in the art, processing platforms other than a computer system such as described below may be suitable for providing a platform for an embodiment of the present invention.
Referring now to Figure 1, there is a schematic representation of an illustrative example of a computer system 11. The computer system 11 comprises a system unit 12, a display device 18 with a display screen 20, and user input devices, including a keyboard 22 and a mouse 24. A printer 21 is also connected to the system.
Each system unit 12 comprises media drives, including an optical disk drive 14, a floppy disk drive 16 and an internal hard disk drive not explicitly shown in Figure 1. A CD-ROM 15 and a floppy disk 17 are also illustrated.
The basic operations of the computer system 11 are controlled by an operating system which is a computer program typically supplied already loaded into the computer system. The computer system may be configured to perform other functions by loading it with a computer program known as an application program, for example.
A computer program for implementing various functions or conveying various information may be supplied on media such as one or more CD-ROMs and/or floppy disks and then stored on a hard disk, for example. The computer system shown in
<Desc/Clms Page number 8>
Figure 1 may also be connected to a network, which may be the Internet or a local or wide area dedicated or private network, for example. A program or program element implementable by a computer system may also be supplied on a telecommunications medium, for example over a telecommunications network and/or the Internet, and embodied as an electronic signal. For a computer system 11 operating as a mobile terminal over a radio telephone network, the telecommunications medium may be a radio frequency carrier wave carrying suitably encoded signals representing the computer program and data or information. Optionally, the carrier wave may be an optical carrier wave for an optical fibre link or any other suitable carrier medium for a land line link telecommunication system.
Referring now to Figure 2, there is shown a schematic and simplified representation of an illustrative implementation of a computer system such as that referred to with reference to Figure 1. As shown in Figure 2, the computer system comprises various data processing resources such as a processor (CPU) 30 coupled to a bus structure 38. Also connected to the bus structure 38 are further data processing resources such as read only memory 32 and random access memory 34. A display adaptor 36 connects a display device 18 to the bus structure 38. One or more userinput device adapters 40 connect the user-input devices, including the keyboard 22 and mouse 24 to the bus structure 38. An adapter 41 for the connection of the printer 21 may also be provided. One or more media drive adapters 42 can be provided for connecting the media drives, for example the optical disk drive 14, the floppy disk drive 16 and hard disk drive 19, to the bus structure 38. One or more telecommunications adapters 44 can be provided thereby providing processing resource interface means for connecting the computer system to one or more networks or to other computer systems. The communications adapters 44 could include a local area network adapter, a modem and/or ISDN terminal adapter, or serial or parallel port adapter etc, as required.
It will be appreciated that Figure 2 is a schematic representation of one possible implementation of a computer system, and that from the following description of embodiments of the present invention, the computer system in which the invention could be implemented may take many forms. For example, the computer system may be a non-PC type of computer which is Internet-or network-compatible, for example a
<Desc/Clms Page number 9>
Web TV, or set-top box for a domestic TV capable of providing access to a computer network such as the Internet.
Optionally, the computer system may be in the form of a wireless PDA or a multimedia terminal. terminal.
The main processes and functions for a TTS converter, such as may be implemented on a computer system platform as described above with reference to Figures 1 and 2, will now be described with reference to Figure 3 of the drawings.
The TTS converter 49 illustrated in Figure 3 comprises two main stages. The first stage is a text analysis module 50, comprising a text normalisation process 52 and a linguistic analysis process 54. A text portion 56, comprising a single sentence or multiple sentences, is input to the text analysis module 50 and undergoes a text normalisation process 52. Text normalisation, sometimes referred to text preprocessing, involves breaking up the text portion string into individual words. For a language such as English this is relatively easily as words are separated by spaces.
However, in an ideographic language such as Japanese or Chinese a more complicated set of rules could be involved. Text normalisation also involves converting abbreviations, non-alpha characters, numbers and acronyms into a fully spelt out form.
For example,"RD"would be converted to"road", and"$"to"dollar", a date such as "1997"would be converted to"nineteen ninety seven"or"one thousand nine hundred and ninety seven" (if it is a number), and"UN"would be converted to"United Nations". This is generally achieved by use of one to one look up dictionaries stored in a suitable database associated with a computer system, or through a set of rules that may take into account preceding and following contexts for a word. Optionally, control sequences are used to separate different modes. For example, a particular control signal may indicate a"mathematical mode"for numbers and mathematical expressions, another control sequence to indicate a"date mode"for dates, and a further control sequence to indicate an"e-mail"mode for e-mail specific characters.
The normalised text is fed to the linguistic analysis process 54, where it is analysed to derive phonetic information 58 and prosodic information 60 of input text portion 56.
The phonetic information 58 may be derived by linguistic analysis process 54 to provide a phonetic representation of the input text, for example by way of grapheme to
<Desc/Clms Page number 10>
phoneme conversion, which is important for deriving the correct pronunciation of words. Preferably, the grapheme to phoneme conversion is done by way of a combination of grammatical rules and a look up dictionary for exceptions to the grammar rules. The grammar typical consists of simple and complex rules to deal with all phonological variations in the language of the text to be converted. Additionally, a morphological and syntactic parser may also be utilised in order to derive a phonetic representation for certain pronunciations that are conditioned by morphosyntactic contexts. For example, such a parser may be used to resolve the ambiguity between the words"read" (present tense) and"read" (past tense) by way of analysis of the context in which the words are used. Idiosyncratic pronunciations, for example proper names, are stored in a dictionary. The linguistic analysis process 54 applies the grammar rules to the input text, and also checks the dictionary in order to identify any exceptions and to derive their phonetic representation.
The linguistic analysis process 54 also provides prosodic information 60 regarding the input text portion 56. Prosodic information relates to duration, intonation and rhythm of a language, and the linguistic analysis process 54 seeks to provide prosodic information 60 regarding such aspects of the input text 56. However, determining prosody from raw text is extremely difficult because indications of prosodic features are not always marked on text. Additionally, prosody is linked with the fundamental frequency contour at different levels in the hierarchy in speech, i. e. words, phrases and boundaries in speech. Fundamental frequency is a frequency of vibration of the vocal chords, which are usually represented as a number of cycles of variations in air-pressure per second. In this regard, the term fundamental frequency is the acoustic correlate of pitch. An increase in pitch corresponds to an increase in the fundamental frequency. Prosody is typically manifested in terms of pitch accents on words, phrase accents on phrases and boundary tones at the end of major breaks in the prosodic structure. Each of these prosodic"events"or"boundaries"is associated with some change in the movement of the fundamental frequency. Prosody markup in TTS conversion involves the marking up of such events or boundaries on the text, which are later mapped onto acoustic parameters related to pause durations and change in the fundamental frequency contours. A significant problem in predicting prosodic events
<Desc/Clms Page number 11>
or boundaries lies in the fact that many of them are governed by extra-linguistic factors such as a speaker's attitude or emotive state.
The phonetic information 58 and prosodic information 60 derived by the linguistic analysis process 54 is forwarded to speech generation module-62.
As described above, the term prosody refers to those features of speech that deal with pitch (fundamental frequency), emphasis (amplitude), and length (duration).
Prosodic features have specific functions in speech communication. The most common function is to group words together in order to facilitate the interpretation of the meaning of a sentence. For example, the different groupings of the words in the two sentences below cause a change in the meaning of the sentence : (1) John said James is a liar.
(2) John, said James, is a liar.
In the first sentence, John is identifying James as a liar. In the second sentence, it is James whose identifying John as a liar.
Through tonal variations on a word (pitch accent) it is possible to emphasis what is important in a particular sentence, as in: (3) I want to go to New York.
(4) I want to go to New York.
(5) I want to go to New York.
In a spoken language interface, for example, a conversation between two human beings or an interaction between a human being and a TTS conversion system, a difference between an ordinary statement and a question can only be understood by using different tones at the end of the sentences : (6) That is your book.
<Desc/Clms Page number 12>
(7) That is your book? Without any tonal variation between sentences (6) and (7), the different meanings of the sentences cannot be determined. Thus, an incorrect prosodic analysis may cause a TTS conversion system to produce sentences that are strange at best, and at worst unintelligible.
Text may be labelled in order to identify its prosodic structure. An example of such prosodic labelling is the Tonal and Break Indices (ToBI) annotation protocol for prosody based on PierreHumbert's theory of intonational English, see PierreHumbert, J. B. (1980) The Phonology and Phonetics of English Intonation, PhD Thesis, The Massachusetts Institute of Technology. Distributed by the Indiana University Linguistics Club.
PierreHumbert's intonation description is used as a standard to label prosodic boundaries at different levels in English, and other languages also. An example of the ToBI nomenclature is laid out below: 1. Pitch accents: These are a local maxima or minima in the fundamental frequency contour associated with intonationally prominent words in a speech utterance. There are six pitch-accents marked on a word: a) L* : Low tone (fO valley) aligned with the stressed syllable b) H*: High tone (fO peak) aligned with the stressed syllable c) LH*: Low tone followed by high tone aligned with the stressed syllable d) L*H : Low tone aligned with the stressed syllable followed by a high tone e) HL* : High tone followed by a low tone aligned with the stressed syllable f) H*L : High tone aligned with the stressed syllable followed by a low tone.
<Desc/Clms Page number 13>
2. Phrase accents: A simple rise or fall in the fundamental frequency contour over a minor prosodic phrase. There are two boundary tones: a) L- : a valley extending from the preceding pitch accent to the end of the prosodic phrase. b) H- : a plateau extending from the preceding pitch accent to the end of the prosodic phrase.
3. Boundary tones: A rise or fall in the fundamental frequency contour at the end of a major prosodic phrase. The value of a boundary tone is higher than the corresponding phrase accent. There are two boundary tones: a) L%: a valley extending from the preceding pitch accent to the end of the prosodic phrase. b) H%: arise at the end of a major phrase.
4. Minor break (Intermediate Phase end): One or more pitch accents plus a phrase accent, and is marked by L-or H-.
5. Major break (Intonational Phrase end) : These consist of one or more intermediate/minor phrases plus a boundary tone. Thus, the end of a major phrase always corresponds with the end of a minor phrase, but not vice versa. These are marked by L-or H-, followed by either L% or H%. This gives a total of four possible major break markers; L-L%, L- H%, H-L%, H-H%.
An important step for marking prosody on raw text is to group the words together into prosodic phrases such as intonational phrases and/or intermediate phrases. The simplest and typically the most common way of doing this is to insert prosodic boundaries corresponding to all punctuation marks. Such a method is arranged to
predict major breaks at the following punctuation [" ?"," !","/',"..."," :"," ;"], acl to predict minor breaks at [","," (",")","-"]. Although the error rate for this method is low, as there are always prosodic breaks at locations of punctuation, it tends to predict
<Desc/Clms Page number 14>
too fewer boundaries and leaves out many prosodic boundaries that might occur at places not marked by punctuation.
Another method for marking prosodic breaks for TTS conversion is to mark minor breaks at punctuation and between a content word and a function word, and major breaks at the end of the sentence. However, this causes the opposite problems to the foregoing method dependent on merely inserting breaks at punctuation. Always inserting a break between a content word and a function word causes too many breaks. Furthermore, this method does not take into account the fact that certain function words, like particles, may be accented, or that certain content words like phrasal verbs are de-accented. Furthermore, the method does not take into account clitics, that is phonologically weakened (reduced, de-accented) words that phonologically form a part of the adjacent content word. Although the two foregoing methods have drawbacks, they are nevertheless capable of at least some level of automation since they rely upon the identification of the grammatical or syntactic construction of a sentence.
Another method involves the manual labelling of data sets by experienced human labellers, and then training prosodic models on such data sets. A typical method involves training prosodic methods on a simplified ToBI annotated data set. Although these methods work satisfactorily, since the human labellers can use an understanding of the semantic content or context of words to determine the prosody labelling, it does require a large investment of resources. Additionally, the human labellers can disagree on the prosody annotation and marking, and fairly large training sets are necessary in order to produce highly predictive models. Thus, although human labellers can provide improved prosody marking, there is an associated risk of inconsistency in such labelling.
Embodiments of the present invention seek to address, and preferably overcome, at least one of the drawbacks associated with the prior art, by providing an automated prosody boundary markup mechanism which seeks to constrain prosody over-marking resulting from syntactic analysis of a text portion. A broad outline of an illustrative embodiment of a prosodic boundary markup mechanism 78, in accordance with an aspect of the present invention, is illustrated in Figure 9. A text portion 56 is input to the markup mechanism 78 by way of a suitable text input mechanism, for example an interface to a text source database comprising text files, or by way of text
<Desc/Clms Page number 15>
typed in via a keyboard. The text portion should be unformatted, that is to say should not include special text such as underlining, emboldening, italicisation, and different sized fonts, for example, and merely comprise plain text such as ASCII, or as supported by unicode tables for example. However, more complex text sources, such as (Microsoft) WORD documents, may be used provided they are arranged in plain text form. Additionally, the text portion 56 should also be, or have been nonnalised by a text normalisation process such as described above with reference to Figure 3. The text portion may comprise a single sentence, or multiple sentences of a corpus of text.
The normalised text portion is input to a Part Of Speech (POS) tagger, where the text portion is tagged with POS tags. The tagged text portion is input to a parser 82 which performs a syntactic analysis on the tagged text in order to obtain"chunks"or groups of words that form a syntactic phrase. The chunks correspond to the way words are grouped together in spoken utterances. Whilst, as mentioned above, research has indicated that syntactic factors such as parts of speech do influence a grouping of words into prosodic phrases, there is not always a one to one correspondence between a syntactic phrase and a prosodic phrase. A prosodic boundary marked at the end of each syntactic phrase would over-predict the number of prosodic phrases. Thus, the number of phrase boundaries predicted on the basis of syntactic phrases generated by the chunking parser 82 are more than the actual number of prosodic phrase boundaries for a given text portion. Using the results of such a chunking parser in a TTS converter in order to synthesis speech would result in strange, unnatural, and possibly unintelligible speech. A markup mechanism in accordance with the present invention applies one or more constraints to the output of chunking parser 82 in order to reduce the overprediction of prosodic phrases.
In the embodiment illustrated in Figure 9, a constraints module 84 provides information regarding the prosodic quality of the text portion that influences the occurrence of a potential prosodic boundary, and can therefore be used to modify the output of parser 82. In a particular example, the constraints module 84 receives the text portion tagged up with POS tags, and determines which words are accented or deaccented on the basis of their associated POS tag. A word which is deaccented is one which is not accented.
<Desc/Clms Page number 16>
The text portion marked up with prosodic boundaries based on the syntactic analysis performed in parser 82 is then forwarded to a prosodic boundary elimination module 86, together with the identification of deaccented words from constraints module 84. Prosodic boundary elimination module 86 is configured-to remove or eliminate prosodic boundaries signed by the chunking parser 82 which follow deaccented words as identified by constraints module 84. A prosodic boundary following a deaccented word is considered illegal. Other constraints influencing the potential prosodic boundaries of the text portion may also be determined and applied in respective modules 84 and 86. Having eliminated prosodic boundaries identified as being illegal in module 86, the remaining prosodic boundaries or breaks are classified as major or minor breaks, 88.
Next, all the major/minor breaks and accented words are assigned appropriate ToBI labels, 90, such as the annotations defined by PierreHumbert referred to above. The ToBI marked up text portion is output, 92, to a speech synthesiser unit for producing synthesised speech.
Synthesised speech derived from a text portion marked up by a markup mechanism in accordance with an embodiment of the invention as illustrated in Figure 9 better simulates natural speech, and improves the intelligibility of the synthesised speech.
A TTS converter system may be used for a number of applications. A particularly useful application is to provide speech output to users who are in environments or situations where reading text is inappropriate or not possible. Or where a user may wish to listen to speech, whilst performing some other task.
Although, a suitable TTS conversion system may be implemented on a user device, such as a personal computer, personal digital assistant, mobile phone, hand-held or laptop computer, the processing overhead for TTS conversion is generally high, and therefore a particularly suitable application for a TTS converter system is one in which it is connected to user devices by way of a communications network. However, the use of a TTS converter within user devices themselves is not precluded from falling within the scope of the present invention.
An illustrative example of a TTS converter system for a network environment is illustrated in Figure 4. TTS converter system 90 is configured to operate as a server for
<Desc/Clms Page number 17>
user devices wishing to receive synthesised speech output. The TTS conversion system 90 is connected to a text source 92 including databases of various types of text material, such as e-mail, news reports, sports reports and children's stories. Each text database may be coupled to the TTS converter system 90 by way of a suitable server. For example, e-mail database may be connected to TTS converter system 90 by way a mail server 92 (1) which forwards e-mail text to the TTS converter system. Suitable servers such as a news server 92 (2) and a storey server 92 (n) are also connected to the TTS converter system 90. The output of the TTS converter system is forwarded to the communications network 96 via a network interface 94. The communications network may be any suitable, or combination of suitable, communications networks, for example Internet backbone services, Public Subscriber Telephone Network (PSTN) or Cellular Radio Telephone Networks for example. Various user devices may be connected to the communications network 96, for example a personal computer 98 a regular landline telephone 100 or a mobile telephone 102. Other sorts of user devices may also be connected to the communications network 96. The user devices 98,100, 102 are connected to the TTS converter system 90 via communications network 96 and network interface 94. In the particular example illustrated in Figure 4, network interface 94 is configured to receive requests from user devices 98,100 and 102 for speech corresponding to a particular text source 92. For example, a user of mobile telephone 102 may request, via network interface 94, their e-mails. Upon receiving such a request network interface 94 accesses mail server 92 (1) to cause the requested email/s to be forwarded to the TTS converter system 90. The e-mails are converted into speech and forwarded to network interface 94 where they are communicated back to the user via mobile telephone 102. Optionally, the network interface may be connected to the text source 92 by way of the TTS converter system 90 which controls access to the various text source servers, retrieves requested text sources and converts them into speech for output to communications network 96 via network interface 94. As will be evident to persons of ordinary skill in the art, other configurations and arrangements may be utilised and embodiments of the invention are not limited to the arrangement described with reference to Figure 4.
An illustrative embodiment of a TTS converter system 90 will now be described with reference to Figure 5 of the drawings. A text source 92 supplies a portion of text
<Desc/Clms Page number 18>
to tokenise module 112, either dir-ctly or via editing work station 110. As described above, the text portion should be unformatted, and preferably well-structured. Via editing workstation 110 a human operator may edit a text portion from text source 92 in order to ensure that it is well formed. For example, proper capitalisation may be inserted, and the text portion edited to ensure that it conforms with grammatical and punctuation rules. Special formatting such as underlining, emboldening, italicisation, etc. may also be removed at this stage. Optionally, such formatting may be automatically removed if the appropriate control characters are recognised by the system.
The text portion either arriving directly at the tokenised module 112, or via the editing work station 110, is then processed in order to insert spaces between words and punctuation. Additionally, the location and type of punctuation is noted in order that it can be forwarded on for use in the prosodic break insertion module 118, described later.
The tokenised text is input to POS tagger 114, which in the described example is a Brill Tagger and therefore requires the tokenised text prepared by tokenised module 112. POS Brill Tagger 114 assigns tags to each word in the tokenised text portion in accordance with a Penn TreeBank POS tag set stored in database 136. The Penn TreeBank POS set of tags will be described in detail hereinafter, but is a well known set of tags an example of which is available from url "http : //www. ccl. umist. ac. uk/teaching/material/1019/Lect6/tsId006. htm", and accessible on 18 July 2001.
POS tagged text is forwarded to parser 116 where it undergoes syntactic analysis. Parser 116 is connected to a memory module 138 in which parser 116 can store parse trees and other parsing and syntactic information for use in the parsing operation. Memory module 138 may be a dedicated unit, or a logical part of a memory resource shared by other parts of the TTS converter system.
Parsed text is forwarded to a prosodic break insertion module 118, which inserts prosodic boundary marks into the parsed text in accordance with the syntactic analysis carried out by parser 116. Prosodic break insertion module 118 also receives punctuation 120 from tokenised module 112. The punctuation information is used by
<Desc/Clms Page number 19>
prosodic break insertion module 118 to assign further prosodic boundaries to the text portion.
The prosodic boundary markup text configured in module 118 is forwarded to constraints module 122. POS tagger 114 is connected 124 to the constraints module 122 to provide information regarding POS tagged text. Constraints module 122 uses the POS tagged text in order to apply constraints to the prosodic boundary markup text output from module 118, in order to reduce overmarking and to delete"illegal" boundary marks. The prosodic boundary constrained text is then output from constraints module 122 to TTS synthesiser unit 126, wherein synthesised speech is generated from the constrained text.
An audio output mechanism 128 receives the synthesised speech from the TTS synthesiser 126, and outputs the speech by way of a speaker 130, mu-or A-law 132 audio output for a PSTN, or a digital MPEG audio output 134, for example, but any suitable encoder may be used.
A prosodic boundary markup mechanism in accordance with an embodiment of the present invention may be implemented as a computer program or a computer program element. The operation of such a computer program or computer program element will now be described with reference to the flow diagram of Figure 6. Plain text 150, i. e. unformatted text and which in this example is a news report is tokenised at step 152 in order to ensure spaces are inserted between words and punctuation marks.
Tokenised text is then tagged with parts of speech at step 154 by way of a Brill POS tagger. A Brill POS tagger is a computer program written by Eric Brill and available from the Massachusetts Institute of Technology (M. I. T.) for use without fee or royalty.
Eric Brill's POS tagger is available from"http ://www. cs. jhu. edu/-brill/ or from the Carnegie Mellon University Artificial Intelligence (A. I. ) repository at http ://www. cgi. cs. cmu. edu/cgl-vin/airkeys ? keywords=tagger & index=CMU+AI+Repository.
The Brill POS tagger applies POS tags using the notation of the Penn TreeBank tag set derived by PierreHumbert. An example of the Penn TreeBank tag set downloaded from uri"http ://www. ccl. umis. ac. uk/teaching/material ! 1019/Lect6/tsldOO6. htm.. on 18 July 2001, is provided below.
Penn TreeBank Tag Set:
<Desc/Clms Page number 20>
CC Coordinating conjunction PP$ Posseve Pronoun CD Cardinal Number RB Adverb DT Determiner RBR Comparative adverb.
EX Existential there RBS Superlative adverb FW Foreign word RP Particle IN Preposition/subord conj. SYM Symbol (math/sci) JJ Adjective TO to JJR Comparative adjective UH Interjection JJS Superlative adjective VB Verb, base form LS List item marker VBD Verb, past tense MD Modal VBG Verb, gerund or pres. Participle NN Noun, singular or mass VBN Verb, past participle NNS Noun, plural VBP Verb, non-3s present NNP Proper noun, singular VBZ Verb, 3s present NNPS proper noun, plural WDT Wh-determiner PDT Predeterminer WP Wh-pronoun POS Possessive ending WPZ Possessive wh-pronoun PRP Personal pronoun WRB Wh-adverb The POS tagged text is then post-processed at step 156, and which will be described in more detail later with reference to Figure 7 and 8.
At step 158 the prosodic boundaries are placed in the text based on the post processing at step 156, and prosodic boundaries corresponding to punctuation are inserted at step 160.
At step 162 constraints are applied to the prosodic boundary marked up text in order to reduce overmarking, and the constrained marked up text is then output to a TTS speech synthesiser 164.
An example of the operation of the computer program or computer program element in accordance with an embodiment of the present invention will now be described in detail by way of a worked example with reference to the flow diagrams of Figures 7 and 8.
<Desc/Clms Page number 21>
Figure 7 illustrates the operation of the parser, referred to herein as a "chunking"parser since the parser identifies syntactic fragments of a sentence based on a sentence syntax, the fragments being referred to as chunks. The Applicant has recognised that there is some correspondence between the chunks and the sites of prosodic boundaries. The chunk boundaries are identified by using a modified chart parser and a phase structure grammar.
Chart parsing is a well-known and efficient parsing technique. It uses a particular kind of data structure called a chart, which contains a number of so-called "edges". Parsing is in essence a search problem, and chart parsing is efficient in performing the necessary search since the edges contain information about all the partial solutions previously found for a particular parse. The principle advantage of this technique is that it is not necessary, for example, to attempt to construct an entirely new parse tree in order to investigate every possible parse. Thus, repeatedly encountering the same dead-ends, a problem which arises in other approaches, is avoided.
The parser used in the present embodiment is a modification of a chart parser, known as Gazdaar & Mellish's bottom-up chart parser downloadable from url "http : //www. coli. uni-sb. de/-brawer/prolog/botupchart", modified to: 1) recover tree structures from the chart; 2) return the best complete parse of a sentence; and 3) return the best (longest) partial parse, in the case when no complete sentence parse is available.
The parser is loaded with a phase-structure grammar (PSG) capable of identifying chunk boundaries such as may be implemented by reference to suitable text books, and the parser memory is initialised by clearing it of any information relating to a previous parsing activity.
Referring now to Figure 7, the input for the parser is the text: A report into the Ladbroke Grove train crash has blamed a"lamentable failure" by Railtrack to respond to safety warnings before the accident.
<Desc/Clms Page number 22>
The foregoing sentence will have been tokenised at step 152 of the flow diagram of Figure 6 to yield the tokenised text: A report into the Ladbroke Grove train crash has blamed a"lamentable failure by Railtrack to respond to safety warnings before the accident. in which spaces have been inserted between words and punctuation. The tagged sentence 170 is tagged by a Brill Tagger using the Penn TreeBank set of tags. Each word just receives a tag indicating the word class (part of speech) played by the word. In the current text example, each word receives the following POS tags: A/DT report/NN into/IN the/DT Ladbroke/NNP Grove/NNP train/NN crash/NN has/VBZ blamed/VBN a/DT"/"lamentable/JJ failure/NN"/"by/IN Railtrack/NNP to/TO respond/VB to/TO safety/NN wamingslNNS before/IN the DT accident/NN.
The notation used in the foregoing example comprises the Penn TreeBank set of tags paired with each word or punctuation mark by way of a forward slash following the relevant word or punctuation mark, and then the POS tag.
At step 172 the tagged sentence is read into the parser. Each word-tagged pair is read into the parser until a full stop is encountered. Each word-tag pair is stored as a term in the programs run time data base at step 174. Information about the location and type of punctuation marks is also retained for later use although the punctuation itself is discarded for the purpose of parsing. The word-tag terms are used in the parsing itself, and the subsequent evaluation of prosodic boundary constraints.
At step 176 the chart parser routine is called, and the word-tag terms are used to initialise it. The parsing proceeds by processing the sentence in the direction of reading, i. e. for English type languages processing would proceed left to right. As the chart parser executes, edges are gradually added, until a parse is found or there are no further alternatives left to explore. The parser regards the sentence as having numbered
<Desc/Clms Page number 23>
vertices in the gaps between each word in the sentence, as well as before the first word, and after the last word of the sentence.
Following initialisation of the parser, either an active or inactive edge may be added to text portion undergoing parsing. An inactive edge represents-the unification of a text chunk spanning two vertices, with a grammar rule. The unification is complete, i. e. there is nothing left over. This may be described by way of the following example. A grammar rule: rule (s, [np, vp]) may be interpreted as :'s'rewrites as an'np'followed by a 'vp'. (Other grammar rules may expand these components in full text i. e. np-noun phase, vp-verb phrase. A noun phrase and verb phrase amoungst other grammatical constructs, correspond to"chunks".
In the first example of an edge given above, edge (152), an active edge, the entire [np, vp] sequence of constituents is not found in that edge. There remains a vp 'left over'to be found.
In inactive edge 150, however, the np has been found completely. Hence there is nothing'left over'.
Active edges requiring components just identified in the inactive edge may also be added to the chart at this point. Active edges are added during the process of attempting to unify a chunk of a sentence with a grammar rule. When the left most constituent of a grammar rule is found in the chunk, an active edge is added which records what has been found so far, and the vertices it spans, along with what remains to be found.
New edges are added only when no similar edge already exists in the chart.
Two examples of an edge are now provided as an illustration: An edge has the syntax :/* edge (ID, FormVertex, ToVertex, SyntacticCategory, Found, Still-to-Find) */ of which a particular example is edge (152,11, 16, s, [150] ; [vp]).
<Desc/Clms Page number 24>
Edge 152 is an active edge. It says that (part of) a sentence, s, has been found between vertex 11 and vertex 16, in edge 150. A vp remains to be found, from vertex 17 onwards in order to complete the sentence.
A second example is: Edge (150,11, 16, np, [149, 147,145, 143, 138], []) Edge 150 is an inactive edge. It says that a noun phrase (np) has been found between vertices 11 and 16. Subcomponents of the phrase are found in edges 149,147,
145, 143, 138. Nothing remains to be found in order to complete the noun phase (np)ZD this is what qualifies 150 as an inactive edge.
Typically there may be hundreds or thousands of edges considered in the parsing activity for a single sentence.
When no further edges can be added to the chart, this part of the parser terminates. The parser then explores the entire set of inactive edges at step 178, seeking paths through the set of edges which span the entire sentence from left to right. The parser operates in a top-down fashion, and considers all inactive edges in which the currently sought category (eg, s, np, vp etc) is found. If a sequence of edges can be found which spans all vertices, from left to right, with no gaps or overlaps, then there is a complete parse.
For the input above, the following parse tree is returned at step 180: (8)
[s ( [np ( [ (dt- > a), (nn- > report), (in- > into), (dt- > the), (nnp- > ladbroke), (nnp- > grove), (nn-- > train), (nn-- > crash)]), vp ( [ (vbz- > has), (vbn- > blamed)]), np ( [ (dt- > a), (jj--- > lamentable), (nn-- > failure), (in-- > by), (nnp-- > railtrack)]), vp ([ (to-- > to), (vb-- >
<Desc/Clms Page number 25>
respond), (t-- > to), (nn-- > safety), (nns--- > warnings), (in--- > before), (dt-- > the), (nn-- > accident)])])] where, s stands for sentence, np for noun phrase, vp for verb phrase. Notice that punctuation has been discarded for the purposes of the parse. Arrow-linked terms are simply the tags and the words which correspond with them. It is relatively straightforward to transform the above tree structure into a flat list, wherein chunk boundaries are marked witch'1'.
At step 182 this structure is re-written using the symbol' to mark chunk
boundaries, to obtain the following text position : ZD (9) [a, report, into, the, ladbroke, grove, train, crash, l, has blamed, l, a, lamentable, failure, j, by railtrack, l, to, respond, to, safety, warnings, before, the, accident, ] The re-written text is them exported at step 184 to the main procedure illustrated in Figure 8 to inport boundaries corresponding to punctuation.
The main procedure for the described embodiment of the prosodic boundary markup mechanism is illustrated in Figure 8. At step 190, POS text tagged by a Brill Tagger is input to the get parse routine 192, described above with reference to the flow diagram of Figure 7. The output of get parse routine 192 is the re-written text labelled (9) above.
At step 194, prosodic boundary marks corresponding to the punctuation in the original text portion are imported into the re-written text portion (9), and are marked with '#'. Note that '#' supersedes the '#' after "failure".
(10)
<Desc/Clms Page number 26>
[a, report, into, the, ladbroke, grove, train, crash, 1, has blamed,, a, II, lamentable, failure, II, by railtrack, , to, respond, to, safety, warnings, before, the, accident, As described above, when discussing problems of the prior art, it is known that attempting to predict prosodic boundaries purely on the basis of syntax results in an overestimation of the number of actual prosodic boundaries in a given text portion.
Therefore, embodiments of the present invention apply prosodic boundary constraints to text incorporating prosodic boundaries marks based on a purely syntactic analysis, depending upon whether particular words are accented or deaccented. Other constraints are applied take account of the relative positions of prosodic boundary marks already posited in the text portion or sentence. These constraints result in the removal of at least some of the over-predicted prosodic boundaries marks.
It is noticeable in the resulting text portion (10) of the parsing process that prosodic boundaries surround the word"a"in the phrase"a lamentable failure". If uncorrected, this would produce an unusual prosodic effect in any speech synthesised therefrom.
The application of the constraints is relatively straightforward, and is by way of passing the marked up text (10) to a number of filters in sequence. Each filter implements one of the constraints. Each time a prosodic boundaries is identified as illegal, according to a constraint, it is removed.
Referring back to Figure 8, the constraints are applied to steps 196 and 198. An overriding rule is that all boundaries at punctuation are legal and cannot be deleted.
Major and minor boundaries are defined as follows: a) all boudnaries at punctuation are major except commas in lists, and designated"Il" ; b) all other boundaries are minor.
The first constraint is applied at step 196, and comprises applying filter (1) to prosodic boundary markup text (10), as defined below: Filter (1) : If the preceding word is deaccented, then the boundary is illegal.
<Desc/Clms Page number 27>
The determination of whether or not a word is deaccented may be based on identifying deaccented and accented words. Any words not being accented are deaccented. The criteria for identifying words as deaccented or accented are set out below. These criteria are applied to each word as part of Filter (1) to remove prosodic boundaries following deaccented words.
Accents: Deaccented: 1. Clitics.
2. Function words: all except the ones listed below.
3. Given: a'nominal' (and the pronoun associated with it), that repeats in the next 5 ( ?) lines.
4. Phrasal verb: any verb preceding RP.
5. Second nominal of a nominal compound (not if both are NNP, then both are accented).
Accented 1. Content words: CD, JJ, JJR, JJS, NN, NNS, NNP, NNPS, CD, JJ, JJR, JJS, NN, NNS, NNP, NNPS, RB, RBR, RBS, VB, VBD, VBG, VBN, VBZ, WRB.
2. Function words: Nominal pronouns : everybody, anybody, everything, anything.
(something/body, nobody/nothing are not accented) Nominative pronouns: he, she, I, we, you (it and they are not accented) Reflexive pronouns: himself, herself, yourself, itself, themselves, myself Pre-quantifier : PDT (e. g.'all') Post-quantifiers: same as PDT except that they don't occur before the Determiner (e. g.'quite') Post-determiner : everything that occurs after a determiner but is not a JJ, JJR, JJS (e. g.'next') Nominal adverbials: here, there, then Negative modals: MD followed by'not' Negative do: do, does, did +'not'
<Desc/Clms Page number 28>
Interjections : UH Particles: RP Wh-words: all wh-words except WPZ, and WRB Some prepositions : despite, unlike, although, beside, above, below, behind, around, towards, during Cue-phrases (e. g. 'well','now','however', sentence initial'and','but') Preposed adverbials and fronted prepositions : Prepositions at the start of the sentence. e. g. without his glasses, he couldn't see a thing (preposed adverbial) Amoung the candidates was Jane Smith, the actress (fronted preposition) NPs and AdvPs can also function as preposed adverbials but for the present embodiment it is not relevant because their heads would be accented in any case since they are content words.
The constraints analysis then proceeds to step 198, where additional constraints are applied in a sequence laid out below: Filter (2): A boundary is illegal if there is only one word between it and the preceding, or subsequent, punctuation. In the fragment"... blamed, 1, a, II, lamentable ..."therefore the first boundary can be removed. The second boundary, marked by"II" is at the site of punctuation, and therefore not removable. The result of applying constraint Filter (2) is: Output (4) [a, report, into, the, ladbroke, grove, train, crash, , has blamed, a, lamentable, failure, II, by railtrack, , to, respond, to, safety, warnings, before, the, accident, 11] Filter (3) : Two words of the major same class cannot surround a boundary. A major class is defined as one of noun, verb, adjective, adverb. The identification of the major class is by way of the POS-tag information, i. e. POS-tagged words.
<Desc/Clms Page number 29>
Filter (4): Function words cannot precede a boundary. A function word is any whose correct tag is drawn from the Penn TreeBank notation set.
{dt, ex, ttin, Is, md, pdt, pos, prp, pps, rp, sym, tto, uh, vbp, wdt, wp, wpz} Filter (5): Stating from the beginning of a sentence, or from a boundary, the next two boundaries are found. If the ratio of the lengths, measured as number of words, from the start point to the first boundary, and from first boundary to the second boundary is greater than 2, then the first boundary is illegal. Preferably, this only applies where the number of words between the first and second boundaries is 3 or less.
Thus, in the worked example, just one constraint is activated, and the constraint application is activated, and the constraint application process yields the result.
(5) [a, report, into, the, ladbroke, grove, train, crash, 1, has blamed, a, II, lamentable, failure, II, by railtrack, dz to, respond, to, safety, warnings, before, the, accident, 11] Text portion (5) is a structure in which reasonable prosodic boundaries have been identified. The text portion (5) is then marked up with ToBI prosodic annotation in order to achieve realistic, (or more realistic), prosodic speech quality in synthesised speech, by adjusting one or more of the following quantities: pitch accent, amplitude, and duration. Each of the prosodic boundaries identifies a point at which some treatment would be applied.
In the described embodiment a reduced ToBI markup is used as set forth below: a) Accents: Only labels marked are: i) H*: default
<Desc/Clms Page number 30>
ii) L* : preposed adverbials, fronted preposition, accented word preceding an Htone at major or minor boundary b) Minor boundaries: i) L- : default ii) L* : preposed adverbials, fronted preposition, accented word preceding an Htone at major or minor boundary c) Major boundaries: i) End of Sentence: 1) L-L% : default 2) H-H%: at the end of a yes-no question (non wh-questions) ii) Intrasentential: L-H% (continuation) (The fourth boundary tone (H-L%) referred to in the general description above depends on semantic factors. For example, when contradiction is implied in a sentence like'He's not stupid (H-L%).'Thus, it is not marked in this auto marked embodiments.) The main routine illustrated in Figure 8 then returns to the process flow illustrated in Figure 6 in which the prosody markup (ToBI) text is input to a TTS synthesiser at step 164.
Due to the complexity of natural languages, and the fact that they are typically in a state of flux, it is very difficult to design a grammar which can precisely accommodate every legitimate expression in a natural language for example English.
There is an indefinitely large number of legitimate expressions. Therefore, not every sentence can be tagged or parsed with complete accuracy.
Any inaccurate POS tagging will almost certainly result in inaccurate parsing.
However, in a preferred embodiment of the prosodic boundary markup mechanism the parser is configured to return partial parse results, and therefore the inability to render a completely accurate parse is not catastrophic. Preferably, the parser seeks a parse which spans an entire sentence. However, if no parse can be found which spans the entire sentence, the longest available partial parse is sought. Thus, it is still possible to recover useful, although incomplete, information about where prosodic boundaries may
<Desc/Clms Page number 31>
be posited. Thus, the invention exhibits a graceful degradation in its performance, rather than abrupt or catastrophic failure.
In cases where multiple parsers compete, there is a choice of which edges to use for the parse, starting from the same vertex. In a preferred embodiment, the longest spanning edge is chosen. Whilst other complete parses may be valid, and represent another interpretation of the meaning of the sentence. Since semantic information is not available to the automated system to resolve the ambiguity, no decision may be made based on the meaning of one or other sentence independent upon its parse. Therefore, the preferred embodiment of the invention always chooses longest spanning edge, since it tends to minimise the number of prosodic boundaries, and thereby reduces the risk of an inappropriate boundary misleading a listener with regard to the semantic content.
There has now been described an automated prosodic boundary markup mechanism suitable for an automated TTS conversion system, which results in improved synthesised speech. The markup mechanism reduces overmarking of prosodic boundaries typically generated by syntactic analysis of text.
Insofar as embodiments of the invention described above are implementable, at least in part, using a software-controlled programmable processing device such as a Digital Signal Processor, microprocessor, other processing devices, data processing apparatus or computer system, it will be appreciated that a computer program or program element for configuring a programmable device, apparatus or system to implement the foregoing described methods is envisaged as an aspect of the present invention. The computer program or program element may be embodied as source code and undergo compilation for implementation on a processing device, apparatus or system, or may be embodied as object code, for example. The skilled person would readily understand that the term computer in its most general sense encompasses programmable devices such as referred to above, and data processing apparatus and computer systems.
Suitably, the computer program or program element is stored on a carrier medium in machine or device readable form, for example in solid-state memory or magnetic memory such as disc or tape and the processing device utilises the program, program element or a part thereof to configure it for operation. The computer program
<Desc/Clms Page number 32>
or program element may be supplied from a remote source embodied in a communications medium such as an electronic signal, including radio frequency carrier wave or optical carrier wave. Such carrier media are also envisaged as aspects of the present invention.
In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention. For example, a POS tagger other than the Brill Tagger may be used. Also, it is not necessary to use the Penn TreeBank set of POS tags, but any other suitable notation of annotating words at a particular part of speech. Furthermore, it is evident that a text portion input to a prosodic boundary markup mechanism according to the present invention, may comprise one or more sentences. Additionally, the text sources, servers in Figure 4, may provide any suitable textual content beyond that illustrated. Other examples being horoscopes.
The scope of the present disclosure includes any novel feature or combination of features disclosed therein either explicitly or implicitly or any generalisation thereof irrespective of whether or not it relates to the claimed invention or mitigates any or all of the problems addressed by the present invention. The applicant hereby gives notice that new claims may be formulated to such features during the prosecution of this application or of any such further application derived therefrom. In particular, with reference to the appended claims, features from dependent claims may be combined with those of the independent claims and features from respective independent claims may be combined in any appropriate manner and not merely in the specific combinations enumerated in the claims.
For the avoidance of doubt, the term"comprising"used in the description and claims should not be construed to mean only"consisting only of.

Claims (57)

1. An automated prosodic boundary markup mechanism for an automated text to speech converter, the markup mechanism operable to perform a syntactic analysis on a text portion input thereto and to assign one or more prosodic boundaries to said text portion, the markup mechanism further operable to apply a constraint to said one or more prosodic boundaries and to remove a prosodic boundary satisfying said constraint.
2. A prosodic boundary markup mechanism according to claim 1, operable to assign a prosodic boundary mark to each of said one or more prosodic boundaries, and to remove a prosodic boundary mark satisfying said constraint.
3. A prosodic boundary markup mechanism according to claim 1 or claim 2, operable to identify a deaccented word in said text portion and to apply a constraint comprising removing a prosodic boundary immediately subsequent to said deaccented word.
4. A prosodic boundary markup mechanism according to any preceding claim, operable to identify punctuation in said text portion and to apply a constraint comprising removing a prosodic boundary conditional on there being only one word between a prosodic boundary and preceding or subsequent punctuation.
5. A prosodic boundary markup mechanism according to any preceding claim, operable to assign a major syntactic class to one or more words in said text portion and to apply a constraint comprising removing a prosodic boundary bounded by two words having the same major syntactic class assigned to them.
6. A prosodic boundary markup mechanism according to claim 5, wherein said major syntactic class is selected from set comprising the following major syntactic classes: noun; verb; adjective and adverb.
<Desc/Clms Page number 34>
7. A prosodic boundary markup mechanism according to any preceding claim, operable to identify one or more function words in said text portion and to apply a constraint comprising removing a prosodic boundary immediately subsequent to a function word.
8. A prosodic boundary markup mechanism according to claim 7, operable to identify said one or more function words in accordance with one or more function definitions drawn from the Penn TreeBank tag set {dt, ex, ttin, Is, md, pdt, pos, prp, pp$, rp, sym, tto, uh, vbp, wdt, wp, wpz}.
9. A prosodic boundary markup mechanism according to any preceding claim, operable to: determine a first number being the number of words between a first prosodic boundary and a second prosodic boundary being the next subsequent prosodic boundary from said first prosodic boundary; to determine a second number being the number of words between said second prosodic boundary and a third prosodic boundary being the next subsequent prosodic boundary from said second prosodic boundary; and to apply a constraint comprising removing said second prosodic boundary conditional on the ratio of said first number to said second number being greater than a predetermined threshold.
10. A prosodic boundary markup mechanism according to claim 9, wherein said threshold is 2.
11. A prosodic boundary markup mechanism according to claim 9 or 10, operable to apply said constraint dependent on said ratio conditional on said second number being less than 4.
12. A prosodic boundary markup mechanism according to any preceding claim, further operable to assign a part of speech to one or more words in said text
<Desc/Clms Page number 35>
portion, and to parse said text portion including said part of speech assigned words, thereby to perform said syntactic analysis on said text portion.
13. A prosodic boundary markup mechanism according-to claim 12, operable to utilise said part of speech assigned words to determine a condition for applying said constraint.
14. A prosodic boundary markup mechanism according to claim 12 or 13, operable to implement an automated Brill Tagger mechanism to assign a part of speech to said one or more words in said text portion.
15. A prosodic boundary markup mechanism according to claim 14, comprising a database including a Penn Treebank set of part of speech tags, and operable to implement said automated Brill Tagger mechanism to interrogate said database to obtain a part of speech tag to assign to said one or more words in said text portion.
16. A prosodic boundary markup mechanism according to any one of claims 12 to 15, operable to parse said text portion including said part of speech assigned words to return a partial parse of said text portion.
17. A prosodic boundary markup mechanism according to any one of claims 12 to 16, operable to parse said text portion including said part of speech assigned words to return the longest possible parse.
18. A prosodic boundary markup mechanism according to any one of claims 12 to 15, operable to parse said text portion including said part of speech assigned words to return a parsed sentence.
19. A prosodic boundary markup mechanism according to any one of claims 16 to 18, operable to return a parse corresponding to the longest spanning edge from a vertex.
<Desc/Clms Page number 36>
20. A computer system comprising memory, a processor and a prosodic boundary markup mechanism according to any preceding claim.
21. An automated text to speech converter system, including a computer system according to claim 20, a text input mechanism for said computer system and an audio output mechanism for outputting speech from said text to speech converter system.
22. An automated text to speech converter system according to claim 21, further including a text source, said computer system operable to communicate with said text source to provide text to said text input mechanism.
23. An automated text to speech converter system according to claim 22, said text source comprising one or more of the following: a news report database; a sports report database; an e-mail database; and any other appropriate textual content.
24. An automated text to speech converter system according to any one of claims 21 to 23, wherein said text input mechanism comprises a keyboard for said computer system.
25. An automated text to speech converter system according to claim 24, operable to provide editing of text originating from said text source with said keyboard.
26. An automated text to speech converter system according to any one of claims 21 to 25, operable to configure said audio output mechanism to output p-law, A- law or MPEG formatted audio output.
27. An automated text to speech converter system according to any one of claims 21 to 25, said audio output mechanism comprising a speaker.
<Desc/Clms Page number 37>
28. An automated text to speech converter including a prosodic boundary markup mechanism according to any one of claims 1 to 19.
29. A user device including a prosodic boundary markup mechanism according to any one of claims 1 to 19, or including a text to speech converter according to claim 28.
30. A communications network comprising an automated text to speech converter system according to any one of claims 21 to 27, a network interface connecting said automated text to speech converter system to said network and a user device connected to said network.
31. A user device according to claim 29 or claim 30, comprising a Personal Digital Assistant, mobile telephone, hand held computer or lap top computer.
32. A method of configuring a computer system, including memory and a processor, to mark up a text portion input thereto with one or more prosodic boundaries, the method comprising configuring said computer system to : store said text portion in said memory; process said text portion in said processor to perform a syntactic analysis thereof and to assign one or more prosodic boundaries to said text portion; store in said memory prosodic boundary marks associated with said prosodic boundaries; process said text portion having prosodic boundaries assigned thereto to apply a constraint; and delete prosodic boundary marks satisfying said constraint.
33. A method according to claim 32, further comprising configuring said computer system to process said text portion in said processor to identify a deaccented word in said text portion, and to apply a constraint comprising deleting a prosodic boundary mark immediately subsequent to said deaccented word.
<Desc/Clms Page number 38>
34. A method according to claim 32 or claim 33, further comprising configuring said computer system to process said text portion in said processor to identify punctuation in said text portion, and to apply a constraint comprising deleting a
prosodic boundary mark conditional on there being only one word between a prosodic ZD boundary mark and preceding or subsequent punctuation.
35. A method according to any one of claims 32 to 34, further comprising configuring said computer system to process said text portion in said processor to assign a major syntactic class to one or more words in said text portion, and to apply a constraint comprising deleting a prosodic boundary bounded by two words having the same major syntactic class assigned to them.
36. FILTER (3) A method according to claim 35, wherein said major syntactic class is selected from the following set comprising the major syntactic classes: noun; verb; adjective and adverb.
37. A method according to any one of claims 32 to 36, further comprising configuring said computer system to process said text portion in said processor to identify one or more function words in said text portion, and to apply a constraint comprising deleting a prosodic boundary mark immediately subsequent to a function word.
38. A method according to claim 37, further comprising configuring said computer system to process said text portion in said processor to identify said one or more function words in accordance with one or more function definitions drawn from the Penn TreeBank tag set {dt, ex, ttin, Is, md, pdt, pos, prp, pp$, rp, sym, tto, uh, vbp, wdt, wp, wpz}.
39. A method according to any one of claims 32 to 38, further comprising configuring said computer system to process said text portion in said processor to:
<Desc/Clms Page number 39>
determine a first number being the number of words between a first prosodic boundary mark and a second prosodic boundary mark being the next subsequent prosodic boundary mark from said first prosodic boundary mark; determine a second number being the number of words between said second prosodic boundary mark and a third prosodic boundary mark being the next subsequent prosodic boundary mark from said second prosodic boundary mark; and apply a constraint comprising deleting said second prosodic boundary mark conditional on the ratio of said first number to said second number being greater than a predetermined threshold.
40. A method according to claim 39, wherein said threshold is 2.
41. A method according to claim 39 or 40, further comprising configuring said computer system to process said text portion in said processor to apply said constraint conditional on said ratio, dependent on the second number being less than 4.
42. A method according to any one of claims 32 to 41, further comprising configuring said computer system to: process said text portion in said processor to assign a part of speech to one or more words in said text portion; store said text portion, including said part of speech assigned words, in said memory; process said text portion, including said part of speech assigned words, stored in said memory in said processor to parse said text portion thereby to perform said syntactic analysis.
43. A method according to claim 43, further comprising configuring said computer system to process said text portion, including said part of speech assigned words, in said processor to utilise said part of speech assigned words to determine a condition for applying said constraint.
<Desc/Clms Page number 40>
44. A method according to claim 42 or 43, further comprising configuring said computer system to process said text portion in said processor to implement an automated Brill Tagger mechanism to assign a part of speech to said one or more words in said text portion. said text portion.
45. A method according to any one of claims 42 to 44, further comprising configuring said computer system to interrogate a database including a Penn Treebank set of part of speech tags form use in said automated Brill Tagger mechanism.
46. A method according to any one of claims 42 to 45, further comprising configuring said computer system to process said text portion in said processor to parse said text portion including said part of speech assigned words to return a partial parse of said text portion.
47. A method according to any one of claims 42 to 46, further comprising configuring said computer system to process said text portion in said processor to parse said text portion including said part of speech assigned words to return the longest possible parse.
48. A method according to any one of claims 42 to 45, further comprising configuring said computer system to process said text portion in said processor to parse said text portion including said part of speech assigned words to return a parsed sentence.
49. A method according to any one of claims 46 to 48, further comprising configuring said computer system to process said text portion in said processor to return a parse corresponding to the longest spanning edge from a vertex.
50. A program element comprising computer-or machine-readable instructions for configuring a computer to implement the prosodic boundary mechanism of claims 1 to 19, or to implement the method of any one of claims 32 to 49.
<Desc/Clms Page number 41>
51. A program element comprising computer-or machine-readable instructions translatable for configuring a computer to implement the prosodic boundary mechanism of claims 1 to 19, or to implement the method'of any one of claims 32 to 49.
52. The program element of claim 50 or 51 on a carrier medium.
53. A prosodic boundary markup mechanism substantially as described herein, and with reference to the drawings.
54. An automated text to speech converter system substantially as described herein, and with reference to the drawings.
55. An automated text to speech converter substantially as described herein, and with reference to the drawings.
56. A user device substantially as described herein, and with reference to the drawings.
57. A communications network substantially as described herein, and with reference to the drawings.
GB0119842A 2001-08-14 2001-08-14 Prosodic boundary markup mechanism Expired - Lifetime GB2378877B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
GB0119842A GB2378877B (en) 2001-08-14 2001-08-14 Prosodic boundary markup mechanism
PCT/GB2002/003738 WO2003017251A1 (en) 2001-08-14 2002-08-14 Prosodic boundary markup mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0119842A GB2378877B (en) 2001-08-14 2001-08-14 Prosodic boundary markup mechanism

Publications (3)

Publication Number Publication Date
GB0119842D0 GB0119842D0 (en) 2001-10-10
GB2378877A true GB2378877A (en) 2003-02-19
GB2378877B GB2378877B (en) 2005-04-13

Family

ID=9920400

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0119842A Expired - Lifetime GB2378877B (en) 2001-08-14 2001-08-14 Prosodic boundary markup mechanism

Country Status (2)

Country Link
GB (1) GB2378877B (en)
WO (1) WO2003017251A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103035241A (en) * 2012-12-07 2013-04-10 中国科学院自动化研究所 Model complementary Chinese rhythm interruption recognition system and method
US10127901B2 (en) 2014-06-13 2018-11-13 Microsoft Technology Licensing, Llc Hyper-structure recurrent neural networks for text-to-speech
CN105185374B (en) * 2015-09-11 2017-03-29 百度在线网络技术(北京)有限公司 Prosody hierarchy mask method and device
RU2643467C1 (en) 2017-05-30 2018-02-01 Общество с ограниченной ответственностью "Аби Девелопмент" Comparison of layout similar documents
BE1025287B1 (en) * 2017-10-09 2019-01-08 Mind The Tea Sas Method of transforming an electronic file into a digital audio file
CN112528014B (en) * 2019-08-30 2023-04-18 成都启英泰伦科技有限公司 Method and device for predicting word segmentation, part of speech and rhythm of language text
CN110782918B (en) * 2019-10-12 2024-02-20 腾讯科技(深圳)有限公司 Speech prosody assessment method and device based on artificial intelligence
CN110782880B (en) * 2019-10-22 2024-04-09 腾讯科技(深圳)有限公司 Training method and device for prosody generation model
CN112786023B (en) * 2020-12-23 2024-07-02 竹间智能科技(上海)有限公司 Mark model construction method and voice broadcasting system
CN113392645B (en) * 2021-06-22 2023-12-15 云知声智能科技股份有限公司 Prosodic phrase boundary prediction method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5749071A (en) * 1993-03-19 1998-05-05 Nynex Science And Technology, Inc. Adaptive methods for controlling the annunciation rate of synthesized speech

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5749071A (en) * 1993-03-19 1998-05-05 Nynex Science And Technology, Inc. Adaptive methods for controlling the annunciation rate of synthesized speech

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Proc. 4th Int. Conf. Spoken Language, Oct. 1996, vol.3, pages 1720-1723 *

Also Published As

Publication number Publication date
WO2003017251A1 (en) 2003-02-27
GB0119842D0 (en) 2001-10-10
GB2378877B (en) 2005-04-13

Similar Documents

Publication Publication Date Title
Hirschberg Pitch accent in context predicting intonational prominence from text
US6535849B1 (en) Method and system for generating semi-literal transcripts for speech recognition systems
CN108470024B (en) Chinese prosodic structure prediction method fusing syntactic and semantic information
Klatt The Klattalk text-to-speech conversion system
Ostendorf et al. The Boston University radio news corpus
Black et al. Building synthetic voices
US6952665B1 (en) Translating apparatus and method, and recording medium used therewith
US8027837B2 (en) Using non-speech sounds during text-to-speech synthesis
US20050154580A1 (en) Automated grammar generator (AGG)
US20090271178A1 (en) Multilingual Asynchronous Communications Of Speech Messages Recorded In Digital Media Files
Blache et al. Creating and exploiting multimodal annotated corpora: the ToMA project
Xydas et al. The DEMOSTHeNES speech composer
GB2378877A (en) Prosodic boundary markup mechanism
Heldner et al. Exploring the prosody-syntax interface in conversations
Gibbon et al. Representation and annotation of dialogue
JP2021179673A (en) Sentence generation device, sentence generation method and sentence generation program
JP3706758B2 (en) Natural language processing method, natural language processing recording medium, and speech synthesizer
Veilleux Computational models of the prosody/syntax mapping for spoken language systems
US6772116B2 (en) Method of decoding telegraphic speech
US20030216921A1 (en) Method and system for limited domain text to speech (TTS) processing
Gibbon et al. Spoken Language Characterization
JP2005208483A (en) Device and program for speech recognition, and method and device for language model generation
JP2001117583A (en) Device and method for voice recognition, and recording medium
Skadiņa et al. Filling the gaps in Latvian BLARK: Case of the Latvian IT Competence Centre
JP3638000B2 (en) Audio output device, audio output method, and recording medium therefor

Legal Events

Date Code Title Description
PCNP Patent ceased through non-payment of renewal fee

Effective date: 20200814

S28 Restoration of ceased patents (sect. 28/pat. act 1977)

Free format text: APPLICATION FILED

732E Amendments to the register in respect of changes of name or changes affecting rights (sect. 32/1977)

Free format text: REGISTERED BETWEEN 20210826 AND 20210901

S28 Restoration of ceased patents (sect. 28/pat. act 1977)

Free format text: APPLICATION WITHDRAWN

Effective date: 20220331