WO1996012271A9

WO1996012271A9 - Speech synthesis apparatus and method for synthesizing a finite set of sentences and numbers using one program

Info

Publication number: WO1996012271A9
Application number: PCT/US1995/013134
Authority: WO
Filing date: 1995-10-16
Publication date: 1996-07-04

Abstract

An apparatus and method for synthesizing a finite set of sentences and numbers in one of several languages using an application program which is independent of the language being synthesized. The invention includes a system controller which communicates with a sentence and word synthesizer by means of a communication link. The sentence and word synthesizer responds to instructions from the controller by accessing a vocabulary and sentence database which contains all of the language dependent information found in an application program contained in the controller in standard implementations of speech synthesizers. The language dependent information such as grammar rules, etc., is encoded in a language independent format in the database. Therefore, the application program can be written in a form which is independent of the specific language to be synthesized. If it is desired to synthesize a different language, only that portion of the database containing the language specific grammar rules, sentence structure, etc. need be replaced instead of adding new code to the application program. Thus, by instructing the speech synthesizer to access a different database or by changing the memory containing the database a new language can be synthesized. This simplifies the process of synthesizing speech in multiple languages and reduces the development cost for such a device and the complexity of the controller.

Description

SPEECH SYNTHESIS APPARATUS AND METHOD FOR SYNTHESIZING A FINITE SET OF SENTENCES AND NUMBERS USING ONE PROGRAM

TECHNICAL FIELD

The present invention relates to techniques for synthesizing speech for use in data processing systems, telephone answering machines, and other devices, and more specifically, to an apparatus and method capable of synthesizing speech in multiple languages using a single application program.

BACKGROUND OF THE INVENTION

Synthesized speech is used in many electronic devices as part of the user interface to enable a user to interact with or obtain information from the device. Such devices typically contain a speech synthesizer chip which consists of a processor having speech synthesis capability. The synthesized speech may be output through any one of several mediums, e.g., audio voice synthesis, morse code, message display, etc. The speech synthesizer chip may be separate from the other functional units of the device, or it may be incorporated with additional functions such as memory, digital signal processing, timers, etc. As shown in Fig. 1, a typical speech synthesis chip 1 contains a system controller 10 which is linked to a word synthesizer 12 by means of a communication link 14. Word synthesizer 12 accesses vocabulary database 16 in order to retrieve word data needed to construct sentences in response to instructions issued by controller 10. Vocabulary database 16 stores the words or groups of words used to synthesize the sentences requested by controller 10 in a non-volatile memory.

Controller 10 typically contains an application program stored in a read-only memory (ROM), with the program being designed for the specific application for which the synthesized words are required. The application program includes routines written for each sentence which the speech synthesis chip 1 is expected to produce for the desired application. Each routine generates a desired sentence by causing controller 10 to issue a set of commands to word synthesizer 12, where each command causes a word or group of words in that sentence to be synthesized. The grammar rules, word order structure, and les for constructing numbers (among other characteristics) specific to a particular language are embedded in the application program and are reflected in the order and types of commands which the program causes controller 10 to issue.

Because different languages have different structural characteristics (grammar rules, etc.), it is very difficult to design a speech synthesis device which is capable of synthesizing speech in multiple languages, or which is capable of synthesizing speech in one of a selected set of languages depending upon the need. Present systems use a different application program for each intended language, so that the languages whose speech is to be synthesized and the sentences which will be produced must be identified prior to developing the system. In addition, the applications programs for each language must be designed and tested prior to production of the speech synthesis chip, leading to a lengthy development cycle. This is a result of the need to include the applications programs for all languages which are expected to be synthesized in the controller at the time a speech synthesis chip is being designed. Note that the language specific routines can also be produced using a context free grammar tool, thereby reducing the amount of code and/or memory required.

Even if a speech synthesis chip is designed with a multi-language capability, if synthesis of a new language is required, or if new sentences need to be synthesized in an existing language, new application program code is necessary. Replacement of the controller ROM and alterations to the vocabulary database are also necessary in this situation. This reduces the flexibility of the speech synthesis chip and makes the use of controllers containing masked ROM impractical. Another problem encountered with currently available speech synthesis chips is that in some languages, the manner in which numbers and certain words are pronounced depends upon the context in which they are used. For instance, a given number may be pronounced in different ways or represented by different words in the same sentence. In order to account for this situation, an application program needs to be able to recognize different contexts and determine the appropriate word or pronunciation to be used, a capability lacking in most word-by-word speech synthesizers. Even if available, this capability further increases the complexity of the program and the load on the communication link connecting the controller and word synthesizer.

What is desired is a method of synthesizing a finite set of sentences and numbers in an arbitrary language in a manner which does not require that a new application program be written for each language. It is also desired to have a speech synthesis chip which implements the above method.

SUMMARY OF THE INVENTION

The present invention is directed to an apparatus and method for synthesizing a finite set of sentences and numbers in one of several languages using an application program which is independent of the language being synthesized. The invention includes a system controller which communicates with a sentence and word synthesizer by means of a communication link. The sentence and word synthesizer responds to instructions from the controller by accessing a vocabulary and sentence database which contains all of the language specific information usually found in a controller resident application program in standard implementations of speech synthesizers. The language specific information is encoded in a language independent format in the database. Therefore, the application program can be written in a form which is independent of the language to be synthesized. The database contains all of the language specific information and its contents is retrieved by an indexing system which assigns an index number to each sentence. The application program causes the controller to issue a command to retrieve a desired sentence by using its index number, where the command includes information regarding the specific data needed to produce the desired sentence. Each sentence is constructed as a set of words, variables, and control terms. The words are fixed entries and the variables are typically numbers. The control terms act to control the operation of the sentence synthesizer and determine the structure of the sentence being synthesized. For example, they may determine whether the singular or plural form of a word is appropriate, or act to produce the proper pronunciation of a number depending upon its context.

In operation, the controller issues a command instructing the sentence synthesizer to produce a sentence having a prescribed index number. The command includes the values of any variables needed to complete the sentence. The sentence synthesizer retrieves the sentence content from the database and then implements the sentence according to the words, control terms, and variables contained in it. Each data word in the sentence is read by a word decoder which determines if the data word is a word, variable, or control term. For each word to be synthesized, the sentence synthesizer instructs a word synthesizer to retrieve that word from the database and produce it in spoken form. For each variable, the sentence synthesizer points to a data table which contains the spoken word equivalents of the number or numbers to be produced by the speech synthesizer. The data table points to the entries in the word database corresponding to the words needed to produce the spoken number. These words are then retrieved and produced as speech by the action of the word synthesizer. The control terms are interpreted by the sentence synthesizer as commands to carry out operations which implement the grammar rules, contextual checking, etc., of the language and thereby determine the final sentence structure. If it is desired to synthesize a different language or to alter the sentences which can be synthesized, only that portion of the database containing the language specific grammar rules, sentence structure, etc., needs to be replaced. This is a more efficient means of expanding the capabilities of the speech synthesis chip than adding new code to the application program. By instructing the speech synthesizer to access a different database, or by changing the memory containing the database, a new language can be synthesized. This simplifies the process of synthesizing speech in multiple languages and reduces the development costs for such a device and the complexity of the controller.

Further objects and advantages of the present invention will become apparent from the following detailed description and accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 is a block diagram of a typical speech synthesis chip.

Fig. 2 is a block diagram of a speech synthesis chip constructed according to the present invention.

Fig. 3 is a flowchart showing the operation of the sentence synthesizer module of the present invention.

Fig. 4 shows how a simple sentence is constructed and synthesized by the speech synthesis chip of the present invention.

Fig. 5 is a block diagram of a telephone answering machine which incorporates the speech synthesizer chip of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

Fig. 2 is a block diagram of a speech synthesis chip 100 constructed according to the present invention. Speech chip 100 includes a system controller 102 which communicates with a sentence and word synthesizer 104 via a communication link 103. System controller 102 can take the form of a separate processor which interacts with synthesizer 104 via communication link 103 in a master/slave type of architecture, or controller 102 can be a separate software module running on the same processor as synthesizer 104. In the latter situation, communication between controller 102 and synthesizer 104 occurs via the internal registers of the processor or by means of a variable in memory.

Synthesizer 104 accesses vocabulary and sentence database 108 in order to construct synthesized speech sentences in response to commands issued by controller 102. Database 108 is typically separated into two sections; a vocabulary or word section 109 and a sentence section 110. Database 108 contains the words, grammar rules, numbers, and contextual information needed for synthesizer 104 to synthesize sentences in response to commands from controller 102. Synthesizer 104 typically contains two modules, a sentence synthesizer 105 and a word synthesizer 106. Sentence synthesizer 105 acts to control the production of a desired sentence by interpreting the data retrieved from database 108 in response to a command from controller 102 to synthesize a particular sentence. Word synthesizer 106 acts to synthesize specific words in response to commands from sentence synthesizer 105.

Database 108 contains all of the language specific information needed to synthesize any of the set of sentences which system 100 is capable of synthesizing. This is accomplished by use of a data structure which includes the language specific information in the definition of the sentence. Thus, as will be described in greater detail later, when controller 102 issues a command to synthesize a particular sentence by providing its index, sentence synthesizer 105 retrieves that sentence structure from sentence section 110 of database 108, where the sentence structure contains all of the grammar and contextual rules of the language being synthesized. This significantly reduces the complexity of the application program which is resident in controller 102, and makes the speech synthesis system more flexible and capable of being used to synthesize multiple languages.

Fig. 3 is a flowchart showing the operation of sentence synthesizer 105 module of the present invention. Sentence synthesizer 105 receives an instruction from controller 102 to synthesize sentence (n), where n represents the index of the sentence to be produced (box 200). In response, a pointer is set to the sentence with index (n) (box 210) in the sentence section 110 of database 108. The sentence content is retrieved from database 108 and the data is read one data word at a time by a word decoder contained in sentence synthesizer 105 (box 220). A test is then performed to determine if the data word which has been read by the decoder is an end marker, signifying the end of the sentence data (box 230). If the data word is an end marker, the program ends (box 250). If the data word is not an end marker, the character of the data word determines whether a number is constructed by means of a data table, a word synthesized by word synthesizer 106, or a control term which modifies the final sentence structure is implemented (box 240). As an example of the operation of speech synthesis chip 100 of the present invention, the process of synthesizing a sentence produced by a telephone answering machine will be described. The example sentence is, "You have 21 messages". Fig. 4 shows how this simple sentence is constructed and synthesized by the speech synthesis chip of the present invention. Controller 102 issues a command to synthesizer 104 via communication link 103. The command is of the form "synthesize sentence (n, x,, x₂, x₃, ...)", where n is a number . corresponding to the sentence index, and x,, x₂, x-,, etc. represent values of the arguments or variables to be inserted into the sentence structure. In response to this command, synthesizer 104 accesses database 108, using a pointer to retrieve the sentence corresponding to index (n) from the sentence database portion 110 of database 108. As shown in Fig. 4, a sentence contained in database 108 is composed of data words representing three types of objects; words, variables, and control terms. The words are fixed entries ("You have", etc. in the example sentence) for the invariant parts of the sentence. The sentence structure in database 108 contains pointers for the words to be synthesized which direct the word synthesizer portion 106 of synthesizer 104 to retrieve those word(s) from the word section 109 of database 108 and then synthesize them. The variables or arguments correspond to portions of the sentence which change with the situation in which the sentence is being synthesized. They are usually numerals and the sentence structure contains a pointer to a numeral decoder or table 300 which translates the number (in this case 21) to its corresponding words ("twenty-one"). The control terms are instructions which cause the synthesizer to check for a particular condition, such as the existence of a plural argument. If the condition is satisfied, the index to the next word to be synthesized by the word synthesizer is automatically incremented, resulting in the production of the next word in word section 109 of database 108. A more detailed description of this process is given below.

After retrieval of the appropriate sentence, a word decoder reads each data word from the sentence, where the data words correspond to the words, variables, and control terms previously described. If the data word corresponds to a word or word group, that word or word group is retrieved from the vocabulary or word section 109 of database 108 and then is synthesized by word synthesizer 106. Sentence synthesizer 105 then reads the next data word, which is the case of the example of Fig. 4 is an instruction to go to table 1 to retrieve a number. The instruction to go to table 1 can, if necessary, be followed by a logic step which determines the context in which the number is being used in the sentence so that the appropriate spoken form of the number will be synthesized. This logic step is important in languages, such as German, in which the form of a number (the actual words used to express that number) depends upon the context in which the number is being used. In Fig. 4, this context determining logic is represented by a context selector (box 310). The word following the instruction to go to table 1 is read next and provides the argument for the variable in the sentence, in this case the number of messages. Based on this argument and the results of the context selector logic, the appropriate entry in table 1 or another data table is located. A pointer or pointers from that entry indicates the words in word section 109 of database 108 which correspond to the argument needed for the sentence. This is followed by an instruction to word synthesizer 106 to synthesize those words.

Sentence synthesizer 105 then reads the next data word, which in this case is a control term or instruction to check if the argument is singular or plural. If the argument is singular, the word "message" is retrieved from the word section 109 of database 108 and is then spoken by word synthesizer 106. If the argument is plural, then sentence synthesizer 105 increments the word index by one, thereby causing the word "messages" to be retrieved and synthesized. As can be understood from the previous example, sentence synthesizer 105 of the present invention performs the processing steps necessary to retrieve the sentence to be synthesized, parse through the data words which comprise the content of that sentence, and control the synthesizing of each of the words or variables in that sentence. In this way the complete sentence is synthesized by a sequence of logic steps and instructions to retrieve words from the word section 109 of database 108 and then synthesize those words.

The present invention solves the problems inherent in the prior art approach of synthesizing a sentence word by word under the control of a controller by incorporating the language specific information in the same database as the word information. The application program being run by the master controller need only issue a command to synthesize a desired sentence which is identified by its index number. Control is then passed to the sentence synthesizer which retrieves the sentence content data and parses through it to carry out the process of synthesizing the sentence. Because the language specific information is expressed in terms of the sentence content, i.e., as data, it can be stored in a standard memory device instead of being expressed as code which runs on the controller. This reduces the complexity of the application program while also reducing system costs and increasing the flexibility of the system. In addition, because the controller need issue only a single command in order to synthesize an entire sentence (as opposed to currently available systems in which the controller must issue a command for each word to be synthesized), the controller can be used to perform or monitor other system functions during the synthesis process. In order to synthesize a sentence in a different language, controller 102 can issue a command to sentence synthesizer 105 to select a different portion of database 108 to use when retrieving sentence and word data, or database 108 may be replaced by a different memory device which contains the sentence content and words needed for the new language. Because the sentence content for the new language contains all of the language specific information required when synthesizing a sentence in that language, the application program being executed by controller 102 does not have to be changed.

As noted, support for synthesizing numbers is provided through the use of data tables which contain the sequence of words to produce in order to synthesize a given number, when that number is spoken in a particular context. Different contexts are supported by using different tables, with pointers from context selector 310 indicating which table to use. Because the selection of the appropriate context is determined by the control terms within the sentence data, different contexts may be used within the same sentence. A typical speech synthesis system constructed according to the present invention is shown in Fig. 5, which is a block diagram of a telephone answering machine which incoφorates the speech synthesizer chip of the present invention. Although Fig. 5 represents a typical application of the present invention, it is only one example of many environments in which the present invention may be utilized to provide efficient multi-language speech synthesis capabilities.

In order to retrieve messages from a telephone answering machine, a user depresses a key on keypad 401. System controller 102 decodes which key has been depressed and translates the keystroke into an action to be implemented by the speech synthesizer. If, for example, the action is to announce the current time, system controller 102 will issue a command to module interface 402 to synthesize a particular sentence from sentence database 110, which is part of language database 108. The sentence to be synthesized is identified by its index number, n. Module interface 402 sends the sentence index to sentence and word synthesizer 104. As previously described with reference to Figs. 3-4, sentence synthesizer module 105 of synthesizer 104 will retrieve the sentence definition corresponding to the sentence with index n from sentence section 110 of database 108, decode its content, and convert it to a series of words to be synthesized or control terms, where the words may include the steps of converting numbers into the equivalent words. The decoded words which are to be spoken are passed to the word synthesizer 106 module of synthesizer 104, along with instructions to synthesize those words. Word synthesizer 106 retrieves the desired words from the word section 109 of database 108, where they may be stored as compressed digitized samples of previously recorded speech. Word synthesizer 106 then acts to decompress the data into a series of samples, and sends the decompressed samples to a codec module 403. Codec module 403 is under the control of module interface 402 and is responsible for performing the digital-to-analog and analog-to-digital conversion functions required by the system. Codec mcdule 403 converts the decompressed digital samples to analog signals which are then produced as audible speech by means of a loudspeaker 408. If desired, the sentence can also be displayed visually by means of a display 409.

If a request to the answering machine is provided by means of a signal transmitted over a telephone line instead of keypad 401, that request enters the system via telephone line interface 404. The incoming signal is passed by interface 404 to analog multiplexer 405 which controls the input, output, and processing of analog signals. Analog multiplexer 405 sends the signal to codec module 403 which reads the signal and converts it to digital form. A digital signal processing (DSP) and systems functions module 406 decodes the signal read by codec module 403 and determines if the decoded signal corresponds to an instruction the system is designed to recognize. If so, module interface 402 informs system controller 102 what the instruction or digit represented by the incoming signal is, and controller 102 then implements that instruction as previously described.

In more complicated systems, DSP and systems functions module 406 can also perform other functions such as voice compression and decompression, tone generation and detection, real-time clock generation, memory management, etc. In addition, most telephone answering systems include a microphone 407 for recording messages, and a loudspeaker 408 for playing back messages and the synthesized speech. As mentioned previously, analog multiplexer 405 controls the input output functions which cause analog signals to enter the other system modules or cause analog signals to be produced by the system. A display 409 may also be included to visually display system information or messages to the user.

The sentence synthesizer module 105 can be modified, depending upon the application, by altering the set of variables which are recognized automatically and the set of grammar rules. The set of variables may be expanded to account for a larger set of numbers, while the grammar rules, which are usually encoded as control terms in a sentence, can be changed to permit the synthesis of other languages or of additional aspects of the same language. Such modifications allow the speech synthesizer system of the present invention to more efficiently adapt to new uses or markets in which a product incorporating the system will be sold. As is evident from the foregoing description of the present invention, creation of database 108 is an important part of the invention, specifically, the structure and contents of sentence section 110 of database 108. As previously discussed, the process of synthesizing a sentence is begun by controller 102 issuing a command to sentence synthesizer 105 to "synthesize sentence (n)", where (n) is the index of the desired sentence. Sentence synthesizer 105 responds to this command by accessing sentence section 110 of database 108 and retrieving the sentence data corresponding to the index (n). Sentence section 110 may be organized as a data table with each entry being identified by its respective index.

The structure of the sentence database data table can be expressed as two columns; a first column containing an index number for the line of sentence data words contained in the second column, and a corresponding line of data words which define the contents of the sentence. The line of data words can be expressed as a sequence of numbers in a fixed-size binary representation. Each number corresponds to an index for an entry representing the sampled spoken form of a particular word stored in word section 109 of database 108, or to a control word which affects the structure of the sentence or an option code which selects the number table to use when generating the spoken word equivalents of a given number. Note that if certain words are pronounced differently depending upon the context, the different versions of the word should be stored as different entries in word section 109.

As sentence synthesizer 105 reads each data word in the line of data words retrieved from sentence section 110, it either instructs word synthesizer 106 to retrieve a particular word from word section 109 of database 108 and produce that word as spoken speech, implement a conditional test or other instruction defined by a control word, or point to a designated number table containing the word indices for the spoken word equivalents of a variable in the sentence. If a number is spoken differently depending upon the context (as in the number one being spoken as "one" or "first"), a different number table should be constructed for each context. The entries in a number table represent the index numbers for the words contained in word section 109 which correspond to the spoken words for the number to be synthesized.

Control words can be used to determine whether the word "AM" or "PM" should be used in a time announcement, whether a singular or plural term should be used (and point to the appropriate word in the word section of the database), select the proper day of the week to announce, etc. The option codes can be used to select the appropriate number table, determine whether the time is announced in 12 or 24 hour format, or perform other functions which involve synthesizing words for numbers.

After the structure of the database has been defined, the data representing the various indices and the digital data representing the spoken words is burned into a ROM. In this way, a memory device can store both the data representing the word samples for each word to be spoken, and the various links which allow sentence synthesizer 105 to control how a sentence is produced. If it is desired to have the synthesizer be able to produce speech in more than one language, a different ROM or section of an existing ROM should be used to store those words. Thus, the database structure and design of the sentence synthesizer of the present invention permit multiple languages to be produced by a controller running a single application program.

The terms and expressions which have been employed herein are used as terms βf description and not of limitation, and there is no intention in the use of such terms and expressions of excluding equivalents of the features shown and described, or portions thereof, it being recognized that various modifications are possible within the scope of the invention claimed.

Claims

We claim:

1. A method of synthesizing a sentence in a desired language using a speech synthesis system, comprising: issuing a command to synthesize the sentence, wherein the command includes a sentence index which identifies the sentence; retrieving a set of data corresponding to the identified sentence from a database by using the sentence index, wherein the set of data includes a string of data words corresponding to words and numerical variables contained in the sentence, and further wherein, the data words can include control terms which incorporate grammar rules of the language being synthesized into the data words and determine the structure of the sentence being synthesized; reading each data word in the string of data words; producing synthesized speech corresponding to a word or numeral represented by the data word if the data word corresponds to a word or numeral in the sentence to be synthesized; and implementing an action which affects the structure of the sentence if the data word is a control term, whereby a sentence may be efficiently synthesized in one of a given set of languages by retrieving the identified sentence from a part of the database containing data words for sentences in that language.

2. The speech synthesis method of claim 1 , wherein the command to synthesize the sentence is issued by a controller which communicates with the sentence synthesizer by means of a communication link.

3. The speech synthesis method of claim 1, wherein the synthesized speech corresponding to a word is produced by issuing an instruction from the sentence synthesizer to a word synthesizer.

4. The speech synthesis method of claim 1, wherein the step of synthesizing a numeral further comprises: retrieving a word or words which represent a number or numbers to be synthesized; and issuing an instruction from the sentence synthesizer to a word synthesizer to synthesize that word or words.

5. The speech synthesis method of claim 1, wherein the control terms include instructions which determine the appropriate form for a word depending upon its context.

6. The speech synthesis method of claim 1, wherein the control terms include instructions which determine whether a word to be synthesized is singular or plural.

7. The speech synthesis method of claim 1, wherein the database contains sentences corresponding to a plurality of languages.

8. A speech synthesis system capable of efficiently synthesizing sentences in multiple languages, comprising:

- 9 -

SUBSTITUTE SHEET (RULE 26L a system controller which issues a single command to synthesize a particular sentence; a speech synthesizer which receives the command issued by the controller, wherein the speech synthesizer includes a sentence synthesizer module which controls the production of the synthesized sentence and a word synthesizer module which produces synthesized speech corresponding to a desired word or group of words; and a database which includes a sentence database and a word database and from which the sentence or word synthesizer retrieves a set of data corresponding to the sentence to be synthesized, wherein the set of data includes a string of data words corresponding to the words and numerical variables contained in the sentence, and further wherein, the data words can include control terms which incorporate grammar rules of the language being synthesized into the data words and determine the structure of the sentence being synthesized.

9. The speech synthesis system of claim 8, wherein the database further comprises: a data table containing a word or words representing a number or numbers to be synthesized, wherein an entry in the data table is retrieved by the sentence synthesizer when a data word is a numerical variable.

10. The speech synthesis system of claim 8, wherein the sentence synthesizer further comprises: a word decoder which reads each data word in the string of data words and determines if that data word is a word, a numerical variable, or a control term.

1 1. The speech synthesis system of claim 8, wherein the database includes data words for more than one language.

12. The speech synthesis system of claim 1 1, wherein the data words for different languages are stored in different parts of the database.

13. The speech synthesis system of claim 12, further comprising: control means for selecting the part of the data base from which to retrieve the set of data corresponding to the sentence to be synthesized based on the language in which the sentence is to be synthesized.

14. The speech synthesis system of claim 8, further comprising: a plurality of databases, wherein each database contains data words for a different language, and further, wherein the sentence and word synthesizer selects which database to retrieve the set of data corresponding to the sentence to be synthesized based on the language in which the sentence is to be synthesized.