EP0762384A2 - Verfahren und Vorrichtung zur Veränderung von Stimmeigenschaften synthetisch erzeugter Sprache - Google Patents

Verfahren und Vorrichtung zur Veränderung von Stimmeigenschaften synthetisch erzeugter Sprache Download PDF

Info

Publication number
EP0762384A2
EP0762384A2 EP96306091A EP96306091A EP0762384A2 EP 0762384 A2 EP0762384 A2 EP 0762384A2 EP 96306091 A EP96306091 A EP 96306091A EP 96306091 A EP96306091 A EP 96306091A EP 0762384 A2 EP0762384 A2 EP 0762384A2
Authority
EP
European Patent Office
Prior art keywords
speech
text
parameter values
speech parameter
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP96306091A
Other languages
English (en)
French (fr)
Inventor
Bruce Melvin Buntschuh
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AT&T Corp
Original Assignee
AT&T Corp
AT&T IPM Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AT&T Corp, AT&T IPM Corp filed Critical AT&T Corp
Publication of EP0762384A2 publication Critical patent/EP0762384A2/de
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present invention relates to speech synthesizer systems, and more particularly to a speech synthesizer system that utilizes an interactive graphical user interface that controls acoustical characteristics of synthesized speech.
  • Kun-Shan Lin, Alva E. Henderson and Gene A. Frantz disclosed in U.S. patent number 4,624,012 a method and apparatus for modifying voice characteristics of synthesized speech. Their method relies upon separating selected acoustical characteristics, such as the pitch period, the vocal tract model and the speech rate, into their respective speech parameters. These speech parameters are then varied and recombined with the original voice to create a modified synthesized voice having acoustical characteristics differing from the original voice.
  • Bell Labs text-to-speech synthesizer also permits users to manipulate the speech parameters that control the acoustical characteristics of synthesized speech.
  • users can modify the speech parameters using escape sequences.
  • the escape sequences consist of ASCII codes that indicate to the Bell Labs text-to-speech synthesizer the manner in which to alter one or more speech parameter.
  • At least the following speech parameters are controllable in the Bell Labs text-to-speech synthesizer system: three pitch parameters, rate, the front and back head of the vocal tract, and aspiration.
  • a virtual continuum of new voices can be created from a base synthesized voice.
  • a user is often required to undergo a time consuming process of experimentating with various combinations of speech parameters before ascertaining which particular combination achieves the desired sound. Experimentation is facilitated if the user is familiar with the text-to-speech synthesizer and the manner in which the speech parameters modify the base voice.
  • the present invention is directed to a method and system that satisfies the need for a facility to explore new combinations of speech parameters in a simple, efficient manner.
  • the method utilizes a graphical user interface for manipulating the speech parameters that control acoustical characteristics of a base synthesized voice.
  • the method comprises the steps of: (1) generating and displaying the graphical user interface; (2) modifying current speech parameter values through the graphical user interface; (3) forming a text string; and (4) outputting the text string to a text to speech synthesizer.
  • the text string includes the current speech parameter values which indicates to the text to speech synthesizer the change in the corresponding acoustical characteristics of the base synthesized voice.
  • the text string may also include test utterances and escape codes. The test utterances represent text to be converted to speech by the text to speech synthesizer. The escape codes indicate to the text to speech synthesizer the particular acoustical characteristics to alter.
  • modifying the current speech parameter values may be accomplished by selecting a named voice from a listbox in the graphical user interface or by manipulating any combination of parameter scales in the graphical user interface.
  • the named voices in the listbox have associated speech parameter values which are assigned as the current speech parameter values when selected by a user.
  • the graphical user interface includes the following manipulable parameter scales: three pitches, front and rear head of the vocal tract, rate and aspiration. The position of sliders within the parameter scales determines the current speech parameter values.
  • the speech synthesizer system for carrying out the above described method has a text to speech synthesizer operative to modify acoustical characteristics of a base synthesized voice.
  • the speech synthesizer system comprises a facilitating means for manipulating speech parameters and an output means.
  • the facilitating means includes a graphical user interface.
  • the graphical user interface includes parameter scales and formation means.
  • the parameter scales are responsive to input from a user for altering current speech parameter values. By manipulating sliders within the parameter scales, the user can modify the current speech parameter values.
  • the values of the current speech parameter are determined by the positions of the sliders within the parameter scales.
  • the formation means are operative to create a text string which includes the current speech parameter values.
  • the text string may also include test utterances and escape codes.
  • the output means transmits the text string from the facilitating means to the text to speech synthesizer. Includable within the speech synthesizer system is an opening means for initiating the text to speech synthesizer so the text to speech synthesizer is operative to receive the text string from the output means.
  • the present invention includes a dialogue processing means for preparing a dialogue script to be converted to speech.
  • the dialogue processing means is operative (1) to detect speaker names in the dialogue script, (2) to match detected speaker names against named voices in a library of named voices, (3) to modify said dialogue script by replacing the detected speaker names with escape sequences, and (4) to output the modified dialogue script to the text to speech synthesizer.
  • the named voices in the library each have associated speech parameter values.
  • the escape sequences are ASCII codes comprised of escape codes and associated speech parameter values.
  • the escape codes indicate to the text to speech synthesizer particular acoustical characteristics to alter.
  • the associated speech parameter values indicate to the text to speech synthesizer change in acoustical characteristics of the base synthesized voice.
  • a computer-based speech synthesizer system 02 that comprises a processing unit 07, a display screen terminal 08, input devices, e.g., a keyboard 10 and a mouse 12.
  • the processing unit 07 includes a processor 04 and a memory 06.
  • the mouse 12 includes switches 13 having a positive on and a positive off position for generating signals to the speech synthesizer system 02.
  • the screen 08, keyboard 10 and pointing device 12 are collectively known as the display.
  • the speech synthesizer system 02 utilizes UNIX® as the computer operating system and X Windows® as the windowing system for providing an interface between the user and a graphical user interface.
  • UNIX and X Windows can be found resident in the memory 06 of the speech synthesizer system 02 or in a memory of a centralized computer, not shown, to which the speech synthesizer system 02 is connected.
  • X Windows is designed around what is described as client/server architecture. This term denotes a cooperative data processing effort between certain computer programs, called servers, and other computer programs, called clients.
  • X Windows is a display server, which is a program that handles the task of controlling the display.
  • Graphical user interfaces also referred herein as "GUI" are clients, which are programs that need to gain access to the display in order to receive input from the keyboard 10 and/or mouse 12 and to transmit output to the screen 08.
  • GUI Graphical user interfaces
  • X Windows provides data processing services to the GUI since the GUI cannot perform operations directly on the display. Through X Windows, the GUI is able to interact with the display.
  • X Windows and the GUI communicate with each other by exchanging messages.
  • X Windows uses what is called an event model.
  • the GUI informs X Windows of the events of interest to the GUI, such as information entered via the keyboard 10 or clicking the mouse 12 in a predetermined area, and then waits for any of the events of interest to occur. Upon such occurrence, X Windows notifies the GUI so the GUI can process the data.
  • the present invention is a graphical user interface and can be found resident in the memory 06 of the speech synthesizer system 02 or the memory of the centralized computer.
  • the interface provides an interactive means for facilitating experimentation with the speech parameters that control the acoustical characteristics of synthesized speech.
  • the present invention is written in the Tcl-Tk language and operates with the standard windowing shell provided with the Tcl-Tk package.
  • Tcl is a simple scripting language (its name stands for "tool command language") for controlling and extending applications.
  • Tk is an X Windows toolkit which extends the core Tcl facilities with commands for building user interfaces having Motif "look and feel" in Tcl scripts instead of C code.
  • the preferred embodiment of the present invention utilizes UNIX's multitasking and pipe features to create an efficient speech synthesizer system that provides effectively instant feedback for facilitating experimentation with speech parameters.
  • the multitasking feature allows more than one application program to run concurrently on the same computer system.
  • the pipe feature involves multitasking and allows the output of one program to be directly passed as input to another program.
  • the Tcl scripting language utilizes these two UNIX features to provide a mechanism for communicating with other programs.
  • the Present invention program (written in the Tcl language) communicates with a concurrently running Bell Labs text-to-speech synthesizer program through a UNIX pipe.
  • the Bell Labs text-to-speech synthesizer program can be found resident in the memory 06 of the speech synthesizer system 02 or in the memory of the centralized computer.
  • the present invention uses UNIX pipes to send a text string comprised of a series of escape sequences and test utterances to the Bell Labs text-to-speech synthesizer.
  • the escape sequences are ASCII codes comprised of pairs of escape codes and associated speech parameter values.
  • the escape codes and parameter values identify to the Bell Labs Text-to-speech synthesizer which speech parameters are to be set and the values to be assigned to each of the speech parameters, respectively.
  • the test utterances represent the text to be converted to speech by the Bell Labs text-to-speech synthesizer.
  • the Bell Labs text-to-speech synthesizer Upon receipt of the text string, the Bell Labs text-to-speech synthesizer will convert the test utterances to speech using a base synthesized voice altered according to the escape sequences.
  • the Bell Labs text-to-speech synthesizer will convert the test utterances to speech using a base synthesized voice altered according to the escape sequences.
  • users are able to explore combinations of speech parameters that would normally be time consuming if they were to be manually entered into the Bell Labs text-to-speech synthesizer.
  • the fact that the user is actually manipulating the Bell Labs text-to-speech escape sequences is entirely transparent.
  • Fig. 3, 5-8 are flowcharts illustrating the sequence of steps utilized by the present invention for processing data to the Bell Labs text-to-speech synthesizer.
  • Fig. 3 illustrates the main routine for the present invention graphical user interface and Figs. 5-8 illustrate how changes to the speech parameters are detected and handled by the main routine.
  • the program begins with the initialization process in step 3a, as shown in Fig. 3.
  • a display is generated and a default initialization file in a user's home directory is used to set current values for each speech parameter.
  • Step 3a also creates a pipeline using a Tcl open command and command-line arguments. The pipeline allows the present invention to send data directly to the Bell Labs text-to-speech synthesizer.
  • Tk implements a ready-made set of controls call "widgets" with the Motif "look and feel."
  • the display 20 comprises the following controls: parameter scales 22, a scrollable list 24 of named voices, a male voice button 26a, a female voice button 26b, an input box 28 for entering test utterances, a "Say It" button 30 and a display box 32.
  • Manipulating any of the controls (except for the display box 32) will cause a change to the current speech parameter values or test utterances. Step 3b in Fig. 3 will detect any of these changes.
  • the parameter scales 22 are created using the Tk scale widgets and button widgets.
  • the parameter scales 22 provide means to modify the current values for the following speech parameters: pitchT, pitchR, pitchB, rate, front head, back head and aspiration.
  • Each of the parameter scales 22 are manipulable within a range of values set according to acceptable ranges of the Bell Labs text-to-speech synthesizer. Additional scales can be included in the display 20 for manipulating other speech parameters.
  • Each parameter scale 22 has a slider 22a, a "-" button 22b and a "+” button 22c.
  • the parameter scales 22 display a scale value 22d that corresponds to the relative position of the slider 22a within the range of the corresponding parameter scale 22.
  • the scale widget evaluates a Tcl command that causes the current speech parameter values to be updated with the scale values 22d.
  • repositioning the sliders 22a have the effect of changing the current speech parameter values.
  • the present invention graphical user interface provides three techniques for changing the scale values 22d by repositioning the slider 22a with a mouse 12, joystick or other similar device: clicking on or selecting the "-" or “+” buttons 22b and 22, dragging the slider 22a, or clicking in the scale 22. Any of these actions will trigger the occurrence of an event of interest to the present invention graphical user interface.
  • buttons 22b and 22c are linked to the parameter scales 22 by a Tcl bind command. Clicking on either the "-" button 22b or “+” button 22c in step 5a, as shown in Fig. 5, will cause the corresponding parameter scale 22 to be repositioned left or right a predetermined increment in step 5b. Dragging the sliders 22a or clicking in the parameter scales 22 will also cause the sliders 22a to be repositioned.
  • any parameter scale 22 is repositioned, as in steps 5a-c, it becomes necessary to update the current speech parameter value with the current scale value 22d of the repositioned parameter scale 22. This is done by step 5d and is detected by step 3b.
  • the present invention can utilize a graphical user interface that has entry boxes for users to change the current speech parameter values by typing in the desired number.
  • the scrollable list 24 is created with the Tk listbox widget and provides a collection of previously created voices stored as named voices. These named voices are loaded in the list 24 in step 3a from the user's default initialization file or from a system default initialization file.
  • the default initialization file includes named voices and associated speech parameter values.
  • the user can select a named voice from the list 24 by double-clicking on one with the mouse 12.
  • a Tcl bind command is used to link a Tcl script to the double-clicking action.
  • the Tcl script causes the speech parameters values associated with the selected named voice to be assigned as the current speech parameter values, as shown by steps 6a and 6b in Fig. 6. This provides a quick mechanism for recalling previously formed combinations of speech parameter values.
  • the sliders 22a are subsequently repositioned to reflect the current speech parameter values. This change will also be detected by step 3b in Fig. 3.
  • the Bell Labs text-to-speech synthesizer provides one base male and female speaker.
  • the present invention permits the user to select one of the two speakers as a base voice by clicking on either button 26a or 26b created with the Tk radio button widget.
  • the acoustical characteristics of the selected base voice are altered according to the current speech parameter values.
  • the current speech parameter value for the sex of the base voice is subsequently updated in step 8b. Step 3b of the main routine will detect this change.
  • the input box 28 is created with the Tk entry widget to permit the user to enter the test utterances, i.e., the text the user desires to have the Bell Labs text-to-speech synthesizer covert to speech. Any change to the input box 28 (or the test utterances) is detected in step 3b.
  • Tcl script forms and transmits to the Bell Labs text-to-speech synthesizer via the UNIX pipe, as shown in Figs. 2 and 3 by step 3c, the text string comprised of the series of escape sequences followed by the test utterances from the input box 28.
  • the Tcl script first pairs the escape codes with their associated speech parameter values and then strings them together to form the series of escape sequences.
  • the test utterances are converted to speech providing users with effectively instant feedback regarding the effects of the new combination of speech parameters on the selected base voice.
  • the display box 32 shows the series of escape sequences that were ultimately transmitted by the present invention graphical user interface to the Bell Labs text-to-speech synthesizer.
  • An escape sequence for the base voice and each speech parameter except for aspiration is included in the series of escape sequences.
  • the current Bell Labs text-to-speech synthesizer does not allow for aspiration to be controlled by an escape code. Changes to aspiration are handled by the present invention through a command-line argument that opens the Bell Labs text-to-speech synthesizer. Normally only one Bell Labs text-to-speech synthesizer is opened per session unless the current speech parameter value for aspiration had been changed.
  • step 7a determines in step 7b whether the parameter value for aspiration had changed since the last transmission of the text string to the Bell Labs text-to-speech synthesizer. If it did not change, then the present invention proceeds to step 7d and passes the text string to the current Bell Labs text-to-speech synthesizer. If it did change, then the present invention proceeds to step 7c before proceeding to step 7d. In step 7c, the present invention closes the pipeline for the current Bell Labs text-to-speech synthesizer and opens another one using the current parameter value for aspiration.
  • the present invention allows users to save the current speech parameter values as a newly created named voices.
  • the present invention provides an entry box 34 entitled "Name of new voice:,” as shown in Fig. 4, to record new combinations of speech parameter values as a named voice. This new named voice will be subsequently added to the list 24 and stored in the default initialization file with its associated speech parameter values.
  • Another embodiment of the present invention includes a companion preprocessor.
  • This embodiment takes advantage of the named voices created with the present invention interface. Once some voices have been created and stored, they can be used to process dialogue scripts or other applications.
  • An example of a dialogue script having speaker names and utterances is shown in Fig. 10.
  • the preprocessor accesses data in step 9a from a .voice file, as shown in Fig. 9, which contains a list of named voices and their associated speech parameter values.
  • steps 9b and 9c the preprocessor filters out the bracket-enclosed speaker names and then replaces them with escape sequences formed using the speech parameter values associated with the named voices matching the speaker names.
  • the escape sequences and the utterances are output in step 9d to the Bell Labs text-to-speech synthesizer to be converted to speech.
  • the result is a spoken colloquy with different voices. If the voice file does not have a named voice matching the speaker name, a default substitute named voice may be used or the program can prompt the user for an alternate named voice.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)
  • Digital Computer Display Output (AREA)
EP96306091A 1995-09-01 1996-08-21 Verfahren und Vorrichtung zur Veränderung von Stimmeigenschaften synthetisch erzeugter Sprache Withdrawn EP0762384A2 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US522895 1983-08-12
US52289595A 1995-09-01 1995-09-01

Publications (1)

Publication Number Publication Date
EP0762384A2 true EP0762384A2 (de) 1997-03-12

Family

ID=24082821

Family Applications (1)

Application Number Title Priority Date Filing Date
EP96306091A Withdrawn EP0762384A2 (de) 1995-09-01 1996-08-21 Verfahren und Vorrichtung zur Veränderung von Stimmeigenschaften synthetisch erzeugter Sprache

Country Status (2)

Country Link
EP (1) EP0762384A2 (de)
JP (1) JPH09127970A (de)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0880127A2 (de) * 1997-05-21 1998-11-25 Nippon Telegraph and Telephone Corporation Verfahren und Vorrichtung zum Editieren/Erzeugen synthetischer Sprachberichte, sowie Aufzeichnungsträger
WO2000016310A1 (de) * 1998-09-11 2000-03-23 Hans Kull Vorrichtung und verfahren zur digitalen sprachbearbeitung
EP1045372A2 (de) * 1999-04-16 2000-10-18 Matsushita Electric Industrial Co., Ltd. Sprachkommunikationsystem
US6956864B1 (en) 1998-05-21 2005-10-18 Matsushita Electric Industrial Co., Ltd. Data transfer method, data transfer system, data transfer controller, and program recording medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000352991A (ja) * 1999-06-14 2000-12-19 Nippon Telegr & Teleph Corp <Ntt> スペクトル補正機能つき音声合成器
GB2501062B (en) * 2012-03-14 2014-08-13 Toshiba Res Europ Ltd A text to speech method and system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0880127A2 (de) * 1997-05-21 1998-11-25 Nippon Telegraph and Telephone Corporation Verfahren und Vorrichtung zum Editieren/Erzeugen synthetischer Sprachberichte, sowie Aufzeichnungsträger
EP0880127A3 (de) * 1997-05-21 1999-07-07 Nippon Telegraph and Telephone Corporation Verfahren und Vorrichtung zum Editieren/Erzeugen synthetischer Sprachberichte, sowie Aufzeichnungsträger
US6226614B1 (en) 1997-05-21 2001-05-01 Nippon Telegraph And Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon
US6334106B1 (en) 1997-05-21 2001-12-25 Nippon Telegraph And Telephone Corporation Method for editing non-verbal information by adding mental state information to a speech message
US6956864B1 (en) 1998-05-21 2005-10-18 Matsushita Electric Industrial Co., Ltd. Data transfer method, data transfer system, data transfer controller, and program recording medium
WO2000016310A1 (de) * 1998-09-11 2000-03-23 Hans Kull Vorrichtung und verfahren zur digitalen sprachbearbeitung
EP1045372A2 (de) * 1999-04-16 2000-10-18 Matsushita Electric Industrial Co., Ltd. Sprachkommunikationsystem
EP1045372A3 (de) * 1999-04-16 2001-08-29 Matsushita Electric Industrial Co., Ltd. Sprachkommunikationsystem

Also Published As

Publication number Publication date
JPH09127970A (ja) 1997-05-16

Similar Documents

Publication Publication Date Title
US6006187A (en) Computer prosody user interface
US6826282B1 (en) Music spatialisation system and method
US7013297B2 (en) Expert system for generating user interfaces
Aspinall Proof General: A generic tool for proof development
US6384829B1 (en) Streamlined architecture for embodied conversational characters with reduced message traffic
JP3843155B2 (ja) 家庭環境内の装置の音声識別システム
EP0534409A2 (de) Verfahren und System zum Steuern des Ablaufs eines Anwenderprogramms
US7444627B2 (en) System and method for creating a performance tool and a performance tool yield
US20130110517A1 (en) Enabling speech within a multimodal program using markup
US20020118220A1 (en) System and method for dynamic assistance in software applications using behavior and host application models
WO2007139040A1 (ja) 音声状況データ生成装置、音声状況可視化装置、音声状況データ編集装置、音声データ再生装置、および音声通信システム
JP3454285B2 (ja) データ処理装置およびデータ処理方法
Eisenstein et al. Agents and GUIs from task models
US8725505B2 (en) Verb error recovery in speech recognition
EP0762384A2 (de) Verfahren und Vorrichtung zur Veränderung von Stimmeigenschaften synthetisch erzeugter Sprache
JPH08115081A (ja) 楽譜表示装置
GB2304945A (en) An object-oriented interface controlling multimedia devices
KR20030031202A (ko) 컴퓨터를 통한 사용자 인터페이스 방법
JP3525817B2 (ja) 操作反応音生成装置及び操作反応音生成プログラムを記録した記録媒体
US5864814A (en) Voice-generating method and apparatus using discrete voice data for velocity and/or pitch
CA2180390A1 (en) Method and apparatus for modifying voice characteristics of synthesized speech
JP3294691B2 (ja) オブジェクト指向システム構築方法
McGlashan et al. A speech interface to virtual environments
US20030097486A1 (en) Method for automatically interfacing collaborative agents to interactive applications
Melin ATLAS: A generic software platform for speech technology based applications

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): DE FR GB IT

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN

18W Application withdrawn

Withdrawal date: 19970513