AU769036B2 - Device and method for digital voice processing - Google Patents

Device and method for digital voice processing Download PDF

Info

Publication number
AU769036B2
AU769036B2 AU60813/99A AU6081399A AU769036B2 AU 769036 B2 AU769036 B2 AU 769036B2 AU 60813/99 A AU60813/99 A AU 60813/99A AU 6081399 A AU6081399 A AU 6081399A AU 769036 B2 AU769036 B2 AU 769036B2
Authority
AU
Australia
Prior art keywords
prosody
generating
speech
speaker
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU60813/99A
Other versions
AU6081399A (en
Inventor
Hans Kull
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of AU6081399A publication Critical patent/AU6081399A/en
Application granted granted Critical
Publication of AU769036B2 publication Critical patent/AU769036B2/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Abstract

The invention relates to a device for digital voice processing which comprises a sentence melody generating device for generating a sentence melody for a text, and an editing device for displaying and modifying the generated sentence melody.

Description

-1- APPARATUS AND METHOD FOR DIGITAL SPEECH PROCESSING The present invention relates to an apparatus and a method for digital speech processing or speech generation, respectively. Present systems for outputting speech typically are applied in areas in which a synthetic speech is acceptable or even desired. The present invention, however, relates to a system which makes it possible to synthetically generate speech giving a natural impression.
In present systems for digital speech generation the information regarding prosody and intonation are generated automatically, like e.g. described in EP 0689706. In some systems it is possible to insert additional commands into the text screen before it is handed over to the speech generator, e.g. in EP 0598599.
Those commands are inputted e.g. as (non-pronouncable) special characters, like e.g. described in EP 0598598.
The commands inserted into the text screen may also contain indications regarding the characteristics of the speaker parameters of the speaker's model). In EP 0762384 there is described a system in which those speakers characteristics may be inserted on the screen by means of a graphical user interface.
The speech synthesis is carried out using auxiliary information which are stored in a databank as waveform sequence in case of EP 0831460).
Nevertheless, for the pronunciation of the words which are not stored in the databank there need to be provided rules regarding the pronunciation in the program. The combination of the individual sequences leads to distortions and acoustic artefacts if no measures are provided for their suppression. This problem (commonly called "segmental quality"), however, nowadays is mostly solved (cp. for example Volker Kraft: Verkettung natirlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken und Evaluierung. Fortschr-Ber.VDI -2- Reihe 10, Nr. 468, VDI-Verlag 1997). Nevertheless even in modern speech synthesis systems several further problems arise.
One problem with digital speech output is for example the multiple language capability.
Another problem consists in the improvement of the prosodic quality, i.e.
the quality of the intonation, cp. e.g. "Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken und Evaluierung.
Fortschr-Ber.VDI Reihe 10, Nr. 468, VDI-Verlag 1997". This difficulty is based on the fact that the intonation can be constructed based on the orthographic input information only insufficiently. It depends also on higher levels like semantics and pragmatics as well as on the situation of the speaker and on the type of the speaker. In general it can be said that the quality of today's speech outputting systems fulfill the requirements where the listener expects or accepts a synthetic speech. However, often the quality of synthetic speech is considered as not sufficient or as unsatisfactory.
It is therefore an object of the present invention to provide an apparatus and a method for digital speech processing which allows the generation of synthetic speech having a better quality. It is a further object of the invention to synthetically generate speech giving a natural impression. The applications reach from the generation of simple texts for multimedia applications up to the sound generation for the dubbing of movies, radio dramas, and audible books.
Even if the synthetically generated speech gives a natural impression sometimes there is needed the possibility to intervene in order to generate dramaturgical effects. Another object of the present invention therefore resides in the provision of such possibilities to intervene.
The present invention is defined in the independent claims. The dependent claims define particular embodiments of the invention.
-3- The problem of the invention is substantially solved by providing the capability to modify the prosody generated for a text by means of an editor.
Particular embodiments of the invention provide for the addition of further characteristics of the synthetically generated speech in addition to the editing of the prosody.
Thereby the starting point is the written text. However, in order to achieve a sufficient (in particular prosodic) quality as well as in order to achieve dramaturgical effects the user in a particular embodiment is provided with farreaching capabilities to intervene. The user is in the position of the director who defines the speakers on the system and assignes to them speech rhythm and prosody, pronunciation and intonation.
Preferably the present invention also comprises the generation of a phonetic transcription for a written text as well as the provision of the capability to modify the generated phonetic transcription, or to modify the phonetic transcription based on modifiable rules, respectively. Thereby for example a particular accent of a speaker may be generated.
According to a further preferred embodiment the present invention comprises a dictionary means in which the words of one or more languages are stored together with their pronunciation. In the latter case this allows for multiple language capability (mulilinguality), i.e. the processing of text in different languages.
Preferably the editing of the generated phonetic transcription or the prosody, respectively, is carried out by means of an editor which is easy to use, such as a graphical user interface.
According to a further preferred embodiment speaker's models which are either predefined or which are defined respectively modified by a user are taken -4into account into the speech processing. Thereby the characteristics of different speakers can be realized, let them be male or female voices, or different accents of a speaker, such as a Bavarian, a Swabian or a Northern German accent.
According to a particularly preferred embodiment the apparatus comprises a dictionary in which for all words also the pronunciation is stored in a phonetic transcription (if hereinafter reference is made to a phonetic transcription, then this may mean an arbitrary phonetic transcription, such as for example the SAMPA-notation, compare for example "Multilingual speech input/output assessment, methodology and standardization, standard computer-compatible transcription, pp 29-31, in Esprit Project 2589 (SAM) Fin. Report SAM-UCC-037") or the international phonetic script known from teaching materials, compare for example "The Principles of the International Phonetik Association: A description of the International Phonetic Alphabet and the Manner of Using it. International Phonetic Association, Dept, Phonetics, Univ. College of London", a translator which converts inputted text into a phonetic transcription and generates a prosody, an editor with which text can be inputted and can be assigned to a speaker and in which the generated phonetic transcription as well as the prosody can be displayed and modified, an input module, in which the speaker's models can be defined, a system for digital speech generation which generates from the phonetic transcription together with the prosody signals representing spoken speech respectively data representing such signals and which is capable of processing S different speaker's models, a system of digital filters and other devices (for reverberation, echo, etc.) with which particular effects can be generated, a sound archive, as well as a mixing device in which the generated speech signals can be mixed with sounds from the archive and can be edited with effects. The invention can either be realized in a hybrid manner by means of software and hardware or fully by means of software. The generated digital speech signals can be outputted S* by means of a particular device for digital audio or by means of a PC sound board.
The present invention will be described hereinafter in detail by means of several embodiments and by referring to the accompanying drawings.
Fig. 1 shows a block diagram of an apparatus for generating digital speech according to an embodiment of the present invention.
In the embodiment of the present invention described hereinafter the invention comprises several individual components which may be realized by means of one or more digital processing apparatuses and the combination and operation of which is described in more detail hereinafter.
The dictionary 100 comprises simple tables (one for each language) in which the words of a language are stored together with their pronunciation. The tables may be extended arbitrarily for the incorporation of additional words and their pronunciation. For particular purposes, e.g. for the generation of accents, there may also be generated additional tables with different phonetic entries in one language.
One table of the dictionary is assigned to the different speakers, respectively.
The translator 110 on one hand generates the phonetic transcriptions by replacing the words of the inserted text through the phonetic correspondences in the dictionary. If in the speaker's model there are stored modifiers which will be 20 described later in more detail, then they are used for modifying the pronunciation.
Additionally it generates the prosody using heuristics known in speech processing. Such heuristics are e.g. the model of Fujisaki (1992) or other acoustic methods, then the perceptual models, e.g. the one of d'Alessandro and Mertens (1995). Both, however, also older linguistic models are described e.g. in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997". Therein one also can find methods for the segmentation (setting of breaks) which are likewise S generated by the translator.
The selection of the methods thereby is of more or less lower importance since the translator only generates a version of the prosody which still can be modified by the user.
-6- Editor 120 provides the user with an instrument with which he can input and modify pronunciation, intonation, accentuation, speed, volume, breaks (interruptions), etc.
At first the user assignes a speaker's model to the text segments to be processed which will be explained in more detail later with respect to its composition and operation. The translator reacts onto this assignment by adapting and newly generating the phonetics and possibly the prosody to the speaker's model. The phonetics is displayed to the user in a phonetic transcription, the prosody is displayed e.g. in a symbolic notation which is taken from music (musical notation). The user then has the possibility to modify them, to listen to individual text segments and to improve his inputs once again, and so on.
The texts themselves may of course be kept in the editor if they cannot be directly imported from another text processing system.
Speaker's models 130 are for example parameterizations for speech generation. In those models the characteristics of the human speech organism are modelled. The function of the vocal chords is represented by a sequence of pulses of which only the frequency (pitch) can be amended. The remaining characteristics (oral cavity, nasal cavity) of the speech organism are realized by means of digital filters. Their parameters are stored in the speaker's model. There are stored standard models (child, young lady, old man, etc.). The user may generate from them additional models by suitably chosing or amending the parameters and storing the model. The therein stored parameters are used during the speech generation which will be explained later together with the prosody information for the intonation.
Thereby also the particularities of the speaker such as e.g. accents or speech impediments may be inputted. Those are used by the translator for modifying the pronunciation. A simple example of such a modifier is e.g. the rule to -7replace (in the phonetic transcription) by (to generate the accent of a person from Hamburg.
A speaker's model may e.g. concern the rules according to which the translator generates the phonetic transcription. Different speaker's models may thereby proceed according to different rules. A speaker's model may, however, correspond to a certain set of filter parameters in order to process the speech signals according to the thereby prescribed speech characteristics. Of course one can also imagine different combinations of these two aspects of a speaker's model.
The task of the speech generation unit 140 consists in generating a numerical data stream representing digital speech signals based on the given text together with the phonetic and prosodic additional information generated by the translator and edited by the user. This data stream can then be converted into analog sound signals, the text to be outputted, by an output device 150 which may be a digital audio device or a sound board in a PC.
For generating the speech a conventional text-to-speech conversion method can be used in which, however, the pronunciation and the prosody already have been generated. In general one distinguishes between rule-based synthesizers and concatenation-based synthesizers.
Rule-based synthesizers operate using rules for the generation of the sounds and the transitions therebetween. Those synthesizers operate with up to parameters the determination of which is very demanding. However, very good results may be achieved with those type of synthesizers. An overview over those type of systems and further references may be found in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997".
On the other hand, concatenation-based synthesizers are easier to handle. They work with a database which stores all possible pairs of sound. They can be easily concatenated, however, systems providing a good quality require a high computational power. Those types of systems are described in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997" and in "Volker Kraft: Verkettung natirlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken und Evaluierung. Fortschr.-Ber. VDI Reihe 10 Nr 468, VDI-Verlag 1997".
In principle both types of systems can be used. In the rule-based synthesizers the prosodic information directly influences the rules, while in the concatenation-based systems the prosodic information are superposed in a suitable manner.
For the generation of particular effects 160 known techniques from digital signal processing are used, such as e.g. digital filters bandpass filters for a telephone effect), reverberation generators, etc. They may also be applied to sounds stored in an archive 170.
In archive 170 sounds like e.g. street noise, railway noise, noise of children, sound of the sea, background music etc. are stored. The archive may be extended arbitrarily. The archive may just be a collection of files having digitized noises, it may, however, also be a database in which noises are stored as blobs 20 (binary large ojects).
In the mixing device 180 the speech signals thus generated are combined with the background noises. The volume of all signals thereby may be adjusted before combination. Additionally, it is possible to apply effects to each signal individually or to all of them together.
The result of the thus generated signal may be handed over (transmitted) to a suitable device for digital audio 150 such as a sound board of a PC, and may thereby be acoustically checked respectively acoustically outputted. Additionally, a storage means (not shown) is provided in order to store the signal so that it may be transmitted later on in a suitable manner to the target medium.
-9- As a mixing device one can use a device classically implemented in hardware, or it can be realized in software and can form a part of the whole program.
For the skilled person modifications of the above embodiment are easily apparent. For example, in a further embodiment of the present invention the output device 150 may be replaced by a further computer which is connected by means of a network connection to mixing device 180. Thereby the generated speech signal can be transmitted through a computer network, e.g. the Internet, to another computer.
In another embodiment the speech signal generated from the speech generation unit 140 can be transmitted directly to the output device 150, without passing through mixing device 180. Further comparable modifications are easily apparent for the skilled person.
When used in this specification and claims, the terms "comprises" and "comprising" and variations thereof mean that the specified features, steps or integers are included. The terms are not to be interpreted to exclude the presence of other features, steps or components.
e* e

Claims (5)

1. A digital speech processing apparatus comprising: a prosody generation means for generating a prosody for a text; and an editing means for displaying and modifying the generated prosody.
2. The apparatus of claim 1 further comprising: translation means for translating the text into a phonetic transcription, said translation means further comprising: means for displaying and modifying the generated phonetic transcription.
3. The apparatus of claim 1 or 2, wherein said prosody generating means and/or said translation means generates said prosody and/or said phonetic transcription based on respectively in dependence of a particular speaker's model.
4. The apparatus of any one of claims 1 to 3, further comprising: means for selecting and/or modification of one or more speaker's models.
205. The apparatus of claim 4, wherein said speaker's model modification means comprises: means for modifying phonetic transcription elements for the generation of accents. 6. An apparatus for generating digital speech comprising: an apparatus for digital speech processing according to one of claims 1 to 4; and means for generating speech signals based on said phonetic transcription which may have been edited using said editing means and/or based on said prosody. 11 7.The apparatus of claim 6 wherein said speech signal generating means further comprises: a speaker's model processing means for generating said speech signals based on respectively depending on a particular speaker's model. 8. The apparatus of claim 7, wherein said speaker's model processing means comprises one or more of the following: a digital filter system; means for adopting a set of filter parameters representing a particular speaker's model. 9. The apparatus of claim 7 or 8, wherein said speaker's model processing means further comprises: means for selecting and/or modifying a speaker's model. The apparatus of any one of claims 6 to 9, further comprising: effect generating means for generating sound effects. 11. The apparatus of claim 10, wherein said effect generating means comprises one or more of the following: digital filter means for modifying the generated speech signals, and/or a reverberation generator for generating a reverberation effect. 12. The apparatus of any one of claims 6 to 11, further comprising: archive means for storing sounds; and mixing means for mixing the generated speech signals with the sounds stored in said archive means. 13. The apparatus of any one of the preceding claims, further 30 comprising: a graphical user interface for editing the generated phonetic transcription and/or prosody. 14. The apparatus of any one of the preceding claims, further comprising: means for modifying speech rhythm and/or pronunciation and/or intonation. The apparatus of any one of the preceding claims, further comprising: display means for displaying the prosody by means of a symbolic notation. 16. The apparatus of any one of the preceding claims, further comprising: dictionary means in which the words of one or more languages are stored together with their pronunciation. 17. The apparatus of claim 16, wherein for at least one dictionary entry different phonetic entries are stored in said dictionary means. 18. The apparatus of any one of claims 6 to 17, further comprising: means for converting said digital speech signals into acoustic signals. 19. A digital speech processing method comprising: generating a prosody for a text; displaying said generated prosody; and editing said generated and displayed prosody. 20. The method of claim 19, further comprising: using an apparatus according to any one of claims 1 to 18 for generating **30 digital speech. 21. A computer program product comprising: a 0 13 a medium, in particular a data carrier, for storing and/or transmitting digital data readable by a computer, wherein said stored and/or transmitted data comprise: a sequence of computer-executable instructions causing said computer to carry out a method according to one of claims 19 or 22. A digital speech processing apparatus substantially as hereinbefore described with reference to the accompanying drawings. Dated this 15 t h day of October 2003 PATENT ATTORNEY SERVICES Attorneys for HANS KULL e ooo
AU60813/99A 1998-09-11 1999-09-10 Device and method for digital voice processing Ceased AU769036B2 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
DE19841683A DE19841683A1 (en) 1998-09-11 1998-09-11 Device and method for digital speech processing
DE19841683 1998-09-11
PCT/EP1999/006712 WO2000016310A1 (en) 1998-09-11 1999-09-10 Device and method for digital voice processing

Publications (2)

Publication Number Publication Date
AU6081399A AU6081399A (en) 2000-04-03
AU769036B2 true AU769036B2 (en) 2004-01-15

Family

ID=7880683

Family Applications (1)

Application Number Title Priority Date Filing Date
AU60813/99A Ceased AU769036B2 (en) 1998-09-11 1999-09-10 Device and method for digital voice processing

Country Status (7)

Country Link
EP (1) EP1110203B1 (en)
JP (1) JP2002525663A (en)
AT (1) ATE222393T1 (en)
AU (1) AU769036B2 (en)
CA (1) CA2343071A1 (en)
DE (2) DE19841683A1 (en)
WO (1) WO2000016310A1 (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10117367B4 (en) * 2001-04-06 2005-08-18 Siemens Ag Method and system for automatically converting text messages into voice messages
JP2002318593A (en) * 2001-04-20 2002-10-31 Sony Corp Language processing system and language processing method as well as program and recording medium
AT6920U1 (en) 2002-02-14 2004-05-25 Sail Labs Technology Ag METHOD FOR GENERATING NATURAL LANGUAGE IN COMPUTER DIALOG SYSTEMS
DE10207875A1 (en) * 2002-02-19 2003-08-28 Deutsche Telekom Ag Parameter-controlled, expressive speech synthesis from text, modifies voice tonal color and melody, in accordance with control commands
KR20070004788A (en) * 2004-03-05 2007-01-09 레삭 테크놀로지스 인코포레이티드. Prosodic speech text codes and their use in computerized speech systems
DE102004012208A1 (en) 2004-03-12 2005-09-29 Siemens Ag Individualization of speech output by adapting a synthesis voice to a target voice
DE102008044635A1 (en) 2008-07-22 2010-02-04 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Apparatus and method for providing a television sequence
US10424288B2 (en) 2017-03-31 2019-09-24 Wipro Limited System and method for rendering textual messages using customized natural voice

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996008813A1 (en) * 1994-09-12 1996-03-21 Arcadia, Inc. Sound characteristic convertor, sound/label associating apparatus and method to form them
EP0880127A2 (en) * 1997-05-21 1998-11-25 Nippon Telegraph and Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon
JPH11202884A (en) * 1997-05-21 1999-07-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for editing and generating synthesized speech message and recording medium where same method is recorded

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5695295A (en) * 1979-12-28 1981-08-01 Sharp Kk Voice sysnthesis and control circuit
FR2494017B1 (en) * 1980-11-07 1985-10-25 Thomson Csf METHOD FOR DETECTING THE MELODY FREQUENCY IN A SPEECH SIGNAL AND DEVICE FOR CARRYING OUT SAID METHOD
JPS58102298A (en) * 1981-12-14 1983-06-17 キヤノン株式会社 Electronic appliance
US4623761A (en) * 1984-04-18 1986-11-18 Golden Enterprises, Incorporated Telephone operator voice storage and retrieval system
US5559927A (en) * 1992-08-19 1996-09-24 Clynes; Manfred Computer system producing emotionally-expressive speech messages
US5956685A (en) * 1994-09-12 1999-09-21 Arcadia, Inc. Sound characteristic converter, sound-label association apparatus and method therefor
DE19503419A1 (en) * 1995-02-03 1996-08-08 Bosch Gmbh Robert Method and device for outputting digitally coded traffic reports using synthetically generated speech
JPH08263094A (en) * 1995-03-10 1996-10-11 Winbond Electron Corp Synthesizer for generation of speech mixed with melody
EP0762384A2 (en) * 1995-09-01 1997-03-12 AT&T IPM Corp. Method and apparatus for modifying voice characteristics of synthesized speech
DE19610019C2 (en) * 1996-03-14 1999-10-28 Data Software Gmbh G Digital speech synthesis process

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1996008813A1 (en) * 1994-09-12 1996-03-21 Arcadia, Inc. Sound characteristic convertor, sound/label associating apparatus and method to form them
EP0880127A2 (en) * 1997-05-21 1998-11-25 Nippon Telegraph and Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon
JPH11202884A (en) * 1997-05-21 1999-07-30 Nippon Telegr & Teleph Corp <Ntt> Method and device for editing and generating synthesized speech message and recording medium where same method is recorded

Also Published As

Publication number Publication date
JP2002525663A (en) 2002-08-13
WO2000016310A1 (en) 2000-03-23
DE59902365D1 (en) 2002-09-19
DE19841683A1 (en) 2000-05-11
AU6081399A (en) 2000-04-03
CA2343071A1 (en) 2000-03-23
EP1110203B1 (en) 2002-08-14
ATE222393T1 (en) 2002-08-15
EP1110203A1 (en) 2001-06-27

Similar Documents

Publication Publication Date Title
US7979274B2 (en) Method and system for preventing speech comprehension by interactive voice response systems
AU769036B2 (en) Device and method for digital voice processing
JPH08335096A (en) Text voice synthesizer
JP2008058379A (en) Speech synthesis system and filter device
JP2006349787A (en) Method and device for synthesizing voices
JPH07200554A (en) Sentence read-aloud device
JP3113101B2 (en) Speech synthesizer
JP2577372B2 (en) Speech synthesis apparatus and method
JP3575919B2 (en) Text-to-speech converter
JPH09179576A (en) Voice synthesizing method
JP2658109B2 (en) Speech synthesizer
JP2703253B2 (en) Speech synthesizer
Morton Adding emotion to synthetic speech dialogue systems
JP2573586B2 (en) Rule-based speech synthesizer
JP3292218B2 (en) Voice message composer
JP3862300B2 (en) Information processing method and apparatus for use in speech synthesis
KR100269215B1 (en) Method for producing fundamental frequency contour of prosodic phrase for tts
JP2809769B2 (en) Speech synthesizer
JP2573585B2 (en) Speech spectrum pattern generator
JP2573587B2 (en) Pitch pattern generator
JPH06138894A (en) Device and method for voice synthesis
JP2001166787A (en) Voice synthesizer and natural language processing method
JPH0553595A (en) Speech synthesizing device
JPH09244680A (en) Device and method for rhythm control
JPH0756589A (en) Voice synthesis method

Legal Events

Date Code Title Description
FGA Letters patent sealed or granted (standard patent)