CA2343071A1 - Device and method for digital voice processing - Google Patents

Device and method for digital voice processing Download PDF

Info

Publication number
CA2343071A1
CA2343071A1 CA002343071A CA2343071A CA2343071A1 CA 2343071 A1 CA2343071 A1 CA 2343071A1 CA 002343071 A CA002343071 A CA 002343071A CA 2343071 A CA2343071 A CA 2343071A CA 2343071 A1 CA2343071 A1 CA 2343071A1
Authority
CA
Canada
Prior art keywords
prosody
generating
speech
speaker
generated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
CA002343071A
Other languages
French (fr)
Inventor
Hans Kull
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Publication of CA2343071A1 publication Critical patent/CA2343071A1/en
Abandoned legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention relates to a device for digital voice processing which comprises a sentence melody generating device for generating a sentence melody for a text, and an editing device for displaying and modifying the generated sentence melody.

Description

APPARATUS AND METHOD FOR DIGITAL SPEECH PROCESSING
The present invention relates to an apparatus and a method for digital speech processing or speech generation, respectively. Present systems for outputting speech typically are applied in areas in which a synthetic speech is acceptable or even desired. The present invention, however, relates to a system which makes it possible to synthetically generate speech giving a natural impression.
o In present systems for digital speech generation the information regarding prosody and intonation are generated automatically, like e.g.
described in EP 0689706. In some systems it is possible to insert additional commands into the text screen before it is handed over to the speech generator, e.g. in EP
0598599.
Those commands are inputted e.g. as (non-pronouncable) special characters, like ~5 e.g. described in EP 0598598.
The commands inserted into the text screen may also contain indications regarding the characteristics of the speaker (e.g. parameters of the speaker's model). In EP 0762384 there is described a system in which those speakers 2o characteristics may be inserted on the screen by means of a graphical user interface.
The speech synthesis is carried out using auxiliary information which are stored in a databank (e.g. as waveform sequence in case of EP 0831460).
25 Nevertheless, for the pronunciation of the words which are not stored in the databank there need to be provided rules regarding the pronunciation in the program. The combination of the individual sequences leads to distortions and acoustic artefacts if no measures are provided for their suppression. This problem (commonly called "segmental quality"), however, nowadays is mostly solved (cp.
for 3o example Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr-Ber.VDl
-2-Reihe 10, Nr. 468, VDI-Veriag 1997). Nevertheless even in modern speech synthesis systems several further problems arise.
One problem with digital speech output is for example the multiple language capability.
Another problem consists in the improvement of the prosodic quality, i.e.
the quality of the intonation, cp. e.g. " Volker Kraft: Verkettung naturlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung.
Fortschr-Ber.VDl Reihe 10, Nr. 468, VDI-Verlag 1997". This difficulty is based on the fact that the intonation can be constructed based on the orthographic input information only insufficiently. It depends also on higher levels like semantics and pragmatics as well as on the situation of the speaker and on the type of the speaker. In general it can be said that the quality of today's speech outputting systems fulfill the requirements where the listener expects or accepts a synthetic speech. However, often the quality of synthetic speech is considered as not sufficient or as unsatisfactory.
It is therefore an object of the present invention to provide an apparatus 2o and a method for digital speech processing which allows the generation of synthetic speech having a better quality. It is a further object of the invention to synthetically generate speech giving a natural impression. The applications reach from the generation of simple texts for multimedia applications up to the sound generation for the dubbing of movies, radio dramas, and audible books.
Even if the synthetically generated speech gives a natural impression sometimes there is needed the possibility to intervene in order to generate dramaturgical effects. Another object of the present invention therefore resides in the provision of such possibilities to intervene.
The present invention is defined in the independent claims. The dependent claims define particular embodiments of the invention.
-3-The problem of the invention is substantially solved by providing the capability to modify the prosody generated for a text by means of an editor.
Particular embodiments of the invention provide for the addition of further characteristics of the synthetically generated speech in addition to the editing of the prosody.
Thereby the starting point is the written text. However, in order to 1o achieve a sufficient (in particular prosodic) quality as well as in order to achieve dramaturgical effects the user in a particular embodiment is provided with far-reaching capabilities to intervene. The user is in the position of the director who defines the speakers on the system and assignes to them speech rhythm and prosody, pronunciation and intonation.
Preferably the present invention also comprises the generation of a phonetic transcription for a written text as well as the provision of the capability to modify the generated phonetic transcription, or to modify the phonetic transcription based on modifiable rules, respectively. Thereby for example a particular accent of 2o a speaker may be generated.
According to a further preferred embodiment the present invention comprises a dictionary means in which the words of one or more languages are stored together with their pronunciation. In the latter case this allows for multiple language capability (mulilinguality), i.e. the processing of text in different languages.
Preferably the editing of the generated phonetic transcription or the prosody, respectively, is carried out by means of an editor which is easy to use, such as a graphical user interface.
According to a further preferred embodiment speaker's models which are either predefined or which are defined respectively modified by a user are taken
-4-into account into the speech processing. Thereby the characteristics of different speakers can be realized, let them be male or female voices, or different accents of a speaker, such as a Bavarian, a Swabian or a Northern German accent.
According to a particularly preferred embodiment the apparatus comprises a dictionary in which for all words also the pronunciation is stored in a phonetic transcription (if hereinafter reference is made to a phonetic transcription, then this may mean an arbitrary phonetic transcription, such as for example the SAMPA-notation, compare for example "Multilingual speech input/output o assessment, methodology and standardization, standard computer-compatible transcription, pp 29-31, in Esprit Project 2589 (SAM) Fin. Report SAM-UCC-037") or the international phonetic script known from teaching materials, compare for example "The Principles of the International Phonetik Association: A
description of the International Phonetic Alphabet and the Manner of Using it. International ~5 Phonetic Association, Dept, Phonetics, Univ. College of London", a translator which converts inputted text into a phonetic transcription and generates a prosody, an editor with which text can be inputted and can be assigned to a speaker and in which the generated phonetic transcription as well as the prosody can be displayed and modified, an input module, in which the speaker's models can be defined, a 2o system for digital speech generation which generates from the phonetic transcription together with the prosody signals representing spoken speech respectively data representing such signals and which is capable of processing different speaker's models, a system of digital filters and other devices (for hall, echo, etc.) with which particular effects can be generated, a sound archive, as well 25 as a mixing device in which the generated speech signals can be mixed with sounds from the archive and can be edited with effects. The invention can either be realized in a hybrid manner by means of software and hardware or fully by means of software. The generated digital speech signals can be outputted by means of a particular device for digital audio or by means of a PC sound board.
The present invention will be described hereinafter in detail by means of several embodiments and by referring to the accompanying drawings.
-5-Fig. 1 shows a block diagram of an apparatus for generating digital speech according to an embodiment of the present invention.
In the embodiment of the present invention described hereinafter the invention comprises several individual components which may be realized by means of one or more digital processing apparatuses and the combination and operation of which is described in more detail hereinafter.
1o The dictionary 100 comprises simple tables (one for each language) in which the words of a language are stored together with their pronunciation.
The tables may be extended arbitrarily for the incorporation of additional words and their pronunciation. For particular purposes, e.g. for the generation of accents, there may also be generated additional tables with different phonetic entries in one language.
~5 One table of the dictionary is assigned to the different speakers, respectively.
The translator 100 on one hand generates the phonetic transcriptions by replacing the words of the inserted text through the phonetic correspondences in the dictionary. If in the speaker's model there are stored modifiers which will be 2o described later in more detail, then they are used for modifying the pronunciation.
Additionally it generates the prosody using heuristics known in speech processing. Such heuristics are e.g. the model of Fujisaki (1992) or other acoustic methods, then the perceptual models, e.g. the one of d'Alessandro and Mertens 25 (1995). Both, however, also older linguistic models are described e.g. in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997". Therein one also can find methods for the segmentation (setting of breaks) which are likewise generated by the translator.
3o The selection of the methods thereby is of more or less lower importance since the translator only generates a version of the prosody which still can be modified by the user.
-6-Editor 120 provides the user with an instrument with which he can input and modify pronunciation, intonation, accentuation, speed, volume, breaks (interruptions), etc.
At first the user assignes a speaker's model to the text segments to be processed which will be explained in more detail later with respect to its composition and operation. The translator reacts onto this assignment by adapting and newly generating the phonetics and possibly the prosody to the speaker's o model. The phonetics is displayed to the user in a phonetic transcription, the prosody is displayed e.g. in a symbolic notation which is taken from music (musical notation). The user then has the possibility to modify them, to listen to individual text segments and to improve his inputs once again, and so on.
~5 The texts themselves may of course be kept in the editor if they cannot be directly imported from another text processing system.
Speaker's models 130 are for example parameterizations for speech generation. In those models the characteristics of the human speech organism are 2o modelled. The function of the vocal chords is represented by a sequence of pulses of which only the frequency (pitch) can be amended. The remaining characteristics (oral cavity, nasal cavity) of the speech organism are realized by means of digital filters. Their parameters are stored in the speaker's model. There are stared standard models (child, young lady, old man, etc.). The user may generate from 25 them additional models by suitably chosing or amending the parameters and storing the model. The therein stored parameters are used during the speech generation which will be explained later together with the prosody information for the intonation.
3o Thereby also the particularities of the speaker such as e.g. accents or speech impediments may be inputted. Those are used by the translator for modifying the pronunciation. A simple example of such a modifier is e.g. the rule to
-7-replace (in the phonetic transcription) ".(t" by "st" (to generate the accent of a person from Hamburg.
A speaker's model may e.g. concern the rules according to which the translator generates the phonetic transcription. Different speaker's models may thereby proceed according to different rules. A speaker's model may, however, correspond to a certain set of filter parameters in order to process the speech signals according to the thereby prescribed speech characteristics. Of course one can also imagine different combinations of these two aspects of a speaker's model.
The task of the speech generation unit 140 consists in generating a numerical data stream representing digital speech signals based on the given text together with the phonetic and prosodic additional information generated by the translator and edited by the user. This data stream can then be converted into analog sound signals, the text to be outputted, by an output device 150 which may be a digital audio device or a sound board in a PC.
For generating the speech a conventional text-to-speech conversion method can be used in which, however, the pronunciation and the prosody already 2o have been generated. In general one distinguishes between rule-based synthesizers and concatenation-based synthesizers.
Rule-based synthesizers operate using rules for the generation of the sounds and the transitions therebetween. Those synthesizers operate with up to parameters the determination of which is very demanding. However, very good results may be achieved with those type of synthesizers. An overview over those type of systems and further references may be found in "Thierry Dutoit: An Introduction to Text-to-Speech Synthesis, Kluwer 1997".
3o On the other hand, concatenation-based synthesizers are easier to handle. They work with a database which stores all possible pairs of sound.
They can be easily concatenated, however, systems providing a good quality require a
- 8 _ high computational power. Those types of systems are described in "Thierry Dutoit:
An Introduction to Text-to-Speech Synthesis, Kluwer 1997" and in "Volker Kraft:
Verkettung natiarlichsprachlicher Bausteine zur Sprachsynthese: Anforderungen, Techniken and Evaluierung. Fortschr.-Ber. VDI Reihe 10 Nr 468, VDI-Verlag 1997".
In principle both types of systems can be used. In the rule-based synthesizers the prosodic information directly influences the rules, while in the concatenation-based systems the rules are superposed in a suitable manner.
For the generation of particular effects 160 known techniques from digital signal processing are used, such as e.g. digital filters (e.g. bandpass filters for a telephone effect), hall generators, etc. They may also be applied to sounds stored in an archive 170.
In archive 170 sounds like e.g. street noise, railway noise, noise of children, sound of the sea, background music etc. are stored. The archive may be extended arbitrarily. The archive may just be a collection of files having digitized noises, it may, however, also be a database in which noises are stored as blobs (binary large ojects).
In the mixing device 180 the speech signals thus generated are combined with the background noises. The volume of all signals thereby may be adjusted before combination. Additionally, it is possible to apply effects to each signal individually or to all of them together.
The result of the thus generated signal may be handed over (transmitted) to a suitable device for digital audio 150 such as a sound board of a PC, and may thereby be acoustically checked respectively acoustically outputted.
Additionally, a storage means (not shown) is provided in order to store the signal so that it may be 3o transmitted later on in a suitable manner to the target medium.

_g_ As a mixing device one can use a device classically implemented in hardware, or it can be realized in software and can form a part of the whole program.
For the skilled person modifications of the above embodiment are easily apparent. For example, in a further embodiment of the present invention the output device 150 may be replaced by a further computer which is connected by means of a network connection to mixing device 180. Thereby the generated speech signal can be transmitted through a computer network, e.g. the Internet, to another ~o computer.
In another embodiment the speech signal generated from the speech generation unit 140 can be transmitted directly to the output device 150, without passing through mixing device 180. Further comparable modifications are easily 15 apparent for the skilled person.

Claims (21)

-10-
1. A digital speech processing apparatus comprising:
a prosody generation means for generating a prosody for a text; and an editing means for displaying and modifying the generated prosody.
2. The apparatus of claim 1 further comprising:
translation means for translating the text into a phonetic transcription, said translation means further comprising:
means for displaying and modifying the generated phonetic transcription.
3. The apparatus of claim 1 or 2, wherein said prosody generating means and/or said translation means generates said prosody and/or said phonetic transcription based on respectively in dependence of a particular speaker's model.
4. The apparatus of one of claims 1 to 3, further comprising:
means for displaying and/or modification of one or more speaker's models.
5. The apparatus of claim 4, wherein said speaker's model modification means comprises:
means for modifying phonetic transcription elements for the generation of accents.
6. An apparatus for generating digital speech comprising:
an apparatus for digital speech processing according to one of claims 1 to 4; and means for generating speech signals based on said phonetic transcription which may have been edited using said editing means and/or based on said prosody.
7. The apparatus of claim 6, wherein said speech signal generating means further comprises:
a speaker's model processing means for generating said speech signals based on respectively depending on a particular speaker's model.
8. The apparatus of claim 7, wherein said speaker's model processing means comprises one or more of the following:
a digital filter system;
means for adopting a set of filter parameters representing a particular speaker's model.
9. The apparatus of claim 7 or 8, wherein said speaker's model processing means further comprises:
means for selecting and/or modifying a speaker's model.
10. The apparatus of one of claims 6 to 9, further comprising:
effect generating means for generating sound effects.
11. The apparatus of claim 10, wherein said effect generating means comprises one or more of the following:
digital filter means for modifying the generated speech signals, and/or a hall generator for generating a hall effect.
12. The apparatus of one of claims 6 to 11, further comprising:
archive means for storing sounds; and mixing means for mixing the generated speech signals with the sounds stored in said archive means.
13. The apparatus of one of the preceding claims, further comprising:
a graphical user interface for editing the generated phonetic transcription and/or prosody.
14. The apparatus of one of the preceding claims, further comprising:
means for modifying speech rhythm and/or pronunciation and/or intonation.
15. The apparatus of one of the preceding claims, further comprising:
display means for displaying the prosody by means of a symbolic notation.
16. The apparatus of one of the preceding claims, further comprising:
dictionary means in which the words of one or more languages are stored together with their pronunciation.
17. The apparatus of claim 16, wherein for at least one dictionary entry different phonetic entries are stored in said dictionary means.
18. The apparatus of one of claims 6 to 17, further comprising:
means for converting said digital speech signals into acoustic signals.
19. A digital speech processing method comprising:
generating a prosody for a text;
displaying said generated prosody; and editing said generated and displayed prosody.
20. The method of claim 19, further comprising:
using an apparatus according to one of claims 1 to 18 for generating digital speech.
21. A computer program product comprising:
a medium, in particular a data carrier for storing and/or transmitting digital data readable by a computer, wheras said stored and/or transmitted data comprise:

a sequence of computer-executable instructions causing said computer to carry out a method according to one of claims 19 or 20.
CA002343071A 1998-09-11 1999-09-10 Device and method for digital voice processing Abandoned CA2343071A1 (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
DE19841683A DE19841683A1 (en) 1998-09-11 1998-09-11 Device and method for digital speech processing
DE19841683.0 1998-09-11
PCT/EP1999/006712 WO2000016310A1 (en) 1998-09-11 1999-09-10 Device and method for digital voice processing

Publications (1)

Publication Number Publication Date
CA2343071A1 true CA2343071A1 (en) 2000-03-23

Family

ID=7880683

Family Applications (1)

Application Number Title Priority Date Filing Date
CA002343071A Abandoned CA2343071A1 (en) 1998-09-11 1999-09-10 Device and method for digital voice processing

Country Status (7)

Country Link
EP (1) EP1110203B1 (en)
JP (1) JP2002525663A (en)
AT (1) ATE222393T1 (en)
AU (1) AU769036B2 (en)
CA (1) CA2343071A1 (en)
DE (2) DE19841683A1 (en)
WO (1) WO2000016310A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566880B2 (en) 2008-07-22 2013-10-22 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Device and method for providing a television sequence using database and user inputs

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10117367B4 (en) * 2001-04-06 2005-08-18 Siemens Ag Method and system for automatically converting text messages into voice messages
JP2002318593A (en) * 2001-04-20 2002-10-31 Sony Corp Language processing system and language processing method as well as program and recording medium
AT6920U1 (en) 2002-02-14 2004-05-25 Sail Labs Technology Ag METHOD FOR GENERATING NATURAL LANGUAGE IN COMPUTER DIALOG SYSTEMS
DE10207875A1 (en) * 2002-02-19 2003-08-28 Deutsche Telekom Ag Parameter-controlled, expressive speech synthesis from text, modifies voice tonal color and melody, in accordance with control commands
US7877259B2 (en) 2004-03-05 2011-01-25 Lessac Technologies, Inc. Prosodic speech text codes and their use in computerized speech systems
DE102004012208A1 (en) 2004-03-12 2005-09-29 Siemens Ag Individualization of speech output by adapting a synthesis voice to a target voice
US10424288B2 (en) 2017-03-31 2019-09-24 Wipro Limited System and method for rendering textual messages using customized natural voice

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS5695295A (en) * 1979-12-28 1981-08-01 Sharp Kk Voice sysnthesis and control circuit
FR2494017B1 (en) * 1980-11-07 1985-10-25 Thomson Csf METHOD FOR DETECTING THE MELODY FREQUENCY IN A SPEECH SIGNAL AND DEVICE FOR CARRYING OUT SAID METHOD
JPS58102298A (en) * 1981-12-14 1983-06-17 キヤノン株式会社 Electronic appliance
US4623761A (en) * 1984-04-18 1986-11-18 Golden Enterprises, Incorporated Telephone operator voice storage and retrieval system
US5559927A (en) * 1992-08-19 1996-09-24 Clynes; Manfred Computer system producing emotionally-expressive speech messages
US5956685A (en) * 1994-09-12 1999-09-21 Arcadia, Inc. Sound characteristic converter, sound-label association apparatus and method therefor
WO1996008813A1 (en) * 1994-09-12 1996-03-21 Arcadia, Inc. Sound characteristic convertor, sound/label associating apparatus and method to form them
DE19503419A1 (en) * 1995-02-03 1996-08-08 Bosch Gmbh Robert Method and device for outputting digitally coded traffic reports using synthetically generated speech
JPH08263094A (en) * 1995-03-10 1996-10-11 Winbond Electron Corp Synthesizer for generation of speech mixed with melody
EP0762384A2 (en) * 1995-09-01 1997-03-12 AT&T IPM Corp. Method and apparatus for modifying voice characteristics of synthesized speech
DE19610019C2 (en) * 1996-03-14 1999-10-28 Data Software Gmbh G Digital speech synthesis process
JP3616250B2 (en) * 1997-05-21 2005-02-02 日本電信電話株式会社 Synthetic voice message creation method, apparatus and recording medium recording the method
US6226614B1 (en) * 1997-05-21 2001-05-01 Nippon Telegraph And Telephone Corporation Method and apparatus for editing/creating synthetic speech message and recording medium with the method recorded thereon

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8566880B2 (en) 2008-07-22 2013-10-22 Fraunhofer-Gesellschaft Zur Foerderung Der Angewandten Forschung E.V. Device and method for providing a television sequence using database and user inputs

Also Published As

Publication number Publication date
JP2002525663A (en) 2002-08-13
DE59902365D1 (en) 2002-09-19
AU6081399A (en) 2000-04-03
AU769036B2 (en) 2004-01-15
EP1110203A1 (en) 2001-06-27
EP1110203B1 (en) 2002-08-14
WO2000016310A1 (en) 2000-03-23
DE19841683A1 (en) 2000-05-11
ATE222393T1 (en) 2002-08-15

Similar Documents

Publication Publication Date Title
US7979274B2 (en) Method and system for preventing speech comprehension by interactive voice response systems
WO2006123539A1 (en) Speech synthesizer
JPH0833744B2 (en) Speech synthesizer
AU769036B2 (en) Device and method for digital voice processing
JPH08335096A (en) Text voice synthesizer
JP2008058379A (en) Speech synthesis system and filter device
JP2006349787A (en) Method and device for synthesizing voices
JPH07200554A (en) Sentence read-aloud device
JP2577372B2 (en) Speech synthesis apparatus and method
JP3113101B2 (en) Speech synthesizer
JPH09179576A (en) Voice synthesizing method
JP2703253B2 (en) Speech synthesizer
JP2573586B2 (en) Rule-based speech synthesizer
JP2658109B2 (en) Speech synthesizer
JP3862300B2 (en) Information processing method and apparatus for use in speech synthesis
JP3292218B2 (en) Voice message composer
KR100269215B1 (en) Method for producing fundamental frequency contour of prosodic phrase for tts
JP2573587B2 (en) Pitch pattern generator
JP2573585B2 (en) Speech spectrum pattern generator
JP2586040B2 (en) Voice editing and synthesis device
JPH0644247A (en) Speech synthesizing device
JPH1011083A (en) Text voice converting device
JPH06138894A (en) Device and method for voice synthesis
JP2001166787A (en) Voice synthesizer and natural language processing method
JPH0553595A (en) Speech synthesizing device

Legal Events

Date Code Title Description
FZDE Discontinued