WO2000011647A1 - Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten - Google Patents
Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten Download PDFInfo
- Publication number
- WO2000011647A1 WO2000011647A1 PCT/EP1999/006081 EP9906081W WO0011647A1 WO 2000011647 A1 WO2000011647 A1 WO 2000011647A1 EP 9906081 W EP9906081 W EP 9906081W WO 0011647 A1 WO0011647 A1 WO 0011647A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- area
- audio segment
- sound
- areas
- concatenation
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 63
- 238000006243 chemical reaction Methods 0.000 claims abstract description 3
- 238000011144 upstream manufacturing Methods 0.000 claims description 50
- 230000006870 function Effects 0.000 claims description 49
- 230000003068 static effect Effects 0.000 claims description 30
- 230000015572 biosynthetic process Effects 0.000 claims description 29
- 238000003786 synthesis reaction Methods 0.000 claims description 29
- 230000007704 transition Effects 0.000 claims description 28
- 238000001228 spectrum Methods 0.000 claims description 20
- 238000012545 processing Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 14
- 230000008859 change Effects 0.000 claims description 13
- 230000008451 emotion Effects 0.000 claims description 7
- 239000007788 liquid Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000001419 dependent effect Effects 0.000 claims description 2
- 230000003287 optical effect Effects 0.000 claims description 2
- 238000009795 derivation Methods 0.000 claims 4
- 238000013500 data storage Methods 0.000 claims 2
- 230000000694 effects Effects 0.000 abstract description 8
- 101000886098 Homo sapiens Rho guanine nucleotide exchange factor 40 Proteins 0.000 description 10
- 101000836397 Homo sapiens SEC14 domain and spectrin repeat-containing protein 1 Proteins 0.000 description 10
- 102100027289 SEC14 domain and spectrin repeat-containing protein 1 Human genes 0.000 description 10
- 239000011295 pitch Substances 0.000 description 10
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 6
- 238000013459 approach Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000002996 emotional effect Effects 0.000 description 3
- 238000009499 grossing Methods 0.000 description 3
- 230000003595 spectral effect Effects 0.000 description 3
- 230000002194 synthesizing effect Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000012530 fluid Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000000056 organ Anatomy 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
- G10L13/07—Concatenation rules
Definitions
- the invention relates to a method and a device for concatenating audio segments for generating synthesized acoustic data, in particular synthesized speech.
- the invention further relates to synthesized speech signals which were generated by the concatenation of speech segments according to the invention in accordance with the articulation, and to a data carrier which contains a computer program for the generation of synthesized acoustic data, in particular synthesized speech, according to the invention.
- the invention relates to a data memory which contains audio segments which are suitable for concatenation in accordance with the invention in accordance with the articulation, and to a sound carrier which contains acoustic data synthesized according to the invention.
- both the prior art presented below and the present invention relate to the entire area of synthesis of acoustic data by concatenation of individual audio segments obtained in any way.
- the following statements relate specifically to synthesized speech data through concatenation of individual speech segments.
- data-based speech synthesis is increasingly being carried out, in which corresponding segments are selected from a database comprising individual speech segments and linked (concatenated) with one another.
- the speech quality depends primarily on the number and type of available speech segments, because only speech can be synthesized that is represented by speech segments in the database.
- various methods are known that perform a concatenation of the language segments according to complex rules.
- an inventory i.e. a database comprising the voice audio segments can be used, which is complete and manageable.
- An inventory is complete if it can be used to generate any phonetic sequence of the language to be synthesized, and is manageable if the number and type of data in the inventory can be processed in a desired manner using the technically available means.
- such a method must ensure that the concatenation of the individual inventory elements generates a synthesized language that differs as little as possible from a naturally spoken language.
- a synthesized language must be fluid and have the same articulatory effects as a natural language.
- co-articulatory effects i.e. the mutual influence of
- the inventory elements should be such that they take into account the co-articulation of individual successive speech sounds. Furthermore, a procedure for concatenating the inventory elements should chain the elements, taking into account the co-articulation of individual consecutive speech sounds as well as the superordinate co-articulation of several consecutive speech sounds, also across word and sentence boundaries.
- a sound is a class of arbitrary sound events (noises, sounds, tones, etc.).
- the sound events are divided into sound classes according to a classification scheme.
- a sound event belongs to a sound if, with regard to the parameters used for classification (e.g. spectrum, pitch, volume, chest or head voice, coarticulation, resonance rooms, emotion, etc.), the values of the sound event lie within the value ranges defined for the sound.
- the classification scheme for sounds depends on the type of application.
- the definition of the term "loud” used here is not limited to this, but any other parameters can be used.
- the pitch or the emotional expression are also included as parameters in the classification, two 'a' sounds with different pitch or with lower different emotional expression to different sounds in the sense of the definition.
- Lute can also be the tones of a musical instrument, such as a violin, at different pitches in different ways of playing (spread and smear, detache, spiccato, dilemmao, col legno etc.). Sounds can also be Hunebellell or the squeak of a car door.
- Sounds can be played through audio segments that contain corresponding acoustic data.
- Phon can be replaced by the term phonetic in the sense of the previous definition and the term phoneme by the term phonetic sign. (This also applies the other way around, as phones are classified sounds according to the IPA classification.)
- a static sound has areas that are similar to previous or subsequent ones
- the similarity does not necessarily have to be an exact correspondence to the periods of a sine tone, but is analogous to the similarity that exists between the areas of the static phones defined below.
- a dynamic sound has no areas that resemble previous or subsequent areas of the dynamic sound, such as the sound event of an explosion or a dynamic phone.
- a phon is a sound generated by the speech organs (a speech sound).
- the phones are divided into static and dynamic phones.
- Static phones include vowels, diphtongs, nasals, laterals, vibrants and fricatives.
- the dynamic phones include plosives, affricates, glottal stops and beaten ones
- a phoneme is the formal description of a phon, whereby i. general.
- the formal description is made by phonetic characters.
- the co-articulation describes the phenomenon that a sound, i.e. also a phon, is influenced by upstream and downstream sounds or phones, whereby the co-articular tion occurs between immediately adjacent sounds / phones, but can also extend as a superordinate co-articulation over a sequence of several sounds / phones (for example, when rounding the lips).
- the initial co-articulation area covers the area from the beginning of the sound / phone to the end of the co-articulation due to an upstream sound / phone.
- the solo articulation range is the range of the sound / phon that is not influenced by a preceding or following sound or a preceding or following phon.
- the end co-articulation area covers the area from the start of co-articulation due to a downstream sound / phone to the end of the sound / phone.
- the co-articulation area comprises an end co-articulation area and the adjacent initial co-articulation area of the adjacent sound / phone.
- a polyphone is a series of phones.
- the elements of an inventory are coded audio segments that reproduce sounds, parts of sounds, sequences of parts or parts of sequences, or phone, parts of phones, polyphones or parts of polyphones.
- FIG. 2a shows a conventional audio segment
- FIGS. 2b-2l in which audio segments according to the invention are shown.
- audio segments can also be formed from smaller or larger audio segments that are contained in the inventory or a database.
- audio segments can also be present in a transformed form (e.g. a Fourier-transformed form) in the inventory or in a database.
- Audio segments for the present method can also originate from an upstream synthesis step (which is not part of the method). Audio segments contain at least part of an initial co-articulation area, a solo articulation area and / or an end co-articulation area. Instead of audio segments, areas of audio segments can also be used.
- Concatenation means the joining of two audio segments.
- the concatenation moment is the point in time at which two audio segments are joined together.
- the concatenation can be done in different ways, e.g. with a crossfade or a hardfade (see also Figures 3a-3e):
- a temporally rear area of a first audio segment area and a temporally front area of a second audio segment area are processed with suitable transition functions, and then these two areas are added in an overlapping manner in such a way that the shorter of the two areas in maximum of the longer of the two areas is completely overlapped.
- a temporally rear area of a first audio segment and a temporally front area of a second audio segment are processed with suitable transition functions, these two audio segments being joined together in such a way that the rear area of the first audio segment and the front area of the second audio segment do not overlap .
- the coarticulation area is particularly noticeable in that a concatenation in it is associated with discontinuities (e.g. spectral jumps).
- a hardfade represents a limit case of a crossfade, in which an overlap of a temporally backward area of a first audio segment and a temporally forward area of a second audio segment has a length of zero. This allows in certain, e.g. Replacing a crossfade with a hardfade in extremely time-critical applications, such a procedure must be carefully considered, since this leads to significant quality losses in the concatenation of audio segments which are actually to be concatenated by a crossfade.
- WO 95/30193 discloses a method and a device for converting text into audible speech signals using a neural network.
- the text to be converted into language is converted into a sequence of phonemes using a conversion unit, with additional information being generated about the syntactical limits of the text and the emphasis on the individual syntactic components of the text. These are forwarded together with the phonemes to a facility that determines the duration of the pronunciation of the individual phonemes based on rules.
- a processor generates a suitable input for the neural network from each individual phoneme in conjunction with the corresponding syntactic and temporal information, this input for the neural network also comprising the corresponding prosodic information for the entire phoneme sequence. From the available audio segments, the neural network now selects those that best reproduce the entered phonemes and links these audio segments accordingly. In this concatenation, the duration, total amplitude and frequency of the individual audio segments are adapted to upstream and downstream audio segments, taking into account the prosodic information of the speech to be synthesized, and are connected to one another in time. A change in individual areas of the audio segments is not described here.
- the neural is used to generate the audio segments required for this method
- No. 5,524,172 describes a device for generating synthesized speech which uses the so-called diphone method.
- a text that is to be converted into synthesized language is divided into phoneme sequences, with each phoneme sequence speaking prosodic information.
- two diphones representing the phoneme are selected for each phoneme in the sequence and concatenated taking into account the corresponding prosodic information.
- the two diphones are each weighted using a suitable filter and the duration and pitch of both diphones are changed so that when the diphones are concatenated, a synthesized phoneme sequence is generated, the duration and pitch of which correspond to the duration and pitch of the desired phoneme sequence.
- the individual diphones are added in such a way that a temporally rear area of a first diphone and a temporally front area of a second diphone overlap, the concatenation moment generally being in the stationary region of the individual diphones (see FIG. 2a). Since a variation of the concatenation moment taking into account the co-articulation of successive audio segments (diphones) is not provided here, the quality (naturalness and intelligibility) of a speech synthesized in this way can be negatively influenced.
- the database also provides audio segments that differ slightly, but are suitable for synthesizing the same phoneme. In this way, the natural variation of the language is to be simulated in order to achieve a higher quality of the synthesized language.
- Both the use of the smoothing filter and the selection from a number of different audio segments for realizing a phoneme requires a high computing power of the system components used when implementing this method.
- the size of the database increases due to the increased number of audio segments provided.
- this method is also a co-articulation-dependent choice of the concatenation moment of individual audio segments is not provided, whereby the quality of the synthesized speech can be reduced.
- DE 689 15 353 T2 aims to improve the sound quality by specifying a procedure for how the transition between two adjacent samples is to be designed. This is particularly relevant for low sampling rates.
- the speech synthesis described in this document uses waveforms that represent sounds to be concatenated. For waveforms for upstream
- a corresponding end sample value and an assigned zero crossing point are determined in each case for sounds, while a first upper sample value and an assigned zero crossing point are each determined for waveforms for downstream sounds.
- sounds are connected to one another in a maximum of four different ways.
- connection types is reduced to two if the waveforms are generated using the Nyquist theorem.
- DE 689 15 353 T2 describes that the range of waveforms used extends between the last sample of the upstream waveform and the first sample of the downstream waveform.
- a synthesized phoneme sequence has an authentic speech quality if it cannot be distinguished by the listener from the same phoneme sequence spoken by a real speaker.
- the acoustic data synthesized with the invention, in particular synthesized speech data, should have an authentic acoustic quality, in particular an authentic speech quality.
- the invention provides a method according to claim 1, a device according to claim 14, synthesized speech signals according to claim 28, a data carrier according to claim 39, a data memory according to claim 51, and a sound carrier according to claim 60.
- the invention thus makes it possible to generate synthesized acoustic data which reproduce a sequence of sounds, in that, when concatenating audio segment areas, the moment of concatenation of two audio segment areas is determined as a function of properties of the audio segment areas to be linked, in particular the co-articulation effects relating to the two audio segment areas.
- the concatenation moment is determined according to the -lü ⁇
- the invention preferably chosen in the vicinity of the limits of the solo articulation range. In this way, a voice quality is achieved that cannot be achieved with the prior art.
- the computing power required is not higher than in the prior art.
- the invention provides for a different selection of the audio segment areas and different types of concatenation that is appropriate for the articulation.
- a higher degree of naturalness of the synthesized acoustic data is achieved when a temporally downstream audio segment area, the beginning of which reproduces a static sound, is connected to a temporally preceding audio segment area by means of a crossfade, or if a temporally downstream audio segment area, the beginning of which is a dynamic sound reproduces, is connected to a temporally preceding audio segment area by means of a hard thread.
- the invention makes it possible to reduce the number of audio segment areas necessary for data synthesis by using audio segment areas which always start to play a dynamic sound, whereby all concatenations of these audio segment areas are carried out by means of a hardfade can be.
- downstream audio segment areas are connected with upstream audio segment areas, the beginnings of which each represent a dynamic sound.
- synthesized acoustic data of high quality can also be generated according to the invention, even with low computing power (for example in the case of answering machines or car control systems).
- the invention provides for the simulation of acoustic phenomena which result from the mutual influence of individual segments of corresponding natural acoustic data.
- individual audio segments or individual areas of the audio segments are processed using suitable functions.
- the frequency, the duration, the amplitude or the spectrum of the audio segments can be changed.
- prosodic information and / or superordinate co-articulation effects are preferably taken into account to solve this task.
- the signal curve of synthesized acoustic data can additionally be improved if the concatenation moment is placed at points of the individual audio segment regions to be linked, at which the two regions used match in terms of one or more suitable properties.
- suitable properties can include be: zero, amplitude value, slope, derivative of any degree, spectrum, pitch, amplitude value in a frequency range, volume, language style, speech emotion, or other properties considered in the sound classification scheme.
- the invention makes it possible to improve the selection of the audio segment regions for generating the synthesized acoustic data and to make their concatenation more efficient by using heuristic knowledge that the
- audio segment areas are preferably used that reproduce sounds / phone or parts of sound sequences / sound sequences.
- the invention allows the use of the synthesized acoustic data generated by converting these data into acoustic signals and / or voice signals and / or storing them on a data carrier.
- the invention can be used to provide synthesized speech signals which differ from known synthesized speech signals in that they do not differ in their naturalness and intelligibility from real speech.
- audio segment areas are concatenated in accordance with the articulation, each reproducing parts of the phonetic sequence / phoneme sequence of the speech to be synthesized, by determining the areas of the audio segments to be used and the moment of concatenation of these areas according to the invention as defined in claim 28.
- An additional improvement of the synthesized speech can be achieved if a downstream audio segment area, the beginning of which is a static sound or reproduces a static phone, is connected to a temporally preceding audio segment area by means of a crossfade, or if a temporally downstream audio segment area, the beginning of which reproduces a dynamic sound or a dynamic phon, is connected to a temporally preceding audio segment area by means of a hardfade.
- a fast and efficient procedure is particularly desirable when generating synthesized speech.
- Such audio segment areas can be generated beforehand with the invention by concatenation of corresponding audio segment areas in accordance with the articulation.
- the invention provides speech signals which have a natural speech flow, speech melody and speech rhythm in that audio segment areas are processed before and / or after concatenation in their entirety or in individual areas with the aid of suitable functions. It is particularly advantageous to additionally carry out this variation in areas in which the corresponding moments of the concatenations lie, in order, inter alia, to change the frequency, duration, amplitude or spectrum.
- An additionally improved signal curve can be achieved if the concatenation moments are located at locations of the audio segment regions to be linked, at which these correspond in one or more suitable properties.
- the speech signals can be converted into acoustic signals or stored on a data carrier.
- a data carrier is provided which contains a computer program which enables the method according to the invention to be carried out or the device according to the invention and its various embodiments to be controlled.
- the data carrier according to the invention also allows the generation of voice signals which have concatenations that are appropriate for co-articulation.
- the invention provides a data memory which contains audio segments which are suitable for to be concatenated according to the invention into synthesized acoustic data.
- a data carrier preferably contains audio segments which are suitable for carrying out the method according to the invention and for use in the device according to the invention or the data carrier according to the invention.
- the data carrier can also include voice signals according to the invention.
- the invention enables synthesized acoustic according to the invention
- a sound carrier that has data that was generated at least partially by the method according to the invention or the device according to the invention or by using the data carrier according to the invention or the data memory according to the invention Speech signals are.
- Figure 1a Schematic representation of an inventive device for generating synthesized acoustic data
- Figure 1b Structure of a sound / phon.
- Figure 2a Structure of a conventional audio segment according to the prior art, consisting of parts of two sounds, ie a diphone for speech. It is essential that the solo articulation areas are only partially contained in the conventional diphone audio segment.
- Figure 2b Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with downstream co-articulation areas (quasi a 'shifted' diphone for speech).
- Figure 2c Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with upstream coarticulation areas.
- Figure 2d Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with downstream coarticulation areas and contains additional areas.
- Figure 2e Structure of an audio segment according to the invention, which reproduces parts of a sound / phon with upstream coarticulation areas and contains additional areas.
- Figure 2f Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with downstream co-articulation areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2g Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with upstream co-articulation areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2h Structure of an audio segment according to the invention, which reproduces parts of several sounds / phones (for speech: a polyphone), each with downstream co-articulation areas and contains additional areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2i Structure of an audio segment according to the invention, the parts of several sounds / phone (for speech: a polyphone), each with upstream co-articulation areas reproduces and contains additional areas. Lute / Phone 2 to (n-1) are all contained in the audio segment.
- Figure 2j Structure of an audio segment according to the invention, which reproduces part of a loud / phon from the beginning of a sound sequence / phon sequence.
- Figure 2k Structure of an audio segment according to the invention, which reproduces parts of sounds / phonas from the beginning of a sound sequence / phoneme.
- Figure 21 Structure of an audio segment according to the invention, which reproduces a sound / a phon from the end of a sound sequence / phon sequence.
- Figure 3a Concatenation according to the prior art using the example of two conventional audio segments. The segments begin and end with parts of the solo activation areas (usually half each).
- Figure 3al concatenation according to the prior art.
- the solo articulation area of the middle phone comes from two different audio segments.
- Audio segments each containing a sound / a phon with downstream coarticulation areas. Both sounds / phones come from the middle of a sequence of sound units
- Figure 3bl concatenation of these audio segments using a crossfade.
- the solo articulation area comes from an audio segment.
- the transition between the audio segments takes place between two areas and is therefore less sensitive to differences (in the spectrum, frequency, amplitude, etc.).
- the audio segments can also be edited with additional transition functions before concatenation.
- Figure 3bll concatenation of these audio segments using a hardfade.
- Figure 3c Concatenation according to the inventive method using the example of two audio segments according to the invention, each containing a sound / a phon with downstream coarticulation areas, the first audio segment from the beginning of one
- Figure 3cll concatenation of these audio segments using a hardfade.
- Figure 3d Concatenation according to the inventive method using the example of two audio segments according to the invention, each of which contains a sound / a phon with upstream co-articulation areas. Both audio segments come from the middle of a sound sequence.
- Figure 3dl concatenation of these audio segments using a crossfade.
- the solo articulation area comes from an audio segment.
- Figure 3dll concatenation of these audio segments using a hardfade.
- Figure 3el concatenation of these audio segments using a crossfade.
- Figure 3ell concatenation of these audio segments using a hardfade.
- Figure 4 Schematic representation of the steps of a method according to the invention for generating synthesized acoustic data.
- the invention for example, to convert a text into synthesized speech, it is necessary in a preceding step to subdivide this text into a sequence of sound signals or phonemes using known methods or devices. Prosodic information corresponding to the text should preferably also be generated.
- the phonetic sequence or phoneme sequence as well as the prosodic and additional information serve as input variables for the method and the device according to the invention.
- the sounds / phones to be synthesized are fed to an input unit 101 of the device 1 for generating synthesized speech data and stored in a first storage unit 103 (see FIG. 1a).
- the audio segment areas, the sounds or phone or parts of sounds are selected from an inventory containing audio segments (elements), which is stored in a database 107, or from an upstream synthesis device 108 (which is not part of the invention) or reproduce phones which correspond to the individual entered sound characters or phonemes or parts thereof and are stored in a second memory unit 109 in an order which corresponds to the sequence of the input sound characters or phonemes.
- the selection device 105 preferably selects the audio segments which reproduce most parts of sound sequences or polyphones that correspond to a sequence of sound signs or phonemes from the input sound string or phoneme sequence correspond, so that a minimum number of audio segments is required for the synthesis of the input phoneme sequence.
- the selection device 105 preferably selects the longest audio segment areas which reproduce parts of the sequence of sounds / phoneme, by the entered sequence of sounds or phoneme and / or a sequence of sounds / Synthesize phones from a minimal number of audio segment areas. In this case, it is advantageous to use concatenated lute / phone reproducing audio segment areas that have a static upstream
- the concatenation moments of two successive audio segment areas are determined with the aid of a concatenation device 111 as follows:
- step 1 If an audio segment area is to be used to synthesize the beginning of the entered sound sequence / phoneme sequence (step 1), then an audio to select a segment area that reproduces the beginning of a sound sequence / phoneme sequence and to chain it with a temporally downstream audio segment area (see FIG. 3c and step 3 in FIG. 4).
- the concatenation is carried out in the form of a crossfade, with the moment of concatenation being placed in the rear area of the first audio segment area and in the front area of the second audio segment area, whereby these two areas are located in the Concatenation overlap or at least immediately adjoin one another (see Figures 3bl, 3cl, 3dl and 3el, concatenation using crossfade).
- the concatenation is carried out in the form of a hardfade, the moment of the concatenation being immediately behind the temporally rear area of the first audio segment area and temporally immediately before the temporally front area of the second audio segment area (see Figures 3bll, 3cll, 3dll and 3ell, concatenation using hardfade).
- new audio segments can be generated from these originally available audio segment areas, which begin with the reproduction of a static sound / phone. This is achieved by concatenating audio segment areas, which start with the reproduction of a dynamic sound / phone, with audio segment areas, which begin with the playback of a static sound / phone. Although this increases the number of audio segments or the scope of the inventory, it can represent a computing advantage in the generation of synthesized speech data, since fewer individual concatenations are required to generate a phonetic sequence / phoneme sequence and concatenations only have to be carried out in the form of a crossfade.
- the new chained audio segments thus generated are preferably fed to the database 107 or another storage unit 113.
- a further advantage of this concatenation of the original audio segment areas to new, longer audio segments arises if, for example, a sequence of sounds / phones is repeated frequently in the sound sequence / phone sequence entered. Then one can use one of the new correspondingly linked audio segments and it is not necessary to re-concatenate the originally existing audio segment areas each time this sequence of sounds / phones occurs.
- overlapping co-articulation effects are preferably also to be recorded or specific co-articulation effects in the form of additional data are to be assigned to the stored chained audio segment.
- an audio segment area is to be used to synthesize the end of the entered sound sequence / phoneme sequence, then an audio segment area is to be selected from the inventory, which reproduces an end of a sound sequence / phoneme sequence and to be concatenated with an audio segment region preceding it (see FIG. 3e and step 8 in FIG 4).
- the individual audio segments are stored in coded form in the database 107, the coded form of the audio segments in addition to the waveform of the respective audio segment being able to indicate which parts of sound sequences / phonetic sequences the respective audio segment reproduces, what type of concatenation (eg hardfade, linear or exponential) Crossfade) with which temporally subsequent audio segment area is to be carried out and at which moment the concatenation with which temporally subsequent audio segment area takes place.
- the encoded form of the audio segments preferably also contains information relating to prosody, superordinate co-articulations and transition functions, which are used to achieve an additional improvement in speech quality.
- those audio segment areas are selected as temporally downstream that correspond to the properties of the audio segment areas upstream in each case, including the type of concatenation and the concatenation moment.
- the concatenation of two successive audio segment areas takes place with the aid of the concatenation device 111.
- the waveform, the type of concatenation, the concatenation moment and any additional information of the first audio segment area and the second audio segment area are loaded from the database or the synthesis device (FIG. 3b and steps 10 and 11).
- the audio segment areas selected those audio segment areas which match one another with regard to their type of concatenation and their concatenation moment. In this case, it is no longer necessary to load the information regarding the type of concatenation and the concatenation moment of the second audio segment area.
- the waveform of the first audio segment area in a temporally rear area and the waveform of the second audio segment area in a temporally front area are each processed with suitable transition functions, e.g. multiplied by a suitable weighting function (see Figure 3b, steps 12 and 13).
- suitable transition functions e.g. multiplied by a suitable weighting function (see Figure 3b, steps 12 and 13).
- the lengths of the backward area of the first audio segment area and of the front area of the second audio segment area result from the type of concatenation and the temporal position of the concatenation moment, and these lengths can also be stored in the coded form of the audio segments in the database.
- the two audio segment areas are to be linked with a crossfade, these are added in an overlapping manner in accordance with the respective concatenation moment (see FIGS. 3bl, 3cl, 3dl and 3el, step 15).
- a linear symmetrical crossfade is preferably to be used here, but any other type of crossfade or any type of transition function can also be used.
- concatenation is to be carried out in the form of a hardfade, the two audio segment areas are not connected in an overlapping manner one after the other (see FIGS. 3bll, 3cll, 3dll and 3ell, step 15).
- the two audio segment areas are arranged directly one behind the other in time. In order to be able to further process the synthesized speech data generated in this way, these are preferably stored in a third memory unit 115.
- the previously linked audio segment areas are regarded as the first audio segment area (step 1)
- the prosodic and additional information which is entered in addition to the sequence of sounds / phon, should preferably also be taken into account when concatenating the audio segment areas.
- the frequency, duration, amplitude and / or spectral properties of the audio segment areas are changed before and / or after their concatenation so that the synthesized speech data have a natural word and / or sentence melody (steps 14, 17 or 18).
- the processing of the two audio segment areas with the aid of suitable functions in the area of the concatenation moment is also provided, in order, inter alia, to adapt the frequencies, durations, amplitudes and spectral properties.
- the invention also allows superordinate acoustic phenomena of a real language, such as e.g. Superordinate co-articulation effects or language style (e.g. whispering, emphasis, singing voice, falsetto, emotional expression) must be taken into account when synthesizing the sequence of sounds / phonograms.
- information relating to such higher-level phenomena is additionally stored in coded form with the corresponding audio segments, so that when selecting the audio segment areas, only those are selected which correspond to the higher-level co-articulation properties of the audio segment areas upstream and / or downstream.
- the synthesized speech data thus generated preferably have a form which, using an output unit 117, allows the speech data to be converted into acoustic speech signals and the speech data and / or speech signals to be stored on an acoustic, optical, magnetic or electrical data carrier (step 19).
- inventory elements are created by incorporating real spoken language.
- the degree of training of the speaker building the inventory i.e. Due to its ability to control the language to be recorded (e.g. to control the pitch of the language or to speak exactly at one pitch)
- Synthesis of any acoustic data or any sound events can be used. Therefore, this invention can also be used for the generation and / or provision of synthesized speech data and / or speech signals for any languages or dialects as well as for the synthesis of music.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Stereophonic System (AREA)
- Telephone Function (AREA)
- Stereo-Broadcasting Methods (AREA)
- Document Processing Apparatus (AREA)
- Photoreceptors In Electrophotography (AREA)
- Machine Translation (AREA)
- Circuits Of Receivers In General (AREA)
- Reverberation, Karaoke And Other Acoustics (AREA)
Abstract
Description
Claims
Priority Applications (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AT99942891T ATE243876T1 (de) | 1998-08-19 | 1999-08-19 | Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten |
EP99942891A EP1105867B1 (de) | 1998-08-19 | 1999-08-19 | Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten |
DE59906115T DE59906115D1 (de) | 1998-08-19 | 1999-08-19 | Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten |
AU56231/99A AU5623199A (en) | 1998-08-19 | 1999-08-19 | Method and device for the concatenation of audiosegments, taking into account coarticulation |
US09/763,149 US7047194B1 (en) | 1998-08-19 | 1999-08-19 | Method and device for co-articulated concatenation of audio segments |
CA002340073A CA2340073A1 (en) | 1998-08-19 | 1999-08-19 | Method and device for the concatenation of audiosegments, taking into account coarticulation |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE1998137661 DE19837661C2 (de) | 1998-08-19 | 1998-08-19 | Verfahren und Vorrichtung zur koartikulationsgerechten Konkatenation von Audiosegmenten |
DE19837661.8 | 1998-08-19 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2000011647A1 true WO2000011647A1 (de) | 2000-03-02 |
Family
ID=7878051
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP1999/006081 WO2000011647A1 (de) | 1998-08-19 | 1999-08-19 | Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten |
Country Status (7)
Country | Link |
---|---|
US (1) | US7047194B1 (de) |
EP (1) | EP1105867B1 (de) |
AT (1) | ATE243876T1 (de) |
AU (1) | AU5623199A (de) |
CA (1) | CA2340073A1 (de) |
DE (2) | DE19861167A1 (de) |
WO (1) | WO2000011647A1 (de) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102004044649B3 (de) * | 2004-09-15 | 2006-05-04 | Siemens Ag | Verfahren zur integrierten Sprachsynthese |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7369994B1 (en) * | 1999-04-30 | 2008-05-06 | At&T Corp. | Methods and apparatus for rapid acoustic unit selection from a large speech corpus |
US7941481B1 (en) | 1999-10-22 | 2011-05-10 | Tellme Networks, Inc. | Updating an electronic phonebook over electronic communication networks |
US7308408B1 (en) * | 2000-07-24 | 2007-12-11 | Microsoft Corporation | Providing services for an information processing system using an audio interface |
DE10042571C2 (de) * | 2000-08-22 | 2003-02-06 | Univ Dresden Tech | Verfahren zur konkatenativen Sprachsynthese mittels graphenbasierter Bausteinauswahl mit variabler Bewertungsfunktion |
JP3901475B2 (ja) * | 2001-07-02 | 2007-04-04 | 株式会社ケンウッド | 信号結合装置、信号結合方法及びプログラム |
US7379875B2 (en) * | 2003-10-24 | 2008-05-27 | Microsoft Corporation | Systems and methods for generating audio thumbnails |
US20080154601A1 (en) * | 2004-09-29 | 2008-06-26 | Microsoft Corporation | Method and system for providing menu and other services for an information processing system using a telephone or other audio interface |
US8510113B1 (en) * | 2006-08-31 | 2013-08-13 | At&T Intellectual Property Ii, L.P. | Method and system for enhancing a speech database |
US8374868B2 (en) * | 2009-08-21 | 2013-02-12 | General Motors Llc | Method of recognizing speech |
US20110046957A1 (en) * | 2009-08-24 | 2011-02-24 | NovaSpeech, LLC | System and method for speech synthesis using frequency splicing |
JP6047922B2 (ja) * | 2011-06-01 | 2016-12-21 | ヤマハ株式会社 | 音声合成装置および音声合成方法 |
US9368104B2 (en) * | 2012-04-30 | 2016-06-14 | Src, Inc. | System and method for synthesizing human speech using multiple speakers and context |
WO2016002879A1 (ja) * | 2014-07-02 | 2016-01-07 | ヤマハ株式会社 | 音声合成装置、音声合成方法およびプログラム |
BR112018008874A8 (pt) * | 2015-11-09 | 2019-02-26 | Sony Corp | aparelho e método de decodificação, e, programa. |
CN111145723B (zh) * | 2019-12-31 | 2023-11-17 | 广州酷狗计算机科技有限公司 | 转换音频的方法、装置、设备以及存储介质 |
CN113066459B (zh) * | 2021-03-24 | 2023-05-30 | 平安科技(深圳)有限公司 | 基于旋律的歌曲信息合成方法、装置、设备及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0351848A2 (de) * | 1988-07-21 | 1990-01-24 | Sharp Kabushiki Kaisha | Einrichtung zur Sprachsynthese |
WO1995030193A1 (en) * | 1994-04-28 | 1995-11-09 | Motorola Inc. | A method and apparatus for converting text into audible signals using a neural network |
US5524172A (en) * | 1988-09-02 | 1996-06-04 | Represented By The Ministry Of Posts Telecommunications And Space Centre National D'etudes Des Telecommunicationss | Processing device for speech synthesis by addition of overlapping wave forms |
US5659664A (en) * | 1992-03-17 | 1997-08-19 | Televerket | Speech synthesis with weighted parameters at phoneme boundaries |
EP0813184A1 (de) * | 1996-06-10 | 1997-12-17 | Faculté Polytechnique de Mons | Verfahren zur Tonsynthese |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5463715A (en) * | 1992-12-30 | 1995-10-31 | Innovation Technologies | Method and apparatus for speech generation from phonetic codes |
-
1998
- 1998-08-19 DE DE19861167A patent/DE19861167A1/de not_active Ceased
-
1999
- 1999-08-19 US US09/763,149 patent/US7047194B1/en not_active Expired - Lifetime
- 1999-08-19 AT AT99942891T patent/ATE243876T1/de not_active IP Right Cessation
- 1999-08-19 EP EP99942891A patent/EP1105867B1/de not_active Expired - Lifetime
- 1999-08-19 CA CA002340073A patent/CA2340073A1/en not_active Abandoned
- 1999-08-19 AU AU56231/99A patent/AU5623199A/en not_active Abandoned
- 1999-08-19 WO PCT/EP1999/006081 patent/WO2000011647A1/de active IP Right Grant
- 1999-08-19 DE DE59906115T patent/DE59906115D1/de not_active Expired - Lifetime
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0351848A2 (de) * | 1988-07-21 | 1990-01-24 | Sharp Kabushiki Kaisha | Einrichtung zur Sprachsynthese |
US5524172A (en) * | 1988-09-02 | 1996-06-04 | Represented By The Ministry Of Posts Telecommunications And Space Centre National D'etudes Des Telecommunicationss | Processing device for speech synthesis by addition of overlapping wave forms |
US5659664A (en) * | 1992-03-17 | 1997-08-19 | Televerket | Speech synthesis with weighted parameters at phoneme boundaries |
WO1995030193A1 (en) * | 1994-04-28 | 1995-11-09 | Motorola Inc. | A method and apparatus for converting text into audible signals using a neural network |
EP0813184A1 (de) * | 1996-06-10 | 1997-12-17 | Faculté Polytechnique de Mons | Verfahren zur Tonsynthese |
Non-Patent Citations (2)
Title |
---|
DETTWEILER H ET AL: "Concatenation rules for demisyllable speech synthesis", PROCEEDINGS OF IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (ICASSP '85), TAMPA, FL, USA, vol. 2, 26 March 1985 (1985-03-26) - 29 March 1985 (1985-03-29), IEEE, New York, NY, USA, pages 752 - 755, XP002128522 * |
YIOURGALIS N ET AL: "A TtS system for the Greek language based on concatenation of formant coded segments", SPEECH COMMUNICATION,NL,ELSEVIER SCIENCE PUBLISHERS, AMSTERDAM, vol. 19, no. 1, pages 21-38, XP004013506, ISSN: 0167-6393 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102004044649B3 (de) * | 2004-09-15 | 2006-05-04 | Siemens Ag | Verfahren zur integrierten Sprachsynthese |
Also Published As
Publication number | Publication date |
---|---|
EP1105867A1 (de) | 2001-06-13 |
CA2340073A1 (en) | 2000-03-02 |
EP1105867B1 (de) | 2003-06-25 |
DE19861167A1 (de) | 2000-06-15 |
US7047194B1 (en) | 2006-05-16 |
DE59906115D1 (de) | 2003-07-31 |
AU5623199A (en) | 2000-03-14 |
ATE243876T1 (de) | 2003-07-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60112512T2 (de) | Kodierung von Ausdruck in Sprachsynthese | |
DE19610019C2 (de) | Digitales Sprachsyntheseverfahren | |
DE69821673T2 (de) | Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren | |
DE4237563C2 (de) | Verfahren zum Synthetisieren von Sprache | |
EP1105867B1 (de) | Verfahren und vorrichtungen zur koartikulationsgerechten konkatenation von audiosegmenten | |
DE69909716T2 (de) | Formant Sprachsynthetisierer unter Verwendung von Verkettung von Halbsilben mit unabhängiger Überblendung im Filterkoeffizienten- und Quellenbereich | |
DE69031165T2 (de) | System und methode zur text-sprache-umsetzung mit hilfe von kontextabhängigen vokalallophonen | |
DE60035001T2 (de) | Sprachsynthese mit Prosodie-Mustern | |
DE60126575T2 (de) | Vorrichtung und Verfahren zur Synthese einer singenden Stimme und Programm zur Realisierung des Verfahrens | |
DE69925932T2 (de) | Sprachsynthese durch verkettung von sprachwellenformen | |
DE60216651T2 (de) | Vorrichtung zur Sprachsynthese | |
DE2115258A1 (de) | Sprachsynthese durch Verkettung von in Formant Form codierten Wortern | |
DD143970A1 (de) | Verfahren und anordnung zur synthese von sprache | |
US6424937B1 (en) | Fundamental frequency pattern generator, method and program | |
DE60202161T2 (de) | Verfahren, Vorrichtung und Programm zur Analyse und Synthese von Sprache | |
DE60205421T2 (de) | Verfahren und Vorrichtung zur Sprachsynthese | |
EP0058130B1 (de) | Verfahren zur Synthese von Sprache mit unbegrenztem Wortschatz und Schaltungsanordnung zur Durchführung des Verfahrens | |
EP1110203B1 (de) | Vorrichtung und verfahren zur digitalen sprachbearbeitung | |
EP1344211B1 (de) | Vorrichtung und verfahren zur differenzierten sprachausgabe | |
DE60305944T2 (de) | Verfahren zur synthese eines stationären klangsignals | |
DE60316678T2 (de) | Verfahren zum synthetisieren von sprache | |
DE19837661C2 (de) | Verfahren und Vorrichtung zur koartikulationsgerechten Konkatenation von Audiosegmenten | |
DE60311482T2 (de) | Verfahren zur steuerung der dauer bei der sprachsynthese | |
DE3232835C2 (de) | ||
DE60131521T2 (de) | Verfahren und Vorrichtung zur Steuerung des Betriebs eines Geräts bzw. eines Systems sowie System mit einer solchen Vorrichtung und Computerprogramm zur Ausführung des Verfahrens |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT UA UG US UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): GH GM KE LS MW SD SL SZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
ENP | Entry into the national phase |
Ref document number: 2340073 Country of ref document: CA Ref country code: CA Ref document number: 2340073 Kind code of ref document: A Format of ref document f/p: F |
|
WWE | Wipo information: entry into national phase |
Ref document number: 1999942891 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 09763149 Country of ref document: US |
|
WWP | Wipo information: published in national office |
Ref document number: 1999942891 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
WWG | Wipo information: grant in national office |
Ref document number: 1999942891 Country of ref document: EP |