WO2004029929A1 - Verfahren zur rechnergestützten sprachsynthese eines gespeicherten elektronischen textes zu einem analogen sprachsignal, sprachsyntheseeinrichtung und telekommunikationsgerät - Google Patents
Verfahren zur rechnergestützten sprachsynthese eines gespeicherten elektronischen textes zu einem analogen sprachsignal, sprachsyntheseeinrichtung und telekommunikationsgerät Download PDFInfo
- Publication number
- WO2004029929A1 WO2004029929A1 PCT/DE2003/003158 DE0303158W WO2004029929A1 WO 2004029929 A1 WO2004029929 A1 WO 2004029929A1 DE 0303158 W DE0303158 W DE 0303158W WO 2004029929 A1 WO2004029929 A1 WO 2004029929A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- electronic
- text
- spoken
- sequence
- units
- Prior art date
Links
- 230000015572 biosynthetic process Effects 0.000 title claims description 59
- 238000003786 synthesis reaction Methods 0.000 title claims description 59
- 238000000034 method Methods 0.000 title claims description 35
- 238000004458 analytical method Methods 0.000 claims abstract description 37
- 230000015654 memory Effects 0.000 claims abstract description 26
- 230000006870 function Effects 0.000 claims description 27
- MQJKPEGWNLWLTK-UHFFFAOYSA-N Dapsone Chemical compound C1=CC(N)=CC=C1S(=O)(=O)C1=CC=C(N)C=C1 MQJKPEGWNLWLTK-UHFFFAOYSA-N 0.000 claims description 13
- 230000006835 compression Effects 0.000 claims description 9
- 238000007906 compression Methods 0.000 claims description 9
- 230000002194 synthesizing effect Effects 0.000 claims description 2
- 238000001308 synthesis method Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 19
- 238000006243 chemical reaction Methods 0.000 description 14
- 238000007781 pre-processing Methods 0.000 description 11
- 230000001944 accentuation Effects 0.000 description 10
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000013518 transcription Methods 0.000 description 7
- 230000035897 transcription Effects 0.000 description 7
- 238000004590 computer program Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 230000014509 gene expression Effects 0.000 description 5
- 230000033001 locomotion Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 5
- 239000000654 additive Substances 0.000 description 3
- 230000000996 additive effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 230000001174 ascending effect Effects 0.000 description 2
- 230000033228 biological regulation Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000009795 derivation Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- XUKUURHRXDUEBC-KAYWLYCHSA-N Atorvastatin Chemical compound C=1C=CC=CC=1C1=C(C=2C=CC(F)=CC=2)N(CC[C@@H](O)C[C@@H](O)CC(O)=O)C(C(C)C)=C1C(=O)NC1=CC=CC=C1 XUKUURHRXDUEBC-KAYWLYCHSA-N 0.000 description 1
- 101100099898 Enterococcus faecalis Int-Tn gene Proteins 0.000 description 1
- 241000276498 Pollachius virens Species 0.000 description 1
- 108010076504 Protein Sorting Signals Proteins 0.000 description 1
- 101100014660 Rattus norvegicus Gimap8 gene Proteins 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- HTIQEAQVCYTUBX-UHFFFAOYSA-N amlodipine Chemical compound CCOC(=O)C1=C(COCCN)NC(C)=C(C(=O)OC)C1C1=CC=CC=C1Cl HTIQEAQVCYTUBX-UHFFFAOYSA-N 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000006837 decompression Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002813 epsilometer test Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000007935 neutral effect Effects 0.000 description 1
- 239000011049 pearl Substances 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
- 230000001755 vocal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Definitions
- the invention relates to a method for computer-assisted speech synthesis of a stored electronic text for an analog speech signal, a speech synthesis device and a telecommunication device.
- Speech synthesis is gaining importance as a means of communication for outputting information to humans in the context of systems in which other output media, such as Graphics are not possible due to space constraints, for example because no monitor is available to display information or cannot be used due to space constraints.
- a speech synthesis device and a method for speech synthesis are required which can manage with very low demands on available resources in terms of computing power and in terms of the required storage space, and yet one provides full synthesis, for example for "reading" a text, preferably an electronic message.
- [5] describes a text-to-speech conversion device in which the text-to-speech conversion is carried out for a special exception lexicon described.
- [6] describes a parser device for determining predetermined expressions from a spoken speech signal sequence.
- the invention is based on the problem of providing a speech synthesis which requires less storage space than is required in known speech synthesis methods or speech synthesis devices.
- the problem is solved by the method for computer-assisted speech synthesis of a stored electronic text into an analog speech signal
- Speech synthesis device and solved by a telecommunications device with the features according to the independent claims.
- the stored electronic text is under Subjected to text analysis using the specified text analysis rules.
- the stored electronic text is usually stored in a predetermined electronic word processing format, such as ASCII.
- control characters of a word processing system such as page break control characters or formatting control characters, can also be contained in the electronic text.
- this text is converted into an analog voice signal, which is output to a user by means of a loudspeaker.
- text analysis rules are to be understood as a set of rules which are processed one after the other and which, as will be explained in more detail below, usually represent language-specific rules which describe the usual mapping of certain parts of the electronic text onto one or more spoken units.
- the following units in particular can be used as spoken units for the subsequent concatenating speech synthesis:
- the abbreviation lexicon contains an illustration table of given abbreviations, coded in the format in which the electronic text is available, and the associated phonetic transcription of the abbreviation, for example coded in SAMPA, as a corresponding representation of the given given abbreviation.
- the electronic function word lexicon is a mapping table with predefined function words, again coded in the electronic text format used in each case, and the spoken units assigned to the respective function word, coded in the respective spoken transcription, preferably SAMPA, as a corresponding representation of the respective predefined function word ,
- a functional word is to be understood as a word which is a noun or verb that is functional connects together, for example the words: "for”, “under”, “on”, “with”, etc.
- the exception lexicon in turn contains predefined exception character strings that can be specified by a user, and the associated sequence of spoken units, with a data tuple again per
- Data entry contains two elements, the first element of the data tuple being the respective term, encoded in the format of the electronic text, and the second element of the data tuple being the respective representation of the first element, encoded in the respective phonetic transcription.
- a prosody is generated for the respectively formed sequence of spoken units using given prosody rules and then the respective sequence of spoken units and that for the respective one is created
- the speech signal preferably the analog speech signal to be output, is generated.
- a speech synthesis device for synthesizing a stored electronic text into an analog speech signal has a text memory for storing the electronic text, as well as a rule memory for storing text analysis rules and for storing prosody rules.
- a lexicon memory is also provided for storing an electronic abbreviation lexicon, an electronic function dictionary lexicon and an electronic exception lexicon.
- the speech synthesis device also has a processor which is set up in such a way that it carries out the method steps described above using the stored text analysis rules and prosody rules as well as the stored electronic dictionaries.
- a telecommunication device with a speech synthesis device is provided.
- Another advantage of the invention is the very easy scalability to increase the achievable quality of the speech synthesis, since the respective electronic lexicons and the rules can be expanded in a very simple manner.
- the spoken units are stored in compressed form and at least some of the stored compressed ones spoken-language units, in particular the compressed spoken-language units required to form the sequence of spoken-language units, are decompressed before the respective sequence of spoken-language units is formed, in particular before the first sequence of spoken-language units is formed.
- ADPCM Adaptive Differential Pulse Code Modulation
- Diphones are preferably used as spoken units.
- the method is preferably used in an embedded system, which is why the speech synthesis device is set up as an embedded system according to one embodiment of the invention.
- FIG. 1 is a block diagram of a telecommunications terminal with a speech synthesis device according to an embodiment of the invention
- FIG. 2 is a block diagram showing the individual in the
- Telecommunications terminal shows embedded components
- Figure 3 is a block diagram showing the individual components for speech synthesis according to an embodiment of the invention.
- Figure 4 is a block diagram showing the components of word processing and prosody control in greater detail
- Figure 6 is a structogram in which the individual
- Figure 7 is a structogram in which the individual
- FIG. 1 shows a telecommunications terminal 100 with a data display unit 101 for displaying information, an antenna 102 for receiving and / or broadcasting radio signals, a loudspeaker 103 for outputting an analog voice signal, and a keypad 104 with
- the mobile radio telephone 100 is set up for communication in accordance with the GSM standard, alternatively in accordance with the UMTS standard, the GPRS standard or any other suitable mobile radio standard.
- the mobile radio telephone 100 is set up to send and receive textual information, for example SMS messages (Short Message Service messages) or MMS messages (Multimedia Service messages).
- SMS messages Short Message Service messages
- MMS messages Multimedia Service messages
- FIG. 2 shows in a block diagram the individual components integrated in the mobile radio telephone 100, in particular a speech synthesis unit explained in detail below, which is integrated in the mobile radio telephone 100 as an embedded system.
- microphone 106 is coupled to an input interface 201.
- a central processor unit 202 and a memory 203 and an ADPCM coding / decoding unit 204 are provided and an output interface 205.
- the individual components are coupled to one another via a computer bus 206.
- the loudspeaker 103 is coupled to the output interface 205.
- the central processor unit 202 is set up in such a way that the method steps described below for voice synthesis, as well as those for operating the mobile radio telephone, in particular for decoding and coding Mobile radio signals, necessary procedural steps are carried out.
- the mobile radio telephone 100 is additionally set up for voice recognition.
- abbreviation lexicon 210 A predetermined number of abbreviations customary for the respective language, for example the following expressions and the sequence of spoken units assigned to the respective abbreviation, are stored in the abbreviation lexicon 210:
- a predetermined number of function words and representations associated with the function words are stored in the spoken word transcription, in other words the sequence of spoken units assigned to the respective function word, in the function word lexicon 211.
- the following function words are provided in the German language, for example: "For”, “under”, “with”, “on”,. , ,
- a corresponding mapping to a sequence of spoken units is defined and stored in the exception lexicon 212 for certain predefinable textual units.
- Diphones are used as phonetic units in this exemplary embodiment.
- the diphones used in the context of the speech synthesis are stored in a diphone dictionary 213, which is also stored in the memory 203.
- Compression of the diphones an LPC method, a CELP method or the GSM method are used, generally any compression method that achieves a sufficiently large compression even with small signal sections while ensuring a sufficiently small loss of information due to the compression.
- a block synthesis 300 in FIG. 3 is used to explain a speech synthesis of a text message stored in the memory 203 and to be output as an analog speech signal.
- the stored electronic text is stored in an electronic file 301 and, in addition to preferably ASCII-coded words, has special characters or control characters such as, for example, a “new line” control character or a “new paragraph” control character or a control character for formatting part or all of the data stored in the electronic file 301 electronic text.
- the electronic text is subjected to different preprocessing rules as part of a word processor (block 302).
- the processed electronic text 303 is then passed to a module, i.e. supplied to a computer program component for prosody control 304, in which, as will be explained in more detail below, the prosody for the electronic text is generated.
- Processing ADPCM decoding is carried out using the ADPCM coding / decoding unit 204, a module selection, ie a selection of spoken units, according to this exemplary embodiment a selection of required diphones 307 (block 308).
- the selected diphones 307 ie generally the selected spoken units, become a computer program component for acoustic synthesis (Block 309) and combined there to a voice signal to be output, which voice signal to be output is initially digital and is converted digitally / analogously to an analog voice signal 310, which is supplied to the loudspeaker 103 via the output interface 205 and is output to the user of the mobile radio telephone 100 ,
- FIG. 4 shows the blocks of word processor 302 and prosody control 304 in greater detail.
- a sufficiently long electronic text is stored in the electronic file 301 and is transferred to the processor unit 202 in a complete, contiguous memory area.
- the electronic text has at least one partial sentence, so that an appropriate generation of prosody is made possible.
- the text in the event that the respectively transferred electronic text from the electronic file 301 is shorter than a subset, i.e. in the event that no punctuation marks are found within the transferred electronic text, the text as a
- Subset is understood and a point is artificially added as a punctuation mark.
- the text preprocessing (block 401) has the function of sending the entered electronic text to the internally on the frame of the
- a conversion to the internally used character set is necessary, because, for example, the German umlauts are not assigned the same codes in all character sets. Control characters are also removed from the text. Line feeds in combination with hyphens are eliminated.
- a character table is provided which encodes format information for each character. The table (not shown), which is also stored in the memory 203, is accessed via the numerical value of the character.
- Control characters or characters that are not included in the table are deleted from the entered electronic text.
- the table is used by the two program components text preprocessing (block 401) and the program component "spelling" (block 408) described below.
- the respective character class is coded in one byte and the pronunciation form of the character as a character string, i.e. as a result of spoken units, i.e. added as a diphon sequence according to the embodiment. Overall, there is a memory requirement of approximately one kbyte.
- the input text 402 filtered by the text preprocessing 401 is then evaluated using a special text analysis rule set as part of a grapheme-phoneme implementation (block 403), which text analysis rule set is stored in the memory 203 and by means of the various connections of numbers in the filtered input text 402 are recognized and implemented (block 404). Since numbers can contain not only sequences of digits, but also measures or currency information, the evaluation is carried out before the further decomposition of the filtered electronic text 402.
- the filtered and numbered electronic text 405 is then divided into sub-chains (i.e., words and sentences) using the Tokenizer program component (block 406).
- the partial chains are referred to below as tokens.
- the number rules of the number conversion text analysis rules are implemented in such a way that the rule interpreter, which is language-independent, and the rules themselves, which are language-dependent, are strictly separated.
- the determined character string is converted into that of the respective text analysis rule 208
- Rule-assigned sequence of diphones implemented, in other words, the found string is replaced by the rule target.
- the rule target contains placeholders for the determined numbers, which are implemented by the second level of the rules.
- the number to be converted must first meet one condition, otherwise the next text analysis rule is checked.
- a second condition can be tested for which the number can be changed beforehand. Then arithmetic operations generate two numbers that are used in the control target for the final implementation. For example, a translation of the first rule outlined above into colloquial language would be as follows:
- Model rules ie the rules of the first level and number rules, ie the rules of the second level, contain an additional conversion into a standard language form for easier troubleshooting. There Any messages can be generated in order to be able to understand the exact process of the rule replacement from outside.
- Spelling mode 408 into a series of diphones, whereby one letter is converted separately, converted into the analog voice signal 306 and output to the user.
- Word boundaries are detected by the program component "Tokenizer", i.e. individual words are detected on the basis of the white characters in between. According to the character types, the token is classified either as a word (upper and lower case) or as a special format (special characters).
- sentence boundaries are marked at all points at which punctuation marks followed by spaces are detected immediately after a word. If a token that is not a number contains more than one special character, it is mapped and output in the analog voice signal by the spelling mode.
- Function word lexicon 211 determines those words or expressions contained in the lexica 210, 211 and the abbreviations or function words determined are converted into the corresponding sequence of diphones.
- the structure of the encyclopedias is the same for all filed entries: the graphemic form of the word and the phonemic form with word accent marks and syllable markers and the word class.
- the word classes according to this exemplary embodiment are:
- the class functional word contains words that occur very frequently and therefore have a low information content and are rarely accentuated which property is used in the context of acoustic synthesis 309, as will be explained in more detail below.
- the word classes are encoded in a byte for later accentuation and assigned to the respective word.
- the phonemic text analysis rules being structured according to the following scheme: XYZ ⁇ W
- the phonemic text analysis rules are processed as follows:
- Y is substituted by W if it appears to the right of X and Z to the left in the word to be transcribed.
- X, Z and W can be empty or contain one to five characters or class symbols.
- Class symbols are placeholders for a group of letters or sequences of letters, as shown in the following table:
- N ⁇ chen 1er leein ling nis ⁇ # unstressed derivation suffixes for nouns
- X and Z can contain the characters "@” and "#", where "@” can be a placeholder for any character and "@" represents the word boundary.
- the rules are arranged according to the first letter of the rule body, so that only a part of all rules needs to be searched.
- Sections arrange the rules from the most specific to the most general, so that it is ensured that at least the last rule is processed. If a rule is applicable, the rule processing is jumped to, the rule result W is appended to the sequence of phonemes that already exists for the current word, and the pointer to the character string to be converted is increased by the number of characters in the rule body.
- the efforts to efficiently display the set of rules in the context of storage in the memory 203 are based on a rule number of 1254 rules. If all four parts of a rule are saved in a table with a fixed number of rows and columns in a row, the length of the longest overall rule must be used as the table width, in this case 19 bytes. Access to the rules is very easy due to the field structure, but the memory requirement is 23 kilobytes.
- control components are packed tightly in an array, which is why an additional field of pointers with the length of 2500 bytes is required for access, but there is only a total memory requirement of 15 kilobytes. If all transcription attempts have failed, ie if the mapping according to the phonemic text analysis rules has not worked either, the token is spelled out by replacing each character with its corresponding phonetic representation and outputting it accordingly. Due to the disproportionate lengthening of the text caused thereby (substitution of each character by n new characters), the number of characters that can be spelled per token is limited to a maximum of 10 according to this exemplary embodiment.
- the sequence of phonemes as a sequence of spoken units is available.
- prosodic processing modules in the context of prosody control 304, namely accentuation and syllable control (block 409), loudness control (block 410) and intonation control (block 411), it is important to know syllable boundaries and accent positions or accent types, which are determined by means of the computer program component 409.
- Some of the relevant information is already contained in the phoneme sequence of the token, provided that it was generated with the help of one of the dictionaries 210, 211, 212, with the rules for converting numbers and number intervals, or in spelling mode. In this part, the information mentioned is collected from the phoneme sequence.
- Accentuation information is not yet available, so it is generated via further heuristic rules, which are explained in more detail below.
- the information is also stored in the memory 203 stored phoneme table used.
- phoneme table there are 49 phonemes and special characters (main and secondary accent, hyphenator, pauses) and
- Classification characteristics (long vowel, short vowel, diphthong, consonant class etc.) included.
- Syllable kernel types are determined and the syllable boundary is determined within the intervocal consonant sequence according to heuristic rules.
- Word with long vowel or diphthong assigned an accent. If neither of these two syllable kernel types is present, the accent is assigned to the first syllable with a short vowel.
- An output sound length in milliseconds, which is different for each sound class and is stored in the phoneme table, is modified using a set of rules which
- Accent situations neighboring sounds (co-articulation factors), position of the sound in the syllable, position of the syllable in the word and in the sentence are used as influencing factors according to this exemplary embodiment. Other suitable criteria can of course be taken into account.
- the output sound length can be stretched or shortened over the factors assigned to the influences, whereby a shortening is only permitted up to a minimum duration.
- the duration of the sound is calculated according to the following rule:
- Duration of sound k • ((D inh - D min ) • Pr cnt + D m i n )
- the model provides a specific duration for each sound as well as the duration of pauses at syntactic limits. Phrase boundaries, sub-sentence boundaries and paragraph boundaries provide breaks with increasing length.
- Program component duration control (block 410) and the determined accentuation information and the determined sentence type information from the grapheme-phoneme implementation 403, a speech melody is calculated for the entire electronic text in the context of the intonation control 411.
- the following model is used for this, which meets the following requirements:
- Phrasal and functional structures are audible (pauses, melody contours),
- internation contours are composed of linear sub-components (see Fig.5a to Fig.5d) by additive superimposition.
- the phrase-based component is formed using the knowledge that the fundamental frequency decreases continuously over every phrase from the beginning to the end of the phrase (declination).
- the interval width of the fundamental frequency movement can be freely selected as the control variable of the model.
- 5 a shows in a time diagram 500 a minimum fundamental frequency 501 and a relative mean fundamental frequency 502 as well as the course 503, the fundamental frequency over time.
- the knowledge is used that depending on the type of sentence to be realized (statement sentence, continuation, exclamation, question) at the end of each phrase the declination line is linked to a phrase-typical final movement.
- This movement extends from the position of the last sentence accent in the phrase to the end of the phrase, but at most over the last five syllables of the phrase.
- a first fundamental frequency curve 511 represents the final movement, a second fundamental frequency curve 512 a forward-looking, i.e. a continuation theorem and a third fundamental frequency curve 513 a question.
- an accent-based component is taken into account as a component for the entire prosody, the knowledge being used that in the event that a syllable bears a sentence accent, the fundamental frequency is raised over the entire syllable and over the duration of the
- the accent stroke can be freely selected as a control variable for the model.
- a first accent component 521 which consists of three areas, the basic frequency being raised from the declination line to the accent stroke 523 in a first ascending area (in a first time area 522) is kept there during a second time period 524 and is only returned to the declination line in a third time period 525.
- a second accent structure 526 becomes just two
- 5d shows a total prosody 531 in a fourth time diagram 530, the total prosody representing the additive superimposition of the individual components shown in FIGS. 5a to 5c.
- the total contour 531 is assigned to each phoneme involved, i.e. each phoneme in the word string for which the overall melody was determined is assigned a value corresponding to the overall prosody determined.
- the intonation contour is then reproduced by linearly interpolating between the phoneme-based reference points.
- the accentuation takes place on the first long vowel or, if none can be found, on the first short vowel of the word.
- the penultimate syllable is passed to. If the penultimate syllable can be emphasized, that is, it is not a "Schwa syllable", it is emphasized, otherwise in each step the syllable is moved forward towards the beginning of the word until a syllable that can be emphasized has been determined or the beginning of the word is reached ,
- the syllables are differentiated into the phonetic categories “heavy syllables”, “light syllables” and “Schwa syllables” according to the definition given in [3] and [4].
- Syllables that have no coda are basically light syllables. If the coda consists of two or more consonants, it is a heavy syllable.
- the coda consists of exactly one consonant.
- it is decided based on the syllable nucleus whether it is a light syllable (with a short vowel as syllable nucleus) or a heavy syllable (with a long vowel or diphthong in the syllable nucleus).
- the syllable sound (onset) plays no role in determining the syllable weight.
- the intensity parameter is generated by preprocessing and is used to influence the dynamic range (and thus the naturalness) of the speech-synthesized signal.
- s p (i) denotes the i-th sample of the p-th period of the speech module to be synthesized u.
- the desired intensity I p is recalculated for each period p of the spoken component u by linearly interpolating the target intensities of the speech signal specified at the interpolation points between these interpolation points.
- the mode of operation of the intensity control is thus comparable to the mode of operation of the basic frequency control as described above.
- the respective support points of the intensity control and the Basic frequency control can be freely selected independently of one another.
- the target intensities are given in the unit [dB]. A target intensity of 0 dB does not change the sample values of the signal components.
- the target intensities to be set provide an indication of the relative change in intensity that the inventory modules have. This means that it is advantageous to use an inventory with balanced intensity profiles.
- the block selection 304 shown in FIG. 3 is explained in more detail below.
- the function of the module selection 304 is to determine and select the dependency of the symbol sequence supplied by the preprocessing (phoneme sequence or syllable sequence) from the inventory or the inventory description of the suitable modules, according to the exemplary embodiment the suitable diphones, for acoustic synthesis.
- the sequence of building blocks generated in this way is provided with additional prosodic information, as explained above (duration of loudness, fundamental frequency curve), which was generated by the preprocessing.
- Each element of the array contains the information for a symbol (phoneme, syllable, ).
- An array structure of the data structure SM is generated by the block selection and transferred to the acoustic synthesis.
- the data structure SM has the following structure:
- the component unit contains the name of the block, display the number of symbols (phonemes, syllables, ...) that appear in the
- the array of the data structure INV contains the description data for an inventory. Before starting, the array is read from the corresponding binary file of the inventory to be used.
- the structure INV has the following structure:
- INV ⁇ char canon [MAX_UNIT_LENGTH]; long startBin; int num; long startPm; int face; int * lastPer;
- Each element of the INV array contains the data of a spoken block.
- the elements are sorted by the starting symbol of the element canon of the structure, by the number of symbols contained in the block (phonemes, syllables, ...) and by the length of the element sequence canon of the structure (in this order). This enables an effective search for the required component in the array.
- FIG. 6 shows in a structure diagram 600 the procedure for selecting the blocks according to the exemplary embodiment of the invention.
- a pause of length 0 is inserted before the first element which is identified by the pointer * SMPROS. This is used to find the start module in the inventory.
- the variable i is then initialized to the value 0 (step 602) and the following steps are carried out in a first intonation loop 603 for all elements of the respective SMPROS structure (all sounds). The longest sound sequence that matches the element sequence at the current position i of the structure is determined in the inventory (step 604).
- step 606 the module is added to the data structure SM and the variable i by the value anz of the maximum number Symbols whose symbol sequence is the same as the symbol sequence in * (SMPROS + i + j). is increased.
- step 607 It is also checked whether there are substitute sounds for the sounds contained in the module (test step 607) and in the event that such a substitute sound exists, the sound is replaced (step 608). Otherwise, the value of variable i is increased by the value 1 (step 609) and the iteration loop of steps 604 to 609 is repeated for the new value of variable i until all elements of the SMPROS structure have been checked.
- the acoustic synthesis 309 is explained in more detail below.
- the function of the acoustic synthesis 309 is to chain the signal sections according to the specification of the component selection.
- the basic frequency and the duration of the sound are manipulated using the PSOLA algorithm.
- the input variable of acoustic synthesis 309 is the SM structure, which is derived from the program component "component selection"
- the SM structure contains the building blocks to be linked and the information on the fundamental frequency and duration, which were generated by the preprocessing.
- the individual are in the structogram 700 in FIG.
- a first step it is checked whether the sound j represents a pause (step 702).
- the pause is synthesized as a speech signal (step 703).
- step 705 the desired duration of sound is calculated.
- variable k is then assigned the value of the start period of the sound j (step 706).
- step 708 a support point with the next target fundamental frequency is determined (step 707).
- the desired period is then calculated based on the interpolated fundamental frequency contour (step 709).
- step 710 It is now checked whether the previously synthesized duration is less than or equal to the proportionate desired duration (step 710) and, if this condition is fulfilled, the period with the desired duration is synthesized according to the PSOLA algorithm (step 711). A check is then carried out again to determine whether the previously synthesized duration is less than or equal to the proportionate desired duration (step 712).
- the value of the variable k is incremented by the value 1 (step 713).
- the fundamental frequency contour is determined from the desired period durations that are achieved using the PSOLA algorithm.
- the specified duration of sounds is approximately achieved by introducing and omitting periods.
- the signal sections i.e. the blocks are stored one after the other in the memory (short *).
- Start samples of the blocks, the number of periods, the start samples of the periods etc. are stored in the structure INV, the information about the number of samples of each period in the PERIOD structure has the following structure:
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP03757683A EP1554715B1 (de) | 2002-09-23 | 2003-09-23 | Verfahren zur rechnergestützten sprachsynthese eines gespeicherten elektronischen textes zu einem analogen sprachsignal, sprachsyntheseeinrichtung und telekommunikationsgerät |
DE50312627T DE50312627D1 (de) | 2002-09-23 | 2003-09-23 | Verfahren zur rechnergestützten sprachsynthese eines gespeicherten elektronischen textes zu einem analogen sprachsignal, sprachsyntheseeinrichtung und telekommunikationsgerät |
US11/086,801 US7558732B2 (en) | 2002-09-23 | 2005-03-22 | Method and system for computer-aided speech synthesis |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE10244166.9 | 2002-09-23 | ||
DE10244166 | 2002-09-23 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/086,801 Continuation US7558732B2 (en) | 2002-09-23 | 2005-03-22 | Method and system for computer-aided speech synthesis |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2004029929A1 true WO2004029929A1 (de) | 2004-04-08 |
Family
ID=32038177
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/DE2003/003158 WO2004029929A1 (de) | 2002-09-23 | 2003-09-23 | Verfahren zur rechnergestützten sprachsynthese eines gespeicherten elektronischen textes zu einem analogen sprachsignal, sprachsyntheseeinrichtung und telekommunikationsgerät |
Country Status (4)
Country | Link |
---|---|
EP (1) | EP1554715B1 (de) |
CN (1) | CN100354928C (de) |
DE (1) | DE50312627D1 (de) |
WO (1) | WO2004029929A1 (de) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE102013219828B4 (de) * | 2013-09-30 | 2019-05-02 | Continental Automotive Gmbh | Verfahren zum Phonetisieren von textenthaltenden Datensätzen mit mehreren Datensatzteilen und sprachgesteuerte Benutzerschnittstelle |
CN105895076B (zh) * | 2015-01-26 | 2019-11-15 | 科大讯飞股份有限公司 | 一种语音合成方法及系统 |
CN105895075B (zh) * | 2015-01-26 | 2019-11-15 | 科大讯飞股份有限公司 | 提高合成语音韵律自然度的方法及系统 |
CN108231058A (zh) * | 2016-12-17 | 2018-06-29 | 鸿富锦精密电子(天津)有限公司 | 语音辅助测试系统及语音辅助测试方法 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP1217610A1 (de) * | 2000-11-28 | 2002-06-26 | Siemens Aktiengesellschaft | Verfahren und System zur multilingualen Spracherkennung |
JP2002169581A (ja) * | 2000-11-29 | 2002-06-14 | Matsushita Electric Ind Co Ltd | 音声合成方法およびその装置 |
-
2003
- 2003-09-23 CN CNB038226553A patent/CN100354928C/zh not_active Expired - Fee Related
- 2003-09-23 WO PCT/DE2003/003158 patent/WO2004029929A1/de active Application Filing
- 2003-09-23 DE DE50312627T patent/DE50312627D1/de not_active Expired - Lifetime
- 2003-09-23 EP EP03757683A patent/EP1554715B1/de not_active Expired - Lifetime
Non-Patent Citations (3)
Title |
---|
MACCHI M: "Issues in text-to-speech synthesis", INTELLIGENCE AND SYSTEMS, 1998. PROCEEDINGS., IEEE INTERNATIONAL JOINT SYMPOSIA ON ROCKVILLE, MD, USA 21-23 MAY 1998, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 21 May 1998 (1998-05-21), pages 318 - 325, XP010288887, ISBN: 0-8186-8548-4 * |
MOBERG M ET AL: "Optimizing speech synthesizer memory footprint through phoneme set reduction", PROCEEDINGS OF 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS (CAT. NO.02EX555), PROCEEDINGS OF 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, SANTA MONICA, CA, USA, 11-13 SEPT. 2002, 2002, Piscataway, NJ, USA, IEEE, USA, pages 171 - 174, XP002267880, ISBN: 0-7803-7395-2 * |
VAN DER VRECKEN O ET AL: "New techniques for the compression of synthesizer databases", CIRCUITS AND SYSTEMS, 1997. ISCAS '97., PROCEEDINGS OF 1997 IEEE INTERNATIONAL SYMPOSIUM ON HONG KONG 9-12 JUNE 1997, NEW YORK, NY, USA,IEEE, US, 9 June 1997 (1997-06-09), pages 2641 - 2644, XP010236271, ISBN: 0-7803-3583-X * |
Also Published As
Publication number | Publication date |
---|---|
EP1554715B1 (de) | 2010-04-14 |
DE50312627D1 (de) | 2010-05-27 |
CN100354928C (zh) | 2007-12-12 |
EP1554715A1 (de) | 2005-07-20 |
CN1685396A (zh) | 2005-10-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE60035001T2 (de) | Sprachsynthese mit Prosodie-Mustern | |
US7558732B2 (en) | Method and system for computer-aided speech synthesis | |
KR900009170B1 (ko) | 규칙합성형 음성합성시스템 | |
DE69031165T2 (de) | System und methode zur text-sprache-umsetzung mit hilfe von kontextabhängigen vokalallophonen | |
DE69620399T2 (de) | Sprachsynthese | |
DE60126564T2 (de) | Verfahren und Anordnung zur Sprachsysnthese | |
DE69925932T2 (de) | Sprachsynthese durch verkettung von sprachwellenformen | |
EP0886853B1 (de) | Auf mikrosegmenten basierendes sprachsyntheseverfahren | |
DE69506037T2 (de) | Audioausgabeeinheit und Methode | |
DE69908047T2 (de) | Verfahren und System zur automatischen Bestimmung von phonetischen Transkriptionen in Verbindung mit buchstabierten Wörtern | |
DE69028072T2 (de) | Verfahren und Einrichtung zur Sprachsynthese | |
DE69413052T2 (de) | Sprachsynthese | |
DE69521244T2 (de) | System zur Text-Sprache-Umsetzung | |
DE69821673T2 (de) | Verfahren und Vorrichtung zum Editieren synthetischer Sprachnachrichten, sowie Speichermittel mit dem Verfahren | |
DE69909716T2 (de) | Formant Sprachsynthetisierer unter Verwendung von Verkettung von Halbsilben mit unabhängiger Überblendung im Filterkoeffizienten- und Quellenbereich | |
DE60020434T2 (de) | Erzeugung und Synthese von Prosodie-Mustern | |
DE69427525T2 (de) | Trainingsmethode für ein tts-system, sich daraus ergebendes gerät und methode zur bedienung des gerätes | |
DE69829389T2 (de) | Textnormalisierung unter verwendung einer kontextfreien grammatik | |
DE19825205C2 (de) | Verfahren, Vorrichtung und Erzeugnis zum Generieren von postlexikalischen Aussprachen aus lexikalischen Aussprachen mit einem neuronalen Netz | |
DE2212472A1 (de) | Verfahren und Anordnung zur Sprachsynthese gedruckter Nachrichtentexte | |
DE69710525T2 (de) | Verfahren und Vorrichtung zur Sprachsynthese | |
EP1184839A2 (de) | Graphem-Phonem-Konvertierung | |
DE69917960T2 (de) | Phonembasierte Sprachsynthese | |
DE69727046T2 (de) | Verfahren, vorrichtung und system zur erzeugung von segmentzeitspannen in einem text-zu-sprache system | |
Fletcher et al. | Segment and syllable duration in Australian English |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A1 Designated state(s): CN US |
|
AL | Designated countries for regional patents |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
WWE | Wipo information: entry into national phase |
Ref document number: 2003757683 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 11086801 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 20038226553 Country of ref document: CN |
|
WWP | Wipo information: published in national office |
Ref document number: 2003757683 Country of ref document: EP |