CN101685633A - Voice synthesizing apparatus and method based on rhythm reference - Google Patents

Voice synthesizing apparatus and method based on rhythm reference Download PDF

Info

Publication number
CN101685633A
CN101685633A CN200810166002A CN200810166002A CN101685633A CN 101685633 A CN101685633 A CN 101685633A CN 200810166002 A CN200810166002 A CN 200810166002A CN 200810166002 A CN200810166002 A CN 200810166002A CN 101685633 A CN101685633 A CN 101685633A
Authority
CN
China
Prior art keywords
parameter
rhythm
text
synthesized
nature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810166002A
Other languages
Chinese (zh)
Inventor
郭庆
陆应亮
王彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN200810166002A priority Critical patent/CN101685633A/en
Publication of CN101685633A publication Critical patent/CN101685633A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a voice synthesizing apparatus and a method based on rhythm reference. The voice synthesizing apparatus comprises a rhythm parameter acquiring part and a sound composing part, wherein the rhythm parameter acquiring part acquires natural rhythm parameters or proximate natural rhythm parameters by analyzing voice record files acquired by ways including the way that natural person reads texts to be synthesized or analyzing rhythm parameter label files acquired by labeling rhythm parameter for texts to be synthesized with a predetermined label standard; and the sound composing part uses the natural rhythm parameters or proximate natural rhythm parameters as references, selects corresponding voice synthesizing units from a pre-record voice library aiming at the texts to be synthesized and splices to synthesize the voice synthesizing units to generate synthesized voice files corresponding to the texts to be synthesized. Highly natural, motional synthesized voices having cadence extremely close to natural voices are generated according to requirements of users according to the synthesizing apparatus and method.

Description

Speech synthetic device and method based on rhythm reference
Technical field
The present invention relates to based on the rhythm with reference to the apparatus and method of carrying out phonetic synthesis, more particularly, the prosodic features that the present invention relates to the modulation in tone that obtains the prosodic features mark file of making from natural-sounding or based on specific criteria is reference, synthesizes the apparatus and method of the synthetic speech with high naturalness.
Background technology
Phonetic synthesis (Text-To-Speech is called for short TTS) is the technology that is used for being transformed into from text voice, specifically, is a kind of technology that any Word message is converted into standard, smooth voice.Phonetic synthesis involves the high and new technology in multiple forward positions such as natural language processing, metrics, voice signal processing, perception of sound, across a plurality of subjects such as acoustics, linguistics, digital signal processing, is a cutting edge technology in Chinese information processing field.
Speech synthesis technique can be widely used in industries such as telecommunications, finance, electric power, postal service, government.Speech synthesis technique can allow the user send and receive e-mail more like a cork, obtains quotations on the stock market, understands weather, traffic and road conditions, and in the near future, it will provide more comprehensively more valuable application service.
Speech synthesis system is used for the voice of synthetic high intelligibility, high naturalness.
In general, speech synthesis system at first will be chosen certain basic synthesis unit, for example the phoneme in the English, the semitone in the Chinese save or are with tuning joint etc., then under the guidance that rhythm model (duration of a sound and fundamental frequency etc.) predicts the outcome, from prerecord the good received pronunciation storehouse of mark, search the synthesis unit of global optimum, utilize specific waveform generation technique (as the TD-PSOLA algorithm) that the voice segments of choosing is carried out the adjustment and the modification of rhythm characteristic, final splicing synthesizes satisfactory voice.
Research through the nearly more than ten years, the synthetic voice quality of the speech synthesis system of present stage has reached practical degree, wherein intelligibility can satisfy the actual needs of application, but naturalness is high not enough, with the still old very big gap of people's natural-sounding.
The speech synthesis system of present stage adopts the splicing synthetic technology based on extensive speech database mostly, promptly, with probabilistic statistical method, under the prosodic parameter that prediction obtains instructs, search the global optimum of synthesis unit meet to(for) input text from the sound bank of pre-recording, it is synthetic to carry out the splicing that waveform adjusts according to the prediction prosodic parameter then.
In general, speech synthesis system comprises following three modules: text analysis model, prosodic parameter prediction module, rear end synthesis module.The function of Chinese language text analysis module comprises participle, part-of-speech tagging, phonetic notation, rhythm structure prediction etc.The prosodic parameter prediction module is carried out the prediction of parameters,acoustics such as the duration of a sound, fundamental frequency, energy on text analyzing result's basis.The rear end synthesis module generally generates submodule by unit selection submodule and waveform and forms, wherein, the unit selection submodule is searched the synthesis unit that meets global optimum for input text by probabilistic statistical method from sound bank under the guidance of prosodic parameter, waveform generates submodule and utilizes specific waveform generation technique (as the TD-PSOLA algorithm) that the voice segments of choosing is carried out the adjustment and the modification of rhythm characteristic, and final splicing synthesizes satisfactory voice.
The duration of a sound (phoneme length) is one of most important prosodic features, and it has great significance for the perception naturalness of synthetic speech.The variation of the duration of a sound can help the cognition of people for phoneme itself, also helps people can determine the division of speech, phrase simultaneously in a continuous flow, thereby improves the naturalness and the intelligibility of voice.
Fundamental frequency also is one of most important prosodic features, and it is even more important for Chinese, because Chinese is the language that tone is arranged.Fundamental frequency has great significance for perception naturalness, the intelligibility of synthetic speech.
In natural flow, the linguistic context before and after the duration of a sound of phoneme and fundamental frequency and this phoneme is a height correlation.The type of many contextual factors such as phoneme self, the type of front and back phoneme, rhythm border, front and back grade, duration and the fundamental frequency whether read again etc. all phoneme all have restrictive function.The basic purpose of duration of a sound forecasting research and fundamental frequency forecasting research is exactly to attempt to describe the influence of these context factors for phoneme duration and fundamental frequency, thereby improves the naturalness of speech synthesis system.
But prosodic parameter prediction is subjected to the restriction of a lot of aspects, except the problem that model modeling own exists, is subjected to also that training data is limited, many-sided restrictions such as accuracy of front end text analyzing.Though various prosodic parameter forecasting techniquess attempt to consider the various phenomenons that exist in the coarticulation, modulation in tone the rhythm rhythm of synthetic speech still can't be accomplished in natural-sounding.
The critical defect of the speech synthesis system of above-mentioned present stage is to have paid attention to local and has ignored integral body, caused synthetic statement unable for no reason, lifeless, be widely used in sound e-book market etc. thereby hindered present speech synthesis system.
Present sound e-book market all is to use true man to read aloud.Find a tone color very beautiful, can combine to such an extent that good declaimer is very difficult with feeling modulation in tone simultaneously.If look for the announcer of specialty to finish recording, cost is bound to very high so.
Generally, traditional speech synthesis system is at first text to be analyzed (for example participle, part-of-speech tagging, numeric character processing, phonetic notation, rhythm structure analysis etc.), carry out the prediction of the prosodic parameter such as the duration of a sound, fundamental frequency, energy etc. then on this basis, afterwards under the guidance of these prosodic parameters, search the synthesis unit of global optimum from the sound bank of pre-recording, the splicing of carrying out the waveform adjustment then is synthetic.Owing in text analyzing, rhythm prediction, also have many inevitable problem, make traditional speech synthesis system can't accurately hold the content of text and the prediction of prosodic parameter, can't control the connection between each phonetic synthesis unit well, can't control the modulation in tone of synthetic speech well, finally cause the user can't obtain satisfied voice.And for the widespread use of making such as sound e-book, want both to reduce cost, obtain the synthetic speech file that tone color is graceful and have rhythm of nature again, press for a kind of speech synthesis system that can synthesize synthetic speech with high naturalness.
List some documents below, and by reference they are incorporated into this, as done comprehensive elaboration at this about the research of this respect.
[1]Meron?and?Joram,US?Patent?publication?No.6,829,581,July?31,2001,Method?for?prosody?generation?by?unit?selection?from?an?imitationspeech?database;
[2]Baraff?and?David?R.,US?Patent?publication?No.6,795,807,August?17,2000,Method?and?means?for?creating?prosody?in?speech?regeneration?forlaryngectomees;
[3]Holm,Frode,Hata?and?Kazue,US?Patent?publication?No.6,260,016,November?25,1998,Speech?synthesis?employing?prosody?templates;
[4]Holm,Frode,Hata?and?Kazue,US?Patent?publication?No.6,185,533,March?15,1999,Generation?and?synthesis?of?prosody?templates;
[5]Shih,C.L.,“The?Prosodic?Domain?of?Tone?Sandhi?in?MandarinChinese”,PhD?Dissertation,UC?San?Diego,1986;
[6]Chu?M.and?Qian?Y.,“Locating?boundaries?for?prosodic?constituentsin?unrestricted?Mandarin?texts”,Journal?of?Computational?Linguistics?andChinese?Language?Processing,6(1),61-82,2001;
[7]E.Moulines?and?F.Charpentier.,“Pitch-synchronous?waveformprocessing?techniques?for?text-to-speech?synthesis?using?diphones”,SpeechCommunication,9:453-467,1990;
[8]Qing?Guo,Nobuyuki?Katae,“Statistical?Prosody?Generation?inMandarin?TTS?System”,OCOCOSDA?2007;
[9]Qing?Guo,Nobuyuki?Katae,Hao?Yu,Hitoshi?Iwamida,“DecisionTree?based?Duration?Prediction?in?Mandarin?TTS?System”,Journal?of?ChineseLanguage?and?Computing;
[10]Guo?Qing,Nobuyuki?Katae,Yu?Hao,Hitoshi?Iwamida,“HighQuality?Prosody?Generation?in?a?Text-to-speech?System”,Journal?of?ChineseInformation?Processing,Vol.22?No.2:110-115,2008;
[11]Qing?Guo,Jie?Zhang,Nobuyuki?Katae,“Prosodic?Word?Groupingwith?Global?Probability?Estimation?Method”,Speech?Prosody,2008。
Summary of the invention
In view of the existing the problems referred to above of traditional speech synthesis technique the present invention has been proposed.The object of the present invention is to provide a kind of speech synthesis system based on rhythm reference, it has overcome problems such as the participle in the insoluble text analyzing, semantic analysis, prosodic parameter prediction in the traditional voice synthetic technology dexterously, and can obtain rhythm rhythm and natural-sounding very near having the good synthetic speech effect of beautiful timbre.Adopt the present invention, can generate arbitrarily and modulation in tone degree and the very approaching voice document of natural-sounding according to user's requirement.By with the tone color of standard sweet read aloud poorly or pronounce nonstandard but feeling reading aloud combines with tone color, making can be by being read aloud the synthetic speech works of producing the sweet and modulation in tone of tone color by anyone.Thereby, can reduce the cost of manufacture of sound e-book significantly, mass-produce the sound e-book that has same voice characteristic and be rich in rhythm rhythm.
To achieve these goals, according to a first aspect of the invention, provide a kind of based on the rhythm with reference to carrying out the speech synthetic device of phonetic synthesis, it comprises: the prosodic parameter acquisition unit, its recording file by this text to be synthesized of the nature person being read aloud text to be synthesized and obtaining is analyzed, perhaps the prosodic parameter mark file that obtains treat synthesis text to carry out the prosodic parameter mark with predetermined labeled standards is analyzed, and obtains nature prosodic parameter or approximate rhythm of nature parameter; Make portion with sound sound, its with this rhythm of nature parameter or approximate rhythm of nature parameter as reference, from the sound bank of pre-recording, select corresponding phonetic synthesis unit at this text to be synthesized, and described phonetic synthesis unit is spliced synthetic, to produce synthetic speech file corresponding to this text to be synthesized.
According to a second aspect of the invention, speech synthetic device described in the first aspect is provided, wherein, this sound sound portion of making comprises: the voice unit selection portion, it as reference, selects the phonetic synthesis unit at this text to be synthesized global optimum with this rhythm of nature parameter or approximate rhythm of nature parameter from this sound bank of pre-recording; With the waveform generating unit, it is based on this rhythm of nature parameter or approximate rhythm of nature parameter, the phonetic synthesis unit that this voice unit selection portion is selected splices synthetic and the voice document after synthetic is carried out the waveform adjustment, to obtain the synthetic speech file with high naturalness corresponding to this text to be synthesized.
According to a third aspect of the invention we, provide the speech synthetic device described in the second aspect, wherein, this prosodic parameter acquisition unit comprises: recording portion, and it reads aloud the recording file that this text to be synthesized obtains this text to be synthesized by the nature person; With the prosodic parameter extraction unit, it obtains prosodic parameter from the Wave data of this recording file, and this prosodic parameter comprises the duration of a sound, audio frequency and energy.
According to a forth aspect of the invention, speech synthetic device described in the second aspect is provided, wherein, this prosodic parameter acquisition unit comprises: prosodic parameter mark portion, its knowledge base with reference to the mark rule that has defined prosodic parameter is carried out the prosodic parameter mark with this predetermined labeled standards to this text to be synthesized, to obtain prosodic parameter mark file; With the prosodic parameter generating unit, it is analyzed this prosodic parameter mark file, and to obtain approximate prosodic parameter, this approximate prosodic parameter comprises the duration of a sound, audio frequency and energy.
According to a fifth aspect of the invention, provide the speech synthetic device described in the first aspect, wherein, the recording file of this text to be synthesized is to obtain by the mode that the electronics cartoon is pronounced.
According to a sixth aspect of the invention, provide a kind of based on the rhythm with reference to carrying out the method for phonetic synthesis, it may further comprise the steps: the prosodic parameter obtaining step, its recording file by this text to be synthesized of the nature person being read aloud text to be synthesized and obtaining is analyzed, perhaps the prosodic parameter mark file that obtains treat synthesis text to carry out the prosodic parameter mark with predetermined labeled standards is analyzed, and obtains nature prosodic parameter or approximate rhythm of nature parameter; Make step with sound sound, its with this rhythm of nature parameter or approximate rhythm of nature parameter as reference, from the sound bank of pre-recording, select corresponding phonetic synthesis unit at this text to be synthesized, and described phonetic synthesis unit is spliced synthetic, to produce synthetic speech file corresponding to this text to be synthesized.
According to a seventh aspect of the invention, method described in the 6th aspect is provided, wherein, this sound sound makes step and comprises: as reference, select the phonetic synthesis unit at this text to be synthesized global optimum with this rhythm of nature parameter or approximate rhythm of nature parameter from this sound bank of pre-recording; With based on this rhythm of nature parameter or approximate rhythm of nature parameter, to splicing synthetic and the voice document after synthetic is carried out the waveform adjustment in selected phonetic synthesis unit, to obtain synthetic speech file with high naturalness corresponding to this text to be synthesized.
According to an eighth aspect of the invention, provide the method described in the 7th aspect, wherein, this prosodic parameter obtaining step comprises: read aloud the recording file that this text to be synthesized obtains this text to be synthesized by the nature person; With from the Wave data of this recording file, obtain prosodic parameter, this prosodic parameter comprises the duration of a sound, audio frequency and energy.
According to a ninth aspect of the invention, method described in the 7th aspect is provided, wherein, this prosodic parameter obtaining step comprises: the knowledge base with reference to the mark rule that has defined prosodic parameter is carried out the prosodic parameter mark with this predetermined labeled standards to this text to be synthesized, to obtain prosodic parameter mark file; With this prosodic parameter mark file is analyzed, to obtain approximate prosodic parameter, this approximate prosodic parameter comprises the duration of a sound, audio frequency and energy.
According to the tenth aspect of the invention, provide the method described in the 6th aspect, wherein, obtained the recording file of this text to be synthesized by the mode of electronics cartoon pronunciation.
According to an eleventh aspect of the invention, a kind of computer program that comprises computer instruction code is provided, when this computer program is carried out the computer instruction code that comprises in this computer program on being loaded into computing machine and by this computing machine, realize any according in sixth aspect present invention to the ten aspects as described above described based on the rhythm with reference to the method for carrying out phonetic synthesis.
According to a twelfth aspect of the invention, the computer readable recording medium storing program for performing of the computer program described in a kind of carrying as the tenth on the one hand is provided, this computer readable recording medium storing program for performing can be read so that described computer program loads is carried out the computer instruction code that comprises in this computer program to this computing machine and by this computing machine by computing machine, thus realize any according in sixth aspect present invention to the ten aspects as described above described based on the rhythm with reference to the method for carrying out phonetic synthesis.
According to the present invention the speech synthetic device of above-mentioned aspect and method and realize this phoneme synthesizing method computer program and the carrying this computer program computer readable recording medium storing program for performing, can generate arbitrarily and modulation in tone degree and the very approaching synthetic speech of natural-sounding according to user's requirement, thereby can improve the naturalness of synthetic speech.
More than general introduction and following detailed description all are to exemplary description of the present invention, but not limitation of the present invention.The those skilled in the art is after the disclosure of reading the application, based on connotation of the present invention, can visualize the embodiment of various other forms fully, but as long as these embodiments comprise following technical characterictic, promptly, read aloud text to be synthesized or otherwise (for example with reality by true man, by electronics cartoon pronunciation) recording file that obtains or mark the prosodic parameter mark file that obtains by treating synthesis text with reference to prosodic parameter mark rule-based knowledge base based on predetermined labeled standards, analyze and obtain rhythm of nature parameter or approximate prosodic parameter, and utilize these rhythm of nature parameters or approximate prosodic parameter to synthesize synthetic speech file with high naturalness with reference to the sound bank of pre-recording, obviously these embodiments all should fall within the scope of the present invention.Protection scope of the present invention is specifically limited by described claims.
With reference to the accompanying drawings the specific embodiment of the present invention is described in detail.
Description of drawings
Included accompanying drawing is used to provide further understanding of the invention, it has constituted the part of instructions, illustration preferred implementation of the present invention, and be used for explaining principle of the present invention with explanatory note, wherein, represent with identical Reference numeral all the time for identical key element.In the accompanying drawings:
Fig. 1 is the block diagram of illustration according to the formation example of the speech synthesis system of first embodiment of the invention;
Fig. 2 is that example illustrates the block diagram according to the formation example of the rhythm of nature parameter acquiring section in the speech synthesis system of first embodiment of the invention;
Fig. 3 is that example illustrates the block diagram that makes the formation example of portion according to the sound sound in the speech synthesis system of first embodiment of the invention;
Fig. 4 is the block diagram of illustration according to the formation example of the speech synthesis system of second embodiment of the invention;
Fig. 5 is that example illustrates the block diagram according to the formation example of the approximate rhythm of nature parameter acquiring section in the speech synthesis system of second embodiment of the invention; And
Fig. 6 is the figure that an example of prosodic parameter mark rule-based knowledge base is shown.
Embodiment
With reference to following description and accompanying drawing, will know these and other aspects of the present invention.In these descriptions and accompanying drawing, specific implementations more of the present invention are specifically disclosed, represent to implement some modes of principle of the present invention, but should be appreciated that scope of the present invention is not limited.On the contrary, the present invention includes spirit and interior all changes, modification and the equivalent of intension scope that falls into appended claims.
Describe and/or illustrative feature at embodiment, can in one or more other embodiment, use in the same manner or in a similar manner, and/or combine with the feature of other embodiments or replace the feature of other embodiments.
What should emphasize is, word " comprises " existence that is used to refer to described feature, integer, step or ingredient when using in this manual, but does not get rid of the existence or the increase of one or more further feature, integer, step, ingredient or their combination.
Come embodiments of the present invention are carried out exemplary detailed description referring to figs. 1 through 6.
Fig. 1 is the block diagram of illustration according to the formation example of the speech synthesis system of first embodiment of the invention.This figure shows the workflow according to the speech synthesis system of first embodiment simultaneously, that is, the signal flow between each ingredient is to relation.
Speech synthesis system according to first embodiment comprises that text to be synthesized provides portion 101, rhythm of nature parameter acquiring section 102, rhythm of nature parameter storage part 103, phoneme list storage part 104, sound bank 105, sound sound to make portion 106 and synthetic speech storage part 107.
The arbitrary text that text to be synthesized provides portion 101 that pending phonetic synthesis is provided, and can carry out pre-service to the text, eliminate the various defectives and the mistake that wherein exist, make its standard, accurate, be convenient to make good recording file.
The text to be synthesized that rhythm of nature parameter acquiring section 102 provides portion 101 to provide based on text to be synthesized, generation is corresponding to the rhythm of nature parameter and the phoneme list of this text to be synthesized, and they offered rhythm of nature parameter storage part 103 respectively and phoneme list storage part 104 is stored.
The rhythm of nature parameter is meant the acoustics prosodic parameter of each phoneme correspondence in voice, generally comprises the duration of a sound, fundamental frequency and energy etc.For example, the duration of a sound is described phoneme time corresponding length in flow.
Phoneme list is meant the language analysis of process speech synthesis system front end and the sequence that the phonetics analyzing and processing obtains.As a rule, it is corresponding to a text sentence.For example, for Chinese, after through the speech synthesis system front-end processing, comprise the information of the following aspects: Chinese character, branch word information, part of speech information, the Chinese phonetic alphabet (syllable, semitone joint) information, rhythm border class information etc.
Be a possible phoneme list example below.Contain participle, phonetic, part of speech and rhythm structure information in this phoneme list.
( you3 ) /v ( yi1 ) /m ( ci4 ) /q||| ( wo3 men5 ) /r|| ( he2 ) /p| ( wai4 ) /f ( xiao4 ) /ng|| ( gao3 ) /v ( lian2 yi4 ) /v| ( pa2 ) /v ( xiang1 shan1 ) /ns||| ( wo3 men5 ) /r ( de5 ) /u| ( xue2 sheng1 ) /n|| ( mei2 you3 ) /v ( yi2 ge4 ) /m| ( diao4 dui4 ) /v ( de5 ) /u||| ( ceng1 ceng1 ceng1 ) /o|| ( jiu4 ) /d ( pa2 ) /v ( shang4 ) /v ( le5 ) /u| ( shan1 ding3 ) /n @ 。
Owing to do not have mark between Chinese word and the speech, therefore for the ease of follow-up processing, lexical analysis comprises that participle and part-of-speech tagging are matters of utmost importance.In general, the machine automatic word segmentation adopts a dictionary to finish.At present, the main automatic word segmentation method that adopts has forward direction maximum match method, back to maximum match method, language model method, hidden Markov model and maximum entropy model etc.In the above example, a speech formed in "/" one or more words before, and the English alphabet after "/" is represented the part of speech of this speech.For example " having " is verb (" v "), and " we " are pronouns (" r ") etc.
Wherein, " | ", " || ", " || | " are represented prosodic phrase, prosodic phrase and intonation phrase respectively." @ " is used for mark sentence tail.Like this, " once ", " we and ", " outer school ", " doing friendship ties ", " climbing Fragrance Hill " etc. all are rhythm speech.Further, " we and outer school " and " do friendship ties and climb Fragrance Hill " are prosodic phrase, and " we and outer school are done friendship ties and climbed Fragrance Hill " then is the intonation phrase.
Be exactly Pinyin information in addition.For example: the phonetic that " you3 " " has " for speech, " wo3 men54 " is the phonetic of speech " we ".
Sound sound makes the phoneme list of portion 106 based on storage in rhythm of nature parameter of storing in the rhythm of nature parameter storage part 103 and the phoneme list storage part 104, a plurality of voice units of selection and text correspondence to be synthesized from the sound bank of prerecording 105, and it is synthetic and adjust that selected a plurality of voice units are carried out waveform, generating final synthetic speech file, and this synthetic speech file is offered synthetic speech storage part 107 store.
Sound bank 105 can be the sound bank that is used for the universal phonetic synthesis system.It can be can be the rear module of universal phonetic synthesis system that sound sound makes portion 106.Rhythm of nature parameter storage part 103, phoneme list storage part 104 and synthetic speech storage part 107 can be read-write memory storages commonly used in the computer system, as RAM, flash memory, hard disk, read-write CD, magneto-optic disk etc., or special storage server.Though these storage parts are described as discrete assembly here, in actual applications, they in fact can shared same physical storage device.
Fig. 2 is the block diagram that example illustrates the concrete formation example of nature prosodic parameter acquisition unit 103.
Rhythm of nature parameter acquiring section 103 comprises recording portion 201, rhythm of nature parameter extraction portion 202 and phoneme list generating unit 203.
Recording portion 201 is used for generating recording file according to the text to be synthesized that text to be synthesized provides portion 101 to provide.Here, can ask a nature person to read aloud text to be synthesized with the intonation of the modulation in tone that is rich in emotion of applying content of text, also can read aloud the text with sound that can send similar common electronics cartoon or the tune that can send the sound of other mannerism, and carry out corresponding record, to form recording file.Alternatively, can from existing recording material database, seek corresponding to or comprise the recording material of the content of the text, carry out corresponding montage and handle, to obtain the file of required recording.The specific implementation of recording portion 201 can adopt the variety of way of fully describing in the prior art, repeats no more here.
Then, the recording file that rhythm of nature parameter extraction portion 202 provides based on recording portion 201, carry out digital signal processing by Wave data, extract the corresponding prosodic parameter of particular content (speech or phrases unit) with text, as the duration of a sound, audio frequency, energy etc. to this recording file.
Phoneme list generating unit 203 is treated synthesis text analysis by the analysis result based on prosodic parameter extraction unit 202, and to obtain a phoneme list, this phoneme list comprises a series of phonemes of the continuous flow that constitutes the text.Here, consider that machine generates phoneme list automatically and may have certain problem, such as the processing of polyphone etc., therefore verification that can additional labor in the processing of phoneme list generating unit 203 generates the correct phoneme list corresponding with text to be synthesized.
The phoneme list that rhythm of nature parameter that rhythm of nature parameter extraction portion 202 extracts and phoneme list generating unit 203 generate is stored in respectively in rhythm of nature parameter storage part 103 and the phoneme list storage part 104.
Fig. 3 example illustrates the formation example that sound sound makes portion 106.
Sound sound makes portion 106 and comprises unit selection portion 301 and waveform generating unit 302.Unit selection portion 301 with in the rhythm of nature parameter storage part 103 storage the rhythm of nature parameter as reference, from sound bank 105, in the various voice units of storage, meet the phonetic synthesis unit list of comprising of global optimum of a series of phonetic syntheses unit at the phoneme list search of storage in the phoneme list storage part 104.
Global optimum is described as follows.
A given sentence is supposed to be made up of N phoneme.For each phoneme, according to its prosodic parameter feature, in sound bank can for its find to the most similar sample of its prosodic parameter as its synthesis unit.But for a sentence, it not is optimum simply the N that finds a like this synthesis unit being stitched together.In fact, except wishing the synthetic prosodic parameter of each phoneme and the sample prosodic parameter being similar as far as possible, also to investigate splicing place of adjacent synthesis unit because the inconsistent tonequality loss that brings of frequency spectrum.In view of this consideration, we claim that the strategy of this unit selection is the unit selection that meets global optimum.
Then, the phonetic synthesis unit list that 302 pairs of unit selection portions 301 of waveform generating unit are selected is spliced synthetic, and with reference to the rhythm of nature parameter in the rhythm of nature parameter storage part 103, carry out the waveform adjustment to splicing synthetic file, obtaining having the synthetic speech file of high naturalness, and with this synthetic speech file storage in synthetic speech storage part 107.
First embodiment of the invention, pronunciation standard and the tuny sound materials of utilizing sound bank to provide, and, can synthesize the synthetic speech file of tone color grace and rhythm nature in conjunction with the rhythm of nature parameter that obtains by the recording file analysis of true man being read aloud or alternate manner obtains.
Come the speech synthesis system of second embodiment of the invention is described below in conjunction with Fig. 4 to 6.Different with first embodiment is that second embodiment is to carry out the prosodic parameter mark by treating synthesis text, and the mark file is analyzed the rhythm of nature parameter that obtains being similar to.
As shown in Figure 4, the speech synthesis system according to second embodiment of the invention comprises that text to be synthesized provides portion 101, approximate rhythm of nature parameter acquiring section 401, approximate rhythm of nature parameter storage part 402, phoneme list storage part 104, sound bank 105, sound sound to make portion 106 and synthetic speech storage part 107.Wherein, text to be synthesized provides portion 101, phoneme list storage part 104, sound bank 105, sound sound to make identical in portion 106 and synthetic speech storage part 107 and first embodiment, no longer is repeated in this description at this.
Approximate rhythm of nature parameter acquiring section 401 is analyzed the rhythm of nature parameter that obtains being similar to the mark text after marking by the text for the treatment of synthesis text with reference to predetermined prosodic parameter mark rule and providing portion 101 to provide.Resulting approximate rhythm of nature parameter is stored in the approximate rhythm of nature parameter storage part 402.
Fig. 5 example illustrates the formation example of approximate rhythm of nature parameter acquiring section 401.
Approximate natural parameter acquisition unit 401 comprises prosodic parameter mark rule-based knowledge base 501, prosodic parameter mark portion 502, prosodic parameter mark file storage part 503, prosodic parameter generating unit 504 and phoneme list generating unit 505.
The data of the various rules of the relevant prosodic parameter mark of prosodic parameter mark rule-based knowledge base 501 storages.For example, Fig. 6 shows an example of the data structure of prosodic parameter mark rule.In prosodic parameter mark rule, to store prosodic parameter title (for example, the duration of a sound, audio frequency, speed etc.), label symbol (for example, P, D, S) and mark interval (for example, 0~9) etc. at least.
The text that prosodic parameter mark portion 502 treats synthesis text and provides portion 101 to provide with reference to the prosodic parameter mark rule of storage in the prosodic parameter mark rule-based knowledge base 501 carries out the prosodic parameter mark, and the file storage after will mark marks in the file storage part 503 at prosodic parameter.
In fact, adopt the prosodic labeling rule to combine with automatic prosodic parameter prediction.Specifically, at first according to the phoneme list of input wherein each phoneme carried out the prediction of prosodic parameter by automatic prosodic parameter prediction module.On this basis, can be according to the needs of using, come some parameters are wherein revised by the people, the needs of emotion based on context for example, certain or certain several phonemes are spun out the processing that increases the weight of of the processing or the tone, the former can operate with label symbol D, and the latter can be operated with label symbol P.For example, suppose " D5 " representative keep the original duration of a sound, " D6, D7, D8, D9 " can represent and carry out respectively the duration of a sound being carried out 10%, 20%, 30%, 40% the processing of spinning out on original basis.Otherwise " D4, D3, D2, D1, D0 " can represent and carry out respectively the duration of a sound being carried out 10%, 20%, 30%, 40%, 50% quickening processing on original basis.
The prosodic parameter mark file of storage is analyzed in 504 pairs of prosodic parameter marks of prosodic parameter generating unit file storage part 503, thereby obtains the prosodic parameter of various quantifications, and it is stored in the approximate rhythm of nature parameter storage part 402.Though these prosodic parameters can not be equal to fully with the rhythm of nature parameter that obtains in first embodiment, also have good naturalness.
Phoneme list generating unit 505 is used for the prosodic parameter mark file of prosodic parameter mark file 503 storages is resolved, and generating and the phoneme list of this document correspondence, and the phoneme list that is generated is stored in the phoneme list storage part 104.
At last, that describes in the first embodiment is the same, sound sound make portion 106 with the approximate rhythm of nature parameter of storage in the approximate rhythm of nature parameter storage part 402 as reference, in the various voice units of from sound bank 105, storing, the phonetic synthesis unit list that meets comprising of global optimum of a series of phonetic syntheses unit at the phoneme list search of storage in the phoneme list 104, splice synthetic to this phonetic synthesis unit, and synthetic file is carried out the waveform adjustment with reference to described approximate rhythm of nature parameter, to obtain the synthetic speech file that naturalness is improved.
Second embodiment of the invention, not only than the speech synthesis system of prior art, the synthetic speech file of the naturalness that can be improved, and than first embodiment of the present invention, needn't read aloud with true man, thereby can raise the efficiency, and save cost.
What need explanation in addition is, above though rhythm of nature parameter storage part 103, phoneme list storage part 104, synthetic speech storage part 107, approximate rhythm of nature parameter storage part 402 etc. have been described separately, these assemblies can shared identical physical storage device in fact.Sound bank 105 and prosodic parameter mark rule-based knowledge base 501 can adopt special-purpose memory storage, but also can utilize the part of general memory storage.
Speech synthesis system of the present invention can be equipped with corresponding computer programs by general calculation machine system and realize.But, being not limited thereto, each ingredient also can be implemented as special-purpose electronic installation (for example firmware etc.), and by they being integrated to realize a complete speech synthesis system.
The invention provides a kind of speech synthesis system that can improve the synthetic speech naturalness.Adopt speech synthesis system of the present invention, can generate tone color grace and modulation in tone degree and the very approaching synthetic speech of natural-sounding according to user's requirement.
By with the tone color of standard sweet read aloud poorly or pronounce nonstandard but feeling reading aloud combines with tone color, making can be by being read aloud the synthetic speech works of producing the sweet and modulation in tone of tone color by anyone.Thereby, can reduce the cost of manufacture of sound e-book significantly, mass-produce the sound e-book that has same voice characteristic and be rich in rhythm rhythm.
Adopt speech synthesis system of the present invention, still can send own original sweet sound in the time of can changing at the voice of oneself.Such as, when catching a cold, the announcer can utilize native system there not to be the voice broadcast of too big difference with the past.Sound when can also send youth by native system after the women is old.
Although below only selected preferred embodiment to come illustration the present invention, those skilled in the art are easy to carry out variations and modifications according to content disclosed herein under the situation that does not break away from the invention scope that is defined by the following claims.The explanation of the foregoing description only is exemplary, and does not constitute the restriction by claims and the invention that equivalent limited thereof.

Claims (10)

1, a kind of based on the rhythm with reference to carrying out the speech synthetic device of phonetic synthesis, it comprises:
The prosodic parameter acquisition unit, it is by to by comprising that the recording file that the nature person reads aloud this text to be synthesized that the mode of text to be synthesized obtains analyzes, perhaps the prosodic parameter mark file that obtains treat synthesis text to carry out the prosodic parameter mark with predetermined labeled standards is analyzed, and obtains nature prosodic parameter or approximate rhythm of nature parameter; With
Sound sound makes portion, its with this rhythm of nature parameter or approximate rhythm of nature parameter as reference, from the sound bank of pre-recording, select corresponding phonetic synthesis unit at this text to be synthesized, and described phonetic synthesis unit is spliced synthetic, to produce synthetic speech file corresponding to this text to be synthesized.
2, speech synthetic device as claimed in claim 1, wherein, this sound sound portion of making comprises:
The voice unit selection portion, it as reference, selects the phonetic synthesis unit at this text to be synthesized global optimum with this rhythm of nature parameter or approximate rhythm of nature parameter from this sound bank of pre-recording; With
The waveform generating unit, it is based on this rhythm of nature parameter or approximate rhythm of nature parameter, the phonetic synthesis unit that this voice unit selection portion is selected splices synthetic and the voice document after synthetic is carried out the waveform adjustment, to obtain the synthetic speech file with high naturalness corresponding to this text to be synthesized.
3, speech synthetic device as claimed in claim 2, wherein, this prosodic parameter acquisition unit comprises:
Recording portion, it reads aloud the recording file that this text to be synthesized obtains this text to be synthesized by the nature person; With
The prosodic parameter extraction unit, it obtains prosodic parameter from the Wave data of this recording file, and this prosodic parameter comprises the duration of a sound, audio frequency and energy.
4, speech synthetic device as claimed in claim 2, wherein, this prosodic parameter acquisition unit comprises:
Prosodic parameter mark portion, its knowledge base with reference to the mark rule that has defined prosodic parameter is carried out the prosodic parameter mark with this predetermined labeled standards to this text to be synthesized, to obtain prosodic parameter mark file; With
The prosodic parameter generating unit, it is analyzed this prosodic parameter mark file, and to obtain approximate prosodic parameter, this approximate prosodic parameter comprises the duration of a sound, audio frequency and energy.
5, speech synthetic device as claimed in claim 1, wherein, the recording file of this text to be synthesized is to obtain by the mode that the electronics cartoon is pronounced.
6, a kind of based on the rhythm with reference to carrying out the method for phonetic synthesis, it may further comprise the steps:
The prosodic parameter obtaining step, it is by to by comprising that the recording file that the nature person reads aloud this text to be synthesized that the mode of text to be synthesized obtains analyzes, perhaps the prosodic parameter mark file that obtains treat synthesis text to carry out the prosodic parameter mark with predetermined labeled standards is analyzed, and obtains nature prosodic parameter or approximate rhythm of nature parameter; With
Sound sound makes step, its with this rhythm of nature parameter or approximate rhythm of nature parameter as reference, from the sound bank of pre-recording, select corresponding phonetic synthesis unit at this text to be synthesized, and described phonetic synthesis unit is spliced synthetic, to produce synthetic speech file corresponding to this text to be synthesized.
7, method as claimed in claim 6, wherein, this sound sound makes step and comprises:
As reference, from this sound bank of pre-recording, select phonetic synthesis unit with this rhythm of nature parameter or approximate rhythm of nature parameter at this text to be synthesized global optimum; With
Based on this rhythm of nature parameter or approximate rhythm of nature parameter, to splicing synthetic and the voice document after synthetic is carried out the waveform adjustment in selected phonetic synthesis unit, to obtain synthetic speech file with high naturalness corresponding to this text to be synthesized.
8, method as claimed in claim 7, wherein, this prosodic parameter obtaining step comprises:
Read aloud the recording file that this text to be synthesized obtains this text to be synthesized by the nature person; With
Obtain prosodic parameter from the Wave data of this recording file, this prosodic parameter comprises the duration of a sound, audio frequency and energy.
9, method as claimed in claim 7, wherein, this prosodic parameter obtaining step comprises:
Knowledge base with reference to the mark rule that has defined prosodic parameter is carried out the prosodic parameter mark with this predetermined labeled standards to this text to be synthesized, to obtain prosodic parameter mark file; With
This prosodic parameter mark file is analyzed, and to obtain approximate prosodic parameter, this approximate prosodic parameter comprises the duration of a sound, audio frequency and energy.
10, method as claimed in claim 6 wherein, obtains the recording file of this text to be synthesized by the mode of electronics cartoon pronunciation.
CN200810166002A 2008-09-28 2008-09-28 Voice synthesizing apparatus and method based on rhythm reference Pending CN101685633A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810166002A CN101685633A (en) 2008-09-28 2008-09-28 Voice synthesizing apparatus and method based on rhythm reference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810166002A CN101685633A (en) 2008-09-28 2008-09-28 Voice synthesizing apparatus and method based on rhythm reference

Publications (1)

Publication Number Publication Date
CN101685633A true CN101685633A (en) 2010-03-31

Family

ID=42048753

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810166002A Pending CN101685633A (en) 2008-09-28 2008-09-28 Voice synthesizing apparatus and method based on rhythm reference

Country Status (1)

Country Link
CN (1) CN101685633A (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102881285A (en) * 2011-07-15 2013-01-16 富士通株式会社 Method for marking rhythm and special marking equipment
CN102881282A (en) * 2011-07-15 2013-01-16 富士通株式会社 Method and system for obtaining prosodic boundary information
CN102971787A (en) * 2010-10-29 2013-03-13 安徽科大讯飞信息科技股份有限公司 Method and system for endpoint automatic detection of audio record
TWI474274B (en) * 2012-03-27 2015-02-21 Mitake Information Corp Device and method for voice reporting the portfolio news of the finance software
CN107705783A (en) * 2017-11-27 2018-02-16 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN109300468A (en) * 2018-09-12 2019-02-01 科大讯飞股份有限公司 A kind of voice annotation method and device
CN110277086A (en) * 2019-06-25 2019-09-24 中国科学院自动化研究所 Phoneme synthesizing method, system and electronic equipment based on dispatching of power netwoks knowledge mapping
CN110517662A (en) * 2019-07-12 2019-11-29 云知声智能科技股份有限公司 A kind of method and system of Intelligent voice broadcasting
CN110534088A (en) * 2019-09-25 2019-12-03 招商局金融科技有限公司 Phoneme synthesizing method, electronic device and storage medium
CN110556092A (en) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 Speech synthesis method and device, storage medium and electronic device
CN111128116A (en) * 2019-12-20 2020-05-08 珠海格力电器股份有限公司 Voice processing method and device, computing equipment and storage medium
CN111292719A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111611208A (en) * 2020-05-27 2020-09-01 北京太极华保科技股份有限公司 File storage and query method and device and storage medium
CN112309368A (en) * 2020-11-23 2021-02-02 北京有竹居网络技术有限公司 Prosody prediction method, device, equipment and storage medium
CN112767912A (en) * 2020-12-28 2021-05-07 深圳市优必选科技股份有限公司 Cross-language voice conversion method and device, computer equipment and storage medium

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102971787B (en) * 2010-10-29 2014-04-23 安徽科大讯飞信息科技股份有限公司 Method and system for endpoint automatic detection of audio record
US9330667B2 (en) 2010-10-29 2016-05-03 Iflytek Co., Ltd. Method and system for endpoint automatic detection of audio record
CN102971787A (en) * 2010-10-29 2013-03-13 安徽科大讯飞信息科技股份有限公司 Method and system for endpoint automatic detection of audio record
CN102881285A (en) * 2011-07-15 2013-01-16 富士通株式会社 Method for marking rhythm and special marking equipment
CN102881282B (en) * 2011-07-15 2014-08-20 富士通株式会社 Method and system for obtaining prosodic boundary information
CN102881285B (en) * 2011-07-15 2015-10-21 富士通株式会社 A kind of method of prosodic labeling and prosodic labeling equipment
CN102881282A (en) * 2011-07-15 2013-01-16 富士通株式会社 Method and system for obtaining prosodic boundary information
TWI474274B (en) * 2012-03-27 2015-02-21 Mitake Information Corp Device and method for voice reporting the portfolio news of the finance software
CN108364631A (en) * 2017-01-26 2018-08-03 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN108364631B (en) * 2017-01-26 2021-01-22 北京搜狗科技发展有限公司 Speech synthesis method and device
CN107705783A (en) * 2017-11-27 2018-02-16 北京搜狗科技发展有限公司 A kind of phoneme synthesizing method and device
CN107705783B (en) * 2017-11-27 2022-04-26 北京搜狗科技发展有限公司 Voice synthesis method and device
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN108364632B (en) * 2017-12-22 2021-09-10 东南大学 Emotional Chinese text voice synthesis method
CN110556092A (en) * 2018-05-15 2019-12-10 中兴通讯股份有限公司 Speech synthesis method and device, storage medium and electronic device
CN109300468A (en) * 2018-09-12 2019-02-01 科大讯飞股份有限公司 A kind of voice annotation method and device
CN109300468B (en) * 2018-09-12 2022-09-06 科大讯飞股份有限公司 Voice labeling method and device
CN110277086A (en) * 2019-06-25 2019-09-24 中国科学院自动化研究所 Phoneme synthesizing method, system and electronic equipment based on dispatching of power netwoks knowledge mapping
CN110277086B (en) * 2019-06-25 2021-11-19 中国科学院自动化研究所 Voice synthesis method and system based on power grid dispatching knowledge graph and electronic equipment
CN110517662A (en) * 2019-07-12 2019-11-29 云知声智能科技股份有限公司 A kind of method and system of Intelligent voice broadcasting
CN110534088A (en) * 2019-09-25 2019-12-03 招商局金融科技有限公司 Phoneme synthesizing method, electronic device and storage medium
CN111128116A (en) * 2019-12-20 2020-05-08 珠海格力电器股份有限公司 Voice processing method and device, computing equipment and storage medium
CN111292719A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111611208A (en) * 2020-05-27 2020-09-01 北京太极华保科技股份有限公司 File storage and query method and device and storage medium
CN112309368A (en) * 2020-11-23 2021-02-02 北京有竹居网络技术有限公司 Prosody prediction method, device, equipment and storage medium
CN112767912A (en) * 2020-12-28 2021-05-07 深圳市优必选科技股份有限公司 Cross-language voice conversion method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN101685633A (en) Voice synthesizing apparatus and method based on rhythm reference
Bulyko et al. A bootstrapping approach to automating prosodic annotation for limited-domain synthesis
US8352270B2 (en) Interactive TTS optimization tool
EP0833304B1 (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
Sridhar et al. Exploiting acoustic and syntactic features for automatic prosody labeling in a maximum entropy framework
US20080177543A1 (en) Stochastic Syllable Accent Recognition
US20090048841A1 (en) Synthesis by Generation and Concatenation of Multi-Form Segments
Patil et al. A syllable-based framework for unit selection synthesis in 13 Indian languages
US20120143611A1 (en) Trajectory Tiling Approach for Text-to-Speech
Khan et al. Concatenative speech synthesis: A review
US7069216B2 (en) Corpus-based prosody translation system
CN115485766A (en) Speech synthesis prosody using BERT models
Dutoit A short introduction to text-to-speech synthesis
Bettayeb et al. Speech synthesis system for the holy quran recitation.
Kayte et al. A Marathi Hidden-Markov Model Based Speech Synthesis System
Stöber et al. Speech synthesis using multilevel selection and concatenation of units from large speech corpora
Hamad et al. Arabic text-to-speech synthesizer
KR100373329B1 (en) Apparatus and method for text-to-speech conversion using phonetic environment and intervening pause duration
Van Do et al. Non-uniform unit selection in Vietnamese speech synthesis
Anushiya Rachel et al. A small-footprint context-independent HMM-based synthesizer for Tamil
Saraswathi et al. Design of multilingual speech synthesis system
Iriondo et al. Objective and subjective evaluation of an expressive speech corpus
Maia et al. An HMM-based Brazilian Portuguese speech synthesizer and its characteristics
Demeke et al. Duration modeling of phonemes for amharic text to speech system
EP1589524B1 (en) Method and device for speech synthesis

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20100331