CN1658281A - Voice operation device, method and recording medium for recording voice operation program - Google Patents

Voice operation device, method and recording medium for recording voice operation program Download PDF

Info

Publication number
CN1658281A
CN1658281A CN2005100074542A CN200510007454A CN1658281A CN 1658281 A CN1658281 A CN 1658281A CN 2005100074542 A CN2005100074542 A CN 2005100074542A CN 200510007454 A CN200510007454 A CN 200510007454A CN 1658281 A CN1658281 A CN 1658281A
Authority
CN
China
Prior art keywords
mentioned
phoneme
information
tonequality
resonance peak
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2005100074542A
Other languages
Chinese (zh)
Other versions
CN100337104C (en
Inventor
川原毅彦
剑持秀纪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yamaha Corp
Original Assignee
Yamaha Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yamaha Corp filed Critical Yamaha Corp
Publication of CN1658281A publication Critical patent/CN1658281A/en
Application granted granted Critical
Publication of CN100337104C publication Critical patent/CN100337104C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • EFIXED CONSTRUCTIONS
    • E02HYDRAULIC ENGINEERING; FOUNDATIONS; SOIL SHIFTING
    • E02DFOUNDATIONS; EXCAVATIONS; EMBANKMENTS; UNDERGROUND OR UNDERWATER STRUCTURES
    • E02D29/00Independent underground or underwater structures; Retaining walls
    • E02D29/02Retaining or protecting walls
    • E02D29/025Retaining or protecting walls made up of similar modular elements stacked without mortar
    • EFIXED CONSTRUCTIONS
    • E02HYDRAULIC ENGINEERING; FOUNDATIONS; SOIL SHIFTING
    • E02DFOUNDATIONS; EXCAVATIONS; EMBANKMENTS; UNDERGROUND OR UNDERWATER STRUCTURES
    • E02D29/00Independent underground or underwater structures; Retaining walls
    • E02D29/02Retaining or protecting walls
    • E02D29/0258Retaining or protecting walls characterised by constructional features
    • E02D29/0266Retaining or protecting walls characterised by constructional features made up of preformed elements
    • AHUMAN NECESSITIES
    • A01AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTING; TRAPPING; FISHING
    • A01GHORTICULTURE; CULTIVATION OF VEGETABLES, FLOWERS, RICE, FRUIT, VINES, HOPS OR SEAWEED; FORESTRY; WATERING
    • A01G9/00Cultivation in receptacles, forcing-frames or greenhouses; Edging for beds, lawn or the like
    • A01G9/02Receptacles, e.g. flower-pots or boxes; Glasses for cultivating flowers
    • A01G9/022Pots for vertical horticulture
    • A01G9/025Containers and elements for greening walls
    • EFIXED CONSTRUCTIONS
    • E02HYDRAULIC ENGINEERING; FOUNDATIONS; SOIL SHIFTING
    • E02DFOUNDATIONS; EXCAVATIONS; EMBANKMENTS; UNDERGROUND OR UNDERWATER STRUCTURES
    • E02D2600/00Miscellaneous
    • E02D2600/20Miscellaneous comprising details of connection between elements

Landscapes

  • Engineering & Computer Science (AREA)
  • Environmental & Geological Engineering (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Life Sciences & Earth Sciences (AREA)
  • Mining & Mineral Resources (AREA)
  • Paleontology (AREA)
  • Civil Engineering (AREA)
  • General Engineering & Computer Science (AREA)
  • Structural Engineering (AREA)
  • Document Processing Apparatus (AREA)
  • Telephonic Communication Services (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

Provided is a speech synthesizer etc., capable of synthesizing speeches having various kinds of voice quality even in an environment where large restrictions are imposed on hardware resources. The speech synthesizer 100 which has one kind of phoneme data is provided with a voice quality change part 250 and a voice quality database 260. The voice quality change part 250 performs retrieval from the voice quality database 260 based upon a voice quality data number supplied from a text analysis part 220 as a retrieval key to obtain voice quality parameters. The voice quality change part 250 changes voice quality of each tone that the phoneme data obtained by a phoneme data acquisition part 230 represent based upon the acquired voice quality parameters.

Description

The recording medium of speech synthetic device, method and record voice operation program
Technical field
The present invention relates to a kind of recording medium that generates speech synthetic device, phoneme synthesizing method and the record voice operation program of synthetic speech according to the text message of being imported.
Background technology
Figure 17 is expression generates existing voice synthesizer 100 structures of synthetic speech according to the text message of being imported figure.
Input part 110 offers text resolution portion 120 with text information after never illustrated operating portion etc. has received text messages such as " こ ん To Chi わ ".Text resolution portion 120 utilizes word dictionary etc., received text message is carried out word parsing, syntax parsing etc., generate the phoneme information of each phoneme of expression " こ ", " ん ", " To ", " Chi ", " わ " these moras, and represent each phoneme length, highly, the prosodic information of intensity, offer phonetic synthesis portion 130 then.Phonetic synthesis portion 130 obtains the speech data (hereinafter referred to as phoneme data) of mora according to each phoneme information that provides from text resolution portion 120 from phoneme database 140.Then, phonetic synthesis portion 130 suitably processes connection etc. according to prosodic information to each obtained phoneme data, generates synthetic speech signal, then as synthetic speech and from outputs such as loudspeakers.The user is by listening to the synthetic speech from speech synthetic device output, can confirm the content of the text message imported.
But, in above-mentioned phoneme database, only logined a kind of phoneme data that specific declaimer (for example male sex declaimer) sends.Therefore, for example the synthetic speech output young woman etc. who has above-mentioned specific declaimer's tonequality in the utilization text message liking using (" Chi I う ... " or text message such as " ... body い な あ ") under the situation, the problem that the user feels variant sense between tonequality and the voice content can appear.
In order to address the above problem, such technology is suggested, promptly in phoneme database, login multiple phoneme data (for example different phoneme datas such as the male sex, women, children, old man) in advance, wait according to the content of the text message of being imported and to select optimum phoneme data, utilize selected phoneme data to generate synthetic speech (for example, with reference to patent documentation 1) then.
Patent documentation 1: the spy opens 2000-339137 communique (the 3rd~4 page)
Summary of the invention
According to patent documentation 1 disclosed technology, can obtain suitable synthetic speech really, but, must in phoneme database, login multiple phoneme data in order to realize synthetic speech.But such problem can occur, promptly in portable terminal that hardware resources such as storer and CPU is had bigger restriction etc. so multiple phoneme data can't be installed, consequently portable terminal etc. can't generate the synthetic speech of various tonequality.
The present invention proposes in view of the above problems, its purpose is, hardware resource is being had under the environment of bigger restriction even provide a kind of, also can generate the recording medium of speech synthetic device, phoneme synthesizing method and the record voice operation program of the synthetic speech of various tonequality.
In order to address the above problem, speech synthetic device of the present invention is characterized in that, has: acquiring unit, from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech; First storage unit is used for a plurality of phoneme datas of each phoneme of storage representation; Second storage unit is used to store multiple phoneme data machining information, and these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas; First extracting unit is used for extracting the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from above-mentioned first storage unit; Second extracting unit is used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from above-mentioned second storage unit; And generation unit, be used for the phoneme data machining information that extracts according to above-mentioned, the above-mentioned phoneme data that extracts is processed, generate above-mentioned synthetic speech.
Speech synthetic device of the present invention, preferred above-mentioned each phoneme data comprises the resonance peak information of the resonance peak of representing phoneme, above-mentioned phoneme data machining information comprises the resonance peak modification information of the changed content of representing above-mentioned resonance peak, above-mentioned generation unit is according to above-mentioned resonance peak modification information, above-mentioned resonance peak information is changed, according to the signal waveform of after changing each resonance peak being carried out after the additive operation, generate above-mentioned synthetic speech then.
Speech synthetic device of the present invention, preferred above-mentioned resonance peak information is made of paired formant frequency and resonance peak amplitude, in above-mentioned resonance peak modification information, include the formant frequency modification information of the changed content that is used to represent above-mentioned formant frequency and be used to represent the resonance peak amplitude modification information of the changed content of above-mentioned resonance peak amplitude, above-mentioned generation unit is according to above-mentioned formant frequency modification information and above-mentioned resonance peak amplitude modification information, each formant frequency and each resonance peak amplitude to the represented phoneme of above-mentioned phoneme data changes respectively, obtains above-mentioned each resonance peak after changing thus.
Speech synthetic device of the present invention, preferred above-mentioned acquiring unit from above-mentioned text message except obtaining above-mentioned phoneme appointed information and above-mentioned tonequality appointed information, also obtain the pitch appointed information of the pitch that is used to specify above-mentioned synthetic speech, the signal waveform of above-mentioned generation unit after above-mentioned each resonance peak is after changing carried out additive operation gives above-mentioned pitch appointed information represented pitch, obtains above-mentioned synthetic speech thus.
Speech synthetic device of the present invention, preferred above-mentioned text message comprises above-mentioned tonequality appointed information, and above-mentioned acquiring unit obtains above-mentioned tonequality appointed information from above-mentioned text message.
Speech synthetic device of the present invention, preferred above-mentioned acquiring unit extracts key word from above-mentioned text message, then according to the key word that is extracted, judge the tonequality that is suitable for above-mentioned text message.
The present invention also provides a kind of phoneme synthesizing method, it is characterized in that, have following steps: obtaining step, from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech; First extraction step extracts the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from first storage unit, and this first storage unit is used for a plurality of phoneme datas of each phoneme of storage representation; Second extraction step, be used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from second storage unit, this second storage unit is used to store multiple phoneme data machining information, these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas; And the generation step, be used for the phoneme data machining information that extracts according to above-mentioned, the above-mentioned phoneme data that extracts is processed, generate above-mentioned synthetic speech.
The effect of invention
As mentioned above, according to the present invention,, also can generate the synthetic speech of various tonequality even hardware resource is being had under the environment of bigger restriction.
Description of drawings
Fig. 1 is the block diagram of functional structure of the speech synthetic device of expression present embodiment.
Fig. 2 is the figure that example illustrates the text message of this embodiment.
Fig. 3 is the figure of the login content of example phoneme database that this embodiment is shown.
Fig. 4 is the figure that example illustrates the phoneme data structure of this embodiment.
Fig. 5 is the figure that is used for illustrating each frame information of the phoneme data that is included in this embodiment.
Fig. 6 is the figure that example illustrates the tonequality database login content of this embodiment.
Fig. 7 is the figure of a structure example of the tonequality parameter of this embodiment of expression.
Fig. 8 is the process flow diagram that the tonequality change of this embodiment of expression is handled.
Fig. 9 is the figure that example illustrates the mapping function of this embodiment.
Figure 10 is the male sex's of this embodiment of expression phonemic analysis result's figure.
Figure 11 is the figure of analysis result of women's phoneme of this embodiment of expression.
Figure 12 is the figure that example illustrates the vibration table of this embodiment.
Figure 13 is that example illustrates the shake number read and the figure of time relation from the vibration table of this embodiment.
Figure 14 is the figure that is used to illustrate the formant frequency vibration of this embodiment.
Figure 15 is used to illustrate that the pitch of this embodiment gives the figure of processing.
Figure 16 be example being carried out of this embodiment has been shown tonequality change handle and pitch is given the figure of the resonance peak waveform of the specific resonance peak after the processing.
Figure 17 is the figure of the functional structure of the existing speech synthetic device of expression.
Embodiment
Following with reference to accompanying drawing, embodiments of the present invention are described.
A. present embodiment
Fig. 1 is the figure of functional structure of the speech synthetic device 100 of expression present embodiment.In the present embodiment, suppose that speech synthetic device 100 is installed in mobile phone or PHS (Personal Handyphone System), PDA (Personal DigitalAssistance) etc. hardware resource is had the situation in the portable terminal of bigger restriction, but the invention is not restricted to this, can be applied in the various electronic equipments.
Input part 210 will offer text resolution portion 220 via the text message of not shown inputs such as operating portion.Fig. 2 has been the illustration figure of text message.
Content of text information is that be used to represent should be as the information of the content (for example " こ ん To Chi わ ") of the text of synthetic speech output.In addition, figure 2 illustrates the content of text information of only being represented by hiragana, but content of text information being not limited to hiragana, also can be the information of being represented by various literal such as Chinese character, Roman capitals, katakana and various mark.
Tonequality data number (tonequality appointed information) is the unique number that is used for discerning respectively a plurality of tonequality parameter described later (phoneme data the machining information) (K1 of Fig. 2~Kn).In the present embodiment, utilize this tonequality parameter by suitable selection, a kind of phoneme data that can send according to specific declaimer (being assumed to " male sex declaimer " in the present embodiment) obtains the synthetic speech (back detailed description) of various tonequality.
Pitch (pitch) information (pitch appointed information) is the information that is used for giving pitch (in other words, specifying the pitch of synthetic speech) to synthetic speech, and it is by specifying " C (trembling) "~" B (sigh) " to wait the information formation (with reference to Fig. 2) of scale.
220 pairs of text messages that provide from input part 210 of text resolution portion are resolved, and then analysis result are offered phoneme data obtaining section 230, tonequality changing unit 250, voice signal generating unit 270 respectively.Specifically, after being provided text message shown in Figure 2, text resolution portion 220 at first is the phoneme of the such mora of " こ ", " ん ", " To ", " Chi ", " わ " with " こ ん To Chi わ " such content of text information decomposition.So-called mora is meant expression pronunciation unit, and is basic by 1 consonant and 1 syllable that vowel constitutes.
Text resolution portion (acquiring unit) 220 is after being the phoneme of mora with the content of text information decomposition in such a manner, generation is used to specify the phoneme information (phoneme appointed information) of each phoneme of these synthetic speechs, offers phoneme data obtaining section 230 then successively.Next, text resolution portion 220 obtains tonequality data number (for example K3), pitch information (for example C (trembling)) respectively from text message, then obtained tonequality data number is offered tonequality changing unit 250, and obtained pitch information is offered voice signal generating unit 270.
The phoneme information that phoneme data obtaining section (first extracting unit) 230 will provide from text resolution portion 220 is retrieved phoneme database 240 as key word, obtains the phoneme corresponding phoneme data represented with phoneme information thus.Fig. 3 is the figure that example illustrates the login content of phoneme database 240.As shown in Figure 3, in phoneme database (first storage unit) 240, except login have the expression male sex declaimer mora each phoneme (" あ ", " い " ... " ん " etc.) a series of phoneme data 1~m outside, also login has the quantity (hereinafter referred to as login phoneme data quantity) of these a series of phoneme datas etc.
Fig. 4 is the structural drawing that example illustrates the phoneme data of a certain phoneme of expression (for example " こ " etc.), and Fig. 5 is the figure that is used for illustrating each frame information that is included in phoneme data.In addition, speech waveform vw when the A of Fig. 5 shows above-mentioned male sex declaimer and reads aloud a certain phoneme (for example " こ " etc.) and the relation between each frame FR, the D of the B of Fig. 5, the C of Fig. 5 and Fig. 5 shows the Resonance Peak Analysis result for the 1st frame FR1, the 2nd frame FR2, n frame FRn respectively.
As shown in Figure 4, phoneme data is made of the 1st frame information~n frame information.Each frame information has that each frame Fr (with reference to Fig. 5) to correspondence carries out Resonance Peak Analysis and the 1st resonance peak information~k resonance peak information of obtaining, and the voice that are used to represent each frame FR be voiced sound (voiced sound) or the voiced/unvoiced differentiation mark of voiceless sound (voiceless sound) (for example, " 1 "=voiced sound, " 0 "=voiceless sound).
The 1st frame information~k frame information that constitutes each frame information constitutes (with reference to the D of B~Fig. 5 of Fig. 5) by expression formant frequency F corresponding resonance peak, paired and resonance peak amplitude A.For example, the 1st resonance peak information~k resonance peak information that constitutes the 1st frame information is respectively by (F11, A11), (F12, A12) ... (F1k, A1k) these paired formant frequencies and resonance peak amplitude constitute (with reference to the B of Fig. 5) ... the 1st resonance peak information~k resonance peak information that constitutes the n frame information is respectively by (Fn1, An1), (Fn2, An2) ... (Fnk, Ank) these paired formant frequencies and resonance peak amplitude constitute (with reference to the D of Fig. 5).
Phoneme data obtaining section 230 offers tonequality changing unit 250 with these phoneme datas after obtaining each corresponding phoneme data according to each phoneme information that provides from text resolution portion 220 (each phoneme information of expression " こ ", " ん ", " To ", " Chi ", " わ " etc.).
The tonequality of the phoneme that each phoneme data that 250 changes of tonequality changing unit are obtained by phoneme data obtaining section 230 is represented.Specifically, the tonequality data number that tonequality changing unit (second extracting unit) 250 at first provides text resolution portion 220 is retrieved tonequality database (second storage unit) 260 as search key, obtains corresponding tonequality parameter.Then, tonequality changing unit 250 is carried out the change of the tonequality of above-mentioned each phoneme according to obtained tonequality parameter.
Fig. 6 is the figure that example illustrates the login content of tonequality database 260.
As shown in Figure 6, in tonequality database (second storage unit) 260, as the necessary information of the tonequality that is used to change above-mentioned each phoneme, store the multiple tonequality parameter 1~L of the processing content of expression phoneme data, and the login quantity information of representing the quantity of this tonequality parameter.
Fig. 7 is the figure of a structure example of expression tonequality parameter.
As shown in Figure 7, tonequality parameter (phoneme data machining information) has the sex change mark that is used for determining the tonequality data number of this parameter and the sex whether expression changes synthetic speech, the 1st~the k resonance peak modification information of representing the changed content of the 1st~the k resonance peak.Wherein, be under the situation of " 1 " for example with above-mentioned sex change flag settings, be used to change the processing (handling) of the sex of synthetic speech hereinafter referred to as the sex change by tonequality changing unit 250, and be under the situation of " 0 " with above-mentioned sex change flag settings, do not carry out above-mentioned sex change and handle (back detailed description).In addition, in the present embodiment, because a kind of phoneme data that supposition male sex declaimer sends, so be under the situation of " 1 " with this sex change flag settings, the feature of synthetic speech changes to femaleness from masculinity.On the other hand, be set under the situation of " 0 " at sex change mark, the feature of synthetic speech also keeps the male sex's feature and does not change.
On the other hand, each resonance peak modification information has the formant frequency modification information of the changed content that is used to select the basic waveform of each resonance peak basic waveform (sine wave etc.) described later to select information, represents each formant frequency and the resonance peak amplitude modification information of representing the changed content of each resonance peak amplitude.
In each formant frequency modification information and each resonance peak amplitude modification information, include the information of the converted quantity of represent formant frequency, hunting speed, oscillation amplitude respectively and represent converted quantity, the hunting speed of resonance peak amplitude, the information of oscillation amplitude.Converted quantity, hunting speed, oscillation amplitude about formant frequency and resonance peak amplitude will describe in detail in the back.
Fig. 8 is that the tonequality that expression is carried out by tonequality changing unit 250 changes the process flow diagram of handling.
Tonequality changing unit (generation unit) 250 as search key, is retrieved this tonequality data number after text resolution portion 220 receives the tonequality data number to tonequality database 260, obtain corresponding tonequality parameter (step S1).Then, tonequality changing unit 250 judges whether to change the sex (promptly whether should carry out the sex change handles) (step S2) of synthetic speech with reference to the sex change mark that is included in the obtained tonequality parameter.Be set to " 0 " at for example sex change mark, thereby tonequality changing unit 250 is judged in the time of should not carrying out the sex change, skips steps S3, enter step S4, and be set to " 1 " at for example sex change mark, tonequality changing unit 250 is judged in the time of should carrying out the sex change, enters step S3, carries out the sex change and handles.
Fig. 9 is that example illustrates the figure that the mapping function mf of usefulness is handled in the sex change that is stored in the storage unit (diagram omit), and Figure 10 and Figure 11 are the figure of the analysis result of expression masculinity and femininity when reading aloud same phoneme (for example " あ " etc.) respectively.In addition, the transverse axis of mapping function mf shown in Figure 9 is represented incoming frequency (being transfused to the formant frequency of tonequality changing unit 250), the longitudinal axis is represented output frequency (from the frequency formant frequency after changing of tonequality changing unit 250 output), and fmax represents the maximal value of the formant frequency that can import.In addition, Figure 10 and analysis diagram g1 shown in Figure 11, the transverse axis of g2 are represented frequency, and the longitudinal axis is represented amplitude.
By Figure 10 relatively and analysis diagram g1, g2 shown in Figure 11 as can be known, the 1st formant frequency fm1 of male sex's phoneme~the 4th formant frequency fm4 is lower than the 1st formant frequency ff1~the 4th formant frequency ff4 of women phoneme.Therefore, in the present embodiment, as shown in Figure 9, utilize the mapping function mf (with reference to the solid line part) that is positioned at straight line n1 (incoming frequency=output frequency is with reference to dotted portion) upside, the phoneme that will have masculinity changes to the phoneme with femaleness.
Specifically, tonequality changing unit 250 is utilized mapping function mf shown in Figure 9, with each formant frequency of the phoneme data imported to the high direction transformation of frequency.Thus, each formant frequency of male sex's phoneme of being imported is changed to the formant frequency with femaleness.In addition, opposite with above-mentioned situation under the situation of the formant frequency of importing women's phoneme, can utilize the mapping function mf ' (with reference to the part shown in Fig. 9 dot-and-dash line) that is arranged in straight line n1 downside.
Tonequality changing unit 250 enters after the step S4 having carried out above-mentioned sex change processing, according to the converted quantity of each represented formant frequency of each resonance peak modification information, each formant frequency is carried out conversion.And then tonequality changing unit 250 makes each the formant frequency vibration after the conversion, carries out hunting of frequency and handles (step S5).
Figure 12 is that example illustrates the figure that is stored in the vibration table TA that uses in the storage unit (diagram omit), in hunting of frequency is handled, and Figure 13 is that example illustrates the shake number read and the figure of the relation between the time from this vibration table TA.In the present embodiment, for convenience of description, suppose the situation of using same vibration table TA to make above-mentioned each formant frequency vibration, but also can be for different vibration tables such as each formant frequency use shake numbers.
Vibration table TA is the table according to time sequencing login shake number.Tonequality changing unit 250 is according to the hunting speed of the represented formant frequency of each resonance peak modification information, be controlled at the reading speed (perhaps skipping the quantity of (promptly not reading) shake number) of the shake number of logining among the vibration table TA, on the other hand, the execution hunting of frequency is handled, and each shake number that is about to be read multiply by the oscillation amplitude of the represented formant frequency of each resonance peak modification information.Thus, can obtain to make the waveform of formant frequency fm shown in Figure 14 with hunting speed sp, oscillation amplitude lv vibration.In the present embodiment, for the operand of the oscillation amplitude that reduces formant frequency, illustration utilize the mode of above-mentioned vibration table TA, but also can not utilize vibration table TA, but utilize the function of regulation to obtain the oscillation amplitude of formant frequency.
Tonequality changing unit 250 enters step S6 after having carried out the hunting of frequency processing, according to the converted quantity of each represented resonance peak amplitude of each resonance peak modification information, each resonance peak amplitude is carried out conversion.And then tonequality changing unit 250 makes each the resonance peak amplitude oscillation after the conversion, carries out amplitude oscillation and handles (step S7), end process then.In addition, for the vibration table that in amplitude oscillation is handled, uses and utilize this vibration to show to make action under each resonance peak amplitude oscillation situation since can with roughly explanation similarly of the situation that makes above-mentioned each formant frequency vibration, omit its explanation here.In addition, for the vibration of resonance peak amplitude, can use with the same vibration of the vibration of formant frequency and show to make its vibration, but also can use the vibration different to show to make its vibration with the vibration of formant frequency.
Tonequality changing unit (generation unit) 250 is changing (promptly phoneme data being processed) afterwards according to obtained tonequality parameter (phoneme data machining information) to the tonequality of each phoneme, selects information, each formant frequency and each resonance peak amplitude to offer voice signal generating unit 270 basic waveform of each resonance peak.
Voice signal generating unit 270 obtains this basic waveform and selects the represented Wave data of information after the basic waveform selection information that provides from tonequality changing unit 250 is provided from waveform database 280.The represented basic waveform of this basic waveform selection information can be different for each resonance peak, the basic waveform that for example can make the low resonance peak of frequency is for sinusoidal wave, and the basic waveform that makes the high resonance peak of the frequency of performance individual character is for the waveform (for example square wave or sawtooth wave etc.) beyond sinusoidal wave etc.Certainly, also can not utilize multiple basic waveform, but only utilize single basic waveform (for example sinusoidal wave).
Voice signal generating unit (generation unit) 270 is utilized selected each Wave data, each formant frequency, each resonance peak amplitude after having selected each Wave data in such a manner, generate the resonance peak waveform of each resonance peak.Then, 270 pairs of each resonance peak waveforms of voice signal generating unit (generation unit) carry out additive operation, generate synthetic speech signal.Then, 270 pairs of synthetic speech signals that generated of voice signal generating unit are given the processing (giving processing hereinafter referred to as pitch) of pitch, and this pitch is the represented pitch of pitch information (pitch appointed information) that provides from text resolution portion 220.
Figure 15 is used to illustrate that pitch gives the figure of processing.In Figure 15, in order to understand easily explanation, example shows the situation that the synthetic speech signal of offset of sinusoidal ripple is given pitch.
Voice signal generating unit 270 calculates the cycle of temporal envelope line tp shown in Figure 15 according to the pitch information that provides from text resolution portion 220.Wherein, the pitch of synthetic speech depends on the cycle of temporal envelope line tp, and the cycle of temporal envelope line tp is long more, and pitch is low more, and the cycle of temporal envelope line is short more, and pitch is high more.Voice signal generating unit 270 is after the cycle of obtaining temporal envelope line tp in such a manner, with the cycle of the temporal envelope line tp that tried to achieve, repeatedly temporal envelope line tp and synthetic speech signal are carried out multiplying, obtained being endowed the synthetic speech signal of regulation pitch thus.
Figure 16 is that example illustrates and carried out that the tonequality change is handled and pitch is given the figure of the resonance peak waveform of handling specific resonance peak afterwards.As shown in figure 16, the processing relevant with tonequality change (for example the oscillation treatment of formant frequency and resonance peak amplitude etc.) can be controlled with frame period (frame unit).Voice signal generating unit (generation unit) 270 is exported as synthetic speech it after obtaining the above-mentioned synthetic speech signal that has been endowed the regulation pitch to the outside.Thus, the user can confirm to be transfused to the content of the text (" こ ん To Chi わ " etc.) of speech synthetic device 100 by the synthetic speech of desirable tonequality.
As mentioned above, speech synthetic device according to present embodiment, because can carry out in the tonequality changing unit with the resonance peak is the various tonequality change processing of unit, even so the phoneme data of being stored has only a kind (phoneme data that promptly has only specific declaimer), also can carry out the phonetic synthesis of various tonequality.
B. other
In above-mentioned present embodiment, example shows the situation (with reference to Fig. 2) that comprises pitch information in the text message that is transfused to speech synthetic device 100, but also can not comprise pitch information in text information.Suppose this situation, in phoneme database 240, login alternative pitch information (part in reference to the bracket of Fig. 3) when in advance, and when in text message, not comprising pitch information, can utilize the represented pitch of this alternative pitch information (for example C (trembling) etc.) to be used as the pitch of synthetic speech.In addition, except substituting pitch information, can also in phoneme database 240, login the quantity (the resonance peak quantity information is with reference to part in the bracket of Fig. 3) of the resonance peak information of each frame shown in Figure 4 in advance.
In addition, in order to carry out the program in the storeies such as being stored in ROM by CPU (or DSP), with the various functions of the speech synthetic device 100 of realizing above explanation, said procedure can be recorded on the recording mediums such as CD-ROM and issue, perhaps can issue via communication networks such as internets.
In the above description, voice change processing is based on the tonequality data number that obtains and carries out from text message, but also can from the text message of being imported, Automatic Extraction go out key word, utilize the key word that is extracted then, come with reference to the database that sets in advance key word in electronic equipment, that have each tonequality, come to judge automatically the tonequality that is suitable for text information thus.

Claims (7)

1. speech synthetic device is characterized in that having:
Acquiring unit is from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech;
First storage unit is used for a plurality of phoneme datas of each phoneme of storage representation;
Second storage unit is used to store multiple phoneme data machining information, and these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas;
First extracting unit is used for extracting the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from above-mentioned first storage unit;
Second extracting unit is used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from above-mentioned second storage unit; And
Generation unit is used for the phoneme data machining information that extracts according to above-mentioned, and the above-mentioned phoneme data that extracts is processed, and generates above-mentioned synthetic speech.
2. speech synthetic device according to claim 1 is characterized in that,
Above-mentioned each phoneme data comprises the resonance peak information of the resonance peak of representing phoneme,
Above-mentioned phoneme data machining information comprises the resonance peak modification information of the changed content of representing above-mentioned resonance peak,
Above-mentioned generation unit changes above-mentioned resonance peak information according to above-mentioned resonance peak modification information, and each signal waveform that basis each resonance peak information is after changing generated is carried out additive operation then, generates above-mentioned synthetic speech thus.
3. speech synthetic device according to claim 2 is characterized in that,
Above-mentioned resonance peak information is made of paired formant frequency and resonance peak amplitude,
In above-mentioned resonance peak modification information, include the formant frequency modification information of the changed content that is used to represent above-mentioned formant frequency and be used to represent the resonance peak amplitude modification information of the changed content of above-mentioned resonance peak amplitude,
Above-mentioned generation unit is according to above-mentioned formant frequency modification information and above-mentioned resonance peak amplitude modification information, each formant frequency and each resonance peak amplitude to the represented phoneme of above-mentioned phoneme data changes respectively, obtains above-mentioned each resonance peak information after changing thus.
4. according to claim 2 or 3 described speech synthetic devices, it is characterized in that,
Above-mentioned acquiring unit except obtaining above-mentioned phoneme appointed information and above-mentioned tonequality appointed information, also obtains the pitch appointed information of the pitch that is used to specify above-mentioned synthetic speech from above-mentioned text message,
Above-mentioned generation unit gives above-mentioned pitch appointed information represented pitch to the composite signal waveform, obtain above-mentioned synthetic speech thus, described composite signal waveform is each signal waveform that generates according to above-mentioned each resonance peak information after changing to be carried out additive operation obtain.
5. speech synthetic device according to claim 1 is characterized in that, above-mentioned text message comprises above-mentioned tonequality appointed information, and above-mentioned acquiring unit obtains above-mentioned tonequality appointed information from above-mentioned text message.
6. speech synthetic device according to claim 1 is characterized in that above-mentioned acquiring unit extracts key word from above-mentioned text message, then according to the key word that is extracted, judges the tonequality that is suitable for above-mentioned text message.
7. phoneme synthesizing method is characterized in that having following steps:
Obtaining step is from phoneme appointed information that is transfused to the text message of speech synthetic device, obtains the phoneme that is used to specify synthetic speech and the tonequality appointed information that is used to specify the tonequality of this synthetic speech;
First extraction step extracts the phoneme corresponding phoneme data represented with above-mentioned phoneme appointed information from first storage unit, and this first storage unit is used for a plurality of phoneme datas of each phoneme of storage representation;
Second extraction step, be used for extracting the tonequality corresponding phoneme data machining information represented with above-mentioned tonequality appointed information from second storage unit, this second storage unit is used to store multiple phoneme data machining information, these phoneme data machining informations are the information that is used to change the tonequality of above-mentioned each phoneme, the processing content of their expression phoneme datas; And
Generate step, be used for the phoneme data machining information that extracts according to above-mentioned, the above-mentioned phoneme data that extracts is processed, generate above-mentioned synthetic speech.
CNB2005100074542A 2004-02-20 2005-02-21 Voice operation device, method and recording medium for recording voice operation program Expired - Fee Related CN100337104C (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2004044852A JP2005234337A (en) 2004-02-20 2004-02-20 Device, method, and program for speech synthesis
JP2004044852 2004-02-20

Publications (2)

Publication Number Publication Date
CN1658281A true CN1658281A (en) 2005-08-24
CN100337104C CN100337104C (en) 2007-09-12

Family

ID=35007713

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005100074542A Expired - Fee Related CN100337104C (en) 2004-02-20 2005-02-21 Voice operation device, method and recording medium for recording voice operation program

Country Status (4)

Country Link
JP (1) JP2005234337A (en)
KR (1) KR100759172B1 (en)
CN (1) CN100337104C (en)
TW (1) TW200535235A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111542875A (en) * 2018-01-11 2020-08-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5510852B2 (en) * 2010-07-20 2014-06-04 独立行政法人産業技術総合研究所 Singing voice synthesis system reflecting voice color change and singing voice synthesis method reflecting voice color change
JP7309155B2 (en) * 2019-01-10 2023-07-18 グリー株式会社 Computer program, server device, terminal device and audio signal processing method

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3633963B2 (en) * 1994-09-14 2005-03-30 株式会社河合楽器製作所 Musical sound generating apparatus and musical sound generating method
DE69617480T2 (en) * 1995-01-13 2002-10-24 Yamaha Corp Device for processing a digital sound signal
JPH1078952A (en) * 1996-07-29 1998-03-24 Internatl Business Mach Corp <Ibm> Voice synthesizing method and device therefor and hypertext control method and controller
CN1113330C (en) * 1997-08-15 2003-07-02 英业达股份有限公司 Phoneme regulating method for phoneme synthesis
US6823309B1 (en) * 1999-03-25 2004-11-23 Matsushita Electric Industrial Co., Ltd. Speech synthesizing system and method for modifying prosody based on match to database
JP2000339137A (en) * 1999-05-31 2000-12-08 Sanyo Electric Co Ltd Electronic mail receiving system
JP2002268699A (en) * 2001-03-09 2002-09-20 Sony Corp Device and method for voice synthesis, program, and recording medium
JP3732793B2 (en) * 2001-03-26 2006-01-11 株式会社東芝 Speech synthesis method, speech synthesis apparatus, and recording medium
JP2003031936A (en) * 2001-07-19 2003-01-31 Murata Mach Ltd Printed board
JP2003295882A (en) * 2002-04-02 2003-10-15 Canon Inc Text structure for speech synthesis, speech synthesizing method, speech synthesizer and computer program therefor
EP1630791A4 (en) * 2003-06-05 2008-05-28 Kenwood Corp Speech synthesis device, speech synthesis method, and program
KR20050041749A (en) * 2003-10-31 2005-05-04 한국전자통신연구원 Voice synthesis apparatus depending on domain and speaker by using broadcasting voice data, method for forming voice synthesis database and voice synthesis service system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111542875A (en) * 2018-01-11 2020-08-14 雅马哈株式会社 Speech synthesis method, speech synthesis device, and program
CN111542875B (en) * 2018-01-11 2023-08-11 雅马哈株式会社 Voice synthesis method, voice synthesis device and storage medium

Also Published As

Publication number Publication date
JP2005234337A (en) 2005-09-02
TWI300551B (en) 2008-09-01
KR20060043023A (en) 2006-05-15
KR100759172B1 (en) 2007-09-14
CN100337104C (en) 2007-09-12
TW200535235A (en) 2005-11-01

Similar Documents

Publication Publication Date Title
CN1108603C (en) Voice synthesis method and device, and computer ready-read medium with recoding voice synthesizing program
CN1169115C (en) Prosodic databases holding fundamental frequency templates for use in speech synthesis
CN101872615B (en) System and method for distributed text-to-speech synthesis and intelligibility
CN1117344C (en) Voice synthetic method and device, dictionary constructional method and computer ready-read medium
US20080126093A1 (en) Method, Apparatus and Computer Program Product for Providing a Language Based Interactive Multimedia System
JP6806662B2 (en) Speech synthesis system, statistical model generator, speech synthesizer, speech synthesis method
CN100337104C (en) Voice operation device, method and recording medium for recording voice operation program
CN1801321A (en) System and method for text-to-speech
CN1201284C (en) Rapid decoding method for voice identifying system
CN1108572C (en) Mechanical Chinese to japanese two-way translating machine
CN1811912A (en) Minor sound base phonetic synthesis method
CN1032391C (en) Chinese character-phonetics transfer method and system edited based on waveform
CN1661673A (en) Speech synthesizer,method and recording medium for speech recording synthetic program
CN1787072A (en) Method for synthesizing pronunciation based on rhythm model and parameter selecting voice
CN1664922A (en) Pitch model production device, method and pitch model production program
CN1455388A (en) Voice identifying system and compression method of characteristic vector set for voice identifying system
CN1190773C (en) Voice identifying system and compression method of characteristic vector set for voice identifying system
CN1251175C (en) An audio synthesis method
CN1238805C (en) Method and apparatus for compressing voice library
CN1607576A (en) A speech recognition system
CN1979636B (en) Method for converting phonetic symbol to speech
CN1647152A (en) Method for synthesizing speech
KR20060056406A (en) Improvements to an utterance waveform corpus
CN1828723A (en) Dispersion type language processing system and its method for outputting agency information
CN1259648C (en) Phonetic recognition system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20070912

Termination date: 20150221

EXPY Termination of patent right or utility model