CN100524456C

CN100524456C - Singing voice synthesizing method

Info

Publication number: CN100524456C
Application number: CNB031275516A
Authority: CN
Inventors: 剑持秀纪; 若尔迪·博纳达; 亚历克斯·洛斯科斯
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2003-08-06
Filing date: 2003-08-06
Publication date: 2009-08-05
Anticipated expiration: 2023-08-06
Also published as: CN1581290A

Abstract

Analyzing frequency of audio waveform composed of phoneme and phoneme chain in audio synthesis unit detects and obtains frequency spectrum. Local peak value is detected on frequency spectrum, and spectral distribution region including local peak is assigned. Amplitude spectrum data, which represents distribution of amplitude spectrum, are generated dependent on frequency axis, as well as phase spectrum data, which represents distribution of phase spectrum, are generated dependent on frequency axis for area of each spectrum distribution. Based on inputting tone and pitch, adjusting data of amplitude spectrum is carried out by moving frequency axis. Based on adjustment of amplitude spectrum data, adjusting phase spectrum data is carried out. Spectrum density and envelope of frequency corresponding to tone and pitch needed are merged. Adjusted amplitude spectrum data and adjusted phase spectrum data are converted to synthesized audio signal in time domain.

Description

Singing voice synthetic method and device

The application submitted to based on February 27th, 2002, application number is the Japanese patent application of 2002-052006, is incorporated herein this application in the lump as a reference.

Technical field

The present invention relates to a kind of synthetic method of singing voice, singing voice synthesizer, and the medium of using sound whose phase compositor technology.

Background technology

Traditionally, as the singing voice synthetic technology, use the singing voice synthetic comparatively common (for example, referring to Jap.P. No2906970) of synthetic (SMS) technology of the described spectral model of known U.S. Patent No. 5029509 instructionss.

Figure 21 shows the process flow diagram that adopts the singing voice synthesizer of technology described in the Japanese patent application No.2906970.At step S1, import a singing voice signal, at step S2, the singing voice signal of input is carried out SMS analyzing and processing and fragment dividing processing.

In the SMS analyzing and processing, the singing voice signal of input is split up into a series of time frames, corresponding to each frame, by one group of value spectrum of generations such as fast Fourier transform (FFT) data, linear spectral is corresponding with a plurality of peak values that obtain from one group of value spectrum data by each frame.Represent the amplitude of these linear spectrals and the data of frequency to be called as determinacy composition (Deterministic Component).Subsequently, from the spectrum of sound import waveform, deduct the spectrum of this determinacy composition, to obtain a residue difference spectrum.This residue difference spectrum is called as random element (Stochastic Component).

In the segment dividing processing,, be specified to divided data and be separated by SMS random data that analyzing and processing obtains corresponding to a sound synthesis unit.The sound synthesis unit is the structural element of the lyrics.For example, the sound synthesis unit is by the single-tone element such as [a] or [i], or the phoneme chain such as [a_i] or [a_p] (chain of a multitone element) is formed.

In sound synthesis unit database D B, for each sound synthesis unit is storing determinacy compositional data and random element data.

In singing voice is synthetic, at step S3, input lyrics data and melody data.Subsequently, at step S4, the aligned phoneme sequence of lyrics data representative is carried out aligned phoneme sequence/sound synthesis unit conversion process, thereby aligned phoneme sequence is divided into the sound synthesis unit.Then, from database D B, read determinacy compositional data and random element data as sound synthesis unit data for each sound synthesis unit.

At step S5, the sound synthesis unit data (determinacy compositional data and random element data) that read from database D B are carried out sound synthesis unit connection processing, thereby sound synthesis unit data are linked in sequence with certain pronunciation.At step S6, for each sound synthesis unit on the specified tone pitch basis of determinacy compositional data and melody data, generate the new determinacy compositional data that is suitable for this tone pitch.At this moment, if the spectral density that receives is adjusted to the form at the handled spectrum envelope of step S5 determinacy compositional data, the tone of the voice signal of being imported at step S1 just can duplicate out by new determinacy compositional data.

At step S7, in each sound synthesis unit, the determinacy compositional data that step S6 generated is added on the handled random element data of step S5.Then,, in each sound synthesis unit, be carried out the data that add processing among the step S7, convert voice signal synthetic in the time domain to by anti-fast fourier transform (FFT) etc. at step S8.

For example,, need from database D B, read corresponding to [#s] for synthetic singing voice [saita], [s_a], [a], [a_i], [l], [i_t], [a], the sound synthesis unit of [a#] (the # representative is quiet), and they are connected with each other in step S5.Then, at step S6, in each sound synthesis unit, generate the determinacy compositional data that has corresponding to the input tone pitch.After the transfer process of the additive process of step S7 and step S8, just can obtain the singing voice signal of [saita].

According to above-mentioned prior art, the consistance trend between determinacy composition and the random element can not be satisfactory.More precisely, because the voice signal pitch in step S1 input is changed according to the input tone pitch of step S6, and the pitch of random element data after conversion be added in the determinacy compositional data at step S7, thereby make the sound sing have the trend of similar artificial sound.For example, during [i] such long, the sound that the random element data are sent is just separated in sending out [saita].

In order to eliminate this trend, the present inventor's suggestion is adjusted (with reference to Japanese patent application 2000-401041) with the spectral amplitude distribution in the more weak zone of random element data representative according to the tone pitch of previous input.Yet if adjust the random element data according to said method, cutting apart and echoing of random element just controlled fully than difficulty.

Equally, in the SMS technology, analyze very difficulty of fricative and plosive, synthetic sound will very manually be changed.The hypothesis that the SMS technology is made up of determinacy composition and random element fully based on voice signal, according to the SMS technology, its basic problem is that voice signal can not be divided into determinacy composition and random element.

On the other hand, phase vocoder (vocoder) technology is described in the instructions of U.S. Patent No. 3360610.In the phase vocoder technology, signal is in the past by the bank of filters representative, at present by the frequency range representative as the fast fourier transform result of input signal.At present, the phase vocoder technology time that is widely used in extends (do not change original pitch and extend or the compression time axle), pitch conversion (do not change time span and change pitch) etc.In this pitch converter technique, the fast fourier transform result of input signal is not used by its script form.As everyone knows, the pitch conversion is after the FFT spectrum being divided into local peaking's a plurality of spectrums distributions on every side, by on each spectrum distributed areas frequency axis, move spectrum distribute realize (for example, be published in " being used for real-time pitch conversion; chorus, the new phase synthesizer technology of harmony and other external sound corrections " on J.Audio Eng.Soc.1999 11 phases 47 volume with reference to J.Laroche and M.Dolson).Yet, related and indeterminate between pitch converter technique and the singing voice synthetic technology.

Summary of the invention

The purpose of this invention is to provide by using phase vocoder technology and medium new singing voice synthetic method and device to realize that nature and high quality sound are synthetic.

According to an aspect of the present invention, provide a kind of singing voice synthetic method, this method comprises the following steps: that (a) by analyzing the sound waveform frequency corresponding to the sound synthesis unit that will be synthesized sound, detects frequency spectrum; (b) detect a plurality of local peakings of spectral density on this frequency spectrum; (c) comprise that for each appointment in a plurality of local peakings local peaking and frequency spectrum are gone forward and the spectrum distributed areas of back spectrum; With relative each spectrum distributed areas, generate the spectral amplitude data that the expression spectral amplitude fixed according to frequency axis distributes; (d) each composes distributed areas relatively, generates the phase spectrum data that the expression phase spectrum fixed according to frequency axis distributes; (e) specify pitch for the sound that will be synthesized; (f) each composes distributed areas relatively, adjusts the spectral amplitude data, thereby distributes along the spectral amplitude that frequency axis moves by spectral amplitude data representative according to pitch; (g) each composes distributed areas relatively, according to the adjustment of spectral amplitude data, to being adjusted by the phase spectrum distribution of phase spectrum data representative; (h) adjusted spectral amplitude data are become the synthetic video signal of time domain with the phase spectrum data-switching.

According to the first singing voice synthetic method, the corresponding sound waveform of sound synthesis unit (phoneme or phoneme chain) is carried out frequency analysis and frequency spectrum is detected.Then, be that the basis generates spectral amplitude data and phase spectrum data with the frequency spectrum.After specifying the pitch that needs, adjust spectral amplitude data and phase spectrum data according to the pitch of appointment, serve as that the basis generates the synthetic video signal in the time domain with adjusted spectral amplitude data and phase spectrum data.Because in the synthetic process of sound, need not the frequency analysis result of sound waveform is divided into determinacy composition and random element, so random element can be cut apart and echo.Thereby, can obtain the synthetic video of nature.In addition, fricative and plosive also can obtain the synthetic video of nature.

According to another aspect of the present invention, a kind of singing voice synthetic method is provided, this method comprises the following steps: that (a) obtains spectral amplitude data and phase spectrum data corresponding to a synthesis unit of the sound that will be synthesized, wherein these spectral amplitude data are expression data that fixed spectral amplitude distributes according to the frequency axis of each spectrum distributed areas, in a plurality of local peakings of spectral density each comprise this peak value and by the frequency analysis to the sound waveform of sound synthesis unit obtain before a frequency spectrum and after spectrum, described phase spectrum data are expression data that fixed phase spectrum distributes according to frequency axis of each spectrum distributed areas; (b) specify pitch for the sound that will be synthesized; (c) each composes distributed areas relatively, adjusts the spectral amplitude data, in order to distribute along the spectral amplitude that frequency axis moves by the spectral amplitude data representation according to tone; (d) each composes distributed areas relatively, according to the adjustment of spectral amplitude data, to being adjusted by the phase spectrum distribution of phase spectrum data representation; (e) the phase spectrum data-switching of the spectral amplitude data adjusted and adjustment is become the synthetic video signal of time domain

The second singing voice synthetic method is corresponding to carrying out after generating the phase spectrum data step, spectral amplitude data and phase spectrum data are stored in situation in the database by each sound synthesis unit, perhaps the situation about being finished by other devices of the process after generating the phase spectrum data.Specifically, in the second singing voice synthetic method, at obtaining step, obtain spectral amplitude data and phase spectrum data from other devices or database corresponding to the sound synthesis unit that will be synthesized sound, the process after the appointment pitch step is identical with the first singing voice synthetic method.Therefore, according to the second singing voice synthetic method, can obtain the natural synthetic video identical with the first singing voice synthetic method.

According to a further aspect in the invention, provide a kind of singing voice synthesizer, this device comprises: be each the sound specified voice synthesis unit that will be synthesized and the specified device of pitch; From sound synthesis unit database, read the sound waveform data of expression, as sound synthesis unit reader reading data corresponding to sound synthesis unit waveform; By analyzing frequency, to detect first pick-up unit of frequency spectrum by the represented sound waveform of sound waveform data; Detect second pick-up unit of a plurality of local peakings of spectral density on the frequency spectrum; Each of a plurality of relatively local peakings is specified the spectrum distributed areas that comprise front and back spectrum on local peaking and the frequency spectrum, and is first generating apparatus of each spectrum distributed areas generation representative according to the spectral amplitude data of the fixed spectral amplitude distribution of frequency axis; Each composes distributed areas relatively, generates second generating apparatus of the phase spectrum data of the phase spectrum distribution fixed according to frequency axis; The spectral amplitude data are adjusted in each spectrum distributed areas relatively, in order to move first adjusting gear that is distributed by the represented spectral amplitude of spectral amplitude data along frequency axis according to pitch; Each composes distributed areas relatively, according to the adjustment of spectral amplitude data, to second adjusting gear that is distributed and adjusted by the represented phase spectrum of phase spectrum data; Adjusted spectral amplitude data are become the conversion equipment of the synthetic video signal of time domain with the phase spectrum data-switching.

According to a further aspect in the invention, provide a singing voice synthesizer, this device comprises: be each the sound specified voice synthesis unit that will be synthesized and the specified device of pitch; From sound synthesis unit database, read the sound waveform data of expression corresponding to sound synthesis unit waveform, as sound synthesis unit reader reading data, wherein these spectral amplitude data are expression data that fixed spectral amplitude distributes according to the frequency axis of each spectrum distributed areas, in a plurality of local peakings of spectral density each comprise this peak value and by the frequency analysis to the sound waveform of sound synthesis unit obtain before a frequency spectrum after spectrum, described phase spectrum data are expression data that fixed phase spectrum distributes according to frequency axis of each spectrum distributed areas; Relatively the spectral amplitude data are adjusted in each spectrum distributed areas, in order to move first adjusting gear that the spectral amplitude by spectral amplitude data representative distributes according to pitch along frequency axis; Second adjusting gear by the phase spectrum distribution of phase spectrum data representative according to the adjustment of spectral amplitude data, is adjusted in each spectrum distributed areas relatively; Adjusted spectral amplitude data are become the conversion equipment of the synthetic video signal of time domain with the phase spectrum data-switching.

The first and second singing voice synthesizers are carried out the aforesaid first and second singing voice synthetic methods by using sound synthesis unit database, thereby obtain the synthetic video of singing of nature.

According to a further aspect of the invention, provide a kind of singing voice synthesizer, this device comprises: for each will be by the synthetic in proper order sound specified voice synthesis unit and the specified device of pitch; From sound synthesis unit database, read sound waveform reader reading data corresponding to each specified sound synthesis unit of specified device; By analyzing sound waveform frequency, to detect first pick-up unit of frequency spectrum corresponding to each sound waveform; Detection is corresponding to second pick-up unit of a plurality of local peakings of the spectral density of each sound waveform frequency spectrum; In a plurality of local peakings of relative each sound synthesis unit each, appointment comprises the spectrum distributed areas of front and back spectrum on local peaking and the frequency spectrum, and each composes first generating apparatus that distributed areas generate the spectral amplitude data of the representative spectral amplitude distribution fixed according to frequency axis relatively; Each of each sound synthesis unit composed distributed areas relatively, generates second generating apparatus of the phase spectrum data of the representative phase spectrum distribution fixed according to frequency axis; Relatively the spectral amplitude data are adjusted in each spectrum distributed areas of each sound synthesis unit, in order to move first adjusting gear that the spectral amplitude by spectral amplitude data representative distributes according to pitch along frequency axis; Second adjusting gear by the phase spectrum distribution of phase spectrum data representative according to the adjustment of spectral amplitude data, is adjusted in each spectrum distributed areas of each sound synthesis unit relatively; With adjusted spectral amplitude is data based will be by the pronunciation sequence of the synthetic sound of order, be connected to first coupling arrangement of corresponding order sound synthesis unit, wherein on the tie point of order sound synthesis unit, each spectral density is adjusted to consistent with each other or approximate unanimity; With adjusted phase spectrum is data based will be by the pronunciation sequence of the synthetic sound of order, be connected to second coupling arrangement of corresponding order sound synthesis unit, wherein on the tie point of order sound synthesis unit, each phase place is adjusted to consistent with each other or approximate unanimity; With the spectral amplitude data after connecting be connected after the phase spectrum data-switching become the conversion equipment of the synthetic video signal of time domain.

According to another aspect of the present invention, provide a kind of singing voice synthesizer, this device comprises: for each will be by the synthetic in proper order sound specified voice synthesis unit and the specified device of pitch; From sound synthesis unit database, read sound waveform reader reading data corresponding to each specified sound synthesis unit of specified device, wherein these spectral amplitude data are expression data that fixed spectral amplitude distributes according to the frequency axis of each spectrum distributed areas, in a plurality of local peakings of spectral density each comprises this peak value and the spectrum before and after a frequency spectrum that obtains by the frequency analysis to the sound waveform of sound synthesis unit, and described phase spectrum data are expression data that fixed phase spectrum distributes according to frequency axis of each spectrum distributed areas; Relatively the spectral amplitude data are adjusted in each spectrum distributed areas of each sound synthesis unit, in order to move first adjusting gear that the spectral amplitude by spectral amplitude data representative distributes according to pitch along frequency axis; Second adjusting gear by the phase spectrum distribution of phase spectrum data representative according to the adjustment of spectral amplitude data, is adjusted in each spectrum distributed areas of each sound synthesis unit relatively; Adjusted spectral amplitude is data based by the pronunciation sequence of the synthetic sound of order, be connected to first coupling arrangement of corresponding order sound synthesis unit, wherein on the tie point of order sound synthesis unit, each spectral density is adjusted to consistent with each other or approximate unanimity; Adjusted phase spectrum is data based by the pronunciation sequence of the synthetic sound of order, be connected to second coupling arrangement of corresponding order sound synthesis unit, wherein on the tie point of order sound synthesis unit, each phase place is adjusted to consistent with each other or approximate unanimity; With the spectral amplitude data after connecting be connected the conversion equipment that back phase spectrum data-switching becomes the synthetic video signal of time domain.

The third and fourth singing voice synthesizer is carried out the aforesaid first or second singing voice synthetic method by using sound generated data storehouse, thereby obtains the synthetic video of singing of nature.In addition, connecting with certain pronunciation sequence in the process of sound synthesis unit, when connecting the spectral amplitude data that will be modified and phase spectrum data, the spectral density of the connecting portion of order sound synthesis unit is adjusted to consistent with each other or approximate consistent with phase place; Thereby, just can prevent from when generating synthetic video, to produce noise.

According to the present invention, spectral amplitude data and phase spectrum data are based on corresponding to the frequency analysis result of the sound waveform of sound synthesis unit and generate, and are carried out adjustment according to the pitch of appointment.Then, because the synthetic video signal is based on adjusted spectral amplitude data and phase spectrum data and generate, random element that determinacy composition and random element cause is cut apart and situation about echoing can not take place in principle because of the frequency analysis result is split in the conventional example, thereby can obtain nature or the synthetic effect of high-quality singing voice.

Description of drawings

Fig. 1 shows the block scheme of singing voice synthesizer circuit structure according to an embodiment of the invention.

Fig. 2 is the process flow diagram that shows a singing voice analytic process example.

Fig. 3 is the chart that shows sound synthesis unit database storing state.

Fig. 4 is the process flow diagram that shows a singing voice building-up process example.

Fig. 5 is the process flow diagram of the step 76 transfer process example in the displayed map 4.

Fig. 6 is the process flow diagram that shows another singing voice analytic process example.

Fig. 7 is the process flow diagram that shows another singing voice building-up process example.

Fig. 8 A shows the oscillogram of input audio signal as evaluating objects.Fig. 8 B is the spectrogram of display frequency analysis result.

Fig. 9 A is the spectrogram that shows the preceding spectrum of pitch conversion distributed areas point.Fig. 9 B is the spectrogram of spectrum distributed areas point after the conversion of demonstration pitch.

Figure 10 A is the chart that shows that preceding spectral amplitude of pitch conversion and phase spectrum distribute.Figure 10 B is the chart that shows that spectral amplitude and phase spectrum distribute after the pitch conversion.

Figure 11 is a chart of explaining spectrum distribution assignment procedure when pitch is turned down.

Figure 12 A is the chart that shows preceding local peaking's point of change in pitch and spectrum envelope.Figure 12 B is the chart that shows local peaking's point and spectrum envelope after the change in pitch.

Figure 13 is the chart that shows a spectral enveloping line example.

Figure 14 is the block scheme that shows pitch conversion process and long tone adjustment process.

Figure 15 is the block scheme that shows a long tone adjustment process example.

Figure 16 is the block scheme that shows another long tone adjustment process example.

Figure 17 explains the modeled chart of spectrum envelope.

Figure 18 explains tone and the unmatched chart of grade that occurs when connecting the sound synthesis unit.

Figure 19 is a chart of explaining smoothing process.

Figure 20 is a chart of explaining that grade is adjusted.

Figure 21 is the block scheme that shows a traditional singing voice building-up process example.

Embodiment

Fig. 1 shows the block scheme of singing voice synthesizer circuit structure according to an embodiment of the invention.This singing voice synthesizer has the structure of small-size computer 10 control operations.

CPU (central processing unit) (CPU) 12, ROM (read-only memory) (ROM) 14, random-access memory (ram) 16, singing voice input block 17, the lyrics/melody input block 18, controlled variable input block 20, external memory unit 22, display unit 24, timer 26, D/A (D/A) converting unit 28, musical instrument numeral (MIDI) interface 30, communication interface 32 grades all are connected to bus 11.

CPU (central processing unit) (CPU) is according to being stored in program among the ROM14, carries out variously to synthesize relevant processing with singing voice.The various processing synthetic relevant with singing voice will make an explanation with reference to Fig. 2 to 7 grade subsequently.

RAM16 has comprised various storage area, for example the various processing in CPU12 the time the perform region.As the storage area according to the embodiment of the invention, for example, the input data storage area corresponds respectively to

input block

17,18 and 20.To explain in detail subsequently.

Singing voice input block 17 includes a microphone and is used to sound entry terminal of importing the singing voice signal or the like, and is equipped with D/A switch (D/A) device, becomes digital waveform data in order to will import the singing voice conversion of signals.The digital waveform data that is transfused to is stored in the presumptive area of RAM16.

The lyrics/melody input block 18 is equipped with in order to the keyboard of input character and numeral and can reads the reading device of music score.It can import the melody data of a series of notes (comprising rest) of representing the formation lyrics data and the melody that expression constitutes the aligned phoneme sequence of the required singing voice lyrics.Lyrics data that is transfused to and melody data are stored in the presumptive area of RAM16.

Controlled variable input block 20 is equipped with the parameter setting apparatus such as switch and volume adjuster etc., can set the controlled variable that the synthetic singing voice of control is broadcasted.Tone, pitch grade (height, in, low etc.), pitch pulsation (throb) (Pitchbend Wheel (pitch bend), trill etc.), representing dynamic level of pixel (height, in, volume such as low), beat grade (fast, in, slow beat) etc. can both be set to controlled variable.The controlled variable data storage of the controlled variable that representative is set is in the presumptive area of RAM16.

External memory unit 22 comprises the mobile storage medium of one or more types, floppy disk (FD) for example, compact disc (CD), digital multipotency dish (DVD), magneto-optic disk (MO) or the like.When external memory unit 22 was mounted with required medium, data can be sent to RAM16 from medium.When the medium of loading was the hard disk (HD) of duplicative and floppy disk (FD), data can be sent to medium from RAM16.

The medium of external memory unit can be used for substituting ROM14 as program storage unit (PSU).In this case, the program that is stored in medium is sent to RAM16 by external memory unit 22.Then, CPU is according to the stored program executable operations of RAM16.Additional and the edition upgrading of the program of can finishing easily by this method.

Display unit 24 comprises the display device such as LCD etc., can show for example aforesaid frequency analysis result's etc. polytype information.

Timer 26 generates beat clock signal TCL according to the specified beat cycle of beat data TM, and beat clock signal TCL is provided for central processing unit CPU 12.CPU12 carries out signal output according to beat clock signal TCL to D/A converting unit 28 and handles.The specified beat of beat data TM can change by the beat setting device in the input block 20 to be set.

D/A switch unit 28 converts synthetic digital audio signal to analoging sound signal.The analoging sound signal that is transmitted by D/A switch unit 28 passes through such as amplifier, and the sound system 34 of loudspeaker etc. converts audio sound to.

Midi interface 30 is carried out MIDI communication to the MIDI device 36 that is independent of this singing voice synthesizer, and is used for receiving the singing voice generated data from the present invention's MIDI device 36.As the singing voice generated data, the data that receive comprise the lyrics data and the melody data of required singing voice, and control the controlled variable data that music is broadcasted.These singing voice generated datas generate according to midi format, and midi format is fit to lyrics data and the melody data imported by data cell 18 more, and the controlled variable data of being imported by input block 20.

As for the lyrics data that is received through midi interface 30, melody data and controlled variable data, can be read prior to other data by the MIDI system-specific data with professional format of manufacturer's definition.Equally, with respect to controlled variable data of being imported by input block 20 and the controlled variable data that received by midi interface 30, when being that each singer (or tone) during to the described database in back, needs to use a singer (or tone) designated parameter with sound synthesis unit data storing.In this case, with respect to singer's (or tone) specific data, need to use MIDI programing change data.

Communication interface 32 provides data communication by communication network (for example LAN (Local Area Network), internet and telephone wire) 37 to another computer 38.Carry out various programs required for the present invention and data (for example lyrics data, melody data, sound synthesis unit data etc.) and can be loaded into RAM16 or external memory unit 22 by communication network 37 by computing machine 38 according to downloading request.

Followingly introduce a singing voice building-up process example with reference to Fig. 2.In step 40, the singing voice signal that is input to input block 17 by microphone or sound entry terminal is carried out the A/D conversion, the digital waveform data of expression input signal sound waveform is stored in RAM16.Fig. 8 A shows the example of a sound import waveform.In addition, " t " representative time in Fig. 8 A and other charts.

In step 42,, will be divided into the segment waveform corresponding to each segment of each sound synthesis unit (phoneme or phoneme chain) to the digital waveform data (cutting apart digital waveform data) that will be stored.For the sound synthesis unit, exist vowel phoneme, vowel and consonant or consonant and vowel phoneme chain, consonant and consonant phoneme chain, vowel and vowel phoneme chain, quiet and consonant or vowel phoneme chain, vowel or consonant and quiet phoneme chain etc.For vowel phoneme, also exist the long phoneme that prolongs vowel articulation.For example, for singing voice [saita], segment waveform and [#s], [a], [a_i], [l], [i_t], [a], each in [a#] is corresponding to be separated.

In step 44, one or more time frame is fixed by each segment waveform, and every frame is carried out frequency analysis by fast Fourier transform (FFT), thereby obtains frequency spectrum (spectral amplitude and phase spectrum).Then, represent the data of frequency spectrum to be stored in the presumptive area of RAM16.The length of every frame is fixing or variable.In order to make the variable-length of time frame, after one frame being finished frequency analysis, detect a pitch by the frequency analysis result with regular length, after being set corresponding to the frame length of detected pitch, can carry out frequency analysis once more to this frame.Under another situation, with regular length one frame is finished frequency analysis after, detect a pitch by the frequency analysis result, be set corresponding to the next frame length of detected pitch, then again next frame is carried out frequency analysis.By the single-tone element that vowel constitutes, the quantity of frame can be one or multiframe; And to the phoneme chain, then be multiframe.Fig. 8 B illustrates sound waveform computing fast Fourier transform (FFT) among Fig. 8 A is carried out the frequency spectrum that is obtained after the frequency analysis.In addition, " f " in Fig. 8 B and other charts represents frequency.

Then, in step 46, the sound synthesis unit detects a pitch on the spectral amplitude basis, generates the pitch data of the detected pitch of representative, and this data storing is in the presumptive area of RAM16.Pitch detection is undertaken by the method that is averaged of all frames of pitch that every frame obtains.

In step 48, press a plurality of local peakings of the spectral density (amplitude) of every frame detected amplitude spectrum.In order to detect local peaking, can use from the method for continuous a plurality of peak values (for example 4) detected amplitude value largest peaks.In Fig. 8 B, show detected a plurality of P of local peaking ₁, P ₂, P ₃... ...

In step 50, appointment is corresponding to the spectrum distributed areas of each local peaking of every frame of spectral amplitude, and the spectral amplitude data storing of fixed representative spectral amplitude distributed areas is in the presumptive area of RAM16 according to frequency axis.Specify the method for spectrum distributed areas to comprise, a kind of method is that each half frequency axis that separates between two adjacent local peakings is distributed to the spectrum distributed areas that comprise more near the local peaking of this semiaxis; Another kind method is that amplitude lowest part between two local peakings is made as the bottom, and the frequency of this bottom is as the border of adjacent spectral distributed areas.A kind of example of method was wherein composed distributed areas R before Fig. 8 B showed ₁, R ₂, R ₃... be assigned to the P of local peaking respectively ₁, P ₂, P ₃....

In step 52, generate the phase spectrum data of representative based on the PHASE DISTRIBUTION of each frame of phase spectrum each spectrum distribution fixed according to frequency axis, this data storing is in the presumptive area of RAM16.In Figure 10 A, the spectral amplitude of a frame of spectrum distributed areas distributes and phase spectrum distributes respectively by curve A M ₁And PH ₁Illustrate.

In step 54, each sound synthesis unit is with the pitch data, and spectral amplitude data and phase spectrum data storing are in sound synthesis unit database.RAM16 or external memory storage 22 can be used as sound synthesis unit database.

Fig. 3 shows the example of a sound synthesis unit database D BS store status.Each is corresponding to such as [a], the sound synthesis unit of single-tone elements such as [i], with each corresponding to such as [a_i], the sound synthesis unit of phoneme chains such as [s_a] is stored among the database D BS.In step 54, pitch data, spectral amplitude data and phase spectrum data are stored and are sound synthesis unit data.

When stored voice synthesis unit data, by storing the sound synthesis unit data that each has the singer's (tone), pitch grade, representing dynamic level of pixel and the beat grade that are different from other sound synthesis units, can synthesize the singing voice of nature (or high-quality).For example, to sound synthesis unit [a], by allowing singer A in beat grade " slowly ", " in ", " soon ", pitch grade " height ", " in ", " low " and representing dynamic level of pixel " greatly ", " in ", all combinations of " little " are sung down, record is " low " when the pitch grade down, when representing dynamic level of pixel is " little " corresponding to beat grade " slowly ", " in " and the sound synthesis unit data M 1 of " soon ", M2, M3.Sound generated data corresponding to other combinations is also noted down by the same manner.The pitch data that step 46 generated are used for judging that sound synthesis unit data belong to " height ", " in ", in " low " pitch grade which.

For having the singer B of alternative sounds with singer A, by allowing singer B sing, and will have different pitch grades with the method that is similar to aforesaid singer A, the multiple sound synthesis unit data recording of representing dynamic level of pixel and pitch grade is in database D BS.Equally, the sound synthesis unit that is different from [a] is also noted down by aforesaid mode.

Although in aforesaid example, the singing voice signal that the data based input block 17 of sound synthesis unit is imported generates, and sound synthesis unit data also can generate according to the singing voice signal that interface 30 and 32 is imported.In addition, database D BS can not only be stored in RAM16 or external memory unit 22, can also be stored in ROM14, and the storage unit of the storage unit of MIDI device 36 and computing machine 38 etc. is located.

Fig. 4 shows the example of a singing voice building-up process.In step 60, the lyrics data of required song and melody data are from input block 18 inputs and be stored in RAM16.Lyrics data and melody data also can be by interface 30 and 32 inputs.

In step 62, be converted into independent sound synthesis unit corresponding to the aligned phoneme sequence of lyrics data of input.In step 64, corresponding to the sound synthesis unit data (pitch data, spectral amplitude data and phase data) of each sound synthesis unit from database D BS read out thereafter.In step 64, tone color, the pitch grade, representing dynamic level of pixel and beat grade etc. can by input block 20 input as controlled variable and with the corresponding sound synthesis unit data of pointing to by these data of controlled variable.

In addition, the pronunciation duration of sound synthesis unit is corresponding to the quantity of sound synthesis unit data.That is to say, revise ground and use the sound synthesis unit data that store to carry out sound when synthetic, can obtain the pronunciation duration corresponding to the quantity of sound synthesis unit data when not adding.Yet, rely on the pronunciation duration of tone duration (input tone length) and beat setting etc. also improper, need to adjust the pronunciation duration.In order to satisfy this kind needs, the reading number of frames and can control of sound synthesis unit data according to input tone length and beat setting etc.

For example, in order to shorten the pronunciation duration of sound synthesis unit, will skip a part of frame when reading sound synthesis unit data.Equally, in order to prolong the pronunciation duration of sound synthesis unit, sound synthesis unit data will be repeated to read.In addition, when the plain long of synthetic single-tone such as [a], the pronunciation duration often is modified.The synthetic of long will be explained with reference to Figure 14 to 16 in the back in detail.

In step 66, adjust the spectral amplitude data of every frame according to the input tone pitch of each sound synthesis unit.That is to say that the spectral amplitude by spectral amplitude data representative of each spectrum distributed areas distributes and will move on frequency axis, thereby generate pitch corresponding to the input tone pitch.

It is to raise to have local peaking's frequency f that Figure 10 A and Figure 10 B show one _iThe pitch of spectrum distributed areas, and will compose distributed areas by AM ₁Move to AM ₂Example, low and high limit frequency is respectively f _iAnd f _u

In this case, for spectrum distribution AM ₂, the frequency of local peaking is F _i=T, f _i, the pitch conversion ratio is T=F _i/ f _iEqually, lower bound frequency F _iWith high limit frequency F _uDetermined by corresponding each frequency departure " fi-fi " and " fu-fi ".

Fig. 9 A shows has the P corresponding to local peaking ₁, P ₂, P ₃Spectrum distributed areas R ₁, R ₂, R ₃(shown in Fig. 8 B), Fig. 9 B show one on frequency axis the high-pitched tone direction move the spectrum distributed areas example.At the spectrum distributed areas R shown in Fig. 9 B ₁In, the P of local peaking ₁Frequency, the lower bound frequency f ₁₁With high limit frequency f ₁₂All determine with reference to the described same procedure of Figure 10 by the front.It can be applied to other spectrum distributed areas equally.

Although in aforesaid example, the high pitch direction to frequency axis moves in order to raise pitch in the spectrum distributed areas, and it also can move to the low pitch direction of frequency axis in order to reduce pitch.In this case, Figure 11 shows partly overlapping two spectrum distributed areas R _aAnd R _b

In example shown in Figure 11, the P of local peaking _bWith to the spectrum distributed areas have the lower bound frequency f _B1(f _B1＜f _A2), high limit frequency f _B2(f _B2F _A2) spectrum distributed areas Pb, at frequency field f _A1To f _A2Between overlap.For fear of this situation, for example can be with frequency field f _B1To f _A2Be divided into two parts from centre frequency, with region R _aHigh limit frequency f _A2Convert one to and be lower than f _cPreset frequency, with region R _bThe lower bound frequency f _B1Convert one to and be higher than f _cPreset frequency.Thus, just can be in region R _aBe lower than f _cFrequency field, and region R _bBe higher than f _cFrequency field use spectrum distribution AMa.

As previously mentioned, when the spectrum that comprises local peaking is distributed on the frequency axis when mobile, spectrum envelope only can extend and shorten by frequency setting, therefore exists the problem that tone is different from institute's sound import waveform.In order to duplicate the tone of sound import waveform, just need to adjust the spectral density of one or more spectrum distributed areas along the spectrum envelope of the connecting line of the local peaking that a series of spectrums corresponding to every frame distribute.

Figure 12 shows the example that a spectral density is adjusted, and Figure 12 A shows one corresponding to the P of local peaking before the pitch conversion ₁₁To P ₁₈Spectrum envelope EV.For according to input tone pitch ratio rising pitch, spectral density is at the P of local peaking ₁₁To P ₁₈Be moved to the P shown in Figure 12 B on the frequency axis ₂₁To P ₂₈The time, along with spectrum envelope rises together or drops to spectrum envelope EV.Obtain the tone identical therefrom with the sound import waveform.

In Figure 12 A, R _fIt is the frequency field that lacks spectrum envelope.When the rising pitch, may need such as P ₂₇, P ₂₈Local peaking transfer to frequency field R shown in Figure 12 B _fFor fear of this situation, frequency field R _fSpectrum envelope can obtain by the method for interpolation shown in Figure 12 B, can adjust the spectral density of local peaking according to acquisition spectrum envelope EV.

In aforesaid example, reproduced the tone of sound import waveform, the tone different with the sound import waveform can be added in the synthesized voice.Like this, can utilize the spectrum envelope of conversion spectrum envelope EV as shown in figure 12 or new spectrum envelope to adjust spectrum intensity.

In order to simplify the process of using spectrum envelope, spectrum envelope preferably is represented as curve or straight line.Figure 13 shows two kinds of different spectrum envelope curve EV ₁And EV ₂Curve EV ₁Only represent spectrum envelope by the rectilinear that connects each local peaking by straight line.Equally, curve EV ₂Use cubic spline function to represent spectrum envelope.Using curve EV ₂The time, can accurately carry out interpolation.

Subsequently, the step 68 in Fig. 4 according to the adjustment of the spectral amplitude data of every frame, is adjusted the phase spectrum data of each sound synthesis unit.That is to say, in the spectrum distributed areas of i the local peaking that comprises a frame shown in Figure 10 A, phase spectrum distribution PH ₁Corresponding to spectral amplitude distribution AM ₁In step 66, at spectral amplitude distribution AM ₁Be moved to AM ₂The time, need be according to the spectral amplitude AM that distributes ₂Adjust phase spectrum distribution PH ₁This is in order to make phase spectrum distribution PH ₁On the frequency of the local peaking of moving target position, become sine wave.

The time interval between every frame is Δ t, and local peaking's frequency is f _i, when the pitch conversion ratio is T, the phase-interpolation amount Δ relevant with the spectrum distributed areas that comprise i local peaking

₁Obtain by following equation A1:

Δψ _i＝2πf _i(T-1)Δt........(A1)

Shown in Figure 10 B, the interpolation amount Δ ψ i that is obtained by equation A1 is added to regional F _iTo F _uOn the phase place of each interior phase spectrum, frequency is F _iThe phase place of local peaking be ψ _i+ Δ ψ _i

All will carry out above-mentioned phase-interpolation for each spectrum distributed areas.For example, when local peaking's frequency of a frame is perfect harmonic wave (harmonic frequency is the absolute integral multiple of fundamental frequency), the fundamental frequency of sound import (pitch of pitch data representative in the sound synthesis unit data) is f ₀Quantity when the spectrum distributed areas is k=1,2, and during 3..., phase-interpolation amount ψ _iObtain by following equation A2.

Δψ _i＝2πf ₀(T-1)Δt........(A2)

In step 70, duplicate the start time according to the decisions such as setting beat of each sound synthesis unit.Duplicating the start time depends on and sets beat and input tone length, and by the clock count value representative of beat clock signal TCL.Singing voice [saita] for example, the reproduction start time of sound synthesis unit [s_a] is set to by the input tone length with set the tone period that beat determined and beginning [a] sound rather than [s] sound.In step 60, lyrics data and melody data are imported in real time.When carrying out real-time singing voice when synthetic, need import lyrics data and melody data prior to tone period, in order to set aforesaid duplicating the start time.

In step 72, between the sound synthesis unit, regulate the spectral density grade.By spectral amplitude data and phase spectrum data are all carried out the grade adjustment process, thereby produce noise when preventing in next step 74 to connect synthetic video by data.Exist smoothing process herein, grade is regulated or similar process, will explain these processes in detail with reference to Figure 17 to 20 in the back.

In step 74, the spectral amplitude data are connected with each other, and the phase spectrum data are interconnected to together too.Then, in step 76, spectral amplitude data and phase spectrum data are converted to the synthetic video signal (digital waveform data) of time domain by each sound synthesis unit.

Fig. 5 shows the example of the transfer process in the step 76.At step 76a, each frame data (spectral amplitude data and phase spectrum data) of frequency field are carried out anti-fast fourier transform (FFT), thereby obtain the synthetic video signal of time domain.Then, at step 76b, the synthetic video signal of time domain is carried out the window processing.In this process, to the synthetic video signal times of time domain with a time window function.At step 76c, the synthetic video signal of time domain is carried out overlapping processing.In this process,, and the synthetic video signal of time domain is coupled together by waveform according to the overlapping sound synthesis unit of certain sequence.

In step 78, with reference to the reproduction start time that step 78 determined, synthetic voice signal is output to D/A switch unit 28.Thus, generate synthetic singing voice by sound system 34.

Fig. 6 shows the example of another singing voice analytic process.In step 80, with the described same procedure of step 40 input singing voice signal, represent the digital waveform data of input signal sound waveform to be stored in the presumptive area of RAM16.The singing voice signal also can be by interface 30 and 32 inputs.

In step 82, be the digital waveform data that is saved, with each segment to be divided into the segment waveform with the described same procedure of step 42.

In step 83, on behalf of the segment Wave data (sound synthesis unit data) of segment waveform, each sound synthesis unit be stored in the sound synthesis unit database.RAM16 and external memory unit 22 all can be used as sound synthesis unit database, also can use ROM14 as required in addition, the memory storage of the memory storage of MIDI device 36 and computing machine 38.When stored voice synthesis unit data, at singer (tone), pitch grade, representing dynamic level of pixel are with beat grade etc. and different segment Wave data m1, m2, m3... by each sound synthesis unit to be stored among the sound synthesis unit database D BS with reference to the described same procedure of Fig. 3 with the front.

Explain the example of another singing voice building-up process below with reference to Fig. 7.In step 90, to be synthesized the lyrics data and the melody data of singing voice by the described same procedure input of step 60.

In step 92, by with the described same procedure of step 62, convert the aligned phoneme sequence of lyrics data representative to independent sound synthesis unit.Then,, read the segment Wave data of each sound synthesis unit from carrying out the database of storing process in step 84 in step 94.In this case, such as tone, pitch grade, the data of representing dynamic level of pixel and beat grade, also are read out corresponding to the segment Wave data of the formed controlled variable of these parameters as controlled variable simultaneously from input block 20 input.Equally, by the described same procedure of step 64, can and set beat according to the input tone length and change the pronunciation duration.Like this, after reading sound waveform, might continue to read sound waveform by the mode of omitting a part of sound waveform, a repetition part or whole sound waveform, thereby obtain the required pronunciation duration.

In step 96, determine one or more time frame by each segment Wave data that will read for the segment waveform, and carry out frequency analysis to detect frequency spectrum (spectral amplitude and phase spectrum) by every frame by fast Fourier transform (FFT) etc.Then, with the presumptive area of representing the data storing of frequency spectrum in RAM16.

In step 98, the identical process by step 46 to 52 among execution and Fig. 2 generates pitch data, spectral amplitude data and phase spectrum data by each sound synthesis unit.Then,, synthesize and the reproduction singing voice by the identical process of step 66 to 78 among execution and Fig. 4 in step 100.

Comparison diagram 4 and two singing voice building-up processes shown in Fig. 7.Singing voice building-up process shown in Figure 4, it is synthetic that spectral amplitude data by each sound synthesis unit of being obtained by database and phase spectrum data are carried out singing voice.On the other hand, singing voice building-up process shown in Figure 7, it is synthetic that the segment Wave data by each sound synthesis unit of being obtained by database carries out singing voice.Though both have above-mentioned difference, their singing voice building-up process comes down to identical.Fig. 4 and singing voice building-up process shown in Figure 7 because the frequency analysis result of sound import waveform is not divided into determinacy composition and random element, cutting apart and echoing of random element just can not occur.Thereby, just can obtain the synthetic video of nature (high-quality).Equally also can obtain fricative and plosive natural synthetic video.

The pitch transfer process and the tone adjustment process (corresponding to the step 66 among Fig. 4) of the plain long of the single-tone that Figure 14 shows similar [a].At this moment, database provide as shown in Figure 3 by the pitch data, one group of data that spectral amplitude data and phase spectrum data are formed.Equally, at singer (tone), pitch grade, representing dynamic level of pixel are with the beat grade and different sound synthesis unit data also are stored in the database.Specify such as required singer (required tone) when input block 20, the pitch grade after the controlled variable of representing dynamic level of pixel and beat grade etc., will read the specified sound synthesis unit data of controlled variable.

In step 110, the spectral amplitude data FSP by long synthesis unit data SD gained is carried out the pitch changing process identical with step 66.Specifically, be each spectrum distributed areas at every frame relevant with spectral amplitude data FSP, spectrum is distributed moves to position corresponding to the input tone pitch shown in input tone pitch data PT is on frequency axis.

The needs pronunciations duration longer than the time span of sound synthesis unit data SD the situation of long under, reading sound synthesis unit data SD after finish, these data will be returned and from the beginning be read once more to operation.By this kind mode, can adopt the method that repeats to read with certain sequential as required.As another kind of method, after it reads ending, begin to read sound synthesis unit data from ending up to, can adopt as required the method that repeats to read or read by opposite sequential by certain sequential.In this method, can arbitrarily set the starting point that reads of time for reading in the reverse sequential.

In the pitch changing process of step 110, for example, corresponding to each the long synthesis unit data M 1 (or m1) such as [a], M2 (or m2), on behalf of the pitch microseismic data of time remaining pitch changing, M3 (or m3) etc. will be stored among the database D BS shown in Figure 3.In this case, in step 112, the pitch microseismic data that read is added on the tone pitch of input, according to the pitch control data as stack result, the pitch changing of controlled step 110.By this method, on (as Pitchbend Wheel, the trill etc.) synthetic video that is added to of pitch can being pulsed, thus the synthetic video of acquisition nature.And by such as tone, pitch grade, the controlled variable of representing dynamic level of pixel and beat grade etc. can change pitch pulsation style, thereby improve the naturalness of synthetic video.The use of pitch microseismic data is to carry out interpolation by basis such as controlled variable such as tones, and revises one or more pitch microseismic data corresponding with the sound synthesis unit.

In step 114, the spectral amplitude data FSP ' that carries out the pitch changing process in step 110 is carried out the tone adjustment process.This process is set the tone of the synthetic video of adjusting spectral density according to spectrum envelope with reference to the described every frame of Figure 12 by the front.

Figure 15 shows the example of the tone adjustment process of step 114.In this example, for example, representative is stored in the database shown in Figure 3 corresponding to the spectrum envelope data of a typical spectrum envelope of the sound synthesis unit of long [a].

In step 116, from database D BS, read spectrum envelope data corresponding to the long synthesis unit.Then, carry out the spectrum envelope assignment procedure according to the spectrum envelope data that read out in step 118.That is to say, for the spectrum envelope of the spectrum envelope data indication of each spectral amplitude data of each frame of n frame spectral amplitude data FRi among the long frame group FR in the FRn is set spectrum envelope by adjusting spectral density together.Thus, a suitable tone can be added on the long.

In the spectrum envelope assignment procedure of step 118, corresponding to each the long synthesis unit data M 1 (m1) such as [a], M2 (m2), M3 (m3), the spectrum envelope microseismic data of for example representing the time remaining spectrum envelope to change is stored among the database D BS shown in Figure 3, and response is specified in input block 20 such as tone, the pitch grade, representing dynamic level of pixel and beat grade etc. can read and want the corresponding spectrum envelope microseismic data of appointed controlled variable.In this case, in step 118, the spectrum envelope microseismic data VE that be read is added on the spectrum envelope microseismic data that step 116 reads, and corresponding to the spectrum envelope control data as stack result, the spectrum envelope of controlled step 118 is set.By this kind method, on the synthetic video that is added to such as (as Pitchbend Wheel) of tone can being pulsed, thereby obtain the nature synthetic video.And, because can be corresponding to tone, pitch grade, the controlled variable of representing dynamic level of pixel and beat grade and so on change pitch pulsation style, thereby have improved the naturalness of synthetic video.Can use the pitch microseismic data by revising one or more pitch microseismic data corresponding by the interpolation corresponding with the controlled variable such as tone with the sound synthesis unit.

Figure 16 shows another example of the tone adjustment process of step 114.In singing voice is synthetic, a typical example is exactly for example foregoing sing [saita], promptly the singing voice of an aligned phoneme sequence (for example, [s_a])-single-tone element (for example [a])-one aligned phoneme sequence (for example [a_i]) is synthetic, and Figure 16 shows the synthetic example of this typical singing voice.In Figure 16, the last note among the spectral amplitude data PFR of last note last frame is corresponding to for example aligned phoneme sequence [s_a], the spectral amplitude data FR of the n of a long frame _iTo FR _nLong corresponding to for example single-tone element [a], the back note among the spectral amplitude data PFR of first frame of back one note is corresponding to for example phoneme chain [a_i].

In step 120, from the spectral amplitude data PFR of last note last frame, extract spectrum envelope and from the spectral amplitude data NFR of first frame of back one note, extract spectrum envelope.Then, these two spectrum envelopes that extract are carried out temporal interpolation, and form the spectrum envelope data of expression long spectrum envelope.

In step 122, set this spectrum envelope by adjusting spectral density, so that with will be in each spectral amplitude data of spectral amplitude data FRi that the spectrum envelope of the spectrum envelope data representation that step 120 forms is indicated n frame each frame in the FRn, thus, a suitable tone can be added on two longs between the phoneme chain.

In addition, in step 122,, control the spectrum envelope setting by reading from database D BS corresponding to the controlled variable such as tone by the top identical process of describing with reference to step 118.By this method, can obtain the synthetic video of nature.

The following example of explaining smoothing process (corresponding to step 72) with reference to Figure 17 to 19.In this example, in order to make the data easy operating and to simplify and calculate, with the spectrum envelope analysis of the every frame of sound synthesis unit is as shown in figure 17 a slope composition and an one or more harmonic components by the exponential function representative by straight line (or exponential function) representative.That is to say that get harmonic components density on slope composition basis as calculated, spectrum envelope obtains by adding slope composition and harmonic components.Equally, prolongation slope composition is called the gain of slope composition to the value of 0Hz gained.

As an example, two sound synthesis units [a_i] and [i_a] as shown in figure 18 interconnected.Because these two sound synthesis units are to extract from different recording at first, do not match mutually in the tone and the grade of connecting portion [i].Thereby, formed a waveform step at connecting portion shown in Figure 180, making it sound is a noise.Slope composition by several frames in front and back that are the center with the tie point to two sound synthesis unit data and harmonic components parameter are intersected weak, thereby the step that can eliminate the tie point place prevents to produce noise.

For example, as shown in figure 19, in order to intersect the parameter of weak harmonic component, the harmonic components parameter of two sound synthesis unit data is multiplied by a function (intersect weak parameter), makes the parameter on the tie point become 0.5, and two products are added to together again.Figure 19 be one by making waveform adder intersect weak example, wherein each waveform all the time remaining of the first harmonic composition density of representative voice synthesis unit [a_i] or [i_a] change, and each waveform all is multiplied by and intersects weak parameter.

Also can intersect weak to other parameters that are similar to above-mentioned harmonic components and slope composition.

Figure 20 shows the example of a grade adjustment process (corresponding to step 72).In this example, as previously mentioned, will explain that connection [a_i] and [i_a] are with the grade adjustment process under the situation of synthesizing.

In this case, replace the weak amplitude of sound synthesis unit tie point front and back that makes of intersection to be close to identical by the grade adjustment.The grade adjustment can be by multiply by sound synthesis unit amplitude fixing or variable coefficient carries out.

In this example, explained the connection of two sound synthesis unit slope composition gains.At first, shown in Figure 20 A and 20B, for sound synthesis unit [a_i] and [i_a], by interpolation is carried out in the slope composition between first frame and last frame gain, and the difference of parameter is come calculating parameter (among the figure shown in the dotted line) after actual slope composition and the interpolation.

After this, calculate the typical sample (slope composition and each harmonic components parameter) of each phoneme [a] and [i].As typical sample, can calculate the spectral amplitude data of first and the last frame of [a_i].

According to the typical sample of [a] and [i], shown in dotted line among Figure 20 C, obtain linear interpolation parameters calculated, and obtain linear interpolation parameters calculated by the gain of the slope composition between [i] and [a] by the gain of the slope composition between [a] and [i].Next, be added to parameter after the interpolation respectively, can make parameter unanimity whenever the time after the interpolation on the border by the difference that will utilize Figure 20 A and Figure 20 B to calculate; Thereby just can not produce the uncontinuity of slope composition gain.By same quadrat method, can prevent that also interrupting from appearring in other parameters as harmonic components parameter etc.

In aforesaid step 72,, not only also carry out aforesaid smoothing process and grade adjustment process to the spectral amplitude data but also to the phase spectrum data in order to adjust phase place.Thereby, prevent to produce noise so that obtain high-quality synthetic singing voice.In addition, in smoothing process and grade adjustment process, although spectral density is identical on tie point, whole spectral density is roughly the same.

With reference to specific embodiment the present invention has been described.The present invention is not limited only to above several embodiment.Various improve and combination etc. will be readily apparent to persons skilled in the art.

Claims

1. singing voice synthetic method comprises step:

(a) come frequency spectrum is detected by analyzing the sound waveform frequency corresponding with the sound synthesis unit of the sound that will be synthesized;

(b) a plurality of local peakings of spectral density on the detection frequency spectrum;

(c) in these a plurality of local peakings each, comprise before this local peaking and this local peaking on the designated spectrum and the spectrum distributed areas of spectrum afterwards, generate the spectral amplitude data that the expression spectral amplitude distributes according to frequency axis of each spectrum distributed areas;

(d) frequency axis according to each spectrum distributed areas generates the phase spectrum data that the expression phase spectrum distributes;

(e) specify pitch for the sound that will be synthesized;

(f) for each spectrum distributed areas, adjust the spectral amplitude data, so that distribute along the spectral amplitude that frequency axis moves by the spectral amplitude data representation according to pitch;

(g) for each spectrum distributed areas, according to the adjustment of spectral amplitude data, the phase spectrum of adjusting by the phase spectrum data representation distributes; With

(h) adjusted spectral amplitude data are become the synthetic video signal of time domain with adjusted phase spectrum data-switching.

2. singing voice synthetic method according to claim 1, pitch given step (e) is wherein specified pitch according to the pitch microseismic data of the change in pitch in sequential of expression.

3. singing voice synthetic method according to claim 2, wherein the pitch microseismic data is corresponding to the controlled variable of the music expression that is used to control the sound that will be synthesized.

4. singing voice synthetic method according to claim 1, wherein spectral amplitude data set-up procedure (f) before adjustment with connect these a plurality of local peakings in the spectral density of the local peaking that is not together of the corresponding spectrum envelope of each spectral line be adjusted to this spectrum envelope together.

5. singing voice synthetic method according to claim 1, wherein spectral amplitude set-up procedure (f) the spectral density of the local peaking that is not together with a predetermined spectrum envelope be adjusted to should predetermined spectrum envelope together.

6. singing voice synthetic method according to claim 4, wherein spectral amplitude set-up procedure (f) is according to the spectrum envelope microseismic data of variation of expression spectrum envelope, by at continuous time frame sequential adjust this density and be set in the spectrum envelope that changes in the sequential.

7. singing voice synthetic method according to claim 6, wherein the spectrum envelope microseismic data is corresponding to the controlled variable of the music expression that is used to control the sound that will be synthesized.

8. singing voice synthetic method comprises step:

(a) obtain spectral amplitude data and the phase spectrum data corresponding with the sound synthesis unit of the sound that will be synthesized, wherein these spectral amplitude data are expression data that fixed spectral amplitude distributes according to the frequency axis of each spectrum distributed areas of each peak value in a plurality of local peakings of spectral density, in a plurality of local peakings of spectral density each comprise this peak value and the frequency spectrum that obtains by frequency analysis to the sound waveform of sound synthesis unit in before this peak value and spectrum afterwards, described phase spectrum data are expression data that fixed phase spectrum distributes according to the frequency axis of each spectrum distributed areas;

(b) specify pitch for the sound that will be synthesized;

(c) for each spectrum distributed areas, adjust the spectral amplitude data, so that distribute along the spectral amplitude that frequency axis moves by the spectral amplitude data representation according to pitch;

(d) for each spectrum distributed areas, according to the adjustment of spectral amplitude data, the phase spectrum of adjusting by the phase spectrum data representation distributes; With

(e) adjusted spectral amplitude data are become the synthetic video signal of time domain with adjusted phase spectrum data-switching.

9. singing voice synthesizer comprises:

Specified device is used to the sound specified voice synthesis unit and the pitch that will be synthesized;

Reading device, be used for from sound synthesis unit database read representative corresponding to the sound waveform data of the waveform of sound synthesis unit as sound synthesis unit data;

First pick-up unit detects frequency spectrum by the frequency of analyzing by the sound waveform of sound waveform data representation;

Second pick-up unit is used to detect a plurality of local peakings of spectral density on the frequency spectrum;

First generating apparatus, be used for each at a plurality of local peakings, appointment comprises before this local peaking and the spectrum distributed areas of spectrum afterwards on this local peaking and the frequency spectrum, and generates the spectral amplitude data that the expression spectral amplitude distributes according to the frequency axis of each spectrum distributed areas;

Second generating apparatus is used for generating the phase spectrum data that the expression phase spectrum distributes at each spectrum distributed areas according to frequency axis;

First adjusting gear is used for adjusting the spectral amplitude data at each spectrum distributed areas, so that distribute along the spectral amplitude that frequency axis moves by the spectral amplitude data representation according to pitch;

Second adjusting gear is used at each spectrum distributed areas, and according to the adjustment of spectral amplitude data, the phase spectrum of adjusting by the phase spectrum data representation distributes; With

Conversion equipment is used for adjusted spectral amplitude data are become with adjusted phase spectrum data-switching the synthetic video signal of time domain.

10. singing voice synthesizer according to claim 9, wherein

Specified device specify the music expression be used to control the sound that will be synthesized controlled variable and

Reading device reads the sound synthesis unit data corresponding to sound synthesis unit and controlled variable.

11. singing voice synthesizer according to claim 9, wherein

Specified device specify the tone length of the sound that will be synthesized and/or beat and

Reading device is by omitting a part, or repeats a part or whole sound synthesis unit data and read sound synthesis unit data with tone length and/or beat time corresponding.

12. a speech synthesizing device comprises:

Reading device, be used for reading corresponding to the spectral amplitude data of sound synthesis unit and phase spectrum data as sound synthesis unit data from sound synthesis unit database, wherein these spectral amplitude data are expression data that fixed spectral amplitude distributes according to the frequency axis of each spectrum distributed areas of each peak value in a plurality of local peakings of spectral density, in a plurality of local peakings of spectral density each comprise this peak value and the frequency spectrum that obtains by frequency analysis to the sound waveform of sound synthesis unit in before this peak value and spectrum afterwards, described phase spectrum data are expression data that fixed phase spectrum distributes according to the frequency axis of each spectrum distributed areas;

Second adjusting gear is used at each spectrum distributed areas, and the phase spectrum of adjusting by the phase spectrum data representation according to the adjustment of spectral amplitude data distributes; With

13. a singing voice synthesizer comprises:

Specified device, being used to will be by order synthetic each sound specified voice synthesis unit and pitch;

Reading device, be used for from sound synthesis unit database read with by the corresponding sound waveform data of each specified sound synthesis unit of specified device;

First pick-up unit is used for detecting frequency spectrum by the frequency of analyzing corresponding to the sound waveform of each sound waveform;

Second pick-up unit is used to detect a plurality of local peakings corresponding to the spectral density on the frequency spectrum of each sound waveform;

First generating apparatus, be used for each at a plurality of local peakings of each sound synthesis unit, appointment comprises before this local peaking and the spectrum distributed areas of spectrum afterwards on local peaking and the frequency spectrum, and generates the spectral amplitude data that the expression spectral amplitude distributes according to the frequency axis of each spectrum distributed areas;

Second generating apparatus, each that is used for according to each sound synthesis unit composed the phase spectrum data of the frequency axis generation expression phase spectrum distribution of distributed areas;

First adjusting gear is used for each the spectrum distributed areas at each sound synthesis unit, adjusts the spectral amplitude data, so that distribute along the spectral amplitude that frequency axis moves by the spectral amplitude data representation according to pitch;

Second adjusting gear is used for each the spectrum distributed areas at each sound synthesis unit, and the phase spectrum of adjusting by the phase spectrum data representation according to the adjustment of spectral amplitude data distributes;

First coupling arrangement, be used to connect adjusted spectral amplitude data, so that according to connecting continuous sound synthesis unit respectively by the synthetic successively sound of pronunciation order, wherein on the tie point of continuous sound synthesis unit, each spectral density is adjusted to consistent with each other or approximate unanimity;

Second coupling arrangement, be used to connect adjusted phase spectrum data, so that according to connecting continuous sound synthesis unit respectively by the synthetic successively sound of pronunciation order, wherein on the tie point of continuous sound synthesis unit, each phase place is adjusted to consistent with each other or approximate unanimity;

Conversion equipment, be used for the spectral amplitude data after connecting be connected after the phase spectrum data-switching become the synthetic video signal of time domain.

14. a singing voice synthesizer comprises:

Reading device, be used for from sound synthesis unit database read with by the corresponding sound waveform data of each specified sound synthesis unit of specified device, wherein the spectral amplitude data are expression data that fixed spectral amplitude distributes according to the frequency axis of each spectrum distributed areas of each peak value in a plurality of local peakings of spectral density, in a plurality of local peakings of spectral density each comprise this peak value and the frequency spectrum that obtains by frequency analysis to the sound waveform of sound synthesis unit in before this peak value and spectrum afterwards, the phase spectrum data are expression data that fixed phase spectrum distributes according to the frequency axis of each spectrum distributed areas, and described spectral amplitude data and described phase spectrum data are read from described database;