CN1779779B

CN1779779B - Method and apparatus for providing phonetical databank

Info

Publication number: CN1779779B
Application number: CN200410095710A
Authority: CN
Inventors: 岳东剑
Original assignee: Motorola Inc
Current assignee: Nuance Communications Inc
Priority date: 2004-11-24
Filing date: 2004-11-24
Publication date: 2010-05-26
Anticipated expiration: 2024-11-24
Also published as: CN1779779A

Abstract

A method for providing voice databank includes selecting acoustic unit from the first voice databank as said acoustic unit being set with identification mark and with multiple correlation frames presenting the sampling voice waveform, confirming whether each frame is voiced sound frame or nonvoiced sound frame and classifying it to voiced sound frame or nonvoiced sound frame if it is voiced sound frame then coding said frame to be coded frame according to at least four CELP parameters and according to said classification and finally storing coded frame to form voice databank

Description

The method and the relevant device thereof of voice corpus are provided

Technical field

It is synthetic to the present invention relates generally to literary composition language conversion (TTS).The present invention specifically is used for, but needn't be limited to, and provides method and this corpus of use of the voice corpus of coding to be used for the equipment of letter to the sound conversion, and it is used for the synthetic pronunciation of text chunk.

Background technology

Literary composition language conversion (TTS) is commonly referred to continuous text synthetic to voice, allows electronic equipment reception input text string also to provide this string to represent with the conversion of synthetic speech form.But, may need the equipment of synthetic voice from the non-text string that receives that ascertains the number to have any problem aspect the desirable synthetic speech of high-quality providing.This is because the pronunciation of each word that will synthesize or syllable (for Chinese character etc.) is that context dependent is relevant with the position.For example, the pronunciation of (input text string) end of the sentence word may be spun out or be elongated.The pronunciation of same word may be only in appearing at sentence, just can be elongated when requiring emphasis.

In most of language, the acoustics prosodic parameter is depended in the pronunciation of word, comprises tone (fundamental tone), volume (power or amplitude) and duration.When word was arranged in phrase or sentence, the prosodic parameter value of word was correlated with.A kind of TTS method attempt to discern with the text string corpus in the text string that is complementary of enough long-distance call languages.But this method calculating degree costliness needs unacceptable long corpus be used for great majority and uses, and does not guarantee that in corpus one finds suitable coupling language surely.

Another kind method is used the corpus of the acoustic elements (phoneme) with coding prosodic parameter.A kind of coding techniques uses code book Excited Linear Prediction CELP, and still, in order to obtain high-quality relatively TTS conversion, corpus may be unacceptable length, especially during the TTS conversion on considering to have the electronic equipment of limited storage space.

In this instructions and claims, term " comprises, comprises (comprises, comprising) " or similar terms should be regarded as meaning forgiving of nonexcludability, therefore, a kind of method that comprises series of elements or device not only comprise also can comprise the element that other are unlisted by the element of listing well.

Summary of the invention

According to an aspect of the present invention, provide a kind of method that is used to provide the encoded voice corpus, this method comprises:

Choose acoustic elements from the first voice corpus, described acoustic elements has a plurality of associated frames of identifier and expression sampled speech waveform;

Determine that each frame is unvoiced frames or unvoiced frame;

Described frame is categorized into unvoiced frame and unvoiced frames at least;

According at least four code book Excited Linear Prediction parameters described frame is encoded into coded frame, wherein, described coding depends on described classification; With

Described coded frame is stored in the second voice corpus.

Suitably, described determine to comprise the frame energy of expression frame and frame energy threshold and Mel frequency cepstral coefficient threshold value are compared.

Suitably, described classification is characterised in that, unvoiced frame is categorized into three subclasses, and these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.

The feature of described classification is that also it is to compare with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame to classify that described two classes are stablized unvoiced frame.

Suitably, the described frame of encoding is characterised in that described coding depends on the classification of described three kinds of subclasses.Suitably, described coding is characterised in that, is different to each subclass required bit number of encoding.And, be used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.Suitably, coded frame comprises the field that is used to discern its classification.

According to a further aspect of the invention, provide a kind of electronic equipment, described equipment comprises:

Device handler;

Voice operation demonstrator is connected to described device handler; With

The voice corpus of the coded frame of acoustic elements is categorized as unvoiced frames or unvoiced frame comes described coded frame is encoded according to them.

Suitably, according to three subclasses the unvoiced frame in the voice corpus is encoded, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized one of unvoiced frame.

Described two classes are stablized unvoiced frame and can be compared with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame and classify.

Coded frame in the corpus is characterised in that, is different to each subclass required bit number of encoding.And, be used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.Suitably, coded frame comprises the field that is used to discern its classification.

Description of drawings

In order to make easy to understand of the present invention and to try out, quote the embodiment of example referring now to accompanying drawing, in the accompanying drawings:

Fig. 1 is the schematic block diagram according to electronic equipment of the present invention;

Fig. 2 is the schematic block diagram of coding structure; With

Fig. 3 is the process flow diagram of method that explanation is used for providing from the first voice corpus the second voice corpus of coding.

Embodiment

Referring to Fig. 1, show a kind of electronic equipment 100, shape is a wireless telephone, comprises the device handler 102 that effectively is connected to user interface 104 by bus 103, user interface 104 is generally touch-screen or display screen and keyboard.Electronic equipment 100 also has speech corpus 106 (the second voice corpus of coding), voice operation demonstrator 110, nonvolatile memory 120, ROM (read-only memory) 118 and wireless communication module 116, and all effectively are connected to processor 102 by bus 103.Voice operation demonstrator 110 has connection to drive the output of loudspeaker 112.To illustrate in this manual after a while and set up speech corpus 106.Nonvolatile memory 120 (memory module) in use stores and is used for literary composition language conversion (TTS) synthetic text, and wherein, text can be by module 116 or otherwise received.

As those skilled in the clear, radio frequency communications unit 116 normally has the receiver and the transmitter of the combination of common antenna.Radio frequency communications unit 116 has the transceiver that is connected to antenna by radio frequency amplifier.Transceiver also has the modulator/demodulator of combination, and it is connected to processor 102 with communication unit 116 via bus 103.And, in the present embodiment, nonvolatile memory 120 (memory module) stored user phonebook database Db able to programme, ROM (read-only memory) 118 stores the operation code (OC) that is used for device handler 102.

In Fig. 2, show the schematic block diagram of coding structure 200.Structure 200 comprises that the first voice corpus 210, frame are handled and the second voice corpus 230 of coding unit 220 and coding.The first voice corpus 210 and the second voice corpus 230 effectively are connected to frame and handle and coding unit 220.Also having user interface 240 effectively to be connected to frame handles and coding unit 220.In use, frame processing and coding unit 220 will be encoded in the second voice corpus 230 of coding following description with reference to figure 3 from the acoustic elements AU of the first voice corpus 210.And user interface 240 comprises the input button that is used for input command and is used for providing to the user display screen of visual feedback that user interface 240 allows the user to start and gap coding.

Referring to Fig. 3, show a kind of method 300 that is used for providing the second voice corpus 230 of coding from the first voice corpus 210, this method is handled by frame in essence and coding unit 220 is carried out.After beginning frame 305 call methods 300, carry out initialization procedure by suitable excited users interface 240 at frame 310.Initialization procedure comprise integer counter i is changed to 1 (i:=1 is set) and be provided with the big lowerinteger value N of voice corpus (voice corpus size :=N), wherein N is the number of acoustic elements AU in the first voice corpus 210.Then, at frame 315, method 300 is chosen i acoustic elements AU in the first voice corpus 210.Therefore, because counter i is initialized as 1 in frame 310, choose first acoustic elements AU in the first voice corpus 210.Here, as is known to the person skilled in the art, first acoustic elements AU is the acoustic elements AU in first memory location of the first voice corpus 210, and has a plurality of associated frames of identifier and expression sampled speech waveform.

In the present embodiment, each frame comprise usually 240 the sampling and each sampled representation be 16 bits.And usually, acoustic elements AU can have the sampled speech of 8 to 20 frames.The frame sequence (FRS) known to those skilled in the art is arranged and be arranged in to these frame sequentials.Like this, in frame 315, the frame number (FN) that is used for i acoustic elements AU is by simply to the frame count of the sampled speech waveform of representing i acoustic elements AU and definite.

After frame 315, method 300 is each frame calculating frame energy (FE) and Mel frequency cepstral coefficient (MFCC) parameter that 1 of i acoustic elements (being the 1st acoustic elements in this example) arrives FN at frame 320.Frame energy FE based on the frame sampling value square, frame energy FE and Mel frequency cepstral coefficient MFCC parameter all are easy to by well known to a person skilled in the art that technique computes comes out.

Then, at frame 325, variable k is changed to 1, and this variable k is used for choosing frame 1 to FN in proper order with frame sequence FRS.After this, the Mel frequency cepstral coefficient MFCC parameter of frame energy FE and each frame is used for determining that in test block 330 the k frame is unvoiced frames (UF) or unvoiced frame (VF).Representation class is like the signal of noise or the frame of nonperiodic signal in essence for unvoiced frames UF, and wherein, sound channel is not vibrated (for example fricative phoneme) when the mankind send this part voice.On the contrary, unvoiced frame VF is the frame of indication cycle's property signal, and wherein, sound channel is at fundamental vibration when the mankind send this part voice.For example, the waveform of most of vowel phonemes all belongs to unvoiced frame VF.

In test block 330, the represented frame energy (signal energy) of in question frame compares with frame energy threshold ETV and MFCC threshold value MFTV.In this embodiment, energy threshold ETV is set to 45dB, and MFCC threshold value MFTV is set to-10.00.If the frame energy of a frame is less than ETV or its MFCC coefficient less than MFTV, then described frame is classified as unvoiced frames UF.These threshold values ETV and MFTV are by from the off-line analysis of the first voice corpus, 210 selected acoustic elements AU and definite.This analysis comprise each frame of being used for each selected acoustic elements AU based on artificial sensory test and classification subsequently.In addition, calculate the associated values of frame energy FE and MFCC parameter value for every frame.Therefore, every frame is classified as voiceless sound or voiced sound, and associated therewith is signal frame energy value and the MFCC parameter value that calculates.Therefore, by to the mean value of this small set of acoustic elements AU and the statistical study of corresponding deviation, determine threshold value.

If determine that in test block 330 the k frame is unvoiced frame VF, then method 300 is that the k frame calculates the CELP parameter at frame 335.These CELP parameters are: line spectrum pair (LSP, LineSpectrum Pair) parameter and adaptive codebook index (Ac) parameter.Should be noted that, do not need to calculate CELP gain index (Gf﹠amp at frame 335; Ga) parameter and fixed codebook indices (Fc) parameter.Normally the art technology people is known for these CELP parameters and CELP, therefore for simple and clear, does not provide detailed explanation here.

Then, method 300 is categorized into unvoiced frame VF in three classes at frame 340, and these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.This three class is called as unvoiced frame conversion (VFT), unvoiced frame stability types 1 (VFS1) and unvoiced frame stability types 2 (VFS2).Classify according to range observation, make unvoiced frame VF be defined as unvoiced frame conversion VFT or one of two kinds of stability types VFS1, VFS2 the adaptive code Ac parameter of the self-adaptation ` sign indicating number Ac parameter of this frame and former frame (k-1 frame).But, if the k frame is first frame (k=1), then do not have former frame to be used for i acoustic elements AU, the default unvoiced frame conversion VFT that is categorized as of this frame.But if this frame is not the 1st frame, then the difference between the adaptive code vector between two continuous adjacent frames can be used for describing the degree of voice cycle.Therefore, use following test to determine the classification of frame.

IF DistAc (Ac _CurFrame, Ac _PreFrame)＜threshold value 1

THEN present frame k is identified as one of stability types VFS1, VFS2;

ELSE present frame k is identified as unvoiced frame conversion VFT.

Wherein, Ac _CurFrameIt is the adaptive code Ac parameter of present frame k; Ac _PreFrameIt is the adaptive code Ac parameter of former frame k-1; DistAc passes through Ac _PreFrameDeduct Ac _CurFrameThe range observation that calculates.

The adaptive code Ac parameter fundamental frequency (fundamental tone) with voice segments roughly is relevant.In this example embodiment, its scope is 20 to 147 samplings in 8kHz speech sample rate.Have identical or very approximate adaptive code Ac parameter if present frame k compares with former frame k-1, it can be counted as stabilizer frame.After experiment, the value of this threshold value 1 is set to 8.

If present frame k is not identified as the unvoiced frame conversion, then can be identified as one of stability types VFS1, VFS2.These two kinds stable voiced sound type VFS1, VFS2 are associated with such frame: its have to the VF frame fundamental tone much at one of front with in addition similar tone color.Following class test will be stablized unvoiced frame and be categorized as VFS1 or VFS2:

IF DistLSP (LSP _CurFrame, LSP _PreFrame)＜threshold value 2

THEN present frame k is classified as VF _S2

ELSE present frame k is classified as VF _S1

Wherein, LSP _CurFrameIt is the LSP vector of present frame k; LSP _PreFrameIt is the LSP vector of former frame k-1; DistLSP is the range observation that calculates by conventional Euclidean distance.In this example embodiment, the value of threshold value 2 is chosen for 100.Therefore, sort out two types stable unvoiced frame by the LSP of more stable unvoiced frame and the LSP of former frame.

Turn back to test block 330,, then be classified as unvoiced frames UF at frame 345, the k frames if determine that the k frame is unvoiced frames UF.

The k frame is after frame 340 or 345 is classified as voiced sound VF or voiceless sound UF, and method 300 encodes a frame as coded frame (EncF) at frame 350 according at least four CELP parameters, and these parameters are: line spectrum pair (LSP) parameter; Adaptive codebook index (Ac) parameter; Gain index (Gf﹠amp; Ga) parameter; And fixed codebook indices (Fc) parameter.Coding depends on frame 340 and 345 classification of carrying out.Therefore, if the k frame is classified as unvoiced frame VF, then coding depends on three classes (subclass), and every kind of subclass is encoded the bit number difference required.Be used for the bit number of unvoiced frames UF coding also is different from the bit number that is used for each subclass coding.

Table 1 is to 4 codings that show voiced sound and unvoiced frames.

Table 1 VF _TThe Bit Allocation in Discrete of frame

Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame
Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame	Frame type (Fid)					2
LSP index (LPC)					8	Frame type (Fid)					2
LSP index (LPC)					8	Adaptive codebook (Ac)	7	2	7	2	18
Gain (Gf﹠Ga)	12	12	12	12	48	Adaptive codebook (Ac)	7	2	7	2	18
Gain (Gf﹠Ga)	12	12	12	12	48	Fixed codebook (Fc)	11	11	11	11	44
Total bit					120 (bits)	Fixed codebook (Fc)	11	11	11	11	44

Table 2VF _S1The Bit Allocation in Discrete of frame

Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame
Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame	Frame type (Fid)					2
LSP index (LPC)					8	Frame type (Fid)					2
LSP index (LPC)					8	Adaptive codebook (Ac)	7	2	5	2	16
Gain (Gf﹠Ga)	12	8	12	8	40	Adaptive codebook (Ac)	7	2	5	2	16
Gain (Gf﹠Ga)	12	8	12	8	40	Fixed codebook (Fc)	11		11		22
Total bit					88 (bits)	Fixed codebook (Fc)	11		11		22

Table 3VF _S2The Bit Allocation in Discrete of frame

Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame
Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame	Frame type (Fid)					2
LSP index (LPC)					0	Frame type (Fid)					2
LSP index (LPC)					0	Adaptive codebook (Ac)	7	2	5	2	16

Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame
Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame	Gain (Gf﹠Ga)	7.5	7.5	7.5	7.5	30
Fixed codebook (Fc)					0	Gain (Gf﹠Ga)	7.5	7.5	7.5	7.5	30
Fixed codebook (Fc)					0	Total bit					48 (bits)

The Bit Allocation in Discrete of table 4UF frame

Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame
Parameter	Subframe 1	Subframe 2	Subframe 3	Subframe 4	The total bit of every frame	Frame type (Fid)	2
LSP index (LPC)					6	Frame type (Fid)	2
LSP index (LPC)					6	Adaptive codebook (Ac)
Gain (Gf﹠Ga)	4	4	4	4	16	Adaptive codebook (Ac)
Gain (Gf﹠Ga)	4	4	4	4	16	Fixed codebook (Fc)	0
Total bit					24 (bits)	Fixed codebook (Fc)	0

To shown in 4, the frame of coding comprises 2 bit fields that are used to discern its classification as table 1.At sensory test with after analyzing, be three unvoiced frame VF subclasses with unvoiced frames UF classification in each choose CELP coding with different file sizes.CELP coding consideration decoding performance and real-time synthetic operation.Table 1 to 4 in, clearly show that, for the gain (Gf﹠amp; Ga) field, the every frame of unvoiced frame conversion VFT is assigned 48 bits; The every frame of unvoiced frame stability types 1VFS1 is assigned 40 bits; The every frame of unvoiced frame stability types 2VFS2 is assigned 30 bits; The every frame of unvoiced frames UF is assigned 16 bits.And after the CELP coding, total bit number that every frame distributes is: 120 bits are used for unvoiced frame conversion VFT; 88 bits are used for unvoiced frame stability types 1VFS1; 48 bits are used for unvoiced frame stability types 2VFS2; 24 bits are used for unvoiced frames UF.

After frame 350 is encoded, test in test block 355, all FN frame of i acoustic elements AU is classified determining whether.If all frames are all classified, method 300 increases progressively k at frame 360, then some in the repeat block 330 to 355 selectively.

All classify (k=FN) when all FN of i acoustic elements AU frame, frame 365 is stored in the coded frame EncF of i acoustic elements AU in the second voice corpus 230.

Then, test, to determine in the first voice corpus 210, whether also having acoustic elements AU to be selected in test block 370.At this moment, if i＜N, then i increases progressively at frame 375, selectively some in the repeat block 320 to 370.If in test block 370, i=N has then chosen all acoustic elements AU, and method 300 finishes at frame 380.

Advantageously, the invention provides a kind of voice corpus (the second voice corpus of coding), it is according to the classification of frame and encode selectively.And unvoiced frame is classified framing conversion (VFT), unvoiced frame stability types 1 (VFS1) and unvoiced frame stability types 2 (VFS2) again.With standard C ELP coded format contrasts be,, can use different bit numbers to come selectively frame to be encoded, and in standard C ELP coded format, every frame needs 120 bits, because it does not use voiced sound described here and voiceless sound classification based on the classification of frame.Therefore, describedly in this manual use different bit numbers to encode selectively, improved decoding efficiency according to classification, the corpus that need reduce, and eliminated the demand that device handler is handled fast bit rate data during TTS.

The embodiment of the just example that detailed description provides, and do not want to limit the scope of the invention, applicability or configuration.The those skilled in the art that are specifically described as of example embodiment provide a kind of description that realizes illustrated embodiments of the invention.Should be appreciated that, under the prerequisite that does not deviate from the spirit and scope described in claims of the present invention, can carry out various change the function and the configuration of each element.

Claims

1. method that is used to provide the encoded voice corpus, this method comprises:

Determine that each frame is unvoiced frames or unvoiced frame;

Described coded frame is stored in the second voice corpus.

The method of claim 1, wherein 2. described determine to comprise the frame energy of expression frame and frame energy threshold and Mel frequency cepstral coefficient threshold value are compared.

3. the method for claim 1, wherein, the feature of described classification also is, described unvoiced frame is categorized into three subclasses, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.

4. method as claimed in claim 3, wherein, the feature of described classification is that also it is to compare with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame to classify that described two classes are stablized unvoiced frame.

5. method as claimed in claim 3, wherein, the feature of the described frame of encoding is that also described coding depends on the classification of described three kinds of subclasses.

6. method as claimed in claim 3, wherein, described coding is characterised in that, is different to each subclass required bit number of encoding.

7. method as claimed in claim 6 wherein, is used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.

8. the method for claim 1, wherein described coded frame comprises the field that is used to discern its classification.

9. electronic equipment, described equipment comprises:

Device handler;

Voice operation demonstrator is connected to described device handler; With

The voice corpus of the coded frame of acoustic elements is classified as unvoiced frames or unvoiced frame comes described coded frame is encoded according to coded frame.

10. electronic equipment as claimed in claim 9, wherein, according to three subclasses the unvoiced frame in the voice corpus is encoded, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized one of unvoiced frame.

11. electronic equipment as claimed in claim 10, wherein, described two classes are stablized unvoiced frame and are compared with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame and classify.

12. electronic equipment as claimed in claim 11, wherein, the coded frame in the corpus is characterised in that, is different to each subclass required bit number of encoding.

13. electronic equipment as claimed in claim 9 wherein, is used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.

14. electronic equipment as claimed in claim 13, wherein, described coded frame comprises the field that is used to discern its classification.