Summary of the invention
According to an aspect of the present invention, provide a kind of method that is used to provide the encoded voice corpus, this method comprises:
Choose acoustic elements from the first voice corpus, described acoustic elements has a plurality of associated frames of identifier and expression sampled speech waveform;
Determine that each frame is unvoiced frames or unvoiced frame;
Described frame is categorized into unvoiced frame and unvoiced frames at least;
According at least four code book Excited Linear Prediction parameters described frame is encoded into coded frame, wherein, described coding depends on described classification; With
Described coded frame is stored in the second voice corpus.
Suitably, described determine to comprise the frame energy of expression frame and frame energy threshold and Mel frequency cepstral coefficient threshold value are compared.
Suitably, described classification is characterised in that, unvoiced frame is categorized into three subclasses, and these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.
The feature of described classification is that also it is to compare with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame to classify that described two classes are stablized unvoiced frame.
Suitably, the described frame of encoding is characterised in that described coding depends on the classification of described three kinds of subclasses.Suitably, described coding is characterised in that, is different to each subclass required bit number of encoding.And, be used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.Suitably, coded frame comprises the field that is used to discern its classification.
According to a further aspect of the invention, provide a kind of electronic equipment, described equipment comprises:
Device handler;
Voice operation demonstrator is connected to described device handler; With
The voice corpus of the coded frame of acoustic elements is categorized as unvoiced frames or unvoiced frame comes described coded frame is encoded according to them.
Suitably, according to three subclasses the unvoiced frame in the voice corpus is encoded, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized one of unvoiced frame.
Described two classes are stablized unvoiced frame and can be compared with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame and classify.
Coded frame in the corpus is characterised in that, is different to each subclass required bit number of encoding.And, be used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.Suitably, coded frame comprises the field that is used to discern its classification.
Embodiment
Referring to Fig. 1, show a kind of electronic equipment 100, shape is a wireless telephone, comprises the device handler 102 that effectively is connected to user interface 104 by bus 103, user interface 104 is generally touch-screen or display screen and keyboard.Electronic equipment 100 also has speech corpus 106 (the second voice corpus of coding), voice operation demonstrator 110, nonvolatile memory 120, ROM (read-only memory) 118 and wireless communication module 116, and all effectively are connected to processor 102 by bus 103.Voice operation demonstrator 110 has connection to drive the output of loudspeaker 112.To illustrate in this manual after a while and set up speech corpus 106.Nonvolatile memory 120 (memory module) in use stores and is used for literary composition language conversion (TTS) synthetic text, and wherein, text can be by module 116 or otherwise received.
As those skilled in the clear, radio frequency communications unit 116 normally has the receiver and the transmitter of the combination of common antenna.Radio frequency communications unit 116 has the transceiver that is connected to antenna by radio frequency amplifier.Transceiver also has the modulator/demodulator of combination, and it is connected to processor 102 with communication unit 116 via bus 103.And, in the present embodiment, nonvolatile memory 120 (memory module) stored user phonebook database Db able to programme, ROM (read-only memory) 118 stores the operation code (OC) that is used for device handler 102.
In Fig. 2, show the schematic block diagram of coding structure 200.Structure 200 comprises that the first voice corpus 210, frame are handled and the second voice corpus 230 of coding unit 220 and coding.The first voice corpus 210 and the second voice corpus 230 effectively are connected to frame and handle and coding unit 220.Also having user interface 240 effectively to be connected to frame handles and coding unit 220.In use, frame processing and coding unit 220 will be encoded in the second voice corpus 230 of coding following description with reference to figure 3 from the acoustic elements AU of the first voice corpus 210.And user interface 240 comprises the input button that is used for input command and is used for providing to the user display screen of visual feedback that user interface 240 allows the user to start and gap coding.
Referring to Fig. 3, show a kind of method 300 that is used for providing the second voice corpus 230 of coding from the first voice corpus 210, this method is handled by frame in essence and coding unit 220 is carried out.After beginning frame 305 call methods 300, carry out initialization procedure by suitable excited users interface 240 at frame 310.Initialization procedure comprise integer counter i is changed to 1 (i:=1 is set) and be provided with the big lowerinteger value N of voice corpus (voice corpus size :=N), wherein N is the number of acoustic elements AU in the first voice corpus 210.Then, at frame 315, method 300 is chosen i acoustic elements AU in the first voice corpus 210.Therefore, because counter i is initialized as 1 in frame 310, choose first acoustic elements AU in the first voice corpus 210.Here, as is known to the person skilled in the art, first acoustic elements AU is the acoustic elements AU in first memory location of the first voice corpus 210, and has a plurality of associated frames of identifier and expression sampled speech waveform.
In the present embodiment, each frame comprise usually 240 the sampling and each sampled representation be 16 bits.And usually, acoustic elements AU can have the sampled speech of 8 to 20 frames.The frame sequence (FRS) known to those skilled in the art is arranged and be arranged in to these frame sequentials.Like this, in frame 315, the frame number (FN) that is used for i acoustic elements AU is by simply to the frame count of the sampled speech waveform of representing i acoustic elements AU and definite.
After frame 315, method 300 is each frame calculating frame energy (FE) and Mel frequency cepstral coefficient (MFCC) parameter that 1 of i acoustic elements (being the 1st acoustic elements in this example) arrives FN at frame 320.Frame energy FE based on the frame sampling value square, frame energy FE and Mel frequency cepstral coefficient MFCC parameter all are easy to by well known to a person skilled in the art that technique computes comes out.
Then, at frame 325, variable k is changed to 1, and this variable k is used for choosing frame 1 to FN in proper order with frame sequence FRS.After this, the Mel frequency cepstral coefficient MFCC parameter of frame energy FE and each frame is used for determining that in test block 330 the k frame is unvoiced frames (UF) or unvoiced frame (VF).Representation class is like the signal of noise or the frame of nonperiodic signal in essence for unvoiced frames UF, and wherein, sound channel is not vibrated (for example fricative phoneme) when the mankind send this part voice.On the contrary, unvoiced frame VF is the frame of indication cycle's property signal, and wherein, sound channel is at fundamental vibration when the mankind send this part voice.For example, the waveform of most of vowel phonemes all belongs to unvoiced frame VF.
In test block 330, the represented frame energy (signal energy) of in question frame compares with frame energy threshold ETV and MFCC threshold value MFTV.In this embodiment, energy threshold ETV is set to 45dB, and MFCC threshold value MFTV is set to-10.00.If the frame energy of a frame is less than ETV or its MFCC coefficient less than MFTV, then described frame is classified as unvoiced frames UF.These threshold values ETV and MFTV are by from the off-line analysis of the first voice corpus, 210 selected acoustic elements AU and definite.This analysis comprise each frame of being used for each selected acoustic elements AU based on artificial sensory test and classification subsequently.In addition, calculate the associated values of frame energy FE and MFCC parameter value for every frame.Therefore, every frame is classified as voiceless sound or voiced sound, and associated therewith is signal frame energy value and the MFCC parameter value that calculates.Therefore, by to the mean value of this small set of acoustic elements AU and the statistical study of corresponding deviation, determine threshold value.
If determine that in test block 330 the k frame is unvoiced frame VF, then method 300 is that the k frame calculates the CELP parameter at frame 335.These CELP parameters are: line spectrum pair (LSP, LineSpectrum Pair) parameter and adaptive codebook index (Ac) parameter.Should be noted that, do not need to calculate CELP gain index (Gf﹠amp at frame 335; Ga) parameter and fixed codebook indices (Fc) parameter.Normally the art technology people is known for these CELP parameters and CELP, therefore for simple and clear, does not provide detailed explanation here.
Then, method 300 is categorized into unvoiced frame VF in three classes at frame 340, and these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.This three class is called as unvoiced frame conversion (VFT), unvoiced frame stability types 1 (VFS1) and unvoiced frame stability types 2 (VFS2).Classify according to range observation, make unvoiced frame VF be defined as unvoiced frame conversion VFT or one of two kinds of stability types VFS1, VFS2 the adaptive code Ac parameter of the self-adaptation ` sign indicating number Ac parameter of this frame and former frame (k-1 frame).But, if the k frame is first frame (k=1), then do not have former frame to be used for i acoustic elements AU, the default unvoiced frame conversion VFT that is categorized as of this frame.But if this frame is not the 1st frame, then the difference between the adaptive code vector between two continuous adjacent frames can be used for describing the degree of voice cycle.Therefore, use following test to determine the classification of frame.
IF DistAc (Ac
CurFrame, Ac
PreFrame)<threshold value 1
THEN present frame k is identified as one of stability types VFS1, VFS2;
ELSE present frame k is identified as unvoiced frame conversion VFT.
Wherein, Ac
CurFrameIt is the adaptive code Ac parameter of present frame k; Ac
PreFrameIt is the adaptive code Ac parameter of former frame k-1; DistAc passes through Ac
PreFrameDeduct Ac
CurFrameThe range observation that calculates.
The adaptive code Ac parameter fundamental frequency (fundamental tone) with voice segments roughly is relevant.In this example embodiment, its scope is 20 to 147 samplings in 8kHz speech sample rate.Have identical or very approximate adaptive code Ac parameter if present frame k compares with former frame k-1, it can be counted as stabilizer frame.After experiment, the value of this threshold value 1 is set to 8.
If present frame k is not identified as the unvoiced frame conversion, then can be identified as one of stability types VFS1, VFS2.These two kinds stable voiced sound type VFS1, VFS2 are associated with such frame: its have to the VF frame fundamental tone much at one of front with in addition similar tone color.Following class test will be stablized unvoiced frame and be categorized as VFS1 or VFS2:
IF DistLSP (LSP
CurFrame, LSP
PreFrame)<threshold value 2
THEN present frame k is classified as VF
S2
ELSE present frame k is classified as VF
S1
Wherein, LSP
CurFrameIt is the LSP vector of present frame k; LSP
PreFrameIt is the LSP vector of former frame k-1; DistLSP is the range observation that calculates by conventional Euclidean distance.In this example embodiment, the value of threshold value 2 is chosen for 100.Therefore, sort out two types stable unvoiced frame by the LSP of more stable unvoiced frame and the LSP of former frame.
Turn back to test block 330,, then be classified as unvoiced frames UF at frame 345, the k frames if determine that the k frame is unvoiced frames UF.
The k frame is after frame 340 or 345 is classified as voiced sound VF or voiceless sound UF, and method 300 encodes a frame as coded frame (EncF) at frame 350 according at least four CELP parameters, and these parameters are: line spectrum pair (LSP) parameter; Adaptive codebook index (Ac) parameter; Gain index (Gf﹠amp; Ga) parameter; And fixed codebook indices (Fc) parameter.Coding depends on frame 340 and 345 classification of carrying out.Therefore, if the k frame is classified as unvoiced frame VF, then coding depends on three classes (subclass), and every kind of subclass is encoded the bit number difference required.Be used for the bit number of unvoiced frames UF coding also is different from the bit number that is used for each subclass coding.
Table 1 is to 4 codings that show voiced sound and unvoiced frames.
Table 1 VF
TThe Bit Allocation in Discrete of frame
Parameter |
Subframe 1 |
Subframe 2 |
Subframe 3 |
Subframe 4 |
The total bit of every frame |
Frame type (Fid) |
|
|
|
|
2 |
LSP index (LPC) |
|
|
|
|
8 |
Adaptive codebook (Ac) |
7 |
2 |
7 |
2 |
18 |
Gain (Gf﹠Ga) |
12 |
12 |
12 |
12 |
48 |
Fixed codebook (Fc) |
11 |
11 |
11 |
11 |
44 |
Total bit |
|
|
|
|
120 (bits) |
Table 2VF
S1The Bit Allocation in Discrete of frame
Parameter |
Subframe 1 |
Subframe 2 |
Subframe 3 |
Subframe 4 |
The total bit of every frame |
Frame type (Fid) |
|
|
|
|
2 |
LSP index (LPC) |
|
|
|
|
8 |
Adaptive codebook (Ac) |
7 |
2 |
5 |
2 |
16 |
Gain (Gf﹠Ga) |
12 |
8 |
12 |
8 |
40 |
Fixed codebook (Fc) |
11 |
|
11 |
|
22 |
Total bit |
|
|
|
|
88 (bits) |
Table 3VF
S2The Bit Allocation in Discrete of frame
Parameter |
Subframe 1 |
Subframe 2 |
Subframe 3 |
Subframe 4 |
The total bit of every frame |
Frame type (Fid) |
|
|
|
|
2 |
LSP index (LPC) |
|
|
|
|
0 |
Adaptive codebook (Ac) |
7 |
2 |
5 |
2 |
16 |
Parameter |
Subframe 1 |
Subframe 2 |
Subframe 3 |
Subframe 4 |
The total bit of every frame |
Gain (Gf﹠Ga) |
7.5 |
7.5 |
7.5 |
7.5 |
30 |
Fixed codebook (Fc) |
|
|
|
|
0 |
Total bit |
|
|
|
|
48 (bits) |
The Bit Allocation in Discrete of table 4UF frame
Parameter |
Subframe 1 |
Subframe 2 |
Subframe 3 |
Subframe 4 |
The total bit of every frame |
Frame type (Fid) |
|
|
|
|
2 |
LSP index (LPC) |
|
|
|
|
6 |
Adaptive codebook (Ac) |
|
|
|
|
|
Gain (Gf﹠Ga) |
4 |
4 |
4 |
4 |
16 |
Fixed codebook (Fc) |
|
|
|
|
0 |
Total bit |
|
|
|
|
24 (bits) |
To shown in 4, the frame of coding comprises 2 bit fields that are used to discern its classification as table 1.At sensory test with after analyzing, be three unvoiced frame VF subclasses with unvoiced frames UF classification in each choose CELP coding with different file sizes.CELP coding consideration decoding performance and real-time synthetic operation.Table 1 to 4 in, clearly show that, for the gain (Gf﹠amp; Ga) field, the every frame of unvoiced frame conversion VFT is assigned 48 bits; The every frame of unvoiced frame stability types 1VFS1 is assigned 40 bits; The every frame of unvoiced frame stability types 2VFS2 is assigned 30 bits; The every frame of unvoiced frames UF is assigned 16 bits.And after the CELP coding, total bit number that every frame distributes is: 120 bits are used for unvoiced frame conversion VFT; 88 bits are used for unvoiced frame stability types 1VFS1; 48 bits are used for unvoiced frame stability types 2VFS2; 24 bits are used for unvoiced frames UF.
After frame 350 is encoded, test in test block 355, all FN frame of i acoustic elements AU is classified determining whether.If all frames are all classified, method 300 increases progressively k at frame 360, then some in the repeat block 330 to 355 selectively.
All classify (k=FN) when all FN of i acoustic elements AU frame, frame 365 is stored in the coded frame EncF of i acoustic elements AU in the second voice corpus 230.
Then, test, to determine in the first voice corpus 210, whether also having acoustic elements AU to be selected in test block 370.At this moment, if i<N, then i increases progressively at frame 375, selectively some in the repeat block 320 to 370.If in test block 370, i=N has then chosen all acoustic elements AU, and method 300 finishes at frame 380.
Advantageously, the invention provides a kind of voice corpus (the second voice corpus of coding), it is according to the classification of frame and encode selectively.And unvoiced frame is classified framing conversion (VFT), unvoiced frame stability types 1 (VFS1) and unvoiced frame stability types 2 (VFS2) again.With standard C ELP coded format contrasts be,, can use different bit numbers to come selectively frame to be encoded, and in standard C ELP coded format, every frame needs 120 bits, because it does not use voiced sound described here and voiceless sound classification based on the classification of frame.Therefore, describedly in this manual use different bit numbers to encode selectively, improved decoding efficiency according to classification, the corpus that need reduce, and eliminated the demand that device handler is handled fast bit rate data during TTS.
The embodiment of the just example that detailed description provides, and do not want to limit the scope of the invention, applicability or configuration.The those skilled in the art that are specifically described as of example embodiment provide a kind of description that realizes illustrated embodiments of the invention.Should be appreciated that, under the prerequisite that does not deviate from the spirit and scope described in claims of the present invention, can carry out various change the function and the configuration of each element.