CN1604189A

CN1604189A - Sound catalog coding for articulated voice synthesizing

Info

Publication number: CN1604189A
Application number: CNA031648495A
Authority: CN
Inventors: 岳东剑
Original assignee: Motorola Inc
Current assignee: Serenes Operations
Priority date: 2003-09-29
Filing date: 2003-09-29
Publication date: 2005-04-06
Anticipated expiration: 2023-09-29
Also published as: CN1303584C

Abstract

It is to make the sound frame down sample of sound signal and to separate the mute frame to fake down sample frame with same size with the down sound frame. It is to make the upper sample of sound frame decoded of the source list and to connect the multiple pairs of mute frames decoded with the upper sample sound frames with the same size before and after the decoding and sound synthesizing of the selected frame.

Description

The sound inventory coding of articulated phonetic synthesis

Invention field

The present invention relates to the sound inventory of articulated phonetic synthesis.The present invention is specially adapted to, but and be not necessarily limited to, to such as the synthetic sound inventory that waits the articulated phonetic synthesis of text-speech conversion (TTS).

Background technology

Phonetic synthesis is to produce voice audio signals from non-speech audio signals.It usually with composition notebook-speech conversion synthetic (TTS), is converted into voice audio signals with text flow mostly.This comprises the reception text flow, analyzes and convert thereof into voice, produces the signal of corresponding voice then.The generation means of sort signal belong to one of following two classes usually: (i) produce from the voice signal model; (ii) connect the voice signal of prerecording.The latter in these two kinds of means can be known as the articulated phonetic synthesis.

At present, nearly all high-quality text-speech conversion system is all based on the articulated phonetic synthesis.This is because this means often produce more natural output synthesis type voice.This can finish by utilizing the voice unit (VU) catalogue, and this catalogue is big more also just good more.These means need storage space more than the speech model means.In storage space is not under the special condition of limited, such as in desk-top computer, does not seem so important.Yet, in miscellaneous equipment,, when more and more functions being bundled to microminiature or littler equipment, just must consider particularly such as in the portable sets such as mobile phone and PDA(Personal Digital Assistant).For memory capacity of reducing tts system and the needs that meet limited resources in the portable equipment, thereby the TTS catalogue carried out compressed encoding.For example, can use low bit rate speech coding technology with low computation complexity.

Obtain sound inventory, need obtain the sound signal recorded from the talker.The talker spends several hrs and reads one piece of pre-determined text aloud, and the text is recorded.This text of appropriate design is preferably recorded several situations of each required combination so that record aligned phoneme sequence combination as much as possible.Reading aloud by speech recognition device of being recorded handled, and be initial and finish from where to determine phoneme.Because the text is known, therefore, the position of each phoneme and phonotactics are exactly known, and extract correct record in the voice required voice unit (VU) is provided, no matter it is be, also just relatively easy as phoneme, diphones, triphones or some other element.When having several sample of specifying phoneme or phonotactics, just select among them best.Selected voice unit (VU) record is compressed, and in database, they are stored.

Figure 1 shows that the system of operation known coded and decoding technique.The set of records ends of selected voice unit (VU) record is provided as unit sampling signal OSi, and is compressed and is encoded by scrambler 10.This signal is by the signal segmenter in the scrambler 10 12 Fi that is segmented into frame.Single frame Fi carries out down-sampling and coding by down-sampler and scrambler 14.The down-sampling process comprises each second ingredient of selecting each frame.Then, the frame of downsizing excites linear prediction (CELP) scheme 20 to encode by code.The frame of having encoded is exported from down-sampler and scrambler 14 and scrambler 10 with the individual bit stream of compression and coding unit CUi, and is stored in like this in the catalogue 30.

For arbitrary corpus of voice unit (VU), above-mentioned compression and the coding in this catalogue and storage are only carried out once.After this, the catalogue of voice unit (VU) being fixed also can be to its access repeatedly.

At TTS between synthesis phase, serve as that the basis conducts interviews to catalogue with the index of the voice unit (VU) Uj that will synthesize.This index is imported in the demoder 40, and selected there device 42 receives.Based on the index of being imported, suitable voice unit (VU) UCUj is selected and extracted to selector switch 42 from parameter catalogue 30.

The voice unit (VU) UCUj stream that is extracted is input in demoder and the up-sampler 44, and the down-sampling and the encoding process of being undertaken by down-sampler and scrambler 14 is reversed there.Based on the CELP scheme 20 that used scheme is identical in the encoding process process, the voice unit (VU) UCUj that is extracted is decoded.Then decoded voice unit (VU) is carried out interpolation, so that the up-sampling voice unit (VU) to be provided.The synthetic sampling as voice unit (VU) Uj is then exported in these up-sampling unit from demoder 40.

Although this type of or other class parameter narrowband speech coding method based on speech production can realize low bit rate speech coding, it is so good that the voice quality of rebuilding does not but have to be envisioned.Unclarity in the part synthetic speech.

Summary of the invention

In this instructions, comprise in the claim that term " comprises ", " comprising " or similar terms mean comprising of nonexcludability, like this, comprise the method or the device of a listed set of pieces, not merely comprise the element that those are listed, but also can comprise the element that other is unlisted.

According to an aspect of the present invention, provide a kind of device, be used for from the voice signal that contains a plurality of fragments, producing catalogue in the articulated phonetic synthesis.A detecting device detects the fragment of first and second different in the voice signal types.The interior clip size of compression set reduction voice signal.This compression set is different with the operation carried out on the second type fragment that is detected on the first kind fragment that is detected.

According to a further aspect in the invention, provide a kind of device that is used for providing synthetic voice signal to the articulated phonetic synthesis from clip stream.Detecting device detects the fragment of first and second different in the clip stream types.Decompressing device increases the clip size in the clip stream.This decompressing device is different with the operation carried out on the second type fragment that is being detected on the first kind fragment that is detected.

According to another aspect of the present invention, provide a kind of articulated phonetic synthesis produces catalogue from the voice signal that contains a plurality of fragments method that is used for.This method detects the fragment of first and second different in the voice signal types.Clip size in this method reduction voice signal.The reduction mode of the fragment of the first kind that detects is different with the reduction mode of the fragment of second type of detection.

A kind of articulated phonetic synthesis produces catalogue from the voice signal that contains a plurality of fragments method that is used for is provided.This method detects the fragment of first and second different in the voice signal types.This method increases the clip size in the voice signal.The increase mode of the fragment of the first kind that detects is different with the increase mode of the fragment of second type of detection.

According to further again aspect of the present invention, a kind of catalogue is provided, this catalogue comprises the compression fragment of compressing fragment and a plurality of second types for a plurality of first kind of articulated phonetic synthesis.The reduction mode of the fragment of the first kind is different with the reduction mode of the fragment of second type.

Description of drawings

For the present invention's easy to understand and putting into practice more, illustrational with reference to the accompanying drawings preferred, indefiniteness embodiment, wherein:

Fig. 1 is the block diagram of known coded and decode system;

Fig. 2 is the block diagram according to the Code And Decode system embodiment of the embodiment of the invention;

Fig. 3 is the operational flowchart of the scrambler of one exemplary embodiment of the present invention;

Fig. 4 is the operational flowchart of the demoder of one exemplary embodiment of the present invention; With

Fig. 5 includes wireless telephonic block diagram of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED

In the accompanying drawings, the same numbers among the different figure is represented components identical.Referring to Fig. 2, there is shown the coding/decoding system that is used for the synthetic sound inventory of articulated TTS according to an embodiment of the invention.

As mentioned above, read aloud the set of records ends that recording provides selected voice unit (VU) record by processing.The set of records ends of this selected voice unit (VU) record is input in the scrambler 100 of the embodiment of the invention, as the wideband speech signal of unit sampling OSi.This signal is segmented into speech frame Fi by the signal segmenter 12 in the middle of the scrambler 100, so that frame stream to be provided.

In this embodiment, frame is the fragment of fixed-length speech signal, for example 30 milliseconds long, and each voice unit (VU) in the voice signal all comprises 2 to 20 frames.The definition of fragment can change, such as comprising two (or a plurality of) frames or determining in addition.The definition of used voice unit (VU) can be according to PB and statistical figure and is changed.It can be phoneme, diphones, syllable or the single-tone string of being longer than syllable.

Detecting device and resolver 102 detect two kinds of frame types in frame stream, and this frame stream is resolved into two son streams, in each son stream different frame types are arranged, and promptly first frame type is in the first son stream, and second frame type is in second son flows.

More particularly, 102 pairs of sound frames of detecting device and resolver and silent frame detect, and differentiate from normal sound frame Nfi with the silent frame UFi with high noisy.

Down-sampler and scrambler 104 receive sound frame stream, and single sound frame NFi is carried out down-sampling and coding.In this one exemplary embodiment, down-sampling comprises each second ingredient of selecting each sound frame, and the narrow band signal of down-sampling frame is provided.This frame that is reduced excites linear prediction (CELP) scheme 120 to encode by the coding of low bit rate.The sound frame that compresses and encode is exported as sound output parameter array PNFi.

Separation vessel and scrambler 106 receive the frame stream of silent frame UFi.Each silent frame Ufi is divided into two parts, and forms two pseudo-arrowband frames.The narrow frequency frame of these two puppets excites linear prediction (CELP) scheme 120 to encode by the coding of low bit rate respectively.The frame that separates and encode is exported as noiseless output parameter array PUFia and PUFib.

Parameter PNFi, PUFia and the PUFib of coding all are imported in the baling press 108.Coding parameter PNFi, PUFia and PUFib are quantized and are packaged into the individual bit stream of voice unit (VU) BUi.The order that these parameters are packed is identical with the order of primitive frame in the frame stream.

All voice unit (VU) BUi of this articulated tts system are stored in the parameter catalogue 130, addressable it synthesize.

At TTS between synthesis phase, be this parameter catalogue 130 of base access with the index of the voice unit (VU) Uj that will synthesize.Handle the voice unit (VU) that extracts then, with in the opposite mode of the coding that carries out for storage in catalogue.Voice unit (VU) Uj index is imported in the demoder 140, and selected there device 42 receives.Selector switch 42 selects and extracts suitable voice unit (VU) UBUj based on the input index from parameter catalogue 30.

Type Discr. and resolver 142 are differentiated two dissimilar frames, and frame stream is separated into two son streams, and first frame type is in the first son stream, and second frame type is in the second son stream.More particularly, the type Discr. determines that according to its type information the frame UBUj extracted is sound or noiseless, and correspondingly the bit stream that is extracted is separated into sound output parameter array PNFj and noiseless output parameter array PUFj.

Demoder and up-sampler 144 receives sound output parameter array PNFj, wherein by down-sampler and scrambler 104 carries out down-sampling and encoding process is reversed.According to the encoding process process in employed same CELP scheme 120, the frame of sound output parameter array PNFj is decoded.Up-sampling is rebuild and followed to corresponding sound bite becomes broadband signal, particularly, decoded voice unit (VU) is carried out interpolation, so that the voice unit (VU) of up-sampling to be provided.Next, these voice unit (VU)s are exported as the sound frame SNFj that has rebuild.

Demoder and unitor 146 receives noiseless output parameter array PUFj, wherein is reversed in separating with encoding process of being undertaken by separation vessel and scrambler 106.According to the encoding process process in employed identical CELP scheme 120, the frame of noiseless output parameter array PUFj is decoded.Corresponding sound bite is rebuild and then is linked together.Next, these voice unit (VU)s are exported as the silent frame SUFj that has rebuild.

Frame unitor 148 according to following same order, sound and the silent frame SNFj and the SUFj that have rebuild are linked up successively, promptly extract the synthetic speech signal that some frame forms the corresponding sound unit from catalogue, described sound and silent frame just is derived from these frames.Next, the frame that these articulateds is got up from demoder 140 is sampled as the voice unit (VU) Uj after synthetic and is exported.

Although can be added to this kind catalogue or make alterations in addition, only need to carry out once above-mentioned compression and coding in this catalogue and storage for arbitrary corpus of voice unit (VU).It can be repeatedly accessed.Though the foregoing description shows scrambler 100 and demoder 140 together, in most equipment they often and together, for example: PDA or mobile phone.They often have (from scrambler) pre-loaded catalogue, and they oneself only have this catalogue and demoder (and other ingredient or be used for the synthetic code of TTS).

In above-mentioned one exemplary embodiment, with the compression set of down-sampler and separator form by carrying out the size that down-sampling reduces sound frame, and by separating the size of reducing silent frame.Also can come the voice unit (VU) fragment is compressed with other means.

In above-mentioned one exemplary embodiment, increase the size of sound frame with the decompressing device of up-sampler and unitor form by carrying out up-sampling, and by connecting the frame sign that increases silent frame.Also can come the voice unit (VU) fragment is decompressed with other means.

Fig. 3 is the operational flowchart of the scrambler of one exemplary embodiment of the present invention.

At step S202, with the voice signal OSi of the voice unit (VU) Fi that is segmented into frame.Step S204 detects whether the frame that imports into is silent frame.If the frame that imports into is not a silent frame, then it is carried out down-sampling at step S206.Frame NFi coding after step S208 is to down-sampling, and the frame PNFi after step S210 will encode is packaged into bit stream.If detecting the frame Fj that imports at step S204 is silent frame UFi, then it is separated into two pseudo-frames at step S212.At step S214 the silent frame after separating is encoded in succession.At step S210, according to other identical order of coded frame, the frame PUFi behind the coding is packaged into bit stream, promptly appear at order in the middle of the input speech signal according to those frames that coded frame had been derived from.

After the processing of Fig. 3, bit stream is recorded in the catalogue.

Fig. 4 is the operational flowchart of the demoder of one exemplary embodiment of the present invention.

At step S252, the index of the voice unit (VU) Uj that input will be synthesized.According to the identical order of order in the middle of corresponding voice unit (VU) appears at index, at step S254, from catalogue, select, and, extract the suitable UBUj of coded frame corresponding to indexed voice unit (VU) at step S256.Whether at step S258, detecting any coded frame of importing into is the noiseless PUHj of coded frame.If not silent frame, then it is decoded at step S260.At step S262, with decoded frame up-sampling, and the frame SNFj after step S264 is with up-sampling is connected into bit stream.If detecting the frame UBUj that imports at step S258 is silent frame, then it is decoded at step S266.This frame occurs often in pairs.Two decoded frames following every pair at step S258 link up.At step S264, articulated becomes bit stream to articulated.With respect to other decoded frame, be according in sequence identical in the connection of step S264, promptly appear at order in the middle of the unit index that will synthesize according to the coded frame that those decoded frames were derived from.

In above-mentioned one exemplary embodiment, be that the type with frame is that sound frame or silent frame serve as that the basis is to being distinguished by the frame of down-sampling and the frame of separated (being reversed of processing).In other embodiments, can distinguish based on other standard.For example, the frame of two correlation types may be noiseless noiseless with noiselessness for noise is arranged, and just, the differentiation between them depends on these two silent frames much noises.Like this, only that noise is bigger noiseless frame is separated, thereby saves some memory capacity.This differentiation can be according to the measurement of line spectrum frequency (LSF) parameter of frame.

Above-mentioned one exemplary embodiment has been utilized the CELP scheme, and this scheme is encoded to frame with 240 vectors.The frame Fi that imports into is with 480 vector beginnings, therefore, with 2: 1 ratios with sound frame down-sampling.Under the situation that the composition structure of importing frame Fi into changes to some extent, and/or under the situation that used encoding scheme changes to some extent, no matter this encoding scheme is still CELP encoding scheme or other encoding scheme, other down-sampling ratio will appear.

In above-mentioned one exemplary embodiment, the frame that each is noiseless is separated into two pseudo-frames.This is still because import frame Fi into 480 vector beginnings, and current CELP scheme is encoded to frame with 240 vectors.Form change in structural change and/or the code used scheme according to importing frame Fi into, can change the quantity of pseudo-frame.If the number of vectors in the incoming frame is not the definite multiple of number of vectors in the required pseudo-frame, the degree of a down-sampling (or up-sampling) can be arranged also then.

Above-mentioned one exemplary embodiment shows hybrid plan, wherein has only two different means: in a down-sampling or separation means of coding side, and at decode a up-sampling or a connection means of side.Also can other means, such as: number of vectors is not two times (for example: incoming frame Fi is with 720 vector beginnings, and meanwhile the CELP scheme is encoded to frame with 240 vectors) of number of vectors in the required pseudo-frame in the incoming frame.In this case, can judge incoming frame be three kinds one of dissimilar; For example sound, noiseless noiseless with noiselessness of noise arranged.Before coding, ratio that can 3: 1 is to sound frame down-sampling.Before coding, can will there be the noise silent frame to be separated into three.On the other hand, before coding, ratio (that is: each is 480 vectors) that can 3: 2 carries out down-sampling to the noiselessness silent frame, and then it is separated into two.The decoding side will carry out with above-mentioned opposite to processing.

Above-mentioned one exemplary embodiment has been utilized the narrowband speech coding techniques, and this technology is on the basis of the mixed bandwidth CELP coding techniques of low bit rate, with the CELP model as basic framework and new tool that the TTS catalogue is compressed.Each voice unit (VU) of TTS catalogue is all encoded with mixed bandwidth CELP scheme, because between synthesis phase, sets up the TTS catalogue and cell bit stream is decoded by mixed bandwidth CELP scheme.

For the CELP scheme in the foregoing description, can customize coding method according to the characteristic of TTS or phonetic synthesis.In order to reach this purpose, set up a low bit rate arrowband CELP scheme, as the voice coding basis of the sound inventory of articulated tts system.Because having only one (or two) specific and known talker is tts system recorded speech corpus, thereby this talker's correlated characteristic can be applied to customize on the encoding scheme fully.Especially, can obtain to represent the linear spectral of talker's pronunciation character to (LSP) vectorial code book by training.These operations can be reduced to encoding rate has the low bit rate that high-quality speech is rebuild.

In the process flow diagram of the one exemplary embodiment of Fig. 2 and Fig. 3, the coding of frame is separately finished, and in down-sampler and scrambler 104 silent frame is encoded, and in separation vessel and scrambler 106 sound frame is encoded.In another one exemplary embodiment, can in identical code device, encode for all frames.For example, be encoded at frame (only by suitably down-sampling or after separating) is preceding, and packing device 108 is packed to these frames, encodes before being stored in catalogue.

In the process flow diagram of the one exemplary embodiment of Fig. 2 and Fig. 4, the decoding of frame is separately finished, and in demoder and down-sampler 144 silent frame is decoded, and in demoder and unitor 146 sound frame is decoded.In another one exemplary embodiment, can in identical decoding device, decode for all frames.For example, can and be accredited as after extracting by type Discr. 142 sound or noiseless before, all frames are all encoded.

In above-mentioned one exemplary embodiment, before the voice signal of articulated phonetic synthesis catalogue is encoded,, simultaneously silent frame is separated into and the onesize pseudo-down-sampling frame of the sound frame of down-sampling the sound frame down-sampling of this voice signal.Therefore, in compression process, to sound be different treating with silent frame.

In above-mentioned one exemplary embodiment, after frame selected from catalogue is decoded and before carrying out the articulated phonetic synthesis, with the decoded sound frame up-sampling that is derived from the catalogue, what will be derived from catalogue simultaneously manyly is connected into and the onesize frame of up-sampling silent frame (or other number) decoded silent frame.Therefore, in decompression process, to sound be different treating with silent frame.

Referring to Fig. 5, there is shown and embody wireless telephone 300 of the present invention.This wireless telephone 300 has the radio frequency communications unit 302 of communicating by letter and connecting with processor 304.The input interface of the form of screen 306 and keypad 308, also communication is connected in this processor 304.

Processor 304 comprises the encoder/decoder 310 that has relevant ROM (read-only memory) (ROM) 312, and the storage of this ROM (read-only memory) is used for the data of Code And Decode voice signal or other signal, and above-mentioned signal can be sent or be received by wireless telephone 300.Processor 304 also comprises: microprocessor 314, and it is connected in encoder/decoder 310 and relevant character ROM (read-only memory) (ROM) 318 by conventional data and address bus 316; Voice unit (VU) catalogue ROM (read-only memory) (ROM) 320 (as the catalogue 130 of Fig. 2 one exemplary embodiment); Random access memory (RAM) 320; Static programmable memory 324; And dismantled and assembled sim module 326.Except that other content, each can store the selected input text information and the phonebook database of telephone number static programmable memory 324 and sim module 326.

Microprocessor 314 has port, is used for connecting the keypad 308, reminding module 328, microphone 330 and the loudspeaker 332 that contain oscillatory type motor and associated drive.

Character ROM318 storage is used to decode or the code of text encoded information, and these text messages can be received by communication unit 302, by keypad 308 inputs.Character ROM318 and catalogue ROM320 also are microprocessor 314 storage operation sign indicating numbers (OC), and it is synthetic that the OC in the middle of this catalogue ROM320 is used to TTS.Particularly, it has comprised OC, thereby makes microprocessor 314 work as demoder 140 in the system of Fig. 2.

Radio frequency communications unit 302 is combined type receiver and the transmitters with common antenna 334.This communication unit 302 has the wireless set 336 that is connected in antenna 334 via radio frequency amplifier 338.This wireless set 336 also is connected in combined type modulator/demodulator 340, and this modulator/demodulator 340 is connected to communication unit 302 on the processor 304.

Above-mentioned one exemplary embodiment has been utilized low bit rate, mixed bandwidth speech coding method, with by for the noiseless part in each voice unit (VU) keeps broadband signal, improves the sharpness of unvoiced speech fragment in the voice unit (VU) after the reconstruction.Can help to realize the required high-quality and the high compression rate coding of tts system voice unit (VU) catalogue like this.It also requires little memory capacity and low calculated amount rationally.

Embodiments of the invention can be used in the various tts systems based on the articulated phonetic synthesis.In the time of in will being embedded into such as equipment such as mobile phone, PDAs with low memory capacity and low calculated amount tts system, it be very useful.This one exemplary embodiment provides a kind of means, thereby realizes that with low-down bit rate to the high-quality of TTS catalogue and compressed encoding efficiently, these means are helpful in the articulated tts system that natural sound embeds.

The foregoing description that this is exemplary and the top distortion of mentioning have comprised various steps, these steps can realize with one of several forms, such as: can be used as specialized hardware or as the machine-executable instruction of carrying out by universal or special processor or logical circuit.One exemplary embodiment of the present invention also contains has covered the various steps of being carried out by the combination of software and hardware.

Can be with a kind of computer program as additional embodiments, for example: be stored in the computer program on internet or other network, or store the machine readable media of instruction on it.Above-mentioned instruction can be used for to mobile phone, other is portable or non-portable set or computing machine in little processing programme.Typical machine readable media comprises: disk, card, memory stick and other memory storage, no matter though be optics or magnetic, also be read-only or rewritable.

Above-mentioned detailed description only provides preferred one exemplary embodiment, and is not intended to scope of the present invention, application or configuration are limited.On the contrary, the detailed description of preferred one exemplary embodiment provides the heuristic explanation that is used to implement the preferred embodiment of the present invention for those skilled in the art.Should be appreciated that, under the situation of essence that does not break away from claims qualification of the present invention and model, can carry out various changes the function and the arrangement of ingredient.

Claims

1. device that is used for from the voice signal that contains a plurality of fragments, producing catalogue in the articulated phonetic synthesis, this device comprises:

Detecting device is used to detect the fragment of first and second different in the voice signal types; With

Compression set is used to reduce the clip size in the voice signal;

Wherein said compression set is different with the operation carried out on the second type fragment that is being detected on the first kind fragment that is detected.

2. according to the device of claim 1, wherein said compression set can be with the first kind fragment down-sampling that detects, and the second type fragment that will detect is separated into a plurality of littler fragments.

3. according to the device of claim 1, wherein said compression set can reduce into the first and second type fragments that detect the fragment of identical size.

4. according to the device of claim 1, wherein said fragment is a speech frame.

5. according to the device of claim 1, wherein said first kind fragment is sound fragment, and the second type fragment is noiseless fragment.

6. according to the device of claim 1, further comprise:

Sectionaliser is used to detecting device that speech signal segments is become fragment; Scrambler is used for the fragment that is derived from compression set is encoded; And packing device, be used for encode fragment is packaged into individual bit stream.

7. one kind is used for the device of synthetic speech signal being provided for the articulated phonetic synthesis from the voice unit clip stream, and this device comprises:

Detecting device is used to detect the fragment of first and second different in the clip stream types; With

Decompressing device is used to increase the clip size in the clip stream;

Wherein, described decompressing device is different with the operation carried out on the second type fragment that is being detected on the first kind fragment that is detected.

8. according to the device of claim 7, wherein said decompressing device can be with the first kind fragment up-sampling that detects, and the second type fragment articulated of a plurality of detections is become single bigger fragment.

9. according to the device of claim 7, wherein said decompressing device can become bigger big or small identical segments with the fragment increase of first and second types that detect.

10. according to the device of claim 7, wherein said fragment is a speech frame.

11. according to the device of claim 7, wherein said first kind fragment is sound fragment, and the second type fragment is noiseless fragment.

12. the device according to claim 7 further comprises: selector switch is used to detecting device to select fragment; Demoder is used for the fragment that is derived from detecting device is decoded; And unitor, be used for becoming individual bit stream with separating the chip segment articulated.

13. a method that is used for producing from the voice signal that contains a plurality of fragments in the articulated phonetic synthesis catalogue, this method comprises:

Detect the fragment of first and second different in the voice signal types; With

Reduce the clip size in the described voice signal;

Wherein the reduction of the second class clip size of first kind fragment of Jian Ceing and detection is carried out in a different manner.

14. according to the method for claim 13, the step of wherein said reduction first kind fragment comprises: the first kind fragment that detects is carried out down-sampling, and the second type fragment that will detect is separated into a plurality of littler fragments.

15. according to the method for claim 13, wherein the first and second type fragments of Jian Ceing are reduced into the fragment of identical size.

16. according to the method for claim 13, wherein said fragment is a speech frame.

17. according to the method for claim 13, wherein said first kind fragment is sound fragment, and the second type fragment is noiseless fragment.

18. the method according to claim 13 further comprises: speech signal segments is become fragment for detecting step; Reduced-size fragment is encoded; Encode fragment is packaged into individual bit stream with inciting somebody to action.

19. one kind is used for the method for synthetic speech signal being provided for the articulated phonetic synthesis from the voice unit clip stream, this method comprises:

Detect the fragment of first and second different in the clip stream types; With

Increase the clip size in the clip stream;

Wherein the increase of the second class clip size of first kind fragment of Jian Ceing and detection is carried out in a different manner.

20. according to the method for claim 19, wherein said increase first kind fragment step comprises: the first kind fragment that detects is carried out up-sampling, and the second type fragment of a plurality of detections is connected into single bigger fragment.

21. according to the method for claim 19, the first and second type fragment increases that wherein will detect in size become the fragment of bigger identical size.

22. according to the method for claim 19, wherein said fragment is a speech frame.

23. according to the method for claim 19, wherein said first kind fragment is sound frame, and the second type fragment is a silent frame.

24. the method according to claim 19 further comprises: select fragment for detecting step; Decode to being derived from the fragment that detects step; With will separate chip segment and be connected into individual bit stream.

25. a catalogue comprises the compression fragment of compressing fragment and a plurality of second types of a plurality of first kind that are used for the articulated phonetic synthesis; Wherein compress the fragment of the first kind and the fragment of a plurality of second types in a different manner.

26., wherein compress the fragment of the first kind, and compress the fragment of second type by being separated into a plurality of fragments by carrying out down-sampling according to the catalogue of claim 25.

27. according to the catalogue of claim 25, wherein the fragment with first and second types is compressed into identical size.

28. according to the catalogue of claim 25, wherein said fragment is a speech frame.

29. according to the catalogue of claim 25, wherein said first kind fragment is sound fragment, and the second type fragment is noiseless fragment.