Background technology
Phonetic synthesis is to produce voice audio signals from non-speech audio signals.It usually with composition notebook-speech conversion synthetic (TTS), is converted into voice audio signals with text flow mostly.This comprises the reception text flow, analyzes and convert thereof into voice, produces the signal of corresponding voice then.The generation means of sort signal belong to one of following two classes usually: (i) produce from the voice signal model; (ii) connect the voice signal of prerecording.The latter in these two kinds of means can be known as the articulated phonetic synthesis.
At present, nearly all high-quality text-speech conversion system is all based on the articulated phonetic synthesis.This is because this means often produce more natural output synthesis type voice.This can finish by utilizing the voice unit (VU) catalogue, and this catalogue is big more also just good more.These means need storage space more than the speech model means.In storage space is not under the special condition of limited, such as in desk-top computer, does not seem so important.Yet, in miscellaneous equipment,, when more and more functions being bundled to microminiature or littler equipment, just must consider particularly such as in the portable sets such as mobile phone and PDA(Personal Digital Assistant).For memory capacity of reducing tts system and the needs that meet limited resources in the portable equipment, thereby the TTS catalogue carried out compressed encoding.For example, can use low bit rate speech coding technology with low computation complexity.
Obtain sound inventory, need obtain the sound signal recorded from the talker.The talker spends several hrs and reads one piece of pre-determined text aloud, and the text is recorded.This text of appropriate design is preferably recorded several situations of each required combination so that record aligned phoneme sequence combination as much as possible.Reading aloud by speech recognition device of being recorded handled, and be initial and finish from where to determine phoneme.Because the text is known, therefore, the position of each phoneme and phonotactics are exactly known, and extract correct record in the voice required voice unit (VU) is provided, no matter it is be, also just relatively easy as phoneme, diphones, triphones or some other element.When having several sample of specifying phoneme or phonotactics, just select among them best.Selected voice unit (VU) record is compressed, and in database, they are stored.
Figure 1 shows that the system of operation known coded and decoding technique.The set of records ends of selected voice unit (VU) record is provided as unit sampling signal OSi, and is compressed and is encoded by scrambler 10.This signal is by the signal segmenter in the scrambler 10 12 Fi that is segmented into frame.Single frame Fi carries out down-sampling and coding by down-sampler and scrambler 14.The down-sampling process comprises each second ingredient of selecting each frame.Then, the frame of downsizing excites linear prediction (CELP) scheme 20 to encode by code.The frame of having encoded is exported from down-sampler and scrambler 14 and scrambler 10 with the individual bit stream of compression and coding unit CUi, and is stored in like this in the catalogue 30.
For arbitrary corpus of voice unit (VU), above-mentioned compression and the coding in this catalogue and storage are only carried out once.After this, the catalogue of voice unit (VU) being fixed also can be to its access repeatedly.
At TTS between synthesis phase, serve as that the basis conducts interviews to catalogue with the index of the voice unit (VU) Uj that will synthesize.This index is imported in the demoder 40, and selected there device 42 receives.Based on the index of being imported, suitable voice unit (VU) UCUj is selected and extracted to selector switch 42 from parameter catalogue 30.
The voice unit (VU) UCUj stream that is extracted is input in demoder and the up-sampler 44, and the down-sampling and the encoding process of being undertaken by down-sampler and scrambler 14 is reversed there.Based on the CELP scheme 20 that used scheme is identical in the encoding process process, the voice unit (VU) UCUj that is extracted is decoded.Then decoded voice unit (VU) is carried out interpolation, so that the up-sampling voice unit (VU) to be provided.The synthetic sampling as voice unit (VU) Uj is then exported in these up-sampling unit from demoder 40.
Although this type of or other class parameter narrowband speech coding method based on speech production can realize low bit rate speech coding, it is so good that the voice quality of rebuilding does not but have to be envisioned.Unclarity in the part synthetic speech.
DETAILED DESCRIPTION OF THE PREFERRED
In the accompanying drawings, the same numbers among the different figure is represented components identical.Referring to Fig. 2, there is shown the coding/decoding system that is used for the synthetic sound inventory of articulated TTS according to an embodiment of the invention.
As mentioned above, read aloud the set of records ends that recording provides selected voice unit (VU) record by processing.The set of records ends of this selected voice unit (VU) record is input in the scrambler 100 of the embodiment of the invention, as the wideband speech signal of unit sampling OSi.This signal is segmented into speech frame Fi by the signal segmenter 12 in the middle of the scrambler 100, so that frame stream to be provided.
In this embodiment, frame is the fragment of fixed-length speech signal, for example 30 milliseconds long, and each voice unit (VU) in the voice signal all comprises 2 to 20 frames.The definition of fragment can change, such as comprising two (or a plurality of) frames or determining in addition.The definition of used voice unit (VU) can be according to PB and statistical figure and is changed.It can be phoneme, diphones, syllable or the single-tone string of being longer than syllable.
Detecting device and resolver 102 detect two kinds of frame types in frame stream, and this frame stream is resolved into two son streams, in each son stream different frame types are arranged, and promptly first frame type is in the first son stream, and second frame type is in second son flows.
More particularly, 102 pairs of sound frames of detecting device and resolver and silent frame detect, and differentiate from normal sound frame Nfi with the silent frame UFi with high noisy.
Down-sampler and scrambler 104 receive sound frame stream, and single sound frame NFi is carried out down-sampling and coding.In this one exemplary embodiment, down-sampling comprises each second ingredient of selecting each sound frame, and the narrow band signal of down-sampling frame is provided.This frame that is reduced excites linear prediction (CELP) scheme 120 to encode by the coding of low bit rate.The sound frame that compresses and encode is exported as sound output parameter array PNFi.
Separation vessel and scrambler 106 receive the frame stream of silent frame UFi.Each silent frame Ufi is divided into two parts, and forms two pseudo-arrowband frames.The narrow frequency frame of these two puppets excites linear prediction (CELP) scheme 120 to encode by the coding of low bit rate respectively.The frame that separates and encode is exported as noiseless output parameter array PUFia and PUFib.
Parameter PNFi, PUFia and the PUFib of coding all are imported in the baling press 108.Coding parameter PNFi, PUFia and PUFib are quantized and are packaged into the individual bit stream of voice unit (VU) BUi.The order that these parameters are packed is identical with the order of primitive frame in the frame stream.
All voice unit (VU) BUi of this articulated tts system are stored in the parameter catalogue 130, addressable it synthesize.
At TTS between synthesis phase, be this parameter catalogue 130 of base access with the index of the voice unit (VU) Uj that will synthesize.Handle the voice unit (VU) that extracts then, with in the opposite mode of the coding that carries out for storage in catalogue.Voice unit (VU) Uj index is imported in the demoder 140, and selected there device 42 receives.Selector switch 42 selects and extracts suitable voice unit (VU) UBUj based on the input index from parameter catalogue 30.
Type Discr. and resolver 142 are differentiated two dissimilar frames, and frame stream is separated into two son streams, and first frame type is in the first son stream, and second frame type is in the second son stream.More particularly, the type Discr. determines that according to its type information the frame UBUj extracted is sound or noiseless, and correspondingly the bit stream that is extracted is separated into sound output parameter array PNFj and noiseless output parameter array PUFj.
Demoder and up-sampler 144 receives sound output parameter array PNFj, wherein by down-sampler and scrambler 104 carries out down-sampling and encoding process is reversed.According to the encoding process process in employed same CELP scheme 120, the frame of sound output parameter array PNFj is decoded.Up-sampling is rebuild and followed to corresponding sound bite becomes broadband signal, particularly, decoded voice unit (VU) is carried out interpolation, so that the voice unit (VU) of up-sampling to be provided.Next, these voice unit (VU)s are exported as the sound frame SNFj that has rebuild.
Demoder and unitor 146 receives noiseless output parameter array PUFj, wherein is reversed in separating with encoding process of being undertaken by separation vessel and scrambler 106.According to the encoding process process in employed identical CELP scheme 120, the frame of noiseless output parameter array PUFj is decoded.Corresponding sound bite is rebuild and then is linked together.Next, these voice unit (VU)s are exported as the silent frame SUFj that has rebuild.
Frame unitor 148 according to following same order, sound and the silent frame SNFj and the SUFj that have rebuild are linked up successively, promptly extract the synthetic speech signal that some frame forms the corresponding sound unit from catalogue, described sound and silent frame just is derived from these frames.Next, the frame that these articulateds is got up from demoder 140 is sampled as the voice unit (VU) Uj after synthetic and is exported.
Although can be added to this kind catalogue or make alterations in addition, only need to carry out once above-mentioned compression and coding in this catalogue and storage for arbitrary corpus of voice unit (VU).It can be repeatedly accessed.Though the foregoing description shows scrambler 100 and demoder 140 together, in most equipment they often and together, for example: PDA or mobile phone.They often have (from scrambler) pre-loaded catalogue, and they oneself only have this catalogue and demoder (and other ingredient or be used for the synthetic code of TTS).
In above-mentioned one exemplary embodiment, with the compression set of down-sampler and separator form by carrying out the size that down-sampling reduces sound frame, and by separating the size of reducing silent frame.Also can come the voice unit (VU) fragment is compressed with other means.
In above-mentioned one exemplary embodiment, increase the size of sound frame with the decompressing device of up-sampler and unitor form by carrying out up-sampling, and by connecting the frame sign that increases silent frame.Also can come the voice unit (VU) fragment is decompressed with other means.
Fig. 3 is the operational flowchart of the scrambler of one exemplary embodiment of the present invention.
At step S202, with the voice signal OSi of the voice unit (VU) Fi that is segmented into frame.Step S204 detects whether the frame that imports into is silent frame.If the frame that imports into is not a silent frame, then it is carried out down-sampling at step S206.Frame NFi coding after step S208 is to down-sampling, and the frame PNFi after step S210 will encode is packaged into bit stream.If detecting the frame Fj that imports at step S204 is silent frame UFi, then it is separated into two pseudo-frames at step S212.At step S214 the silent frame after separating is encoded in succession.At step S210, according to other identical order of coded frame, the frame PUFi behind the coding is packaged into bit stream, promptly appear at order in the middle of the input speech signal according to those frames that coded frame had been derived from.
After the processing of Fig. 3, bit stream is recorded in the catalogue.
Fig. 4 is the operational flowchart of the demoder of one exemplary embodiment of the present invention.
At step S252, the index of the voice unit (VU) Uj that input will be synthesized.According to the identical order of order in the middle of corresponding voice unit (VU) appears at index, at step S254, from catalogue, select, and, extract the suitable UBUj of coded frame corresponding to indexed voice unit (VU) at step S256.Whether at step S258, detecting any coded frame of importing into is the noiseless PUHj of coded frame.If not silent frame, then it is decoded at step S260.At step S262, with decoded frame up-sampling, and the frame SNFj after step S264 is with up-sampling is connected into bit stream.If detecting the frame UBUj that imports at step S258 is silent frame, then it is decoded at step S266.This frame occurs often in pairs.Two decoded frames following every pair at step S258 link up.At step S264, articulated becomes bit stream to articulated.With respect to other decoded frame, be according in sequence identical in the connection of step S264, promptly appear at order in the middle of the unit index that will synthesize according to the coded frame that those decoded frames were derived from.
In above-mentioned one exemplary embodiment, be that the type with frame is that sound frame or silent frame serve as that the basis is to being distinguished by the frame of down-sampling and the frame of separated (being reversed of processing).In other embodiments, can distinguish based on other standard.For example, the frame of two correlation types may be noiseless noiseless with noiselessness for noise is arranged, and just, the differentiation between them depends on these two silent frames much noises.Like this, only that noise is bigger noiseless frame is separated, thereby saves some memory capacity.This differentiation can be according to the measurement of line spectrum frequency (LSF) parameter of frame.
Above-mentioned one exemplary embodiment has been utilized the CELP scheme, and this scheme is encoded to frame with 240 vectors.The frame Fi that imports into is with 480 vector beginnings, therefore, with 2: 1 ratios with sound frame down-sampling.Under the situation that the composition structure of importing frame Fi into changes to some extent, and/or under the situation that used encoding scheme changes to some extent, no matter this encoding scheme is still CELP encoding scheme or other encoding scheme, other down-sampling ratio will appear.
In above-mentioned one exemplary embodiment, the frame that each is noiseless is separated into two pseudo-frames.This is still because import frame Fi into 480 vector beginnings, and current CELP scheme is encoded to frame with 240 vectors.Form change in structural change and/or the code used scheme according to importing frame Fi into, can change the quantity of pseudo-frame.If the number of vectors in the incoming frame is not the definite multiple of number of vectors in the required pseudo-frame, the degree of a down-sampling (or up-sampling) can be arranged also then.
Above-mentioned one exemplary embodiment shows hybrid plan, wherein has only two different means: in a down-sampling or separation means of coding side, and at decode a up-sampling or a connection means of side.Also can other means, such as: number of vectors is not two times (for example: incoming frame Fi is with 720 vector beginnings, and meanwhile the CELP scheme is encoded to frame with 240 vectors) of number of vectors in the required pseudo-frame in the incoming frame.In this case, can judge incoming frame be three kinds one of dissimilar; For example sound, noiseless noiseless with noiselessness of noise arranged.Before coding, ratio that can 3: 1 is to sound frame down-sampling.Before coding, can will there be the noise silent frame to be separated into three.On the other hand, before coding, ratio (that is: each is 480 vectors) that can 3: 2 carries out down-sampling to the noiselessness silent frame, and then it is separated into two.The decoding side will carry out with above-mentioned opposite to processing.
Above-mentioned one exemplary embodiment has been utilized the narrowband speech coding techniques, and this technology is on the basis of the mixed bandwidth CELP coding techniques of low bit rate, with the CELP model as basic framework and new tool that the TTS catalogue is compressed.Each voice unit (VU) of TTS catalogue is all encoded with mixed bandwidth CELP scheme, because between synthesis phase, sets up the TTS catalogue and cell bit stream is decoded by mixed bandwidth CELP scheme.
For the CELP scheme in the foregoing description, can customize coding method according to the characteristic of TTS or phonetic synthesis.In order to reach this purpose, set up a low bit rate arrowband CELP scheme, as the voice coding basis of the sound inventory of articulated tts system.Because having only one (or two) specific and known talker is tts system recorded speech corpus, thereby this talker's correlated characteristic can be applied to customize on the encoding scheme fully.Especially, can obtain to represent the linear spectral of talker's pronunciation character to (LSP) vectorial code book by training.These operations can be reduced to encoding rate has the low bit rate that high-quality speech is rebuild.
In the process flow diagram of the one exemplary embodiment of Fig. 2 and Fig. 3, the coding of frame is separately finished, and in down-sampler and scrambler 104 silent frame is encoded, and in separation vessel and scrambler 106 sound frame is encoded.In another one exemplary embodiment, can in identical code device, encode for all frames.For example, be encoded at frame (only by suitably down-sampling or after separating) is preceding, and packing device 108 is packed to these frames, encodes before being stored in catalogue.
In the process flow diagram of the one exemplary embodiment of Fig. 2 and Fig. 4, the decoding of frame is separately finished, and in demoder and down-sampler 144 silent frame is decoded, and in demoder and unitor 146 sound frame is decoded.In another one exemplary embodiment, can in identical decoding device, decode for all frames.For example, can and be accredited as after extracting by type Discr. 142 sound or noiseless before, all frames are all encoded.
In above-mentioned one exemplary embodiment, before the voice signal of articulated phonetic synthesis catalogue is encoded,, simultaneously silent frame is separated into and the onesize pseudo-down-sampling frame of the sound frame of down-sampling the sound frame down-sampling of this voice signal.Therefore, in compression process, to sound be different treating with silent frame.
In above-mentioned one exemplary embodiment, after frame selected from catalogue is decoded and before carrying out the articulated phonetic synthesis, with the decoded sound frame up-sampling that is derived from the catalogue, what will be derived from catalogue simultaneously manyly is connected into and the onesize frame of up-sampling silent frame (or other number) decoded silent frame.Therefore, in decompression process, to sound be different treating with silent frame.
Referring to Fig. 5, there is shown and embody wireless telephone 300 of the present invention.This wireless telephone 300 has the radio frequency communications unit 302 of communicating by letter and connecting with processor 304.The input interface of the form of screen 306 and keypad 308, also communication is connected in this processor 304.
Processor 304 comprises the encoder/decoder 310 that has relevant ROM (read-only memory) (ROM) 312, and the storage of this ROM (read-only memory) is used for the data of Code And Decode voice signal or other signal, and above-mentioned signal can be sent or be received by wireless telephone 300.Processor 304 also comprises: microprocessor 314, and it is connected in encoder/decoder 310 and relevant character ROM (read-only memory) (ROM) 318 by conventional data and address bus 316; Voice unit (VU) catalogue ROM (read-only memory) (ROM) 320 (as the catalogue 130 of Fig. 2 one exemplary embodiment); Random access memory (RAM) 320; Static programmable memory 324; And dismantled and assembled sim module 326.Except that other content, each can store the selected input text information and the phonebook database of telephone number static programmable memory 324 and sim module 326.
Microprocessor 314 has port, is used for connecting the keypad 308, reminding module 328, microphone 330 and the loudspeaker 332 that contain oscillatory type motor and associated drive.
Character ROM318 storage is used to decode or the code of text encoded information, and these text messages can be received by communication unit 302, by keypad 308 inputs.Character ROM318 and catalogue ROM320 also are microprocessor 314 storage operation sign indicating numbers (OC), and it is synthetic that the OC in the middle of this catalogue ROM320 is used to TTS.Particularly, it has comprised OC, thereby makes microprocessor 314 work as demoder 140 in the system of Fig. 2.
Radio frequency communications unit 302 is combined type receiver and the transmitters with common antenna 334.This communication unit 302 has the wireless set 336 that is connected in antenna 334 via radio frequency amplifier 338.This wireless set 336 also is connected in combined type modulator/demodulator 340, and this modulator/demodulator 340 is connected to communication unit 302 on the processor 304.
Above-mentioned one exemplary embodiment has been utilized low bit rate, mixed bandwidth speech coding method, with by for the noiseless part in each voice unit (VU) keeps broadband signal, improves the sharpness of unvoiced speech fragment in the voice unit (VU) after the reconstruction.Can help to realize the required high-quality and the high compression rate coding of tts system voice unit (VU) catalogue like this.It also requires little memory capacity and low calculated amount rationally.
Embodiments of the invention can be used in the various tts systems based on the articulated phonetic synthesis.In the time of in will being embedded into such as equipment such as mobile phone, PDAs with low memory capacity and low calculated amount tts system, it be very useful.This one exemplary embodiment provides a kind of means, thereby realizes that with low-down bit rate to the high-quality of TTS catalogue and compressed encoding efficiently, these means are helpful in the articulated tts system that natural sound embeds.
The foregoing description that this is exemplary and the top distortion of mentioning have comprised various steps, these steps can realize with one of several forms, such as: can be used as specialized hardware or as the machine-executable instruction of carrying out by universal or special processor or logical circuit.One exemplary embodiment of the present invention also contains has covered the various steps of being carried out by the combination of software and hardware.
Can be with a kind of computer program as additional embodiments, for example: be stored in the computer program on internet or other network, or store the machine readable media of instruction on it.Above-mentioned instruction can be used for to mobile phone, other is portable or non-portable set or computing machine in little processing programme.Typical machine readable media comprises: disk, card, memory stick and other memory storage, no matter though be optics or magnetic, also be read-only or rewritable.
Above-mentioned detailed description only provides preferred one exemplary embodiment, and is not intended to scope of the present invention, application or configuration are limited.On the contrary, the detailed description of preferred one exemplary embodiment provides the heuristic explanation that is used to implement the preferred embodiment of the present invention for those skilled in the art.Should be appreciated that, under the situation of essence that does not break away from claims qualification of the present invention and model, can carry out various changes the function and the arrangement of ingredient.