CN1779779B - Method and apparatus for providing phonetical databank - Google Patents

Method and apparatus for providing phonetical databank Download PDF

Info

Publication number
CN1779779B
CN1779779B CN200410095710A CN200410095710A CN1779779B CN 1779779 B CN1779779 B CN 1779779B CN 200410095710 A CN200410095710 A CN 200410095710A CN 200410095710 A CN200410095710 A CN 200410095710A CN 1779779 B CN1779779 B CN 1779779B
Authority
CN
China
Prior art keywords
frame
unvoiced
coding
voice
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN200410095710A
Other languages
Chinese (zh)
Other versions
CN1779779A (en
Inventor
岳东剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuance Communications Inc
Original Assignee
Motorola Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Motorola Inc filed Critical Motorola Inc
Priority to CN200410095710A priority Critical patent/CN1779779B/en
Publication of CN1779779A publication Critical patent/CN1779779A/en
Application granted granted Critical
Publication of CN1779779B publication Critical patent/CN1779779B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

A method for providing voice databank includes selecting acoustic unit from the first voice databank as said acoustic unit being set with identification mark and with multiple correlation frames presenting the sampling voice waveform, confirming whether each frame is voiced sound frame or nonvoiced sound frame and classifying it to voiced sound frame or nonvoiced sound frame if it is voiced sound frame then coding said frame to be coded frame according to at least four CELP parameters and according to said classification and finally storing coded frame to form voice databank

Description

The method and the relevant device thereof of voice corpus are provided
Technical field
It is synthetic to the present invention relates generally to literary composition language conversion (TTS).The present invention specifically is used for, but needn't be limited to, and provides method and this corpus of use of the voice corpus of coding to be used for the equipment of letter to the sound conversion, and it is used for the synthetic pronunciation of text chunk.
Background technology
Literary composition language conversion (TTS) is commonly referred to continuous text synthetic to voice, allows electronic equipment reception input text string also to provide this string to represent with the conversion of synthetic speech form.But, may need the equipment of synthetic voice from the non-text string that receives that ascertains the number to have any problem aspect the desirable synthetic speech of high-quality providing.This is because the pronunciation of each word that will synthesize or syllable (for Chinese character etc.) is that context dependent is relevant with the position.For example, the pronunciation of (input text string) end of the sentence word may be spun out or be elongated.The pronunciation of same word may be only in appearing at sentence, just can be elongated when requiring emphasis.
In most of language, the acoustics prosodic parameter is depended in the pronunciation of word, comprises tone (fundamental tone), volume (power or amplitude) and duration.When word was arranged in phrase or sentence, the prosodic parameter value of word was correlated with.A kind of TTS method attempt to discern with the text string corpus in the text string that is complementary of enough long-distance call languages.But this method calculating degree costliness needs unacceptable long corpus be used for great majority and uses, and does not guarantee that in corpus one finds suitable coupling language surely.
Another kind method is used the corpus of the acoustic elements (phoneme) with coding prosodic parameter.A kind of coding techniques uses code book Excited Linear Prediction CELP, and still, in order to obtain high-quality relatively TTS conversion, corpus may be unacceptable length, especially during the TTS conversion on considering to have the electronic equipment of limited storage space.
In this instructions and claims, term " comprises, comprises (comprises, comprising) " or similar terms should be regarded as meaning forgiving of nonexcludability, therefore, a kind of method that comprises series of elements or device not only comprise also can comprise the element that other are unlisted by the element of listing well.
Summary of the invention
According to an aspect of the present invention, provide a kind of method that is used to provide the encoded voice corpus, this method comprises:
Choose acoustic elements from the first voice corpus, described acoustic elements has a plurality of associated frames of identifier and expression sampled speech waveform;
Determine that each frame is unvoiced frames or unvoiced frame;
Described frame is categorized into unvoiced frame and unvoiced frames at least;
According at least four code book Excited Linear Prediction parameters described frame is encoded into coded frame, wherein, described coding depends on described classification; With
Described coded frame is stored in the second voice corpus.
Suitably, described determine to comprise the frame energy of expression frame and frame energy threshold and Mel frequency cepstral coefficient threshold value are compared.
Suitably, described classification is characterised in that, unvoiced frame is categorized into three subclasses, and these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.
The feature of described classification is that also it is to compare with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame to classify that described two classes are stablized unvoiced frame.
Suitably, the described frame of encoding is characterised in that described coding depends on the classification of described three kinds of subclasses.Suitably, described coding is characterised in that, is different to each subclass required bit number of encoding.And, be used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.Suitably, coded frame comprises the field that is used to discern its classification.
According to a further aspect of the invention, provide a kind of electronic equipment, described equipment comprises:
Device handler;
Voice operation demonstrator is connected to described device handler; With
The voice corpus of the coded frame of acoustic elements is categorized as unvoiced frames or unvoiced frame comes described coded frame is encoded according to them.
Suitably, according to three subclasses the unvoiced frame in the voice corpus is encoded, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized one of unvoiced frame.
Described two classes are stablized unvoiced frame and can be compared with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame and classify.
Coded frame in the corpus is characterised in that, is different to each subclass required bit number of encoding.And, be used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.Suitably, coded frame comprises the field that is used to discern its classification.
Description of drawings
In order to make easy to understand of the present invention and to try out, quote the embodiment of example referring now to accompanying drawing, in the accompanying drawings:
Fig. 1 is the schematic block diagram according to electronic equipment of the present invention;
Fig. 2 is the schematic block diagram of coding structure; With
Fig. 3 is the process flow diagram of method that explanation is used for providing from the first voice corpus the second voice corpus of coding.
Embodiment
Referring to Fig. 1, show a kind of electronic equipment 100, shape is a wireless telephone, comprises the device handler 102 that effectively is connected to user interface 104 by bus 103, user interface 104 is generally touch-screen or display screen and keyboard.Electronic equipment 100 also has speech corpus 106 (the second voice corpus of coding), voice operation demonstrator 110, nonvolatile memory 120, ROM (read-only memory) 118 and wireless communication module 116, and all effectively are connected to processor 102 by bus 103.Voice operation demonstrator 110 has connection to drive the output of loudspeaker 112.To illustrate in this manual after a while and set up speech corpus 106.Nonvolatile memory 120 (memory module) in use stores and is used for literary composition language conversion (TTS) synthetic text, and wherein, text can be by module 116 or otherwise received.
As those skilled in the clear, radio frequency communications unit 116 normally has the receiver and the transmitter of the combination of common antenna.Radio frequency communications unit 116 has the transceiver that is connected to antenna by radio frequency amplifier.Transceiver also has the modulator/demodulator of combination, and it is connected to processor 102 with communication unit 116 via bus 103.And, in the present embodiment, nonvolatile memory 120 (memory module) stored user phonebook database Db able to programme, ROM (read-only memory) 118 stores the operation code (OC) that is used for device handler 102.
In Fig. 2, show the schematic block diagram of coding structure 200.Structure 200 comprises that the first voice corpus 210, frame are handled and the second voice corpus 230 of coding unit 220 and coding.The first voice corpus 210 and the second voice corpus 230 effectively are connected to frame and handle and coding unit 220.Also having user interface 240 effectively to be connected to frame handles and coding unit 220.In use, frame processing and coding unit 220 will be encoded in the second voice corpus 230 of coding following description with reference to figure 3 from the acoustic elements AU of the first voice corpus 210.And user interface 240 comprises the input button that is used for input command and is used for providing to the user display screen of visual feedback that user interface 240 allows the user to start and gap coding.
Referring to Fig. 3, show a kind of method 300 that is used for providing the second voice corpus 230 of coding from the first voice corpus 210, this method is handled by frame in essence and coding unit 220 is carried out.After beginning frame 305 call methods 300, carry out initialization procedure by suitable excited users interface 240 at frame 310.Initialization procedure comprise integer counter i is changed to 1 (i:=1 is set) and be provided with the big lowerinteger value N of voice corpus (voice corpus size :=N), wherein N is the number of acoustic elements AU in the first voice corpus 210.Then, at frame 315, method 300 is chosen i acoustic elements AU in the first voice corpus 210.Therefore, because counter i is initialized as 1 in frame 310, choose first acoustic elements AU in the first voice corpus 210.Here, as is known to the person skilled in the art, first acoustic elements AU is the acoustic elements AU in first memory location of the first voice corpus 210, and has a plurality of associated frames of identifier and expression sampled speech waveform.
In the present embodiment, each frame comprise usually 240 the sampling and each sampled representation be 16 bits.And usually, acoustic elements AU can have the sampled speech of 8 to 20 frames.The frame sequence (FRS) known to those skilled in the art is arranged and be arranged in to these frame sequentials.Like this, in frame 315, the frame number (FN) that is used for i acoustic elements AU is by simply to the frame count of the sampled speech waveform of representing i acoustic elements AU and definite.
After frame 315, method 300 is each frame calculating frame energy (FE) and Mel frequency cepstral coefficient (MFCC) parameter that 1 of i acoustic elements (being the 1st acoustic elements in this example) arrives FN at frame 320.Frame energy FE based on the frame sampling value square, frame energy FE and Mel frequency cepstral coefficient MFCC parameter all are easy to by well known to a person skilled in the art that technique computes comes out.
Then, at frame 325, variable k is changed to 1, and this variable k is used for choosing frame 1 to FN in proper order with frame sequence FRS.After this, the Mel frequency cepstral coefficient MFCC parameter of frame energy FE and each frame is used for determining that in test block 330 the k frame is unvoiced frames (UF) or unvoiced frame (VF).Representation class is like the signal of noise or the frame of nonperiodic signal in essence for unvoiced frames UF, and wherein, sound channel is not vibrated (for example fricative phoneme) when the mankind send this part voice.On the contrary, unvoiced frame VF is the frame of indication cycle's property signal, and wherein, sound channel is at fundamental vibration when the mankind send this part voice.For example, the waveform of most of vowel phonemes all belongs to unvoiced frame VF.
In test block 330, the represented frame energy (signal energy) of in question frame compares with frame energy threshold ETV and MFCC threshold value MFTV.In this embodiment, energy threshold ETV is set to 45dB, and MFCC threshold value MFTV is set to-10.00.If the frame energy of a frame is less than ETV or its MFCC coefficient less than MFTV, then described frame is classified as unvoiced frames UF.These threshold values ETV and MFTV are by from the off-line analysis of the first voice corpus, 210 selected acoustic elements AU and definite.This analysis comprise each frame of being used for each selected acoustic elements AU based on artificial sensory test and classification subsequently.In addition, calculate the associated values of frame energy FE and MFCC parameter value for every frame.Therefore, every frame is classified as voiceless sound or voiced sound, and associated therewith is signal frame energy value and the MFCC parameter value that calculates.Therefore, by to the mean value of this small set of acoustic elements AU and the statistical study of corresponding deviation, determine threshold value.
If determine that in test block 330 the k frame is unvoiced frame VF, then method 300 is that the k frame calculates the CELP parameter at frame 335.These CELP parameters are: line spectrum pair (LSP, LineSpectrum Pair) parameter and adaptive codebook index (Ac) parameter.Should be noted that, do not need to calculate CELP gain index (Gf﹠amp at frame 335; Ga) parameter and fixed codebook indices (Fc) parameter.Normally the art technology people is known for these CELP parameters and CELP, therefore for simple and clear, does not provide detailed explanation here.
Then, method 300 is categorized into unvoiced frame VF in three classes at frame 340, and these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.This three class is called as unvoiced frame conversion (VFT), unvoiced frame stability types 1 (VFS1) and unvoiced frame stability types 2 (VFS2).Classify according to range observation, make unvoiced frame VF be defined as unvoiced frame conversion VFT or one of two kinds of stability types VFS1, VFS2 the adaptive code Ac parameter of the self-adaptation ` sign indicating number Ac parameter of this frame and former frame (k-1 frame).But, if the k frame is first frame (k=1), then do not have former frame to be used for i acoustic elements AU, the default unvoiced frame conversion VFT that is categorized as of this frame.But if this frame is not the 1st frame, then the difference between the adaptive code vector between two continuous adjacent frames can be used for describing the degree of voice cycle.Therefore, use following test to determine the classification of frame.
IF DistAc (Ac CurFrame, Ac PreFrame)<threshold value 1
THEN present frame k is identified as one of stability types VFS1, VFS2;
ELSE present frame k is identified as unvoiced frame conversion VFT.
Wherein, Ac CurFrameIt is the adaptive code Ac parameter of present frame k; Ac PreFrameIt is the adaptive code Ac parameter of former frame k-1; DistAc passes through Ac PreFrameDeduct Ac CurFrameThe range observation that calculates.
The adaptive code Ac parameter fundamental frequency (fundamental tone) with voice segments roughly is relevant.In this example embodiment, its scope is 20 to 147 samplings in 8kHz speech sample rate.Have identical or very approximate adaptive code Ac parameter if present frame k compares with former frame k-1, it can be counted as stabilizer frame.After experiment, the value of this threshold value 1 is set to 8.
If present frame k is not identified as the unvoiced frame conversion, then can be identified as one of stability types VFS1, VFS2.These two kinds stable voiced sound type VFS1, VFS2 are associated with such frame: its have to the VF frame fundamental tone much at one of front with in addition similar tone color.Following class test will be stablized unvoiced frame and be categorized as VFS1 or VFS2:
IF DistLSP (LSP CurFrame, LSP PreFrame)<threshold value 2
THEN present frame k is classified as VF S2
ELSE present frame k is classified as VF S1
Wherein, LSP CurFrameIt is the LSP vector of present frame k; LSP PreFrameIt is the LSP vector of former frame k-1; DistLSP is the range observation that calculates by conventional Euclidean distance.In this example embodiment, the value of threshold value 2 is chosen for 100.Therefore, sort out two types stable unvoiced frame by the LSP of more stable unvoiced frame and the LSP of former frame.
Turn back to test block 330,, then be classified as unvoiced frames UF at frame 345, the k frames if determine that the k frame is unvoiced frames UF.
The k frame is after frame 340 or 345 is classified as voiced sound VF or voiceless sound UF, and method 300 encodes a frame as coded frame (EncF) at frame 350 according at least four CELP parameters, and these parameters are: line spectrum pair (LSP) parameter; Adaptive codebook index (Ac) parameter; Gain index (Gf﹠amp; Ga) parameter; And fixed codebook indices (Fc) parameter.Coding depends on frame 340 and 345 classification of carrying out.Therefore, if the k frame is classified as unvoiced frame VF, then coding depends on three classes (subclass), and every kind of subclass is encoded the bit number difference required.Be used for the bit number of unvoiced frames UF coding also is different from the bit number that is used for each subclass coding.
Table 1 is to 4 codings that show voiced sound and unvoiced frames.
Table 1 VF TThe Bit Allocation in Discrete of frame
Parameter Subframe 1 Subframe 2 Subframe 3 Subframe 4 The total bit of every frame
Frame type (Fid) 2
LSP index (LPC) 8
Adaptive codebook (Ac) 7 2 7 2 18
Gain (Gf﹠Ga) 12 12 12 12 48
Fixed codebook (Fc) 11 11 11 11 44
Total bit 120 (bits)
Table 2VF S1The Bit Allocation in Discrete of frame
Parameter Subframe 1 Subframe 2 Subframe 3 Subframe 4 The total bit of every frame
Frame type (Fid) 2
LSP index (LPC) 8
Adaptive codebook (Ac) 7 2 5 2 16
Gain (Gf﹠Ga) 12 8 12 8 40
Fixed codebook (Fc) 11 11 22
Total bit 88 (bits)
Table 3VF S2The Bit Allocation in Discrete of frame
Parameter Subframe 1 Subframe 2 Subframe 3 Subframe 4 The total bit of every frame
Frame type (Fid) 2
LSP index (LPC) 0
Adaptive codebook (Ac) 7 2 5 2 16
Parameter Subframe 1 Subframe 2 Subframe 3 Subframe 4 The total bit of every frame
Gain (Gf﹠Ga) 7.5 7.5 7.5 7.5 30
Fixed codebook (Fc) 0
Total bit 48 (bits)
The Bit Allocation in Discrete of table 4UF frame
Parameter Subframe 1 Subframe 2 Subframe 3 Subframe 4 The total bit of every frame
Frame type (Fid) 2
LSP index (LPC) 6
Adaptive codebook (Ac)
Gain (Gf﹠Ga) 4 4 4 4 16
Fixed codebook (Fc) 0
Total bit 24 (bits)
To shown in 4, the frame of coding comprises 2 bit fields that are used to discern its classification as table 1.At sensory test with after analyzing, be three unvoiced frame VF subclasses with unvoiced frames UF classification in each choose CELP coding with different file sizes.CELP coding consideration decoding performance and real-time synthetic operation.Table 1 to 4 in, clearly show that, for the gain (Gf﹠amp; Ga) field, the every frame of unvoiced frame conversion VFT is assigned 48 bits; The every frame of unvoiced frame stability types 1VFS1 is assigned 40 bits; The every frame of unvoiced frame stability types 2VFS2 is assigned 30 bits; The every frame of unvoiced frames UF is assigned 16 bits.And after the CELP coding, total bit number that every frame distributes is: 120 bits are used for unvoiced frame conversion VFT; 88 bits are used for unvoiced frame stability types 1VFS1; 48 bits are used for unvoiced frame stability types 2VFS2; 24 bits are used for unvoiced frames UF.
After frame 350 is encoded, test in test block 355, all FN frame of i acoustic elements AU is classified determining whether.If all frames are all classified, method 300 increases progressively k at frame 360, then some in the repeat block 330 to 355 selectively.
All classify (k=FN) when all FN of i acoustic elements AU frame, frame 365 is stored in the coded frame EncF of i acoustic elements AU in the second voice corpus 230.
Then, test, to determine in the first voice corpus 210, whether also having acoustic elements AU to be selected in test block 370.At this moment, if i<N, then i increases progressively at frame 375, selectively some in the repeat block 320 to 370.If in test block 370, i=N has then chosen all acoustic elements AU, and method 300 finishes at frame 380.
Advantageously, the invention provides a kind of voice corpus (the second voice corpus of coding), it is according to the classification of frame and encode selectively.And unvoiced frame is classified framing conversion (VFT), unvoiced frame stability types 1 (VFS1) and unvoiced frame stability types 2 (VFS2) again.With standard C ELP coded format contrasts be,, can use different bit numbers to come selectively frame to be encoded, and in standard C ELP coded format, every frame needs 120 bits, because it does not use voiced sound described here and voiceless sound classification based on the classification of frame.Therefore, describedly in this manual use different bit numbers to encode selectively, improved decoding efficiency according to classification, the corpus that need reduce, and eliminated the demand that device handler is handled fast bit rate data during TTS.
The embodiment of the just example that detailed description provides, and do not want to limit the scope of the invention, applicability or configuration.The those skilled in the art that are specifically described as of example embodiment provide a kind of description that realizes illustrated embodiments of the invention.Should be appreciated that, under the prerequisite that does not deviate from the spirit and scope described in claims of the present invention, can carry out various change the function and the configuration of each element.

Claims (14)

1. method that is used to provide the encoded voice corpus, this method comprises:
Choose acoustic elements from the first voice corpus, described acoustic elements has a plurality of associated frames of identifier and expression sampled speech waveform;
Determine that each frame is unvoiced frames or unvoiced frame;
Described frame is categorized into unvoiced frame and unvoiced frames at least;
According at least four code book Excited Linear Prediction parameters described frame is encoded into coded frame, wherein, described coding depends on described classification; With
Described coded frame is stored in the second voice corpus.
The method of claim 1, wherein 2. described determine to comprise the frame energy of expression frame and frame energy threshold and Mel frequency cepstral coefficient threshold value are compared.
3. the method for claim 1, wherein, the feature of described classification also is, described unvoiced frame is categorized into three subclasses, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized unvoiced frame.
4. method as claimed in claim 3, wherein, the feature of described classification is that also it is to compare with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame to classify that described two classes are stablized unvoiced frame.
5. method as claimed in claim 3, wherein, the feature of the described frame of encoding is that also described coding depends on the classification of described three kinds of subclasses.
6. method as claimed in claim 3, wherein, described coding is characterised in that, is different to each subclass required bit number of encoding.
7. method as claimed in claim 6 wherein, is used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.
8. the method for claim 1, wherein described coded frame comprises the field that is used to discern its classification.
9. electronic equipment, described equipment comprises:
Device handler;
Voice operation demonstrator is connected to described device handler; With
The voice corpus of the coded frame of acoustic elements is classified as unvoiced frames or unvoiced frame comes described coded frame is encoded according to coded frame.
10. electronic equipment as claimed in claim 9, wherein, according to three subclasses the unvoiced frame in the voice corpus is encoded, these classes are: the unvoiced frame conversion of the voice signal that expression is associated with the frame that is transformed into voiced sound from voiceless sound, and two classes are stablized one of unvoiced frame.
11. electronic equipment as claimed in claim 10, wherein, described two classes are stablized unvoiced frame and are compared with the line spectrum pair of former frame by the line spectrum pair that will stablize unvoiced frame and classify.
12. electronic equipment as claimed in claim 11, wherein, the coded frame in the corpus is characterised in that, is different to each subclass required bit number of encoding.
13. electronic equipment as claimed in claim 9 wherein, is used for the bit number of unvoiced frames coding is different from the bit number that is used for each subclass coding.
14. electronic equipment as claimed in claim 13, wherein, described coded frame comprises the field that is used to discern its classification.
CN200410095710A 2004-11-24 2004-11-24 Method and apparatus for providing phonetical databank Expired - Fee Related CN1779779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200410095710A CN1779779B (en) 2004-11-24 2004-11-24 Method and apparatus for providing phonetical databank

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200410095710A CN1779779B (en) 2004-11-24 2004-11-24 Method and apparatus for providing phonetical databank

Publications (2)

Publication Number Publication Date
CN1779779A CN1779779A (en) 2006-05-31
CN1779779B true CN1779779B (en) 2010-05-26

Family

ID=36770082

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200410095710A Expired - Fee Related CN1779779B (en) 2004-11-24 2004-11-24 Method and apparatus for providing phonetical databank

Country Status (1)

Country Link
CN (1) CN1779779B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101510424B (en) * 2009-03-12 2012-07-04 孟智平 Method and system for encoding and synthesizing speech based on speech primitive

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1173690A (en) * 1996-04-15 1998-02-18 索尼公司 Method and apparatus fro judging voiced/unvoiced sound and method for encoding the speech
CN1274456A (en) * 1998-05-21 2000-11-22 萨里大学 Vocoder

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1173690A (en) * 1996-04-15 1998-02-18 索尼公司 Method and apparatus fro judging voiced/unvoiced sound and method for encoding the speech
CN1274456A (en) * 1998-05-21 2000-11-22 萨里大学 Vocoder

Also Published As

Publication number Publication date
CN1779779A (en) 2006-05-31

Similar Documents

Publication Publication Date Title
CN103971685B (en) Method and system for recognizing voice commands
CN110827805B (en) Speech recognition model training method, speech recognition method and device
TWI427620B (en) A speech recognition result correction device and a speech recognition result correction method, and a speech recognition result correction system
US20110093261A1 (en) System and method for voice recognition
EP1139332A9 (en) Spelling speech recognition apparatus
US20020178004A1 (en) Method and apparatus for voice recognition
CN108053823A (en) A kind of speech recognition system and method
CN111105785B (en) Text prosody boundary recognition method and device
JPH09507105A (en) Distributed speech recognition system
JPS62231997A (en) Voice recognition system and method
EP1619661A3 (en) System and method for spelled text input recognition using speech and non-speech input
CN110503956B (en) Voice recognition method, device, medium and electronic equipment
Zenkel et al. Subword and crossword units for CTC acoustic models
CN108922521A (en) A kind of voice keyword retrieval method, apparatus, equipment and storage medium
CN112580335B (en) Method and device for disambiguating polyphone
US20050216272A1 (en) System and method for speech-to-text conversion using constrained dictation in a speak-and-spell mode
WO2022126969A1 (en) Service voice quality inspection method, apparatus and device, and storage medium
JP2016062069A (en) Speech recognition method and speech recognition apparatus
CN101312038B (en) Method for synthesizing voice
Kurian et al. Continuous speech recognition system for Malayalam language using PLP cepstral coefficient
CN103474067B (en) speech signal transmission method and system
KR100769032B1 (en) Letter to sound conversion for synthesized pronounciation of a text segment
EP1136983A1 (en) Client-server distributed speech recognition
CN1779779B (en) Method and apparatus for providing phonetical databank
Chou et al. Variable dimension vector quantization of linear predictive coefficients of speech

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: NIUANSI COMMUNICATION CO., LTD.

Free format text: FORMER OWNER: MOTOROLA INC.

Effective date: 20101008

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: ILLINOIS STATE, USA TO: DELAWARE STATE, USA

TR01 Transfer of patent right

Effective date of registration: 20101008

Address after: Delaware

Patentee after: NUANCE COMMUNICATIONS, Inc.

Address before: Illinois, USA

Patentee before: Motorola, Inc.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20100526

CF01 Termination of patent right due to non-payment of annual fee