WO2001039179A1 - Systeme et procede de reconnaissance vocale utilisant la modelisation tonale - Google Patents

Systeme et procede de reconnaissance vocale utilisant la modelisation tonale Download PDF

Info

Publication number
WO2001039179A1
WO2001039179A1 PCT/US2000/032230 US0032230W WO0139179A1 WO 2001039179 A1 WO2001039179 A1 WO 2001039179A1 US 0032230 W US0032230 W US 0032230W WO 0139179 A1 WO0139179 A1 WO 0139179A1
Authority
WO
WIPO (PCT)
Prior art keywords
spectral
tonal
fundamental frequency
speech waveform
syllables
Prior art date
Application number
PCT/US2000/032230
Other languages
English (en)
Inventor
Grace Chung
Hong Chung Leung
Suk Hing Wong
Original Assignee
Infotalk Corporation Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Infotalk Corporation Limited filed Critical Infotalk Corporation Limited
Priority to AU19280/01A priority Critical patent/AU1928001A/en
Priority to US10/130,490 priority patent/US7043430B1/en
Publication of WO2001039179A1 publication Critical patent/WO2001039179A1/fr

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/15Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information

Definitions

  • the present invention is directed to the field of speech recognition More specifically, the mvention provides a speaker-independent speech recognition system and method for tonal languages in which a spectral score is combined, sequentially, with a tonal score to arrive at a best prediction for a spoken syllable
  • a system and method for speaker-independent speech recognition is provided that integrates spectral and tonal analysis in a sequential architecture.
  • the system analyzes the spectral content of a spoken syllable (or group of syllables) and generates a spectral score for each of a plurality of predicted syllables.
  • Time alignment information for the predicted syllable(s) is then sequentially passed to a tonal modeling block, which performs an iterative fundamental frequency (F0) contour estimation for the spoken syllable(s).
  • F0 fundamental frequency
  • the tones of adjacent syllables, as well as the rate of change of the tonal information, is then used to generate a tonal score for each of the plurality of predicted syllables.
  • the tonal score is then arithmetically combined with the spectral score in order to generate an output prediction.
  • An aspect of the present invention provides a speech recognition method that may include the following steps: (a) receiving a speech waveform; (b) performing a spectral analysis of the speech waveform and generating a set of syllabic predictions, each syllabic prediction including one or more predicted syllables, wherein the set of syllabic predictions includes a spectral score and timing alignment information of the one or more predicted syllables; (c) sequentially performing a tonal analysis of the input speech waveform using the timing alignment information and generating tonal scores for each of the syllabic predictions; and (d) combining the spectral score with the tonal score for each of the syllabic predictions in order to generate an output prediction.
  • a speech recognition system that includes several software and/or hardware implemented blocks, including: (a) a spectral modeling block that analyzes a speech waveform and generates a plurality of predicted syllables based upon the spectral content of the speech waveform, wherein each of the predicted syllables includes an associated spectral score and timing alignment information indicating the duration of the syllable; and (b) a tonal modeling block that sequentially analyzes the speech waveform using the timing alignment information from the spectral modeling block and generates a plurality of tone scores based upon the tonal content of the speech waveform for each of the predicted syllables.
  • Still another aspect of the invention provides a system for analyzing a speech waveform.
  • This system preferably includes a spectral modeling branch for generating a spectral score, and a tonal modeling branch for generating a tonal score
  • the spectral modeling branch generates timing alignment information that indicates the beginning and ending points for a plurality of syllables in the speech waveform and provides this timing alignment information to the tonal modeling branch in order to sequentially analyze the speech waveform.
  • An additional method provides a method for analyzing a speech waveform carrying a plurality of syllables.
  • This method preferably includes the following steps: (a) performing a spectral analysis on the speech waveform and generating one or more spectral scores for each syllable; (b) performing a tonal analysis on the speech waveform and generating one or more tonal scores for each syllable, wherein the tonal scores are generating by comparing the fundamental frequencies of two or more adjacent syllables; and (c) combining the spectral scores with the tonal scores to produce an output prediction.
  • Still another, more specific method according to the invention provides a method of recognizing tonal information in a speech waveform.
  • This method preferably includes the following steps: (a) generating timing alignment information for a plurality of syllables in the speech waveform; (b) determining a center point within each syllable of the speech waveform using a beginning and ending point specified by the timing alignment information; (c) determining the energy of the syllable at the center point; (d) generating an analysis window for each syllable, wherein the analysis window is centered at the center point and is bounded on either side of the center point by calculating the points at which the energy of the syllable has decreased to a first predetermined percentage of the energy at the center point; (e) computing a fundamental frequency contour within the analysis window; (f) extracting one or more tonal features from the fundamental frequency contour; and (g) generating a plurality of tonal scores for each syllable based on the one or more extracted
  • FIG. 1 is a block diagram of a speaker-independent speech recognition system according to the present invention
  • FIG. 2 is a flowchart depicting a series of steps for FO contour estimation according to the present invention
  • FIG. 3 is an example FO contour plot generating by the methodology of the present invention depicting three spoken syllables;
  • FIG. 4 is a timing diagram depicting three spoken syllables including tonal information.
  • FIG. 1 is a block diagram of a speaker-independent speech recognition system according to the present invention.
  • This system includes two branches (or paths), an upper branch 12, which performs spectral modeling of an input waveform and produces a spectral score 32, and a lower branch 14, which performs tonal modeling based on the input waveform and also based upon information received from the upper branch 12, and produces a tonal score 34.
  • a combination block then combines the spectral score 32 with the tonal score 34 in order to generate a best output prediction 42 for the spoken syllable(s).
  • the present invention provides a sequential architecture for speech recognition in which information from the spectral analysis is used in the tonal analysis to provide a more robust result.
  • front-end hardware for generating the input waveform 16, and back-end hardware (or software) for using the output prediction 42.
  • This front-end hardware may include a microphone, an analog-to-digital converter and a digital signal processor (DSP), depending upon the application of the system.
  • DSP digital signal processor
  • the system 10 could be integrated into a variety of applications, such as a general-purpose speech recognition program, a telephone, cellular phone, or other type of electronic appliance, or any other type of software application or electronic device that may require speaker-independent speech recognition capability.
  • the input waveform 16 is a digital waveform.
  • the spectral modeling branch 14 includes a spectral analysis block 18, a feature extraction block 20, a model scoring block 22, and an N-best search block 24.
  • the model scoring block 22 receives information from a model database 46
  • the N-best search block 24 receives information from a vocabulary database 48.
  • the spectral analysis block 18 receives the input waveform 16 and performs a frequency- domain spectral analysis of the spoken syllable(s).
  • Example spectral analysis could include a fast-fourier transform (FFT), or a mel frequency cepstral coefficients analysis (MFCC), or a linear protection coefficient analysis (LPC). Regardless of the exact type of spectral analysis performed, the spectral analysis block 18 generates a sequence of frames that include a multi- dimensional vector that describes the spectral content of the input waveform 16.
  • FFT fast-fourier transform
  • MFCC mel frequency cepstral coefficients analysis
  • LPC linear protection coefficient analysis
  • the sequence of frames from the spectral analysis block 18 are then provided to the feature extraction block 20.
  • the feature extraction block analyses the multi-dimensional vector data in the sequence of frames and generates additional dimensionality data that further describes certain features of the input waveform 16.
  • the feature extraction block 20 may compute a differential between two adjacent frames for each of the dimensions in the vector, and it may then computer a differential of the computed differential, or it may compute energy, or some other related calculation. These calculations relate to certain features of the spoken syllables that can be further utilized by the model scoring block 22 in order to properly predict the actual speech.
  • the multi-dimensional vector data from the spectral analysis block 18 and the additional computations from the feature extraction block 20 are then provided to the model scoring block 22.
  • the model scoring block may use a Gaussian distribution function in order to compute a probability result that the feature vector corresponds to a particular spectral model of some syllable (or syllables).
  • the system described herein could be configured at a variety of levels of granularity.
  • the system may be configured to analyze one letter at a time, or one syllable at a time, or a group of syllables at a time, or entire words at a time. Regardless of the granularity of the analysis, however, the basic steps and functions set forth would be the same.
  • the model scoring block 22 utilizes data from a model database 46 in computing its probabilities for a particular set of input data (feature vector).
  • the model database preferably includes a Hidden Markov Model (HMM), although other types of models could also be utilized. For more information on the HMM, see Robustness in Automatic Speech Recognition, by Hisashi Wakita, pp. 90-102.
  • HMM Hidden Markov Model
  • the model scoring block uses the input data from the spectral analysis block 18 and the feature extraction block 20, the model scoring block develops a prediction (or score) for each entry in the model database. Higher scores are associated with more likely spectral models, and lower scores with less likely models.
  • the scores for each of the models from the model scoring block 22 are then passed to the N-Best search block 24, which compares these scores to data stored within a vocabulary database in order to derive a set of predictions for the most likely spoken syllables (or letters, or words depending on the application).
  • the vocabulary database is typically organized into a series of words that include syllables and tones associated with those syllables, although other symantical organizations for the vocabulary are certainly possible. If the vocabulary is on a word level, then the scores at the frame level (or syllable level) may be combined by the N-best search block 24 prior to comparison to the data in the vocabulary database 48.
  • the N-Best search block 24 provides two outputs 32, 36.
  • the first output is a set of spectral scores 32 for the most likely syllables (or words or sentences) as determined by comparing the model scoring information to the data stored in the vocabulary database 48. These spectral scores 32 are preferably described in terms of a probability value, and are then provided to the combination block 40 for combination with the tonal scores 34.
  • the N-Best search block 24 For each of the set of most likely syllables, the N-Best search block 24 also provides time alignment information 36, which is provided o the FO estimation block 26 of the tonal analysis branch 14.
  • the time alignment information 36 includes information as to where (in time) a particular syllable begins and ends. This information 36 also includes the identity of the predicted syllables (and their associated tone) as determined by the N-Best search block 24.
  • the time alignment information 36 passed to the FO estimation block 26 would include beginning and ending timing information for each of the three syllables, the identity of the syllable, and its tone.
  • the input speech waveform 16 undergoes analysis by an FO estimation block 26, a feature extraction block 28, and a model scoring block 30.
  • FO estimation block 26 a feature extraction block
  • model scoring block 30 a model scoring block.
  • the FO estimation block 26 uses the input waveform and the time alignment information in order to output an FO contour 44, as further described below.
  • the FO contour 44 determination is preferably based on the Average Magnitude Difference Function (AMDF) algorithm.
  • AMDF Average Magnitude Difference Function
  • the system then extracts numerous features from the FO contour of the input waveform using a feature extraction block 28, such as the ratio of the average FO frequencies of adjacent syllable pairs and the slope of the first-order least squares regression line of the FO contour.
  • a statistical model 30 that preferably uses a two-dimensional full- covariance Gaussian distribution to generate a plurality of tone scores 34 for each of the predicted syllables from the N-Best search block 24.
  • the tone score 34 is combined, preferably linearly, with the spectral score 32 from the spectral analysis branch 12 for each of the predicted syllables in order to arrive at a set of final scores that correspond to an output prediction 42.
  • the tonal modeling section 14 is now described in more detail. 1.
  • FIG. 2 is a flowchart depicting a series of steps for F0 contour estimation 26 according to the present invention.
  • the F0 estimation algorithm involves an initial second order lowpass filtering operation 110, followed by a methodology based on the AMDF algorithm
  • the basic desc ⁇ ption is as follows
  • the second order lowpass filter step 110 on the input waveform 16 preferably is desc ⁇ bed by the following transfer function
  • the F0 estimation block 26 receives the time alignment information 36 at step 112 from the N-Best search block 24 of the spectral modeling branch 12
  • this information 36 includes beginning and ending timing information for each of the predicted syllables fiom the spectral analysis, and also includes the identity of the predicted syllables and their corresponding tones
  • the primary purpose of the tonal modeling block is to predict which of these spectral analysis predictions is most likely given an analysis of the tonal information m the actual input waveform 16 A center point for each syllable can then be identified by determining the point of maximum energy within the syllable
  • the FO estimator block 26 computes the fundamental frequency contour for the entire frame at step 114 using the AMDF algorithm, the frame corresponding to a particular prediction (which could be a letter, syllable, word or sentence as discussed above). This step also computes the average frequency F ⁇ for the entire frame of data.
  • the AMDF algorithm produces an estimate of the fundamental frequency using an N data point length window of the lowpass filtered waveform 16 that corresponds to the type of prediction.
  • a difference function is computed at each frame where a value of fundamental is required. The equation for the difference function is as follows:
  • the actual FO contour estimation set forth in FIG. 3 consists of several passes through the entire spoken utterance (i.e., all the data present in the input waveform 16. This is done in order to reduce the number of halving or doubling errors. These errors are more susceptible at the edges of the vowel, that is at the consonant-vowel transition boundaries. Also, if voicing is absent, the estimation of FO is meaningless and the value of FO should be ignored. In the absence of an accurate alignment of the vowel-consonant boundary, it is necessary to incorporate automatic voicing detection into the FO estimation algorithm.
  • the present invention introduces the concept of "islands of reliability.” These islands of reliability are first computed in step 116 of the preferred methodology utilizing the time alignment information 36 received at step 112. The point of maximum energy near the center of each syllable has been previously obtained in step 112 from an alignment provided by the spectral analysis branch. The speech segment in which energy remains above P percent of the maximum is then marked as an island of reliability in step 116.
  • the value of "P" is a predetermined amount and may vary from application to application.
  • the concept of the island of reliability is to provide a speech segment over which the basic FO estimator or AMDF algorithm produces very reliable results.
  • FIG. 3 sets forth a portion of the FO contour 200 for three spoken syllables 202, 204, 206 in which the initial island of reliability for each syllable is shown as 208.
  • the difference function set forth above is computed whenever the frame falls within an island of reliability.
  • the fundamental period, pertaining to that frame, is chosen as the global minimum of the difference function. Any local minimums are ignored at this stage of the process.
  • an overall average F0 is computed from all such values computed. This forms an initial estimate that indicates the average pitch, F A V of the speaker's voice and the final fundamental frequency contour should reside around this vicinity.
  • the F0 contour is established within the islands of reliability, but this time both global and local minimums are considered. Again the difference function is computed for all frames that lie within these islands.
  • two sources are utilized to make each estimate from the difference function, y devis(k), as defined above.
  • the algorithm searches for (i) the global minimum K c of the difference function and (ii) the local minimum K L that is closest to the period of the average fundamental, F AV , as computed in the first pass above.
  • the global minimum, K c in (i) is always chosen if the value of the minimum is much less than the other local minimum (ii) by some predetermined threshold scaled value. Otherwise K L in (ii) is chosen. Therefore,
  • the FO contour is predicted from left to right of the utterance at the marked islands of reliability.
  • K ⁇ is chosen over K ⁇ , unless K c is much less than the other local minimum, is that a typical speaker's tone cannot change very rapidly, and thus it is more likely that the correct FO calculation is based on the local minima that is closest to the average fundamental frequency for the entire data frame.
  • the next pass through the speech data involves the determination of the FO contour from each boundary of the initial islands of reliability to points on either side of the islands at which the energy of the waveform drops below R percent of the maximum energy within the island.
  • the boundary at which voicing in the vowel terminates is determined. This is done by examining the data frame to the left or right of the initial island boundaries and then assuming that when the energy in the frame data drops below R percent of its maximum value at the vowel center in the initial island of reliability, the FO estimate would not be reliable. This is due to the absence of voicing, and so the FO values are ignored beyond this cutoff point. In this manner, the initial islands of reliability are expanded to the right and left of the initial boundaries.
  • the fundamental frequency contour F0 is then recomputed over the expanded island of reliability.
  • the contour is estimated from left to right, and vice versa for the F0 contour to the left of each island. Again, for every time the difference function is computed, two particular locations are marked.
  • the method searches for (i) the global minimum K c and (ii) the local minimum K L whose occurrence is most proximate to the fundamental period value to the immediate left of the current estimated value.
  • the global minimum K G in (i) is always chosen if the value of the minimum is much less than the other local minimum (ii) by some predetermined threshold value ⁇ . Otherwise K L in (ii) is chosen as the fundamental period.
  • steps 120, 122 are very similar to the FO estimation within the islands of reliability in step 118.
  • the procedure continues from right to left to estimate the fundamental frequency values to the left of the islands of reliability, beginning at the left boundary of each of these islands and terminating when energy falls below R percent of the maximum energy within the syllable.
  • This method uses the global minimum of the difference function, y n (k), as an estimate of the fundamental period if that value is not very far from previous estimates of the pitch contour. In many cases the minimum calculations in (i) and (ii) will coincide at the same point and there is no question of where the fundamental period occurs.
  • the aim is to produce a fundamental contour that is as smooth as possible with a minimum number of discontinuities and sudden changes is likely to be closer to the true contour.
  • 1.5 Median Filtering As an additional measure to produce a smoother contour, a five-point median filter is applied in step 124. This operation is used to smooth the contour data, and produces the FO contour output 44, which is then supplied to the feature extraction block 28 of the tonal analysis branch 14.
  • Tone Feature Extraction and Modeling Algorithm After the FO contour has been computed, features are extracted pertaining to tone information for generating a tonal score, which will eventually be combined with the spectral score in order to arrive at a final output prediction 42. These steps are carried out by the feature extraction block 28 and the model scoring block 30.
  • the tone model is preferably based on a two- dimensional full-covariance Gaussian model, although other tonal models could also be used. During training of this type of model, a separate sub-model for each unique combination of tone pairs is built. Each syllable in the vocabulary database 48 is associated with a tone of its own. Therefore, for a vocabulary of N syllables, there is a total of N squared sub-models.
  • the tone model preferably consists of two dimensions: (1) a ratio of the average tone frequency of a syllable to the average tone frequency of the following syllable (in order to compare to the tone pairs); and (2) a slope of the fundamental frequency FO as estimated by a regression line of one of the syllables.
  • the tone frequency is estimated by averaging the FO frequencies for each syllable and then the ratio of adjacent syllables is taken.
  • the slope of the contour at the syllable is estimated by a first order least squares linear regression line.
  • the present invention overcomes a primary disadvantage of known systems that only derive tonal information based on the absolute value of the fundamental frequency FO contour, and which do not take into account adjacent tones. This advantage of the present invention enables use in a speaker-independent environment.
  • the system shown in FIG. 1 Having computed the spectral score 32 for a particular set of predictive syllables from the spectral branch 12, and having computed the corresponding tonal score 34 for the same set of predictive syllables from the tonal branch 14, the system shown in FIG. 1 then combines these scores in a combination block 40, as further discussed below, in order to derive a final output prediction 42.
  • Hyp 1 al a2 cr3 a4 a5
  • Hyp 2 a6 ⁇ 7a3a4a5
  • This final score is then used to reorder the hypotheses to produce a new N-best list as a final output prediction 42.
  • FIG. 4 is a timing diagram depicting three spoken syllables including tonal information.
  • This figure illustrates a sequence of three syllables: x(3), y(l), and z(2), where x, y, and z denote the syllables and the digits inside the parentheses, (3), (1), and (2), denote the tones of the respective syllables.
  • the tone-recognition component of the present invention computes the probability of having Tone 3 between tl and t2 and having Tone 1 between t2 and t3, it utilizes the pitch information between tl and t3.
  • This strategy has two advantages: (1) it reduces the sensitivity of the recognition software to different speaking characteristics of the speakers; and (2) it captures co-articulatory effects of two adjacent syllables and tones.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

L'invention concerne un système et un procédé de reconnaissance vocale indépendante du locuteur intégrant des analyses spectrales et tonales dans une architecture séquentielle. Le système analyse le contenu spectral d'une syllabe orale ou d'un groupe de syllabes (18) et génère un tracé spectral pour chaque syllabe(s) prédite(s) (46, 22). Les informations d'alignement temporel (36) concernant les syllabes prédites sont alors séquentiellement introduites dans un bloc de modélisation tonale (14) qui effectue une estimation itérative du contour de la fréquence fondamentale pour la (les) syllabe(s) orale(s). Les tons des syllabes adjacentes, tout comme la vitesse de changement des informations tonales, sont alors utilisés pour générer un tracé tonal pour chacune des syllabes prédites. Les tracés tonals (34) sont alors combinés arithmétiquement au (40) tracé spectral (32) de façon à générer une prédiction de sortie.
PCT/US2000/032230 1999-11-23 2000-11-22 Systeme et procede de reconnaissance vocale utilisant la modelisation tonale WO2001039179A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
AU19280/01A AU1928001A (en) 1999-11-23 2000-11-22 System and method for speech recognition using tonal modeling
US10/130,490 US7043430B1 (en) 1999-11-23 2000-11-22 System and method for speech recognition using tonal modeling

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US16717299P 1999-11-23 1999-11-23
US60/167,172 1999-11-23

Publications (1)

Publication Number Publication Date
WO2001039179A1 true WO2001039179A1 (fr) 2001-05-31

Family

ID=22606249

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2000/032230 WO2001039179A1 (fr) 1999-11-23 2000-11-22 Systeme et procede de reconnaissance vocale utilisant la modelisation tonale

Country Status (3)

Country Link
CN (1) CN1209743C (fr)
AU (1) AU1928001A (fr)
WO (1) WO2001039179A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8249873B2 (en) 2005-08-12 2012-08-21 Avaya Inc. Tonal correction of speech
CN110675845A (zh) * 2019-09-25 2020-01-10 杨岱锦 人声哼唱精确识别算法及数字记谱方法
CN111599347A (zh) * 2020-05-27 2020-08-28 广州科慧健远医疗科技有限公司 一种提取病理语音mfcc特征用于人工智能分析的标准化采样方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4264841B2 (ja) * 2006-12-01 2009-05-20 ソニー株式会社 音声認識装置および音声認識方法、並びに、プログラム

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4937870A (en) * 1988-11-14 1990-06-26 American Telephone And Telegraph Company Speech recognition arrangement
US5327521A (en) * 1992-03-02 1994-07-05 The Walt Disney Company Speech transformation system
US5583961A (en) * 1993-03-25 1996-12-10 British Telecommunications Public Limited Company Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands
US5884253A (en) * 1992-04-09 1999-03-16 Lucent Technologies, Inc. Prototype waveform speech coding with interpolation of pitch, pitch-period waveforms, and synthesis filter

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4937870A (en) * 1988-11-14 1990-06-26 American Telephone And Telegraph Company Speech recognition arrangement
US5327521A (en) * 1992-03-02 1994-07-05 The Walt Disney Company Speech transformation system
US5884253A (en) * 1992-04-09 1999-03-16 Lucent Technologies, Inc. Prototype waveform speech coding with interpolation of pitch, pitch-period waveforms, and synthesis filter
US5583961A (en) * 1993-03-25 1996-12-10 British Telecommunications Public Limited Company Speaker recognition using spectral coefficients normalized with respect to unequal frequency bands

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8249873B2 (en) 2005-08-12 2012-08-21 Avaya Inc. Tonal correction of speech
CN110675845A (zh) * 2019-09-25 2020-01-10 杨岱锦 人声哼唱精确识别算法及数字记谱方法
CN111599347A (zh) * 2020-05-27 2020-08-28 广州科慧健远医疗科技有限公司 一种提取病理语音mfcc特征用于人工智能分析的标准化采样方法
CN111599347B (zh) * 2020-05-27 2024-04-16 广州科慧健远医疗科技有限公司 一种提取病理语音mfcc特征用于人工智能分析的标准化采样方法

Also Published As

Publication number Publication date
CN1425176A (zh) 2003-06-18
AU1928001A (en) 2001-06-04
CN1209743C (zh) 2005-07-06

Similar Documents

Publication Publication Date Title
US6278970B1 (en) Speech transformation using log energy and orthogonal matrix
AU685788B2 (en) A method and apparatus for speaker recognition
US6195634B1 (en) Selection of decoys for non-vocabulary utterances rejection
US5625749A (en) Segment-based apparatus and method for speech recognition by analyzing multiple speech unit frames and modeling both temporal and spatial correlation
KR100631786B1 (ko) 프레임의 신뢰도를 측정하여 음성을 인식하는 방법 및 장치
US9123347B2 (en) Apparatus and method for eliminating noise
WO2001035389A1 (fr) Caracteristiques tonales pour reconnaissance de la parole
US7043430B1 (en) System and method for speech recognition using tonal modeling
Kourd et al. Arabic isolated word speaker dependent recognition system
WO1994022132A1 (fr) Procede et dispositif d'identification de locuteur
Zolnay et al. Extraction methods of voicing feature for robust speech recognition.
KR100930587B1 (ko) 혼동 행렬 기반 발화 검증 방법 및 장치
WO2001039179A1 (fr) Systeme et procede de reconnaissance vocale utilisant la modelisation tonale
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.
WO2002029785A1 (fr) Procede, appareil et systeme permettant la verification du locuteur s'inspirant d'un modele de melanges de gaussiennes (gmm)
Doss et al. Using pitch frequency information in speech recognition
KR100551953B1 (ko) 피치와 엠.에프.씨.씨를 이용한 성별식별 장치 및 방법
Shah et al. Phone Aware Nearest Neighbor Technique Using Spectral Transition Measure for Non-Parallel Voice Conversion.
Cevik et al. Detection of repetitions in spontaneous speech in dialogue sessions.
Pawar et al. Analysis of FFSR, VFSR, MFSR techniques for feature extraction in speaker recognition: a review
Hao et al. A data-driven speech enhancement method based on A* longest segment searching technique
Mayora-Ibarra et al. Time-domain segmentation and labelling of speech with fuzzy-logic post-correction rules
Fotinea et al. Emotion in speech: Towards an integration of linguistic, paralinguistic, and psychological analysis
Morales-Cordovilla et al. A robust pitch extractor based on dtw lines and casa with application in noisy speech recognition
Pattanayak et al. Significance of single frequency filter for the development of children's KWS system.

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AL AM AT AU AZ BA BB BG BR BY CA CH CN CR CU CZ DE DK DM EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 008185468

Country of ref document: CN

REG Reference to national code

Ref country code: DE

Ref legal event code: 8642

122 Ep: pct application non-entry in european phase
WWE Wipo information: entry into national phase

Ref document number: 10130490

Country of ref document: US