CN104376850A - Estimation method for fundamental frequency of Chinese whispered speech - Google Patents

Estimation method for fundamental frequency of Chinese whispered speech Download PDF

Info

Publication number
CN104376850A
CN104376850A CN201410705012.4A CN201410705012A CN104376850A CN 104376850 A CN104376850 A CN 104376850A CN 201410705012 A CN201410705012 A CN 201410705012A CN 104376850 A CN104376850 A CN 104376850A
Authority
CN
China
Prior art keywords
voice
speech
fundamental frequency
normal
whispering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410705012.4A
Other languages
Chinese (zh)
Other versions
CN104376850B (en
Inventor
陈雪勤
刘正
赵鹤鸣
俞一彪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN201410705012.4A priority Critical patent/CN104376850B/en
Publication of CN104376850A publication Critical patent/CN104376850A/en
Application granted granted Critical
Publication of CN104376850B publication Critical patent/CN104376850B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses an estimation method for the fundamental frequency of Chinese whispered speech. The estimation method concretely includes the steps that a whispered speech and normal speech database uniform in linguistic data is set up; LPCC parameters Lw of whispered speech, LPCC parameters Ln of normal speech and fundamental frequency parameters F0 are extracted, and DTW alignment is carried out according to Lw and Ln; F0 of the normal speech is divided within the range of 100-300 Hz at the interval of 5 Hz, and forty intervals are generated in total; aligned vectors are assigned to all the intervals according to F0 values of the normal speech, all whispered speech LPCC vectors in each interval are trained to be a GMM model, the combined vectors formed by all the whispered speech LPCC vectors and the F0 parameters of the normal speech in the corresponding interval are trained be to a GMM model to obtain an estimation function, and forty estimation functions are obtained in total; the LPCC parameters of the whispered speech are extracted and matched with all the GMM models to search for the optimum matching model, and then the F0 values of the whispered speech are estimated according to the estimation function of the model. The fundamental frequency of the whispered speech can be estimated, and the difficulty, caused by the loss of fundamental frequency information, of the Chinese whispered speech is effectively overcome.

Description

A kind of fundamental frequency estimation method of Chinese ear voice
Technical field
The present invention relates to a kind of voice process technology, be specifically related to a kind of fundamental frequency estimation method of Chinese ear voice.
Background technology
Chinese is a kind of tone language, and semanteme, the emotion of speaker are expressed mainly through tone.And vocal cords do not vibrate during whisper in sb.'s ear pronunciation, also just lose the of paramount importance carrier of tone---whether fundamental frequency, therefore have tone about whispering voice, and how its tone of perception was once becoming the focus of research.The research of whisper in sb.'s ear tone perception is as significant in enhancing, identification etc. for the process of whispering voice.1972, Abramson summarizes two contrary viewpoints to whisper in sb.'s ear tone: the representative figure of the first viewpoint is Panconcelli-calzia, think that, for there being tone language, based on context continuous print whispering voice is appreciated that and isolated word is impenetrable; The representative figure of the second viewpoint is Giet, think whisper in sb.'s ear tone information substitute by other non-fundamental frequency features, the increase of such as air stream or reduction, so still remain with tone information in whispering voice.The backers of the second viewpoint are in order to can the tone of better perception whisper in sb.'s ear, and adopt the means of subjective audiovisual and objective examination to carry out the perception of whisper in sb.'s ear tone, it is appreciable for demonstrating tones in whispered Chinese by master, objective experiment.
In traditional speech analysis system, often think that the excitation of voice and sound channel system are separate, but Assmann points out that in his research the excitation of voice and channel information exist restricting relation, only have both harmonious, just can produce the melodious tone color of nature.Experiment is like this design, extracts fundamental frequency and the formant parameter of natural-sounding respectively, and when wherein one group of parameter change, another group remains unchanged, and the audience that please participate in testing evaluates synthetic speech, selects and sounds the most natural voice.Experimental result shows the most natural sound that people select, and the combination of its fundamental frequency and resonance peak is closest to original voice.Show really to there is restriction relation between excitation and sound channel.This points out us, and because of fundamental frequency information disappearance, inexplicable tones in whispered Chinese problem can be explained by channel parameters originally, and tone information can be hidden in channel parameters.
Summary of the invention
Goal of the invention of the present invention is to provide a kind of fundamental frequency estimation method of Chinese ear voice, can solve Chinese ear voice due to fundamental frequency information and lack the difficulty brought.
To achieve the above object of the invention, the technical solution used in the present invention is: a kind of fundamental frequency estimation method of Chinese ear voice, comprises the steps:
(1) set up the consistent whispering voice of language material and normal voice database, make in database, the speaker of whispering voice and normal voice, voice content, word order are completely the same;
(2) the linear prediction cepstrum coefficient parameter L of whispering voice is extracted respectively w, normal voice linear prediction cepstrum coefficient parameter L nwith base frequency parameters F0, and according to L wand L ncarry out dynamic time warping alignment;
(3) F0 of normal voice is divided according to 5Hz interval between 100 ~ 300Hz, raw 40 intervals of common property;
(4) vector after all alignment is belonged in each interval according to the size of normal voice F0, all whispering voice linear prediction cepstrum coefficient vectors in each interval are trained for a gauss hybrid models, the vector of combining that whispering voice linear prediction cepstrum coefficient vectors all in this interval and normal voice F0 parameter are formed is trained for a gauss hybrid models and obtains an estimation function, totally 40 estimation functions simultaneously;
(5) extract the linear prediction cepstrum coefficient parameter of whispering voice, it is mated with each gauss hybrid models, the model of search optimum matching, then adopt the estimation function of this model to estimate the F0 value of whispering voice.
Because technique scheme is used, the present invention compared with prior art has following advantages:
The present invention is by setting up whispering voice and normal voice database, extract the LPCC parameter of whispering voice again, the LPCC parameter of normal voice and F0 parameter, and the LPCC parameter of whispering voice and the LPCC parameter of normal voice are alignd, by the FO parameter demarcation interval at equal intervals of normal voice, vector after all alignment is belonged in each interval according to the size of normal voice F0, all whispering voice linear prediction cepstrum coefficient vectors in each interval are trained for a gauss hybrid models, the vector of combining that whispering voice linear prediction cepstrum coefficient vectors all in this interval and normal voice F0 parameter are formed is trained for a gauss hybrid models and obtains an estimation function simultaneously, totally 40 estimation functions, extract the linear prediction cepstrum coefficient parameter of whispering voice, it is mated with each gauss hybrid models, the model of search optimum matching, then the estimation of estimation function realization to the F0 value of whispering voice of this model is adopted, effectively can solve Chinese ear voice due to fundamental frequency information lacks the difficulty brought.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of fundamental frequency estimation method of the present invention in embodiment one.
Fig. 2 adopts in embodiment two the pitch contour that model is estimated and target pitch contour collection of illustrative plates.
Fig. 3 adopts in embodiment two the pitch contour that model is estimated and target pitch contour collection of illustrative plates.
Embodiment
Below in conjunction with drawings and Examples, the invention will be further described:
Embodiment one: shown in Figure 1, a kind of fundamental frequency estimation method of Chinese ear voice, comprises the steps:
(1) set up the consistent whispering voice of language material and normal voice database, make in database, the speaker of whispering voice and normal voice, voice content, word order are completely the same;
(2) the linear prediction cepstrum coefficient parameter L of whispering voice is extracted respectively w, normal voice linear prediction cepstrum coefficient parameter L nwith base frequency parameters F0, and according to L wand L ncarry out dynamic time warping (DTW) alignment;
(3) F0 of normal voice is divided according to 5Hz interval between 100 ~ 300Hz, raw 40 intervals of common property;
(4) vector after all alignment is belonged in each interval according to the size of normal voice F0, all whispering voice linear prediction cepstrum coefficient vectors in each interval are trained for a gauss hybrid models, the vector of combining that whispering voice linear prediction cepstrum coefficient vectors all in this interval and normal voice F0 parameter are formed is trained for a gauss hybrid models and obtains an estimation function, totally 40 estimation functions simultaneously;
(5) extract the linear prediction cepstrum coefficient parameter of whispering voice, it is mated with each gauss hybrid models, the model of search optimum matching, then adopt the estimation function of this model to estimate the F0 value of whispering voice.
Embodiment two: choose 80 speakers and participate in recording, comprising 40 male sex and 40 women, the range of age, from children to old man, distributes more balanced.Playback environ-ment is quiet, and microphone is handheld mike, and sampling rate is 16KHz, and quantization is 16bits.For ensureing that children can participate in recording smoothly, recording text collection is from Chinese textbook of elementary school, and containing all Chinese that Chinese 21 initial consonants and 35 simple or compound vowel of a Chinese syllable be combined into has tonic syllable, and language material content is through screening guarantee phoneme distributing equilibrium.
Identical language material pronounces one time with whispering voice and normal voice by each speaker respectively.Due to the singularity of whispering voice pronunciation, there is the incorrect situation of articulation type unavoidably, therefore, the corpus data of all whispering voices is all observed through subjective frequency spectrum and is guaranteed do not have pitch contour.Incongruent place is marked out, inserts language material and concentrate after amended record again.
Adopt STRAIGHT kit to extract fundamental frequency and the linear prediction cepstrum coefficient parameter (LPCC) of voice, LPCC exponent number is p=24, frame length 25ms, frame moves 10ms.
Fundamental frequency information exists only in voiced portions, and the LPCC of extraction normal voice voiced segments and base frequency parameters extract the LPCC eigenvector of the corresponding segment of whispering voice simultaneously.Consider that the word speed of whispering voice is slower than the word speed of normal voice, the LPCC parameter therefore according to normal voice and whispering voice carries out DTW alignment, then retains the F0 of normal voice after alignment and the LPCC parameter of whispering voice, forms associating vector.
Gauss hybrid models parameter be made up of mean vector, covariance matrix and hybrid weight, be expressed as . rank gauss hybrid models (GMM) can be expressed as (1) formula:
(1)
Wherein, the probability density function of Gaussian distribution, be d feature vectors, mean vector, covariance matrix, it is the hybrid weight of each Gaussian probability-density function.
First by each interval inner ear speech LPC C parameter with normal voice F0 parameter form associating vector .Then by associating vector data through expecting that maximum calculated method is estimated to obtain with .Known condition under, eigenvector belong to the posterior probability of individual component is:
(2)
At translate phase, use the method for the conditional expectation of Joint Gaussian distribution prediction to estimate transfer function, its general expression can be expressed as according to Gaussian distribution model:
(3)
Wherein posterior probability, qthe exponent number of GMM model, transition matrix, be Bayes's vector expression of q class such as formula shown in (4) and (5):
(4)
(5)
Be generally many-to-one mapping relations based on GMM transformation model unlike, this conversion foundation, be about to pthe LPCC eigenvector of dimension is converted to the F0 parameter of one dimension.In order to address this problem, according to existing experience: compare exponent number below, the front 6 rank parameters of LPCC have larger contribution for tone information, therefore, have carried out ranking operation for front 6 rank LPCC parameters:
(6)
In order to analyze the validity of the method, devising two groups and verifying.First group of experiment is that the language material choosing a speaker carries out by stages according to the method described above and sets up one group of model relevant with speaker, and this model is determined by speaker, and language material is relatively few, and therefore the gaussian component number of model is also corresponding few, we by this model referred to as , and predicting the outcome of this model pays close attention to tone track emphatically.Second group of experiment is that all corpus carry out by stages according to the method described above and set up one group of model had nothing to do with speaker.Because speaker's quantity is comparatively large, language material quantity is large, and therefore selected gaussian component number is also comparatively large, we by this model referred to as .
Definition for the target fundamental frequency value of a certain frame, for the fundamental frequency estimation value of a certain frame, be defined as the base band Error of this frame.If , can be described as GPE(Gross Pitch Error), GPE often occurs with the form of number percent, as shown in (7) formula:
(7)
Table 1 respectively illustrates employing with the GPE value of fundamental frequency estimated by two kinds of models.The first row numeral 1 to 9 represents 9 different speakers respectively, and these 9 speaker's data are random selecting from above-mentioned database; Second row adopts the data that model obtains, that is the data of each speaker oneself carry out by stages modeling formation estimation function; The third line is the data that model obtains, by all speakers data common zone between modeling form estimation function.From table 1, the fundamental frequency estimation value error that the model relevant to speaker obtains is between 5% to 10%, and the fundamental frequency estimation value error obtained by the model had nothing to do with speaker is near 15%.
Table 1 adopts with the GPE value of fundamental frequency estimated by two kinds of models
Fig. 2 and Fig. 3 respectively illustrates employing with the pitch contour of two kinds of model estimations and the example of target pitch contour.Fig. 2 adopts model, is the fundamental frequency estimation of a certain speaker dependent, and Fig. 3 adopts model, the model namely had nothing to do with speaker.Although the fundamental frequency estimation of Fig. 3 has obvious error, in the trend of pitch contour, have good consistance, this is very crucial information in whispering voice fundamental frequency estimation.

Claims (1)

1. a fundamental frequency estimation method for Chinese ear voice, is characterized in that, comprise the steps:
(1) set up the consistent whispering voice of language material and normal voice database, make in database, the speaker of whispering voice and normal voice, voice content, word order are completely the same;
(2) the linear prediction cepstrum coefficient parameter L of whispering voice is extracted respectively w, normal voice linear prediction cepstrum coefficient parameter L nwith base frequency parameters F0, and according to L wand L ncarry out dynamic time warping alignment;
(3) F0 of normal voice is divided according to 5Hz interval between 100 ~ 300Hz, raw 40 intervals of common property;
(4) vector after all alignment is belonged in each interval according to the size of normal voice F0, all whispering voice linear prediction cepstrum coefficient vectors in each interval are trained for a gauss hybrid models, the vector of combining that whispering voice linear prediction cepstrum coefficient vectors all in this interval and normal voice F0 parameter are formed is trained for a gauss hybrid models and obtains an estimation function, totally 40 estimation functions simultaneously;
(5) extract the linear prediction cepstrum coefficient parameter of whispering voice, it is mated with each gauss hybrid models, the model of search optimum matching, then adopt the estimation function of this model to estimate the F0 value of whispering voice.
CN201410705012.4A 2014-11-28 2014-11-28 A kind of fundamental frequency estimation method of Chinese ear voice Active CN104376850B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410705012.4A CN104376850B (en) 2014-11-28 2014-11-28 A kind of fundamental frequency estimation method of Chinese ear voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410705012.4A CN104376850B (en) 2014-11-28 2014-11-28 A kind of fundamental frequency estimation method of Chinese ear voice

Publications (2)

Publication Number Publication Date
CN104376850A true CN104376850A (en) 2015-02-25
CN104376850B CN104376850B (en) 2017-07-21

Family

ID=52555722

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410705012.4A Active CN104376850B (en) 2014-11-28 2014-11-28 A kind of fundamental frequency estimation method of Chinese ear voice

Country Status (1)

Country Link
CN (1) CN104376850B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328123A (en) * 2016-08-25 2017-01-11 苏州大学 Method of recognizing ear speech in normal speech flow under condition of small database
CN106571135A (en) * 2016-10-27 2017-04-19 苏州大学 Whisper speech feature extraction method and system
CN109101581A (en) * 2018-07-20 2018-12-28 安徽淘云科技有限公司 A kind of screening technique and device of corpus of text
CN109903752A (en) * 2018-05-28 2019-06-18 华为技术有限公司 The method and apparatus for being aligned voice

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0990993A (en) * 1995-09-22 1997-04-04 Fujitsu Ltd Speech processor
JP2006119647A (en) * 2005-09-16 2006-05-11 Yasuto Takeuchi System for spuriously converting whispery voice to ordinary voiced sound
CN101281747A (en) * 2008-05-30 2008-10-08 苏州大学 Method for recognizing Chinese language whispered pectoriloquy intonation based on acoustic channel parameter
CN101441868A (en) * 2008-11-11 2009-05-27 苏州大学 Real time converting method for Chinese ear voice into natural voice based on characteristic transition rule

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0990993A (en) * 1995-09-22 1997-04-04 Fujitsu Ltd Speech processor
JP2006119647A (en) * 2005-09-16 2006-05-11 Yasuto Takeuchi System for spuriously converting whispery voice to ordinary voiced sound
CN101281747A (en) * 2008-05-30 2008-10-08 苏州大学 Method for recognizing Chinese language whispered pectoriloquy intonation based on acoustic channel parameter
CN101441868A (en) * 2008-11-11 2009-05-27 苏州大学 Real time converting method for Chinese ear voice into natural voice based on characteristic transition rule

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
XUEQIN CHEN ET AL: "F0 Prediction from Linear Predictive Cepstral Coefficient", 《2014 SIXTH INTERNATIONAL CONFERENCE ON WIRELESS COMMUNICATIONS AND SIGNAL PROCESSING (WCSP)》 *
陈雪勤 等: "有效高斯分量通用背景模型下耳语音声道系统转换研究", 《声学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106328123A (en) * 2016-08-25 2017-01-11 苏州大学 Method of recognizing ear speech in normal speech flow under condition of small database
CN106328123B (en) * 2016-08-25 2020-03-20 苏州大学 Method for recognizing middle ear voice in normal voice stream under condition of small database
CN106571135A (en) * 2016-10-27 2017-04-19 苏州大学 Whisper speech feature extraction method and system
CN106571135B (en) * 2016-10-27 2020-06-09 苏州大学 Ear voice feature extraction method and system
CN109903752A (en) * 2018-05-28 2019-06-18 华为技术有限公司 The method and apparatus for being aligned voice
CN109903752B (en) * 2018-05-28 2021-04-20 华为技术有限公司 Method and device for aligning voice
US11631397B2 (en) 2018-05-28 2023-04-18 Huawei Technologies Co., Ltd. Voice alignment method and apparatus
CN109101581A (en) * 2018-07-20 2018-12-28 安徽淘云科技有限公司 A kind of screening technique and device of corpus of text

Also Published As

Publication number Publication date
CN104376850B (en) 2017-07-21

Similar Documents

Publication Publication Date Title
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
Liu et al. Voice Conversion Across Arbitrary Speakers Based on a Single Target-Speaker Utterance.
Tran et al. Improvement to a NAM-captured whisper-to-speech system
CN107369440A (en) The training method and device of a kind of Speaker Identification model for phrase sound
Aryal et al. Can voice conversion be used to reduce non-native accents?
CN109671442A (en) Multi-to-multi voice conversion method based on STARGAN Yu x vector
Qian et al. Improved prosody generation by maximizing joint probability of state and longer units
CN104240706B (en) It is a kind of that the method for distinguishing speek person that similarity corrects score is matched based on GMM Token
Urbain et al. Arousal-driven synthesis of laughter
Urbain et al. Evaluation of HMM-based laughter synthesis
Xie et al. A KL divergence and DNN approach to cross-lingual TTS
CN106653002A (en) Literal live broadcasting method and platform
CN104376850A (en) Estimation method for fundamental frequency of Chinese whispered speech
Aryal et al. Articulatory inversion and synthesis: towards articulatory-based modification of speech
CN101178895A (en) Model self-adapting method based on generating parameter listen-feel error minimize
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
Yan et al. Analysis and synthesis of formant spaces of British, Australian, and American accents
Csapó et al. Modeling irregular voice in statistical parametric speech synthesis with residual codebook based excitation
Tao et al. Realistic visual speech synthesis based on hybrid concatenation method
Liao et al. Speaker adaptation of SR-HPM for speaking rate-controlled Mandarin TTS
Fatima et al. Vowel-category based short utterance speaker recognition
Moungsri et al. HMM-based Thai speech synthesis using unsupervised stress context labeling
Li et al. Diverse and Expressive Speech Prosody Prediction with Denoising Diffusion Probabilistic Model
Wenjing et al. A hybrid speech emotion perception method of VQ-based feature processing and ANN recognition
Kobayashi et al. An investigation of acoustic features for singing voice conversion based on perceptual age.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant