WO2017098307A1 - 基于谐波模型和声源-声道特征分解的语音分析合成方法 - Google Patents
基于谐波模型和声源-声道特征分解的语音分析合成方法 Download PDFInfo
- Publication number
- WO2017098307A1 WO2017098307A1 PCT/IB2015/059495 IB2015059495W WO2017098307A1 WO 2017098307 A1 WO2017098307 A1 WO 2017098307A1 IB 2015059495 W IB2015059495 W IB 2015059495W WO 2017098307 A1 WO2017098307 A1 WO 2017098307A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- harmonic
- sound source
- phase
- response
- model
- Prior art date
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 69
- 238000001308 synthesis method Methods 0.000 title claims abstract description 23
- 238000000354 decomposition reaction Methods 0.000 title claims 5
- 238000000034 method Methods 0.000 claims abstract description 62
- 230000004044 response Effects 0.000 claims description 150
- 239000013598 vector Substances 0.000 claims description 72
- 238000001914 filtration Methods 0.000 claims description 9
- 230000003044 adaptive effect Effects 0.000 claims 1
- 230000015572 biosynthetic process Effects 0.000 abstract description 28
- 238000003786 synthesis reaction Methods 0.000 abstract description 28
- 230000001755 vocal effect Effects 0.000 abstract description 4
- 230000003595 spectral effect Effects 0.000 description 11
- 238000012545 processing Methods 0.000 description 10
- 230000005855 radiation Effects 0.000 description 8
- 238000001228 spectrum Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 210000004704 glottis Anatomy 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000006870 function Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 210000002816 gill Anatomy 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000010363 phase shift Effects 0.000 description 2
- SQHUBVCIVAIUAB-UHFFFAOYSA-N 2-hydroxy-2-methylpropanedial Chemical compound O=CC(O)(C)C=O SQHUBVCIVAIUAB-UHFFFAOYSA-N 0.000 description 1
- 241000928106 Alain Species 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 238000000819 phase cycle Methods 0.000 description 1
- 230000035807 sensation Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000010189 synthetic method Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/45—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of analysis window
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/75—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 for modelling vocal tract parameters
Definitions
- the present invention relates to the field of speech synthesis, and in particular to the field of speech analysis synthesis and speech coding.
- Speech analysis and synthesis technology is a technique in which a speech signal is analyzed to obtain an intermediate expression form and then re-synthesized according to the analysis result.
- the characteristics of the speech such as the fundamental frequency, duration, tone, etc., can be changed.
- speech analysis synthesis systems are an important component. In order to be able to modify speech parameters flexibly, such applications often require a parametric high quality speech analysis synthesis method.
- a commonly used speech analysis synthesis method is based on a source-filter model.
- the model models the human pronunciation system as a pulse train signal and a series of cascaded filters, including glottal flow filters, vocal tract filters, and lip radiation. )filter.
- the periodic pulse signal is a series of unit pulse signals separated by a fundamental frequency period.
- a simplified form of the source-filter model is widely used in speech analysis synthesis techniques. This simplified form combines the glottal wave filter and the lip radiation filter into a human channel filter.
- the speech analysis synthesis method based on the simplified model design includes PS0LA (Pitch Synchronous Overlay), STRAIGHT, MLSA (Mel Log Spectrum Approximation Filter) and the like.
- the simplified form of the source-filter model exposes certain defects when the speech fundamental frequency is modified.
- the glottis wave is the flow velocity of the gas passing through the glottis, reflecting the degree of gating of the glottis. Because the fundamental frequency determines the gating frequency of the glottis, the duration of the unit impulse response of the glottal wave filter should be equal to the fundamental frequency period.
- the shape of the glottal wave is approximately the same at different fundamental frequencies, but the period length varies with the base. The frequency changes.
- the glottal filter is incorporated into the channel filter, so the frequency response of the glottal filter is assumed to be independent of the fundamental frequency. This assumption is inconsistent with the principle of utterance. Therefore, after the modification of the fundamental frequency parameters, the speech analysis synthesis method based on the simplified model often cannot generate natural speech.
- the parameters of the sound source model are first calculated, and then the amplitude spectrum of the speech is divided by the amplitude response of the sound source model, and spectral envelope estimation is performed to obtain the amplitude response of the channel. Finally, based on the minimum phase assumption, the frequency response of the channel is calculated based on the amplitude response of the channel.
- the synthesis process is equivalent to the reverse analysis process and will not be described here.
- the SVLN and GSS methods make the fundamental frequency modified speech sense more natural to some extent, but the method has some drawbacks at the same time.
- the quality of the synthesized speech is easily affected by the accuracy of the sound source parameters.
- the synthesized voice will have a different listening sensation than the input voice.
- the recording environment and equipment of the input voice are not ideal, the calculation of the sound source parameters tends to have large errors, making the output generated by the method less stable.
- the glottal wave signal generated by the Liljencrants-Fant sound source model is different from the actual glottal wave signal. Therefore, this method cannot accurately restore the input speech, and the speech synthesized by this method is slightly sharper.
- phase error When the fundamental frequency is modified, the phase error is unwrapping, and the harmonic is reinterpolated with the new fundamental frequency. Phase error.
- the drawback of this method is that phase unwrapping is prone to errors, especially for high fundamental frequency speech, which has a greater chance of producing a sequence of inconsistent speech parameters between frames.
- the method assumes that the amplitude response of the sound source is constant, so the method cannot model the influence of the fundamental frequency on the amplitude response of the glottal wave.
- the present invention is based on a harmonic model, which decomposes the parameters of the harmonic model to obtain sound source and channel parameters.
- a harmonic model which decomposes the parameters of the harmonic model to obtain sound source and channel parameters.
- the present invention proposes a speech analysis synthesis method and a simplified form of the method.
- the method of the invention is based on a harmonic model, and the parameters of the harmonic model are decomposed into sound source features and channel features in the analysis stage, and the sound source and channel features are recombined in the synthesis stage to generate harmonic model parameters.
- the fundamental frequency extraction and harmonic analysis are performed on the input speech signal, and the fundamental frequency and the amplitude and phase vector of each harmonic are obtained. Calculating the relative phase offset of each harmonic according to the phase vector of the harmonic;
- the sound source characteristics of the input speech signal at each analysis time are predicted, and the parameters of the sound source model are obtained.
- the harmonic amplitude vector is divided by the amplitude response of the sound source and the lip radiation amplitude response to obtain the amplitude response of the channel;
- the frequency response of the sound source is obtained, including the sound source amplitude vector and the sound source phase direction corresponding to each harmonic
- the difference between the phase vector of the sound source corresponding to each harmonic obtained in the fifth step and the phase response of the sound source model obtained in the second step is calculated, and the phase difference corresponding to each harmonic is obtained. vector.
- the frequency response of the sound source model is calculated according to the parameters of the sound source model and the fundamental frequency, including the amplitude response of the sound source model and the phase response of the sound source model;
- the phase response of the sound source model and the sound source phase difference vector corresponding to each harmonic are added to obtain a sound source phase vector corresponding to each harmonic;
- the amplitude response of the channel and the sound source amplitude vector corresponding to each harmonic are multiplied to obtain the amplitude of each harmonic. Adding the phase response of the channel and the phase vector of the sound source corresponding to each harmonic to obtain the phase of each harmonic;
- the speech signal is synthesized according to the fundamental frequency and the amplitude and phase of each harmonic.
- the fundamental frequency extraction and the harmonic analysis are performed on the input speech signal, and the fundamental frequency and the amplitude and phase vector of each harmonic are obtained at each analysis time. Calculating the relative phase offset of each harmonic according to the phase vector of the harmonic; [0027] In the second step, optionally, predicting a sound source characteristic of the input speech signal at each analysis time, and calculating an amplitude response of the sound source;
- obtaining a frequency response of the sound source including a sound source amplitude vector and a sound source phase vector corresponding to each harmonic
- the amplitude response of the channel and the sound source amplitude vector corresponding to each harmonic are multiplied to obtain the amplitude of each harmonic. Adding the phase response of the channel and the phase vector of the sound source corresponding to each harmonic to obtain the phase of each harmonic;
- the speech signal is synthesized according to the fundamental frequency and the amplitude and phase of each harmonic.
- the present invention proposes a speech analysis synthesis method and a simplified form of the method.
- the method of the invention is based on a harmonic model, and the parameters of the harmonic model are decomposed into sound source features and channel features in the analysis stage, and the sound source and channel features are recombined in the synthesis stage to generate harmonic model parameters.
- the basic form of the speech analysis synthesis method proposed by the present invention is specifically described below from the analysis stage, and the flow thereof is shown in FIG.
- the fundamental frequency extraction and harmonic analysis are performed on the input speech signal, and the fundamental frequency 0 of each analysis time, the amplitude a k of each harmonic, and the phase vector are obtained.
- Calculate the relative phase shift of each harmonic based on the phase vector of the harmonic see Degottex, Gilles, and Daniel Erro. "A uniform phase representation for the harmonic model in speech synthesis applications.” EURASIP Journal on Audio, Speech, and Music Processing 2014.1 (2014): 1-16. ) ;
- the invention of the present invention lies in a method for processing harmonic model parameters, and therefore, a specific method for extracting a harmonic wave of a fundamental frequency is not limited.
- Commonly used fundamental frequency extraction methods include YIN (De Cheveign e, Alain, and Hideki Kawahara. "YIN, a fundamental frequency estimator for speech and music.” The Journal of the Acoustical Society of America 111.4 (2002): 1917-1930.
- SRH Drug Man, Thomas, and Abeer Alwan. "Joint Robust voicingng Detection and Pitch Estimation Based on Residual Harmonics.” Interspeech. 2011.).
- harmonic analysis methods include peak-picking method (refer to McAulay, Robert J., and Thomas F. Quatieri. "Speech analysis/synthesis based on a sinusoidal representation.” Acoustics, Speech and Signal Processing, IEEE Transactions On 34.4 (1986): 744- 754. ), Stylianou, Vietnamese, Vietnamese,is. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Diss. concluded Nationale Sup e rieure des T ele communications, 1996.) and so on.
- the sound source characteristics of the input speech signal at each analysis time are predicted, and the parameters of the sound source model are obtained.
- the frequency response of the sound source model is calculated according to the parameters of the sound source model, including the amplitude response of the sound source model and the phase response of the sound source model.
- the present invention is applicable to a plurality of sound source models, and thus the sound source model and its parameter prediction method are not specifically limited.
- LF model Liljencrants-Fant sound source model
- MSP Degottex, Gilles, Axel Roebel, and Xavier Rodet.
- the parameter prediction method is taken as an example.
- the specific parameter prediction steps are as follows:
- a Generate a series of candidate LF model parameters.
- Rd parameter as an example, a candidate Rd parameter sequence that is progressively incremented from 0.3 to 2.5 at intervals of 0.1 is generated. Do the following for each candidate Rd parameter:
- f Generate a series of candidate offset phases.
- a candidate offset phase sequence that progresses from _ ⁇ to ⁇ at intervals of 0.1 is taken as an example.
- the time-varying Rd parameter sequence obtained in the above step is subjected to median filtering.
- the sound source frequency response G iF ( c3 ⁇ 4) at each harmonic frequency is calculated.
- the amplitude response of the channel is obtained. Dividing the harmonic amplitude vector by the amplitude response of the sound source and the lip radiation amplitude response to obtain the amplitude response of the channel;
- the lip radiation frequency response is assumed to be equivalent to a differentiator.
- the frequency response can be responsive to the human source frequency response. Therefore, when calculating the frequency response of the sound source in the third step, G iF ( c3 ⁇ 4) can be replaced by the frequency response of the number of glottal waveguides. At this point, the operation of this step is simplified to:
- the spectral envelope of the input speech is obtained according to the harmonic amplitude vector (in -row ) frequency
- spectrum ⁇ envelope prediction.
- the phase response of the channel is calculated based on the amplitude response of the channel. Since the frequency response of the channel can be roughly modeled as an all-pole filter, it can be assumed that the frequency response of the channel has minimal phase properties. Under this assumption, the phase response arg ⁇ (o3 ⁇ 4) of the channel can be calculated using the homomorphic filtering method. For specific methods, please refer to Lim, Jae S., and Alan V. Oppenheim. Advanced topics in signal processing. Prentice-Hall, Inc., 1987;
- the frequency response of the sound source is obtained. , including the sound source amplitude vector and the sound source phase vector corresponding to each harmonic.
- the sound source amplitude vector follows the
- the sound source phase vector is calculated by the spectral phase division method according to the harmonic phase vector of the removed offset and the phase response of the channel;
- the difference between the phase vector of the sound source corresponding to each harmonic obtained in the fifth step and the phase response of the sound source model obtained in the second step is calculated, and the phase difference corresponding to each harmonic is obtained. vector.
- the phase response of the channel is calculated arg ⁇ v ⁇ )) or arg(V(co.
- the specific calculation method is the same as the fourth step of the analysis phase). If the phase response arg (V((o;) is calculated according to the amplitude response spectrum
- the frequency response G F (o3 ⁇ 4) of the sound source model is calculated according to the parameters of the sound source model and the fundamental frequency, including the amplitude response of the sound source model and the phase response of the sound source model.
- the specific method is the same as b in the second step of the analysis phase;
- the phase response arg of the sound source model (G F (c3 ⁇ 4; and the sound source phase difference vector ⁇ ⁇ corresponding to each harmonic ( adding, obtaining the sound source phase corresponding to each harmonic) Vector arg(G(o3 ⁇ 4)) ;
- the amplitude response of the channel and the sound source amplitude vector corresponding to each harmonic are multiplied to obtain the amplitude of each harmonic. Adding the phase response of the channel and the phase vector of the sound source corresponding to each harmonic to obtain the phase of each harmonic;
- the speech signal is synthesized according to the fundamental frequency and the amplitude and phase of each harmonic.
- the present invention does not specifically limit the harmonic mode synthesis method used. Common synthetic methods can be found in McAulay, Robert J., and Thomas F.
- the analytical synthesis method of the present invention has a simplified form when it is not necessary to modify the sound source parameters. This simplified form does not depend on a particular sound source model, so the sound source model parameter prediction step can be omitted. As shown in Figure 2, the specific steps of the simplified form of the analysis phase are as follows:
- the fundamental frequency extraction and harmonic analysis are performed on the input speech signal, and the fundamental frequency 0 of each analysis time, the amplitude a k of each harmonic, and the phase vector are obtained. Calculate the relative phase offset of each harmonic based on the phase vector of the harmonic Relative phase shift ;
- the second step optionally, predicting the sound source characteristics of the input speech signal at each analysis time, and calculating the amplitude response of the sound source
- the prediction method of the sound source feature in this step is not necessarily based on a specific sound source model, and the prediction method may be any technique capable of predicting the sound source amplitude response.
- the method for predicting the sound source amplitude response used in the present invention is not specifically limited.
- the speech of each analysis time is windowed, and the coefficient of a second-order all-pole filter is calculated using a linear prediction method.
- the amplitude response is calculated based on the coefficients of the all-pole filter.
- the amplitude response obtained by the above method is roughly the product of the amplitude response of the sound source and the amplitude response of the lip radiation. Since the lip radiation frequency response is independent of the sound source and channel characteristics, the amplitude response can be responsive to the human source amplitude response.
- the amplitude response of the sound source is unknown, then the amplitude response of the sound source is assumed to be constant (ie,
- 1 ), and the amplitude response of the channel is defined as the harmonic amplitude vector; The amplitude response is known, and the harmonic amplitude vector is divided by the amplitude response of the sound source to obtain the amplitude response of the channel;
- the spectral envelope prediction is performed according to the harmonic amplitude vector, and the spectral envelope of the input speech is obtained.
- the amplitude response of the channel obtained at this time is a function defined at any frequency, and includes not only the amplitude response at each harmonic frequency:
- the phase response of the channel is calculated according to the amplitude response of the channel arg ⁇ ( co)).
- the specific method is the same as b in the second step of the analysis phase of the basic form of the method of the present invention.
- the specific method is: the sound source amplitude vector has been obtained in the second step; The vector is subtracted from the phase response of the channel to obtain the source phase vector.
- the phase response arg ⁇ v ⁇ w k )) or cw3 ⁇ 4 ⁇ ((D)) of the channel is calculated according to the amplitude response of the channel or
- the specific calculation method is the same as b in the second step of the analysis phase of the basic form of the method of the present invention. If the phase response arg ⁇ v ⁇ i )) is calculated from the continuous amplitude response spectrum
- the amplitude response of the channel and the sound source amplitude vector corresponding to each harmonic are multiplied to obtain the amplitude of each harmonic. Adding the phase response of the channel and the phase vector of the sound source corresponding to each harmonic to obtain the phase of each harmonic;
- the speech signal is synthesized according to the fundamental frequency and the amplitude and phase of each harmonic.
- the present invention does not specifically limit the harmonic mode synthesis method used.
- the basic form of the speech analysis synthesis technique of the present invention is applicable to applications including modification of sound source parameters; a simplified form of the technique is applicable to applications that do not include modification of sound source parameters.
- the basic form of the speech analysis and synthesis technology of the present invention records the phase difference of the sound source model and the phase of the sound source obtained by inverse frequency filtering using the frequency domain, and corresponds the phase difference to each harmonic, so that the input The phase characteristics of human speech are better preserved, and the effect of sound source model parameter prediction errors on synthesized speech quality is mitigated.
- the simplified form of the technique is based on the shape-invariant assumption of the glottal wave, which corresponds to the sound source characteristics to the individual harmonics without the need for explicit sound source model parameters and their parameter prediction steps.
- the simplified form completely avoids the problem of parameter prediction error of the sound source model, greatly simplifies the analysis and synthesis steps, and improves the operation efficiency.
- the speech analysis and synthesis technology of the present invention can also be applied to a sinusoidal model, a harmonic plus noise model (Harmonic+Noise Model), a harmonic plus stochastic model (Harmonic+Stochastic Model), and the like.
- the modification of the method of the present invention to the above-described model is a common knowledge well known to those skilled in the art and will not be specifically described.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Circuit For Audible Band Transducer (AREA)
- Measurement Of Mechanical Vibrations Or Ultrasonic Waves (AREA)
Abstract
本发明提出了一种语音分析合成方法以及该方法的一种简化形式。本发明所述方法基于谐波模型,在分析阶段将谐波模型的参数分解为声源特征和声道特征,并于合成阶段重新组合声源和声道特征,生成谐波模型参数。
Description
技术领域
[0001]本发明涉及语音合成领域, 具体涉及语音分析合成和语音编码子领域 背景技术
[0002]语音分析合成技术是将语音信号进行分析, 获得一种中间的表达形式, 再根据分析 结果重新合成的技术。 通过修改由分析获得的中间数据, 可以改变语音的特性, 例如基频、 时长、 音色等。
[0003]在语音合成和音频处理应用中, 语音分析合成系统是重要的组成部分。 为了能够灵 活地修改语音参数, 此类应用往往需要一种参数性(parametric )的高质量语音分析合成方 法。
[0004]常用的语音分析合成方法基于源 -滤波器模型(source-filter model ) 。 该模型将人 的发音系统建模为周期脉冲(pulse train )信号和一系列级联滤波器, 包括声门波(glottal flow )滤波器、 声道 ( vocal tract )滤波器和唇辐射( lip radiation )滤波器。 周期脉冲信号是 一系列以基频周期间隔的单位脉冲信号。
[0005]源-滤波器模型的一种简化形式在语音分析合成技术中被广泛采用。 该简化形式将 声门波滤波器和唇辐射滤波器合并纳人声道滤波器。 基于该简化模型设计的语音分析合成 方法包括 PS0LA (基音同步叠加) 、 STRAIGHT, MLSA (梅尔对数频谱逼近滤波器)等。
[0006]当语音基频被修改时, 该源-滤波器模型的简化形式暴露出一定缺陷。 声门波是通 过声门的气体的流动速度, 反映了声门的张合程度。 因为基频决定了声门的张合频率, 所 以声门波滤波器的单位脉冲响应的时长应等于基频周期, 在不同基频下声门波的形状大致 不变, 但其周期长度随基频改变。 然而源-滤波器模型的简化形式中, 声门波滤波器被合并 到声道滤波器中, 故声门波滤波器的频率响应被假设为独立于基频。 该假设与发声原理不 符, 因此在对基频参数进行修改后, 基于该简化模型的语音分析合成方法往往不能产生自 然的语音。
[0007]为了克服上述缺点, 近年有若干新语音分析合成技术被提出, 例如
SVLN ( Degottex, Gilles, et al. "Mixed source model and its adapted vocal tract filter estimate for voice transformation and synthesis." Speech Communication 55.2 (2013): 278-294. )禾口
GSS ( Cabral, Joao P., et al. "Glottal spectral separation for speech synthesis. "Selected Topics in Signal Processing, IEEE Journal of 8.2 (2014): 195-208. )方法。 这些方法对声门波和声道分别 进行建模。 由于唇辐射滤波器的特性接近微分器(differentiator ) , 该滤波器被并人声门波, 形成声门波导数。 声门波导数则由 Liljencrants-Fant声源模型表示。 在分析过程中, 首先计 算声源模型的参数, 然后将语音的幅度频谱除以声源模型的幅度响应, 并进行频谱包络预 测( spectral envelope estimation ) , 获得声道的幅度响应。 最后基于最小相位假设, 根据声 道的幅度响应计算声道的频率响应。 合成过程相当于逆向进行分析过程, 这里不作赘述。
[0008] SVLN和 GSS方法在一定程度上使基频经过修改的语音听感更加自然, 但是该方法 同时具有一些缺陷。 首先, 合成语音的质量容易受声源参数的准确度影响, 当声源参数计 算不准确时, 合成语音的听感会和输人语音有差异。 尤其是当输人语音的录制环境和设备 不够理想时, 声源参数的计算往往会出现较大误差, 使得该方法产生的输出变得不够平稳。 其次, Liljencrants-Fant声源模型产生的声门波信号和实际的声门波信号有一定差异, 因此 该方法无法准确还原输人语音, 且使用该方法合成的语音听感上稍为尖锐。
[0009]近期提出的 HMPD ( Degottex, Gilles, and Daniel Erro. "A uniform phase representation for the harmonic model in speech synthesis applications." EURASIP Journal on Audio, Speech, and Music Processing 2014.1 (2014): 1-16. )语音分析合成方法不需要声源模型的参数预测步骤,
因此在一定程度上具有更好的鲁棒性。 该方法基于谐波模型, 在分析阶段先基于最小相位 假设预测声道的相位响应, 然后从谐波相位向量中减去声道的成分, 获得声源的各谐波的 相位响应。 最后计算声源的相位误差(phase distortion ) , 其计算方法和特性类似于群延迟 在对基频进行修改时, 先对相位误差进行解缠绕(unwrapping ) , 再以新的基频重新插值 谐波的相位误差。 该方法的缺陷在于, 相位解缠绕容易出错, 特别对于高基频的语音, 该 操作有较大几率产生帧间不连贯的语音参数序列。 此外, 该方法假设声源的幅度响应为常 数, 故该方法无法建模基频对声门波幅度响应的影响。
[0010] 本发明基于谐波模型, 对谐波模型的参数进行分解, 获得声源和声道参数。 利用声 门波的形状不变特性, 通过保留各谐波对应的声源相位与声源模型所产生的相位之差, 有 效降低了声源参数预测的准确度对合成质量的影响。 本发明所述方法的一种简化形式隐含 地建模声源特征, 而不依赖于特定的参数性声源模型, 从而简化了语音的分析合成步骤。 本发明所述方法及其变化形式不需要相位解缠绕操作, 因此避免了语音参数不连贯的问题。 在语音参数未经修改的前提下, 本发明所述方法及其简化形式不会引人谐波幅度或相位误 差, 能够准确还原谐波模型参数。 发明内容
[0011] 本发明提出了一种语音分析合成方法以及该方法的一种简化形式。 本发明所述方法 基于谐波模型, 在分析阶段将谐波模型的参数分解为声源特征和声道特征, 并于合成阶段 重新组合声源和声道特征, 生成谐波模型参数。
[0012]本发明提出的语音分析合成方法的基本形式中, 分析阶段步骤如下:
[0013]第一步, 对输人语音信号进行基频提取和谐波分析, 获得各分析时刻的基频、 各谐 波的幅度和相位向量。 根据谐波的相位向量, 计算各谐波的相对相位偏移;
[0014]第二步, 预测输人语音信号在各分析时刻的声源特征, 获得声源模型的参数。 根据 声源模型的参数计算声源模型的频率响应, 包括声源模型的幅度响应和声源模型的相位响 应;
[0015]第三步, 将谐波幅度向量除以声源的幅度响应和唇辐射幅度响应, 获得声道的幅度 响应;
[0016]第四步, 根据声道的幅度响应, 计算声道的相位响应;
[0017]第五步, 获得声源的频率响应, 包括对应到各谐波的声源幅度向量和声源相位向
[0018]第六步, 计算第五步中获得的对应到各谐波的声源的相位向量和第二步中获得的声 源模型的相位响应之差, 获得对应到各谐波的相位差向量。
[0019]本发明提出的语音分析合成方法的基本形式中, 合成阶段步骤如下:
[0020]第一步, 根据声道的幅度响应, 计算声道的相位响应;
[0021]第二步, 根据声源模型的参数和基频计算声源模型的频率响应, 包括声源模型的幅 度响应和声源模型的相位响应;
[0022]第三步, 将声源模型的相位响应和对应到各谐波的声源相位差向量相加, 获得对应 到各谐波的声源相位向量;
[0023]第四步, 将声道的幅度响应和对应到各谐波的声源幅度向量相乘, 获得各谐波的幅 度。 将声道的相位响应和对应到各谐波的声源相位向量相加, 获得各谐波的相位;
[0024]第五步, 根据基频和各谐波的幅度及相位, 合成语音信号。
[0025]本发明提出的语音分析合成方法的简化形式中, 分析阶段步骤如下:
[0026]第一步, 对输人语音信号进行基频提取和谐波分析, 获得各分析时刻的基频、 各谐 波的幅度和相位向量。 根据谐波的相位向量, 计算各谐波的相对相位偏移;
[0027]第二步, 可选地, 预测输人语音信号在各分析时刻的声源特征, 计算声源的幅度响 应;
[0028]第三歩, 根据谐波幅度向量和可选的声源幅度响应, 计算声道的幅度响应;
[0029]第四步, 根据声道的幅度响应, 计算声道的相位响应;
[0030]第五步, 获得声源的频率响应, 包括对应到各谐波的声源幅度向量和声源相位向量,
[0031]本发明提出的语音分析合成方法的简化形式中, 合成阶段步骤如下:
[0032]第一步, 根据声道的幅度响应, 计算声道的相位响应;
[0033]第二步, 将声道的幅度响应和对应到各谐波的声源幅度向量相乘, 获得各谐波的幅 度。 将声道的相位响应和对应到各谐波的声源相位向量相加, 获得各谐波的相位;
[0034]第三步, 根据基频和各谐波的幅度及相位, 合成语音信号。 具体实施方式
[0035] 本发明提出了一种语音分析合成方法以及该方法的一种简化形式。 本发明所述方法 基于谐波模型, 在分析阶段将谐波模型的参数分解为声源特征和声道特征, 并于合成阶段 重新组合声源和声道特征, 生成谐波模型参数。 下面从分析阶段开始具体介绍本发明提出 的语音分析合成方法的基本形式, 其流程如图 1所示。
[0036]第一步, 对输人语音信号进行基频提取和谐波分析, 获得各分析时刻的基频 0 、 各谐波的幅度 ak 和相位 向量。 根据谐波的相位向量, 计算各谐波的相对相位偏移 ( relative phase shift , 参见 Degottex, Gilles, and Daniel Erro. "A uniform phase representation for the harmonic model in speech synthesis applications." EURASIP Journal on Audio, Speech, and Music Processing 2014.1 (2014): 1-16. ) ;
φ,= ¾- (/( + ΐ ) θ0
[0037] 本发明的发明点在于对谐波模型参数的处理方法, 因此对具体采用的基频提取和谐 波分析方法不作限定。 常用的基频提取方法包括 YIN ( De Cheveign e , Alain, and Hideki Kawahara. "YIN, a fundamental frequency estimator for speech and music." The Journal of the Acoustical Society of America 111.4 (2002): 1917-1930. )和 SRH ( Drugman, Thomas, and Abeer Alwan. "Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics." Interspeech. 2011. )等。 常用的谐波分析方法包括谱峰值法 ( peak-picking method , 参考 McAulay, Robert J., and Thomas F. Quatieri. "Speech analysis/synthesis based on a sinusoidal representation." Acoustics, Speech and Signal Processing, IEEE Transactions on 34.4 (1986): 744- 754. ) 、 最小二乘法 ( Stylianou, Ioannis. Harmonic plus noise models for speech, combined with statistical methods, for speech and speaker modification. Diss. Ecole Nationale Sup e rieure des T e l e communications, 1996. )等。
[0038]第二步, 预测输人语音信号在各分析时刻的声源特征, 获得声源模型的参数。 根据 声源模型的参数计算声源模型的频率响应, 包括声源模型的幅度响应和声源模型的相位响 应。 本发明适用于多种声源模型, 因此对采用的声源模型及其参数预测方法不作具体限定。 这里以较为常用的 Liljencrants- Fant声源模型(以下简称 LF模型)和 MSP ( Degottex, Gilles, Axel Roebel, and Xavier Rodet. "Phase minimization for glottal model estimation." Audio, Speech, and Language Processing, IEEE Transactions on 19.5 (2011): 1080-1090. )参数预测方法为例。 具体参数预测步骤如下:
[0039] a.生成一系列候选的 LF模型参数。 这里以 Rd参数为例, 生成从 0.3到 2.5以 0.1为 间隔递进的候选 Rd参数序列。 对每个候选的 Rd参数进行以下操作:
[0040] b.根据该 Rd参数生成 LF模型的 te、 tp和 ta参数, 并根据基频和 te、 tp、 ta参数计 算 LF模型在各谐波频率上的频率响应 GRd (ω,) (具体方法参考 Fant, Gunnar,
Johan Liljencrants, and Qi-guang Lin. "A four-parameter model of glottal flow." STL— QPSR 4.1985 (1985): 1-13和 Doval, Boris, Christophe d'Alessandro, and Nathalie Henrich. "The spectrum of glottal flow models." Acta acustica united with acustica 92.6 (2006): 1026-1046. ) ;
[0041] c.将 LF模型在各谐波频率上的频率响应 G^ ( c¾;i 乘以线性相位函数, 根据 te参 数使其时间对齐到最大激励瞬间( instant of maximum excitation ) ;
G
[0043] e.根据各谐波频率上的声道频率响应的幅度成分 |\^ (¾;)| , 使用同态滤波方法 计算声道的最小相位频率响应 Vmi„(c¾) , 具体方法可参考 Lim, Jae S., and Alan V. Oppenheim. Advanced topics in signal processing. Prentice-Hall, Inc., 1987;
[0044] f.生成一系列候选的偏移相位。 这里以从 _ ττ到 ΤΤ以 0.1为间隔递进的候选偏移相 位序列为例。
[0045] g.针对每个候选的偏移相位, 计算经相位偏移的 ν(ω)() 和 Vmin( o¾) 的相位 成分的欧氏距离;
1 κ— 1
E= 7∑ (wrap ( A 6 (/c + l )+ arg (y (M;() )-arg (ymin( MJ) ))2 其中 wrap(e) 是相位缠绕(wrapping ) 函数, K是谐波数量, Δ Θ 是偏移相 位。
[0046] h.挑选使得 E最小的 Rd参数, 作为该分析时刻的 LF模型参数;
[0047] L可选地, 为了获得平滑的 Rd参数曲线, 对上述步骤获得的随时间变化的 Rd参数 序列进行中值滤波。
[0048]获得声源模型的参数后, 计算各谐波频率上的声源频率响应 GiF ( c¾) 。
其中唇辐射频率响应被假设为 , 相当于一个微分器。
[0050] 由于唇辐射频率响应独立于声源和声道特性, 该频率响应可并人声源频率响应。 故 第三步中计算声源频率响应时, GiF ( c¾) 可以由声门波导数的频率响应代替, 此时本步 骤的操作被简化为:
[0051]可选地, 先根据谐波幅度向量 ^进(-行)频|=谱^包络预测, 获得输人语音的频谱包络
|s(w)| , 并对各谐波频率上的声源幅度响应 |GiF (c¾)| 进行插值, 再将前者的频谱 包络除以后者的频谱包络。 此时获得的声道幅度响应是定义于任意频率的函数, 而不仅包 括各谐波频率上的幅度响应:
[0052]第四步, 根据声道的幅度响应, 计算声道的相位响应。 因为声道的频率响应可大致 建模为一个全极点滤波器, 可以假设声道的频率响应具有最小相位性质。 在该假设下, 可 以使用同态滤波方法计算出声道的相位响应 arg^(o¾)) 。 具体方法可参考 Lim, Jae S., and Alan V. Oppenheim. Advanced topics in signal processing. Prentice-Hall, Inc., 1987;
[0053]第五步, 获得声源的频率响应
, 包括对应到各谐波的声源幅度向量和声 源相位向量。 其中声源幅度向量沿用第二歩中获得的 |GiF(c¾)| ; 声源相位向量使用频 谱相除方法, 根据去除偏移的谐波相位向量和声道的相位响应计算得到;
[0054]第六步, 计算第五步中获得的对应到各谐波的声源的相位向量和第二步中获得的声 源模型的相位响应之差, 获得对应到各谐波的相位差向量。
Aq)k=arg{G{ k))-arg{GLF{ k))
[0055]本发明所提出的语音分析合成方法的基本形式中, 如图 3所示, 合成阶段步骤如下:
[0056]第一步, 根据声道的幅度响应 或 |ν(ω)| , 计算声道的相位响应 arg{v{ )) 或 arg(V(co 。 具体计算方法和分析阶段第四步相同。 若根据定义于 任意频率的幅度响应频谱 |ν(ω〗 计算出相位响应 arg(V((o; , 须对相位响应在各 谐波频率上进行采样, 获得 arg(V(o¾)) ;
[0057]第二步, 根据声源模型的参数和基频计算声源模型的频率响应 G F(o¾) , 包括 声源模型的幅度响应和声源模型的相位响应。 具体方法和分析阶段第二步中 b相同;
[0058]第三步, 将声源模型的相位响应 arg(G F(c¾; 和对应到各谐波的声源相位差向 量 Δφ(( 相加, 获得对应到各谐波的声源相位向量 arg(G(o¾)) ;
arg{G{vik)) = arg{GLF{vik))+A(pk
[0060]第五步, 根据基频和各谐波的幅度及相位, 合成语音信号。 本发明对使用的谐波模 型合成方法不作具体限定。 常用的合成方法可参考 McAulay, Robert J., and Thomas F.
Quatieri. "Speech analysis/synthesis based on a sinusoidal representation." Acoustics, Speech and Signal Processing, IEEE Transactions on 34.4 (1986): 744—754。
[0061]使用上述分析合成方法对语音进行基频修改时, 只须对分析得到的声道幅度响应以 新的基频间隔进行重采样, 或使用频谱包络预测算法构建频谱包络再以新的基频间隔进行 重采样, 然后基于最小相位假设重新计算各谐波频率上的声道相位响应即可; 不须要改变 声源相位差向量。
[0062] 由于基频改变时, 声门波的大致形状仍保持不变, 当不需要修改声源参数时, 本发 明所述的分析合成方法有一种简化形式。 该简化形式不依赖于特定的声源模型, 故声源模 型参数预测步骤可被省略。 如图 2所示, 该简化形式的分析阶段的具体步骤如下:
[0063]第一步, 对输人语音信号进行基频提取和谐波分析, 获得各分析时刻的基频 0 、 各谐波的幅度 ak 和相位 向量。 根据谐波的相位向量, 计算各谐波的相对相位偏移
( relative phase shift ) ;
φ,= ¾-(/( + ΐ)θ0
[0064]第二步, 可选地, 预测输人语音信号在各分析时刻的声源特征, 计算声源的幅度响 应 |G( CO)| ;
[0065]本步骤中声源特征的预测方法并不一定基于特定的声源模型, 该预测方法可以是任 意的能够预测声源幅度响应的技术。 本发明对使用的声源幅度响应预测方法不作具体限定。
[0066]以基于全极点模型的线性预测方法为例, 对各分析时刻的语音加窗, 使用线性预测 方法计算一个二阶全极点滤波器的系数。 根据该全极点滤波器的系数, 计算幅度响应。
[0067]上述方法获得的幅度响应大致是声源的幅度响应和唇辐射的幅度响应之积。 由于唇 辐射频率响应独立于声源和声道特性, 该幅度响应可并人声源幅度响应。
[0068]第三歩, 获得声道的幅度响应 或 |ν( ω】 ;
[0069]若声源的幅度响应为未知, 则假设声源的幅度响应为常数(即 |σ( ω)| = 1 ) , 将声道的幅度响应定义为谐波幅度向量; 若声源的幅度响应为已知, 则将谐波幅度向量除 以声源的幅度响应, 获得声道的幅度响应;
[0070]可选地, 先根据谐波幅度向量进行频谱包络预测, 获得输人语音的频谱包络
|s(w)| , 再将该频谱包络除以声源的幅度响应。 此时获得的声道幅度响应是定义于任 意频率的函数, 而不仅包括各谐波频率上的幅度响应:
[0071]第四步, 根据声道的幅度响应, 计算声道的相位响应 arg ^ ( co)) 。 具体方法和 本发明所述方法的基本形式的分析阶段第二步中 b相同;
[0072]第五步, 获得声源的频率响应, 包括对应到各谐波的声源幅度向量和声源相位向量 具体方法为: 声源幅度向量已在第二步中获得; 将谐波相位向量与声道的相位响应相减获 得声源相位向量。
[0073]本发明所述的语音分析合成技术的简化形式中, 如图 4所示, 合成阶段的具体步骤 如下:
[0074]第一步, 根据声道的幅度响应 或 |ν( ω )| , 计算声道的相位响应 arg{v{ wk )) 或 cw¾ ^ ( (D)) 。 具体计算方法和和本发明所述方法的基本形式的分析 阶段第二步中 b相同。 若根据连续的幅度响应频谱 |ν( ω )| 计算出相位响应 arg{v{ i )) , 须对相位响应进行插值, 获得 arg ( V ( c¾)) ;
[0075]第二步, 将声道的幅度响应和对应到各谐波的声源幅度向量相乘, 获得各谐波的幅 度。 将声道的相位响应和对应到各谐波的声源相位向量相加, 获得各谐波的相位;
q)k= arg{v{ k) ) +arg{G{ k ))
[0076]第三步, 根据基频和各谐波的幅度及相位, 合成语音信号。 本发明对使用的谐波模 型合成方法不作具体限定。
[0077]本发明所述的语音分析合成技术的基本形式适用于包括声源参数修改的应用; 该技 术的简化形式适用于不包括声源参数修改的应用。
[0078]本发明所述的语音分析合成技术的基本形式通过记录声源模型的相位和使用频域反 向滤波得到的声源相位之差, 并将该相位差对应到各个谐波, 使得输人语音的相位特性被 更好地保留, 且减轻了声源模型参数预测误差对合成语音质量的影响。 该技术的简化形式 基于声门波的形状不变假设, 将声源特性对应到各个谐波, 而不需要显式的声源模型参数 及其参数预测步骤。 该简化形式彻底避免了声源模型参数预测误差问题, 大幅简化了分析 合成步骤, 且提高了运行效率。
[0079]本发明所述的语音分析合成技术亦可适用于正弦模型(Sinusoidal Model ) 、 谐波加 噪声模型( Harmonic+Noise Model ) 、 谐波加随机模型( Harmonic+Stochastic Model )等。 修 改本发明所述的方法使其适用于上述模型之过程, 属于本领域技术人员公知的常识, 故不 具体介绍。
Claims
1.一种基于谐波模型 ( Harmonic Model )的语音分析方法, 其特征在于对谐波模型的参数进行 声源和声道特征分解, 其中声源特征包括声源模型参数和各对应到各谐波的相位差。该分析方法 具体包括以下步骤:
a) 对输入语音信号进行谐波分析, 获得各分析时刻的基频、 谐波幅度向量和谐波相位向 b) 预测输入语音信号在各分析时刻的声源特征, 获得声源模型的参数。 根据声源模型的 参数计算声源模型的频率响应, 包括声源模型的幅度响应和声源模型的相位响应。 c) 获得声道的幅度响应。 将谐波幅度向量除以声源的幅度响应, 获得声道的幅度响应; d) 根据声道的幅度响应, 计算声道的相位响应。 具体方法包括但不限于基于最小相位假 设, 使用同态滤波方法获得相位响应;
e) 获得声源的频率响应, 包括对应到各谐波的声源幅度向量和声源相位向量。 具体方法 为: 声源幅度向量已在步骤 b中获得; 将谐波相位向量与声道的相位响应相减获得声 源相位向量;
f) 计算步骤 e中获得的对应到各谐波的声源的相位向量和步骤 b中获得的声源模型的相 位响应之差, 获得对应到各谐波的相位差向量。
2.—种基于谐波模型的语音分析方法, 其特征在于, 对谐波模型的参数进行声源和声道特征分 解, 其中声源特征包括各对应到各谐波的幅度向量和相位向量。该分析方法具体包括以下步骤: a) 对输入语音信号进行谐波分析, 获得各分析时刻的基频、 谐波幅度向量和谐波相位向 b) 预测输入语音信号在各分析时刻的声源特征, 计算声源的幅度响应;
c) 获得声道的幅度响应。 将谐波幅度向量除以声源的幅度响应, 获得声道的幅度响应; d) 根据声道的幅度响应, 计算声道的相位响应。 具体方法包括但不限于基于最小相位假 设, 使用同态滤波方法获得相位响应;
e) 获得声源的频率响应, 包括对应到各谐波的声源幅度向量和声源相位向量。 具体方法 为: 声源幅度向量已在步骤 b中获得; 将谐波相位向量与声道的相位响应相减获得声 源相位向量。
3.一种基于谐波模型的语音合成方法, 其特征在于对分解后和的声源和声道特征进行重新组合, 并将其转化成适用于谐波模型的参数。其中声源特征包括声源模型参数和对应到各谐波的声源相 位差向量; 声道特征包括声道的幅度响应。 该合成方法具体包括以下步骤:
a) 根据声道的幅度响应, 计算声道的相位响应。 具体方法包括但不限于基于最小相位假 设, 使用同态滤波方法获得相位响应;
b) 根据声源模型的参数计算声源模型的频率响应, 包括声源模型的幅度响应和声源模型 的相位响应。
c) 将声源模型的相位响应和对应到各谐波的声源相位差向量相加, 获得对应到各谐波的 声源相位向量。
d) 将各谐波频率上的声道的幅度响应和声源幅度响应相乘, 获得各谐波的幅度。 将各谐 波频率上的声道的相位响应和对应到各谐波的声源相位向量相加, 获得各谐波的相位。 e) 根据基频以及各谐波的幅度和相位, 合成语音信号。
4.一种基于谐波模型的语音合成方法, 其特征在于对分解后和的声源和声道特征进行重新组合, 并将其转化成适用于谐波模型的参数。其中声源特征包括对应到各谐波的声源幅度向量和声源相 位向量; 声道特征包括声道的幅度响应。 该合成方法具体包括以下步骤:
a) 根据声道的幅度响应, 计算声道的相位响应。 具体方法包括但不限于基于最小相位假 设, 使用同态滤波方法获得相位响应;
b) 将各谐波频率上的声道的幅度响应和对应到各谐波的声源幅度向量相乘, 获得各谐波
的幅度。 将各谐波频率上的声道的相位响应和对应到各谐波的声源相位向量相加, 获 得各谐波的相位。
c) 根据基频以及各谐波的幅度和相位, 合成语音信号。
5.权利要求 1与权利要求 4所述的声源模型包括但不限于 Liljencrants-Fant模型、 KLGLOTT88模 型、 Rosenberg模型和 R++模型等。
6.权利要求 1所述的声源特征的预测方法包括但不限于 MSP (最小平方相位差)、 IAIF (迭代 自适应反向滤波)和 ZZT ( Z变换零点)等方法。
7.权利要求 1、2、3、4所述的谐波模型, 包括但不限于正弦模型 ( Si皿 soidal Model )、谐波加噪声 模型( Harmonic+Noise Model )、 谐波加随机模型 ( Harmonic+Stochastic Model )等含有正弦或 谐波成分的模型。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201580080885.3A CN107851433B (zh) | 2015-12-10 | 2015-12-10 | 基于谐波模型和声源-声道特征分解的语音分析合成方法 |
PCT/IB2015/059495 WO2017098307A1 (zh) | 2015-12-10 | 2015-12-10 | 基于谐波模型和声源-声道特征分解的语音分析合成方法 |
JP2017567786A JP6637082B2 (ja) | 2015-12-10 | 2015-12-10 | 調波モデルと音源−声道特徴分解に基づく音声分析合成方法 |
US15/745,307 US10586526B2 (en) | 2015-12-10 | 2015-12-10 | Speech analysis and synthesis method based on harmonic model and source-vocal tract decomposition |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/IB2015/059495 WO2017098307A1 (zh) | 2015-12-10 | 2015-12-10 | 基于谐波模型和声源-声道特征分解的语音分析合成方法 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2017098307A1 true WO2017098307A1 (zh) | 2017-06-15 |
Family
ID=59013771
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2015/059495 WO2017098307A1 (zh) | 2015-12-10 | 2015-12-10 | 基于谐波模型和声源-声道特征分解的语音分析合成方法 |
Country Status (4)
Country | Link |
---|---|
US (1) | US10586526B2 (zh) |
JP (1) | JP6637082B2 (zh) |
CN (1) | CN107851433B (zh) |
WO (1) | WO2017098307A1 (zh) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3857541B1 (en) * | 2018-09-30 | 2023-07-19 | Microsoft Technology Licensing, LLC | Speech waveform generation |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1669074A (zh) * | 2002-10-31 | 2005-09-14 | 富士通株式会社 | 话音增强装置 |
EP1619666A1 (en) * | 2003-05-01 | 2006-01-25 | Fujitsu Limited | Speech decoder, speech decoding method, program, recording medium |
CN101981612A (zh) * | 2008-09-26 | 2011-02-23 | 松下电器产业株式会社 | 声音分析装置以及声音分析方法 |
CN103544949A (zh) * | 2012-07-12 | 2014-01-29 | 哈曼贝克自动系统股份有限公司 | 发动机声音合成 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5023910A (en) * | 1988-04-08 | 1991-06-11 | At&T Bell Laboratories | Vector quantization in a harmonic speech coding arrangement |
AU3702497A (en) * | 1996-07-30 | 1998-02-20 | British Telecommunications Public Limited Company | Speech coding |
JPH11219200A (ja) * | 1998-01-30 | 1999-08-10 | Sony Corp | 遅延検出装置及び方法、並びに音声符号化装置及び方法 |
US9254383B2 (en) * | 2009-03-20 | 2016-02-09 | ElectroCore, LLC | Devices and methods for monitoring non-invasive vagus nerve stimulation |
CN101552006B (zh) * | 2009-05-12 | 2011-12-28 | 武汉大学 | 加窗信号mdct域的能量及相位调整方法及其装置 |
JP5085700B2 (ja) * | 2010-08-30 | 2012-11-28 | 株式会社東芝 | 音声合成装置、音声合成方法およびプログラム |
US9865247B2 (en) * | 2014-07-03 | 2018-01-09 | Google Inc. | Devices and methods for use of phase information in speech synthesis systems |
-
2015
- 2015-12-10 JP JP2017567786A patent/JP6637082B2/ja active Active
- 2015-12-10 US US15/745,307 patent/US10586526B2/en not_active Expired - Fee Related
- 2015-12-10 CN CN201580080885.3A patent/CN107851433B/zh active Active
- 2015-12-10 WO PCT/IB2015/059495 patent/WO2017098307A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1669074A (zh) * | 2002-10-31 | 2005-09-14 | 富士通株式会社 | 话音增强装置 |
EP1619666A1 (en) * | 2003-05-01 | 2006-01-25 | Fujitsu Limited | Speech decoder, speech decoding method, program, recording medium |
CN101981612A (zh) * | 2008-09-26 | 2011-02-23 | 松下电器产业株式会社 | 声音分析装置以及声音分析方法 |
CN103544949A (zh) * | 2012-07-12 | 2014-01-29 | 哈曼贝克自动系统股份有限公司 | 发动机声音合成 |
Also Published As
Publication number | Publication date |
---|---|
JP6637082B2 (ja) | 2020-01-29 |
US20190013005A1 (en) | 2019-01-10 |
CN107851433B (zh) | 2021-06-29 |
US10586526B2 (en) | 2020-03-10 |
JP2018532131A (ja) | 2018-11-01 |
CN107851433A (zh) | 2018-03-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5925742B2 (ja) | 通信システムにおける隠蔽フレームの生成方法 | |
JP5275612B2 (ja) | 周期信号処理方法、周期信号変換方法および周期信号処理装置ならびに周期信号の分析方法 | |
JP4705203B2 (ja) | 声質変換装置、音高変換装置および声質変換方法 | |
CN107924686B (zh) | 语音处理装置、语音处理方法以及存储介质 | |
JP2014130359A (ja) | 音声信号符号器、符号化されたマルチチャンネル音声信号表現の生成方法およびコンピュータプログラム | |
Mittal et al. | Study of characteristics of aperiodicity in Noh voices | |
WO2009034167A1 (en) | Audio signal transforming | |
JP2018136430A (ja) | 音声変換モデル学習装置、音声変換装置、方法、及びプログラム | |
WO2010032405A1 (ja) | 音声分析装置、音声分析合成装置、補正規則情報生成装置、音声分析システム、音声分析方法、補正規則情報生成方法、およびプログラム | |
JP2019144404A (ja) | 音声変換学習装置、音声変換装置、方法、及びプログラム | |
US9466285B2 (en) | Speech processing system | |
EP1905009B1 (en) | Audio signal synthesis | |
WO2017098307A1 (zh) | 基于谐波模型和声源-声道特征分解的语音分析合成方法 | |
Mathur et al. | Vocal-tract modeling: Fractional elongation of segment lengths in a waveguide model with half-sample delays | |
Zheng et al. | Bandwidth extension WaveNet for bone-conducted speech enhancement | |
JPH1020887A (ja) | 音声処理装置のピッチ抽出方法 | |
Lin et al. | High quality and low complexity pitch modification of acoustic signals | |
JPH07261798A (ja) | 音声分析合成装置 | |
Liu et al. | LPCSE: Neural Speech Enhancement through Linear Predictive Coding | |
JP2019070775A (ja) | 信号解析装置、方法、及びプログラム | |
Airaksinen et al. | Glottal inverse filtering based on quadratic programming. | |
Xu et al. | Voice conversion with a strategy for separating speaker individuality using state-space model | |
Aczél et al. | Sound separation of polyphonic music using instrument prints | |
CN116978346A (zh) | 一种音频变速变调方法、装置及存储介质 | |
JP5679451B2 (ja) | 音声処理装置およびそのプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15910154 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2017567786 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 15910154 Country of ref document: EP Kind code of ref document: A1 |