WO2017028003A1 - Hidden markov model-based voice unit concatenation method - Google Patents

Hidden markov model-based voice unit concatenation method Download PDF

Info

Publication number
WO2017028003A1
WO2017028003A1 PCT/CN2015/086931 CN2015086931W WO2017028003A1 WO 2017028003 A1 WO2017028003 A1 WO 2017028003A1 CN 2015086931 W CN2015086931 W CN 2015086931W WO 2017028003 A1 WO2017028003 A1 WO 2017028003A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
state
voice
splicing
duration
Prior art date
Application number
PCT/CN2015/086931
Other languages
French (fr)
Chinese (zh)
Inventor
华侃如
Original Assignee
华侃如
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华侃如 filed Critical 华侃如
Priority to PCT/CN2015/086931 priority Critical patent/WO2017028003A1/en
Publication of WO2017028003A1 publication Critical patent/WO2017028003A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • the invention relates to the field of speech synthesis, in particular to spliced speech synthesis and statistical parameter speech synthesis based on hidden Markov model.
  • Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information.
  • Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).
  • An improved speech unit splicing method is to interpolate the joint portion of the speech unit to smoothly transition from one unit to the next unit.
  • the interpolated objects may be speech synthesis parameters such as time domain waveforms, line spectrum pair (LSP) parameters, and spectral envelopes.
  • the problem with the interpolated speech audio unit audio method is that when the acoustic characteristics of the two sets of speech units that are spliced differ greatly, the speech of the joint portion tends to become too smooth, blurring the synthesized speech, and reducing the synthesized speech. Recognizability.
  • the present invention introduces a HMM model commonly used in statistical parameter speech synthesis technology into a spliced speech synthesis system, and proposes a new speech unit splicing method: firstly using the corpus data to train the HMM to obtain the model state level.
  • the corpus text is time aligned with the speech, and then the splicing start time and end time are determined according to the most similar model state in the corresponding audio unit during splicing, and then the speech synthesis parameters are interpolated and spliced.
  • the spliced speech synthesis technology based on the invention can automatically select the part with the smallest difference between the acoustic characteristics of the two groups before and after the splicing of the speech unit and the transition trend is stable, thereby effectively improving the definition and recognizability of the synthesized speech. .
  • the technical field to which the present invention pertains is spliced speech synthesis.
  • the technical problem solved by the present invention is that the spliced speech synthesis system blurs and inconsistency caused by improper splicing and interpolation methods when splicing speech audio segments.
  • the present invention introduces an HMM model in a conventional spliced speech synthesis system. Before splicing a speech audio segment using the techniques proposed by the present invention, it is necessary to pre-compute and store the context-dependent model and the state-level temporal alignment of the training speech and text.
  • the method adopted by the present invention comprises the following steps:
  • a state-level time segment of the combined partial phoneme in the two sets of voice segments pre-calculated and stored is obtained, and the duration of each state after the splicing is calculated;
  • the speech synthesis parameters included in the two sets of speech segments are spliced and interpolated.
  • FIG. 1 is a schematic diagram of a problem of splicing technology of a voice audio segment solved by the present invention
  • FIG. 2 is a schematic diagram of time allocation of a splicing and interpolation transition of a speech synthesis parameter according to the technology of the present invention
  • FIG. 3 is a flow chart of a training phase when the present invention is applied to a complete speech synthesis system
  • the invention proposes a speech audio unit splicing technique applied to spliced speech synthesis.
  • the technique is based on a context-dependent HMM, and its parameter acquisition method is similar to the general HMM-based statistical parameter speech synthesis system, which will be specifically described below in the embodiments of the present invention.
  • the speech unit to be spliced generally includes two diphone voice segments in two corpora, but may also be multi-phone or multi-syllable speech segments.
  • a splicing speech synthesis using a dual phoneme as a speech unit including speech synthesis using a dual phoneme unit as an example, two sets of speech segments to be spliced are shown in unit 1 and unit 2 in FIG. The part is the joint of the two sets of speech segments.
  • the method used includes the following steps:
  • the most relevant corresponding state is searched according to the context-dependent HMM state sequence corresponding to the phonemes in the two sets of voice segments;
  • the approximate degree of the corresponding two states can be calculated by various methods, for example, using the following steps:
  • ⁇ ' is the diagonal covariance matrix constructed from ⁇ '
  • K is the speech acoustic parameter dimension modeled using HMM.
  • the speech acoustic parameters are characteristic parameters capable of reflecting the auditory features of the speech, such as cepstral coefficient parameters such as MFCC, line spectrum pair (LSP) parameters, and Mel's general cepstral coefficient (MGC) parameters.
  • cepstral coefficient parameters such as MFCC, line spectrum pair (LSP) parameters
  • LSP line spectrum pair
  • MMC Mel's general cepstral coefficient
  • step c the magnitude of the value of det( ⁇ ') obtained in step c is compared.
  • the state in which the det( ⁇ ') value is minimized is the most approximate model state; the sequence number of the state is recorded.
  • the L-N distance between the mean values of the output distributions using the corresponding states reflects the similarity between the states.
  • the Mahalanobis distance between the output distributions using the corresponding states reflects the similarity between states.
  • the Kullback-Leibler divergence between the output distributions of the corresponding states is used to reflect the similarity between states.
  • a state-level time segment of the combined partial phoneme in the two sets of voice segments pre-calculated and stored is obtained, and the duration of each state after the splicing is calculated;
  • the first approximate state number obtained by the first step be n (the state number starts from 0), and the number of states each phoneme contains is N, and the duration of each state in the time segment of the joint part of the phonemes in the two sets of speech segments Respectively represented by the vector t a and the vector t b , the duration of each state after splicing is represented by the vector t′.
  • t a corresponds to a speech unit with a relatively advanced time
  • t b corresponds to a speech unit with a relatively late time.
  • the method for calculating the duration t' of each state after splicing is: retaining the length of time corresponding to each state before n in t a ; retaining the length of time corresponding to each state after n in t b ; duration of state n the average time for the length of time in a state in t a and t b corresponds.
  • the shortest duration t min of each state after splicing is set such that t' n ⁇ t min to prevent the transition period duration from being too short, affecting the continuity of the speech.
  • the speech synthesis parameters included in the two sets of speech segments are spliced and interpolated.
  • the speech synthesis parameter is data capable of expressing a speech feature and causing the vocoder to generate a speech waveform.
  • the speech acoustic parameters can be used as speech synthesis parameters at the same time.
  • speech synthesis parameters can also reflect the auditory characteristics of speech.
  • the time period corresponding to the most approximate model state determined in the first step in the database voice is the time period during which the interpolation transition is required.
  • the speech synthesis parameters in the time period corresponding to the remaining states are directly processed to the target speech synthesis parameter sequence without being processed.
  • the interpolation method for the speech data transition and the time warping includes linear interpolation.
  • the unit selection speech synthesis technology based on the voice audio unit splicing technology proposed by the present invention includes two stages of training and operation.
  • the specific implementation of the training phase (shown in Figure 3) is as follows:
  • the speech waveform data and the phoneme level time segmentation in the corpus are obtained, and the speech analysis is performed: the speech waveform data is converted into the speech acoustic parameter data, and is stored in the speech database together with the speech phoneme time segmentation (hereinafter referred to as Database); based on the text corresponding to the voice in the corpus, generates a sequence of context information, which is also stored in the database.
  • Database the speech phoneme time segmentation
  • the speech synthesis parameters need to be additionally calculated from the speech waveform data in the corpus and stored in the database.
  • the speech acoustic parameter data and the phoneme level time segmentation in the database are acquired, the state transition probability distribution and the output distribution of the HMM are initialized, and the context-independent model is trained.
  • the training of the context-independent model can adopt the Baum-Welch algorithm or the Viterbi Training algorithm.
  • HMM hidden semi-Markov model
  • a syllable is used as a unit of speech.
  • the third step is to perform state-level time alignment and phoneme-level time alignment on the database according to the context-independent model, and use the new phoneme-level time alignment result to cover the original time segment in the database, thereby ensuring the time alignment and model of the speech unit in the database.
  • the time alignment of the states remains uniform.
  • the state binding of the context-free model is released, making it a context-dependent model
  • the context-dependent model is trained and the model parameters are stored in the database.
  • the training of the context-dependent model can adopt the Baum-Welch algorithm or the Viterbi Training algorithm.
  • the first step is to obtain a text to be synthesized, and generate a sequence of context information corresponding to the text to be synthesized;
  • the context information of each speech unit in the database is compared with the context information of the text to be synthesized, for each phoneme contained in the synthesized text or other specified phonetic units according to the context information. Similarity, selecting a set of candidate speech units;
  • the splicing distance between the speech units before and after is calculated, and the Viterbi algorithm is used to calculate the simultaneous minimization.
  • the most relevant corresponding state is searched according to the context-dependent HMM state sequence corresponding to the phonemes in the two sets of voice segments.
  • the approximate degree of the corresponding two states can be calculated by various methods, for example, using the following steps:
  • ⁇ ' is the diagonal covariance matrix constructed from ⁇ '
  • K is the acoustic parameter dimension modeled using HMM.
  • the Mahalanobis distance between the output distributions using the corresponding states reflects the similarity between states.
  • the Kullback-Leibler divergence between the output distributions of the corresponding states is used to reflect the similarity between states.
  • t a corresponds to a speech unit with a relatively advanced time
  • t b corresponds to a speech unit with a relatively late time
  • the duration t' of each state after splicing is calculated by: retaining the length of time corresponding to each state before n in t a ; retaining the length of time corresponding to each state after n in t b ; state n
  • the duration is the average of the length of time that the state corresponds to in t a and t b .
  • the shortest duration t min of each state after splicing is set such that t' n ⁇ t min to prevent the transition period from being too short, affecting the continuity of the speech.
  • the time period corresponding to the most approximate model state determined in the first step in the database voice is the time period during which the interpolation transition is required.
  • the speech synthesis parameters in the time period corresponding to the remaining states are directly processed to the target speech synthesis parameter sequence without being processed.
  • the interpolation method for the speech data transition and the time warping includes linear interpolation.
  • the voice waveform is generated by using a vocoder according to the sequence of speech synthesis parameters generated in the sixth step.
  • the synthesis method is determined by a specific vocoder algorithm, which is not specifically limited in the present invention.
  • the present invention automatically selects the time period of the interpolation transition according to the similarity and variation trend of the acoustic parameters in different regions in the joint portion, thereby avoiding the corresponding region in the transition period.
  • the difference in speech parameters is too large, causing the synthesized speech to become choppy or blurry.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A voice audio unit concatenation method mainly used for concatenative voice synthesis, specifically comprising the following steps: according to context-related HMM state sequences respectively corresponding to concatenated partial phonemes in two adjacent groups of voice segments, searching for the most approximate corresponding state; obtaining a pre-calculated and stored state-level time slice of the concatenated partial phonemes, and calculating the duration of various states after concatenation; and according to voice synthesis parameter data and the duration of various states in a database, performing concatenation and interpolation transitioning on voice synthesis parameters included in the two groups of voice segments. A concatenative voice synthesis system of the concatenation method can automatically choose portions, with the minimum acoustic feature difference and a stable change trend, between two adjacent groups of units to perform interpolation transitioning when voice units are concatenated, thereby effectively improving the intelligibility and degree of distinguishability of a synthesized voice.

Description

基于隐马尔科夫模型的语音单元拼接方法Speech unit splicing method based on hidden Markov model 技术领域Technical field
本发明涉及语音合成领域,具体涉及拼接式语音合成和基于隐马尔科夫模型的统计参数语音合成。The invention relates to the field of speech synthesis, in particular to spliced speech synthesis and statistical parameter speech synthesis based on hidden Markov model.
背景技术Background technique
语音合成技术是让机器或程序根据文本信息产生人类可懂的语音的技术,与语音合成技术相关的应用包括文语转换(TTS)和歌声合成(SVS)等。Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information. Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).
目前主流的语音合成技术包括基于单元选择的拼接式语音合成技术与基于隐马尔科夫模型(以下简称HMM)的统计参数语音合成技术。At present, mainstream speech synthesis technologies include spliced speech synthesis technology based on unit selection and statistical parameter speech synthesis technology based on Hidden Markov Model (HMM).
基于单元选择的拼接式语音合成技术通过在事先录制好并经过标注的语料库中寻找一系列最符合待合成文本上下文的语音单元,对选择出的单元进行音频拼接,从而合成出待合成文本所对应的语音音频。该方法能产生较为清晰的高音质语音,但是该方法合成出的语音相比基于HMM的语音合成技术往往连贯性较差。The spliced speech synthesis technology based on unit selection finds a series of speech units that best match the context of the text to be synthesized in a pre-recorded and annotated corpus, and performs audio stitching on the selected units to synthesize the corresponding text to be synthesized. Voice audio. This method can produce relatively clear high-quality speech, but the speech synthesized by this method tends to be less consistent than the HMM-based speech synthesis technology.
影响拼接式语音合成生成语音质量的主要因素之一是语音单元的拼接方式(可参考Chappell,David T.,et al."A comparison of spectral smoothing methods for segment concatenation based speech synthesis."Speech Communication 36.3(2002):343-373.)。较为简便的做法是直接将语音单元的波形片段进行拼接,但是拼接边界处的不连贯现象会严重影响合成语音的自然度及可辨识度。One of the main factors affecting the speech quality of spliced speech synthesis is the splicing method of speech units (refer to Chappell, David T., et al. "A comparison of spectral smoothing methods for segment concatenation based speech synthesis." Speech Communication 36.3 ( 2002): 343-373.). The simpler method is to directly splicing the waveform segments of the speech unit, but the inconsistency at the splicing boundary will seriously affect the naturalness and recognizability of the synthesized speech.
一种经改进语音单元的拼接方法是对语音单元的接合部分进行插值过渡,从而较为平滑地从一个单元过渡到下一个单元。其中插值的对象可以是时域波形、线谱对(LSP)参数和频谱包络等语音合成参数。An improved speech unit splicing method is to interpolate the joint portion of the speech unit to smoothly transition from one unit to the next unit. The interpolated objects may be speech synthesis parameters such as time domain waveforms, line spectrum pair (LSP) parameters, and spectral envelopes.
基于插值的语音音频单元音频方法的问题在于:当进行拼接的两组语音单元的声学特征相差较大时,接合部分的语音往往变得过于平滑,使合成语音变得模糊,降低了合成语音的可辨识度。The problem with the interpolated speech audio unit audio method is that when the acoustic characteristics of the two sets of speech units that are spliced differ greatly, the speech of the joint portion tends to become too smooth, blurring the synthesized speech, and reducing the synthesized speech. Recognizability.
本发明为解决以上技术问题,将统计参数语音合成技术中常用的HMM模型引入到拼接式语音合成系统,提出了一种新的语音单元拼接方法:先使用语料库数据训练HMM,获得模型状态级别的语料库文本与语音的时间对齐,然后在拼接时根据相对应的音频单元中最相似的模型状态,确定插值的起始时间和结束时间,再对语音合成参数进行插值和拼接。基于本发明的拼接式语音合成技术能够在拼接语音单元时自动选择前后两组单元间声学特征相差最小且变化趋势平稳的部分进行插值过渡,从而有效地提高了合成语音的清晰度和可辨识度。In order to solve the above technical problems, the present invention introduces a HMM model commonly used in statistical parameter speech synthesis technology into a spliced speech synthesis system, and proposes a new speech unit splicing method: firstly using the corpus data to train the HMM to obtain the model state level. The corpus text is time aligned with the speech, and then the splicing start time and end time are determined according to the most similar model state in the corresponding audio unit during splicing, and then the speech synthesis parameters are interpolated and spliced. The spliced speech synthesis technology based on the invention can automatically select the part with the smallest difference between the acoustic characteristics of the two groups before and after the splicing of the speech unit and the transition trend is stable, thereby effectively improving the definition and recognizability of the synthesized speech. .
发明内容Summary of the invention
本发明所属的技术领域是拼接式语音合成。本发明解决的技术问题是拼接式语音合成系统在对语音音频片段进行拼接时,由于不恰当的拼接和插值方法造成的听感模糊和不连贯现象。The technical field to which the present invention pertains is spliced speech synthesis. The technical problem solved by the present invention is that the spliced speech synthesis system blurs and inconsistency caused by improper splicing and interpolation methods when splicing speech audio segments.
本发明为解决以上技术问题,在传统的拼接式语音合成系统中引入了HMM模型。使用本发明提出的技术对语音音频片段进行拼接前,需要预先计算并存储上下文相关模型和训练语音与文本的状态级时间对齐。In order to solve the above technical problems, the present invention introduces an HMM model in a conventional spliced speech synthesis system. Before splicing a speech audio segment using the techniques proposed by the present invention, it is necessary to pre-compute and store the context-dependent model and the state-level temporal alignment of the training speech and text.
本发明采取的方法包括以下步骤:The method adopted by the present invention comprises the following steps:
第一步,根据两组语音片段中接合部分音素分别对应的上下文相关HMM状态序列, 寻找最近似的对应状态;In the first step, according to the context-dependent HMM state sequence corresponding to the combined partial phonemes in the two sets of speech segments, Find the most approximate corresponding state;
第二步,根据第一步所获得的最近似的状态序号,获得预先计算并存储好的两组语音片段中接合部分音素的状态级时间分段,并计算拼接后各状态的持续时间;In the second step, according to the most approximate state sequence number obtained in the first step, a state-level time segment of the combined partial phoneme in the two sets of voice segments pre-calculated and stored is obtained, and the duration of each state after the splicing is calculated;
第三歩,根据数据库中的语音合成参数数据和第二步所获得的各状态持续时间,对两组语音片段包含的语音合成参数进行拼接和插值过渡。Thirdly, according to the speech synthesis parameter data in the database and the state durations obtained in the second step, the speech synthesis parameters included in the two sets of speech segments are spliced and interpolated.
附图说明DRAWINGS
图1为本发明所解决的语音音频片段拼接技术问题的示意图;FIG. 1 is a schematic diagram of a problem of splicing technology of a voice audio segment solved by the present invention; FIG.
图2为本发明所述的技术对语音合成参数进行拼接和插值过渡的时间分配示意图;2 is a schematic diagram of time allocation of a splicing and interpolation transition of a speech synthesis parameter according to the technology of the present invention;
图3为本发明应用于完整的语音合成系统时,训练阶段的流程图;3 is a flow chart of a training phase when the present invention is applied to a complete speech synthesis system;
图4为本发明应用于完整的语音合成系统时,合成阶段的流程图。Figure 4 is a flow chart of the synthesis stage when the present invention is applied to a complete speech synthesis system.
具体实施方式detailed description
本发明提出了一种应用于拼接式语音合成的语音音频单元拼接技术。该技术基于上下文相关(context-dependent)HMM,其参数获得方法与一般基于HMM的统计参数语音合成系统类似,将在下文中本发明的实施例中作具体介绍。The invention proposes a speech audio unit splicing technique applied to spliced speech synthesis. The technique is based on a context-dependent HMM, and its parameter acquisition method is similar to the general HMM-based statistical parameter speech synthesis system, which will be specifically described below in the embodiments of the present invention.
本发明提出的语音音频单元拼接技术应用于拼接式语音合成时,待拼接的语音单元一般包括两组语料库中的双音素(diphone)语音片段,但也可以是多音素或多音节语音片段。When the speech audio unit splicing technology proposed by the present invention is applied to spliced speech synthesis, the speech unit to be spliced generally includes two diphone voice segments in two corpora, but may also be multi-phone or multi-syllable speech segments.
以使用双音素为语音单元的拼接式语音合成,包括使用双音素为单元的语音合成为例,待拼接的两组语音片段如图1中单元1、单元2所示,其中标有拼音“a”的部分为两组语音片段的接合部分。For example, a splicing speech synthesis using a dual phoneme as a speech unit, including speech synthesis using a dual phoneme unit as an example, two sets of speech segments to be spliced are shown in unit 1 and unit 2 in FIG. The part is the joint of the two sets of speech segments.
本发明为了较好地拼接两组语音片段,使用的方法包括以下步骤:In order to better splicing two sets of speech segments, the method used includes the following steps:
第一步,根据两组语音片段中接合部分音素分别对应的上下文相关HMM状态序列,寻找最近似的对应状态;In the first step, the most relevant corresponding state is searched according to the context-dependent HMM state sequence corresponding to the phonemes in the two sets of voice segments;
使用HMM对连续语音进行建模时,一般单个音素对应固定数量的多个模型状态,因此不同音素分别对应的上下文相关模型中,可按顺序对相同序号的状态依次进行比较。When using HMM to model continuous speech, a single phoneme generally corresponds to a fixed number of multiple model states. Therefore, in the context-dependent models corresponding to different phonemes, the states of the same serial numbers can be sequentially compared in order.
相对应的两个状态的近似程度可通过多种方法计算得到,例如使用以下步骤:The approximate degree of the corresponding two states can be calculated by various methods, for example, using the following steps:
a.获得对应状态的输出分布的均值向量μa、μb和对角协方差向量σa、σba. obtaining the mean vector μ a , μ b and diagonal covariance vectors σ a , σ b of the output distribution of the corresponding state;
b.计算两组输出分布合并后的均值向量μ'和对角协方差向量σ':b. Calculate the mean vector μ' and the diagonal covariance vector σ' after combining the two sets of output distributions:
Figure PCTCN2015086931-appb-000001
Figure PCTCN2015086931-appb-000001
Figure PCTCN2015086931-appb-000002
Figure PCTCN2015086931-appb-000002
根据步骤b获得的合并后的对角协方差向量σ',计算行列式:Calculate the determinant according to the combined diagonal covariance vector σ' obtained in step b:
Figure PCTCN2015086931-appb-000003
Figure PCTCN2015086931-appb-000003
其中∑'为根据σ'构建的对角协方差矩阵,K为使用HMM建模的语音声学参数维度。 Where ∑' is the diagonal covariance matrix constructed from σ', and K is the speech acoustic parameter dimension modeled using HMM.
语音声学参数是能够反映语音的听觉特征的特征参数,例如MFCC等倒谱系数参数、线谱对(LSP)参数以及梅尔通用倒谱系数(MGC)参数等。The speech acoustic parameters are characteristic parameters capable of reflecting the auditory features of the speech, such as cepstral coefficient parameters such as MFCC, line spectrum pair (LSP) parameters, and Mel's general cepstral coefficient (MGC) parameters.
最后,比较步骤c获得的det(∑')的数值大小。使det(∑')值最小的状态即为最近似的模型状态;记录该状态的序号。Finally, the magnitude of the value of det(∑') obtained in step c is compared. The state in which the det(∑') value is minimized is the most approximate model state; the sequence number of the state is recorded.
可选地,使用对应状态的输出分布的均值之间的L-N距离反映状态间的相似度。Alternatively, the L-N distance between the mean values of the output distributions using the corresponding states reflects the similarity between the states.
可选地,使用对应状态的输出分布之间的Mahalanobis距离反映状态间的相似度。Alternatively, the Mahalanobis distance between the output distributions using the corresponding states reflects the similarity between states.
可选地,使用对应状态的输出分布之间的Kullback-Leibler散度反映状态间的相似度。Optionally, the Kullback-Leibler divergence between the output distributions of the corresponding states is used to reflect the similarity between states.
第二步,根据第一步所获得的最近似的状态序号,获得预先计算并存储好的两组语音片段中接合部分音素的状态级时间分段,并计算拼接后各状态的持续时间;In the second step, according to the most approximate state sequence number obtained in the first step, a state-level time segment of the combined partial phoneme in the two sets of voice segments pre-calculated and stored is obtained, and the duration of each state after the splicing is calculated;
设第一步所获得的最近似的状态序号为n(状态序号从0开始),每个音素包含的状态数量为N,两组语音片段中接合部分音素的时间分段中各状态的持续时间分别由向量ta与向量tb表示,拼接后各状态的持续时间由向量t'表示。其中ta对应时间较提前的语音单元;tb对应时间较滞后的语音单元。Let the first approximate state number obtained by the first step be n (the state number starts from 0), and the number of states each phoneme contains is N, and the duration of each state in the time segment of the joint part of the phonemes in the two sets of speech segments Respectively represented by the vector t a and the vector t b , the duration of each state after splicing is represented by the vector t′. Where t a corresponds to a speech unit with a relatively advanced time; t b corresponds to a speech unit with a relatively late time.
拼接后各状态的持续时间t'的计算方法为:保留n之前的各状态在ta中所对应的时间长度;保留n之后的各状态在tb中所对应的时间长度;状态n的持续时间为该状态在ta和tb中所对应的时间长度之平均。The method for calculating the duration t' of each state after splicing is: retaining the length of time corresponding to each state before n in t a ; retaining the length of time corresponding to each state after n in t b ; duration of state n the average time for the length of time in a state in t a and t b corresponds.
Figure PCTCN2015086931-appb-000004
Figure PCTCN2015086931-appb-000004
Figure PCTCN2015086931-appb-000005
Figure PCTCN2015086931-appb-000005
Figure PCTCN2015086931-appb-000006
Figure PCTCN2015086931-appb-000006
可选地,设定拼接后各状态的最短持续时间tmin,使得t'n≥tmin以防止过渡段持续时间过短,影响语音的连贯性。Optionally, the shortest duration t min of each state after splicing is set such that t' n ≥ t min to prevent the transition period duration from being too short, affecting the continuity of the speech.
第三歩,根据数据库中的语音合成参数数据和第二步所获得的各状态的持续时间,对两组语音片段包含的语音合成参数进行拼接和插值过渡。Thirdly, according to the speech synthesis parameter data in the database and the duration of each state obtained in the second step, the speech synthesis parameters included in the two sets of speech segments are spliced and interpolated.
语音合成参数是能够表述语音特征,并使得声码器生成语音波形的数据。当采用LSP或MGC参数时,语音声学参数可同时作为语音合成参数。语音合成参数在一定程度上往往亦能反应语音的听觉特征。The speech synthesis parameter is data capable of expressing a speech feature and causing the vocoder to generate a speech waveform. When LSP or MGC parameters are used, the speech acoustic parameters can be used as speech synthesis parameters at the same time. To some extent, speech synthesis parameters can also reflect the auditory characteristics of speech.
第一步中所确定的最近似模型状态在数据库语音中所对应的时间段,即为拼接时需要进行插值过渡的时间段。其余状态所对应的时间段内的语音合成参数不经处理,被直接复制到目标语音合成参数序列。The time period corresponding to the most approximate model state determined in the first step in the database voice is the time period during which the interpolation transition is required. The speech synthesis parameters in the time period corresponding to the remaining states are directly processed to the target speech synthesis parameter sequence without being processed.
图2为上述过程的一个例子,其中待拼接的两组语音片段的接合部分为音素“a”,各包含3个状态,其对应的时间段分别用A、B、C和D、E、F表示。以第二个状态作为第一步挑选出的最近似状态为例,由于时间段A在第二个状态之前且在双音素单元“t a”之内,故时间段A内的语音被直接复制到图2单元3中的时间段A内;由于时间段F在第二 个状态之后且在双音素单元“a o”之内,故时间段F内的语音被直接复制到图2单元3中的时间段F内;由于时间段B、E对应到最近似的状态,时间段B、E所对应的语音数据被插值过渡并时间伸缩到步骤二中计算得出的第二个状态的时长t'1,然后被写入图2单元3中的时间段B->E内。2 is an example of the above process, in which the joint portions of the two sets of speech segments to be spliced are phoneme "a", each of which contains three states, and the corresponding time segments are respectively A, B, C, and D, E, and F. Said. Taking the second state as the most approximate state selected as the first step as an example, since the time period A is before the second state and within the diphone unit "ta", the speech in the time period A is directly copied to In time period A in unit 3 of Fig. 2; since the time period F is after the second state and within the diphone unit "ao", the speech in the time period F is directly copied to the time in unit 3 of Fig. 2. In the segment F; since the time segments B and E correspond to the most approximate state, the speech data corresponding to the time segments B and E is interpolated and time-expanded to the time t′ 1 of the second state calculated in the second step. And then written into the time period B->E in unit 3 of Fig. 2.
上述步骤中,用于语音数据过渡和时间伸缩的插值方式包括线性插值。In the above steps, the interpolation method for the speech data transition and the time warping includes linear interpolation.
基于本发明所提出的语音音频单元拼接技术的单元选择语音合成技术,包括训练和运行两个阶段。其中训练阶段(如图3所示)具体实施方式如下:The unit selection speech synthesis technology based on the voice audio unit splicing technology proposed by the present invention includes two stages of training and operation. The specific implementation of the training phase (shown in Figure 3) is as follows:
第一步,获取语料库中的语音波形数据和语音音素级时间分段,进行语音分析:将语音波形数据转换为语音声学参数数据,并和语音音素级时间分段一同存入语音数据库(以下简称数据库);根据语料库中与语音对应的文本,生成上下文信息序列,亦存入数据库。In the first step, the speech waveform data and the phoneme level time segmentation in the corpus are obtained, and the speech analysis is performed: the speech waveform data is converted into the speech acoustic parameter data, and is stored in the speech database together with the speech phoneme time segmentation (hereinafter referred to as Database); based on the text corresponding to the voice in the corpus, generates a sequence of context information, which is also stored in the database.
若使用不同于语音声学参数的语音合成参数,需要额外地根据语料库中的语音波形数据,计算语音合成参数,并存入数据库。If speech synthesis parameters other than speech acoustic parameters are used, the speech synthesis parameters need to be additionally calculated from the speech waveform data in the corpus and stored in the database.
第二步,获取数据库中的语音声学参数数据和语音音素级时间分段,初始化HMM的状态转移概率分布和输出分布,训练上下文无关(context-independent)模型。In the second step, the speech acoustic parameter data and the phoneme level time segmentation in the database are acquired, the state transition probability distribution and the output distribution of the HMM are initialized, and the context-independent model is trained.
其中,上下文无关模型的训练可采用Baum-Welch算法或Viterbi Training算法。Among them, the training of the context-independent model can adopt the Baum-Welch algorithm or the Viterbi Training algorithm.
可选地,使用隐半马尔科夫模型(HSMM)代替HMM。Alternatively, a hidden semi-Markov model (HSMM) is used instead of the HMM.
可选地,使用音节作为作为语音单位。Alternatively, a syllable is used as a unit of speech.
第三步,根据上下文无关模型,对数据库进行状态级时间对齐和音素级时间对齐,使用新的音素级时间对齐结果覆盖数据库中原有的时间分段,从而保证数据库中语音单元的时间对齐与模型状态的时间对齐保持统一。The third step is to perform state-level time alignment and phoneme-level time alignment on the database according to the context-independent model, and use the new phoneme-level time alignment result to cover the original time segment in the database, thereby ensuring the time alignment and model of the speech unit in the database. The time alignment of the states remains uniform.
第四步,解除上下文无关模型的状态绑定,使其成为上下文相关(context-dependent)模型;In the fourth step, the state binding of the context-free model is released, making it a context-dependent model;
第五步,训练上下文相关模型并将模型参数储存到数据库。In the fifth step, the context-dependent model is trained and the model parameters are stored in the database.
其中,上下文相关模型的训练可采用Baum-Welch算法或Viterbi Training算法。Among them, the training of the context-dependent model can adopt the Baum-Welch algorithm or the Viterbi Training algorithm.
基于本发明所提出的语音音频单元拼接技术的单元选择语音合成技术,其中运行阶段(如图4所示)具体实施方式如下:The unit selection speech synthesis technology based on the voice audio unit splicing technology proposed by the present invention, wherein the operation phase (shown in FIG. 4) is as follows:
第一步,获取待合成文本,生成待合成文本对应的上下文信息序列;The first step is to obtain a text to be synthesized, and generate a sequence of context information corresponding to the text to be synthesized;
第二步,根据上下文信息序列,将数据库中的各语音单元的上下文信息与待合成文本的上下文信息进行比较,为每个带合成文本所包含的音素或其它指定的语音学单位根据上下文信息的相似度,选择一组候选语音单元;In the second step, according to the context information sequence, the context information of each speech unit in the database is compared with the context information of the text to be synthesized, for each phoneme contained in the synthesized text or other specified phonetic units according to the context information. Similarity, selecting a set of candidate speech units;
第三歩,根据步骤二获得的候选语音单元、数据库中的语音声学参数和数据库中的语音音素级或状态级时间分段,计算前后语音单元间的拼接距离,使用Viterbi算法计算出同时最小化拼接距离与上下文误差的语音单元序列。Thirdly, according to the candidate speech unit obtained in step 2, the speech acoustic parameters in the database, and the phoneme level or state-level time segmentation in the database, the splicing distance between the speech units before and after is calculated, and the Viterbi algorithm is used to calculate the simultaneous minimization. A sequence of speech units that splicing distances and context errors.
上述步骤更加具体的实施方法可参考A.Black,et al.“Optimising selection of units from speech databases for concatenative synthesis.”EUROSPEECH 95,pages 581-584,Madrid,Spain,1995.A more specific implementation of the above steps can be found in A. Black, et al. "Optimising selection of units from speech databases for concatenative synthesis." EUROSPEECH 95, pages 581-584, Madrid, Spain, 1995.
第四步,根据第三歩生成的语音单元序列,对每组前后相邻的两个语音单元,根据两组语音片段中接合部分音素分别对应的上下文相关HMM状态序列,寻找最近似的对应状态;In the fourth step, according to the sequence of phonetic units generated by the third frame, for each group of two adjacent speech units, the most relevant corresponding state is searched according to the context-dependent HMM state sequence corresponding to the phonemes in the two sets of voice segments. ;
相对应的两个状态的近似程度可通过多种方法计算得到,例如使用以下步骤:The approximate degree of the corresponding two states can be calculated by various methods, for example, using the following steps:
a.获得对应状态的输出分布的均值向量μa、μb和对角协方差向量σa、σba. obtaining the mean vector μ a , μ b and diagonal covariance vectors σ a , σ b of the output distribution of the corresponding state;
b.计算两组输出分布合并后的均值向量μ'和对角协方差向量σ': b. Calculate the mean vector μ' and the diagonal covariance vector σ' after combining the two sets of output distributions:
Figure PCTCN2015086931-appb-000007
Figure PCTCN2015086931-appb-000007
Figure PCTCN2015086931-appb-000008
Figure PCTCN2015086931-appb-000008
c.根据步骤b获得的合并后的对角协方差向量σ',计算行列式:c. Calculate the determinant according to the combined diagonal covariance vector σ' obtained in step b:
Figure PCTCN2015086931-appb-000009
Figure PCTCN2015086931-appb-000009
其中∑'为根据σ'构建的对角协方差矩阵,K为使用HMM建模的声学参数维度。Where ∑' is the diagonal covariance matrix constructed from σ', and K is the acoustic parameter dimension modeled using HMM.
最后,比较步骤c获得的det(∑')的数值大小。使det(∑')值最小的状态即为最近似的模型状态;记录该状态的序号。Finally, the magnitude of the value of det(∑') obtained in step c is compared. The state in which the det(∑') value is minimized is the most approximate model state; the sequence number of the state is recorded.
可选地,使用对应状态的输出分布的均值之间的L-N距离反映状态间的相似度。Alternatively, the L-N distance between the mean values of the output distributions using the corresponding states reflects the similarity between the states.
可选地,使用对应状态的输出分布之间的Mahalanobis距离反映状态间的相似度。Alternatively, the Mahalanobis distance between the output distributions using the corresponding states reflects the similarity between states.
可选地,使用对应状态的输出分布之间的Kullback-Leibler散度反映状态间的相似度。Optionally, the Kullback-Leibler divergence between the output distributions of the corresponding states is used to reflect the similarity between states.
第五步,根据第三歩生成的语音单元序列,对每组前后相邻的两个语音单元,根据第四步所获得的最近似的状态序号,获得在训练阶段计算并存储好的两组语音片段中接合部分音素的状态级时间分段,并计算拼接后各状态的持续时间;In the fifth step, according to the sequence of phonetic units generated by the third frame, two groups of speech units adjacent to each group are obtained, and according to the most approximate state sequence number obtained in the fourth step, two groups calculated and stored in the training phase are obtained. The state-level time segmentation of the part of the phoneme in the speech segment, and calculating the duration of each state after the splicing;
a.设第一步所获得的最近似的状态序号为n(状态序号从0开始),每个音素包含的状态数量为N,两组语音片段中接合部分音素的时间分段中各状态的持续时间分别由向量ta与向量tb表示,拼接后各状态的持续时间由向量t'表示。其中ta对应时间较提前的语音单元;tb对应时间较滞后的语音单元。a. The most approximate state number obtained in the first step is n (the state number starts from 0), and the number of states included in each phoneme is N, and the states of the time segments of the two parts of the two segments are combined. The duration is represented by a vector t a and a vector t b , respectively, and the duration of each state after splicing is represented by a vector t′. Where t a corresponds to a speech unit with a relatively advanced time; t b corresponds to a speech unit with a relatively late time.
b.拼接后各状态的持续时间t'的计算方法为:保留n之前的各状态在ta中所对应的时间长度;保留n之后的各状态在tb中所对应的时间长度;状态n的持续时间为该状态在ta和tb中所对应的时间长度之平均。b. The duration t' of each state after splicing is calculated by: retaining the length of time corresponding to each state before n in t a ; retaining the length of time corresponding to each state after n in t b ; state n The duration is the average of the length of time that the state corresponds to in t a and t b .
Figure PCTCN2015086931-appb-000010
Figure PCTCN2015086931-appb-000010
Figure PCTCN2015086931-appb-000011
Figure PCTCN2015086931-appb-000011
Figure PCTCN2015086931-appb-000012
Figure PCTCN2015086931-appb-000012
c.可选地,设定拼接后各状态的最短持续时间tmin,使得t'n≥tmin以防止过渡段持续时间过短,影响语音的连贯性。c. Optionally, the shortest duration t min of each state after splicing is set such that t' n ≥ t min to prevent the transition period from being too short, affecting the continuity of the speech.
第六步,根据第三歩生成的语音单元序列,对每组前后相邻的两个语音单元,根据数据库中的语音合成参数数据和第五步所获得的各状态的持续时间,对两组语音片段包含的 语音合成参数进行拼接和插值过渡。In the sixth step, according to the sequence of phonetic units generated by the third frame, for each group of two adjacent speech units, according to the speech synthesis parameter data in the database and the duration of each state obtained in the fifth step, Voice clips included The speech synthesis parameters are spliced and interpolated.
第一步中所确定的最近似模型状态在数据库语音中所对应的时间段,即为拼接时需要进行插值过渡的时间段。其余状态所对应的时间段内的语音合成参数不经处理,被直接复制到目标语音合成参数序列。The time period corresponding to the most approximate model state determined in the first step in the database voice is the time period during which the interpolation transition is required. The speech synthesis parameters in the time period corresponding to the remaining states are directly processed to the target speech synthesis parameter sequence without being processed.
上述步骤中,用于语音数据过渡和时间伸缩的插值方式包括线性插值。In the above steps, the interpolation method for the speech data transition and the time warping includes linear interpolation.
第七步,根据第六步生成的语音合成参数序列,使用声码器生成语音波形。合成方法由具体采用的声码器算法决定,本发明对此不做具体限定。In the seventh step, the voice waveform is generated by using a vocoder according to the sequence of speech synthesis parameters generated in the sixth step. The synthesis method is determined by a specific vocoder algorithm, which is not specifically limited in the present invention.
相比拼接式语音合成中传统的语音片段的拼接方式,本发明根据接合部分内不同区域声学参数的相似度和变化趋势而自动选择插值过渡的时间段,避免了由于过渡时间段内对应区域的语音参数相差太大导致合成语音变得不连贯或听感模糊。 Compared with the splicing manner of the traditional speech segment in the spliced speech synthesis, the present invention automatically selects the time period of the interpolation transition according to the similarity and variation trend of the acoustic parameters in different regions in the joint portion, thereby avoiding the corresponding region in the transition period. The difference in speech parameters is too large, causing the synthesized speech to become choppy or blurry.

Claims (1)

  1. 本发明提供了一种主要用于拼接式语音合成的语音音频单元拼接方法,具体包括以下步骤:根据前后两组语音片段中接合部分音素分别对应的上下文相关HMM状态序列,寻找最近似的对应状态;获得预先计算并存储好的接合部分音素的状态级时间分段,并计算拼接后各状态的持续时间;根据数据库中的语音合成参数数据和各状态持续时间,对两组语音片段包含的语音合成参数进行拼接和插值过渡。基于本发明所述的拼接方法的拼接式语音合成系统能够在拼接语音单元时自动选择前后两组单元间声学特征相差最小且变化趋势平稳的部分进行插值过渡,从而有效地提高了合成语音的清晰度和可辨识度。 The invention provides a voice audio unit splicing method mainly used for splicing speech synthesis, which comprises the following steps: searching for the most approximate corresponding state according to the context-related HMM state sequence corresponding to the combined partial phonemes in the two groups of voice segments. Obtaining a state-level time segment of the pre-computed and stored joint part phoneme, and calculating the duration of each state after the splicing; according to the speech synthesis parameter data in the database and each state duration, the speech included in the two sets of speech segments The synthesis parameters are spliced and interpolated. The splicing speech synthesis system based on the splicing method according to the present invention can automatically select the portion where the acoustic characteristics of the two groups before and after the splicing of the speech unit have the smallest difference and the trend of the change is stable, thereby effectively improving the clarity of the synthesized speech. Degree and recognizability.
PCT/CN2015/086931 2015-08-14 2015-08-14 Hidden markov model-based voice unit concatenation method WO2017028003A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/086931 WO2017028003A1 (en) 2015-08-14 2015-08-14 Hidden markov model-based voice unit concatenation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/086931 WO2017028003A1 (en) 2015-08-14 2015-08-14 Hidden markov model-based voice unit concatenation method

Publications (1)

Publication Number Publication Date
WO2017028003A1 true WO2017028003A1 (en) 2017-02-23

Family

ID=58050590

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086931 WO2017028003A1 (en) 2015-08-14 2015-08-14 Hidden markov model-based voice unit concatenation method

Country Status (1)

Country Link
WO (1) WO2017028003A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185338A (en) * 2020-09-30 2021-01-05 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
CN101178896A (en) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 Unit selection voice synthetic method based on acoustics statistical model
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080059190A1 (en) * 2006-08-22 2008-03-06 Microsoft Corporation Speech unit selection using HMM acoustic models
CN101178896A (en) * 2007-12-06 2008-05-14 安徽科大讯飞信息科技股份有限公司 Unit selection voice synthetic method based on acoustics statistical model
CN101471071A (en) * 2007-12-26 2009-07-01 中国科学院自动化研究所 Speech synthesis system based on mixed hidden Markov model
CN103531196A (en) * 2013-10-15 2014-01-22 中国科学院自动化研究所 Sound selection method for waveform concatenation speech synthesis

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
HU , KE ET AL.: "HMM-based Mandarin Speech Synthesis System", COMMUNICATIONS TECHNOLOGY, vol. 45, no. 8, 31 August 2012 (2012-08-31), pages 101 - 103 , 108, ISSN: 1002-0802 *
TOSHIO HIRAI ET AL.: "USING 5 ms SEGMENTS IN CONCATENATIVE SPEECH SYNTHESIS", 5TH ISCA SPEECH SYNTHESIS WORKSHOP, 16 June 2004 (2004-06-16), pages 37 - 42, XP055365714 *
YIN, YONG ET AL.: "Smoothing algorithm for contextual phone concatenation in speech synthesis", JOURNAL OF TSINGHUA UNIVERSITY( SCIENCE AND TECHNOLOGY, vol. 48, no. Sl, 31 December 2008 (2008-12-31), pages 640 - 644, ISSN: 1000-0054 *
ZHANG, PENG ET AL.: "On transitional algorithm of waveform concatenation in speech synthesis system", JOURNAL OF NATURAL SCIENCE OF HEILONGJIANG UNIVERSITY, vol. 28, no. 6, 31 December 2011 (2011-12-31), pages 867 - 870, ISSN: 1001-7011 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112185338A (en) * 2020-09-30 2021-01-05 北京大米科技有限公司 Audio processing method and device, readable storage medium and electronic equipment
CN112185338B (en) * 2020-09-30 2024-01-23 北京大米科技有限公司 Audio processing method, device, readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US10347238B2 (en) Text-based insertion and replacement in audio narration
JP5665780B2 (en) Speech synthesis apparatus, method and program
US10741169B1 (en) Text-to-speech (TTS) processing
JP4469883B2 (en) Speech synthesis method and apparatus
EP2140447B1 (en) System and method for hybrid speech synthesis
US10497362B2 (en) System and method for outlier identification to remove poor alignments in speech synthesis
JP4406440B2 (en) Speech synthesis apparatus, speech synthesis method and program
JP3910628B2 (en) Speech synthesis apparatus, speech synthesis method and program
JP6293912B2 (en) Speech synthesis apparatus, speech synthesis method and program
Khan et al. Concatenative speech synthesis: A review
CN101131818A (en) Speech synthesis apparatus and method
Bellur et al. Prosody modeling for syllable-based concatenative speech synthesis of Hindi and Tamil
JP4639932B2 (en) Speech synthesizer
JP4225128B2 (en) Regular speech synthesis apparatus and regular speech synthesis method
WO2017028003A1 (en) Hidden markov model-based voice unit concatenation method
JP2009133890A (en) Voice synthesizing device and method
JP4247289B1 (en) Speech synthesis apparatus, speech synthesis method and program thereof
JP5874639B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
Bunnell et al. The ModelTalker system
JP5328703B2 (en) Prosody pattern generator
Latsch et al. Pitch-synchronous time alignment of speech signals for prosody transplantation
JPWO2008139919A1 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
JP5275470B2 (en) Speech synthesis apparatus and program
JP4034751B2 (en) Speech synthesis apparatus, speech synthesis method, and speech synthesis program
Carvalho et al. Concatenative speech synthesis for European Portuguese

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15901209

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15901209

Country of ref document: EP

Kind code of ref document: A1