WO2017028003A1

WO2017028003A1 - Hidden markov model-based voice unit concatenation method

Info

Publication number: WO2017028003A1
Application number: PCT/CN2015/086931
Authority: WO
Inventors: 华侃如
Original assignee: 华侃如
Priority date: 2015-08-14
Filing date: 2015-08-14
Publication date: 2017-02-23

Abstract

A voice audio unit concatenation method mainly used for concatenative voice synthesis, specifically comprising the following steps: according to context-related HMM state sequences respectively corresponding to concatenated partial phonemes in two adjacent groups of voice segments, searching for the most approximate corresponding state; obtaining a pre-calculated and stored state-level time slice of the concatenated partial phonemes, and calculating the duration of various states after concatenation; and according to voice synthesis parameter data and the duration of various states in a database, performing concatenation and interpolation transitioning on voice synthesis parameters included in the two groups of voice segments. A concatenative voice synthesis system of the concatenation method can automatically choose portions, with the minimum acoustic feature difference and a stable change trend, between two adjacent groups of units to perform interpolation transitioning when voice units are concatenated, thereby effectively improving the intelligibility and degree of distinguishability of a synthesized voice.

Description

Speech unit splicing method based on hidden Markov model

Technical field

The invention relates to the field of speech synthesis, in particular to spliced speech synthesis and statistical parameter speech synthesis based on hidden Markov model.

Background technique

Speech synthesis technology is a technology that allows a machine or program to generate human-readable speech based on textual information. Applications related to speech synthesis technology include text-to-speech (TTS) and speech synthesis (SVS).

At present, mainstream speech synthesis technologies include spliced speech synthesis technology based on unit selection and statistical parameter speech synthesis technology based on Hidden Markov Model (HMM).

The spliced speech synthesis technology based on unit selection finds a series of speech units that best match the context of the text to be synthesized in a pre-recorded and annotated corpus, and performs audio stitching on the selected units to synthesize the corresponding text to be synthesized. Voice audio. This method can produce relatively clear high-quality speech, but the speech synthesized by this method tends to be less consistent than the HMM-based speech synthesis technology.

One of the main factors affecting the speech quality of spliced speech synthesis is the splicing method of speech units (refer to Chappell, David T., et al. "A comparison of spectral smoothing methods for segment concatenation based speech synthesis." Speech Communication 36.3 ( 2002): 343-373.). The simpler method is to directly splicing the waveform segments of the speech unit, but the inconsistency at the splicing boundary will seriously affect the naturalness and recognizability of the synthesized speech.

An improved speech unit splicing method is to interpolate the joint portion of the speech unit to smoothly transition from one unit to the next unit. The interpolated objects may be speech synthesis parameters such as time domain waveforms, line spectrum pair (LSP) parameters, and spectral envelopes.

The problem with the interpolated speech audio unit audio method is that when the acoustic characteristics of the two sets of speech units that are spliced differ greatly, the speech of the joint portion tends to become too smooth, blurring the synthesized speech, and reducing the synthesized speech. Recognizability.

In order to solve the above technical problems, the present invention introduces a HMM model commonly used in statistical parameter speech synthesis technology into a spliced speech synthesis system, and proposes a new speech unit splicing method: firstly using the corpus data to train the HMM to obtain the model state level. The corpus text is time aligned with the speech, and then the splicing start time and end time are determined according to the most similar model state in the corresponding audio unit during splicing, and then the speech synthesis parameters are interpolated and spliced. The spliced speech synthesis technology based on the invention can automatically select the part with the smallest difference between the acoustic characteristics of the two groups before and after the splicing of the speech unit and the transition trend is stable, thereby effectively improving the definition and recognizability of the synthesized speech. .

Summary of the invention

The technical field to which the present invention pertains is spliced speech synthesis. The technical problem solved by the present invention is that the spliced speech synthesis system blurs and inconsistency caused by improper splicing and interpolation methods when splicing speech audio segments.

In order to solve the above technical problems, the present invention introduces an HMM model in a conventional spliced speech synthesis system. Before splicing a speech audio segment using the techniques proposed by the present invention, it is necessary to pre-compute and store the context-dependent model and the state-level temporal alignment of the training speech and text.

The method adopted by the present invention comprises the following steps:

In the first step, according to the context-dependent HMM state sequence corresponding to the combined partial phonemes in the two sets of speech segments, Find the most approximate corresponding state;

In the second step, according to the most approximate state sequence number obtained in the first step, a state-level time segment of the combined partial phoneme in the two sets of voice segments pre-calculated and stored is obtained, and the duration of each state after the splicing is calculated;

Thirdly, according to the speech synthesis parameter data in the database and the state durations obtained in the second step, the speech synthesis parameters included in the two sets of speech segments are spliced and interpolated.

DRAWINGS

FIG. 1 is a schematic diagram of a problem of splicing technology of a voice audio segment solved by the present invention; FIG.

2 is a schematic diagram of time allocation of a splicing and interpolation transition of a speech synthesis parameter according to the technology of the present invention;

3 is a flow chart of a training phase when the present invention is applied to a complete speech synthesis system;

Figure 4 is a flow chart of the synthesis stage when the present invention is applied to a complete speech synthesis system.

detailed description

The invention proposes a speech audio unit splicing technique applied to spliced speech synthesis. The technique is based on a context-dependent HMM, and its parameter acquisition method is similar to the general HMM-based statistical parameter speech synthesis system, which will be specifically described below in the embodiments of the present invention.

When the speech audio unit splicing technology proposed by the present invention is applied to spliced speech synthesis, the speech unit to be spliced generally includes two diphone voice segments in two corpora, but may also be multi-phone or multi-syllable speech segments.

For example, a splicing speech synthesis using a dual phoneme as a speech unit, including speech synthesis using a dual phoneme unit as an example, two sets of speech segments to be spliced are shown in unit 1 and unit 2 in FIG. The part is the joint of the two sets of speech segments.

In order to better splicing two sets of speech segments, the method used includes the following steps:

In the first step, the most relevant corresponding state is searched according to the context-dependent HMM state sequence corresponding to the phonemes in the two sets of voice segments;

When using HMM to model continuous speech, a single phoneme generally corresponds to a fixed number of multiple model states. Therefore, in the context-dependent models corresponding to different phonemes, the states of the same serial numbers can be sequentially compared in order.

The approximate degree of the corresponding two states can be calculated by various methods, for example, using the following steps:

a. obtaining the mean vector μ _a , μ _b and diagonal covariance vectors σ _a , σ _b of the output distribution of the corresponding state;

b. Calculate the mean vector μ' and the diagonal covariance vector σ' after combining the two sets of output distributions:

Calculate the determinant according to the combined diagonal covariance vector σ' obtained in step b:

Where ∑' is the diagonal covariance matrix constructed from σ', and K is the speech acoustic parameter dimension modeled using HMM.

The speech acoustic parameters are characteristic parameters capable of reflecting the auditory features of the speech, such as cepstral coefficient parameters such as MFCC, line spectrum pair (LSP) parameters, and Mel's general cepstral coefficient (MGC) parameters.

Finally, the magnitude of the value of det(∑') obtained in step c is compared. The state in which the det(∑') value is minimized is the most approximate model state; the sequence number of the state is recorded.

Alternatively, the L-N distance between the mean values of the output distributions using the corresponding states reflects the similarity between the states.

Alternatively, the Mahalanobis distance between the output distributions using the corresponding states reflects the similarity between states.

Optionally, the Kullback-Leibler divergence between the output distributions of the corresponding states is used to reflect the similarity between states.

Let the first approximate state number obtained by the first step be n (the state number starts from 0), and the number of states each phoneme contains is N, and the duration of each state in the time segment of the joint part of the phonemes in the two sets of speech segments Respectively represented by the vector t _a and the vector t _b , the duration of each state after splicing is represented by the vector t′. Where t _a corresponds to a speech unit with a relatively advanced time; t _b corresponds to a speech unit with a relatively late time.

The method for calculating the duration t' of each state after splicing is: retaining the length of time corresponding to each state before n in t _a ; retaining the length of time corresponding to each state after n in t _b ; duration of state n the average time for the length of time in a state in t _a and t _b corresponds.

Optionally, the shortest duration t _min of each state after splicing is set such that t' _n ≥ t _min to prevent the transition period duration from being too short, affecting the continuity of the speech.

Thirdly, according to the speech synthesis parameter data in the database and the duration of each state obtained in the second step, the speech synthesis parameters included in the two sets of speech segments are spliced and interpolated.

The speech synthesis parameter is data capable of expressing a speech feature and causing the vocoder to generate a speech waveform. When LSP or MGC parameters are used, the speech acoustic parameters can be used as speech synthesis parameters at the same time. To some extent, speech synthesis parameters can also reflect the auditory characteristics of speech.

The time period corresponding to the most approximate model state determined in the first step in the database voice is the time period during which the interpolation transition is required. The speech synthesis parameters in the time period corresponding to the remaining states are directly processed to the target speech synthesis parameter sequence without being processed.

2 is an example of the above process, in which the joint portions of the two sets of speech segments to be spliced are phoneme "a", each of which contains three states, and the corresponding time segments are respectively A, B, C, and D, E, and F. Said. Taking the second state as the most approximate state selected as the first step as an example, since the time period A is before the second state and within the diphone unit "ta", the speech in the time period A is directly copied to In time period A in unit 3 of Fig. 2; since the time period F is after the second state and within the diphone unit "ao", the speech in the time period F is directly copied to the time in unit 3 of Fig. 2. In the segment F; since the time segments B and E correspond to the most approximate state, the speech data corresponding to the time segments B and E is interpolated and time-expanded to the time t′ ₁ of the second state calculated in the second step. And then written into the time period B->E in unit 3 of Fig. 2.

In the above steps, the interpolation method for the speech data transition and the time warping includes linear interpolation.

The unit selection speech synthesis technology based on the voice audio unit splicing technology proposed by the present invention includes two stages of training and operation. The specific implementation of the training phase (shown in Figure 3) is as follows:

In the first step, the speech waveform data and the phoneme level time segmentation in the corpus are obtained, and the speech analysis is performed: the speech waveform data is converted into the speech acoustic parameter data, and is stored in the speech database together with the speech phoneme time segmentation (hereinafter referred to as Database); based on the text corresponding to the voice in the corpus, generates a sequence of context information, which is also stored in the database.

If speech synthesis parameters other than speech acoustic parameters are used, the speech synthesis parameters need to be additionally calculated from the speech waveform data in the corpus and stored in the database.

In the second step, the speech acoustic parameter data and the phoneme level time segmentation in the database are acquired, the state transition probability distribution and the output distribution of the HMM are initialized, and the context-independent model is trained.

Among them, the training of the context-independent model can adopt the Baum-Welch algorithm or the Viterbi Training algorithm.

Alternatively, a hidden semi-Markov model (HSMM) is used instead of the HMM.

Alternatively, a syllable is used as a unit of speech.

The third step is to perform state-level time alignment and phoneme-level time alignment on the database according to the context-independent model, and use the new phoneme-level time alignment result to cover the original time segment in the database, thereby ensuring the time alignment and model of the speech unit in the database. The time alignment of the states remains uniform.

In the fourth step, the state binding of the context-free model is released, making it a context-dependent model;

In the fifth step, the context-dependent model is trained and the model parameters are stored in the database.

Among them, the training of the context-dependent model can adopt the Baum-Welch algorithm or the Viterbi Training algorithm.

The unit selection speech synthesis technology based on the voice audio unit splicing technology proposed by the present invention, wherein the operation phase (shown in FIG. 4) is as follows:

The first step is to obtain a text to be synthesized, and generate a sequence of context information corresponding to the text to be synthesized;

In the second step, according to the context information sequence, the context information of each speech unit in the database is compared with the context information of the text to be synthesized, for each phoneme contained in the synthesized text or other specified phonetic units according to the context information. Similarity, selecting a set of candidate speech units;

Thirdly, according to the candidate speech unit obtained in step 2, the speech acoustic parameters in the database, and the phoneme level or state-level time segmentation in the database, the splicing distance between the speech units before and after is calculated, and the Viterbi algorithm is used to calculate the simultaneous minimization. A sequence of speech units that splicing distances and context errors.

A more specific implementation of the above steps can be found in A. Black, et al. "Optimising selection of units from speech databases for concatenative synthesis." EUROSPEECH 95, pages 581-584, Madrid, Spain, 1995.

In the fourth step, according to the sequence of phonetic units generated by the third frame, for each group of two adjacent speech units, the most relevant corresponding state is searched according to the context-dependent HMM state sequence corresponding to the phonemes in the two sets of voice segments. ;

c. Calculate the determinant according to the combined diagonal covariance vector σ' obtained in step b:

Where ∑' is the diagonal covariance matrix constructed from σ', and K is the acoustic parameter dimension modeled using HMM.

In the fifth step, according to the sequence of phonetic units generated by the third frame, two groups of speech units adjacent to each group are obtained, and according to the most approximate state sequence number obtained in the fourth step, two groups calculated and stored in the training phase are obtained. The state-level time segmentation of the part of the phoneme in the speech segment, and calculating the duration of each state after the splicing;

a. The most approximate state number obtained in the first step is n (the state number starts from 0), and the number of states included in each phoneme is N, and the states of the time segments of the two parts of the two segments are combined. The duration is represented by a vector t _a and a vector t _b , respectively, and the duration of each state after splicing is represented by a vector t′. Where t _a corresponds to a speech unit with a relatively advanced time; t _b corresponds to a speech unit with a relatively late time.

b. The duration t' of each state after splicing is calculated by: retaining the length of time corresponding to each state before n in t _a ; retaining the length of time corresponding to each state after n in t _b ; state n The duration is the average of the length of time that the state corresponds to in t _a and t _b .

c. Optionally, the shortest duration t _min of each state after splicing is set such that t' _n ≥ t _min to prevent the transition period from being too short, affecting the continuity of the speech.

In the sixth step, according to the sequence of phonetic units generated by the third frame, for each group of two adjacent speech units, according to the speech synthesis parameter data in the database and the duration of each state obtained in the fifth step, Voice clips included The speech synthesis parameters are spliced and interpolated.

In the seventh step, the voice waveform is generated by using a vocoder according to the sequence of speech synthesis parameters generated in the sixth step. The synthesis method is determined by a specific vocoder algorithm, which is not specifically limited in the present invention.

Compared with the splicing manner of the traditional speech segment in the spliced speech synthesis, the present invention automatically selects the time period of the interpolation transition according to the similarity and variation trend of the acoustic parameters in different regions in the joint portion, thereby avoiding the corresponding region in the transition period. The difference in speech parameters is too large, causing the synthesized speech to become choppy or blurry.

Claims

The invention provides a voice audio unit splicing method mainly used for splicing speech synthesis, which comprises the following steps: searching for the most approximate corresponding state according to the context-related HMM state sequence corresponding to the combined partial phonemes in the two groups of voice segments. Obtaining a state-level time segment of the pre-computed and stored joint part phoneme, and calculating the duration of each state after the splicing; according to the speech synthesis parameter data in the database and each state duration, the speech included in the two sets of speech segments The synthesis parameters are spliced and interpolated. The splicing speech synthesis system based on the splicing method according to the present invention can automatically select the portion where the acoustic characteristics of the two groups before and after the splicing of the speech unit have the smallest difference and the trend of the change is stable, thereby effectively improving the clarity of the synthesized speech. Degree and recognizability.