CN107103900B

CN107103900B - Cross-language emotion voice synthesis method and system

Info

Publication number: CN107103900B
Application number: CN201710415814.5A
Authority: CN
Inventors: 杨鸿武; 吴沛文
Original assignee: Northwest Normal University
Current assignee: Northwest Normal University
Priority date: 2017-06-06
Filing date: 2017-06-06
Publication date: 2020-03-31
Anticipated expiration: 2037-06-06
Also published as: CN107103900A

Abstract

The invention discloses a cross-language emotion voice synthesis method and a system, firstly, establishing a context-related label format and a context-related clustering problem set; secondly, determining a first language markup file, a second language markup file, a target emotion common Chinese markup file, a markup file to be synthesized, a first language acoustic parameter, a second language acoustic parameter and a target emotion acoustic parameter; then determining a target emotion average acoustic model of the multiple speakers according to the first language labeling file, the second language labeling file, the target emotion common speech labeling file, the first language acoustic parameters, the second language acoustic parameters and the target emotion acoustic parameters; and finally, inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file so as to synthesize the emotion voice of the same speaker or different speakers across languages.

Description

Cross-language emotion voice synthesis method and system

Technical Field

The invention relates to the technical field of multilingual emotion voice synthesis, in particular to a cross-language emotion voice synthesis method and a cross-language emotion voice synthesis system.

Background

The current speech synthesis technology can synthesize natural neutral speech, but when the human-computer interaction tasks such as robots, virtual assistants and the like need to simulate human behaviors, the simple neutral speech synthesis cannot meet the requirements of people. Speech synthesis capable of simulating emotion that represents human emotion and speaking style has become a trend of speech synthesis in the future.

For the emotional voice synthesis of Chinese, English and the like in large languages with a large number of users, the research investment is more, and the development level is higher; however, for the synthesis of emotional voices in small languages, such as Tibetan, Russian, Spanish and the like, which are used by a small number of people, the development is slow, and at present, a recognized high-standard and high-quality small language emotion corpus facing voice synthesis does not exist, so that the synthesis of the small language emotional voices becomes a blank in the field of voice synthesis.

At present, the research technologies for emotion voice synthesis at home and abroad comprise a waveform splicing method, a prosodic unit selection method and a statistical parameter method. The waveform splicing method needs to establish a huge emotion language database containing each emotion for an emotion voice synthesis system, then performs text and prosody analysis on an input text to obtain basic unit information of synthesized voice, finally selects a proper voice element from a previously marked language database according to the unit information, modifies and splices the selected voice element to obtain the synthesized voice of a target emotion, wherein the synthesized voice has better emotion similarity, but needs to establish a large voice element language database containing various emotions in advance, which is very difficult in system realization and difficult to expand to synthesize emotion voices of different speakers and different languages; the prosodic feature unit selection method integrates the strategies of prosody or a voice system into unit selection, and establishes a small or mixed emotion voice database by using the rules for modifying the outline of the target f0 and duration so as to obtain emotion voice. The prosody modification method needs to modify the voice signal, the tone quality of the synthesized voice is poor, and emotional voices of different people and different languages cannot be synthesized. The above two methods are not the mainstream methods at present due to their limitations. Although the statistical parameter speech synthesis method becomes the mainstream speech synthesis method, the method can only synthesize emotional speech of one language, if emotional speech of different languages needs to be synthesized, a plurality of emotional speech synthesis systems need to be trained, and each emotional speech synthesis system needs an emotional speech training corpus of the language.

Aiming at the defects of the emotional speech synthesis method, how to overcome the problems is a technical problem which is urgently needed to be solved in the technical field of multilingual emotional speech synthesis at present.

Disclosure of Invention

The invention aims to provide a cross-language emotion voice synthesis method and a system, which are used for training a mandarin speaker target emotion average acoustic model by using a multi-speaker target emotion mandarin training corpus and can synthesize cross-language emotion voice of the same speaker or different speakers by only changing a file to be synthesized.

In order to achieve the purpose, the invention provides a cross-language emotion voice synthesis method, which comprises the following steps:

establishing a context-related labeling format and a context-related clustering problem set, and respectively carrying out context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; respectively extracting acoustic parameters of the first and second neutral language training corpora to obtain first language acoustic parameters corresponding to the first and second neutral language training corpora and second language acoustic parameters corresponding to the second and third neutral language training corpora;

performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to the context-dependent labeling format and the context-dependent clustering problem set to obtain a target emotion mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;

determining a target emotion average acoustic model of the multiple speakers according to the first language labeling file, the second language labeling file, the target emotion common speech labeling file, the first language acoustic parameters, the second language acoustic parameters and the target emotion acoustic parameters;

carrying out context-dependent text labeling on the file to be synthesized in the first language or/and the second language to obtain a labeled file to be synthesized;

and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file.

Optionally, the establishing a context-dependent labeling format and a context-dependent clustering problem set, and performing context-dependent text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus, where the specific steps include:

establishing a first language marking rule and a second language marking rule;

determining a context-related labeling format according to a first language labeling rule and a second language labeling rule, and respectively performing context-related text labeling on a neutral first language training corpus of a plurality of speakers and a neutral second language training corpus of a single speaker to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus;

and establishing a context-dependent clustering problem set according to the similarity of the first language and the second language.

Optionally, the establishing the first language annotation rule and the second language annotation rule specifically includes:

the establishing of the first language marking rule specifically comprises the following steps:

taking SAMPA-SC mandarin machine-readable phonetic symbols as the first language marking rules;

the establishing of the second language marking rule specifically comprises the following steps:

taking the international phonetic symbols as reference, and obtaining the international phonetic symbols for inputting the pinyin of the second language based on the SAMPA-SC mandarin machine-readable phonetic symbols;

judging whether the international phonetic symbol of the second language pinyin is consistent with the international phonetic symbol of the first language pinyin; if the two languages are consistent, the machine-readable phonetic symbols of the SAMPA-SC mandarin Chinese are directly adopted to mark the pinyin of the second language; otherwise, according to the simplification principle, marking by using the user-defined unused keyboard symbol.

Optionally, the determining a context-dependent markup format according to the first language markup rule and the second language markup rule includes:

according to a grammar rule knowledge base and a grammar dictionary of a first language and a second language, carrying out text standardization, grammar analysis and prosodic structure analysis on input texts with irregular first language and second language to obtain standard texts, length information of prosodic words and phrases, prosodic boundary information, word related information and tone information;

substituting the standard text into the first language labeling rule to obtain a single-phone labeling file of a first language; or bringing the standard text into the second language labeling rule to obtain a single-phone labeling file of a second language;

and determining a context-related labeling format according to the length information of the prosodic words and phrases, prosodic boundary information, word related information, tone information and the single-phone labeling file.

Optionally, the determining a target emotion average acoustic model of multiple speakers according to the first language markup file, the second language markup file, the target emotion mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter includes:

taking a first language markup file, a second language markup file, a first language acoustic parameter and a second language acoustic parameter as a training set, and obtaining a neutral average acoustic model of a mixed language through adaptive training of a speaker based on an adaptive model;

according to a neutral average acoustic model of mixed languages, a target emotion Mandarin marking file and target emotion acoustic parameters are used as a test set, and a multi-speaker target emotion average acoustic model is obtained through speaker self-adaptive transformation.

Optionally, the specific steps of obtaining the target emotion average acoustic model of the multiple speaker target emotion mandarin speaker by using the target emotion mandarin labeling file and the target emotion acoustic parameters as a test set and through speaker adaptive transformation according to the neutral average acoustic model of the mixed language are as follows:

adopting a constrained maximum likelihood linear regression algorithm to calculate a covariance matrix and a mean vector of state duration probability distribution and state output probability distribution of the speaker, and transforming the covariance matrix and the mean vector of the neutral average acoustic model into a target speaker model by using a group of transformation matrices of the state duration distribution and the state output distribution, wherein the specific formula is as follows:

p_i(d)＝N(d；αm_i-β,ασ_i ²α)＝|α^-1|N(αψ；m_i,σ_i ²) (7)；

b_i(o)＝N(o；Au_i-b,AΣ_iA^T)＝|A^-1|N(Wξ；u_i,Σ_i) (8)；

where i is the state, d is the state duration, N is a constant, p_i(d) Is a transformation equation of state duration, m_iIs the mean value of the time length distribution, σ_i ²Is variance, psi ═ d,1]^TO is a feature vector, ξ ═ o^T,1]，u_iOutputs the distribution mean, sigma for the state_iAs a diagonal covariance matrix, X ═ α^-1,β^-1]Is a transformation matrix of state duration probability density distribution, W ═ A^-1,b^-1]Outputting a linear transformation matrix of probability density distribution for the state of the target speaker;

through an adaptive transformation algorithm based on MSD-HSMM, the fundamental frequency, frequency spectrum and time length parameters of voice data can be transformed and normalized; for adaptive data O of length T, the maximum likelihood estimation can be performed by transforming Λ ═ W, X:

wherein λ is a parameter set of MSD-HSMM, O is adaptive data of length T,

is a maximum likelihood estimation;

carrying out maximum likelihood estimation on the converted and normalized time length, frequency spectrum and fundamental frequency parameters, and updating and correcting the speaker correlation model by adopting a maximum posterior probability algorithm, wherein the specific formula is as follows:

and (3) MAP estimation:

wherein T is time, λ is a given MSD-HSMM parameter set, T is length, o is adaptive data i in state when length is T, d is state duration, N is a constant, s is a training speech data model, k is time, and_t ^d(i) is a continuous observation sequence o in state i_t-d+1...o_tProbability of (α)_t(i) To forward probability, β_t(i) In order to be the backward probability,

and

is the mean vector after linear regression transformation, omega is the MAP estimation parameter of state output, tau is the time length distribution MAP estimation parameter,

and

are respectively adaptive vectors

And

weighted average MAP estimate of (a).

The invention also provides a cross-language emotion voice synthesis system, which comprises the following components:

the language corpus text labeling and parameter extracting module is used for establishing a context-related labeling format and a context-related clustering problem set, and respectively performing context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of single speakers to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; respectively extracting acoustic parameters of a first language training corpus and a second language training corpus to obtain first language acoustic parameters corresponding to the first language training corpus and second language acoustic parameters corresponding to the second language training corpus;

the target emotion corpus text labeling and parameter extracting module is used for carrying out context-related text labeling on a target emotion Mandarin training corpus of multiple speakers according to a context-related labeling format and a context-related clustering problem set to obtain a target emotion Mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;

the target emotion average acoustic model determining module is used for determining a target emotion average acoustic model of the multiple speakers according to the first language markup file, the second language markup file, the target emotion Mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter;

the device comprises a to-be-synthesized labeled file determining module, a labeling module and a labeling module, wherein the to-be-synthesized labeled file determining module is used for performing context-related text labeling on a to-be-synthesized file in a first language or/and a second language to obtain a to-be-synthesized labeled file;

and the voice synthesis file determining module is used for inputting the to-be-synthesized marking file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file.

Optionally, the language corpus text labeling module specifically includes:

the marking rule establishing submodule is used for establishing a first language marking rule and a second language marking rule;

the language corpus text labeling submodule is used for determining a context-related labeling format according to a first language labeling rule and a second language labeling rule, and performing context-related text labeling on a neutral first language training corpus of a plurality of speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus;

and the phonetic transcription system and the question set establishing submodule are used for establishing a context-related clustering question set according to the similarity of the first language and the second language.

Optionally, the target emotion average acoustic model determining module specifically includes:

the neutral average acoustic model determining submodule of the mixed language is used for taking the Tibetan language labeling file, the Chinese language labeling file, the first language acoustic parameter and the second language acoustic parameter as a training set, and obtaining a neutral average acoustic model of the mixed language through adaptive training of speakers based on an adaptive model;

and the target emotion average acoustic model determining submodule is used for taking the target emotion Mandarin marking file and the target emotion acoustic parameters as a test set according to the neutral average acoustic model of the mixed language and obtaining the target emotion average acoustic model of the multiple speakers through adaptive transformation of the speakers.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

1) the target emotion mean acoustic model of the multiple speakers can be trained by utilizing the target emotion common speech training corpus of the multiple speakers, and emotion speech synthesis of another language or multiple languages can be synthesized only by changing a file to be synthesized, so that the speech synthesis range is widened.

2) The target emotion mean acoustic model of the multiple speakers can be trained by utilizing the target emotion common speech training corpus of the multiple speakers, so that not only can the emotion voices of the same speaker in different languages be synthesized, but also the emotion voices of different speakers in different languages can be synthesized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical rules in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive exercise.

FIG. 1 is a flowchart of a cross-language emotion speech synthesis method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the Tibetan annotation rule according to an embodiment of the present invention;

FIG. 3 is a detailed flowchart illustrating the process of creating a context-dependent annotation format according to an embodiment of the present invention;

FIG. 4 is a detailed flowchart of acoustic parameter extraction according to an embodiment of the present invention;

FIG. 5 is a block diagram of a cross-language emotion speech synthesis system according to an embodiment of the present invention.

Detailed Description

Technical rules in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The invention discloses a first language and a second language, wherein the first language is any one of Chinese, English, German and French; the second language is any one of Tibetan, Spanish, Japanese, Arabic, Korean and Portuguese. In the embodiment of the present invention, chinese is taken as a first language, and tibetan is taken as a second language for example, which are discussed, and fig. 1 is a flowchart of a cross-language emotion voice synthesis method in the embodiment of the present invention, and is specifically shown in fig. 1.

The invention specifically provides a cross-language emotion voice synthesis method, which specifically comprises the following steps:

step 100: establishing a context-related labeling format and a context-related clustering problem set which are common to Chinese and Tibetan, and respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus; and respectively extracting acoustic parameters of the neutral Chinese training corpus and the neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.

Step 200: performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to the context-dependent labeling format and the context-dependent clustering problem set to obtain a target emotion mandarin labeling file; and extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters.

Step 300: and determining a target emotion average acoustic model of multiple speakers according to the Chinese labeling file, the Tibetan labeling file, the target emotion Mandarin labeling file, the Chinese acoustic parameters, the Tibetan acoustic parameters and the target emotion acoustic parameters.

Step 400: and carrying out context-related text labeling on the file to be synthesized of the Chinese or/and the Tibetan to obtain the file to be synthesized.

Step 500: and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese or/and Tibetan target emotion voice synthesis file.

The following describes the steps in detail:

Step 101: and establishing Chinese marking rules and Tibetan marking rules.

Step 1011: and taking SAMPA-SC Mandarin Chinese machine-readable phonetic symbols as the Chinese marking rules.

Step 1012: the establishing of the Tibetan language labeling rule specifically comprises the following steps:

at present, the Chinese mandarin machine-readable phonetic symbol SAMPA-SC tends to be mature and widely applied, and the Tibetan language and the Chinese language have many similarities in pronunciation, for example, in a Chinese-Tibetan language system, the Chinese language and the Tibetan language have commonalities and differences in pronunciation, the Tibetan language Lassa dialect and the Chinese mandarin are both composed of syllables, each syllable comprises 1 final and 1 initial, the Tibetan language Lassa dialect comprises 45 final and 36 initial, the mandarin comprises 39 final and 22 initial, the two syllables share 13 final and 20 initial, and 4 tones are different only in tone value. Therefore, the invention designs a set of Tibetan language computer readable phonetic symbol SAMPA-T, namely Tibetan language labeling rule, based on SAMPA-SC according to the pronunciation characteristics of the Tibetan language. See figure 2 for details.

And obtaining the international phonetic symbols of the Pinyin of the input Tibetan language based on the SAMPA-SC mandarin machine-readable phonetic symbols by taking the international phonetic symbols as reference.

And judging whether the international phonetic symbols of the Tibetan language pinyin are consistent with the international phonetic symbols of the Chinese pinyin, if so, directly marking the Tibetan language pinyin by adopting SAMPA-SC mandarin machine-readable phonetic symbols, and otherwise, marking by using self-defined unused keyboard symbols according to a simplification principle.

Step 102: determining a context-related labeling format which is common to Chinese and Tibetan according to a Chinese labeling rule and a Tibetan labeling rule, respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of a single speaker according to the context-related labeling format, and respectively obtaining a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus, which are specifically shown in FIG. 3.

Step 1021: according to a grammar rule knowledge base and a grammar dictionary of Chinese and Tibetan, text normalization, grammar analysis and prosodic structure analysis are carried out on input texts with irregular Chinese and Tibetan to obtain normalized texts, length information of prosodic words and phrases, prosodic boundary information, word related information and tone information.

Step 1022: substituting the standard text into the Chinese labeling rule to obtain a single-phone labeling file of Chinese; or bringing the standard text into the Tibetan language labeling rule to obtain a single-phone labeling file of the Tibetan language.

Step 1023: and determining a context-related labeling format commonly used by Chinese and Tibetan according to the length information, prosodic boundary information, word related information, tone information and a single-phoneme labeling file of the prosodic words and phrases.

The context-dependent labeling format is used for labeling the context information of pronunciation primitives (initials and finals). The context-related labeling format comprises 6 layers of initial consonants, syllables, words, prosodic phrases and sentences, and is used for representing pronunciation elements (initial consonants and vowels) and context-related information thereof under different contexts.

Step 1024: and respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of a single speaker according to a context-related labeling format to respectively obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus.

Step 103: and establishing a general context related clustering problem set of the Chinese and the Tibetan according to the similarity of the Chinese and the Tibetan.

Step 104: and respectively extracting acoustic parameters of the neutral Chinese training corpus and the neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus, which are specifically shown in FIG. 4.

When the acoustic parameters are extracted, the acoustic characteristics such as the fundamental frequency and the spectral characteristics of the voice signals are extracted by analyzing the voice signals. In the present invention, a generalized Mel-frequency cepstral (mgc) is used as a spectrum feature to represent a spectrum envelope, that is: a filter section in the source filter model; the logarithmic fundamental frequency logF0 is used as the fundamental frequency characteristic. Since the speech signal is not a pure, stable periodic signal, the extraction of the spectral envelope is directly affected by the error of the fundamental frequency, and therefore the extraction of the spectral envelope (the generalized mel-frequency cepstral coefficients mgc) is also accompanied by the extraction of the fundamental frequency features (the logarithmic fundamental frequency logF 0).

The acoustic parameter extraction includes: the generalized mel frequency cepstrum coefficient mgc is extracted, the logarithmic fundamental frequency logF0 is extracted, and the aperiodic component bap is extracted.

The extraction formula of the generalized mel-frequency cepstrum coefficient mgc is specifically as follows:

wherein,

(| α | < 1) is an m-order all-pass function, γ is the attribute of the system function, c_α,γ(M) is the coefficient, M is the total number of filter coefficients, z is the z-transform of the discrete signal, and M is the order of the filter coefficients.

If gamma is 0, c_α,γ(m) model mgc; gamma is equal to-1, the model is an autoregressive model; if γ is equal to 0, then it is an exponential model.

Logarithmic fundamental frequency logF 0:

the method adopts a normalized autocorrelation function method to extract fundamental frequency characteristics, and comprises the following specific steps:

for speech signal s (N), N ≦ N, N ∈ N⁺The autocorrelation function is:

where K is the delay time and should be set to be an integer multiple of the pitch period, s (N + K) is s (N) adjacent speech signals, N is an integer, and K is the maximum number of delay times.

Normalizing the autocorrelation function acf (k) to obtain a normalized autocorrelation function:

wherein,

e₀e at time 0_k。

When the maximum value of the autocorrelation function is found, the delay value k of the function is the pitch period. The reciprocal of the pitch period is the fundamental frequency, and the logarithm of the fundamental frequency is the logarithmic fundamental frequency logF0 to be extracted.

Non-periodic component bap extraction:

the non-periodic components of the speech signal are defined as the relative energy levels of the non-periodic components in a frequency domain, the non-periodic component value ap of a linear domain is calculated through the ratio of the energy of the non-harmonic components to the total energy of a spectrum with a fixed fundamental frequency value structure, namely the non-periodic component value ap of the linear domain can be determined by subtracting the upper and lower spectrum envelopes, and the specific formula is as follows:

P_AP(ω') is lg field aperiodic component value; s (λ') represents the spectral energy, S_L(λ') represents the spectral energy of the envelope under the spectrum, S_U(λ') is the spectral energy of the envelope on the spectrum; w is a_ERB(lambda '; omega') is a smooth acoustic filteringAnd λ 'is the fundamental frequency and ω' is the frequency.

The aperiodic component bap can be determined by averaging ap in each frequency band of each frame, and the specific formula is as follows:

wherein, the bap (ω') is an aperiodic component bap.

Step 200: performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to a context-dependent labeling format and a context-dependent clustering problem set to obtain a target emotion mandarin labeling file; and extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters.

The acoustic parameter extraction of the target emotion Mandarin training corpus is the same as the acoustic parameter extraction of the neutral Chinese training corpus and the neutral Tibetan training corpus. See formulas (1) - (4) for details.

Step 301: the Chinese labeling file, the Tibetan labeling file, the Chinese acoustic parameters and the Tibetan acoustic parameters are used as training sets, and a neutral average acoustic model of the mixed language is obtained through adaptive training of speakers based on an adaptive model. The self-adaptive model is any one of a deep learning model, a long-time and short-time memory model and a hidden Markov model. The invention adopts a semi-hidden Markov model for analysis.

The invention adopts a constrained maximum likelihood linear regression algorithm, expresses the difference between an average acoustic model and the speech data of a speaker in training by using a linear regression function, normalizes the difference between the trained speakers by using a group of linear regression formulas of state duration distribution and state output distribution, and trains to obtain a context-dependent semi-hidden Markov model (MSD-HSMM). The speaker self-adaptive training algorithm based on the semi-hidden Markov model MSD-HSMM is adopted to improve the tone quality of synthesized voice and reduce the influence of the difference among speakers on the quality of the synthesized voice. The linear regression formula of the state frequent distribution and the state output distribution is specifically as follows:

wherein, the formula (5) is a state duration distribution transformation equation, i is a state, i at the lower right corner represents in the state i, s is a training voice data model, s is marked at the upper right corner to represent the model belonging to the voice data model s,

mean vector representing the duration of the state of training speech data model s X α]For training the transformation matrix of the differences between the state duration distribution of the speech data model s and the mean tone model, d_iIs its average duration, wherein ξ ═ o-^T,1]. Equation (6) shows a state output distribution transformation equation,

mean vector representing the state output of training speech data model s, W ═ a, b]For training a transformation matrix of the differences between the state output distributions of the speech data model s and the mean tone model, o_iIs its average observation vector.

Step 302: according to a neutral average acoustic model of mixed languages, a target emotion Mandarin marking file and target emotion acoustic parameters are used as a test set, and a multi-speaker target emotion average acoustic model is obtained through speaker self-adaptive transformation; the method comprises the following specific steps:

step 3021: adopting a constrained maximum likelihood linear regression algorithm to calculate a covariance matrix and a mean vector of state duration probability distribution and state output probability distribution of the speaker, and transforming the covariance matrix and the mean vector of the neutral average acoustic model into a target speaker model by using a group of transformation matrices of the state duration distribution and the state output distribution, wherein the specific formula is as follows:

p_i(d)＝N(d；αm_i-β,ασ_i ²α)＝|α^-1|N(αψ；m_i,σ_i ²) (7)

b_i(o)＝N(o；Au_i-b,AΣ_iA^T)＝|A^-1|N(Wξ；u_i,Σ_i) (8)

step 3022: through an adaptive transformation algorithm based on MSD-HSMM, the fundamental frequency, frequency spectrum and time length parameters of voice data can be transformed and normalized; for adaptive data O of length T, the maximum likelihood estimation can be performed by transforming Λ ═ W, X:

wherein λ is a parameter set of MSD-HSMM, O is adaptive data of length T,

is a maximum likelihood estimate.

Step 3023: carrying out maximum likelihood estimation on the converted and normalized time length, frequency spectrum and fundamental frequency parameters, and updating and correcting the speaker correlation model by adopting a maximum posterior probability algorithm, wherein the specific formula is as follows:

and (3) MAP estimation:

and

is the mean vector after linear regression transformation, omega is the Maximum a posteriori probability (MAP) estimation parameter of state output, tau is the time length distribution MAP estimation parameter,

and

are respectively adaptive vectors

And

weighted average MAP estimate of (a).

The file to be synthesized comprises a file to be synthesized of Chinese and/or Tibetan, the file to be synthesized is any one of characters, words, phrases and sentences, and the file to be synthesized of Chinese and/or Tibetan is subjected to context-related text labeling according to the context-related text labeling format to obtain a labeled file to be synthesized.

Namely, when the text to be synthesized is a Tibetan language text to be synthesized, carrying out context-related text labeling according to the context-related text labeling format to obtain a Tibetan language labeling file to be synthesized; when the text to be synthesized is the text to be synthesized, carrying out context-related text labeling according to the context-related text labeling format to obtain a Chinese labeling file to be synthesized; and when the text to be synthesized is a Tibetan or Chinese text to be synthesized, carrying out context-related text labeling according to the context-related text labeling format to obtain a labeling file to be synthesized of the Tibetan or Chinese.

Step 500: and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a target emotion voice synthesis file.

For a to-be-synthesized labeled file of a to-be-synthesized text, a question set is utilized, a target emotion average acoustic model related to a speaker of each pronunciation element is obtained according to context related information of each pronunciation element (initial consonant and final consonant), then the target emotion average acoustic model related to the speaker of the whole to-be-synthesized sentence is determined through clustering, then an acoustic parameter file of target emotion of mandarin and/or Tibetan is obtained according to the target emotion average acoustic model related to the speaker, and finally the acoustic parameter file is utilized to synthesize the Tibetan and/or Chinese target emotion voice synthesis file through a voice waveform generator.

Namely, inputting the Tibetan language to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Tibetan language target emotion voice synthesis file; inputting the Chinese to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese target emotion voice synthesis file; and inputting the Chinese and Tibetan language to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese and Tibetan mixed target emotion voice synthesis file.

In order to achieve the purpose, the invention also provides a cross-language emotion voice synthesis system.

Fig. 5 is a block diagram of a cross-language emotion speech synthesis system according to an embodiment of the present invention, and as shown in fig. 5, the system includes: the system comprises a language corpus text labeling and parameter extracting module 1, a target emotion corpus text labeling and parameter extracting module 2, a target emotion average acoustic model determining module 3, a to-be-synthesized labeled file determining module 4 and a voice synthesis file determining module 5.

The language corpus text labeling and parameter extracting module 1 is used for establishing a context-related labeling format and a context-related clustering problem set, and respectively performing context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus; the method is used for extracting acoustic parameters of a neutral Chinese training corpus and a neutral Tibetan training corpus respectively to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.

The language corpus text labeling and parameter extracting module 1 specifically comprises: the language corpus text labeling module and the language corpus parameter extracting module.

The language corpus text labeling module specifically comprises: a marking rule establishing submodule, a language corpus text marking submodule, a phonetic system and a question set establishing submodule.

The marking rule establishing submodule is used for establishing a Chinese marking rule and a Tibetan marking rule;

the language corpus text labeling submodule is used for determining a context-related labeling format according to a Chinese labeling rule and a Tibetan labeling rule, and performing context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers respectively to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus;

the phonetic transcription system and the question set establishing submodule are used for establishing a context-related clustering question set which is universal for Chinese and Tibetan according to the similarity of the Chinese and the Tibetan.

The language corpus parameter extraction module is used for respectively extracting acoustic parameters of a neutral Chinese training corpus and a neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.

The target emotion corpus text labeling and parameter extracting module 2 is used for carrying out context-related text labeling on a target emotion mandarin training corpus of multiple speakers to obtain a target emotion mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;

the target emotion average acoustic model determining module 3 is used for determining a target emotion average acoustic model of multiple speakers according to the Chinese labeling file, the Tibetan labeling file, the target emotion Mandarin labeling file, the Chinese acoustic parameters, the Tibetan acoustic parameters and the target emotion acoustic parameters;

the target emotion average acoustic model determining module 3 specifically includes: a neutral average acoustic model determining submodule of the mixed language and a target emotion average acoustic model determining submodule.

The neutral average acoustic model determining submodule of the mixed language is used for taking the Tibetan language labeling file, the Chinese acoustic parameters and the Tibetan language acoustic parameters as a training set, and obtaining a neutral average acoustic model of the mixed language through adaptive training of speakers based on an adaptive model;

And the to-be-synthesized labeling file determining module 4 is used for performing context-related text labeling on the to-be-synthesized file of the Chinese language or/and the Tibetan language to obtain the to-be-synthesized labeling file.

And the voice synthesis file determining module 5 is used for inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese or/and Tibetan target emotion voice synthesis file.

Specific examples are:

the method records 800 sentences of a female Tibetan speaker as a neutral Tibetan training corpus of a single speaker, a Chinese-English bilingual speech database as a neutral Chinese training corpus of multiple speakers, and records 11 emotions of 9 female speakers, namely 9900 sentences of 11 emotions, as a target emotion mandarin training corpus of the multiple speakers, wherein the emotions in 11 include sadness, relaxation, anger, anxiety, surprise, fear, slight, gentle, joy, disgust and neutrality. Experiments prove that the emotion similarity evaluation score (EMOS) of Tibetan or Chinese speech of the synthesized target emotion is gradually improved along with the increase of the Mandarin target emotion training corpus.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A cross-language emotion voice synthesis method is characterized by comprising the following steps:

2. The method according to claim 1, wherein the method for synthesizing cross-language emotion speech comprises establishing a context-dependent labeling format and a context-dependent clustering problem set, and performing context-dependent text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus, and the specific steps comprise:

establishing a first language marking rule and a second language marking rule;

3. The method for synthesizing cross-language emotion speech according to claim 2, wherein the step of establishing the first language labeling rule and the second language labeling rule comprises the following steps:

4. The method for synthesizing cross-language emotion speech according to claim 3, wherein the step of determining the context-dependent markup format according to the first linguistic markup rule and the second linguistic markup rule comprises:

5. The method for synthesizing cross-language emotion speech according to claim 1, wherein the method for determining the multi-speaker target emotion average acoustic model according to the first language markup file, the second language markup file, the target emotion mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter comprises the following specific steps:

6. The method for synthesizing cross-language emotion voice according to claim 5, wherein the specific steps of obtaining the speaker target emotion mean acoustic model of the multiple speakers target emotion Mandarin speaker by using the target emotion Mandarin markup file and the target emotion acoustic parameters as a test set and by speaker adaptive transformation according to the neutral mean acoustic model of the mixed language are as follows:

p_i(d)＝N(d；αm_i-β,ασ_i ²α)＝|α^-1|N(αψ；m_i,σ_i ²) (7)

b_i(ο)＝N(ο；Au_i-b,AΣ_iA^T)＝|A^-1|N(Wξ；u_i,Σ_i) (8)

where i is the state, d is the state duration, N () is the Gaussian distribution function, p_i(d) Is a transformation equation of state duration, m_iIs the mean value of the time length distribution, σ_i ²Is variance, psi ═ d,1]^TO is a feature vector, ξ ═ o^T,1]，u_iOutputs the distribution mean, sigma for the state_iAs a diagonal covariance matrix, X ═ α^-1,β^-1]Is a transformation matrix of state duration probability density distribution, W ═ A^-1,b^-1]Outputting a linear transformation matrix of probability density distribution for the state of the target speaker;

wherein λ is a parameter set of MSD-HSMM, O is adaptive data of length T,

is a maximum likelihood estimation;

and (3) MAP estimation:

wherein T is time, λ is a given MSD-HSMM parameter set, T is length, o is adaptive data i in state when length is T, d is state duration, N is a constant, s is a training speech data model, k is time, and_t ^d(i) is a continuous observation sequence o in state i_t-d+ ₁...o_tProbability of (α)_t(i) To forward probability, β_t(i) In order to be the backward probability,

and

and

are respectively adaptive vectors

And

weighted average MAP estimate of (a).

7. A cross-language emotion speech synthesis system, comprising:

the language corpus text labeling and parameter extracting module is used for establishing a context-related labeling format and a context-related clustering problem set, and respectively performing context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of single speakers to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; the acoustic parameter extraction module is used for respectively extracting acoustic parameters of the first and second neutral language training corpora to obtain first language acoustic parameters corresponding to the first and second neutral language training corpora and second language acoustic parameters corresponding to the second and third neutral language training corpora;

the target emotion corpus text labeling and parameter extracting module is used for carrying out context-related text labeling on a target emotion Mandarin training corpus of multiple speakers according to the context-related labeling format and the context-related clustering problem set to obtain a target emotion Mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;

8. The system according to claim 7, wherein the language corpus text labeling and parameter extracting module specifically comprises:

9. The system of claim 7, wherein the module for determining the target emotion average acoustic model specifically comprises: