CN107103900B - Cross-language emotion voice synthesis method and system - Google Patents

Cross-language emotion voice synthesis method and system Download PDF

Info

Publication number
CN107103900B
CN107103900B CN201710415814.5A CN201710415814A CN107103900B CN 107103900 B CN107103900 B CN 107103900B CN 201710415814 A CN201710415814 A CN 201710415814A CN 107103900 B CN107103900 B CN 107103900B
Authority
CN
China
Prior art keywords
language
labeling
file
neutral
target emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710415814.5A
Other languages
Chinese (zh)
Other versions
CN107103900A (en
Inventor
杨鸿武
吴沛文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwest Normal University
Original Assignee
Northwest Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwest Normal University filed Critical Northwest Normal University
Priority to CN201710415814.5A priority Critical patent/CN107103900B/en
Publication of CN107103900A publication Critical patent/CN107103900A/en
Application granted granted Critical
Publication of CN107103900B publication Critical patent/CN107103900B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1807Speech classification or search using natural language modelling using prosody or stress
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/19Grammatical context, e.g. disambiguation of the recognition hypotheses based on word sequence rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering

Abstract

The invention discloses a cross-language emotion voice synthesis method and a system, firstly, establishing a context-related label format and a context-related clustering problem set; secondly, determining a first language markup file, a second language markup file, a target emotion common Chinese markup file, a markup file to be synthesized, a first language acoustic parameter, a second language acoustic parameter and a target emotion acoustic parameter; then determining a target emotion average acoustic model of the multiple speakers according to the first language labeling file, the second language labeling file, the target emotion common speech labeling file, the first language acoustic parameters, the second language acoustic parameters and the target emotion acoustic parameters; and finally, inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file so as to synthesize the emotion voice of the same speaker or different speakers across languages.

Description

Cross-language emotion voice synthesis method and system
Technical Field
The invention relates to the technical field of multilingual emotion voice synthesis, in particular to a cross-language emotion voice synthesis method and a cross-language emotion voice synthesis system.
Background
The current speech synthesis technology can synthesize natural neutral speech, but when the human-computer interaction tasks such as robots, virtual assistants and the like need to simulate human behaviors, the simple neutral speech synthesis cannot meet the requirements of people. Speech synthesis capable of simulating emotion that represents human emotion and speaking style has become a trend of speech synthesis in the future.
For the emotional voice synthesis of Chinese, English and the like in large languages with a large number of users, the research investment is more, and the development level is higher; however, for the synthesis of emotional voices in small languages, such as Tibetan, Russian, Spanish and the like, which are used by a small number of people, the development is slow, and at present, a recognized high-standard and high-quality small language emotion corpus facing voice synthesis does not exist, so that the synthesis of the small language emotional voices becomes a blank in the field of voice synthesis.
At present, the research technologies for emotion voice synthesis at home and abroad comprise a waveform splicing method, a prosodic unit selection method and a statistical parameter method. The waveform splicing method needs to establish a huge emotion language database containing each emotion for an emotion voice synthesis system, then performs text and prosody analysis on an input text to obtain basic unit information of synthesized voice, finally selects a proper voice element from a previously marked language database according to the unit information, modifies and splices the selected voice element to obtain the synthesized voice of a target emotion, wherein the synthesized voice has better emotion similarity, but needs to establish a large voice element language database containing various emotions in advance, which is very difficult in system realization and difficult to expand to synthesize emotion voices of different speakers and different languages; the prosodic feature unit selection method integrates the strategies of prosody or a voice system into unit selection, and establishes a small or mixed emotion voice database by using the rules for modifying the outline of the target f0 and duration so as to obtain emotion voice. The prosody modification method needs to modify the voice signal, the tone quality of the synthesized voice is poor, and emotional voices of different people and different languages cannot be synthesized. The above two methods are not the mainstream methods at present due to their limitations. Although the statistical parameter speech synthesis method becomes the mainstream speech synthesis method, the method can only synthesize emotional speech of one language, if emotional speech of different languages needs to be synthesized, a plurality of emotional speech synthesis systems need to be trained, and each emotional speech synthesis system needs an emotional speech training corpus of the language.
Aiming at the defects of the emotional speech synthesis method, how to overcome the problems is a technical problem which is urgently needed to be solved in the technical field of multilingual emotional speech synthesis at present.
Disclosure of Invention
The invention aims to provide a cross-language emotion voice synthesis method and a system, which are used for training a mandarin speaker target emotion average acoustic model by using a multi-speaker target emotion mandarin training corpus and can synthesize cross-language emotion voice of the same speaker or different speakers by only changing a file to be synthesized.
In order to achieve the purpose, the invention provides a cross-language emotion voice synthesis method, which comprises the following steps:
establishing a context-related labeling format and a context-related clustering problem set, and respectively carrying out context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; respectively extracting acoustic parameters of the first and second neutral language training corpora to obtain first language acoustic parameters corresponding to the first and second neutral language training corpora and second language acoustic parameters corresponding to the second and third neutral language training corpora;
performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to the context-dependent labeling format and the context-dependent clustering problem set to obtain a target emotion mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;
determining a target emotion average acoustic model of the multiple speakers according to the first language labeling file, the second language labeling file, the target emotion common speech labeling file, the first language acoustic parameters, the second language acoustic parameters and the target emotion acoustic parameters;
carrying out context-dependent text labeling on the file to be synthesized in the first language or/and the second language to obtain a labeled file to be synthesized;
and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file.
Optionally, the establishing a context-dependent labeling format and a context-dependent clustering problem set, and performing context-dependent text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus, where the specific steps include:
establishing a first language marking rule and a second language marking rule;
determining a context-related labeling format according to a first language labeling rule and a second language labeling rule, and respectively performing context-related text labeling on a neutral first language training corpus of a plurality of speakers and a neutral second language training corpus of a single speaker to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus;
and establishing a context-dependent clustering problem set according to the similarity of the first language and the second language.
Optionally, the establishing the first language annotation rule and the second language annotation rule specifically includes:
the establishing of the first language marking rule specifically comprises the following steps:
taking SAMPA-SC mandarin machine-readable phonetic symbols as the first language marking rules;
the establishing of the second language marking rule specifically comprises the following steps:
taking the international phonetic symbols as reference, and obtaining the international phonetic symbols for inputting the pinyin of the second language based on the SAMPA-SC mandarin machine-readable phonetic symbols;
judging whether the international phonetic symbol of the second language pinyin is consistent with the international phonetic symbol of the first language pinyin; if the two languages are consistent, the machine-readable phonetic symbols of the SAMPA-SC mandarin Chinese are directly adopted to mark the pinyin of the second language; otherwise, according to the simplification principle, marking by using the user-defined unused keyboard symbol.
Optionally, the determining a context-dependent markup format according to the first language markup rule and the second language markup rule includes:
according to a grammar rule knowledge base and a grammar dictionary of a first language and a second language, carrying out text standardization, grammar analysis and prosodic structure analysis on input texts with irregular first language and second language to obtain standard texts, length information of prosodic words and phrases, prosodic boundary information, word related information and tone information;
substituting the standard text into the first language labeling rule to obtain a single-phone labeling file of a first language; or bringing the standard text into the second language labeling rule to obtain a single-phone labeling file of a second language;
and determining a context-related labeling format according to the length information of the prosodic words and phrases, prosodic boundary information, word related information, tone information and the single-phone labeling file.
Optionally, the determining a target emotion average acoustic model of multiple speakers according to the first language markup file, the second language markup file, the target emotion mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter includes:
taking a first language markup file, a second language markup file, a first language acoustic parameter and a second language acoustic parameter as a training set, and obtaining a neutral average acoustic model of a mixed language through adaptive training of a speaker based on an adaptive model;
according to a neutral average acoustic model of mixed languages, a target emotion Mandarin marking file and target emotion acoustic parameters are used as a test set, and a multi-speaker target emotion average acoustic model is obtained through speaker self-adaptive transformation.
Optionally, the specific steps of obtaining the target emotion average acoustic model of the multiple speaker target emotion mandarin speaker by using the target emotion mandarin labeling file and the target emotion acoustic parameters as a test set and through speaker adaptive transformation according to the neutral average acoustic model of the mixed language are as follows:
adopting a constrained maximum likelihood linear regression algorithm to calculate a covariance matrix and a mean vector of state duration probability distribution and state output probability distribution of the speaker, and transforming the covariance matrix and the mean vector of the neutral average acoustic model into a target speaker model by using a group of transformation matrices of the state duration distribution and the state output distribution, wherein the specific formula is as follows:
pi(d)=N(d;αmi-β,ασi 2α)=|α-1|N(αψ;mii 2) (7);
bi(o)=N(o;Aui-b,AΣiAT)=|A-1|N(Wξ;uii) (8);
where i is the state, d is the state duration, N is a constant, pi(d) Is a transformation equation of state duration, miIs the mean value of the time length distribution, σi 2Is variance, psi ═ d,1]TO is a feature vector, ξ ═ oT,1],uiOutputs the distribution mean, sigma for the stateiAs a diagonal covariance matrix, X ═ α-1-1]Is a transformation matrix of state duration probability density distribution, W ═ A-1,b-1]Outputting a linear transformation matrix of probability density distribution for the state of the target speaker;
through an adaptive transformation algorithm based on MSD-HSMM, the fundamental frequency, frequency spectrum and time length parameters of voice data can be transformed and normalized; for adaptive data O of length T, the maximum likelihood estimation can be performed by transforming Λ ═ W, X:
Figure BDA0001313631070000051
wherein λ is a parameter set of MSD-HSMM, O is adaptive data of length T,
Figure BDA0001313631070000052
is a maximum likelihood estimation;
carrying out maximum likelihood estimation on the converted and normalized time length, frequency spectrum and fundamental frequency parameters, and updating and correcting the speaker correlation model by adopting a maximum posterior probability algorithm, wherein the specific formula is as follows:
Figure BDA0001313631070000053
and (3) MAP estimation:
Figure BDA0001313631070000054
Figure BDA0001313631070000055
wherein T is time, λ is a given MSD-HSMM parameter set, T is length, o is adaptive data i in state when length is T, d is state duration, N is a constant, s is a training speech data model, k is time, andt d(i) is a continuous observation sequence o in state it-d+1...otProbability of (α)t(i) To forward probability, βt(i) In order to be the backward probability,
Figure BDA00013136310700000511
and
Figure BDA00013136310700000510
is the mean vector after linear regression transformation, omega is the MAP estimation parameter of state output, tau is the time length distribution MAP estimation parameter,
Figure BDA0001313631070000057
and
Figure BDA0001313631070000056
are respectively adaptive vectors
Figure BDA0001313631070000059
And
Figure BDA0001313631070000058
weighted average MAP estimate of (a).
The invention also provides a cross-language emotion voice synthesis system, which comprises the following components:
the language corpus text labeling and parameter extracting module is used for establishing a context-related labeling format and a context-related clustering problem set, and respectively performing context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of single speakers to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; respectively extracting acoustic parameters of a first language training corpus and a second language training corpus to obtain first language acoustic parameters corresponding to the first language training corpus and second language acoustic parameters corresponding to the second language training corpus;
the target emotion corpus text labeling and parameter extracting module is used for carrying out context-related text labeling on a target emotion Mandarin training corpus of multiple speakers according to a context-related labeling format and a context-related clustering problem set to obtain a target emotion Mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;
the target emotion average acoustic model determining module is used for determining a target emotion average acoustic model of the multiple speakers according to the first language markup file, the second language markup file, the target emotion Mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter;
the device comprises a to-be-synthesized labeled file determining module, a labeling module and a labeling module, wherein the to-be-synthesized labeled file determining module is used for performing context-related text labeling on a to-be-synthesized file in a first language or/and a second language to obtain a to-be-synthesized labeled file;
and the voice synthesis file determining module is used for inputting the to-be-synthesized marking file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file.
Optionally, the language corpus text labeling module specifically includes:
the marking rule establishing submodule is used for establishing a first language marking rule and a second language marking rule;
the language corpus text labeling submodule is used for determining a context-related labeling format according to a first language labeling rule and a second language labeling rule, and performing context-related text labeling on a neutral first language training corpus of a plurality of speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus;
and the phonetic transcription system and the question set establishing submodule are used for establishing a context-related clustering question set according to the similarity of the first language and the second language.
Optionally, the target emotion average acoustic model determining module specifically includes:
the neutral average acoustic model determining submodule of the mixed language is used for taking the Tibetan language labeling file, the Chinese language labeling file, the first language acoustic parameter and the second language acoustic parameter as a training set, and obtaining a neutral average acoustic model of the mixed language through adaptive training of speakers based on an adaptive model;
and the target emotion average acoustic model determining submodule is used for taking the target emotion Mandarin marking file and the target emotion acoustic parameters as a test set according to the neutral average acoustic model of the mixed language and obtaining the target emotion average acoustic model of the multiple speakers through adaptive transformation of the speakers.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
1) the target emotion mean acoustic model of the multiple speakers can be trained by utilizing the target emotion common speech training corpus of the multiple speakers, and emotion speech synthesis of another language or multiple languages can be synthesized only by changing a file to be synthesized, so that the speech synthesis range is widened.
2) The target emotion mean acoustic model of the multiple speakers can be trained by utilizing the target emotion common speech training corpus of the multiple speakers, so that not only can the emotion voices of the same speaker in different languages be synthesized, but also the emotion voices of different speakers in different languages can be synthesized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical rules in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without inventive exercise.
FIG. 1 is a flowchart of a cross-language emotion speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the Tibetan annotation rule according to an embodiment of the present invention;
FIG. 3 is a detailed flowchart illustrating the process of creating a context-dependent annotation format according to an embodiment of the present invention;
FIG. 4 is a detailed flowchart of acoustic parameter extraction according to an embodiment of the present invention;
FIG. 5 is a block diagram of a cross-language emotion speech synthesis system according to an embodiment of the present invention.
Detailed Description
Technical rules in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention aims to provide a cross-language emotion voice synthesis method and a system, which are used for training a mandarin speaker target emotion average acoustic model by using a multi-speaker target emotion mandarin training corpus and can synthesize cross-language emotion voice of the same speaker or different speakers by only changing a file to be synthesized.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
The invention discloses a first language and a second language, wherein the first language is any one of Chinese, English, German and French; the second language is any one of Tibetan, Spanish, Japanese, Arabic, Korean and Portuguese. In the embodiment of the present invention, chinese is taken as a first language, and tibetan is taken as a second language for example, which are discussed, and fig. 1 is a flowchart of a cross-language emotion voice synthesis method in the embodiment of the present invention, and is specifically shown in fig. 1.
The invention specifically provides a cross-language emotion voice synthesis method, which specifically comprises the following steps:
step 100: establishing a context-related labeling format and a context-related clustering problem set which are common to Chinese and Tibetan, and respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus; and respectively extracting acoustic parameters of the neutral Chinese training corpus and the neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.
Step 200: performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to the context-dependent labeling format and the context-dependent clustering problem set to obtain a target emotion mandarin labeling file; and extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters.
Step 300: and determining a target emotion average acoustic model of multiple speakers according to the Chinese labeling file, the Tibetan labeling file, the target emotion Mandarin labeling file, the Chinese acoustic parameters, the Tibetan acoustic parameters and the target emotion acoustic parameters.
Step 400: and carrying out context-related text labeling on the file to be synthesized of the Chinese or/and the Tibetan to obtain the file to be synthesized.
Step 500: and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese or/and Tibetan target emotion voice synthesis file.
The following describes the steps in detail:
step 100: establishing a context-related labeling format and a context-related clustering problem set which are common to Chinese and Tibetan, and respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus; and respectively extracting acoustic parameters of the neutral Chinese training corpus and the neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.
Step 101: and establishing Chinese marking rules and Tibetan marking rules.
Step 1011: and taking SAMPA-SC Mandarin Chinese machine-readable phonetic symbols as the Chinese marking rules.
Step 1012: the establishing of the Tibetan language labeling rule specifically comprises the following steps:
at present, the Chinese mandarin machine-readable phonetic symbol SAMPA-SC tends to be mature and widely applied, and the Tibetan language and the Chinese language have many similarities in pronunciation, for example, in a Chinese-Tibetan language system, the Chinese language and the Tibetan language have commonalities and differences in pronunciation, the Tibetan language Lassa dialect and the Chinese mandarin are both composed of syllables, each syllable comprises 1 final and 1 initial, the Tibetan language Lassa dialect comprises 45 final and 36 initial, the mandarin comprises 39 final and 22 initial, the two syllables share 13 final and 20 initial, and 4 tones are different only in tone value. Therefore, the invention designs a set of Tibetan language computer readable phonetic symbol SAMPA-T, namely Tibetan language labeling rule, based on SAMPA-SC according to the pronunciation characteristics of the Tibetan language. See figure 2 for details.
And obtaining the international phonetic symbols of the Pinyin of the input Tibetan language based on the SAMPA-SC mandarin machine-readable phonetic symbols by taking the international phonetic symbols as reference.
And judging whether the international phonetic symbols of the Tibetan language pinyin are consistent with the international phonetic symbols of the Chinese pinyin, if so, directly marking the Tibetan language pinyin by adopting SAMPA-SC mandarin machine-readable phonetic symbols, and otherwise, marking by using self-defined unused keyboard symbols according to a simplification principle.
Step 102: determining a context-related labeling format which is common to Chinese and Tibetan according to a Chinese labeling rule and a Tibetan labeling rule, respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of a single speaker according to the context-related labeling format, and respectively obtaining a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus, which are specifically shown in FIG. 3.
Step 1021: according to a grammar rule knowledge base and a grammar dictionary of Chinese and Tibetan, text normalization, grammar analysis and prosodic structure analysis are carried out on input texts with irregular Chinese and Tibetan to obtain normalized texts, length information of prosodic words and phrases, prosodic boundary information, word related information and tone information.
Step 1022: substituting the standard text into the Chinese labeling rule to obtain a single-phone labeling file of Chinese; or bringing the standard text into the Tibetan language labeling rule to obtain a single-phone labeling file of the Tibetan language.
Step 1023: and determining a context-related labeling format commonly used by Chinese and Tibetan according to the length information, prosodic boundary information, word related information, tone information and a single-phoneme labeling file of the prosodic words and phrases.
The context-dependent labeling format is used for labeling the context information of pronunciation primitives (initials and finals). The context-related labeling format comprises 6 layers of initial consonants, syllables, words, prosodic phrases and sentences, and is used for representing pronunciation elements (initial consonants and vowels) and context-related information thereof under different contexts.
Step 1024: and respectively carrying out context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of a single speaker according to a context-related labeling format to respectively obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus.
Step 103: and establishing a general context related clustering problem set of the Chinese and the Tibetan according to the similarity of the Chinese and the Tibetan.
Step 104: and respectively extracting acoustic parameters of the neutral Chinese training corpus and the neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus, which are specifically shown in FIG. 4.
When the acoustic parameters are extracted, the acoustic characteristics such as the fundamental frequency and the spectral characteristics of the voice signals are extracted by analyzing the voice signals. In the present invention, a generalized Mel-frequency cepstral (mgc) is used as a spectrum feature to represent a spectrum envelope, that is: a filter section in the source filter model; the logarithmic fundamental frequency logF0 is used as the fundamental frequency characteristic. Since the speech signal is not a pure, stable periodic signal, the extraction of the spectral envelope is directly affected by the error of the fundamental frequency, and therefore the extraction of the spectral envelope (the generalized mel-frequency cepstral coefficients mgc) is also accompanied by the extraction of the fundamental frequency features (the logarithmic fundamental frequency logF 0).
The acoustic parameter extraction includes: the generalized mel frequency cepstrum coefficient mgc is extracted, the logarithmic fundamental frequency logF0 is extracted, and the aperiodic component bap is extracted.
The extraction formula of the generalized mel-frequency cepstrum coefficient mgc is specifically as follows:
Figure BDA0001313631070000111
wherein the content of the first and second substances,
Figure BDA0001313631070000112
(| α | < 1) is an m-order all-pass function, γ is the attribute of the system function, cα,γ(M) is the coefficient, M is the total number of filter coefficients, z is the z-transform of the discrete signal, and M is the order of the filter coefficients.
If gamma is 0, cα,γ(m) model mgc; gamma is equal to-1, the model is an autoregressive model; if γ is equal to 0, then it is an exponential model.
Logarithmic fundamental frequency logF 0:
the method adopts a normalized autocorrelation function method to extract fundamental frequency characteristics, and comprises the following specific steps:
for speech signal s (N), N ≦ N, N ∈ N+The autocorrelation function is:
Figure BDA0001313631070000113
where K is the delay time and should be set to be an integer multiple of the pitch period, s (N + K) is s (N) adjacent speech signals, N is an integer, and K is the maximum number of delay times.
Normalizing the autocorrelation function acf (k) to obtain a normalized autocorrelation function:
Figure BDA0001313631070000121
wherein the content of the first and second substances,
Figure BDA0001313631070000122
e0e at time 0k
When the maximum value of the autocorrelation function is found, the delay value k of the function is the pitch period. The reciprocal of the pitch period is the fundamental frequency, and the logarithm of the fundamental frequency is the logarithmic fundamental frequency logF0 to be extracted.
Non-periodic component bap extraction:
the non-periodic components of the speech signal are defined as the relative energy levels of the non-periodic components in a frequency domain, the non-periodic component value ap of a linear domain is calculated through the ratio of the energy of the non-harmonic components to the total energy of a spectrum with a fixed fundamental frequency value structure, namely the non-periodic component value ap of the linear domain can be determined by subtracting the upper and lower spectrum envelopes, and the specific formula is as follows:
Figure BDA0001313631070000123
PAP(ω') is lg field aperiodic component value; s (λ') represents the spectral energy, SL(λ') represents the spectral energy of the envelope under the spectrum, SU(λ') is the spectral energy of the envelope on the spectrum; w is aERB(lambda '; omega') is a smooth acoustic filteringAnd λ 'is the fundamental frequency and ω' is the frequency.
The aperiodic component bap can be determined by averaging ap in each frequency band of each frame, and the specific formula is as follows:
Figure BDA0001313631070000124
wherein, the bap (ω') is an aperiodic component bap.
Step 200: performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to a context-dependent labeling format and a context-dependent clustering problem set to obtain a target emotion mandarin labeling file; and extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters.
The acoustic parameter extraction of the target emotion Mandarin training corpus is the same as the acoustic parameter extraction of the neutral Chinese training corpus and the neutral Tibetan training corpus. See formulas (1) - (4) for details.
Step 300: and determining a target emotion average acoustic model of multiple speakers according to the Chinese labeling file, the Tibetan labeling file, the target emotion Mandarin labeling file, the Chinese acoustic parameters, the Tibetan acoustic parameters and the target emotion acoustic parameters.
Step 301: the Chinese labeling file, the Tibetan labeling file, the Chinese acoustic parameters and the Tibetan acoustic parameters are used as training sets, and a neutral average acoustic model of the mixed language is obtained through adaptive training of speakers based on an adaptive model. The self-adaptive model is any one of a deep learning model, a long-time and short-time memory model and a hidden Markov model. The invention adopts a semi-hidden Markov model for analysis.
The invention adopts a constrained maximum likelihood linear regression algorithm, expresses the difference between an average acoustic model and the speech data of a speaker in training by using a linear regression function, normalizes the difference between the trained speakers by using a group of linear regression formulas of state duration distribution and state output distribution, and trains to obtain a context-dependent semi-hidden Markov model (MSD-HSMM). The speaker self-adaptive training algorithm based on the semi-hidden Markov model MSD-HSMM is adopted to improve the tone quality of synthesized voice and reduce the influence of the difference among speakers on the quality of the synthesized voice. The linear regression formula of the state frequent distribution and the state output distribution is specifically as follows:
Figure BDA0001313631070000131
Figure BDA0001313631070000132
wherein, the formula (5) is a state duration distribution transformation equation, i is a state, i at the lower right corner represents in the state i, s is a training voice data model, s is marked at the upper right corner to represent the model belonging to the voice data model s,
Figure BDA0001313631070000133
mean vector representing the duration of the state of training speech data model s X α]For training the transformation matrix of the differences between the state duration distribution of the speech data model s and the mean tone model, diIs its average duration, wherein ξ ═ o-T,1]. Equation (6) shows a state output distribution transformation equation,
Figure BDA0001313631070000134
mean vector representing the state output of training speech data model s, W ═ a, b]For training a transformation matrix of the differences between the state output distributions of the speech data model s and the mean tone model, oiIs its average observation vector.
Step 302: according to a neutral average acoustic model of mixed languages, a target emotion Mandarin marking file and target emotion acoustic parameters are used as a test set, and a multi-speaker target emotion average acoustic model is obtained through speaker self-adaptive transformation; the method comprises the following specific steps:
step 3021: adopting a constrained maximum likelihood linear regression algorithm to calculate a covariance matrix and a mean vector of state duration probability distribution and state output probability distribution of the speaker, and transforming the covariance matrix and the mean vector of the neutral average acoustic model into a target speaker model by using a group of transformation matrices of the state duration distribution and the state output distribution, wherein the specific formula is as follows:
pi(d)=N(d;αmi-β,ασi 2α)=|α-1|N(αψ;mii 2) (7)
bi(o)=N(o;Aui-b,AΣiAT)=|A-1|N(Wξ;uii) (8)
where i is the state, d is the state duration, N is a constant, pi(d) Is a transformation equation of state duration, miIs the mean value of the time length distribution, σi 2Is variance, psi ═ d,1]TO is a feature vector, ξ ═ oT,1],uiOutputs the distribution mean, sigma for the stateiAs a diagonal covariance matrix, X ═ α-1-1]Is a transformation matrix of state duration probability density distribution, W ═ A-1,b-1]Outputting a linear transformation matrix of probability density distribution for the state of the target speaker;
step 3022: through an adaptive transformation algorithm based on MSD-HSMM, the fundamental frequency, frequency spectrum and time length parameters of voice data can be transformed and normalized; for adaptive data O of length T, the maximum likelihood estimation can be performed by transforming Λ ═ W, X:
Figure BDA0001313631070000141
wherein λ is a parameter set of MSD-HSMM, O is adaptive data of length T,
Figure BDA0001313631070000142
is a maximum likelihood estimate.
Step 3023: carrying out maximum likelihood estimation on the converted and normalized time length, frequency spectrum and fundamental frequency parameters, and updating and correcting the speaker correlation model by adopting a maximum posterior probability algorithm, wherein the specific formula is as follows:
Figure BDA0001313631070000143
and (3) MAP estimation:
Figure BDA0001313631070000151
Figure BDA0001313631070000152
wherein T is time, λ is a given MSD-HSMM parameter set, T is length, o is adaptive data i in state when length is T, d is state duration, N is a constant, s is a training speech data model, k is time, andt d(i) is a continuous observation sequence o in state it-d+1...otProbability of (α)t(i) To forward probability, βt(i) In order to be the backward probability,
Figure BDA0001313631070000153
and
Figure BDA0001313631070000154
is the mean vector after linear regression transformation, omega is the Maximum a posteriori probability (MAP) estimation parameter of state output, tau is the time length distribution MAP estimation parameter,
Figure BDA0001313631070000155
and
Figure BDA0001313631070000156
are respectively adaptive vectors
Figure BDA0001313631070000157
And
Figure BDA0001313631070000158
weighted average MAP estimate of (a).
Step 400: and carrying out context-related text labeling on the file to be synthesized of the Chinese or/and the Tibetan to obtain the file to be synthesized.
The file to be synthesized comprises a file to be synthesized of Chinese and/or Tibetan, the file to be synthesized is any one of characters, words, phrases and sentences, and the file to be synthesized of Chinese and/or Tibetan is subjected to context-related text labeling according to the context-related text labeling format to obtain a labeled file to be synthesized.
Namely, when the text to be synthesized is a Tibetan language text to be synthesized, carrying out context-related text labeling according to the context-related text labeling format to obtain a Tibetan language labeling file to be synthesized; when the text to be synthesized is the text to be synthesized, carrying out context-related text labeling according to the context-related text labeling format to obtain a Chinese labeling file to be synthesized; and when the text to be synthesized is a Tibetan or Chinese text to be synthesized, carrying out context-related text labeling according to the context-related text labeling format to obtain a labeling file to be synthesized of the Tibetan or Chinese.
Step 500: and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a target emotion voice synthesis file.
For a to-be-synthesized labeled file of a to-be-synthesized text, a question set is utilized, a target emotion average acoustic model related to a speaker of each pronunciation element is obtained according to context related information of each pronunciation element (initial consonant and final consonant), then the target emotion average acoustic model related to the speaker of the whole to-be-synthesized sentence is determined through clustering, then an acoustic parameter file of target emotion of mandarin and/or Tibetan is obtained according to the target emotion average acoustic model related to the speaker, and finally the acoustic parameter file is utilized to synthesize the Tibetan and/or Chinese target emotion voice synthesis file through a voice waveform generator.
Namely, inputting the Tibetan language to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Tibetan language target emotion voice synthesis file; inputting the Chinese to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese target emotion voice synthesis file; and inputting the Chinese and Tibetan language to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese and Tibetan mixed target emotion voice synthesis file.
In order to achieve the purpose, the invention also provides a cross-language emotion voice synthesis system.
Fig. 5 is a block diagram of a cross-language emotion speech synthesis system according to an embodiment of the present invention, and as shown in fig. 5, the system includes: the system comprises a language corpus text labeling and parameter extracting module 1, a target emotion corpus text labeling and parameter extracting module 2, a target emotion average acoustic model determining module 3, a to-be-synthesized labeled file determining module 4 and a voice synthesis file determining module 5.
The language corpus text labeling and parameter extracting module 1 is used for establishing a context-related labeling format and a context-related clustering problem set, and respectively performing context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus; the method is used for extracting acoustic parameters of a neutral Chinese training corpus and a neutral Tibetan training corpus respectively to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.
The language corpus text labeling and parameter extracting module 1 specifically comprises: the language corpus text labeling module and the language corpus parameter extracting module.
The language corpus text labeling module specifically comprises: a marking rule establishing submodule, a language corpus text marking submodule, a phonetic system and a question set establishing submodule.
The marking rule establishing submodule is used for establishing a Chinese marking rule and a Tibetan marking rule;
the language corpus text labeling submodule is used for determining a context-related labeling format according to a Chinese labeling rule and a Tibetan labeling rule, and performing context-related text labeling on a neutral Chinese training corpus of multiple speakers and a neutral Tibetan training corpus of single speakers respectively to obtain a Chinese labeling file corresponding to the neutral Chinese training corpus and a Tibetan labeling file corresponding to the neutral Tibetan training corpus;
the phonetic transcription system and the question set establishing submodule are used for establishing a context-related clustering question set which is universal for Chinese and Tibetan according to the similarity of the Chinese and the Tibetan.
The language corpus parameter extraction module is used for respectively extracting acoustic parameters of a neutral Chinese training corpus and a neutral Tibetan training corpus to obtain Chinese acoustic parameters corresponding to the neutral Chinese training corpus and Tibetan acoustic parameters corresponding to the neutral Tibetan training corpus.
The target emotion corpus text labeling and parameter extracting module 2 is used for carrying out context-related text labeling on a target emotion mandarin training corpus of multiple speakers to obtain a target emotion mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;
the target emotion average acoustic model determining module 3 is used for determining a target emotion average acoustic model of multiple speakers according to the Chinese labeling file, the Tibetan labeling file, the target emotion Mandarin labeling file, the Chinese acoustic parameters, the Tibetan acoustic parameters and the target emotion acoustic parameters;
the target emotion average acoustic model determining module 3 specifically includes: a neutral average acoustic model determining submodule of the mixed language and a target emotion average acoustic model determining submodule.
The neutral average acoustic model determining submodule of the mixed language is used for taking the Tibetan language labeling file, the Chinese acoustic parameters and the Tibetan language acoustic parameters as a training set, and obtaining a neutral average acoustic model of the mixed language through adaptive training of speakers based on an adaptive model;
and the target emotion average acoustic model determining submodule is used for taking the target emotion Mandarin marking file and the target emotion acoustic parameters as a test set according to the neutral average acoustic model of the mixed language and obtaining the target emotion average acoustic model of the multiple speakers through adaptive transformation of the speakers.
And the to-be-synthesized labeling file determining module 4 is used for performing context-related text labeling on the to-be-synthesized file of the Chinese language or/and the Tibetan language to obtain the to-be-synthesized labeling file.
And the voice synthesis file determining module 5 is used for inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a Chinese or/and Tibetan target emotion voice synthesis file.
Specific examples are:
the method records 800 sentences of a female Tibetan speaker as a neutral Tibetan training corpus of a single speaker, a Chinese-English bilingual speech database as a neutral Chinese training corpus of multiple speakers, and records 11 emotions of 9 female speakers, namely 9900 sentences of 11 emotions, as a target emotion mandarin training corpus of the multiple speakers, wherein the emotions in 11 include sadness, relaxation, anger, anxiety, surprise, fear, slight, gentle, joy, disgust and neutrality. Experiments prove that the emotion similarity evaluation score (EMOS) of Tibetan or Chinese speech of the synthesized target emotion is gradually improved along with the increase of the Mandarin target emotion training corpus.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (9)

1. A cross-language emotion voice synthesis method is characterized by comprising the following steps:
establishing a context-related labeling format and a context-related clustering problem set, and respectively carrying out context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; respectively extracting acoustic parameters of the first and second neutral language training corpora to obtain first language acoustic parameters corresponding to the first and second neutral language training corpora and second language acoustic parameters corresponding to the second and third neutral language training corpora;
performing context-dependent text labeling on a target emotion mandarin training corpus of multiple speakers according to the context-dependent labeling format and the context-dependent clustering problem set to obtain a target emotion mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;
determining a target emotion average acoustic model of the multiple speakers according to the first language labeling file, the second language labeling file, the target emotion common speech labeling file, the first language acoustic parameters, the second language acoustic parameters and the target emotion acoustic parameters;
carrying out context-dependent text labeling on the file to be synthesized in the first language or/and the second language to obtain a labeled file to be synthesized;
and inputting the to-be-synthesized labeling file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file.
2. The method according to claim 1, wherein the method for synthesizing cross-language emotion speech comprises establishing a context-dependent labeling format and a context-dependent clustering problem set, and performing context-dependent text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus, and the specific steps comprise:
establishing a first language marking rule and a second language marking rule;
determining a context-related labeling format according to a first language labeling rule and a second language labeling rule, and respectively performing context-related text labeling on a neutral first language training corpus of a plurality of speakers and a neutral second language training corpus of a single speaker to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus;
and establishing a context-dependent clustering problem set according to the similarity of the first language and the second language.
3. The method for synthesizing cross-language emotion speech according to claim 2, wherein the step of establishing the first language labeling rule and the second language labeling rule comprises the following steps:
the establishing of the first language marking rule specifically comprises the following steps:
taking SAMPA-SC mandarin machine-readable phonetic symbols as the first language marking rules;
the establishing of the second language marking rule specifically comprises the following steps:
taking the international phonetic symbols as reference, and obtaining the international phonetic symbols for inputting the pinyin of the second language based on the SAMPA-SC mandarin machine-readable phonetic symbols;
judging whether the international phonetic symbol of the second language pinyin is consistent with the international phonetic symbol of the first language pinyin; if the two languages are consistent, the machine-readable phonetic symbols of the SAMPA-SC mandarin Chinese are directly adopted to mark the pinyin of the second language; otherwise, according to the simplification principle, marking by using the user-defined unused keyboard symbol.
4. The method for synthesizing cross-language emotion speech according to claim 3, wherein the step of determining the context-dependent markup format according to the first linguistic markup rule and the second linguistic markup rule comprises:
according to a grammar rule knowledge base and a grammar dictionary of a first language and a second language, carrying out text standardization, grammar analysis and prosodic structure analysis on input texts with irregular first language and second language to obtain standard texts, length information of prosodic words and phrases, prosodic boundary information, word related information and tone information;
substituting the standard text into the first language labeling rule to obtain a single-phone labeling file of a first language; or bringing the standard text into the second language labeling rule to obtain a single-phone labeling file of a second language;
and determining a context-related labeling format according to the length information of the prosodic words and phrases, prosodic boundary information, word related information, tone information and the single-phone labeling file.
5. The method for synthesizing cross-language emotion speech according to claim 1, wherein the method for determining the multi-speaker target emotion average acoustic model according to the first language markup file, the second language markup file, the target emotion mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter comprises the following specific steps:
taking a first language markup file, a second language markup file, a first language acoustic parameter and a second language acoustic parameter as a training set, and obtaining a neutral average acoustic model of a mixed language through adaptive training of a speaker based on an adaptive model;
according to a neutral average acoustic model of mixed languages, a target emotion Mandarin marking file and target emotion acoustic parameters are used as a test set, and a multi-speaker target emotion average acoustic model is obtained through speaker self-adaptive transformation.
6. The method for synthesizing cross-language emotion voice according to claim 5, wherein the specific steps of obtaining the speaker target emotion mean acoustic model of the multiple speakers target emotion Mandarin speaker by using the target emotion Mandarin markup file and the target emotion acoustic parameters as a test set and by speaker adaptive transformation according to the neutral mean acoustic model of the mixed language are as follows:
adopting a constrained maximum likelihood linear regression algorithm to calculate a covariance matrix and a mean vector of state duration probability distribution and state output probability distribution of the speaker, and transforming the covariance matrix and the mean vector of the neutral average acoustic model into a target speaker model by using a group of transformation matrices of the state duration distribution and the state output distribution, wherein the specific formula is as follows:
pi(d)=N(d;αmi-β,ασi 2α)=|α-1|N(αψ;mii 2) (7)
bi(ο)=N(ο;Aui-b,AΣiAT)=|A-1|N(Wξ;uii) (8)
where i is the state, d is the state duration, N () is the Gaussian distribution function, pi(d) Is a transformation equation of state duration, miIs the mean value of the time length distribution, σi 2Is variance, psi ═ d,1]TO is a feature vector, ξ ═ oT,1],uiOutputs the distribution mean, sigma for the stateiAs a diagonal covariance matrix, X ═ α-1-1]Is a transformation matrix of state duration probability density distribution, W ═ A-1,b-1]Outputting a linear transformation matrix of probability density distribution for the state of the target speaker;
through an adaptive transformation algorithm based on MSD-HSMM, the fundamental frequency, frequency spectrum and time length parameters of voice data can be transformed and normalized; for adaptive data O of length T, the maximum likelihood estimation can be performed by transforming Λ ═ W, X:
Figure FDA0002208303490000031
wherein λ is a parameter set of MSD-HSMM, O is adaptive data of length T,
Figure FDA0002208303490000032
is a maximum likelihood estimation;
carrying out maximum likelihood estimation on the converted and normalized time length, frequency spectrum and fundamental frequency parameters, and updating and correcting the speaker correlation model by adopting a maximum posterior probability algorithm, wherein the specific formula is as follows:
Figure FDA0002208303490000041
and (3) MAP estimation:
Figure FDA0002208303490000042
Figure FDA0002208303490000043
wherein T is time, λ is a given MSD-HSMM parameter set, T is length, o is adaptive data i in state when length is T, d is state duration, N is a constant, s is a training speech data model, k is time, andt d(i) is a continuous observation sequence o in state it-d+ 1...otProbability of (α)t(i) To forward probability, βt(i) In order to be the backward probability,
Figure FDA0002208303490000044
and
Figure FDA0002208303490000045
is the mean vector after linear regression transformation, omega is the MAP estimation parameter of state output, tau is the time length distribution MAP estimation parameter,
Figure FDA0002208303490000046
and
Figure FDA0002208303490000047
are respectively adaptive vectors
Figure FDA0002208303490000048
And
Figure FDA0002208303490000049
weighted average MAP estimate of (a).
7. A cross-language emotion speech synthesis system, comprising:
the language corpus text labeling and parameter extracting module is used for establishing a context-related labeling format and a context-related clustering problem set, and respectively performing context-related text labeling on a neutral first language training corpus of multiple speakers and a neutral second language training corpus of single speakers to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus; the acoustic parameter extraction module is used for respectively extracting acoustic parameters of the first and second neutral language training corpora to obtain first language acoustic parameters corresponding to the first and second neutral language training corpora and second language acoustic parameters corresponding to the second and third neutral language training corpora;
the target emotion corpus text labeling and parameter extracting module is used for carrying out context-related text labeling on a target emotion Mandarin training corpus of multiple speakers according to the context-related labeling format and the context-related clustering problem set to obtain a target emotion Mandarin labeling file; extracting acoustic parameters of the target emotion Mandarin Chinese training corpus to obtain target emotion acoustic parameters;
the target emotion average acoustic model determining module is used for determining a target emotion average acoustic model of the multiple speakers according to the first language markup file, the second language markup file, the target emotion Mandarin markup file, the first language acoustic parameter, the second language acoustic parameter and the target emotion acoustic parameter;
the device comprises a to-be-synthesized labeled file determining module, a labeling module and a labeling module, wherein the to-be-synthesized labeled file determining module is used for performing context-related text labeling on a to-be-synthesized file in a first language or/and a second language to obtain a to-be-synthesized labeled file;
and the voice synthesis file determining module is used for inputting the to-be-synthesized marking file into the multi-speaker target emotion average acoustic model to obtain a first language or/and second language target emotion voice synthesis file.
8. The system according to claim 7, wherein the language corpus text labeling and parameter extracting module specifically comprises:
the marking rule establishing submodule is used for establishing a first language marking rule and a second language marking rule;
the language corpus text labeling submodule is used for determining a context-related labeling format according to a first language labeling rule and a second language labeling rule, and performing context-related text labeling on a neutral first language training corpus of a plurality of speakers and a neutral second language training corpus of a single speaker respectively to obtain a first language labeling file corresponding to the neutral first language training corpus and a second language labeling file corresponding to the neutral second language training corpus;
and the phonetic transcription system and the question set establishing submodule are used for establishing a context-related clustering question set according to the similarity of the first language and the second language.
9. The system of claim 7, wherein the module for determining the target emotion average acoustic model specifically comprises:
the neutral average acoustic model determining submodule of the mixed language is used for taking the Tibetan language labeling file, the Chinese language labeling file, the first language acoustic parameter and the second language acoustic parameter as a training set, and obtaining a neutral average acoustic model of the mixed language through adaptive training of speakers based on an adaptive model;
and the target emotion average acoustic model determining submodule is used for taking the target emotion Mandarin marking file and the target emotion acoustic parameters as a test set according to the neutral average acoustic model of the mixed language and obtaining the target emotion average acoustic model of the multiple speakers through adaptive transformation of the speakers.
CN201710415814.5A 2017-06-06 2017-06-06 Cross-language emotion voice synthesis method and system Active CN107103900B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710415814.5A CN107103900B (en) 2017-06-06 2017-06-06 Cross-language emotion voice synthesis method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710415814.5A CN107103900B (en) 2017-06-06 2017-06-06 Cross-language emotion voice synthesis method and system

Publications (2)

Publication Number Publication Date
CN107103900A CN107103900A (en) 2017-08-29
CN107103900B true CN107103900B (en) 2020-03-31

Family

ID=59660516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710415814.5A Active CN107103900B (en) 2017-06-06 2017-06-06 Cross-language emotion voice synthesis method and system

Country Status (1)

Country Link
CN (1) CN107103900B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109036370B (en) * 2018-06-06 2021-07-20 安徽继远软件有限公司 Adaptive training method for speaker voice
CN108831435B (en) * 2018-06-06 2020-10-16 安徽继远软件有限公司 Emotional voice synthesis method based on multi-emotion speaker self-adaption
CN109192225B (en) * 2018-09-28 2021-07-09 清华大学 Method and device for recognizing and marking speech emotion
CN111192568B (en) * 2018-11-15 2022-12-13 华为技术有限公司 Speech synthesis method and speech synthesis device
CN111954903B (en) * 2018-12-11 2024-03-15 微软技术许可有限责任公司 Multi-speaker neuro-text-to-speech synthesis
CN109949791A (en) * 2019-03-22 2019-06-28 平安科技(深圳)有限公司 Emotional speech synthesizing method, device and storage medium based on HMM
CN110534089B (en) * 2019-07-10 2022-04-22 西安交通大学 Chinese speech synthesis method based on phoneme and prosodic structure
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN112233648A (en) * 2019-12-09 2021-01-15 北京来也网络科技有限公司 Data processing method, device, equipment and storage medium combining RPA and AI
CN111145719B (en) * 2019-12-31 2022-04-05 北京太极华保科技股份有限公司 Data labeling method and device for Chinese-English mixing and tone labeling
CN112151008B (en) * 2020-09-22 2022-07-15 中用科技有限公司 Voice synthesis method, system and computer equipment
CN112270168B (en) * 2020-10-14 2023-11-24 北京百度网讯科技有限公司 Method and device for predicting emotion style of dialogue, electronic equipment and storage medium
CN112634858B (en) * 2020-12-16 2024-01-23 平安科技(深圳)有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113539268A (en) * 2021-01-29 2021-10-22 南京迪港科技有限责任公司 End-to-end voice-to-text rare word optimization method
CN113345431A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Cross-language voice conversion method, device, equipment and medium
CN113611286B (en) * 2021-10-08 2022-01-18 之江实验室 Cross-language speech emotion recognition method and system based on common feature extraction
CN117496944B (en) * 2024-01-03 2024-03-22 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006178063A (en) * 2004-12-21 2006-07-06 Toyota Central Res & Dev Lab Inc Interactive processing device
CN101064104B (en) * 2006-04-24 2011-02-02 中国科学院自动化研究所 Emotion voice creating method based on voice conversion
CN102005205B (en) * 2009-09-03 2012-10-03 株式会社东芝 Emotional speech synthesizing method and device
CN102385858B (en) * 2010-08-31 2013-06-05 国际商业机器公司 Emotional voice synthesis method and system
KR101203188B1 (en) * 2011-04-14 2012-11-22 한국과학기술원 Method and system of synthesizing emotional speech based on personal prosody model and recording medium
CN102184731A (en) * 2011-05-12 2011-09-14 北京航空航天大学 Method for converting emotional speech by combining rhythm parameters with tone parameters
TWI471854B (en) * 2012-10-19 2015-02-01 Ind Tech Res Inst Guided speaker adaptive speech synthesis system and method and computer program product
US9177549B2 (en) * 2013-11-01 2015-11-03 Google Inc. Method and system for cross-lingual voice conversion
CN104217713A (en) * 2014-07-15 2014-12-17 西北师范大学 Tibetan-Chinese speech synthesis method and device
CN104538025A (en) * 2014-12-23 2015-04-22 西北师范大学 Method and device for converting gestures to Chinese and Tibetan bilingual voices
US9665567B2 (en) * 2015-09-21 2017-05-30 International Business Machines Corporation Suggesting emoji characters based on current contextual emotional state of user
CN105654942A (en) * 2016-01-04 2016-06-08 北京时代瑞朗科技有限公司 Speech synthesis method of interrogative sentence and exclamatory sentence based on statistical parameter
CN106128450A (en) * 2016-08-31 2016-11-16 西北师范大学 The bilingual method across language voice conversion and system thereof hidden in a kind of Chinese
CN106531150B (en) * 2016-12-23 2020-02-07 云知声(上海)智能科技有限公司 Emotion synthesis method based on deep neural network model

Also Published As

Publication number Publication date
CN107103900A (en) 2017-08-29

Similar Documents

Publication Publication Date Title
CN107103900B (en) Cross-language emotion voice synthesis method and system
Hozjan et al. Context-independent multilingual emotion recognition from speech signals
Li et al. Analysis and modeling of F0 contours for Cantonese text-to-speech
Kayte et al. Di-phone-Based Concatenative Speech Synthesis Systems for Marathi Language
Maia et al. Towards the development of a brazilian portuguese text-to-speech system based on HMM.
Chomphan et al. Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis
Chomphan et al. Tone correctness improvement in speaker-independent average-voice-based Thai speech synthesis
Sakti et al. Development of HMM-based Indonesian speech synthesis
Sun et al. A method for generation of Mandarin F0 contours based on tone nucleus model and superpositional model
Mustafa et al. Developing an HMM-based speech synthesis system for Malay: a comparison of iterative and isolated unit training
Maia et al. An HMM-based Brazilian Portuguese speech synthesizer and its characteristics
Bonafonte et al. The UPC TTS system description for the 2008 blizzard challenge
Chen et al. A Bilingual Speech Synthesis System of Standard Malay and Indonesian Based on HMM-DNN
Chen et al. A Mandarin Text-to-Speech System
Iyanda et al. Development of a Yorúbà Textto-Speech System Using Festival
Sulír et al. Development of the Slovak HMM-based tts system and evaluation of voices in respect to the used vocoding techniques
Mustafa et al. EM-HTS: real-time HMM-based Malay emotional speech synthesis.
Bahaadini et al. Implementation and evaluation of statistical parametric speech synthesis methods for the Persian language
Mustafa et al. Prosodic Analysis And Modelling For Malay Emotional Speech Synthesis
Bailly et al. Advocating for text input in multi-speaker text-to-speech systems
Nair et al. Indian text to speech systems: A short survey
Mustafa et al. A cross-lingual approach to the development of an HMM-based speech synthesis system for Malay
Winarti et al. Enhancing Indonesian Speech Synthesis: Embracing Naturalness and Expressiveness with Hidden Markov Models
Bu et al. The Speech Synthesis of Yi Language Based on DNN
IMRAN ADMAS UNIVERSITY SCHOOL OF POST GRADUATE STUDIES DEPARTMENT OF COMPUTER SCIENCE

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant