CN112863483A - Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm - Google Patents

Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm Download PDF

Info

Publication number
CN112863483A
CN112863483A CN202110008049.1A CN202110008049A CN112863483A CN 112863483 A CN112863483 A CN 112863483A CN 202110008049 A CN202110008049 A CN 202110008049A CN 112863483 A CN112863483 A CN 112863483A
Authority
CN
China
Prior art keywords
speaker
text
predicted
duration
coding information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110008049.1A
Other languages
Chinese (zh)
Other versions
CN112863483B (en
Inventor
盛乐园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202110008049.1A priority Critical patent/CN112863483B/en
Publication of CN112863483A publication Critical patent/CN112863483A/en
Application granted granted Critical
Publication of CN112863483B publication Critical patent/CN112863483B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/26Pre-filtering or post-filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention discloses a voice synthesis device supporting multi-speaker style and language switching and controllable rhythm, belonging to the field of voice synthesis. The method comprises the following steps: the text preprocessing unit is used for preprocessing the acquired text data; the language switching unit is used for storing and displaying speaker labels corresponding to training data of different language types and automatically identifying the language type of the text to be synthesized; a style switching unit for specifying a speech synthesis style according to a language type; a speaker switching unit for designating a speaker; an encoding-decoding unit for obtaining a predicted Mel spectrum; a training unit for training the encoding-decoding unit; and the voice synthesis unit is used for generating the predicted Mel frequency spectrum according to the generated predicted Mel frequency spectrum and converting the predicted Mel frequency spectrum into a sound signal for voice playing. The invention can generate voice with richer prosodic change and simultaneously can respectively control the speaker and the style of the speaker.

Description

Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
Technical Field
The invention belongs to the field of voice synthesis, and particularly relates to a voice synthesis device which supports multi-speaker style and language switching and has controllable rhythm.
Background
With the development of deep learning in recent years, speech synthesis technology has been greatly improved. Speech synthesis moves from the traditional parametric approach and the concatenation approach towards an end-to-end approach. They typically synthesize speech by first generating a mel-frequency spectrum from text features and then using the mel-frequency spectrum to synthesize the speech using a vocoder image. These end-to-end methods can be classified by structure into autoregressive models and non-autoregressive models. The autoregressive model is usually generated by autoregressive using an Encoder-Attention-Decoder (Encoder-Attention-Decoder) mechanism: to generate the current data point, all previous data points in the time series must be generated as model inputs, like Taoctron, Taoctron 2, Deep voice 3, Clarinet, tanformer TTS. Although autoregressive models can produce satisfactory results, Attention to Attention may be inadequate, leading to the phenomenon of duplication or word missing in the synthesized speech. Non-autoregressive models can parallelize the generation of mel spectra from text features much faster than autoregressive models like ParaNet, fastspech, alignts, fastspech 2.
The existing voice synthesis method has single control on synthesized voice, cannot synthesize mixed voice of multiple languages, and further cannot decouple and separate styles of multiple speakers to apply to other speakers.
Therefore, how to enable the voice synthesis system to support multiple speakers on the basis of ensuring controllable prosody and to decouple and separate the styles of the speakers to apply to other speakers is still an unsolved problem in the field of intelligent voice synthesis of computers.
Disclosure of Invention
The invention aims to solve the problems in the prior art, and on one hand, the prosody of the synthesized voice is controlled by the text, the duration, the energy and the pitch of prosody labels and four characteristics. On the other hand, the method can support the ability of speaking two languages for only one language in the data set to perform language migration, and can also decouple the style of multiple speakers from the characteristics of the speakers, apply the style to other speakers, and perform speaker style migration. The invention overcomes the limit of language and speaker style to other speakers by optimizing the voice synthesis model, and realizes that the multi-language style of multiple speakers can be separated, and the rhythm can be controlled comprehensively.
In order to achieve the purpose, the invention adopts the following specific technical scheme:
a speech synthesis apparatus supporting multi-speaker style, language switching and controllable prosody, comprising:
the text acquisition unit is used for acquiring different text data according to the mode of the voice synthesis device, acquiring a mixed training text with a rhythm label and corresponding standard voice audio in the training mode, and marking a speaker label of each standard voice audio; acquiring a text to be synthesized in a prediction mode;
the text preprocessing unit is used for converting the text into a phoneme sequence with a prosodic tag, and outputting a real Mel frequency spectrum, a real energy, a real pitch, a real duration and a corresponding speaker tag according to a standard voice audio corresponding to the text during a training mode;
the language switching unit is used for storing and displaying speaker labels corresponding to training data of different language types and automatically identifying the language type of the text to be synthesized;
the style switching unit is used for reading the language type of the text displayed by the language switching unit and setting a first speaker label as a voice synthesis style according to the language type;
a speaker switching unit for setting a second speaker tag as a designated speaker;
in the training mode, the first speaker label and the second speaker label are both speaker labels marked in the mixed training sample; in the prediction mode, the first speaker tag and the second speaker tag are respectively specified by a user through a style switching unit and a speaker switching unit;
an encoding-decoding unit including an encoder for encoding a phoneme sequence with a prosody tag, a first speaker tag, and a second speaker tag; the prosody control unit is used for predicting and adjusting the duration, pitch and energy of voice synthesis; the decoder is used for combining the first speaker coding information, the second speaker coding information, the pitch and the energy which are regulated by the rhythm control unit, and decoding the combined coding information to obtain a predicted Mel frequency spectrum;
the training unit is used for training the coding-decoding unit and saving the coding-decoding unit as a model file after the training is finished;
and the voice synthesis unit is used for loading the model file generated by the training unit, reading the text to be synthesized in the text acquisition unit, the first speaker label set by the style switching unit and the second speaker label set by the speaker switching unit as the input of the model, generating a predicted Mel frequency spectrum, and converting the predicted Mel frequency spectrum into a voice signal for voice playing.
Further, the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends and character boundaries. The prosodic tags are added by adopting a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output.
Further, the multiple speaker tag refers to a natural number sequence that distinguishes each speaker.
Further, the decoder consists of a bi-directional LSTM and a linear affine transform.
Furthermore, the time length control unit, the pitch control unit and the energy control unit are all composed of three one-dimensional convolution layers and a regularization layer, a two-way gating circulation unit GRU and linear affine transformation.
The invention can utilize the coding-decoding unit, the duration/energy/pitch control unit and the neural network decoder to realize the control of the rhythm in the synthesized voice and further realize the support of the separation of the multi-speaker and the multi-speaker style. Compared with the prior art, the invention has the beneficial effects that:
(1) the encoding-decoding unit of the invention comprises an encoder, a prosody control unit and a decoder, wherein the encoder is used for encoding a phoneme sequence with a prosody label, a first speaker label and a second speaker label; the prosody control unit is used for predicting and adjusting the duration, pitch and energy of voice synthesis; the decoder is used for combining the first speaker coding information, the second speaker coding information, the pitch and the energy which are regulated by the prosody control unit, decoding the combined coding information and obtaining the predicted Mel frequency spectrum. By the method, the voice synthesis system supporting multiple speakers and multiple languages is realized, the model function is richer, the control of pitch, energy, duration and rhythm of voice synthesis can be supported, the speaker tag and the speaking style tag can be designated in the voice synthesis, and the migration of the language and the speaking style of the speaker is supported, for example, the speaker who only speaks in Chinese can be migrated to the same speaker who speaks in English in customer service style.
(2) Compared with the traditional method for separately constructing various models, the method adopts a mode of directly converting text to acoustic characteristics, avoids the influence of single model prediction error on the effect of the whole model, thereby improving the fault-tolerant capability of the model and reducing the deployment cost of the model; and the CBHG module is adopted to effectively model the current and context information, extract the features of higher level and extract the context features of the sequence, and the method can learn more text pronunciation features which are difficult to define by people through data, thereby effectively improving the pronunciation effect of the voice.
(3) The text of the invention also comprises a preprocessing process before being input into the speech synthesis model, namely a process of adding prosodic tags to the text in a prosodic phrase boundary prediction mode, thereby ensuring controllable text prosody and solving the defect of reduced naturalness of long sentence synthesis caused by unnatural prosodic pause of synthesized speech in the traditional speech synthesis method; through a supervision mode, the text, the duration, the energy and the pitch with rhythm labeling are finely controlled, so that the rhythm of the synthesized voice is more comprehensively controlled, and the voice with richer and more natural rhythm change is synthesized;
(4) the invention simplifies the complexity of the speech synthesis model training by introducing the time length control unit, and because the traditional process of dynamically aligning the text and the audio by adopting the attention module in the end-to-end speech synthesis model needs a large amount of computing resource consumption and time consumption, the invention avoids the process of aligning the text and the audio in the form of autoregressive attention, thereby reducing the requirements on the computing resources and saving the computing cost of the model. For example, the traditional Chinese and English two independent models can be replaced by a Chinese-English mixed model, and the method can be used for not only Chinese-English mixed speech synthesis, but also other mixed languages.
Drawings
FIG. 1 is an overall schematic diagram of a speech synthesis apparatus of the present invention;
FIG. 2 is a schematic diagram of a pitch/energy/duration control unit for use in the present invention;
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
Compared with the traditional scheme, the invention better controls the rhythm pause information in the synthesized voice by using a skip neural network encoder, and simultaneously controls the rhythm pronunciation information of each frame in the synthesized voice by using the predicted duration, energy and pitch, so that the characteristic of the synthesized voice can be more accurately controlled, and the voice with richer rhythm change can be generated. On the other hand, the support to multiple speakers is realized, the whole solution is completely completed by one model, the language of the text does not need to be distinguished, and the complexity of the model is reduced.
As shown in fig. 1, a speech synthesizer supporting multi-speaker style, language switching and controllable prosody of the present invention comprises the following units:
the text acquisition unit is used for acquiring different text data according to the mode of the voice synthesis device, acquiring a mixed training text with a rhythm label and corresponding standard voice audio in the training mode, and marking a speaker label of each standard voice audio; acquiring a text to be synthesized in a prediction mode;
the text preprocessing unit is used for converting the text into a phoneme sequence with a prosodic tag, and outputting a real Mel frequency spectrum, a real energy, a real pitch, a real duration and a corresponding speaker tag according to a standard voice audio corresponding to the text during a training mode;
the language switching unit is used for storing and displaying speaker labels corresponding to training data of different language types and automatically identifying the language type of the text to be synthesized;
the style switching unit is used for reading the language type of the text displayed by the language switching unit and setting a first speaker label as a voice synthesis style according to the language type;
a speaker switching unit for setting a second speaker tag as a designated speaker;
in the training mode, the first speaker label and the second speaker label are both speaker labels marked in the mixed training sample; in the prediction mode, the first speaker tag and the second speaker tag are respectively specified by a user through a style switching unit and a speaker switching unit;
an encoding-decoding unit including an encoder for encoding a phoneme sequence with a prosody tag, a first speaker tag, and a second speaker tag; the prosody control unit is used for predicting and adjusting the duration, pitch and energy of voice synthesis; the decoder is used for combining the first speaker coding information, the second speaker coding information, the pitch and the energy which are regulated by the rhythm control unit, and decoding the combined coding information to obtain a predicted Mel frequency spectrum;
the training unit is used for training the coding-decoding unit and saving the coding-decoding unit as a model file after the training is finished;
and the voice synthesis unit is used for loading the model file generated by the training unit, reading the text to be synthesized in the text acquisition unit, the first speaker label set by the style switching unit and the second speaker label set by the speaker switching unit as the input of the model, generating a predicted Mel frequency spectrum, and converting the predicted Mel frequency spectrum into a voice signal for voice playing.
In an embodiment of the present invention, the text preprocessing unit converts the text into a phoneme sequence with a prosodic tag, specifically:
respectively converting different language types in the text into corresponding pronunciation phonemes to construct a mixed phoneme dictionary; mapping phonemes with prosodic labels to serialized data using a mixed phoneme dictionary to obtain a phoneme sequence w1,w2,…,wUWhere U is the length of the text.
In one embodiment of the present invention, the prosody control unit is described.
The prosody control unit includes:
the time length control unit is used for predicting the time length of the text coding information and the first speaker coding information output by the CBHG module, outputting the predicted time length and adjusting the predicted time length;
the alignment unit is used for aligning the text coding information which is output by the encoder and does not contain the prosodic tags according to the duration information output by the duration control unit, the length of the text coding information needs to be consistent with the length of a real Mel frequency spectrum in the training mode, the predicted duration of each phoneme is output by the trained duration control unit in the prediction mode, the length of each phoneme is expanded according to the predicted duration, and the text coding information after the duration adjustment is output after the expansion;
the energy control unit is used for reading the text coding information and the first speaker coding information which are output by the alignment unit and have the adjusted time length, generating predicted energy and adjusting the predicted energy;
and the high pitch control unit is used for reading the text coding information after the time length adjustment output by the alignment unit and the second speaker coding information, generating a predicted pitch and performing pitch adjustment on the predicted pitch.
In one embodiment of the present invention, the alignment unit is described.
The operation steps of the alignment unit are as follows: text coding information t without rhythm label position after jump coding1,t2,…,tU′Length expansion is carried out by combining the time length information output by the time length control unit, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each phoneme is output according to the trained duration control unit, and the length of each phoneme is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion1,t2,…,tTAnd T is the frame number of the extracted real Mel spectrum.
In this embodiment, the encoder is provided with a phoneme Embedding layer, a speaker Embedding layer, a CBHG module, and a skip module;
for phoneme sequence w with rhythm label1,w2,…,wUConverted into phoneme vector sequence x through phoneme Embedding layer1,x2,…,xU
Speaker tag s for inputiI 1,2,3, which is converted into a speaker vector sequence S by a speaker Embedding layeri
Using the converted phoneme vector sequence as the input of the CBHG module to generate text coding information t1,t2,…,tU
Encoding information t from text1,t2,…,tUAnd speaker vector sequence SiPredicting the duration;
encoding text into information t1,t2,…,tUText coding information t without rhythm label position after generating jump coding by jump module1,t2,…,tU′Of which is U'<U, wherein U' is the text length after the prosodic tag is removed.
In this embodiment, the training of the encoding-decoding unit by the training unit specifically includes:
processing the phoneme sequence with the prosodic tag by a phoneme Embedding layer and a CBHG module in sequence to obtain text coding information, and removing the prosodic tag from the text coding information by a hopping module; respectively processing a first speaker tag and a second speaker tag through a speaker Embedding layer to obtain first speaker coding information and second speaker coding information;
the text coding information and the first speaker coding information are subjected to duration control to obtain the predicted duration with the speaker characteristics, and the predicted duration is multiplied by a duration adjustment factor, wherein the duration adjustment factor is 1;
aligning the text coding information without the prosodic tags according to the predicted duration after the duration adjustment to obtain the text coding information after the duration adjustment;
using the text coding information and the second speaker coding information after the duration adjustment as the input of a pitch control unit to obtain a predicted pitch with speaker characteristics, and multiplying the predicted pitch by a pitch adjustment factor, wherein the pitch adjustment factor is 1;
using the text coding information and the first speaker coding information after the duration adjustment as the input of an energy control unit to obtain predicted energy with speaker characteristics, and multiplying the predicted energy by an energy adjustment factor, wherein the energy adjustment factor is 1;
combining the predicted pitch, the predicted energy, the text coding information subjected to time length adjustment and the first speaker coding information to be used as the input of a decoder to obtain a predicted Mel frequency spectrum;
calculating a duration loss according to the predicted duration and the real duration, calculating a pitch loss according to the predicted pitch and the real pitch, calculating an energy loss according to the predicted energy and the real energy, and calculating a mel-frequency spectrum loss according to the predicted mel-frequency spectrum and the real mel-frequency spectrum; and combining various loss values to carry out end-to-end training on the encoder, the prosody control unit and the decoder.
In this embodiment, the prosodic tags include prosodic words, prosodic phrases, intonation phrases, sentence ends, and character boundaries.
In this embodiment, after the voice synthesis unit reads the text to be synthesized in the text acquisition unit, a prosody tag needs to be added to the text to be synthesized, the addition of the prosody tag is implemented by using a pre-trained prosody phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosody phrase boundary prediction model, and the text to be synthesized with the prosody tag is output.
The speech synthesis system needs to complete training before use, the training process needs to calculate time loss according to predicted time and real time, calculate pitch loss according to predicted pitch and real pitch, calculate energy loss according to predicted energy and real energy, and calculate mel-frequency spectrum loss according to predicted mel-frequency spectrum and real mel-frequency spectrum; and performing end-to-end training on the mixed speech synthesis model by combining various loss values.
In a specific Chinese and English implementation process of the invention, a text preprocessing module (front end) mainly has the functions of receiving text data, normalizing the text, analyzing an XML tag, mapping Chinese and English phonemes with rhythm labels to serialized data by utilizing a Chinese and English mixed phoneme dictionary to obtain a phoneme sequence w1,w2,…,wUWhere U is the length of the text. The rhythm labeling process specifically comprises the following steps: the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends and character boundaries, the prosodic tags are added by adopting a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output. The training samples used in the training stage may be data with prosodic tags in the open source database.
Specifically, the main function of the encoder is to train and learn the text features and speaker information of the phoneme sequence of the current sample, so that the phoneme sequence can be converted into a fixed dimension vector capable of representing the text and the speaker features. Compared with the traditional parametric method speech synthesis algorithm, the function of the encoder is similar to the step of manually extracting the features in the parametric method, the encoder can learn representative feature vectors through data, a large amount of manpower is consumed in the process of manually extracting the features to carry out statistical criteria, and the labor cost is greatly increased. On the other hand, compared with the incomplete feature information possibly caused by manually extracting the features, sufficient feature information can be learned under the condition of comprehensive data coverage through the learned feature vectors. Compared with the traditional decoder, the decoder has a simple structure and only consists of a bidirectional LSTM and a linear affine transformation, and greatly improves the decoding speed.
Specifically, the time length control unit and the alignment unit are used for carrying out length expansion on the coding information output by the coder, the introduction of the time length control unit simplifies the complexity of the training of the speech synthesis model, and the traditional end-to-end speech synthesis model adopts an attention module to dynamically align texts and audios, which requires a large amount of computing resource consumption and time consumption.
In addition, the introduction of the time length control unit, the pitch control unit and the energy control unit enables prosody to be adjusted in three aspects of time length, pitch and energy, and the specific adjustment mode can be realized by adding an adjustable parameter after the output value of each module and multiplying the output result by a coefficient.
In one embodiment of the present invention, the multiple speaker tags refer to a sequence of natural numbers that distinguish each speaker, wherein a first speaker tag serves as a speech synthesis style and a second speaker tag serves as a designated speaker. In the training mode, the first speaker label and the second speaker label are both speaker labels marked in the mixed training sample; in the prediction mode, the first speaker tag and the second speaker tag are respectively specified by a user through a style switching unit and a speaker switching unit; for example, after the system training is finished, a second speaker is designated to perform speech synthesis by adopting the style of the first speaker, and only the first speaker tag needs to be set as the speech synthesis style through the style switching unit, and the second speaker tag needs to be set as the designated speaker through the speaker switching unit. According to the mode, the speaker only meeting the Chinese reading style can be migrated to the same speaker who can speak English in a customer service style.
The following describes a specific implementation method of the speech synthesis apparatus of the present invention.
Preprocessing an input Chinese and English text data sequence with rhythm labels to serve as the input of a hopping neural network encoder; the hopping neural network encoder consists of a phoneme Embedding layer, a CBHG module and a hopping module;
step two, for the output of the jumping neural network encoder, combining the output of the CBHG module and the output of the speaker Embedding layer, and obtaining text coding information with adjusted duration through the duration adjustment;
step three, the text coding information after the duration adjustment and the output of the speaker Embedding layer are used as the input of a pitch control unit and an energy control unit together to obtain the predicted pitch and the predicted energy; combining the predicted pitch, the predicted energy and the text coding information subjected to time length adjustment, and using the combined information and the output of the speaker Embedding layer as the input of a decoder to obtain a predicted Mel frequency spectrum; the vocoder outputs the synthesized speech.
In one embodiment of the present invention (taking chinese and english as an example), the transmission and processing process of the input text in the speech synthesis device is as follows:
1) aiming at Chinese and English in the text, respectively converting into corresponding pronunciation phonemes,constructing a Chinese-English mixed phoneme dictionary; mapping Chinese and English phonemes with rhythm labels to serialized data by adopting a Chinese-English mixed phoneme dictionary to obtain a phoneme sequence w1,w2,…,wUWhere U is the length of the text, wiIndicating phoneme information corresponding to the ith word in the text.
2) Constructing a multi-speaker dictionary for multiple speakers; deriving a multiple speaker tag s1,s2,…,skWherein k is the number of speakers. The multi-speaker tag is converted into a speaker vector sequence S through a speaker Embedding layeri
3) For serialized text data (phoneme sequence w)1,w2,…,wU) Converted into phoneme vector sequence x through phoneme Embedding layer1,x2,…,xUAnd the prosodic tags comprise prosodic words (#1), prosodic phrases (#2), intonation phrases (#3), sentence ends (#4) and character boundaries (# S).
x1,x2,…,xU=Embedding(w1,w2,…,wU);
xiRepresenting a phoneme vector corresponding to the ith word in the text, wherein Embedding (·) represents Embedding processing; for example, a text "she holds her shoes in the hands and deliberately steps on the bottom of a puddle with the feet shining. "convert to after labeling by rhythm label" she holds #1 shoe #1 in #1 hand #3 bare #1 foot #2 and steps on #1 puddle #4 deliberately # 1. "
4) For the converted phoneme vector sequence x1,x2,…,xUInput to CBHG module, resulting in a sequence of speaker vectors SiThe predicted time length is generated through the time length control unit, and then the text coding information which does not contain the rhythm label position after the jump coding is generated through the jump module; the CBHG module employed in this embodiment comprises a one-dimensional convolutional filter bank, which efficiently models current and context information. Followed by a multi-level highway network to extract higher level features. Finally, a bidirectional gate control circulation unit GRU circulation neural network RNN is used for extracting the sequenceThe contextual characteristics of the column.
Expressed by the formula:
t1,t2,…,tU=CBHG(x1,x2,…,xU)
wherein, tiCoding information of the ith word in the text;
5) because the prosodic tags are added to the input serialized text data, but the tags do not have obvious pronunciation duration, skip coding is needed to remove the prosodic tags to generate t1,t2,…,tU′,(U′<U), where U' is the text length after removal of the prosodic tag.
Encoding information t for the removed text1,t2,…,tU′
t1,t2,…,tU′=Skip_state(t1,t2,…,tU)
6) Text coding information t without rhythm label position after jump coding1,t2,…,tU′And length expansion is carried out by combining the time length control unit, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each phoneme is output according to the trained duration control unit, and the length of each phoneme is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion1,t2,…,tTAnd T is the frame number of the extracted real Mel spectrum.
The duration control unit and the energy and pitch control unit have the same network structure: the three one-dimensional convolution layers and the regularization layer are used for feature separation; a bidirectional GRU learns the relationship between the front and rear phoneme characteristics; finally, the duration/energy/pitch is predicted through a linear affine transformation.
t1,t2,…,tT=State_Expand(t1,t2,…,tU′)
7) Predicting pitch and energy; aiming at text with rhythm label removedCoding information and prediction duration information, which are subjected to duration adjustment and then are summed with a speaker vector sequence SiTogether as inputs to the pitch control unit and the energy control unit, the predicted pitch and the predicted energy are derived for control with respect to the energy and pitch of the generated audio.
8) Encoding predicted pitch and energy with text information t1,t2,…,tTPerforming combined text encoding feature E1,E2,…,ET(ii) a Coding information t for text1,t2,…,tTRespectively obtaining through the control unit of energy and pitch:
e1,e2,…,eT=Energy_Predictor(t1,t2,…,tT;Si)
p1,p2,…,pT=Pitch_Predictor(t1,t2,…,tT;;Si)
E1,E2,…,ET=(e1,e2,…,eT+p1,p2,…,pT)*t1,t2,…,tT+Si
wherein E is1,E2,…,ETFor encoding information for the combined text, e1,e2,…,eTAs output of the energy control unit, p1,p2,…,pTAs output of the pitch control unit, t1,t2,…,tTFor text-coded information subject to duration adjustment, SiThe second speaker vector sequence is represented here as a speaker vector sequence.
9) Encoding features E for text1,E2,…,ETDecoding to generate a predicted Mel frequency spectrum;
in an embodiment of the present invention, the decoder is specifically composed of a bi-directional LSTM and a linear affine transformation, and may be specifically expressed as:
encoding by BLSTM:
Figure BDA0002884269120000101
Figure BDA0002884269120000102
combining the bi-directional final hidden states to obtain h*Represents:
Figure BDA0002884269120000103
for the obtained h*Through linear affine transformation, a predicted mel spectrum can be generated:
M1,M2,…,MT=Linear(h*)
finally, the generated Mel frequency spectrum is synthesized into the voice with controllable rhythm by a common vocoder.
In one embodiment of the present invention, as shown in fig. 2, the duration control unit, the pitch control unit, and the energy control unit are all composed of three one-dimensional convolution layers and regularization layers, a bidirectional gating loop unit GRU, and a linear affine transformation.
Adding prosodic tags to the text to be synthesized, wherein the addition of the prosodic tags is realized by adopting a pre-trained prosodic phrase boundary prediction model, inputting the text to be synthesized into the pre-trained prosodic phrase boundary prediction model, and outputting the text to be synthesized with the prosodic tags. The pre-trained prosodic phrase boundary prediction model adopts a decision tree or blstm-crf for predicting the boundary of the phrase and inserting prosodic tags, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tags is output.
During synthesis, the styles of the speaker and any speaker can be respectively designated for model input.
Compared with the traditional method for separately constructing various models, the method adopts a mode of directly constructing from text to acoustic characteristics and an end-to-end training mode, calculates time loss according to the predicted time and the real time, calculates pitch loss according to the predicted pitch and the real pitch, calculates energy loss according to the predicted energy and the real energy, and calculates mel frequency spectrum loss according to the predicted mel frequency spectrum and the real mel frequency spectrum; performing end-to-end training on the mixed speech synthesis model by combining various loss values; the influence of single model prediction error on the effect of the whole model is avoided, and therefore the fault tolerance of the model is improved.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
The invention tests on a text data set containing 32500 speakers with 30000 Chinese, 2000 English, and 500 Chinese-English mixed, and corresponding prosody labels. The invention preprocesses the data set as follows:
1) extracting Chinese and English phoneme files and corresponding audio, and extracting pronunciation duration of the phoneme by using an open source tool Montreal-forced-aligner.
2) A mel-frequency spectrum is extracted for each audio, with a window size of 50 milliseconds, a frame shift of 12.5 milliseconds, and a dimension of 80 dimensions.
3) For each audio, the pitch of the audio is extracted using the World vocoder.
4) The energy of the mel spectrum is obtained by summing the mel spectrum extracted from the audio in dimension.
The mixed voice synthesis system realizes the controllable operation of four dimensions of text rhythm, energy, duration and pitch in the voice synthesis process, and realizes the support of multiple languages; the support of multiple speakers is realized; the support of speaker style migration is realized, and the wide application of a voice synthesis system in an industrial scene is facilitated.
Various technical features of the above embodiments can be combined arbitrarily, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described in detail, however, as long as there is no contradiction between the combinations of the technical features, the scope of the present description should be considered as being described in the present specification.

Claims (10)

1. A speech synthesis apparatus supporting multi-speaker style, language switching and prosody control, comprising:
the text acquisition unit is used for acquiring different text data according to the mode of the voice synthesis device, acquiring a mixed training text with a rhythm label and corresponding standard voice audio in the training mode, and marking a speaker label of each standard voice audio; acquiring a text to be synthesized in a prediction mode;
the text preprocessing unit is used for converting the text into a phoneme sequence with a prosodic tag, and outputting a real Mel frequency spectrum, a real energy, a real pitch, a real duration and a corresponding speaker tag according to a standard voice audio corresponding to the text during a training mode;
the language switching unit is used for storing and displaying speaker labels corresponding to training data of different language types and automatically identifying the language type of the text to be synthesized;
the style switching unit is used for reading the language type of the text displayed by the language switching unit and setting a first speaker label as a voice synthesis style according to the language type;
a speaker switching unit for setting a second speaker tag as a designated speaker;
in the training mode, the first speaker label and the second speaker label are both speaker labels marked in the mixed training sample; in the prediction mode, the first speaker tag and the second speaker tag are respectively specified by a user through a style switching unit and a speaker switching unit;
an encoding-decoding unit including an encoder for encoding a phoneme sequence with a prosody tag, a first speaker tag, and a second speaker tag; the prosody control unit is used for predicting and adjusting the duration, pitch and energy of voice synthesis; the decoder is used for combining the first speaker coding information, the second speaker coding information, the pitch and the energy which are regulated by the rhythm control unit, and decoding the combined coding information to obtain a predicted Mel frequency spectrum;
the training unit is used for training the coding-decoding unit and saving the coding-decoding unit as a model file after the training is finished;
and the voice synthesis unit is used for loading the model file generated by the training unit, reading the text to be synthesized in the text acquisition unit, the first speaker label set by the style switching unit and the second speaker label set by the speaker switching unit as the input of the model, generating a predicted Mel frequency spectrum, and converting the predicted Mel frequency spectrum into a voice signal for voice playing.
2. The apparatus according to claim 1, wherein the text preprocessing unit converts the text into a phoneme sequence with prosodic tags, and specifically comprises:
respectively converting different language types in the text into corresponding pronunciation phonemes to construct a mixed phoneme dictionary; mapping phonemes with prosodic labels to serialized data using a mixed phoneme dictionary to obtain a phoneme sequence w1,w2,…,wUWhere U is the length of the text.
3. The apparatus of claim 1, wherein the prosody control unit comprises:
the time length control unit is used for predicting the time length of the text coding information and the first speaker coding information output by the CBHG module, outputting the predicted time length and adjusting the predicted time length;
the alignment unit is used for aligning the text coding information which is output by the encoder and does not contain the prosodic tags according to the duration information output by the duration control unit, the length of the text coding information needs to be consistent with the length of a real Mel frequency spectrum in the training mode, the predicted duration of each phoneme is output by the trained duration control unit in the prediction mode, the length of each phoneme is expanded according to the predicted duration, and the text coding information after the duration adjustment is output after the expansion;
the energy control unit is used for reading the text coding information and the first speaker coding information which are output by the alignment unit and have the adjusted time length, generating predicted energy and adjusting the predicted energy;
and the high pitch control unit is used for reading the text coding information after the time length adjustment output by the alignment unit and the second speaker coding information, generating a predicted pitch and performing pitch adjustment on the predicted pitch.
4. The apparatus according to claim 3, wherein the aligning unit is configured to perform the following steps: text coding information t without rhythm label position after jump coding1,t2,…,tU′Length expansion is carried out by combining the time length information output by the time length control unit, and the standard of the length expansion is as follows: in the training stage, the length of the real Mel frequency spectrum is required to be consistent; in the prediction stage, the prediction duration of each phoneme is output according to the trained duration control unit, and the length of each phoneme is expanded according to the prediction duration; the text coding information t after the time length adjustment is obtained after the expansion1,t2,…,tTAnd T is the frame number of the extracted real Mel spectrum.
5. The speech synthesis device of claim 3, wherein the coder comprises a phoneme Embedding layer, a speaker Embedding layer, a CBHG module and a skip module;
for phoneme sequence w with rhythm label1,w2,…,wUBy phonemeThe Embedding Embedding layer is converted into a phoneme vector sequence x1,x2,…,xU
Speaker tag s for inputiI 1,2,3, which is converted into a speaker vector sequence S by a speaker Embedding layeri
Using the converted phoneme vector sequence as the input of the CBHG module to generate text coding information t1,t2,…,tU
Encoding information t from text1,t2,…,tUAnd speaker vector sequence SiPredicting the duration;
encoding text into information t1,t2,…,tUText coding information t without rhythm label position after generating jump coding by jump module1,t2,…,tU′Wherein U '< U, wherein U' is the text length after removal of the prosodic tag.
6. The apparatus according to claim 3 or 5, wherein the encoding-decoding unit is trained by a training unit, specifically:
processing the phoneme sequence with the prosodic tag by a phoneme Embedding layer and a CBHG module in sequence to obtain text coding information, and removing the prosodic tag from the text coding information by a hopping module; respectively processing a first speaker tag and a second speaker tag through a speaker Embedding layer to obtain first speaker coding information and second speaker coding information;
the text coding information and the first speaker coding information are subjected to duration control to obtain the predicted duration with the speaker characteristics, and the predicted duration is multiplied by a duration adjustment factor, wherein the duration adjustment factor is 1;
aligning the text coding information without the prosodic tags according to the predicted duration after the duration adjustment to obtain the text coding information after the duration adjustment;
using the text coding information and the second speaker coding information after the duration adjustment as the input of a pitch control unit to obtain a predicted pitch with speaker characteristics, and multiplying the predicted pitch by a pitch adjustment factor, wherein the pitch adjustment factor is 1;
using the text coding information and the first speaker coding information after the duration adjustment as the input of an energy control unit to obtain predicted energy with speaker characteristics, and multiplying the predicted energy by an energy adjustment factor, wherein the energy adjustment factor is 1;
combining the predicted pitch, the predicted energy, the text coding information subjected to time length adjustment and the first speaker coding information to be used as the input of a decoder to obtain a predicted Mel frequency spectrum;
calculating a duration loss according to the predicted duration and the real duration, calculating a pitch loss according to the predicted pitch and the real pitch, calculating an energy loss according to the predicted energy and the real energy, and calculating a mel-frequency spectrum loss according to the predicted mel-frequency spectrum and the real mel-frequency spectrum; and combining various loss values to carry out end-to-end training on the encoder, the prosody control unit and the decoder.
7. The device of claim 1, wherein the prosodic tags comprise prosodic words, prosodic phrases, intonation phrases, sentence ends, and character boundaries.
8. The device of claim 1, wherein after the voice synthesis unit reads the text to be synthesized in the text acquisition unit, a prosodic tag needs to be added to the text to be synthesized, the prosodic tag is added by using a pre-trained prosodic phrase boundary prediction model, the text to be synthesized is input into the pre-trained prosodic phrase boundary prediction model, and the text to be synthesized with the prosodic tag is output.
9. The apparatus of claim 1, wherein the decoder is configured to combine the first speaker encoding information, the second speaker encoding information, and the pitch and energy adjusted by the prosody control unit, and the combination formula is:
E1,E2,…,ET=(e1,e2,…,eT+p1,p2,…,pT)*t1,t2,…,tT+Si
wherein E is1,E2,…,ETFor encoding information for the combined text, e1,e2,…,eTAs output of the energy control unit, p1,p2,…,pTAs output of the pitch control unit, t1,t2,…,tTFor text-coded information subject to duration adjustment, SiThe second speaker vector sequence is represented here as a speaker vector sequence.
10. A device for speech synthesis supporting multiple speaker styles, language switching and prosody control according to claim 1, wherein the decoder comprises a bi-directional LSTM and a linear affine transformation; the time length control unit, the pitch control unit and the energy control unit are all composed of three one-dimensional convolution layers and a regularization layer, a bidirectional gating circulation unit GRU and linear affine transformation.
CN202110008049.1A 2021-01-05 2021-01-05 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm Active CN112863483B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110008049.1A CN112863483B (en) 2021-01-05 2021-01-05 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110008049.1A CN112863483B (en) 2021-01-05 2021-01-05 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm

Publications (2)

Publication Number Publication Date
CN112863483A true CN112863483A (en) 2021-05-28
CN112863483B CN112863483B (en) 2022-11-08

Family

ID=76003829

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110008049.1A Active CN112863483B (en) 2021-01-05 2021-01-05 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm

Country Status (1)

Country Link
CN (1) CN112863483B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113223494A (en) * 2021-05-31 2021-08-06 平安科技(深圳)有限公司 Prediction method, device, equipment and storage medium of Mel frequency spectrum
CN113327574A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113327575A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113345412A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113345407A (en) * 2021-06-03 2021-09-03 广州虎牙信息科技有限公司 Style speech synthesis method and device, electronic equipment and storage medium
CN113380222A (en) * 2021-06-09 2021-09-10 广州虎牙科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113393829A (en) * 2021-06-16 2021-09-14 哈尔滨工业大学(深圳) Chinese speech synthesis method integrating rhythm and personal information
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113421571A (en) * 2021-06-22 2021-09-21 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium
CN113488021A (en) * 2021-08-09 2021-10-08 杭州小影创新科技股份有限公司 Method for improving naturalness of speech synthesis
CN113658577A (en) * 2021-08-16 2021-11-16 腾讯音乐娱乐科技(深圳)有限公司 Speech synthesis model training method, audio generation method, device and medium
CN113707125A (en) * 2021-08-30 2021-11-26 中国科学院声学研究所 Training method and device for multi-language voice synthesis model
CN113763924A (en) * 2021-11-08 2021-12-07 北京优幕科技有限责任公司 Acoustic deep learning model training method, and voice generation method and device
CN114708876A (en) * 2022-05-11 2022-07-05 北京百度网讯科技有限公司 Audio processing method and device, electronic equipment and storage medium
US11580955B1 (en) * 2021-03-31 2023-02-14 Amazon Technologies, Inc. Synthetic speech processing
WO2023051155A1 (en) * 2021-09-30 2023-04-06 华为技术有限公司 Voice processing and training methods and electronic device
CN116895273A (en) * 2023-09-11 2023-10-17 南京硅基智能科技有限公司 Output method and device for synthesized audio, storage medium and electronic device
CN117476027A (en) * 2023-12-28 2024-01-30 南京硅基智能科技有限公司 Voice conversion method and device, storage medium and electronic device
CN117496944A (en) * 2024-01-03 2024-02-02 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system
CN117711374A (en) * 2024-02-01 2024-03-15 广东省连听科技有限公司 Audio-visual consistent personalized voice synthesis system, synthesis method and training method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002108383A (en) * 2000-09-29 2002-04-10 Pioneer Electronic Corp Speech recognition system
JP2010128103A (en) * 2008-11-26 2010-06-10 Nippon Telegr & Teleph Corp <Ntt> Speech synthesizer, speech synthesis method and speech synthesis program
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
US20200082806A1 (en) * 2018-01-11 2020-03-12 Neosapience, Inc. Multilingual text-to-speech synthesis
TW202016921A (en) * 2018-10-24 2020-05-01 中華電信股份有限公司 Method for speech synthesis and system thereof
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111862934A (en) * 2020-07-24 2020-10-30 苏州思必驰信息科技有限公司 Method for improving speech synthesis model and speech synthesis method and device
US20200402497A1 (en) * 2019-06-24 2020-12-24 Replicant Solutions, Inc. Systems and Methods for Speech Generation

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002108383A (en) * 2000-09-29 2002-04-10 Pioneer Electronic Corp Speech recognition system
JP2010128103A (en) * 2008-11-26 2010-06-10 Nippon Telegr & Teleph Corp <Ntt> Speech synthesizer, speech synthesis method and speech synthesis program
US20200082806A1 (en) * 2018-01-11 2020-03-12 Neosapience, Inc. Multilingual text-to-speech synthesis
TW202016921A (en) * 2018-10-24 2020-05-01 中華電信股份有限公司 Method for speech synthesis and system thereof
CN109326283A (en) * 2018-11-23 2019-02-12 南京邮电大学 Multi-to-multi phonetics transfer method under non-parallel text condition based on text decoder
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
US20200402497A1 (en) * 2019-06-24 2020-12-24 Replicant Solutions, Inc. Systems and Methods for Speech Generation
CN111292720A (en) * 2020-02-07 2020-06-16 北京字节跳动网络技术有限公司 Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment
CN111862934A (en) * 2020-07-24 2020-10-30 苏州思必驰信息科技有限公司 Method for improving speech synthesis model and speech synthesis method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张小峰,谢钧,罗健欣,俞璐: "深度学习语音合成技术研究", 《计算机时代》 *
张雅欣,张连海: "一种基于x-vector说话人特征的语音克隆方法", 《信息工程大学学报》 *

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11580955B1 (en) * 2021-03-31 2023-02-14 Amazon Technologies, Inc. Synthetic speech processing
CN113327575B (en) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113327574A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113327575A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113345412A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113223494B (en) * 2021-05-31 2024-01-30 平安科技(深圳)有限公司 Method, device, equipment and storage medium for predicting mel frequency spectrum
CN113223494A (en) * 2021-05-31 2021-08-06 平安科技(深圳)有限公司 Prediction method, device, equipment and storage medium of Mel frequency spectrum
CN113327574B (en) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN113345407A (en) * 2021-06-03 2021-09-03 广州虎牙信息科技有限公司 Style speech synthesis method and device, electronic equipment and storage medium
CN113345407B (en) * 2021-06-03 2023-05-26 广州虎牙信息科技有限公司 Style speech synthesis method and device, electronic equipment and storage medium
CN113380222A (en) * 2021-06-09 2021-09-10 广州虎牙科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113409764B (en) * 2021-06-11 2024-04-26 北京搜狗科技发展有限公司 Speech synthesis method and device for speech synthesis
CN113409764A (en) * 2021-06-11 2021-09-17 北京搜狗科技发展有限公司 Voice synthesis method and device for voice synthesis
CN113393829A (en) * 2021-06-16 2021-09-14 哈尔滨工业大学(深圳) Chinese speech synthesis method integrating rhythm and personal information
CN113393829B (en) * 2021-06-16 2023-08-29 哈尔滨工业大学(深圳) Chinese speech synthesis method integrating rhythm and personal information
CN113421571A (en) * 2021-06-22 2021-09-21 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium
CN113488021A (en) * 2021-08-09 2021-10-08 杭州小影创新科技股份有限公司 Method for improving naturalness of speech synthesis
CN113658577A (en) * 2021-08-16 2021-11-16 腾讯音乐娱乐科技(深圳)有限公司 Speech synthesis model training method, audio generation method, device and medium
CN113707125A (en) * 2021-08-30 2021-11-26 中国科学院声学研究所 Training method and device for multi-language voice synthesis model
CN113707125B (en) * 2021-08-30 2024-02-27 中国科学院声学研究所 Training method and device for multi-language speech synthesis model
WO2023051155A1 (en) * 2021-09-30 2023-04-06 华为技术有限公司 Voice processing and training methods and electronic device
CN113763924B (en) * 2021-11-08 2022-02-15 北京优幕科技有限责任公司 Acoustic deep learning model training method, and voice generation method and device
CN113763924A (en) * 2021-11-08 2021-12-07 北京优幕科技有限责任公司 Acoustic deep learning model training method, and voice generation method and device
CN114708876B (en) * 2022-05-11 2023-10-03 北京百度网讯科技有限公司 Audio processing method, device, electronic equipment and storage medium
CN114708876A (en) * 2022-05-11 2022-07-05 北京百度网讯科技有限公司 Audio processing method and device, electronic equipment and storage medium
CN116895273B (en) * 2023-09-11 2023-12-26 南京硅基智能科技有限公司 Output method and device for synthesized audio, storage medium and electronic device
CN116895273A (en) * 2023-09-11 2023-10-17 南京硅基智能科技有限公司 Output method and device for synthesized audio, storage medium and electronic device
CN117476027A (en) * 2023-12-28 2024-01-30 南京硅基智能科技有限公司 Voice conversion method and device, storage medium and electronic device
CN117476027B (en) * 2023-12-28 2024-04-23 南京硅基智能科技有限公司 Voice conversion method and device, storage medium and electronic device
CN117496944A (en) * 2024-01-03 2024-02-02 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system
CN117496944B (en) * 2024-01-03 2024-03-22 广东技术师范大学 Multi-emotion multi-speaker voice synthesis method and system
CN117711374A (en) * 2024-02-01 2024-03-15 广东省连听科技有限公司 Audio-visual consistent personalized voice synthesis system, synthesis method and training method

Also Published As

Publication number Publication date
CN112863483B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
CN112863483B (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
CN112802450B (en) Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof
CN112802448B (en) Speech synthesis method and system for generating new tone
CN108899009B (en) Chinese speech synthesis system based on phoneme
Mu et al. Review of end-to-end speech synthesis technology based on deep learning
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN111681641B (en) Phrase-based end-to-end text-to-speech (TTS) synthesis
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
KR20200088263A (en) Method and system of text to multiple speech
Black et al. The festival speech synthesis system, version 1.4. 2
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
Lorenzo-Trueba et al. Simple4all proposals for the albayzin evaluations in speech synthesis
CN116092475B (en) Stuttering voice editing method and system based on context-aware diffusion model
CN113539268A (en) End-to-end voice-to-text rare word optimization method
Kang et al. Connectionist temporal classification loss for vector quantized variational autoencoder in zero-shot voice conversion
CN115620699A (en) Speech synthesis method, speech synthesis system, speech synthesis apparatus, and storage medium
CN115359775A (en) End-to-end tone and emotion migration Chinese voice cloning method
Xia et al. HMM-based unit selection speech synthesis using log likelihood ratios derived from perceptual data
Krug et al. Articulatory synthesis for data augmentation in phoneme recognition
CN113628609A (en) Automatic audio content generation
CN116403562B (en) Speech synthesis method and system based on semantic information automatic prediction pause
CN113763924B (en) Acoustic deep learning model training method, and voice generation method and device
Zhang et al. Chinese speech synthesis system based on end to end
CN115424604B (en) Training method of voice synthesis model based on countermeasure generation network
CN113362803B (en) ARM side offline speech synthesis method, ARM side offline speech synthesis device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant