CN110299131A

CN110299131A - A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion

Info

Publication number: CN110299131A
Application number: CN201910706204.XA
Authority: CN
Inventors: 王欢良; 王飞; 张李; 沈文武; 代大明
Original assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Current assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2019-10-01
Anticipated expiration: 2039-08-01
Also published as: CN110299131B

Abstract

The present invention provides phoneme synthesizing method, device, the storage mediums of a kind of controllable rhythm emotion, it can add rhythm emotion in synthesis voice, the effectively rhythm rhythm of control synthesis voice, method is the following steps are included: convert character representation vector for the corresponding character of text to be synthesized；By character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding feature vector；Coding characteristic vector and rhythm rhythm vector are spliced, by attention mechanism, generates and pays attention to force vector；The frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, it is sent into decoder, it is updated by the output of decoder and pays attention to force vector, the attention force vector newly calculated is done with decoder output to be spliced, it is sent into projection layer output and has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates；Prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.

Description

A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion

Technical field

The present invention relates to speech synthesis technique fields, and in particular to a kind of phoneme synthesizing method of controllable rhythm emotion, Device, storage medium.

Background technique

Speech synthesis, also known as literary periodicals (Text To Speech, TTS) are that one kind can turn any input text Change the technology of corresponding voice into.

Traditional speech synthesis system generally includes the module of front-end and back-end two.Front-end module is mainly to input text It is analyzed, extracts linguistic information required for rear module, for Chinese synthesis system, front-end module is generally comprised The submodules such as text regularization, participle, part of speech prediction, polyphone disambiguation, prosody prediction.Rear module is according to frontal chromatography knot Fruit generates speech waveform by certain method, back-end system be generally divided into based on statistical parameter modeling speech synthesis (or Parameter synthesis) and speech synthesis (or splicing synthesis) based on unit selection and waveform concatenation.

Current end-to-end synthetic model not only can produce the audio of more high fidelity and naturalness, and modeling process letter It is single, do not need any linguistic information.Therefore, it has also become the speech synthesis technique of current main-stream.But classical end-to-end conjunction There is its technical vulnerability at technology, for example it is possible that unforeseen uncontrollable synthesis flaw, can not explicitly control for another example The rhythm rhythm of synthesis is made, such as: phoneme duration, stressed and intonation etc..This be primarily due to end-to-end synthesis input only according to Rely in shallow-layer content of text, such as alphabetical sequence, syllable sequence, aligned phoneme sequence etc., the language message of deep layer can not be utilized, such as Part of speech, intonation, syntactic structure etc..

Summary of the invention

In view of the above-mentioned problems, the present invention provides a kind of phoneme synthesizing method of controllable rhythm emotion, device, storages to be situated between Matter can add rhythm emotion, the effectively rhythm rhythm of control synthesis voice in synthesis voice.

Its technical solution is such that a kind of phoneme synthesizing method of controllable rhythm emotion, which is characterized in that including with Lower step:

Step S1: character representation vector is converted by the corresponding character of text to be synthesized；

Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding Feature vector；

Step S3: coding characteristic vector and rhythm rhythm vector are spliced, by attention mechanism, generate attention to Amount；

Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through solution The output of code device, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer Output has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates；

Step S5: the prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.

Further, in step s 4, after completing decoding, the prediction voice with rhythm rhythm that prediction is obtained is frequently Spectrum is admitted in convolutional layer to improve and generate quality.

Further, the prosodic information that the rhythm rhythm vector includes includes word speed information, reads information, intonation letter again Breath, the word speed of syllable or word where word speed information refers to current character；Word or syllable where stressed information refers to current character Whether read again；The tune type of word or syllable where prosody information refers to current character；Word speed information includes: normally, at a slow speed, fastly Speed, it is supper-fast；Reading information again includes stressed and anacrusis；Prosody information includes: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone.

Further, rhythm rhythm vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information with 2 two into System is to encode；It reads again and is encoded with 1 binary system；Intonation is encoded with 2 binary systems.

Further, in step s3, using the attention mechanism of position sensing.

Further, the prediction voice spectrum with rhythm rhythm is inputted into speech synthesizer, output has rhythm rhythm Voice, speech synthesizer includes any one in WaveNet, WaveRNN.

Further, by the prediction voice spectrum with rhythm rhythm by Griffin_Lim algorithm, output has the rhythm The voice of rhythm.

A kind of speech synthetic device of controllable rhythm emotion characterized by comprising

Representation space conversion module, for converting character representation vector for the corresponding character of text to be synthesized；

Encoder, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector；

Pay attention to power module, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates note Meaning force vector；

Decoder；For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through decoder Output, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer and exports band There is the prediction voice spectrum of rhythm rhythm.

A kind of speech synthetic device of controllable rhythm emotion, characterized in that it comprises: including processor, memory And program；

Described program stores in the memory, and the processor calls the program of memory storage, above-mentioned to execute Controllable rhythm emotion phoneme synthesizing method.

A kind of computer readable storage medium, which is characterized in that the computer readable storage medium is configured to store Program, described program are configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.

Phoneme synthesizing method, device, the storage medium of controllable rhythm emotion of the invention, to classical end-to-end synthesis Method improves, by inputting prosodic control information abundant, so that synthesized voice not only keeps similar with original sound as far as possible Rhythm rhythm, sound more life-like naturally, rich in emotion, and the rhythm of synthesized voice can be changed by control information Rhythm；By the inclusion of word speed information, read again information, prosody information rhythm rhythm vector, define additional rhythm cadence information End-to-end synthetic model is preferably trained, by adding rhythm cadence information in encoder and attention stage, can be convenient The speech manual of decoder output is efficiently controlled and changes, to control the emotion rhythm of synthesis voice.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the phoneme synthesizing method of controllable rhythm emotion of the invention；

Fig. 2 is a kind of frame diagram of the speech synthetic device of controllable rhythm emotion of the invention.

Specific embodiment

See Fig. 1, a kind of phoneme synthesizing method of controllable rhythm emotion of the invention, comprising the following steps:

Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding Feature vector, encoder generally use CNN+LSTM network to model；

Step S3: coding characteristic vector and rhythm rhythm vector are spliced, and by the attention mechanism of position sensing, are generated Pay attention to force vector；

Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through solution The output of code device, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer Output has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates；After completing decoding, it will measure in advance To the prediction voice spectrum with rhythm rhythm be admitted in convolutional layer to improve and generate quality, decoder generallys use LSTM + CNN+ linear projection is modeled；

Step S5: will be converted to the voice output with rhythm rhythm with the prediction voice spectrum of rhythm rhythm, can be with Prediction voice spectrum with rhythm rhythm is inputted into speech synthesizer, output has the voice of rhythm rhythm, speech synthesizer Including any one in WaveNet, WaveRNN；In addition it is also possible to by the prediction voice spectrum for having rhythm rhythm is passed through Griffin_Lim algorithm, output have the voice of rhythm rhythm.

Specifically in the present embodiment, the prosodic information that rhythm rhythm vector includes includes word speed information, reads information, intonation again Information, the word speed of syllable or word where word speed information refers to current character；Word or sound where stressed information refers to current character Whether section is read again；The tune type of word or syllable where prosody information refers to current character.

Word speed information include: it is normal, it is at a slow speed, quickly, supper-fast；Reading information again includes stressed and anacrusis；Prosody information It include: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone；The normal word speed of normally expression in word speed indicates at a slow speed 0.5 times of normal language Speed；Quickly indicate 1.5 times of normal word speed；The supper-fast normal word speed for indicating 2 times.

In the present embodiment, rhythm rhythm vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information is with 2 Binary system encodes；It reads again and is encoded with 1 binary system；Intonation is encoded with 2 binary systems.

In the present embodiment, specific word speed information, stressed information, the coding of prosody information are as follows:

The normal word speed of word speed-: 00

The slow word speed of word speed-: 01

The fast word speed of word speed-: 10

The ultrafast word speed of word speed-: 11

It reads again-reads again: 1

Stressed-anacrusis: 0

The high Heibei provincial opera of intonation-: 00

Intonation-rising tone: 01

The lower falling tone of intonation-: 10

The low Heibei provincial opera of intonation-: 11

When speech synthesis, if synthesis text be it is neutral, if not needing obvious emotion, default be sent into synthesis The rhythm rhythm control information of device may is that normal word speed, anacrusis, high Heibei provincial opera.The obvious emotion rhythm in need the case where Under, rhythm cadence information can be correspondingly arranged.

See Fig. 2, a kind of speech synthetic device of controllable rhythm emotion of the invention, comprising:

Representation space conversion module 1, for converting character representation vector for the corresponding character of text to be synthesized；

Encoder 2, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector；

Pay attention to power module 3, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates Pay attention to force vector；

Decoder 4；For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through decoder Output, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer and exports band There is the prediction voice spectrum of rhythm rhythm.

A kind of speech synthetic device of controllable rhythm emotion comprising: including processor, memory and program；

Program stores in memory, and processor calls the program of memory storage, to execute the above-mentioned controllable rhythm The phoneme synthesizing method of emotion.

In the realization of the speech synthetic device of above-mentioned controllable rhythm emotion, between memory and processor directly or Ground connection is electrically connected, to realize the transmission or interaction of data.For example, these elements between each other can be by one or more of Communication bus or signal wire, which are realized, to be electrically connected, and can such as be connected by bus.It is stored in memory and realizes data access control The computer executed instructions of method processed, the software function that can be stored in memory in the form of software or firmware including at least one Can module, processor by the operation software program and module that are stored in memory, thereby executing various function application with And data processing.

Memory may be, but not limited to, random access memory (Random Access Memory, referred to as: RAM), Read-only memory (Read Only Memory, referred to as: ROM), programmable read only memory (Programmable Read-Only Memory, referred to as: PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, letter Claim: EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, Referred to as: EEPROM) etc..Wherein, memory is for storing program, and processor executes program after receiving and executing instruction.

Processor can be a kind of IC chip, the processing capacity with signal.Above-mentioned processor can be logical With processor, including central processing unit (Central Processing Unit, referred to as: CPU), network processing unit (Network Processor, referred to as: NP) etc..It may be implemented or execute disclosed each method, step and the logic in the embodiment of the present application Block diagram.General processor can be microprocessor or the processor is also possible to any conventional processor etc..

In an embodiment of the present invention, a kind of computer readable storage medium, computer readable storage medium are additionally provided It is configured to store program, program is configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.

Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can store in computer readable storage medium.The program exists When being executed by processor, realization includes the steps that above-mentioned each method embodiment；And computer readable storage medium above-mentioned includes: The various media that can store program code such as ROM, RAM, magnetic or disk, including some instructions are used so that one big number Each embodiment or embodiment are executed according to transmission device (can be personal computer, server or the network equipment etc.) Method described in certain parts.

Its input of the end-to-end synthesis system of classics is the corresponding character string of text to be synthesized, to identical text to be synthesized This, can not individually control its rhythm rhythm.The rhythm rhythm that this causes synthesized voice that can show is very limited, and people is allowed to feel bright Aobvious mechanical sense.

For this purpose, this patent improves classical end-to-end synthetic method, by inputting prosodic control information abundant, So that synthesized voice not only keeps the rhythm rhythm similar with original sound as far as possible, sound more life-like naturally, rich in emotion, and And the rhythm rhythm of synthesized voice can be changed by control information

Rhythm cadence information is typically all super Duan Tezheng, and end-to-end synthesis is generally used character or phoneme is used as and builds Form unit.Therefore, in modeling, section grade prosodic information is averaged each character or phoneme for being assigned to equivalent, passes through packet Information containing word speed, the rhythm rhythm vector for reading information, prosody information again, define additional rhythm cadence information preferably to train End-to-end synthetic model can effectively control the rhythm rhythm of synthesis voice, by encoder by duration, stressed and intonation Rhythm cadence information is added with the attention stage, can be convenient the speech manual for efficiently controlling and changing decoder output, thus The emotion rhythm of control synthesis voice.

Claims

1. a kind of phoneme synthesizing method of controllable rhythm emotion, which comprises the following steps:

Step S3: coding characteristic vector and rhythm rhythm vector are spliced, and by attention mechanism, are generated and are paid attention to force vector；

Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through decoder Output update and pay attention to force vector, the attention force vector newly calculated and decoder output are done and are spliced, and are sent into projection layer and export Prediction voice spectrum with rhythm rhythm, while predicting the end point that frequency spectrum generates；

2. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: in step S4 In, after completing decoding, the prediction voice spectrum with rhythm rhythm that prediction obtains is admitted in convolutional layer to improve life At quality.

3. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: the rhythm The prosodic information that rhythm vector includes includes word speed information, reads information, prosody information again, and word speed information refers to sound where current character The word speed of section or word；Whether word or syllable are read again where stressed information refers to current character；Prosody information refers to current character The tune type of place word or syllable；Word speed information include: it is normal, it is at a slow speed, quickly, supper-fast；Read again information include read again and Anacrusis；Prosody information includes: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone.

4. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 3, it is characterised in that: rhythm rhythm Vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information is encoded with 2 binary systems；It reads again with 1 binary system To encode；Intonation is encoded with 2 binary systems.

5. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: in step S3 In, using the attention mechanism of position sensing.

6. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: rhythm will be had The prediction voice spectrum for restraining rhythm inputs speech synthesizer, and output has the voice of rhythm rhythm, and speech synthesizer includes Any one in WaveNet, WaveRNN.

7. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: rhythm will be had The prediction voice spectrum of rhythm is restrained by Griffin_Lim algorithm, output has the voice of rhythm rhythm.

8. a kind of speech synthetic device of controllable rhythm emotion characterized by comprising

Pay attention to power module, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates attention Vector；

Decoder；For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through the output of decoder It updates and pays attention to force vector, the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer output and is had rhythm Restrain the prediction voice spectrum of rhythm.

9. a kind of speech synthetic device of controllable rhythm emotion, characterized in that it comprises: including processor, memory with And program；

Described program stores in the memory, and the processor calls the program of memory storage, with execute it is above-mentioned can Control the phoneme synthesizing method of rhythm emotion.

10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is configured to store journey Sequence, described program are configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.