CN108172209A

CN108172209A - Build voice idol method

Info

Publication number: CN108172209A
Application number: CN201810017849.8A
Authority: CN
Inventors: 武星; 张南
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2018-01-09
Filing date: 2018-01-09
Publication date: 2018-06-15

Abstract

The present invention relates to a kind of structure voice idol methods.This method is to realize voice idol with the style and tone color same or similar with idol come the method for making voice answer-back to the problem of " bean vermicelli " using the voice technologies such as speech recognition, emotional speech synthesis and depth learning technology.The main operational steps of this method are a. largely text materials of the collection about idol；B. word vector is converted by big section text for the text material application LSTM neural networks that step a is collected；C., step b results are used as to the input of RNN training patterns to training style learning model；D. the style spoken by the training study of a large amount of data to idol；E. a large amount of voice documents collected about idol；F. the voice document of step E-search collection is obtained into emotional speech synthesis model using two-way long short-term memory prosody hierarchy model；G., the result of embodiment three is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step f is used for phonetic synthesis.

Description

Build voice idol method

Technical field

A kind of structure voice idol method of the present invention, and in particular to the voice technologies such as speech recognition, semantic understanding and depth Spend the technology in terms of study, the especially technology for loudspeaker box with phonetic function.

Background technology

Modern society comes into highly developed society, and material goods are very sufficient, now social substance money Material has greatly met pursuit of the mankind for material life.Modern humans gradually show simultaneously the pursuit of cultural life Have reached very high requirement.

Many people generate idol for films and television programs performer and stage performance personnel during cultural life is pursued Worship sense.For the worship for idol, they thirst for obtaining the signature of idol, group photo, it is desirable to which idol sees the message of oneself And expect that idol can reply the message of oneself.

With the development of direct seeding technique and microblog technology, " bean vermicelli " idol can be in the time of a Relatively centralized Answer some problems of " bean vermicelli " online on live streaming platform, but this chance is fewer；Idol can also be in microblogging The problem of " bean vermicelli ", is answered in upper dispersion in real time.But since " bean vermicelli " quantity is often excessively huge, idol can not possibly real-time online Answer " bean vermicelli " problem or in real time on microblogging answer bean vermicelli the problem of.When idol be not broadcast live or is being not logged on microblogging When, " bean vermicelli " can not with oneself worship idol dialogue or ask a question.

Invention content

It is an object of the invention to meet the worship psychology of " bean vermicelli " for idol, provide a kind of structure voice idol side Method." bean vermicelli " can propose problem with voice, and voice idol can respond the problem of " bean vermicelli " proposes in real time.The beneficial effect of the present invention Fruit is can greatly to meet the purpose that " bean vermicelli " is exchanged with idol, and the performance of voice idol is also relatively stable, for being promoted Modern humans are of great advantage for the pursuit of cultural life.

To achieve the object of the present invention, idea of the invention is that using depth learning technology, by the training of mass data, Neural network model is trained to the idol style learning model of simulating the thinking style of idol and for synthesizing voice Emotional speech synthesis model, it is final realize it is close with idol sound and locution or as answer.

According to above design, the present invention uses following technical scheme：

A kind of structure voice idol method, it is characterised in that operating procedure is as follows：

(1) voice idol text extracts：

A) a large amount of text materials collected about idol；

B) the text material application LSTM neural networks for the problem of being collected for step a by big section text be converted into word to Amount；

C) step b results are used as to the input of RNN training patterns to training style learning model；

D) style spoken by the training study of a large amount of data to idol.

(2) voice idol phonetic synthesis：

E) a large amount of voice documents collected about idol；

F) voice document for collecting step a obtains emotional speech synthesis using two-way long short-term memory prosody hierarchy model Model；

G) result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used In phonetic synthesis.

The present invention compared with prior art, has following obvious prominent substantive distinguishing features and notable technology Progress：

(1) the step a largely collects the text material about idol, and wherein text material is mainly derived from following Approach：

I. interview disclosed in idol or interview class video data pass through the text information that speech recognition technology obtains, by it Input as style training pattern；

Ii. the text of the body text in idol microblogging and reply bean vermicelli message, as the input of training pattern.

(2) text is obtained text vector by the step b for LSMT neural networks, is used to follow-up training step.

(3) step b results are used as the input of RNN training patterns to training style learning model by the step c

(4) the step e largely collects the voice document about idol；Wherein voice document is mainly derived from idol and owns Disclosed video and voice class documentation.

(5) voice document of step E-search collection is obtained feelings by the step f using two-way long short-term memory prosody hierarchy model Feel phonetic synthesis model, the autolearn feature of deep learning can be combined using this model, the audio of phonetic synthesis is more natural more Really.

(6) result of embodiment one is used as the text input of phonetic synthesis by the step g, by the emotional speech of step f Synthetic model is used for phonetic synthesis, which completes the phonetic synthesis of idol tone color and emotion, due to the use of above-mentioned steps, The voice of synthesis is presented substantially with idol original sound.

(7) idol of method of the invention synthesis is answered and sound can simulate idol with very high similarity, So that bean vermicelli has sense as if on the spot in person, there is unrivaled advantage for abundant human spirit's life level.

Description of the drawings

Attached drawing 1 is the operating process block diagram of the present invention.

Attached drawing 2 is the idol style study model training schematic diagram of the present invention.

Attached drawing 3 is the emotional speech synthesis model training schematic diagram of the present invention.

Specific embodiment

Details are as follows for the preferred embodiment of the present invention combination attached drawing：

Embodiment one：

Referring to Fig. 1, this structure voice idol method, it is characterised in that operating procedure is as follows：

(1) voice idol text extracts

A) a large amount of text materials collected about idol；

D) style spoken by the training study of a large amount of data to idol.

(2) voice idol phonetic synthesis：

E) a large amount of voice documents collected about idol；

Embodiment two：

The present embodiment and embodiment one are essentially identical, as follows in place of feature：

Embodiment three：

(1) referring to Fig. 2, idol text answers are obtained by idol text question, operating procedure is as follows：

A. a large amount of text materials collected about idol；

B. word vector is converted by big section text for the text material application LSTM neural networks that step a is collected；

C., step b results are used as to the input of RNN training patterns to training style learning model；

D. the style spoken by the training study of a large amount of data to idol.

(2) referring to Fig. 3, idol vocal answer is obtained by idol text answers：

E. a large amount of voice documents collected about idol；

F. voice document step a collected obtains emotional speech synthesis using two-way long short-term memory prosody hierarchy model Model；

G., the result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used In phonetic synthesis.

Length time memory neural network (LSTM) calculating process in step (1) is as follows：

Forward direction calculates：

Input gate：

Forget door：

As shown in above-mentioned two rows publicity：The input of forgetting door comes from the input outside t moment, and the t-1 moment implies unit Output.And the output from t-1 moment units；

Unit:

The input of unit is：When t moment forgets the output+t of output+t moment unit of the output * t-1 moment units of door The output * activation primitives for carving input gate calculate (input+t-1 moment outside t moment implies the output of unit)；

Out gate:

The input of out gate is：Input outside t moment, t-1 moment imply the output of unit and t moment unit unit Output；

Unit exports:

The output of module is the output of the output * t moment unit units of t moment out gate.

It calculates backward：

Unit exports:

Out gate：

Unit:

Forget door：

Input gate：

The hidden layer of Recognition with Recurrent Neural Network is changed to long memory module in short-term with this method, can effectively solve cycle nerve The problem of network access contextual information is limited in scope.

Claims

A kind of 1. structure voice idol method, it is characterised in that operating procedure is as follows：

（1）Voice idol text extracts：

A) a large amount of text materials collected about idol；

Big section text is converted into word vector by the text material application LSTM neural networks for the problem of b) being collected for step a；

C) step b results are used as to the input of RNN training patterns to training style learning model；

d）The style spoken by the training study of a large amount of data to idol；

（2）Voice idol phonetic synthesis：

E) a large amount of voice documents collected about idol；

F) voice document for collecting step a obtains emotional speech synthesis mould using two-way long short-term memory prosody hierarchy model Type；

G) result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used for language Sound synthesizes.
2. structure voice idol method according to claim 1, it is characterised in that：The step a is largely collected about idol The text material of picture, wherein text material are mainly derived from following approach：

（1）The disclosed interview of idol or interview class video data pass through the text information that speech recognition technology obtains, by it Input as style training pattern；

（2）The text of body text and reply bean vermicelli message in idol microblogging, as the input of training pattern.
3. structure voice idol method according to claim 1, it is characterised in that：Text is used for LSMT by the step b Neural network obtains text vector, is used to follow-up training step.
4. structure voice idol method according to claim 1, it is characterised in that：Step b results are used as by the step c The input of RNN training patterns is to training style learning model.
5. structure voice idol method according to claim 1, it is characterised in that：The step e is largely collected about idol The voice document of picture；Wherein voice document is mainly derived from all disclosed videos of idol and voice class documentation.
6. structure voice idol method according to claim 1, it is characterised in that：The step f is by the language of step E-search collection Sound file obtains emotional speech synthesis model using two-way long short-term memory prosody hierarchy model, and depth can be combined using this model The autolearn feature of study, the audio of phonetic synthesis are more natural truer.
7. structure voice idol method according to claim 1, it is characterised in that：The step g is by the knot of embodiment one Fruit is used as the text input of phonetic synthesis, and by the emotional speech synthesis model of step f for phonetic synthesis, which completes idol The phonetic synthesis of tone color and emotion, due to the use of above-mentioned steps, the voice of synthesis is presented substantially with idol original sound.