CN108172209A - Build voice idol method - Google Patents

Build voice idol method Download PDF

Info

Publication number
CN108172209A
CN108172209A CN201810017849.8A CN201810017849A CN108172209A CN 108172209 A CN108172209 A CN 108172209A CN 201810017849 A CN201810017849 A CN 201810017849A CN 108172209 A CN108172209 A CN 108172209A
Authority
CN
China
Prior art keywords
idol
voice
text
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810017849.8A
Other languages
Chinese (zh)
Inventor
武星
张南
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201810017849.8A priority Critical patent/CN108172209A/en
Publication of CN108172209A publication Critical patent/CN108172209A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of structure voice idol methods.This method is to realize voice idol with the style and tone color same or similar with idol come the method for making voice answer-back to the problem of " bean vermicelli " using the voice technologies such as speech recognition, emotional speech synthesis and depth learning technology.The main operational steps of this method are a. largely text materials of the collection about idol;B. word vector is converted by big section text for the text material application LSTM neural networks that step a is collected;C., step b results are used as to the input of RNN training patterns to training style learning model;D. the style spoken by the training study of a large amount of data to idol;E. a large amount of voice documents collected about idol;F. the voice document of step E-search collection is obtained into emotional speech synthesis model using two-way long short-term memory prosody hierarchy model;G., the result of embodiment three is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step f is used for phonetic synthesis.

Description

Build voice idol method
Technical field
A kind of structure voice idol method of the present invention, and in particular to the voice technologies such as speech recognition, semantic understanding and depth Spend the technology in terms of study, the especially technology for loudspeaker box with phonetic function.
Background technology
Modern society comes into highly developed society, and material goods are very sufficient, now social substance money Material has greatly met pursuit of the mankind for material life.Modern humans gradually show simultaneously the pursuit of cultural life Have reached very high requirement.
Many people generate idol for films and television programs performer and stage performance personnel during cultural life is pursued Worship sense.For the worship for idol, they thirst for obtaining the signature of idol, group photo, it is desirable to which idol sees the message of oneself And expect that idol can reply the message of oneself.
With the development of direct seeding technique and microblog technology, " bean vermicelli " idol can be in the time of a Relatively centralized Answer some problems of " bean vermicelli " online on live streaming platform, but this chance is fewer;Idol can also be in microblogging The problem of " bean vermicelli ", is answered in upper dispersion in real time.But since " bean vermicelli " quantity is often excessively huge, idol can not possibly real-time online Answer " bean vermicelli " problem or in real time on microblogging answer bean vermicelli the problem of.When idol be not broadcast live or is being not logged on microblogging When, " bean vermicelli " can not with oneself worship idol dialogue or ask a question.
Invention content
It is an object of the invention to meet the worship psychology of " bean vermicelli " for idol, provide a kind of structure voice idol side Method." bean vermicelli " can propose problem with voice, and voice idol can respond the problem of " bean vermicelli " proposes in real time.The beneficial effect of the present invention Fruit is can greatly to meet the purpose that " bean vermicelli " is exchanged with idol, and the performance of voice idol is also relatively stable, for being promoted Modern humans are of great advantage for the pursuit of cultural life.
To achieve the object of the present invention, idea of the invention is that using depth learning technology, by the training of mass data, Neural network model is trained to the idol style learning model of simulating the thinking style of idol and for synthesizing voice Emotional speech synthesis model, it is final realize it is close with idol sound and locution or as answer.
According to above design, the present invention uses following technical scheme:
A kind of structure voice idol method, it is characterised in that operating procedure is as follows:
(1) voice idol text extracts:
A) a large amount of text materials collected about idol;
B) the text material application LSTM neural networks for the problem of being collected for step a by big section text be converted into word to Amount;
C) step b results are used as to the input of RNN training patterns to training style learning model;
D) style spoken by the training study of a large amount of data to idol.
(2) voice idol phonetic synthesis:
E) a large amount of voice documents collected about idol;
F) voice document for collecting step a obtains emotional speech synthesis using two-way long short-term memory prosody hierarchy model Model;
G) result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used In phonetic synthesis.
The present invention compared with prior art, has following obvious prominent substantive distinguishing features and notable technology Progress:
(1) the step a largely collects the text material about idol, and wherein text material is mainly derived from following Approach:
I. interview disclosed in idol or interview class video data pass through the text information that speech recognition technology obtains, by it Input as style training pattern;
Ii. the text of the body text in idol microblogging and reply bean vermicelli message, as the input of training pattern.
(2) text is obtained text vector by the step b for LSMT neural networks, is used to follow-up training step.
(3) step b results are used as the input of RNN training patterns to training style learning model by the step c
(4) the step e largely collects the voice document about idol;Wherein voice document is mainly derived from idol and owns Disclosed video and voice class documentation.
(5) voice document of step E-search collection is obtained feelings by the step f using two-way long short-term memory prosody hierarchy model Feel phonetic synthesis model, the autolearn feature of deep learning can be combined using this model, the audio of phonetic synthesis is more natural more Really.
(6) result of embodiment one is used as the text input of phonetic synthesis by the step g, by the emotional speech of step f Synthetic model is used for phonetic synthesis, which completes the phonetic synthesis of idol tone color and emotion, due to the use of above-mentioned steps, The voice of synthesis is presented substantially with idol original sound.
(7) idol of method of the invention synthesis is answered and sound can simulate idol with very high similarity, So that bean vermicelli has sense as if on the spot in person, there is unrivaled advantage for abundant human spirit's life level.
Description of the drawings
Attached drawing 1 is the operating process block diagram of the present invention.
Attached drawing 2 is the idol style study model training schematic diagram of the present invention.
Attached drawing 3 is the emotional speech synthesis model training schematic diagram of the present invention.
Specific embodiment
Details are as follows for the preferred embodiment of the present invention combination attached drawing:
Embodiment one:
Referring to Fig. 1, this structure voice idol method, it is characterised in that operating procedure is as follows:
(1) voice idol text extracts
A) a large amount of text materials collected about idol;
B) the text material application LSTM neural networks for the problem of being collected for step a by big section text be converted into word to Amount;
C) step b results are used as to the input of RNN training patterns to training style learning model;
D) style spoken by the training study of a large amount of data to idol.
(2) voice idol phonetic synthesis:
E) a large amount of voice documents collected about idol;
F) voice document for collecting step a obtains emotional speech synthesis using two-way long short-term memory prosody hierarchy model Model;
G) result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used In phonetic synthesis.
Embodiment two:
The present embodiment and embodiment one are essentially identical, as follows in place of feature:
(1) the step a largely collects the text material about idol, and wherein text material is mainly derived from following Approach:
I. interview disclosed in idol or interview class video data pass through the text information that speech recognition technology obtains, by it Input as style training pattern;
Ii. the text of the body text in idol microblogging and reply bean vermicelli message, as the input of training pattern.
(2) text is obtained text vector by the step b for LSMT neural networks, is used to follow-up training step.
(3) step b results are used as the input of RNN training patterns to training style learning model by the step c
(4) the step e largely collects the voice document about idol;Wherein voice document is mainly derived from idol and owns Disclosed video and voice class documentation.
(5) voice document of step E-search collection is obtained feelings by the step f using two-way long short-term memory prosody hierarchy model Feel phonetic synthesis model, the autolearn feature of deep learning can be combined using this model, the audio of phonetic synthesis is more natural more Really.
(6) result of embodiment one is used as the text input of phonetic synthesis by the step g, by the emotional speech of step f Synthetic model is used for phonetic synthesis, which completes the phonetic synthesis of idol tone color and emotion, due to the use of above-mentioned steps, The voice of synthesis is presented substantially with idol original sound.
Embodiment three:
(1) referring to Fig. 2, idol text answers are obtained by idol text question, operating procedure is as follows:
A. a large amount of text materials collected about idol;
B. word vector is converted by big section text for the text material application LSTM neural networks that step a is collected;
C., step b results are used as to the input of RNN training patterns to training style learning model;
D. the style spoken by the training study of a large amount of data to idol.
(2) referring to Fig. 3, idol vocal answer is obtained by idol text answers:
E. a large amount of voice documents collected about idol;
F. voice document step a collected obtains emotional speech synthesis using two-way long short-term memory prosody hierarchy model Model;
G., the result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used In phonetic synthesis.
Length time memory neural network (LSTM) calculating process in step (1) is as follows:
Forward direction calculates:
Input gate:
Forget door:
As shown in above-mentioned two rows publicity:The input of forgetting door comes from the input outside t moment, and the t-1 moment implies unit Output.And the output from t-1 moment units;
Unit:
The input of unit is:When t moment forgets the output+t of output+t moment unit of the output * t-1 moment units of door The output * activation primitives for carving input gate calculate (input+t-1 moment outside t moment implies the output of unit);
Out gate:
The input of out gate is:Input outside t moment, t-1 moment imply the output of unit and t moment unit unit Output;
Unit exports:
The output of module is the output of the output * t moment unit units of t moment out gate.
It calculates backward:
Unit exports:
Out gate:
Unit:
Forget door:
Input gate:
The hidden layer of Recognition with Recurrent Neural Network is changed to long memory module in short-term with this method, can effectively solve cycle nerve The problem of network access contextual information is limited in scope.

Claims (7)

  1. A kind of 1. structure voice idol method, it is characterised in that operating procedure is as follows:
    (1)Voice idol text extracts:
    A) a large amount of text materials collected about idol;
    Big section text is converted into word vector by the text material application LSTM neural networks for the problem of b) being collected for step a;
    C) step b results are used as to the input of RNN training patterns to training style learning model;
    d)The style spoken by the training study of a large amount of data to idol;
    (2)Voice idol phonetic synthesis:
    E) a large amount of voice documents collected about idol;
    F) voice document for collecting step a obtains emotional speech synthesis mould using two-way long short-term memory prosody hierarchy model Type;
    G) result of embodiment one is used as to the text input of phonetic synthesis, the emotional speech synthesis model of step b is used for language Sound synthesizes.
  2. 2. structure voice idol method according to claim 1, it is characterised in that:The step a is largely collected about idol The text material of picture, wherein text material are mainly derived from following approach:
    (1)The disclosed interview of idol or interview class video data pass through the text information that speech recognition technology obtains, by it Input as style training pattern;
    (2)The text of body text and reply bean vermicelli message in idol microblogging, as the input of training pattern.
  3. 3. structure voice idol method according to claim 1, it is characterised in that:Text is used for LSMT by the step b Neural network obtains text vector, is used to follow-up training step.
  4. 4. structure voice idol method according to claim 1, it is characterised in that:Step b results are used as by the step c The input of RNN training patterns is to training style learning model.
  5. 5. structure voice idol method according to claim 1, it is characterised in that:The step e is largely collected about idol The voice document of picture;Wherein voice document is mainly derived from all disclosed videos of idol and voice class documentation.
  6. 6. structure voice idol method according to claim 1, it is characterised in that:The step f is by the language of step E-search collection Sound file obtains emotional speech synthesis model using two-way long short-term memory prosody hierarchy model, and depth can be combined using this model The autolearn feature of study, the audio of phonetic synthesis are more natural truer.
  7. 7. structure voice idol method according to claim 1, it is characterised in that:The step g is by the knot of embodiment one Fruit is used as the text input of phonetic synthesis, and by the emotional speech synthesis model of step f for phonetic synthesis, which completes idol The phonetic synthesis of tone color and emotion, due to the use of above-mentioned steps, the voice of synthesis is presented substantially with idol original sound.
CN201810017849.8A 2018-01-09 2018-01-09 Build voice idol method Pending CN108172209A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810017849.8A CN108172209A (en) 2018-01-09 2018-01-09 Build voice idol method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810017849.8A CN108172209A (en) 2018-01-09 2018-01-09 Build voice idol method

Publications (1)

Publication Number Publication Date
CN108172209A true CN108172209A (en) 2018-06-15

Family

ID=62517657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810017849.8A Pending CN108172209A (en) 2018-01-09 2018-01-09 Build voice idol method

Country Status (1)

Country Link
CN (1) CN108172209A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
WO2021047233A1 (en) * 2019-09-10 2021-03-18 苏宁易购集团股份有限公司 Deep learning-based emotional speech synthesis method and device
CN112820265A (en) * 2020-09-14 2021-05-18 腾讯科技(深圳)有限公司 Speech synthesis model training method and related device
WO2022227188A1 (en) * 2021-04-27 2022-11-03 平安科技(深圳)有限公司 Intelligent customer service staff answering method and apparatus for speech, and computer device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN106372058A (en) * 2016-08-29 2017-02-01 中译语通科技(北京)有限公司 Short text emotion factor extraction method and device based on deep learning
CN106448670A (en) * 2016-10-21 2017-02-22 竹间智能科技(上海)有限公司 Dialogue automatic reply system based on deep learning and reinforcement learning
CN106997370A (en) * 2015-08-07 2017-08-01 谷歌公司 Text classification and conversion based on author
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106997370A (en) * 2015-08-07 2017-08-01 谷歌公司 Text classification and conversion based on author
CN105244020A (en) * 2015-09-24 2016-01-13 百度在线网络技术(北京)有限公司 Prosodic hierarchy model training method, text-to-speech method and text-to-speech device
CN105206258A (en) * 2015-10-19 2015-12-30 百度在线网络技术(北京)有限公司 Generation method and device of acoustic model as well as voice synthetic method and device
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN106372058A (en) * 2016-08-29 2017-02-01 中译语通科技(北京)有限公司 Short text emotion factor extraction method and device based on deep learning
CN106448670A (en) * 2016-10-21 2017-02-22 竹间智能科技(上海)有限公司 Dialogue automatic reply system based on deep learning and reinforcement learning
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109285562A (en) * 2018-09-28 2019-01-29 东南大学 Speech-emotion recognition method based on attention mechanism
CN109285562B (en) * 2018-09-28 2022-09-23 东南大学 Voice emotion recognition method based on attention mechanism
WO2021047233A1 (en) * 2019-09-10 2021-03-18 苏宁易购集团股份有限公司 Deep learning-based emotional speech synthesis method and device
CN110853616A (en) * 2019-10-22 2020-02-28 武汉水象电子科技有限公司 Speech synthesis method, system and storage medium based on neural network
CN112820265A (en) * 2020-09-14 2021-05-18 腾讯科技(深圳)有限公司 Speech synthesis model training method and related device
CN112820265B (en) * 2020-09-14 2023-12-08 腾讯科技(深圳)有限公司 Speech synthesis model training method and related device
WO2022227188A1 (en) * 2021-04-27 2022-11-03 平安科技(深圳)有限公司 Intelligent customer service staff answering method and apparatus for speech, and computer device

Similar Documents

Publication Publication Date Title
CN108172209A (en) Build voice idol method
CN101064104B (en) Emotion voice creating method based on voice conversion
CN107958433A (en) A kind of online education man-machine interaction method and system based on artificial intelligence
CN106294726A (en) Based on the processing method and processing device that robot role is mutual
Suzuki et al. Effects of echoic mimicry using hummed sounds on human–computer interaction
Zhou et al. Speech synthesis with mixed emotions
KR102505927B1 (en) Deep learning-based emotional text-to-speech apparatus and method using generative model-based data augmentation
Wang et al. Multi-source domain adaptation for text-independent forensic speaker recognition
CN108470188A (en) Exchange method based on image analysis and electronic equipment
Hu et al. Exploiting cross domain acoustic-to-articulatory inverted features for disordered speech recognition
CN116524791A (en) Lip language learning auxiliary training system based on meta universe and application thereof
Filntisis et al. Video-realistic expressive audio-visual speech synthesis for the Greek language
CN108417198A (en) A kind of men and women's phonetics transfer method based on spectrum envelope and pitch period
Gjaci et al. Towards culture-aware co-speech gestures for social robots
Barbulescu et al. Audio-visual speaker conversion using prosody features
Kirkland et al. Perception of smiling voice in spontaneous speech synthesis
Chen et al. Speaker-independent emotional voice conversion via disentangled representations
Li et al. Non-Parallel Many-to-Many Voice Conversion with PSR-StarGAN.
Riviello et al. On the perception of dynamic emotional expressions: A cross-cultural comparison
Zhou et al. Multimodal voice conversion under adverse environment using a deep convolutional neural network
Matsui et al. Music recommendation system driven by interaction between user and personified agent using speech recognition, synthesized voice and facial expression
Kelleher Narrative iteration and place in a Johannesburg tavern
Ramachandra et al. Human centered computing in digital persona generation
Demenko et al. Annotation specifications of a dialogue corpus for modelling phonetic convergence in technical systems
Minami et al. The world of mushrooms: human-computer interaction prototype systems for ambient intelligence

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180615

RJ01 Rejection of invention patent application after publication