CN110299131A - A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion - Google Patents

A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion Download PDF

Info

Publication number
CN110299131A
CN110299131A CN201910706204.XA CN201910706204A CN110299131A CN 110299131 A CN110299131 A CN 110299131A CN 201910706204 A CN201910706204 A CN 201910706204A CN 110299131 A CN110299131 A CN 110299131A
Authority
CN
China
Prior art keywords
rhythm
vector
attention
emotion
controllable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910706204.XA
Other languages
Chinese (zh)
Other versions
CN110299131B (en
Inventor
王欢良
王飞
张李
沈文武
代大明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Qdreamer Network Science And Technology Co Ltd
Original Assignee
Suzhou Qdreamer Network Science And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Qdreamer Network Science And Technology Co Ltd filed Critical Suzhou Qdreamer Network Science And Technology Co Ltd
Priority to CN201910706204.XA priority Critical patent/CN110299131B/en
Publication of CN110299131A publication Critical patent/CN110299131A/en
Application granted granted Critical
Publication of CN110299131B publication Critical patent/CN110299131B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration

Abstract

The present invention provides phoneme synthesizing method, device, the storage mediums of a kind of controllable rhythm emotion, it can add rhythm emotion in synthesis voice, the effectively rhythm rhythm of control synthesis voice, method is the following steps are included: convert character representation vector for the corresponding character of text to be synthesized;By character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding feature vector;Coding characteristic vector and rhythm rhythm vector are spliced, by attention mechanism, generates and pays attention to force vector;The frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, it is sent into decoder, it is updated by the output of decoder and pays attention to force vector, the attention force vector newly calculated is done with decoder output to be spliced, it is sent into projection layer output and has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates;Prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.

Description

A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion
Technical field
The present invention relates to speech synthesis technique fields, and in particular to a kind of phoneme synthesizing method of controllable rhythm emotion, Device, storage medium.
Background technique
Speech synthesis, also known as literary periodicals (Text To Speech, TTS) are that one kind can turn any input text Change the technology of corresponding voice into.
Traditional speech synthesis system generally includes the module of front-end and back-end two.Front-end module is mainly to input text It is analyzed, extracts linguistic information required for rear module, for Chinese synthesis system, front-end module is generally comprised The submodules such as text regularization, participle, part of speech prediction, polyphone disambiguation, prosody prediction.Rear module is according to frontal chromatography knot Fruit generates speech waveform by certain method, back-end system be generally divided into based on statistical parameter modeling speech synthesis (or Parameter synthesis) and speech synthesis (or splicing synthesis) based on unit selection and waveform concatenation.
Current end-to-end synthetic model not only can produce the audio of more high fidelity and naturalness, and modeling process letter It is single, do not need any linguistic information.Therefore, it has also become the speech synthesis technique of current main-stream.But classical end-to-end conjunction There is its technical vulnerability at technology, for example it is possible that unforeseen uncontrollable synthesis flaw, can not explicitly control for another example The rhythm rhythm of synthesis is made, such as: phoneme duration, stressed and intonation etc..This be primarily due to end-to-end synthesis input only according to Rely in shallow-layer content of text, such as alphabetical sequence, syllable sequence, aligned phoneme sequence etc., the language message of deep layer can not be utilized, such as Part of speech, intonation, syntactic structure etc..
Summary of the invention
In view of the above-mentioned problems, the present invention provides a kind of phoneme synthesizing method of controllable rhythm emotion, device, storages to be situated between Matter can add rhythm emotion, the effectively rhythm rhythm of control synthesis voice in synthesis voice.
Its technical solution is such that a kind of phoneme synthesizing method of controllable rhythm emotion, which is characterized in that including with Lower step:
Step S1: character representation vector is converted by the corresponding character of text to be synthesized;
Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding Feature vector;
Step S3: coding characteristic vector and rhythm rhythm vector are spliced, by attention mechanism, generate attention to Amount;
Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through solution The output of code device, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer Output has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates;
Step S5: the prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.
Further, in step s 4, after completing decoding, the prediction voice with rhythm rhythm that prediction is obtained is frequently Spectrum is admitted in convolutional layer to improve and generate quality.
Further, the prosodic information that the rhythm rhythm vector includes includes word speed information, reads information, intonation letter again Breath, the word speed of syllable or word where word speed information refers to current character;Word or syllable where stressed information refers to current character Whether read again;The tune type of word or syllable where prosody information refers to current character;Word speed information includes: normally, at a slow speed, fastly Speed, it is supper-fast;Reading information again includes stressed and anacrusis;Prosody information includes: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone.
Further, rhythm rhythm vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information with 2 two into System is to encode;It reads again and is encoded with 1 binary system;Intonation is encoded with 2 binary systems.
Further, in step s3, using the attention mechanism of position sensing.
Further, the prediction voice spectrum with rhythm rhythm is inputted into speech synthesizer, output has rhythm rhythm Voice, speech synthesizer includes any one in WaveNet, WaveRNN.
Further, by the prediction voice spectrum with rhythm rhythm by Griffin_Lim algorithm, output has the rhythm The voice of rhythm.
A kind of speech synthetic device of controllable rhythm emotion characterized by comprising
Representation space conversion module, for converting character representation vector for the corresponding character of text to be synthesized;
Encoder, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector;
Pay attention to power module, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates note Meaning force vector;
Decoder;For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through decoder Output, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer and exports band There is the prediction voice spectrum of rhythm rhythm.
A kind of speech synthetic device of controllable rhythm emotion, characterized in that it comprises: including processor, memory And program;
Described program stores in the memory, and the processor calls the program of memory storage, above-mentioned to execute Controllable rhythm emotion phoneme synthesizing method.
A kind of computer readable storage medium, which is characterized in that the computer readable storage medium is configured to store Program, described program are configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.
Phoneme synthesizing method, device, the storage medium of controllable rhythm emotion of the invention, to classical end-to-end synthesis Method improves, by inputting prosodic control information abundant, so that synthesized voice not only keeps similar with original sound as far as possible Rhythm rhythm, sound more life-like naturally, rich in emotion, and the rhythm of synthesized voice can be changed by control information Rhythm;By the inclusion of word speed information, read again information, prosody information rhythm rhythm vector, define additional rhythm cadence information End-to-end synthetic model is preferably trained, by adding rhythm cadence information in encoder and attention stage, can be convenient The speech manual of decoder output is efficiently controlled and changes, to control the emotion rhythm of synthesis voice.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the phoneme synthesizing method of controllable rhythm emotion of the invention;
Fig. 2 is a kind of frame diagram of the speech synthetic device of controllable rhythm emotion of the invention.
Specific embodiment
See Fig. 1, a kind of phoneme synthesizing method of controllable rhythm emotion of the invention, comprising the following steps:
Step S1: character representation vector is converted by the corresponding character of text to be synthesized;
Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding Feature vector, encoder generally use CNN+LSTM network to model;
Step S3: coding characteristic vector and rhythm rhythm vector are spliced, and by the attention mechanism of position sensing, are generated Pay attention to force vector;
Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through solution The output of code device, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer Output has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates;After completing decoding, it will measure in advance To the prediction voice spectrum with rhythm rhythm be admitted in convolutional layer to improve and generate quality, decoder generallys use LSTM + CNN+ linear projection is modeled;
Step S5: will be converted to the voice output with rhythm rhythm with the prediction voice spectrum of rhythm rhythm, can be with Prediction voice spectrum with rhythm rhythm is inputted into speech synthesizer, output has the voice of rhythm rhythm, speech synthesizer Including any one in WaveNet, WaveRNN;In addition it is also possible to by the prediction voice spectrum for having rhythm rhythm is passed through Griffin_Lim algorithm, output have the voice of rhythm rhythm.
Specifically in the present embodiment, the prosodic information that rhythm rhythm vector includes includes word speed information, reads information, intonation again Information, the word speed of syllable or word where word speed information refers to current character;Word or sound where stressed information refers to current character Whether section is read again;The tune type of word or syllable where prosody information refers to current character.
Word speed information include: it is normal, it is at a slow speed, quickly, supper-fast;Reading information again includes stressed and anacrusis;Prosody information It include: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone;The normal word speed of normally expression in word speed indicates at a slow speed 0.5 times of normal language Speed;Quickly indicate 1.5 times of normal word speed;The supper-fast normal word speed for indicating 2 times.
In the present embodiment, rhythm rhythm vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information is with 2 Binary system encodes;It reads again and is encoded with 1 binary system;Intonation is encoded with 2 binary systems.
In the present embodiment, specific word speed information, stressed information, the coding of prosody information are as follows:
The normal word speed of word speed-: 00
The slow word speed of word speed-: 01
The fast word speed of word speed-: 10
The ultrafast word speed of word speed-: 11
It reads again-reads again: 1
Stressed-anacrusis: 0
The high Heibei provincial opera of intonation-: 00
Intonation-rising tone: 01
The lower falling tone of intonation-: 10
The low Heibei provincial opera of intonation-: 11
When speech synthesis, if synthesis text be it is neutral, if not needing obvious emotion, default be sent into synthesis The rhythm rhythm control information of device may is that normal word speed, anacrusis, high Heibei provincial opera.The obvious emotion rhythm in need the case where Under, rhythm cadence information can be correspondingly arranged.
See Fig. 2, a kind of speech synthetic device of controllable rhythm emotion of the invention, comprising:
Representation space conversion module 1, for converting character representation vector for the corresponding character of text to be synthesized;
Encoder 2, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector;
Pay attention to power module 3, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates Pay attention to force vector;
Decoder 4;For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through decoder Output, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer and exports band There is the prediction voice spectrum of rhythm rhythm.
A kind of speech synthetic device of controllable rhythm emotion comprising: including processor, memory and program;
Program stores in memory, and processor calls the program of memory storage, to execute the above-mentioned controllable rhythm The phoneme synthesizing method of emotion.
In the realization of the speech synthetic device of above-mentioned controllable rhythm emotion, between memory and processor directly or Ground connection is electrically connected, to realize the transmission or interaction of data.For example, these elements between each other can be by one or more of Communication bus or signal wire, which are realized, to be electrically connected, and can such as be connected by bus.It is stored in memory and realizes data access control The computer executed instructions of method processed, the software function that can be stored in memory in the form of software or firmware including at least one Can module, processor by the operation software program and module that are stored in memory, thereby executing various function application with And data processing.
Memory may be, but not limited to, random access memory (Random Access Memory, referred to as: RAM), Read-only memory (Read Only Memory, referred to as: ROM), programmable read only memory (Programmable Read-Only Memory, referred to as: PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, letter Claim: EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, Referred to as: EEPROM) etc..Wherein, memory is for storing program, and processor executes program after receiving and executing instruction.
Processor can be a kind of IC chip, the processing capacity with signal.Above-mentioned processor can be logical With processor, including central processing unit (Central Processing Unit, referred to as: CPU), network processing unit (Network Processor, referred to as: NP) etc..It may be implemented or execute disclosed each method, step and the logic in the embodiment of the present application Block diagram.General processor can be microprocessor or the processor is also possible to any conventional processor etc..
In an embodiment of the present invention, a kind of computer readable storage medium, computer readable storage medium are additionally provided It is configured to store program, program is configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to The relevant hardware of program instruction is crossed to complete.Program above-mentioned can store in computer readable storage medium.The program exists When being executed by processor, realization includes the steps that above-mentioned each method embodiment;And computer readable storage medium above-mentioned includes: The various media that can store program code such as ROM, RAM, magnetic or disk, including some instructions are used so that one big number Each embodiment or embodiment are executed according to transmission device (can be personal computer, server or the network equipment etc.) Method described in certain parts.
Its input of the end-to-end synthesis system of classics is the corresponding character string of text to be synthesized, to identical text to be synthesized This, can not individually control its rhythm rhythm.The rhythm rhythm that this causes synthesized voice that can show is very limited, and people is allowed to feel bright Aobvious mechanical sense.
For this purpose, this patent improves classical end-to-end synthetic method, by inputting prosodic control information abundant, So that synthesized voice not only keeps the rhythm rhythm similar with original sound as far as possible, sound more life-like naturally, rich in emotion, and And the rhythm rhythm of synthesized voice can be changed by control information
Rhythm cadence information is typically all super Duan Tezheng, and end-to-end synthesis is generally used character or phoneme is used as and builds Form unit.Therefore, in modeling, section grade prosodic information is averaged each character or phoneme for being assigned to equivalent, passes through packet Information containing word speed, the rhythm rhythm vector for reading information, prosody information again, define additional rhythm cadence information preferably to train End-to-end synthetic model can effectively control the rhythm rhythm of synthesis voice, by encoder by duration, stressed and intonation Rhythm cadence information is added with the attention stage, can be convenient the speech manual for efficiently controlling and changing decoder output, thus The emotion rhythm of control synthesis voice.

Claims (10)

1. a kind of phoneme synthesizing method of controllable rhythm emotion, which comprises the following steps:
Step S1: character representation vector is converted by the corresponding character of text to be synthesized;
Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding feature Vector;
Step S3: coding characteristic vector and rhythm rhythm vector are spliced, and by attention mechanism, are generated and are paid attention to force vector;
Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through decoder Output update and pay attention to force vector, the attention force vector newly calculated and decoder output are done and are spliced, and are sent into projection layer and export Prediction voice spectrum with rhythm rhythm, while predicting the end point that frequency spectrum generates;
Step S5: the prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.
2. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: in step S4 In, after completing decoding, the prediction voice spectrum with rhythm rhythm that prediction obtains is admitted in convolutional layer to improve life At quality.
3. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: the rhythm The prosodic information that rhythm vector includes includes word speed information, reads information, prosody information again, and word speed information refers to sound where current character The word speed of section or word;Whether word or syllable are read again where stressed information refers to current character;Prosody information refers to current character The tune type of place word or syllable;Word speed information include: it is normal, it is at a slow speed, quickly, supper-fast;Read again information include read again and Anacrusis;Prosody information includes: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone.
4. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 3, it is characterised in that: rhythm rhythm Vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information is encoded with 2 binary systems;It reads again with 1 binary system To encode;Intonation is encoded with 2 binary systems.
5. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: in step S3 In, using the attention mechanism of position sensing.
6. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: rhythm will be had The prediction voice spectrum for restraining rhythm inputs speech synthesizer, and output has the voice of rhythm rhythm, and speech synthesizer includes Any one in WaveNet, WaveRNN.
7. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: rhythm will be had The prediction voice spectrum of rhythm is restrained by Griffin_Lim algorithm, output has the voice of rhythm rhythm.
8. a kind of speech synthetic device of controllable rhythm emotion characterized by comprising
Representation space conversion module, for converting character representation vector for the corresponding character of text to be synthesized;
Encoder, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector;
Pay attention to power module, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates attention Vector;
Decoder;For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through the output of decoder It updates and pays attention to force vector, the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer output and is had rhythm Restrain the prediction voice spectrum of rhythm.
9. a kind of speech synthetic device of controllable rhythm emotion, characterized in that it comprises: including processor, memory with And program;
Described program stores in the memory, and the processor calls the program of memory storage, with execute it is above-mentioned can Control the phoneme synthesizing method of rhythm emotion.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is configured to store journey Sequence, described program are configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.
CN201910706204.XA 2019-08-01 2019-08-01 Voice synthesis method and device capable of controlling prosodic emotion and storage medium Active CN110299131B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910706204.XA CN110299131B (en) 2019-08-01 2019-08-01 Voice synthesis method and device capable of controlling prosodic emotion and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910706204.XA CN110299131B (en) 2019-08-01 2019-08-01 Voice synthesis method and device capable of controlling prosodic emotion and storage medium

Publications (2)

Publication Number Publication Date
CN110299131A true CN110299131A (en) 2019-10-01
CN110299131B CN110299131B (en) 2021-12-10

Family

ID=68032457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910706204.XA Active CN110299131B (en) 2019-08-01 2019-08-01 Voice synthesis method and device capable of controlling prosodic emotion and storage medium

Country Status (1)

Country Link
CN (1) CN110299131B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system
CN111583902A (en) * 2020-05-14 2020-08-25 携程计算机技术(上海)有限公司 Speech synthesis system, method, electronic device, and medium
CN111710326A (en) * 2020-06-12 2020-09-25 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111724765A (en) * 2020-06-30 2020-09-29 上海优扬新媒信息技术有限公司 Method and device for converting text into voice and computer equipment
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112185363A (en) * 2020-10-21 2021-01-05 北京猿力未来科技有限公司 Audio processing method and device
CN112767969A (en) * 2021-01-29 2021-05-07 苏州思必驰信息科技有限公司 Method and system for determining emotion tendentiousness of voice information
WO2021127979A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis method and apparatus, computer device, and computer readable storage medium
WO2021134591A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium
CN113096636A (en) * 2021-06-08 2021-07-09 北京世纪好未来教育科技有限公司 Speech synthesis apparatus, speech synthesis method, electronic device, and storage medium
WO2021179910A1 (en) * 2020-03-09 2021-09-16 百果园技术(新加坡)有限公司 Text voice front-end conversion method and apparatus, and device and storage medium
CN113643717A (en) * 2021-07-07 2021-11-12 深圳市联洲国际技术有限公司 Music rhythm detection method, device, equipment and storage medium
CN113808579A (en) * 2021-11-22 2021-12-17 中国科学院自动化研究所 Detection method and device for generated voice, electronic equipment and storage medium
CN114420086A (en) * 2022-03-30 2022-04-29 北京沃丰时代数据科技有限公司 Speech synthesis method and device
WO2022095754A1 (en) * 2020-11-03 2022-05-12 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, storage medium, and electronic device
WO2022105545A1 (en) * 2020-11-20 2022-05-27 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, and readable medium and electronic device
WO2023061259A1 (en) * 2021-10-14 2023-04-20 北京字跳网络技术有限公司 Speech speed adjustment method and apparatus, electronic device, and readable storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101000765A (en) * 2007-01-09 2007-07-18 黑龙江大学 Speech synthetic method based on rhythm character
US7454348B1 (en) * 2004-01-08 2008-11-18 At&T Intellectual Property Ii, L.P. System and method for blending synthetic voices
CN103077705A (en) * 2012-12-30 2013-05-01 安徽科大讯飞信息科技股份有限公司 Method for optimizing local synthesis based on distributed natural rhythm
US20160203815A1 (en) * 2008-06-06 2016-07-14 At&T Intellectual Property I, Lp System and method for synthetically generated speech describing media content
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
CN109543722A (en) * 2018-11-05 2019-03-29 中山大学 A kind of emotion trend forecasting method based on sentiment analysis model
CN109616093A (en) * 2018-12-05 2019-04-12 平安科技(深圳)有限公司 End-to-end phoneme synthesizing method, device, equipment and storage medium
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109979429A (en) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 A kind of method and system of TTS

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7454348B1 (en) * 2004-01-08 2008-11-18 At&T Intellectual Property Ii, L.P. System and method for blending synthetic voices
CN101000765A (en) * 2007-01-09 2007-07-18 黑龙江大学 Speech synthetic method based on rhythm character
US20160203815A1 (en) * 2008-06-06 2016-07-14 At&T Intellectual Property I, Lp System and method for synthetically generated speech describing media content
CN103077705A (en) * 2012-12-30 2013-05-01 安徽科大讯飞信息科技股份有限公司 Method for optimizing local synthesis based on distributed natural rhythm
CN109036375A (en) * 2018-07-25 2018-12-18 腾讯科技(深圳)有限公司 Phoneme synthesizing method, model training method, device and computer equipment
CN109543722A (en) * 2018-11-05 2019-03-29 中山大学 A kind of emotion trend forecasting method based on sentiment analysis model
CN109616093A (en) * 2018-12-05 2019-04-12 平安科技(深圳)有限公司 End-to-end phoneme synthesizing method, device, equipment and storage medium
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN109979429A (en) * 2019-05-29 2019-07-05 南京硅基智能科技有限公司 A kind of method and system of TTS

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
曾碧卿 等: "《基于双注意力卷积神经网络模型的情感分析演技》", 《广东工业大学学报》 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110808027A (en) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 Voice synthesis method and device and news broadcasting method and system
WO2021127979A1 (en) * 2019-12-24 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis method and apparatus, computer device, and computer readable storage medium
WO2021134591A1 (en) * 2019-12-31 2021-07-08 深圳市优必选科技股份有限公司 Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium
WO2021179910A1 (en) * 2020-03-09 2021-09-16 百果园技术(新加坡)有限公司 Text voice front-end conversion method and apparatus, and device and storage medium
CN111583902A (en) * 2020-05-14 2020-08-25 携程计算机技术(上海)有限公司 Speech synthesis system, method, electronic device, and medium
CN111710326A (en) * 2020-06-12 2020-09-25 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111710326B (en) * 2020-06-12 2024-01-23 携程计算机技术(上海)有限公司 English voice synthesis method and system, electronic equipment and storage medium
CN111724765A (en) * 2020-06-30 2020-09-29 上海优扬新媒信息技术有限公司 Method and device for converting text into voice and computer equipment
CN111724765B (en) * 2020-06-30 2023-07-25 度小满科技(北京)有限公司 Text-to-speech method and device and computer equipment
CN112185363A (en) * 2020-10-21 2021-01-05 北京猿力未来科技有限公司 Audio processing method and device
CN112185363B (en) * 2020-10-21 2024-02-13 北京猿力未来科技有限公司 Audio processing method and device
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
WO2022095754A1 (en) * 2020-11-03 2022-05-12 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, storage medium, and electronic device
WO2022105545A1 (en) * 2020-11-20 2022-05-27 北京有竹居网络技术有限公司 Speech synthesis method and apparatus, and readable medium and electronic device
CN112767969A (en) * 2021-01-29 2021-05-07 苏州思必驰信息科技有限公司 Method and system for determining emotion tendentiousness of voice information
CN113096636A (en) * 2021-06-08 2021-07-09 北京世纪好未来教育科技有限公司 Speech synthesis apparatus, speech synthesis method, electronic device, and storage medium
CN113643717A (en) * 2021-07-07 2021-11-12 深圳市联洲国际技术有限公司 Music rhythm detection method, device, equipment and storage medium
WO2023061259A1 (en) * 2021-10-14 2023-04-20 北京字跳网络技术有限公司 Speech speed adjustment method and apparatus, electronic device, and readable storage medium
CN113808579A (en) * 2021-11-22 2021-12-17 中国科学院自动化研究所 Detection method and device for generated voice, electronic equipment and storage medium
CN114420086A (en) * 2022-03-30 2022-04-29 北京沃丰时代数据科技有限公司 Speech synthesis method and device
CN114420086B (en) * 2022-03-30 2022-06-17 北京沃丰时代数据科技有限公司 Speech synthesis method and device

Also Published As

Publication number Publication date
CN110299131B (en) 2021-12-10

Similar Documents

Publication Publication Date Title
CN110299131A (en) A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion
Zhang et al. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning
US11295721B2 (en) Generating expressive speech audio from text data
CN112687259B (en) Speech synthesis method, device and readable storage medium
EP4029010B1 (en) Neural text-to-speech synthesis with multi-level context features
CN116364055B (en) Speech generation method, device, equipment and medium based on pre-training language model
CN108630203A (en) Interactive voice equipment and its processing method and program
US6212501B1 (en) Speech synthesis apparatus and method
KR20220054655A (en) Speech synthesis method and apparatus, storage medium
King A beginners’ guide to statistical parametric speech synthesis
KR102294639B1 (en) Deep neural network based non-autoregressive speech synthesizer method and system using multiple decoder
CN113327627B (en) Multi-factor controllable voice conversion method and system based on feature decoupling
JP5398295B2 (en) Audio processing apparatus, audio processing method, and audio processing program
KR20230084229A (en) Parallel tacotron: non-autoregressive and controllable TTS
CN111681641B (en) Phrase-based end-to-end text-to-speech (TTS) synthesis
CN113838448A (en) Voice synthesis method, device, equipment and computer readable storage medium
CN113450758B (en) Speech synthesis method, apparatus, device and medium
JP4008607B2 (en) Speech encoding / decoding method
US7089187B2 (en) Voice synthesizing system, segment generation apparatus for generating segments for voice synthesis, voice synthesizing method and storage medium storing program therefor
JP2007086309A (en) Voice synthesizer, voice synthesizing method, and program
JP5376643B2 (en) Speech synthesis apparatus, method and program
CN114495896A (en) Voice playing method and computer equipment
JP2010224418A (en) Voice synthesizer, method, and program
US7092878B1 (en) Speech synthesis using multi-mode coding with a speech segment dictionary
CN114495898B (en) Unified speech synthesis and speech conversion training method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant