CN110299131A - A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion - Google Patents
A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion Download PDFInfo
- Publication number
- CN110299131A CN110299131A CN201910706204.XA CN201910706204A CN110299131A CN 110299131 A CN110299131 A CN 110299131A CN 201910706204 A CN201910706204 A CN 201910706204A CN 110299131 A CN110299131 A CN 110299131A
- Authority
- CN
- China
- Prior art keywords
- rhythm
- vector
- attention
- emotion
- controllable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
- G10L2013/105—Duration
Abstract
The present invention provides phoneme synthesizing method, device, the storage mediums of a kind of controllable rhythm emotion, it can add rhythm emotion in synthesis voice, the effectively rhythm rhythm of control synthesis voice, method is the following steps are included: convert character representation vector for the corresponding character of text to be synthesized;By character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding feature vector;Coding characteristic vector and rhythm rhythm vector are spliced, by attention mechanism, generates and pays attention to force vector;The frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, it is sent into decoder, it is updated by the output of decoder and pays attention to force vector, the attention force vector newly calculated is done with decoder output to be spliced, it is sent into projection layer output and has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates;Prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.
Description
Technical field
The present invention relates to speech synthesis technique fields, and in particular to a kind of phoneme synthesizing method of controllable rhythm emotion,
Device, storage medium.
Background technique
Speech synthesis, also known as literary periodicals (Text To Speech, TTS) are that one kind can turn any input text
Change the technology of corresponding voice into.
Traditional speech synthesis system generally includes the module of front-end and back-end two.Front-end module is mainly to input text
It is analyzed, extracts linguistic information required for rear module, for Chinese synthesis system, front-end module is generally comprised
The submodules such as text regularization, participle, part of speech prediction, polyphone disambiguation, prosody prediction.Rear module is according to frontal chromatography knot
Fruit generates speech waveform by certain method, back-end system be generally divided into based on statistical parameter modeling speech synthesis (or
Parameter synthesis) and speech synthesis (or splicing synthesis) based on unit selection and waveform concatenation.
Current end-to-end synthetic model not only can produce the audio of more high fidelity and naturalness, and modeling process letter
It is single, do not need any linguistic information.Therefore, it has also become the speech synthesis technique of current main-stream.But classical end-to-end conjunction
There is its technical vulnerability at technology, for example it is possible that unforeseen uncontrollable synthesis flaw, can not explicitly control for another example
The rhythm rhythm of synthesis is made, such as: phoneme duration, stressed and intonation etc..This be primarily due to end-to-end synthesis input only according to
Rely in shallow-layer content of text, such as alphabetical sequence, syllable sequence, aligned phoneme sequence etc., the language message of deep layer can not be utilized, such as
Part of speech, intonation, syntactic structure etc..
Summary of the invention
In view of the above-mentioned problems, the present invention provides a kind of phoneme synthesizing method of controllable rhythm emotion, device, storages to be situated between
Matter can add rhythm emotion, the effectively rhythm rhythm of control synthesis voice in synthesis voice.
Its technical solution is such that a kind of phoneme synthesizing method of controllable rhythm emotion, which is characterized in that including with
Lower step:
Step S1: character representation vector is converted by the corresponding character of text to be synthesized;
Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding
Feature vector;
Step S3: coding characteristic vector and rhythm rhythm vector are spliced, by attention mechanism, generate attention to
Amount;
Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through solution
The output of code device, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer
Output has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates;
Step S5: the prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.
Further, in step s 4, after completing decoding, the prediction voice with rhythm rhythm that prediction is obtained is frequently
Spectrum is admitted in convolutional layer to improve and generate quality.
Further, the prosodic information that the rhythm rhythm vector includes includes word speed information, reads information, intonation letter again
Breath, the word speed of syllable or word where word speed information refers to current character;Word or syllable where stressed information refers to current character
Whether read again;The tune type of word or syllable where prosody information refers to current character;Word speed information includes: normally, at a slow speed, fastly
Speed, it is supper-fast;Reading information again includes stressed and anacrusis;Prosody information includes: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone.
Further, rhythm rhythm vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information with 2 two into
System is to encode;It reads again and is encoded with 1 binary system;Intonation is encoded with 2 binary systems.
Further, in step s3, using the attention mechanism of position sensing.
Further, the prediction voice spectrum with rhythm rhythm is inputted into speech synthesizer, output has rhythm rhythm
Voice, speech synthesizer includes any one in WaveNet, WaveRNN.
Further, by the prediction voice spectrum with rhythm rhythm by Griffin_Lim algorithm, output has the rhythm
The voice of rhythm.
A kind of speech synthetic device of controllable rhythm emotion characterized by comprising
Representation space conversion module, for converting character representation vector for the corresponding character of text to be synthesized;
Encoder, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector;
Pay attention to power module, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates note
Meaning force vector;
Decoder;For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through decoder
Output, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer and exports band
There is the prediction voice spectrum of rhythm rhythm.
A kind of speech synthetic device of controllable rhythm emotion, characterized in that it comprises: including processor, memory
And program;
Described program stores in the memory, and the processor calls the program of memory storage, above-mentioned to execute
Controllable rhythm emotion phoneme synthesizing method.
A kind of computer readable storage medium, which is characterized in that the computer readable storage medium is configured to store
Program, described program are configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.
Phoneme synthesizing method, device, the storage medium of controllable rhythm emotion of the invention, to classical end-to-end synthesis
Method improves, by inputting prosodic control information abundant, so that synthesized voice not only keeps similar with original sound as far as possible
Rhythm rhythm, sound more life-like naturally, rich in emotion, and the rhythm of synthesized voice can be changed by control information
Rhythm;By the inclusion of word speed information, read again information, prosody information rhythm rhythm vector, define additional rhythm cadence information
End-to-end synthetic model is preferably trained, by adding rhythm cadence information in encoder and attention stage, can be convenient
The speech manual of decoder output is efficiently controlled and changes, to control the emotion rhythm of synthesis voice.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the phoneme synthesizing method of controllable rhythm emotion of the invention;
Fig. 2 is a kind of frame diagram of the speech synthetic device of controllable rhythm emotion of the invention.
Specific embodiment
See Fig. 1, a kind of phoneme synthesizing method of controllable rhythm emotion of the invention, comprising the following steps:
Step S1: character representation vector is converted by the corresponding character of text to be synthesized;
Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding
Feature vector, encoder generally use CNN+LSTM network to model;
Step S3: coding characteristic vector and rhythm rhythm vector are spliced, and by the attention mechanism of position sensing, are generated
Pay attention to force vector;
Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through solution
The output of code device, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer
Output has the prediction voice spectrum of rhythm rhythm, while predicting the end point that frequency spectrum generates;After completing decoding, it will measure in advance
To the prediction voice spectrum with rhythm rhythm be admitted in convolutional layer to improve and generate quality, decoder generallys use LSTM
+ CNN+ linear projection is modeled;
Step S5: will be converted to the voice output with rhythm rhythm with the prediction voice spectrum of rhythm rhythm, can be with
Prediction voice spectrum with rhythm rhythm is inputted into speech synthesizer, output has the voice of rhythm rhythm, speech synthesizer
Including any one in WaveNet, WaveRNN;In addition it is also possible to by the prediction voice spectrum for having rhythm rhythm is passed through
Griffin_Lim algorithm, output have the voice of rhythm rhythm.
Specifically in the present embodiment, the prosodic information that rhythm rhythm vector includes includes word speed information, reads information, intonation again
Information, the word speed of syllable or word where word speed information refers to current character;Word or sound where stressed information refers to current character
Whether section is read again;The tune type of word or syllable where prosody information refers to current character.
Word speed information include: it is normal, it is at a slow speed, quickly, supper-fast;Reading information again includes stressed and anacrusis;Prosody information
It include: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone;The normal word speed of normally expression in word speed indicates at a slow speed 0.5 times of normal language
Speed;Quickly indicate 1.5 times of normal word speed;The supper-fast normal word speed for indicating 2 times.
In the present embodiment, rhythm rhythm vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information is with 2
Binary system encodes;It reads again and is encoded with 1 binary system;Intonation is encoded with 2 binary systems.
In the present embodiment, specific word speed information, stressed information, the coding of prosody information are as follows:
The normal word speed of word speed-: 00
The slow word speed of word speed-: 01
The fast word speed of word speed-: 10
The ultrafast word speed of word speed-: 11
It reads again-reads again: 1
Stressed-anacrusis: 0
The high Heibei provincial opera of intonation-: 00
Intonation-rising tone: 01
The lower falling tone of intonation-: 10
The low Heibei provincial opera of intonation-: 11
When speech synthesis, if synthesis text be it is neutral, if not needing obvious emotion, default be sent into synthesis
The rhythm rhythm control information of device may is that normal word speed, anacrusis, high Heibei provincial opera.The obvious emotion rhythm in need the case where
Under, rhythm cadence information can be correspondingly arranged.
See Fig. 2, a kind of speech synthetic device of controllable rhythm emotion of the invention, comprising:
Representation space conversion module 1, for converting character representation vector for the corresponding character of text to be synthesized;
Encoder 2, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector;
Pay attention to power module 3, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates
Pay attention to force vector;
Decoder 4;For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through decoder
Output, which updates, pays attention to force vector, and the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer and exports band
There is the prediction voice spectrum of rhythm rhythm.
A kind of speech synthetic device of controllable rhythm emotion comprising: including processor, memory and program;
Program stores in memory, and processor calls the program of memory storage, to execute the above-mentioned controllable rhythm
The phoneme synthesizing method of emotion.
In the realization of the speech synthetic device of above-mentioned controllable rhythm emotion, between memory and processor directly or
Ground connection is electrically connected, to realize the transmission or interaction of data.For example, these elements between each other can be by one or more of
Communication bus or signal wire, which are realized, to be electrically connected, and can such as be connected by bus.It is stored in memory and realizes data access control
The computer executed instructions of method processed, the software function that can be stored in memory in the form of software or firmware including at least one
Can module, processor by the operation software program and module that are stored in memory, thereby executing various function application with
And data processing.
Memory may be, but not limited to, random access memory (Random Access Memory, referred to as: RAM),
Read-only memory (Read Only Memory, referred to as: ROM), programmable read only memory (Programmable Read-Only
Memory, referred to as: PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, letter
Claim: EPROM), electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory,
Referred to as: EEPROM) etc..Wherein, memory is for storing program, and processor executes program after receiving and executing instruction.
Processor can be a kind of IC chip, the processing capacity with signal.Above-mentioned processor can be logical
With processor, including central processing unit (Central Processing Unit, referred to as: CPU), network processing unit (Network
Processor, referred to as: NP) etc..It may be implemented or execute disclosed each method, step and the logic in the embodiment of the present application
Block diagram.General processor can be microprocessor or the processor is also possible to any conventional processor etc..
In an embodiment of the present invention, a kind of computer readable storage medium, computer readable storage medium are additionally provided
It is configured to store program, program is configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.
Those of ordinary skill in the art will appreciate that: realize that all or part of the steps of above-mentioned each method embodiment can lead to
The relevant hardware of program instruction is crossed to complete.Program above-mentioned can store in computer readable storage medium.The program exists
When being executed by processor, realization includes the steps that above-mentioned each method embodiment;And computer readable storage medium above-mentioned includes:
The various media that can store program code such as ROM, RAM, magnetic or disk, including some instructions are used so that one big number
Each embodiment or embodiment are executed according to transmission device (can be personal computer, server or the network equipment etc.)
Method described in certain parts.
Its input of the end-to-end synthesis system of classics is the corresponding character string of text to be synthesized, to identical text to be synthesized
This, can not individually control its rhythm rhythm.The rhythm rhythm that this causes synthesized voice that can show is very limited, and people is allowed to feel bright
Aobvious mechanical sense.
For this purpose, this patent improves classical end-to-end synthetic method, by inputting prosodic control information abundant,
So that synthesized voice not only keeps the rhythm rhythm similar with original sound as far as possible, sound more life-like naturally, rich in emotion, and
And the rhythm rhythm of synthesized voice can be changed by control information
Rhythm cadence information is typically all super Duan Tezheng, and end-to-end synthesis is generally used character or phoneme is used as and builds
Form unit.Therefore, in modeling, section grade prosodic information is averaged each character or phoneme for being assigned to equivalent, passes through packet
Information containing word speed, the rhythm rhythm vector for reading information, prosody information again, define additional rhythm cadence information preferably to train
End-to-end synthetic model can effectively control the rhythm rhythm of synthesis voice, by encoder by duration, stressed and intonation
Rhythm cadence information is added with the attention stage, can be convenient the speech manual for efficiently controlling and changing decoder output, thus
The emotion rhythm of control synthesis voice.
Claims (10)
1. a kind of phoneme synthesizing method of controllable rhythm emotion, which comprises the following steps:
Step S1: character representation vector is converted by the corresponding character of text to be synthesized;
Step S2: by character representation vector with and rhythm rhythm vector splice, then input coding device, exports coding feature
Vector;
Step S3: coding characteristic vector and rhythm rhythm vector are spliced, and by attention mechanism, are generated and are paid attention to force vector;
Step S4: the frequency spectrum frame of previous moment predicted is done with attention force vector and is spliced, decoder is sent into, passes through decoder
Output update and pay attention to force vector, the attention force vector newly calculated and decoder output are done and are spliced, and are sent into projection layer and export
Prediction voice spectrum with rhythm rhythm, while predicting the end point that frequency spectrum generates;
Step S5: the prediction voice spectrum with rhythm rhythm is converted to the voice output with rhythm rhythm.
2. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: in step S4
In, after completing decoding, the prediction voice spectrum with rhythm rhythm that prediction obtains is admitted in convolutional layer to improve life
At quality.
3. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: the rhythm
The prosodic information that rhythm vector includes includes word speed information, reads information, prosody information again, and word speed information refers to sound where current character
The word speed of section or word;Whether word or syllable are read again where stressed information refers to current character;Prosody information refers to current character
The tune type of place word or syllable;Word speed information include: it is normal, it is at a slow speed, quickly, supper-fast;Read again information include read again and
Anacrusis;Prosody information includes: low Heibei provincial opera, high Heibei provincial opera, rising tune, falling tone.
4. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 3, it is characterised in that: rhythm rhythm
Vector is expressed as 5 dimension rhythm rhythm coding vectors, wherein word speed information is encoded with 2 binary systems;It reads again with 1 binary system
To encode;Intonation is encoded with 2 binary systems.
5. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: in step S3
In, using the attention mechanism of position sensing.
6. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: rhythm will be had
The prediction voice spectrum for restraining rhythm inputs speech synthesizer, and output has the voice of rhythm rhythm, and speech synthesizer includes
Any one in WaveNet, WaveRNN.
7. a kind of phoneme synthesizing method of controllable rhythm emotion according to claim 1, it is characterised in that: rhythm will be had
The prediction voice spectrum of rhythm is restrained by Griffin_Lim algorithm, output has the voice of rhythm rhythm.
8. a kind of speech synthetic device of controllable rhythm emotion characterized by comprising
Representation space conversion module, for converting character representation vector for the corresponding character of text to be synthesized;
Encoder, for the character representation vector sum rhythm rhythm vector of input to be converted into the output of coding characteristic vector;
Pay attention to power module, for splicing coding characteristic vector and rhythm rhythm vector, by attention mechanism, generates attention
Vector;
Decoder;For splicing the frequency spectrum frame of previous moment predicted and paying attention to force vector, then pass through the output of decoder
It updates and pays attention to force vector, the attention force vector newly calculated is done with decoder output to be spliced, and is sent into projection layer output and is had rhythm
Restrain the prediction voice spectrum of rhythm.
9. a kind of speech synthetic device of controllable rhythm emotion, characterized in that it comprises: including processor, memory with
And program;
Described program stores in the memory, and the processor calls the program of memory storage, with execute it is above-mentioned can
Control the phoneme synthesizing method of rhythm emotion.
10. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is configured to store journey
Sequence, described program are configured to execute the phoneme synthesizing method of above-mentioned controllable rhythm emotion.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910706204.XA CN110299131B (en) | 2019-08-01 | 2019-08-01 | Voice synthesis method and device capable of controlling prosodic emotion and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910706204.XA CN110299131B (en) | 2019-08-01 | 2019-08-01 | Voice synthesis method and device capable of controlling prosodic emotion and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110299131A true CN110299131A (en) | 2019-10-01 |
CN110299131B CN110299131B (en) | 2021-12-10 |
Family
ID=68032457
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910706204.XA Active CN110299131B (en) | 2019-08-01 | 2019-08-01 | Voice synthesis method and device capable of controlling prosodic emotion and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110299131B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110808027A (en) * | 2019-11-05 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Voice synthesis method and device and news broadcasting method and system |
CN111583902A (en) * | 2020-05-14 | 2020-08-25 | 携程计算机技术(上海)有限公司 | Speech synthesis system, method, electronic device, and medium |
CN111710326A (en) * | 2020-06-12 | 2020-09-25 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111724765A (en) * | 2020-06-30 | 2020-09-29 | 上海优扬新媒信息技术有限公司 | Method and device for converting text into voice and computer equipment |
CN112086086A (en) * | 2020-10-22 | 2020-12-15 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and computer readable storage medium |
CN112185363A (en) * | 2020-10-21 | 2021-01-05 | 北京猿力未来科技有限公司 | Audio processing method and device |
CN112767969A (en) * | 2021-01-29 | 2021-05-07 | 苏州思必驰信息科技有限公司 | Method and system for determining emotion tendentiousness of voice information |
WO2021127979A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis method and apparatus, computer device, and computer readable storage medium |
WO2021134591A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium |
CN113096636A (en) * | 2021-06-08 | 2021-07-09 | 北京世纪好未来教育科技有限公司 | Speech synthesis apparatus, speech synthesis method, electronic device, and storage medium |
WO2021179910A1 (en) * | 2020-03-09 | 2021-09-16 | 百果园技术(新加坡)有限公司 | Text voice front-end conversion method and apparatus, and device and storage medium |
CN113643717A (en) * | 2021-07-07 | 2021-11-12 | 深圳市联洲国际技术有限公司 | Music rhythm detection method, device, equipment and storage medium |
CN113808579A (en) * | 2021-11-22 | 2021-12-17 | 中国科学院自动化研究所 | Detection method and device for generated voice, electronic equipment and storage medium |
CN114420086A (en) * | 2022-03-30 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Speech synthesis method and device |
WO2022095754A1 (en) * | 2020-11-03 | 2022-05-12 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, storage medium, and electronic device |
WO2022105545A1 (en) * | 2020-11-20 | 2022-05-27 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, and readable medium and electronic device |
WO2023061259A1 (en) * | 2021-10-14 | 2023-04-20 | 北京字跳网络技术有限公司 | Speech speed adjustment method and apparatus, electronic device, and readable storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101000765A (en) * | 2007-01-09 | 2007-07-18 | 黑龙江大学 | Speech synthetic method based on rhythm character |
US7454348B1 (en) * | 2004-01-08 | 2008-11-18 | At&T Intellectual Property Ii, L.P. | System and method for blending synthetic voices |
CN103077705A (en) * | 2012-12-30 | 2013-05-01 | 安徽科大讯飞信息科技股份有限公司 | Method for optimizing local synthesis based on distributed natural rhythm |
US20160203815A1 (en) * | 2008-06-06 | 2016-07-14 | At&T Intellectual Property I, Lp | System and method for synthetically generated speech describing media content |
CN109036375A (en) * | 2018-07-25 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Phoneme synthesizing method, model training method, device and computer equipment |
CN109543722A (en) * | 2018-11-05 | 2019-03-29 | 中山大学 | A kind of emotion trend forecasting method based on sentiment analysis model |
CN109616093A (en) * | 2018-12-05 | 2019-04-12 | 平安科技(深圳)有限公司 | End-to-end phoneme synthesizing method, device, equipment and storage medium |
CN109754779A (en) * | 2019-01-14 | 2019-05-14 | 出门问问信息科技有限公司 | Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing |
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
CN109979429A (en) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | A kind of method and system of TTS |
-
2019
- 2019-08-01 CN CN201910706204.XA patent/CN110299131B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7454348B1 (en) * | 2004-01-08 | 2008-11-18 | At&T Intellectual Property Ii, L.P. | System and method for blending synthetic voices |
CN101000765A (en) * | 2007-01-09 | 2007-07-18 | 黑龙江大学 | Speech synthetic method based on rhythm character |
US20160203815A1 (en) * | 2008-06-06 | 2016-07-14 | At&T Intellectual Property I, Lp | System and method for synthetically generated speech describing media content |
CN103077705A (en) * | 2012-12-30 | 2013-05-01 | 安徽科大讯飞信息科技股份有限公司 | Method for optimizing local synthesis based on distributed natural rhythm |
CN109036375A (en) * | 2018-07-25 | 2018-12-18 | 腾讯科技(深圳)有限公司 | Phoneme synthesizing method, model training method, device and computer equipment |
CN109543722A (en) * | 2018-11-05 | 2019-03-29 | 中山大学 | A kind of emotion trend forecasting method based on sentiment analysis model |
CN109616093A (en) * | 2018-12-05 | 2019-04-12 | 平安科技(深圳)有限公司 | End-to-end phoneme synthesizing method, device, equipment and storage medium |
CN109754779A (en) * | 2019-01-14 | 2019-05-14 | 出门问问信息科技有限公司 | Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing |
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
CN109979429A (en) * | 2019-05-29 | 2019-07-05 | 南京硅基智能科技有限公司 | A kind of method and system of TTS |
Non-Patent Citations (1)
Title |
---|
曾碧卿 等: "《基于双注意力卷积神经网络模型的情感分析演技》", 《广东工业大学学报》 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110808027A (en) * | 2019-11-05 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Voice synthesis method and device and news broadcasting method and system |
WO2021127979A1 (en) * | 2019-12-24 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis method and apparatus, computer device, and computer readable storage medium |
WO2021134591A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Speech synthesis method, speech synthesis apparatus, smart terminal and storage medium |
WO2021179910A1 (en) * | 2020-03-09 | 2021-09-16 | 百果园技术(新加坡)有限公司 | Text voice front-end conversion method and apparatus, and device and storage medium |
CN111583902A (en) * | 2020-05-14 | 2020-08-25 | 携程计算机技术(上海)有限公司 | Speech synthesis system, method, electronic device, and medium |
CN111710326A (en) * | 2020-06-12 | 2020-09-25 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111710326B (en) * | 2020-06-12 | 2024-01-23 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111724765A (en) * | 2020-06-30 | 2020-09-29 | 上海优扬新媒信息技术有限公司 | Method and device for converting text into voice and computer equipment |
CN111724765B (en) * | 2020-06-30 | 2023-07-25 | 度小满科技(北京)有限公司 | Text-to-speech method and device and computer equipment |
CN112185363A (en) * | 2020-10-21 | 2021-01-05 | 北京猿力未来科技有限公司 | Audio processing method and device |
CN112185363B (en) * | 2020-10-21 | 2024-02-13 | 北京猿力未来科技有限公司 | Audio processing method and device |
CN112086086A (en) * | 2020-10-22 | 2020-12-15 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and computer readable storage medium |
WO2022095754A1 (en) * | 2020-11-03 | 2022-05-12 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, storage medium, and electronic device |
WO2022105545A1 (en) * | 2020-11-20 | 2022-05-27 | 北京有竹居网络技术有限公司 | Speech synthesis method and apparatus, and readable medium and electronic device |
CN112767969A (en) * | 2021-01-29 | 2021-05-07 | 苏州思必驰信息科技有限公司 | Method and system for determining emotion tendentiousness of voice information |
CN113096636A (en) * | 2021-06-08 | 2021-07-09 | 北京世纪好未来教育科技有限公司 | Speech synthesis apparatus, speech synthesis method, electronic device, and storage medium |
CN113643717A (en) * | 2021-07-07 | 2021-11-12 | 深圳市联洲国际技术有限公司 | Music rhythm detection method, device, equipment and storage medium |
WO2023061259A1 (en) * | 2021-10-14 | 2023-04-20 | 北京字跳网络技术有限公司 | Speech speed adjustment method and apparatus, electronic device, and readable storage medium |
CN113808579A (en) * | 2021-11-22 | 2021-12-17 | 中国科学院自动化研究所 | Detection method and device for generated voice, electronic equipment and storage medium |
CN114420086A (en) * | 2022-03-30 | 2022-04-29 | 北京沃丰时代数据科技有限公司 | Speech synthesis method and device |
CN114420086B (en) * | 2022-03-30 | 2022-06-17 | 北京沃丰时代数据科技有限公司 | Speech synthesis method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110299131B (en) | 2021-12-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110299131A (en) | A kind of phoneme synthesizing method, device, the storage medium of controllable rhythm emotion | |
Zhang et al. | Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning | |
US11295721B2 (en) | Generating expressive speech audio from text data | |
CN112687259B (en) | Speech synthesis method, device and readable storage medium | |
EP4029010B1 (en) | Neural text-to-speech synthesis with multi-level context features | |
CN116364055B (en) | Speech generation method, device, equipment and medium based on pre-training language model | |
CN108630203A (en) | Interactive voice equipment and its processing method and program | |
US6212501B1 (en) | Speech synthesis apparatus and method | |
KR20220054655A (en) | Speech synthesis method and apparatus, storage medium | |
King | A beginners’ guide to statistical parametric speech synthesis | |
KR102294639B1 (en) | Deep neural network based non-autoregressive speech synthesizer method and system using multiple decoder | |
CN113327627B (en) | Multi-factor controllable voice conversion method and system based on feature decoupling | |
JP5398295B2 (en) | Audio processing apparatus, audio processing method, and audio processing program | |
KR20230084229A (en) | Parallel tacotron: non-autoregressive and controllable TTS | |
CN111681641B (en) | Phrase-based end-to-end text-to-speech (TTS) synthesis | |
CN113838448A (en) | Voice synthesis method, device, equipment and computer readable storage medium | |
CN113450758B (en) | Speech synthesis method, apparatus, device and medium | |
JP4008607B2 (en) | Speech encoding / decoding method | |
US7089187B2 (en) | Voice synthesizing system, segment generation apparatus for generating segments for voice synthesis, voice synthesizing method and storage medium storing program therefor | |
JP2007086309A (en) | Voice synthesizer, voice synthesizing method, and program | |
JP5376643B2 (en) | Speech synthesis apparatus, method and program | |
CN114495896A (en) | Voice playing method and computer equipment | |
JP2010224418A (en) | Voice synthesizer, method, and program | |
US7092878B1 (en) | Speech synthesis using multi-mode coding with a speech segment dictionary | |
CN114495898B (en) | Unified speech synthesis and speech conversion training method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |