CN113488021A - Method for improving naturalness of speech synthesis - Google Patents
Method for improving naturalness of speech synthesis Download PDFInfo
- Publication number
- CN113488021A CN113488021A CN202110906779.3A CN202110906779A CN113488021A CN 113488021 A CN113488021 A CN 113488021A CN 202110906779 A CN202110906779 A CN 202110906779A CN 113488021 A CN113488021 A CN 113488021A
- Authority
- CN
- China
- Prior art keywords
- duration
- phonemes
- phoneme
- text
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 21
- 238000003786 synthesis reaction Methods 0.000 title claims abstract description 21
- 238000000034 method Methods 0.000 title claims abstract description 20
- 238000013135 deep learning Methods 0.000 claims description 10
- 239000013598 vector Substances 0.000 claims description 9
- 230000002457 bidirectional effect Effects 0.000 claims description 7
- 238000013507 mapping Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 125000004122 cyclic group Chemical group 0.000 claims description 4
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 3
- 230000006403 short-term memory Effects 0.000 claims description 3
- 230000000306 recurrent effect Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 abstract description 9
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 230000009466 transformation Effects 0.000 abstract 1
- 238000001228 spectrum Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 5
- 238000005457 optimization Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention discloses a method for improving the naturalness of speech synthesis. It comprises the following steps: obtaining phonemes corresponding to the text by a tool from a font to phonemes, forming a phoneme dictionary by all the phonemes, representing the phonemes of the text by using the number of the phoneme dictionaries as the dimension of an embedding layer, and coding the represented characteristics by a CBHG module; taking a text coding result as input, predicting the duration of each phoneme, comparing the prediction result with a real label, and optimizing the time-length model; and decoding the features expanded by the time length model, combining decoded results into a complex feature, and restoring the decoded complex feature into a voice waveform through short-time Fourier inverse transformation in the original audio. The invention has the beneficial effects that: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.
Description
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a method for improving the naturalness of speech synthesis.
Background
Due to the development of deep learning and application in various fields, the speech synthesis also benefits greatly. Speech synthesis can also be roughly divided into two stages: 1. splicing and parametric methods. The splicing method is to search speech segments in a relatively large corpus and then search corresponding speech segments to combine them according to the characters to be synthesized. Although the synthesized speech is the voice of a real person, the expression of some global features such as the tone of speech, prosody and the like can be limited. Meanwhile, the splicing method also needs a large corpus and has high requirements on a data set. The parametric method is to establish a mapping model between text parameters and acoustic parameters according to a statistical model. The disadvantage is that the synthesized speech has unnatural mechanical feeling and the parameter adjustment is troublesome. 2. Studies based on deep learning. Deep learning based speech synthesis has evolved in the end-to-end direction. The quality of synthesis is better and better, but at present, the number of real end-to-end models is small, and a bridge is basically established between text and voice through a Mel frequency spectrum. This results in a loss of naturalness of the synthesized speech.
In the existing speech synthesis technology, firstly, a text is processed into phonemes as input by a regularization module, then the text or the phonemes are characterized through an embedded layer network, and then the characterized characteristics are encoded through some characteristic extraction networks. The length of the coded features is consistent with the length of the input phoneme, and only the dimension is increased from one dimension to a high dimension. And predicting the pronunciation duration of the text or the phoneme according to the text coding result. Rounding the predicted pronunciation time length, wherein the number of the time lengths is consistent with the length of the phoneme. And then, the coded features are adjusted according to the rounded duration, and finally, a text coding result which is consistent with the length of a Mel frequency spectrum extracted from real voice can be obtained. And (4) decoding the characteristics of the result of the time length model adjustment through a deep learning network, and calculating loss with a Mel frequency spectrum extracted from real voice. Taking as input the mel spectrum extracted from real speech, using neural network models such as: WaveNet, ParallelWaveNet, HifiGan, etc. to predict the true speech waveform. The input at the synthesis stage is the decoded mel spectrum, not the true mel spectrum as input. The prior art circuit predicts the mel spectrum from the text and then predicts the speech waveform from the predicted mel spectrum by the vocoder. And the objective functions calculated by these two processes are also not consistent.
Disclosure of Invention
The invention provides a method for improving the naturalness of speech synthesis, which can reduce the calculation amount and overcome the defects in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for improving the naturalness of speech synthesis specifically comprises the following steps:
(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;
(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module;
(3) a duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;
(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized;
(5) and (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio;
(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.
Because the objective optimization function of the invention is aimed at the synthesized speech waveform and the predicted phoneme pronunciation duration, the speaking characteristics of the speaker can be directly learned from the original audio, and the objective optimization function comprises the following steps: tone, pause, speech style, etc. The synthesized speech is more natural than other speech synthesis systems. The invention avoids the defects of the prior art, directly predicts the waveform by the text, reduces the intermediate process and synthesizes more natural voice. The invention has the advantages that an end-to-end speech synthesis system is provided, compared with other speech synthesis systems: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.
Preferably, in step (2), the CBHG module is composed of a one-dimensional convolution filter bank, a highway network, and a recurrent neural network of bidirectional gated cyclic units.
Preferably, in the step (4), specifically: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration.
The invention has the beneficial effects that: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
In the embodiment shown in fig. 1, a method for improving naturalness of speech synthesis specifically includes the following steps:
(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;
(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module; the CBHG module consists of a one-dimensional convolution filter bank, an expressway network and a cyclic neural network of a bidirectional gating cyclic unit.
(3) A duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;
(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized; the method specifically comprises the following steps: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration. Looking at the input and output before and after the length adjuster as in fig. 1, in particular if there are three phonemes a, b, c, the predicted durations are 2, 3, 4 respectively, then the augmented version is aabbcccc.
(5) And (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio; the 2-layer bidirectional long-short term memory network generally refers to bidirectional lstm, complex features are different from general features, generally features in a real number domain, and the complex number domain has one more part than the real number domain, namely the features consist of two parts, namely a real part and an imaginary part; the short-time Fourier transform is a mathematical general operation, stft, and can also be realized by a neural network;
(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.
Because the objective optimization function of the invention is aimed at the synthesized speech waveform and the predicted phoneme pronunciation duration, the speaking characteristics of the speaker can be directly learned from the original audio, and the objective optimization function comprises the following steps: tone, pause, speech style, etc. The synthesized speech is more natural than other speech synthesis systems. The invention avoids the defects of the prior art, directly predicts the waveform by the text, reduces the intermediate process and synthesizes more natural voice. The invention has the advantages that an end-to-end speech synthesis system is provided, compared with other speech synthesis systems: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.
Claims (3)
1. A method for improving the naturalness of speech synthesis is characterized by comprising the following steps:
(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;
(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module;
(3) a duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;
(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized;
(5) and (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio;
(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.
2. The method of claim 1, wherein in step (2), the CBHG module comprises a one-dimensional convolutional filter bank, a highway network, and a recurrent neural network of bi-directional gated cyclic units.
3. The method according to claim 1, wherein in the step (4), the method specifically comprises: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110906779.3A CN113488021A (en) | 2021-08-09 | 2021-08-09 | Method for improving naturalness of speech synthesis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110906779.3A CN113488021A (en) | 2021-08-09 | 2021-08-09 | Method for improving naturalness of speech synthesis |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113488021A true CN113488021A (en) | 2021-10-08 |
Family
ID=77946052
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110906779.3A Pending CN113488021A (en) | 2021-08-09 | 2021-08-09 | Method for improving naturalness of speech synthesis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113488021A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111739508A (en) * | 2020-08-07 | 2020-10-02 | 浙江大学 | End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network |
CN112802450A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof |
CN112802448A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Speech synthesis method and system for generating new tone |
CN112863483A (en) * | 2021-01-05 | 2021-05-28 | 杭州一知智能科技有限公司 | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm |
WO2021127821A1 (en) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis model training method, apparatus, computer device, and storage medium |
-
2021
- 2021-08-09 CN CN202110906779.3A patent/CN113488021A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021127821A1 (en) * | 2019-12-23 | 2021-07-01 | 深圳市优必选科技股份有限公司 | Speech synthesis model training method, apparatus, computer device, and storage medium |
CN111739508A (en) * | 2020-08-07 | 2020-10-02 | 浙江大学 | End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network |
CN112802450A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof |
CN112802448A (en) * | 2021-01-05 | 2021-05-14 | 杭州一知智能科技有限公司 | Speech synthesis method and system for generating new tone |
CN112863483A (en) * | 2021-01-05 | 2021-05-28 | 杭州一知智能科技有限公司 | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Yu et al. | DurIAN: Duration Informed Attention Network for Speech Synthesis. | |
Kleijn et al. | Wavenet based low rate speech coding | |
CN108899009B (en) | Chinese speech synthesis system based on phoneme | |
CN111179905A (en) | Rapid dubbing generation method and device | |
CN110767210A (en) | Method and device for generating personalized voice | |
CN113112995B (en) | Word acoustic feature system, and training method and system of word acoustic feature system | |
CN112489629A (en) | Voice transcription model, method, medium, and electronic device | |
CN114023300A (en) | Chinese speech synthesis method based on diffusion probability model | |
KR102272554B1 (en) | Method and system of text to multiple speech | |
CN114464162B (en) | Speech synthesis method, neural network model training method, and speech synthesis model | |
CN114678032B (en) | Training method, voice conversion method and device and electronic equipment | |
CN111724809A (en) | Vocoder implementation method and device based on variational self-encoder | |
US20240127832A1 (en) | Decoder | |
Oura et al. | Deep neural network based real-time speech vocoder with periodic and aperiodic inputs | |
CN113782042A (en) | Speech synthesis method, vocoder training method, device, equipment and medium | |
CN116092475B (en) | Stuttering voice editing method and system based on context-aware diffusion model | |
Zhao et al. | Research on voice cloning with a few samples | |
CN116312476A (en) | Speech synthesis method and device, storage medium and electronic equipment | |
CN113436607B (en) | Quick voice cloning method | |
KR20230075340A (en) | Voice synthesis system and method capable of duplicating tone and prosody styles in real time | |
CN114974218A (en) | Voice conversion model training method and device and voice conversion method and device | |
CN114203151A (en) | Method, device and equipment for training speech synthesis model | |
CN113488021A (en) | Method for improving naturalness of speech synthesis | |
CN113327578A (en) | Acoustic model training method and device, terminal device and storage medium | |
CN115700871A (en) | Model training and speech synthesis method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |