CN113488021A - Method for improving naturalness of speech synthesis - Google Patents

Method for improving naturalness of speech synthesis Download PDF

Info

Publication number
CN113488021A
CN113488021A CN202110906779.3A CN202110906779A CN113488021A CN 113488021 A CN113488021 A CN 113488021A CN 202110906779 A CN202110906779 A CN 202110906779A CN 113488021 A CN113488021 A CN 113488021A
Authority
CN
China
Prior art keywords
duration
phonemes
phoneme
text
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110906779.3A
Other languages
Chinese (zh)
Inventor
盛乐园
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Xiaoying Innovation Technology Co ltd
Original Assignee
Hangzhou Xiaoying Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Xiaoying Innovation Technology Co ltd filed Critical Hangzhou Xiaoying Innovation Technology Co ltd
Priority to CN202110906779.3A priority Critical patent/CN113488021A/en
Publication of CN113488021A publication Critical patent/CN113488021A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention discloses a method for improving the naturalness of speech synthesis. It comprises the following steps: obtaining phonemes corresponding to the text by a tool from a font to phonemes, forming a phoneme dictionary by all the phonemes, representing the phonemes of the text by using the number of the phoneme dictionaries as the dimension of an embedding layer, and coding the represented characteristics by a CBHG module; taking a text coding result as input, predicting the duration of each phoneme, comparing the prediction result with a real label, and optimizing the time-length model; and decoding the features expanded by the time length model, combining decoded results into a complex feature, and restoring the decoded complex feature into a voice waveform through short-time Fourier inverse transformation in the original audio. The invention has the beneficial effects that: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.

Description

Method for improving naturalness of speech synthesis
Technical Field
The invention relates to the technical field of speech synthesis, in particular to a method for improving the naturalness of speech synthesis.
Background
Due to the development of deep learning and application in various fields, the speech synthesis also benefits greatly. Speech synthesis can also be roughly divided into two stages: 1. splicing and parametric methods. The splicing method is to search speech segments in a relatively large corpus and then search corresponding speech segments to combine them according to the characters to be synthesized. Although the synthesized speech is the voice of a real person, the expression of some global features such as the tone of speech, prosody and the like can be limited. Meanwhile, the splicing method also needs a large corpus and has high requirements on a data set. The parametric method is to establish a mapping model between text parameters and acoustic parameters according to a statistical model. The disadvantage is that the synthesized speech has unnatural mechanical feeling and the parameter adjustment is troublesome. 2. Studies based on deep learning. Deep learning based speech synthesis has evolved in the end-to-end direction. The quality of synthesis is better and better, but at present, the number of real end-to-end models is small, and a bridge is basically established between text and voice through a Mel frequency spectrum. This results in a loss of naturalness of the synthesized speech.
In the existing speech synthesis technology, firstly, a text is processed into phonemes as input by a regularization module, then the text or the phonemes are characterized through an embedded layer network, and then the characterized characteristics are encoded through some characteristic extraction networks. The length of the coded features is consistent with the length of the input phoneme, and only the dimension is increased from one dimension to a high dimension. And predicting the pronunciation duration of the text or the phoneme according to the text coding result. Rounding the predicted pronunciation time length, wherein the number of the time lengths is consistent with the length of the phoneme. And then, the coded features are adjusted according to the rounded duration, and finally, a text coding result which is consistent with the length of a Mel frequency spectrum extracted from real voice can be obtained. And (4) decoding the characteristics of the result of the time length model adjustment through a deep learning network, and calculating loss with a Mel frequency spectrum extracted from real voice. Taking as input the mel spectrum extracted from real speech, using neural network models such as: WaveNet, ParallelWaveNet, HifiGan, etc. to predict the true speech waveform. The input at the synthesis stage is the decoded mel spectrum, not the true mel spectrum as input. The prior art circuit predicts the mel spectrum from the text and then predicts the speech waveform from the predicted mel spectrum by the vocoder. And the objective functions calculated by these two processes are also not consistent.
Disclosure of Invention
The invention provides a method for improving the naturalness of speech synthesis, which can reduce the calculation amount and overcome the defects in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for improving the naturalness of speech synthesis specifically comprises the following steps:
(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;
(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module;
(3) a duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;
(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized;
(5) and (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio;
(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.
Because the objective optimization function of the invention is aimed at the synthesized speech waveform and the predicted phoneme pronunciation duration, the speaking characteristics of the speaker can be directly learned from the original audio, and the objective optimization function comprises the following steps: tone, pause, speech style, etc. The synthesized speech is more natural than other speech synthesis systems. The invention avoids the defects of the prior art, directly predicts the waveform by the text, reduces the intermediate process and synthesizes more natural voice. The invention has the advantages that an end-to-end speech synthesis system is provided, compared with other speech synthesis systems: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.
Preferably, in step (2), the CBHG module is composed of a one-dimensional convolution filter bank, a highway network, and a recurrent neural network of bidirectional gated cyclic units.
Preferably, in the step (4), specifically: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration.
The invention has the beneficial effects that: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The invention is further described with reference to the following figures and detailed description.
In the embodiment shown in fig. 1, a method for improving naturalness of speech synthesis specifically includes the following steps:
(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;
(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module; the CBHG module consists of a one-dimensional convolution filter bank, an expressway network and a cyclic neural network of a bidirectional gating cyclic unit.
(3) A duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;
(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized; the method specifically comprises the following steps: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration. Looking at the input and output before and after the length adjuster as in fig. 1, in particular if there are three phonemes a, b, c, the predicted durations are 2, 3, 4 respectively, then the augmented version is aabbcccc.
(5) And (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio; the 2-layer bidirectional long-short term memory network generally refers to bidirectional lstm, complex features are different from general features, generally features in a real number domain, and the complex number domain has one more part than the real number domain, namely the features consist of two parts, namely a real part and an imaginary part; the short-time Fourier transform is a mathematical general operation, stft, and can also be realized by a neural network;
(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.
Because the objective optimization function of the invention is aimed at the synthesized speech waveform and the predicted phoneme pronunciation duration, the speaking characteristics of the speaker can be directly learned from the original audio, and the objective optimization function comprises the following steps: tone, pause, speech style, etc. The synthesized speech is more natural than other speech synthesis systems. The invention avoids the defects of the prior art, directly predicts the waveform by the text, reduces the intermediate process and synthesizes more natural voice. The invention has the advantages that an end-to-end speech synthesis system is provided, compared with other speech synthesis systems: the complexity of the model can be reduced, the calculation amount is reduced, and the calculation and deployment cost is saved; the naturalness of the synthesized voice is improved, and the pronunciation is more like a real person.

Claims (3)

1. A method for improving the naturalness of speech synthesis is characterized by comprising the following steps:
(1) text encoding: obtaining phonemes corresponding to the text by a tool from a font to the phonemes, then forming a phoneme dictionary by all the phonemes, wherein the number of the phoneme dictionaries is used as the dimension of an Embedding layer, and representing the phonemes of the text, namely mapping the phonemes to a feature vector through Embedding in deep learning;
(2) the CBHG module encodes the characterized features, the characterized features refer to feature vectors in deep learning, and the encoding refers to mapping the characterized features to another feature vector through the CBHG module;
(3) a duration model: taking a text coding result as input, and predicting the duration of each phoneme through a convolutional neural network with 3 layers and a full-connection layer with 1 layer, wherein the duration refers to one duration of network prediction;
(4) comparing the prediction result with a real label, and optimizing the time-length model; the prediction result is the prediction of the network on the duration, the real label is the real duration of each phoneme, the error is calculated by the network predicted duration and the real duration of the phonemes in the training set, and then the error is continuously reduced, namely the time-length model is optimized;
(5) and (3) voice decoding: decoding the features expanded by the time length model through a 2-layer bidirectional long and short term memory network, combining decoded results into a complex feature, and correspondingly extracting the complex feature from the short-time Fourier transform in the original audio;
(6) and the decoded complex features are subjected to short-time inverse Fourier transform and are restored into voice waveforms.
2. The method of claim 1, wherein in step (2), the CBHG module comprises a one-dimensional convolutional filter bank, a highway network, and a recurrent neural network of bi-directional gated cyclic units.
3. The method according to claim 1, wherein in the step (4), the method specifically comprises: after the pronunciation duration of the phoneme is obtained, the encoded phoneme is expanded according to the numerical value of the duration.
CN202110906779.3A 2021-08-09 2021-08-09 Method for improving naturalness of speech synthesis Pending CN113488021A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110906779.3A CN113488021A (en) 2021-08-09 2021-08-09 Method for improving naturalness of speech synthesis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110906779.3A CN113488021A (en) 2021-08-09 2021-08-09 Method for improving naturalness of speech synthesis

Publications (1)

Publication Number Publication Date
CN113488021A true CN113488021A (en) 2021-10-08

Family

ID=77946052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110906779.3A Pending CN113488021A (en) 2021-08-09 2021-08-09 Method for improving naturalness of speech synthesis

Country Status (1)

Country Link
CN (1) CN113488021A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111739508A (en) * 2020-08-07 2020-10-02 浙江大学 End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN112802450A (en) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof
CN112802448A (en) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 Speech synthesis method and system for generating new tone
CN112863483A (en) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
WO2021127821A1 (en) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis model training method, apparatus, computer device, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021127821A1 (en) * 2019-12-23 2021-07-01 深圳市优必选科技股份有限公司 Speech synthesis model training method, apparatus, computer device, and storage medium
CN111739508A (en) * 2020-08-07 2020-10-02 浙江大学 End-to-end speech synthesis method and system based on DNN-HMM bimodal alignment network
CN112802450A (en) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 Rhythm-controllable Chinese and English mixed speech synthesis method and system thereof
CN112802448A (en) * 2021-01-05 2021-05-14 杭州一知智能科技有限公司 Speech synthesis method and system for generating new tone
CN112863483A (en) * 2021-01-05 2021-05-28 杭州一知智能科技有限公司 Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm

Similar Documents

Publication Publication Date Title
Yu et al. DurIAN: Duration Informed Attention Network for Speech Synthesis.
Kleijn et al. Wavenet based low rate speech coding
CN108899009B (en) Chinese speech synthesis system based on phoneme
CN111179905A (en) Rapid dubbing generation method and device
CN110767210A (en) Method and device for generating personalized voice
CN113112995B (en) Word acoustic feature system, and training method and system of word acoustic feature system
CN112489629A (en) Voice transcription model, method, medium, and electronic device
CN114023300A (en) Chinese speech synthesis method based on diffusion probability model
KR102272554B1 (en) Method and system of text to multiple speech
CN114464162B (en) Speech synthesis method, neural network model training method, and speech synthesis model
CN114678032B (en) Training method, voice conversion method and device and electronic equipment
CN111724809A (en) Vocoder implementation method and device based on variational self-encoder
US20240127832A1 (en) Decoder
Oura et al. Deep neural network based real-time speech vocoder with periodic and aperiodic inputs
CN113782042A (en) Speech synthesis method, vocoder training method, device, equipment and medium
CN116092475B (en) Stuttering voice editing method and system based on context-aware diffusion model
Zhao et al. Research on voice cloning with a few samples
CN116312476A (en) Speech synthesis method and device, storage medium and electronic equipment
CN113436607B (en) Quick voice cloning method
KR20230075340A (en) Voice synthesis system and method capable of duplicating tone and prosody styles in real time
CN114974218A (en) Voice conversion model training method and device and voice conversion method and device
CN114203151A (en) Method, device and equipment for training speech synthesis model
CN113488021A (en) Method for improving naturalness of speech synthesis
CN113327578A (en) Acoustic model training method and device, terminal device and storage medium
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination