CN112489616A - Speech synthesis method - Google Patents

Speech synthesis method Download PDF

Info

Publication number
CN112489616A
CN112489616A CN202011374257.5A CN202011374257A CN112489616A CN 112489616 A CN112489616 A CN 112489616A CN 202011374257 A CN202011374257 A CN 202011374257A CN 112489616 A CN112489616 A CN 112489616A
Authority
CN
China
Prior art keywords
output
encoder
attention
network
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011374257.5A
Other languages
Chinese (zh)
Inventor
邓努波
陈丽娟
张丽娟
张建华
黄嫄
向洪伟
郭强
程洁
张流畅
巫俊洁
邓燕晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Chongqing Electric Power Co Ltd
Materials Branch of State Grid Chongqing Electric Power Co Ltd
Original Assignee
Materials Branch of State Grid Chongqing Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Materials Branch of State Grid Chongqing Electric Power Co Ltd filed Critical Materials Branch of State Grid Chongqing Electric Power Co Ltd
Priority to CN202011374257.5A priority Critical patent/CN112489616A/en
Publication of CN112489616A publication Critical patent/CN112489616A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Abstract

The invention relates to a voice synthesis method, which comprises the following steps of firstly extracting text characteristics and acoustic characteristics, secondly constructing an encoder, and outputting a coding sequence through the encoder; a third position sensitive attention mechanism is introduced, and attention weight is calculated; fourthly, a decoder is constructed, the output of the decoder is spliced with the attention context vector, projected into a scalar and then transmitted to an activation function to determine whether the prediction is finished; and finally, converting the output of the decoder into a linear spectrogram through a post-processing network, and restoring the linear spectrogram into voice waveform output by a Griffin-Lim algorithm. The invention splices the attention context vector and the encoder output coding sequence together, and makes the accumulated attention weight after the previous decoding process as an additional feature, so that the model keeps consistency when advancing along the input sequence, the problems of subsequence omission or repetition and the like which possibly occur during decoding are reduced, and the accuracy of the final synthesized speech is improved.

Description

Speech synthesis method
Technical Field
The present invention relates to speech synthesis technology, and more particularly, to a speech synthesis method.
Background
Under the large background of reducing personnel and increasing efficiency of the national power grid company, the contradiction between shortage of professional personnel and continuous increase of the number of suppliers of the power material company is increasingly prominent, and the existing large-batch bidding and performance information interaction requirements are difficult to meet. With the advent of the artificial intelligence era, speech recognition technology is continuously developing. The artificial intelligent voice technology can take over most of artificial telephone traffic work, release manpower and improve efficiency. Therefore, in the upgrading and transformation process of the intelligent provider service hall, the Chongqing electric power company material branch company of the national network confirms the key point of a material specialist in information notification by combing the requirement of the existing voice call interaction scene, constructs a conversational process of common business notification, and creates an AI intelligent outbound system with a high service paste and intelligent level by relying on the technologies of voice recognition, semantic understanding, voice synthesis, big data analysis and the like.
The AI intelligent outbound system can liberate boring repeated work of workers, work efficiency is improved, meanwhile, voice of the AI intelligent outbound system is free of emotion, and conflict can be effectively avoided. The speech synthesis technology is a very important technical problem in the AI intelligent outbound system, and how to accurately synthesize speech is a technical problem of the system.
Disclosure of Invention
Aiming at the problems in the prior art, the technical problems to be solved by the invention are as follows: how to accurately synthesize speech.
In order to solve the technical problems, the invention adopts the following technical scheme: a method of speech synthesis comprising the steps of:
s10: extracting text features and acoustic features;
the text feature extraction module firstly embeds characters into input text data, namely uses vectors with fixed dimensionality to represent text characters, and then sequentially passes through two sub-networks of Pre-Net and CBHG to obtain text feature data;
acoustic feature extraction: performing pre-emphasis processing on voice data by using a Mel frequency spectrum and a linear frequency spectrum, passing an original audio signal through a high-pass filter to obtain pre-emphasized voice data, and performing short-time Fourier transform to obtain a linear frequency spectrum;
s20, fusing the extracted text feature data and the acoustic features, which comprises the following steps:
a) constructing an encoder, wherein the encoder uses an encoder in a Tacotron frame, text characteristic data obtained in S10 is input into the encoder, and the encoder outputs a coding sequence;
b) constructing a position sensitive attention mechanism, wherein the position characteristics of the position sensitive attention mechanism are obtained by convolution of 32 1-dimensional convolution kernels with the length of 31, and the attention weight, namely an attention context vector, is obtained after the coded sequence and the position characteristics output by the step a) are projected to a 128-dimensional hidden layer for characterization;
c) constructing a decoder which is an autoregressive recurrent neural network, predicting a coding sequence output by an encoder to output a spectrogram, predicting one frame at a time, and firstly transmitting the predicted spectral frame in the previous step into a double-layer fully-connected preprocessing network pre-net composed of 256 hidden ReLU units in each layer;
splicing the output of pre-net and the attention context vector together, transmitting the spliced output to a two-layer stacked unidirectional neural network consisting of 1024 units, splicing the output of the neural network with the attention context vector again, and predicting a target frequency spectrum frame through a linear transformation projection;
predicting a residual error by a 5-layer convolution network of the predicted target spectrum frame and superposing the residual error on the spectrum frame before convolution, wherein each layer of the network consists of 512 convolution kernels with the size of 5 multiplied by 1 and a batch standardization process, and except the last layer of convolution, the batch standardization process of each layer is followed by a tanh activation function;
in parallel with the prediction of the frequency spectrum frame, the output of a decoder is spliced with the attention context vector, projected into a scalar and then transmitted to a sigmoid activation function to predict the probability of whether an output sequence is finished;
when the probability value is greater than or equal to a preset ending threshold value, indicating that the prediction is ended, and carrying out the next step;
d) and synthesizing a post-processing network and a waveform, wherein the post-processing network consists of a CBHG module and a full connection layer, the output of the decoder is converted into a linear spectrogram through the post-processing network, and the linear spectrogram is restored into a voice waveform for output by a Griffin-Lim algorithm.
Preferably, the specific method for extracting the acoustic features in S10 is as follows:
1) the original audio signal is passed through a high-pass filter to obtain pre-emphasized voice data, and formula (1) is adopted:
H(Z)=1-μ·z-1 (1);
wherein H is a voice sampling value, Z represents different time, 1 represents a sampling value of the current time, and Z-1Representing the sampling value at the previous moment, and mu is a pre-emphasis coefficient;
2) then, performing short-time Fourier transform on the voice data obtained by the formula (1) to obtain a linear spectrum, as shown in the formula (2):
Figure BDA0002807754010000021
where z (t) is the source signal, z (t) h (z), g (t) is a window function, and f is the frequency of the linear spectrum;
3) processing the linear spectrum with a mel filter bank to obtain a mel spectrum, as shown in formula (3):
Figure BDA0002807754010000022
where f is the frequency of the linear spectrum.
Preferably, the encoder in S20 is composed of a Pre-net preprocessing network and a CBHG module, and the CBHG module is sequentially composed of a one-dimensional convolution filter bank, a residual connection, a multi-layer highway network, and a bidirectional gated cyclic unit GRU network.
Preferably, the output calculation process of the post-decoder for constructing the position-sensitive attention mechanism in S20 is as follows:
the energy calculation of the position sensitive attention mechanism is as in formula (4):
Figure BDA0002807754010000031
wherein s isiIs the implicit state of the decoder recurrent neural network at time i, hjIs the jth output of the encoder, fi,jThe convolution output representing the cumulative attention weight before time i, b is the offset value, initially the 0 vector, vaW, V and U denote weight matrices of different network layers,
Figure BDA0002807754010000032
denotes vaTransposing;
convolution output fi,jFrom cumulative attention weights
Figure BDA0002807754010000033
F is the convolution kernel, as in equations (5) and (6);
Figure BDA0002807754010000034
Figure BDA0002807754010000035
compared with the prior art, the invention has at least the following advantages:
in the invention, a decoder is constructed by using an autoregressive recurrent neural network, and a position sensitive attention mechanism is introduced in the encoding process, so that an attention context vector and an encoder output encoding sequence are spliced together when the decoder is used, and because the sensitive attention mechanism can simultaneously consider the content and the position of an input phoneme, the accumulated attention weight after the previous decoding process can be used as an additional feature, so that the model keeps consistency when advancing along an input sequence, the problems of subsequence omission or repetition and the like which possibly occur in the decoding process are reduced, and the accuracy of the final synthesized speech is improved.
Detailed Description
The present invention is described in further detail below.
A method of speech synthesis comprising the steps of: s10: extracting text features and acoustic features;
the text feature extraction module firstly embeds characters into input text data, namely uses vectors with fixed dimensionality to represent text characters, and then sequentially passes through two sub-networks of Pre-Net and CBHG to obtain text feature data;
acoustic feature extraction: the method is carried out by using a Mel frequency spectrum and a linear frequency spectrum, and comprises the steps of firstly carrying out pre-emphasis processing on voice data, enabling an original audio signal to pass through a high-pass filter to obtain pre-emphasized voice data, and then carrying out short-time Fourier transform to obtain the linear frequency spectrum.
As an improvement, the specific method for extracting the acoustic features in S01 is as follows:
1) the original audio signal is passed through a high-pass filter to obtain pre-emphasized voice data, and formula (1) is adopted:
H(Z)=1-μ·z-1 (1);
wherein H is a voice sampling value, Z represents different time, 1 represents a sampling value of the current time, and Z-1Representing the sample value at the previous moment, μ is a pre-emphasis coefficient, typically between 0.9 and 1.0;
2) then, performing short-time Fourier transform on the voice data obtained by the formula (1) to obtain a linear spectrum, as shown in the formula (2):
Figure BDA0002807754010000041
where z (t) is the source signal, z (t) h (z), g (t) is a window function, and f is the frequency of the linear spectrum;
3) processing the linear spectrum with a mel filter bank to obtain a mel spectrum, as shown in formula (3):
Figure BDA0002807754010000042
where f is the frequency of the linear spectrum.
S20, fusing the extracted text feature data and the acoustic features, which comprises the following steps:
a) the encoder was constructed using an encoder in a Tacotron frame, which is prior art.
The text feature data obtained in S10 is input to the encoder, and the encoder outputs the encoded sequence.
The encoder is composed of a Pre-net preprocessing network and a CBHG module, and the Pre-net is used for preprocessing input texts. The CBHG module consists of a one-dimensional convolution filter bank, residual connection, a multilayer highway network and a bidirectional gating circulation unit GRU network in sequence; the one-dimensional convolution filter bank is a convolution layer consisting of m one-dimensional filters with different sizes, and the sizes of the filters are 1,2 and 3 … m respectively; the problem of gradient diffusion caused by too deep neural network layers can be solved by using residual connection, so that the condition that too much information input before is lost after multilayer convolution can be ensured. The Highway network is used for relieving the overfitting problem caused by network deepening and reducing the training difficulty of a deeper network. And finally, using GRU to obtain a bidirectional extraction feature sequence.
b) Constructing a position sensitive attention mechanism; the position sensitive attention mechanism can simultaneously consider the content and the position of an input phoneme, and can enable the accumulated attention weight after the previous decoding process to be used as an additional feature, so that the model keeps consistency when advancing along an input sequence, and the problems of subsequence omission or repetition and the like which possibly occur in decoding are reduced.
The position characteristics of the position sensitive attention mechanism are obtained by convolution of 32 1-dimensional convolution kernels with the length of 31, and after the coded sequence and the position characteristics output by the step a) are projected to a 128-dimensional hidden layer for representation, attention weights, namely attention context vectors, are obtained;
c) constructing a decoder which is an autoregressive recurrent neural network, predicting a coding sequence output by an encoder to output a spectrogram, predicting one frame at a time, and firstly transmitting the predicted spectral frame in the previous step into a double-layer fully-connected preprocessing network pre-net composed of 256 hidden ReLU units in each layer;
the output of the pre-net is spliced with the attention context vector and transmitted to a two-layer stacked unidirectional neural network consisting of 1024 units, the output of the neural network is spliced with the attention context vector again, and then the target frequency spectrum frame is predicted through a linear transformation projection.
And predicting a residual error by the predicted target spectrum frame through a 5-layer convolution network, and superposing the residual error on the spectrum frame before convolution so as to improve the whole process of spectrum reconstruction. Each layer of the network consists of 512 5 x 1 convolution kernels and a batch normalization process, each layer of batch normalization process being followed by a tanh activation function except for the last layer of convolution.
In parallel with the prediction of the spectrum frame, the output of the decoder is spliced with the attention context vector, projected into a scalar and then transmitted to the sigmoid activation function to predict the probability of whether the output sequence is finished.
When the probability value is greater than or equal to a preset ending threshold value, indicating that the prediction is ended, and carrying out the next step;
convolutional layers in the network are regularized using dropout with a probability of 0.5, and LSTM layers are regularized using zoneout with a probability of 0.1. To bring some variation to the output result at the time of inference, dropout with a probability of 0.5 is applied only to pre-net of the autoregressive decoder.
The model of the invention uses more compact building blocks, uses common LSTM and convolutional layers, and outputs only a single spectral frame per decoding step.
The output calculation procedure of the post-decoder for constructing the position-sensitive attention mechanism in S20 is as follows:
the energy calculation of the position sensitive attention mechanism is as in formula (4):
Figure BDA0002807754010000051
wherein s isiIs the implicit state of the decoder recurrent neural network at time i, hjIs the jth output of the encoder, fi,jThe convolution output representing the cumulative attention weight before time i, b is the offset value, initially the 0 vector, vaW, V and U denote weight matrices of different network layers,
Figure BDA0002807754010000052
denotes vaTransposing;
convolution output fi,jFrom cumulative attention weights
Figure BDA0002807754010000053
F is the convolution kernel, as in equations (5) and (6);
Figure BDA0002807754010000054
Figure BDA0002807754010000055
d) and synthesizing a post-processing network and a waveform, wherein the post-processing network consists of a CBHG module and a full connection layer, the output of the decoder is converted into a linear spectrogram through the post-processing network, and the linear spectrogram is restored into a voice waveform for output by a Griffin-Lim algorithm.
Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims (4)

1. A speech synthesis method, comprising the steps of:
s10: extracting text features and acoustic features;
the text feature extraction module firstly embeds characters into input text data, namely uses vectors with fixed dimensionality to represent text characters, and then sequentially passes through two sub-networks of Pre-Net and CBHG to obtain text feature data;
acoustic feature extraction: performing pre-emphasis processing on voice data by using a Mel frequency spectrum and a linear frequency spectrum, passing an original audio signal through a high-pass filter to obtain pre-emphasized voice data, and performing short-time Fourier transform to obtain a linear frequency spectrum;
s20, fusing the extracted text feature data and the acoustic features, which comprises the following steps:
a) constructing an encoder, wherein the encoder uses an encoder in a Tacotron frame, text characteristic data obtained in S10 is input into the encoder, and the encoder outputs a coding sequence;
b) constructing a position sensitive attention mechanism, wherein the position characteristics of the position sensitive attention mechanism are obtained by convolution of 32 1-dimensional convolution kernels with the length of 31, and the attention weight, namely an attention context vector, is obtained after the coded sequence and the position characteristics output by the step a) are projected to a 128-dimensional hidden layer for characterization;
c) constructing a decoder which is an autoregressive recurrent neural network, predicting a coding sequence output by an encoder to output a spectrogram, predicting one frame at a time, and firstly transmitting the predicted spectral frame in the previous step into a double-layer fully-connected preprocessing network pre-net composed of 256 hidden ReLU units in each layer;
splicing the output of pre-net and the attention context vector together, transmitting the spliced output to a two-layer stacked unidirectional neural network consisting of 1024 units, splicing the output of the neural network with the attention context vector again, and predicting a target frequency spectrum frame through a linear transformation projection;
predicting a residual error by a 5-layer convolution network of the predicted target spectrum frame and superposing the residual error on the spectrum frame before convolution, wherein each layer of the network consists of 512 convolution kernels with the size of 5 multiplied by 1 and a batch standardization process, and except the last layer of convolution, the batch standardization process of each layer is followed by a tanh activation function;
in parallel with the prediction of the frequency spectrum frame, the output of a decoder is spliced with the attention context vector, projected into a scalar and then transmitted to a sigmoid activation function to predict the probability of whether an output sequence is finished;
when the probability value is greater than or equal to a preset ending threshold value, indicating that the prediction is ended, and carrying out the next step;
d) and synthesizing a post-processing network and a waveform, wherein the post-processing network consists of a CBHG module and a full connection layer, the output of the decoder is converted into a linear spectrogram through the post-processing network, and the linear spectrogram is restored into a voice waveform for output by a Griffin-Lim algorithm.
2. The speech synthesis method of claim 1, wherein: the specific method for extracting the acoustic features in the step S10 is as follows:
1) the original audio signal is passed through a high-pass filter to obtain pre-emphasized voice data, and formula (1) is adopted:
H(Z)=1-μ·z-1 (1);
wherein H is speech acquisitionSample values, Z representing different time instants, 1 representing the sample value at the current time instant, Z-1Representing the sampling value at the previous moment, and mu is a pre-emphasis coefficient;
2) then, performing short-time Fourier transform on the voice data obtained by the formula (1) to obtain a linear spectrum, as shown in the formula (2):
Figure FDA0002807752000000021
where z (t) is the source signal, z (t) h (z), g (t) is a window function, and f is the frequency of the linear spectrum;
3) processing the linear spectrum with a mel filter bank to obtain a mel spectrum, as shown in formula (3):
Figure FDA0002807752000000022
where f is the frequency of the linear spectrum.
3. The speech synthesis method of claim 1, wherein: the encoder in the S20 is composed of a Pre-net preprocessing network and a CBHG module, wherein the CBHG module is composed of a one-dimensional convolution filter bank, a residual connection, a multi-layer highway network and a bidirectional gating circulation unit GRU network in sequence.
4. The speech synthesis method of claim 1, wherein: the output calculation procedure of the post-decoder for constructing the position-sensitive attention mechanism in S20 is as follows:
the energy calculation of the position sensitive attention mechanism is as in formula (4):
Figure FDA0002807752000000023
wherein s isiIs the implicit state of the decoder recurrent neural network at time i, hjIs the jth output of the encoder, fi,jThe convolution output representing the cumulative attention weight before time i, b is the offset value, initially the 0 vector, vaW, V and U denote weight matrices of different network layers,
Figure FDA0002807752000000024
denotes vaTransposing;
convolution output fi,jFrom cumulative attention weights
Figure FDA0002807752000000025
F is the convolution kernel, as in equations (5) and (6);
Figure FDA0002807752000000026
Figure FDA0002807752000000027
CN202011374257.5A 2020-11-30 2020-11-30 Speech synthesis method Pending CN112489616A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011374257.5A CN112489616A (en) 2020-11-30 2020-11-30 Speech synthesis method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011374257.5A CN112489616A (en) 2020-11-30 2020-11-30 Speech synthesis method

Publications (1)

Publication Number Publication Date
CN112489616A true CN112489616A (en) 2021-03-12

Family

ID=74937322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011374257.5A Pending CN112489616A (en) 2020-11-30 2020-11-30 Speech synthesis method

Country Status (1)

Country Link
CN (1) CN112489616A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205793A (en) * 2021-04-30 2021-08-03 北京有竹居网络技术有限公司 Audio generation method and device, storage medium and electronic equipment
CN113270086A (en) * 2021-07-19 2021-08-17 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance
CN113409759A (en) * 2021-07-07 2021-09-17 浙江工业大学 End-to-end real-time speech synthesis method
CN113806543A (en) * 2021-09-22 2021-12-17 三峡大学 Residual jump connection-based text classification method for gated cyclic unit
CN115588437A (en) * 2022-12-13 2023-01-10 南方电网数字电网研究院有限公司 Speech enhancement method, apparatus, device and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020072759A1 (en) * 2018-10-03 2020-04-09 Visteon Global Technologies, Inc. A voice assistant system for a vehicle cockpit system
CN111028824A (en) * 2019-12-13 2020-04-17 厦门大学 Method and device for synthesizing Minnan
CN111489754A (en) * 2019-01-28 2020-08-04 国家电网有限公司客户服务中心 Telephone traffic data analysis method based on intelligent voice technology

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020072759A1 (en) * 2018-10-03 2020-04-09 Visteon Global Technologies, Inc. A voice assistant system for a vehicle cockpit system
CN111489754A (en) * 2019-01-28 2020-08-04 国家电网有限公司客户服务中心 Telephone traffic data analysis method based on intelligent voice technology
CN111028824A (en) * 2019-12-13 2020-04-17 厦门大学 Method and device for synthesizing Minnan

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JONATHAN SHEN ET AL: "NATURAL TTS SYNTHESIS BY CONDITIONINGWAVENET ON MEL SPECTROGRAM PREDICTIONS", 《ICASSP 2018》 *
YUXUAN WANG ET AL: "TACOTRON: A FULLY END-TO-END TEXT-TO-SPEECH SYNTHESIS MODEL", 《ARXIV.ORG》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113205793A (en) * 2021-04-30 2021-08-03 北京有竹居网络技术有限公司 Audio generation method and device, storage medium and electronic equipment
CN113205793B (en) * 2021-04-30 2022-05-31 北京有竹居网络技术有限公司 Audio generation method and device, storage medium and electronic equipment
CN113409759A (en) * 2021-07-07 2021-09-17 浙江工业大学 End-to-end real-time speech synthesis method
CN113270086A (en) * 2021-07-19 2021-08-17 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance
CN113270086B (en) * 2021-07-19 2021-10-15 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance
US11488586B1 (en) 2021-07-19 2022-11-01 Institute Of Automation, Chinese Academy Of Sciences System for speech recognition text enhancement fusing multi-modal semantic invariance
CN113806543A (en) * 2021-09-22 2021-12-17 三峡大学 Residual jump connection-based text classification method for gated cyclic unit
CN113806543B (en) * 2021-09-22 2023-05-30 三峡大学 Text classification method of gate control circulation unit based on residual jump connection
CN115588437A (en) * 2022-12-13 2023-01-10 南方电网数字电网研究院有限公司 Speech enhancement method, apparatus, device and storage medium

Similar Documents

Publication Publication Date Title
CN112489616A (en) Speech synthesis method
CN111754976B (en) Rhythm control voice synthesis method, system and electronic device
CN109671442B (en) Many-to-many speaker conversion method based on STARGAN and x vectors
CN110189749A (en) Voice keyword automatic identifying method
CN109979429A (en) A kind of method and system of TTS
CN111312245B (en) Voice response method, device and storage medium
CN111312228A (en) End-to-end-based voice navigation method applied to electric power enterprise customer service
CN113450765A (en) Speech synthesis method, apparatus, device and storage medium
CN116364055A (en) Speech generation method, device, equipment and medium based on pre-training language model
CN112489623A (en) Language identification model training method, language identification method and related equipment
Hwang et al. Improving lpcnet-based text-to-speech with linear prediction-structured mixture density network
CN113362804B (en) Method, device, terminal and storage medium for synthesizing voice
CN111583965A (en) Voice emotion recognition method, device, equipment and storage medium
CN113488029A (en) Non-autoregressive speech recognition training decoding method and system based on parameter sharing
CN113611281A (en) Voice synthesis method and device, electronic equipment and storage medium
CN116863920A (en) Voice recognition method, device, equipment and medium based on double-flow self-supervision network
CN114626424B (en) Data enhancement-based silent speech recognition method and device
CN115206284B (en) Model training method, device, server and medium
CN115273829A (en) Vietnamese-to-English voice-to-text translation method based on multi-feature fusion
Zhao et al. Research on voice cloning with a few samples
CN116580694A (en) Audio challenge sample generation method, device, equipment and storage medium
CN114974206A (en) Unconstrained lip language-to-speech synthesis method, system and storage medium
CN115019785A (en) Streaming voice recognition method and device, electronic equipment and storage medium
CN114360491A (en) Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium
CN113838449A (en) Novel Mongolian speech synthesis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210420

Address after: 100031 Xicheng District West Chang'an Avenue, No. 86, Beijing

Applicant after: STATE GRID CORPORATION OF CHINA

Applicant after: STATE GRID CHONGQING ELECTRIC POWER Co.

Applicant after: MATERIALS BRANCH OF STATE GRID CHONGQING ELECTRIC POWER Co.

Address before: No.20 Qingfeng North Road, Yubei District, Chongqing

Applicant before: MATERIALS BRANCH OF STATE GRID CHONGQING ELECTRIC POWER Co.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210312