CN110232907B - Voice synthesis method and device, readable storage medium and computing equipment - Google Patents
Voice synthesis method and device, readable storage medium and computing equipment Download PDFInfo
- Publication number
- CN110232907B CN110232907B CN201910670564.9A CN201910670564A CN110232907B CN 110232907 B CN110232907 B CN 110232907B CN 201910670564 A CN201910670564 A CN 201910670564A CN 110232907 B CN110232907 B CN 110232907B
- Authority
- CN
- China
- Prior art keywords
- speaker
- embedded vector
- text
- voice
- inputting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 53
- 238000000034 method Methods 0.000 claims abstract description 34
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 28
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 28
- 238000001228 spectrum Methods 0.000 claims abstract description 27
- 230000003595 spectral effect Effects 0.000 claims abstract description 9
- 238000012549 training Methods 0.000 claims description 56
- 238000012545 processing Methods 0.000 claims description 17
- 230000002194 synthesizing effect Effects 0.000 claims description 12
- 230000001537 neural effect Effects 0.000 claims description 9
- 230000008451 emotion Effects 0.000 claims description 4
- 230000007613 environmental effect Effects 0.000 claims description 3
- 238000004891 communication Methods 0.000 description 17
- 238000010586 diagram Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 5
- 230000006870 function Effects 0.000 description 3
- 230000002093 peripheral effect Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000007723 transport mechanism Effects 0.000 description 2
- 230000003042 antagnostic effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000007667 floating Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/012—Comfort noise or silence coding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L19/00—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
- G10L19/04—Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
- G10L19/16—Vocoder architecture
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Telephonic Communication Services (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The embodiment of the disclosure provides a voice synthesis method, a voice synthesis device, a readable storage medium and a computing device, which can realize voice synthesis by using universal Chinese voice. The method comprises the following steps: acquiring a voice sequence and a text sequence for voice synthesis; inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter; speech is synthesized from the first spectral parameters.
Description
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a speech synthesis method, an apparatus, a readable storage medium, and a computing device.
Background
In order to train a high-quality speech synthesis model, the existing Chinese speech synthesis system needs to use high-quality speech recorded in a professional recording studio. The following schemes are mainly adopted:
recording high-quality data of a single speaker in 10-20 hours, carrying out manual marking, including marking pinyin, rhythm and cutting, and then training a model To obtain a complete Text-To-Speech (TTS) system;
recording 10-20 hours high quality data of 10-20 multiple speakers, performing manual marking including marking pinyin, rhythm and cutting, and training multiple speaker models to obtain a complete TTS system
The two schemes both depend on perfect recording equipment and a high-standard recording environment, and depend on fine manual labeling, and problems occur in a certain link, so that the model effect is poor. If the recording equipment and the environment have problems, the generated voice is very noisy; and the problem of the label is caused, so that the model training is not converged and the acceptance standard is not reached. In addition, recording high-quality speech requires a long time and a large amount of money, so that constructing a high-quality TTS model using the two schemes has certain disadvantages.
Disclosure of Invention
To this end, the present disclosure provides a speech synthesis method, apparatus, readable storage medium and computing device in an attempt to solve or at least alleviate at least one of the problems identified above.
According to an aspect of an embodiment of the present disclosure, there is provided a speech synthesis method including:
acquiring a voice sequence and a text sequence for voice synthesis;
inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
speech is synthesized from the first spectral parameters.
Optionally, inputting the speech sequence and the text sequence into a pre-trained neural network to obtain a first spectral parameter, including:
according to a pre-trained neural network, inputting a voice sequence into a speaker encoder to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter.
Optionally, synthesizing speech from the first spectral parameters comprises:
and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.
Optionally, training the neural network comprises:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
Optionally, training the neural network further comprises:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
Optionally, training the neural network further comprises:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
Optionally, training the neural network further comprises:
and inputting the second speaker embedded vector into a speech background classifier for classification.
Optionally, training the neural network further comprises:
and calculating the posterior probability of the voice background for correcting the neural network parameters.
According to still another aspect of an embodiment of the present disclosure, there is provided a speech synthesis apparatus including:
a data acquisition unit configured to acquire a speech sequence and a text sequence for speech synthesis;
the neural network processing unit is used for inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
a speech synthesis unit for synthesizing speech according to the first spectral parameters.
Optionally, the neural network processing unit is specifically configured to:
inputting a voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter;
optionally, the speech synthesis unit is specifically configured to:
and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.
Optionally, the method further comprises a neural network training unit, configured to:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
Optionally, the neural network training unit is further configured to:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
Optionally, the neural network training unit is further configured to:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
Optionally, the neural network training unit is further configured to:
and inputting the second speaker embedded vector into a speech background classifier for classification.
Optionally, the neural network training unit is further configured to:
and calculating the posterior probability of the voice background for correcting the neural network parameters.
According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.
According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.
According to the technical scheme provided by the embodiment of the disclosure, a voice sequence and a text sequence for voice synthesis are obtained, the voice sequence and the text sequence are input into a pre-trained neural network to obtain a first spectrum parameter, and voice is synthesized according to the first spectrum parameter; according to the scheme, only Chinese audio with common quality, such as voice recorded by a mobile phone or voice of customer service personnel, is needed, information of a speaker and background noise can be automatically distinguished, and high-quality voice can be synthesized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a block diagram of an exemplary computing device;
FIG. 2 is a flow diagram of a method of speech synthesis according to an embodiment of the present disclosure;
FIG. 3 is a flow chart diagram of a neural network training method in accordance with an embodiment of the present disclosure;
FIG. 4 is yet another flow chart diagram of a neural network training method in accordance with an embodiment of the present disclosure;
fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 is a block diagram of an example computing device 100 arranged to implement a speech synthesis method according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.
Among other things, one or more programs 122 of computing device 100 include instructions for performing a method of speech synthesis according to the present disclosure.
Fig. 2 illustrates a flow diagram of a speech synthesis method 200 according to an embodiment of the present disclosure, the speech synthesis method 200 starting at step S210.
S210, acquiring a voice sequence and a text sequence for voice synthesis;
s220, inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
and S230, synthesizing voice according to the first spectrum parameter.
In step S210, the speech sequence used for speech synthesis may be a newly recorded speech sequence of a certain speaker, or may be selected from a training speech sequence used for training a neural network. The text sequence used for speech synthesis may be any text sequence entered by the user.
In step S220, the specific neural network data processing process includes:
according to a pre-trained neural network, inputting a voice sequence into a speaker encoder to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter.
In step S220, the neural network is trained in advance, as shown in fig. 3, the training process includes the following steps:
s310, acquiring training voice sequences and training text sequences which correspond to one another one by one;
s320, inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
s330, inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
s340, inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and S350, inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
The training data may be low quality voice data, including voice data of multiple speakers, and may include noisy data and clean (non-noisy) data.
Optionally, in step S230, synthesizing speech by using a Neural Vocoder Neural Vocoder model trained in advance according to the first spectrum parameter. Neural vocoders are a separately trained convolutional Neural network model, the model abandons the traditional Vocoder scheme, and directly models sampling points, so that the model has the characteristic of high fidelity of sound.
Optionally, in the neural network training process, the method for improving the performance of the neural network by decomposing the antagonistic factors specifically includes:
and embedding the second speaker into the vector input speaker recognition classifier for classification, thereby distinguishing information of different speakers.
Optionally, the method further comprises:
and performing gradient inversion processing on the second speaker embedded vector, so that the speaker embedded vector only contains speaker information, such as gender, identification ID and the like.
Optionally, the method further comprises:
the second speaker is embedded into the vector input speech background classifier for classification, thereby distinguishing clean and noisy speech.
Optionally, the method further comprises:
after the speech background is classified, the posterior probability of the speech background is calculated and used for correcting the neural network parameters.
To further illustrate the concepts of the present disclosure, a specific embodiment of the present disclosure is provided below in conjunction with FIG. 4.
In a specific embodiment of the present disclosure, an end-to-end neural network architecture is employed.
1. A data preparation phase.
General speech recognition data (i.e., speech recognition data of poor quality) including speech data of a plurality of speakers, including data with and without noise, is prepared for about 50 hours.
2. And (5) a training stage.
a. The text and speech are one-to-one entered into the following modules and used for training.
b. And a voice coding module.
i. The speech sequence is processed by a speaker Encoder (Encoder) to obtain an Embedding vector (Embedding) of the speaker.
ii. And processing the voice sequence by a residual error Encoder to obtain a residual error Embedding without speaker information, wherein the residual error Embedding comprises acoustic characteristic information such as environmental noise, emotion and the like.
c. And an anti-factorization module.
i. The speaker Embedding is processed by a speaker classifier, so that information of different speakers is distinguished.
ii. Speaker Embedding, processed by the speech background classifier, aims to distinguish clean and noisy speech. And gradient inversion is carried out, so that the speaker Embedding only contains information of the speaker, such as gender, speaker ID and the like.
d. A synthesizer module.
i. The text sequence passes through a text Encoder to obtain a text Embedding containing text information.
ii. The text Embedding + speaker Embedding + residual Embedding are input to a Decoder (Decoder) together to obtain spectral parameters.
3. And (5) a generation stage.
a. Given arbitrary text and the speaker's voice to be referred to.
b. The speaker speech quoted passes through the speaker Encoder to obtain speaker Embedding.
c. The quoted speaker voice passes through the residual error Encoder to obtain residual error.
d. The input text passes through the text Encoder to obtain the text Embedding.
e. The text Embedding, the speaker Embedding and the residual error Embedding are input into the decoder together to obtain the final spectrum parameters.
f. Finally, the spectral parameters are processed by Neural vocoders to obtain voice.
The resulting speech is consistent with the cited speaker's speech, and the acoustic emotion and background sounds are also consistent with the cited speech. The voice of the speaker quoted has noise, and the synthesized voice has noise; the speaker's speech is noiseless and the synthesized speech is noiseless.
The specific embodiment of the disclosure achieves the following technical effects:
1. a scheme for carrying out voice synthesis by using general Chinese voice is provided, so that the available data volume is greatly increased, and the model quality is improved.
2. The use of universal Chinese speech recognition data, rather than high quality recordings recorded by professional recording equipment, greatly reduces the cost of recording and labeling.
3. The model is added with a gradient inversion module, so that the information and the voice characteristics of the speaker can be effectively separated.
4. The Neural Vocoder is introduced into the model, so that the quality of the synthesized voice can be improved.
As shown in fig. 5, the present disclosure also provides a speech synthesis apparatus, including:
a data acquisition unit 510 for acquiring a speech sequence and a text sequence for speech synthesis;
the neural network processing unit 520 is configured to input the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
a speech synthesis unit 530 for synthesizing speech according to the first spectral parameters.
Optionally, the neural network processing unit 520 is specifically configured to:
inputting a voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter;
optionally, the speech synthesis unit 530 is specifically configured to:
and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.
Optionally, a neural network training unit 540 is further included, configured to:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
Optionally, the neural network training unit 540 is further configured to:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
Optionally, the neural network training unit 540 is further configured to:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
Optionally, the neural network training unit 540 is further configured to:
and inputting the second speaker embedded vector into a speech background classifier for classification.
Optionally, the neural network training unit 540 is further configured to:
and calculating the posterior probability of the voice background for correcting the neural network parameters.
For the specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, which are not described herein again.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.
Claims (9)
1. A method of speech synthesis, comprising:
acquiring a voice sequence and a text sequence for voice synthesis;
inputting the voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting the text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter; the first residual embedding vector at least comprises the acoustic characteristics of emotion and environmental noise;
and synthesizing voice according to the first spectrum parameter.
2. The method of claim 1, wherein synthesizing speech from the first spectral parameters comprises:
and synthesizing voice by adopting a pre-trained neural network vocoder neural Vocoder model according to the first spectrum parameter.
3. The method of claim 1, wherein training the neural network comprises:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into the speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into the residual error encoder to obtain a second residual error embedded vector;
inputting the training text sequence into the text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into the decoder to obtain a second spectrum parameter.
4. The method of claim 1, wherein training the neural network further comprises:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
5. The method of claim 1, wherein training the neural network further comprises:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
6. The method of claim 1, wherein training the neural network further comprises:
and inputting the second speaker embedded vector into a speech background classifier for classification.
7. A speech synthesis apparatus, comprising:
a data acquisition unit configured to acquire a speech sequence and a text sequence for speech synthesis;
the neural network processing unit is used for inputting the voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting the text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter; the first residual embedding vector at least comprises the acoustic characteristics of emotion and environmental noise;
and the voice synthesis unit is used for synthesizing voice according to the first spectrum parameter.
8. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-6.
9. A computing device, comprising:
a processor; and
a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910670564.9A CN110232907B (en) | 2019-07-24 | 2019-07-24 | Voice synthesis method and device, readable storage medium and computing equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910670564.9A CN110232907B (en) | 2019-07-24 | 2019-07-24 | Voice synthesis method and device, readable storage medium and computing equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110232907A CN110232907A (en) | 2019-09-13 |
CN110232907B true CN110232907B (en) | 2021-11-02 |
Family
ID=67856002
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910670564.9A Active CN110232907B (en) | 2019-07-24 | 2019-07-24 | Voice synthesis method and device, readable storage medium and computing equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110232907B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110473516B (en) * | 2019-09-19 | 2020-11-27 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and device and electronic equipment |
CN112802443B (en) * | 2019-11-14 | 2024-04-02 | 腾讯科技(深圳)有限公司 | Speech synthesis method and device, electronic equipment and computer readable storage medium |
CN113066472B (en) * | 2019-12-13 | 2024-05-31 | 科大讯飞股份有限公司 | Synthetic voice processing method and related device |
CN111369968B (en) * | 2020-03-19 | 2023-10-13 | 北京字节跳动网络技术有限公司 | Speech synthesis method and device, readable medium and electronic equipment |
CN111862931B (en) * | 2020-05-08 | 2024-09-24 | 北京嘀嘀无限科技发展有限公司 | Voice generation method and device |
CN112750419B (en) * | 2020-12-31 | 2024-02-13 | 科大讯飞股份有限公司 | Speech synthesis method, device, electronic equipment and storage medium |
CN112786012B (en) * | 2020-12-31 | 2024-05-31 | 科大讯飞股份有限公司 | Speech synthesis method, device, electronic equipment and storage medium |
CN117174074A (en) * | 2023-11-01 | 2023-12-05 | 浙江同花顺智能科技有限公司 | Speech synthesis method, device, equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103377651A (en) * | 2012-04-28 | 2013-10-30 | 北京三星通信技术研究有限公司 | Device and method for automatic voice synthesis |
CN104252861A (en) * | 2014-09-11 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Video voice conversion method, video voice conversion device and server |
CN105096932A (en) * | 2015-07-14 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and apparatus of talking book |
CN105304079A (en) * | 2015-09-14 | 2016-02-03 | 上海可言信息技术有限公司 | Multi-party call multi-mode speech synthesis method and system |
CN105355193A (en) * | 2015-10-30 | 2016-02-24 | 百度在线网络技术(北京)有限公司 | Speech synthesis method and device |
CN109461435A (en) * | 2018-11-19 | 2019-03-12 | 北京光年无限科技有限公司 | A kind of phoneme synthesizing method and device towards intelligent robot |
CN109754779A (en) * | 2019-01-14 | 2019-05-14 | 出门问问信息科技有限公司 | Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2003255993A (en) * | 2002-03-04 | 2003-09-10 | Ntt Docomo Inc | System, method, and program for speech recognition, and system, method, and program for speech synthesis |
CN101281744B (en) * | 2007-04-04 | 2011-07-06 | 纽昂斯通讯公司 | Method and apparatus for analyzing and synthesizing voice |
US8639502B1 (en) * | 2009-02-16 | 2014-01-28 | Arrowhead Center, Inc. | Speaker model-based speech enhancement system |
CN107564543B (en) * | 2017-09-13 | 2020-06-26 | 苏州大学 | Voice feature extraction method with high emotion distinguishing degree |
CN108831435B (en) * | 2018-06-06 | 2020-10-16 | 安徽继远软件有限公司 | Emotional voice synthesis method based on multi-emotion speaker self-adaption |
-
2019
- 2019-07-24 CN CN201910670564.9A patent/CN110232907B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102063899A (en) * | 2010-10-27 | 2011-05-18 | 南京邮电大学 | Method for voice conversion under unparallel text condition |
CN103377651A (en) * | 2012-04-28 | 2013-10-30 | 北京三星通信技术研究有限公司 | Device and method for automatic voice synthesis |
CN104252861A (en) * | 2014-09-11 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Video voice conversion method, video voice conversion device and server |
CN105096932A (en) * | 2015-07-14 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice synthesis method and apparatus of talking book |
CN105304079A (en) * | 2015-09-14 | 2016-02-03 | 上海可言信息技术有限公司 | Multi-party call multi-mode speech synthesis method and system |
CN105355193A (en) * | 2015-10-30 | 2016-02-24 | 百度在线网络技术(北京)有限公司 | Speech synthesis method and device |
CN109461435A (en) * | 2018-11-19 | 2019-03-12 | 北京光年无限科技有限公司 | A kind of phoneme synthesizing method and device towards intelligent robot |
CN109754779A (en) * | 2019-01-14 | 2019-05-14 | 出门问问信息科技有限公司 | Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing |
Also Published As
Publication number | Publication date |
---|---|
CN110232907A (en) | 2019-09-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110232907B (en) | Voice synthesis method and device, readable storage medium and computing equipment | |
CN110379407B (en) | Adaptive speech synthesis method, device, readable storage medium and computing equipment | |
CN106486130B (en) | Noise elimination and voice recognition method and device | |
US8386256B2 (en) | Method, apparatus and computer program product for providing real glottal pulses in HMM-based text-to-speech synthesis | |
CN108492818B (en) | Text-to-speech conversion method and device and computer equipment | |
US8131550B2 (en) | Method, apparatus and computer program product for providing improved voice conversion | |
CN110379415B (en) | Training method of domain adaptive acoustic model | |
CN107240401B (en) | Tone conversion method and computing device | |
CN104205215B (en) | Automatic real-time verbal therapy | |
CN113053357B (en) | Speech synthesis method, apparatus, device and computer readable storage medium | |
CN107705782B (en) | Method and device for determining phoneme pronunciation duration | |
CN111435592B (en) | Voice recognition method and device and terminal equipment | |
CN112185363B (en) | Audio processing method and device | |
CN110379414B (en) | Acoustic model enhancement training method and device, readable storage medium and computing equipment | |
US20130332171A1 (en) | Bandwidth Extension via Constrained Synthesis | |
CN111681639B (en) | Multi-speaker voice synthesis method, device and computing equipment | |
CN113724683B (en) | Audio generation method, computer device and computer readable storage medium | |
US20170364516A1 (en) | Linguistic model selection for adaptive automatic speech recognition | |
US11574622B2 (en) | Joint automatic speech recognition and text to speech conversion using adversarial neural networks | |
US20120109654A1 (en) | Methods and apparatuses for facilitating speech synthesis | |
CN114025235A (en) | Video generation method and device, electronic equipment and storage medium | |
WO2002059877A1 (en) | Data processing device | |
CN117012177A (en) | Speech synthesis method, electronic device, and storage medium | |
CN113205797B (en) | Virtual anchor generation method, device, computer equipment and readable storage medium | |
CN110728137B (en) | Method and device for word segmentation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |