CN110232907B - Voice synthesis method and device, readable storage medium and computing equipment - Google Patents

Voice synthesis method and device, readable storage medium and computing equipment Download PDF

Info

Publication number
CN110232907B
CN110232907B CN201910670564.9A CN201910670564A CN110232907B CN 110232907 B CN110232907 B CN 110232907B CN 201910670564 A CN201910670564 A CN 201910670564A CN 110232907 B CN110232907 B CN 110232907B
Authority
CN
China
Prior art keywords
speaker
embedded vector
text
voice
inputting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910670564.9A
Other languages
Chinese (zh)
Other versions
CN110232907A (en
Inventor
陈云琳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Go Out And Ask Suzhou Information Technology Co ltd
Original Assignee
Go Out And Ask Suzhou Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Go Out And Ask Suzhou Information Technology Co ltd filed Critical Go Out And Ask Suzhou Information Technology Co ltd
Priority to CN201910670564.9A priority Critical patent/CN110232907B/en
Publication of CN110232907A publication Critical patent/CN110232907A/en
Application granted granted Critical
Publication of CN110232907B publication Critical patent/CN110232907B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/012Comfort noise or silence coding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The embodiment of the disclosure provides a voice synthesis method, a voice synthesis device, a readable storage medium and a computing device, which can realize voice synthesis by using universal Chinese voice. The method comprises the following steps: acquiring a voice sequence and a text sequence for voice synthesis; inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter; speech is synthesized from the first spectral parameters.

Description

Voice synthesis method and device, readable storage medium and computing equipment
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a speech synthesis method, an apparatus, a readable storage medium, and a computing device.
Background
In order to train a high-quality speech synthesis model, the existing Chinese speech synthesis system needs to use high-quality speech recorded in a professional recording studio. The following schemes are mainly adopted:
recording high-quality data of a single speaker in 10-20 hours, carrying out manual marking, including marking pinyin, rhythm and cutting, and then training a model To obtain a complete Text-To-Speech (TTS) system;
recording 10-20 hours high quality data of 10-20 multiple speakers, performing manual marking including marking pinyin, rhythm and cutting, and training multiple speaker models to obtain a complete TTS system
The two schemes both depend on perfect recording equipment and a high-standard recording environment, and depend on fine manual labeling, and problems occur in a certain link, so that the model effect is poor. If the recording equipment and the environment have problems, the generated voice is very noisy; and the problem of the label is caused, so that the model training is not converged and the acceptance standard is not reached. In addition, recording high-quality speech requires a long time and a large amount of money, so that constructing a high-quality TTS model using the two schemes has certain disadvantages.
Disclosure of Invention
To this end, the present disclosure provides a speech synthesis method, apparatus, readable storage medium and computing device in an attempt to solve or at least alleviate at least one of the problems identified above.
According to an aspect of an embodiment of the present disclosure, there is provided a speech synthesis method including:
acquiring a voice sequence and a text sequence for voice synthesis;
inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
speech is synthesized from the first spectral parameters.
Optionally, inputting the speech sequence and the text sequence into a pre-trained neural network to obtain a first spectral parameter, including:
according to a pre-trained neural network, inputting a voice sequence into a speaker encoder to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter.
Optionally, synthesizing speech from the first spectral parameters comprises:
and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.
Optionally, training the neural network comprises:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
Optionally, training the neural network further comprises:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
Optionally, training the neural network further comprises:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
Optionally, training the neural network further comprises:
and inputting the second speaker embedded vector into a speech background classifier for classification.
Optionally, training the neural network further comprises:
and calculating the posterior probability of the voice background for correcting the neural network parameters.
According to still another aspect of an embodiment of the present disclosure, there is provided a speech synthesis apparatus including:
a data acquisition unit configured to acquire a speech sequence and a text sequence for speech synthesis;
the neural network processing unit is used for inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
a speech synthesis unit for synthesizing speech according to the first spectral parameters.
Optionally, the neural network processing unit is specifically configured to:
inputting a voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter;
optionally, the speech synthesis unit is specifically configured to:
and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.
Optionally, the method further comprises a neural network training unit, configured to:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
Optionally, the neural network training unit is further configured to:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
Optionally, the neural network training unit is further configured to:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
Optionally, the neural network training unit is further configured to:
and inputting the second speaker embedded vector into a speech background classifier for classification.
Optionally, the neural network training unit is further configured to:
and calculating the posterior probability of the voice background for correcting the neural network parameters.
According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.
According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.
According to the technical scheme provided by the embodiment of the disclosure, a voice sequence and a text sequence for voice synthesis are obtained, the voice sequence and the text sequence are input into a pre-trained neural network to obtain a first spectrum parameter, and voice is synthesized according to the first spectrum parameter; according to the scheme, only Chinese audio with common quality, such as voice recorded by a mobile phone or voice of customer service personnel, is needed, information of a speaker and background noise can be automatically distinguished, and high-quality voice can be synthesized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.
FIG. 1 is a block diagram of an exemplary computing device;
FIG. 2 is a flow diagram of a method of speech synthesis according to an embodiment of the present disclosure;
FIG. 3 is a flow chart diagram of a neural network training method in accordance with an embodiment of the present disclosure;
FIG. 4 is yet another flow chart diagram of a neural network training method in accordance with an embodiment of the present disclosure;
fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
FIG. 1 is a block diagram of an example computing device 100 arranged to implement a speech synthesis method according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.
Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.
Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.
A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.
Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.
Among other things, one or more programs 122 of computing device 100 include instructions for performing a method of speech synthesis according to the present disclosure.
Fig. 2 illustrates a flow diagram of a speech synthesis method 200 according to an embodiment of the present disclosure, the speech synthesis method 200 starting at step S210.
S210, acquiring a voice sequence and a text sequence for voice synthesis;
s220, inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
and S230, synthesizing voice according to the first spectrum parameter.
In step S210, the speech sequence used for speech synthesis may be a newly recorded speech sequence of a certain speaker, or may be selected from a training speech sequence used for training a neural network. The text sequence used for speech synthesis may be any text sequence entered by the user.
In step S220, the specific neural network data processing process includes:
according to a pre-trained neural network, inputting a voice sequence into a speaker encoder to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter.
In step S220, the neural network is trained in advance, as shown in fig. 3, the training process includes the following steps:
s310, acquiring training voice sequences and training text sequences which correspond to one another one by one;
s320, inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
s330, inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
s340, inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and S350, inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
The training data may be low quality voice data, including voice data of multiple speakers, and may include noisy data and clean (non-noisy) data.
Optionally, in step S230, synthesizing speech by using a Neural Vocoder Neural Vocoder model trained in advance according to the first spectrum parameter. Neural vocoders are a separately trained convolutional Neural network model, the model abandons the traditional Vocoder scheme, and directly models sampling points, so that the model has the characteristic of high fidelity of sound.
Optionally, in the neural network training process, the method for improving the performance of the neural network by decomposing the antagonistic factors specifically includes:
and embedding the second speaker into the vector input speaker recognition classifier for classification, thereby distinguishing information of different speakers.
Optionally, the method further comprises:
and performing gradient inversion processing on the second speaker embedded vector, so that the speaker embedded vector only contains speaker information, such as gender, identification ID and the like.
Optionally, the method further comprises:
the second speaker is embedded into the vector input speech background classifier for classification, thereby distinguishing clean and noisy speech.
Optionally, the method further comprises:
after the speech background is classified, the posterior probability of the speech background is calculated and used for correcting the neural network parameters.
To further illustrate the concepts of the present disclosure, a specific embodiment of the present disclosure is provided below in conjunction with FIG. 4.
In a specific embodiment of the present disclosure, an end-to-end neural network architecture is employed.
1. A data preparation phase.
General speech recognition data (i.e., speech recognition data of poor quality) including speech data of a plurality of speakers, including data with and without noise, is prepared for about 50 hours.
2. And (5) a training stage.
a. The text and speech are one-to-one entered into the following modules and used for training.
b. And a voice coding module.
i. The speech sequence is processed by a speaker Encoder (Encoder) to obtain an Embedding vector (Embedding) of the speaker.
ii. And processing the voice sequence by a residual error Encoder to obtain a residual error Embedding without speaker information, wherein the residual error Embedding comprises acoustic characteristic information such as environmental noise, emotion and the like.
c. And an anti-factorization module.
i. The speaker Embedding is processed by a speaker classifier, so that information of different speakers is distinguished.
ii. Speaker Embedding, processed by the speech background classifier, aims to distinguish clean and noisy speech. And gradient inversion is carried out, so that the speaker Embedding only contains information of the speaker, such as gender, speaker ID and the like.
d. A synthesizer module.
i. The text sequence passes through a text Encoder to obtain a text Embedding containing text information.
ii. The text Embedding + speaker Embedding + residual Embedding are input to a Decoder (Decoder) together to obtain spectral parameters.
3. And (5) a generation stage.
a. Given arbitrary text and the speaker's voice to be referred to.
b. The speaker speech quoted passes through the speaker Encoder to obtain speaker Embedding.
c. The quoted speaker voice passes through the residual error Encoder to obtain residual error.
d. The input text passes through the text Encoder to obtain the text Embedding.
e. The text Embedding, the speaker Embedding and the residual error Embedding are input into the decoder together to obtain the final spectrum parameters.
f. Finally, the spectral parameters are processed by Neural vocoders to obtain voice.
The resulting speech is consistent with the cited speaker's speech, and the acoustic emotion and background sounds are also consistent with the cited speech. The voice of the speaker quoted has noise, and the synthesized voice has noise; the speaker's speech is noiseless and the synthesized speech is noiseless.
The specific embodiment of the disclosure achieves the following technical effects:
1. a scheme for carrying out voice synthesis by using general Chinese voice is provided, so that the available data volume is greatly increased, and the model quality is improved.
2. The use of universal Chinese speech recognition data, rather than high quality recordings recorded by professional recording equipment, greatly reduces the cost of recording and labeling.
3. The model is added with a gradient inversion module, so that the information and the voice characteristics of the speaker can be effectively separated.
4. The Neural Vocoder is introduced into the model, so that the quality of the synthesized voice can be improved.
As shown in fig. 5, the present disclosure also provides a speech synthesis apparatus, including:
a data acquisition unit 510 for acquiring a speech sequence and a text sequence for speech synthesis;
the neural network processing unit 520 is configured to input the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;
a speech synthesis unit 530 for synthesizing speech according to the first spectral parameters.
Optionally, the neural network processing unit 520 is specifically configured to:
inputting a voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter;
optionally, the speech synthesis unit 530 is specifically configured to:
and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.
Optionally, a neural network training unit 540 is further included, configured to:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;
inputting the training text sequence into a text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.
Optionally, the neural network training unit 540 is further configured to:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
Optionally, the neural network training unit 540 is further configured to:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
Optionally, the neural network training unit 540 is further configured to:
and inputting the second speaker embedded vector into a speech background classifier for classification.
Optionally, the neural network training unit 540 is further configured to:
and calculating the posterior probability of the voice background for correcting the neural network parameters.
For the specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, which are not described herein again.
It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.
In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.
By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.
It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.
Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.
As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.
While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims (9)

1. A method of speech synthesis, comprising:
acquiring a voice sequence and a text sequence for voice synthesis;
inputting the voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting the text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter; the first residual embedding vector at least comprises the acoustic characteristics of emotion and environmental noise;
and synthesizing voice according to the first spectrum parameter.
2. The method of claim 1, wherein synthesizing speech from the first spectral parameters comprises:
and synthesizing voice by adopting a pre-trained neural network vocoder neural Vocoder model according to the first spectrum parameter.
3. The method of claim 1, wherein training the neural network comprises:
acquiring training voice sequences and training text sequences which correspond to each other one by one;
inputting the training voice sequence into the speaker encoder to obtain a second speaker embedded vector;
inputting the training voice sequence into the residual error encoder to obtain a second residual error embedded vector;
inputting the training text sequence into the text encoder to obtain a second text embedded vector;
and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into the decoder to obtain a second spectrum parameter.
4. The method of claim 1, wherein training the neural network further comprises:
and embedding the second speaker into the vector input speaker recognition classifier for classification.
5. The method of claim 1, wherein training the neural network further comprises:
and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.
6. The method of claim 1, wherein training the neural network further comprises:
and inputting the second speaker embedded vector into a speech background classifier for classification.
7. A speech synthesis apparatus, comprising:
a data acquisition unit configured to acquire a speech sequence and a text sequence for speech synthesis;
the neural network processing unit is used for inputting the voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting the text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter; the first residual embedding vector at least comprises the acoustic characteristics of emotion and environmental noise;
and the voice synthesis unit is used for synthesizing voice according to the first spectrum parameter.
8. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-6.
9. A computing device, comprising:
a processor; and
a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-6.
CN201910670564.9A 2019-07-24 2019-07-24 Voice synthesis method and device, readable storage medium and computing equipment Active CN110232907B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910670564.9A CN110232907B (en) 2019-07-24 2019-07-24 Voice synthesis method and device, readable storage medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910670564.9A CN110232907B (en) 2019-07-24 2019-07-24 Voice synthesis method and device, readable storage medium and computing equipment

Publications (2)

Publication Number Publication Date
CN110232907A CN110232907A (en) 2019-09-13
CN110232907B true CN110232907B (en) 2021-11-02

Family

ID=67856002

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910670564.9A Active CN110232907B (en) 2019-07-24 2019-07-24 Voice synthesis method and device, readable storage medium and computing equipment

Country Status (1)

Country Link
CN (1) CN110232907B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110473516B (en) * 2019-09-19 2020-11-27 百度在线网络技术(北京)有限公司 Voice synthesis method and device and electronic equipment
CN112802443B (en) * 2019-11-14 2024-04-02 腾讯科技(深圳)有限公司 Speech synthesis method and device, electronic equipment and computer readable storage medium
CN113066472B (en) * 2019-12-13 2024-05-31 科大讯飞股份有限公司 Synthetic voice processing method and related device
CN111369968B (en) * 2020-03-19 2023-10-13 北京字节跳动网络技术有限公司 Speech synthesis method and device, readable medium and electronic equipment
CN111862931B (en) * 2020-05-08 2024-09-24 北京嘀嘀无限科技发展有限公司 Voice generation method and device
CN112750419B (en) * 2020-12-31 2024-02-13 科大讯飞股份有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN112786012B (en) * 2020-12-31 2024-05-31 科大讯飞股份有限公司 Speech synthesis method, device, electronic equipment and storage medium
CN117174074A (en) * 2023-11-01 2023-12-05 浙江同花顺智能科技有限公司 Speech synthesis method, device, equipment and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103377651A (en) * 2012-04-28 2013-10-30 北京三星通信技术研究有限公司 Device and method for automatic voice synthesis
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN105304079A (en) * 2015-09-14 2016-02-03 上海可言信息技术有限公司 Multi-party call multi-mode speech synthesis method and system
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN109461435A (en) * 2018-11-19 2019-03-12 北京光年无限科技有限公司 A kind of phoneme synthesizing method and device towards intelligent robot
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003255993A (en) * 2002-03-04 2003-09-10 Ntt Docomo Inc System, method, and program for speech recognition, and system, method, and program for speech synthesis
CN101281744B (en) * 2007-04-04 2011-07-06 纽昂斯通讯公司 Method and apparatus for analyzing and synthesizing voice
US8639502B1 (en) * 2009-02-16 2014-01-28 Arrowhead Center, Inc. Speaker model-based speech enhancement system
CN107564543B (en) * 2017-09-13 2020-06-26 苏州大学 Voice feature extraction method with high emotion distinguishing degree
CN108831435B (en) * 2018-06-06 2020-10-16 安徽继远软件有限公司 Emotional voice synthesis method based on multi-emotion speaker self-adaption

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102063899A (en) * 2010-10-27 2011-05-18 南京邮电大学 Method for voice conversion under unparallel text condition
CN103377651A (en) * 2012-04-28 2013-10-30 北京三星通信技术研究有限公司 Device and method for automatic voice synthesis
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
CN105096932A (en) * 2015-07-14 2015-11-25 百度在线网络技术(北京)有限公司 Voice synthesis method and apparatus of talking book
CN105304079A (en) * 2015-09-14 2016-02-03 上海可言信息技术有限公司 Multi-party call multi-mode speech synthesis method and system
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN109461435A (en) * 2018-11-19 2019-03-12 北京光年无限科技有限公司 A kind of phoneme synthesizing method and device towards intelligent robot
CN109754779A (en) * 2019-01-14 2019-05-14 出门问问信息科技有限公司 Controllable emotional speech synthesizing method, device, electronic equipment and readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN110232907A (en) 2019-09-13

Similar Documents

Publication Publication Date Title
CN110232907B (en) Voice synthesis method and device, readable storage medium and computing equipment
CN110379407B (en) Adaptive speech synthesis method, device, readable storage medium and computing equipment
CN106486130B (en) Noise elimination and voice recognition method and device
US8386256B2 (en) Method, apparatus and computer program product for providing real glottal pulses in HMM-based text-to-speech synthesis
CN108492818B (en) Text-to-speech conversion method and device and computer equipment
US8131550B2 (en) Method, apparatus and computer program product for providing improved voice conversion
CN110379415B (en) Training method of domain adaptive acoustic model
CN107240401B (en) Tone conversion method and computing device
CN104205215B (en) Automatic real-time verbal therapy
CN113053357B (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN111435592B (en) Voice recognition method and device and terminal equipment
CN112185363B (en) Audio processing method and device
CN110379414B (en) Acoustic model enhancement training method and device, readable storage medium and computing equipment
US20130332171A1 (en) Bandwidth Extension via Constrained Synthesis
CN111681639B (en) Multi-speaker voice synthesis method, device and computing equipment
CN113724683B (en) Audio generation method, computer device and computer readable storage medium
US20170364516A1 (en) Linguistic model selection for adaptive automatic speech recognition
US11574622B2 (en) Joint automatic speech recognition and text to speech conversion using adversarial neural networks
US20120109654A1 (en) Methods and apparatuses for facilitating speech synthesis
CN114025235A (en) Video generation method and device, electronic equipment and storage medium
WO2002059877A1 (en) Data processing device
CN117012177A (en) Speech synthesis method, electronic device, and storage medium
CN113205797B (en) Virtual anchor generation method, device, computer equipment and readable storage medium
CN110728137B (en) Method and device for word segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant