CN110232907B

CN110232907B - Voice synthesis method and device, readable storage medium and computing equipment

Info

Publication number: CN110232907B
Application number: CN201910670564.9A
Authority: CN
Inventors: 陈云琳
Original assignee: Go Out And Ask Suzhou Information Technology Co ltd
Current assignee: Go Out And Ask Suzhou Information Technology Co ltd
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2021-11-02
Anticipated expiration: 2039-07-24
Also published as: CN110232907A

Abstract

The embodiment of the disclosure provides a voice synthesis method, a voice synthesis device, a readable storage medium and a computing device, which can realize voice synthesis by using universal Chinese voice. The method comprises the following steps: acquiring a voice sequence and a text sequence for voice synthesis; inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter; speech is synthesized from the first spectral parameters.

Description

Voice synthesis method and device, readable storage medium and computing equipment

Technical Field

The present disclosure relates to the field of speech processing technologies, and in particular, to a speech synthesis method, an apparatus, a readable storage medium, and a computing device.

Background

In order to train a high-quality speech synthesis model, the existing Chinese speech synthesis system needs to use high-quality speech recorded in a professional recording studio. The following schemes are mainly adopted:

recording high-quality data of a single speaker in 10-20 hours, carrying out manual marking, including marking pinyin, rhythm and cutting, and then training a model To obtain a complete Text-To-Speech (TTS) system;

recording 10-20 hours high quality data of 10-20 multiple speakers, performing manual marking including marking pinyin, rhythm and cutting, and training multiple speaker models to obtain a complete TTS system

The two schemes both depend on perfect recording equipment and a high-standard recording environment, and depend on fine manual labeling, and problems occur in a certain link, so that the model effect is poor. If the recording equipment and the environment have problems, the generated voice is very noisy; and the problem of the label is caused, so that the model training is not converged and the acceptance standard is not reached. In addition, recording high-quality speech requires a long time and a large amount of money, so that constructing a high-quality TTS model using the two schemes has certain disadvantages.

Disclosure of Invention

To this end, the present disclosure provides a speech synthesis method, apparatus, readable storage medium and computing device in an attempt to solve or at least alleviate at least one of the problems identified above.

According to an aspect of an embodiment of the present disclosure, there is provided a speech synthesis method including:

acquiring a voice sequence and a text sequence for voice synthesis;

inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;

speech is synthesized from the first spectral parameters.

Optionally, inputting the speech sequence and the text sequence into a pre-trained neural network to obtain a first spectral parameter, including:

according to a pre-trained neural network, inputting a voice sequence into a speaker encoder to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter.

Optionally, synthesizing speech from the first spectral parameters comprises:

and synthesizing voice by adopting a Neural Vocoder model of a pre-trained Neural network Vocoder according to the first spectrum parameters.

Optionally, training the neural network comprises:

acquiring training voice sequences and training text sequences which correspond to each other one by one;

inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;

inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;

inputting the training text sequence into a text encoder to obtain a second text embedded vector;

and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.

Optionally, training the neural network further comprises:

and embedding the second speaker into the vector input speaker recognition classifier for classification.

Optionally, training the neural network further comprises:

and performing gradient inversion processing on the second speaker embedded vector to obtain speaker information.

Optionally, training the neural network further comprises:

and inputting the second speaker embedded vector into a speech background classifier for classification.

Optionally, training the neural network further comprises:

and calculating the posterior probability of the voice background for correcting the neural network parameters.

According to still another aspect of an embodiment of the present disclosure, there is provided a speech synthesis apparatus including:

a data acquisition unit configured to acquire a speech sequence and a text sequence for speech synthesis;

the neural network processing unit is used for inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;

a speech synthesis unit for synthesizing speech according to the first spectral parameters.

Optionally, the neural network processing unit is specifically configured to:

inputting a voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting a text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter;

optionally, the speech synthesis unit is specifically configured to:

Optionally, the method further comprises a neural network training unit, configured to:

Optionally, the neural network training unit is further configured to:

According to yet another aspect of an embodiment of the present disclosure, there is provided a readable storage medium having executable instructions thereon, which when executed, cause a computer to perform the operations included in the above-mentioned method.

According to yet another aspect of embodiments of the present disclosure, there is provided a computing device comprising: a processor; and a memory storing executable instructions that, when executed, cause the processor to perform operations included in the above-described methods.

According to the technical scheme provided by the embodiment of the disclosure, a voice sequence and a text sequence for voice synthesis are obtained, the voice sequence and the text sequence are input into a pre-trained neural network to obtain a first spectrum parameter, and voice is synthesized according to the first spectrum parameter; according to the scheme, only Chinese audio with common quality, such as voice recorded by a mobile phone or voice of customer service personnel, is needed, information of a speaker and background noise can be automatically distinguished, and high-quality voice can be synthesized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

FIG. 1 is a block diagram of an exemplary computing device;

FIG. 2 is a flow diagram of a method of speech synthesis according to an embodiment of the present disclosure;

FIG. 3 is a flow chart diagram of a neural network training method in accordance with an embodiment of the present disclosure;

FIG. 4 is yet another flow chart diagram of a neural network training method in accordance with an embodiment of the present disclosure;

fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 is a block diagram of an example computing device 100 arranged to implement a speech synthesis method according to the present disclosure. In a basic configuration 102, computing device 100 typically includes system memory 106 and one or more processors 104. A memory bus 108 may be used for communication between the processor 104 and the system memory 106.

Depending on the desired configuration, the processor 104 may be any type of processing, including but not limited to: the processor 104 may include one or more levels of cache, such as a level one cache 110 and a level two cache 112, a processor core 114, and registers 116. the example processor core 114 may include an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a digital signal processing core (DSP core), or any combination thereof.

Depending on the desired configuration, system memory 106 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 106 may include an operating system 120, one or more programs 122, and program data 124. In some implementations, the program 122 can be configured to execute instructions on an operating system by one or more processors 104 using program data 124.

Computing device 100 may also include an interface bus 140 that facilitates communication from various interface devices (e.g., output devices 142, peripheral interfaces 144, and communication devices 146) to the basic configuration 102 via the bus/interface controller 130. The example output device 142 includes a graphics processing unit 148 and an audio processing unit 150. They may be configured to facilitate communication with various external devices, such as a display terminal or speakers, via one or more a/V ports 152. Example peripheral interfaces 144 may include a serial interface controller 154 and a parallel interface controller 156, which may be configured to facilitate communication with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 158. An example communication device 146 may include a network controller 160, which may be arranged to facilitate communications with one or more other computing devices 162 over a network communication link via one or more communication ports 164.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, and may include any information delivery media, such as carrier waves or other transport mechanisms, in a modulated data signal. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 100 may be implemented as part of a small-form factor portable (or mobile) electronic device such as a cellular telephone, a Personal Digital Assistant (PDA), a personal media player device, a wireless web-watch device, a personal headset device, an application specific device, or a hybrid device that include any of the above functions. Computing device 100 may also be implemented as a personal computer including both desktop and notebook computer configurations.

Among other things, one or more programs 122 of computing device 100 include instructions for performing a method of speech synthesis according to the present disclosure.

Fig. 2 illustrates a flow diagram of a speech synthesis method 200 according to an embodiment of the present disclosure, the speech synthesis method 200 starting at step S210.

S210, acquiring a voice sequence and a text sequence for voice synthesis;

s220, inputting the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;

and S230, synthesizing voice according to the first spectrum parameter.

In step S210, the speech sequence used for speech synthesis may be a newly recorded speech sequence of a certain speaker, or may be selected from a training speech sequence used for training a neural network. The text sequence used for speech synthesis may be any text sequence entered by the user.

In step S220, the specific neural network data processing process includes:

In step S220, the neural network is trained in advance, as shown in fig. 3, the training process includes the following steps:

s310, acquiring training voice sequences and training text sequences which correspond to one another one by one;

s320, inputting the training voice sequence into a speaker encoder to obtain a second speaker embedded vector;

s330, inputting the training voice sequence into a residual error coder to obtain a second residual error embedded vector;

s340, inputting the training text sequence into a text encoder to obtain a second text embedded vector;

and S350, inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into a decoder to obtain a second spectrum parameter.

The training data may be low quality voice data, including voice data of multiple speakers, and may include noisy data and clean (non-noisy) data.

Optionally, in step S230, synthesizing speech by using a Neural Vocoder Neural Vocoder model trained in advance according to the first spectrum parameter. Neural vocoders are a separately trained convolutional Neural network model, the model abandons the traditional Vocoder scheme, and directly models sampling points, so that the model has the characteristic of high fidelity of sound.

Optionally, in the neural network training process, the method for improving the performance of the neural network by decomposing the antagonistic factors specifically includes:

and embedding the second speaker into the vector input speaker recognition classifier for classification, thereby distinguishing information of different speakers.

Optionally, the method further comprises:

and performing gradient inversion processing on the second speaker embedded vector, so that the speaker embedded vector only contains speaker information, such as gender, identification ID and the like.

Optionally, the method further comprises:

the second speaker is embedded into the vector input speech background classifier for classification, thereby distinguishing clean and noisy speech.

Optionally, the method further comprises:

after the speech background is classified, the posterior probability of the speech background is calculated and used for correcting the neural network parameters.

To further illustrate the concepts of the present disclosure, a specific embodiment of the present disclosure is provided below in conjunction with FIG. 4.

In a specific embodiment of the present disclosure, an end-to-end neural network architecture is employed.

1. A data preparation phase.

General speech recognition data (i.e., speech recognition data of poor quality) including speech data of a plurality of speakers, including data with and without noise, is prepared for about 50 hours.

2. And (5) a training stage.

a. The text and speech are one-to-one entered into the following modules and used for training.

b. And a voice coding module.

i. The speech sequence is processed by a speaker Encoder (Encoder) to obtain an Embedding vector (Embedding) of the speaker.

ii. And processing the voice sequence by a residual error Encoder to obtain a residual error Embedding without speaker information, wherein the residual error Embedding comprises acoustic characteristic information such as environmental noise, emotion and the like.

c. And an anti-factorization module.

i. The speaker Embedding is processed by a speaker classifier, so that information of different speakers is distinguished.

ii. Speaker Embedding, processed by the speech background classifier, aims to distinguish clean and noisy speech. And gradient inversion is carried out, so that the speaker Embedding only contains information of the speaker, such as gender, speaker ID and the like.

d. A synthesizer module.

i. The text sequence passes through a text Encoder to obtain a text Embedding containing text information.

ii. The text Embedding + speaker Embedding + residual Embedding are input to a Decoder (Decoder) together to obtain spectral parameters.

3. And (5) a generation stage.

a. Given arbitrary text and the speaker's voice to be referred to.

b. The speaker speech quoted passes through the speaker Encoder to obtain speaker Embedding.

c. The quoted speaker voice passes through the residual error Encoder to obtain residual error.

d. The input text passes through the text Encoder to obtain the text Embedding.

e. The text Embedding, the speaker Embedding and the residual error Embedding are input into the decoder together to obtain the final spectrum parameters.

f. Finally, the spectral parameters are processed by Neural vocoders to obtain voice.

The resulting speech is consistent with the cited speaker's speech, and the acoustic emotion and background sounds are also consistent with the cited speech. The voice of the speaker quoted has noise, and the synthesized voice has noise; the speaker's speech is noiseless and the synthesized speech is noiseless.

The specific embodiment of the disclosure achieves the following technical effects:

1. a scheme for carrying out voice synthesis by using general Chinese voice is provided, so that the available data volume is greatly increased, and the model quality is improved.

2. The use of universal Chinese speech recognition data, rather than high quality recordings recorded by professional recording equipment, greatly reduces the cost of recording and labeling.

3. The model is added with a gradient inversion module, so that the information and the voice characteristics of the speaker can be effectively separated.

4. The Neural Vocoder is introduced into the model, so that the quality of the synthesized voice can be improved.

As shown in fig. 5, the present disclosure also provides a speech synthesis apparatus, including:

a data acquisition unit 510 for acquiring a speech sequence and a text sequence for speech synthesis;

the neural network processing unit 520 is configured to input the voice sequence and the text sequence into a pre-trained neural network to obtain a first spectrum parameter;

a speech synthesis unit 530 for synthesizing speech according to the first spectral parameters.

Optionally, the neural network processing unit 520 is specifically configured to:

optionally, the speech synthesis unit 530 is specifically configured to:

Optionally, a neural network training unit 540 is further included, configured to:

Optionally, the neural network training unit 540 is further configured to:

For the specific limitations of the speech synthesis apparatus, reference may be made to the above limitations of the speech synthesis method, which are not described herein again.

It should be understood that the various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present disclosure, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the disclosure.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to perform the various methods of the present disclosure according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer-readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

It should be appreciated that in the foregoing description of exemplary embodiments of the disclosure, various features of the disclosure are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various disclosed aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that is, the claimed disclosure requires more features than are expressly recited in each claim. Rather, as the following claims reflect, disclosed aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this disclosure.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Moreover, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Furthermore, some of the described embodiments are described herein as a method or combination of method elements that can be performed by a processor of a computer system or by other means of performing the described functions. A processor having the necessary instructions for carrying out the method or method elements thus forms a means for carrying out the method or method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purposes of this disclosure.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the disclosure has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the disclosure as described herein. Moreover, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the disclosed subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The disclosure of the present disclosure is intended to be illustrative, but not limiting, of the scope of the disclosure, which is set forth in the following claims.

Claims

1. A method of speech synthesis, comprising:

acquiring a voice sequence and a text sequence for voice synthesis;

inputting the voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting the text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter; the first residual embedding vector at least comprises the acoustic characteristics of emotion and environmental noise;

and synthesizing voice according to the first spectrum parameter.

2. The method of claim 1, wherein synthesizing speech from the first spectral parameters comprises:

and synthesizing voice by adopting a pre-trained neural network vocoder neural Vocoder model according to the first spectrum parameter.

3. The method of claim 1, wherein training the neural network comprises:

inputting the training voice sequence into the speaker encoder to obtain a second speaker embedded vector;

inputting the training voice sequence into the residual error encoder to obtain a second residual error embedded vector;

inputting the training text sequence into the text encoder to obtain a second text embedded vector;

and inputting the second speaker embedded vector, the second residual error embedded vector and the second text embedded vector into the decoder to obtain a second spectrum parameter.

4. The method of claim 1, wherein training the neural network further comprises:

5. The method of claim 1, wherein training the neural network further comprises:

6. The method of claim 1, wherein training the neural network further comprises:

7. A speech synthesis apparatus, comprising:

the neural network processing unit is used for inputting the voice sequence into a speaker encoder according to a pre-trained neural network to obtain a first speaker embedded vector, inputting the voice sequence into a residual encoder to obtain a first residual embedded vector, inputting the text sequence into a text encoder to obtain a first text embedded vector, and inputting the first speaker embedded vector, the first residual embedded vector and the first text embedded vector into a decoder to obtain a first spectrum parameter; the first residual embedding vector at least comprises the acoustic characteristics of emotion and environmental noise;

and the voice synthesis unit is used for synthesizing voice according to the first spectrum parameter.

8. A readable storage medium having executable instructions thereon that, when executed, cause a computer to perform the operations included in any one of claims 1-6.

9. A computing device, comprising:

a processor; and

a memory storing executable instructions that, when executed, cause the processor to perform the operations included in any one of claims 1-6.