CN111599338B

CN111599338B - Stable and controllable end-to-end speech synthesis method and device

Info

Publication number: CN111599338B
Application number: CN202010275510.5A
Authority: CN
Inventors: 孙见青
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2023-04-18
Anticipated expiration: 2040-04-09
Also published as: CN111599338A

Abstract

The invention discloses a stable and controllable end-to-end voice synthesis method and a device, comprising the following steps: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice. The problem of add long control module can promote the stability of synthetic speech to a certain extent among the prior art, but can introduce because long prediction accuracy is not enough and lead to the synthetic speech effect not good is solved, great improvement user's experience sense.

Description

Stable and controllable end-to-end speech synthesis method and device

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a stable and controllable end-to-end voice synthesis method and device.

Background

In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent to long control module when adding, but can introduce because the not enough problem that leads to the synthetic speech effect not good of long prediction precision, reduced user's experience and felt.

Disclosure of Invention

Aiming at the displayed problems, the method optimizes the voice synthesis effect by adding the phoneme duration model in the preset network neural model, trains the preset neural network model by using the preset recording and the text data corresponding to the preset recording, and then carries out voice synthesis on the text to be synthesized by using the trained preset neural network model.

A stable and controllable end-to-end speech synthesis method, comprising the steps of:

training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;

acquiring a text to be synthesized;

and inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice.

Preferably, before training a preset neural network model by using a preset sound recording and text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further includes:

acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;

preprocessing the preset number of preset voices, filtering noise components in the preset number of preset voices, and removing mute components in the preset number of preset voices;

checking whether the text content of the preset number of text data has defects, wherein the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;

and determining a second number of second text data without the defect in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset recording and the text data corresponding to the preset recording.

Preferably, the training of the preset neural network model by using the preset sound recording and the text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model and a speech output model, includes:

acquiring a characterization phoneme sequence and a first phoneme duration in each text data in the second quantity of text data;

training the phoneme duration model using a second number of the sequence of characterizing phonemes as an input to the phoneme duration model and the second number of the first phoneme durations as an output of the phoneme duration model;

acquiring a second phoneme duration of the second number of preset recordings by using the trained phoneme duration model;

performing a first frame expansion on the second number of characterizing phoneme sequences according to a second number of second phoneme durations;

extracting the frequency spectrum parameters of the second number of preset sound recordings;

taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;

training the voice output model by taking the second number of spectral parameters as input of the voice output model and taking the second number of preset recordings as output of the voice output model;

and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.

Preferably, before obtaining the text to be synthesized, the method further includes:

acquiring n Chinese characters in the text content to be synthesized;

confirming whether the n Chinese characters have polyphone characters;

if so, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming the first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming the second tone when the first Chinese character is in the second tone;

selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;

acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;

and marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

Preferably, the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice includes:

analyzing the text to be synthesized to obtain a target representation phoneme sequence;

inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;

performing second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;

inputting the target representation phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and inputting the predicted spectrum parameters into a trained speech output model to obtain the target synthesized speech.

A stable and controllable end-to-end speech device, the device comprising:

the training module is used for training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, and the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;

the first acquisition module is used for acquiring a text to be synthesized;

and the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.

Preferably, the apparatus further comprises:

the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;

the preprocessing module is used for preprocessing the preset number of voices, filtering noise components in the preset number of voices and removing mute components in the preset number of voices;

the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read easily, the text content has logic problems, and first text data with the defects in the preset number of text data and first preset voices corresponding to the first text data are removed;

and the determining module is used for determining a second number of second text data without the defects in the preset number of text data and a second preset voice corresponding to the second text data as the preset recording and the text data corresponding to the preset recording.

Preferably, the training module includes:

the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second quantity of text data;

a first training submodule, configured to train the phoneme duration model by using a second number of the characterizing phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;

the second obtaining submodule is used for obtaining second phoneme durations of the second number of preset sound recordings by using the trained phoneme duration model;

a first expansion submodule, configured to perform a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;

the extraction submodule is used for extracting the frequency spectrum parameters of the second number of preset sound recordings;

the second training submodule is used for taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;

a third training submodule, configured to train the speech output model by using the second number of spectral parameters as inputs of the speech output model and using the second number of preset sound recordings as outputs of the speech output model;

and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.

Preferably, the apparatus further comprises:

the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;

the confirming module is used for confirming whether the n Chinese characters have polyphone characters;

a fourth obtaining module, configured to obtain a first letter sequence and a second letter sequence of a first chinese character that is a polyphone in the n chinese characters if the confirming module confirms that the n chinese characters have the polyphone, where the first letter sequence is a first sequence of letters that form the first tone when the first chinese character is a first tone, and the second letter sequence is a second sequence of letters that form the second tone when the first chinese character is a second tone;

the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;

a fifth obtaining module, configured to obtain a third letter sequence of a second Chinese character, where the second Chinese character is a Chinese character of the n Chinese characters except the first Chinese character;

and the marking module is used for marking the target letter sequence and the third sequence on the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

Preferably, the obtaining module includes:

the analysis module is used for analyzing the text to be synthesized to obtain a target representation phoneme sequence;

the second obtaining submodule is used for inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;

the second expansion submodule is used for carrying out second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;

the third obtaining submodule is used for inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;

FIG. 2 is another flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;

FIG. 3 is a block diagram of a stable and controllable end-to-end speech synthesizer according to the present invention;

fig. 4 is another structural diagram of a stable and controllable end-to-end speech synthesis apparatus provided in the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent by adding the time length control module, but the problem of poor synthetic speech effect caused by insufficient time length prediction precision can be introduced, and the experience feeling of a user is reduced. In order to solve the above problems, this embodiment discloses a method for optimizing a speech synthesis effect by adding a phoneme duration model to a preset network neural model, training the preset neural model by using a preset recording and text data corresponding to the preset recording, and then performing speech synthesis on a text to be synthesized by using the trained preset neural model.

A stable and controllable end-to-end speech synthesis method, as shown in fig. 1, comprising the following steps:

step S101, training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a frequency spectrum parameter prediction model and a voice output model;

s102, acquiring a text to be synthesized;

step S103, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized voice;

in this embodiment, a sufficient number of high-quality preset recordings and corresponding text data are used in advance to train a preset neural network, so that the speech synthesized by the converged model is high-quality, meanwhile, a phoneme duration model is added to the preset neural network, which can ensure that the duration of the synthesized speech is long and the phonemes are not mixed, and after the model is converged, a user inputs a text to be synthesized into the trained preset neural network model to obtain a target synthesized speech.

The working principle of the technical scheme is as follows: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.

The beneficial effects of the above technical scheme are: the preset neural network model is trained by utilizing the preset recording and the corresponding text data to ensure that the voice synthesized by the trained preset neural network model is high in quality and stable, in addition, the phoneme duration model is added into the preset neural network model, the duration of the synthesized voice can be accurately controlled, and the phoneme of the synthesized voice can be ensured, so that the synthesized voice is more close to what the user needs, the problem that the synthesized voice effect is not good due to the fact that the duration prediction precision is insufficient due to the addition of the duration control module in the prior art is solved, and the experience of the user is greatly improved.

In one embodiment, before training the preset neural network model by using the preset audio recording and the text data corresponding to the preset audio recording, the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, the method further includes:

preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices, and removing mute components in the preset number of preset voices;

checking whether the text contents of a preset number of text data are defective or not, wherein the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;

and determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as a preset recording and text data corresponding to the preset recording.

The beneficial effects of the above technical scheme are: the quality of the preset voice and the corresponding text data is further guaranteed, and the problem that the synthesized voice is defective due to the fact that the converged model is not perfect because an incomplete preset recording and the text data corresponding to the preset recording are used for training the preset neural network model is solved.

In one embodiment, the preset neural network model is trained by using the preset recording and text data corresponding to the preset recording, and the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, and includes:

acquiring a representation phoneme sequence and a first phoneme duration in each text data in a second number of text data;

training a phoneme duration model by taking a second number of the characterizing phoneme sequences as input of the phoneme duration model and taking a second number of the first phoneme durations as output of the phoneme duration model;

acquiring second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;

performing a first frame expansion on a second number of the characterising phoneme sequences according to a second number of second phoneme durations;

extracting frequency spectrum parameters of a second number of preset sound recordings;

taking the second number of the representation phoneme sequences after the first frame expansion as the input of a spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train a spectrum parameter preset model;

taking a second number of spectral parameters as the input of the voice output model, and taking a second number of preset recordings as the output of the voice output model to train the voice output model;

The beneficial effects of the above technical scheme are: and performing first frame expansion on the second number of the characterizing phoneme sequences according to the second number of the second phoneme durations to compensate the second number of the characterizing phoneme sequences, so that the characterizing phoneme sequences of the text data and the preset speech can correspond to each other, and the trained model is more perfect.

In one embodiment, before obtaining the text to be synthesized, the method further includes:

acquiring n Chinese characters in the text content to be synthesized;

confirming whether n Chinese characters have polyphone characters;

if yes, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in a first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in a second tone;

and labeling the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

The beneficial effects of the above technical scheme are: the pronunciation of each character in the text content to be synthesized can be accurately determined, and the accuracy of the speech synthesis result is improved.

In one embodiment, as shown in fig. 2, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized speech, including:

step S201, analyzing a text to be synthesized to obtain a target representation phoneme sequence;

step S202, inputting the target representation phoneme sequence into a trained phoneme duration model to obtain a target phoneme duration;

step S203, performing second frame expansion on the target representation phoneme sequence according to the target phoneme duration;

step S204, inputting the target characterization phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and S205, inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.

The beneficial effects of the above technical scheme are: the target synthesized voice is accurately synthesized according to the predicted spectrum parameters, so that the stability of the target synthesized voice is further ensured, and meanwhile, the whole voice synthesizing process is carried out step by step, so that the integrity of the target synthesized voice is ensured, and the finally synthesized voice is more perfect.

The embodiment also discloses a stable and controllable end-to-end voice device, as shown in fig. 3, the device includes:

the training module 301 is configured to train a preset neural network model by using a preset recording and text data corresponding to the preset recording, to obtain the trained preset neural network model, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model, and a speech output model;

a first obtaining module 302, configured to obtain a text to be synthesized;

an obtaining module 303, configured to input the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech.

In one embodiment, the above apparatus further comprises:

the preprocessing module is used for preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices and removing mute components in the preset number of preset voices;

the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;

the determining module is used for determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second text data as a preset recording and text data corresponding to the preset recording.

In one embodiment, a training module, comprising:

the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second number of text data;

a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using a second number of the first phoneme durations as an output of the phoneme duration model;

the second obtaining submodule is used for obtaining second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;

a first expansion submodule, configured to perform a first frame expansion on a second number of the characterized phoneme sequences according to a second number of second phoneme durations;

the extraction submodule is used for extracting the frequency spectrum parameters of a second number of preset sound recordings;

the second training submodule is used for taking the second number of the representation phoneme sequences after the first frame expansion as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;

the third training submodule is used for taking a second number of frequency spectrum parameters as the input of the voice output model and taking a second number of preset records as the output of the voice output model to train the voice output model;

In one embodiment, the above apparatus further comprises:

the fourth obtaining module is used for obtaining a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters if the confirming module confirms that the n Chinese characters have the polyphone, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in the second tone;

the fifth acquisition module is used for acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;

and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

In one embodiment, as shown in FIG. 4, a module is obtained comprising:

the parsing module 3031 is configured to parse the text to be synthesized to obtain a target representation phoneme sequence;

a second obtaining submodule 3032, configured to input the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;

a second extension submodule 3033, configured to perform second frame extension on the target characterization phoneme sequence according to the target phoneme duration;

a third obtaining submodule 3034, configured to input the target representation phoneme sequence after the second frame extension into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and a fourth obtaining submodule 3035, configured to input the predicted spectrum parameter into the trained speech output model to obtain the target synthesized speech.

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A stable and controllable end-to-end speech synthesis method is characterized by comprising the following steps:

acquiring a text to be synthesized;

inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice;

the method for training the preset neural network model by using the preset recording and the text data corresponding to the preset recording comprises the following steps:

acquiring a characterization phoneme sequence and a first phoneme duration in each text data in a second number of text data;

training the phone duration model using a second number of the characterized phone sequences as an input to the phone duration model and the second number of the first phone durations as an output from the phone duration model;

performing a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;

2. The method of claim 1, wherein before training a predetermined neural network model using a predetermined audio recording and text data corresponding to the predetermined audio recording, the predetermined neural network model comprising a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further comprises:

preprocessing the preset number of voices, filtering noise components in the preset number of voices, and removing mute components in the preset number of voices;

3. The method of claim 1, wherein prior to obtaining the text to be synthesized, the method further comprises:

acquiring n Chinese characters in text content to be synthesized;

confirming whether the n Chinese characters have polyphone characters;

and labeling the target letter sequence and the third sequence to Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

4. The stable and controllable end-to-end speech synthesis method according to claim 1, wherein the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech comprises:

inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;

5. A stable and controllable end-to-end speech device, comprising:

the first acquisition module is used for acquiring a text to be synthesized;

the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice;

the training module comprises:

a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;

a third training sub-module, configured to train the speech output model by using the second number of spectral parameters as input of the speech output model and using the second number of preset sound recordings as output of the speech output model;

6. The robust and controllable end-to-end speech apparatus of claim 5, further comprising:

the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;

7. The robust and controllable end-to-end speech apparatus of claim 5, further comprising:

8. The steadily controllable end-to-end speech device of claim 5, wherein the obtaining module comprises:

and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained voice output model to obtain the target synthesized voice.