CN111599338A

CN111599338A - Stable and controllable end-to-end speech synthesis method and device

Info

Publication number: CN111599338A
Application number: CN202010275510.5A
Authority: CN
Inventors: 孙见青
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-28
Anticipated expiration: 2040-04-09
Also published as: CN111599338B

Abstract

The invention discloses a stable and controllable end-to-end voice synthesis method and a device, comprising the following steps: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice. The problem of add long control module can promote the stability of synthetic speech to a certain extent among the prior art, but can introduce because long prediction accuracy is not enough and lead to the synthetic speech effect not good is solved, great improvement user's experience sense.

Description

Stable and controllable end-to-end speech synthesis method and device

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a stable and controllable end-to-end voice synthesis method and device.

Background

In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent by adding the time length control module, but the problem of poor synthetic speech effect caused by insufficient time length prediction precision can be introduced, and the experience feeling of a user is reduced.

Disclosure of Invention

Aiming at the displayed problems, the method optimizes the voice synthesis effect by adding the phoneme duration model in the preset network neural model, trains the preset neural network model by using the preset recording and the text data corresponding to the preset recording, and then carries out voice synthesis on the text to be synthesized by using the trained preset neural network model.

A stable and controllable end-to-end speech synthesis method comprises the following steps:

training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;

acquiring a text to be synthesized;

and inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice.

Preferably, before training a preset neural network model by using a preset sound recording and text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further includes:

acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;

preprocessing the preset number of voices, filtering noise components in the preset number of voices, and removing mute components in the preset number of voices;

checking whether the text content of the preset number of text data has defects, wherein the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;

and determining a second number of second text data without the defect in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset recording and the text data corresponding to the preset recording.

Preferably, the training of the preset neural network model by using the preset sound recording and the text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model and a speech output model, includes:

acquiring a representation phoneme sequence and a first phoneme duration in each text data in the second number of text data;

training the phone duration model using a second number of the characterized phone sequences as an input to the phone duration model and the second number of the first phone durations as an output from the phone duration model;

acquiring a second phoneme duration of the second number of preset recordings by using the trained phoneme duration model;

performing a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;

extracting the frequency spectrum parameters of the second number of preset sound recordings;

taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;

training the voice output model by taking the second number of spectral parameters as input of the voice output model and taking the second number of preset recordings as output of the voice output model;

and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.

Preferably, before obtaining the text to be synthesized, the method further includes:

acquiring n Chinese characters in the text content to be synthesized;

confirming whether the n Chinese characters have polyphone characters;

if so, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming the first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming the second tone when the first Chinese character is in the second tone;

selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;

acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;

and marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

Preferably, the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice includes:

analyzing the text to be synthesized to obtain a target representation phoneme sequence;

inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;

performing second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;

inputting the target representation phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and inputting the predicted spectrum parameters into a trained speech output model to obtain the target synthesized speech.

A stable and controllable end-to-end speech device, the device comprising:

the training module is used for training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, and the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;

the first acquisition module is used for acquiring a text to be synthesized;

and the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.

Preferably, the apparatus further comprises:

the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;

the preprocessing module is used for preprocessing the preset number of voices, filtering noise components in the preset number of voices and removing mute components in the preset number of voices;

the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;

and the determining module is used for determining a second number of second text data without the defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset sound recording and the text data corresponding to the preset sound recording.

Preferably, the training module includes:

the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second quantity of text data;

a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;

the second obtaining submodule is used for obtaining second phoneme durations of the second number of preset sound recordings by using the trained phoneme duration model;

a first expansion submodule, configured to perform a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;

the extraction submodule is used for extracting the frequency spectrum parameters of the second number of preset sound recordings;

the second training submodule is used for taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;

a third training submodule, configured to train the speech output model by using the second number of spectral parameters as inputs of the speech output model and using the second number of preset sound recordings as outputs of the speech output model;

and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.

Preferably, the apparatus further comprises:

the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;

the confirming module is used for confirming whether the n Chinese characters have polyphone characters;

a fourth obtaining module, configured to obtain a first letter sequence and a second letter sequence of a first chinese character that is a polyphone in the n chinese characters if the confirming module confirms that the n chinese characters have the polyphone, where the first letter sequence is a first sequence of letters that form the first tone when the first chinese character is a first tone, and the second letter sequence is a second sequence of letters that form the second tone when the first chinese character is a second tone;

the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;

a fifth obtaining module, configured to obtain a third letter sequence of a second Chinese character, where the second Chinese character is a Chinese character of the n Chinese characters except the first Chinese character;

and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.

Preferably, the obtaining module includes:

the analysis module is used for analyzing the text to be synthesized to obtain a target representation phoneme sequence;

the second obtaining submodule is used for inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;

the second expansion submodule is used for carrying out second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;

the third obtaining submodule is used for inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;

FIG. 2 is another flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;

FIG. 3 is a block diagram of a stable and controllable end-to-end speech synthesizer according to the present invention;

fig. 4 is another structural diagram of a stable and controllable end-to-end speech synthesis apparatus provided by the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent by adding the time length control module, but the problem of poor synthetic speech effect caused by insufficient time length prediction precision can be introduced, and the experience feeling of a user is reduced. In order to solve the above problems, this embodiment discloses a method for optimizing a speech synthesis effect by adding a phoneme duration model to a preset network neural model, training the preset neural network model by using a preset recording and text data corresponding to the preset recording, and then performing speech synthesis on a text to be synthesized by using the trained preset neural network model.

A stable and controllable end-to-end speech synthesis method, as shown in fig. 1, comprising the following steps:

step S101, training a preset neural network model by using a preset sound record and text data corresponding to the preset sound record to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a frequency spectrum parameter prediction model and a voice output model;

s102, acquiring a text to be synthesized;

step S103, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized voice;

in this embodiment, a sufficient number of high-quality preset recordings and corresponding text data are used in advance to train a preset neural network, so that the speech synthesized by the converged model is high-quality, meanwhile, a phoneme duration model is added to the preset neural network, which can ensure that the duration of the synthesized speech is long and the phonemes are not mixed, and after the model is converged, a user inputs a text to be synthesized into the trained preset neural network model to obtain a target synthesized speech.

The working principle of the technical scheme is as follows: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.

The beneficial effects of the above technical scheme are: the preset neural network model is trained by utilizing the preset recording and the corresponding text data to ensure that the voice synthesized by the trained preset neural network model is high in quality and stable, in addition, the phoneme duration model is added into the preset neural network model, the duration of the synthesized voice can be accurately controlled, and the phoneme of the synthesized voice can be ensured, so that the synthesized voice is more close to what the user needs, the problem that the synthesized voice effect is not good due to the fact that the duration prediction precision is insufficient due to the addition of the duration control module in the prior art is solved, and the experience of the user is greatly improved.

In one embodiment, before training the preset neural network model using the preset recording and the text data corresponding to the preset recording, the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, the method further includes:

preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices, and removing mute components in the preset number of preset voices;

checking whether the text content of a preset number of text data has defects, wherein the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;

and determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as a preset recording and text data corresponding to the preset recording.

The beneficial effects of the above technical scheme are: the quality of the preset voice and the corresponding text data is further guaranteed, and the problem that the synthesized voice is defective due to the fact that the converged model is not perfect because an incomplete preset recording and the text data corresponding to the preset recording are used for training the preset neural network model is solved.

In one embodiment, the preset neural network model is trained by using the preset recording and text data corresponding to the preset recording, and the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, and includes:

acquiring a representation phoneme sequence and a first phoneme duration in each text data in a second number of text data;

training a phoneme duration model by taking a second number of the characterizing phoneme sequences as input of the phoneme duration model and taking a second number of the first phoneme durations as output of the phoneme duration model;

acquiring second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;

performing a first frame expansion on a second number of the characterizing phoneme sequences according to a second number of second phoneme durations;

extracting frequency spectrum parameters of a second number of preset sound recordings;

taking a second number of the first frame-expanded characterizing phoneme sequences as the input of a spectrum parameter prediction model, and taking a second number of spectrum parameters as the output of the spectrum parameter model to train a spectrum parameter preset model;

taking a second number of spectral parameters as the input of the voice output model, and taking a second number of preset recordings as the output of the voice output model to train the voice output model;

The beneficial effects of the above technical scheme are: and performing first frame expansion on the second number of characteristic phoneme sequences according to the second number of second phoneme durations to compensate the second number of characteristic phoneme sequences, so that the characteristic phoneme sequences of the text data can correspond to the preset speech, and the trained model is more perfect.

In one embodiment, before obtaining the text to be synthesized, the method further includes:

acquiring n Chinese characters in the text content to be synthesized;

confirming whether n Chinese characters have polyphone characters;

if yes, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in a first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in a second tone;

The beneficial effects of the above technical scheme are: the pronunciation of each word in the text content to be synthesized can be accurately determined, and the accuracy of the speech synthesis result is improved.

In one embodiment, as shown in fig. 2, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized speech, including:

step S201, analyzing a text to be synthesized to obtain a target representation phoneme sequence;

step S202, inputting the target representation phoneme sequence into a trained phoneme duration model to obtain a target phoneme duration;

step S203, performing second frame expansion on the target representation phoneme sequence according to the target phoneme duration;

step S204, inputting the target characterization phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and S205, inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.

The beneficial effects of the above technical scheme are: the target synthesized voice is accurately synthesized according to the predicted spectrum parameters, so that the stability of the target synthesized voice is further ensured, and meanwhile, the whole voice synthesizing process is carried out step by step, so that the integrity of the target synthesized voice is ensured, and the finally synthesized voice is more perfect.

The embodiment also discloses a stable and controllable end-to-end voice device, as shown in fig. 3, the device includes:

the training module 301 is configured to train a preset neural network model by using a preset recording and text data corresponding to the preset recording, to obtain the trained preset neural network model, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model, and a speech output model;

a first obtaining module 302, configured to obtain a text to be synthesized;

an obtaining module 303, configured to input the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech.

In one embodiment, the above apparatus further comprises:

the preprocessing module is used for preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices and removing mute components in the preset number of preset voices;

the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;

the determining module is used for determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as a preset recording and text data corresponding to the preset recording.

In one embodiment, a training module, comprising:

the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second number of text data;

a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using a second number of the first phoneme durations as an output of the phoneme duration model;

the second obtaining submodule is used for obtaining second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;

a first expansion submodule, configured to perform a first frame expansion on a second number of the characterized phoneme sequences according to a second number of second phoneme durations;

the extraction submodule is used for extracting the frequency spectrum parameters of a second number of preset sound recordings;

the second training submodule is used for taking a second number of the first frame-expanded representation phoneme sequences as the input of the spectrum parameter prediction model and taking a second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;

the third training submodule is used for taking a second number of frequency spectrum parameters as the input of the voice output model and taking a second number of preset records as the output of the voice output model to train the voice output model;

In one embodiment, the above apparatus further comprises:

the fourth obtaining module is used for obtaining a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters if the confirming module confirms that the n Chinese characters have the polyphone, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in the second tone;

the fifth acquisition module is used for acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;

In one embodiment, as shown in FIG. 4, a module is obtained comprising:

the parsing module 3031 is configured to parse the text to be synthesized to obtain a target representation phoneme sequence;

a second obtaining submodule 3032, configured to input the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;

a second extension submodule 3033, configured to perform second frame extension on the target characterization phoneme sequence according to the target phoneme duration;

a third obtaining submodule 3034, configured to input the target representation phoneme sequence after the second frame extension into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;

and a fourth obtaining submodule 3035, configured to input the predicted spectrum parameter into the trained speech output model to obtain the target synthesized speech.

It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A stable and controllable end-to-end speech synthesis method is characterized by comprising the following steps:

acquiring a text to be synthesized;

2. The method of claim 1, wherein before training a predetermined neural network model using a predetermined audio recording and text data corresponding to the predetermined audio recording, the predetermined neural network model comprising a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further comprises:

3. The method according to claim 1, wherein the training of the neural network model using the preset recording and the text data corresponding to the preset recording, the neural network model including a phoneme duration model, a spectral parameter prediction model and a speech output model, comprises:

4. The method of claim 1, wherein prior to obtaining the text to be synthesized, the method further comprises:

acquiring n Chinese characters in the text content to be synthesized;

confirming whether the n Chinese characters have polyphone characters;

5. The stable and controllable end-to-end speech synthesis method according to claim 1, wherein the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech comprises:

6. A stable and controllable end-to-end speech device, comprising:

the first acquisition module is used for acquiring a text to be synthesized;

7. The apparatus of claim 6, further comprising:

8. The apparatus according to claim 6, wherein the training module comprises:

9. The apparatus of claim 6, further comprising:

10. The apparatus according to claim 6, wherein the obtaining module comprises: