CN111599338A - Stable and controllable end-to-end speech synthesis method and device - Google Patents

Stable and controllable end-to-end speech synthesis method and device Download PDF

Info

Publication number
CN111599338A
CN111599338A CN202010275510.5A CN202010275510A CN111599338A CN 111599338 A CN111599338 A CN 111599338A CN 202010275510 A CN202010275510 A CN 202010275510A CN 111599338 A CN111599338 A CN 111599338A
Authority
CN
China
Prior art keywords
preset
model
phoneme
text data
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010275510.5A
Other languages
Chinese (zh)
Other versions
CN111599338B (en
Inventor
孙见青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Original Assignee
Unisound Intelligent Technology Co Ltd
Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisound Intelligent Technology Co Ltd, Xiamen Yunzhixin Intelligent Technology Co Ltd filed Critical Unisound Intelligent Technology Co Ltd
Priority to CN202010275510.5A priority Critical patent/CN111599338B/en
Publication of CN111599338A publication Critical patent/CN111599338A/en
Application granted granted Critical
Publication of CN111599338B publication Critical patent/CN111599338B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The invention discloses a stable and controllable end-to-end voice synthesis method and a device, comprising the following steps: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice. The problem of add long control module can promote the stability of synthetic speech to a certain extent among the prior art, but can introduce because long prediction accuracy is not enough and lead to the synthetic speech effect not good is solved, great improvement user's experience sense.

Description

Stable and controllable end-to-end speech synthesis method and device
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a stable and controllable end-to-end voice synthesis method and device.
Background
In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent by adding the time length control module, but the problem of poor synthetic speech effect caused by insufficient time length prediction precision can be introduced, and the experience feeling of a user is reduced.
Disclosure of Invention
Aiming at the displayed problems, the method optimizes the voice synthesis effect by adding the phoneme duration model in the preset network neural model, trains the preset neural network model by using the preset recording and the text data corresponding to the preset recording, and then carries out voice synthesis on the text to be synthesized by using the trained preset neural network model.
A stable and controllable end-to-end speech synthesis method comprises the following steps:
training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
acquiring a text to be synthesized;
and inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice.
Preferably, before training a preset neural network model by using a preset sound recording and text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further includes:
acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
preprocessing the preset number of voices, filtering noise components in the preset number of voices, and removing mute components in the preset number of voices;
checking whether the text content of the preset number of text data has defects, wherein the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and determining a second number of second text data without the defect in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset recording and the text data corresponding to the preset recording.
Preferably, the training of the preset neural network model by using the preset sound recording and the text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model and a speech output model, includes:
acquiring a representation phoneme sequence and a first phoneme duration in each text data in the second number of text data;
training the phone duration model using a second number of the characterized phone sequences as an input to the phone duration model and the second number of the first phone durations as an output from the phone duration model;
acquiring a second phoneme duration of the second number of preset recordings by using the trained phoneme duration model;
performing a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
extracting the frequency spectrum parameters of the second number of preset sound recordings;
taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
training the voice output model by taking the second number of spectral parameters as input of the voice output model and taking the second number of preset recordings as output of the voice output model;
and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.
Preferably, before obtaining the text to be synthesized, the method further includes:
acquiring n Chinese characters in the text content to be synthesized;
confirming whether the n Chinese characters have polyphone characters;
if so, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming the first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming the second tone when the first Chinese character is in the second tone;
selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
Preferably, the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice includes:
analyzing the text to be synthesized to obtain a target representation phoneme sequence;
inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
performing second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
inputting the target representation phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and inputting the predicted spectrum parameters into a trained speech output model to obtain the target synthesized speech.
A stable and controllable end-to-end speech device, the device comprising:
the training module is used for training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, and the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
the first acquisition module is used for acquiring a text to be synthesized;
and the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.
Preferably, the apparatus further comprises:
the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
the preprocessing module is used for preprocessing the preset number of voices, filtering noise components in the preset number of voices and removing mute components in the preset number of voices;
the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and the determining module is used for determining a second number of second text data without the defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset sound recording and the text data corresponding to the preset sound recording.
Preferably, the training module includes:
the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second quantity of text data;
a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;
the second obtaining submodule is used for obtaining second phoneme durations of the second number of preset sound recordings by using the trained phoneme duration model;
a first expansion submodule, configured to perform a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
the extraction submodule is used for extracting the frequency spectrum parameters of the second number of preset sound recordings;
the second training submodule is used for taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
a third training submodule, configured to train the speech output model by using the second number of spectral parameters as inputs of the speech output model and using the second number of preset sound recordings as outputs of the speech output model;
and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.
Preferably, the apparatus further comprises:
the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;
the confirming module is used for confirming whether the n Chinese characters have polyphone characters;
a fourth obtaining module, configured to obtain a first letter sequence and a second letter sequence of a first chinese character that is a polyphone in the n chinese characters if the confirming module confirms that the n chinese characters have the polyphone, where the first letter sequence is a first sequence of letters that form the first tone when the first chinese character is a first tone, and the second letter sequence is a second sequence of letters that form the second tone when the first chinese character is a second tone;
the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
a fifth obtaining module, configured to obtain a third letter sequence of a second Chinese character, where the second Chinese character is a Chinese character of the n Chinese characters except the first Chinese character;
and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
Preferably, the obtaining module includes:
the analysis module is used for analyzing the text to be synthesized to obtain a target representation phoneme sequence;
the second obtaining submodule is used for inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
the second expansion submodule is used for carrying out second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
the third obtaining submodule is used for inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;
FIG. 2 is another flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;
FIG. 3 is a block diagram of a stable and controllable end-to-end speech synthesizer according to the present invention;
fig. 4 is another structural diagram of a stable and controllable end-to-end speech synthesis apparatus provided by the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent by adding the time length control module, but the problem of poor synthetic speech effect caused by insufficient time length prediction precision can be introduced, and the experience feeling of a user is reduced. In order to solve the above problems, this embodiment discloses a method for optimizing a speech synthesis effect by adding a phoneme duration model to a preset network neural model, training the preset neural network model by using a preset recording and text data corresponding to the preset recording, and then performing speech synthesis on a text to be synthesized by using the trained preset neural network model.
A stable and controllable end-to-end speech synthesis method, as shown in fig. 1, comprising the following steps:
step S101, training a preset neural network model by using a preset sound record and text data corresponding to the preset sound record to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a frequency spectrum parameter prediction model and a voice output model;
s102, acquiring a text to be synthesized;
step S103, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized voice;
in this embodiment, a sufficient number of high-quality preset recordings and corresponding text data are used in advance to train a preset neural network, so that the speech synthesized by the converged model is high-quality, meanwhile, a phoneme duration model is added to the preset neural network, which can ensure that the duration of the synthesized speech is long and the phonemes are not mixed, and after the model is converged, a user inputs a text to be synthesized into the trained preset neural network model to obtain a target synthesized speech.
The working principle of the technical scheme is as follows: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.
The beneficial effects of the above technical scheme are: the preset neural network model is trained by utilizing the preset recording and the corresponding text data to ensure that the voice synthesized by the trained preset neural network model is high in quality and stable, in addition, the phoneme duration model is added into the preset neural network model, the duration of the synthesized voice can be accurately controlled, and the phoneme of the synthesized voice can be ensured, so that the synthesized voice is more close to what the user needs, the problem that the synthesized voice effect is not good due to the fact that the duration prediction precision is insufficient due to the addition of the duration control module in the prior art is solved, and the experience of the user is greatly improved.
In one embodiment, before training the preset neural network model using the preset recording and the text data corresponding to the preset recording, the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, the method further includes:
acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices, and removing mute components in the preset number of preset voices;
checking whether the text content of a preset number of text data has defects, wherein the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;
and determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as a preset recording and text data corresponding to the preset recording.
The beneficial effects of the above technical scheme are: the quality of the preset voice and the corresponding text data is further guaranteed, and the problem that the synthesized voice is defective due to the fact that the converged model is not perfect because an incomplete preset recording and the text data corresponding to the preset recording are used for training the preset neural network model is solved.
In one embodiment, the preset neural network model is trained by using the preset recording and text data corresponding to the preset recording, and the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, and includes:
acquiring a representation phoneme sequence and a first phoneme duration in each text data in a second number of text data;
training a phoneme duration model by taking a second number of the characterizing phoneme sequences as input of the phoneme duration model and taking a second number of the first phoneme durations as output of the phoneme duration model;
acquiring second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;
performing a first frame expansion on a second number of the characterizing phoneme sequences according to a second number of second phoneme durations;
extracting frequency spectrum parameters of a second number of preset sound recordings;
taking a second number of the first frame-expanded characterizing phoneme sequences as the input of a spectrum parameter prediction model, and taking a second number of spectrum parameters as the output of the spectrum parameter model to train a spectrum parameter preset model;
taking a second number of spectral parameters as the input of the voice output model, and taking a second number of preset recordings as the output of the voice output model to train the voice output model;
and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.
The beneficial effects of the above technical scheme are: and performing first frame expansion on the second number of characteristic phoneme sequences according to the second number of second phoneme durations to compensate the second number of characteristic phoneme sequences, so that the characteristic phoneme sequences of the text data can correspond to the preset speech, and the trained model is more perfect.
In one embodiment, before obtaining the text to be synthesized, the method further includes:
acquiring n Chinese characters in the text content to be synthesized;
confirming whether n Chinese characters have polyphone characters;
if yes, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in a first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in a second tone;
selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
The beneficial effects of the above technical scheme are: the pronunciation of each word in the text content to be synthesized can be accurately determined, and the accuracy of the speech synthesis result is improved.
In one embodiment, as shown in fig. 2, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized speech, including:
step S201, analyzing a text to be synthesized to obtain a target representation phoneme sequence;
step S202, inputting the target representation phoneme sequence into a trained phoneme duration model to obtain a target phoneme duration;
step S203, performing second frame expansion on the target representation phoneme sequence according to the target phoneme duration;
step S204, inputting the target characterization phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and S205, inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.
The beneficial effects of the above technical scheme are: the target synthesized voice is accurately synthesized according to the predicted spectrum parameters, so that the stability of the target synthesized voice is further ensured, and meanwhile, the whole voice synthesizing process is carried out step by step, so that the integrity of the target synthesized voice is ensured, and the finally synthesized voice is more perfect.
The embodiment also discloses a stable and controllable end-to-end voice device, as shown in fig. 3, the device includes:
the training module 301 is configured to train a preset neural network model by using a preset recording and text data corresponding to the preset recording, to obtain the trained preset neural network model, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model, and a speech output model;
a first obtaining module 302, configured to obtain a text to be synthesized;
an obtaining module 303, configured to input the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech.
In one embodiment, the above apparatus further comprises:
the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
the preprocessing module is used for preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices and removing mute components in the preset number of preset voices;
the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;
the determining module is used for determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as a preset recording and text data corresponding to the preset recording.
In one embodiment, a training module, comprising:
the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second number of text data;
a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using a second number of the first phoneme durations as an output of the phoneme duration model;
the second obtaining submodule is used for obtaining second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;
a first expansion submodule, configured to perform a first frame expansion on a second number of the characterized phoneme sequences according to a second number of second phoneme durations;
the extraction submodule is used for extracting the frequency spectrum parameters of a second number of preset sound recordings;
the second training submodule is used for taking a second number of the first frame-expanded representation phoneme sequences as the input of the spectrum parameter prediction model and taking a second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
the third training submodule is used for taking a second number of frequency spectrum parameters as the input of the voice output model and taking a second number of preset records as the output of the voice output model to train the voice output model;
and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.
In one embodiment, the above apparatus further comprises:
the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;
the confirming module is used for confirming whether the n Chinese characters have polyphone characters;
the fourth obtaining module is used for obtaining a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters if the confirming module confirms that the n Chinese characters have the polyphone, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in the second tone;
the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
the fifth acquisition module is used for acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
In one embodiment, as shown in FIG. 4, a module is obtained comprising:
the parsing module 3031 is configured to parse the text to be synthesized to obtain a target representation phoneme sequence;
a second obtaining submodule 3032, configured to input the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
a second extension submodule 3033, configured to perform second frame extension on the target characterization phoneme sequence according to the target phoneme duration;
a third obtaining submodule 3034, configured to input the target representation phoneme sequence after the second frame extension into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and a fourth obtaining submodule 3035, configured to input the predicted spectrum parameter into the trained speech output model to obtain the target synthesized speech.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A stable and controllable end-to-end speech synthesis method is characterized by comprising the following steps:
training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
acquiring a text to be synthesized;
and inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice.
2. The method of claim 1, wherein before training a predetermined neural network model using a predetermined audio recording and text data corresponding to the predetermined audio recording, the predetermined neural network model comprising a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further comprises:
acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
preprocessing the preset number of voices, filtering noise components in the preset number of voices, and removing mute components in the preset number of voices;
checking whether the text content of the preset number of text data has defects, wherein the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and determining a second number of second text data without the defect in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset recording and the text data corresponding to the preset recording.
3. The method according to claim 1, wherein the training of the neural network model using the preset recording and the text data corresponding to the preset recording, the neural network model including a phoneme duration model, a spectral parameter prediction model and a speech output model, comprises:
acquiring a representation phoneme sequence and a first phoneme duration in each text data in the second number of text data;
training the phone duration model using a second number of the characterized phone sequences as an input to the phone duration model and the second number of the first phone durations as an output from the phone duration model;
acquiring a second phoneme duration of the second number of preset recordings by using the trained phoneme duration model;
performing a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
extracting the frequency spectrum parameters of the second number of preset sound recordings;
taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
training the voice output model by taking the second number of spectral parameters as input of the voice output model and taking the second number of preset recordings as output of the voice output model;
and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.
4. The method of claim 1, wherein prior to obtaining the text to be synthesized, the method further comprises:
acquiring n Chinese characters in the text content to be synthesized;
confirming whether the n Chinese characters have polyphone characters;
if so, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming the first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming the second tone when the first Chinese character is in the second tone;
selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
5. The stable and controllable end-to-end speech synthesis method according to claim 1, wherein the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech comprises:
analyzing the text to be synthesized to obtain a target representation phoneme sequence;
inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
performing second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
inputting the target representation phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and inputting the predicted spectrum parameters into a trained speech output model to obtain the target synthesized speech.
6. A stable and controllable end-to-end speech device, comprising:
the training module is used for training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, and the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
the first acquisition module is used for acquiring a text to be synthesized;
and the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.
7. The apparatus of claim 6, further comprising:
the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
the preprocessing module is used for preprocessing the preset number of voices, filtering noise components in the preset number of voices and removing mute components in the preset number of voices;
the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and the determining module is used for determining a second number of second text data without the defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset sound recording and the text data corresponding to the preset sound recording.
8. The apparatus according to claim 6, wherein the training module comprises:
the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second quantity of text data;
a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;
the second obtaining submodule is used for obtaining second phoneme durations of the second number of preset sound recordings by using the trained phoneme duration model;
a first expansion submodule, configured to perform a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
the extraction submodule is used for extracting the frequency spectrum parameters of the second number of preset sound recordings;
the second training submodule is used for taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
a third training submodule, configured to train the speech output model by using the second number of spectral parameters as inputs of the speech output model and using the second number of preset sound recordings as outputs of the speech output model;
and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.
9. The apparatus of claim 6, further comprising:
the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;
the confirming module is used for confirming whether the n Chinese characters have polyphone characters;
a fourth obtaining module, configured to obtain a first letter sequence and a second letter sequence of a first chinese character that is a polyphone in the n chinese characters if the confirming module confirms that the n chinese characters have the polyphone, where the first letter sequence is a first sequence of letters that form the first tone when the first chinese character is a first tone, and the second letter sequence is a second sequence of letters that form the second tone when the first chinese character is a second tone;
the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
a fifth obtaining module, configured to obtain a third letter sequence of a second Chinese character, where the second Chinese character is a Chinese character of the n Chinese characters except the first Chinese character;
and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
10. The apparatus according to claim 6, wherein the obtaining module comprises:
the analysis module is used for analyzing the text to be synthesized to obtain a target representation phoneme sequence;
the second obtaining submodule is used for inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
the second expansion submodule is used for carrying out second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
the third obtaining submodule is used for inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.
CN202010275510.5A 2020-04-09 2020-04-09 Stable and controllable end-to-end speech synthesis method and device Active CN111599338B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010275510.5A CN111599338B (en) 2020-04-09 2020-04-09 Stable and controllable end-to-end speech synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010275510.5A CN111599338B (en) 2020-04-09 2020-04-09 Stable and controllable end-to-end speech synthesis method and device

Publications (2)

Publication Number Publication Date
CN111599338A true CN111599338A (en) 2020-08-28
CN111599338B CN111599338B (en) 2023-04-18

Family

ID=72183478

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010275510.5A Active CN111599338B (en) 2020-04-09 2020-04-09 Stable and controllable end-to-end speech synthesis method and device

Country Status (1)

Country Link
CN (1) CN111599338B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133282A (en) * 2020-10-26 2020-12-25 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN113066511A (en) * 2021-03-16 2021-07-02 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538024A (en) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus and equipment
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model
CN108806665A (en) * 2018-09-12 2018-11-13 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN110556129A (en) * 2019-09-09 2019-12-10 北京大学深圳研究生院 Bimodal emotion recognition model training method and bimodal emotion recognition method
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104538024A (en) * 2014-12-01 2015-04-22 百度在线网络技术(北京)有限公司 Speech synthesis method, apparatus and equipment
US20160171974A1 (en) * 2014-12-15 2016-06-16 Baidu Usa Llc Systems and methods for speech transcription
CN106531150A (en) * 2016-12-23 2017-03-22 上海语知义信息技术有限公司 Emotion synthesis method based on deep neural network model
CN108806665A (en) * 2018-09-12 2018-11-13 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
CN109767755A (en) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 A kind of phoneme synthesizing method and system
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN110556129A (en) * 2019-09-09 2019-12-10 北京大学深圳研究生院 Bimodal emotion recognition model training method and bimodal emotion recognition method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133282A (en) * 2020-10-26 2020-12-25 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN112133282B (en) * 2020-10-26 2022-07-08 厦门大学 Lightweight multi-speaker speech synthesis system and electronic equipment
CN113066511A (en) * 2021-03-16 2021-07-02 云知声智能科技股份有限公司 Voice conversion method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111599338B (en) 2023-04-18

Similar Documents

Publication Publication Date Title
CN107516509B (en) Voice database construction method and system for news broadcast voice synthesis
CN110148427B (en) Audio processing method, device, system, storage medium, terminal and server
CN109065031B (en) Voice labeling method, device and equipment
CN105245917B (en) A kind of system and method for multi-media voice subtitle generation
CN110166816B (en) Video editing method and system based on voice recognition for artificial intelligence education
DE60203705T2 (en) DESCRIPTION AND DISPLAY OF A PRESENT LANGUAGE SIGNAL
CN109285537B (en) Acoustic model establishing method, acoustic model establishing device, acoustic model synthesizing method, acoustic model synthesizing device, acoustic model synthesizing equipment and storage medium
Zwicker et al. Automatic speech recognition using psychoacoustic models
CN108847215B (en) Method and device for voice synthesis based on user timbre
US20170047060A1 (en) Text-to-speech method and multi-lingual speech synthesizer using the method
CN111599338B (en) Stable and controllable end-to-end speech synthesis method and device
KR100659212B1 (en) Language learning system and voice data providing method for language learning
CN106328146A (en) Video subtitle generation method and apparatus
CN111613224A (en) Personalized voice synthesis method and device
CN105448289A (en) Speech synthesis method, speech synthesis device, speech deletion method, speech deletion device and speech deletion and synthesis method
CN110390928B (en) Method and system for training speech synthesis model of automatic expansion corpus
CN110738981A (en) interaction method based on intelligent voice call answering
CN112420015A (en) Audio synthesis method, device, equipment and computer readable storage medium
CN110853627B (en) Method and system for voice annotation
CN116092472A (en) Speech synthesis method and synthesis system
CN110111778A (en) A kind of method of speech processing, device, storage medium and electronic equipment
CN111354325A (en) Automatic word and song creation system and method thereof
CN112270917A (en) Voice synthesis method and device, electronic equipment and readable storage medium
CN111383627B (en) Voice data processing method, device, equipment and medium
CN111429878B (en) Self-adaptive voice synthesis method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant