CN111599338B - Stable and controllable end-to-end speech synthesis method and device - Google Patents
Stable and controllable end-to-end speech synthesis method and device Download PDFInfo
- Publication number
- CN111599338B CN111599338B CN202010275510.5A CN202010275510A CN111599338B CN 111599338 B CN111599338 B CN 111599338B CN 202010275510 A CN202010275510 A CN 202010275510A CN 111599338 B CN111599338 B CN 111599338B
- Authority
- CN
- China
- Prior art keywords
- preset
- model
- phoneme
- text data
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001308 synthesis method Methods 0.000 title claims abstract description 10
- 238000001228 spectrum Methods 0.000 claims abstract description 65
- 238000003062 neural network model Methods 0.000 claims abstract description 59
- 230000007547 defect Effects 0.000 claims description 23
- 238000000034 method Methods 0.000 claims description 17
- 238000007781 pre-processing Methods 0.000 claims description 9
- 230000003595 spectral effect Effects 0.000 claims description 9
- 238000012512 characterization method Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 3
- 238000004458 analytical method Methods 0.000 claims description 2
- 238000002372 labelling Methods 0.000 claims description 2
- 230000000694 effects Effects 0.000 abstract description 6
- 230000015572 biosynthetic process Effects 0.000 description 9
- 238000003786 synthesis reaction Methods 0.000 description 9
- 230000009286 beneficial effect Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000001537 neural effect Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000002950 deficient Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000002789 length control Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a stable and controllable end-to-end voice synthesis method and a device, comprising the following steps: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice. The problem of add long control module can promote the stability of synthetic speech to a certain extent among the prior art, but can introduce because long prediction accuracy is not enough and lead to the synthetic speech effect not good is solved, great improvement user's experience sense.
Description
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a stable and controllable end-to-end voice synthesis method and device.
Background
In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent to long control module when adding, but can introduce because the not enough problem that leads to the synthetic speech effect not good of long prediction precision, reduced user's experience and felt.
Disclosure of Invention
Aiming at the displayed problems, the method optimizes the voice synthesis effect by adding the phoneme duration model in the preset network neural model, trains the preset neural network model by using the preset recording and the text data corresponding to the preset recording, and then carries out voice synthesis on the text to be synthesized by using the trained preset neural network model.
A stable and controllable end-to-end speech synthesis method, comprising the steps of:
training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
acquiring a text to be synthesized;
and inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice.
Preferably, before training a preset neural network model by using a preset sound recording and text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further includes:
acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
preprocessing the preset number of preset voices, filtering noise components in the preset number of preset voices, and removing mute components in the preset number of preset voices;
checking whether the text content of the preset number of text data has defects, wherein the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and determining a second number of second text data without the defect in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset recording and the text data corresponding to the preset recording.
Preferably, the training of the preset neural network model by using the preset sound recording and the text data corresponding to the preset sound recording, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model and a speech output model, includes:
acquiring a characterization phoneme sequence and a first phoneme duration in each text data in the second quantity of text data;
training the phoneme duration model using a second number of the sequence of characterizing phonemes as an input to the phoneme duration model and the second number of the first phoneme durations as an output of the phoneme duration model;
acquiring a second phoneme duration of the second number of preset recordings by using the trained phoneme duration model;
performing a first frame expansion on the second number of characterizing phoneme sequences according to a second number of second phoneme durations;
extracting the frequency spectrum parameters of the second number of preset sound recordings;
taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
training the voice output model by taking the second number of spectral parameters as input of the voice output model and taking the second number of preset recordings as output of the voice output model;
and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.
Preferably, before obtaining the text to be synthesized, the method further includes:
acquiring n Chinese characters in the text content to be synthesized;
confirming whether the n Chinese characters have polyphone characters;
if so, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming the first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming the second tone when the first Chinese character is in the second tone;
selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
Preferably, the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice includes:
analyzing the text to be synthesized to obtain a target representation phoneme sequence;
inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
performing second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
inputting the target representation phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and inputting the predicted spectrum parameters into a trained speech output model to obtain the target synthesized speech.
A stable and controllable end-to-end speech device, the device comprising:
the training module is used for training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, and the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
the first acquisition module is used for acquiring a text to be synthesized;
and the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.
Preferably, the apparatus further comprises:
the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
the preprocessing module is used for preprocessing the preset number of voices, filtering noise components in the preset number of voices and removing mute components in the preset number of voices;
the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read easily, the text content has logic problems, and first text data with the defects in the preset number of text data and first preset voices corresponding to the first text data are removed;
and the determining module is used for determining a second number of second text data without the defects in the preset number of text data and a second preset voice corresponding to the second text data as the preset recording and the text data corresponding to the preset recording.
Preferably, the training module includes:
the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second quantity of text data;
a first training submodule, configured to train the phoneme duration model by using a second number of the characterizing phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;
the second obtaining submodule is used for obtaining second phoneme durations of the second number of preset sound recordings by using the trained phoneme duration model;
a first expansion submodule, configured to perform a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
the extraction submodule is used for extracting the frequency spectrum parameters of the second number of preset sound recordings;
the second training submodule is used for taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
a third training submodule, configured to train the speech output model by using the second number of spectral parameters as inputs of the speech output model and using the second number of preset sound recordings as outputs of the speech output model;
and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.
Preferably, the apparatus further comprises:
the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;
the confirming module is used for confirming whether the n Chinese characters have polyphone characters;
a fourth obtaining module, configured to obtain a first letter sequence and a second letter sequence of a first chinese character that is a polyphone in the n chinese characters if the confirming module confirms that the n chinese characters have the polyphone, where the first letter sequence is a first sequence of letters that form the first tone when the first chinese character is a first tone, and the second letter sequence is a second sequence of letters that form the second tone when the first chinese character is a second tone;
the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
a fifth obtaining module, configured to obtain a third letter sequence of a second Chinese character, where the second Chinese character is a Chinese character of the n Chinese characters except the first Chinese character;
and the marking module is used for marking the target letter sequence and the third sequence on the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
Preferably, the obtaining module includes:
the analysis module is used for analyzing the text to be synthesized to obtain a target representation phoneme sequence;
the second obtaining submodule is used for inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
the second expansion submodule is used for carrying out second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
the third obtaining submodule is used for inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is a flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;
FIG. 2 is another flowchart of a stable and controllable end-to-end speech synthesis method provided by the present invention;
FIG. 3 is a block diagram of a stable and controllable end-to-end speech synthesizer according to the present invention;
fig. 4 is another structural diagram of a stable and controllable end-to-end speech synthesis apparatus provided in the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In recent years, with the increasing maturity of voice technology, the voice synthesis technology is gradually applied to voice signal processing systems such as voice interaction, voice broadcasting, personalized voice production, and the like. In the social and commercial fields, synthetic voice is presented as a sound, convenience and richness are brought to social life, and the synthetic voice has a potentially wide use value. However, this method has the following disadvantages: the stability of synthetic speech can be promoted to a certain extent by adding the time length control module, but the problem of poor synthetic speech effect caused by insufficient time length prediction precision can be introduced, and the experience feeling of a user is reduced. In order to solve the above problems, this embodiment discloses a method for optimizing a speech synthesis effect by adding a phoneme duration model to a preset network neural model, training the preset neural model by using a preset recording and text data corresponding to the preset recording, and then performing speech synthesis on a text to be synthesized by using the trained preset neural model.
A stable and controllable end-to-end speech synthesis method, as shown in fig. 1, comprising the following steps:
step S101, training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a frequency spectrum parameter prediction model and a voice output model;
s102, acquiring a text to be synthesized;
step S103, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized voice;
in this embodiment, a sufficient number of high-quality preset recordings and corresponding text data are used in advance to train a preset neural network, so that the speech synthesized by the converged model is high-quality, meanwhile, a phoneme duration model is added to the preset neural network, which can ensure that the duration of the synthesized speech is long and the phonemes are not mixed, and after the model is converged, a user inputs a text to be synthesized into the trained preset neural network model to obtain a target synthesized speech.
The working principle of the technical scheme is as follows: training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model; acquiring a text to be synthesized; and inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized voice.
The beneficial effects of the above technical scheme are: the preset neural network model is trained by utilizing the preset recording and the corresponding text data to ensure that the voice synthesized by the trained preset neural network model is high in quality and stable, in addition, the phoneme duration model is added into the preset neural network model, the duration of the synthesized voice can be accurately controlled, and the phoneme of the synthesized voice can be ensured, so that the synthesized voice is more close to what the user needs, the problem that the synthesized voice effect is not good due to the fact that the duration prediction precision is insufficient due to the addition of the duration control module in the prior art is solved, and the experience of the user is greatly improved.
In one embodiment, before training the preset neural network model by using the preset audio recording and the text data corresponding to the preset audio recording, the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, the method further includes:
acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices, and removing mute components in the preset number of preset voices;
checking whether the text contents of a preset number of text data are defective or not, wherein the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;
and determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second number of second text data as a preset recording and text data corresponding to the preset recording.
The beneficial effects of the above technical scheme are: the quality of the preset voice and the corresponding text data is further guaranteed, and the problem that the synthesized voice is defective due to the fact that the converged model is not perfect because an incomplete preset recording and the text data corresponding to the preset recording are used for training the preset neural network model is solved.
In one embodiment, the preset neural network model is trained by using the preset recording and text data corresponding to the preset recording, and the preset neural network model includes a phoneme duration model, a spectral parameter prediction model and a speech output model, and includes:
acquiring a representation phoneme sequence and a first phoneme duration in each text data in a second number of text data;
training a phoneme duration model by taking a second number of the characterizing phoneme sequences as input of the phoneme duration model and taking a second number of the first phoneme durations as output of the phoneme duration model;
acquiring second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;
performing a first frame expansion on a second number of the characterising phoneme sequences according to a second number of second phoneme durations;
extracting frequency spectrum parameters of a second number of preset sound recordings;
taking the second number of the representation phoneme sequences after the first frame expansion as the input of a spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train a spectrum parameter preset model;
taking a second number of spectral parameters as the input of the voice output model, and taking a second number of preset recordings as the output of the voice output model to train the voice output model;
and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.
The beneficial effects of the above technical scheme are: and performing first frame expansion on the second number of the characterizing phoneme sequences according to the second number of the second phoneme durations to compensate the second number of the characterizing phoneme sequences, so that the characterizing phoneme sequences of the text data and the preset speech can correspond to each other, and the trained model is more perfect.
In one embodiment, before obtaining the text to be synthesized, the method further includes:
acquiring n Chinese characters in the text content to be synthesized;
confirming whether n Chinese characters have polyphone characters;
if yes, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in a first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in a second tone;
selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and labeling the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
The beneficial effects of the above technical scheme are: the pronunciation of each character in the text content to be synthesized can be accurately determined, and the accuracy of the speech synthesis result is improved.
In one embodiment, as shown in fig. 2, inputting a text to be synthesized into a trained preset neural network model to obtain a target synthesized speech, including:
step S201, analyzing a text to be synthesized to obtain a target representation phoneme sequence;
step S202, inputting the target representation phoneme sequence into a trained phoneme duration model to obtain a target phoneme duration;
step S203, performing second frame expansion on the target representation phoneme sequence according to the target phoneme duration;
step S204, inputting the target characterization phoneme sequence after the second frame expansion into a trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and S205, inputting the predicted spectrum parameters into the trained speech output model to obtain the target synthesized speech.
The beneficial effects of the above technical scheme are: the target synthesized voice is accurately synthesized according to the predicted spectrum parameters, so that the stability of the target synthesized voice is further ensured, and meanwhile, the whole voice synthesizing process is carried out step by step, so that the integrity of the target synthesized voice is ensured, and the finally synthesized voice is more perfect.
The embodiment also discloses a stable and controllable end-to-end voice device, as shown in fig. 3, the device includes:
the training module 301 is configured to train a preset neural network model by using a preset recording and text data corresponding to the preset recording, to obtain the trained preset neural network model, where the preset neural network model includes a phoneme duration model, a spectrum parameter prediction model, and a speech output model;
a first obtaining module 302, configured to obtain a text to be synthesized;
an obtaining module 303, configured to input the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech.
In one embodiment, the above apparatus further comprises:
the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
the preprocessing module is used for preprocessing a preset number of preset voices, filtering noise components in the preset number of preset voices and removing mute components in the preset number of preset voices;
the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the method comprises the steps that text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with defects in a preset number of text data and a first preset voice corresponding to the first text data are eliminated;
the determining module is used for determining a second number of second text data without defects in the preset number of text data and a second preset voice corresponding to the second text data as a preset recording and text data corresponding to the preset recording.
In one embodiment, a training module, comprising:
the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second number of text data;
a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using a second number of the first phoneme durations as an output of the phoneme duration model;
the second obtaining submodule is used for obtaining second phoneme durations of a second number of preset recordings by using the trained phoneme duration model;
a first expansion submodule, configured to perform a first frame expansion on a second number of the characterized phoneme sequences according to a second number of second phoneme durations;
the extraction submodule is used for extracting the frequency spectrum parameters of a second number of preset sound recordings;
the second training submodule is used for taking the second number of the representation phoneme sequences after the first frame expansion as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
the third training submodule is used for taking a second number of frequency spectrum parameters as the input of the voice output model and taking a second number of preset records as the output of the voice output model to train the voice output model;
and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.
In one embodiment, the above apparatus further comprises:
the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;
the confirming module is used for confirming whether the n Chinese characters have polyphone characters;
the fourth obtaining module is used for obtaining a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters if the confirming module confirms that the n Chinese characters have the polyphone, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in the first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in the second tone;
the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
the fifth acquisition module is used for acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
In one embodiment, as shown in FIG. 4, a module is obtained comprising:
the parsing module 3031 is configured to parse the text to be synthesized to obtain a target representation phoneme sequence;
a second obtaining submodule 3032, configured to input the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
a second extension submodule 3033, configured to perform second frame extension on the target characterization phoneme sequence according to the target phoneme duration;
a third obtaining submodule 3034, configured to input the target representation phoneme sequence after the second frame extension into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and a fourth obtaining submodule 3035, configured to input the predicted spectrum parameter into the trained speech output model to obtain the target synthesized speech.
It will be understood by those skilled in the art that the first and second terms of the present invention refer to different stages of application.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.
Claims (8)
1. A stable and controllable end-to-end speech synthesis method is characterized by comprising the following steps:
training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, wherein the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
acquiring a text to be synthesized;
inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice;
the method for training the preset neural network model by using the preset recording and the text data corresponding to the preset recording comprises the following steps:
acquiring a characterization phoneme sequence and a first phoneme duration in each text data in a second number of text data;
training the phone duration model using a second number of the characterized phone sequences as an input to the phone duration model and the second number of the first phone durations as an output from the phone duration model;
acquiring a second phoneme duration of the second number of preset recordings by using the trained phoneme duration model;
performing a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
extracting the frequency spectrum parameters of the second number of preset sound recordings;
taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model, and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
training the voice output model by taking the second number of spectral parameters as input of the voice output model and taking the second number of preset recordings as output of the voice output model;
and obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the speech output model are trained.
2. The method of claim 1, wherein before training a predetermined neural network model using a predetermined audio recording and text data corresponding to the predetermined audio recording, the predetermined neural network model comprising a phoneme duration model, a spectral parameter prediction model, and a speech output model, the method further comprises:
acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
preprocessing the preset number of voices, filtering noise components in the preset number of voices, and removing mute components in the preset number of voices;
checking whether the text content of the preset number of text data has defects, wherein the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and determining a second number of second text data without the defect in the preset number of text data and a second preset voice corresponding to the second number of second text data as the preset recording and the text data corresponding to the preset recording.
3. The method of claim 1, wherein prior to obtaining the text to be synthesized, the method further comprises:
acquiring n Chinese characters in text content to be synthesized;
confirming whether the n Chinese characters have polyphone characters;
if yes, acquiring a first letter sequence and a second letter sequence of a first Chinese character which is a polyphone in the n Chinese characters, wherein the first letter sequence is a first sequence of letters forming a first tone when the first Chinese character is in a first tone, and the second letter sequence is a second sequence of letters forming a second tone when the first Chinese character is in a second tone;
selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
acquiring a third letter sequence of a second Chinese character, wherein the second Chinese character is a Chinese character except the first Chinese character in the n Chinese characters;
and labeling the target letter sequence and the third sequence to Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
4. The stable and controllable end-to-end speech synthesis method according to claim 1, wherein the inputting the text to be synthesized into the trained preset neural network model to obtain the target synthesized speech comprises:
analyzing the text to be synthesized to obtain a target representation phoneme sequence;
inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
performing second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and inputting the predicted spectrum parameters into a trained speech output model to obtain the target synthesized speech.
5. A stable and controllable end-to-end speech device, comprising:
the training module is used for training a preset neural network model by using a preset recording and text data corresponding to the preset recording to obtain the trained preset neural network model, and the preset neural network model comprises a phoneme duration model, a spectrum parameter prediction model and a voice output model;
the first acquisition module is used for acquiring a text to be synthesized;
the obtaining module is used for inputting the text to be synthesized into the trained preset neural network model to obtain target synthesized voice;
the training module comprises:
the first obtaining submodule is used for obtaining the representation phoneme sequence and the first phoneme duration in each text data in the second number of text data;
a first training submodule, configured to train the phoneme duration model by using a second number of the characterized phoneme sequences as an input of the phoneme duration model and using the second number of the first phoneme durations as an output of the phoneme duration model;
the second obtaining submodule is used for obtaining second phoneme durations of the second number of preset sound recordings by using the trained phoneme duration model;
a first expansion submodule, configured to perform a first frame expansion on the second number of characterized phoneme sequences according to a second number of second phoneme durations;
the extraction submodule is used for extracting the frequency spectrum parameters of the second number of preset sound recordings;
the second training submodule is used for taking a second number of the first frame-expanded characterizing phoneme sequences as the input of the spectrum parameter prediction model and taking the second number of the spectrum parameters as the output of the spectrum parameter model to train the spectrum parameter preset model;
a third training sub-module, configured to train the speech output model by using the second number of spectral parameters as input of the speech output model and using the second number of preset sound recordings as output of the speech output model;
and the first obtaining submodule is used for obtaining the trained preset neural network model after the phoneme duration model, the spectrum parameter prediction model and the voice output model are trained.
6. The robust and controllable end-to-end speech apparatus of claim 5, further comprising:
the second acquisition module is used for acquiring a preset number of preset voices and a preset number of text data corresponding to the preset number of preset voices;
the preprocessing module is used for preprocessing the preset number of voices, filtering noise components in the preset number of voices and removing mute components in the preset number of voices;
the checking module is used for checking whether the text content of the preset number of text data has defects, and the defects comprise: the text content is incomplete, the text content cannot be read smoothly, the text content has logic problems, and a first number of first text data with the defects in the preset number of text data and a first preset voice corresponding to the first text data are removed;
and the determining module is used for determining a second number of second text data without the defects in the preset number of text data and a second preset voice corresponding to the second text data as the preset recording and the text data corresponding to the preset recording.
7. The robust and controllable end-to-end speech apparatus of claim 5, further comprising:
the third acquisition module is used for acquiring n Chinese characters in the text content to be synthesized;
the confirming module is used for confirming whether the n Chinese characters have polyphone characters;
a fourth obtaining module, configured to obtain a first letter sequence and a second letter sequence of a first chinese character that is a polyphone in the n chinese characters if the confirming module confirms that the n chinese characters have the polyphone, where the first letter sequence is a first sequence of letters that form the first tone when the first chinese character is a first tone, and the second letter sequence is a second sequence of letters that form the second tone when the first chinese character is a second tone;
the selection module is used for selecting a target letter sequence from the first letter sequence and the second letter sequence according to the text content to be synthesized;
a fifth obtaining module, configured to obtain a third letter sequence of a second Chinese character, where the second Chinese character is a Chinese character of the n Chinese characters except the first Chinese character;
and the marking module is used for marking the target letter sequence and the third sequence to the Chinese characters corresponding to the n Chinese characters in the text content to be synthesized.
8. The steadily controllable end-to-end speech device of claim 5, wherein the obtaining module comprises:
the analysis module is used for analyzing the text to be synthesized to obtain a target representation phoneme sequence;
the second obtaining submodule is used for inputting the target representation phoneme sequence into the trained phoneme duration model to obtain a target phoneme duration;
the second expansion submodule is used for carrying out second frame expansion on the target characterization phoneme sequence according to the target phoneme duration;
the third obtaining submodule is used for inputting the target representation phoneme sequence after the second frame expansion into the trained spectrum parameter preset model to obtain a predicted spectrum parameter;
and the fourth obtaining submodule is used for inputting the predicted spectrum parameters into the trained voice output model to obtain the target synthesized voice.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010275510.5A CN111599338B (en) | 2020-04-09 | 2020-04-09 | Stable and controllable end-to-end speech synthesis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010275510.5A CN111599338B (en) | 2020-04-09 | 2020-04-09 | Stable and controllable end-to-end speech synthesis method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111599338A CN111599338A (en) | 2020-08-28 |
CN111599338B true CN111599338B (en) | 2023-04-18 |
Family
ID=72183478
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010275510.5A Active CN111599338B (en) | 2020-04-09 | 2020-04-09 | Stable and controllable end-to-end speech synthesis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111599338B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112133282B (en) * | 2020-10-26 | 2022-07-08 | 厦门大学 | Lightweight multi-speaker speech synthesis system and electronic equipment |
CN112836276A (en) * | 2021-02-20 | 2021-05-25 | 广东三维家信息科技有限公司 | Method, device, equipment and storage medium for file broadcasting based on household design |
CN113066511B (en) * | 2021-03-16 | 2023-01-24 | 云知声智能科技股份有限公司 | Voice conversion method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104538024B (en) * | 2014-12-01 | 2019-03-08 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method, device and equipment |
US10540957B2 (en) * | 2014-12-15 | 2020-01-21 | Baidu Usa Llc | Systems and methods for speech transcription |
CN106531150B (en) * | 2016-12-23 | 2020-02-07 | 云知声(上海)智能科技有限公司 | Emotion synthesis method based on deep neural network model |
CN108806665A (en) * | 2018-09-12 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device |
CN110782870B (en) * | 2019-09-06 | 2023-06-16 | 腾讯科技(深圳)有限公司 | Speech synthesis method, device, electronic equipment and storage medium |
CN110556129B (en) * | 2019-09-09 | 2022-04-19 | 北京大学深圳研究生院 | Bimodal emotion recognition model training method and bimodal emotion recognition method |
-
2020
- 2020-04-09 CN CN202010275510.5A patent/CN111599338B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109767755A (en) * | 2019-03-01 | 2019-05-17 | 广州多益网络股份有限公司 | A kind of phoneme synthesizing method and system |
Also Published As
Publication number | Publication date |
---|---|
CN111599338A (en) | 2020-08-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111599338B (en) | Stable and controllable end-to-end speech synthesis method and device | |
CN110148427B (en) | Audio processing method, device, system, storage medium, terminal and server | |
CN107516509B (en) | Voice database construction method and system for news broadcast voice synthesis | |
CN105245917B (en) | A kind of system and method for multi-media voice subtitle generation | |
CN110166816B (en) | Video editing method and system based on voice recognition for artificial intelligence education | |
CN108847215B (en) | Method and device for voice synthesis based on user timbre | |
CN111613224A (en) | Personalized voice synthesis method and device | |
CN109285537B (en) | Acoustic model establishing method, acoustic model establishing device, acoustic model synthesizing method, acoustic model synthesizing device, acoustic model synthesizing equipment and storage medium | |
KR100659212B1 (en) | Language learning system and voice data providing method for language learning | |
CN106328146A (en) | Video subtitle generating method and device | |
CN105448289A (en) | Speech synthesis method, speech synthesis device, speech deletion method, speech deletion device and speech deletion and synthesis method | |
CN112420015B (en) | Audio synthesis method, device, equipment and computer readable storage medium | |
CN110390928B (en) | Method and system for training speech synthesis model of automatic expansion corpus | |
CN110111778A (en) | A kind of method of speech processing, device, storage medium and electronic equipment | |
CN110738981A (en) | interaction method based on intelligent voice call answering | |
CN112542158A (en) | Voice analysis method, system, electronic device and storage medium | |
CN110853627B (en) | Method and system for voice annotation | |
CN114927122A (en) | Emotional voice synthesis method and synthesis device | |
CN111383627B (en) | Voice data processing method, device, equipment and medium | |
CN111785236A (en) | Automatic composition method based on motivational extraction model and neural network | |
CN111354325A (en) | Automatic word and song creation system and method thereof | |
CN111429878B (en) | Self-adaptive voice synthesis method and device | |
CN113572977B (en) | Video production method and device | |
CN115472185A (en) | Voice generation method, device, equipment and storage medium | |
CN114267326A (en) | Training method and device of voice synthesis system and voice synthesis method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |