CN110136691B - Speech synthesis model training method and device, electronic equipment and storage medium - Google Patents

Speech synthesis model training method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN110136691B
CN110136691B CN201910455396.1A CN201910455396A CN110136691B CN 110136691 B CN110136691 B CN 110136691B CN 201910455396 A CN201910455396 A CN 201910455396A CN 110136691 B CN110136691 B CN 110136691B
Authority
CN
China
Prior art keywords
voice
text
corpus
synthesis model
speech
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910455396.1A
Other languages
Chinese (zh)
Other versions
CN110136691A (en
Inventor
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Duoyi Network Co ltd
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Guangzhou Duoyi Network Co ltd
Original Assignee
Duoyi Network Co ltd
GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD
Guangzhou Duoyi Network Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duoyi Network Co ltd, GUANGDONG LIWEI NETWORK TECHNOLOGY CO LTD, Guangzhou Duoyi Network Co ltd filed Critical Duoyi Network Co ltd
Priority to CN201910455396.1A priority Critical patent/CN110136691B/en
Publication of CN110136691A publication Critical patent/CN110136691A/en
Application granted granted Critical
Publication of CN110136691B publication Critical patent/CN110136691B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The invention discloses a method and a device for training a speech synthesis model, electronic equipment and a storage medium, wherein the method comprises the following steps: constructing an original speech synthesis model based on a deep learning method; acquiring a pre-constructed basic text voice parallel corpus, and training an original voice synthesis model according to the basic text voice parallel corpus to obtain a basic voice synthesis model; acquiring a pre-generated target text voice parallel corpus, and optimizing and training a basic voice synthesis model according to the target text voice parallel corpus to obtain a target voice synthesis model; the target text voice parallel corpus is a corpus meeting preset voice synthesis requirements. The method can train the target speech synthesis model by using the small-scale target text speech parallel corpus, effectively shorten the research and development period of the speech synthesis technology and reduce the research and development cost.

Description

Speech synthesis model training method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a method and an apparatus for training a speech synthesis model, an electronic device, and a storage medium.
Background
The speech synthesis technology is a technology for converting text information into speech information. Currently, widely used speech synthesis technologies include a speech synthesis technology based on parameter synthesis and a speech synthesis technology based on deep learning.
In the speech synthesis technique based on the parameter synthesis, when speech synthesis is performed, a text is abstracted into phonetic features, speech parameters are generated according to a statistical model, acoustic features are predicted, and a vocoder synthesizes speech and outputs the speech. The statistical model needs to be trained through a large amount of high-quality corpora to learn the corresponding relation between the phonetic features and the acoustic features so as to ensure the acoustic feature prediction and the accurate speech parameters of the statistical model.
In the speech synthesis technology based on deep learning, the corresponding relation between texts and acoustic features can be directly learned through a neural network, and natural and accurate synthesized speech is given for different texts. However, the speech synthesis model has a very large parameter amount, and in order to obtain a model that stably outputs high-quality speech, a large amount of high-quality corpora are required to be used as training data to obtain a target model.
In the prior art, a large amount of high-quality corpora are required to be used for training to obtain a target model so as to realize high-quality voice synthesis. However, the acquisition cycle of the high-quality corpus is long, and the cost is high, and the model training scheme in the prior art is adopted, so that the problems of long research and development cycle and high research and development cost of the speech synthesis technology can be caused.
Disclosure of Invention
The technical problem to be solved by the embodiments of the present invention is to provide a method and an apparatus for training a speech synthesis model, an electronic device, and a storage medium, which can train a target speech synthesis model satisfying a higher speech synthesis requirement by using a small amount of target text speech parallel corpora, thereby effectively shortening the research and development period of the speech synthesis technology and reducing the research and development cost.
In a first aspect, an embodiment of the present invention provides a method for training a speech synthesis model, where the method includes:
constructing an original speech synthesis model based on a deep learning method;
acquiring a pre-constructed basic text voice parallel corpus, and training the original voice synthesis model according to the basic text voice parallel corpus to obtain a basic voice synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
acquiring a pre-generated target text voice parallel corpus, and optimally training the basic voice synthesis model according to the target text voice parallel corpus to obtain a target voice synthesis model; and the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement.
Further, the basic text-to-speech parallel corpus is constructed and generated through the following steps:
determining a second word voice corresponding to each second word text in the text corpus according to the voice database;
arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus to obtain a text voice corpus;
and constructing and generating the basic text voice parallel corpus according to the text corpus and the text voice corpus.
Further, the method generates the target text-to-speech parallel corpus by the following steps:
acquiring a preset recording text;
collecting a recording voice corresponding to the recording text; the recorded voice is voice meeting preset voice quality requirements;
and generating the target text voice parallel corpus according to the recording text and the recording voice.
Further, the method generates the target text-to-speech parallel corpus by the following steps:
collecting audio corpora meeting preset requirements;
performing voice recognition on the audio corpus to obtain a recognition text corresponding to the audio corpus;
after the identification text is corrected, acquiring the corrected identification text;
and generating the target text voice parallel corpus according to the corrected recognition text and the audio corpus.
Further, the original speech synthesis model is any one of a Tacotron model, a GST model, or a Deep Voice model.
In a second aspect, an embodiment of the present invention further provides a speech synthesis model training apparatus, where the apparatus includes:
the original speech synthesis model building module is used for building an original speech synthesis model based on a deep learning method;
the basic speech synthesis model obtaining module is used for obtaining a pre-constructed basic text speech parallel corpus and training the original speech synthesis model according to the basic text speech parallel corpus to obtain a basic speech synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
the target speech synthesis model obtaining module is used for obtaining a pre-generated target text speech parallel corpus, and optimally training the basic speech synthesis model according to the target text speech parallel corpus to obtain a target speech synthesis model; and the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement.
Further, the device further includes the basic text-to-speech parallel corpus building module, where the basic text-to-speech parallel corpus building module is specifically configured to:
determining a second word voice corresponding to each second word text in the text corpus according to the voice database;
arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus to obtain a text voice corpus;
and constructing and generating the basic text voice parallel corpus according to the text corpus and the text voice corpus.
Further, the device further includes the target text-to-speech parallel corpus establishing module, where the target text-to-speech parallel corpus establishing module is specifically configured to:
acquiring a preset recording text;
collecting a recording voice corresponding to the recording text; the recorded voice is voice meeting preset voice quality requirements;
generating the target text voice parallel corpus according to the recording text and the recording voice; or the like, or, alternatively,
the target text voice parallel corpus building module is specifically configured to:
collecting audio corpora meeting preset requirements;
performing voice recognition on the audio corpus to obtain a recognition text corresponding to the audio corpus;
after the identification text is corrected, acquiring the corrected identification text;
and generating the target text voice parallel corpus according to the corrected recognition text and the audio corpus.
In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, and when the processor executes the computer program, the processor implements the speech synthesis model training method according to any one of the above-mentioned first aspect.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where, when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the speech synthesis model training method according to any one of the above-mentioned methods provided in the first aspect.
According to the speech synthesis model training method, the speech synthesis model training device, the electronic equipment and the computer readable storage medium, the target speech synthesis model meeting the high speech synthesis requirement can be obtained through a small amount of target text speech parallel corpora, the model is not required to be constructed by adopting a large amount of high-quality corpora, the research and development period of the speech synthesis technology is effectively shortened, and the research and development cost is reduced.
Drawings
FIG. 1 is a flow chart of a preferred embodiment of a speech synthesis model training method provided by the present invention;
FIG. 2 is a block diagram of a preferred embodiment of a speech synthesis model training apparatus according to the present invention;
fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a flowchart illustrating a method for training a speech synthesis model according to a preferred embodiment of the method for training a speech synthesis model provided by the present invention; specifically, the method for training the speech synthesis model includes:
s1, constructing an original speech synthesis model based on a deep learning method;
s2, acquiring a pre-constructed basic text voice parallel corpus, and training the original voice synthesis model according to the basic text voice parallel corpus to obtain a basic voice synthesis model; the basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
s3, obtaining a pre-generated target text voice parallel corpus, and optimally training the basic voice synthesis model according to the target text voice parallel corpus to obtain a target voice synthesis model; and the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement.
Specifically, an original speech synthesis model based on a deep learning method is constructed, the original speech synthesis model has deep learning capability, but model parameters in the original speech synthesis model are not adjusted according to training data, namely, the original speech synthesis model does not learn the basic corresponding relation between features, and speech synthesis cannot be performed. In order to obtain a model with voice synthesis capability, a pre-constructed basic text voice parallel corpus is obtained, the basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, an original voice synthesis model is trained according to the basic text voice parallel corpus, the original voice synthesis model continuously learns the corresponding relation of parameters such as voice tone, voice rhythm and the like in texts and voice in the basic text voice parallel corpus in the training process, learns the corresponding relation of a language environment and acoustic characteristics, and continuously iteratively updates all parameters in the model until convergence, so that the basic voice synthesis model is obtained. The obtained basic speech synthesis model learns the corresponding relation between the text and the speech in the basic text speech parallel corpus and has the basic speech synthesis capability. After the basic speech synthesis model is obtained, in order to achieve that the speech output by the model has higher quality, training is further added to the basic speech synthesis model, a target text speech parallel corpus is obtained, the training basic speech synthesis model is optimized according to the target text speech parallel corpus, in the training process, the basic speech synthesis model learns the corresponding relation between the text and the speech in the target text speech parallel corpus, and each parameter in the model is further updated iteratively until convergence, so that the target speech synthesis model is obtained. Because the target text voice parallel corpus is the corpus meeting the voice synthesis requirement, the target voice synthesis model obtained by training according to the target text voice parallel corpus can meet the higher voice synthesis requirement, and when the target text voice parallel corpus is used subsequently, the voice meeting the voice synthesis requirement can be output after the text is input.
It should be noted that the basic speech synthesis model only needs to have basic speech synthesis capability, and the speech synthesis model training method provided by the invention only needs large-scale basic text speech parallel corpus and small-scale target text speech parallel corpus. The base text-to-speech parallel corpus and the target text-to-speech parallel corpus each include a word text and a word speech that is strictly aligned with the word text. The speech synthesis requires correct, clear and natural synthesized speech, and the tone and rhythm of speech are matched with the environment of the language.
According to the speech synthesis model training method provided by the embodiment of the invention, as the target speech synthesis model is obtained by training the basic speech synthesis model according to the target text speech parallel corpus, the basic text speech parallel corpus required in the speech synthesis model training method provided by the invention is only a corpus which is common or has slightly poor speech naturalness, and the quality requirements of texts and speech in a speech database are reduced; the target speech synthesis model meeting the higher speech synthesis requirement can be trained and obtained only by finely adjusting the basic speech synthesis model according to the small-scale training of the target text speech parallel corpus, and the construction cost of the basic text speech parallel corpus is far lower than that of the target text speech parallel corpus, so that the target speech synthesis model meeting the higher speech synthesis requirement can be obtained through a small amount of target text speech parallel corpus, a large amount of high-quality corpus construction models are not needed, the research and development period of the speech synthesis technology is effectively shortened, and the research and development cost is reduced.
Optionally, the target text-to-speech parallel corpus is formed according to a preset number of word texts and the word speech of each word text meeting the speech synthesis requirement. By adopting the speech synthesis model training method provided by the embodiment of the invention, only 2000 target text speeches are needed to be parallel, namely about 2.8 hours of speech; the target text voice parallel corpus is obtained by collecting about 1500 word texts and corresponding word voices. Compared with the prior art that the acquisition of the high-quality corpus consisting of at least 5 ten thousand sentences and dozens of hours of voices is required to be collected to construct a model and the high-quality corpus is collected in at least 3 months, the method for training the voice synthesis model provided by the invention has the advantages that the research and development period of the voice synthesis technology is effectively shortened, the research and development efficiency is improved, and the research and development cost is reduced.
Preferably, the basic text-to-speech parallel corpus is constructed and generated through the following steps:
determining a second word voice corresponding to each second word text in the text corpus according to the voice database;
arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus to obtain a text voice corpus;
and constructing and generating the basic text voice parallel corpus according to the text corpus and the text voice corpus.
Specifically, according to the voice database, determining a second character voice corresponding to each second character text in the text corpus; arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus so as to enable the second word voices to be strictly aligned with the corresponding second word texts, and splicing to obtain the text voice corpus; and constructing and generating a basic text voice parallel corpus according to the text corpus and the text voice corpus.
According to the speech synthesis model training method provided by the embodiment of the invention, the speech database is used for determining the second character speech corresponding to the second character text, a large number of basic text speech parallel corpora can be automatically spliced and generated, the construction cost is low, the original speech synthesis model can learn the corresponding relation between the basic text and the speech, and the basic speech synthesis model is updated and obtained.
Preferably, the method generates the target text-to-speech parallel corpus by:
acquiring a preset recording text;
collecting a recording voice corresponding to the recording text; the recorded voice is voice meeting preset voice quality requirements;
and generating the target text voice parallel corpus according to the recording text and the recording voice.
In order to obtain high-quality corpora, the target text voice parallel corpora meet the voice synthesis requirement, and the recorded voice of a sound recorder is used as high-quality voice data. Specifically, a preset recording text is obtained; collecting a recording voice corresponding to the recording text; and strictly aligning the recording text and the recording voice to generate a target text voice parallel corpus.
The speech synthesis model training method provided by the embodiment of the invention directly utilizes professional speech pronunciation of a sound recorder to obtain speech meeting the preset speech quality requirement, and the speech better conforms to the tone and rhythm of human speech, thereby further reducing the research and development period and improving the research and development efficiency.
Preferably, the method generates the target text-to-speech parallel corpus by:
collecting audio corpora meeting preset requirements;
performing voice recognition on the audio corpus to obtain a recognition text corresponding to the audio corpus;
after the identification text is corrected, acquiring the corrected identification text;
and generating the target text voice parallel corpus according to the corrected recognition text and the audio corpus.
In order to obtain high-quality corpora and enable the target text-voice parallel corpora to meet the voice synthesis requirement, the audio corpora are used as voice data, and then the corrected recognition text is combined to obtain the target text-voice parallel corpora.
It should be noted that the modification process may be a manual modification process, or the recognized text may be modified by using the existing modification technology, as long as the accuracy of the recognized text can be improved, and the method is suitable for generating the target text-to-speech parallel corpus meeting the speech synthesis requirement.
According to the speech synthesis model training method provided by the embodiment of the invention, the audio corpus is directly utilized, the existing audio recognition technology is utilized, the primary speech and the corresponding text can be quickly obtained, the corrected recognition text more accurately corresponds to the audio corpus, so that the target text speech parallel corpus generated according to the corrected recognition text and the audio corpus has the accurate corresponding relation between the text and the speech, the target text speech parallel corpus can be quickly and accurately generated, the construction efficiency and the prediction accuracy of the target speech synthesis model are improved, the target speech synthesis model meeting higher speech synthesis requirements is trained, and the research and development period of the speech synthesis technology is effectively shortened.
Preferably, the original speech synthesis model is any one of a Tacotron model, a GST model, or a Deep Voice model.
It should be noted that the tacontron model is an end-to-end deep learning speech synthesis model, and the trained model can directly output the input text to the corresponding audio.
The gst (global Style tokens) model is a model of embedding a prosody encoder based on a Tacotron model, and has a specific prosody Style.
The Deep Voice model is a high-quality Voice synthesis system completely constructed by a Deep neural network, and real end-to-end Voice synthesis is realized.
Fig. 2 is a block diagram of a preferred embodiment of a speech synthesis model training apparatus according to the present invention; specifically, the speech synthesis model training device includes:
an original speech synthesis model construction module 11, configured to construct an original speech synthesis model based on a deep learning method;
a basic speech synthesis model obtaining module 12, configured to obtain a pre-constructed basic text-speech parallel corpus, and train the original speech synthesis model according to the basic text-speech parallel corpus to obtain a basic speech synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
a target speech synthesis model obtaining module 13, configured to obtain a pre-generated target text-speech parallel corpus, and optimally train the basic speech synthesis model according to the target text-speech parallel corpus to obtain a target speech synthesis model; and the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement.
According to the speech synthesis model training device provided by the embodiment of the invention, an original speech synthesis model based on a deep learning method is constructed through an original speech synthesis model construction module 11; acquiring a pre-constructed basic text voice parallel corpus through a basic voice synthesis model acquisition module 12, and training the original voice synthesis model according to the basic text voice parallel corpus to acquire a basic voice synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts; a target speech synthesis model obtaining module 13 obtains a pre-generated target text speech parallel corpus, and optimally trains the basic speech synthesis model according to the target text speech parallel corpus to obtain a target speech synthesis model; and the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement.
According to the speech synthesis model training device provided by the embodiment of the invention, as the target speech synthesis model is obtained by training the basic speech synthesis model according to the target text speech parallel corpus, the basic text speech parallel corpus required in the speech synthesis model training device provided by the invention is only a corpus which is common or has slightly poor speech naturalness, and the quality requirements of texts and speech in a speech database are reduced; the target speech synthesis model meeting the higher speech synthesis requirement can be trained and obtained only by finely adjusting the basic speech synthesis model according to the small-scale training of the target text speech parallel corpus, and the construction cost of the basic text speech parallel corpus is far lower than that of the target text speech parallel corpus, so that the target speech synthesis model meeting the higher speech synthesis requirement can be obtained through a small amount of target text speech parallel corpus, a large amount of high-quality corpus construction models are not needed, the research and development period of the speech synthesis technology is effectively shortened, and the research and development cost is reduced.
Preferably, the device further includes the basic text-to-speech parallel corpus building module, where the basic text-to-speech parallel corpus building module is specifically configured to:
determining a second word voice corresponding to each second word text in the text corpus according to the voice database;
arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus to obtain a text voice corpus;
and constructing and generating the basic text voice parallel corpus according to the text corpus and the text voice corpus.
Preferably, the apparatus further includes the target text-to-speech parallel corpus building module, where the target text-to-speech parallel corpus building module is specifically configured to:
acquiring a preset recording text;
collecting a recording voice corresponding to the recording text; the recorded voice is voice meeting preset voice quality requirements;
generating the target text voice parallel corpus according to the recording text and the recording voice; or the like, or, alternatively,
the target text voice parallel corpus building module is specifically configured to:
collecting audio corpora meeting preset requirements;
performing voice recognition on the audio corpus to obtain a recognition text corresponding to the audio corpus;
after the identification text is corrected, acquiring the corrected identification text;
and generating the target text voice parallel corpus according to the corrected recognition text and the audio corpus.
Preferably, the original speech synthesis model is any one of a Tacotron model, a GST model, or a Deep Voice model.
It should be noted that, the speech synthesis model training apparatus provided in the embodiment of the present invention is used for executing the steps of the speech synthesis model training method described in the above embodiment, and the working principles and beneficial effects of the two are in one-to-one correspondence, and thus are not described again.
It will be understood by those skilled in the art that the schematic diagram of the speech synthesis model training apparatus is merely an example of the speech synthesis model training apparatus, and does not constitute a limitation of the speech synthesis model training apparatus, and may include more or less components than those shown, or combine some components, or different components, for example, the speech synthesis model training apparatus may further include an input-output device, a network access device, a bus, etc.
It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the apparatus provided by the present invention, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
An embodiment of the present invention further provides an electronic device, please refer to fig. 3, which is a schematic structural diagram of an electronic device according to the present invention; specifically, the electronic device includes a processor 10, a memory 20, and a computer program stored in the memory and configured to be executed by the processor, and the processor implements the speech synthesis model training method according to any one of the embodiments when executing the computer program.
Specifically, the processor and the memory in the electronic device can be one or more, and the electronic device can be a computer, a server, a cloud device and the like.
The electronic device of the embodiment includes: a processor, a memory, and a computer program stored in the memory and executable on the processor. The processor, when executing the computer program, implements the steps in the speech synthesis model training method provided by the above embodiment, for example, step S1 shown in fig. 1, and constructs an original speech synthesis model based on a deep learning method. Alternatively, the processor, when executing the computer program, implements the functions of the modules in the above-mentioned apparatus embodiments, for example, implements the original speech synthesis model building module 11, which is used for building an original speech synthesis model based on a deep learning method.
Illustratively, the computer program may be divided into one or more modules/units (e.g., computer program 1, computer program 2 … … shown in fig. 3) that are stored in the memory and executed by the processor to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the electronic device. For example, the computer program may be divided into an original speech synthesis model building module 11, a basic speech synthesis model obtaining module 12, and a target speech synthesis model obtaining module 13, where the specific functions of each module are as follows:
an original speech synthesis model construction module 11, configured to construct an original speech synthesis model based on a deep learning method;
a basic speech synthesis model obtaining module 12, configured to obtain a pre-constructed basic text-speech parallel corpus, and train the original speech synthesis model according to the basic text-speech parallel corpus to obtain a basic speech synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
a target speech synthesis model obtaining module 13, configured to obtain a pre-generated target text-speech parallel corpus, and optimally train the basic speech synthesis model according to the target text-speech parallel corpus to obtain a target speech synthesis model; and the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is the control center for the electronic device and that connects the various parts of the overall electronic device using various interfaces and wires.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the electronic device by running or executing the computer programs and/or modules stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
Wherein, the integrated module/unit of the electronic device can be stored in a computer readable storage medium if it is implemented in the form of software functional unit and sold or used as a stand-alone product. Based on such understanding, all or part of the processes in the speech synthesis model training method provided in the foregoing embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps in the speech synthesis model training method provided in any of the foregoing embodiments. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It should be noted that the above-mentioned electronic device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the structural diagram of fig. 3 is only an example of the above-mentioned electronic device, and does not constitute a limitation of the electronic device, and may include more or less components than those shown in the drawings, or may combine some components, or may be different components.
An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, a device on which the computer-readable storage medium is located is controlled to execute the speech synthesis model training method according to any one of the above embodiments.
In summary, the present invention provides a method for training a speech synthesis model, an apparatus for training a speech synthesis model, an electronic device and a computer-readable storage medium, which have the following advantages:
the target speech synthesis model is obtained by training the large-scale low-cost basic text speech parallel corpus and the small-scale high-quality target text speech parallel corpus, a large amount of high-quality corpora are not needed to construct the model, the research and development period of the speech synthesis technology is effectively shortened, and the research and development cost is reduced.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (5)

1. A method for training a speech synthesis model, the method comprising:
constructing an original speech synthesis model based on a deep learning method;
acquiring a pre-constructed basic text voice parallel corpus, and training the original voice synthesis model according to the basic text voice parallel corpus to obtain a basic voice synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
acquiring a pre-generated target text voice parallel corpus, and optimally training the basic voice synthesis model according to the target text voice parallel corpus to obtain a target voice synthesis model; the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement;
the basic text voice parallel corpus is constructed and generated through the following steps:
determining a second word voice corresponding to each second word text in the text corpus according to the voice database;
arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus to obtain a text voice corpus;
according to the text corpus and the text voice corpus, constructing and generating the basic text voice parallel corpus;
generating the target text voice parallel corpus by the following steps:
acquiring a preset recording text;
collecting a recording voice corresponding to the recording text; the recorded voice is voice meeting preset voice quality requirements;
generating the target text voice parallel corpus according to the recording text and the recording voice;
or, generating the target text-to-speech parallel corpus by the following steps:
collecting audio corpora meeting preset requirements;
performing voice recognition on the audio corpus to obtain a recognition text corresponding to the audio corpus;
after the identification text is corrected, acquiring the corrected identification text;
and generating the target text voice parallel corpus according to the corrected recognition text and the audio corpus.
2. The method of claim 1, wherein the original speech synthesis model is any one of a Tacotron model, a GST model, or a Deep Voice model.
3. A speech synthesis model training apparatus, characterized in that the apparatus comprises:
the original speech synthesis model building module is used for building an original speech synthesis model based on a deep learning method;
the basic speech synthesis model obtaining module is used for obtaining a pre-constructed basic text speech parallel corpus and training the original speech synthesis model according to the basic text speech parallel corpus to obtain a basic speech synthesis model; each basic text voice parallel corpus is constructed and generated according to a preset voice database and a text corpus, wherein the voice database comprises a plurality of first word texts and first word voices corresponding to the first word texts;
the target speech synthesis model obtaining module is used for obtaining a pre-generated target text speech parallel corpus, and optimally training the basic speech synthesis model according to the target text speech parallel corpus to obtain a target speech synthesis model; the target text voice parallel corpus is a corpus meeting a preset voice synthesis requirement;
the basic text voice parallel corpus building module is specifically used for: determining a second word voice corresponding to each second word text in the text corpus according to the voice database; arranging and splicing all second word voices according to the arrangement sequence of the second word texts in the text corpus to obtain a text voice corpus; according to the text corpus and the text voice corpus, constructing and generating the basic text voice parallel corpus;
the target text voice parallel corpus building module is specifically configured to: acquiring a preset recording text; collecting a recording voice corresponding to the recording text; the recorded voice is voice meeting preset voice quality requirements; generating the target text voice parallel corpus according to the recording text and the recording voice; or, the target text-to-speech parallel corpus constructing module is specifically configured to: collecting audio corpora meeting preset requirements; performing voice recognition on the audio corpus to obtain a recognition text corresponding to the audio corpus; after the identification text is corrected, acquiring the corrected identification text; and generating the target text voice parallel corpus according to the corrected recognition text and the audio corpus.
4. An electronic device comprising a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, the processor implementing the speech synthesis model training method according to any one of claims 1 to 2 when executing the computer program.
5. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium is located to perform the speech synthesis model training method according to any one of claims 1-2.
CN201910455396.1A 2019-05-28 2019-05-28 Speech synthesis model training method and device, electronic equipment and storage medium Active CN110136691B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910455396.1A CN110136691B (en) 2019-05-28 2019-05-28 Speech synthesis model training method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910455396.1A CN110136691B (en) 2019-05-28 2019-05-28 Speech synthesis model training method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN110136691A CN110136691A (en) 2019-08-16
CN110136691B true CN110136691B (en) 2021-09-28

Family

ID=67582515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910455396.1A Active CN110136691B (en) 2019-05-28 2019-05-28 Speech synthesis model training method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110136691B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110600000B (en) * 2019-09-29 2022-04-15 阿波罗智联(北京)科技有限公司 Voice broadcasting method and device, electronic equipment and storage medium
CN113299272B (en) * 2020-02-06 2023-10-31 菜鸟智能物流控股有限公司 Speech synthesis model training and speech synthesis method, equipment and storage medium
CN111326136B (en) * 2020-02-13 2022-10-14 腾讯科技(深圳)有限公司 Voice processing method and device, electronic equipment and storage medium
CN111210803B (en) * 2020-04-21 2021-08-03 南京硅基智能科技有限公司 System and method for training clone timbre and rhythm based on Bottle sock characteristics
CN111508470B (en) * 2020-04-26 2024-04-12 北京声智科技有限公司 Training method and device for speech synthesis model
CN111540345B (en) * 2020-05-09 2022-06-24 北京大牛儿科技发展有限公司 Weakly supervised speech recognition model training method and device
CN111986646B (en) * 2020-08-17 2023-12-15 云知声智能科技股份有限公司 Dialect synthesis method and system based on small corpus
CN111968616A (en) * 2020-08-19 2020-11-20 浙江同花顺智能科技有限公司 Training method and device of speech synthesis model, electronic equipment and storage medium
CN111968617B (en) * 2020-08-25 2024-03-15 云知声智能科技股份有限公司 Voice conversion method and system for non-parallel data
CN112102811B (en) * 2020-11-04 2021-03-02 北京淇瑀信息科技有限公司 Optimization method and device for synthesized voice and electronic equipment
CN113421547B (en) * 2021-06-03 2023-03-17 华为技术有限公司 Voice processing method and related equipment
CN114708849A (en) * 2022-04-27 2022-07-05 网易(杭州)网络有限公司 Voice processing method and device, computer equipment and computer readable storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7266497B2 (en) * 2002-03-29 2007-09-04 At&T Corp. Automatic segmentation in speech synthesis
US7983919B2 (en) * 2007-08-09 2011-07-19 At&T Intellectual Property Ii, L.P. System and method for performing speech synthesis with a cache of phoneme sequences

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于新闻联播语料库的语音合成系统;汤胜良,张士礼,张志平,吴玺宏,迟惠生;《第八届全国人机语音通讯学术会议论文集》;20051031;第2页第1栏第2、4段,第2栏第1-2段 *

Also Published As

Publication number Publication date
CN110136691A (en) 2019-08-16

Similar Documents

Publication Publication Date Title
CN110136691B (en) Speech synthesis model training method and device, electronic equipment and storage medium
US10410621B2 (en) Training method for multiple personalized acoustic models, and voice synthesis method and device
CN107464554B (en) Method and device for generating speech synthesis model
CN105845125B (en) Phoneme synthesizing method and speech synthetic device
WO2020118521A1 (en) Multi-speaker neural text-to-speech synthesis
US20190088253A1 (en) Method and apparatus for converting english speech information into text
CN107481715B (en) Method and apparatus for generating information
CN107705782B (en) Method and device for determining phoneme pronunciation duration
CN113053357B (en) Speech synthesis method, apparatus, device and computer readable storage medium
CN111128116B (en) Voice processing method and device, computing equipment and storage medium
CN111433847A (en) Speech conversion method and training method, intelligent device and storage medium
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN113822017A (en) Audio generation method, device, equipment and storage medium based on artificial intelligence
CN113380222A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111613224A (en) Personalized voice synthesis method and device
WO2021169825A1 (en) Speech synthesis method and apparatus, device and storage medium
CN112185341A (en) Dubbing method, apparatus, device and storage medium based on speech synthesis
CN111383627B (en) Voice data processing method, device, equipment and medium
CN112242134A (en) Speech synthesis method and device
TW201331930A (en) Speech synthesis method and apparatus for electronic system
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
WO2021114617A1 (en) Voice synthesis method and apparatus, computer device, and computer readable storage medium
CN114783424A (en) Text corpus screening method, device, equipment and storage medium
CN114495896A (en) Voice playing method and computer equipment
CN115700871A (en) Model training and speech synthesis method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant