CN112037755A - Voice synthesis method and device based on timbre clone and electronic equipment - Google Patents

Voice synthesis method and device based on timbre clone and electronic equipment Download PDF

Info

Publication number
CN112037755A
CN112037755A CN202011211468.7A CN202011211468A CN112037755A CN 112037755 A CN112037755 A CN 112037755A CN 202011211468 A CN202011211468 A CN 202011211468A CN 112037755 A CN112037755 A CN 112037755A
Authority
CN
China
Prior art keywords
voice
variable
corpus
fine tuning
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011211468.7A
Other languages
Chinese (zh)
Other versions
CN112037755B (en
Inventor
张彤彤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qiyu Information Technology Co Ltd
Original Assignee
Beijing Qiyu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qiyu Information Technology Co Ltd filed Critical Beijing Qiyu Information Technology Co Ltd
Priority to CN202011211468.7A priority Critical patent/CN112037755B/en
Publication of CN112037755A publication Critical patent/CN112037755A/en
Application granted granted Critical
Publication of CN112037755B publication Critical patent/CN112037755B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

The invention discloses a voice synthesis method, a device and electronic equipment based on timbre clone, wherein the method comprises the following steps: training a TTS basic model through open source linguistic data; training the basic model in a fine tuning finetune mode through the target tone and color corpus to obtain a fine tuning model; generating variable voice of the target tone according to the variable linguistic data and the fine tuning model; and synthesizing target tone voice according to the variable voice and the fixed voice. The method firstly adopts open source linguistic data and then trains a fine tuning model of the TTS network in a fine tuning finetune mode through a small amount of target tone linguistic data. Compared with exhaustive manual recording or a traditional TTS high-quality corpus synthesis mode, the method can effectively reduce the recording time of the target tone corpus and greatly save the recording cost.

Description

Voice synthesis method and device based on timbre clone and electronic equipment
Technical Field
The invention relates to the technical field of voice intelligence, in particular to a voice synthesis method and device based on timbre clone, electronic equipment and a computer readable medium.
Background
In the intelligent voice interaction process, the voice robot usually performs voice interaction with the user by using a preset dialect. Among them, the preset dialogs are generally synthesized from fixed speech and variable speech. Fixed speech is speech that is common to all users, and variable speech is speech that needs to be changed for a single user. For example, in the default word "you good! xx mr. "in," you good "and" mr "are available to all male users, and belong to fixed voices; and "xx" needs to be changed according to the name of each male user and thus belongs to variable voice.
In the prior art, fixed voice is recorded by a professional sound recorder in advance, while variable voice is generated by firstly reducing the variable voice to a limitless range according to a product and then recording all exhaustive variable voices by the sound recorder. This approach requires reducing the variable speech to an exhaustible range, which is itself a trade-off in business; and the time and economic cost of recording a large amount of variable voice is high. Another method for generating variable Speech is To synthesize the sound of the sound engineer's timbre from Text-To-Speech (TTS), and then concatenate the variable Speech and the fixed Speech together. However, the TTS methods currently used are based on an end-to-end network, such as tactron2, transform TTS, fast speed, etc. It is generally necessary to find a high-quality corpus recorded for more than 10 hours by an audiologist, and then train a TTS network and a vocoder network for synthesis. The 10-hour high-quality corpus needs professional equipment for recording and monitoring by professionals, and still consumes a great deal of time and economic cost.
Disclosure of Invention
The invention aims to solve the technical problems of time consumption and high economic cost of synthesizing variable voice of a target tone.
In order to solve the above technical problem, a first aspect of the present invention provides a method for synthesizing speech based on timbre cloning, the method comprising:
training a TTS basic model through open source linguistic data;
training the basic model in a fine tuning finetune mode through the target tone and color corpus to obtain a fine tuning model;
generating variable voice of the target tone according to the variable linguistic data and the fine tuning model;
and synthesizing target tone voice according to the variable voice and the fixed voice.
According to a preferred embodiment of the present invention, before the training of the base model by the open-source corpus, the method further includes:
acquiring open source corpus of a first gender;
before the training of the base model by the target timbre corpus in a fine tuning manner, the method further includes:
and acquiring target tone corpus of the first gender.
According to a preferred embodiment of the present invention, the generating the variable speech of the target timbre according to the variable corpus and the fine tuning model includes:
inputting the variable linguistic data into the fine tuning model to obtain the acoustic characteristics of the sound frequency spectrum;
and generating variable audio based on the acoustic features through a preset vocoder to obtain the variable voice of the target tone.
According to a preferred embodiment of the present invention, before the generating the variable audio based on the acoustic feature by the preset vocoder, the method further comprises:
and training the preset vocoder through the target tone and color corpus.
According to a preferred embodiment of the present invention, the TTS network is a transform TTS network.
According to a preferred embodiment of the present invention, the predetermined vocoder is a WaveGAN vocoder.
According to a preferred embodiment of the invention, the acoustic feature is an Fbank feature.
In order to solve the above technical problem, a second aspect of the present invention provides a voice synthesis apparatus based on timbre cloning, the apparatus comprising:
the first training module is used for training a TTS basic model through open source corpora;
the second training module is used for training the basic model in a fine tuning finetune mode through the target tone and color corpus to obtain a fine tuning model;
the generating module is used for generating variable voice of the target tone according to the variable linguistic data and the fine tuning model;
and the synthesis module is used for synthesizing target tone voice according to the variable voice and the fixed voice.
According to a preferred embodiment of the invention, the device further comprises:
the first obtaining module is used for obtaining the open source corpus of the first gender;
and the second acquisition module is used for acquiring the target tone corpus of the first gender.
According to a preferred embodiment of the present invention, the generating module includes:
the input module is used for inputting the variable linguistic data into the fine tuning model to obtain the acoustic characteristics of the sound frequency spectrum;
and the sub-generation module is used for generating variable audio through a preset vocoder based on the acoustic features to obtain the variable voice of the target tone.
According to a preferred embodiment of the invention, the device further comprises:
and the third training module is used for training the preset vocoder through the target tone and color corpus.
According to a preferred embodiment of the present invention, the TTS network is a transform TTS network.
According to a preferred embodiment of the present invention, the predetermined vocoder is a WaveGAN vocoder.
According to a preferred embodiment of the invention, the acoustic feature is an Fbank feature.
To solve the above technical problem, a third aspect of the present invention provides an electronic device, comprising:
a processor; and
a memory storing computer executable instructions that, when executed, cause the processor to perform the method described above.
In order to solve the above technical problem, a fourth aspect of the present invention proposes a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs that, when executed by a processor, implement the above method.
Firstly, training a TTS basic model through open source corpus; training the basic model in a fine tuning finetune mode through a small amount of target tone color linguistic data, wherein the obtained fine tuning model can obtain a good effect in the application of variable speech synthesis; finally, variable voice of the target tone is generated according to the variable linguistic data and the fine tuning model; and synthesizing the variable voice and the fixed voice to obtain the target tone voice. The method firstly adopts open source linguistic data and then trains a fine tuning model of the TTS network in a fine tuning finetune mode through a small amount of target tone linguistic data. Compared with exhaustive manual recording or a traditional TTS high-quality corpus synthesis mode, the method can effectively reduce the recording time of the target tone corpus and greatly save the recording cost.
Drawings
In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.
FIG. 1 is a flow chart of a method for synthesizing speech based on timbre cloning according to the present invention;
FIG. 2 is a schematic diagram of the present invention for training a neural network-based TTS model;
FIG. 3 is a schematic diagram of the network framework of the transform TTS network model and the WaveNet vocoder of the present invention;
FIG. 4 is a schematic diagram of a structural framework of a voice synthesizer based on timbre cloning according to the present invention;
FIG. 5 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;
FIG. 6 is a diagrammatic representation of one embodiment of a computer-readable medium of the present invention.
Detailed Description
Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.
The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.
In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.
The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.
The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.
In the invention, the timbre cloning refers to a technology of performing speech model training by using a target timbre corpus and synthesizing speech with a target timbre through a trained speech model.
It should be understood that the present invention can be applied to a human-computer conversation scenario, where human-computer conversation is a sub-direction in the field of artificial intelligence, and popular speaking is to allow a person to interact with a computer through human language (i.e., natural language). As one of the ultimate problems of artificial intelligence, a complete human-computer interaction system involves a very wide range of technologies, such as speech technology in computer science, natural language processing, machine learning, planning and reasoning, and knowledge engineering, and even many theories in linguistic and cognitive sciences are applied to human-computer interaction. In general, human-machine conversation can be divided into the following four subproblems: open domain chat, task-driven multi-turn conversations, question-answering, and recommendations.
In the existing man-machine conversation device, the open domain chat mainly plays roles of shortening distance, establishing trust relationship, emotional companions, smoothing conversation process (for example, when task-type conversation cannot meet the requirements of users) and improving user stickiness.
In task-driven multi-turn conversations, the user is given a definite purpose and wants to get information or services that meet certain constraints, such as: ordering a meal, ordering a ticket, looking for music, a movie or a certain good, etc. Because the user's needs may be complex and may need to be presented in multiple rounds, the user may also continuously modify or refine his or her needs during the course of the conversation. In addition, the machine may also help the user find a satisfactory result by asking, clarifying or confirming when the stated needs of the user are not specific or clear enough. Therefore, task-driven multi-turn dialog is not a simple natural language understanding plus information retrieval process, but a decision-making process, requiring the machine to continuously decide the optimal action to take next step according to the current state during the dialog.
The question answering is more focused on one question and one answer, namely, an accurate answer is directly given according to the question of the user. Question answering more closely approximates the process of information retrieval, typically accomplished by reference resolution and query completion, although simple context processing may also be involved. The most fundamental difference between question-and-answer systems and task-driven rounds of dialogue is whether the system needs to maintain a representation of the user's goal state and whether a decision-making process is required to complete the task.
The recommendation is to actively recommend information or services that may be of interest to the user based on current user queries and historical user profiles.
Referring to fig. 1, fig. 1 is a flowchart of a method for synthesizing speech based on timbre cloning according to the present invention, as shown in fig. 1, the method includes:
s1, training a TTS basic model through open source corpora;
TTS is a technology for converting text into voice, and mainly includes: front-end processing, creating TTS models and vocoders (vocoders). The front-end processing is directed at a corpus in a text form, which converts any text into linguistic features, and generally comprises sub-modules of text regularization, word segmentation, part of speech prediction, word-to-word conversion (graph-to-phone), Polyphone disambiguation, prosody estimation and the like. Text regularization may convert some written expression to spoken expression, such as 1% to "one percent," 1kg to "one kilogram," and so on. Word segmentation and part-of-speech Prediction are the basis of Prosody estimation (Prosody Prediction). The glyph transliterates to convert speech into phonemes s iy ch. Prosodic words and prosodic phrases are generated based on the participles and the part-of-speech information. Speech parameter features (such as fundamental frequency, formant frequency, Mel spectrogram, etc.) are extracted based on pronunciation or linguistic information of the front end by creating a neural network-based TTS base model. Common TTS models include: tacotron 1/2, Deep Voice 1/2/3, Transformer TTS, FastSpeech, LightTTS, and the like. The vocoder converts the acoustic features into speech waveforms. Common vocoders include: the phase recovery algorithm Griffin Lim, the conventional vocoders WORLD and STRAIGHT, the neural vocoders WAVENET, WAVERNN, SAMPLERNN and wavegow.
In the invention, the open source corpus can be in a text form or a voice form. For open source speech in text form, before training a TTS base model, a front-end processing technique is required to convert the text into linguistic features. For the open-source corpus in the form of voice, front-end processing is not needed, and dozens of hours of open-source corpus can be directly obtained from the open-source corpus. Preferably, the open source corpus may be selected to have the same gender as the target timbre corpus. For example, if the target tone corpus is male speech, the open-source corpus also selects male speech from the open-source corpus. Therefore, the effect of the subsequent training fine tuning model can be ensured to be closer to the target tone.
The method provided by the invention trains a TTS model based on a neural network through a large amount of open-source linguistic data (such as dozens of hours of open-source phonetic linguistic data) to extract the speech parameter characteristics (including fundamental frequency, formant frequency, Mel spectrogram and the like). And then the voice parameters are converted into voice waveforms through the training vocoder, thereby generating the variable voice of the invention. As shown in fig. 2, the TTS model based on the neural network is obtained by training a TTS basic model with a large amount of open-source corpora, and then training a fine-tuning model with a small amount (less than 1 hour) of customized target timbre corpora. Compared with the traditional mode of training a TTS model by consuming a large amount (more than 10 hours) of high-quality customized corpora, the method can effectively reduce the corpus recording time and cost and improve the variable speech generation efficiency.
In one example, to increase the speed of variable speech generation, a transform TTS network model and a WaveGAN are selected to train a WaveNet based vocoder. As shown in fig. 3, when generating variable speech, the Transformer TTS network model converts text into acoustic features of a sound spectrum, such as Fbank features, and the WaveGAN is responsible for generating specific audio based on the Fbank features.
S2, training the basic model in a fine tuning finetune mode through the target tone color corpus to obtain a fine tuning model;
in the invention, the target tone corpus is a pre-recorded corpus in the form of voice uttered by a target speaker. Specifically, the target timbre corpus recorded by the target speaker in less than 1 hour can be selected.
The fine tune is to train a new model using the trained model, plus its own data. The fine tune of the invention is equivalent to using the first few layers of the trained basic model to extract shallow features, and then fine tuning is carried out on the shallow features through target tone language materials to obtain more accurate model effect. Generally, the accuracy of the new training model will slowly increase from a very low value, but the fine tune can obtain a good effect after a relatively small number of iterations. It does not require a complete retraining of the model, thereby increasing efficiency.
S3, generating variable voice of the target tone according to the variable linguistic data and the fine tuning model;
wherein the variable corpus is a corpus corresponding to the variable voice. For example, "you are! xx mr. "in" xx "is variable voice, and the corresponding corpus is variable corpus. Specifically, the method comprises the following steps:
s31, inputting the variable linguistic data into the fine tuning model to obtain the acoustic characteristics of the sound frequency spectrum;
wherein the acoustic features may be: MFCC characteristics, Fbank characteristics, and the like. The Fbank feature is preferred in the present invention.
And S32, generating variable audio through a preset vocoder based on the acoustic features to obtain variable voice of the target tone.
Wherein the preset vocoder is preferably a WaveNet vocoder. Before this step, may
And training a WaveNet-based vocoder through the WaveGAN by the target tone corpus.
And S4, synthesizing the target tone color voice according to the variable voice and the fixed voice.
The fixed voice is voice which is not changed for all users in the target tone voice prerecorded by the target tone speaker, and the variable voice is voice which is generated by the variable corpus and the fine tuning model and needs to be changed for a single user in the target tone voice. For example, in the target timbre voice "hello! xx mr. "Medium," you good "and" Mr. are available to all male users, and are fixed voices prerecorded by the target timbre speaker; the "xx" needs to be changed according to the name of each male user and is generated by the variable corpus and the fine tuning model.
Illustratively, the variable speech and the fixed speech may be synthesized by means of speech concatenation. Specifically, word slots can be preset in the fixed voice, and variable voice generated in real time is embedded into the preset word slots.
Fig. 4 is a schematic diagram of an architecture of a voice synthesis apparatus based on timbre cloning of the present invention, as shown in fig. 4, the apparatus includes:
the first training module 41 is configured to train a TTS basic model through open-source corpora;
the second training module 42 is configured to train the basic model in a fine tuning finetune manner through the target tone corpus, so as to obtain a fine tuning model;
a generating module 43, configured to generate a variable voice of the target tone according to the variable corpus and the fine tuning model;
and a synthesis module 44, configured to synthesize a target tone color voice according to the variable voice and the fixed voice.
Further, the apparatus further comprises:
a first obtaining module 410, configured to obtain an open-source corpus of a first gender;
the second obtaining module 420 is configured to obtain a target tone corpus of the first gender.
In a specific embodiment, the generating module 43 includes:
an input module 431, configured to input a variable corpus into the fine tuning model to obtain an acoustic feature of a sound spectrum;
and a sub-generating module 432, configured to generate a variable audio through a preset vocoder based on the acoustic feature, so as to obtain a variable voice of the target timbre.
Further, the apparatus further comprises:
a third training module 45, configured to train the preset vocoder through the target timbre corpus.
In the invention, the TTS network is preferably a Transformer TTS network. The preset vocoder is preferably a WaveGAN vocoder. The acoustic feature is preferably an Fbank feature.
Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.
Fig. 5 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the electronic device 500 of the exemplary embodiment is represented in the form of a general-purpose data processing device. The components of the electronic device 500 may include, but are not limited to: at least one processing unit 510, at least one memory unit 520, a bus 530 connecting different electronic device components (including the memory unit 520 and the processing unit 510), a display unit 540, and the like.
The storage unit 520 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 510 such that the processing unit 510 performs the steps of various embodiments of the present invention. For example, the processing unit 510 may perform the steps as shown in fig. 1.
The memory unit 520 may include a readable medium in the form of a volatile memory unit, such as a random access memory unit (RAM) 5201 and/or a cache memory unit 5202, and may further include a read only memory unit (ROM) 5203. The memory unit 520 may also include a program/utility 5204 having a set (at least one) of program modules 5205, such program modules 5205 including, but not limited to: operating the electronic device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 530 may be one or more of any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 500 may also communicate with one or more external devices 300 (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the electronic device 500 via the external devices 500, and/or enable the electronic device 500 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication can occur via input/output (I/O) interfaces 550, and can also occur via network adapter 560 to one or more networks, such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet. The network adapter 560 may communicate with other modules of the electronic device 500 via the bus 530. It should be appreciated that although not shown in FIG. 5, other hardware and/or software modules may be used in the electronic device 500, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID electronics, tape drives, and data backup storage electronics, among others.
FIG. 6 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 6, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic device, apparatus, or device that is electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: training a TTS basic model through open source linguistic data; training the basic model in a fine tuning finetune mode through the target tone and color corpus to obtain a fine tuning model; generating variable voice of the target tone according to the variable linguistic data and the fine tuning model; and synthesizing target tone voice according to the variable voice and the fixed voice.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.
The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution electronic device, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + + or the like and conventional procedural programming languages, such as "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).
While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims (10)

1. A method for voice synthesis based on timbre cloning, the method comprising:
training a TTS basic model through open source linguistic data;
training the basic model in a fine tuning finetune mode through the target tone and color corpus to obtain a fine tuning model;
generating variable voice of the target tone according to the variable linguistic data and the fine tuning model;
and synthesizing target tone voice according to the variable voice and the fixed voice.
2. The method according to claim 1, wherein before the training of the base model by the open-source corpus, the method further comprises:
acquiring open source corpus of a first gender;
before the training of the base model by the target timbre corpus in a fine tuning manner, the method further includes:
and acquiring target tone corpus of the first gender.
3. The method of claim 1, wherein generating a variable voice of a target timbre from a variable corpus and the fine tuning model comprises:
inputting the variable linguistic data into the fine tuning model to obtain the acoustic characteristics of the sound frequency spectrum;
and generating variable audio based on the acoustic features through a preset vocoder to obtain the variable voice of the target tone.
4. The method of claim 3, wherein prior to generating the variable audio based on the acoustic features by the pre-set vocoder, the method further comprises:
and training the preset vocoder through the target tone and color corpus.
5. The method of claim 1, wherein the TTS network is a transform TTS network.
6. The method of claim 3, wherein the pre-vocoder is a WaveGAN vocoder.
7. The method of claim 3, wherein the acoustic feature is an Fbank feature.
8. A voice synthesis apparatus based on timbre cloning, the apparatus comprising:
the first training module is used for training a TTS basic model through open source corpora;
the second training module is used for training the basic model in a fine tuning finetune mode through the target tone and color corpus to obtain a fine tuning model;
the generating module is used for generating variable voice of the target tone according to the variable linguistic data and the fine tuning model;
and the synthesis module is used for synthesizing target tone voice according to the variable voice and the fixed voice.
9. An electronic device, comprising:
a processor; and
a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-7.
10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.
CN202011211468.7A 2020-11-03 2020-11-03 Voice synthesis method and device based on timbre clone and electronic equipment Active CN112037755B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011211468.7A CN112037755B (en) 2020-11-03 2020-11-03 Voice synthesis method and device based on timbre clone and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011211468.7A CN112037755B (en) 2020-11-03 2020-11-03 Voice synthesis method and device based on timbre clone and electronic equipment

Publications (2)

Publication Number Publication Date
CN112037755A true CN112037755A (en) 2020-12-04
CN112037755B CN112037755B (en) 2021-02-02

Family

ID=73573573

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011211468.7A Active CN112037755B (en) 2020-11-03 2020-11-03 Voice synthesis method and device based on timbre clone and electronic equipment

Country Status (1)

Country Link
CN (1) CN112037755B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539237A (en) * 2021-07-15 2021-10-22 思必驰科技股份有限公司 Speech synthesis method, electronic device, and storage medium
CN114566143A (en) * 2022-03-31 2022-05-31 北京帝派智能科技有限公司 Speech synthesis method and speech synthesis system capable of locally modifying content

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112992118B (en) * 2021-05-22 2021-07-23 成都启英泰伦科技有限公司 Speech model training and synthesizing method with few linguistic data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997032299A1 (en) * 1996-02-27 1997-09-04 Philips Electronics N.V. Method and apparatus for automatic speech segmentation into phoneme-like units
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium
CN110569960A (en) * 2018-06-06 2019-12-13 耐能有限公司 self-fine-tuning model compression method and device for reconstructing deep neural network
CN111210803A (en) * 2020-04-21 2020-05-29 南京硅基智能科技有限公司 System and method for training clone timbre and rhythm based on Bottleneck characteristics

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997032299A1 (en) * 1996-02-27 1997-09-04 Philips Electronics N.V. Method and apparatus for automatic speech segmentation into phoneme-like units
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN110569960A (en) * 2018-06-06 2019-12-13 耐能有限公司 self-fine-tuning model compression method and device for reconstructing deep neural network
CN110110041A (en) * 2019-03-15 2019-08-09 平安科技(深圳)有限公司 Wrong word correcting method, device, computer installation and storage medium
CN111210803A (en) * 2020-04-21 2020-05-29 南京硅基智能科技有限公司 System and method for training clone timbre and rhythm based on Bottleneck characteristics

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539237A (en) * 2021-07-15 2021-10-22 思必驰科技股份有限公司 Speech synthesis method, electronic device, and storage medium
CN114566143A (en) * 2022-03-31 2022-05-31 北京帝派智能科技有限公司 Speech synthesis method and speech synthesis system capable of locally modifying content

Also Published As

Publication number Publication date
CN112037755B (en) 2021-02-02

Similar Documents

Publication Publication Date Title
CN112102811B (en) Optimization method and device for synthesized voice and electronic equipment
CN112037755B (en) Voice synthesis method and device based on timbre clone and electronic equipment
US7603278B2 (en) Segment set creating method and apparatus
JP2023535230A (en) Two-level phonetic prosodic transcription
US11763797B2 (en) Text-to-speech (TTS) processing
JP2007249212A (en) Method, computer program and processor for text speech synthesis
JP2002530703A (en) Speech synthesis using concatenation of speech waveforms
US8626510B2 (en) Speech synthesizing device, computer program product, and method
CN101578659A (en) Voice tone converting device and voice tone converting method
US20030154080A1 (en) Method and apparatus for modification of audio input to a data processing system
CN111627420B (en) Method and device for synthesizing emotion voice of specific speaker under extremely low resource
KR20230133362A (en) Generate diverse and natural text-to-speech conversion samples
Panda et al. An efficient model for text-to-speech synthesis in Indian languages
Hamad et al. Arabic text-to-speech synthesizer
Bulyko et al. Efficient integrated response generation from multiple targets using weighted finite state transducers
Panda et al. Text-to-speech synthesis with an Indian language perspective
O'Shaughnessy Modern methods of speech synthesis
Stan et al. Generating the Voice of the Interactive Virtual Assistant
CN112037757A (en) Singing voice synthesis method and device and computer readable storage medium
WO2023197206A1 (en) Personalized and dynamic text to speech voice cloning using incompletely trained text to speech models
Kamble et al. Audio Visual Speech Synthesis and Speech Recognition for Hindi Language
Tsiakoulis et al. An overview of the ILSP unit selection text-to-speech synthesis system
CN112382269A (en) Audio synthesis method, device, equipment and storage medium
EP1589524B1 (en) Method and device for speech synthesis
Evrard et al. Comparison of chironomic stylization versus statistical modeling of prosody for expressive speech synthesis.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant