CN112102811A

CN112102811A - Optimization method and device for synthesized voice and electronic equipment

Info

Publication number: CN112102811A
Application number: CN202011217838.8A
Authority: CN
Inventors: 张彤彤
Original assignee: Beijing Qilu Information Technology Co Ltd
Current assignee: Beijing Qilu Information Technology Co Ltd
Priority date: 2020-11-04
Filing date: 2020-11-04
Publication date: 2020-12-18
Anticipated expiration: 2040-11-04
Also published as: CN112102811B

Abstract

The invention discloses an optimization method, an optimization device and electronic equipment for synthesized voice, wherein the method comprises the following steps: generating acoustic features of variable speech through TTS; acquiring real variable voice corresponding to the parallel corpus of the variable voice; extracting acoustic features of the real variable voice; training a preset neural network through acoustic features of the variable voice and the real variable voice; and inputting the acoustic characteristics of the variable voice to be optimized into the trained preset neural network for optimization. The optimization method can generate variable voice with high similarity to fixed voice recorded by a sound recorder, effectively reduce the tone difference between the variable voice and the fixed voice and improve the integral tone effect of voice splicing synthesis.

Description

Optimization method and device for synthesized voice and electronic equipment

Technical Field

The invention relates to the technical field of voice intelligence, in particular to an optimization method and device for synthesized voice, electronic equipment and a computer readable medium.

Background

In the intelligent voice interaction process, the voice robot usually performs voice interaction with the user by using a preset dialect. Among them, the preset dialogs are generally synthesized from fixed speech and variable speech. Fixed speech is speech that is common to all users, and variable speech is speech that needs to be changed for a single user. For example, in the default word "you good! xx mr. "in," you good "and" mr "are available to all male users, and belong to fixed voices; and "xx" needs to be changed according to the name of each male user and thus belongs to variable voice.

In the prior art, fixed voice is recorded by a professional sound recorder in advance, variable voice synthesizes the voice of the sound quality of the sound recorder from Text-To-Speech (TTS), and then the variable voice and the fixed voice are spliced together. However, the TTS methods currently used are based on an end-to-end network, such as tactron2, transform TTS, fast speed, etc. It is generally necessary to find a high quality corpus recorded for more than 10 hours by an audiologist and then train a TTS and vocoder network for synthesis. The 10-hour high-quality corpus needs professional equipment for recording, needs monitoring of professionals, and consumes a great deal of time and economic cost. In addition, the tone effect of the variable voice generated in this way is different from the tone effect of the fixed voice recorded by the sound engineer, so that the splicing and synthesis of the variable voice and the fixed voice is not natural, and the tone difference exists.

Disclosure of Invention

The invention aims to solve the technical problem that the splicing and synthesizing voice is not connected naturally due to the fact that the tone difference exists between the variable voice of the synthesized target tone and the fixed voice recorded by a sound recorder.

In order to solve the above technical problem, a first aspect of the present invention provides an optimization method for synthesized speech, including:

generating acoustic features of variable speech through TTS;

acquiring real variable voice corresponding to the parallel corpus of the variable voice; the real variable voice corresponding to the parallel corpus of the variable voice is the voice of the parallel corpus of the variable voice recorded by a sound recorder with fixed voice;

extracting acoustic features of the real variable voice;

training a preset neural network through acoustic features of the variable voice and the real variable voice;

and inputting the acoustic characteristics of the variable voice to be optimized into the trained preset neural network for optimization.

According to a preferred embodiment of the present invention, the generating of the acoustic feature of the variable speech by TTS includes:

training a basic model of TTS through open source linguistic data;

training the basic model in a fine tune mode through the target tone and color corpus to obtain a fine tuning model;

and generating acoustic characteristics of the variable voice according to the variable linguistic data and the fine tuning model.

According to a preferred embodiment of the present invention, the generating acoustic features of the variable speech according to the variable corpus and the fine tuning model includes:

and inputting the variable linguistic data into the fine tuning model to obtain the acoustic characteristics of the variable voice sound frequency spectrum.

According to a preferred embodiment of the invention, the method further comprises:

generating optimized variable voice based on the acoustic characteristics of the optimized variable voice through a preset vocoder;

and synthesizing the target tone voice according to the optimized variable voice and the fixed voice.

According to a preferred embodiment of the invention, the acoustic feature is an Fbank feature.

According to a preferred embodiment of the present invention, the predetermined neural network is a recurrent neural network RNN.

According to a preferred embodiment of the present invention, the predetermined vocoder is a WaveGAN vocoder.

In order to solve the above technical problem, a second aspect of the present invention provides an apparatus for optimizing synthesized speech, the apparatus comprising:

the generating module is used for generating acoustic characteristics of variable voice through TTS;

the acquisition module is used for acquiring real variable voice corresponding to the parallel corpus of the variable voice; the real variable voice corresponding to the parallel corpus of the variable voice is the voice of the parallel corpus of the variable voice recorded by a sound recorder with fixed voice;

the extraction module is used for extracting the acoustic features of the real variable voice;

the training module is used for training a preset neural network through the acoustic characteristics of the variable voice and the real variable voice;

and the optimization module is used for inputting the acoustic characteristics of the variable voice to be optimized into the trained preset neural network for optimization.

According to a preferred embodiment of the present invention, the generating module includes:

the first training module is used for training a basic model of TTS through open source linguistic data;

the second training module is used for training the basic model in a fine tune mode through the target tone and color corpus to obtain a fine tuning model;

and the first generation module is used for generating acoustic characteristics of variable voice according to the variable linguistic data and the fine tuning model.

According to a preferred embodiment of the present invention, the first generating module is specifically configured to input a variable corpus into the fine tuning model to obtain an acoustic feature of the variable voice sound spectrum.

According to a preferred embodiment of the invention, the device further comprises:

the second generation module is used for generating optimized variable voice based on the acoustic characteristics of the optimized variable voice through a preset vocoder;

and the synthesis module is used for synthesizing the target tone voice according to the optimized variable voice and the fixed voice.

To solve the above technical problem, a third aspect of the present invention provides an electronic device, comprising:

a processor; and

a memory storing computer executable instructions that, when executed, cause the processor to perform the method described above.

In order to solve the above technical problem, a fourth aspect of the present invention proposes a computer-readable storage medium, wherein the computer-readable storage medium stores one or more programs that, when executed by a processor, implement the above method.

The method is based on a Voice Conversion (VC) technology, adopts a parallel data supervised learning method, and trains a preset neural network by acquiring real variable voice corresponding to a parallel corpus of variable voice and acoustic characteristics of the variable voice; and inputting the acoustic characteristics of variable voice to be optimized into the trained preset neural network for optimization, thereby generating variable voice with high similarity to fixed voice recorded by a sound engineer, effectively reducing the tone difference between the variable voice and the fixed voice, and improving the integral tone effect of voice splicing synthesis.

Drawings

In order to make the technical problems solved by the present invention, the technical means adopted and the technical effects obtained more clear, the following will describe in detail the embodiments of the present invention with reference to the accompanying drawings. It should be noted, however, that the drawings described below are only illustrations of exemplary embodiments of the invention, from which other embodiments can be derived by those skilled in the art without inventive step.

FIG. 1 is a schematic flow chart of an optimization method for synthesized speech according to the present invention;

FIG. 2 is a schematic diagram of the present invention for training a neural network-based TTS model;

FIG. 3 is a schematic diagram of the network framework of the transform TTS model and the WaveNet vocoder of the present invention;

FIG. 4 is a schematic diagram of the present invention for optimizing variable speech;

FIG. 5 is a schematic diagram of a structural framework of an apparatus for optimizing synthesized speech according to the present invention;

FIG. 6 is a block diagram of an exemplary embodiment of an electronic device in accordance with the present invention;

FIG. 7 is a diagrammatic representation of one embodiment of a computer-readable medium of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which exemplary embodiments of the invention may be embodied in many specific forms, and should not be construed as limited to the embodiments set forth herein. Rather, these exemplary embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the invention to those skilled in the art.

The structures, properties, effects or other characteristics described in a certain embodiment may be combined in any suitable manner in one or more other embodiments, while still complying with the technical idea of the invention.

In describing particular embodiments, specific details of structures, properties, effects, or other features are set forth in order to provide a thorough understanding of the embodiments by one skilled in the art. However, it is not excluded that a person skilled in the art may implement the invention in a specific case without the above-described structures, performances, effects or other features.

The flow chart in the drawings is only an exemplary flow demonstration, and does not represent that all the contents, operations and steps in the flow chart are necessarily included in the scheme of the invention, nor does it represent that the execution is necessarily performed in the order shown in the drawings. For example, some operations/steps in the flowcharts may be divided, some operations/steps may be combined or partially combined, and the like, and the execution order shown in the flowcharts may be changed according to actual situations without departing from the gist of the present invention.

The block diagrams in the figures generally represent functional entities and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The same reference numerals denote the same or similar elements, components, or parts throughout the drawings, and thus, a repetitive description thereof may be omitted hereinafter. It will be further understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, or sections, these elements, components, or sections should not be limited by these terms. That is, these phrases are used only to distinguish one from another. For example, a first device may also be referred to as a second device without departing from the spirit of the present invention. Furthermore, the term "and/or", "and/or" is intended to include all combinations of any one or more of the listed items.

The method is based on Voice Conversion (VC) technology, and adopts a parallel data supervised learning method to optimize variable voice. Among them, the object of speech conversion is to convert non-linguistic features of a language while preserving linguistic features of speech. Non-linguistic characteristics of a language include accent, timbre, speaking style, and the like. The object of the invention is to perform a timbre conversion.

It should be understood that the present invention can be applied to a human-computer conversation scenario, where human-computer conversation is a sub-direction in the field of artificial intelligence, and popular speaking is to allow a person to interact with a computer through human language (i.e., natural language). As one of the ultimate problems of artificial intelligence, a complete human-computer interaction system involves a very wide range of technologies, such as speech technology in computer science, natural language processing, machine learning, planning and reasoning, and knowledge engineering, and even many theories in linguistic and cognitive sciences are applied to human-computer interaction. In general, human-machine conversation can be divided into the following four subproblems: open domain chat, task-driven multi-turn conversations, question-answering, and recommendations.

In the existing man-machine conversation device, the open domain chat mainly plays roles of shortening distance, establishing trust relationship, emotional companions, smoothing conversation process (for example, when task-type conversation cannot meet the requirements of users) and improving user stickiness.

In task-driven multi-turn conversations, the user is given a definite purpose and wants to get information or services that meet certain constraints, such as: ordering a meal, ordering a ticket, looking for music, a movie or a certain good, etc. Because the user's needs may be complex and may need to be presented in multiple rounds, the user may also continuously modify or refine his or her needs during the course of the conversation. In addition, the machine may also help the user find a satisfactory result by asking, clarifying or confirming when the stated needs of the user are not specific or clear enough. Therefore, task-driven multi-turn dialog is not a simple natural language understanding plus information retrieval process, but a decision-making process, requiring the machine to continuously decide the optimal action to take next step according to the current state during the dialog.

The question answering is more focused on one question and one answer, namely, an accurate answer is directly given according to the question of the user. Question answering more closely approximates the process of information retrieval, typically accomplished by reference resolution and query completion, although simple context processing may also be involved. The most fundamental difference between question-and-answer systems and task-driven rounds of dialogue is whether the system needs to maintain a representation of the user's goal state and whether a decision-making process is required to complete the task.

The recommendation is to actively recommend information or services that may be of interest to the user based on current user queries and historical user profiles.

Referring to fig. 1, fig. 1 is a flowchart of an optimization method for synthesized speech according to the present invention, as shown in fig. 1, the method includes:

s1, generating acoustic characteristics of variable voice through TTS;

the method firstly adopts open source linguistic data, then trains a fine tuning model of TTS in a fine tune mode through a small amount of target tone linguistic data, and generates acoustic characteristics of variable voice through the fine tuning model. Compared with exhaustive manual recording or a traditional TTS high-quality corpus synthesis mode, the method can effectively reduce the recording time of the target tone corpus and greatly save the recording cost. Specifically, the method comprises the following steps:

s11, training a TTS basic model through open source corpora;

TTS is a technology for converting text into voice, and mainly includes: front-end processing, creating TTS models and vocoders (vocoders). The front-end processing is directed at a corpus in a text form, which converts any text into linguistic features, and generally comprises sub-modules of text regularization, word segmentation, part of speech prediction, word-to-word conversion (graph-to-phone), Polyphone disambiguation, prosody estimation and the like. Text regularization may convert some written expression to spoken expression, such as 1% to "one percent," 1kg to "one kilogram," and so on. Word segmentation and part-of-speech Prediction are the basis of Prosody estimation (Prosody Prediction). The glyph transliterates to convert speech into phonemes s iy ch. Prosodic words and prosodic phrases are generated based on the participles and the part-of-speech information. Speech parameter features (such as fundamental frequency, formant frequency, Mel spectrogram, etc.) are extracted based on pronunciation or linguistic information of the front end by creating a neural network-based TTS base model. Common TTS models include: tacotron 1/2, Deep Voice 1/2/3, Transformer TTS, FastSpeech, LightTTS, and the like. The vocoder converts the acoustic features into speech waveforms. Common vocoders include: the phase recovery algorithm Griffin Lim, the conventional vocoders WORLD and STRAIGHT, the neural vocoders WAVENET, WAVERNN, SAMPLERNN and wavegow.

In the invention, the open source corpus can be in a text form or a voice form. For open source speech in text form, before training a TTS base model, a front-end processing technique is required to convert the text into linguistic features. For the open-source corpus in the form of voice, front-end processing is not needed, and dozens of hours of open-source corpus can be directly obtained from the open-source corpus. Preferably, the open source corpus may be selected to have the same gender as the target timbre corpus. For example, if the target tone corpus is male speech, the open-source corpus also selects male speech from the open-source corpus. Therefore, the effect of the subsequent training fine tuning model can be ensured to be closer to the target tone.

The method provided by the invention trains a TTS model based on a neural network through a large amount of open-source linguistic data (such as dozens of hours of open-source phonetic linguistic data) to extract the speech parameter characteristics (including fundamental frequency, formant frequency, Mel spectrogram and the like). And then the voice parameters are converted into voice waveforms through the training vocoder, thereby generating the variable voice of the invention. As shown in fig. 2, the TTS model based on the neural network is obtained by training a TTS basic model with a large amount of open-source corpora, and then training a fine-tuning model with a small amount (less than 1 hour) of customized target timbre corpora. Compared with the traditional mode of training a TTS model by consuming a large amount (more than 10 hours) of high-quality customized corpora, the method can effectively reduce the corpus recording time and cost and improve the variable speech generation efficiency.

In one example, to increase the speed of variable speech generation, a transform TTS model and a WaveGAN are selected to train a WaveNet based vocoder. As shown in fig. 3, in generating variable speech, the Transformer TTS model converts text into acoustic features of a sound spectrum, such as Fbank features, and the WaveGAN model is responsible for generating specific audio based on the Fbank features.

S12, training the basic model in a fine tune mode through the target tone and color corpus to obtain a fine tuning model;

in the invention, the target tone corpus is a pre-recorded corpus in the form of voice uttered by a target speaker. Specifically, the target timbre corpus recorded by the target speaker in less than 1 hour can be selected.

The fine tune is to train a new model by using the trained model and adding its own data. The fine tune of the invention is equivalent to using the first few layers of the trained basic model to extract shallow features, and then fine tuning is carried out on the shallow features through target tone language materials to obtain more accurate model effect. Generally, the accuracy of the new training model will slowly increase from a very low value, but the fine tune can obtain a good effect after a relatively small number of iterations. It does not require a complete retraining of the model, thereby increasing efficiency.

S13, generating acoustic characteristics of variable voice of the target tone according to the variable linguistic data and the fine tuning model;

wherein the variable corpus is a corpus corresponding to the variable voice. For example, "you are! xx mr. "in" xx "is variable voice, and the corresponding corpus is variable corpus. Specifically, the variable corpus is input into the fine tuning model to obtain the acoustic characteristics of the sound spectrum;

wherein the acoustic features may be: Mel-Frequency Cepstral Coefficients (MFCC), Fbank characteristics, and the like. The Fbank feature is preferred in the present invention.

S2, acquiring real variable voice corresponding to the parallel corpus of the variable voice;

the parallel corpus of the variable voice is a corpus consisting of a text which is the same as the variable voice and a translated text which corresponds to the variable voice in parallel. The real variable voice corresponding to the parallel corpus of the variable voice refers to the voice of the parallel corpus of the variable voice recorded by a sound recorder with fixed voice.

The real variable voice can be recorded and stored in advance, and the real variable voice is obtained according to the requirement in the step.

S3, extracting acoustic features of the real variable voice;

wherein the acoustic feature may be: MFCC characteristics, Fbank characteristics, and the like. The Fbank feature is preferred in the present invention. As shown in fig. 4, Fbank features of the real variables speech frame _1 and frame _2 … frame _ i are extracted respectively. Wherein i is a natural number.

S4, training a preset neural network through the acoustic features of the variable voice and the real variable voice;

in the present invention, the predetermined neural network is preferably a recurrent neural network RNN. The RNN neural network is an artificial neural network with nodes connected in a ring in a directional mode. The internal state of such a network may exhibit dynamic timing behavior. The RNN can use its internal memory to process input sequences of arbitrary timing, and can more easily process non-segmented handwriting recognition, speech recognition, etc. As shown in FIG. 4, the RNN is trained by inputting the Fbank characteristic of the variable speech frame _0 and the Fbank characteristic of each of the real variable speech frames _1, frame _2 … frame _ i into the RNN.

And S5, inputting the acoustic characteristics of the variable voice to be optimized into the trained preset neural network for optimization.

Wherein, the acoustic characteristics of the variable voice to be optimized can be obtained by the method of step S1. As shown in fig. 4, the acoustic features of the variable speech to be optimized are input into the trained preset neural network for optimization, so as to obtain the optimized acoustic features of the variable speech.

Further, as shown in fig. 4, the optimized variable voice may be generated based on the acoustic characteristics of the optimized variable voice by a preset vocoder; wherein the preset vocoder is preferably a WaveNet vocoder. Before this step, a WaveNet based vocoder may be trained by WaveGAN through the target timbre corpus. And then synthesizing target tone color voice according to the optimized variable voice and the fixed voice.

The fixed voice is voice which is not changed for all users in the target tone voice prerecorded by the target tone speaker, and the optimized variable voice is voice which needs to be changed for a single user in the target tone voice optimized by the RNN. For example, in the target timbre voice "hello! xx mr. "Medium," you good "and" Mr. are available to all male users, and are fixed voices prerecorded by the target timbre speaker; the 'xx' needs to be changed according to the name of each male user and is obtained by optimizing the RNN after the variable linguistic data and the fine tuning model are generated.

Illustratively, the optimized variable speech and the fixed speech may be synthesized by means of a speech concatenation. Specifically, word slots can be preset in the fixed voice, and the optimized variable voice generated in real time is embedded into the preset word slots.

Fig. 5 is a schematic diagram of an architecture of an apparatus for optimizing synthesized speech according to the present invention, as shown in fig. 5, the apparatus includes:

a generating module 51, configured to generate an acoustic feature of a variable speech through TTS;

an obtaining module 52, configured to obtain a real variable voice corresponding to the parallel corpus of the variable voice;

an extraction module 53, configured to extract an acoustic feature of the real variable speech;

a training module 54, configured to train a preset neural network through acoustic features of the variable speech and the real variable speech;

and the optimization module 55 is configured to input the acoustic features of the variable voice to be optimized into the trained preset neural network for optimization.

In a specific embodiment, the generating module 51 includes:

the first training module 511 is used for training a basic model of the TTS through the open source corpus;

a second training module 512, configured to train the basic model in a fine tune manner through the target timbre corpus, so as to obtain a fine tuning model;

and a first generating module 513, configured to generate an acoustic feature of the variable speech according to the variable corpus and the fine tuning model.

Further, the first generating module 513 is specifically configured to input a variable corpus into the fine tuning model to obtain an acoustic feature of a variable voice sound spectrum.

The device further comprises:

a second generating module 56, configured to generate, by using a preset vocoder, an optimized variable voice based on the acoustic feature of the optimized variable voice;

and a synthesis module 57, configured to synthesize the target tone color speech according to the optimized variable speech and the fixed speech.

In the invention, the acoustic feature is preferably an Fbank feature; the preset neural network is preferably a recurrent neural network RNN. The preset vocoder is preferably a WaveGAN vocoder.

Those skilled in the art will appreciate that the modules in the above-described embodiments of the apparatus may be distributed as described in the apparatus, and may be correspondingly modified and distributed in one or more apparatuses other than the above-described embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

In the following, embodiments of the electronic device of the present invention are described, which may be regarded as an implementation in physical form for the above-described embodiments of the method and apparatus of the present invention. Details described in the embodiments of the electronic device of the invention should be considered supplementary to the embodiments of the method or apparatus described above; for details which are not disclosed in embodiments of the electronic device of the invention, reference may be made to the above-described embodiments of the method or the apparatus.

Fig. 6 is a block diagram of an exemplary embodiment of an electronic device according to the present invention. The electronic device shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 6, the electronic device 600 of the exemplary embodiment is represented in the form of a general-purpose data processing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different electronic device components (including the storage unit 620 and the processing unit 610), a display unit 640, and the like.

The storage unit 620 stores a computer readable program, which may be a code of a source program or a read-only program. The program may be executed by the processing unit 610 such that the processing unit 610 performs the steps of various embodiments of the present invention. For example, the processing unit 610 may perform the steps as shown in fig. 1.

The storage unit 620 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 6201 and/or a cache memory unit 6202, and may further include a read-only memory unit (ROM) 6203. The memory unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: operating the electronic device, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Bus 630 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 600 may also communicate with one or more external devices 300 (e.g., keyboard, display, network device, bluetooth device, etc.), enable a user to interact with the electronic device 600 via the external devices 600, and/or enable the electronic device 600 to communicate with one or more other data processing devices (e.g., router, modem, etc.). Such communication can occur via input/output (I/O) interfaces 650, and can also occur via network adapter 660 with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the Internet). The network adapter 660 may communicate with other modules of the electronic device 600 via the bus 630. It should be appreciated that although not shown in FIG. 6, other hardware and/or software modules may be used in the electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID electronics, tape drives, and data backup storage electronics, among others.

FIG. 7 is a schematic diagram of one computer-readable medium embodiment of the present invention. As shown in fig. 7, the computer program may be stored on one or more computer readable media. The computer readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electronic device, apparatus, or device that is electronic, magnetic, optical, electromagnetic, infrared, or semiconductor, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. The computer program, when executed by one or more data processing devices, enables the computer-readable medium to implement the above-described method of the invention, namely: generating acoustic features of variable speech through TTS; acquiring real variable voice corresponding to the parallel corpus of the variable voice; extracting acoustic features of the real variable voice; training a preset neural network through acoustic features of the variable voice and the real variable voice; and inputting the acoustic characteristics of the variable voice to be optimized into the trained preset neural network for optimization.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments of the present invention described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiment of the present invention can be embodied in the form of a software product, which can be stored in a computer-readable storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to make a data processing device (which can be a personal computer, a server, or a network device, etc.) execute the above-mentioned method according to the present invention.

The computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution electronic device, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including object oriented programming languages such as Java, C + + or the like and conventional procedural programming languages, such as "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

In summary, the present invention can be implemented as a method, an apparatus, an electronic device, or a computer-readable medium executing a computer program. Some or all of the functions of the present invention may be implemented in practice using a general purpose data processing device such as a microprocessor or a Digital Signal Processor (DSP).

While the foregoing embodiments have described the objects, aspects and advantages of the present invention in further detail, it should be understood that the present invention is not inherently related to any particular computer, virtual machine or electronic device, and various general-purpose machines may be used to implement the present invention. The invention is not to be considered as limited to the specific embodiments thereof, but is to be understood as being modified in all respects, all changes and equivalents that come within the spirit and scope of the invention.

Claims

1. A method for optimizing synthesized speech, the method comprising:

generating acoustic features of variable speech through TTS;

extracting acoustic features of the real variable voice;

2. The method of claim 1, wherein generating acoustic features of variable speech by TTS comprises:

training a basic model of TTS through open source linguistic data;

training the basic model in a finetune mode through target tone and color corpora to obtain a fine tuning model;

3. The method of claim 2, wherein generating acoustic features of variable speech from variable corpora and the fine-tuning model comprises:

4. The method of claim 1, further comprising:

5. The method of claim 1, wherein the acoustic feature is an Fbank feature.

6. The method of claim 1, wherein the predetermined neural network is a Recurrent Neural Network (RNN).

7. The method of claim 4, wherein the pre-vocoder is a WaveGAN vocoder.

8. An apparatus for optimizing synthesized speech, the apparatus comprising:

9. An electronic device, comprising:

a processor; and

a memory storing computer-executable instructions that, when executed, cause the processor to perform the method of any of claims 1-7.

10. A computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.