CN111223474A

CN111223474A - Voice cloning method and system based on multi-neural network

Info

Publication number: CN111223474A
Application number: CN202010041207.9A
Authority: CN
Inventors: 柳慧芬
Original assignee: Wuhan Shuixiang Electronic Technology Co ltd
Current assignee: Wuhan Shuixiang Electronic Technology Co ltd
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2020-06-02

Abstract

A speech cloning method and system based on a multi-neural network, which utilizes the voice data in a sample base and the text irrelevant speaker acoustic feature vector corresponding to the voice data and the text to be synthesized to obtain a multi-neural network model for speech cloning; acquiring target speaker audio data, and taking the target speaker audio data as the input of a first neural network model to obtain a target speaker acoustic feature vector; the audio frequency of the target speaker, the text to be synthesized and the acoustic feature vector of the target speaker are used as the input of a second neural network model to generate the primary cloned voice of the target speaker; and taking the primary cloned voice as the input of a third neural network model to obtain the final cloned voice of the target speaker. The multi-neural network has low requirement on the data volume of a target speaker, high training speed and short customization period, and the initial cloned voice is corrected by adopting the third neural network model for voice conversion, so that the final cloned voice effect is improved.

Description

Voice cloning method and system based on multi-neural network

Technical Field

The invention relates to the technical field of voice synthesis, in particular to a method and a system for cloning voices based on a multi-neural network.

Background

Speech synthesis, also known as Text To Speech (TTS). Speech synthesis is a technique that outputs fluent chinese spoken language through mechanical and electronic comprehension. Speech synthesis corresponds to the installation of a computer with a human-like "mouth" and plays a vital role in a "listen and talk" intelligent computer system. Speech cloning, which belongs to speech synthesis technology, is equivalent to selecting a voice of a certain speaker and using the voice to speak speech content from another speaker.

Compared to text-to-speech synthesis, the speech cloning requirements are more versatile, meaning less customization time, data cost, and a wider range of synthesis targets. Like speech synthesis, speech cloning relates to the technical field of multidisciplinary crossing, including signal and information processing, information theory, random process, probability theory, acoustic processing, linguistics, psychology, computer science, artificial intelligence and other professional fields.

Patent No. CN201910420416.1 discloses a method for cloning accents and rhymes based on speech training, which designs a set of typical classified texts with different tones, and the target speaker reads the sound record according to the tone designed by the text as the training material. After training, the audio units with different tones under the same phonetic symbol of the target speaker can be obtained, and the units ensure the original accent rhyme of the target speaker and form a target speaker sound library. During voice synthesis, the intonation of a text to be synthesized is analyzed, voice units are matched from a voice library, and the voice units are synthesized into a gentle and natural audio frequency through phonological correction and slow error alignment correction methods. The method needs the target speaker to cooperate with the recording, and for general cloning, huge workload, huge time, manpower and material cost are formed.

Patent No. CN201910066489.5 discloses a system and method for neural voice cloning based on a small number of samples. The specific implementation of the method comprises three processes: training, cloning, audio generation. A multi-speaker generative model is first trained that can adapt to speaker embedding. When cloning, the cloning audio and text of the new speaker are input into the multi-speaker generation model, and the embedding of the new speaker is finely adjusted. Finally, the fine-tuned speaker embedding and text input are used for the multi-speaker model to generate audio. In the patent, the same neural network is used, the speaker identity is combined with a text-audio center to train a speaker adaptive network, speaker embedding is output, and the speaker embedding comprises the voice characteristic representation of the speaker. The embedded representation may rely on a trained text set, i.e., the speaker's embedded speech characteristics do not fully represent acoustic characteristics, but may also include text characteristics. This requires a training set where the corpus of all speakers is required to cover the use of the text set to a large extent and where the corpus is nearly parallel between each speaker. The cloning phase, which uses cloned text and audio to fine tune speaker embedding, also faces speaker text that is very different from training text, resulting in unpredictable deviations from speaker embedding. The acoustic similarity of the final clone synthesized audio to the speaker is very unstable.

Disclosure of Invention

In view of the above, a method and system for a multi-neural network based voice cloning is provided that overcomes or at least partially solves the above mentioned problems.

The invention discloses a speech cloning method based on a multi-neural network, which is characterized by comprising the following steps:

s100, training and generating a first neural network model for extracting text-independent speaker acoustic feature vectors by using audio data in a sample library and speaker identity labels corresponding to the audio data;

s200, generating a second neural network model for controlling text voice synthesis by using acoustic features by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized;

s300, generating a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network;

s400, acquiring target speaker audio data, and taking the target speaker audio data as input of a first neural network model to obtain an acoustic feature vector of the target speaker;

s500, using the audio frequency of the target speaker, the text to be synthesized and the acoustic feature vector of the target speaker as the input of a second neural network model to generate a primary clone voice of the target speaker;

s600, the primary cloned voice of the target speaker is used as the input of the third neural network model, and the final cloned voice of the target speaker is obtained.

Further, the acoustic feature vector in S100 includes: fundamental frequency, aperiodic features, and mel-spectrum data.

Further, in S100, when there is more than one piece of voice data of the same speaker, the acoustic feature vector corresponding to the voice is averaged and used as the acoustic feature vector of the speaker.

Further, in S200, the acoustic feature vector and the text corresponding to the audio data are used as input, and the audio data is used as a label, so as to perform multiple rounds of iterative training.

Further, in S200, for the long text and the long audio data, the acoustic features are extracted in a segmentation manner by using a splicing method, and the segmented acoustic features and the segmented audio are input by using a fixed-length interleaving structure.

Further, in S300, the original voice of the target speaker and the clone voice generated in S200 are input, and loop iteration is performed to generate a mapping model from the clone voice to the real voice, where the mapping model may be a mapping of audio data or a mapping of sound spectrum feature data.

Further, the speech conversion third neural network model in S300 is a speech conversion model of the GMM or a CycleGAN speech conversion model.

Further, when different audios of the target speaker are input to the first neural network in S400, the audios of the target speaker are manually or automatically classified by an algorithm, audios similar to emotion and mood are classified into the same class, and a common acoustic feature vector is used.

Further, the target speaker audio data in the step S400 participates in the training of the first neural network model in the step S100 to obtain the target speaker audio and the acoustic optimal model, and the acoustic feature vector corresponding to the target speaker audio is output through the optimal model.

The invention also discloses a speech cloning system based on the multi-neural network, which comprises the following steps: the device comprises a sample library module, a first neural network module, a second neural network module and a third neural network module, wherein

The sample database module is used for storing sample data for training the first neural network module, the second neural network module and the third neural network module, and the sample data at least comprises audio data, a text to be synthesized and an acoustic feature vector corresponding to the audio data;

the first neural network module is used for training and generating a first neural network model for extracting the acoustic characteristic vector of the speaker irrelevant to the text by using the audio data in the sample library and the speaker identity label corresponding to the audio data; the first neural network module obtains acoustic feature vectors of the target speaker through the audio data of the target speaker;

the second neural network module is used for generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; the second neural network module generates a primary clone voice of the target speaker through the audio frequency and the text of the target speaker and the acoustic feature vector of the target speaker;

the third neural network module generates a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network; and the third neural network module obtains the final cloned voice of the target speaker through the primary cloned voice of the target speaker.

The invention has the beneficial effects that:

the method realizes voice cloning based on a multi-neural network, utilizes the audio data in a sample library and speaker identity labels corresponding to the audio data, trains and generates a first neural network model for extracting speaker acoustic feature vectors irrelevant to texts, extracts core parameters in the network as the speaker acoustic feature vectors, and ensures that the speaker feature vectors used in synthesis are not influenced by the texts, so that any text to be synthesized can have stable and high acoustic fidelity; generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data and the text to be synthesized, and expressing the acoustic feature of the speaker and the voice unit in the text in the output audio data; and generating a third speech conversion neural network model by using the primary cloned speech and the original speech generated by the second neural network, inputting the cloned audio and the original audio of the target speaker into the third speech conversion neural network model for training, and further promoting the cloned speech clone through the trained third neural network model. In the invention, only the third neural network model for voice conversion needs the data of the target speaker to participate in training, so that the requirement on the data size of the target speaker is low, the training speed is high, and the customization period is short. And the initial cloned voice is corrected by adopting a third neural network model for voice conversion, so that the similarity of the final cloned voice to the real voice is improved, and the cloned voice effect is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a flowchart of a method for cloning a speech based on a multi-neural network according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating the use of a multiple neural network model voice clone according to a first embodiment of the present invention;

fig. 3 is a structural diagram of a speech cloning system based on a multi-neural network according to a first embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

The method and the device aim at solving the problems that in the prior art, voice cloning needs a target speaker to be matched with recording, the audio data requirement of the target speaker is high, and the acoustic similarity between the effect of cloning voice and a speaker is unstable. The embodiment of the invention provides a method and a system for cloning a voice based on a multi-neural network.

Example one

The embodiment discloses a method for cloning a voice based on a multi-neural network, as shown in fig. 1, including:

s100, training and generating a first neural network model for extracting text-independent speaker acoustic feature vectors by using audio data in a sample library and speaker identity labels corresponding to the audio data; specifically, preferably, a text-independent voiceprint authentication network is used, and feature vectors are extracted from a layer before the last output layer of the network to serve as acoustic feature vectors of the unique identification of the speaker.

Preferably, the fundamental frequency, the non-periodic characteristics and the Mel spectrum are extracted to be used as acoustic characteristic vectors of the training speaker, the acoustic characteristic vectors are used as the input of a network model, the unique ID of the identity of the speaker is used as a label, and multiple rounds of iterative training are carried out.

It can be understood that when there is more than one piece of audio data of the same speaker, there may be a slight difference between the acoustic features corresponding to different audio frequencies, the acoustic feature vectors corresponding to each piece of audio frequency are retained, the acoustic feature vectors of the same speaker are averaged, or any one of the acoustic feature vectors is randomly taken as the unified acoustic feature vector of the speaker.

S200, generating a second neural network model for controlling text voice synthesis by using acoustic features by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; specifically, the infrastructure of the network model is a speech synthesis network, except for the input and output of the network, preferably using a text-to-speech network (TTS) end-to-end;

the training mode of the network is that the acoustic feature vector generated by the first neural network audio and the text corresponding to the audio are used as input, the audio data are used as labels, and multiple rounds of iterative training are carried out;

the feature vector and the audio are input simultaneously, the data organization method can be a splicing method, in some embodiments, for long text and long audio, the acoustic features are extracted in a segmentation mode, and the segmented acoustic features and the segmented audio are input by adopting a fixed-length interweaving structure. The neural network model can complete the text-to-speech conversion of the target speaker, and experimental results show that the tone of the target speaker can be well cloned by the synthesis effect, and the result is used as a primary speech cloning result in the embodiment.

S300, generating a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network; specifically, the network model of the voice conversion may be a voice conversion model based on the GMM or a voice conversion model of the GMM or a CycleGAN voice conversion model in some embodiments.

Inputting the data set audio of the target speaker and the corresponding parallel clone audio generated by S200 into a third neural network model, and performing loop iteration to generate a mapping model from clone data to real data. Preferably, in the network training process, the characteristic parameters of the audio frequency are extracted and selected from fundamental frequency, non-periodic characteristics and mel spectrum data. In some embodiments, the transformation model may be a map of audio data, and may also be a map of spectral feature data.

S400, acquiring target speaker audio data, and taking the target speaker audio data as input of a first neural network model to obtain an acoustic feature vector of the target speaker; specifically, as shown in fig. 2, since the acoustic features generated when different audios of the target speaker are input into the first neural network model have a slight difference, the value of the acoustic features may be retained, or an average value may be taken, or a certain value may be taken as a representative value. Here, the audio of the target speaker is preferably manually or algorithmically automatically classified, the audios with similar emotion and mood are classified into the same class, and a common acoustic feature vector is used.

In some embodiments, an audio data of the target speaker may be selected as a reference audio, the audio data is inputted into the trained first acoustic neural network model in claim S100, corresponding reference acoustic features are extracted, the reference audio is selected, a manual or algorithmic classification method may be used to provide visual selection for a user, and the user selects an audio with similar acoustic features of the target speaker as the reference audio. For example, selecting audio in a pleasant speech class in the data set as reference audio, inputting the reference audio into an acoustic feature network (network 1), and outputting an acoustic feature vector of a target speaker; in some preferred embodiments, a feature vector database is made for the classified reference speech, and the classified reference speech can be directly inquired and matched for use.

S500, using the audio of the target pronunciation, the text to be synthesized and the acoustic feature vector of the target pronunciation person as the input of a second neural network model to generate a primary clone voice of the target pronunciation person; as shown in fig. 2, the acoustic feature vectors corresponding to all the audios of the target speaker generated in S400 and the text corresponding to the audios in the target speaker data set are input to a second neural network model for speech synthesis of the acoustic feature control text, and clone data parallel to the target speaker data set is generated. In some preferred embodiments, the output of the model may also be spectral feature data of the audio.

S600, the primary cloned voice of the target speaker is used as the input of the third neural network model, and the final cloned voice of the target speaker is obtained. Referring to fig. 2, the cloning audio and the original audio of the target speaker are input into the voice conversion network for training in S500, and the trained voice conversion network will further improve the cloning quality.

The embodiment realizes voice cloning based on a multi-neural network, utilizes audio data in a sample library and speaker identity labels corresponding to the audio data, trains and generates a first neural network model for extracting speaker acoustic feature vectors irrelevant to texts, extracts core parameters in the network as the speaker acoustic feature vectors, and ensures that the speaker feature vectors used in synthesis are not influenced by the texts, so that any text to be synthesized can have stable and high acoustic fidelity; generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data and the text to be synthesized, and expressing the acoustic feature of the speaker and the voice unit in the text in the output audio data; and generating a third speech conversion neural network model by using the primary cloned speech and the original speech generated by the second neural network, inputting the cloned audio and the original audio of the target speaker into the third speech conversion neural network model for training, and further promoting the cloned speech clone through the trained third neural network model. In the invention, only the third neural network model for voice conversion needs the data of the target speaker to participate in training, so that the requirement on the data size of the target speaker is low, the training speed is high, and the customization period is short. And the initial cloned voice is corrected by adopting a third neural network model for voice conversion, so that the similarity of the final cloned voice to the real voice is improved, and the cloned voice effect is improved.

Example two

The embodiment discloses a speech cloning system based on a multi-neural network, which comprises: a sample base module 1, a first neural network module 2, a second neural network module 3 and a third neural network module 4, wherein,

the sample database module 1 is used for storing sample data for training the first neural network module 2, the second neural network module 3 and the third neural network module 4, wherein the sample data at least comprises audio data, a text to be synthesized and an acoustic feature vector corresponding to the audio data; the fundamental frequency, the aperiodic features and the Mel spectrum are preferably extracted as acoustic feature vectors for training the speaker.

The first neural network module 2 is used for training and generating a first neural network model for extracting the acoustic characteristic vector of the speaker irrelevant to the text by using the audio data in the sample library and the speaker identity label corresponding to the audio data; the first neural network module 2 obtains the acoustic feature vector of the target speaker through the audio data of the target speaker.

In some embodiments, when there is more than one piece of audio data of the same speaker, there may be slight differences in the acoustic features corresponding to different audio frequencies, the acoustic feature vectors corresponding to each piece of audio frequency are retained, the acoustic feature vectors of the same speaker are averaged, or any one of the acoustic feature vectors is randomly taken as the unified acoustic feature vector of the speaker.

The second neural network module 3 is used for generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; the second neural network module 3 generates a primary clone voice of the target speaker through the audio frequency and the text of the target speaker and the acoustic feature vector of the target speaker.

Specifically, the infrastructure of the network model is a speech synthesis network, except for the input and output of the network, preferably using a text-to-speech network (TTS) end-to-end;

The third neural network module 4 generates a third neural network model of voice conversion by using the primary cloned voice and the original voice generated by the second neural network; the third neural network module 4 obtains the final cloned voice of the target speaker through the initial cloned voice of the target speaker.

In some embodiments may be a GMM-based speech conversion model or a CycleGAN speech conversion model.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims

1. A speech cloning method based on a multi-neural network is characterized by comprising the following steps:

2. The method of claim 1, wherein the acoustic feature vectors in S100 comprise: fundamental frequency, aperiodic features, and mel-spectrum data.

3. The method of claim 1, wherein in S100, when there is more than one piece of audio data of a speaker, the acoustic feature vector of the speaker is obtained by averaging corresponding acoustic feature vectors of audio.

4. The method as claimed in claim 1, wherein in S200, the acoustic feature vectors and the text corresponding to the audio data are used as input, and the audio data are used as tags, and multiple rounds of iterative training are performed.

5. The method of claim 1, wherein in S200, for long text and long audio data, the acoustic features are extracted in a segmentation manner by using a concatenation method, and the segmented acoustic features and the segmented audio are input by using a fixed-length interleaving structure.

6. The method for cloning voices based on the multi-neural network as claimed in claim 1, wherein in S300, the original voices of the target speaker and the cloned voices generated in S200 are input into loop iteration to generate a mapping model from the cloned voices to the real voices, wherein the mapping model can be a mapping of audio data or a mapping of sound spectrum characteristic data.

7. The method of claim 1, wherein the third neural network model for speech conversion in S300 is a GMM speech conversion model or a CycleGAN speech conversion model.

8. The method for cloning voices based on multi-neural network as claimed in claim 1, wherein when different audios of target speakers are inputted to the first neural network in S400, the audios of the target speakers are manually or algorithmically classified, audios similar to emotion and mood are classified into the same class, and a common acoustic feature vector is used.

9. The method of claim 1, wherein the target speaker audio data in S400 is subjected to the training of the first neural network model in S100, so as to obtain the target speaker audio and an acoustic optimal model, and an acoustic feature vector corresponding to the target speaker audio is output through the optimal model.

10. A system for cloning speech based on a multi-neural network, comprising: the device comprises a sample library module, a first neural network module, a second neural network module and a third neural network module, wherein

The sample database module is used for storing sample data for training the first neural network module, the second neural network module and the third neural network module, and the sample data at least comprises audio data, a text to be synthesized and speaker acoustic feature vectors corresponding to the audio data;