CN111223474A - Voice cloning method and system based on multi-neural network - Google Patents

Voice cloning method and system based on multi-neural network Download PDF

Info

Publication number
CN111223474A
CN111223474A CN202010041207.9A CN202010041207A CN111223474A CN 111223474 A CN111223474 A CN 111223474A CN 202010041207 A CN202010041207 A CN 202010041207A CN 111223474 A CN111223474 A CN 111223474A
Authority
CN
China
Prior art keywords
neural network
voice
speaker
target speaker
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010041207.9A
Other languages
Chinese (zh)
Inventor
柳慧芬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Shuixiang Electronic Technology Co ltd
Original Assignee
Wuhan Shuixiang Electronic Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Shuixiang Electronic Technology Co ltd filed Critical Wuhan Shuixiang Electronic Technology Co ltd
Priority to CN202010041207.9A priority Critical patent/CN111223474A/en
Publication of CN111223474A publication Critical patent/CN111223474A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Signal Processing (AREA)
  • Telephonic Communication Services (AREA)

Abstract

A speech cloning method and system based on a multi-neural network, which utilizes the voice data in a sample base and the text irrelevant speaker acoustic feature vector corresponding to the voice data and the text to be synthesized to obtain a multi-neural network model for speech cloning; acquiring target speaker audio data, and taking the target speaker audio data as the input of a first neural network model to obtain a target speaker acoustic feature vector; the audio frequency of the target speaker, the text to be synthesized and the acoustic feature vector of the target speaker are used as the input of a second neural network model to generate the primary cloned voice of the target speaker; and taking the primary cloned voice as the input of a third neural network model to obtain the final cloned voice of the target speaker. The multi-neural network has low requirement on the data volume of a target speaker, high training speed and short customization period, and the initial cloned voice is corrected by adopting the third neural network model for voice conversion, so that the final cloned voice effect is improved.

Description

Voice cloning method and system based on multi-neural network
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a method and a system for cloning voices based on a multi-neural network.
Background
Speech synthesis, also known as Text To Speech (TTS). Speech synthesis is a technique that outputs fluent chinese spoken language through mechanical and electronic comprehension. Speech synthesis corresponds to the installation of a computer with a human-like "mouth" and plays a vital role in a "listen and talk" intelligent computer system. Speech cloning, which belongs to speech synthesis technology, is equivalent to selecting a voice of a certain speaker and using the voice to speak speech content from another speaker.
Compared to text-to-speech synthesis, the speech cloning requirements are more versatile, meaning less customization time, data cost, and a wider range of synthesis targets. Like speech synthesis, speech cloning relates to the technical field of multidisciplinary crossing, including signal and information processing, information theory, random process, probability theory, acoustic processing, linguistics, psychology, computer science, artificial intelligence and other professional fields.
Patent No. CN201910420416.1 discloses a method for cloning accents and rhymes based on speech training, which designs a set of typical classified texts with different tones, and the target speaker reads the sound record according to the tone designed by the text as the training material. After training, the audio units with different tones under the same phonetic symbol of the target speaker can be obtained, and the units ensure the original accent rhyme of the target speaker and form a target speaker sound library. During voice synthesis, the intonation of a text to be synthesized is analyzed, voice units are matched from a voice library, and the voice units are synthesized into a gentle and natural audio frequency through phonological correction and slow error alignment correction methods. The method needs the target speaker to cooperate with the recording, and for general cloning, huge workload, huge time, manpower and material cost are formed.
Patent No. CN201910066489.5 discloses a system and method for neural voice cloning based on a small number of samples. The specific implementation of the method comprises three processes: training, cloning, audio generation. A multi-speaker generative model is first trained that can adapt to speaker embedding. When cloning, the cloning audio and text of the new speaker are input into the multi-speaker generation model, and the embedding of the new speaker is finely adjusted. Finally, the fine-tuned speaker embedding and text input are used for the multi-speaker model to generate audio. In the patent, the same neural network is used, the speaker identity is combined with a text-audio center to train a speaker adaptive network, speaker embedding is output, and the speaker embedding comprises the voice characteristic representation of the speaker. The embedded representation may rely on a trained text set, i.e., the speaker's embedded speech characteristics do not fully represent acoustic characteristics, but may also include text characteristics. This requires a training set where the corpus of all speakers is required to cover the use of the text set to a large extent and where the corpus is nearly parallel between each speaker. The cloning phase, which uses cloned text and audio to fine tune speaker embedding, also faces speaker text that is very different from training text, resulting in unpredictable deviations from speaker embedding. The acoustic similarity of the final clone synthesized audio to the speaker is very unstable.
Disclosure of Invention
In view of the above, a method and system for a multi-neural network based voice cloning is provided that overcomes or at least partially solves the above mentioned problems.
The invention discloses a speech cloning method based on a multi-neural network, which is characterized by comprising the following steps:
s100, training and generating a first neural network model for extracting text-independent speaker acoustic feature vectors by using audio data in a sample library and speaker identity labels corresponding to the audio data;
s200, generating a second neural network model for controlling text voice synthesis by using acoustic features by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized;
s300, generating a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network;
s400, acquiring target speaker audio data, and taking the target speaker audio data as input of a first neural network model to obtain an acoustic feature vector of the target speaker;
s500, using the audio frequency of the target speaker, the text to be synthesized and the acoustic feature vector of the target speaker as the input of a second neural network model to generate a primary clone voice of the target speaker;
s600, the primary cloned voice of the target speaker is used as the input of the third neural network model, and the final cloned voice of the target speaker is obtained.
Further, the acoustic feature vector in S100 includes: fundamental frequency, aperiodic features, and mel-spectrum data.
Further, in S100, when there is more than one piece of voice data of the same speaker, the acoustic feature vector corresponding to the voice is averaged and used as the acoustic feature vector of the speaker.
Further, in S200, the acoustic feature vector and the text corresponding to the audio data are used as input, and the audio data is used as a label, so as to perform multiple rounds of iterative training.
Further, in S200, for the long text and the long audio data, the acoustic features are extracted in a segmentation manner by using a splicing method, and the segmented acoustic features and the segmented audio are input by using a fixed-length interleaving structure.
Further, in S300, the original voice of the target speaker and the clone voice generated in S200 are input, and loop iteration is performed to generate a mapping model from the clone voice to the real voice, where the mapping model may be a mapping of audio data or a mapping of sound spectrum feature data.
Further, the speech conversion third neural network model in S300 is a speech conversion model of the GMM or a CycleGAN speech conversion model.
Further, when different audios of the target speaker are input to the first neural network in S400, the audios of the target speaker are manually or automatically classified by an algorithm, audios similar to emotion and mood are classified into the same class, and a common acoustic feature vector is used.
Further, the target speaker audio data in the step S400 participates in the training of the first neural network model in the step S100 to obtain the target speaker audio and the acoustic optimal model, and the acoustic feature vector corresponding to the target speaker audio is output through the optimal model.
The invention also discloses a speech cloning system based on the multi-neural network, which comprises the following steps: the device comprises a sample library module, a first neural network module, a second neural network module and a third neural network module, wherein
The sample database module is used for storing sample data for training the first neural network module, the second neural network module and the third neural network module, and the sample data at least comprises audio data, a text to be synthesized and an acoustic feature vector corresponding to the audio data;
the first neural network module is used for training and generating a first neural network model for extracting the acoustic characteristic vector of the speaker irrelevant to the text by using the audio data in the sample library and the speaker identity label corresponding to the audio data; the first neural network module obtains acoustic feature vectors of the target speaker through the audio data of the target speaker;
the second neural network module is used for generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; the second neural network module generates a primary clone voice of the target speaker through the audio frequency and the text of the target speaker and the acoustic feature vector of the target speaker;
the third neural network module generates a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network; and the third neural network module obtains the final cloned voice of the target speaker through the primary cloned voice of the target speaker.
The invention has the beneficial effects that:
the method realizes voice cloning based on a multi-neural network, utilizes the audio data in a sample library and speaker identity labels corresponding to the audio data, trains and generates a first neural network model for extracting speaker acoustic feature vectors irrelevant to texts, extracts core parameters in the network as the speaker acoustic feature vectors, and ensures that the speaker feature vectors used in synthesis are not influenced by the texts, so that any text to be synthesized can have stable and high acoustic fidelity; generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data and the text to be synthesized, and expressing the acoustic feature of the speaker and the voice unit in the text in the output audio data; and generating a third speech conversion neural network model by using the primary cloned speech and the original speech generated by the second neural network, inputting the cloned audio and the original audio of the target speaker into the third speech conversion neural network model for training, and further promoting the cloned speech clone through the trained third neural network model. In the invention, only the third neural network model for voice conversion needs the data of the target speaker to participate in training, so that the requirement on the data size of the target speaker is low, the training speed is high, and the customization period is short. And the initial cloned voice is corrected by adopting a third neural network model for voice conversion, so that the similarity of the final cloned voice to the real voice is improved, and the cloned voice effect is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a flowchart of a method for cloning a speech based on a multi-neural network according to a first embodiment of the present invention;
FIG. 2 is a flowchart illustrating the use of a multiple neural network model voice clone according to a first embodiment of the present invention;
fig. 3 is a structural diagram of a speech cloning system based on a multi-neural network according to a first embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The method and the device aim at solving the problems that in the prior art, voice cloning needs a target speaker to be matched with recording, the audio data requirement of the target speaker is high, and the acoustic similarity between the effect of cloning voice and a speaker is unstable. The embodiment of the invention provides a method and a system for cloning a voice based on a multi-neural network.
Example one
The embodiment discloses a method for cloning a voice based on a multi-neural network, as shown in fig. 1, including:
s100, training and generating a first neural network model for extracting text-independent speaker acoustic feature vectors by using audio data in a sample library and speaker identity labels corresponding to the audio data; specifically, preferably, a text-independent voiceprint authentication network is used, and feature vectors are extracted from a layer before the last output layer of the network to serve as acoustic feature vectors of the unique identification of the speaker.
Preferably, the fundamental frequency, the non-periodic characteristics and the Mel spectrum are extracted to be used as acoustic characteristic vectors of the training speaker, the acoustic characteristic vectors are used as the input of a network model, the unique ID of the identity of the speaker is used as a label, and multiple rounds of iterative training are carried out.
It can be understood that when there is more than one piece of audio data of the same speaker, there may be a slight difference between the acoustic features corresponding to different audio frequencies, the acoustic feature vectors corresponding to each piece of audio frequency are retained, the acoustic feature vectors of the same speaker are averaged, or any one of the acoustic feature vectors is randomly taken as the unified acoustic feature vector of the speaker.
S200, generating a second neural network model for controlling text voice synthesis by using acoustic features by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; specifically, the infrastructure of the network model is a speech synthesis network, except for the input and output of the network, preferably using a text-to-speech network (TTS) end-to-end;
the training mode of the network is that the acoustic feature vector generated by the first neural network audio and the text corresponding to the audio are used as input, the audio data are used as labels, and multiple rounds of iterative training are carried out;
the feature vector and the audio are input simultaneously, the data organization method can be a splicing method, in some embodiments, for long text and long audio, the acoustic features are extracted in a segmentation mode, and the segmented acoustic features and the segmented audio are input by adopting a fixed-length interweaving structure. The neural network model can complete the text-to-speech conversion of the target speaker, and experimental results show that the tone of the target speaker can be well cloned by the synthesis effect, and the result is used as a primary speech cloning result in the embodiment.
S300, generating a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network; specifically, the network model of the voice conversion may be a voice conversion model based on the GMM or a voice conversion model of the GMM or a CycleGAN voice conversion model in some embodiments.
Inputting the data set audio of the target speaker and the corresponding parallel clone audio generated by S200 into a third neural network model, and performing loop iteration to generate a mapping model from clone data to real data. Preferably, in the network training process, the characteristic parameters of the audio frequency are extracted and selected from fundamental frequency, non-periodic characteristics and mel spectrum data. In some embodiments, the transformation model may be a map of audio data, and may also be a map of spectral feature data.
S400, acquiring target speaker audio data, and taking the target speaker audio data as input of a first neural network model to obtain an acoustic feature vector of the target speaker; specifically, as shown in fig. 2, since the acoustic features generated when different audios of the target speaker are input into the first neural network model have a slight difference, the value of the acoustic features may be retained, or an average value may be taken, or a certain value may be taken as a representative value. Here, the audio of the target speaker is preferably manually or algorithmically automatically classified, the audios with similar emotion and mood are classified into the same class, and a common acoustic feature vector is used.
In some embodiments, an audio data of the target speaker may be selected as a reference audio, the audio data is inputted into the trained first acoustic neural network model in claim S100, corresponding reference acoustic features are extracted, the reference audio is selected, a manual or algorithmic classification method may be used to provide visual selection for a user, and the user selects an audio with similar acoustic features of the target speaker as the reference audio. For example, selecting audio in a pleasant speech class in the data set as reference audio, inputting the reference audio into an acoustic feature network (network 1), and outputting an acoustic feature vector of a target speaker; in some preferred embodiments, a feature vector database is made for the classified reference speech, and the classified reference speech can be directly inquired and matched for use.
S500, using the audio of the target pronunciation, the text to be synthesized and the acoustic feature vector of the target pronunciation person as the input of a second neural network model to generate a primary clone voice of the target pronunciation person; as shown in fig. 2, the acoustic feature vectors corresponding to all the audios of the target speaker generated in S400 and the text corresponding to the audios in the target speaker data set are input to a second neural network model for speech synthesis of the acoustic feature control text, and clone data parallel to the target speaker data set is generated. In some preferred embodiments, the output of the model may also be spectral feature data of the audio.
S600, the primary cloned voice of the target speaker is used as the input of the third neural network model, and the final cloned voice of the target speaker is obtained. Referring to fig. 2, the cloning audio and the original audio of the target speaker are input into the voice conversion network for training in S500, and the trained voice conversion network will further improve the cloning quality.
The embodiment realizes voice cloning based on a multi-neural network, utilizes audio data in a sample library and speaker identity labels corresponding to the audio data, trains and generates a first neural network model for extracting speaker acoustic feature vectors irrelevant to texts, extracts core parameters in the network as the speaker acoustic feature vectors, and ensures that the speaker feature vectors used in synthesis are not influenced by the texts, so that any text to be synthesized can have stable and high acoustic fidelity; generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data and the text to be synthesized, and expressing the acoustic feature of the speaker and the voice unit in the text in the output audio data; and generating a third speech conversion neural network model by using the primary cloned speech and the original speech generated by the second neural network, inputting the cloned audio and the original audio of the target speaker into the third speech conversion neural network model for training, and further promoting the cloned speech clone through the trained third neural network model. In the invention, only the third neural network model for voice conversion needs the data of the target speaker to participate in training, so that the requirement on the data size of the target speaker is low, the training speed is high, and the customization period is short. And the initial cloned voice is corrected by adopting a third neural network model for voice conversion, so that the similarity of the final cloned voice to the real voice is improved, and the cloned voice effect is improved.
Example two
The embodiment discloses a speech cloning system based on a multi-neural network, which comprises: a sample base module 1, a first neural network module 2, a second neural network module 3 and a third neural network module 4, wherein,
the sample database module 1 is used for storing sample data for training the first neural network module 2, the second neural network module 3 and the third neural network module 4, wherein the sample data at least comprises audio data, a text to be synthesized and an acoustic feature vector corresponding to the audio data; the fundamental frequency, the aperiodic features and the Mel spectrum are preferably extracted as acoustic feature vectors for training the speaker.
The first neural network module 2 is used for training and generating a first neural network model for extracting the acoustic characteristic vector of the speaker irrelevant to the text by using the audio data in the sample library and the speaker identity label corresponding to the audio data; the first neural network module 2 obtains the acoustic feature vector of the target speaker through the audio data of the target speaker.
In some embodiments, when there is more than one piece of audio data of the same speaker, there may be slight differences in the acoustic features corresponding to different audio frequencies, the acoustic feature vectors corresponding to each piece of audio frequency are retained, the acoustic feature vectors of the same speaker are averaged, or any one of the acoustic feature vectors is randomly taken as the unified acoustic feature vector of the speaker.
The second neural network module 3 is used for generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; the second neural network module 3 generates a primary clone voice of the target speaker through the audio frequency and the text of the target speaker and the acoustic feature vector of the target speaker.
Specifically, the infrastructure of the network model is a speech synthesis network, except for the input and output of the network, preferably using a text-to-speech network (TTS) end-to-end;
the training mode of the network is that the acoustic feature vector generated by the first neural network audio and the text corresponding to the audio are used as input, the audio data are used as labels, and multiple rounds of iterative training are carried out;
the feature vector and the audio are input simultaneously, the data organization method can be a splicing method, in some embodiments, for long text and long audio, the acoustic features are extracted in a segmentation mode, and the segmented acoustic features and the segmented audio are input by adopting a fixed-length interweaving structure. The neural network model can complete the text-to-speech conversion of the target speaker, and experimental results show that the tone of the target speaker can be well cloned by the synthesis effect, and the result is used as a primary speech cloning result in the embodiment.
The third neural network module 4 generates a third neural network model of voice conversion by using the primary cloned voice and the original voice generated by the second neural network; the third neural network module 4 obtains the final cloned voice of the target speaker through the initial cloned voice of the target speaker.
In some embodiments may be a GMM-based speech conversion model or a CycleGAN speech conversion model.
Inputting the data set audio of the target speaker and the corresponding parallel clone audio generated by S200 into a third neural network model, and performing loop iteration to generate a mapping model from clone data to real data. Preferably, in the network training process, the characteristic parameters of the audio frequency are extracted and selected from fundamental frequency, non-periodic characteristics and mel spectrum data. In some embodiments, the transformation model may be a map of audio data, and may also be a map of spectral feature data.
The embodiment realizes voice cloning based on a multi-neural network, utilizes audio data in a sample library and speaker identity labels corresponding to the audio data, trains and generates a first neural network model for extracting speaker acoustic feature vectors irrelevant to texts, extracts core parameters in the network as the speaker acoustic feature vectors, and ensures that the speaker feature vectors used in synthesis are not influenced by the texts, so that any text to be synthesized can have stable and high acoustic fidelity; generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data and the text to be synthesized, and expressing the acoustic feature of the speaker and the voice unit in the text in the output audio data; and generating a third speech conversion neural network model by using the primary cloned speech and the original speech generated by the second neural network, inputting the cloned audio and the original audio of the target speaker into the third speech conversion neural network model for training, and further promoting the cloned speech clone through the trained third neural network model. In the invention, only the third neural network model for voice conversion needs the data of the target speaker to participate in training, so that the requirement on the data size of the target speaker is low, the training speed is high, and the customization period is short. And the initial cloned voice is corrected by adopting a third neural network model for voice conversion, so that the similarity of the final cloned voice to the real voice is improved, and the cloned voice effect is improved.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. Of course, the processor and the storage medium may reside as discrete components in a user terminal.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in memory units and executed by processors. The memory unit may be implemented within the processor or external to the processor, in which case it can be communicatively coupled to the processor via various means as is known in the art.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Claims (10)

1. A speech cloning method based on a multi-neural network is characterized by comprising the following steps:
s100, training and generating a first neural network model for extracting text-independent speaker acoustic feature vectors by using audio data in a sample library and speaker identity labels corresponding to the audio data;
s200, generating a second neural network model for controlling text voice synthesis by using acoustic features by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized;
s300, generating a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network;
s400, acquiring target speaker audio data, and taking the target speaker audio data as input of a first neural network model to obtain an acoustic feature vector of the target speaker;
s500, using the audio frequency of the target speaker, the text to be synthesized and the acoustic feature vector of the target speaker as the input of a second neural network model to generate a primary clone voice of the target speaker;
s600, the primary cloned voice of the target speaker is used as the input of the third neural network model, and the final cloned voice of the target speaker is obtained.
2. The method of claim 1, wherein the acoustic feature vectors in S100 comprise: fundamental frequency, aperiodic features, and mel-spectrum data.
3. The method of claim 1, wherein in S100, when there is more than one piece of audio data of a speaker, the acoustic feature vector of the speaker is obtained by averaging corresponding acoustic feature vectors of audio.
4. The method as claimed in claim 1, wherein in S200, the acoustic feature vectors and the text corresponding to the audio data are used as input, and the audio data are used as tags, and multiple rounds of iterative training are performed.
5. The method of claim 1, wherein in S200, for long text and long audio data, the acoustic features are extracted in a segmentation manner by using a concatenation method, and the segmented acoustic features and the segmented audio are input by using a fixed-length interleaving structure.
6. The method for cloning voices based on the multi-neural network as claimed in claim 1, wherein in S300, the original voices of the target speaker and the cloned voices generated in S200 are input into loop iteration to generate a mapping model from the cloned voices to the real voices, wherein the mapping model can be a mapping of audio data or a mapping of sound spectrum characteristic data.
7. The method of claim 1, wherein the third neural network model for speech conversion in S300 is a GMM speech conversion model or a CycleGAN speech conversion model.
8. The method for cloning voices based on multi-neural network as claimed in claim 1, wherein when different audios of target speakers are inputted to the first neural network in S400, the audios of the target speakers are manually or algorithmically classified, audios similar to emotion and mood are classified into the same class, and a common acoustic feature vector is used.
9. The method of claim 1, wherein the target speaker audio data in S400 is subjected to the training of the first neural network model in S100, so as to obtain the target speaker audio and an acoustic optimal model, and an acoustic feature vector corresponding to the target speaker audio is output through the optimal model.
10. A system for cloning speech based on a multi-neural network, comprising: the device comprises a sample library module, a first neural network module, a second neural network module and a third neural network module, wherein
The sample database module is used for storing sample data for training the first neural network module, the second neural network module and the third neural network module, and the sample data at least comprises audio data, a text to be synthesized and speaker acoustic feature vectors corresponding to the audio data;
the first neural network module is used for training and generating a first neural network model for extracting the acoustic characteristic vector of the speaker irrelevant to the text by using the audio data in the sample library and the speaker identity label corresponding to the audio data; the first neural network module obtains acoustic feature vectors of the target speaker through the audio data of the target speaker;
the second neural network module is used for generating a second neural network model for controlling text voice synthesis by using the acoustic feature vector generated by the first neural network, the audio data in the sample library and the text to be synthesized; the second neural network module generates a primary clone voice of the target speaker through the audio frequency and the text of the target speaker and the acoustic feature vector of the target speaker;
the third neural network module generates a third neural network model for voice conversion by using the primary cloned voice and the original voice generated by the second neural network; and the third neural network module obtains the final cloned voice of the target speaker through the primary cloned voice of the target speaker.
CN202010041207.9A 2020-01-15 2020-01-15 Voice cloning method and system based on multi-neural network Pending CN111223474A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010041207.9A CN111223474A (en) 2020-01-15 2020-01-15 Voice cloning method and system based on multi-neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010041207.9A CN111223474A (en) 2020-01-15 2020-01-15 Voice cloning method and system based on multi-neural network

Publications (1)

Publication Number Publication Date
CN111223474A true CN111223474A (en) 2020-06-02

Family

ID=70832279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010041207.9A Pending CN111223474A (en) 2020-01-15 2020-01-15 Voice cloning method and system based on multi-neural network

Country Status (1)

Country Link
CN (1) CN111223474A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968617A (en) * 2020-08-25 2020-11-20 云知声智能科技股份有限公司 Voice conversion method and system for non-parallel data
CN112102808A (en) * 2020-08-25 2020-12-18 上海红阵信息科技有限公司 Method and system for constructing deep neural network for voice forgery
CN112233646A (en) * 2020-10-20 2021-01-15 携程计算机技术(上海)有限公司 Voice cloning method, system, device and storage medium based on neural network
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN112383721A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040230410A1 (en) * 2003-05-13 2004-11-18 Harless William G. Method and system for simulated interactive conversation
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
US20160343366A1 (en) * 2015-05-19 2016-11-24 Google Inc. Speech synthesis model selection
CN107452369A (en) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A kind of hybrid language phoneme synthesizing method and device
CN109523989A (en) * 2019-01-29 2019-03-26 网易有道信息技术(北京)有限公司 Phoneme synthesizing method, speech synthetic device, storage medium and electronic equipment
US20190096385A1 (en) * 2017-09-28 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for generating speech synthesis model
CN110136693A (en) * 2018-02-09 2019-08-16 百度(美国)有限责任公司 System and method for using a small amount of sample to carry out neural speech clone

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040230410A1 (en) * 2003-05-13 2004-11-18 Harless William G. Method and system for simulated interactive conversation
US20160343366A1 (en) * 2015-05-19 2016-11-24 Google Inc. Speech synthesis model selection
CN105118498A (en) * 2015-09-06 2015-12-02 百度在线网络技术(北京)有限公司 Training method and apparatus of speech synthesis model
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN107481713A (en) * 2017-07-17 2017-12-15 清华大学 A kind of hybrid language phoneme synthesizing method and device
CN107452369A (en) * 2017-09-28 2017-12-08 百度在线网络技术(北京)有限公司 Phonetic synthesis model generating method and device
US20190096385A1 (en) * 2017-09-28 2019-03-28 Baidu Online Network Technology (Beijing) Co., Ltd Method and apparatus for generating speech synthesis model
CN110136693A (en) * 2018-02-09 2019-08-16 百度(美国)有限责任公司 System and method for using a small amount of sample to carry out neural speech clone
CN109523989A (en) * 2019-01-29 2019-03-26 网易有道信息技术(北京)有限公司 Phoneme synthesizing method, speech synthetic device, storage medium and electronic equipment

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
MERLIJN BLAAUW: "Data Efficient Voice Cloning for Neural Singing Synthesis", 《ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
YU ZHANG: "Learning to speak fluently in a foreign language:Multilingual speech synthesis and cross-language voice cloning", 《ARXIV》 *
张君腾: "基于深度神经网络的语音合成方法研究", 《中国优秀硕士学位论文全文数据库》 *
胡亚军: "基于神经网络的统计参数语音合成方法研究", 《中国优秀博士论文全文数据库》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968617A (en) * 2020-08-25 2020-11-20 云知声智能科技股份有限公司 Voice conversion method and system for non-parallel data
CN112102808A (en) * 2020-08-25 2020-12-18 上海红阵信息科技有限公司 Method and system for constructing deep neural network for voice forgery
CN111968617B (en) * 2020-08-25 2024-03-15 云知声智能科技股份有限公司 Voice conversion method and system for non-parallel data
CN112259072A (en) * 2020-09-25 2021-01-22 北京百度网讯科技有限公司 Voice conversion method and device and electronic equipment
CN112233646A (en) * 2020-10-20 2021-01-15 携程计算机技术(上海)有限公司 Voice cloning method, system, device and storage medium based on neural network
CN112233646B (en) * 2020-10-20 2024-05-31 携程计算机技术(上海)有限公司 Voice cloning method, system, equipment and storage medium based on neural network
CN112382308A (en) * 2020-11-02 2021-02-19 天津大学 Zero-order voice conversion system and method based on deep learning and simple acoustic features
CN112383721A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method and apparatus for generating video
CN112383721B (en) * 2020-11-13 2023-04-07 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating video
CN112735434A (en) * 2020-12-09 2021-04-30 中国人民解放军陆军工程大学 Voice communication method and system with voiceprint cloning function

Similar Documents

Publication Publication Date Title
CN111223474A (en) Voice cloning method and system based on multi-neural network
US11410684B1 (en) Text-to-speech (TTS) processing with transfer of vocal characteristics
CN101578659B (en) Voice tone converting device and voice tone converting method
WO2020118521A1 (en) Multi-speaker neural text-to-speech synthesis
JP4176169B2 (en) Runtime acoustic unit selection method and apparatus for language synthesis
US9830904B2 (en) Text-to-speech device, text-to-speech method, and computer program product
CN108899009B (en) Chinese speech synthesis system based on phoneme
JP3588302B2 (en) Method of identifying unit overlap region for concatenated speech synthesis and concatenated speech synthesis method
US11763797B2 (en) Text-to-speech (TTS) processing
Qian et al. A cross-language state sharing and mapping approach to bilingual (Mandarin–English) TTS
Liu et al. High quality voice conversion through phoneme-based linear mapping functions with straight for mandarin
US10008216B2 (en) Method and apparatus for exemplary morphing computer system background
JP4829477B2 (en) Voice quality conversion device, voice quality conversion method, and voice quality conversion program
US9798653B1 (en) Methods, apparatus and data structure for cross-language speech adaptation
US20070294082A1 (en) Voice Recognition Method and System Adapted to the Characteristics of Non-Native Speakers
CN110853616A (en) Speech synthesis method, system and storage medium based on neural network
US6546369B1 (en) Text-based speech synthesis method containing synthetic speech comparisons and updates
CN115101046A (en) Method and device for synthesizing voice of specific speaker
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
JP6330069B2 (en) Multi-stream spectral representation for statistical parametric speech synthesis
Lee et al. A comparative study of spectral transformation techniques for singing voice synthesis.
KR102277205B1 (en) Apparatus for converting audio and method thereof
KR20220070979A (en) Style speech synthesis apparatus and speech synthesis method using style encoding network
Nthite et al. End-to-End Text-To-Speech synthesis for under resourced South African languages
Hinterleitner et al. Speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
AD01 Patent right deemed abandoned

Effective date of abandoning: 20230728

AD01 Patent right deemed abandoned