CN115394285A

CN115394285A - Voice cloning method, device, equipment and storage medium

Info

Publication number: CN115394285A
Application number: CN202211045094.5A
Authority: CN
Inventors: 吴东海; 黄杰雄; 轩晓光; 张超钢; 孙洪文
Original assignee: Guangzhou Kugou Computer Technology Co Ltd
Current assignee: Guangzhou Kugou Computer Technology Co Ltd
Priority date: 2022-08-30
Filing date: 2022-08-30
Publication date: 2022-11-25

Abstract

The application discloses a voice cloning method, a voice cloning device, equipment and a storage medium, and relates to the field of audio processing. The method comprises the following steps: acquiring phoneme information of a text to be cloned; performing feature extraction on the reference voice to obtain voice features; synthesizing clone voice of the phoneme information according to the voice characteristics; the voice features comprise recording environment features and tone features, or the voice features comprise the recording environment features, the tone features and prosody duration features. The method can improve the authenticity of the cloned voice.

Description

Voice cloning method, device, equipment and storage medium

Technical Field

The present application relates to the field of audio processing, and in particular, to a method, an apparatus, a device, and a storage medium for voice cloning.

Background

With the development of deep learning, the speech synthesis technology has achieved a leap-type development. The vivid and natural speech synthesis technology has been applied to speech interaction systems such as mobile phone speech assistants, smart speakers, vehicle-mounted computers of automobiles, and the like. Meanwhile, the demand of users for speech synthesis technology is increasing, and the technical requirements are increasing. Users not only want the synthesized voice to be comparable with the real pronunciation, but also want to have various timbres, even the timbres of family and friends.

The voice cloning technology is generated to satisfy the diversity of the synthesized timbre. The voice cloning technology is mainly divided into three parts: a front-end system (used for converting characters and symbols into phonemes); an acoustic model system (for converting phonemes, which are the smallest units of speech divided according to the natural properties of the speech, into acoustic features); vocoder systems (for converting acoustic features into audio). The acoustic model system uses a reference voice of a character A and phonemes of a text to be cloned as input, so that the acoustic model system synthesizes acoustic features of a cloned voice of the text read by using the voice of the character A by using features extracted from the reference voice, and trains the acoustic model system by using the acoustic features of a real voice of the text read by the character A as training labels.

The method can only generate the clone voice by simulating the real voice hard, the generated clone voice does not conform to the actual voice production characteristic of a person, and the truth degree of the clone voice is poor.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for cloning voices, and can improve the reality degree of the cloned voices. The technical scheme is as follows.

According to an aspect of the present application, there is provided a voice cloning method, the method including:

acquiring phoneme information of a text to be cloned;

performing feature extraction on the reference voice to obtain voice features;

synthesizing clone voice of the phoneme information according to the voice characteristics;

the voice features comprise recording environment features and tone features, or the voice features comprise the recording environment features, the tone features and prosody duration features.

According to another aspect of the present application, there is provided a voice cloning apparatus, the apparatus including:

the phoneme module is used for acquiring phoneme information of the text to be cloned;

the characteristic extraction module is used for extracting the characteristics of the reference voice to obtain voice characteristics;

the synthesis module is used for synthesizing the clone voice of the phoneme information according to the voice characteristics;

According to another aspect of the present application, there is provided a computer device comprising: a processor and a memory, the memory having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement a voice cloning method as described above.

According to another aspect of the present application, there is provided a computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement a voice cloning method as described above.

According to another aspect of an embodiment of the present disclosure, there is provided a computer program product or a computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the voice cloning method provided in the above-mentioned alternative implementation.

The beneficial effects that technical scheme that this application embodiment brought include at least:

by training the acoustic model, acoustic features of the reference speech in multiple actual dimensions such as tone color, recording environment, rhythm duration and the like can be extracted, the acoustic model generates the clone speech according to the actual acoustic features of the reference speech, and compared with a method for simulating real speech through primitive and hard, the method has the advantages that the sound output by the acoustic model better accords with the actual sounding characteristics of a clone object, and the authenticity of the clone speech is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram of a computer device provided by an exemplary embodiment of the present application;

FIG. 2 is a schematic diagram of a phonetic clone model provided by another exemplary embodiment of the present application;

FIG. 3 is a schematic diagram of a phonetic clone model provided by another exemplary embodiment of the present application;

FIG. 4 is a flow chart of a method of voice cloning provided by another exemplary embodiment of the present application;

FIG. 5 is a flow chart of a method of voice cloning provided by another exemplary embodiment of the present application;

FIG. 6 is a method flow diagram of a method for voice cloning provided by another exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of a phonetic clone model training method provided by another exemplary embodiment of the present application;

FIG. 8 is a method flow diagram of a voice cloning method provided by another exemplary embodiment of the present application;

FIG. 9 is a block diagram of a voice cloning device provided in another exemplary embodiment of the present application;

FIG. 10 is a block diagram of a server provided in another exemplary embodiment of the present application;

fig. 11 is a block diagram of a terminal provided in another exemplary embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic diagram of a computer device 101 provided in an exemplary embodiment of the present application, where the computer device 101 may be a terminal or a server.

The terminal may include at least one of a digital camera, a smart phone, a notebook computer, a desktop computer, a tablet computer, a smart speaker, and a smart robot. Optionally, the terminal may also be a device with a camera, for example, a face payment device, a monitoring device, an access control device, and the like. In an alternative implementation manner, the voice cloning method provided by the present application may be applied to an application program with a voice cloning function, where the application program may be: audio processing applications, voice cloning applications, video processing applications, audio publishing applications, video publishing applications, social applications, shopping applications, live applications, forum applications, information applications, life applications, office applications, and the like. Optionally, the terminal is provided with a client of the application program.

Illustratively, the terminal has a voice clone model 102 stored thereon, and when the client needs to use the voice clone function, the client can call the voice clone model to complete voice cloning. Illustratively, the voice cloning process can be completed by the terminal or the server.

The terminal and the server are connected with each other through a wired or wireless network.

The terminal includes a first memory and a first processor. The first memory stores a voice clone model; the voice clone model is called and executed by the first processor to realize the voice clone method provided by the application. The first memory may include, but is not limited to, the following: random Access Memory (RAM), read Only Memory (ROM), programmable Read-Only Memory (PROM), erasable Read-Only Memory (EPROM), and electrically Erasable Read-Only Memory (EEPROM).

The first processor may be comprised of one or more integrated circuit chips. Alternatively, the first Processor may be a general purpose Processor, such as a Central Processing Unit (CPU) or a Network Processor (NP). Alternatively, the first processor may implement the voice cloning method provided by the present application by a running program or code.

The server includes a second memory and a second processor. The second memory stores a voice clone model; the voice clone model is called by the second processor to implement the voice clone method provided by the present application. Optionally, the second memory may include, but is not limited to, the following: RAM, ROM, PROM, EPROM, EEPROM. Alternatively, the second processor may be a general purpose processor, such as a CPU or NP.

The computer device 101 has stored therein a phonetic clone model 102. When the computer device 101 needs to perform voice cloning, the voice cloning model 102 is called to perform voice cloning according to the reference voice and the text to be cloned to obtain cloned voice.

Optionally, as shown in fig. 2, the phonetic clone model 102 includes a feature extraction layer 103 and an acoustic model 104. The computer device extracts the reference speech feature of the reference speech by the reference speech input feature extraction layer 103. The computer device obtains the phoneme information of the text to be cloned, inputs the phoneme information and the reference speech features into the acoustic model 104, and the acoustic model outputs the cloned speech. Therefore, the clone voice of the text to be cloned can be read by the human voice of the reference voice.

In an alternative implementation, as shown in fig. 3, the feature extraction layer 103 includes an environmental feature extraction layer, a tone color feature extraction layer, and a prosodic duration feature extraction layer. The acoustic model 104 includes an encoder, a prosodic duration estimation layer, and a decoder. The computer equipment inputs the reference voice into the environment feature extraction layer to obtain the recording environment feature, inputs the reference voice into the tone feature extraction layer to obtain the tone feature, and inputs the reference voice into the rhythm duration feature extraction layer to obtain the rhythm duration feature. The computer equipment acquires the phoneme information of the text to be cloned, inputs the phoneme information into the encoder to obtain phoneme coding information, and inputs the phoneme coding information and the prosody duration characteristics into the prosody duration estimation layer to obtain the phoneme duration of each phoneme. And inputting the phoneme coding information, the phoneme duration, the recording environment characteristics, the tone characteristics and the prosody duration characteristics into a decoder to obtain the cloned speech.

In an application scenario, the voice cloning method provided by the application is adopted in a voice application program to perform voice cloning on characters input by a user to obtain cloned voice, and the cloned voice is sent out. For example, for a social application program, a user records a piece of speech audio first and then inputs a piece of text to be cloned, the terminal or the server inputs the recorded audio and the text to be cloned into a voice cloning model to obtain cloned voice, and the user can send the cloned voice to a chat object.

In another application scenario, the voice cloning method provided by the application is adopted in an audio editing or video editing application program to perform voice cloning on characters provided by a user to obtain cloned voice. For example, a user submits a voice recording and a text, and the terminal or the server inputs the voice recording and the text into the voice clone model to obtain a cloned voice, so that the user can quickly acquire a section of audio of a reading target text, and the cloned voice can be used for audio editing or video editing conveniently.

Fig. 4 shows a flowchart of a voice cloning method provided by an exemplary embodiment of the present application. The method may be performed by a computer device, e.g. a terminal or a server as shown in fig. 1. The method comprises the following steps.

Step 210, obtaining phoneme information of the text to be cloned.

And the computer equipment converts the text to be cloned into phonemes to obtain phoneme information. The phoneme information comprises at least one phoneme corresponding to the text to be cloned.

Phones (phones) are the smallest phonetic unit divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, and one action constitutes one phone.

The text to be cloned comprises phrases, sentences, paragraphs and chapters which are composed of characters and/or symbols and the like. The language of the text to be cloned is not limited.

The computer equipment receives the text to be cloned input by a user, or the computer equipment acquires the text to be cloned through a specified path. For example, the computer device obtains the text to be cloned from a text library.

Step 220, performing feature extraction on the reference voice to obtain voice features.

The reference voice is the voice of the clone object (audiologist). Alternatively, the reference voice is a voice of the cloned object read text. Alternatively, the reference speech is a piece of audio that contains the sound of a cloned object. For example, reference speech is a recording of person A reading a piece of text.

The computer device extracts speech features of the reference speech from multiple dimensions, which may characterize the speaking characteristics, reading characteristics, or reference speech recording environment of the cloned object, etc. For example, the computer device may extract speech features of the reference speech from the recording environment, timbre, prosody duration dimensions.

Optionally, the voice features include recording environment features and tone features, or the voice features include recording environment features, tone features and prosody duration features.

Optionally, as shown in fig. 2, the computer device invokes the feature extraction layer to perform feature extraction on the reference speech to obtain a recording environment feature, a tone feature, and a prosody duration feature. Optionally, the computer device calls an environmental feature extraction layer to perform feature extraction on the reference voice to obtain a recording environmental feature; computer equipment calls a tone characteristic extraction layer to perform characteristic extraction on the reference voice to obtain tone characteristics; and calling a rhythm duration feature extraction layer by the computer equipment to perform feature extraction on the reference voice to obtain rhythm duration features.

The recording environment feature is used for representing the feature of the reference voice recording environment, the tone feature is used for representing the tone feature of the clone object, and the prosody duration feature is used for representing the prosody feature in the reference voice. Illustratively, the prosodic duration feature is used to characterize the pitch, duration, pitch, etc. of the cloned subject when speaking, or the prosodic duration feature is used to characterize the yangtao frustration of the cloned subject when speaking.

Feature extraction is performed from three dimensions of recording environment, timbre and prosody duration, and the computer equipment can fully learn acoustic features of the reference voice and synthesize the cloned voice according to the extracted features. For example, compared with a method of extracting only the characteristics of the timbre and the prosody duration, the embodiment of the present application further extracts the characteristics of the recording environment, and if the recording environment of the reference voice is noisy or has a reverberation effect, the characteristics of the recording environment can be extracted, and the cloned voice also has a similar recording environment effect when synthesizing the cloned voice. However, if the recording environment features are not extracted, the features may be extracted into the timbre features, so that the timbre features are inaccurate, the timbre of the synthesized clone speech is false, and the recording environment effect is poor.

And step 230, synthesizing clone voice of the phoneme information according to the voice characteristics.

The computer device synthesizes a clone speech of phoneme information based on the recording environment feature and the tone feature extracted from the reference speech.

Or, the computer device synthesizes a clone speech of phoneme information based on the recording environment feature, the tone feature and the prosody duration feature extracted from the reference speech.

Alternatively, the computer device synthesizes acoustic features of the phoneme information based on the recording environment features, the timbre features and the prosody duration features extracted from the reference speech, and converts the acoustic features into a clone speech (clone audio) through a vocoder.

Optionally, the computer device inputs the recording environment feature, the tone feature, the prosody duration feature and the phoneme information into the acoustic model to obtain an acoustic feature, and inputs the acoustic feature into the vocoder to obtain a clone audio.

Illustratively, the acoustic feature may be a Mel-Frequency spectrum (Mel spectrum) or Mel-Frequency Cepstrum Coefficient (MFCC) of the audio.

The clone speech is an audio that deduces the text to be cloned in the sound of the clone object. The clone speech is audio that utters the text to be cloned imitating the sound of a clone object (human voice in the reference speech).

In summary, according to the method provided by this embodiment, by training the acoustic model, the acoustic features of the reference speech in the dimension of the tone color and the recording environment can be extracted, or the acoustic features of the dimension of the tone color, the recording environment and the prosody duration can be extracted, and the acoustic model generates the cloned speech according to the acoustic features of the reference speech.

Illustratively, a voice clone model is stored in the computer device, and the computer device executes the voice clone method provided by the embodiment of the application by calling the voice clone model.

Optionally, as shown in fig. 2, the phonetic clone model includes a feature extraction layer and an acoustic model.

Optionally, as shown in fig. 3, the feature extraction layer includes an environmental feature extraction layer, a tone feature extraction layer, and a prosody duration feature extraction layer.

Optionally, the acoustic model comprises an encoder and a decoder. Alternatively, as shown in fig. 3, the acoustic model includes an encoder, a decoder, and a prosody duration estimation layer.

Illustratively, a method of voice cloning is presented that is implemented based on the voice cloning model shown in FIG. 3.

Fig. 5 shows a flowchart of a voice cloning method provided by an exemplary embodiment of the present application. The method may be performed by a computer device, e.g. a terminal or server as shown in fig. 1. The method comprises the following steps.

Step 210, obtaining phoneme information of the text to be cloned.

For example, the computer device acquires the input text to be cloned in response to an editing operation by the user.

Or the computer equipment acquires the text audio, performs voice recognition on the text audio to obtain a text to be cloned, or performs phoneme extraction on the text audio to obtain phoneme information. For example, the computer device records a section of speech spoken by the user through a microphone, performs speech recognition on the speech spoken by the user to obtain a text to be cloned, and converts the text to be cloned into phonemes to obtain phoneme information.

Step 221, inputting the reference voice into the environmental feature extraction layer to obtain the recording environmental feature.

Optionally, the environmental feature extraction layer includes x convolution layers, where x is a positive integer smaller than the first threshold. For example, the environmental feature extraction layer includes three convolutional layers, each including an arbitrary number of convolutional kernels. Of course, the structure of the environmental feature extraction layer is not limited to this implementation.

Illustratively, the environmental feature extraction layer is a simple convolution layer structure, and the environmental feature extraction layer extracts shallow features in the reference speech, that is, recording environmental features existing in the reference speech from beginning to end, through the simple convolution structure.

Step 222, inputting the reference voice into the tone feature extraction layer to obtain the tone features.

Optionally, the tone feature extraction layer includes a convolutional layer and a recurrent neural network. For example, the tone feature extraction layer includes three convolutional layers and one convolutional neural network. Of course, the structure of the tone feature extraction layer is not limited to this implementation.

The tone characteristic extraction layer extracts deep characteristics in the reference voice through a complex network structure, and trains the tone characteristic extraction layer to output the same characteristics to the voice of the same clone object in a training stage, so that the tone characteristic extraction layer can accurately extract the tone characteristics of the clone object.

Step 223, inputting the reference voice into the prosody duration feature extraction layer to obtain prosody duration features.

Optionally, the prosodic duration feature extraction layer includes a convolutional layer and a recurrent neural network. For example, the prosody duration feature extraction layer includes three convolutional layers and one convolutional neural network. Of course, the structure of the prosody duration feature extraction layer is not limited to this implementation.

The prosodic duration feature extraction layer extracts deep features in the reference voice, namely, prosodic features of phonemes when a cloned object in the reference voice speaks through a complex network structure so as to represent the speaking characteristics of the cloned object.

Optionally, the method may include step 221 and step 222; alternatively, the method may include step 221, step 222, and step 223.

Step 231, inputting the phoneme information into the encoder to obtain the phoneme encoding information.

Optionally, the computer device inputs the phoneme information into an encoder, and the encoder performs phoneme encoding processing on the phoneme information to obtain the phoneme encoding information.

The encoder may be implemented by a Convolutional Neural Network (CNN), a Transformer Neural Network (Transformer), or a Convolutional enhanced Transformer Network (Transformer), among others. For example, the encoder may be implemented by a four-layer Transformer neural network (Transformer), although the structure of the encoder is not limited to this implementation.

Step 232, inputting the phoneme coding information, the recording environment characteristics, the tone characteristics and the prosody duration characteristics into a decoder to obtain the cloned speech.

The computer device inputs the factor coding information and the speech characteristics into a decoder to obtain a cloned speech.

Optionally, the computer device concatenates or superimposes the phoneme coding information, the recording environment feature, and the tone feature, and then inputs the concatenated or superimposed phoneme coding information, recording environment feature, and tone feature into a decoder, and performs decoding processing to obtain the cloned speech.

Or the computer equipment cascades or superposes the phoneme coding information, the recording environment characteristic, the tone characteristic and the prosody duration characteristic and then inputs the information into a decoder for decoding to obtain the cloned voice.

The computer device cascades or superposes the phoneme coding information, the recording environment characteristics, the tone color characteristics and the rhythm duration characteristics, inputs the information into the decoder to obtain the acoustic characteristics, and inputs the acoustic characteristics into the vocoder to output the clone voice.

Among them, the decoder can be implemented by a Convolutional Neural Network (CNN), a Transformer Neural Network (Transformer), or a Convolutional enhanced Transformer Network (Transformer). For example, the decoder may be implemented by using a six-layer Transformer neural network (Transformer), although the structure of the decoder is not limited to this implementation.

Optionally, as shown in fig. 6, step 231 may further include step 233, and step 232 may be replaced with step 234.

Step 233, inputting the phoneme coding information and the prosody duration feature information into the prosody duration estimation layer to obtain the first phoneme duration of each phoneme in the phoneme information.

The first phoneme duration refers to a phoneme duration output according to the prosodic duration characteristics of the reference speech in step 210.

The computer device inputs the prosody duration estimation layer after cascading or superposing the phoneme coding information and the prosody duration characteristic information, performs prosody duration estimation, and outputs the duration of each phoneme (phoneme duration).

The computer device outputs a phoneme duration for each phoneme in the phoneme coding information according to the prosodic duration characteristics of the reference speech. For example, if ten phonemes are included in the phoneme coding information, the prosody duration estimation layer outputs ten phoneme durations, which correspond one-to-one to the ten phonemes. The phoneme duration is the duration of each phoneme, e.g., the duration of the first phoneme is 1 second, or the duration of the second phoneme is 3 frames.

Optionally, the prosodic duration estimation layer is composed of convolutional layers. For example, the prosody duration estimation layer is implemented by using a convolutional neural network with three layers, although the structure of the prosody duration estimation layer is not limited to this implementation.

Step 234, inputting the phoneme coding information, the duration of the first phoneme, the recording environment feature, the timbre feature and the prosody duration feature into a decoder to obtain the cloned speech.

And the computer equipment inputs the phoneme coding information, the first phoneme duration of each phoneme, the recording environment characteristic, the tone color characteristic and the prosody duration characteristic into a decoder after cascade connection or superposition, and performs decoding processing to obtain the cloned speech.

The computer device cascades or superposes the phoneme coding information, the first phoneme duration of each phoneme, the recording environment characteristic, the tone color characteristic and the prosody duration characteristic, inputs the information into a decoder to obtain the acoustic characteristic, and inputs the acoustic characteristic into a vocoder to output the cloned voice.

Wherein the first phoneme duration comprises a phoneme duration of each phoneme in the text to be cloned. For example, if the text to be cloned comprises ten phonemes, the first phoneme duration comprises 10 phoneme durations.

In summary, the method provided in this embodiment adopts a multi-dimensional voice feature extraction layer, and can efficiently extract the recording environment information features, the timbre features and the prosody duration features of the recorder. The information can assist an acoustic model to accurately restore the acoustic characteristics containing the sound information. The voice similarity of the voice clone is improved, the fine tuning training speed of the acoustic model is improved, and the robustness of the whole system is improved. The method improves the synthesis effect, reduces the requirement on the recording environment and reduces the model training cost.

The method provided by the embodiment provides the voice clone acoustic model based on multi-dimensional sound feature extraction, and the voice color, the reading rhythm and the background noise and the reverberation of the recording environment of a recorder can be accurately restored. By means of the multi-dimensional sound feature extraction module, the method can conveniently adjust the reading rhythm and the environment background noise of the synthesized audio.

Optionally, a discriminator is used to perform the confrontational training on the phonetic clone model. That is, the feature extraction layer and the acoustic model are trained against using a discriminator.

FIG. 7 is a diagram illustrating a phonetic clone model training method according to an exemplary embodiment of the present application. The method may be performed by a computer device, e.g. a terminal or a server as shown in fig. 1. The method comprises the following steps:

step one, calling a feature extraction layer 103 to perform feature extraction on the sample reference voice to obtain sample recording environment features, sample tone features and sample prosody duration features, wherein the sample reference voice is the recording of a second clone object.

And step two, acquiring sample phoneme information of the sample text to be cloned.

And step three, calling the acoustic model 104 to perform acoustic feature generation on the sample phoneme information and the sample reference speech features to obtain sample clone speech.

Step four, inputting the sample clone voice into the discriminator 105 to obtain a first discrimination result.

The discriminator is used for discriminating whether the input information is model-generated information (false label) or true information (true label). The discriminator outputs a false tag (e.g., 0) when the discriminator discriminates the input voice as a clone voice generated by the model, and outputs a true tag (e.g., 1) when the discriminator discriminates the input voice as a real recorded voice.

Optionally, the acoustic model outputs an acoustic feature of the sample clone speech, and the acoustic feature of the sample clone speech is input to the discriminator 105 to obtain a first discrimination result.

In the countertraining, the training discriminator is aimed at outputting the first discrimination result as a false label. The goal in training the phonetic clone model is to have the first discrimination result output as a true label.

Alternatively, the arbiter is composed of convolutional layers, for example, the arbiter is implemented by using five-layer two-dimensional convolutional neural network, although the structure of the arbiter is not limited to this implementation.

And step five, inputting the sample real voice into the discriminator 105 to obtain a second discrimination result, wherein the sample real voice deduces the voice of the sample text to be cloned by the voice of the second clone object.

The sample real voice is the voice recorded by the second clone object for reading the sample text to be cloned.

Optionally, the acoustic feature of the sample real voice is input into the discriminator to obtain a second discrimination result.

In the countertraining, the training discriminator is aimed at outputting the first discrimination result as a true label.

And step six, calculating the first loss of the sample clone voice and the sample real voice.

And calling a loss function to calculate the loss of the sample clone voice and the sample real voice.

And step seven, calculating the first judgment result and the second loss of the false label.

And calling a loss function to calculate the loss of the first discrimination result and the false label.

And step eight, calculating a second judgment result and a third loss of the true label.

And calling a loss function to calculate the loss of the second judgment result and the true label.

And step nine, according to the first loss, the second loss and the third loss, confronting the training feature extraction layer, the acoustic model and the discriminator.

Illustratively, model parameters of the fixed feature extraction layer and the acoustic model are unchanged, and model parameters of the discriminator are trained by using a back propagation algorithm according to the second loss and the third loss, so that the discriminator can accurately judge model clone voice and real recorded voice.

Illustratively, model parameters of the discriminator are fixed, a fourth loss of the first discrimination result and the true tag is calculated, a phonetic clone model (including a feature extraction layer and an acoustic model) is trained by a back propagation algorithm according to the first loss, the third loss and the fourth loss, so that a clone voice output by the phonetic clone model is close to a real voice, and the discriminator judges the output clone voice to be the real voice (the true tag).

Illustratively, the computer device repeatedly performs the above nine steps until the phonetic clone model converges. Experiments prove that the model can be well converged by repeating the steps for five hundred times, and the trained voice cloning model can be used for executing the voice cloning method provided by the embodiment of the application after the model is converged.

Optionally, when the feature extraction layer includes a tone feature extraction layer, the tone feature similarity of the reference speech output of the training tone feature extraction layer to the same clone object is higher than a threshold.

The threshold may be any value, for example, 90%. Illustratively, the goal of training the timbre feature extraction layer is to make it output the same timbre features for reference speech of the same cloned object, but the actually output plurality of timbre features may not be exactly the same.

While training the phonetic clone model by adopting the above-mentioned countertraining method, the tone characteristic extraction layer is trained by adopting the following method.

Calling a tone characteristic extraction layer to perform characteristic extraction on the first sample reference voice to obtain a first sample tone characteristic, wherein the first sample reference voice corresponds to a third clone object; calling a tone characteristic extraction layer to perform characteristic extraction on the second sample reference voice to obtain a second sample tone characteristic, wherein the second sample reference voice corresponds to a third clone object; calculating a fourth loss of the first sample timbre feature and the second sample timbre feature; and training the tone characteristic extraction layer according to the fourth loss.

In other words, different sample reference voices of the same clone object are input into the tone characteristic extraction layer to obtain at least two sample tone characteristics, and the tone characteristic extraction layer is trained according to the loss of the at least two sample tone characteristics to output the same tone characteristic to the voices of the same clone object.

In summary, in the method provided in this embodiment, the anti-training mode is adopted to train the voice clone model, so that the voice clone model learns the characteristics of the real voice, and the output discriminator cannot recognize the true and false clone voice, thereby effectively improving the problem of over-smooth acoustic characteristics and improving the synthesized tone quality.

Optionally, when the reading rhythm of the sound recorder (clone object) of the reference voice is not good, the clone voice of the sound recorder can be generated by using rhythm duration features of other sound recorders, so as to improve the reading effect of the clone voice.

Fig. 8 shows a flowchart of a voice cloning method provided by an exemplary embodiment of the present application. The method may be performed by a computer device, e.g. a terminal or a server as shown in fig. 1. The method comprises the following steps.

Step 210, obtaining phoneme information of the text to be cloned.

Step 220, extracting the features of the reference voice to obtain the recording environment features, the tone features and the prosody duration features.

Optionally, the reference voice is a voice recorded by the first cloned object.

And 235, synthesizing the clone voice of the phoneme information according to the recording environment characteristic, the tone characteristic and the prosodic duration characteristic of the second clone object.

Optionally, the feature extraction is performed on the reference speech of the second cloned object to obtain the prosody duration feature of the second cloned object, for example, the reference speech of the second cloned object is input into the prosody duration feature extraction layer to obtain the prosody duration feature of the second cloned object.

Then, the prosody duration feature of the first cloned object in step 220 is replaced with the prosody duration feature of the second cloned object, that is, the timbre feature of the first cloned object, the recording environment feature of the first cloned object, and the prosody duration feature of the second cloned object are used to synthesize the cloned speech of phoneme information. The thus generated clone speech has a timbre of the first clone object and a prosody of the second clone object.

Optionally, inputting the phoneme information into an encoder to obtain phoneme encoding information; inputting the phoneme coding information and the prosody duration characteristics of the second clone object into a prosody duration estimation layer to obtain a second phoneme duration of each phoneme in the phoneme information; and inputting the phoneme coding information, the second phoneme duration, the recording environment characteristic, the tone characteristic and the prosody duration characteristic of the second clone object into a decoder to obtain clone voice.

In summary, in the method provided in this embodiment, because the feature extraction layer can extract the speech features of each acoustic feature dimension in the reference speech, when a certain voice recorder is poor in performance in a certain dimension, the features of the voice recorder in the dimension can be replaced by the features of other voice recorders in the dimension, so as to improve the quality of the finally generated clone speech. For example, when the reading level of the voice recorder stumbles, the prosody duration characteristics of the high-quality voice recorder are used for replacing the prosody duration characteristics of the voice recorder, so that the clone voice is generated, and the generated clone voice can be fluent in language. For example, the voice with similar tone to the user and natural reading rhythm can be synthesized by using the tone characteristics of the ordinary recording user and combining the reading rhythm characteristics (rhythm duration characteristics) of the professional anchor.

The following are embodiments of the apparatus of the present application, and for details that are not described in detail in the embodiments of the apparatus, reference may be made to corresponding descriptions in the above method embodiments, and details are not described herein again.

Fig. 9 shows a schematic structural diagram of a voice cloning apparatus according to an exemplary embodiment of the present application. The apparatus may be implemented as all or part of a computer device by software, hardware or a combination of both, and includes:

a phoneme module 401, configured to obtain phoneme information of a text to be cloned;

a feature extraction module 402, configured to perform feature extraction on the reference speech to obtain a speech feature;

a synthesizing module 403, configured to synthesize a clone voice of the phoneme information according to the voice features;

In an optional embodiment, the feature extraction module 402 is configured to input the reference voice into an environmental feature extraction layer to obtain a recording environmental feature; inputting the reference voice into a tone characteristic extraction layer to obtain tone characteristics;

or inputting the reference voice into the environmental feature extraction layer to obtain the recording environmental feature; inputting the reference voice into the tone characteristic extraction layer to obtain the tone characteristics; and inputting the reference voice into a rhythm duration feature extraction layer to obtain rhythm duration features.

In an optional embodiment, the environmental feature extraction layer includes x convolution layers, where x is a positive integer smaller than a first threshold;

the tone extraction layer comprises a convolution layer and a circulation neural network;

the prosodic duration extraction layer comprises a convolution layer and a recurrent neural network.

In an optional embodiment, the synthesizing module 403 is configured to input the phoneme information into an encoder to obtain phoneme encoding information;

the synthesizing module 403 is configured to input the phoneme coding information and the speech features into a decoder to obtain the cloned speech.

In an alternative embodiment, the speech feature comprises the prosodic duration feature;

the synthesizing module 403 is configured to input the phoneme coding information and the prosody duration feature information into a prosody duration estimation layer to obtain a first phoneme duration of each phoneme in the phoneme information;

the synthesizing module 403 is configured to input the phoneme coding information, the duration of the first phoneme, and the speech feature into a decoder to obtain the cloned speech.

In an optional embodiment, the feature extraction module 402 is configured to invoke a feature extraction layer to perform feature extraction on the reference speech to obtain the speech feature;

the synthesis module 403 is configured to invoke an acoustic model to synthesize the clone speech of the phoneme information according to the speech features;

the device further comprises:

a training module 404 for training the feature extraction layer and the acoustic model using a discriminator countermeasure.

In an alternative embodiment, the feature extraction layer comprises a tone feature extraction layer;

the training module 404 is configured to train that the similarity of the tone features output by the tone feature extraction layer for the reference speech of the same clone object is higher than a threshold.

In an alternative embodiment, the reference speech is the speech of the first cloned object;

the synthesizing module 403 is configured to synthesize the clone voice of the phoneme information according to the recording environment feature, the timbre feature, and a prosody duration feature of the second clone object.

the synthesizing module 403 is configured to input the phoneme coding information and the prosody duration characteristics of the second clone object into a prosody duration estimating layer to obtain a second phoneme duration of each phoneme in the phoneme information;

the synthesizing module 403 is configured to input the phoneme coding information, the second phoneme duration, the recording environment feature, the timbre feature, and the prosody duration feature of the second clone object into a decoder to obtain the clone speech.

Fig. 10 is a schematic structural diagram of a server according to an embodiment of the present application. Specifically, the method comprises the following steps: the server 800 includes a Central Processing Unit (CPU) 801, a system Memory 804 including a Random Access Memory (RAM) 802 and a Read-Only Memory (ROM) 803, and a system bus 805 connecting the system Memory 804 and the CPU 801. The server 800 also includes a basic input/output system (I/O system) 806, which facilitates transfer of information between devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.

The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for inputting information for a user account. Wherein a display 808 and an input device 809 are connected to the central processing unit 801 through an input/output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, an input/output controller 810 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer readable medium (not shown) such as a hard disk or Compact disk-Only Memory (CD-ROM) drive.

Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable Programmable Read-Only Memory (EPROM), electrically Erasable Programmable Read-Only Memory (EEPROM), flash Memory or other solid state Memory technology, CD-ROM, digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.

According to various embodiments of the application, the server 800 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 800 may be connected to the network 812 through a network interface unit 811 coupled to the system bus 805, or the network interface unit 811 may be used to connect to other types of networks or remote computer systems (not shown).

The application also provides a terminal, which comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to implement the voice cloning method provided by the above method embodiments. It should be noted that the terminal may be a terminal as provided in fig. 11 below.

Fig. 11 shows a block diagram of a terminal 900 according to an exemplary embodiment of the present application. The terminal 900 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 900 may also be referred to as a user account device, portable terminal, laptop terminal, desktop terminal, or other name.

In general, terminal 900 includes: a processor 901 and a memory 902.

Processor 901 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 901 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). Processor 901 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 901 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, the processor 901 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 902 may include one or more computer-readable storage media, which may be non-transitory. Memory 902 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 902 is used to store at least one instruction for execution by the processor 901 to implement a voice cloning method or a voice cloning method provided by the method embodiments of the present application.

In some embodiments, terminal 900 can also optionally include: a peripheral interface 903 and at least one peripheral. The processor 901, memory 902, and peripheral interface 903 may be connected by buses or signal lines. Each peripheral may be connected to the peripheral interface 903 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, a positioning assembly 908, and a power supply 909.

The peripheral interface 903 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 901 and the memory 902. In some embodiments, the processor 901, memory 902, and peripheral interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 901, the memory 902 and the peripheral interface 903 may be implemented on a separate chip or circuit board, which is not limited by this embodiment.

The Radio Frequency circuit 904 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 904 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 904 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Illustratively, the radio frequency circuit 904 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber account identity module card, and so forth. The radio frequency circuitry 904 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 904 may also include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 905 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 905 is a touch display screen, the display screen 905 also has the ability to capture touch signals on or over the surface of the display screen 905. The touch signal may be input to the processor 901 as a control signal for processing. At this point, the display 905 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 905 may be one, providing the front panel of the terminal 900; in other embodiments, the number of the display panels 905 may be at least two, and each of the display panels is disposed on a different surface of the terminal 900 or is in a foldable design; in still other embodiments, the display 905 may be a flexible display disposed on a curved surface or a folded surface of the terminal 900. Even more, the display 905 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display panel 905 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

The camera assembly 906 is used to capture images or video. Illustratively, camera assembly 906 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 906 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.

Audio circuit 907 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user account and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 901 for processing, or inputting the electric signals into the radio frequency circuit 904 for realizing voice communication. For stereo sound acquisition or noise reduction purposes, the microphones may be multiple and disposed at different locations of the terminal 900. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 901 or the radio frequency circuit 904 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuit 907 may also include a headphone jack.

The positioning component 908 is used to locate the current geographic Location of the terminal 900 for navigation or LBS (Location Based Service). The Positioning component 908 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 909 is used to supply power to the various components in the terminal 900. The power source 909 may be alternating current, direct current, disposable or rechargeable. When the power source 909 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 900 can also include one or more sensors 910. The one or more sensors 910 include, but are not limited to: an acceleration sensor 911, a gyro sensor 912, a pressure sensor 913, a fingerprint sensor 914, an optical sensor 915, and a proximity sensor 916.

The acceleration sensor 911 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 900. For example, the acceleration sensor 911 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 901 can control the display screen 905 to display the user account interface in a horizontal view or a vertical view according to the gravitational acceleration signal collected by the acceleration sensor 911. The acceleration sensor 911 may also be used for acquisition of motion data of a game or a user account.

The gyroscope sensor 912 can detect the body direction and the rotation angle of the terminal 900, and the gyroscope sensor 912 and the acceleration sensor 911 cooperate to acquire the 3D motion of the user account on the terminal 900. The processor 901 can implement the following functions according to the data collected by the gyro sensor 912: motion sensing (such as changing the UI according to a tilting operation of a user account), image stabilization while shooting, game control, and inertial navigation.

The pressure sensor 913 may be disposed on a side frame of the terminal 900 and/or underneath the display 905. When the pressure sensor 913 is disposed on the side frame of the terminal 900, the holding signal of the user account to the terminal 900 may be detected, and the processor 901 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 913. When the pressure sensor 913 is disposed on the lower layer of the display screen 905, the processor 901 controls the operable control on the UI interface by performing a pressure operation on the display screen 905 according to the user account. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 914 is used for collecting fingerprints of the user account, and the processor 901 identifies the identity of the user account according to the fingerprints collected by the fingerprint sensor 914, or the fingerprint sensor 914 identifies the identity of the user account according to the collected fingerprints. When the identity of the user account is identified as a trusted identity, the processor 901 authorizes the user account to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 914 may be disposed on the front, back, or side of the terminal 900. When a physical key or vendor Logo is provided on the terminal 900, the fingerprint sensor 914 may be integrated with the physical key or vendor Logo.

The optical sensor 915 is used to collect ambient light intensity. In one embodiment, the processor 901 may control the display brightness of the display screen 905 based on the ambient light intensity collected by the optical sensor 915. Specifically, when the ambient light intensity is high, the display brightness of the display screen 905 is increased; when the ambient light intensity is low, the display brightness of the display screen 905 is reduced. In another embodiment, the processor 901 may also dynamically adjust the shooting parameters of the camera assembly 906 according to the ambient light intensity collected by the optical sensor 915.

A proximity sensor 916, also known as a distance sensor, is typically provided on the front panel of the terminal 900. The proximity sensor 916 is used to collect the distance between the user account and the front face of the terminal 900. In one embodiment, when the proximity sensor 916 detects that the distance between the user account and the front face of the terminal 900 is gradually decreased, the display screen 905 is controlled by the processor 901 to switch from the bright screen state to the dark screen state; when the proximity sensor 916 detects that the distance between the user account and the front surface of the terminal 900 gradually becomes larger, the processor 901 controls the display 905 to switch from the breath screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 is not limiting to terminal 900 and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components may be used.

The memory further includes one or more programs, the one or more programs are stored in the memory, and the one or more programs include programs for performing the voice cloning method provided by the embodiment of the application.

The present application also provides a computer device, comprising: a processor and a memory, the storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions, which is loaded and executed by the processor to implement the voice cloning method provided by the above-described method embodiments.

The present application further provides a computer-readable storage medium, in which at least one instruction, at least one program, a set of codes, or a set of instructions is stored, and the at least one instruction, the at least one program, the set of codes, or the set of instructions is loaded and executed by a processor to implement the voice cloning method provided by the above-mentioned method embodiments.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the voice cloning method provided in the above-mentioned alternative implementation.

It should be understood that reference herein to "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The present application is intended to cover various modifications, equivalent arrangements, improvements, etc. without departing from the spirit and scope of the present application.

Claims

1. A method of voice cloning, the method comprising:

acquiring phoneme information of a text to be cloned;

performing feature extraction on the reference voice to obtain voice features;

2. The method of claim 1, wherein the performing feature extraction on the reference speech to obtain speech features comprises:

inputting the reference voice into an environmental feature extraction layer to obtain the recording environmental feature; inputting the reference voice into a tone characteristic extraction layer to obtain the tone characteristics;

or the like, or, alternatively,

inputting the reference voice into the environmental feature extraction layer to obtain the recording environmental feature; inputting the reference voice into the tone characteristic extraction layer to obtain the tone characteristics; and inputting the reference voice into a rhythm duration feature extraction layer to obtain the rhythm duration feature.

3. The method of claim 2, wherein the environmental feature extraction layer comprises x convolutional layers, x being a positive integer less than a first threshold;

the tone extraction layer comprises a convolution layer and a recurrent neural network;

the prosodic duration extraction layer comprises a convolutional layer and a recurrent neural network.

4. The method according to any one of claims 1 to 3, wherein the synthesizing of the clone speech of the phoneme information according to the speech features comprises:

inputting the phoneme information into an encoder to obtain phoneme coding information;

and inputting the phoneme coding information and the voice characteristics into a decoder to obtain the cloned voice.

5. The method of claim 4, wherein the speech feature comprises the prosodic duration feature, the method further comprising:

inputting the phoneme coding information and the prosody duration characteristic information into a prosody duration estimation layer to obtain a first phoneme duration of each phoneme in the phoneme information;

the inputting the phoneme coding information and the speech features into a decoder to obtain the cloned speech includes:

and inputting the phoneme coding information, the first phoneme duration and the voice features into a decoder to obtain the clone voice.

6. The method according to any one of claims 1 to 3,

the extracting the feature of the reference voice to obtain the voice feature comprises:

calling a feature extraction layer to perform feature extraction on the reference voice to obtain the voice feature;

the synthesizing of the clone voice of the phoneme information according to the voice feature includes:

calling an acoustic model to synthesize the clone voice of the phoneme information according to the voice characteristics;

the method further comprises the following steps:

training the feature extraction layer and the acoustic model using a discriminator pair.

7. The method of claim 6, wherein the feature extraction layer comprises a timbre feature extraction layer; the method further comprises the following steps:

and training the similarity of the tone features output by the tone feature extraction layer to the reference voice of the same clone object to be higher than a threshold value.

8. The method according to any one of claims 1 to 3, wherein the reference speech is the speech of the first cloned object;

the synthesizing of the clone voice of the phoneme information according to the voice features comprises:

and synthesizing the clone voice of the phoneme information according to the recording environment characteristics, the tone characteristics and the prosodic duration characteristics of the second clone object.

9. The method according to claim 8, wherein the synthesizing the clone speech of the phoneme information according to the recording environment feature, the timbre feature and a prosodic duration feature of a second clone object comprises:

inputting the phoneme coding information and the prosody duration characteristics of the second clone object into a prosody duration estimation layer to obtain a second phoneme duration of each phoneme in the phoneme information;

and inputting the phoneme coding information, the second phoneme duration, the recording environment characteristics, the tone characteristics and the prosodic duration characteristics of the second clone object into a decoder to obtain the clone voice.

10. A voice cloning device, the device comprising:

11. A computer device, characterized in that the computer device comprises: a processor and a memory, the memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the voice cloning method of any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the method of voice cloning of any of claims 1 to 9.