CN116206592A

CN116206592A - Voice cloning method, device, equipment and storage medium

Info

Publication number: CN116206592A
Application number: CN202310066406.9A
Authority: CN
Inventors: 孙志宏
Original assignee: Sipic Technology Co Ltd
Current assignee: Sipic Technology Co Ltd
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-06-02

Abstract

The invention provides a voice cloning method, a device, equipment and a storage medium, wherein the method comprises the following steps: receiving a text to be synthesized; converting the text to be synthesized into a phoneme sequence to be synthesized; inputting the phoneme sequence to be synthesized into an adaptive acoustic model, and converting the phoneme sequence to be synthesized into the acoustic feature to be synthesized of the target object by using the adaptive acoustic model; the self-adaptive acoustic model is obtained by self-adaptive training of a basic acoustic model based on target voice sample data of a target object, and when the basic acoustic model is self-adaptively trained, parameters of a condition normalization layer of a decoder in the basic acoustic model are updated; and synthesizing the to-be-synthesized audio data of the target object by adopting the vocoder. The invention reduces the training time of the self-adaptive model during voice cloning and improves the efficiency and quality of voice cloning.

Description

Voice cloning method, device, equipment and storage medium

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for cloning speech.

Background

With the advancement of technology, intelligent voice technology is also continuously developed, such as: speech cloning refers to the synthesis of specified content into audio data of a specified character pronunciation using a smart device, such as: the intelligent sound box in the home can make the sound of the host.

The general voice cloning technology needs to use the recorded audio of the target speaker in the actual scene to extract the acoustic features and voiceprint features of the target speaker, perform model training, and finally decode the acoustic features of the target speaker through a vocoder to obtain the cloned voice of the target speaker. However, in performing model training, a large number of model parameters need to be updated, which requires a long model training time and results in degradation of both the quality and naturalness of the synthesized audio.

Disclosure of Invention

In view of the foregoing, embodiments of the present invention provide a voice cloning method, apparatus, device, and storage medium that obviate or mitigate one or more disadvantages in the prior art.

One aspect of the present invention provides a voice cloning method comprising the steps of:

receiving a text to be synthesized;

converting the text to be synthesized into a phoneme sequence to be synthesized;

Inputting the phoneme sequence to be synthesized into an adaptive acoustic model, and converting the phoneme sequence to be synthesized into the acoustic feature to be synthesized of the target object by using the adaptive acoustic model; the self-adaptive acoustic model is obtained by self-adaptive training of a basic acoustic model based on target voice sample data of the target object, and when the basic acoustic model is self-adaptively trained, parameters of a condition normalization layer of a decoder in the basic acoustic model are updated;

and synthesizing the target synthesized audio data of the target object by adopting the vocoder.

In some embodiments of the invention, the training method of the adaptive acoustic model comprises:

collecting target voice sample data of a target object;

converting the target voice sample data into target text;

extracting a target voiceprint vector and a target acoustic feature in the target voice sample data;

converting the target text into a target phoneme sequence, and aligning the target voice sample data with the target text to obtain a target duration sequence;

and carrying out self-adaptive training on the basic acoustic model based on the target phoneme sequence, the target duration sequence, the target acoustic features and the target voiceprint vector, and updating parameters of a conditional normalization layer of a decoder in the basic acoustic model until the preset requirement is met.

In some embodiments of the invention, the method further comprises:

when the basic acoustic model is adaptively trained based on target voice sample data of a target object, a neural network algorithm is adopted to reduce noise of the target voice sample data, and the adaptive acoustic model is obtained by adaptively training the basic acoustic model through the target voice sample data after noise reduction.

In some embodiments of the invention, the method further comprises:

and adopting a noise reduction model to carry out secondary noise reduction on the target synthesized audio data to obtain and output noise reduced target synthesized audio data.

In some embodiments of the invention, the method further comprises:

the sampling frequency of the target speech sample data is the same as the sampling frequency of the training data of the underlying acoustic model.

In some embodiments of the invention, a reference encoder is also included in the base acoustic model, the reference encoder being used to learn the regularity of the acoustic features in the audio data.

Another aspect of the present invention provides a voice cloning apparatus, the apparatus comprising:

the text receiving module is used for receiving the text to be synthesized;

the phoneme conversion module is used for converting the text to be synthesized into a phoneme sequence to be synthesized;

The acoustic feature prediction module is used for inputting the phoneme sequence to be synthesized into an adaptive acoustic model, and converting the phoneme sequence to be synthesized into acoustic features to be synthesized of a target object by utilizing the adaptive acoustic model; the self-adaptive acoustic model is obtained by self-adaptive training of a basic acoustic model based on target voice sample data of the target object, and when the basic acoustic model is self-adaptively trained, parameters of a condition normalization layer of a decoder in the basic acoustic model are updated;

and the voice synthesis module is used for synthesizing the acoustic characteristics to be synthesized into target synthesized audio data of the target object by adopting a vocoder.

In some embodiments of the invention, the apparatus further comprises a model training module for training the adaptive acoustic model using the following method:

collecting target voice sample data of a target object;

converting the target voice sample data into target text;

Another aspect of the present invention provides a speech cloning apparatus comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the apparatus implementing the above-described speech cloning method when the computer instructions are executed by the processor.

Yet another aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described speech cloning method.

The voice cloning method, the device, the equipment and the storage medium provided by the invention can train and obtain the self-adaptive acoustic model of the target object on the basis of the basic acoustic model only by collecting a small amount of audio data of the target object. The self-adaptive acoustic model can be used for converting the appointed content into the acoustic characteristics of the target object, and the vocoder can be used for cloning the audio data of the target object. When the acoustic model is adaptively trained, the condition normalization layer is introduced into the decoder of the acoustic model, so that the time required by the adaptive training can be reduced under the condition that the naturalness of the synthesized audio is not lost, the training time of the adaptive model in the voice cloning process is reduced, the speed of voice cloning is improved, and the audio quality of the voice cloning is ensured.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the above-described specific ones, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate and together with the description serve to explain the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. Corresponding parts in the drawings may be exaggerated, i.e. made larger relative to other parts in an exemplary device actually manufactured according to the present invention, for convenience in showing and describing some parts of the present invention. In the drawings:

FIG. 1 is a flow chart of a method for voice cloning provided in one embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a training process for a basic acoustic model in one embodiment of the present disclosure;

FIG. 3 is a schematic flow chart of speech cloning in another embodiment of the present disclosure;

FIG. 4 is a schematic block diagram of an embodiment of a speech cloning apparatus provided in the present specification;

fig. 5 is a block diagram of a hardware structure of a voice clone server in one embodiment of the present specification.

Detailed Description

The present invention will be described in further detail with reference to the following embodiments and the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. The exemplary embodiments of the present invention and the descriptions thereof are used herein to explain the present invention, but are not intended to limit the invention.

It should be noted here that, in order to avoid obscuring the present invention due to unnecessary details, only structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, while other details not greatly related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

It is also noted herein that the term "coupled" may refer to not only a direct connection, but also an indirect connection in which an intermediate is present, unless otherwise specified.

Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar components, or the same or similar steps.

The general speech cloning technology is based on a speech synthesis model and is classified into an acoustic model and a vocoder. Firstly, training a basic acoustic model on the data of multiple speakers; then, extracting the acoustic features and voiceprint features of the target speaker by using the recorded audio of the target speaker in the actual scene, performing self-adaptive training (updating a large number of network model parameters) on the basic acoustic model, and reasoning the text to be cloned into the corresponding acoustic features through the self-adaptive acoustic model; finally, the acoustic characteristics of the target speaker can be decoded by the vocoder to obtain the cloned voice of the target speaker. The voice cloning technique can synthesize the sound of the target speaker, such as: the voice cloning technology can be applied to the dubbing industry, and can be used for quickly synthesizing the voice of a target dubbing actor, so that the dubbing cost and time are reduced. Of course, the voice cloning technology can also be applied to other scenes according to actual needs, and the embodiment of the present disclosure is not limited specifically.

However, in the real use scenario of the voice cloning technology, in order to obtain a good cloning effect, there are often more requirements on the recording of the target speaker, for example: the number of sound recordings, the signal-to-noise ratio of the sound recordings, and the adaptive training on the basic model are longer, the parameters of the acoustic model update are more, and the restrictions can greatly reduce the use experience of users. In order to improve naturalness and expressive force of synthesized audio, a basic acoustic model is often obtained on the premise of high-quality speech synthesis data and a complex network structure, and in a real use scene of speech cloning, both the tone quality of the cloned audio and the duration of adaptive training are challenges facing the field for a long time. To address the long adaptation training, practitioners often resort to directly reducing adaptation time in the underlying acoustic model, but this also directly reduces the naturalness of the cloned audio.

According to the voice cloning method, an adaptive acoustic model of a target object can be trained in advance based on a basic acoustic model, a condition normalization layer is added to the basic acoustic model, parameters of the condition normalization layer of a decoder in the basic acoustic model only need to be updated when the adaptive acoustic model is trained, and a text to be synthesized is synthesized into audio data of the pronunciation of the target object by using the trained adaptive acoustic model when voice cloning is carried out. The whole process only needs to update a small amount of designated model parameters, the designated text can be quickly and accurately synthesized into the audio of the target object, the model training time of voice cloning is greatly reduced, the efficiency of voice cloning is further improved, the quality of the audio of the voice cloning is not influenced, and the experience of a user is improved.

Fig. 1 is a schematic flow chart of a voice cloning method provided in an embodiment of the present disclosure, as shown in fig. 1, in one embodiment of the voice cloning method provided in the present disclosure, the method may be applied to a terminal device such as a computer, a tablet computer, a server, a smart phone, a smart wearable device, and the like, and the method may include the following steps:

step 102, receiving a text to be synthesized.

In a specific implementation process, the voice cloning technology is to synthesize the designated content into the audio data of the pronunciation of the target object by using the intelligent device, and the text to be synthesized can be understood as the content of the audio to be cloned as the target object, which can be text or picture or voice, and the voice is converted into the corresponding text through voice recognition. Such as: if the user A wants to use the intelligent device to synthesize the text of "today's weather is good" into the audio uttered by the user B, the text of "today's weather is good" can be understood as the text to be synthesized. Of course, the text to be synthesized may be in a language of a different language, such as: chinese, english, etc. can be specifically determined according to actual needs, and the embodiment of the present specification is not specifically limited.

And 104, converting the text to be synthesized into a phoneme sequence to be synthesized.

In a specific implementation process, after receiving the text to be synthesized, phoneme conversion may be performed, where the phoneme sequence in the embodiment of the present disclosure may be understood as a combination of the smallest speech units, and front-end equipment TTS software (text-to-speech software) may be used to convert the text to be synthesized into a corresponding phoneme sequence to be synthesized.

Step 106, inputting the phoneme sequence to be synthesized into an adaptive acoustic model, and converting the phoneme sequence to be synthesized into the acoustic feature to be synthesized of the target object by using the adaptive acoustic model; the adaptive acoustic model is obtained by adaptively training a basic acoustic model based on target voice sample data of the target object, and parameters of a conditional normalization layer of a decoder in the basic acoustic model are updated when the basic acoustic model is adaptively trained.

In a specific implementation, the target object may be understood as an object for which cloning of sound is desired, such as: the user wishes to synthesize audio of the pronunciation of user B, which may be referred to as a target object, using the voice cloning device. The self-adaptive acoustic model can be built by training the target voice sample data of the target object in advance, and when the audio data of the target object needs to be cloned, the corresponding to-be-synthesized phoneme sequence converted from the to-be-synthesized text is input into the self-adaptive acoustic model, so that the audio data of the target object can be synthesized. Referring to the description of the above embodiment, the adaptive acoustic model may be obtained by adaptive training based on a basic acoustic model, which may be an acoustic model obtained by training using a large amount of speech sample data, and the basic acoustic model may learn the acoustic characteristics of different objects, and then synthesize the specified content into the acoustic characteristics of the specified object. The basic acoustic model may select an autoregressive speech synthesis model according to actual needs, such as: tactotron2, transducer TTS, deep Voice 3, or non-autoregressive speech synthesis models such as: fastSpeech, fastSpeech2, the present illustrative embodiment is not particularly limited.

In some embodiments of the present description, the base acoustic model may employ FastSpecch 2, with the addition of a conditional normalization layer (Conditional Layer Normalization, CLN) to the base acoustic model. Where a target object needs to be cloned, such as: when a certain user sounds, audio data of a target object, namely target voice sample data, can be acquired, the target voice sample data of the target object is used for carrying out self-adaptive training on a basic acoustic model, and when the basic acoustic model is subjected to self-adaptive training, parameters of a condition normalization layer in a decoder in the basic acoustic model are only required to be updated, and the self-adaptive acoustic model is obtained through training. The condition normalization layer is arranged to control the number of model parameter updates in the training process of the self-adaptive acoustic model, reduce the model training time of the voice clone, improve the model training efficiency of the voice clone, and also realize better deployment in the later stage.

Fig. 2 is a schematic diagram of a training process of a basic acoustic model in an embodiment of the present disclosure, as shown in fig. 2, where the basic acoustic model in an embodiment of the present disclosure may further include a reference encoder, and the reference encoder (reference encoder) may be used to model acoustic features extracted from audio data, learn rules of the acoustic features, and further enable the basic acoustic model to extract pronunciation features of a user, for example: rereading, pausing, language style and the like, and further improving prosody, emotion effect and the like of the finally synthesized voice, so that the cloned voice is more close to the speaking situation of the real character.

As shown in fig. 2, the training process of the basic acoustic model may include:

collecting sample training audio data, and extracting training voiceprint vectors and training acoustic features in the sample training audio data;

converting the sample training audio data into training text;

converting the training text into a training phoneme sequence, and aligning the sample training audio data with the training text to obtain a training duration sequence;

setting model parameters of the basic acoustic model, wherein the basic acoustic model comprises an encoder, a duration predictor and a reference encoder, and the decoder comprises a condition normalization layer;

inputting the training phoneme sequence into the encoder, the training time length sequence into the time length predictor, and the training acoustic feature into the reference encoder;

and inputting the output of the encoder, the duration predictor, the reference encoder and the voiceprint vector into a decoder, and performing model training on the basic acoustic model.

As shown in fig. 2, the basic acoustic model may be understood as an intelligent learning model that inputs audio and text data, outputs predicted acoustic features. Sample training audio data for basic acoustic model training and training texts corresponding to the sample audio data can be collected from an audio sample database in advance, feature extraction is carried out on the sample training audio data and the training texts, and the basic acoustic model is trained by utilizing the extracted features, and the specific process is as follows:

Step 1: and extracting the characteristics of the audio and the text. Firstly, 256-dimensional voiceprint vectors (x-vector) can be extracted from audio, namely sample training audio data by using a voiceprint recognition model which is already trained, and acoustic features (mel-spectrum of 320 dimensions) can be continuously extracted from the audio; secondly, converting the recording text, namely the training text corresponding to the sample training audio data, into a corresponding phoneme sequence (phone); finally, the audio and text are aligned for phoneme pronunciation time length by using a trained MFA (Montreal Forced Aligner, speech alignment tool) model to obtain a time length sequence (duration).

Step 2: and (3) passing the phoneme sequence through an encoder, passing the duration sequence through a duration predictor, passing the Mel frequency spectrum through a reference encoder, and finally transmitting the output of each part and the voiceprint vector to a decoder containing a conditional normalization layer to predict corresponding acoustic features.

Step 3: the MSE loss is calculated between the output of the decoder and the real Mel frequency spectrum, and the distribution and the rule of the training sample data are learned.

After a large amount of training data is adopted for training to obtain a basic acoustic model, the basic acoustic model can be stored, when a specified object needs to be subjected to voice cloning, the corresponding audio data are collected for performing self-adaptive training on the basic acoustic model, and then the self-adaptive acoustic model capable of cloning the sound of the specified object can be rapidly obtained.

In some embodiments of the present disclosure, the training method of the adaptive acoustic model includes:

collecting target voice sample data of a target object;

converting the target voice sample data into target text;

In a specific implementation process, fig. 3 is a schematic flow chart of voice cloning in another embodiment of the present disclosure, as shown in fig. 3, a target speaker, that is, a target object, may collect audio data recorded by the target object, that is, target voice sample data, where, generally, the target voice sample data does not need a large number of samples as training of a basic acoustic model, and the target voice sample data may be a small number of audio data, for example: the 10 sound recordings can be trained to obtain the adaptive acoustic model capable of cloning the sound characteristics of the target object by using a small amount of audio data of the target object.

Furthermore, in some embodiments of the present disclosure, the sampling frequency of the target speech sample data of the target object is the same as the sampling frequency of the training data of the underlying acoustic model when the target speech sample data is acquired. The basic acoustic model is obtained based on a large amount of training data, and the audio of the target object is acquired through the same sampling frequency, so that the basic acoustic model can quickly and accurately learn the sound characteristics of the target object, and the quality of the audio of the voice clone is improved.

After the target speech sample data of the target object is collected, ASR (Automatic Speech Recognition ) recognition may be performed on the target speech sample data to convert the target speech sample data into target text. And then extracting the characteristics of the target voice sample data and the target text, such as: the method comprises the steps of extracting target voiceprint vectors and target acoustic features in target voice sample data, converting target texts into target phoneme sequences, and aligning the target voice sample data with the target texts to obtain target duration sequences, wherein the process of extracting features can refer to the process of extracting features in the basic acoustic model in fig. 2, and details are omitted here.

As shown in fig. 3, after feature extraction is completed, based on the extracted target voiceprint vector, target acoustic feature, target duration sequence, and target phoneme sequence, performing adaptive training on the basic acoustic model, and performing parameter update on only the CLN layer in the decoder on the basic acoustic model until a preset requirement is met, for example: and updating 500 steps to obtain the self-adaptive acoustic model of the target speaker.

When the acoustic model of the target object is self-adaptive, the condition normalization layer in the decoder is updated, the updated 70k parameter is achieved, and the experiment shows that the faster and good re-etching effect can be obtained when the self-adaptation is performed in 500 steps. Table 1 is a flowchart of the present disclosure referring to fig. 3, and relevant tests performed by a 10-person test set (10 voices per person) are collected, and table 1 is relevant data of cloning time, naturalness and similarity of the test, wherein the similarity and naturalness of the cloned voices are divided into 5 points. As shown in table 1, the embodiment of the present specification starts from training to obtain the adaptive acoustic model of the target object, and the time for cloning the whole voice only needs 6.2 minutes, so that the similarity and naturalness of the voice are also better.

TABLE 1

Time required for the Speech cloning procedure	Cloning of speech naturalness	Cloning of speech similarity
			6.2 minutes	3.92	3.87

According to the embodiment of the specification, on the basis of the basic acoustic model, only a small amount of audio data of the target object is needed, the self-adaptive acoustic model capable of synthesizing the acoustic characteristics of the target object can be quickly obtained through training, in addition, only a small amount of parameter updating is needed in the self-adaptive process, the self-adaptive model training time in the voice cloning is greatly reduced, the model training speed in the voice cloning is improved, the self-adaptive acoustic models of different target objects can be quickly obtained, and the foundation is laid for popularization of the voice cloning technology.

And step 108, synthesizing the acoustic features to be synthesized into target synthesized audio data of the target object by adopting a vocoder.

In a specific implementation process, after the adaptive acoustic model is obtained by utilizing the audio data training of the target object, when the adaptive acoustic model is used, the sound of the target object can be designated, and then the adaptive acoustic model can convert the text to be synthesized into the acoustic characteristics of the target object. After the text to be synthesized is converted into the acoustic feature to be synthesized of the target object by using the adaptive acoustic model, the vocoder can be used to synthesize the target synthesized audio data of the pronunciation of the target object by using the acoustic feature to be synthesized.

For example: the user a wants to clone and synthesize the sound of the user B by using the intelligent device in the home, and can collect the audio data of the user B in advance, and perform adaptive training on the basic acoustic model by referring to the description of the above embodiment, so as to obtain the adaptive acoustic model of the user B. The training adaptive acoustic model is loaded to the intelligent device, and the user A can input the content which the user A wants to output by the intelligent device to the intelligent device, for example: the self-adaptive acoustic model in the intelligent device can convert the 'happy birthday' into the acoustic characteristics of the user B, and then the vocoder is utilized to synthesize the acoustic characteristics output by the self-adaptive acoustic model into corresponding audio data, wherein the sound of the audio data is the sound of the user B.

In addition, after the self-adaptive acoustic model of the target object is obtained through training, if the audio data of the target object is required to be cloned, the self-adaptive acoustic model of the trained target object is directly used for predicting the acoustic characteristics of the phoneme sequence to be synthesized, and the self-adaptive acoustic model of the target object does not need to be trained repeatedly.

According to the voice cloning method provided by the embodiment of the specification, training on the basis of the basic acoustic model can be achieved by only collecting a small amount of audio data of the target object. The self-adaptive acoustic model can be used for converting the appointed content into the acoustic characteristics of the target object, and the vocoder can be used for cloning the audio data of the target object. When the basic acoustic model is adaptively trained, the condition normalization layer is introduced into the decoder of the acoustic model, so that the time required by the adaptive training can be reduced under the condition that the naturalness of synthesized audio is not lost, the training time of the adaptive model in the voice cloning process is reduced, the speed of voice cloning is improved, and the audio quality of voice cloning is ensured.

As shown in fig. 3, in some embodiments of the present disclosure, the method further includes:

In a specific implementation process, after target voice sample data of a target object is acquired when the basic acoustic model is adaptively trained, a neural network algorithm such as RNN (Recurrent Neural Network ) is adopted to perform noise reduction processing on the target voice sample data, then feature extraction is performed on the noise reduced target voice sample data, and the extracted features are utilized to perform adaptive training on the basic acoustic model to obtain an adaptive acoustic model. Generally, in a real use scene, the audio of a target speaker has certain noise, the signal to noise ratio of the recorded audio of a target object is low, and when the basic acoustic model is used for self-adaptive training, a large number of model parameters need to be updated, so that the quality and naturalness of the synthesized audio can be reduced.

According to the embodiment of the specification, the neural network is used for noise reduction before the self-adaptive acoustic model training is carried out, so that the quality of the audio data of the target object can be improved, the requirements on the environment and the quality of the target object when the audio data are recorded are not required, the difficulty and the requirements on sample acquisition are reduced, and the quality of voice cloning is improved.

Furthermore, as shown in fig. 3, in some embodiments of the present disclosure, the method further includes:

In a specific implementation process, after the acoustic features output by the adaptive acoustic model are synthesized into the corresponding audio data, a noise reduction model may be used, for example: the speeddsp performs noise reduction on the synthesized audio data to obtain and output noise-reduced target synthesized audio data. Wherein, the speeddsp can be used for additional functions such as echo suppression, noise elimination, etc. Experiments show that the sound quality of the final synthesized audio can be improved by denoising the final synthesized audio.

According to the embodiment of the specification, through carrying out noise reduction on the audio data of the target object and the audio data generated by cloning for two times and carrying out noise reduction on the recorded audio and the cloned audio by a specific method respectively, the tone quality of the cloned audio can be improved, and the quality of voice cloning is further improved.

As shown in fig. 3, in the voice cloning method provided in the embodiment of the present disclosure, 10 sentences of audio recorded by a target speaker are utilized in a real usage scenario, and on the basis of a trained basic acoustic model, a designated network layer is updated, so that the adaptation time of the model is shortened, and in addition, the tone quality and naturalness of the cloned audio are ensured through two different noise reduction processes.

First, in a real usage scenario, the environments of users are different, and some of the users' recordings inevitably contain some noise, which also tends to affect the quality of the final synthesized audio to some extent. In the embodiment of the specification, during the early preprocessing, the main operations include sampling rate conversion, loudness normalization and noise reduction. The sampling rate of the user record is consistent with the sampling rate of training data of the basic acoustic model, the intensity normalization is carried out by using an audio tool sox, and the noise reduction is based on a neural network.

And secondly, extracting the characteristics. The steps mainly comprise the steps of converting voice into characters and converting the characters into a phoneme sequence; alignment operation: the MFAs are aligned; extracting voiceprint vectors; extraction of acoustic features. Firstly, a user records 10 smooth audios in a quiet environment by using a mobile phone, and performs voice transcription by using an ASR tool to convert the audio into corresponding characters. Secondly, converting the recognized characters into corresponding phoneme sequences by utilizing a Chinese dictionary; then, using an alignment tool MFA to align the time length of the phoneme sequence and the corresponding audio frequency, and obtaining a result which is the pronunciation time length corresponding to each phoneme; thirdly, extracting speaker information in the audio by using the trained voiceprint recognition model, and outputting a vector (x-vector) with 256 dimensions; finally, the audio is extracted for acoustic features, here a 320-dimensional mel spectrum is used.

Finally, the model is adaptively trained. The examples of this specification were modified on Fatspeech 2. The main changes are adding a condition normalization layer and a reference encoder, wherein the condition normalization layer aims to control the quantity of model parameter updating in the self-adaption process, and also aims to better deploy in the later period, and the reference encoder aims to extract the pronunciation characteristics of the user from the recording of the user.

According to the voice cloning method provided by the embodiment of the specification, through the improved acoustic model, when the self-adaptive learning of voice cloning is carried out, only the parameters appointed in the acoustic model are required to be updated, a faster and good re-etching effect can be obtained, the speed and effect of voice cloning are improved, the neural network is used for noise reduction before the self-adaptation is carried out, the speedxsp is used for noise reduction once after the audio is synthesized, and the tone quality of the synthesized audio of the voice cloning can be improved.

In the present specification, each embodiment of the method is described in a progressive manner, and the same and similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments. Reference is made to the description of parts of the method embodiments where relevant.

Based on the above-mentioned voice cloning method, one or more embodiments of the present disclosure further provide a voice cloning apparatus. The apparatus may include apparatus (including distributed systems), software (applications), modules, components, servers, clients, etc. that employ the methods described in the embodiments of the present specification in combination with the necessary apparatus to implement the hardware. Based on the same innovative concepts, the embodiments of the present description provide means in one or more embodiments as described in the following embodiments. Because the implementation schemes and methods of the device for solving the problems are similar, the implementation of the device in the embodiments of the present disclosure may refer to the implementation of the foregoing method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Specifically, fig. 4 is a schematic block diagram of an embodiment of a voice cloning apparatus provided in the present specification, and as shown in fig. 4, the apparatus provided in the present specification may include:

A text receiving module 41, configured to receive a text to be synthesized;

a phoneme conversion module 42, configured to convert the text to be synthesized into a phoneme sequence to be synthesized;

the acoustic feature prediction module 43 is configured to input the phoneme sequence to be synthesized into an adaptive acoustic model, and convert the phoneme sequence to be synthesized into an acoustic feature to be synthesized of a target object by using the adaptive acoustic model; the self-adaptive acoustic model is obtained by self-adaptive training of a basic acoustic model based on target voice sample data of the target object, and when the basic acoustic model is self-adaptively trained, parameters of a condition normalization layer of a decoder in the basic acoustic model are updated;

a speech synthesis module 44 for synthesizing target synthesized audio data of the target object using a vocoder from the acoustic features to be synthesized.

In some embodiments of the present description, the apparatus further comprises a model training module for training the adaptive acoustic model using the following method:

collecting target voice sample data of a target object;

converting the target voice sample data into target text;

According to the voice cloning device provided by the embodiment of the specification, the target object self-adaptive acoustic model can be obtained by training on the basis of the basic acoustic model only by collecting a small amount of audio data of the target object. The self-adaptive acoustic model can be used for converting the appointed content into the acoustic characteristics of the target object, and the vocoder can be used for cloning the audio data of the target object. When the basic acoustic model is subjected to self-adaptive learning, the condition normalization layer is introduced into the decoder of the acoustic model, so that the time required by self-adaptive training can be reduced under the condition that the naturalness of synthesized audio is not lost, the training time of the self-adaptive model in the voice cloning process is reduced, the speed of voice cloning is improved, and the audio quality of voice cloning is ensured.

In some embodiments of the present disclosure, there is further provided a language synthesis apparatus, including a processor and a memory, where the memory stores computer instructions, and the processor is configured to execute the computer instructions stored in the memory, where the computer instructions, when executed by the processor, implement the speech cloning method described in the foregoing embodiments, and the speech cloning method includes:

receiving a text to be synthesized;

It should be noted that the descriptions of the apparatus and the device according to the method embodiments may further include other implementations. Specific implementation may refer to descriptions of related method embodiments, which are not described herein in detail.

The method embodiments provided in the embodiments of the present specification may be performed in a mobile terminal, a computer terminal, a server, or similar computing device. Taking the example of running on a server, fig. 5 is a block diagram of the hardware structure of a voice cloning server in one embodiment of the present specification, and the computer terminal may be the voice cloning server or the voice cloning apparatus in the above embodiment. The server 10 as shown in fig. 5 may include one or more (only one is shown in the figure) processors 100 (the processors 100 may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a nonvolatile memory 200 for storing data, and a transmission module 300 for communication functions. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 5 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, server 10 may also include more or fewer components than shown in FIG. 5, for example, may also include other processing hardware such as a database or multi-level cache, a GPU, or have a different configuration than that shown in FIG. 5.

The nonvolatile memory 200 may be used to store software programs and modules of application software, such as program instructions/modules corresponding to the voice cloning method in the embodiment of the present disclosure, and the processor 100 executes the software programs and modules stored in the nonvolatile memory 200, thereby executing various functional applications and resource data updates. The non-volatile memory 200 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the non-volatile memory 200 may further include memory located remotely from the processor 100, which may be connected to the computer terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission module 300 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a computer terminal. In one example, the transmission module 300 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission module 300 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

Correspondingly, the invention also provides a device comprising a computer apparatus, the computer apparatus comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the device implementing the steps of the method as described above when the computer instructions are executed by the processor.

The embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the edge computing server deployment method described above. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disk, a removable memory disk, a CD-ROM, or any other form of storage medium known in the art.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein can be implemented as hardware, software, or a combination of both. The particular implementation is hardware or software dependent on the specific application of the solution and the design constraints. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave.

It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.

In this disclosure, features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations can be made to the embodiments of the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of voice cloning, the method comprising:

receiving a text to be synthesized;

2. The method of claim 1, wherein the training method of the adaptive acoustic model comprises:

collecting target voice sample data of a target object;

converting the target voice sample data into target text;

3. The method according to claim 1, wherein the method further comprises:

4. A method according to claim 3, characterized in that the method further comprises:

5. The method according to claim 2, wherein the method further comprises:

6. The method of claim 1, further comprising a reference encoder in the base acoustic model, the reference encoder for learning a law of acoustic features in the audio data.

7. A speech cloning apparatus, the apparatus comprising:

the text receiving module is used for receiving the text to be synthesized;

8. The apparatus of claim 7, further comprising a model training module for training the adaptive acoustic model using the method of:

collecting target voice sample data of a target object;

converting the target voice sample data into target text;

9. A speech cloning device comprising a processor and a memory, wherein the memory has stored therein computer instructions for executing the computer instructions stored in the memory, which device, when executed by the processor, implements the steps of the method according to any one of claims 1 to 6.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 6.