CN116092466A

CN116092466A - Speech model processing method, device, computer equipment and storage medium

Info

Publication number: CN116092466A
Application number: CN202211502645.6A
Authority: CN
Inventors: 詹皓粤; 张旸; 林悦
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-05-09

Abstract

The embodiment of the application discloses a processing method, a processing device, computer equipment and a storage medium of a voice model, wherein after conversion voice is generated based on a plurality of preset sample pairs through a non-parallel voice conversion model, the preset parallel voice conversion model is pre-trained by adopting appointed tone voice and conversion voice, so that a universal basic parallel voice conversion model is obtained, the basic parallel voice conversion model can be finely adjusted according to user voices input by different users, and then a target parallel voice conversion model corresponding to the user voices can be obtained, so that the input target text of the voice to be synthesized generates synthetic voice with the same tone as the tone of the user voices, and accordingly sound cloning is realized. According to the method and the device for training the parallel voice conversion model, training steps of the voice model can be simplified, voice cloning efficiency is improved, the parallel voice conversion model is pre-trained, robustness of the parallel voice conversion model can be effectively improved, and tone quality and pronunciation accuracy of synthesized voice are improved.

Description

Speech model processing method, device, computer equipment and storage medium

Technical Field

The embodiment of the application relates to the technical field of information processing, in particular to a processing method and device of a voice model, computer equipment and a storage medium.

Background

The voice cloning technology refers to a technology in which a machine extracts tone information from a voice provided by a user and synthesizes the voice using the user tone. Voice cloning is an extension of speech synthesis technology, where traditional speech synthesis is the conversion of text to speech on a fixed speaker, and voice cloning further specifies the speaker's timbre. At present, sound cloning has a plurality of practical scenes, such as applications of voice navigation, voiced novels and the like, a user can customize a voice package by uploading voice, and the user uses the voice to navigate or read the novels so as to promote the interestingness of using application programs.

At present, when a user performs personalized customization by using a voice cloning technology, a section of own voice and a text corresponding to the voice are generally required to be provided to realize voice cloning. The user can not obtain the recorded voice consistent with the read-aloud content when recording, and the recorded voice provided by the user and the read-aloud content of the voice can be inconsistent, so that cleaning and correcting operations are required before the voice model training is carried out. Thus, the time consumed for training the acoustic model is long, and the efficiency of acoustic cloning is low.

Disclosure of Invention

The embodiment of the application provides a processing method, a processing device, computer equipment and a storage medium for a voice model, wherein converted voice is generated based on a plurality of preset sample pairs through a non-parallel voice conversion model, the preset parallel voice conversion model is pre-trained by adopting appointed tone voice and converted voice to obtain a universal basic parallel voice conversion model, the basic parallel voice conversion model is subsequently finely tuned according to voices of different users to obtain a target parallel voice conversion model corresponding to the users, the training steps of a voice model can be simplified, the voice cloning efficiency is improved, in addition, the parallel voice conversion model is pre-trained, the robustness of the parallel voice conversion model can be effectively improved, and the tone quality and the pronunciation accuracy of synthesized voice are improved.

The embodiment of the application provides a processing method of a voice model, which comprises the following steps:

acquiring a plurality of preset sample pairs, wherein each preset sample pair comprises a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample;

Converting the appointed tone voice sample into converted voice under a reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, wherein text information of the converted voice under the reference tone is consistent with text information of the appointed tone voice sample;

acquiring the designated tone phonetic feature of the designated tone voice sample and the corresponding reference tone phonetic feature of the converted voice under the reference tone;

training a preset parallel voice conversion model based on the appointed tone voice characteristics, the reference tone voice characteristics and the reference tone information to obtain a basic parallel voice conversion model, wherein the reference tone information is tone information of the reference tone voice sample;

acquiring user voice of a target user, inputting the user voice and preset tone information into the non-parallel voice conversion model, and generating a voice sample under a specified tone, wherein the tone information of the text information of the voice sample under the specified tone is consistent with the text information of the user voice;

and training the basic parallel voice conversion model based on the user voice, the voice sample under the appointed tone and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user.

Correspondingly, the embodiment of the application also provides a processing device of the voice model, which comprises:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a plurality of preset sample pairs, each preset sample pair comprises a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample;

the conversion unit is used for converting the appointed tone voice sample into conversion voice under the reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, and text information of the conversion voice under the reference tone is consistent with text information of the appointed tone voice sample;

a second obtaining unit, configured to obtain a designated timbre phonetic feature of the designated timbre speech sample and a reference timbre phonetic feature of the converted speech under the corresponding reference timbre;

the first training unit is used for training a preset parallel voice conversion model based on the appointed tone voice characteristics, the reference tone voice characteristics and the reference tone information to obtain a basic parallel voice conversion model, wherein the reference tone information is tone information of the reference tone voice sample;

A third obtaining unit, configured to obtain a user voice of a target user, input the user voice and preset tone information into the non-parallel voice conversion model, and generate a voice sample under a specified tone, where tone information of text information of the voice sample under the specified tone is consistent with text information of the user voice;

and the second training unit is used for training the basic parallel voice conversion model based on the user voice, the voice sample under the appointed tone and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user.

Accordingly, the embodiments of the present application further provide a computer device, including a processor, a memory, and a computer program stored on the memory and capable of running on the processor, where the computer program when executed by the processor implements the steps of any one of the processing methods of the speech model.

Accordingly, embodiments of the present application also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the processing methods of the speech model.

The embodiment of the application provides a processing method, a processing device, computer equipment and a storage medium for a voice model, wherein after conversion voice is generated based on a plurality of preset sample pairs through a non-parallel voice conversion model, the preset parallel voice conversion model is pre-trained by adopting appointed tone voice and conversion voice, so that a universal basic parallel voice conversion model is obtained, the basic parallel voice conversion model can be finely adjusted according to user voices input by different users, and then a target parallel voice conversion model corresponding to the user voices can be obtained, so that the input target text of the voice to be synthesized generates synthetic voice with the same tone as the tone of the user voices, and accordingly sound cloning is realized. According to the method and the device for training the parallel voice conversion model, training steps of the voice model can be simplified, voice cloning efficiency is improved, the parallel voice conversion model is pre-trained, robustness of the parallel voice conversion model can be effectively improved, and tone quality and pronunciation accuracy of synthesized voice are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a scenario of a processing system of a speech model according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a processing method of a speech model according to an embodiment of the present application.

Fig. 3 is a training schematic diagram of a non-parallel speech conversion model according to an embodiment of the present application.

Fig. 4 is a schematic diagram of a non-parallel speech conversion model according to an embodiment of the present application.

Fig. 5 is a pre-training schematic diagram of a pre-set parallel speech conversion model according to an embodiment of the present application.

Fig. 6 is an application schematic diagram of a non-parallel speech conversion model according to an embodiment of the present application.

Fig. 7 is a training schematic diagram of a basic parallel speech conversion model according to an embodiment of the present application.

Fig. 8 is a training schematic diagram of a preset speech synthesis model according to an embodiment of the present application.

Fig. 9 is an application scenario schematic diagram of a method for processing a speech model according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a processing device for a speech model according to an embodiment of the present application.

Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides a processing method and device of a voice model, computer equipment and a storage medium. Specifically, the method for processing the voice model in the embodiment of the application may be performed by a computer device, where the computer device may be a terminal. The terminal may be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game machine, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA), and the like, and the terminal may further include a client, which may be a video application client, a music application client, a game application client, a browser client carrying a game program, or an instant messaging client, and the like.

Referring to fig. 1, fig. 1 is a schematic view of a scenario of a speech model processing system provided in an embodiment of the present application, including a computer device, where the system may include at least one terminal, at least one server, and a network. The terminal held by the user can be connected to the server of different games through the network. A terminal is any device having computing hardware capable of supporting and executing a software product corresponding to a game. In addition, the terminal has one or more multi-touch-sensitive screens for sensing and obtaining inputs of a user through touch or slide operations performed at a plurality of points of the one or more touch-sensitive display screens. In addition, when the system includes a plurality of terminals, a plurality of servers, and a plurality of networks, different terminals may be connected to each other through different networks, through different servers. The network may be a wireless network or a wired network, such as a Wireless Local Area Network (WLAN), a Local Area Network (LAN), a cellular network, a 2G network, a 3G network, a 4G network, a 5G network, etc. In addition, the different terminals may be connected to other terminals or to a server or the like using their own bluetooth network or hotspot network.

After generating the converted voice based on a plurality of preset sample pairs through the non-parallel voice conversion model, the computer equipment pretrains the preset parallel voice conversion model by adopting appointed tone voice and the converted voice, so that a general basic parallel voice conversion model is obtained, and then the basic parallel voice conversion model can be finely adjusted according to user voices input by different users, so that a target parallel voice conversion model corresponding to the user voices can be obtained, and the input target text of the voice to be synthesized is generated into synthesized voice with the same tone as the tone of the user voices, so that voice cloning is realized.

Further, the computer device may also obtain language content features and prosodic features from the user's speech of the target user; performing voice conversion processing based on the language content characteristics, the rhythm characteristics and the appointed tone information to obtain voice to be processed under the appointed tone; and performing voice conversion processing on the voice to be processed through the target parallel voice conversion model to generate target synthesized voice matched with the tone of the target user.

It should be noted that, the schematic view of the scenario of the processing system of the voice model shown in fig. 1 is only an example, and the processing system and scenario of the voice model described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as one of ordinary skill in the art can know, along with the evolution of the processing system of the voice model and the appearance of a new service scenario, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

The embodiment of the invention provides a processing method, a device, computer equipment and a storage medium of a voice model, wherein the processing method of the voice model can be used together with a terminal, such as a smart phone, a tablet personal computer, a notebook computer or a personal computer. The following describes the processing method, apparatus, computer device and storage medium of the speech model in detail. The following description of the embodiments is not intended to limit the preferred embodiments.

Referring to fig. 2, fig. 2 is a flow chart of a processing method of a speech model according to an embodiment of the present application, and the specific flow may be as follows:

101, obtaining a plurality of preset sample pairs, wherein each preset sample pair comprises a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample.

To obtain the plurality of preset pairs of samples, prior to the step of "obtaining the plurality of preset pairs of samples", the method may comprise:

determining a target voice sample from a plurality of preset voice samples in a preset voice library as a designated tone voice sample, wherein the target voice sample is a voice sample with the largest tone data amount in the plurality of preset voice samples;

And generating a plurality of preset sample pairs based on the appointed tone voice sample and each reference tone voice sample in the plurality of reference tone voice samples.

In the embodiment of the present application, the designated tone color voice sample may be one preset voice sample randomly selected from a plurality of preset voice samples in a preset voice library, as the designated tone color voice sample. A fixed tone may be used, for example, a preset voice sample corresponding to a tone with the largest data amount among a plurality of preset voice samples in a preset voice library may be used as the designated tone voice sample.

The reference tone voice sample may be a tone mark, and the tone mark may be a code of a person or a certain voice library, for example, tone mark a is tone information with a tone code of a certain tone, and tone mark B is tone information with a tone code of B, etc.

For example, the plurality of preset sample pairs may include a first preset sample pair consisting of a specified tone color voice sample and a tone color mark a, a second preset sample pair consisting of a specified tone color voice sample and a tone color mark B, and a third preset sample pair consisting of a specified tone color voice sample and a tone color mark C.

102, converting the appointed tone voice sample into converted voice under a reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, wherein text information of the converted voice under the reference tone is consistent with text information of the appointed tone voice sample.

Referring to fig. 3, fig. 3 is a training schematic diagram of a non-parallel speech conversion model, when the non-parallel speech conversion model is trained, a trained language feature extraction module may be used to extract a language related feature representation of an original speech, a prosodic feature module may be used to extract a prosodic related feature representation of the original speech, and the language related feature representation and the prosodic related feature representation are added with a timbre mark and output speech to be input to the non-parallel speech conversion model for training.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating the use of a non-parallel speech conversion model, wherein a reference tone speech sample of a preset sample pair is processed by the non-parallel speech conversion model, and converted speech of the corresponding reference tone of the preset sample pair is output. The non-parallel voice conversion model is set in the embodiment of the application, and aims to generate converted voice corresponding to tone and semantic content according to language related feature representation, prosody related feature representation and appointed tone marks extracted from original voice, and construct parallel voice data for training the parallel conversion model.

The non-parallel voice conversion model comprises a training stage and an application stage, wherein the training stage inputs the language feature representation extracted by the language feature extraction model, the prosodic feature representation obtained by the prosodic feature module, the timbre mark and the corresponding output voice into the neural network model for training. The application stage is to input the language feature representation extracted by the language feature extraction model, the prosodic feature representation obtained by the prosodic feature module and the timbre mark into the trained model to obtain the converted voice. In technical implementation, a deep neural network model is generally adopted, and a specific model structure can be in various manners, such as a convolutional network, a cyclic neural network, a transducer or any combination thereof.

The language feature extraction module is used for obtaining language feature representation irrelevant to tone according to input voice, and can only extract language information and convert the language information into vector representation with fixed length by removing information irrelevant to language content in the voice, wherein the extracted language information accurately reflects speaking content of the original voice and has no error. It should be noted that, the language feature extraction module needs a neural network model to be implemented. Specific implementations can be varied, one way is to train a speech recognition model through a large number of voices and texts, select a specific hidden layer output of the model as a language feature representation, and the other way is to quantize the speech compression into a representation of several speech units through an unsupervised training way, such as using a VQVAE model, and restore these speech units to the original speech. In this self-restoring training process, the quantization unit gradually learns to be voice units independent of tone color, and these voice units are language feature representations. Other ways of implementation are also possible, not limited to the above two. In one embodiment, the prosodic feature extraction module is directed to deriving a prosodic feature representation from the input speech and converting it to a vector representation. The voice prosody characteristic extraction module is used for ensuring that the converted voice is consistent with the original voice in prosody style, so that the data before and after conversion are completely parallel except the tone color, and modeling of a parallel conversion model is facilitated. There are many implementations technically possible, mainly extracted by signal processing tools and algorithms, such as using common speech features: fundamental frequency, energy, etc., and also voice emotion classification related features.

103, acquiring the designated tone color phonetic feature of the designated tone color phonetic sample and the corresponding reference tone color phonetic feature of the converted voice under the reference tone color.

In the embodiment of the application, the computer equipment can extract the designated tone color phonetic feature from the designated tone color phonetic sample, and can extract the reference tone color phonetic feature from the converted voice under the reference tone color corresponding to the designated tone color phonetic sample.

104, training a preset parallel voice conversion model based on the appointed tone voice characteristics, the reference tone voice characteristics and the reference tone information to obtain a basic parallel voice conversion model, wherein the reference tone information is tone information of the reference tone voice sample.

For example, referring to fig. 5, fig. 5 is a pre-training schematic diagram of a preset parallel voice conversion model, and the preset parallel voice conversion model is trained by specifying timbre phonetic features, a plurality of reference timbre phonetic features and corresponding reference timbre information to obtain a basic parallel voice conversion model. Specifically, according to the timbre phonetic feature of the speaker a and the timbre mark B, the timbre phonetic feature of the speaker B is obtained through a parallel voice conversion model, and the timbre phonetic feature of the speaker B is verified by adopting the reference timbre phonetic feature B to determine whether the training effect of the model reaches the expectation. Furthermore, according to the timbre phonetic feature of the speaker A and the timbre mark C, the timbre phonetic feature of the speaker C can be obtained through a parallel voice conversion model, and the timbre phonetic feature of the speaker C is verified by adopting the reference timbre phonetic feature C. According to the tone color phonetic feature of the speaker A and the tone color mark D, the tone color phonetic feature of the speaker D is obtained through a parallel phonetic conversion model, and the tone color phonetic feature of the speaker D is verified by adopting the reference tone color phonetic feature D until the training condition of the preset parallel phonetic conversion model meets the model training ending condition, so that a trained preset parallel phonetic conversion model is obtained and is used as a basic parallel phonetic conversion model.

And 105, acquiring user voice of a target user, inputting the user voice and preset tone information into the non-parallel voice conversion model, and generating a voice sample under a specified tone, wherein the tone information of the text information of the voice sample under the specified tone is consistent with the text information of the user voice.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating an application of a non-parallel voice conversion model, after determining a user voice of a target user needing to perform voice cloning, the non-parallel voice conversion model trained in the pre-training stage may be used to convert the user voice into a voice sample under a specified tone, and text content and prosody information of the user voice remain unchanged, that is, after conversion, the text content and prosody information of the voice sample under the specified tone are the same as those of the user voice, and only the tone is changed to construct parallel voice data.

And 106, training the basic parallel voice conversion model based on the user voice, the voice sample under the appointed tone and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user.

In an embodiment, before the step of training the basic parallel voice conversion model based on the user voice, the voice sample under the designated tone and the designated tone information to obtain the target parallel voice conversion model corresponding to the target user, the method may include:

Performing feature extraction processing on the user voice through a first feature extraction module to obtain user phonetic features;

and carrying out feature extraction processing on the voice sample under the appointed tone through a second feature extraction module to obtain the appointed tone phonetic feature.

Further, the training of the basic parallel voice conversion model based on the user voice, the voice sample under the designated tone and the designated tone information to obtain the target parallel voice conversion model corresponding to the target user may include:

and training the basic parallel voice conversion model based on the user phonetic features, the appointed tone color phonetic features and the appointed tone color information to obtain a target parallel voice conversion model corresponding to the target user.

Referring to fig. 7, fig. 7 is a schematic diagram of training a basic parallel voice conversion model, after a voice sample under a specified tone is obtained, a voice pair may be formed based on the voice sample under the specified tone and the user voice, so as to obtain a user phonetic feature of the user voice, a specified tone phonetic feature of the voice sample under the specified tone, and specified tone information, and train the basic parallel voice conversion model to obtain a target parallel voice conversion model corresponding to the target user. The parallel speech conversion model may use a simple neural network model, such as a layer of recurrent neural network, or other model structures that can meet the above conditions.

In order to reduce the difficulty of fine tuning the basic parallel voice conversion model, before the step of training the basic parallel voice conversion model based on the user voice, the voice sample under the designated tone and the designated tone information to obtain the target parallel voice conversion model corresponding to the target user, the method may include:

acquiring user tone information corresponding to user voice of the target user;

and screening target tone voices from a plurality of preset tone voices in a preset voice library based on the user tone information, and taking the tone information of the target tone voices as appointed tone information, wherein the target tone voices are preset tone voices with highest similarity between the tone information and the user tone information.

Further, the step of "screening a target tone color voice from a plurality of preset tone color voices in a preset voice library based on the tone color information of the user, and taking the tone color information of the target tone color voice as the designated tone color information" may include:

and screening target tone color voices from a plurality of preset tone color voices in the preset voice library based on the tone color information of the user through a speaker recognition model, and taking the tone color information of the target tone color voices as appointed tone color information.

According to the embodiment of the application, a Speaker Recognition (SRE) model can be used for screening out tone marks closest to the tone of the user from a preset voice library, so that the difficulty of fine adjustment of a follow-up model is reduced, for example, if the tone of the user is male, male voice closest to the tone of the user is screened out from the preset voice library and serves as appointed tone information, and for example, if the tone of the user is female, female voice closest to the tone of the user is screened out from the preset voice library and serves as appointed tone information. Among them, a Speaker Recognition (SRE) model is generally used to extract speaker vectors of a specific speaker from voice data of the specific speaker, thereby performing speaker retrieval.

In an embodiment, after the training the basic parallel voice conversion model based on the user voice, the voice sample under the designated tone and the designated tone information to obtain the target parallel voice conversion model corresponding to the target user, the method may include:

acquiring a target text of the voice to be synthesized, and converting the target text into text characters to be processed based on a plurality of preset characters through a text processing module, wherein the plurality of preset characters have a mapping relation with the target text;

Inputting the text characters to be processed and the appointed tone information into a voice synthesis model to generate voice to be processed under appointed tone;

and performing voice conversion processing on the voice to be processed through the target parallel voice conversion model to generate target synthesized voice matched with the tone of the target user.

Further, in the step of performing the voice conversion processing on the voice to be processed through the target parallel voice conversion model to generate a target synthesized voice matching with the tone of the target user, the method may include:

performing voice conversion processing on the voice to be processed through the target parallel voice conversion model to obtain synthetic phonetic features under the tone of a user;

and processing the synthesized phonetic features through a vocoder module to generate target synthesized voice matched with the tone of the target user.

In order to be able to convert text to speech of a specified timbre, embodiments of the present application also provide a speech synthesis model. When the model training is performed on the preset voice synthesis model, the preset voice synthesis model can be trained by utilizing the existing multi-person voices in the preset database, text data corresponding to the voices and preset tone colors, the trained preset voice synthesis model is obtained and then is used as a target voice synthesis model, and the target voice synthesis model is stored for being used in a voice cloning application stage. Specifically, the pre-training stage of the speech synthesis model is to input a large amount of text speech and tone mark data into the neural network model for training, and the application stage is to synthesize any text input into a specified tone speech. There are many options for specific model structures, including but not limited to popular tacotron, fastspech, etc., generally based on end-to-end deep neural network models.

In order to enable the model to have cross-language pronunciation capability, the embodiment of the application is further provided with a text processing module, wherein the text processing module is used for mainly processing special characters such as numbers in texts in different languages and converting the texts in different languages into unified character representations to realize cross-language speech synthesis. The text processing module can convert the text into pinyin or English phonemes, and then convert the pinyin and the English phonemes into international phonetic symbols for subsequent processing.

Referring to fig. 8, fig. 8 is a training schematic diagram of a preset speech synthesis model, a text is obtained, the text is input into a text processing module, the text is processed by the text processing module to obtain a cross-language text representation, simultaneously, a phonetic feature is extracted from a corresponding speech of the text, and then the phonetic feature, the cross-language text representation and a designated tone information are input into the preset speech synthesis model to perform model training on the preset speech synthesis model. And continuously acquiring other texts, voices corresponding to the texts and appointed tone information, inputting the voice information into a preset voice synthesis model, and training the preset voice synthesis model until the training condition of the preset voice synthesis model meets the model training ending condition, so as to obtain a trained preset voice synthesis model as a target voice synthesis model.

In order to further explain the processing method of the voice model provided in the embodiment of the present application, an application of the processing method of the voice model in a specific implementation scenario will be described below, with reference to fig. 9, where a specific application scenario is as follows:

(1) The method comprises the steps of obtaining target text of voice to be synthesized, inputting the target text into a text processing module, processing the target text through the text processing module to obtain cross-language text representation, and inputting the cross-language text representation and appointed tone information into a target voice synthesis model to generate appointed tone phonetic features through the target voice synthesis model.

(2) Inputting the appointed tone color phonetic feature into a target parallel phonetic conversion model, and converting the appointed tone color phonetic feature into a user tone color synthetic phonetic feature of the user tone color corresponding to the target parallel phonetic conversion model through the target parallel phonetic conversion model. Then, the voice synthesis phonetic feature of the user is processed through the vocoder module to generate target synthesized voice matched with the voice of the user of the target user.

In summary, the embodiment of the present application provides a method for processing a voice model, by obtaining a plurality of preset sample pairs, where each preset sample pair includes a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample; converting the appointed tone voice sample into converted voice under a reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, wherein text information of the converted voice under the reference tone is consistent with text information of the appointed tone voice sample; acquiring the designated tone phonetic feature of the designated tone voice sample and the corresponding reference tone phonetic feature of the converted voice under the reference tone; training a preset parallel voice conversion model based on the appointed tone voice characteristics, the reference tone voice characteristics and the reference tone information to obtain a basic parallel voice conversion model, wherein the reference tone information is tone information of the reference tone voice sample; acquiring user voice of a target user, inputting the user voice and preset tone information into the non-parallel voice conversion model, and generating a voice sample under a specified tone, wherein the tone information of the text information of the voice sample under the specified tone is consistent with the text information of the user voice; and training the basic parallel voice conversion model based on the user voice, the voice sample under the appointed tone and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user. According to the method and the device for training the parallel voice conversion model, training steps of the voice model can be simplified, voice cloning efficiency is improved, the parallel voice conversion model is pre-trained, robustness of the parallel voice conversion model can be effectively improved, and tone quality and pronunciation accuracy of synthesized voice are improved.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a processing device for a speech model according to an embodiment of the present application, where the device includes:

a first obtaining unit 201, configured to obtain a plurality of preset sample pairs, where each preset sample pair includes a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample;

a conversion unit 202, configured to convert, by using a non-parallel speech conversion model, the specified tone speech sample into converted speech in a reference tone based on a reference tone speech sample of a preset sample pair, where text information of the converted speech in the reference tone is consistent with text information of the specified tone speech sample;

a second obtaining unit 203, configured to obtain a designated timbre phonetic feature of the designated timbre voice sample and a reference timbre phonetic feature of the converted voice under the corresponding reference timbre;

a first training unit 204, configured to train a preset parallel voice conversion model based on the specified timbre phonetic feature, the reference timbre phonetic feature and the reference timbre information, to obtain a basic parallel voice conversion model, where the reference timbre information is timbre information of the reference timbre voice sample;

A third obtaining unit 205, configured to obtain a user voice of a target user, input the user voice and preset tone information into the non-parallel voice conversion model, and generate a voice sample under a specified tone, where tone information of text information of the voice sample under the specified tone is consistent with text information of the user voice;

the second training unit 206 is configured to train the basic parallel speech conversion model based on the user speech, the speech sample under the specified tone and the specified tone information, so as to obtain a target parallel speech conversion model corresponding to the target user.

In some embodiments, the processing means of the speech model comprises:

the first acquisition subunit is used for acquiring the user tone information corresponding to the user voice of the target user;

and the first screening subunit is used for screening target tone voice from a plurality of preset tone voices in a preset voice library based on the tone information of the user, and taking the tone information of the target tone voice as appointed tone information, wherein the target tone voice is the preset tone voice with the highest similarity between the tone information and the tone information of the user.

In some embodiments, the processing means of the speech model comprises:

and the second screening subunit is used for screening target tone color voices from a plurality of preset tone color voices in the preset voice library based on the tone color information of the user through a speaker recognition model, and taking the tone color information of the target tone color voices as appointed tone color information.

In some embodiments, the processing means of the speech model comprises:

the first processing subunit is used for carrying out feature extraction processing on the user voice through the first feature extraction module to obtain user phonetic features;

and the second processing subunit is used for carrying out feature extraction processing on the voice sample under the appointed tone through the second feature extraction module to obtain the appointed tone phonetic feature.

In some embodiments, the processing means of the speech model comprises:

and the training subunit is used for training the basic parallel voice conversion model based on the user phonetic features, the appointed tone phonetic features and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user.

In some embodiments, the processing means of the speech model comprises:

A determining subunit, configured to determine a target voice sample from a plurality of preset voice samples in a preset voice library, as a specified tone voice sample, where the target voice sample is a voice sample with the largest tone data amount in the plurality of preset voice samples;

and the first generation subunit is used for generating a plurality of preset sample pairs based on the appointed tone voice sample and each reference tone voice sample in the plurality of reference tone voice samples.

In some embodiments, the processing means of the speech model comprises:

the second acquisition subunit is used for acquiring the target text of the voice to be synthesized, and converting the target text into text characters to be processed based on a plurality of preset characters through the text processing module, wherein the plurality of preset characters have a mapping relation with the target text;

the second generation subunit is used for inputting the text characters to be processed and the appointed tone information into a voice synthesis model to generate voice to be processed under appointed tone;

and the third generation subunit is used for carrying out voice conversion processing on the voice to be processed through the target parallel voice conversion model to generate target synthesized voice matched with the tone of the target user.

In some embodiments, the processing means of the speech model comprises:

the third processing subunit is used for performing voice conversion processing on the voice to be processed through the target parallel voice conversion model to obtain the synthetic phonetic feature of the user under the tone;

and the fourth processing subunit is used for processing the synthesized phonetic features through the vocoder module to generate target synthesized voice matched with the tone of the target user.

The embodiment of the application provides a processing device for a voice model, which acquires a plurality of preset sample pairs through a first acquisition unit 201, wherein each preset sample pair comprises a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample; the conversion unit 202 converts the specified tone voice sample into converted voice under the reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, and text information of the converted voice under the reference tone is consistent with text information of the specified tone voice sample; the second obtaining unit 203 obtains a designated timbre phonetic feature of the designated timbre phonetic sample and a reference timbre phonetic feature of the converted voice under the corresponding reference timbre; the first training unit 204 trains a preset parallel voice conversion model based on the designated tone color phonetic feature, the reference tone color phonetic feature and reference tone color information to obtain a basic parallel voice conversion model, wherein the reference tone color information is tone color information of the reference tone color voice sample; the third obtaining unit 205 obtains a user voice of a target user, inputs the user voice and preset tone information into the non-parallel voice conversion model, and generates a voice sample under a specified tone, wherein the tone information of the text information of the voice sample under the specified tone is consistent with the text information of the user voice; the second training unit 206 trains the basic parallel speech conversion model based on the user speech, the speech sample under the specified tone and the specified tone information, and obtains a target parallel speech conversion model corresponding to the target user. According to the method and the device for training the parallel voice conversion model, training steps of the voice model can be simplified, voice cloning efficiency is improved, the parallel voice conversion model is pre-trained, robustness of the parallel voice conversion model can be effectively improved, and tone quality and pronunciation accuracy of synthesized voice are improved.

Correspondingly, the embodiment of the application also provides a computer device, which can be a terminal or a server, wherein the terminal can be a terminal device such as a smart phone, a tablet computer, a notebook computer, a touch screen, a game console, a personal computer (PC, personal Computer), a personal digital assistant (Personal Digital Assistant, PDA) and the like. Fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application, as shown in fig. 11. The computer device 300 includes a processor 301 having one or more processing cores, a memory 302 having one or more computer readable storage media, and a computer program stored on the memory 302 and executable on the processor. The processor 301 is electrically connected to the memory 302. It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

Processor 301 is a control center of computer device 300 and utilizes various interfaces and lines to connect various portions of the overall computer device 300, and to perform various functions of computer device 300 and process data by running or loading software programs and/or modules stored in memory 302 and invoking data stored in memory 302, thereby performing overall monitoring of computer device 300.

In the embodiment of the present application, the processor 301 in the computer device 300 loads the instructions corresponding to the processes of one or more application programs into the memory 302 according to the following steps, and the processor 301 executes the application programs stored in the memory 302, so as to implement various functions:

In an embodiment, before training the basic parallel voice conversion model based on the user voice, the voice sample under the designated tone and the designated tone information to obtain the target parallel voice conversion model corresponding to the target user, the method further includes:

acquiring user tone information corresponding to user voice of the target user;

In an embodiment, the screening the target tone color voice from the plurality of preset tone color voices in the preset voice library based on the tone color information of the user, taking the tone color information of the target tone color voice as the designated tone color information, includes:

In an embodiment, the training the basic parallel speech conversion model based on the user speech, the speech sample under the specified tone and the specified tone information to obtain the target parallel speech conversion model corresponding to the target user includes:

In an embodiment, before acquiring the plurality of preset sample pairs, the method further includes:

In an embodiment, after training the basic parallel speech conversion model based on the user speech, the speech sample under the specified tone and the specified tone information to obtain the target parallel speech conversion model corresponding to the target user, the method further includes:

In an embodiment, the performing, by the target parallel speech conversion model, speech conversion processing on the speech to be processed, to generate a target synthesized speech matching with a tone of the target user, includes:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Optionally, as shown in fig. 11, the computer device 300 further includes: a touch display 303, a radio frequency circuit 304, an audio circuit 305, an input unit 306, and a power supply 307. The processor 301 is electrically connected to the touch display 303, the radio frequency circuit 304, the audio circuit 305, the input unit 306, and the power supply 307, respectively. Those skilled in the art will appreciate that the computer device structure shown in FIG. 11 is not limiting of the computer device and may include more or fewer components than shown, or may be combined with certain components, or a different arrangement of components.

The touch display 303 may be used to display a graphical user interface and receive operation instructions generated by a user acting on the graphical user interface. The touch display 303 may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of a computer device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the display panel may be configured in the form of a liquid crystal display (LCD, liquid Crystal Display), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 301, and can receive and execute commands sent from the processor 301. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is passed to the processor 301 to determine the type of touch event, and the processor 301 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the touch display screen 303 to implement the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the touch-sensitive display 303 may also implement an input function as part of the input unit 306.

In the embodiment of the present application, the processor 301 executes an application program to generate a graphical interface on the touch display screen 303. The touch display 303 is used for presenting a graphical interface and receiving an operation instruction generated by a user acting on the graphical interface.

The radio frequency circuitry 304 may be used to transceive radio frequency signals to establish wireless communications with a network device or other computer device via wireless communications.

The audio circuit 305 may be used to provide an audio interface between a user and a computer device through a speaker, microphone. The audio circuit 305 may transmit the received electrical signal after audio data conversion to a speaker, and convert the electrical signal into a sound signal for output by the speaker; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 305 and converted into audio data, which are processed by the audio data output processor 301 for transmission to, for example, another computer device via the radio frequency circuit 304, or which are output to the memory 302 for further processing. The audio circuit 305 may also include an ear bud jack to provide communication of the peripheral ear bud with the computer device.

The input unit 306 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control.

The power supply 307 is used to power the various components of the computer device 300. Alternatively, the power supply 307 may be logically connected to the processor 301 through a power management system, so as to perform functions of managing charging, discharging, and power consumption management through the power management system. The power supply 307 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 11, the computer device 300 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which will not be described herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

As can be seen from the above, the computer device provided in this embodiment obtains a plurality of preset sample pairs, where each preset sample pair includes a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample; converting the appointed tone voice sample into converted voice under a reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, wherein text information of the converted voice under the reference tone is consistent with text information of the appointed tone voice sample; acquiring the designated tone phonetic feature of the designated tone voice sample and the corresponding reference tone phonetic feature of the converted voice under the reference tone; training a preset parallel voice conversion model based on the appointed tone voice characteristics, the reference tone voice characteristics and the reference tone information to obtain a basic parallel voice conversion model, wherein the reference tone information is tone information of the reference tone voice sample; acquiring user voice of a target user, inputting the user voice and preset tone information into the non-parallel voice conversion model, and generating a voice sample under a specified tone, wherein the tone information of the text information of the voice sample under the specified tone is consistent with the text information of the user voice; and training the basic parallel voice conversion model based on the user voice, the voice sample under the appointed tone and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user. According to the method and the device for training the parallel voice conversion model, training steps of the voice model can be simplified, voice cloning efficiency is improved, the parallel voice conversion model is pre-trained, robustness of the parallel voice conversion model can be effectively improved, and tone quality and pronunciation accuracy of synthesized voice are improved.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of computer programs that can be loaded by a processor to perform steps in any of the methods for processing a speech model provided by embodiments of the present application. For example, the computer program may perform the steps of:

Acquiring user tone information corresponding to user voice of the target user;

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because of the computer program stored in the storage medium, the steps in any of the processing methods of the voice models provided in the embodiments of the present application may be performed by obtaining a plurality of preset sample pairs, where each preset sample pair includes a reference tone voice sample and a designated tone voice sample, and tone information of the designated tone voice sample is different from tone information of the reference tone voice sample; converting the appointed tone voice sample into converted voice under a reference tone based on a reference tone voice sample of a preset sample pair through a non-parallel voice conversion model, wherein text information of the converted voice under the reference tone is consistent with text information of the appointed tone voice sample; acquiring the designated tone phonetic feature of the designated tone voice sample and the corresponding reference tone phonetic feature of the converted voice under the reference tone; training a preset parallel voice conversion model based on the appointed tone voice characteristics, the reference tone voice characteristics and the reference tone information to obtain a basic parallel voice conversion model, wherein the reference tone information is tone information of the reference tone voice sample; acquiring user voice of a target user, inputting the user voice and preset tone information into the non-parallel voice conversion model, and generating a voice sample under a specified tone, wherein the tone information of the text information of the voice sample under the specified tone is consistent with the text information of the user voice; and training the basic parallel voice conversion model based on the user voice, the voice sample under the appointed tone and the appointed tone information to obtain a target parallel voice conversion model corresponding to the target user. According to the method and the device for training the parallel voice conversion model, training steps of the voice model can be simplified, voice cloning efficiency is improved, the parallel voice conversion model is pre-trained, robustness of the parallel voice conversion model can be effectively improved, and tone quality and pronunciation accuracy of synthesized voice are improved.

The foregoing describes in detail a method, apparatus, computer device and storage medium for processing a speech model provided in the embodiments of the present application, and specific examples are applied to describe the principles and embodiments of the present application, where the descriptions of the foregoing embodiments are only used to help understand the technical solutions and core ideas of the present application; those of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for processing a speech model, comprising:

2. The method according to claim 1, further comprising, before training the basic parallel speech conversion model based on the user speech, the speech sample under the designated tone, and the designated tone information to obtain a target parallel speech conversion model corresponding to the target user:

acquiring user tone information corresponding to user voice of the target user;

3. The method for processing a voice model according to claim 2, wherein the step of screening a target tone color voice from a plurality of preset tone color voices in a preset voice library based on the user tone color information, and taking tone color information of the target tone color voice as specified tone color information comprises the steps of:

4. The method according to claim 1, further comprising, before training the basic parallel speech conversion model based on the user speech, the speech sample under the designated tone, and the designated tone information to obtain a target parallel speech conversion model corresponding to the target user:

5. The method for processing a speech model according to claim 4, wherein training the basic parallel speech conversion model based on the user speech, the speech sample under the designated tone and the designated tone information to obtain the target parallel speech conversion model corresponding to the target user comprises:

6. The method for processing a speech model according to claim 1, further comprising, before acquiring the plurality of preset pairs of samples:

7. The method according to claim 1, wherein after training the basic parallel speech conversion model based on the user speech, the speech sample under the designated tone, and the designated tone information to obtain a target parallel speech conversion model corresponding to the target user, further comprising:

acquiring a target text of a voice to be synthesized, and converting the target text into text characters to be processed based on a plurality of preset characters through a text processing module, wherein the preset characters and the target text have a mapping relation;

8. The method for processing a voice model according to claim 7, wherein the voice conversion processing is performed on the voice to be processed by the target parallel voice conversion model to generate a target synthesized voice matching with the tone of the target user, comprising:

9. A processing apparatus for a speech model, comprising:

10. A computer device, characterized in that it comprises a memory in which a computer program is stored and a processor which performs the steps in the method of processing a speech model according to any of claims 1 to 8 by calling the computer program stored in the memory.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, which is adapted to be loaded by a processor for performing the steps in the method of processing a speech model according to any of claims 1 to 8.