WO2024001307A1

WO2024001307A1 - Voice cloning method and apparatus, and related device

Info

Publication number: WO2024001307A1
Application number: PCT/CN2023/081526
Authority: WO
Inventors: 陈飞扬; 王喆锋; 段新宇; 怀宝兴
Original assignee: 华为云计算技术有限公司
Priority date: 2022-06-29
Filing date: 2023-03-15
Publication date: 2024-01-04

Abstract

The present application provides a voice cloning method, comprising: determining a target scene, and according to the target scene, determining a target corpus text belonging to the target scene; and then determining an audio of a target subject according to the target corpus text, wherein voice content of the audio is matched to content of the target corpus text, and the target corpus text and the audio of the target subject is thus used to train a voice cloning model corresponding to the target scene, wherein the voice cloning model is used for outputting an audio which simulates the pronunciation of the target subject in the target scene. Since the voice cloning model is obtained by training pronunciation audio of a corpus text in the target scene on the basis of the target subject, the voice cloning model can better meet real pronunciation conditions of the target subject in the target scene according to features of a new voice outputted by the text with regards to tone, rhythm, pronunciation style and the like, thereby effectively improving a voice cloning effect. In addition, the present application further provides a corresponding apparatus and a related device.

Description

A voice cloning method, device and related equipment

This application requests the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on June 29, 2022, with the application number 202210778187.2 and the application title "A voice cloning method, device and related equipment", and requests that it be filed in 2022 The priority of the Chinese patent application submitted to the State Intellectual Property Office of China on September 2, with application number 202211071940.0 and the application title "A voice cloning method, device and related equipment", the entire content of which is incorporated into this application by reference.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a voice cloning method, device and related equipment.

Background technique

Voice cloning is a technology that uses the original voice of a target object (such as a cloned human) to generate a new voice that is similar to the original voice in terms of timbre and other pronunciation characteristics, achieving the effect of cloning the pronunciation of the target object. It is used in virtual humans, audiobooks, It is widely used in scenarios such as video creation.

However, the current speech cloning technology can only clone the timbre of the target object's pronunciation in the generated new speech, and it is difficult to compare with the pronunciation effect of the target object in the real scene, resulting in poor cloning effect.

Contents of the invention

In view of this, embodiments of the present application provide a voice cloning method to improve the voice cloning effect for the target object. This application also provides corresponding devices, computing device clusters, computer-readable storage media, and computer program products.

In the first aspect, embodiments of the present application provide a voice cloning method, which can be executed by a voice cloning device. Specifically, the voice cloning device determines the target scene, such as determining the story scene specified by the user as the target scene, etc., and based on The target scene determines the target corpus text belonging to the target scene, and then determines the audio of the target object based on the target corpus text. The speech content of the audio matches the content of the target corpus text, so that the speech cloning device uses the target corpus text and Use the audio of the target object to train a speech clone model corresponding to the target scene. The speech clone model is used to output audio that simulates the target object's pronunciation in the target scene.

Since the speech cloning model is trained based on the pronunciation audio of the corpus text in the target scene based on the target object, this allows the speech cloning model to be more accurate based on the characteristics of the new speech output by the text in terms of timbre, rhythm, and pronunciation style. It conforms to the real pronunciation of the target object in the target scene, which can effectively improve the voice cloning effect.

In practical applications, the above method can be used to generate speech cloning models for simulating and outputting the pronunciation rhythm and style of each object in various scenarios, so that these speech cloning models can be used to improve the authenticity and diversity of speech cloning.

Further, after training the speech cloning model, the speech cloning device can use the speech cloning model to output an audio corresponding to a piece of text to achieve speech cloning of the target object.

In a possible implementation, the content context of the target corpus text matches the context indicated by the target scene. For example, when the target scene is a story scene, the target corpus text may be, for example, the corpus text of the story content. For example, the target scene can be any one of dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, educational scenes, or speech scenes, or the target object can be scenes divided according to emotion types, such as Sad scenes, happy scenes, etc. In actual application, the target scenario can also be other applicable scenarios.

In a possible implementation, when determining the corpus text belonging to the target scene, the speech cloning device may first obtain the pinyin distribution of multiple corpus texts belonging to the target scene. The pinyin distribution may be, for example, the multiple corpus texts. The number distribution of each pinyin in the multiple corpus texts, etc., so that the speech cloning device can select the target corpus text from the multiple corpus texts according to the pinyin distribution of the multiple corpus texts, and the number of the target corpus texts is less than the number of the multiple corpus texts, and , the pinyin distribution of the target corpus text and the pinyin distribution of the multiple corpus texts meet preset conditions, such as the variance or standard deviation between the two pinyin distributions being less than a threshold, etc. Since there are usually differences in the pinyin distribution of corpus texts in different scenarios, the pinyin distribution in each scenario can be used as a representative feature of the scene, so that the target corpus text can be selected based on the pinyin distribution so that the target corpus text can also be selected. It can meet the characteristics of the corpus in this scenario, and then train the speech cloning model based on the target corpus text, which can improve the speech cloning effect of the speech cloning model.

In one possible implementation, when the speech cloning device determines the corpus text belonging to the target scene, it may specifically select the target corpus text from multiple corpus texts belonging to the target scene. The proportion of professional terms in the target corpus text is ratio is greater than the proportion threshold. In this way, after using the selected target corpus text to train the speech cloning model, the pronunciation expression content of the professional terminology in the audio output by the speech cloning model can be more fluent and consistent with the target object's real pronunciation of the professional term, so that it can Improve voice cloning effect.

In a possible implementation, when the voice cloning device determines the audio of the target object belonging to the target scene based on the target corpus text, it may specifically generate a recording interface, and the recording interface is used to present the target corpus text to the target object, In this way, the target object can pronounce according to the target corpus text presented in the recording interface. Correspondingly, the voice cloning device records the pronunciation of the target object to obtain the audio of the target object. In this way, the speech cloning device can obtain the audio of the target object by collecting the pronunciation of the target object, so that the speech cloning model can be subsequently trained based on the obtained audio.

In a possible implementation, when the voice cloning device determines the audio of the target object belonging to the target scene according to the target corpus text, it may specifically obtain multiple audios of the target object that are pronounced in the target scene, so that the voice cloning device The audio whose speech content matches the content of the target corpus text can be determined from multiple audios. For example, the voice cloning device can obtain multiple audios of the target object in public places (and belonging to the target scene) from the network, so that the voice cloning device can determine the content that matches the target corpus text through content matching. The target object's audio. In this way, after the user indicates the target scene, the target object no longer needs to interact with the voice cloning device through recording, thereby simplifying the interactive operations required to implement voice cloning and improving the user experience.

In a possible implementation, when determining the target scene, the voice cloning device may specifically generate a scene configuration interface. The scene configuration interface is used to present multiple candidate scenes to the user so that the user can perform operations on the multiple candidate scenes. selection, so that the voice cloning device can determine the target scene selected by the user from multiple candidate scenes. In this way, the voice cloning device can determine the pronunciation scene of the voice cloning based on the user's specification, thereby improving the selectivity of the voice cloning scene and improving the user experience.

In a possible implementation, when the voice cloning device determines the target scene, it may specifically generate a scene configuration interface. The scene configuration interface is used to prompt the user to input the identification (such as name) of the defined target scene and the information belonging to the target scene. The corpus text of the target scene, so that the voice cloning device can respond to the user's operation on the scene configuration interface to obtain the identification of the user-defined target scene and the corpus text belonging to the target scene. In this way, the voice cloning device can support the user's customization of the pronunciation scene of the voice cloning, thereby improving the flexibility of the voice cloning and improving the user experience.

In a possible implementation, the voice cloning device can also generate a test interface that prompts the user to input text. Then, the voice cloning device can obtain the target text input by the user in response to the user's operation on the test interface, And input the target text into the speech cloning model to obtain the audio output by the speech cloning model. In this way, users can judge the cloning effect of the speech cloning model on the pronunciation of the target object in the target scene based on the audio output by the speech cloning model, so that when the cloning effect is poor, the speech cloning effect can be further improved through model retraining and other methods.

In the second aspect, embodiments of the present application also provide a voice cloning method, which can be executed by a voice cloning device. Specifically, the voice cloning device receives the target text of the target scene input by the user, such as receiving the story scene input by the user and The story text, etc., then, the voice clone can determine the voice clone model corresponding to the target scene according to the target scene, and output the target audio corresponding to the target text based on the voice clone model, and the voice clone model is used to output the simulated target object Audio pronounced in the target scenario.

In this way, the new speech output by the speech cloning model based on the characteristics of timbre, rhythm, and pronunciation style of the target text can be more consistent with the real pronunciation of the target object in the target scene, thereby effectively improving the speech cloning effect.

In a possible implementation, when the voice cloning device receives the target scene and target text input by the user, it can generate a speech synthesis interface. The speech synthesis interface is used to present multiple candidate scenes to the user, so that the voice cloning device can Determine the target scene selected by the user from multiple candidate scenes, and receive the target text input by the user on the speech synthesis interface. In this way, the voice cloning device can support the user's customization of scenes and texts, thereby making scenes and texts optional.

In a possible implementation, the speech synthesis interface presented by the voice cloning device can also be used to present multiple candidate objects to the user, so that the user can select one of the multiple objects as the target object. In this way, the voice cloning device can perform voice cloning on the object selected by the user, thereby improving the flexibility and selectability of voice cloning and improving user experience.

In a third aspect, embodiments of the present application also provide a voice cloning device, including: a data acquisition module, used to determine a target scene, and according to the target scene, determine the target corpus text belonging to the target scene, and according to the target scene, The target corpus text is used to determine the audio of the target object, and the voice content of the audio matches the content of the target corpus text; a model training module is used to use the target corpus text and the audio to train the target scene Corresponding voice cloning model, the voice cloning model is used to output audio that simulates the pronunciation of the target object in the target scene.

In a possible implementation, the context of the target corpus text matches the context indicated by the target scene; the target scene includes any one of the following: dialogue scenes, news scenes, financial scenes, Live broadcast scenes, story scenes, educational scenes, speech scenes; or, the target scenes are scenes divided according to emotion types.

In a possible implementation, the data acquisition module is configured to: obtain the pinyin distribution of multiple corpus texts belonging to the target scene; and obtain the pinyin distribution from the multiple corpus texts according to the pinyin distribution of the multiple corpus texts. The target corpus text is selected from the text, the number of the target corpus text is less than the number of the plurality of corpus texts, and the pinyin distribution of the target corpus text and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.

In a possible implementation, the data acquisition module is configured to: select the target corpus text from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts The corpus text belongs to the target scenario.

In a possible implementation, the data acquisition module is used to: generate a recording interface, the recording interface is used to present the target corpus text to the target object; to the target object according to the target The pronunciation of the corpus text is recorded to obtain the audio of the target object.

In a possible implementation, the data acquisition module is configured to: acquire multiple audios produced by the target object in the target scene; and determine the voice content and the target corpus from the multiple audios. The content of the text matches the audio.

In a possible implementation, the data acquisition module is configured to: generate a scene configuration interface, the scene configuration interface is used to present multiple candidate scenes to the user; determine the The target scenario selected by the user.

In a possible implementation, the data acquisition module is configured to: generate a scene configuration interface, and the scene configuration interface is used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene; respond Based on the user's operation on the scene configuration interface, the identification of the target scene defined by the user and the corpus text belonging to the target scene are obtained.

In a possible implementation, the voice cloning device further includes a voice cloning module, configured to: generate a test interface, where the test interface is used to prompt the user to input text; and in response to the user's operation on the test interface , obtain the target text input by the user; input the target text to the speech cloning model, and obtain the audio output by the speech cloning model.

It is worth noting that the voice cloning device provided in the third aspect corresponds to the voice cloning method provided in the first aspect, so the technical effects of the third aspect and any of the embodiments of the third aspect can be found in the first aspect or The technical effects achieved by the corresponding implementation of the first aspect.

In the fourth aspect, embodiments of the present application also provide a voice cloning device. The voice cloning device includes: a data acquisition module for receiving the target scene and target text input by the user; a voice cloning module for scene, determine the speech clone model corresponding to the target scene, and based on the speech clone model, output the target audio corresponding to the target text, the speech clone model is used to output a simulated pronunciation of the target object in the target scene audio.

In a possible implementation, the data acquisition module is used to: generate a speech synthesis interface, the speech synthesis interface is used to present multiple candidate scenes to the user; determine the said speech synthesis interface from the multiple candidate scenes. The user selected The target scenario is described; receiving the target text input by the user on the speech synthesis interface.

In a possible implementation, the speech synthesis interface is also used to present multiple candidate objects to the user; the data acquisition module is also used to: determine the candidate objects from the multiple candidate objects. The target object selected by the user.

It is worth noting that the voice cloning device provided in the fourth aspect corresponds to the voice cloning method provided in the second aspect, so the technical effects of the fourth aspect and any one of the embodiments of the fourth aspect can be found in the second aspect or The technical effects achieved by the corresponding implementation of the second aspect.

In a fifth aspect, the present application provides a computing device, the computing device includes a processor and a memory; the memory is used to store instructions, and the processor executes the instructions stored in the memory, so that the computing device executes The voice cloning method in the above-mentioned first aspect or any possible implementation of the first aspect, or the voice cloning method in the above-mentioned second aspect or any possible implementation of the second aspect. It should be noted that the memory can be integrated into the processor or independent of the processor. The computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.

In a sixth aspect, the present application provides a computing device cluster. The computing device includes at least one computing device. The at least one computing device includes at least one processor and at least one memory; the at least one memory is used to store instructions. The at least one processor executes the instruction stored in the at least one memory, so that the computing device cluster executes the above first aspect or the voice cloning method in any possible implementation of the first aspect, or executes the above second aspect. Or the voice cloning method in any possible implementation of the second aspect. It should be noted that the memory can be integrated into the processor or independent of the processor. The at least one computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.

In a seventh aspect, the present application provides a computer-readable storage medium that stores instructions, which when run on at least one computing device, cause the at least one computing device to execute the above-mentioned first aspect. Or the method described in any implementation of the first aspect, or perform the voice cloning method in the above second aspect or any possible implementation of the second aspect.

In an eighth aspect, the present application provides a computer program product containing instructions that, when run on at least one computing device, cause the at least one computing device to execute the above first aspect or any implementation of the first aspect. The method described above, or performing the voice cloning method in the above second aspect or any possible implementation of the second aspect.

Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some implementations recorded in the present application. For example, those of ordinary skill in the art can also obtain other drawings based on these drawings.

Figure 1 is a schematic diagram of an exemplary application scenario provided by this application;

Figure 2 is a schematic diagram of another exemplary application scenario provided by this application;

Figure 3 is a schematic flow chart of a voice cloning method provided by this application;

Figure 4 is a schematic diagram of a scene configuration interface provided by this application;

Figure 5 is a schematic diagram of another scene configuration interface provided by this application;

Figure 6 is a schematic diagram of the pinyin distribution corresponding to the corpus text in news scenarios and financial scenarios provided by this application;

Figure 7 is a schematic diagram of a recording interface provided by this application;

Figure 8 is a schematic diagram of a test interface provided by this application;

Figure 9 is a schematic structural diagram of a computing device provided by this application;

Figure 10 is a schematic structural diagram of a computing device cluster provided by this application.

Detailed ways

The solutions in the embodiments provided in this application will be described below with reference to the drawings in this application.

The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application.

Currently, when performing voice cloning, the general corpus text and the target object's audio recording of the corpus text are used to train the speech cloning model. In this way, the speech cloning model can learn the timbre of the target object's pronunciation, and based on the newly provided text, generate and output a voice that matches the timbre of the target object's pronunciation, thereby realizing the speech cloning of the target object. Among them, the target object refers to an object that can pronounce words, such as human beings.

In actual application scenarios, there are usually differences in the rhythm and style of the target object's pronunciation in different scenarios. Among them, rhythm and style can reflect the characteristics of the target object's pronunciation. Rhythm can include features such as pronunciation intonation, temporal distribution, and stress. Style can include characteristics such as the speaking speed of the target object.

Take story scenes and news broadcast scenes as examples. People's pronunciation in story scenes (such as telling story content, etc.) usually has a relatively gentle speech speed (such as speaking 120 words per minute) and large changes in volume. However, in news broadcast scenes, The pronunciation in Chinese is usually faster (such as 200 words per minute) and the volume changes less. However, the speech cloning model trained based on general text corpus and corresponding recorded audio can only clone the timbre of the target object's pronunciation, and it is difficult to clone the different rhythms and styles of the target object's pronunciation in different scenarios, thus affecting the speech cloning effect. .

Based on this, embodiments of the present application provide a voice cloning method, which can be executed by a voice cloning device and used to improve the voice cloning effect on the target object. During specific implementation, the speech cloning device first determines the target scene in which the target object to be cloned pronounces, and obtains the target corpus text belonging to the target scene based on the target scene, and further determines the audio of the target object based on the target corpus. The speech content of the target object's audio matches the content of the target corpus text. For example, the audio may be audio obtained by recording the target object's pronunciation according to the target corpus text, etc., so that the speech cloning device uses the target corpus text and the target corpus text. Audio, train to obtain a voice cloning model for outputting audio that simulates the target object's pronunciation in the target scene, and realizes voice cloning of the target object's pronunciation in the target scene.

Since the speech cloning model is trained on the pronunciation audio of the corpus text in the target scene based on the target object, this allows the speech cloning model to be more accurate in terms of timbre, rhythm, pronunciation style and other characteristics based on the new speech output by the text. It conforms to the real pronunciation of the target object in the target scene, which can effectively improve the voice cloning effect.

In practical applications, for each scenario, the above method can be used to clone the speech of the target object in that scene, so that the different rhythms and styles of the target object's pronunciation in different scenarios can be cloned, and the authenticity and accuracy of speech cloning can be improved. Diversity. Furthermore, the voice cloning device can also use the above method for each of the multiple objects. This method can be used to clone the object's voice in various scenarios, thereby improving the flexibility and richness of voice cloning.

As an example, the above voice cloning device can be deployed in the cloud to provide users with voice cloning cloud services. For example, in the application scenario shown in FIG. 1 , the voice cloning device 100 can be deployed in the cloud, for example, it can be implemented by a cloud computing device or a computing device cluster. In addition, the voice cloning device 100 can provide an external client 200 for interaction with the user 300, such as receiving scene information, text or audio data input by the user 300, or feedback cloned audio to the user 300, etc. In actual application, the client 200 may be, for example, an application program running on the user-side device, or may be a web browser provided externally by the voice cloning device 100, or the like. The voice cloning device 100 may include a data acquisition module 101 and a model training module 102. Among them, the data acquisition module 101 is used to determine the target scene, for example, the scene selected by the user 300 or the scene customized by the user 300 can be determined as the target scene, etc., and obtain the target corpus text and the audio of the target object belonging to the target scene, The target corpus text and audio are provided to the model training module 102; the model training module 102 is used to use the target corpus text and the audio of the target object to train a speech cloning model corresponding to the target scene. Further, the voice cloning device 100 can also include a voice cloning module 103, then the model training module 102 can provide the voice cloning model to the voice cloning module 103; the voice cloning module 103 is used to use the voice cloning model to output the target text correspondence. The audio is the audio that simulates the target object's pronunciation in the target scene, where the target text may be preconfigured text, or may be text newly provided by the user 300 , etc. Further, the voice cloning module 103 can also send the audio corresponding to the target text to the client 200, so that the client 200 plays the audio to the user 300.

As another example, the above voice cloning device can be deployed locally, so that local voice cloning services can be provided for users. For example, in the application scenario shown in Figure 2, the above-mentioned voice cloning device can be a local terminal 400, so that the user 300 can input the target scene, the target corpus text, and the audio of the target object to the terminal 400, and the terminal 400 uses the target corpus. text and audio, train a speech cloning model corresponding to the target scene, and use the speech cloning model to output the audio corresponding to the target text, and play the audio to the user 300.

In actual application, the above voice cloning device can be implemented by software or can be implemented by hardware.

As an example of a software functional unit, the voice cloning device may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more. For example, a voice cloning device may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.

Likewise, the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs. Among them, usually a VPC is set up in a region. Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .

The voice cloning device is an example of a hardware functional unit. The voice cloning device may include at least one computing device, such as a server. Alternatively, the voice cloning device can also be implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). Current equipment, etc. The above-mentioned PLD can be implemented by a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.

Multiple computing devices included in the voice cloning device may be distributed in the same region or in different regions. Multiple computing devices included in the voice cloning device may be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the voice cloning device may be distributed in the same VPC or in multiple VPCs. The plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.

Next, various non-limiting specific implementations of the voice cloning process are described in detail.

Refer to Figure 3, which is a schematic flow chart of a voice cloning method in an embodiment of the present application. This method can be applied to the application scenarios shown in Figure 1 or Figure 2 above, or can also be applied to other applicable application scenarios. The following description takes the application scenario shown in Figure 1 as an example. In the application scenario shown in Figure 1, for the functions of the data acquisition module 101, the model training module 102 and the voice cloning module 103 in the voice cloning device 100, please refer to the relevant descriptions of the following embodiments for details. Moreover, the voice cloning device 100 can be used to generate a voice cloning model that clones one or more objects to be pronounced in various scenarios. For convenience of explanation, the embodiment shown in FIG. 3 is used to generate an object for simulating output (i.e., the following target The speech cloning model of audio produced by an object) in one scene (i.e., the target scene described below) is taken as an example to illustrate the implementation process of the speech cloning device 100 generating a speech cloning model for simulating the speech of other objects in various other scenes, This can be understood with reference to the embodiment shown in FIG. 3 .

The voice cloning method shown in Figure 3 may specifically include:

S301: The data acquisition module 101 determines the target scene.

Usually, the rhythm and style of the target object's pronunciation may be different in many different scenes. Therefore, when cloning the pronunciation of the target object, you can first determine the scene to which the pronunciation of the target object to be cloned belongs, which is hereinafter referred to as the target. Scenes.

Among them, the scenes to which the target object's pronunciation belongs can be divided according to the pronunciation environment in actual application scenarios. For example, it can be divided into multiple scenes such as dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, etc. The target scene is for one of the scenarios. Alternatively, the scene to which the target object's pronunciation belongs can also be divided according to the type of the character's emotion. For example, it can be divided into happy scenes, sad scenes, worship scenes, calm scenes, dull scenes, etc. according to the different emotions of the characters. In actual application, other methods may be used to divide the scene into multiple different scenarios, which is not limited in this embodiment.

Furthermore, the target scene can also be a user-defined scene, for example, the user can customize a bedtime story scene, a speech scene, etc.

In an implementation manner of determining the target scene, the data acquisition module 101 can generate a scene configuration interface and send the scene configuration interface to the client 200 so that the client 200 can present it to the user 300 . Among them, the scene configuration interface presented by the client 200 may include multiple candidate scenes, for example, it may be a dialogue scene, a news scene, a financial scene, a live broadcast scene, a story scene, an education scene, a speech scene, etc. as shown in Figure 4. Multiple candidate scenarios can be configured in advance by technicians. In this way, the user 300 can select a scene from multiple candidate scenes presented on the client 200, such as selecting a story scene, etc., so as to designate the voice cloning device 100 to perform voice cloning based on the scene. Correspondingly, The client 200 can feed back the scene selected by the user to the data acquisition module 101, so that the data acquisition module 101 determines it as the target scene.

In addition, the voice cloning device 100 can also support user 300 customized scenarios. For example, in the scene configuration interface shown in Figure 4, when the user 300 selects the "custom" scene, the data acquisition module 101 can also generate the scene configuration interface shown in Figure 5, and present the scene configuration interface through the client 200 300 is given to the user. At this time, the user 300 can enter the name of the customized scene (or other information used to identify the scene) in the scene configuration interface; accordingly, the data acquisition module 101 can create a new scene according to the name of the scene input by the user. scene and identify it as the target scene.

It should be noted that the above-mentioned implementation method for the data acquisition module 101 to determine the target scene is only for illustrative purposes. In actual application, the data acquisition module 101 can also determine the target scene through other methods, which is not limited in this embodiment.

S302: The data acquisition module 101 determines to acquire the target corpus text belonging to the target scene according to the target scene.

After determining the target scene, the data acquisition module 101 can further acquire the target corpus text required to implement speech cloning.

In an implementation of obtaining target corpus text, the data acquisition module 101 can be configured with corresponding corpora for multiple candidate scenes in advance before performing speech cloning, and each corpus is used to store multiple corpora belonging to the same candidate scene. Text, corpus texts stored in different corpora belong to different candidate scenarios. Among them, the context of the content of the corpus text stored in each corpus matches the context indicated by the candidate scene. For example, in a corpus corresponding to a speech scene, the stored corpus text may be multiple different speech scripts, etc. When the target scene is one of multiple candidate scenes, the data acquisition module 101 can access the corpus corresponding to the target scene, and filter out part of the corpus text from the corpus as the target corpus text for training the speech cloning model. .

In this embodiment, the following implementation examples are provided for filtering target corpus text from the corpus.

In the first implementation example, the data acquisition module 101 can filter out the target corpus text from the corpus according to pinyin distribution.

Specifically, taking the corpus text as Chinese text as an example, when the corpus corresponding to the target scenario stores multiple corpus texts, it also stores the pinyin distribution of each Chinese character included in the multiple corpus texts. For example, the pinyin corresponding to each Chinese character is in the corpus. The distribution of the number of occurrences in is hereinafter referred to as the first pinyin distribution. Then, the data acquisition module 101 can filter out a preset number of corpus texts (such as 30, or 50, or 100, etc.) from the corpus, add them to the corpus text collection, and count the multiple corpus texts in the corpus text collection. The pinyin distribution corresponding to the corpus text is hereinafter referred to as the second pinyin distribution.

Next, the data acquisition module 101 can calculate the variance (or standard deviation, etc.) between the first Pinyin distribution and the second Pinyin distribution. Usually, there are big differences in the pinyin distribution corresponding to corpus texts in different scenarios. For example, for 500 corpus texts in the news scenario and 500 corpus texts in the financial scenario, the distribution of the top 10 pinyins with the largest number in the pinyin distribution can be shown in Figure 6. Therefore, the pinyin distribution corresponding to the corpus text in each scene can be used as a feature indicating the characteristics of the corpus text in that scene. Correspondingly, when selecting target corpus texts for training in this scenario, multiple corpus texts whose pinyin distribution is the same as that of the corpus can be selected as target corpus texts to retain the text content characteristics in this scenario.

Moreover, when the variance between the first Pinyin distribution and the second Pinyin distribution is less than or equal to the preset threshold, the data acquisition module 101 may determine multiple corpus texts in the corpus text set as target corpus for training the speech cloning model. text. When the variance between the first Pinyin distribution and the second Pinyin distribution is greater than the preset threshold, the data acquisition module 101 According to the first pinyin distribution, the target pinyin with an excessive proportion of pinyin in the second pinyin distribution can be determined, and one or more corpus texts with a relatively high repetition rate of the target pinyin can be deleted from the corpus text collection, and then randomly selected from the database Select one or more corpus texts from the remaining corpus texts and add them to the corpus text collection.

Then, the data acquisition module 101 can recalculate that the variance (or standard deviation, etc.) between the pinyin distribution corresponding to the corpus text collection and the first pinyin distribution is less than the preset threshold. If yes, multiple corpus texts in the current corpus text collection are determined as the target corpus text; if not, the above steps can be repeated to update the corpus text collection until the pinyin distribution corresponding to the corpus text collection is equal to the first pinyin distribution. The variance (or standard deviation, etc.) is less than the preset threshold.

In the second implementation example, the data acquisition module 101 can filter out the target corpus text from the corpus according to the proportion of professional terms. Among them, professional terms refer to the unified names for some specific things in specific fields, such as complex program logic devices (CPLD) in the computer field.

Specifically, multiple corpus texts stored in the corpus corresponding to the target scene may carry identifiers (or labels) of professional terms included in each corpus text. In this way, the data acquisition module 101 can first randomly screen a preset number of corpus texts from the corpus, add them to the corpus text collection, and determine the number of professional terms in the corpus text collection based on the identification of professional terms carried by these corpus texts. The proportion relative to the number of all words included in the corpus text collection. When the proportion is greater than or equal to the preset proportion threshold, the data acquisition module 101 can determine multiple corpus texts in the corpus text set as the target corpus text; and when the proportion is less than the preset proportion threshold, the data acquisition module 101 You can delete some of the corpus texts with a small number of professional terms in the corpus text collection, or delete some of the corpus texts with a high repetition rate in the corpus text collection, and then randomly select one or more corpus texts from the remaining corpus texts in the database, and add It is added to the corpus text collection.

Then, the data acquisition module 101 can recalculate the proportion of the number of professional terms in the corpus text collection to the number of all words in the corpus text collection. When the proportion is greater than or equal to the preset proportion threshold, the data acquisition module 101 can determine multiple corpus texts in the current corpus text set as the target corpus text; and when the proportion is less than the preset proportion threshold, the data acquisition module 101 101 can repeat the above steps to update the corpus text collection until the proportion of the number of professional terms in the corpus text collection relative to the number of all words in the corpus text collection is greater than or equal to the preset proportion threshold.

In the third implementation example, the data acquisition module 101 can combine the above-mentioned pinyin distribution and the proportion of professional terms to filter out the target corpus text from the database. That is, in the filtered target corpus text, not only the pinyin distribution corresponds to the database The variance between pinyin distributions is less than or equal to the preset threshold, and the proportion of the number of professional terms relative to the number of all words in the target corpus text is greater than or equal to the proportion threshold.

The above-mentioned data acquisition module 101 filters out the target corpus text from the corpus only as some exemplary explanations. In actual application, the data acquisition module 101 can also filter out the target corpus text from the corpus through other methods, which is not limited in this embodiment.

In another implementation of obtaining the target corpus text, when the target scene is a scene customized by the user 300, the data acquisition module 101 can determine from the corpus text uploaded by the user 300 the speech suitable for training in the scene. Clone the target corpus text of the model.

Specifically, when the data acquisition module 101 presents the scene configuration interface through the client 200, in addition to prompting the user 300 to enter the name of the customized scene, the data acquisition module 101 can also prompt the user 300 to upload corpus text on the scene configuration interface, as shown in Figure 5 Show. The user 300 can import corpus text on the scene configuration interface, or on the scene configuration interface Enter the path, file name or network address of the corpus text, so that the data acquisition module 101 can access and obtain the corpus text according to the information input by the user 300 . Furthermore, the scenario configuration interface shown in Figure 5 can also prompt the user 300 to input professional terms in the customized scenario.

Then, the data acquisition module 101 can determine the target corpus text from the corpus text uploaded by the user 300 . Among them, when the number of corpus texts uploaded by the user 300 is large, the data acquisition module 101 can refer to the aforementioned embodiments to determine the target corpus text from multiple corpus texts based on pinyin distribution or professional terminology, which will not be reiterated here; and when When the number of corpus texts uploaded by the user 300 is small, for example, the number of corpus texts uploaded by the user 300 does not exceed the above-mentioned preset number, etc., the data acquisition module 101 can determine all corpus texts uploaded by the user 300 as target corpus texts. This implementation This example does not limit this.

In actual application, the data acquisition module 101 may also acquire the target corpus text through other methods, which is not limited in this embodiment.

S303: The data acquisition module 101 determines the audio of the target object based on the target corpus text, and the voice content of the audio matches the content of the target corpus text.

The target object may be the user 300, for example, or the target object may be other objects besides the user 300, such as public figures.

After obtaining the target corpus text, the data acquisition module 101 can further obtain the audio of the target object. The speech content of the audio of the target object matches the content of the target corpus text. For example, the speech content of the audio is the same as the content of the target corpus text. .

In this embodiment, the following implementation examples for obtaining the audio of a target object are provided.

In the first implementation example, when the target object is the user 300, the data acquisition module 101 can generate a recording interface, and the recording interface can include the determined target corpus text, so that the data acquisition module 101 can present the text through the client 200. Recording interface. Furthermore, the recording interface can further present the pinyin and tonal information corresponding to the target corpus text. The pinyin and tonal information can be manually annotated by technical personnel on the target corpus text in advance. For example, in the recording interface shown in Figure 7, the target corpus text presented can be a text belonging to a financial scene: "Is the trend of real estate prices this year rising or falling?", and the pinyin and tonal information presented is "jin1 nian2 fang2 di4 chan3 jia4 ge2 zou3 shi4 shi4 zhang3 shi 4luo4", where "jin" in "jin1" is the pinyin of the character "jin" in the target corpus text, and the "1" in "jin1" indicates the character "jin" in the target corpus text The pronunciation tone of "nian2" is the pinyin of "nian", and the "2" in "nian2" indicates that the pronunciation tone of "nian" is the second tone.

In this way, the user 300 can pronounce according to the target corpus text (and the corresponding pinyin and tones) presented in the recording interface. Correspondingly, the data acquisition module 101 can use the client 200 to record the pronunciation of the user 300 to obtain the audio of the user 300, that is, the audio of the target object.

Furthermore, since the target object's pronunciation is easily interfered by the noise environment when recording it, the data acquisition module 101 can also perform noise detection on the recorded audio and calculate the signal-to-noise ratio of the audio. When the signal-to-noise ratio is greater than the noise threshold, it means that the audio is subject to greater noise interference. At this time, the data acquisition module 101 can delete the recording, and can prompt the user 300 to re-record the target corpus text until the obtained The signal-to-noise ratio in the audio does not exceed this noise threshold. In addition, the data acquisition module 101 can also verify whether the voice content in the recorded audio matches the target corpus text, such as verifying whether the voice content in the audio is consistent with the content of the target corpus text, or whether the pronunciation of the user 300 is correct. Whether the rate reaches the threshold value, if so, the data acquisition module 101 can determine the The voice content matches the target corpus text, and if not, the data acquisition module 101 may prompt the user 300 to re-perform the recording process for the target corpus text.

In the second implementation example, when the target object is inconsistent with the user 300, the data acquisition module 101 can acquire multiple pieces of audio of the target object in the target scene. For example, when the target scene is a speech scene, the data acquisition module 101 can obtain the speech audio recorded by the target object in various public speech scenes, etc. In actual application, the target object can be specified by the user 300 in advance. For example, in the scene configuration interface shown in Figure 4, the scene configuration interface can present multiple different objects, including object 1 to object 4, so that the user 300 can select one of the multiple objects as the target object to indicate The voice cloning device 300 performs voice cloning on the target object. Correspondingly, the data acquisition module 101 may acquire multiple audio segments of the target object specified by the user 300 from the database or the network. Then, the data acquisition module 101 can content-match the target corpus text with the acquired multiple audio segments of the target object, thereby determining the audio that matches the target corpus text from the multiple audio segments.

After acquiring the target corpus text and the audio of the target object, the data acquisition module 101 can forward them to the model training module 102 .

S304: The model training module 102 uses the target corpus text and the audio of the target object to train a speech clone model corresponding to the target scene, where the speech clone model is used to output audio that simulates the target object's pronunciation in the target scene.

In this embodiment, the speech cloning model can be constructed based on, for example, the PortaSpeech model, or the Tacotron model, or the FastSpeech model, or it can be constructed based on other speech synthesis models, which is not limited in this embodiment.

As an implementation example, after obtaining the target corpus text and the audio of the target object, you can use them as training samples to iteratively train the speech cloning model until the speech cloning model meets the training termination conditions, such as the loss value is less than the threshold, etc. In this way, the speech cloning model can learn the timbre, rhythm and style of the target object's pronunciation in the target scene.

In another implementation example, since the number of target corpus texts and target object audios is usually small, the model training module 101 can obtain the general corpus text (that is, the scene to which it belongs is not distinguished) and the corresponding general corpus text. Audio, preliminary training of the speech cloning model. When the termination conditions of preliminary training are met, the speech cloning model can output corresponding audio according to the input text, that is, it can realize the basic function of speech synthesis. Then, the data acquisition module 101 further uses the target corpus text and the audio of the target object to further train the speech cloning model until it meets the training termination condition. In this way, even if the number of target corpus texts and audio of the target object is small (that is, there are fewer model training samples), the speech cloning model finally trained by the data processing module 101 can better clone the target object and pronounce it in the target scene. timbre, rhythm and style.

In a further possible implementation, after training the speech cloning model, the model training module 102 can send it to the speech cloning module 103, so that the speech cloning module 103 can be used to output audio that simulates the pronunciation of the target object, so as to realize the target object's pronunciation. Voice cloning. To this end, this embodiment may further include:

S305: The speech cloning module 103 uses the speech cloning model to output the audio corresponding to the target text.

The target text may be a test text used to present the cloning effect of the speech cloning model to the user 300, or may be a text pre-specified by the user that requires synthesized speech.

As an implementation example, when the target text is a test text, the speech cloning module 103 can input the fixedly configured test text into the speech cloning model, and the speech cloning model outputs corresponding audio according to the test text. The frequency is the audio that simulates the target object's pronunciation of the test text in the target scenario. Then, the voice cloning module 103 can output the audio. Specifically, the audio can be sent to the client 200, and the client 200 can play the audio to the user 300, so that the user 300 can perceive the voice cloning model based on the played audio. The cloning effect for the target object's pronunciation in the target scene.

As another implementation example, when the target text is a test text, the target text can be provided by the user 300, then the voice cloning module 103 can generate a test interface, for example, the test interface as shown in Figure 8, and The test interface is presented to the user 300 through the client to prompt the user 300 to input test text. Correspondingly, the voice cloning module 103 can obtain the test text input by the user 300 in response to the user's operation on the test interface, and input the test text into the voice cloning model to obtain the audio output by the voice cloning model. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300, so that the user 300 can pronounce the target object in the target scene based on the played audio-aware voice cloning model. The cloning effect.

As another implementation example, the target text is text that is pre-specified by the user 300 and needs to be synthesized into speech. For example, when the target scene is a story scene, the user 300 can pre-specify the name or text of a story, so that the voice cloning module 103 can input the text (such as story text, etc.) specified by the user 300 into the voice cloning model to obtain the voice The cloned model outputs corresponding audio based on this text. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300 to meet the needs of the user 300 for voice cloning of the text. For example, the user 300 can hear the simulated target Audio of subject telling a story.

As another implementation example, the target text is text input by the user 300 that needs to be synthesized into speech. Correspondingly, after receiving the speech cloning model, the speech cloning module 103 can generate a speech synthesis interface and pass the speech synthesis interface through the client. Terminal 200 is presented to user 300. Then, the speech cloning device 103 can receive the target text input by the user 300 that needs to be speech synthesized through the client 200, and input it into the speech cloning model to obtain the audio output by the speech cloning model according to the target text. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300 to meet the needs of the user 300 for voice cloning of the target text.

It is worth noting that in the embodiment shown in FIG. 3 , the process of the voice cloning device 100 generating audio for cloning the target object to pronounce in the target scene is exemplified. In actual application, the voice cloning device 100 can be based on In a similar manner to the above, each scene is trained to obtain a speech cloning model corresponding to the scene, and the speech cloning model corresponding to each scene generates audio used to clone the target object's pronunciation in that scene. Moreover, for different objects, the speech cloning model corresponding to each object in each scene can be trained based on the above-mentioned similar methods. In this way, the voice cloning device 100 can train multiple different voice cloning models for different scenarios and different objects to support the user in selecting pronunciation scenes and objects to be cloned, thereby improving the flexibility and richness of voice cloning.

In this way, when the user 300 specifies a scene and an object, the voice cloning device 100 can use the voice cloning model corresponding to the scene and the specified object specified by the user 300 to generate corresponding audio and feed it back to the user 300 . For example, the speech cloning module 103 can generate a speech synthesis interface, which can present multiple candidate scenes and multiple candidate objects to the user, so that the user can select one of the multiple candidate scenes on the speech synthesis interface. A candidate scenario in which one candidate object is selected from multiple candidates. Correspondingly, the voice cloning module 103 can determine the candidate scene selected by the user as the target scene, determine the candidate object selected by the user as the target object, and further determine the voice clone corresponding to the target scene for simulating the pronunciation of the target object. Model. Then, the speech cloning model 103 can use the determined speech cloning model to generate the target text according to the preconfigured target text or the user's input in the speech synthesis world. Input target text on the screen, and synthesize audio that simulates the pronunciation of the target object in the target scene.

In other implementations, the speech synthesis interface generated by the speech cloning module 103 may also only support the user to select one scene from multiple scenes as the target scene, or may only support the user to select one object from multiple candidate objects as the target scene. The target object is not limited in this embodiment.

In the above embodiment shown in Figure 3, the voice cloning device (including the above-mentioned data acquisition module 101, model training module 102, and voice cloning module 103) involved in the voice cloning process may be configured on a computing device or a cluster of computing devices. software, and by running the software on a computing device or computing device cluster, the computing device or computing device cluster can realize the functions of the above voice cloning device. Below, based on the perspective of hardware device implementation, the voice cloning device involved in the voice cloning process is introduced in detail.

Figure 9 shows a schematic structural diagram of a computing device. The above-mentioned voice cloning device can be deployed on the computing device. The computing device can be a computing device (such as a server) in a cloud environment, or a computing device in an edge environment, or Terminal equipment, etc. can be specifically used to implement the functions of the interaction module 201 and the processing module 202 in the embodiment shown in FIG. 3 .

As shown in FIG. 9 , computing device 900 includes processor 920 , memory 910 , communication interface 930 , and bus 940 . The processor 920, the memory 910 and the communication interface 930 communicate through the bus 940. The bus 940 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 9, but it does not mean that there is only one bus or one type of bus. The communication interface 930 is used to communicate with the outside, such as receiving original data provided by the user and the feature extraction network model to be trained.

The processor 920 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more integrated circuits. The processor 920 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, the functions of each module in the voice cloning device can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 920 . The processor 920 may also be a general processor, a digital signal processor (DSP), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, Discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application. Among them, the general processor can be a microprocessor or the processor can be any conventional processor, etc. The method disclosed in combination with the embodiments of the present application can be directly implemented as a hardware decoding processor to complete the execution, or can be performed using decoding processing. The combination of hardware and software modules in the device is executed. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 910. The processor 920 reads the information in the memory 910 and completes some or all functions of the voice cloning device in combination with its hardware.

The memory 910 may include volatile memory (volatile memory), such as random access memory (RAM). The memory 910 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, HDD or SSD.

The memory 910 stores executable code, and the processor 920 executes the executable code to perform the method performed by the aforementioned voice cloning device.

Specifically, when the embodiment shown in Figure 3 is implemented, and the data acquisition module 101, the model training module 102 and the speech cloning module 103 described in the embodiment shown in Figure 3 are implemented by software, execute Figure 3 Data in 3 The software or program code required for the functions of the acquisition module 101, the model training module 102 and the speech cloning module 103 is stored in the memory 910. The interaction between the data acquisition module 101 and other devices is realized through the communication interface 930, and the processor is used to execute the memory 910 The instruction implements the method executed by the voice cloning device.

Figure 10 shows a schematic structural diagram of a computing device cluster. The computing device cluster 10 shown in FIG. 10 includes multiple computing devices, and the above voice cloning device can be deployed on multiple computing devices in the computing device cluster 10 in a distributed manner. As shown in Figure 10, the computing device cluster 100 includes multiple computing devices 1000. Each computing device 1000 includes a memory 1010, a processor 1020, a communication interface 1030, and a bus 1040. The memory 1010, the processor 1020, and the communication interface 1030 pass through Bus 1040 implements communication connections between each other.

Processor 1020 may employ a CPU, GPU, ASIC, or one or more integrated circuits. The processor 1020 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, part of the functions of the voice cloning device can be completed by instructions in the form of integrated logic circuits or software in the hardware of the processor 1020 . The processor 1020 can also be a DSP, FPGA, general-purpose processor, other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, and can implement or execute some of the methods, steps, and logical block diagrams disclosed in the embodiments of this application. The general processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly implemented as a hardware decoding processor, or may be executed using a decoding processor. The combination of hardware and software modules in the code processor is executed. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 1010. In each computing device 1000, the processor 1020 reads the information in the memory 1010, and combined with its hardware, can complete part of the functions of the voice cloning device.

The memory 1010 may include ROM, RAM, static storage devices, dynamic storage devices, hard disks (eg, SSD, HDD), etc. The memory 1010 may store program codes, for example, part or all of the program code used to implement the data acquisition module 101, part or all of the program code used to implement the model training module 102, part or all of the program code used to implement the speech cloning module 103 wait. For each computing device 1000, when the program code stored in the memory 1010 is executed by the processor 1020, the processor 1020 executes part of the method executed by the voice cloning device based on the communication interface 1030. For example, part of the computing device 1000 may be used to execute the above. For the method executed by the data acquisition module 101, another part of the computing device 1000 is used to execute the method executed by the above-mentioned model training module 102, and another part of the computing device 1000 is used for executing the method executed by the above-mentioned voice cloning module 103. The memory 1010 can also store data, such as intermediate data or result data generated by the processor 1020 during execution, such as the above-mentioned target corpus text, audio, speech cloning model, etc.

The communication interface 1003 in each computing device 1000 is used to communicate with the outside, such as interacting with other computing devices 1000 and so on.

The bus 1040 may be a peripheral component interconnection standard bus or an extended industry standard architecture bus, or the like. For ease of presentation, the bus 1040 within each computing device 1000 in FIG. 10 is represented by only one thick line, but this does not mean that there is only one bus or one type of bus.

Communication paths are established between the above-mentioned plurality of computing devices 1000 through a communication network to realize the function of the voice cloning device. Any computing device may be a computing device (eg, a server) in a cloud environment, a computing device in an edge environment, or a terminal device.

In addition, embodiments of the present application also provide a computer-readable storage medium, which stores Instructions are stored that, when run on one or more computing devices, cause the one or more computing devices to execute the methods performed by each module of the voice cloning device in the above embodiment.

In addition, embodiments of the present application also provide a computer program product. When the computer program product is executed by one or more computing devices, the one or more computing devices execute any one of the foregoing voice cloning methods. The computer program product can be a software installation package. If it is necessary to use any of the foregoing voice cloning methods, the computer program product can be downloaded and executed on the computer.

In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Claims

A voice cloning method, characterized in that the method includes:

Determine the target scenario;

According to the target scene, determine the target corpus text belonging to the target scene;

Determine the audio of the target object according to the target corpus text, and the voice content of the audio matches the content of the target corpus text;

The target corpus text and the audio are used to train a speech clone model corresponding to the target scene. The speech clone model is used to output audio that simulates the pronunciation of the target object in the target scene.
The method according to claim 1, characterized in that the content context of the target corpus text matches the context indicated by the target scene;

The target scenarios include any of the following:

Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;

Alternatively, the target scene is a scene divided according to emotion types.
The method according to claim 1 or 2, characterized in that, according to the target scene, determining the corpus text belonging to the target scene includes:

Obtain the pinyin distribution of multiple corpus texts belonging to the target scene;

The target corpus text is selected from the plurality of corpus texts according to the pinyin distribution of the plurality of corpus texts, the number of the target corpus texts is less than the number of the plurality of corpus texts, and the number of the target corpus texts is The pinyin distribution and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.
The method according to any one of claims 1 to 3, characterized in that, according to the target scene, determining the corpus text belonging to the target scene includes:

The target corpus text is selected from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts belong to the target scene.
The method according to any one of claims 1 to 4, wherein determining the audio of the target object belonging to the target scene according to the target corpus text includes:

Generate a recording interface, the recording interface is used to present the target corpus text to the target object;

Record the pronunciation of the target object according to the target corpus text to obtain the audio of the target object.
The method according to any one of claims 1 to 4, wherein determining the audio of the target object belonging to the target scene according to the target corpus text includes:

Obtain multiple audios produced by the target object in the target scene;

The audio whose speech content matches the content of the target corpus text is determined from the plurality of audios.
The method according to any one of claims 1 to 6, characterized in that determining the target scenario includes:

Generate a scene configuration interface, the scene configuration interface being used to present multiple candidate scenes to the user;

The target scene selected by the user is determined from the plurality of candidate scenes.
The method according to any one of claims 1 to 6, characterized in that determining the target scenario includes:

Generate a scene configuration interface, the scene configuration interface being used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene;

In response to the user's operation on the scene configuration interface, obtain the identification of the user-defined target scene and the corpus text belonging to the target scene.
The method according to any one of claims 1 to 8, characterized in that the method further includes:

Generate a test interface, the test interface is used to prompt the user to input text;

In response to the user's operation on the test interface, obtain the target text input by the user;

The target text is input into the speech cloning model to obtain the audio output by the speech cloning model.
A voice cloning method, characterized in that the method includes:

Receive the target scene and target text input by the user;

According to the target scene, determine the speech cloning model corresponding to the target scene;

Based on the speech cloning model, target audio corresponding to the target text is output, and the speech cloning model is used to output audio that simulates the pronunciation of the target object in the target scene.
The method according to claim 10, characterized in that the content context of the target corpus text matches the context indicated by the target scene;

The target scenarios include any of the following:

Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;

Alternatively, the target scene is a scene divided according to emotion types.
The method according to claim 10 or 11, characterized in that receiving the target scene and target text input by the user includes:

Generate a speech synthesis interface, the speech synthesis interface being used to present multiple candidate scenarios to the user;

Determine the target scene selected by the user from the plurality of candidate scenes;

Receive the target text input by the user on the speech synthesis interface.
The method according to claim 12, characterized in that the speech synthesis interface is also used to present multiple candidate objects to the user;

The method also includes:

From the plurality of candidate objects, the target object selected by the user is determined.
A voice cloning device, characterized in that the voice cloning device includes:

The data acquisition module is used to determine the target scene, and determine the target corpus text belonging to the target scene according to the target scene, and determine the audio of the target object based on the target corpus text, and the voice content of the audio is consistent with the target corpus text. Match the content of the target corpus text;

A model training module, configured to use the target corpus text and the audio to train a speech clone model corresponding to the target scene. The speech clone model is used to output audio that simulates the pronunciation of the target object in the target scene. .
The device according to claim 14, characterized in that the context of the target corpus text matches the context indicated by the target scene;

The target scenarios include any of the following:

Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;

Alternatively, the target scene is a scene divided according to emotion types.
The device according to claim 14 or 15, characterized in that the data acquisition module is used for:

Obtain the pinyin distribution of multiple corpus texts belonging to the target scene;

The target corpus text is selected from the plurality of corpus texts according to the pinyin distribution of the plurality of corpus texts, the number of the target corpus texts is less than the number of the plurality of corpus texts, and the number of the target corpus texts is The pinyin distribution and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.
The device according to any one of claims 14 to 16, characterized in that the data acquisition module is used for:

The target corpus text is selected from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts belong to the target scene.
The device according to any one of claims 14 to 17, characterized in that the data acquisition module is used for:

Generate a recording interface, the recording interface is used to present the target corpus text to the target object;

Record the pronunciation of the target object according to the target corpus text to obtain the audio of the target object.
The device according to any one of claims 14 to 17, characterized in that the data acquisition module is used for:

Obtain multiple audios produced by the target object in the target scene;

The audio whose speech content matches the content of the target corpus text is determined from the plurality of audios.
The device according to any one of claims 14 to 19, characterized in that the data acquisition module is used for:

Generate a scene configuration interface, the scene configuration interface being used to present multiple candidate scenes to the user;

The target scene selected by the user is determined from the plurality of candidate scenes.
The device according to any one of claims 14 to 19, characterized in that the data acquisition module is used for:

Generate a scene configuration interface, the scene configuration interface being used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene;

In response to the user's operation on the scene configuration interface, the identification of the target scene defined by the user and the corpus text belonging to the target scene are obtained.
The device according to any one of claims 14 to 21, characterized in that the voice cloning device further includes a voice cloning module for:

Generate a test interface, the test interface is used to prompt the user to input text;

In response to the user's operation on the test interface, obtain the target text input by the user;

The target text is input into the speech cloning model to obtain the audio output by the speech cloning model.
A voice cloning device, characterized in that the voice cloning device includes:

The data acquisition module is used to receive the target scene and target text input by the user;

A voice cloning module, configured to determine a voice cloning model corresponding to the target scene according to the target scene, and to output target audio corresponding to the target text based on the voice cloning model, where the voice cloning model is used to output Audio that simulates the target object speaking in the target scene.
The device according to claim 23, characterized in that the context of the target corpus text matches the context indicated by the target scene;

The target scenarios include any of the following:

Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;

Alternatively, the target scene is a scene divided according to emotion types.
The device according to claim 23 or 24, characterized in that the data acquisition module is used for:

Generate a speech synthesis interface, the speech synthesis interface being used to present multiple candidate scenarios to the user;

Determine the target scene selected by the user from the plurality of candidate scenes;

Receive the target text input by the user on the speech synthesis interface.
The device according to claim 25, wherein the speech synthesis interface is also used to present multiple candidate objects to the user;

The data acquisition module is further configured to determine the target object selected by the user from the plurality of candidate objects.
A computing device cluster, characterized by including at least one computing device, each computing device including a processor and a memory;

The processor is configured to execute instructions stored in the memory, so that the computing device cluster executes the method according to any one of claims 1 to 7.
A computer-readable storage medium, characterized in that instructions are stored in the computer-readable storage medium, and when run on at least one computing device, the at least one computing device executes any of claims 1 to 9. The method according to one of the claims 10 to 13, or causing the at least one computing device to perform the method according to any one of claims 10 to 13.
A computer program product containing instructions, characterized in that, when run on at least one computing device, it causes the at least one computing device to perform the method according to any one of claims 1 to 9, or, causes The at least one computing device performs a method as claimed in any one of claims 10 to 13.