WO2024001307A1 - Procédé et appareil de clonage vocal, et dispositif associé - Google Patents

Procédé et appareil de clonage vocal, et dispositif associé Download PDF

Info

Publication number
WO2024001307A1
WO2024001307A1 PCT/CN2023/081526 CN2023081526W WO2024001307A1 WO 2024001307 A1 WO2024001307 A1 WO 2024001307A1 CN 2023081526 W CN2023081526 W CN 2023081526W WO 2024001307 A1 WO2024001307 A1 WO 2024001307A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
scene
text
corpus
scenes
Prior art date
Application number
PCT/CN2023/081526
Other languages
English (en)
Chinese (zh)
Inventor
陈飞扬
王喆锋
段新宇
怀宝兴
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211071940.0A external-priority patent/CN117373432A/zh
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024001307A1 publication Critical patent/WO2024001307A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a voice cloning method, device and related equipment.
  • Voice cloning is a technology that uses the original voice of a target object (such as a cloned human) to generate a new voice that is similar to the original voice in terms of timbre and other pronunciation characteristics, achieving the effect of cloning the pronunciation of the target object. It is used in virtual humans, audiobooks, It is widely used in scenarios such as video creation.
  • the current speech cloning technology can only clone the timbre of the target object's pronunciation in the generated new speech, and it is difficult to compare with the pronunciation effect of the target object in the real scene, resulting in poor cloning effect.
  • embodiments of the present application provide a voice cloning method to improve the voice cloning effect for the target object.
  • This application also provides corresponding devices, computing device clusters, computer-readable storage media, and computer program products.
  • embodiments of the present application provide a voice cloning method, which can be executed by a voice cloning device.
  • the voice cloning device determines the target scene, such as determining the story scene specified by the user as the target scene, etc., and based on The target scene determines the target corpus text belonging to the target scene, and then determines the audio of the target object based on the target corpus text.
  • the speech content of the audio matches the content of the target corpus text, so that the speech cloning device uses the target corpus text and Use the audio of the target object to train a speech clone model corresponding to the target scene.
  • the speech clone model is used to output audio that simulates the target object's pronunciation in the target scene.
  • the speech cloning model is trained based on the pronunciation audio of the corpus text in the target scene based on the target object, this allows the speech cloning model to be more accurate based on the characteristics of the new speech output by the text in terms of timbre, rhythm, and pronunciation style. It conforms to the real pronunciation of the target object in the target scene, which can effectively improve the voice cloning effect.
  • the above method can be used to generate speech cloning models for simulating and outputting the pronunciation rhythm and style of each object in various scenarios, so that these speech cloning models can be used to improve the authenticity and diversity of speech cloning.
  • the speech cloning device can use the speech cloning model to output an audio corresponding to a piece of text to achieve speech cloning of the target object.
  • the content context of the target corpus text matches the context indicated by the target scene.
  • the target corpus text may be, for example, the corpus text of the story content.
  • the target scene can be any one of dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, educational scenes, or speech scenes, or the target object can be scenes divided according to emotion types, such as Sad scenes, happy scenes, etc.
  • the target scenario can also be other applicable scenarios.
  • the speech cloning device may first obtain the pinyin distribution of multiple corpus texts belonging to the target scene.
  • the pinyin distribution may be, for example, the multiple corpus texts.
  • the number distribution of each pinyin in the multiple corpus texts, etc. so that the speech cloning device can select the target corpus text from the multiple corpus texts according to the pinyin distribution of the multiple corpus texts, and the number of the target corpus texts is less than the number of the multiple corpus texts, and , the pinyin distribution of the target corpus text and the pinyin distribution of the multiple corpus texts meet preset conditions, such as the variance or standard deviation between the two pinyin distributions being less than a threshold, etc.
  • the pinyin distribution in each scenario can be used as a representative feature of the scene, so that the target corpus text can be selected based on the pinyin distribution so that the target corpus text can also be selected. It can meet the characteristics of the corpus in this scenario, and then train the speech cloning model based on the target corpus text, which can improve the speech cloning effect of the speech cloning model.
  • the speech cloning device when it determines the corpus text belonging to the target scene, it may specifically select the target corpus text from multiple corpus texts belonging to the target scene.
  • the proportion of professional terms in the target corpus text is ratio is greater than the proportion threshold.
  • the voice cloning device when it determines the audio of the target object belonging to the target scene based on the target corpus text, it may specifically generate a recording interface, and the recording interface is used to present the target corpus text to the target object, In this way, the target object can pronounce according to the target corpus text presented in the recording interface.
  • the voice cloning device records the pronunciation of the target object to obtain the audio of the target object. In this way, the speech cloning device can obtain the audio of the target object by collecting the pronunciation of the target object, so that the speech cloning model can be subsequently trained based on the obtained audio.
  • the voice cloning device when it determines the audio of the target object belonging to the target scene according to the target corpus text, it may specifically obtain multiple audios of the target object that are pronounced in the target scene, so that the voice cloning device
  • the audio whose speech content matches the content of the target corpus text can be determined from multiple audios.
  • the voice cloning device can obtain multiple audios of the target object in public places (and belonging to the target scene) from the network, so that the voice cloning device can determine the content that matches the target corpus text through content matching.
  • the target object's audio In this way, after the user indicates the target scene, the target object no longer needs to interact with the voice cloning device through recording, thereby simplifying the interactive operations required to implement voice cloning and improving the user experience.
  • the voice cloning device when determining the target scene, may specifically generate a scene configuration interface.
  • the scene configuration interface is used to present multiple candidate scenes to the user so that the user can perform operations on the multiple candidate scenes. selection, so that the voice cloning device can determine the target scene selected by the user from multiple candidate scenes. In this way, the voice cloning device can determine the pronunciation scene of the voice cloning based on the user's specification, thereby improving the selectivity of the voice cloning scene and improving the user experience.
  • the voice cloning device when it determines the target scene, it may specifically generate a scene configuration interface.
  • the scene configuration interface is used to prompt the user to input the identification (such as name) of the defined target scene and the information belonging to the target scene.
  • the corpus text of the target scene so that the voice cloning device can respond to the user's operation on the scene configuration interface to obtain the identification of the user-defined target scene and the corpus text belonging to the target scene.
  • the voice cloning device can support the user's customization of the pronunciation scene of the voice cloning, thereby improving the flexibility of the voice cloning and improving the user experience.
  • the voice cloning device can also generate a test interface that prompts the user to input text. Then, the voice cloning device can obtain the target text input by the user in response to the user's operation on the test interface, And input the target text into the speech cloning model to obtain the audio output by the speech cloning model.
  • users can judge the cloning effect of the speech cloning model on the pronunciation of the target object in the target scene based on the audio output by the speech cloning model, so that when the cloning effect is poor, the speech cloning effect can be further improved through model retraining and other methods.
  • embodiments of the present application also provide a voice cloning method, which can be executed by a voice cloning device.
  • the voice cloning device receives the target text of the target scene input by the user, such as receiving the story scene input by the user and The story text, etc., then, the voice clone can determine the voice clone model corresponding to the target scene according to the target scene, and output the target audio corresponding to the target text based on the voice clone model, and the voice clone model is used to output the simulated target object Audio pronounced in the target scenario.
  • the new speech output by the speech cloning model based on the characteristics of timbre, rhythm, and pronunciation style of the target text can be more consistent with the real pronunciation of the target object in the target scene, thereby effectively improving the speech cloning effect.
  • the content context of the target corpus text matches the context indicated by the target scene.
  • the target corpus text may be, for example, the corpus text of the story content.
  • the target scene can be any one of dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, educational scenes, or speech scenes, or the target object can be scenes divided according to emotion types, such as Sad scenes, happy scenes, etc.
  • the target scenario can also be other applicable scenarios.
  • the voice cloning device when it receives the target scene and target text input by the user, it can generate a speech synthesis interface.
  • the speech synthesis interface is used to present multiple candidate scenes to the user, so that the voice cloning device can Determine the target scene selected by the user from multiple candidate scenes, and receive the target text input by the user on the speech synthesis interface.
  • the voice cloning device can support the user's customization of scenes and texts, thereby making scenes and texts optional.
  • the speech synthesis interface presented by the voice cloning device can also be used to present multiple candidate objects to the user, so that the user can select one of the multiple objects as the target object.
  • the voice cloning device can perform voice cloning on the object selected by the user, thereby improving the flexibility and selectability of voice cloning and improving user experience.
  • embodiments of the present application also provide a voice cloning device, including: a data acquisition module, used to determine a target scene, and according to the target scene, determine the target corpus text belonging to the target scene, and according to the target scene, The target corpus text is used to determine the audio of the target object, and the voice content of the audio matches the content of the target corpus text; a model training module is used to use the target corpus text and the audio to train the target scene Corresponding voice cloning model, the voice cloning model is used to output audio that simulates the pronunciation of the target object in the target scene.
  • the context of the target corpus text matches the context indicated by the target scene;
  • the target scene includes any one of the following: dialogue scenes, news scenes, financial scenes, Live broadcast scenes, story scenes, educational scenes, speech scenes; or, the target scenes are scenes divided according to emotion types.
  • the data acquisition module is configured to: obtain the pinyin distribution of multiple corpus texts belonging to the target scene; and obtain the pinyin distribution from the multiple corpus texts according to the pinyin distribution of the multiple corpus texts.
  • the target corpus text is selected from the text, the number of the target corpus text is less than the number of the plurality of corpus texts, and the pinyin distribution of the target corpus text and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.
  • the data acquisition module is configured to: select the target corpus text from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts The corpus text belongs to the target scenario.
  • the data acquisition module is used to: generate a recording interface, the recording interface is used to present the target corpus text to the target object; to the target object according to the target The pronunciation of the corpus text is recorded to obtain the audio of the target object.
  • the data acquisition module is configured to: acquire multiple audios produced by the target object in the target scene; and determine the voice content and the target corpus from the multiple audios. The content of the text matches the audio.
  • the data acquisition module is configured to: generate a scene configuration interface, the scene configuration interface is used to present multiple candidate scenes to the user; determine the The target scenario selected by the user.
  • the data acquisition module is configured to: generate a scene configuration interface, and the scene configuration interface is used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene; respond Based on the user's operation on the scene configuration interface, the identification of the target scene defined by the user and the corpus text belonging to the target scene are obtained.
  • the voice cloning device further includes a voice cloning module, configured to: generate a test interface, where the test interface is used to prompt the user to input text; and in response to the user's operation on the test interface , obtain the target text input by the user; input the target text to the speech cloning model, and obtain the audio output by the speech cloning model.
  • a voice cloning module configured to: generate a test interface, where the test interface is used to prompt the user to input text; and in response to the user's operation on the test interface , obtain the target text input by the user; input the target text to the speech cloning model, and obtain the audio output by the speech cloning model.
  • the voice cloning device provided in the third aspect corresponds to the voice cloning method provided in the first aspect, so the technical effects of the third aspect and any of the embodiments of the third aspect can be found in the first aspect or The technical effects achieved by the corresponding implementation of the first aspect.
  • inventions of the present application also provide a voice cloning device.
  • the voice cloning device includes: a data acquisition module for receiving the target scene and target text input by the user; a voice cloning module for scene, determine the speech clone model corresponding to the target scene, and based on the speech clone model, output the target audio corresponding to the target text, the speech clone model is used to output a simulated pronunciation of the target object in the target scene audio.
  • the context of the target corpus text matches the context indicated by the target scene;
  • the target scene includes any one of the following: dialogue scenes, news scenes, financial scenes, Live broadcast scenes, story scenes, educational scenes, speech scenes; or, the target scenes are scenes divided according to emotion types.
  • the data acquisition module is used to: generate a speech synthesis interface, the speech synthesis interface is used to present multiple candidate scenes to the user; determine the said speech synthesis interface from the multiple candidate scenes. The user selected The target scenario is described; receiving the target text input by the user on the speech synthesis interface.
  • the speech synthesis interface is also used to present multiple candidate objects to the user; the data acquisition module is also used to: determine the candidate objects from the multiple candidate objects. The target object selected by the user.
  • the voice cloning device provided in the fourth aspect corresponds to the voice cloning method provided in the second aspect, so the technical effects of the fourth aspect and any one of the embodiments of the fourth aspect can be found in the second aspect or The technical effects achieved by the corresponding implementation of the second aspect.
  • the present application provides a computing device, the computing device includes a processor and a memory; the memory is used to store instructions, and the processor executes the instructions stored in the memory, so that the computing device executes
  • the memory can be integrated into the processor or independent of the processor.
  • the computing device may also include a bus. Among them, the processor is connected to the memory through a bus.
  • the memory may include readable memory and random access memory.
  • the present application provides a computing device cluster.
  • the computing device includes at least one computing device.
  • the at least one computing device includes at least one processor and at least one memory; the at least one memory is used to store instructions.
  • the at least one processor executes the instruction stored in the at least one memory, so that the computing device cluster executes the above first aspect or the voice cloning method in any possible implementation of the first aspect, or executes the above second aspect. Or the voice cloning method in any possible implementation of the second aspect.
  • the memory can be integrated into the processor or independent of the processor.
  • the at least one computing device may also include a bus. Among them, the processor is connected to the memory through a bus.
  • the memory may include readable memory and random access memory.
  • the present application provides a computer-readable storage medium that stores instructions, which when run on at least one computing device, cause the at least one computing device to execute the above-mentioned first aspect. Or the method described in any implementation of the first aspect, or perform the voice cloning method in the above second aspect or any possible implementation of the second aspect.
  • the present application provides a computer program product containing instructions that, when run on at least one computing device, cause the at least one computing device to execute the above first aspect or any implementation of the first aspect.
  • Figure 1 is a schematic diagram of an exemplary application scenario provided by this application.
  • Figure 2 is a schematic diagram of another exemplary application scenario provided by this application.
  • Figure 3 is a schematic flow chart of a voice cloning method provided by this application.
  • Figure 4 is a schematic diagram of a scene configuration interface provided by this application.
  • FIG. 5 is a schematic diagram of another scene configuration interface provided by this application.
  • Figure 6 is a schematic diagram of the pinyin distribution corresponding to the corpus text in news scenarios and financial scenarios provided by this application;
  • Figure 7 is a schematic diagram of a recording interface provided by this application.
  • Figure 8 is a schematic diagram of a test interface provided by this application.
  • Figure 9 is a schematic structural diagram of a computing device provided by this application.
  • Figure 10 is a schematic structural diagram of a computing device cluster provided by this application.
  • the general corpus text and the target object's audio recording of the corpus text are used to train the speech cloning model.
  • the speech cloning model can learn the timbre of the target object's pronunciation, and based on the newly provided text, generate and output a voice that matches the timbre of the target object's pronunciation, thereby realizing the speech cloning of the target object.
  • the target object refers to an object that can pronounce words, such as human beings.
  • rhythm and style can reflect the characteristics of the target object's pronunciation.
  • Rhythm can include features such as pronunciation intonation, temporal distribution, and stress.
  • Style can include characteristics such as the speaking speed of the target object.
  • embodiments of the present application provide a voice cloning method, which can be executed by a voice cloning device and used to improve the voice cloning effect on the target object.
  • the speech cloning device first determines the target scene in which the target object to be cloned pronounces, and obtains the target corpus text belonging to the target scene based on the target scene, and further determines the audio of the target object based on the target corpus.
  • the speech content of the target object's audio matches the content of the target corpus text.
  • the audio may be audio obtained by recording the target object's pronunciation according to the target corpus text, etc., so that the speech cloning device uses the target corpus text and the target corpus text. Audio, train to obtain a voice cloning model for outputting audio that simulates the target object's pronunciation in the target scene, and realizes voice cloning of the target object's pronunciation in the target scene.
  • the speech cloning model is trained on the pronunciation audio of the corpus text in the target scene based on the target object, this allows the speech cloning model to be more accurate in terms of timbre, rhythm, pronunciation style and other characteristics based on the new speech output by the text. It conforms to the real pronunciation of the target object in the target scene, which can effectively improve the voice cloning effect.
  • the above method can be used to clone the speech of the target object in that scene, so that the different rhythms and styles of the target object's pronunciation in different scenarios can be cloned, and the authenticity and accuracy of speech cloning can be improved.
  • Diversity the voice cloning device can also use the above method for each of the multiple objects. This method can be used to clone the object's voice in various scenarios, thereby improving the flexibility and richness of voice cloning.
  • the above voice cloning device can be deployed in the cloud to provide users with voice cloning cloud services.
  • the voice cloning device 100 can be deployed in the cloud, for example, it can be implemented by a cloud computing device or a computing device cluster.
  • the voice cloning device 100 can provide an external client 200 for interaction with the user 300, such as receiving scene information, text or audio data input by the user 300, or feedback cloned audio to the user 300, etc.
  • the client 200 may be, for example, an application program running on the user-side device, or may be a web browser provided externally by the voice cloning device 100, or the like.
  • the voice cloning device 100 may include a data acquisition module 101 and a model training module 102.
  • the data acquisition module 101 is used to determine the target scene, for example, the scene selected by the user 300 or the scene customized by the user 300 can be determined as the target scene, etc., and obtain the target corpus text and the audio of the target object belonging to the target scene,
  • the target corpus text and audio are provided to the model training module 102;
  • the model training module 102 is used to use the target corpus text and the audio of the target object to train a speech cloning model corresponding to the target scene.
  • the voice cloning device 100 can also include a voice cloning module 103, then the model training module 102 can provide the voice cloning model to the voice cloning module 103; the voice cloning module 103 is used to use the voice cloning model to output the target text correspondence.
  • the audio is the audio that simulates the target object's pronunciation in the target scene, where the target text may be preconfigured text, or may be text newly provided by the user 300 , etc.
  • the voice cloning module 103 can also send the audio corresponding to the target text to the client 200, so that the client 200 plays the audio to the user 300.
  • the above voice cloning device can be deployed locally, so that local voice cloning services can be provided for users.
  • the above-mentioned voice cloning device can be a local terminal 400, so that the user 300 can input the target scene, the target corpus text, and the audio of the target object to the terminal 400, and the terminal 400 uses the target corpus. text and audio, train a speech cloning model corresponding to the target scene, and use the speech cloning model to output the audio corresponding to the target text, and play the audio to the user 300.
  • the above voice cloning device can be implemented by software or can be implemented by hardware.
  • the voice cloning device may include code running on a computing instance.
  • the computing instance may include at least one of a physical host (computing device), a virtual machine, and a container.
  • the above computing instance may be one or more.
  • a voice cloning device may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
  • the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs.
  • VPC virtual private cloud
  • Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .
  • the voice cloning device is an example of a hardware functional unit.
  • the voice cloning device may include at least one computing device, such as a server.
  • the voice cloning device can also be implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD).
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD can be implemented by a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
  • CPLD complex programmable logical device
  • FPGA field-programmable gate array
  • GAL general array logic
  • Multiple computing devices included in the voice cloning device may be distributed in the same region or in different regions. Multiple computing devices included in the voice cloning device may be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the voice cloning device may be distributed in the same VPC or in multiple VPCs.
  • the plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
  • FIG. 3 is a schematic flow chart of a voice cloning method in an embodiment of the present application.
  • This method can be applied to the application scenarios shown in Figure 1 or Figure 2 above, or can also be applied to other applicable application scenarios.
  • the following description takes the application scenario shown in Figure 1 as an example.
  • the voice cloning device 100 can be used to generate a voice cloning model that clones one or more objects to be pronounced in various scenarios.
  • the speech cloning model of audio produced by an object in one scene is taken as an example to illustrate the implementation process of the speech cloning device 100 generating a speech cloning model for simulating the speech of other objects in various other scenes, This can be understood with reference to the embodiment shown in FIG. 3 .
  • the voice cloning method shown in Figure 3 may specifically include:
  • the data acquisition module 101 determines the target scene.
  • the rhythm and style of the target object's pronunciation may be different in many different scenes. Therefore, when cloning the pronunciation of the target object, you can first determine the scene to which the pronunciation of the target object to be cloned belongs, which is hereinafter referred to as the target. Scenes.
  • the scenes to which the target object's pronunciation belongs can be divided according to the pronunciation environment in actual application scenarios. For example, it can be divided into multiple scenes such as dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, etc.
  • the target scene is for one of the scenarios.
  • the scene to which the target object's pronunciation belongs can also be divided according to the type of the character's emotion. For example, it can be divided into happy scenes, sad scenes, worship scenes, calm scenes, dull scenes, etc. according to the different emotions of the characters.
  • other methods may be used to divide the scene into multiple different scenarios, which is not limited in this embodiment.
  • the target scene can also be a user-defined scene, for example, the user can customize a bedtime story scene, a speech scene, etc.
  • the data acquisition module 101 can generate a scene configuration interface and send the scene configuration interface to the client 200 so that the client 200 can present it to the user 300 .
  • the scene configuration interface presented by the client 200 may include multiple candidate scenes, for example, it may be a dialogue scene, a news scene, a financial scene, a live broadcast scene, a story scene, an education scene, a speech scene, etc. as shown in Figure 4. Multiple candidate scenarios can be configured in advance by technicians.
  • the user 300 can select a scene from multiple candidate scenes presented on the client 200, such as selecting a story scene, etc., so as to designate the voice cloning device 100 to perform voice cloning based on the scene.
  • the client 200 can feed back the scene selected by the user to the data acquisition module 101, so that the data acquisition module 101 determines it as the target scene.
  • the voice cloning device 100 can also support user 300 customized scenarios.
  • the data acquisition module 101 can also generate the scene configuration interface shown in Figure 5, and present the scene configuration interface through the client 200 300 is given to the user.
  • the user 300 can enter the name of the customized scene (or other information used to identify the scene) in the scene configuration interface; accordingly, the data acquisition module 101 can create a new scene according to the name of the scene input by the user. scene and identify it as the target scene.
  • the above-mentioned implementation method for the data acquisition module 101 to determine the target scene is only for illustrative purposes. In actual application, the data acquisition module 101 can also determine the target scene through other methods, which is not limited in this embodiment.
  • the data acquisition module 101 determines to acquire the target corpus text belonging to the target scene according to the target scene.
  • the data acquisition module 101 can further acquire the target corpus text required to implement speech cloning.
  • the data acquisition module 101 can be configured with corresponding corpora for multiple candidate scenes in advance before performing speech cloning, and each corpus is used to store multiple corpora belonging to the same candidate scene.
  • Text, corpus texts stored in different corpora belong to different candidate scenarios.
  • the context of the content of the corpus text stored in each corpus matches the context indicated by the candidate scene.
  • the stored corpus text may be multiple different speech scripts, etc.
  • the data acquisition module 101 can access the corpus corresponding to the target scene, and filter out part of the corpus text from the corpus as the target corpus text for training the speech cloning model. .
  • the data acquisition module 101 can filter out the target corpus text from the corpus according to pinyin distribution.
  • the corpus corresponding to the target scenario when the corpus corresponding to the target scenario stores multiple corpus texts, it also stores the pinyin distribution of each Chinese character included in the multiple corpus texts.
  • the pinyin corresponding to each Chinese character is in the corpus.
  • the distribution of the number of occurrences in is hereinafter referred to as the first pinyin distribution.
  • the data acquisition module 101 can filter out a preset number of corpus texts (such as 30, or 50, or 100, etc.) from the corpus, add them to the corpus text collection, and count the multiple corpus texts in the corpus text collection.
  • the pinyin distribution corresponding to the corpus text is hereinafter referred to as the second pinyin distribution.
  • the data acquisition module 101 can calculate the variance (or standard deviation, etc.) between the first Pinyin distribution and the second Pinyin distribution.
  • the pinyin distribution corresponding to corpus texts in different scenarios For example, for 500 corpus texts in the news scenario and 500 corpus texts in the financial scenario, the distribution of the top 10 pinyins with the largest number in the pinyin distribution can be shown in Figure 6. Therefore, the pinyin distribution corresponding to the corpus text in each scene can be used as a feature indicating the characteristics of the corpus text in that scene.
  • multiple corpus texts whose pinyin distribution is the same as that of the corpus can be selected as target corpus texts to retain the text content characteristics in this scenario.
  • the data acquisition module 101 may determine multiple corpus texts in the corpus text set as target corpus for training the speech cloning model. text.
  • the data acquisition module 101 According to the first pinyin distribution, the target pinyin with an excessive proportion of pinyin in the second pinyin distribution can be determined, and one or more corpus texts with a relatively high repetition rate of the target pinyin can be deleted from the corpus text collection, and then randomly selected from the database Select one or more corpus texts from the remaining corpus texts and add them to the corpus text collection.
  • the data acquisition module 101 can recalculate that the variance (or standard deviation, etc.) between the pinyin distribution corresponding to the corpus text collection and the first pinyin distribution is less than the preset threshold. If yes, multiple corpus texts in the current corpus text collection are determined as the target corpus text; if not, the above steps can be repeated to update the corpus text collection until the pinyin distribution corresponding to the corpus text collection is equal to the first pinyin distribution.
  • the variance (or standard deviation, etc.) is less than the preset threshold.
  • the data acquisition module 101 can filter out the target corpus text from the corpus according to the proportion of professional terms.
  • professional terms refer to the unified names for some specific things in specific fields, such as complex program logic devices (CPLD) in the computer field.
  • CPLD complex program logic devices
  • multiple corpus texts stored in the corpus corresponding to the target scene may carry identifiers (or labels) of professional terms included in each corpus text.
  • the data acquisition module 101 can first randomly screen a preset number of corpus texts from the corpus, add them to the corpus text collection, and determine the number of professional terms in the corpus text collection based on the identification of professional terms carried by these corpus texts. The proportion relative to the number of all words included in the corpus text collection.
  • the data acquisition module 101 can determine multiple corpus texts in the corpus text set as the target corpus text; and when the proportion is less than the preset proportion threshold, the data acquisition module 101 You can delete some of the corpus texts with a small number of professional terms in the corpus text collection, or delete some of the corpus texts with a high repetition rate in the corpus text collection, and then randomly select one or more corpus texts from the remaining corpus texts in the database, and add It is added to the corpus text collection.
  • the data acquisition module 101 can recalculate the proportion of the number of professional terms in the corpus text collection to the number of all words in the corpus text collection.
  • the data acquisition module 101 can determine multiple corpus texts in the current corpus text set as the target corpus text; and when the proportion is less than the preset proportion threshold, the data acquisition module 101 101 can repeat the above steps to update the corpus text collection until the proportion of the number of professional terms in the corpus text collection relative to the number of all words in the corpus text collection is greater than or equal to the preset proportion threshold.
  • the data acquisition module 101 can combine the above-mentioned pinyin distribution and the proportion of professional terms to filter out the target corpus text from the database. That is, in the filtered target corpus text, not only the pinyin distribution corresponds to the database The variance between pinyin distributions is less than or equal to the preset threshold, and the proportion of the number of professional terms relative to the number of all words in the target corpus text is greater than or equal to the proportion threshold.
  • the above-mentioned data acquisition module 101 filters out the target corpus text from the corpus only as some exemplary explanations. In actual application, the data acquisition module 101 can also filter out the target corpus text from the corpus through other methods, which is not limited in this embodiment.
  • the data acquisition module 101 can determine from the corpus text uploaded by the user 300 the speech suitable for training in the scene. Clone the target corpus text of the model.
  • the data acquisition module 101 when the data acquisition module 101 presents the scene configuration interface through the client 200, in addition to prompting the user 300 to enter the name of the customized scene, the data acquisition module 101 can also prompt the user 300 to upload corpus text on the scene configuration interface, as shown in Figure 5 Show.
  • the user 300 can import corpus text on the scene configuration interface, or on the scene configuration interface Enter the path, file name or network address of the corpus text, so that the data acquisition module 101 can access and obtain the corpus text according to the information input by the user 300 .
  • the scenario configuration interface shown in Figure 5 can also prompt the user 300 to input professional terms in the customized scenario.
  • the data acquisition module 101 can determine the target corpus text from the corpus text uploaded by the user 300 .
  • the data acquisition module 101 can refer to the aforementioned embodiments to determine the target corpus text from multiple corpus texts based on pinyin distribution or professional terminology, which will not be reiterated here; and when When the number of corpus texts uploaded by the user 300 is small, for example, the number of corpus texts uploaded by the user 300 does not exceed the above-mentioned preset number, etc., the data acquisition module 101 can determine all corpus texts uploaded by the user 300 as target corpus texts. This implementation This example does not limit this.
  • the data acquisition module 101 may also acquire the target corpus text through other methods, which is not limited in this embodiment.
  • the data acquisition module 101 determines the audio of the target object based on the target corpus text, and the voice content of the audio matches the content of the target corpus text.
  • the target object may be the user 300, for example, or the target object may be other objects besides the user 300, such as public figures.
  • the data acquisition module 101 can further obtain the audio of the target object.
  • the speech content of the audio of the target object matches the content of the target corpus text.
  • the speech content of the audio is the same as the content of the target corpus text. .
  • the data acquisition module 101 can generate a recording interface, and the recording interface can include the determined target corpus text, so that the data acquisition module 101 can present the text through the client 200. Recording interface. Furthermore, the recording interface can further present the pinyin and tonal information corresponding to the target corpus text. The pinyin and tonal information can be manually annotated by technical personnel on the target corpus text in advance.
  • the target corpus text presented can be a text belonging to a financial scene: "Is the trend of real estate prices this year rising or falling?"
  • the pinyin and tonal information presented is "jin1 nian2 fang2 di4 chan3 jia4 ge2 zou3 shi4 shi4 zhang3 shi 4luo4", where "jin” in “jin1” is the pinyin of the character "jin” in the target corpus text, and the "1” in “jin1” indicates the character “jin” in the target corpus text
  • the pronunciation tone of "nian2” is the pinyin of "nian”
  • the "2" in “nian2” indicates that the pronunciation tone of "nian” is the second tone.
  • the user 300 can pronounce according to the target corpus text (and the corresponding pinyin and tones) presented in the recording interface.
  • the data acquisition module 101 can use the client 200 to record the pronunciation of the user 300 to obtain the audio of the user 300, that is, the audio of the target object.
  • the data acquisition module 101 can also perform noise detection on the recorded audio and calculate the signal-to-noise ratio of the audio.
  • the signal-to-noise ratio is greater than the noise threshold, it means that the audio is subject to greater noise interference.
  • the data acquisition module 101 can delete the recording, and can prompt the user 300 to re-record the target corpus text until the obtained The signal-to-noise ratio in the audio does not exceed this noise threshold.
  • the data acquisition module 101 can also verify whether the voice content in the recorded audio matches the target corpus text, such as verifying whether the voice content in the audio is consistent with the content of the target corpus text, or whether the pronunciation of the user 300 is correct. Whether the rate reaches the threshold value, if so, the data acquisition module 101 can determine the The voice content matches the target corpus text, and if not, the data acquisition module 101 may prompt the user 300 to re-perform the recording process for the target corpus text.
  • the data acquisition module 101 can acquire multiple pieces of audio of the target object in the target scene.
  • the data acquisition module 101 can obtain the speech audio recorded by the target object in various public speech scenes, etc.
  • the target object can be specified by the user 300 in advance.
  • the scene configuration interface can present multiple different objects, including object 1 to object 4, so that the user 300 can select one of the multiple objects as the target object to indicate The voice cloning device 300 performs voice cloning on the target object.
  • the data acquisition module 101 may acquire multiple audio segments of the target object specified by the user 300 from the database or the network. Then, the data acquisition module 101 can content-match the target corpus text with the acquired multiple audio segments of the target object, thereby determining the audio that matches the target corpus text from the multiple audio segments.
  • the data acquisition module 101 can forward them to the model training module 102 .
  • the model training module 102 uses the target corpus text and the audio of the target object to train a speech clone model corresponding to the target scene, where the speech clone model is used to output audio that simulates the target object's pronunciation in the target scene.
  • the speech cloning model can be constructed based on, for example, the PortaSpeech model, or the Tacotron model, or the FastSpeech model, or it can be constructed based on other speech synthesis models, which is not limited in this embodiment.
  • the speech cloning model can learn the timbre, rhythm and style of the target object's pronunciation in the target scene.
  • the model training module 101 can obtain the general corpus text (that is, the scene to which it belongs is not distinguished) and the corresponding general corpus text. Audio, preliminary training of the speech cloning model. When the termination conditions of preliminary training are met, the speech cloning model can output corresponding audio according to the input text, that is, it can realize the basic function of speech synthesis. Then, the data acquisition module 101 further uses the target corpus text and the audio of the target object to further train the speech cloning model until it meets the training termination condition.
  • the speech cloning model finally trained by the data processing module 101 can better clone the target object and pronounce it in the target scene. timbre, rhythm and style.
  • the model training module 102 can send it to the speech cloning module 103, so that the speech cloning module 103 can be used to output audio that simulates the pronunciation of the target object, so as to realize the target object's pronunciation.
  • Voice cloning may further include:
  • the speech cloning module 103 uses the speech cloning model to output the audio corresponding to the target text.
  • the target text may be a test text used to present the cloning effect of the speech cloning model to the user 300, or may be a text pre-specified by the user that requires synthesized speech.
  • the speech cloning module 103 can input the fixedly configured test text into the speech cloning model, and the speech cloning model outputs corresponding audio according to the test text.
  • the frequency is the audio that simulates the target object's pronunciation of the test text in the target scenario.
  • the voice cloning module 103 can output the audio.
  • the audio can be sent to the client 200, and the client 200 can play the audio to the user 300, so that the user 300 can perceive the voice cloning model based on the played audio.
  • the cloning effect for the target object's pronunciation in the target scene is a test text.
  • the target text when the target text is a test text, the target text can be provided by the user 300, then the voice cloning module 103 can generate a test interface, for example, the test interface as shown in Figure 8, and The test interface is presented to the user 300 through the client to prompt the user 300 to input test text.
  • the voice cloning module 103 can obtain the test text input by the user 300 in response to the user's operation on the test interface, and input the test text into the voice cloning model to obtain the audio output by the voice cloning model.
  • the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300, so that the user 300 can pronounce the target object in the target scene based on the played audio-aware voice cloning model.
  • the cloning effect can be applied to the client 200, and the client 200 plays the audio to the user 300, so that the user 300 can pronounce the target object in the target scene based on the played audio-aware voice cloning model.
  • the target text is text that is pre-specified by the user 300 and needs to be synthesized into speech.
  • the user 300 can pre-specify the name or text of a story, so that the voice cloning module 103 can input the text (such as story text, etc.) specified by the user 300 into the voice cloning model to obtain the voice
  • the cloned model outputs corresponding audio based on this text.
  • the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300 to meet the needs of the user 300 for voice cloning of the text.
  • the user 300 can hear the simulated target Audio of subject telling a story.
  • the target text is text input by the user 300 that needs to be synthesized into speech.
  • the speech cloning module 103 can generate a speech synthesis interface and pass the speech synthesis interface through the client. Terminal 200 is presented to user 300. Then, the speech cloning device 103 can receive the target text input by the user 300 that needs to be speech synthesized through the client 200, and input it into the speech cloning model to obtain the audio output by the speech cloning model according to the target text. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300 to meet the needs of the user 300 for voice cloning of the target text.
  • the process of the voice cloning device 100 generating audio for cloning the target object to pronounce in the target scene is exemplified.
  • the voice cloning device 100 can be based on In a similar manner to the above, each scene is trained to obtain a speech cloning model corresponding to the scene, and the speech cloning model corresponding to each scene generates audio used to clone the target object's pronunciation in that scene.
  • the speech cloning model corresponding to each object in each scene can be trained based on the above-mentioned similar methods. In this way, the voice cloning device 100 can train multiple different voice cloning models for different scenarios and different objects to support the user in selecting pronunciation scenes and objects to be cloned, thereby improving the flexibility and richness of voice cloning.
  • the voice cloning device 100 can use the voice cloning model corresponding to the scene and the specified object specified by the user 300 to generate corresponding audio and feed it back to the user 300 .
  • the speech cloning module 103 can generate a speech synthesis interface, which can present multiple candidate scenes and multiple candidate objects to the user, so that the user can select one of the multiple candidate scenes on the speech synthesis interface.
  • the voice cloning module 103 can determine the candidate scene selected by the user as the target scene, determine the candidate object selected by the user as the target object, and further determine the voice clone corresponding to the target scene for simulating the pronunciation of the target object. Model. Then, the speech cloning model 103 can use the determined speech cloning model to generate the target text according to the preconfigured target text or the user's input in the speech synthesis world. Input target text on the screen, and synthesize audio that simulates the pronunciation of the target object in the target scene.
  • the speech synthesis interface generated by the speech cloning module 103 may also only support the user to select one scene from multiple scenes as the target scene, or may only support the user to select one object from multiple candidate objects as the target scene.
  • the target object is not limited in this embodiment.
  • the voice cloning device (including the above-mentioned data acquisition module 101, model training module 102, and voice cloning module 103) involved in the voice cloning process may be configured on a computing device or a cluster of computing devices.
  • software and by running the software on a computing device or computing device cluster, the computing device or computing device cluster can realize the functions of the above voice cloning device.
  • the voice cloning device involved in the voice cloning process is introduced in detail.
  • Figure 9 shows a schematic structural diagram of a computing device.
  • the above-mentioned voice cloning device can be deployed on the computing device.
  • the computing device can be a computing device (such as a server) in a cloud environment, or a computing device in an edge environment, or Terminal equipment, etc. can be specifically used to implement the functions of the interaction module 201 and the processing module 202 in the embodiment shown in FIG. 3 .
  • computing device 900 includes processor 920 , memory 910 , communication interface 930 , and bus 940 .
  • the processor 920, the memory 910 and the communication interface 930 communicate through the bus 940.
  • the bus 940 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc.
  • PCI peripheral component interconnect
  • EISA extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 9, but it does not mean that there is only one bus or one type of bus.
  • the communication interface 930 is used to communicate with the outside, such as receiving original data provided by the user and the feature extraction network model to be trained.
  • the processor 920 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more integrated circuits.
  • the processor 920 may also be an integrated circuit chip with signal processing capabilities.
  • the functions of each module in the voice cloning device can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 920 .
  • the processor 920 may also be a general processor, a digital signal processor (DSP), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, Discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application.
  • DSP digital signal processor
  • FPGA field programmable gate array
  • the general processor can be a microprocessor or the processor can be any conventional processor, etc.
  • the method disclosed in combination with the embodiments of the present application can be directly implemented as a hardware decoding processor to complete the execution, or can be performed using decoding processing.
  • the combination of hardware and software modules in the device is executed.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 910.
  • the processor 920 reads the information in the memory 910 and completes some or all functions of the voice cloning device in combination with its hardware.
  • the memory 910 may include volatile memory (volatile memory), such as random access memory (RAM).
  • volatile memory volatile memory
  • non-volatile memory non-volatile memory
  • ROM read-only memory
  • HDD HDD or SSD.
  • the memory 910 stores executable code
  • the processor 920 executes the executable code to perform the method performed by the aforementioned voice cloning device.
  • FIG. 10 shows a schematic structural diagram of a computing device cluster.
  • the computing device cluster 10 shown in FIG. 10 includes multiple computing devices, and the above voice cloning device can be deployed on multiple computing devices in the computing device cluster 10 in a distributed manner.
  • the computing device cluster 100 includes multiple computing devices 1000.
  • Each computing device 1000 includes a memory 1010, a processor 1020, a communication interface 1030, and a bus 1040.
  • the memory 1010, the processor 1020, and the communication interface 1030 pass through Bus 1040 implements communication connections between each other.
  • Processor 1020 may employ a CPU, GPU, ASIC, or one or more integrated circuits.
  • the processor 1020 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, part of the functions of the voice cloning device can be completed by instructions in the form of integrated logic circuits or software in the hardware of the processor 1020 .
  • the processor 1020 can also be a DSP, FPGA, general-purpose processor, other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, and can implement or execute some of the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • the general processor may be a microprocessor or the processor may be any conventional processor, etc.
  • the steps of the method disclosed in conjunction with the embodiments of the present application may be directly implemented as a hardware decoding processor, or may be executed using a decoding processor.
  • the combination of hardware and software modules in the code processor is executed.
  • the software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field.
  • the storage medium is located in the memory 1010.
  • the processor 1020 reads the information in the memory 1010, and combined with its hardware, can complete part of the functions of the voice cloning device.
  • the memory 1010 may include ROM, RAM, static storage devices, dynamic storage devices, hard disks (eg, SSD, HDD), etc.
  • the memory 1010 may store program codes, for example, part or all of the program code used to implement the data acquisition module 101, part or all of the program code used to implement the model training module 102, part or all of the program code used to implement the speech cloning module 103 wait.
  • the processor 1020 executes part of the method executed by the voice cloning device based on the communication interface 1030.
  • part of the computing device 1000 may be used to execute the above.
  • the memory 1010 can also store data, such as intermediate data or result data generated by the processor 1020 during execution, such as the above-mentioned target corpus text, audio, speech cloning model, etc.
  • the communication interface 1003 in each computing device 1000 is used to communicate with the outside, such as interacting with other computing devices 1000 and so on.
  • the bus 1040 may be a peripheral component interconnection standard bus or an extended industry standard architecture bus, or the like.
  • the bus 1040 within each computing device 1000 in FIG. 10 is represented by only one thick line, but this does not mean that there is only one bus or one type of bus.
  • Any computing device may be a computing device (eg, a server) in a cloud environment, a computing device in an edge environment, or a terminal device.
  • embodiments of the present application also provide a computer-readable storage medium, which stores Instructions are stored that, when run on one or more computing devices, cause the one or more computing devices to execute the methods performed by each module of the voice cloning device in the above embodiment.
  • embodiments of the present application also provide a computer program product.
  • the computer program product When the computer program product is executed by one or more computing devices, the one or more computing devices execute any one of the foregoing voice cloning methods.
  • the computer program product can be a software installation package. If it is necessary to use any of the foregoing voice cloning methods, the computer program product can be downloaded and executed on the computer.
  • the device embodiments described above are only illustrative.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate.
  • the physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
  • the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology.
  • the computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means.
  • wired such as coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless such as infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

La présente demande concerne un procédé de clonage vocal consistant à : déterminer une scène cible et, selon la scène cible, déterminer un texte de corpus cible appartenant à la scène cible ; puis déterminer un audio d'un sujet cible selon le texte de corpus cible, le contenu vocal de l'audio étant apparié avec un contenu du texte de corpus cible, et le texte de corpus cible et l'audio du sujet cible étant utilisés pour apprendre un modèle de clonage vocal correspondant à la scène cible, le modèle de clonage vocal servant à générer un audio qui simule la prononciation du sujet cible dans la scène cible. Comme le modèle de clonage vocal est obtenu par apprentissage de l'audio de prononciation d'un texte de corpus dans la scène cible d'après le sujet cible, le modèle de clonage vocal peut mieux satisfaire les conditions de prononciation réelles du sujet cible dans la scène cible en fonction des caractéristiques d'une nouvelle voix émise par le texte en ce qui concerne le ton, le rythme, le style de prononciation etc., ce qui permet d'améliorer efficacement un effet de clonage vocal. De plus, la présente invention concerne un appareil correspondant et un dispositif associé.
PCT/CN2023/081526 2022-06-29 2023-03-15 Procédé et appareil de clonage vocal, et dispositif associé WO2024001307A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210778187.2 2022-06-29
CN202210778187 2022-06-29
CN202211071940.0 2022-09-02
CN202211071940.0A CN117373432A (zh) 2022-06-29 2022-09-02 一种语音克隆方法、装置及相关设备

Publications (1)

Publication Number Publication Date
WO2024001307A1 true WO2024001307A1 (fr) 2024-01-04

Family

ID=89382602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081526 WO2024001307A1 (fr) 2022-06-29 2023-03-15 Procédé et appareil de clonage vocal, et dispositif associé

Country Status (1)

Country Link
WO (1) WO2024001307A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN112885326A (zh) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 个性化语音合成模型创建、语音合成和测试方法及装置
CN113241056A (zh) * 2021-04-26 2021-08-10 标贝(北京)科技有限公司 语音合成模型的训练与语音合成方法、装置、系统及介质
CN113327574A (zh) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN112885326A (zh) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 个性化语音合成模型创建、语音合成和测试方法及装置
CN113241056A (zh) * 2021-04-26 2021-08-10 标贝(北京)科技有限公司 语音合成模型的训练与语音合成方法、装置、系统及介质
CN113327574A (zh) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 一种语音合成方法、装置、计算机设备和存储介质

Similar Documents

Publication Publication Date Title
JP6799574B2 (ja) 音声対話の満足度の確定方法及び装置
CN107516510B (zh) 一种智能设备自动化语音测试方法及装置
JP6786751B2 (ja) 音声接続合成の処理方法及び装置、コンピュータ設備及びコンピュータプログラム
CN106652997B (zh) 一种音频合成的方法及终端
US11882319B2 (en) Virtual live video streaming method and apparatus, device, and readable storage medium
CN111741326B (zh) 视频合成方法、装置、设备及存储介质
CN110473525B (zh) 获取语音训练样本的方法和装置
CN111489424A (zh) 虚拟角色表情生成方法、控制方法、装置和终端设备
US10665218B2 (en) Audio data processing method and device
US11511200B2 (en) Game playing method and system based on a multimedia file
US10971125B2 (en) Music synthesis method, system, terminal and computer-readable storage medium
WO2017059694A1 (fr) Procédé et dispositif d'imitation de parole
KR20210001859A (ko) 3차원 가상 인물 입모양 변화 제어 방법 및 장치
TWI731382B (zh) 語音合成的方法、裝置及設備
CN109389427A (zh) 问卷推送方法、装置、计算机设备和存储介质
JP2023552854A (ja) ヒューマンコンピュータインタラクション方法、装置、システム、電子機器、コンピュータ可読媒体及びプログラム
CN104505103B (zh) 语音质量评价设备、方法和系统
WO2021227308A1 (fr) Procédé et appareil de génération de ressource vidéo
CN115691544A (zh) 虚拟形象口型驱动模型的训练及其驱动方法、装置和设备
CN113691909A (zh) 具有音频处理推荐的数字音频工作站
WO2023241360A1 (fr) Procédés et appareil d'interaction vocale de classe en ligne, dispositif et support de stockage
WO2024001307A1 (fr) Procédé et appareil de clonage vocal, et dispositif associé
JP2020052262A (ja) 修正候補提示方法、修正候補提示プログラムおよび情報処理装置
CN111966803B (zh) 对话模拟方法、装置、存储介质及电子设备
CN113851106A (zh) 音频播放方法、装置、电子设备和可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829499

Country of ref document: EP

Kind code of ref document: A1