WO2024001307A1 - Voice cloning method and apparatus, and related device - Google Patents

Voice cloning method and apparatus, and related device Download PDF

Info

Publication number
WO2024001307A1
WO2024001307A1 PCT/CN2023/081526 CN2023081526W WO2024001307A1 WO 2024001307 A1 WO2024001307 A1 WO 2024001307A1 CN 2023081526 W CN2023081526 W CN 2023081526W WO 2024001307 A1 WO2024001307 A1 WO 2024001307A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
scene
text
corpus
scenes
Prior art date
Application number
PCT/CN2023/081526
Other languages
French (fr)
Chinese (zh)
Inventor
陈飞扬
王喆锋
段新宇
怀宝兴
Original Assignee
华为云计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211071940.0A external-priority patent/CN117373432A/en
Application filed by 华为云计算技术有限公司 filed Critical 华为云计算技术有限公司
Publication of WO2024001307A1 publication Critical patent/WO2024001307A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The present application provides a voice cloning method, comprising: determining a target scene, and according to the target scene, determining a target corpus text belonging to the target scene; and then determining an audio of a target subject according to the target corpus text, wherein voice content of the audio is matched to content of the target corpus text, and the target corpus text and the audio of the target subject is thus used to train a voice cloning model corresponding to the target scene, wherein the voice cloning model is used for outputting an audio which simulates the pronunciation of the target subject in the target scene. Since the voice cloning model is obtained by training pronunciation audio of a corpus text in the target scene on the basis of the target subject, the voice cloning model can better meet real pronunciation conditions of the target subject in the target scene according to features of a new voice outputted by the text with regards to tone, rhythm, pronunciation style and the like, thereby effectively improving a voice cloning effect. In addition, the present application further provides a corresponding apparatus and a related device.

Description

一种语音克隆方法、装置及相关设备A voice cloning method, device and related equipment
本申请要求于2022年06月29日提交中国国家知识产权局、申请号为202210778187.2、申请名称为“一种语音克隆方法、装置及相关设备”的中国专利申请的优先权,并要求于2022年9月2日提交中国国家知识产权局、申请号为202211071940.0、申请名称为“一种语音克隆方法、装置及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requests the priority of the Chinese patent application submitted to the State Intellectual Property Office of China on June 29, 2022, with the application number 202210778187.2 and the application title "A voice cloning method, device and related equipment", and requests that it be filed in 2022 The priority of the Chinese patent application submitted to the State Intellectual Property Office of China on September 2, with application number 202211071940.0 and the application title "A voice cloning method, device and related equipment", the entire content of which is incorporated into this application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种语音克隆方法、装置及相关设备。This application relates to the field of artificial intelligence technology, and in particular to a voice cloning method, device and related equipment.
背景技术Background technique
语音克隆,是一种根据目标对象(如克隆人类)的原始语音,生成与原始语音在音色等发音特征上相似的新语音的技术,达到克隆目标对象发音的效果,在虚拟人、有声读物、视频创作等场景中存在广泛应用。Voice cloning is a technology that uses the original voice of a target object (such as a cloned human) to generate a new voice that is similar to the original voice in terms of timbre and other pronunciation characteristics, achieving the effect of cloning the pronunciation of the target object. It is used in virtual humans, audiobooks, It is widely used in scenarios such as video creation.
但是,目前的语音克隆技术,仅能在生成的新语音中实现克隆目标对象发音的音色,难以比拟目标对象在真实场景中的发音效果,从而导致克隆效果较差。However, the current speech cloning technology can only clone the timbre of the target object's pronunciation in the generated new speech, and it is difficult to compare with the pronunciation effect of the target object in the real scene, resulting in poor cloning effect.
发明内容Contents of the invention
有鉴于此,本申请实施例提供了一种语音克隆方法,以提升针对目标对象的语音克隆效果。本申请还提供了对应的装置、计算设备集群、计算机可读存储介质以及计算机程序产品。In view of this, embodiments of the present application provide a voice cloning method to improve the voice cloning effect for the target object. This application also provides corresponding devices, computing device clusters, computer-readable storage media, and computer program products.
第一方面,本申请实施例提供了一种语音克隆方法,该方法可以由语音克隆装置执行,具体地,语音克隆装置确定目标场景,如将用户指定的故事场景确定为目标场景等,并根据该目标场景,确定属于目标场景的目标语料文本,然后根据该目标语料文本,确定目标对象的音频,该音频的语音内容与目标语料文本的内容相匹配,从而语音克隆装置利用该目标语料文本以及目标对象的音频,训练目标场景对应的语音克隆模型,该语音克隆模型用于输出模拟目标对象在目标场景下发音的音频。In the first aspect, embodiments of the present application provide a voice cloning method, which can be executed by a voice cloning device. Specifically, the voice cloning device determines the target scene, such as determining the story scene specified by the user as the target scene, etc., and based on The target scene determines the target corpus text belonging to the target scene, and then determines the audio of the target object based on the target corpus text. The speech content of the audio matches the content of the target corpus text, so that the speech cloning device uses the target corpus text and Use the audio of the target object to train a speech clone model corresponding to the target scene. The speech clone model is used to output audio that simulates the target object's pronunciation in the target scene.
由于语音克隆模型是基于目标对象针对目标场景下的语料文本的发音音频进行训练得到,这使得语音克隆模型根据文本所输出的新的语音,在音色、韵律和发音风格等方面的特征,能够更加符合目标对象在目标场景下的真实发音情况,以此可以有效提高语音克隆效果。Since the speech cloning model is trained based on the pronunciation audio of the corpus text in the target scene based on the target object, this allows the speech cloning model to be more accurate based on the characteristics of the new speech output by the text in terms of timbre, rhythm, and pronunciation style. It conforms to the real pronunciation of the target object in the target scene, which can effectively improve the voice cloning effect.
实际应用时,可以利用上述方式生成用于模拟输出各个对象在各个场景下发音韵律和风格的语音克隆模型,以便利用这些语音克隆模型提高语音克隆的真实性和多样性。In practical applications, the above method can be used to generate speech cloning models for simulating and outputting the pronunciation rhythm and style of each object in various scenarios, so that these speech cloning models can be used to improve the authenticity and diversity of speech cloning.
进一步地,语音克隆装置在训练得到的语音克隆模型后,可以利用该语音克隆模型输出一段文本对应的音频,以实现对目标对象的语音克隆。Further, after training the speech cloning model, the speech cloning device can use the speech cloning model to output an audio corresponding to a piece of text to achieve speech cloning of the target object.
在一种可能的实施方式中,目标语料文本的内容语境与目标场景所指示的语境相匹配,比如,当目标场景为故事场景时,目标语料文本例如可以是故事内容的语料文本。示例性地,目标场景可以是对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、或演讲场景中任意一种,或者,目标对象为根据情绪类型进行划分所得到的场景,如悲伤场景、高兴场景等。实际应用时,目标场景也可以是其它可适用的场景。 In a possible implementation, the content context of the target corpus text matches the context indicated by the target scene. For example, when the target scene is a story scene, the target corpus text may be, for example, the corpus text of the story content. For example, the target scene can be any one of dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, educational scenes, or speech scenes, or the target object can be scenes divided according to emotion types, such as Sad scenes, happy scenes, etc. In actual application, the target scenario can also be other applicable scenarios.
在一种可能的实施方式中,语音克隆装置在确定属于目标场景的语料文本时,具体可以是先获取属于目标场景的多个语料文本的拼音分布,该拼音分布例如可以是该多个语料文本中各个拼音的数量分布等,从而语音克隆装置可以根据该多个语料文本的拼音分布,从多个语料文本中选取目标语料文本,该目标语料文本的数量少于多个语料文本的数量,并且,该目标语料文本的拼音分布与该多个语料文本的拼音分布满足预设条件,如两个拼音分布之间的方差或者标准差小于阈值等。由于不同场景下的语料文本的拼音分布之间通常存在差异,因此,每个场景下的拼音分布可以作为该场景的代表性特征,从而基于拼音分布选取目标语料文本,可以使得该目标语料文本也能符合该场景下的语料特征,进而基于该目标语料文本训练语音克隆模型,可以提高该语音克隆模型的语音克隆效果。In a possible implementation, when determining the corpus text belonging to the target scene, the speech cloning device may first obtain the pinyin distribution of multiple corpus texts belonging to the target scene. The pinyin distribution may be, for example, the multiple corpus texts. The number distribution of each pinyin in the multiple corpus texts, etc., so that the speech cloning device can select the target corpus text from the multiple corpus texts according to the pinyin distribution of the multiple corpus texts, and the number of the target corpus texts is less than the number of the multiple corpus texts, and , the pinyin distribution of the target corpus text and the pinyin distribution of the multiple corpus texts meet preset conditions, such as the variance or standard deviation between the two pinyin distributions being less than a threshold, etc. Since there are usually differences in the pinyin distribution of corpus texts in different scenarios, the pinyin distribution in each scenario can be used as a representative feature of the scene, so that the target corpus text can be selected based on the pinyin distribution so that the target corpus text can also be selected. It can meet the characteristics of the corpus in this scenario, and then train the speech cloning model based on the target corpus text, which can improve the speech cloning effect of the speech cloning model.
在一种可能的实施方式中,语音克隆装置在确定属于目标场景的语料文本时,具体可以是从属于该目标场景的多个语料文本中选取目标语料文本,该目标语料文本中专业术语的占比大于比例阈值。这样,利用所选取的目标语料文本训练语音克隆模型后,该语音克隆模型所输出的音频中对于专业术语的发音表达内容可以更加流畅,符合目标对象在真实针对该专业术语的真实发音,从而可以提高语音克隆效果。In one possible implementation, when the speech cloning device determines the corpus text belonging to the target scene, it may specifically select the target corpus text from multiple corpus texts belonging to the target scene. The proportion of professional terms in the target corpus text is ratio is greater than the proportion threshold. In this way, after using the selected target corpus text to train the speech cloning model, the pronunciation expression content of the professional terminology in the audio output by the speech cloning model can be more fluent and consistent with the target object's real pronunciation of the professional term, so that it can Improve voice cloning effect.
在一种可能的实施方式中,语音克隆装置在根据目标语料文本确定属于该目标场景的目标对象的音频时,具体可以是生成录音界面,该录音界面用于将目标语料文本呈现给目标对象,从而目标对象可以根据录音界面所呈现的目标语料文本进行发音。相应地,语音克隆装置对目标对象的发音进行录音,得到该目标对象的音频。如此,语音克隆装置可以通过采集目标对象发音的方式,获得该目标对象的音频,以便后续基于获取的音频实现对语音克隆模型的训练。In a possible implementation, when the voice cloning device determines the audio of the target object belonging to the target scene based on the target corpus text, it may specifically generate a recording interface, and the recording interface is used to present the target corpus text to the target object, In this way, the target object can pronounce according to the target corpus text presented in the recording interface. Correspondingly, the voice cloning device records the pronunciation of the target object to obtain the audio of the target object. In this way, the speech cloning device can obtain the audio of the target object by collecting the pronunciation of the target object, so that the speech cloning model can be subsequently trained based on the obtained audio.
在一种可能的实施方式中,语音克隆装置在根据目标语料文本确定属于该目标场景的目标对象的音频时,具体可以是获取目标对象中在目标场景下发音的多个音频,从而语音克隆装置可以从多个音频中确定语音内容与该目标语料文本的内容相匹配的音频。比如,语音克隆装置可以从网络中获取目标对象在公共场合下(并且属于目标场景)的多个音频,从而语音克隆装置可以通过内容匹配的方式,确定与该目标语料文本在内容上相匹配的目标对象的音频。如此,在用户指示目标场景后,目标对象可以无需再通过录音的方式与语音克隆装置进行交互,以此简化实现语音克隆所需执行的交互操作,提高用户体验。In a possible implementation, when the voice cloning device determines the audio of the target object belonging to the target scene according to the target corpus text, it may specifically obtain multiple audios of the target object that are pronounced in the target scene, so that the voice cloning device The audio whose speech content matches the content of the target corpus text can be determined from multiple audios. For example, the voice cloning device can obtain multiple audios of the target object in public places (and belonging to the target scene) from the network, so that the voice cloning device can determine the content that matches the target corpus text through content matching. The target object's audio. In this way, after the user indicates the target scene, the target object no longer needs to interact with the voice cloning device through recording, thereby simplifying the interactive operations required to implement voice cloning and improving the user experience.
在一种可能的实施方式中,语音克隆装置在确定目标场景时,具体可以是生成场景配置界面,该场景配置界面用于将多个候选场景呈现给用户,以便用户对该多个候选场景进行选择,从而语音克隆装置可以从多个候选场景中确定该用户选择的目标场景。如此,语音克隆装置可以基于用户的指定,确定语音克隆的发音场景,从而可以提高语音克隆场景的可选性,提高用户体验。In a possible implementation, when determining the target scene, the voice cloning device may specifically generate a scene configuration interface. The scene configuration interface is used to present multiple candidate scenes to the user so that the user can perform operations on the multiple candidate scenes. selection, so that the voice cloning device can determine the target scene selected by the user from multiple candidate scenes. In this way, the voice cloning device can determine the pronunciation scene of the voice cloning based on the user's specification, thereby improving the selectivity of the voice cloning scene and improving the user experience.
在一种可能的实施方式中,语音克隆装置在确定目标场景时,具体可以是生成场景配置界面,该场景配置界面用于提示用户输入其所定义的目标场景的标识(如名称)以及属于该目标场景的语料文本,从而语音克隆装置可以响应于用户针对该场景配置界面的操作,获取该用户定义的目标场景的标识以及属于该目标场景的语料文本。如此,语音克隆装置可以支持用户对于语音克隆的发音场景的自定义,从而可以提高语音克隆的灵活性,提高用户体验。 In a possible implementation, when the voice cloning device determines the target scene, it may specifically generate a scene configuration interface. The scene configuration interface is used to prompt the user to input the identification (such as name) of the defined target scene and the information belonging to the target scene. The corpus text of the target scene, so that the voice cloning device can respond to the user's operation on the scene configuration interface to obtain the identification of the user-defined target scene and the corpus text belonging to the target scene. In this way, the voice cloning device can support the user's customization of the pronunciation scene of the voice cloning, thereby improving the flexibility of the voice cloning and improving the user experience.
在一种可能的实施方式中,语音克隆装置还可以生成测试界面,该测试界面用户提示用户输入文本,然后,语音克隆装置可以响应于用户针对该测试界面的操作,获取用户输入的目标文本,并将该目标文本输入至该语音克隆模型,得到该语音克隆模型输出的音频。如此,用户可以根据语音克隆模型输出的音频评判语音克隆模型对于目标对象在目标场景下发音的克隆效果,以便在克隆效果较差时通过模型再训练等方式来进一步提高语音克隆效果。In a possible implementation, the voice cloning device can also generate a test interface that prompts the user to input text. Then, the voice cloning device can obtain the target text input by the user in response to the user's operation on the test interface, And input the target text into the speech cloning model to obtain the audio output by the speech cloning model. In this way, users can judge the cloning effect of the speech cloning model on the pronunciation of the target object in the target scene based on the audio output by the speech cloning model, so that when the cloning effect is poor, the speech cloning effect can be further improved through model retraining and other methods.
第二方面,本申请实施例还提供了一种语音克隆方法,该方法可以由语音克隆装置执行,具体地,语音克隆装置接收用户输入的目标场景的目标文本,如接收用户输入的故事场景以及该故事文本等,然后,语音克隆可以根据该目标场景确定该目标场景对应的语音克隆模型,并基于该语音克隆模型输出和该目标文本对应的目标音频,该语音克隆模型用于输出模拟目标对象在目标场景下发音的音频。In the second aspect, embodiments of the present application also provide a voice cloning method, which can be executed by a voice cloning device. Specifically, the voice cloning device receives the target text of the target scene input by the user, such as receiving the story scene input by the user and The story text, etc., then, the voice clone can determine the voice clone model corresponding to the target scene according to the target scene, and output the target audio corresponding to the target text based on the voice clone model, and the voice clone model is used to output the simulated target object Audio pronounced in the target scenario.
如此,语音克隆模型根据目标文本所输出的新的语音,在音色、韵律和发音风格等方面的特征,能够更加符合目标对象在目标场景下的真实发音情况,以此可以有效提高语音克隆效果。In this way, the new speech output by the speech cloning model based on the characteristics of timbre, rhythm, and pronunciation style of the target text can be more consistent with the real pronunciation of the target object in the target scene, thereby effectively improving the speech cloning effect.
在一种可能的实施方式中,目标语料文本的内容语境与目标场景所指示的语境相匹配,比如,当目标场景为故事场景时,目标语料文本例如可以是故事内容的语料文本。示例性地,目标场景可以是对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、或演讲场景中任意一种,或者,目标对象为根据情绪类型进行划分所得到的场景,如悲伤场景、高兴场景等。实际应用时,目标场景也可以是其它可适用的场景。In a possible implementation, the content context of the target corpus text matches the context indicated by the target scene. For example, when the target scene is a story scene, the target corpus text may be, for example, the corpus text of the story content. For example, the target scene can be any one of dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, educational scenes, or speech scenes, or the target object can be scenes divided according to emotion types, such as Sad scenes, happy scenes, etc. In actual application, the target scenario can also be other applicable scenarios.
在一种可能的实施方式中,语音克隆装置在接收用户输入的目标场景以及目标文本时,可以生成语音合成界面,该语音合成界面用于将多个候选场景呈现给用户,从而语音克隆装置可以从多个候选场景中确定用户选择的目标场景,并接收该用户在语音合成界面上输入的目标文本。如此,语音克隆装置可以支持用户对场景以及文本的自定义,从而场景和文本的可选性。In a possible implementation, when the voice cloning device receives the target scene and target text input by the user, it can generate a speech synthesis interface. The speech synthesis interface is used to present multiple candidate scenes to the user, so that the voice cloning device can Determine the target scene selected by the user from multiple candidate scenes, and receive the target text input by the user on the speech synthesis interface. In this way, the voice cloning device can support the user's customization of scenes and texts, thereby making scenes and texts optional.
在一种可能的实施方式中,语音克隆装置所呈现的语音合成界面,还可以用于将多个候选对象呈现给用户,从而用户可以从多个对象中选择其中一个对象作为目标对象。如此,语音克隆装置可以根据用户所选择的对象,对该对象进行语音克隆,以此可以提高语音克隆的灵活性和可选性,提高用户体验。In a possible implementation, the speech synthesis interface presented by the voice cloning device can also be used to present multiple candidate objects to the user, so that the user can select one of the multiple objects as the target object. In this way, the voice cloning device can perform voice cloning on the object selected by the user, thereby improving the flexibility and selectability of voice cloning and improving user experience.
第三方面,本申请实施例还提供了一种语音克隆装置,包括:数据获取模块,用于确定目标场景,并根据所述目标场景,确定属于所述目标场景的目标语料文本,并根据所述目标语料文本,确定目标对象的音频,所述音频的语音内容与所述目标语料文本的内容相匹配;模型训练模块,用于利用所述目标语料文本以及所述音频,训练所述目标场景对应的语音克隆模型,所述语音克隆模型用于输出模拟所述目标对象在所述目标场景下发音的音频。In a third aspect, embodiments of the present application also provide a voice cloning device, including: a data acquisition module, used to determine a target scene, and according to the target scene, determine the target corpus text belonging to the target scene, and according to the target scene, The target corpus text is used to determine the audio of the target object, and the voice content of the audio matches the content of the target corpus text; a model training module is used to use the target corpus text and the audio to train the target scene Corresponding voice cloning model, the voice cloning model is used to output audio that simulates the pronunciation of the target object in the target scene.
在一种可能的实施方式中,所述目标语料文本的语境与所述目标场景所指示的语境相匹配;所述目标场景包括以下中任意一种:对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景;或者,所述目标场景为根据情绪类型进行划分所得到的场景。 In a possible implementation, the context of the target corpus text matches the context indicated by the target scene; the target scene includes any one of the following: dialogue scenes, news scenes, financial scenes, Live broadcast scenes, story scenes, educational scenes, speech scenes; or, the target scenes are scenes divided according to emotion types.
在一种可能的实施方式中,所述数据获取模块,用于:获取属于所述目标场景的多个语料文本的拼音分布;根据所述多个语料文本的拼音分布,从所述多个语料文本中选取所述目标语料文本,所述目标语料文本的数量少于所述多个语料文本的数量,所述目标语料文本的拼音分布与所述多个语料文本的拼音分布满足预设条件。In a possible implementation, the data acquisition module is configured to: obtain the pinyin distribution of multiple corpus texts belonging to the target scene; and obtain the pinyin distribution from the multiple corpus texts according to the pinyin distribution of the multiple corpus texts. The target corpus text is selected from the text, the number of the target corpus text is less than the number of the plurality of corpus texts, and the pinyin distribution of the target corpus text and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.
在一种可能的实施方式中,所述数据获取模块,用于:从多个语料文本中选取所述目标语料文本,所述目标语料文本中专业术语的占比大于比例阈值,所述多个语料文本属于所述目标场景。In a possible implementation, the data acquisition module is configured to: select the target corpus text from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts The corpus text belongs to the target scenario.
在一种可能的实施方式中,所述数据获取模块,用于:生成录音界面,所述录音界面用于将所述目标语料文本呈现给所述目标对象;对所述目标对象根据所述目标语料文本的发音进行录音,得到所述目标对象的音频。In a possible implementation, the data acquisition module is used to: generate a recording interface, the recording interface is used to present the target corpus text to the target object; to the target object according to the target The pronunciation of the corpus text is recorded to obtain the audio of the target object.
在一种可能的实施方式中,所述数据获取模块,用于:获取所述目标对象在所述目标场景下发音的多个音频;从所述多个音频中确定语音内容与所述目标语料文本的内容相匹配的音频。In a possible implementation, the data acquisition module is configured to: acquire multiple audios produced by the target object in the target scene; and determine the voice content and the target corpus from the multiple audios. The content of the text matches the audio.
在一种可能的实施方式中,所述数据获取模块,用于:生成场景配置界面,所述场景配置界面用于将多个候选场景呈现给用户;从所述多个候选场景中确定所述用户选择的目标场景。In a possible implementation, the data acquisition module is configured to: generate a scene configuration interface, the scene configuration interface is used to present multiple candidate scenes to the user; determine the The target scenario selected by the user.
在一种可能的实施方式中,所述数据获取模块,用于:生成场景配置界面,所述场景配置界面用于提示输入用户定义的目标场景的标识以及属于所述目标场景的语料文本;响应于所述用户针对所述场景配置界面的操作,获取所述用户定义的目标场景的标识以及属于所述目标场景的语料文本。In a possible implementation, the data acquisition module is configured to: generate a scene configuration interface, and the scene configuration interface is used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene; respond Based on the user's operation on the scene configuration interface, the identification of the target scene defined by the user and the corpus text belonging to the target scene are obtained.
在一种可能的实施方式中,所述语音克隆装置还包括语音克隆模块,用于:生成测试界面,所述测试界面用于提示用户输入文本;响应于所述用户针对所述测试界面的操作,获取所述用户输入的目标文本;将所述目标文本输入至所述语音克隆模型,得到所述语音克隆模型输出的音频。In a possible implementation, the voice cloning device further includes a voice cloning module, configured to: generate a test interface, where the test interface is used to prompt the user to input text; and in response to the user's operation on the test interface , obtain the target text input by the user; input the target text to the speech cloning model, and obtain the audio output by the speech cloning model.
值得注意的是,第三方面提供的语音克隆装置,对应于第一方面提供的语音克隆方法,故第三方面以及第三方面中任一实施方式所具有的技术效果,可参见第一方面或者第一方面的相应实施方式所具有的技术效果。It is worth noting that the voice cloning device provided in the third aspect corresponds to the voice cloning method provided in the first aspect, so the technical effects of the third aspect and any of the embodiments of the third aspect can be found in the first aspect or The technical effects achieved by the corresponding implementation of the first aspect.
第四方面,本申请实施例还提供了一种语音克隆装置,所述语音克隆装置包括:数据获取模块,用于接收用户输入的目标场景和目标文本;语音克隆模块,用于根据所述目标场景,确定所述目标场景对应的语音克隆模型,并基于所述语音克隆模型,输出和所述目标文本对应的目标音频,所述语音克隆模型用于输出模拟目标对象在所述目标场景下发音的音频。In the fourth aspect, embodiments of the present application also provide a voice cloning device. The voice cloning device includes: a data acquisition module for receiving the target scene and target text input by the user; a voice cloning module for scene, determine the speech clone model corresponding to the target scene, and based on the speech clone model, output the target audio corresponding to the target text, the speech clone model is used to output a simulated pronunciation of the target object in the target scene audio.
在一种可能的实施方式中,所述目标语料文本的语境与所述目标场景所指示的语境相匹配;所述目标场景包括以下中任意一种:对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景;或者,所述目标场景为根据情绪类型进行划分所得到的场景。In a possible implementation, the context of the target corpus text matches the context indicated by the target scene; the target scene includes any one of the following: dialogue scenes, news scenes, financial scenes, Live broadcast scenes, story scenes, educational scenes, speech scenes; or, the target scenes are scenes divided according to emotion types.
在一种可能的实施方式中,所述数据获取模块,用于:生成语音合成界面,所述语音合成界面用于将多个候选场景呈现给用户;从所述多个候选场景中确定所述用户选择的所 述目标场景;接收所述用户在所述语音合成界面上输入的所述目标文本。In a possible implementation, the data acquisition module is used to: generate a speech synthesis interface, the speech synthesis interface is used to present multiple candidate scenes to the user; determine the said speech synthesis interface from the multiple candidate scenes. The user selected The target scenario is described; receiving the target text input by the user on the speech synthesis interface.
在一种可能的实施方式中,所述语音合成界面,还用于将多个候选对象呈现给所述用户;所述数据获取模块,还用于:从所述多个候选对象中,确定所述用户选择的所述目标对象。In a possible implementation, the speech synthesis interface is also used to present multiple candidate objects to the user; the data acquisition module is also used to: determine the candidate objects from the multiple candidate objects. The target object selected by the user.
值得注意的是,第四方面提供的语音克隆装置,对应于第二方面提供的语音克隆方法,故第四方面以及第四方面中任一实施方式所具有的技术效果,可参见第二方面或者第二方面的相应实施方式所具有的技术效果。It is worth noting that the voice cloning device provided in the fourth aspect corresponds to the voice cloning method provided in the second aspect, so the technical effects of the fourth aspect and any one of the embodiments of the fourth aspect can be found in the second aspect or The technical effects achieved by the corresponding implementation of the second aspect.
第五方面,本申请提供一种计算设备,所述计算设备包括处理器和存储器;所述存储器用于存储指令,所述处理器执行所述存储器存储的该指令,以使所述计算设备执行上述第一方面或第一方面任一种可能实现方式中的语音克隆方法,或者执行上述第二方面或第二方面任一种可能实现方式中的语音克隆方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。所述计算设备还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。In a fifth aspect, the present application provides a computing device, the computing device includes a processor and a memory; the memory is used to store instructions, and the processor executes the instructions stored in the memory, so that the computing device executes The voice cloning method in the above-mentioned first aspect or any possible implementation of the first aspect, or the voice cloning method in the above-mentioned second aspect or any possible implementation of the second aspect. It should be noted that the memory can be integrated into the processor or independent of the processor. The computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.
第六方面,本申请提供一种计算设备集群,所述计算设备包括至少一个计算设备,所述至少一个计算设备包括至少一个处理器和至少一个存储器;所述至少一个存储器用于存储指令,所述至少一个处理器执行所述至少一个存储器存储的该指令,以使所述计算设备集群执行上述第一方面或第一方面任一种可能实现方式中的语音克隆方法,或者执行上述第二方面或第二方面任一种可能实现方式中的语音克隆方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。所述至少一个计算设备还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。In a sixth aspect, the present application provides a computing device cluster. The computing device includes at least one computing device. The at least one computing device includes at least one processor and at least one memory; the at least one memory is used to store instructions. The at least one processor executes the instruction stored in the at least one memory, so that the computing device cluster executes the above first aspect or the voice cloning method in any possible implementation of the first aspect, or executes the above second aspect. Or the voice cloning method in any possible implementation of the second aspect. It should be noted that the memory can be integrated into the processor or independent of the processor. The at least one computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.
第七方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行上述第一方面或第一方面的任一种实现方式所述的方法,或者执行上述第二方面或第二方面任一种可能实现方式中的语音克隆方法。In a seventh aspect, the present application provides a computer-readable storage medium that stores instructions, which when run on at least one computing device, cause the at least one computing device to execute the above-mentioned first aspect. Or the method described in any implementation of the first aspect, or perform the voice cloning method in the above second aspect or any possible implementation of the second aspect.
第八方面,本申请提供了一种包含指令的计算机程序产品,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行上述第一方面或第一方面的任一种实现方式所述的方法,或者执行上述第二方面或第二方面任一种可能实现方式中的语音克隆方法。In an eighth aspect, the present application provides a computer program product containing instructions that, when run on at least one computing device, cause the at least one computing device to execute the above first aspect or any implementation of the first aspect. The method described above, or performing the voice cloning method in the above second aspect or any possible implementation of the second aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请中记载的一些实施例,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some implementations recorded in the present application. For example, those of ordinary skill in the art can also obtain other drawings based on these drawings.
图1为本申请提供的一示例性应用场景的示意图;Figure 1 is a schematic diagram of an exemplary application scenario provided by this application;
图2为本申请提供的另一示例性应用场景的示意图;Figure 2 is a schematic diagram of another exemplary application scenario provided by this application;
图3为本申请提供的一种语音克隆方法的流程示意图;Figure 3 is a schematic flow chart of a voice cloning method provided by this application;
图4为本申请提供的一种场景配置界面的示意图; Figure 4 is a schematic diagram of a scene configuration interface provided by this application;
图5为本申请提供的另一种场景配置界面的示意图;Figure 5 is a schematic diagram of another scene configuration interface provided by this application;
图6为本申请提供的新闻场景以及财经场景下的语料文本对应的拼音分布示意图;Figure 6 is a schematic diagram of the pinyin distribution corresponding to the corpus text in news scenarios and financial scenarios provided by this application;
图7为本申请提供的一种录音界面的示意图;Figure 7 is a schematic diagram of a recording interface provided by this application;
图8为本申请提供的一种测试界面的示意图;Figure 8 is a schematic diagram of a test interface provided by this application;
图9为本申请提供的一种计算设备的结构示意图;Figure 9 is a schematic structural diagram of a computing device provided by this application;
图10为本申请提供的一种计算设备集群的结构示意图。Figure 10 is a schematic structural diagram of a computing device cluster provided by this application.
具体实施方式Detailed ways
下面将结合本申请中的附图,对本申请提供的实施例中的方案进行描述。The solutions in the embodiments provided in this application will be described below with reference to the drawings in this application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。The terms "first", "second", etc. in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that the terms so used are interchangeable under appropriate circumstances, and are merely a way of distinguishing objects with the same attributes in describing the embodiments of the present application.
目前,在进行语音克隆时,会利用通用的语料文本以及目标对象针对该语料文本的录音音频,训练语音克隆模型。这样,语音克隆模型可以学习目标对象的发音的音色,根据新提供的文本,生成与该目标对象发音的音色相符的语音并输出,实现目标对象的语音克隆。其中,目标对象,是指具有能够发音的对象,如人类等。Currently, when performing voice cloning, the general corpus text and the target object's audio recording of the corpus text are used to train the speech cloning model. In this way, the speech cloning model can learn the timbre of the target object's pronunciation, and based on the newly provided text, generate and output a voice that matches the timbre of the target object's pronunciation, thereby realizing the speech cloning of the target object. Among them, the target object refers to an object that can pronounce words, such as human beings.
实际应用场景中,目标对象在不同场景中发音的韵律、风格等,通常存在差异。其中,韵律以及风格可以反映目标对象发音时的特点。韵律,可以包括发音的语调、时域分布和重音等方面的特征。风格,可以包括目标对象说话的语速等特征。In actual application scenarios, there are usually differences in the rhythm and style of the target object's pronunciation in different scenarios. Among them, rhythm and style can reflect the characteristics of the target object's pronunciation. Rhythm can include features such as pronunciation intonation, temporal distribution, and stress. Style can include characteristics such as the speaking speed of the target object.
以故事场景和新闻播报场景为例,人在故事场景(如在讲述故事内容等)中的发音,通常语速较为平缓(如每分钟说120字)、音量变化较大,而在新闻播报场景中的发音,通常语速较快(如每分钟说200字)、音量变化较小等。但是,基于通用文本的语料以及相应的录音音频所训练出的语音克隆模型,仅能克隆目标对象发音的音色,难以克隆出目标对象在不同场景下发音的不同韵律和风格,从而影响语音克隆效果。Take story scenes and news broadcast scenes as examples. People's pronunciation in story scenes (such as telling story content, etc.) usually has a relatively gentle speech speed (such as speaking 120 words per minute) and large changes in volume. However, in news broadcast scenes, The pronunciation in Chinese is usually faster (such as 200 words per minute) and the volume changes less. However, the speech cloning model trained based on general text corpus and corresponding recorded audio can only clone the timbre of the target object's pronunciation, and it is difficult to clone the different rhythms and styles of the target object's pronunciation in different scenarios, thus affecting the speech cloning effect. .
基于此,本申请实施例提供了一种语音克隆方法,该方法可以由语音克隆装置执行,用于提升针对目标对象的语音克隆效果。具体实现时,语音克隆装置先确定所要克隆的目标对象发音时的目标场景,并根据该目标场景,获取属于该目标场景的目标语料文本,并进一步根据该目标语料,确定目标对象的音频,该目标对象的音频的语音内容与该目标语料文本的内容相匹配,如该音频可以是对目标对象根据该目标语料文本的发音进行录音得到的音频等,从而语音克隆装置利用该目标语料文本以及该音频,训练得到用于输出模拟所述目标对象在目标场景下发音的音频的语音克隆模型,实现针对目标对象在目标场景下发音的语音克隆。Based on this, embodiments of the present application provide a voice cloning method, which can be executed by a voice cloning device and used to improve the voice cloning effect on the target object. During specific implementation, the speech cloning device first determines the target scene in which the target object to be cloned pronounces, and obtains the target corpus text belonging to the target scene based on the target scene, and further determines the audio of the target object based on the target corpus. The speech content of the target object's audio matches the content of the target corpus text. For example, the audio may be audio obtained by recording the target object's pronunciation according to the target corpus text, etc., so that the speech cloning device uses the target corpus text and the target corpus text. Audio, train to obtain a voice cloning model for outputting audio that simulates the target object's pronunciation in the target scene, and realizes voice cloning of the target object's pronunciation in the target scene.
由于语音克隆模型,基于目标对象针对目标场景下的语料文本的发音音频进行训练得到,这使得语音克隆模型根据文本所输出的新的语音,在音色、韵律和发音风格等方面的特征,能够更加符合目标对象在目标场景下的真实发音情况,以此可以有效提高语音克隆效果。Since the speech cloning model is trained on the pronunciation audio of the corpus text in the target scene based on the target object, this allows the speech cloning model to be more accurate in terms of timbre, rhythm, pronunciation style and other characteristics based on the new speech output by the text. It conforms to the real pronunciation of the target object in the target scene, which can effectively improve the voice cloning effect.
实际应用时,针对每种场景,均可以利用上述方式克隆目标对象在该场景下的语音,从而可以实现克隆目标对象在不同场景下发音所具有的不同韵律和风格,提高语音克隆的真实性和多样性。进一步地,语音克隆装置还可以针对多个对象中的每个对象,均采用上 述方式克隆该对象在各个场景下的语音,从而可以提高语音克隆的灵活性以及丰富性。In practical applications, for each scenario, the above method can be used to clone the speech of the target object in that scene, so that the different rhythms and styles of the target object's pronunciation in different scenarios can be cloned, and the authenticity and accuracy of speech cloning can be improved. Diversity. Furthermore, the voice cloning device can also use the above method for each of the multiple objects. This method can be used to clone the object's voice in various scenarios, thereby improving the flexibility and richness of voice cloning.
作为一种示例,上述语音克隆装置可以被部署于云端,用于为用户提供语音克隆的云服务。例如,在图1所示的应用场景中,语音克隆装置100可以部署于云端,例如可以是由云端的计算设备或者计算设备集群实现。并且,语音克隆装置100可以对外提供客户端200,用于实现与用户300的交互,如接收用户300输入的场景信息、文本或者音频数据,或者向用户300反馈克隆的音频等。实际应用时,客户端200例如可以是运行在用户侧设备上的应用程序,或者可以是语音克隆装置100对外提供的网络浏览器等。语音克隆装置100可以包括数据获取模块101、模型训练模块102。其中,数据获取模块101,用于确定目标场景,如可以将用户300选择的场景或者用户300自定义的场景确定为目标场景等,并获取属于该目标场景的目标语料文本以及目标对象的音频,将目标语料文本以及音频提供给模型训练模块102;模型训练模块102,用于利用目标语料文本以及该目标对象的音频,训练该目标场景对应的语音克隆模型。进一步地,语音克隆装置100还可以包括语音克隆模块103,则,模型训练模块102可以将该语音克隆模型提供给语音克隆模块103;语音克隆模块103,用于利用该语音克隆模型输出目标文本对应的音频,该音频即为模拟目标对象在目标场景下发音的音频,其中,该目标文本可以是预先配置的文本,或者可以是用户300新提供的文本等。进一步地,语音克隆模块103还可以将目标文本对应的音频发送给客户端200,以便客户端200将该音频播放给用户300。As an example, the above voice cloning device can be deployed in the cloud to provide users with voice cloning cloud services. For example, in the application scenario shown in FIG. 1 , the voice cloning device 100 can be deployed in the cloud, for example, it can be implemented by a cloud computing device or a computing device cluster. In addition, the voice cloning device 100 can provide an external client 200 for interaction with the user 300, such as receiving scene information, text or audio data input by the user 300, or feedback cloned audio to the user 300, etc. In actual application, the client 200 may be, for example, an application program running on the user-side device, or may be a web browser provided externally by the voice cloning device 100, or the like. The voice cloning device 100 may include a data acquisition module 101 and a model training module 102. Among them, the data acquisition module 101 is used to determine the target scene, for example, the scene selected by the user 300 or the scene customized by the user 300 can be determined as the target scene, etc., and obtain the target corpus text and the audio of the target object belonging to the target scene, The target corpus text and audio are provided to the model training module 102; the model training module 102 is used to use the target corpus text and the audio of the target object to train a speech cloning model corresponding to the target scene. Further, the voice cloning device 100 can also include a voice cloning module 103, then the model training module 102 can provide the voice cloning model to the voice cloning module 103; the voice cloning module 103 is used to use the voice cloning model to output the target text correspondence. The audio is the audio that simulates the target object's pronunciation in the target scene, where the target text may be preconfigured text, or may be text newly provided by the user 300 , etc. Further, the voice cloning module 103 can also send the audio corresponding to the target text to the client 200, so that the client 200 plays the audio to the user 300.
作为另一种示例,上述语音克隆装置可以被部署于本地,从而可以为用户提供本地的语音克隆服务。例如,在图2所示的应用场景中,上述语音克隆装置具体可以是本地的终端400,从而用户300可以向终端400输入目标场景、目标语料文本以及目标对象的音频,终端400利用该目标语料文本以及音频,训练目标场景对应的语音克隆模型,并利用该语音克隆模型输出目标文本对应的音频,并向用户300播放该音频。As another example, the above voice cloning device can be deployed locally, so that local voice cloning services can be provided for users. For example, in the application scenario shown in Figure 2, the above-mentioned voice cloning device can be a local terminal 400, so that the user 300 can input the target scene, the target corpus text, and the audio of the target object to the terminal 400, and the terminal 400 uses the target corpus. text and audio, train a speech cloning model corresponding to the target scene, and use the speech cloning model to output the audio corresponding to the target text, and play the audio to the user 300.
实际应用时,上述语音克隆装置可以通过软件实现,或者可以通过硬件实现。In actual application, the above voice cloning device can be implemented by software or can be implemented by hardware.
语音克隆装置作为软件功能单元的一种举例,可以包括运行在计算实例上的代码。其中,计算实例可以包括物理主机(计算设备)、虚拟机、容器中的至少一种。进一步地,上述计算实例可以是一台或者多台。例如,语音克隆装置可以包括运行在多个主机/虚拟机/容器上的代码。需要说明的是,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的区域(region)中,也可以分布在不同的region中。进一步地,用于运行该代码的多个主机/虚拟机/容器可以分布在相同的可用区(availability zone,AZ)中,也可以分布在不同的AZ中,每个AZ包括一个数据中心或多个地理位置相近的数据中心。其中,通常一个region可以包括多个AZ。As an example of a software functional unit, the voice cloning device may include code running on a computing instance. The computing instance may include at least one of a physical host (computing device), a virtual machine, and a container. Furthermore, the above computing instance may be one or more. For example, a voice cloning device may include code running on multiple hosts/virtual machines/containers. It should be noted that multiple hosts/virtual machines/containers used to run the code can be distributed in the same region (region) or in different regions. Furthermore, multiple hosts/virtual machines/containers used to run the code can be distributed in the same availability zone (AZ) or in different AZs. Each AZ includes one data center or multiple AZs. geographically close data centers. Among them, usually a region can include multiple AZs.
同样,用于运行该代码的多个主机/虚拟机/容器可以分布在同一个虚拟私有云(virtual private cloud,VPC)中,也可以分布在多个VPC中。其中,通常一个VPC设置在一个region内,同一region内两个VPC之间,以及不同region的VPC之间跨区通信需在每个VPC内设置通信网关,经通信网关实现VPC之间的互连。Likewise, the multiple hosts/VMs/containers used to run the code can be distributed in the same virtual private cloud (VPC), or across multiple VPCs. Among them, usually a VPC is set up in a region. Cross-region communication between two VPCs in the same region and between VPCs in different regions requires a communication gateway in each VPC, and the interconnection between VPCs is realized through the communication gateway. .
语音克隆装置作为硬件功能单元的一种举例,语音克隆装置可以包括至少一个计算设备,如服务器等。或者,语音克隆装置也可以是利用专用集成电路(application-specific integrated circuit,ASIC)实现、或可编程逻辑器件(programmable logic device,PLD)实 现的设备等。其中,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、现场可编程门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合实现。The voice cloning device is an example of a hardware functional unit. The voice cloning device may include at least one computing device, such as a server. Alternatively, the voice cloning device can also be implemented using an application-specific integrated circuit (ASIC) or a programmable logic device (PLD). Current equipment, etc. The above-mentioned PLD can be implemented by a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL), or any combination thereof.
语音克隆装置包括的多个计算设备可以分布在相同的region中,也可以分布在不同的region中。语音克隆装置包括的多个计算设备可以分布在相同的AZ中,也可以分布在不同的AZ中。同样,语音克隆装置包括的多个计算设备可以分布在同一个VPC中,也可以分布在多个VPC中。其中,所述多个计算设备可以是服务器、ASIC、PLD、CPLD、FPGA和GAL等计算设备的任意组合。Multiple computing devices included in the voice cloning device may be distributed in the same region or in different regions. Multiple computing devices included in the voice cloning device may be distributed in the same AZ or in different AZs. Similarly, multiple computing devices included in the voice cloning device may be distributed in the same VPC or in multiple VPCs. The plurality of computing devices may be any combination of computing devices such as servers, ASICs, PLDs, CPLDs, FPGAs, and GALs.
接下来,对语音克隆过程的各种非限定性的具体实施方式进行详细描述。Next, various non-limiting specific implementations of the voice cloning process are described in detail.
参阅图3,为本申请实施例中一种语音克隆方法的流程示意图。该方法可以应用于上述图1或者图2所示的应用场景中,或者也可以是应用于其它可适用的应用场景中。下面以应用于图1所示的应用场景为例进行说明。在图1所示的应用场景中,语音克隆装置100中的数据获取模块101、模型训练模块102以及语音克隆模块103的功能,具体参见下述实施例的相关描述。并且,语音克隆装置100能够用于生成克隆一个或者多个对象在各个场景下发音的语音克隆模型,为便于说明,图3所示实施例中以生成用于模拟输出一个对象(即下述目标对象)在一个场景(即下述目标场景)下发音的音频的语音克隆模型为例进行说明,关于语音克隆装置100生成用于模拟其他对象在其它各个场景下发音的语音克隆模型的实现过程,可参照图3所示实施例进行理解。Refer to Figure 3, which is a schematic flow chart of a voice cloning method in an embodiment of the present application. This method can be applied to the application scenarios shown in Figure 1 or Figure 2 above, or can also be applied to other applicable application scenarios. The following description takes the application scenario shown in Figure 1 as an example. In the application scenario shown in Figure 1, for the functions of the data acquisition module 101, the model training module 102 and the voice cloning module 103 in the voice cloning device 100, please refer to the relevant descriptions of the following embodiments for details. Moreover, the voice cloning device 100 can be used to generate a voice cloning model that clones one or more objects to be pronounced in various scenarios. For convenience of explanation, the embodiment shown in FIG. 3 is used to generate an object for simulating output (i.e., the following target The speech cloning model of audio produced by an object) in one scene (i.e., the target scene described below) is taken as an example to illustrate the implementation process of the speech cloning device 100 generating a speech cloning model for simulating the speech of other objects in various other scenes, This can be understood with reference to the embodiment shown in FIG. 3 .
图3所示的语音克隆方法具体可以包括:The voice cloning method shown in Figure 3 may specifically include:
S301:数据获取模块101确定目标场景。S301: The data acquisition module 101 determines the target scene.
通常情况下,目标对象在多种不同的场景下发音的韵律、风格可能存在差异,因此,在克隆目标对象的发音时,可以先确定所要克隆的目标对象的发音所属场景,以下称之为目标场景。Usually, the rhythm and style of the target object's pronunciation may be different in many different scenes. Therefore, when cloning the pronunciation of the target object, you can first determine the scene to which the pronunciation of the target object to be cloned belongs, which is hereinafter referred to as the target. Scenes.
其中,目标对象发音所属的场景,可以根据实际应用场景中的发音环境进行划分,如可以划分为对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景等多个场景,目标场景即为其中一种场景。或者,目标对象发音所属的场景,也可以是根据人物情绪的类型进行划分,如可以按照人物情绪的不同,划分为高兴场景、悲伤场景、崇拜场景、冷静场景、平淡场景等。实际应用时,也可以是采用其它方式划分得到多种不同的场景,本实施例对此并不进行限定。Among them, the scenes to which the target object's pronunciation belongs can be divided according to the pronunciation environment in actual application scenarios. For example, it can be divided into multiple scenes such as dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, etc. The target scene is for one of the scenarios. Alternatively, the scene to which the target object's pronunciation belongs can also be divided according to the type of the character's emotion. For example, it can be divided into happy scenes, sad scenes, worship scenes, calm scenes, dull scenes, etc. according to the different emotions of the characters. In actual application, other methods may be used to divide the scene into multiple different scenarios, which is not limited in this embodiment.
进一步地,目标场景,还可以是用户自定义的场景,如用户可以自定义睡前故事场景、演讲场景等。Furthermore, the target scene can also be a user-defined scene, for example, the user can customize a bedtime story scene, a speech scene, etc.
在一种确定目标场景的实现方式中,数据获取模块101可以生成场景配置界面,并将该场景配置界面发送给客户端200,以便客户端200将其呈现给用户300。其中,客户端200所呈现的场景配置界面中可以包括多个候选场景,例如可以是图4所示的对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景等,该多个候选场景可以预先由技术人员进行配置。这样,用户300可以在客户端200上,从呈现的多个候选场景中选择一种场景,如选择故事场景等,以便指定语音克隆装置100基于该场景进行语音克隆。相应地, 客户端200可以将用户所选择的场景反馈给数据获取模块101,以便数据获取模块101将其确定为目标场景。In an implementation manner of determining the target scene, the data acquisition module 101 can generate a scene configuration interface and send the scene configuration interface to the client 200 so that the client 200 can present it to the user 300 . Among them, the scene configuration interface presented by the client 200 may include multiple candidate scenes, for example, it may be a dialogue scene, a news scene, a financial scene, a live broadcast scene, a story scene, an education scene, a speech scene, etc. as shown in Figure 4. Multiple candidate scenarios can be configured in advance by technicians. In this way, the user 300 can select a scene from multiple candidate scenes presented on the client 200, such as selecting a story scene, etc., so as to designate the voice cloning device 100 to perform voice cloning based on the scene. Correspondingly, The client 200 can feed back the scene selected by the user to the data acquisition module 101, so that the data acquisition module 101 determines it as the target scene.
另外,语音克隆装置100还可以支持用户300自定义场景。比如,在图4所示的场景配置界面中,当用户300选择“自定义”场景,数据获取模块101还可以生成图5所示的场景配置界面,并将该场景配置界面通过客户端200呈现给用户300。此时,用户300可以在该场景配置界面中输入自定义的场景的名称(或者其它用于标识该场景的信息);相应地,数据获取模块101可以根据用户输入的场景的名称,创建新的场景,并将其确定为目标场景。In addition, the voice cloning device 100 can also support user 300 customized scenarios. For example, in the scene configuration interface shown in Figure 4, when the user 300 selects the "custom" scene, the data acquisition module 101 can also generate the scene configuration interface shown in Figure 5, and present the scene configuration interface through the client 200 300 is given to the user. At this time, the user 300 can enter the name of the customized scene (or other information used to identify the scene) in the scene configuration interface; accordingly, the data acquisition module 101 can create a new scene according to the name of the scene input by the user. scene and identify it as the target scene.
需要说明的是,上述数据获取模块101确定目标场景的实现方式仅作为示例性说明,实际应用时,数据获取模块101也可以通过其它方式确定目标场景,本实施例对此并不进行限定。It should be noted that the above-mentioned implementation method for the data acquisition module 101 to determine the target scene is only for illustrative purposes. In actual application, the data acquisition module 101 can also determine the target scene through other methods, which is not limited in this embodiment.
S302:数据获取模块101根据目标场景,确定获取属于该目标场景的目标语料文本。S302: The data acquisition module 101 determines to acquire the target corpus text belonging to the target scene according to the target scene.
在确定目标场景后,数据获取模块101可以进一步获取实现语音克隆所需的目标语料文本。After determining the target scene, the data acquisition module 101 can further acquire the target corpus text required to implement speech cloning.
在一种获取目标语料文本的实施方式中,数据获取模块101可以在进行语音克隆之前,预先针对多种候选场景分别配置有相应的语料库,每个语料库用于存储属于同一候选场景的多个语料文本,不同语料库中存储的语料文本属于不同候选场景。其中,每个语料库中存储的语料文本的内容的语境与该候选场景所指示的语境相匹配。比如,在演讲场景对应的语料库中,所存储的语料文本例如可以是多份不同的演讲稿等。当目标场景为多个候选场景的其中一种候选场景时,数据获取模块101可以访问该目标场景对应的语料库,并从该语料库中筛选出部分语料文本作为用于训练语音克隆模型的目标语料文本。In an implementation of obtaining target corpus text, the data acquisition module 101 can be configured with corresponding corpora for multiple candidate scenes in advance before performing speech cloning, and each corpus is used to store multiple corpora belonging to the same candidate scene. Text, corpus texts stored in different corpora belong to different candidate scenarios. Among them, the context of the content of the corpus text stored in each corpus matches the context indicated by the candidate scene. For example, in a corpus corresponding to a speech scene, the stored corpus text may be multiple different speech scripts, etc. When the target scene is one of multiple candidate scenes, the data acquisition module 101 can access the corpus corresponding to the target scene, and filter out part of the corpus text from the corpus as the target corpus text for training the speech cloning model. .
本实施例中,提供了以下几种从语料库筛选出目标语料文本的实现示例。In this embodiment, the following implementation examples are provided for filtering target corpus text from the corpus.
在第一种实现示例中,数据获取模块101可以根据拼音分布从语料库中筛选出目标语料文本。In the first implementation example, the data acquisition module 101 can filter out the target corpus text from the corpus according to pinyin distribution.
具体地,以语料文本为中文文本为例,目标场景对应的语料库在存储多个语料文本时,还存储该多个语料文本包括的各个中文字符的拼音分布,如各个中文字符对应的拼音在语料库中出现次数的分布,以下称之为第一拼音分布。然后,数据获取模块101可以从该语料库中筛选出预设数量(如30、或50、或100等)的语料文本,将其添加至语料文本集合中,并统计该语料文本集合中的多个语料文本对应的拼音分布,以下称之为第二拼音分布。Specifically, taking the corpus text as Chinese text as an example, when the corpus corresponding to the target scenario stores multiple corpus texts, it also stores the pinyin distribution of each Chinese character included in the multiple corpus texts. For example, the pinyin corresponding to each Chinese character is in the corpus. The distribution of the number of occurrences in is hereinafter referred to as the first pinyin distribution. Then, the data acquisition module 101 can filter out a preset number of corpus texts (such as 30, or 50, or 100, etc.) from the corpus, add them to the corpus text collection, and count the multiple corpus texts in the corpus text collection. The pinyin distribution corresponding to the corpus text is hereinafter referred to as the second pinyin distribution.
接着,数据获取模块101可以计算第一拼音分布与第二拼音分布之间的方差(或者标准差等)。通常情况下,不同场景下的语料文本对应的拼音分布通常存在较大差异。比如,对于新闻场景下的500条语料文本以及财经场景下的500条语料文本,其拼音分布中数量最多的前10个拼音的分布可以如图6所示。因此,每个场景下的语料文本对应的拼音分布,可以作为指示该场景下语料文本特性的特征。相应地,在选取用于训练该场景下的目标语料文本时,可以选取拼音分布与语料库的拼音分布相同的多个语料文本作为目标语料文本,以保留该场景下的文本内容特征。Next, the data acquisition module 101 can calculate the variance (or standard deviation, etc.) between the first Pinyin distribution and the second Pinyin distribution. Usually, there are big differences in the pinyin distribution corresponding to corpus texts in different scenarios. For example, for 500 corpus texts in the news scenario and 500 corpus texts in the financial scenario, the distribution of the top 10 pinyins with the largest number in the pinyin distribution can be shown in Figure 6. Therefore, the pinyin distribution corresponding to the corpus text in each scene can be used as a feature indicating the characteristics of the corpus text in that scene. Correspondingly, when selecting target corpus texts for training in this scenario, multiple corpus texts whose pinyin distribution is the same as that of the corpus can be selected as target corpus texts to retain the text content characteristics in this scenario.
并且,当第一拼音分布与第二拼音分布之间的方差小于或者等于预设阈值时,数据获取模块101可以将语料文本集合中的多个语料文本确定为用于训练语音克隆模型的目标语料文本。而当第一拼音分布与第二拼音分布之间的方差大于预设阈值时,数据获取模块101 可以根据第一拼音分布,确定第二拼音分布中拼音占比过大的目标拼音,并从语料文本集合中删除该目标拼音重复率相对较高的一个或者多个语料文本,然后随机从数据库中剩余的语料文本中选择一个或者多个语料文本,并将其添加至语料文本集合中。Moreover, when the variance between the first Pinyin distribution and the second Pinyin distribution is less than or equal to the preset threshold, the data acquisition module 101 may determine multiple corpus texts in the corpus text set as target corpus for training the speech cloning model. text. When the variance between the first Pinyin distribution and the second Pinyin distribution is greater than the preset threshold, the data acquisition module 101 According to the first pinyin distribution, the target pinyin with an excessive proportion of pinyin in the second pinyin distribution can be determined, and one or more corpus texts with a relatively high repetition rate of the target pinyin can be deleted from the corpus text collection, and then randomly selected from the database Select one or more corpus texts from the remaining corpus texts and add them to the corpus text collection.
然后,数据获取模块101可以重新计算该语料文本集合对应的拼音分布与第一拼音分布之间的方差(或者标准差等)小于预设阈值。如果是,则将当前语料文本集合中的多个语料文本确定为目标语料文本;如果不是,则可以重复上述步骤更新语料文本集合,直至语料文本集合对应的拼音分布与第一拼音分布之间的方差(或者标准差等)小于预设阈值。Then, the data acquisition module 101 can recalculate that the variance (or standard deviation, etc.) between the pinyin distribution corresponding to the corpus text collection and the first pinyin distribution is less than the preset threshold. If yes, multiple corpus texts in the current corpus text collection are determined as the target corpus text; if not, the above steps can be repeated to update the corpus text collection until the pinyin distribution corresponding to the corpus text collection is equal to the first pinyin distribution. The variance (or standard deviation, etc.) is less than the preset threshold.
在第二种实现示例中,数据获取模块101可以根据专业术语的占比,从语料库中筛选出目标语料文本。其中,专业术语,是指在特定领域中对一些特定事物的统一称谓,如计算机领域中的复杂程序逻辑器件(CPLD)等。In the second implementation example, the data acquisition module 101 can filter out the target corpus text from the corpus according to the proportion of professional terms. Among them, professional terms refer to the unified names for some specific things in specific fields, such as complex program logic devices (CPLD) in the computer field.
具体地,目标场景对应的语料库所存储多个语料文本中,可以携带有各个语料文本分别包括的专业术语的标识(或者称之为标签)。这样,数据获取模块101可以先从语料库中随机筛选预设数量的语料文本,将其添加至语料文本集合中,并根据这些语料文本携带的专业术语的标识,确定语料文本集合中专业术语的数量相对于该语料文本集合包括的所有词汇的数量的占比。当占比大于或者等于预设的比例阈值时,数据获取模块101可以将语料文本集合中的多个语料文本确定为目标语料文本;而当占比小于预设的比例阈值时,数据获取模块101可以删除语料文本集合中专业术语数量较少的部分语料文本,或者删除语料集合中重复率较高的部分语料文本,然后随机从数据库中剩余的语料文本中选择一个或者多个语料文本,并将其添加至语料文本集合中。Specifically, multiple corpus texts stored in the corpus corresponding to the target scene may carry identifiers (or labels) of professional terms included in each corpus text. In this way, the data acquisition module 101 can first randomly screen a preset number of corpus texts from the corpus, add them to the corpus text collection, and determine the number of professional terms in the corpus text collection based on the identification of professional terms carried by these corpus texts. The proportion relative to the number of all words included in the corpus text collection. When the proportion is greater than or equal to the preset proportion threshold, the data acquisition module 101 can determine multiple corpus texts in the corpus text set as the target corpus text; and when the proportion is less than the preset proportion threshold, the data acquisition module 101 You can delete some of the corpus texts with a small number of professional terms in the corpus text collection, or delete some of the corpus texts with a high repetition rate in the corpus text collection, and then randomly select one or more corpus texts from the remaining corpus texts in the database, and add It is added to the corpus text collection.
然后,数据获取模块101可以重新计算语料文本集合中专业术语的数量相对于该语料文本集合中所有词汇的数量的占比。当占比大于或者等于预设的比例阈值时,数据获取模块101可以将当前语料文本集合中的多个语料文本确定为目标语料文本;而当占比小于预设的比例阈值时,数据获取模块101可以重复上述步骤更新语料文本集合,直至语料文本集合中专业术语的数量相对于该语料文本集合中所有词汇的数量的占比大于或者等于预设的比例阈值。Then, the data acquisition module 101 can recalculate the proportion of the number of professional terms in the corpus text collection to the number of all words in the corpus text collection. When the proportion is greater than or equal to the preset proportion threshold, the data acquisition module 101 can determine multiple corpus texts in the current corpus text set as the target corpus text; and when the proportion is less than the preset proportion threshold, the data acquisition module 101 101 can repeat the above steps to update the corpus text collection until the proportion of the number of professional terms in the corpus text collection relative to the number of all words in the corpus text collection is greater than or equal to the preset proportion threshold.
在第三种实现示例中,数据获取模块101可以综合上述拼音分布以及专业术语的占比,从数据库中筛选出目标语料文本,即所筛选出的目标语料文本中,不仅拼音分布与数据库对应的拼音分布之间的方差小于或者等于预设阈值,而且,专业术语的数量相对于目标语料文本中所有词汇的数量的占比大于或者等于比例阈值。In the third implementation example, the data acquisition module 101 can combine the above-mentioned pinyin distribution and the proportion of professional terms to filter out the target corpus text from the database. That is, in the filtered target corpus text, not only the pinyin distribution corresponds to the database The variance between pinyin distributions is less than or equal to the preset threshold, and the proportion of the number of professional terms relative to the number of all words in the target corpus text is greater than or equal to the proportion threshold.
上述数据获取模块101从语料库筛选出目标语料文本仅作为一些示例性说明,实际应用时,数据获取模块101也可以通过其它方式从语料库筛选出目标语料文本,本实施例对此并不进行限定。The above-mentioned data acquisition module 101 filters out the target corpus text from the corpus only as some exemplary explanations. In actual application, the data acquisition module 101 can also filter out the target corpus text from the corpus through other methods, which is not limited in this embodiment.
在另一种获取目标语料文本的实施方式中,当目标场景为用户300自定义的场景时,数据获取模块101可以从用户300上传的语料文本中确定出用于训练适用于该场景下的语音克隆模型的目标语料文本。In another implementation of obtaining the target corpus text, when the target scene is a scene customized by the user 300, the data acquisition module 101 can determine from the corpus text uploaded by the user 300 the speech suitable for training in the scene. Clone the target corpus text of the model.
具体地,数据获取模块101在通过客户端200呈现场景配置界面时,除了可以提示用户300输入自定义的场景的名称,还可以在该场景配置界面上提示用户300上传语料文本,如图5所示。其中用户300可以在该场景配置界面上导入语料文本,或者在该场景配置界面上 输入语料文本的路径、文件名或者网络地址等,以便数据获取模块101根据用户300输入的信息访问得到语料文本等。进一步地,图5所示的场景配置界面还可以提示用户300输入自定义场景下的专业术语。Specifically, when the data acquisition module 101 presents the scene configuration interface through the client 200, in addition to prompting the user 300 to enter the name of the customized scene, the data acquisition module 101 can also prompt the user 300 to upload corpus text on the scene configuration interface, as shown in Figure 5 Show. The user 300 can import corpus text on the scene configuration interface, or on the scene configuration interface Enter the path, file name or network address of the corpus text, so that the data acquisition module 101 can access and obtain the corpus text according to the information input by the user 300 . Furthermore, the scenario configuration interface shown in Figure 5 can also prompt the user 300 to input professional terms in the customized scenario.
然后,数据获取模块101可以从用户300上传的语料文本中确定出目标语料文本。其中,当用户300上传的语料文本的数量较多时,数据获取模块101可以参照前述实施方式,根据拼音分布或专业术语从多个语料文本中确定目标语料文本,在此不做重述;而当用户300上传的语料文本的数量较少时,如用户300上传的语料文本的数量不超过上述预设数量等,数据获取模块101可以将用户300上传的所有语料文本确定为目标语料文本,本实施例对此并不进行限定。Then, the data acquisition module 101 can determine the target corpus text from the corpus text uploaded by the user 300 . Among them, when the number of corpus texts uploaded by the user 300 is large, the data acquisition module 101 can refer to the aforementioned embodiments to determine the target corpus text from multiple corpus texts based on pinyin distribution or professional terminology, which will not be reiterated here; and when When the number of corpus texts uploaded by the user 300 is small, for example, the number of corpus texts uploaded by the user 300 does not exceed the above-mentioned preset number, etc., the data acquisition module 101 can determine all corpus texts uploaded by the user 300 as target corpus texts. This implementation This example does not limit this.
实际应用时,数据获取模块101也可以通过其它方式获取目标语料文本,本实施例对此并不进行限定。In actual application, the data acquisition module 101 may also acquire the target corpus text through other methods, which is not limited in this embodiment.
S303:数据获取模块101根据目标语料文本,确定目标对象的音频,该音频的语音内容与该目标语料文本的内容相匹配。S303: The data acquisition module 101 determines the audio of the target object based on the target corpus text, and the voice content of the audio matches the content of the target corpus text.
其中,目标对象例如可以是用户300,或者目标对象也可以是除用户300之外的其它对象,如公众人物等。The target object may be the user 300, for example, or the target object may be other objects besides the user 300, such as public figures.
数据获取模块101在获取到目标语料文本后,还可以进一步获取目标对象的音频,该目标对象的音频的语音内容与目标语料文本的内容相匹配,如音频的语音内容与目标语料文本的内容相同。After obtaining the target corpus text, the data acquisition module 101 can further obtain the audio of the target object. The speech content of the audio of the target object matches the content of the target corpus text. For example, the speech content of the audio is the same as the content of the target corpus text. .
本实施例中,提供了以下几种获取目标对象的音频的实现示例。In this embodiment, the following implementation examples for obtaining the audio of a target object are provided.
在第一种实现示例中,当目标对象为用户300时,数据获取模块101可以生成录音界面,该录音界面中可以包括已确定的目标语料文本,从而数据获取模块101可以通过客户端200呈现该录音界面。进一步地,该录音界面中还可以进一步呈现该目标语料文本对应的拼音和音调信息,该拼音和音调信息可以预先由技术人员对目标语料文本完成人工标注。比如,在图7所示的录音界面中,所呈现的目标语料文本可以是属于财经场景的文本:“今年房地产价格走势是涨是落”,所呈现的拼音和音调信息为“jin1 nian2 fang2 di4 chan3 jia4 ge2 zou3 shi4 shi4 zhang3 shi 4luo4”,其中,“jin1”中的“jin”为目标语料文本中字符“今”的拼音,“jin1”中的“1”指示目标语料文本中字符“今”的发音音调为一声;类似地,“nian2”中的“nian”为字符“年”的拼音,“nian2”中的“2”指示字符“年”的发音音调为二声。In the first implementation example, when the target object is the user 300, the data acquisition module 101 can generate a recording interface, and the recording interface can include the determined target corpus text, so that the data acquisition module 101 can present the text through the client 200. Recording interface. Furthermore, the recording interface can further present the pinyin and tonal information corresponding to the target corpus text. The pinyin and tonal information can be manually annotated by technical personnel on the target corpus text in advance. For example, in the recording interface shown in Figure 7, the target corpus text presented can be a text belonging to a financial scene: "Is the trend of real estate prices this year rising or falling?", and the pinyin and tonal information presented is "jin1 nian2 fang2 di4 chan3 jia4 ge2 zou3 shi4 shi4 zhang3 shi 4luo4", where "jin" in "jin1" is the pinyin of the character "jin" in the target corpus text, and the "1" in "jin1" indicates the character "jin" in the target corpus text The pronunciation tone of "nian2" is the pinyin of "nian", and the "2" in "nian2" indicates that the pronunciation tone of "nian" is the second tone.
这样,用户300可以根据录音界面所呈现的目标语料文本(以及相应的拼音和音调)进行发音。相应地,数据获取模块101可以利用客户端200对用户300的发音进行录音,得到用户300的音频,也即目标对象的音频。In this way, the user 300 can pronounce according to the target corpus text (and the corresponding pinyin and tones) presented in the recording interface. Correspondingly, the data acquisition module 101 can use the client 200 to record the pronunciation of the user 300 to obtain the audio of the user 300, that is, the audio of the target object.
进一步地,由于在对目标对象的发音进行录音时,容易受到噪声环境干扰,因此,数据获取模块101还可以对录音得到的音频进行噪声检测,并计算该音频的信噪比。当信噪比大于噪声阈值时,表征该音频受到较大的噪声干扰,此时,数据获取模块101可以删除该录音,并可以提示用户300重新针对该目标语料文本进行录音过程,直至所得到的音频中的信噪比不超过该噪声阈值。另外,数据获取模块101还可以校验录音得到的音频中的语音内容是否与目标语料文本相匹配,如校验该音频中的语音内容与目标语料文本的内容是否一致,或者用户300发音的正确率是否达到门限值,如果是,则数据获取模块101可以确定音频中 的语音内容与目标语料文本相匹配,而如果不是,则数据获取模块101可以提示用户300重新针对该目标语料文本进行录音过程。Furthermore, since the target object's pronunciation is easily interfered by the noise environment when recording it, the data acquisition module 101 can also perform noise detection on the recorded audio and calculate the signal-to-noise ratio of the audio. When the signal-to-noise ratio is greater than the noise threshold, it means that the audio is subject to greater noise interference. At this time, the data acquisition module 101 can delete the recording, and can prompt the user 300 to re-record the target corpus text until the obtained The signal-to-noise ratio in the audio does not exceed this noise threshold. In addition, the data acquisition module 101 can also verify whether the voice content in the recorded audio matches the target corpus text, such as verifying whether the voice content in the audio is consistent with the content of the target corpus text, or whether the pronunciation of the user 300 is correct. Whether the rate reaches the threshold value, if so, the data acquisition module 101 can determine the The voice content matches the target corpus text, and if not, the data acquisition module 101 may prompt the user 300 to re-perform the recording process for the target corpus text.
在第二种实现示例中,当目标对象与用户300不一致时,数据获取模块101可以获取目标对象在目标场景下的多段音频。比如,当目标场景为演讲场景下时,数据获取模块101可以获取目标对象在各种公开演讲场景下录制的演讲音频等。实际应用时,目标对象可以预先由用户300进行指定。比如,在图4所示的场景配置界面中,该场景配置界面可以呈现多个不同的对象,包括对象1至对象4,从而用户300可以多个对象中选择其中一个对象作为目标对象,以指示语音克隆装置300对该目标对象进行语音克隆。相应的,数据获取模块101可以从数据库或者网络中获取用户300指定的目标对象的多段音频。然后,数据获取模块101可以将目标语料文本与获取的目标对象的多段音频进行内容匹配,以此可以从多段音频中确定出与目标语料文本相匹配的音频。In the second implementation example, when the target object is inconsistent with the user 300, the data acquisition module 101 can acquire multiple pieces of audio of the target object in the target scene. For example, when the target scene is a speech scene, the data acquisition module 101 can obtain the speech audio recorded by the target object in various public speech scenes, etc. In actual application, the target object can be specified by the user 300 in advance. For example, in the scene configuration interface shown in Figure 4, the scene configuration interface can present multiple different objects, including object 1 to object 4, so that the user 300 can select one of the multiple objects as the target object to indicate The voice cloning device 300 performs voice cloning on the target object. Correspondingly, the data acquisition module 101 may acquire multiple audio segments of the target object specified by the user 300 from the database or the network. Then, the data acquisition module 101 can content-match the target corpus text with the acquired multiple audio segments of the target object, thereby determining the audio that matches the target corpus text from the multiple audio segments.
数据获取模块101在获取到目标语料文本以及目标对象的音频后,可以将其转发给模型训练模块102。After acquiring the target corpus text and the audio of the target object, the data acquisition module 101 can forward them to the model training module 102 .
S304:模型训练模块102利用目标语料文本以及目标对象的音频,训练目标场景对应的语音克隆模型,其中,该语音克隆模型用于输出模拟目标对象在该目标场景下发音的音频。S304: The model training module 102 uses the target corpus text and the audio of the target object to train a speech clone model corresponding to the target scene, where the speech clone model is used to output audio that simulates the target object's pronunciation in the target scene.
本实施例中,语音克隆模型,例如可以基于PortaSpeech模型、或Tacotron模型、或FastSpeech模型等进行构建,或者可以基于其它语音合成模型进行构建,本实施例对此并不进行限定。In this embodiment, the speech cloning model can be constructed based on, for example, the PortaSpeech model, or the Tacotron model, or the FastSpeech model, or it can be constructed based on other speech synthesis models, which is not limited in this embodiment.
作为一种实现示例,在获取目标语料文本以及目标对象的音频后,可以将其作为训练样本,对语音克隆模型进行迭代训练,直至语音克隆模型满足训练终止条件,如损失值小于阈值等。如此,语音克隆模型可以学习到目标对象在该目标场景下发音的音色、韵律和风格。As an implementation example, after obtaining the target corpus text and the audio of the target object, you can use them as training samples to iteratively train the speech cloning model until the speech cloning model meets the training termination conditions, such as the loss value is less than the threshold, etc. In this way, the speech cloning model can learn the timbre, rhythm and style of the target object's pronunciation in the target scene.
而在另一种实现示例中,由于目标语料文本以及目标对象的音频的数量通常较少,因此,模型训练模块101可以获取通用语料文本(即未区分其所属场景)以及该通用语料文本对应的音频,对语音克隆模型进行初步训练。当满足初步训练的终止条件时,语音克隆模型能够根据输入的文本输出相应的音频,也即能够实现语音合成的基本功能。然后,数据获取模块101再利用目标语料文本以及目标对象的音频,进一步训练该语音克隆模型直至其满足训练终止条件。如此,即使目标语料文本以及目标对象的音频的数量较少(也即模型训练样本较少),数据处理模块101最终所训练出的语音克隆模型也能较好的克隆目标对象在目标场景下发音的音色、韵律和风格。In another implementation example, since the number of target corpus texts and target object audios is usually small, the model training module 101 can obtain the general corpus text (that is, the scene to which it belongs is not distinguished) and the corresponding general corpus text. Audio, preliminary training of the speech cloning model. When the termination conditions of preliminary training are met, the speech cloning model can output corresponding audio according to the input text, that is, it can realize the basic function of speech synthesis. Then, the data acquisition module 101 further uses the target corpus text and the audio of the target object to further train the speech cloning model until it meets the training termination condition. In this way, even if the number of target corpus texts and audio of the target object is small (that is, there are fewer model training samples), the speech cloning model finally trained by the data processing module 101 can better clone the target object and pronounce it in the target scene. timbre, rhythm and style.
在进一步可能的实施方式中,模型训练模块102在训练得到的语音克隆模型后,可以将其发送给语音克隆模块103,以便利用语音克隆模块103输出模拟目标对象发音的音频,实现对目标对象的语音克隆。为此,本实施例还可以进一步包括:In a further possible implementation, after training the speech cloning model, the model training module 102 can send it to the speech cloning module 103, so that the speech cloning module 103 can be used to output audio that simulates the pronunciation of the target object, so as to realize the target object's pronunciation. Voice cloning. To this end, this embodiment may further include:
S305:语音克隆模块103利用语音克隆模型,输出目标文本对应的音频。S305: The speech cloning module 103 uses the speech cloning model to output the audio corresponding to the target text.
其中,目标文本,可以是用于向用户300呈现语音克隆模型的克隆效果的测试文本,或者,可以是用户预先指定的、需要合成语音的文本。The target text may be a test text used to present the cloning effect of the speech cloning model to the user 300, or may be a text pre-specified by the user that requires synthesized speech.
作为一种实现示例,当目标文本为测试文本时,语音克隆模块103可以将固定配置的测试文本输入至语音克隆模型中,由该语音克隆模型根据该测试文本输出相应的音频,该音 频为模拟目标对象在目标场景下针对该测试文本进行发音的音频。然后,语音克隆模块103可以将该音频进行输出,具体可以是将该音频发送至客户端200,并由客户端200将该音频播放给用户300,以便用户300可以基于播放的音频感知语音克隆模型对于目标对象在该目标场景下发音的克隆效果。As an implementation example, when the target text is a test text, the speech cloning module 103 can input the fixedly configured test text into the speech cloning model, and the speech cloning model outputs corresponding audio according to the test text. The frequency is the audio that simulates the target object's pronunciation of the test text in the target scenario. Then, the voice cloning module 103 can output the audio. Specifically, the audio can be sent to the client 200, and the client 200 can play the audio to the user 300, so that the user 300 can perceive the voice cloning model based on the played audio. The cloning effect for the target object's pronunciation in the target scene.
作为另一种实现示例,当目标文本为测试文本时,该目标文本可以由用户300进行提供,则,语音克隆模块103可以生成测试界面,例如可以是如图8所示的测试界面,并将该测试界面通过客户端呈现给用户300,以提示用户300输入测试文本。相应地,语音克隆模块103可以响应于用户针对该测试界面的操作,获取用户300输入的测试文本,并将该测试文本输入至语音克隆模型,得到该语音克隆模型输出的音频。接着,语音克隆模块103可以将该音频发送至客户端200,并由客户端200将该音频播放给用户300,以便用户300可以基于播放的音频感知语音克隆模型对于目标对象在该目标场景下发音的克隆效果。As another implementation example, when the target text is a test text, the target text can be provided by the user 300, then the voice cloning module 103 can generate a test interface, for example, the test interface as shown in Figure 8, and The test interface is presented to the user 300 through the client to prompt the user 300 to input test text. Correspondingly, the voice cloning module 103 can obtain the test text input by the user 300 in response to the user's operation on the test interface, and input the test text into the voice cloning model to obtain the audio output by the voice cloning model. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300, so that the user 300 can pronounce the target object in the target scene based on the played audio-aware voice cloning model. The cloning effect.
作为又一种实现示例,目标文本为用户300预先指定的、需要合成语音的文本。比如,当目标场景具体为故事场景时,用户300可以预先指定一个故事的名称或者文本,从而语音克隆模块103可以将用户300指定的文本(如故事文本等)输入至语音克隆模型中,得到语音克隆模型根据该文本输出相应的音频。然后,语音克隆模块103可以将该音频发送至客户端200,并由客户端200将该音频播放给用户300,以满足用户300针对该文本的语音克隆的需求,如用户300可以听到模拟目标对象讲故事的音频。As another implementation example, the target text is text that is pre-specified by the user 300 and needs to be synthesized into speech. For example, when the target scene is a story scene, the user 300 can pre-specify the name or text of a story, so that the voice cloning module 103 can input the text (such as story text, etc.) specified by the user 300 into the voice cloning model to obtain the voice The cloned model outputs corresponding audio based on this text. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300 to meet the needs of the user 300 for voice cloning of the text. For example, the user 300 can hear the simulated target Audio of subject telling a story.
作为再一种实现示例,目标文本为用户300输入的、需要合成语音的文本,相应地,语音克隆模块103在接收到语音克隆模型后,可以生成语音合成界面,并将该语音合成界面通过客户端200呈现给用户300。然后,语音克隆装置103可以通过客户端200接收用户300输入的需要进行语音合成的目标文本,并将其输入至语音克隆模型,得到语音克隆模型根据该目标文本输出的音频。然后,语音克隆模块103可以将该音频发送至客户端200,并由客户端200将该音频播放给用户300,以满足用户300针对该目标文本的语音克隆的需求。As another implementation example, the target text is text input by the user 300 that needs to be synthesized into speech. Correspondingly, after receiving the speech cloning model, the speech cloning module 103 can generate a speech synthesis interface and pass the speech synthesis interface through the client. Terminal 200 is presented to user 300. Then, the speech cloning device 103 can receive the target text input by the user 300 that needs to be speech synthesized through the client 200, and input it into the speech cloning model to obtain the audio output by the speech cloning model according to the target text. Then, the voice cloning module 103 can send the audio to the client 200, and the client 200 plays the audio to the user 300 to meet the needs of the user 300 for voice cloning of the target text.
值得注意的是,图3所示的实施例中,是以语音克隆装置100生成用于克隆目标对象在目标场景下发音的音频的过程进行示例性说明,实际应用时,语音克隆装置100可以基于上述类似方式,针对每个场景训练得到该场景对应的语音克隆模型,每个场景对应的语音克隆模型生成用于克隆目标对象该场景下发音的音频。并且,针对不同的对象,均可以基于上述类似方式训练得到每个对象在每个场景下所对应的语音克隆模型。如此,语音克隆装置100可以针对不同场景、不同对象训练得到多个不同的语音克隆模型,用于支持用户对所要克隆的发音场景以及对象进行选择,从而可以提高语音克隆的灵活性以及丰富性。It is worth noting that in the embodiment shown in FIG. 3 , the process of the voice cloning device 100 generating audio for cloning the target object to pronounce in the target scene is exemplified. In actual application, the voice cloning device 100 can be based on In a similar manner to the above, each scene is trained to obtain a speech cloning model corresponding to the scene, and the speech cloning model corresponding to each scene generates audio used to clone the target object's pronunciation in that scene. Moreover, for different objects, the speech cloning model corresponding to each object in each scene can be trained based on the above-mentioned similar methods. In this way, the voice cloning device 100 can train multiple different voice cloning models for different scenarios and different objects to support the user in selecting pronunciation scenes and objects to be cloned, thereby improving the flexibility and richness of voice cloning.
这样,当用户300指定一个场景以及一个对象后,语音克隆装置100可以利用用户300指定场景以及指定对象所对应的语音克隆模型,生成相应的音频并反馈给用户300。示例性地,语音克隆模块103可以生成语音合成界面,该语音合成界面可以将多个候选场景以及多个候选对象呈现给用户,从而用户可以在该语音合成界面上从多种候选场景中选择其中一种候选场景,从多个候选对象中选择其中一个候选对象。相应的,语音克隆模块103可以将用户所选择的候选场景确定为目标场景,将用户所选择的候选对象确定为目标对象,并进一步确定该目标场景对应的、用于模拟目标对象发音的语音克隆模型。然后,语音克隆模型103可以利用所确定的语音克隆模型,根据预先配置的目标文本或者用户在该语音合成界 面上输入目标文本,合成模拟目标对象在该目标场景下发音的音频。In this way, when the user 300 specifies a scene and an object, the voice cloning device 100 can use the voice cloning model corresponding to the scene and the specified object specified by the user 300 to generate corresponding audio and feed it back to the user 300 . For example, the speech cloning module 103 can generate a speech synthesis interface, which can present multiple candidate scenes and multiple candidate objects to the user, so that the user can select one of the multiple candidate scenes on the speech synthesis interface. A candidate scenario in which one candidate object is selected from multiple candidates. Correspondingly, the voice cloning module 103 can determine the candidate scene selected by the user as the target scene, determine the candidate object selected by the user as the target object, and further determine the voice clone corresponding to the target scene for simulating the pronunciation of the target object. Model. Then, the speech cloning model 103 can use the determined speech cloning model to generate the target text according to the preconfigured target text or the user's input in the speech synthesis world. Input target text on the screen, and synthesize audio that simulates the pronunciation of the target object in the target scene.
在其它实现方式中,语音克隆模块103所生成的语音合成界面也可以是仅支持用户从多个场景中选择一种场景作为目标场景,或者可以仅支持用户从多个候选对象中选择一个对象作为目标对象,本实施例对此并不进行限定。In other implementations, the speech synthesis interface generated by the speech cloning module 103 may also only support the user to select one scene from multiple scenes as the target scene, or may only support the user to select one object from multiple candidate objects as the target scene. The target object is not limited in this embodiment.
上述图3所示实施例中,针对语音克隆过程中所涉及到的语音克隆装置(包括上述数据获取模块101、模型训练模块102、语音克隆模块103)可以是配置于计算设备或者计算设备集群上的软件,并且,通过在计算设备或者计算设备集群上运行该软件,可以使得计算设备或者计算设备集群实现上述语音克隆装置所具有的功能。下面,基于硬件设备实现的角度,对语音克隆的过程中所涉及的语音克隆装置进行详细介绍。In the above embodiment shown in Figure 3, the voice cloning device (including the above-mentioned data acquisition module 101, model training module 102, and voice cloning module 103) involved in the voice cloning process may be configured on a computing device or a cluster of computing devices. software, and by running the software on a computing device or computing device cluster, the computing device or computing device cluster can realize the functions of the above voice cloning device. Below, based on the perspective of hardware device implementation, the voice cloning device involved in the voice cloning process is introduced in detail.
图9示出了一种计算设备的结构示意图,上述语音克隆装置可以部署在该计算设备上,该计算设备可以是云环境中的计算设备(如服务器),或边缘环境中的计算设备,或终端设备等具体可以用于实现上述图3所示实施例中交互模块201、处理模块202的功能。Figure 9 shows a schematic structural diagram of a computing device. The above-mentioned voice cloning device can be deployed on the computing device. The computing device can be a computing device (such as a server) in a cloud environment, or a computing device in an edge environment, or Terminal equipment, etc. can be specifically used to implement the functions of the interaction module 201 and the processing module 202 in the embodiment shown in FIG. 3 .
如图9所示,计算设备900包括处理器920、存储器910、通信接口930和总线940。处理器920、存储器910和通信接口930之间通过总线940通信。总线940可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图9中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口930用于与外部通信,例如接收用户提供的原始数据以及待训练的特征提取网络模型等。As shown in FIG. 9 , computing device 900 includes processor 920 , memory 910 , communication interface 930 , and bus 940 . The processor 920, the memory 910 and the communication interface 930 communicate through the bus 940. The bus 940 may be a peripheral component interconnect (PCI) bus or an extended industry standard architecture (EISA) bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 9, but it does not mean that there is only one bus or one type of bus. The communication interface 930 is used to communicate with the outside, such as receiving original data provided by the user and the feature extraction network model to be trained.
其中,处理器920可以为中央处理器(central processing unit,CPU)、专用集成电路(application specific integrated circuit,ASIC)、图形处理器(graphics processing unit,GPU)或者一个或多个集成电路。处理器920还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,语音克隆装置中各个模块的功能可以通过处理器920中的硬件的集成逻辑电路或者软件形式的指令完成。处理器920还可以是通用处理器、数据信号处理器(digital signal process,DSP)、现场可编程逻辑门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件,分立门或者晶体管逻辑器件,分立硬件组件,可以实现或者执行本申请实施例中公开的方法、步骤及逻辑框图。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,结合本申请实施例所公开的方法可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器910,处理器920读取存储器910中的信息,结合其硬件完成语音克隆装置中的部分或全部功能。The processor 920 may be a central processing unit (CPU), an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more integrated circuits. The processor 920 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, the functions of each module in the voice cloning device can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 920 . The processor 920 may also be a general processor, a digital signal processor (DSP), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, Discrete hardware components can implement or execute the methods, steps and logical block diagrams disclosed in the embodiments of this application. Among them, the general processor can be a microprocessor or the processor can be any conventional processor, etc. The method disclosed in combination with the embodiments of the present application can be directly implemented as a hardware decoding processor to complete the execution, or can be performed using decoding processing. The combination of hardware and software modules in the device is executed. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 910. The processor 920 reads the information in the memory 910 and completes some or all functions of the voice cloning device in combination with its hardware.
存储器910可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器910还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,HDD或SSD。The memory 910 may include volatile memory (volatile memory), such as random access memory (RAM). The memory 910 may also include non-volatile memory (non-volatile memory), such as read-only memory (ROM), flash memory, HDD or SSD.
存储器910中存储有可执行代码,处理器920执行该可执行代码以执行前述语音克隆装置所执行的方法。The memory 910 stores executable code, and the processor 920 executes the executable code to perform the method performed by the aforementioned voice cloning device.
具体地,在实现图3所示实施例的情况下,且图3所示实施例中所描述的数据获取模块101、模型训练模块102以及语音克隆模块103为通过软件实现的情况下,执行图3中的数据 获取模块101、模型训练模块102以及语音克隆模块103的功能所需的软件或程序代码存储在存储器910中,数据获取模块101与其它设备的交互通过通信接口930实现,处理器用于执行存储器910中的指令,实现语音克隆装置所执行的方法。Specifically, when the embodiment shown in Figure 3 is implemented, and the data acquisition module 101, the model training module 102 and the speech cloning module 103 described in the embodiment shown in Figure 3 are implemented by software, execute Figure 3 Data in 3 The software or program code required for the functions of the acquisition module 101, the model training module 102 and the speech cloning module 103 is stored in the memory 910. The interaction between the data acquisition module 101 and other devices is realized through the communication interface 930, and the processor is used to execute the memory 910 The instruction implements the method executed by the voice cloning device.
图10示出的一种计算设备集群的结构示意图。其中,图10所示的计算设备集群10包括多个计算设备,上述语音克隆装置可以分布式地部署在该计算设备集群10中的多个计算设备上。如图10所示,计算设备集群100包括多个计算设备1000,每个计算设备1000包括存储器1010、处理器1020、通信接口1030以及总线1040,其中,存储器1010、处理器1020、通信接口1030通过总线1040实现彼此之间的通信连接。Figure 10 shows a schematic structural diagram of a computing device cluster. The computing device cluster 10 shown in FIG. 10 includes multiple computing devices, and the above voice cloning device can be deployed on multiple computing devices in the computing device cluster 10 in a distributed manner. As shown in Figure 10, the computing device cluster 100 includes multiple computing devices 1000. Each computing device 1000 includes a memory 1010, a processor 1020, a communication interface 1030, and a bus 1040. The memory 1010, the processor 1020, and the communication interface 1030 pass through Bus 1040 implements communication connections between each other.
处理器1020可以采用CPU、GPU、ASIC或者一个或多个集成电路。处理器1020还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,语音克隆装置的部分功能可用通过处理器1020中的硬件的集成逻辑电路或者软件形式的指令完成。处理器1020还可以是DSP、FPGA、通用处理器、其他可编程逻辑器件,分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本申请实施例中公开的部分方法、步骤及逻辑框图。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等,结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器、闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1010,在每个计算设备1000中,处理器1020读取存储器1010中的信息,结合其硬件可以完成语音克隆装置的部分功能。Processor 1020 may employ a CPU, GPU, ASIC, or one or more integrated circuits. The processor 1020 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, part of the functions of the voice cloning device can be completed by instructions in the form of integrated logic circuits or software in the hardware of the processor 1020 . The processor 1020 can also be a DSP, FPGA, general-purpose processor, other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components, and can implement or execute some of the methods, steps, and logical block diagrams disclosed in the embodiments of this application. The general processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly implemented as a hardware decoding processor, or may be executed using a decoding processor. The combination of hardware and software modules in the code processor is executed. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 1010. In each computing device 1000, the processor 1020 reads the information in the memory 1010, and combined with its hardware, can complete part of the functions of the voice cloning device.
存储器1010可以包括ROM、RAM、静态存储设备、动态存储设备、硬盘(例如SSD、HDD)等。存储器1010可以存储程序代码,例如,用于实现数据获取模块101的部分或者全部程序代码、用于实现模型训练模块102的部分或者全部程序代码、用于实现语音克隆模块103的部分或者全部程序代码等。针对每个计算设备1000,当存储器1010中存储的程序代码被处理器1020执行时,处理器1020基于通信接口1030执行语音克隆装置所执行的部分方法,如其中一部分计算设备1000可以用于执行上述数据获取模块101所执行的方法,另一部分计算设备1000用于执行上述模型训练模块102所执行的方法,又一部分计算设备1000用于执行上述语音克隆模块103所执行的方法。存储器1010还可以存储数据,例如:处理器1020在执行过程中产生的中间数据或结果数据,例如,上述目标语料文本、音频、语音克隆模型等。The memory 1010 may include ROM, RAM, static storage devices, dynamic storage devices, hard disks (eg, SSD, HDD), etc. The memory 1010 may store program codes, for example, part or all of the program code used to implement the data acquisition module 101, part or all of the program code used to implement the model training module 102, part or all of the program code used to implement the speech cloning module 103 wait. For each computing device 1000, when the program code stored in the memory 1010 is executed by the processor 1020, the processor 1020 executes part of the method executed by the voice cloning device based on the communication interface 1030. For example, part of the computing device 1000 may be used to execute the above. For the method executed by the data acquisition module 101, another part of the computing device 1000 is used to execute the method executed by the above-mentioned model training module 102, and another part of the computing device 1000 is used for executing the method executed by the above-mentioned voice cloning module 103. The memory 1010 can also store data, such as intermediate data or result data generated by the processor 1020 during execution, such as the above-mentioned target corpus text, audio, speech cloning model, etc.
每个计算设备1000中的通信接口1003用于与外部通信,例如与其它计算设备1000进行交互等。The communication interface 1003 in each computing device 1000 is used to communicate with the outside, such as interacting with other computing devices 1000 and so on.
总线1040可以是外设部件互连标准总线或扩展工业标准结构总线等。为便于表示,图10中每个计算设备1000内的总线1040仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 1040 may be a peripheral component interconnection standard bus or an extended industry standard architecture bus, or the like. For ease of presentation, the bus 1040 within each computing device 1000 in FIG. 10 is represented by only one thick line, but this does not mean that there is only one bus or one type of bus.
上述多个计算设备1000之间通过通信网络建立通信通路,以实现语音克隆装置的功能。任一计算设备可以是云环境中的计算设备(例如,服务器),或边缘环境中的计算设备,或终端设备。Communication paths are established between the above-mentioned plurality of computing devices 1000 through a communication network to realize the function of the voice cloning device. Any computing device may be a computing device (eg, a server) in a cloud environment, a computing device in an edge environment, or a terminal device.
此外,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存 储有指令,当其在一个或者多个计算设备上运行时,使得该一个或者多个计算设备执行上述实施例语音克隆装置的各个模块所执行的方法。In addition, embodiments of the present application also provide a computer-readable storage medium, which stores Instructions are stored that, when run on one or more computing devices, cause the one or more computing devices to execute the methods performed by each module of the voice cloning device in the above embodiment.
此外,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品被一个或者多个计算设备执行时,所述一个或者多个计算设备执行前述语音克隆方法中的任一方法。该计算机程序产品可以为一个软件安装包,在需要使用前述语音克隆方法的任一方法的情况下,可以下载该计算机程序产品并在计算机上执行该计算机程序产品。In addition, embodiments of the present application also provide a computer program product. When the computer program product is executed by one or more computing devices, the one or more computing devices execute any one of the foregoing voice cloning methods. The computer program product can be a software installation package. If it is necessary to use any of the foregoing voice cloning methods, the computer program product can be downloaded and executed on the computer.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative. The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physically separate. The physical unit can be located in one place, or it can be distributed across multiple network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between modules indicates that there are communication connections between them, which can be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that the present application can be implemented by software plus necessary general hardware. Of course, it can also be implemented by dedicated hardware including dedicated integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions performed by computer programs can be easily implemented with corresponding hardware. Moreover, the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special-purpose circuits. circuit etc. However, for this application, software program implementation is a better implementation in most cases. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or that contributes to the existing technology. The computer software product is stored in a readable storage medium, such as a computer floppy disk. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to cause a computer device (which can be a personal computer, training device, or network device, etc.) to execute the steps described in various embodiments of this application. method.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。 The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training equipment or data center through wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store, or a data storage device such as a training device or a data center integrated with one or more available media. The available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, solid state disk (Solid State Disk, SSD)), etc.

Claims (29)

  1. 一种语音克隆方法,其特征在于,所述方法包括:A voice cloning method, characterized in that the method includes:
    确定目标场景;Determine the target scenario;
    根据所述目标场景,确定属于所述目标场景的目标语料文本;According to the target scene, determine the target corpus text belonging to the target scene;
    根据所述目标语料文本,确定目标对象的音频,所述音频的语音内容与所述目标语料文本的内容相匹配;Determine the audio of the target object according to the target corpus text, and the voice content of the audio matches the content of the target corpus text;
    利用所述目标语料文本以及所述音频,训练所述目标场景对应的语音克隆模型,所述语音克隆模型用于输出模拟所述目标对象在所述目标场景下发音的音频。The target corpus text and the audio are used to train a speech clone model corresponding to the target scene. The speech clone model is used to output audio that simulates the pronunciation of the target object in the target scene.
  2. 根据权利要求1所述的方法,其特征在于,所述目标语料文本的内容语境与所述目标场景所指示的语境相匹配;The method according to claim 1, characterized in that the content context of the target corpus text matches the context indicated by the target scene;
    所述目标场景包括以下中任意一种:The target scenarios include any of the following:
    对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景;Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;
    或者,所述目标场景为根据情绪类型进行划分所得到的场景。Alternatively, the target scene is a scene divided according to emotion types.
  3. 根据权利要求1或2所述的方法,其特征在于,所述根据所述目标场景,确定属于所述目标场景的语料文本,包括:The method according to claim 1 or 2, characterized in that, according to the target scene, determining the corpus text belonging to the target scene includes:
    获取属于所述目标场景的多个语料文本的拼音分布;Obtain the pinyin distribution of multiple corpus texts belonging to the target scene;
    根据所述多个语料文本的拼音分布,从所述多个语料文本中选取所述目标语料文本,所述目标语料文本的数量少于所述多个语料文本的数量,所述目标语料文本的拼音分布与所述多个语料文本的拼音分布满足预设条件。The target corpus text is selected from the plurality of corpus texts according to the pinyin distribution of the plurality of corpus texts, the number of the target corpus texts is less than the number of the plurality of corpus texts, and the number of the target corpus texts is The pinyin distribution and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据所述目标场景,确定属于所述目标场景的语料文本,包括:The method according to any one of claims 1 to 3, characterized in that, according to the target scene, determining the corpus text belonging to the target scene includes:
    从多个语料文本中选取所述目标语料文本,所述目标语料文本中专业术语的占比大于比例阈值,所述多个语料文本属于所述目标场景。The target corpus text is selected from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts belong to the target scene.
  5. 根据权利要求1至4任一项所述的方法,其特征在于,所述根据所述目标语料文本,确定属于所述目标场景的目标对象的音频,包括:The method according to any one of claims 1 to 4, wherein determining the audio of the target object belonging to the target scene according to the target corpus text includes:
    生成录音界面,所述录音界面用于将所述目标语料文本呈现给所述目标对象;Generate a recording interface, the recording interface is used to present the target corpus text to the target object;
    对所述目标对象根据所述目标语料文本的发音进行录音,得到所述目标对象的音频。Record the pronunciation of the target object according to the target corpus text to obtain the audio of the target object.
  6. 根据权利要求1至4任一项所述的方法,其特征在于,所述根据所述目标语料文本,确定属于所述目标场景的目标对象的音频,包括:The method according to any one of claims 1 to 4, wherein determining the audio of the target object belonging to the target scene according to the target corpus text includes:
    获取所述目标对象在所述目标场景下发音的多个音频;Obtain multiple audios produced by the target object in the target scene;
    从所述多个音频中确定语音内容与所述目标语料文本的内容相匹配的音频。The audio whose speech content matches the content of the target corpus text is determined from the plurality of audios.
  7. 根据权利要求1至6任一项所述的方法,其特征在于,所述确定目标场景,包括:The method according to any one of claims 1 to 6, characterized in that determining the target scenario includes:
    生成场景配置界面,所述场景配置界面用于将多个候选场景呈现给用户;Generate a scene configuration interface, the scene configuration interface being used to present multiple candidate scenes to the user;
    从所述多个候选场景中确定所述用户选择的目标场景。The target scene selected by the user is determined from the plurality of candidate scenes.
  8. 根据权利要求1至6任一项所述的方法,其特征在于,所述确定目标场景,包括:The method according to any one of claims 1 to 6, characterized in that determining the target scenario includes:
    生成场景配置界面,所述场景配置界面用于提示输入用户定义的目标场景的标识以及属于所述目标场景的语料文本;Generate a scene configuration interface, the scene configuration interface being used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene;
    响应于所述用户针对所述场景配置界面的操作,获取所述用户定义的目标场景的标识 以及属于所述目标场景的语料文本。In response to the user's operation on the scene configuration interface, obtain the identification of the user-defined target scene and the corpus text belonging to the target scene.
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 8, characterized in that the method further includes:
    生成测试界面,所述测试界面用于提示用户输入文本;Generate a test interface, the test interface is used to prompt the user to input text;
    响应于所述用户针对所述测试界面的操作,获取所述用户输入的目标文本;In response to the user's operation on the test interface, obtain the target text input by the user;
    将所述目标文本输入至所述语音克隆模型,得到所述语音克隆模型输出的音频。The target text is input into the speech cloning model to obtain the audio output by the speech cloning model.
  10. 一种语音克隆方法,其特征在于,所述方法包括:A voice cloning method, characterized in that the method includes:
    接收用户输入的目标场景和目标文本;Receive the target scene and target text input by the user;
    根据所述目标场景,确定所述目标场景对应的语音克隆模型;According to the target scene, determine the speech cloning model corresponding to the target scene;
    基于所述语音克隆模型,输出和所述目标文本对应的目标音频,所述语音克隆模型用于输出模拟目标对象在所述目标场景下发音的音频。Based on the speech cloning model, target audio corresponding to the target text is output, and the speech cloning model is used to output audio that simulates the pronunciation of the target object in the target scene.
  11. 根据权利要求10所述的方法,其特征在于,所述目标语料文本的内容语境与所述目标场景所指示的语境相匹配;The method according to claim 10, characterized in that the content context of the target corpus text matches the context indicated by the target scene;
    所述目标场景包括以下中任意一种:The target scenarios include any of the following:
    对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景;Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;
    或者,所述目标场景为根据情绪类型进行划分所得到的场景。Alternatively, the target scene is a scene divided according to emotion types.
  12. 根据权利要求10或11所述的方法,其特征在于,所述接收用户输入的目标场景和目标文本,包括:The method according to claim 10 or 11, characterized in that receiving the target scene and target text input by the user includes:
    生成语音合成界面,所述语音合成界面用于将多个候选场景呈现给用户;Generate a speech synthesis interface, the speech synthesis interface being used to present multiple candidate scenarios to the user;
    从所述多个候选场景中确定所述用户选择的所述目标场景;Determine the target scene selected by the user from the plurality of candidate scenes;
    接收所述用户在所述语音合成界面上输入的所述目标文本。Receive the target text input by the user on the speech synthesis interface.
  13. 根据权利要求12所述的方法,其特征在于,所述语音合成界面,还用于将多个候选对象呈现给所述用户;The method according to claim 12, characterized in that the speech synthesis interface is also used to present multiple candidate objects to the user;
    所述方法还包括:The method also includes:
    从所述多个候选对象中,确定所述用户选择的所述目标对象。From the plurality of candidate objects, the target object selected by the user is determined.
  14. 一种语音克隆装置,其特征在于,所述语音克隆装置包括:A voice cloning device, characterized in that the voice cloning device includes:
    数据获取模块,用于确定目标场景,并根据所述目标场景,确定属于所述目标场景的目标语料文本,并根据所述目标语料文本,确定目标对象的音频,所述音频的语音内容与所述目标语料文本的内容相匹配;The data acquisition module is used to determine the target scene, and determine the target corpus text belonging to the target scene according to the target scene, and determine the audio of the target object based on the target corpus text, and the voice content of the audio is consistent with the target corpus text. Match the content of the target corpus text;
    模型训练模块,用于利用所述目标语料文本以及所述音频,训练所述目标场景对应的语音克隆模型,所述语音克隆模型用于输出模拟所述目标对象在所述目标场景下发音的音频。A model training module, configured to use the target corpus text and the audio to train a speech clone model corresponding to the target scene. The speech clone model is used to output audio that simulates the pronunciation of the target object in the target scene. .
  15. 根据权利要求14所述的装置,其特征在于,所述目标语料文本的语境与所述目标场景所指示的语境相匹配;The device according to claim 14, characterized in that the context of the target corpus text matches the context indicated by the target scene;
    所述目标场景包括以下中任意一种:The target scenarios include any of the following:
    对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景;Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;
    或者,所述目标场景为根据情绪类型进行划分所得到的场景。Alternatively, the target scene is a scene divided according to emotion types.
  16. 根据权利要求14或15所述的装置,其特征在于,所述数据获取模块,用于:The device according to claim 14 or 15, characterized in that the data acquisition module is used for:
    获取属于所述目标场景的多个语料文本的拼音分布; Obtain the pinyin distribution of multiple corpus texts belonging to the target scene;
    根据所述多个语料文本的拼音分布,从所述多个语料文本中选取所述目标语料文本,所述目标语料文本的数量少于所述多个语料文本的数量,所述目标语料文本的拼音分布与所述多个语料文本的拼音分布满足预设条件。The target corpus text is selected from the plurality of corpus texts according to the pinyin distribution of the plurality of corpus texts, the number of the target corpus texts is less than the number of the plurality of corpus texts, and the number of the target corpus texts is The pinyin distribution and the pinyin distribution of the plurality of corpus texts satisfy preset conditions.
  17. 根据权利要求14至16任一项所述的装置,其特征在于,所述数据获取模块,用于:The device according to any one of claims 14 to 16, characterized in that the data acquisition module is used for:
    从多个语料文本中选取所述目标语料文本,所述目标语料文本中专业术语的占比大于比例阈值,所述多个语料文本属于所述目标场景。The target corpus text is selected from a plurality of corpus texts, the proportion of professional terms in the target corpus text is greater than a proportion threshold, and the plurality of corpus texts belong to the target scene.
  18. 根据权利要求14至17任一项所述的装置,其特征在于,所述数据获取模块,用于:The device according to any one of claims 14 to 17, characterized in that the data acquisition module is used for:
    生成录音界面,所述录音界面用于将所述目标语料文本呈现给所述目标对象;Generate a recording interface, the recording interface is used to present the target corpus text to the target object;
    对所述目标对象根据所述目标语料文本的发音进行录音,得到所述目标对象的音频。Record the pronunciation of the target object according to the target corpus text to obtain the audio of the target object.
  19. 根据权利要求14至17任一项所述的装置,其特征在于,所述数据获取模块,用于:The device according to any one of claims 14 to 17, characterized in that the data acquisition module is used for:
    获取所述目标对象在所述目标场景下发音的多个音频;Obtain multiple audios produced by the target object in the target scene;
    从所述多个音频中确定语音内容与所述目标语料文本的内容相匹配的音频。The audio whose speech content matches the content of the target corpus text is determined from the plurality of audios.
  20. 根据权利要求14至19任一项所述的装置,其特征在于,所述数据获取模块,用于:The device according to any one of claims 14 to 19, characterized in that the data acquisition module is used for:
    生成场景配置界面,所述场景配置界面用于将多个候选场景呈现给用户;Generate a scene configuration interface, the scene configuration interface being used to present multiple candidate scenes to the user;
    从所述多个候选场景中确定所述用户选择的目标场景。The target scene selected by the user is determined from the plurality of candidate scenes.
  21. 根据权利要求14至19任一项所述的装置,其特征在于,所述数据获取模块,用于:The device according to any one of claims 14 to 19, characterized in that the data acquisition module is used for:
    生成场景配置界面,所述场景配置界面用于提示输入用户定义的目标场景的标识以及属于所述目标场景的语料文本;Generate a scene configuration interface, the scene configuration interface being used to prompt for input of a user-defined target scene identifier and corpus text belonging to the target scene;
    响应于所述用户针对所述场景配置界面的操作,获取所述用户定义的目标场景的标识以及属于所述目标场景的语料文本。In response to the user's operation on the scene configuration interface, the identification of the target scene defined by the user and the corpus text belonging to the target scene are obtained.
  22. 根据权利要求14至21任一项所述的装置,其特征在于,所述语音克隆装置还包括语音克隆模块,用于:The device according to any one of claims 14 to 21, characterized in that the voice cloning device further includes a voice cloning module for:
    生成测试界面,所述测试界面用于提示用户输入文本;Generate a test interface, the test interface is used to prompt the user to input text;
    响应于所述用户针对所述测试界面的操作,获取所述用户输入的目标文本;In response to the user's operation on the test interface, obtain the target text input by the user;
    将所述目标文本输入至所述语音克隆模型,得到所述语音克隆模型输出的音频。The target text is input into the speech cloning model to obtain the audio output by the speech cloning model.
  23. 一种语音克隆装置,其特征在于,所述语音克隆装置包括:A voice cloning device, characterized in that the voice cloning device includes:
    数据获取模块,用于接收用户输入的目标场景和目标文本;The data acquisition module is used to receive the target scene and target text input by the user;
    语音克隆模块,用于根据所述目标场景,确定所述目标场景对应的语音克隆模型,并基于所述语音克隆模型,输出和所述目标文本对应的目标音频,所述语音克隆模型用于输出模拟目标对象在所述目标场景下发音的音频。A voice cloning module, configured to determine a voice cloning model corresponding to the target scene according to the target scene, and to output target audio corresponding to the target text based on the voice cloning model, where the voice cloning model is used to output Audio that simulates the target object speaking in the target scene.
  24. 根据权利要求23所述的装置,其特征在于,所述目标语料文本的语境与所述目标场景所指示的语境相匹配;The device according to claim 23, characterized in that the context of the target corpus text matches the context indicated by the target scene;
    所述目标场景包括以下中任意一种:The target scenarios include any of the following:
    对话场景、新闻场景、财经场景、直播场景、故事场景、教育场景、演讲场景;Dialogue scenes, news scenes, financial scenes, live broadcast scenes, story scenes, education scenes, speech scenes;
    或者,所述目标场景为根据情绪类型进行划分所得到的场景。Alternatively, the target scene is a scene divided according to emotion types.
  25. 根据权利要求23或24所述的装置,其特征在于,所述数据获取模块,用于:The device according to claim 23 or 24, characterized in that the data acquisition module is used for:
    生成语音合成界面,所述语音合成界面用于将多个候选场景呈现给用户;Generate a speech synthesis interface, the speech synthesis interface being used to present multiple candidate scenarios to the user;
    从所述多个候选场景中确定所述用户选择的所述目标场景; Determine the target scene selected by the user from the plurality of candidate scenes;
    接收所述用户在所述语音合成界面上输入的所述目标文本。Receive the target text input by the user on the speech synthesis interface.
  26. 根据权利要求25所述的装置,其特征在于,所述语音合成界面,还用于将多个候选对象呈现给所述用户;The device according to claim 25, wherein the speech synthesis interface is also used to present multiple candidate objects to the user;
    所述数据获取模块,还用于:从所述多个候选对象中,确定所述用户选择的所述目标对象。The data acquisition module is further configured to determine the target object selected by the user from the plurality of candidate objects.
  27. 一种计算设备集群,其特征在于,包括至少一个计算设备,每个计算设备包括处理器和存储器;A computing device cluster, characterized by including at least one computing device, each computing device including a processor and a memory;
    所述处理器用于执行所述存储器中存储的指令,以使得所述计算设备集群执行权利要求1至7中任一项所述的方法。The processor is configured to execute instructions stored in the memory, so that the computing device cluster executes the method according to any one of claims 1 to 7.
  28. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行如权利要求1至9任一项所述的方法,或者,使得所述至少一个计算设备执行如权利要求10至13任一项所述的方法。A computer-readable storage medium, characterized in that instructions are stored in the computer-readable storage medium, and when run on at least one computing device, the at least one computing device executes any of claims 1 to 9. The method according to one of the claims 10 to 13, or causing the at least one computing device to perform the method according to any one of claims 10 to 13.
  29. 一种包含指令的计算机程序产品,其特征在于,当其在至少一个计算设备上运行时,使得所述至少一个计算设备执行如权利要求1至9中任一项所述的方法,或者,使得所述至少一个计算设备执行如权利要求10至13任一项所述的方法。 A computer program product containing instructions, characterized in that, when run on at least one computing device, it causes the at least one computing device to perform the method according to any one of claims 1 to 9, or, causes The at least one computing device performs a method as claimed in any one of claims 10 to 13.
PCT/CN2023/081526 2022-06-29 2023-03-15 Voice cloning method and apparatus, and related device WO2024001307A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202210778187.2 2022-06-29
CN202210778187 2022-06-29
CN202211071940.0A CN117373432A (en) 2022-06-29 2022-09-02 Voice cloning method and device and related equipment
CN202211071940.0 2022-09-02

Publications (1)

Publication Number Publication Date
WO2024001307A1 true WO2024001307A1 (en) 2024-01-04

Family

ID=89382602

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/081526 WO2024001307A1 (en) 2022-06-29 2023-03-15 Voice cloning method and apparatus, and related device

Country Status (1)

Country Link
WO (1) WO2024001307A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN112885326A (en) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
CN113241056A (en) * 2021-04-26 2021-08-10 标贝(北京)科技有限公司 Method, device, system and medium for training speech synthesis model and speech synthesis
CN113327574A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190251952A1 (en) * 2018-02-09 2019-08-15 Baidu Usa Llc Systems and methods for neural voice cloning with a few samples
CN112885326A (en) * 2019-11-29 2021-06-01 阿里巴巴集团控股有限公司 Method and device for creating personalized speech synthesis model, method and device for synthesizing and testing speech
CN113241056A (en) * 2021-04-26 2021-08-10 标贝(北京)科技有限公司 Method, device, system and medium for training speech synthesis model and speech synthesis
CN113327574A (en) * 2021-05-31 2021-08-31 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
JP6799574B2 (en) Method and device for determining satisfaction with voice dialogue
CN107516510B (en) Automatic voice testing method and device for intelligent equipment
JP6786751B2 (en) Voice connection synthesis processing methods and equipment, computer equipment and computer programs
CN106652997B (en) Audio synthesis method and terminal
CN111741326B (en) Video synthesis method, device, equipment and storage medium
US11882319B2 (en) Virtual live video streaming method and apparatus, device, and readable storage medium
CN110473525B (en) Method and device for acquiring voice training sample
CN111489424A (en) Virtual character expression generation method, control method, device and terminal equipment
US10665218B2 (en) Audio data processing method and device
US11511200B2 (en) Game playing method and system based on a multimedia file
KR20210001859A (en) 3d virtual figure mouth shape control method and device
WO2017059694A1 (en) Speech imitation method and device
TWI731382B (en) Method, device and equipment for speech synthesis
CN109389427A (en) Questionnaire method for pushing, device, computer equipment and storage medium
JP2023552854A (en) Human-computer interaction methods, devices, systems, electronic devices, computer-readable media and programs
CN104505103B (en) Voice quality assessment equipment, method and system
WO2021227308A1 (en) Video resource generation method and apparatus
CN115691544A (en) Training of virtual image mouth shape driving model and driving method, device and equipment thereof
CN112614478A (en) Audio training data processing method, device, equipment and storage medium
CN113691909A (en) Digital audio workstation with audio processing recommendations
WO2023241360A1 (en) Online class voice interaction methods and apparatus, device and storage medium
WO2024001307A1 (en) Voice cloning method and apparatus, and related device
CN111966803B (en) Dialogue simulation method and device, storage medium and electronic equipment
CN117373432A (en) Voice cloning method and device and related equipment
CN114822492B (en) Speech synthesis method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829499

Country of ref document: EP

Kind code of ref document: A1