CN114187892A

CN114187892A - Style migration synthesis method and device and electronic equipment

Info

Publication number: CN114187892A
Application number: CN202111491886.0A
Authority: CN
Inventors: 赵情恩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-15
Anticipated expiration: 2041-12-08
Also published as: CN114187892B

Abstract

The disclosure provides a style migration and synthesis method and device and electronic equipment. The invention relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, speech synthesis and style migration, and specifically relates to a speech style migration and synthesis method, a speech style migration and synthesis device and electronic equipment. The specific implementation scheme is as follows: inputting the target text and the target audio clip into a speech synthesis model obtained by training the sample text and the sample audio clip in advance; superposing the coarse-grained audio features and the fine-grained audio features on each audio unit in the target audio clip to obtain superposed audio features of the audio units; extracting pronunciation characteristics of each pronunciation unit in the target text; fusing the pronunciation characteristics of the pronunciation units and the target superposition audio characteristics aiming at each pronunciation unit in the target text to obtain the fusion characteristics of the pronunciation units; the audio segments are synthesized according to the fusion features. An audio piece can be synthesized that has a target style in its entirety and in detail.

Description

A style transfer synthesis method, device and electronic device

技术领域technical field

本公开涉及人工智能技术领域，尤其涉及深度学习、语音合成、风格迁移技术领域，具体涉及一种语音风格迁移合成方法、装置及电子设备。The present disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of deep learning, speech synthesis, and style transfer, and in particular, to a voice style transfer synthesis method, device, and electronic device.

背景技术Background technique

出于各种实际需求，如为实现语音聊天软件中提供的变声功能、隐藏说话人的真实身份等，需要根据给定的一个音频片段以及文本，合成得到具有与该音频片段具有相同语音风格且语音内容为该文本的音频片段，由于该过程可以视为将音频片段的语音风格迁移至文本，因此该过程称为风格迁移合成。For various practical needs, for example, in order to realize the voice changing function provided in the voice chat software, hide the real identity of the speaker, etc., it is necessary to synthesize a given audio clip and text to obtain a voice style with the same voice style as the audio clip. The speech content is the audio segment of the text. Since this process can be regarded as transferring the speech style of the audio segment to the text, this process is called style transfer synthesis.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种风格迁移合成方法、装置及电子设备。The present disclosure provides a style transfer synthesis method, device and electronic device.

根据本公开的第一方面，提供了一种风格迁移合成方法，包括：According to a first aspect of the present disclosure, a style transfer synthesis method is provided, comprising:

将目标文本和具有目标语音风格的目标音频片段输入至预先经过样本文本和样本音频片段训练得到的语音合成模型；Input the target text and the target audio clip with the target speech style into the speech synthesis model trained in advance through the sample text and the sample audio clip;

通过所述语音合成模型的风格抽取子模型，针对所述目标音频片段中每个音频单元，叠加用于表征所述目标音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征；Through the style extraction sub-model of the speech synthesis model, for each audio unit in the target audio segment, the coarse-grained audio feature used to characterize the target audio segment and the fine-grained audio feature used to characterize the audio unit are superimposed feature, to obtain the superimposed audio feature of the audio unit;

通过所述语音合成模型的内容编码子模型，提取所述目标文本中每个发音单元的发音特征；Extract the pronunciation feature of each pronunciation unit in the target text by the content coding sub-model of the speech synthesis model;

通过所述语音合成模型的内容风格交叉注意力子模型，针对所述目标文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述目标叠加音频特征为与所述发音特征匹配的叠加音频特征；Through the content style cross-attention sub-model of the speech synthesis model, for each pronunciation unit in the target text, the pronunciation feature of the pronunciation unit and the target superimposed audio feature are fused to obtain the fusion feature of the pronunciation unit, Wherein, the target superimposed audio feature is the superimposed audio feature matched with the pronunciation feature;

通过所述语音合成模型的声谱解码子模型，根据所述目标文本中每个发音单元的所述融合特征，合成具有所述目标语音风格且语音内容为所述目标文本的音频片段。Through the sound spectrum decoding sub-model of the speech synthesis model, according to the fusion feature of each pronunciation unit in the target text, an audio segment having the target speech style and the speech content of the target text is synthesized.

根据本公开的第二方面，提供了一种语音合成模型的训练方法，包括：According to a second aspect of the present disclosure, a method for training a speech synthesis model is provided, including:

将样本音频片段、样本文本输入至原始模型，其中，所述样本文本为所述样本音频片段的语音内容；Input the sample audio clip and sample text into the original model, wherein the sample text is the voice content of the sample audio clip;

通过所述原始模型，针对所述样本音频片段中每个音频单元，叠加用于表征所述样本音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征；Using the original model, for each audio unit in the sample audio segment, superimposing the coarse-grained audio feature used to characterize the sample audio segment and the fine-grained audio feature used to characterize the audio unit to obtain the audio Overlay audio characteristics of the unit;

通过所述原始模型，提取所述样本文本中每个发音单元的发音特征；Through the original model, extract the pronunciation feature of each pronunciation unit in the sample text;

通过所述原始模型，针对所述样本文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述目标叠加音频特征为与所述发音特征匹配的叠加音频特征；Through the original model, for each pronunciation unit in the sample text, the pronunciation feature of the pronunciation unit and the target superimposed audio feature are fused to obtain the fusion feature of the pronunciation unit, wherein the target superimposed audio feature is an overlay audio feature that matches the pronunciation feature;

通过所述原始模型，根据所述样本文本中每个发音单元的所述融合特征转换为预测声谱特征；Through the original model, the fusion feature of each pronunciation unit in the sample text is converted into a predicted spectral feature;

根据所述预测声谱特征与所述样本音频片段的真实声谱特征之间的差异，调整所述原始模型的模型参数；adjusting the model parameters of the original model according to the difference between the predicted acoustic spectrum feature and the real acoustic spectrum feature of the sample audio clip;

获取新的样本音频片段和新的样本文本，返回执行所述将样本音频片段、样本文本输入至原始模型的步骤，直至达到第一收敛条件，将经过调整的原始模型作为语音合成模型。Obtain new sample audio clips and new sample texts, return to performing the step of inputting the sample audio clips and sample texts into the original model, until the first convergence condition is reached, and use the adjusted original model as a speech synthesis model.

根据本公开的第三方面，提供了一种风格迁移合成装置，包括：According to a third aspect of the present disclosure, there is provided a style transfer synthesis device, comprising:

第一输入模块，用于将目标文本和具有目标语音风格的目标音频片段输入至预先经过样本文本和样本音频片段训练得到的语音合成模型；The first input module is used to input the target text and the target audio clip with the target voice style into the speech synthesis model trained in advance through the sample text and the sample audio clip;

风格抽取模块，用于通过所述语音合成模型的风格抽取子模型，针对所述目标音频片段中每个音频单元，叠加用于表征所述目标音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征；A style extraction module for extracting a sub-model through the style of the speech synthesis model, for each audio unit in the target audio segment, superimposing the coarse-grained audio features used to characterize the target audio Fine-grained audio features of the audio unit, to obtain the superimposed audio features of the audio unit;

内容编码模块，用于通过所述语音合成模型的内容编码子模型，提取所述目标文本中每个发音单元的发音特征；A content coding module for extracting the pronunciation feature of each pronunciation unit in the target text by the content coding sub-model of the speech synthesis model;

内容风格交叉注意力模块，用于通过所述语音合成模型的内容风格交叉注意力子模型，针对所述目标文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述目标叠加音频特征为与所述发音特征匹配的叠加音频特征；The content style cross attention module is used for the content style cross attention sub-model of the speech synthesis model, for each pronunciation unit in the target text, fuse the pronunciation feature of the pronunciation unit and the target superimposed audio feature, Obtain the fusion feature of the pronunciation unit, wherein, the target overlay audio feature is the overlay audio feature matched with the pronunciation feature;

声谱解码模块，用于通过所述语音合成模型的声谱解码子模型，根据所述目标文本中每个发音单元的所述融合特征，合成具有所述目标语音风格且语音内容为所述目标文本的音频片段。The sound spectrum decoding module is used for the sound spectrum decoding sub-model of the speech synthesis model, according to the fusion feature of each pronunciation unit in the target text, the synthesis has the target voice style and the voice content is the target An audio clip of the text.

根据本公开的第四方面，提供了一种语音合成模型的训练装置，包括：According to a fourth aspect of the present disclosure, there is provided a training device for a speech synthesis model, comprising:

第二输入模块，用于将样本音频片段、样本文本输入至原始模型，其中，所述样本文本为所述样本音频片段的语音内容；a second input module for inputting sample audio clips and sample texts into the original model, wherein the sample texts are the voice content of the sample audio clips;

第一原始模块，用于通过所述原始模型，针对所述样本音频片段中每个音频单元，叠加用于表征所述样本音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征；The first original module is configured to, through the original model, superimpose, for each audio unit in the sample audio segment, a coarse-grained audio feature used to characterize the sample audio segment and a fine-grained audio feature used to characterize the audio unit Audio features, to obtain the superimposed audio features of the audio unit;

第二原始模块，用于通过所述原始模型，提取所述样本文本中每个发音单元的发音特征；The second original module is used to extract the pronunciation feature of each pronunciation unit in the sample text through the original model;

第三原始模块，用于通过所述原始模型，针对所述样本文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述目标叠加音频特征为与所述发音特征匹配的叠加音频特征；The third original module is used for, through the original model, for each pronunciation unit in the sample text, fuse the pronunciation feature of the pronunciation unit and the target superimposed audio feature to obtain the fusion feature of the pronunciation unit, wherein, The target superimposed audio feature is the superimposed audio feature matched with the pronunciation feature;

第四原始模块，用于通过所述原始模型，根据所述样本文本中每个发音单元的所述融合特征转换为预测声谱特征；The fourth original module is used for converting into predicted spectral features according to the fusion feature of each pronunciation unit in the sample text through the original model;

参数调整模块，用于根据所述预测声谱特征与所述样本音频片段的真实声谱特征之间的差异，调整所述原始模型的模型参数；A parameter adjustment module, configured to adjust the model parameters of the original model according to the difference between the predicted sound spectrum feature and the real sound spectrum feature of the sample audio clip;

获取模块，用于获取新的样本音频片段和新的样本文本，返回执行所述将样本音频片段、样本文本输入至原始模型的步骤，直至达到第一收敛条件，将经过调整的原始模型作为语音合成模型。The acquisition module is used to acquire new sample audio clips and new sample texts, and returns to execute the step of inputting the sample audio clips and sample texts into the original model until the first convergence condition is reached, and the adjusted original model is used as the voice synthetic model.

根据本公开的第五方面，提供了一种电子设备，包括：According to a fifth aspect of the present disclosure, there is provided an electronic device, comprising:

至少一个处理器；以及at least one processor; and

与所述至少一个处理器通信连接的存储器；其中，a memory communicatively coupled to the at least one processor; wherein,

所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行上述第一方面或第二方面中任一项所述的方法。the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform any one of the first or second aspects above method described in item.

根据本公开的第六方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其中，所述计算机指令用于使所述计算机执行根据上述第一方面或第二方面中任一项所述的方法。According to a sixth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform any one of the first or second aspects above method described in item.

根据本公开提供的第七方面，提供了一种计算机程序产品，包括计算机程序，所述计算机程序在被处理器执行时实现根据上述第一方面或第二方面中任一项所述的方法。According to a seventh aspect provided by the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the above first or second aspects.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

附图用于更好地理解本方案，不构成对本公开的限定。其中：The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:

图1是根据本公开提供的风格迁移合成方法的一种流程示意图；1 is a schematic flowchart of a style transfer synthesis method provided according to the present disclosure;

图2是根据本公开提供的风格迁移合成方法中使用的语音合成模型的一种结构示意图；2 is a schematic structural diagram of a speech synthesis model used in the style transfer synthesis method provided according to the present disclosure;

图3a是根据本公开提供的风格迁移合成方法中使用的语音合成模型中风格抽取子模型的结构示意图；3a is a schematic structural diagram of a style extraction sub-model in a speech synthesis model used in the style transfer synthesis method provided according to the present disclosure;

图3b是根据本公开提供的风格迁移合成方法中使用的语音合成模型中内容编码子模型的结构示意图；3b is a schematic structural diagram of a content coding sub-model in a speech synthesis model used in the style transfer synthesis method provided according to the present disclosure;

图3c是根据本公开提供的风格迁移合成方法中使用的语音合成模型中内容风格交叉注意力子模型的结构示意图；3c is a schematic structural diagram of a content-style cross-attention sub-model in a speech synthesis model used in the style transfer synthesis method provided according to the present disclosure;

图3d是根据本公开提供的风格迁移合成方法中使用的语音合成模型中声谱解码子模型的结构示意图；3d is a schematic structural diagram of a spectral decoding sub-model in a speech synthesis model used in the style transfer synthesis method provided according to the present disclosure;

图4是根据本公开提供的语音合成模型的训练方法的一种流程示意图；4 is a schematic flowchart of a training method for a speech synthesis model provided according to the present disclosure;

图5是根据本公开提供的风格迁移合成装置的一种结构示意图；5 is a schematic structural diagram of a style transfer synthesis device provided according to the present disclosure;

图6是根据本公开提供的语音合成模型的训练装置的一种结构示意图；6 is a schematic structural diagram of a training device for a speech synthesis model provided according to the present disclosure;

图7是用来实现本公开实施例的风格迁移合成方法或语音合成模型的训练方法的电子设备的框图。FIG. 7 is a block diagram of an electronic device used to implement the style transfer synthesis method or the training method of the speech synthesis model according to the embodiment of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围和精神。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

为了更清楚的对本公开提供的风格迁移合成方法进行说明，下面将对本公开提供的风格迁移合成方法的一种可能的应用场景进行示例性说明，可以理解的是，以下示例仅是公开提供的风格迁移合成方法的一种可能的应用场景，在其他可能的实施例中，本公开提供的风格迁移合成方法也可以应用于其他可能的应用场景，以下示例对此不做任何限制。In order to explain the style transfer synthesis method provided by the present disclosure more clearly, a possible application scenario of the style transfer synthesis method provided by the present disclosure will be exemplified below. It can be understood that the following example is only a style provided by the disclosure. A possible application scenario of the migration synthesis method, in other possible embodiments, the style migration synthesis method provided by the present disclosure can also be applied to other possible application scenarios, and the following examples do not make any limitation to this.

出于隐藏目标人员，如在线游戏的玩家、接受新闻采访的受访人员等，真实身份的目的，可以将目标人员说的话转换为文本，再通过风格迁移合成，将不同于目标人员的语音风格的另一种语音风格(下文称目标语音风格)迁移至该文本，由于合成的音频片段的语音风格与目标人员不同，因此当其他人听到合成的音频片段时，不会联想到目标人员，从而实现隐藏真实身份的目的。For the purpose of hiding the true identity of the target person, such as players of online games, interviewees in news interviews, etc., the words spoken by the target person can be converted into text, and then synthesized through style transfer, so that the voice style different from the target person can be changed. Another voice style (hereinafter referred to as the target voice style) is transferred to the text. Since the voice style of the synthesized audio clip is different from that of the target person, when other people hear the synthesized audio clip, they will not associate the target person. So as to achieve the purpose of hiding the real identity.

相关技术中，为实现风格迁移合成，往往是利用编码网络对具有目标语音风格的目标音频片段进行编码，得到风格特征，并且对由目标人员说的话转换为文本进行编码，得到内容特征，将风格特征和内容特征输入至预先经过训练得到的解码网络，得到解码网络输出的声谱特征，再由声码器将声谱特征转换为音频片段，从而得到具有目标语音风格、且语音内容为该文本的音频片段。In the related art, in order to realize the style transfer synthesis, the encoding network is often used to encode the target audio segment with the target speech style to obtain the style features, and then the words spoken by the target person are converted into text to encode the content features, and the style is converted into text. The features and content features are input to the pre-trained decoding network to obtain the spectral features output by the decoding network, and then the vocoder converts the spectral features into audio clips, so as to obtain the text with the target voice style and the voice content. audio clip.

但是，该方案只能够使得合成得到的音频片段整体听上去与目标音频频段具有相近的声学特征，但是在一些细节，如语速、情感、音调、抑扬顿挫、短时停顿、重音等，与目标音频片段相差较大。换言之，合成得到的音频片段在细节上并不具有目标风格。However, this solution can only make the synthesized audio segment sound similar to the target audio frequency band as a whole, but some details, such as speech rate, emotion, pitch, cadence, short pause, accent, etc., are different from the target audio frequency. Fragments are quite different. In other words, the synthesized audio segment does not have the target style in detail.

基于此，本公开提供了一种风格迁移合成方法，可以应用于任意具备风格迁移合成功能的电子设备，包括但不限于手机、平板电脑、个人电脑、服务器等，本公开提供的风格迁移合成方法可以如图1所示，包括：Based on this, the present disclosure provides a style transfer synthesis method, which can be applied to any electronic device with a style transfer synthesis function, including but not limited to mobile phones, tablet computers, personal computers, servers, etc. The style transfer synthesis method provided by the present disclosure It can be shown in Figure 1, including:

S101，将目标文本和具有目标语音风格的目标音频片段输入至预先经过样本文本和样本音频片段训练得到的语音合成模型。S101, input the target text and the target audio segment with the target speech style into a speech synthesis model trained in advance through the sample text and the sample audio segment.

S102，通过语音合成模型的风格抽取子模型，针对目标音频片段中每个音频单元，叠加用于表征目标音频片段的粗粒度音频特征和用于表征音频单元的细粒度音频特征，得到音频单元的叠加音频特征。S102, extracting the sub-model by the style of the speech synthesis model, for each audio unit in the target audio segment, superimpose the coarse-grained audio feature used to characterize the target audio segment and the fine-grained audio feature used to characterize the audio unit to obtain the audio unit of the audio unit. Overlay audio features.

S103，通过语音合成模型的内容编码子模型，提取目标文本中每个发音单元的发音特征。S103, extract the pronunciation feature of each pronunciation unit in the target text through the content coding sub-model of the speech synthesis model.

S104，通过语音合成模型的内容风格交叉注意力子模型，针对目标文本中的每个发音单元，融合发音单元的发音特征以及目标叠加音频特征，得到发音单元的融合特征，其中，目标得加音频特征为与发音特征匹配的叠加音频特征。S104, through the content style cross-attention sub-model of the speech synthesis model, for each pronunciation unit in the target text, the pronunciation feature of the fusion pronunciation unit and the target superimposed audio feature are obtained, and the fusion feature of the pronunciation unit is obtained, wherein, the target must add audio Features are superimposed audio features matched with pronunciation features.

S105，通过语音合成模型的声谱解码子模型，根据目标文本中每个发音单元的融合特征，合成具有目标语音风格且语音内容为目标文本的音频片段。S105 , synthesizing an audio segment having the target speech style and the speech content as the target text according to the fusion feature of each pronunciation unit in the target text by using the sound spectrum decoding sub-model of the speech synthesis model.

选用该实施例，利用粒度不同的音频特征叠加得到叠加音频特征，从而使得叠加音频特征不仅能够反映出目标风格整体上的特征，同时也能够反映出目标音频片段的细节特征。再利用交叉注意力机制，将各个发音单元的音频特征与相匹配的叠加音频特征融合，则得到的融合特征一方面能够反映目标文本中包括的语音内容，另一方面能够反映出目标风格整体以及细节特征，并且由于音频特征是与相匹配的叠加音频特征融合，因此各个发音单元的融合特征反映出的细节特征能够反映出目标风格中念出该发音单元的细节特征。因此，根据融合特征合成的音频片段中，不仅整体上的声学特征与目标风格相近，且在各个发音单元的发音上与目标风格相近，即能够合成在整体和细节上具有目标风格的音频片段。In this embodiment, the superimposed audio features are obtained by superimposing audio features with different granularities, so that the superimposed audio features can not only reflect the overall characteristics of the target style, but also can reflect the detailed characteristics of the target audio segment. Then, the cross-attention mechanism is used to fuse the audio features of each pronunciation unit with the matching superimposed audio features. Since the audio features are fused with the matching superimposed audio features, the detailed features reflected by the fusion features of each pronunciation unit can reflect the detailed features of the pronunciation unit in the target style. Therefore, in the audio clip synthesized according to the fusion feature, not only the overall acoustic features are similar to the target style, but also the pronunciation of each pronunciation unit is similar to the target style, that is, the audio clip with the target style in the whole and details can be synthesized.

为了更清楚的对前述S101-S105的步骤进行说明，下面将首先对本公开提供的语音合成模型进行说明，参见图2，图2所示为本公开提供的语音合成模型的一种结构示意图，包括：In order to more clearly describe the steps of S101-S105, the speech synthesis model provided by the present disclosure will first be described below. Referring to FIG. 2, FIG. 2 shows a schematic structural diagram of the speech synthesis model provided by the present disclosure, including :

风格抽取子模型、内容编码子模型、内容风格较叉注意力子模型、声谱解码子模型。Style extraction sub-model, content encoding sub-model, content-style attention sub-model, and spectral decoding sub-model.

其中，风格抽取子模型的输入为目标音频片段，输出为目标音频片段中每个音频单元的叠加音频特征。每个音频单元是由音频片段中M个连续的音频帧组成，并且每个音频帧仅属于一个音频单元。M可以是用户根据实际经验或需求设置的，如M＝2、4、5、9等，本公开对此不做任何限制。每个音频帧为音频片段中连续N ms的音频数据，相邻两个音频帧之间的间隔为Q ms，Q不大于N，例如，N＝25、Q＝10，N＝20、Q＝20，N＝28，Q＝21等，本公开对此不做任何限制。The input of the style extraction sub-model is the target audio segment, and the output is the superimposed audio feature of each audio unit in the target audio segment. Each audio unit is composed of M consecutive audio frames in the audio segment, and each audio frame belongs to only one audio unit. M may be set by the user according to actual experience or requirements, such as M=2, 4, 5, 9, etc., which is not limited in the present disclosure. Each audio frame is continuous N ms audio data in the audio segment, the interval between two adjacent audio frames is Q ms, and Q is not greater than N, for example, N=25, Q=10, N=20, Q= 20, N=28, Q=21, etc., which is not limited in the present disclosure.

内容编码子模型的输入为目标文本，输出为目标文本中各个发音单元的发音特征。内容编码子模型用于实现前述S103的步骤。The input of the content coding sub-model is the target text, and the output is the pronunciation features of each pronunciation unit in the target text. The content coding sub-model is used to implement the aforementioned step of S103.

其中，目标文本中的每个发音单元是由目标文本的发音中的k个连续的音素组成，k可以是根据用户的实际需求或经验设置的，示例性的，k＝1时，每个发音单元为目标文本的发音中的一个音素，以目标文本为“百”、目标文本的发音为中文发音为例，包括发音单元“b”和发音单元“ai”。Wherein, each pronunciation unit in the target text is composed of k consecutive phonemes in the pronunciation of the target text, and k can be set according to the actual needs or experience of the user. For example, when k=1, each pronunciation The unit is a phoneme in the pronunciation of the target text. Taking the target text as "hundred" and the pronunciation of the target text as Chinese pronunciation as an example, it includes the pronunciation unit "b" and the pronunciation unit "ai".

内容风格较叉注意力子模型的输入为目标音频片段中各音频单元的叠加音频特征，以及目标文本中各发音单元的音频特征，即内容风格较叉注意力子模型的输入为风格抽取子模型的输出和内容编码子模型的输出。内容风格较叉注意力子模型的输出为目标文本中各个发音单元。内容风格交叉注意力子模型用于实现前述S104的步骤。The input of the content-style attention sub-model is the superimposed audio features of each audio unit in the target audio segment, and the audio features of each pronunciation unit in the target text, that is, the input of the content-style attention sub-model is the style extraction sub-model and the output of the content encoding submodel. The output of the content-style attention sub-model is each pronunciation unit in the target text. The content-style cross-attention sub-model is used to implement the aforementioned step of S104.

声谱解码子模型的输入为目标文本中各个发音单元，即声谱解码子模型的输入为内容风格较叉注意力子模型的的输出。声谱解码子模型的输出为合成的音频片段。声谱解码子模型用于实现前述S105的步骤。The input of the audio spectrum decoding sub-model is each pronunciation unit in the target text, that is, the input of the audio spectrum decoding sub-model is the output of the content style and attention sub-model. The output of the spectral decoding submodel is a synthesized audio segment. The spectral decoding sub-model is used to implement the aforementioned step of S105.

下面将分别结合语音合成模型中各个子模型的结构，对前述S102-S105的实现进行说明，参见图3a-图3d，图3a-图3d所示为语音合成模型中各个子模型的结构示意图：The implementation of the aforementioned S102-S105 will be described below in conjunction with the structure of each sub-model in the speech synthesis model, referring to Fig. 3a-Fig. 3d, and Fig. 3a-Fig. 3d is a schematic structural diagram of each sub-model in the speech synthesis model:

如图3a所示，风格抽取子模型中包括波对特征向量(Wav2Vec)子网络，由线性(Linear)子网络、长短期记忆(Long Short-Term Memory，LSTM)子网络、池化(pooling)子网络构成的第一支路，以及由长短期记忆子网络和池化子网络构成的第二支路。As shown in Figure 3a, the style extraction sub-model includes a wave pair feature vector (Wav2Vec) sub-network, which consists of a Linear (Linear) sub-network, a Long Short-Term Memory (LSTM) sub-network, a pooling (pooling) The first branch composed of sub-networks, and the second branch composed of long short-term memory sub-networks and pooling sub-networks.

波对特征向量子网络的输入为目标音频片段，输出为目标音频片段中各音频帧的音频特征。输入至波对特征向量子网络的目标音频片段的数量可以是一个也可以是多个，本公开对此不做任何限制。示例性的，在一种可能的实施例中，目标音频片段的数量为两个，其中一个目标音频片段用于反映目标风格的整体特征，另一个目标音频片段用于反映目标风格的细节特征，在另一种可能的实施例中，目标音频片段的数量为四个，其中一个目标音频片段用于反映目标风格的整体特征，剩余三个目标音频片段用于反映目标风格的细节特征。The input of the wave pair feature vector sub-network is the target audio segment, and the output is the audio features of each audio frame in the target audio segment. The number of target audio segments input to the wave pair feature vector sub-network may be one or multiple, which is not limited in the present disclosure. Exemplarily, in a possible embodiment, the number of target audio clips is two, wherein one target audio clip is used to reflect the overall feature of the target style, and the other target audio clip is used to reflect the detailed feature of the target style, In another possible embodiment, the number of target audio segments is four, wherein one target audio segment is used to reflect the overall feature of the target style, and the remaining three target audio segments are used to reflect the detailed feature of the target style.

第一支路和第二支路的输入为目标音频片段中各音频帧的音频特征，即第一支路和第二支路的输入为波对特征向量子网络的输出。若目标音频片段的数量为一个，则第一支路和第二支路的输入相同，均为该一个目标音频片段中各音频帧的特征向量。若目标音频片段的数量为多个，则第一支路和第二支路的输入不同，第一支路的输入为用于反映目标风格的细节特征的目标音频片段中各音频帧的音频特征，而第二支路的输入为用于反映目标风格的整体特征的目标音频片段中各音频帧的音频特征。The input of the first branch and the second branch is the audio feature of each audio frame in the target audio segment, that is, the input of the first branch and the second branch is the output of the wave pair feature vector sub-network. If the number of target audio clips is one, the inputs of the first branch and the second branch are the same, and both are feature vectors of each audio frame in the one target audio clip. If the number of target audio clips is multiple, the input of the first branch and the second branch are different, and the input of the first branch is the audio feature of each audio frame in the target audio clip used to reflect the detailed features of the target style , and the input of the second branch is the audio feature of each audio frame in the target audio segment that is used to reflect the overall feature of the target style.

示例性的，假设目标音频片段包括第一目标音频片段和第二目标音频片段，其中，第一目标音频片段用于反映目标风格的细节特征，第二目标音频片段反映目标风格的整体特征，则第一支路的输入为第一目标音频片段中各音频帧的音频特征，第二支路的输入为第二目标音频片段中各音频帧的音频特征。Exemplarily, assuming that the target audio clip includes a first target audio clip and a second target audio clip, wherein the first target audio clip is used to reflect the detailed features of the target style, and the second target audio clip reflects the overall characteristics of the target style, then The input of the first branch is the audio feature of each audio frame in the first target audio segment, and the input of the second branch is the audio feature of each audio frame in the second target audio segment.

第一支路的输出为目标音频片段中各音频单元的细粒度音频特征。第一支路分别针对每个音频单元，对该音频单元中包括的所有音频帧的音频特征进行平均，得到用于表征该音频单元的细粒度音频特征。第一支路通过该方式得到细粒度音频特征，由于细粒度音频特征中仅包括音频单元内各个音频帧的音频特征，因此能够使得得到的细粒度音频特征能够更多地保留目标风格的细节特征。The output of the first branch is the fine-grained audio features of each audio unit in the target audio segment. For each audio unit, the first branch averages the audio features of all audio frames included in the audio unit to obtain fine-grained audio features for characterizing the audio unit. The first branch obtains fine-grained audio features in this way. Since the fine-grained audio features only include the audio features of each audio frame in the audio unit, the obtained fine-grained audio features can retain more detailed features of the target style. .

第二支路的输出为目标音频片段的粗粒度音频特征。第二支路对目标音频片段中的所有音频帧的音频特征进行平均，得到用于表征目标音频片段的粗粒度音频特征。第二支路通过该方式得到粗粒度音频特征，由于粗粒度音频特征是对所有音频帧的音频特征进行平均，因此得到的粗粒度特征更多的保留目标风格的整体特征。The output of the second branch is the coarse-grained audio features of the target audio segment. The second branch averages the audio features of all audio frames in the target audio segment to obtain coarse-grained audio features for characterizing the target audio segment. The second branch obtains coarse-grained audio features in this way. Since the coarse-grained audio features average the audio features of all audio frames, the obtained coarse-grained features retain more overall features of the target style.

风格抽取子模型将第一支路输出的每个细粒度音频特征分别和第二支路的输出的粗粒度音频特征相加，得到叠加音频特征，并输出至内容风格交叉注意力子模型中。The style extraction sub-model adds each fine-grained audio feature output from the first branch to the coarse-grained audio feature output from the second branch to obtain superimposed audio features, which are output to the content-style cross-attention sub-model.

示例性的，假设共计有三个音频单元，分别记为音频单元1-3.其中音频单元1的细粒度音频特征为x1，音频单元2的细粒度音频特征为x2，音频单元3的细粒度音频特征为x3，并且假设粗粒度音频特征为x4，则音频单元1的叠加音频特征为x1+x4，音频单元2的细粒度音频特征为x2+x4，音频单元3的细粒度音频特征为x3+x4。其中，x1+x4是指在每个特征维度上将x1和x4的取值相加，例如假设共计有三个特征维度，并且x1为(a1，b1，c1)，x4为(a4，b4，c4)，则叠加音频特征为(a1+a4，b1+b4，c1+c4)。同理于x2+x4、x3+x4，在此不再赘述。Exemplarily, it is assumed that there are three audio units in total, which are denoted as audio units 1-3 respectively. The fine-grained audio feature of audio unit 1 is x1, the fine-grained audio feature of audio unit 2 is x2, and the fine-grained audio feature of audio unit 3 is x2. The feature is x3, and assuming that the coarse-grained audio feature is x4, the superimposed audio feature of audio unit 1 is x1+x4, the fine-grained audio feature of audio unit 2 is x2+x4, and the fine-grained audio feature of audio unit 3 is x3+ x4. Among them, x1+x4 refers to adding the values of x1 and x4 on each feature dimension. For example, suppose there are three feature dimensions in total, and x1 is (a1, b1, c1), and x4 is (a4, b4, c4 ), the superimposed audio feature is (a1+a4, b1+b4, c1+c4). The same is true for x2+x4 and x3+x4, which will not be repeated here.

如图3b所示，内容编码子模型由音素嵌入(Phoneme Emdedding)子网络、层范数(Layer Norm)子网络以及位置嵌入(Position embedding)子网络构成。As shown in Figure 3b, the content coding sub-model consists of a phoneme embedding (Phoneme Embedding) sub-network, a layer norm (Layer Norm) sub-network and a position embedding (Position embedding) sub-network.

音素嵌入子网络的输入为目标文本，输出为目标文本中各发音单元的特征向量。音素嵌入子网络用于将目标文本中各发音单元的发音编码为特征向量。The input of the phoneme embedding sub-network is the target text, and the output is the feature vector of each pronunciation unit in the target text. The phoneme embedding sub-network is used to encode the pronunciation of each phonetic unit in the target text as a feature vector.

层范数子网络的输入为目标文本中各发音单元的特征向量，即层范数子网络的输入为音素嵌入子网络的的输出。层范数字网络用于进一步从特征向量中提取发音特征。The input of the layer norm sub-network is the feature vector of each pronunciation unit in the target text, that is, the input of the layer norm sub-network is the output of the phoneme embedding sub-network. A layer norm digit network is used to further extract pronunciation features from the feature vector.

位置嵌入子网络的输入为目标文本，输出为目标文本中各发音单元的位置编码。位置嵌入子网络用于对目标文本中各发音单元的位置进行编码，得到目标文本中各发音单元的位置编码。The input of the position embedding sub-network is the target text, and the output is the position code of each pronunciation unit in the target text. The position embedding sub-network is used to encode the position of each pronunciation unit in the target text, and obtain the position code of each pronunciation unit in the target text.

内容编码子将目标文本中各个发音单元的发音特征和位置编码组合，以通过位置编码表示各个发音特征所属发音单元在目标文本中的位置，并将发音特征输出至内容风格交叉注意力子模型。The content encoder combines the pronunciation features of each pronunciation unit in the target text with the position encoding to indicate the position of the pronunciation unit to which each pronunciation feature belongs in the target text through the position encoding, and outputs the pronunciation features to the content style cross-attention sub-model.

如图3c所示，内容风格交叉注意力子模型包括文本自注意力(Text Self-attention)子网络(下文简称自注意力子网络)、第一加、范数和映射(Add&Norm&Projection)子网络、多头交叉注意力(Multi-head cross-attention)子网络(下文简称交叉注意力子网络)以及第二加、范数和映射子网络。第一加、范数和映射子网络和第二加、范数和映射子网络由加、范数子网络和前馈神经(Feedfoward Neural)子网络构成。As shown in Figure 3c, the content-style cross-attention sub-model includes the Text Self-attention sub-network (hereinafter referred to as the self-attention sub-network), the first add, norm and mapping (Add&Norm&Projection) sub-network, Multi-head cross-attention sub-network (hereinafter referred to as cross-attention sub-network) and second addition, norm and mapping sub-network. The first addition, norm and mapping sub-network and the second addition, norm and mapping sub-network consist of an addition, norm and mapping sub-network and a Feedforward Neural sub-network.

自注意力网络的输入为目标文本中各发音单元的音频特征，即自注意力子网络的输入为内容编码子模型的输出。自注意子网络的输出为经过调整的音频特征。自注意子网络用于通过自注意力机制，将各个发音单元中相对重要的发音单元的音频特征加强，相对不重要的发音单元的音频特征减弱，以使得各个发音单元的发音特征能够更好的反映出目标文本的特征。The input of the self-attention network is the audio features of each pronunciation unit in the target text, that is, the input of the self-attention sub-network is the output of the content coding sub-model. The output of the self-attention sub-network is the adjusted audio features. The self-attention sub-network is used to strengthen the audio features of the relatively important pronunciation units in each pronunciation unit and weaken the audio features of the relatively unimportant pronunciation units through the self-attention mechanism, so that the pronunciation characteristics of each pronunciation unit can be better reflect the characteristics of the target text.

交叉注意力子网络的输入为目标音频片段中各音频单元的叠加音频特征以及目标文本中各个发音单元经过调整的发音特征。交叉注意力子网络的输出为目标文本中各个发音单元的融合特征。The input of the cross-attention sub-network is the superimposed audio features of each audio unit in the target audio segment and the adjusted pronunciation features of each pronunciation unit in the target text. The output of the cross-attention sub-network is the fusion feature of each pronunciation unit in the target text.

交叉注意力子网络用于以各音频单元的叠加音频特征为键(key)和值(value)，以各发音单元经过调整的发音特征为查询(query)，通过交叉注意力机制，针对每个查询，在键中查找与该查询匹配的键，并将该键对应的值(即目标叠加音频特征)与该查询融合，得到融合特征。即交叉注意力子网络针对每个发音单元，将该发音单元经过调整的发音特征与目标叠加音频特征融合，得到该发音单元的融合特征。The cross-attention sub-network is used to use the superimposed audio features of each audio unit as the key (key) and value (value), and the adjusted pronunciation feature of each pronunciation unit as the query (query). Query, find the key matching the query in the key, and fuse the value corresponding to the key (ie the target superimposed audio feature) with the query to obtain the fusion feature. That is, for each pronunciation unit, the cross-attention sub-network fuses the adjusted pronunciation feature of the pronunciation unit with the target superimposed audio feature to obtain the fusion feature of the pronunciation unit.

如图3d所示，声谱解码子模型包括多个转换子网络、预处理(Pre-net)子网络、后处理(Post-net)子网络以及声码器(WaveGlow Vocoder)。As shown in Figure 3d, the sound spectrum decoding sub-model includes a plurality of transformation sub-networks, a pre-net sub-network, a post-processing (Post-net) sub-network, and a vocoder (WaveGlow Vocoder).

预处理子网络的输入为原始的梅尔频谱(Mel-spectrograms)，作为原始声谱特征，预处理子网络的输出为经过预处理的原始声谱特征，预处理子网络用于对原始声谱特征进行预处理。预处理网络有多个线性子网络和线性整流函数(Relu)子网络构成。The input of the preprocessing sub-network is the original Mel-spectrograms (Mel-spectrograms), which are used as the original spectral features, and the output of the preprocessing sub-network is the preprocessed original spectral features. Features are preprocessed. The preprocessing network consists of multiple linear sub-networks and linear rectification function (Relu) sub-networks.

在一种可能的实施例中，转换子网络的输入为经过预处理的原始声谱特征以及各发音单元的融合特征。转换子网络的输出为融合特征转化为声谱特征。该实施例中，转换子网络用于根据输入的原始声谱特征，将输入的融合特征转换为声谱特征。In a possible embodiment, the input of the conversion sub-network is the preprocessed original spectral feature and the fusion feature of each pronunciation unit. The output of the transformation sub-network is transformed into a fused feature into a spectral feature. In this embodiment, the conversion sub-network is used to convert the input fusion feature into a sound spectrum feature according to the input original sound spectrum feature.

在另一种可能的实施例中，转换子网络的输入为经过预处理的原始声谱特征、粗粒度特征以及各发音单元的融合特征，粗粒度特征为前述下支路的输出。转换子网络的输出为融合特征转化为声谱特征。在该实施例中，转换子网络用于根据输入的原始声谱特征、粗粒度特征，将输入的融合特征转换为声谱特征。In another possible embodiment, the input of the conversion sub-network is the preprocessed original sound spectrum feature, the coarse-grained feature, and the fusion feature of each pronunciation unit, and the coarse-grained feature is the output of the aforementioned lower branch. The output of the transformation sub-network is transformed into a fused feature into a spectral feature. In this embodiment, the conversion sub-network is used to convert the input fusion feature into a sound spectrum feature according to the input original sound spectrum feature and coarse-grained feature.

可以理解的是，融合特征是通过融合叠加音频特征和发音特征得到的，而叠加音频特征是由粗粒度音频特征和细粒度音频特征叠加得到的，因此融合特征能够在一定程度上反映出粗粒度音频特征。但是如前述说明，融合特征是由粗粒度音频特征经过一系列计算得到的，因此融合特征难以准确地反映出粗粒度音频特征，因此转换子网络转换得到的融声谱特征难以准确地反映出目标风格的整体特征。It can be understood that the fusion feature is obtained by fusing superimposed audio features and pronunciation features, and superimposed audio features are obtained by superimposing coarse-grained audio features and fine-grained audio features, so fusion features can reflect coarse-grained features to a certain extent. audio features. However, as explained above, the fusion features are obtained from coarse-grained audio features through a series of calculations, so it is difficult for the fusion features to accurately reflect the coarse-grained audio features. Therefore, the fusion spectral features converted by the conversion sub-network cannot accurately reflect the target. The overall character of the style.

因此，将粗粒度音频特征输入至转化子网络，能够使得转换子网络在将融合特征转换为声谱特征时能够准确的参考目标风格的整体特征，从而使得转换得到的声谱特征能够准确地反映出目标风格的整体特征，以使得后续合成的音频片段在整体上具有目标风格。Therefore, inputting coarse-grained audio features into the conversion sub-network enables the conversion sub-network to accurately refer to the overall features of the target style when converting the fusion features into spectral features, so that the converted spectral features can accurately reflect The overall characteristics of the target style are obtained, so that the subsequent synthesized audio clips have the target style as a whole.

后处理子网络的输入为转换得到的声谱特征，后处理子网络的输出为经过后处理的声谱特征。后处理子网络用于对转换得到的声谱特征进行后处理。后处理子网络由多个卷积神经子网络(Convolutional Neural Networks)构成。The input of the post-processing sub-network is the converted sound spectrum feature, and the output of the post-processing sub-network is the post-processed sound spectrum feature. The post-processing sub-network is used to post-process the transformed spectral features. The post-processing sub-network consists of multiple Convolutional Neural Networks.

声码器的输入为经过后处理的声谱特征，即声码器的输入为后处理子网络的输出，声码器的输出为合成得到的具有目标风格且语音内容为目标文本的音频片段。声码器用于将输入的音频特征转换为音频片段。The input of the vocoder is the post-processed spectral features, that is, the input of the vocoder is the output of the post-processing sub-network, and the output of the vocoder is the synthesized audio segment with the target style and the speech content as the target text. Vocoders are used to convert input audio features into audio clips.

对应于前述风格迁移合成方法，本公开还提供了一种语音合成模型训练方法，用于训练前述风格迁移合成方法中所使用的语音合成模型。Corresponding to the foregoing style transfer synthesis method, the present disclosure also provides a speech synthesis model training method for training the speech synthesis model used in the foregoing style transfer synthesis method.

本公开提供的语音合成模型训练方法可以应用于任一具备语音合成模型训练能力的电子设备，包括但不限于服务器、个人电脑等。并且本公开提供的语音合成模型与本公开提供的风格迁移合成方法可以应用于同一设备，也可以应用于不同设备，本公开对此不做任何限制。The speech synthesis model training method provided in the present disclosure can be applied to any electronic device capable of speech synthesis model training, including but not limited to servers, personal computers, and the like. In addition, the speech synthesis model provided by the present disclosure and the style transfer synthesis method provided by the present disclosure may be applied to the same device or to different devices, which is not limited in the present disclosure.

本公开提供的语音合成模型训练方法可以参见图4，包括：The speech synthesis model training method provided by the present disclosure can refer to FIG. 4, including:

S401，将样本音频片段、样本文本输入至原始模型，其中，样本文本为样本音频片段的语音内容。S401 , input the sample audio segment and the sample text into the original model, wherein the sample text is the speech content of the sample audio segment.

样本文本为样本音频片段的语音内容是指：样本音频片段经过语音识别得到的文本为样本文本。示例性的，假设一个样本音频片段为张三说“AAAABBBB”时录制的音频片段，则样本文本为“AAAABBBB”。The sample text is the speech content of the sample audio segment means: the text obtained by the speech recognition of the sample audio segment is the sample text. Exemplarily, assuming that a sample audio clip is an audio clip recorded when Zhang San said "AAAABBBB", the sample text is "AAAABBBB".

S402，通过原始模型，针对样本音频片段中每个音频单元，叠加用于表征样本音频片段的粗粒度音频特征和用于表征音频单元的细粒度音频特征，得到音频单元的叠加音频特征。S402, through the original model, for each audio unit in the sample audio clip, superimpose the coarse-grained audio feature used to characterize the sample audio clip and the fine-grained audio feature used to characterize the audio unit to obtain the superimposed audio feature of the audio unit.

原始模型的原理与前述语音合成模型的结构和原理完全相同，区别仅在于原始模型的模型参数与语音合成模型的模型参数不同。因此可以参见前述关于风格抽取子模型的相关说明，在此不再赘述。The principle of the original model is exactly the same as the structure and principle of the aforementioned speech synthesis model, the only difference is that the model parameters of the original model are different from those of the speech synthesis model. Therefore, reference can be made to the foregoing related descriptions on the style extraction sub-model, which will not be repeated here.

S403，通过原始模型，提取样本文本中每个发音单元的发音特征。S403, extract the pronunciation feature of each pronunciation unit in the sample text through the original model.

该步骤的原理与前述内容编码子模型的原理相同，可以参见前述关于内容编码子模型的相关说明，在此不再赘述。The principle of this step is the same as the principle of the foregoing content coding sub-model, and reference may be made to the foregoing relevant description of the content coding sub-model, and details are not repeated here.

S404，通过原始模型，针对样本文本中的每个发音单元，融合发音单元的发音特征以及目标叠加音频特征，得到发音单元的融合特征，其中，目标叠加音频特征为与发音特征匹配的叠加音频特征。S404, through the original model, for each pronunciation unit in the sample text, the pronunciation feature of the fusion pronunciation unit and the target overlay audio feature are obtained, and the fusion feature of the pronunciation unit is obtained, wherein, the target overlay audio feature is the overlay audio feature matched with the pronunciation feature .

该步骤的原理与前述内容风格交叉注意力子模型的原理相同，可以参见前述关于内容风格交叉注意力子模型的相关说明，在此不再赘述。The principle of this step is the same as the principle of the aforementioned content-style cross-attention sub-model, and reference may be made to the aforementioned relevant description of the content-style cross-attention sub-model, which will not be repeated here.

S405，通过原始模型，根据样本文本中每个发音单元的融合特征转换为预测声谱特征。S405, through the original model, convert the fusion features of each pronunciation unit in the sample text into predicted spectral features.

该步骤的原理与前述声谱解码子模型的原理相同，可以参见前述关于声谱解码子模型子模型的相关说明，在此不再赘述。The principle of this step is the same as the principle of the aforementioned sound spectrum decoding sub-model, and reference may be made to the aforementioned related description of the sound spectrum decoding sub-model and sub-model, which will not be repeated here.

S406，根据预测声谱特征与样本音频特征的真实声谱特征之间的差异，调整原始模型的模型参数。S406, according to the difference between the predicted sound spectrum feature and the real sound spectrum feature of the sample audio feature, adjust the model parameters of the original model.

可以理解的是，由于输入至原始网络的样本文本为样本音频片段的内容，因此原始网络将样本音频片段的语义风格迁移至样本文本得到的音频片段应当为样本音频片段，换言之，若原始网络能够准确地进行风格迁移合成，则预测声谱特征应当与样本音频片段的真实声谱特征相同。而导致预测声谱特征与真实声谱特征之间存在差异的原因为：原始模型无法准确进行风格迁移合成。It can be understood that since the sample text input to the original network is the content of the sample audio clip, the audio clip obtained by the original network migrating the semantic style of the sample audio clip to the sample text should be the sample audio clip. In other words, if the original network can For accurate style transfer synthesis, the predicted spectral features should be the same as the real spectral features of the sample audio segment. The reason for the difference between the predicted spectral features and the real spectral features is that the original model cannot accurately perform style transfer synthesis.

因此，能够使用预测声谱特征与真实声谱特征之间的差异，指导对原始模型的模型参数的调整，以使得原始模型的模型参数向着差异缩小的方向调整，从而训练得到能够准确进行风格迁移合成的语音合成模型。Therefore, the difference between the predicted sound spectrum feature and the real sound spectrum feature can be used to guide the adjustment of the model parameters of the original model, so that the model parameters of the original model are adjusted in the direction of reducing the difference, so that the training can accurately perform style transfer. Synthesized speech synthesis model.

S407，获取新的样本音频片段和新的样本文本，返回执行S401，直至达到第一收敛条件，将经过调整的原始模型作为语音合成模型。In S407, a new sample audio segment and a new sample text are acquired, and the process returns to S401 until the first convergence condition is reached, and the adjusted original model is used as the speech synthesis model.

新获取的样本文本应当为新获取的样本音频片段的内容，并且新获取的样本音频片段与之前的样本音频片段不同。第一收敛条件可以是由用户根据实际需求或需求设置的，示例性的，第一收敛条件可以是原始模型的模型参数的收敛性达到预设收敛性阈值，第一收敛条件也可以是已经使用的样本音频片段的数目达到预设数量阈值。The newly acquired sample text should be the content of the newly acquired sample audio segment, and the newly acquired sample audio segment is different from the previous sample audio segment. The first convergence condition may be set by the user according to actual requirements or requirements. Exemplarily, the first convergence condition may be that the convergence of the model parameters of the original model reaches a preset convergence threshold, and the first convergence condition may also be that the first convergence condition has been used. The number of sample audio clips reaches a preset number threshold.

选用该实施例，通过样本音频的真实声谱特征对原始模型进行监督训练，并且由于样本文本为样本音频片段的语音内容，因此原始模型提取到的叠加音频特征能够更好地与样本文本中各发音单元的音频特征匹配，从而使得训练得到的语音合成模型能够更好的学习到音频特征与发音特征之间的匹配关系，从而在风格迁移合成过程中合成得到更具有目标风格的音频片段。This embodiment is selected to supervise the training of the original model by using the real spectral features of the sample audio, and since the sample text is the speech content of the sample audio clips, the superimposed audio features extracted by the original model can better match the various features in the sample text. The audio features of the pronunciation units are matched, so that the speech synthesis model obtained by training can better learn the matching relationship between the audio features and the pronunciation features, thereby synthesizing audio clips with more target style in the process of style transfer synthesis.

可以理解的是，由于样本文本为样本音频片段的语音内容，因此样本文本中发音单元的数量应当与样本音频片段中音频单元的数量相近甚至相同。而在使用语音合成模型进行风格迁移合成时，目标文本并非目标音频片段的语音内容，因此目标文本的发音单元的数量可能与目标音频片段中音频单元的数量相差较大。It can be understood that, since the sample text is the speech content of the sample audio clip, the number of pronunciation units in the sample text should be similar to or even the same as the number of audio units in the sample audio clip. When using the speech synthesis model for style transfer synthesis, the target text is not the speech content of the target audio segment, so the number of pronunciation units of the target text may be quite different from the number of audio units in the target audio segment.

为使得语音合成模型能够在发音单元与音频单元的数量相差较大的情况下也能够准确地实现风格迁移合成。在一种可能的实施例中，在前述S404的目标叠加音频特征为与发音特征匹配的筛选后音频特征，筛选后音频特征为从所有叠加音频特征中抽取的部分叠加音频特征。In order to enable the speech synthesis model to be able to accurately implement style transfer synthesis even when the number of pronunciation units and the number of audio units is quite different. In a possible embodiment, the target superimposed audio feature in the aforementioned S404 is a filtered audio feature matching the pronunciation feature, and the filtered audio feature is a partial superimposed audio feature extracted from all superimposed audio features.

抽取的方式为随机抽取，并且抽取的叠加音频特征的数量可以根据用户的实际需求或经验设置，如抽取80％的叠加音频特征作为筛选后音频特征，又如抽取90％的叠加音频特征作为筛选后音频特征。The extraction method is random extraction, and the number of superimposed audio features to be extracted can be set according to the actual needs or experience of the user, such as extracting 80% of the superimposed audio features as the filtered audio features, or extracting 90% of the superimposed audio features as the filter. Post audio features.

选用该实施例，可以在训练过程中通过抽取部分叠加音频特征的方式，使得音频特征仅与选取的部分叠加音频特征通过交叉注意力机制进行融合，从而使得训练得到的语音合成模型能够学习到如何在发音单元与音频单元的数量相差较大的情况下实现风格迁移合成，即能够使得语音合成模型能够在发音单元与音频单元的数量相差较大的情况下也能够准确地实现风格迁移合成。By selecting this embodiment, in the training process, by extracting part of the superimposed audio features, the audio features are only fused with the selected part of the superimposed audio features through the cross-attention mechanism, so that the speech synthesis model obtained by training can learn how to Implementing style transfer synthesis when the number of pronunciation units and audio units differs greatly, that is, the speech synthesis model can accurately implement style transfer synthesis even when the numbers of pronunciation units and audio units are greatly different.

在前述S407中，获取的新的样本音频片段与之前的样本音频片段可以为具有相同风格(即为同一人员的音频片段)，也可以是与之前的样本音频片段具有不相同的语音风格(即为不同人员的音频片段)。In the foregoing S407, the acquired new sample audio segment and the previous sample audio segment may have the same style (that is, the audio segment of the same person), or may have a different voice style from the previous sample audio segment (that is, the audio segment of the same person). audio clips for different people).

在一种可能的实施例中，前述S407中获取新的样本片段通过以下方式实现：In a possible embodiment, the acquisition of the new sample segment in the aforementioned S407 is implemented in the following manner:

若未达到第二收敛条件，则从第一样本数据集中获取新的样本音频片段，若达到第二收敛条件，则从第二样本数据集中获取新的样本音频片段。If the second convergence condition is not met, a new sample audio segment is obtained from the first sample data set, and if the second convergence condition is met, a new sample audio segment is obtained from the second sample data set.

其中，第一样本数据集中包括第一样本人员的音频片段，而第二样本数据集中包括多个样本人员的音频片段。并且在该实施例中，第一次执行S401时的样本音频片段为第一样本人员的音频片段。The first sample data set includes audio clips of the first sample person, and the second sample data set includes audio clips of multiple sample people. And in this embodiment, the sample audio segment when S401 is executed for the first time is the audio segment of the first sample person.

其中，第二收敛条件的达成难度低于第一收敛条件，即第二收敛条件达成时第一收敛条件尚未达成，而第一收敛条件达成时第二收敛条件已经达成。The difficulty of achieving the second convergence condition is lower than that of the first convergence condition, that is, when the second convergence condition is met, the first convergence condition has not been met, and when the first convergence condition is met, the second convergence condition has been met.

选用该实施例，可以首先通过第一样本人员的音频特征对原始模型进行训练，使得原始模型学习到如何将第一样本人员的语音风格迁移至特定文本，再利用多个样本人员各自的音频片段对原始模型进行训练，从而使得原始模型学习到如何将不同的语音风格迁移至特定文本。由于原始模型在学习如何将不同的语音风格迁移至特定文本之前，已经经过预训练，学习到如何将第一样本人员的语音风格迁移至特定文本，因此仅需使用不同人员的少量音频片段，即可完成对原始模型的训练，有效降低样本音频片段的获取难度。By choosing this embodiment, the original model can be trained first by using the audio features of the first sample person, so that the original model can learn how to transfer the voice style of the first sample person to a specific text, and then use the respective The audio clips train the original model so that the original model learns how to transfer different speech styles to specific texts. Since the original model has been pre-trained to learn how to transfer the voice style of the first sample of people to a specific text before learning how to transfer different speech styles to a specific text, only a few audio clips of different people are needed, The training of the original model can be completed, which effectively reduces the difficulty of obtaining sample audio clips.

参见图5，图5所示为本公开提供的风格迁移合成装置的一种结构示意图，包括：Referring to FIG. 5, FIG. 5 shows a schematic structural diagram of a style transfer synthesis device provided by the present disclosure, including:

第一输入模块501，用于将目标文本和具有目标语音风格的目标音频片段输入至预先经过样本文本和样本音频片段训练得到的语音合成模型；The first input module 501 is used to input the target text and the target audio clip with the target voice style into the speech synthesis model trained in advance through the sample text and the sample audio clip;

风格抽取模块502，用于通过所述语音合成模型的风格抽取子模型，针对所述目标音频片段中每个音频单元，叠加用于表征所述目标音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征；The style extraction module 502 is used to extract the sub-model through the style of the speech synthesis model, and for each audio unit in the target audio segment, superimpose the coarse-grained audio feature used to characterize the target audio segment and the coarse-grained audio feature used to characterize the target audio segment. the fine-grained audio feature of the audio unit, to obtain the superimposed audio feature of the audio unit;

内容编码模块503，用于通过所述语音合成模型的内容编码子模型，提取所述目标文本中每个发音单元的发音特征；The content coding module 503 is used to extract the pronunciation feature of each pronunciation unit in the target text by the content coding sub-model of the speech synthesis model;

内容风格交叉注意力模块504，用于通过所述语音合成模型的内容风格交叉注意力子模型，针对所述目标文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述目标叠加音频特征为与所述发音特征匹配的叠加音频特征；The content style cross attention module 504 is used for the content style cross attention sub-model of the speech synthesis model, for each pronunciation unit in the target text, fuse the pronunciation feature of the pronunciation unit and the target superimposed audio feature , obtain the fusion feature of described pronunciation unit, wherein, described target superimposed audio feature is the superimposed audio feature matched with described pronunciation feature;

声谱解码模块505，用于通过所述语音合成模型的声谱解码子模型，根据所述目标文本中每个发音单元的所述融合特征，合成具有所述目标语音风格且语音内容为所述目标文本的音频片段。The sound spectrum decoding module 505 is used to synthesize the sound spectrum decoding sub-model of the speech synthesis model, according to the fusion feature of each pronunciation unit in the target text, to synthesize the target speech style and the speech content as described. Audio snippet of the target text.

在一种可能的实施例中，所述风格抽取模块502通过所述语音合成模型的风格抽取子模型，针对所述目标音频片段中每个音频单元，叠加用于表征所述目标音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征，包括：In a possible embodiment, the style extraction module 502 uses a style extraction sub-model of the speech synthesis model to superimpose, for each audio unit in the target audio segment, a rough representation of the target audio segment Granular audio features and fine-grained audio features used to characterize the audio unit, to obtain the superimposed audio features of the audio unit, including:

通过所述语音合成模型的风格抽取模块，提取所述目标音频片段中所有音频帧的平均音频特征，作为粗粒度音频特征；Through the style extraction module of the speech synthesis model, the average audio features of all audio frames in the target audio segment are extracted as coarse-grained audio features;

通过所述风格抽取模块，针对所述目标音频片段中的每个音频单元，提取所述音频单元中所有音频帧的平均音频特征，作为所述音频单元的细粒度音频特征；By the style extraction module, for each audio unit in the target audio segment, the average audio feature of all audio frames in the audio unit is extracted as the fine-grained audio feature of the audio unit;

通过所述风格抽取模块，针对所述目标音频片段中的每个音频单元，将所述音频单元的所述细粒度音频特征与所述粗粒度音频特征相加，得到所述音频单元的叠加音频特征。Through the style extraction module, for each audio unit in the target audio segment, the fine-grained audio feature of the audio unit and the coarse-grained audio feature are added to obtain the superimposed audio of the audio unit feature.

在一种可能的实施例中，所述内容风格交叉注意力模块504通过所述语音合成模型的内容风格交叉注意力子模型，针对目标文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，包括：In a possible embodiment, the content-style cross-attention module 504 fuses the pronunciation of the pronunciation unit for each pronunciation unit in the target text through the content-style cross-attention sub-model of the speech synthesis model Features and target superimposed audio features to obtain the fusion features of the pronunciation unit, including:

将目标文本中每个发音单元的发音特征输入至所述语音合成模型中内容风格交叉注意力子模型的自注意力子网络，得到所述子注意力子网络输出的经过调整的发音特征；Input the pronunciation feature of each pronunciation unit in the target text to the self-attention sub-network of the content-style cross-attention sub-model in the speech synthesis model, obtain the adjusted pronunciation characteristic of the output of the sub-attention sub-network;

通过所述内容交叉子模型的交叉注意力子网络，针对所述目标文本中的每个发音单元，融合所述发音单元的经过调整的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述所述目标叠加音频特征为与经过调整的发音特征匹配的叠加音频特征。Through the cross-attention sub-network of the content intersection sub-model, for each pronunciation unit in the target text, the adjusted pronunciation feature of the pronunciation unit and the target superimposed audio feature are fused to obtain the fusion of the pronunciation unit feature, wherein the target superimposed audio feature is the superimposed audio feature matched with the adjusted pronunciation feature.

在一种可能的实施例中，所述声谱解码模块505通过所述语音合成模型的声谱解码子模型，根据所述目标文本中每个发音单元的所述融合特征，合成具有所述目标语音风格且语音内容为所述目标文本的音频片段，包括：In a possible embodiment, the sound spectrum decoding module 505 uses the sound spectrum decoding sub-model of the speech synthesis model to synthesize the target text according to the fusion feature of each pronunciation unit in the target text. The voice style and the voice content are audio clips of the target text, including:

将所述目标文本中每个发音单元的所述融合特征、所述粗粒度音频特征输入至所述语音合成模型的声谱解码子模型，得到所述声谱解码子网络输出的声谱特征；Inputting the fusion feature and the coarse-grained audio feature of each pronunciation unit in the target text into the audio spectrum decoding sub-model of the speech synthesis model, to obtain the audio spectrum feature output by the audio spectrum decoding sub-network;

将所述声谱特征转换为具有所述目标语音风格且语音内容为所述目标文本的音频片段。Converting the spectral feature into an audio segment having the target speech style and the speech content being the target text.

参见图6，图6所示为本发明实施例提供的语音合成模型的训练装置的一种结构示意图，可以包括：Referring to FIG. 6, FIG. 6 shows a schematic structural diagram of a training device for a speech synthesis model provided by an embodiment of the present invention, which may include:

第二输入模块601，用于将样本音频片段、样本文本输入至原始模型，其中，所述样本文本为所述样本音频片段的语音内容；The second input module 601 is configured to input sample audio clips and sample texts into the original model, wherein the sample texts are the voice content of the sample audio clips;

第一原始模块602，用于通过所述原始模型，针对所述样本音频片段中每个音频单元，叠加用于表征所述样本音频片段的粗粒度音频特征和用于表征所述音频单元的细粒度音频特征，得到所述音频单元的叠加音频特征；The first original module 602 is configured to, through the original model, superimpose, for each audio unit in the sample audio segment, a coarse-grained audio feature used to characterize the sample audio segment and a fine-grained audio feature used to characterize the audio unit. Granular audio features, to obtain the superimposed audio features of the audio unit;

第二原始模块603，用于通过所述原始模型，提取所述样本文本中每个发音单元的发音特征；The second original module 603 is used to extract the pronunciation feature of each pronunciation unit in the sample text through the original model;

第三原始模块604，用于通过所述原始模型，针对所述样本文本中的每个发音单元，融合所述发音单元的发音特征以及目标叠加音频特征，得到所述发音单元的融合特征，其中，所述目标叠加音频特征为与所述发音特征匹配的叠加音频特征；The third original module 604 is configured to, through the original model, for each pronunciation unit in the sample text, fuse the pronunciation feature of the pronunciation unit and the target superimposed audio feature to obtain the fusion feature of the pronunciation unit, wherein , the target superimposed audio feature is the superimposed audio feature matched with the pronunciation feature;

第四原始模块605，用于通过所述原始模型，根据所述样本文本中每个发音单元的所述融合特征转换为预测声谱特征；The fourth original module 605 is used for converting into predicted spectral features according to the fusion feature of each pronunciation unit in the sample text through the original model;

参数调整模块606，用于根据所述预测声谱特征与所述样本音频片段的真实声谱特征之间的差异，调整所述原始模型的模型参数；A parameter adjustment module 606, configured to adjust the model parameters of the original model according to the difference between the predicted sound spectrum feature and the real sound spectrum feature of the sample audio clip;

获取模块607，用于获取新的样本音频片段和新的样本文本，返回执行所述将样本音频片段、样本文本输入至原始模型的步骤，直至达到第一收敛条件，将经过调整的原始模型作为语音合成模型。The acquisition module 607 is used to acquire new sample audio clips and new sample texts, and returns to execute the step of inputting the sample audio clips and sample texts into the original model, until the first convergence condition is reached, and the adjusted original model is used as Speech synthesis model.

在一种可能的实施例中，还包括：In a possible embodiment, it also includes:

叠加特征抽取模块，用于从所有叠加音频特征中抽取部分叠加音频特征，作为筛选后音频特征；The superimposed feature extraction module is used to extract part of the superimposed audio features from all the superimposed audio features as the filtered audio features;

所述目标叠加音频特征为与所述发音特征匹配的筛选后音频特征。The target superimposed audio feature is the filtered audio feature matched with the pronunciation feature.

在一种可能的实施例中，所述样本音频片段初始为第一样本人员的音频片段；In a possible embodiment, the sample audio segment is initially the audio segment of the first sample person;

所述获取模块607获取新的样本音频片段，包括：The acquisition module 607 acquires new sample audio clips, including:

若未达到第二收敛条件，从第一样本数据集中获取新的样本音频片段，所述第一样本数据集中包括所述第一样本人员的音频片段；If the second convergence condition is not reached, acquire a new sample audio clip from the first sample data set, the first sample data set includes the audio clip of the first sample person;

若达到所述第二收敛条件，从第二样本数据集中获取新的样本音频片段和新的样本文本，所述第二样本数据集中包括多个样本人员的音频片段。If the second convergence condition is reached, a new sample audio clip and new sample text are obtained from a second sample data set, where the second sample data set includes audio clips of a plurality of sample persons.

本公开的技术方案中，所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solutions of the present disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of the user's personal information involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.

需要说明的是，本实施例中的样本音频片段来自于公开数据集，如LJSpeech(一种公开数据集)、VCTK(一种公开数据集)。It should be noted that the sample audio clips in this embodiment come from public data sets, such as LJSpeech (a public data set) and VCTK (a public data set).

根据本公开的实施例，本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a schematic block diagram of an example electronic device 700 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示，设备700包括计算单元701，其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序，来执行各种适当的动作和处理。在RAM 703中，还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7 , the device 700 includes a computing unit 701 that can be executed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 Various appropriate actions and handling. In the RAM 703, various programs and data necessary for the operation of the device 700 can also be stored. The computing unit 701 , the ROM 702 , and the RAM 703 are connected to each other through a bus 704 . An input/output (I/O) interface 705 is also connected to bus 704 .

设备700中的多个部件连接至I/O接口705，包括：输入单元706，例如键盘、鼠标等；输出单元707，例如各种类型的显示器、扬声器等；存储单元708，例如磁盘、光盘等；以及通信单元709，例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706, such as a keyboard, mouse, etc.; an output unit 707, such as various types of displays, speakers, etc.; a storage unit 708, such as a magnetic disk, an optical disk, etc. ; and a communication unit 709, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 709 allows the device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理，例如风格迁移合成方法或语音合成模型的训练方法。例如，在一些实施例中，风格迁移合成方法或语音合成模型的训练方法可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元708。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时，可以执行上文描述的风格迁移合成方法或语音合成模型的训练方法的一个或多个步骤。备选地，在其他实施例中，计算单元701可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行风格迁移合成方法或语音合成模型的训练方法。Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 701 performs various methods and processes described above, such as a style transfer synthesis method or a training method of a speech synthesis model. For example, in some embodiments, a style transfer synthesis method or a training method of a speech synthesis model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on device 700 via ROM 702 and/or communication unit 709 . When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the style transfer synthesis method or the training method of the speech synthesis model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform a style transfer synthesis method or a training method of a speech synthesis model by any other suitable means (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a distributed system server, or a server combined with blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, no limitation is imposed herein.

上述具体实施方式，并不构成对本公开保护范围的限制。本领域技术人员应该明白的是，根据设计要求和其他因素，可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等，均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.

Claims

1. A style migration synthesis method, comprising:

inputting a target text and a target audio clip with a target voice style into a voice synthesis model obtained by training a sample text and a sample audio clip in advance;

extracting a sub-model through the style of the voice synthesis model, and superposing a coarse-grained audio feature for representing the target audio segment and a fine-grained audio feature for representing the audio unit aiming at each audio unit in the target audio segment to obtain a superposed audio feature of the audio unit;

extracting the pronunciation characteristics of each pronunciation unit in the target text through the content coding sub-model of the voice synthesis model;

fusing the pronunciation characteristics and target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the content style cross attention submodel of the voice synthesis model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are superposition audio characteristics matched with the pronunciation characteristics;

and synthesizing an audio segment which has the target voice style and has voice content as the target text according to the fusion characteristics of each pronunciation unit in the target text through the sound spectrum decoding submodel of the voice synthesis model.

2. The method of claim 1, wherein the obtaining, by the style extraction submodel of the speech synthesis model, for each audio unit in the target audio segment, a superimposed audio feature of the audio unit by superimposing a coarse-grained audio feature used for characterizing the target audio segment and a fine-grained audio feature used for characterizing the audio unit comprises:

extracting the average audio features of all audio frames in the target audio clip as coarse-grained audio features through a style extraction module of the speech synthesis model;

extracting, by the style extraction module, average audio features of all audio frames in the audio unit as fine-grained audio features of the audio unit for each audio unit in the target audio clip;

and adding the fine-grained audio features and the coarse-grained audio features of the audio units to each audio unit in the target audio segment through the style extraction module to obtain the superimposed audio features of the audio units.

3. The method according to claim 1, wherein the fusing the pronunciation features of the pronunciation units and the target superimposed audio features for each pronunciation unit in the target text by the content style cross attention submodel of the speech synthesis model to obtain the fused features of the pronunciation units comprises:

inputting the pronunciation characteristics of each pronunciation unit in the target text into a self-attention sub-network of a content style cross-attention sub-model in the speech synthesis model to obtain adjusted pronunciation characteristics output by the sub-attention sub-network;

and fusing the adjusted pronunciation characteristics and the target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the cross attention sub-network of the content cross sub-model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are the superposition audio characteristics matched with the adjusted pronunciation characteristics.

4. The method according to claim 1, wherein the synthesizing of the audio segment having the target speech style and speech content of the target text according to the fusion feature of each pronunciation unit in the target text by the sonographic decoding submodel of the speech synthesis model comprises:

inputting the fusion features and the coarse-grained audio features of each pronunciation unit in the target text into a sound spectrum decoding sub-model of the speech synthesis model to obtain sound spectrum features output by the sound spectrum decoding sub-network;

and converting the sound spectrum characteristics into an audio fragment which has the target voice style and the voice content of which is the target text.

5. A method of training a speech synthesis model, comprising:

inputting a sample audio clip and a sample text into an original model, wherein the sample text is the voice content of the sample audio clip;

superposing, by the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit to obtain a superposed audio feature of the audio unit;

extracting pronunciation characteristics of each pronunciation unit in the sample text through the original model;

fusing the pronunciation characteristics of the pronunciation units and target superposition audio characteristics aiming at each pronunciation unit in the sample text through the original model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are superposition audio characteristics matched with the pronunciation characteristics;

converting the fusion features of each pronunciation unit in the sample text into prediction sound spectrum features through the original model;

adjusting model parameters of the original model according to the difference between the predicted audio spectrum characteristics and the real audio spectrum characteristics of the sample audio frequency fragment;

and acquiring a new sample audio clip and a new sample text, returning to execute the step of inputting the sample audio clip and the sample text into the original model until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.

6. The method of claim 5, further comprising:

extracting part of the superposed audio features from all the superposed audio features to serve as the screened audio features;

the target superimposed audio features are screened audio features matched with the pronunciation features.

7. The method of claim 6, wherein the sample audio clip is initially an audio clip of a first sample person;

the obtaining of the new sample audio piece comprises:

if the second convergence condition is not met, acquiring a new sample audio clip from a first sample data set, wherein the first sample data set comprises the audio clip of the first sample person;

and if the second convergence condition is reached, acquiring a new sample audio clip and a new sample text from a second sample data set, wherein the second sample data set comprises audio clips of a plurality of sample persons.

8. A style migration synthesis apparatus comprising:

the first input module is used for inputting the target text and the target audio clip with the target voice style into a voice synthesis model which is obtained by training a sample text and a sample audio clip in advance;

the style extraction module is used for extracting a sub-model through the style of the voice synthesis model, and for each audio unit in the target audio segment, overlapping a coarse-grained audio feature used for representing the target audio segment and a fine-grained audio feature used for representing the audio unit to obtain an overlapped audio feature of the audio unit;

the content coding module is used for extracting the pronunciation characteristics of each pronunciation unit in the target text through a content coding sub-model of the voice synthesis model;

the content style cross attention module is used for fusing pronunciation characteristics of the pronunciation units and target superposed audio characteristics aiming at each pronunciation unit in the target text through a content style cross attention submodel of the voice synthesis model to obtain the fused characteristics of the pronunciation units, wherein the target superposed audio characteristics are superposed audio characteristics matched with the pronunciation characteristics;

and the sound spectrum decoding module is used for synthesizing an audio segment which has the target voice style and has the voice content of the target text according to the fusion characteristics of each pronunciation unit in the target text through a sound spectrum decoding sub-model of the voice synthesis model.

9. The apparatus of claim 8, wherein the style extraction module, via a style extraction submodel of the speech synthesis model, for each audio unit in the target audio segment, superimposes a coarse-grained audio feature for characterizing the target audio segment and a fine-grained audio feature for characterizing the audio unit to obtain a superimposed audio feature of the audio unit, includes:

10. The apparatus of claim 8, wherein the content style cross attention module fuses, for each pronunciation unit in a target text, a pronunciation feature of the pronunciation unit and a target overlay audio feature to obtain a fused feature of the pronunciation unit through a content style cross attention submodel of the speech synthesis model, comprising:

11. The apparatus according to claim 8, wherein the voice spectrum decoding module synthesizes an audio segment having the target speech style and speech content of the target text according to the fusion feature of each pronunciation unit in the target text by a voice spectrum decoding submodel of the speech synthesis model, including:

12. An apparatus for training a speech synthesis model, comprising:

the second input module is used for inputting a sample audio clip and a sample text into the original model, wherein the sample text is the voice content of the sample audio clip;

a first original module, configured to superimpose, by using the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit, so as to obtain a superimposed audio feature of the audio unit;

the second original module is used for extracting the pronunciation characteristics of each pronunciation unit in the sample text through the original model;

a third original module, configured to fuse, by using the original model, a pronunciation feature of the pronunciation unit and a target superimposed audio feature for each pronunciation unit in the sample text to obtain a fused feature of the pronunciation unit, where the target superimposed audio feature is a superimposed audio feature matched with the pronunciation feature;

a fourth original module, configured to convert, through the original model, the fusion feature of each pronunciation unit in the sample text into a predicted sound spectrum feature;

a parameter adjusting module, configured to adjust a model parameter of the original model according to a difference between the predicted audio spectrum feature and a true audio spectrum feature of the sample audio segment;

and the obtaining module is used for obtaining a new sample audio clip and a new sample text, returning to execute the step of inputting the sample audio clip and the sample text into the original model until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.

13. The apparatus of claim 12, further comprising:

the superposition characteristic extraction module is used for extracting partial superposition audio characteristics from all the superposition audio characteristics to serve as the screened audio characteristics;

14. The apparatus of claim 12, wherein the sample audio clip is initially an audio clip of a first sample person;

the obtaining module obtains a new sample audio clip, including:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4 or 5-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-4 or 5-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-4 or 5-7.