CN114187892A - Style migration synthesis method and device and electronic equipment - Google Patents

Style migration synthesis method and device and electronic equipment Download PDF

Info

Publication number
CN114187892A
CN114187892A CN202111491886.0A CN202111491886A CN114187892A CN 114187892 A CN114187892 A CN 114187892A CN 202111491886 A CN202111491886 A CN 202111491886A CN 114187892 A CN114187892 A CN 114187892A
Authority
CN
China
Prior art keywords
audio
target
pronunciation
sample
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111491886.0A
Other languages
Chinese (zh)
Inventor
赵情恩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111491886.0A priority Critical patent/CN114187892A/en
Publication of CN114187892A publication Critical patent/CN114187892A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The disclosure provides a style migration and synthesis method and device and electronic equipment. The invention relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, speech synthesis and style migration, and specifically relates to a speech style migration and synthesis method, a speech style migration and synthesis device and electronic equipment. The specific implementation scheme is as follows: inputting the target text and the target audio clip into a speech synthesis model obtained by training the sample text and the sample audio clip in advance; superposing the coarse-grained audio features and the fine-grained audio features on each audio unit in the target audio clip to obtain superposed audio features of the audio units; extracting pronunciation characteristics of each pronunciation unit in the target text; fusing the pronunciation characteristics of the pronunciation units and the target superposition audio characteristics aiming at each pronunciation unit in the target text to obtain the fusion characteristics of the pronunciation units; the audio segments are synthesized according to the fusion features. An audio piece can be synthesized that has a target style in its entirety and in detail.

Description

Style migration synthesis method and device and electronic equipment
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning, speech synthesis, and style migration technologies, and in particular, to a speech style migration and synthesis method, apparatus, and electronic device.
Background
Due to various practical requirements, such as implementing a voice change function provided in the voice chat software, hiding the true identity of a speaker, and the like, it is necessary to synthesize an audio segment having the same voice style as the audio segment and having a voice content as the text according to a given audio segment and text, and this process is called style migration synthesis because it can be regarded as migrating the voice style of the audio segment to the text.
Disclosure of Invention
The disclosure provides a style migration and synthesis method and device and electronic equipment.
According to a first aspect of the present disclosure, there is provided a style migration and synthesis method, including:
inputting a target text and a target audio clip with a target voice style into a voice synthesis model obtained by training a sample text and a sample audio clip in advance;
extracting a sub-model through the style of the voice synthesis model, and superposing a coarse-grained audio feature for representing the target audio segment and a fine-grained audio feature for representing the audio unit aiming at each audio unit in the target audio segment to obtain a superposed audio feature of the audio unit;
extracting the pronunciation characteristics of each pronunciation unit in the target text through the content coding sub-model of the voice synthesis model;
fusing the pronunciation characteristics and target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the content style cross attention submodel of the voice synthesis model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are superposition audio characteristics matched with the pronunciation characteristics;
and synthesizing an audio segment which has the target voice style and has voice content as the target text according to the fusion characteristics of each pronunciation unit in the target text through the sound spectrum decoding submodel of the voice synthesis model.
According to a second aspect of the present disclosure, there is provided a training method of a speech synthesis model, including:
inputting a sample audio clip and a sample text into an original model, wherein the sample text is the voice content of the sample audio clip;
superposing, by the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit to obtain a superposed audio feature of the audio unit;
extracting pronunciation characteristics of each pronunciation unit in the sample text through the original model;
fusing the pronunciation characteristics of the pronunciation units and target superposition audio characteristics aiming at each pronunciation unit in the sample text through the original model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are superposition audio characteristics matched with the pronunciation characteristics;
converting the fusion features of each pronunciation unit in the sample text into prediction sound spectrum features through the original model;
adjusting model parameters of the original model according to the difference between the predicted audio spectrum characteristics and the real audio spectrum characteristics of the sample audio frequency fragment;
and acquiring a new sample audio clip and a new sample text, returning to execute the step of inputting the sample audio clip and the sample text into the original model until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.
According to a third aspect of the present disclosure, there is provided a style migration and synthesis apparatus comprising:
the first input module is used for inputting the target text and the target audio clip with the target voice style into a voice synthesis model which is obtained by training a sample text and a sample audio clip in advance;
the style extraction module is used for extracting a sub-model through the style of the voice synthesis model, and for each audio unit in the target audio segment, overlapping a coarse-grained audio feature used for representing the target audio segment and a fine-grained audio feature used for representing the audio unit to obtain an overlapped audio feature of the audio unit;
the content coding module is used for extracting the pronunciation characteristics of each pronunciation unit in the target text through a content coding sub-model of the voice synthesis model;
the content style cross attention module is used for fusing pronunciation characteristics of the pronunciation units and target superposed audio characteristics aiming at each pronunciation unit in the target text through a content style cross attention submodel of the voice synthesis model to obtain the fused characteristics of the pronunciation units, wherein the target superposed audio characteristics are superposed audio characteristics matched with the pronunciation characteristics;
and the sound spectrum decoding module is used for synthesizing an audio segment which has the target voice style and has the voice content of the target text according to the fusion characteristics of each pronunciation unit in the target text through a sound spectrum decoding sub-model of the voice synthesis model.
According to a fourth aspect of the present disclosure, there is provided a training apparatus for a speech synthesis model, comprising:
the second input module is used for inputting a sample audio clip and a sample text into the original model, wherein the sample text is the voice content of the sample audio clip;
a first original module, configured to superimpose, by using the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit, so as to obtain a superimposed audio feature of the audio unit;
the second original module is used for extracting the pronunciation characteristics of each pronunciation unit in the sample text through the original model;
a third original module, configured to fuse, by using the original model, a pronunciation feature of the pronunciation unit and a target superimposed audio feature for each pronunciation unit in the sample text to obtain a fused feature of the pronunciation unit, where the target superimposed audio feature is a superimposed audio feature matched with the pronunciation feature;
a fourth original module, configured to convert, through the original model, the fusion feature of each pronunciation unit in the sample text into a predicted sound spectrum feature;
a parameter adjusting module, configured to adjust a model parameter of the original model according to a difference between the predicted audio spectrum feature and a true audio spectrum feature of the sample audio segment;
and the obtaining module is used for obtaining a new sample audio clip and a new sample text, returning to execute the step of inputting the sample audio clip and the sample text into the original model until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the first or second aspects.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to any one of the above first or second aspects.
According to a seventh aspect provided by the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of the first or second aspects above.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic flow diagram of a style migration synthesis method provided in accordance with the present disclosure;
FIG. 2 is a schematic diagram of a structure of a speech synthesis model used in a style migration synthesis method provided in accordance with the present disclosure;
FIG. 3a is a schematic structural diagram of a lattice extraction submodel in a speech synthesis model for use in a method of style migration synthesis provided in accordance with the present disclosure;
FIG. 3b is a schematic structural diagram of a content coding sub-model in a speech synthesis model used in a style migration synthesis method provided according to the present disclosure;
FIG. 3c is a schematic structural diagram of a content style cross attention submodel in a speech synthesis model used in a style migration synthesis method provided according to the present disclosure;
FIG. 3d is a schematic structural diagram of a sub-model for audio spectrum decoding in a speech synthesis model used in the style migration synthesis method provided in accordance with the present disclosure;
FIG. 4 is a schematic flow chart diagram of a method of training a speech synthesis model provided in accordance with the present disclosure;
FIG. 5 is a schematic diagram of one configuration of a style migration and synthesis apparatus provided in accordance with the present disclosure;
FIG. 6 is a schematic diagram of an architecture of a training apparatus for a speech synthesis model provided according to the present disclosure;
FIG. 7 is a block diagram of an electronic device for implementing a style migration synthesis method or a training method of a speech synthesis model according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In order to more clearly illustrate the style migration and synthesis method provided by the present disclosure, an exemplary application scenario of the style migration and synthesis method provided by the present disclosure is described below, it is understood that the following example is only one possible application scenario of the style migration and synthesis method provided by the present disclosure, and in other possible embodiments, the style migration and synthesis method provided by the present disclosure may also be applied to other possible application scenarios, and the following example is not limited thereto.
For the purpose of hiding the true identity of a target person, such as a player of an online game, a interviewee who receives a news interview, and the like, the words spoken by the target person can be converted into a text, and then another voice style (hereinafter referred to as a target voice style) different from the voice style of the target person is migrated to the text through style migration synthesis.
In the related art, to implement style migration synthesis, a target audio segment with a target speech style is often encoded by using an encoding network to obtain style characteristics, and the words spoken by a target person are converted into texts to be encoded to obtain content characteristics, the style characteristics and the content characteristics are input into a decoding network obtained through training in advance to obtain sound spectrum characteristics output by the decoding network, and then the sound spectrum characteristics are converted into audio segments by a vocoder, so that the audio segments with the target speech style and the speech content as the texts are obtained.
However, this solution only enables the synthesized audio segment to sound similar acoustic features to the target audio frequency band as a whole, but some details, such as speech speed, emotion, pitch, yangton frustration, pause, accent, etc., are greatly different from the target audio segment. In other words, the synthesized audio piece does not have a target style in detail.
Based on this, the present disclosure provides a style migration and synthesis method, which may be applied to any electronic device with a style migration and synthesis function, including but not limited to a mobile phone, a tablet computer, a personal computer, a server, and the like, and the style migration and synthesis method provided by the present disclosure may be as shown in fig. 1, including:
s101, inputting a target text and a target audio clip with a target voice style into a voice synthesis model obtained by training a sample text and a sample audio clip in advance.
S102, extracting a sub-model through the style of the voice synthesis model, and superposing the coarse-grained audio features for representing the target audio segment and the fine-grained audio features for representing the audio units aiming at each audio unit in the target audio segment to obtain the superposed audio features of the audio units.
S103, extracting the pronunciation characteristics of each pronunciation unit in the target text through the content coding sub-model of the voice synthesis model.
And S104, fusing the pronunciation characteristics of the pronunciation units and the target superposition audio characteristics aiming at each pronunciation unit in the target text through the content style cross attention sub-model of the voice synthesis model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are the superposition audio characteristics matched with the pronunciation characteristics.
And S105, synthesizing an audio segment which has a target voice style and takes the voice content as the target text according to the fusion characteristics of each pronunciation unit in the target text through the sound spectrum decoding submodel of the voice synthesis model.
By selecting the embodiment, the superposed audio features are obtained by superposing the audio features with different granularities, so that the superposed audio features can reflect the overall features of the target style and the detailed features of the target audio segments. And then, fusing the audio features of each pronunciation unit with the matched superposed audio features by using a cross attention mechanism, wherein the obtained fused features can reflect the voice content in the target text on one hand and reflect the whole target style and the detail features on the other hand, and the detail features reflected by the fused features of each pronunciation unit can reflect the detail features of the pronunciation unit commented in the target style because the audio features are fused with the matched superposed audio features. Therefore, in the audio clip synthesized based on the fusion feature, not only the acoustic feature as a whole is similar to the target style, but also the pronunciation of each pronunciation unit is similar to the target style, that is, the audio clip having the target style in whole and detail can be synthesized.
In order to more clearly describe the foregoing steps of S101-S105, the following will first describe the speech synthesis model provided by the present disclosure, and referring to fig. 2, fig. 2 is a schematic structural diagram of the speech synthesis model provided by the present disclosure, which includes:
a style extraction submodel, a content coding submodel, a content style cross attention submodel and a sound spectrum decoding submodel.
The input of the style extraction submodel is a target audio clip, and the output is the superposed audio features of each audio unit in the target audio clip. Each audio unit is composed of M consecutive audio frames in an audio clip, and each audio frame belongs to only one audio unit. M may be set by a user according to actual experience or requirements, such as M ═ 2, 4, 5, 9, and the like, and the present disclosure does not set any limit to this. Each audio frame is N ms consecutive audio data in the audio segment, an interval between two adjacent audio frames is Q ms, Q is not greater than N, for example, N is 25, Q is 10, N is 20, Q is 20, N is 28, Q is 21, and the like, which is not limited in any way by this disclosure.
The content coding sub-model inputs target texts and outputs pronunciation characteristics of all pronunciation units in the target texts. The content encoding submodel is used to implement the aforementioned step of S103.
Each pronunciation unit in the target text is composed of k consecutive phonemes in the pronunciation of the target text, where k may be set according to actual needs or experience of a user, and for example, when k is 1, each pronunciation unit is one phoneme in the pronunciation of the target text, and includes a pronunciation unit "b" and a pronunciation unit "ai" taking the target text as a hundred "and the pronunciation of the target text as a chinese pronunciation as an example.
The input of the content style cross attention submodel is the superposed audio features of each audio unit in the target audio segment and the audio features of each pronunciation unit in the target text, namely the input of the content style cross attention submodel is the output of the style extraction submodel and the output of the content coding submodel. And outputting the content style less attention submodel as each pronunciation unit in the target text. The content style cross attention submodel is used to implement the aforementioned step of S104.
The input of the sound spectrum decoding submodel is each pronunciation unit in the target text, namely the input of the sound spectrum decoding submodel is the output of the content style cross attention submodel. The output of the sonographic decoding submodel is a synthesized audio clip. The sonogram decoding submodel is used to implement the aforementioned step of S105.
The implementation of the foregoing S102-S105 will be described below with reference to fig. 3 a-3 d, which are schematic structural diagrams of each sub-model in the speech synthesis model, respectively, in conjunction with the structure of each sub-model in the speech synthesis model:
as shown in fig. 3a, the style extraction submodel includes a wave pair feature vector (Wav2Vec) sub-network, a first branch formed by a Linear (Linear) sub-network, a Long Short-Term Memory (LSTM) sub-network, a pooling sub-network, and a second branch formed by the Long Short-Term Memory sub-network and the pooling sub-network.
The wave pair feature vector sub-network inputs the target audio clip and outputs the audio features of each audio frame in the target audio clip. The number of target audio pieces input to the wave pair feature vector sub-network may be one or more, and the present disclosure does not limit this. Illustratively, in one possible embodiment, the number of the target audio segments is two, wherein one of the target audio segments is used for reflecting the overall characteristics of the target style, and the other target audio segment is used for reflecting the detailed characteristics of the target style, and in another possible embodiment, the number of the target audio segments is four, wherein one of the target audio segments is used for reflecting the overall characteristics of the target style, and the remaining three target audio segments are used for reflecting the detailed characteristics of the target style.
The inputs of the first branch and the second branch are the audio features of each audio frame in the target audio segment, that is, the inputs of the first branch and the second branch are the outputs of the wave pair feature vector sub-network. If the number of the target audio segments is one, the inputs of the first branch and the second branch are the same and are the feature vectors of each audio frame in the target audio segment. If the number of the target audio segments is multiple, the input of the first branch is different from the input of the second branch, the input of the first branch is the audio feature of each audio frame in the target audio segment for reflecting the detailed feature of the target style, and the input of the second branch is the audio feature of each audio frame in the target audio segment for reflecting the overall feature of the target style.
Illustratively, assuming that the target audio segments include a first target audio segment and a second target audio segment, wherein the first target audio segment is used for reflecting the detail features of the target style, and the second target audio segment reflects the overall features of the target style, the input of the first branch is the audio features of the audio frames in the first target audio segment, and the input of the second branch is the audio features of the audio frames in the second target audio segment.
The output of the first branch is the fine-grained audio features of each audio unit in the target audio segment. The first branch averages the audio features of all audio frames included in each audio unit respectively, and obtains fine-grained audio features used for representing the audio units. The first path obtains fine-grained audio features through the method, and the fine-grained audio features only comprise the audio features of all audio frames in the audio unit, so that the obtained fine-grained audio features can keep more detailed features of a target style.
The output of the second branch is the coarse-grained audio features of the target audio piece. And the second branch circuit averages the audio features of all the audio frames in the target audio segment to obtain coarse-grained audio features for representing the target audio segment. The second branch obtains coarse-grained audio features through the method, and the obtained coarse-grained audio features are more integral features of the target style, because the coarse-grained audio features are obtained by averaging the audio features of all the audio frames.
And the style extraction submodel adds each fine-grained audio feature output by the first branch and the coarse-grained audio feature output by the second branch to obtain a superposed audio feature and outputs the superposed audio feature to the content style cross attention submodel.
Illustratively, assume that there are three audio units in total, denoted as audio units 1-3, where the fine-grained audio feature of audio unit 1 is x1, the fine-grained audio feature of audio unit 2 is x2, the fine-grained audio feature of audio unit 3 is x3, and assuming the coarse-grained audio feature is x4, the superimposed audio feature of audio unit 1 is x1+ x4, the fine-grained audio feature of audio unit 2 is x2+ x4, and the fine-grained audio feature of audio unit 3 is x3+ x 4. Where x1+ x4 refers to adding values of x1 and x4 in each feature dimension, for example, if there are three feature dimensions in total, and x1 is (a1, b1, c1), x4 is (a4, b4, c4), then the superimposed audio features are (a1+ a4, b1+ b4, c1+ c 4). Similarly, x2+ x4 and x3+ x4 are not described herein.
As shown in fig. 3b, the content coding submodel is composed of a Phoneme embedding (phone embedding) subnetwork, a Layer Norm (Layer Norm) subnetwork, and a Position embedding (Position embedding) subnetwork.
The phoneme embedded sub-network inputs the target text and outputs the target text as feature vectors of all pronunciation units. The phoneme embedding sub-network is used for coding pronunciations of the pronunciation units in the target text into feature vectors.
The input of the layer norm sub-network is the feature vector of each pronunciation unit in the target text, namely the input of the layer norm sub-network is the output of the phoneme embedding sub-network. The layer norm digital network is used for further extracting pronunciation characteristics from the characteristic vector.
The input of the position embedding sub network is the target text, and the output is the position code of each pronunciation unit in the target text. And the position embedding sub-network is used for coding the position of each pronunciation unit in the target text to obtain the position code of each pronunciation unit in the target text.
The content coding sub combines the pronunciation characteristics and the position codes of all the pronunciation units in the target text to express the positions of the pronunciation units to which all the pronunciation characteristics belong in the target text through the position codes, and outputs the pronunciation characteristics to the content style cross attention sub-model.
As shown in fig. 3c, the content style cross-attention submodel includes a Text Self-attention (Text Self-attention) sub-network (hereinafter simply referred to as a Self-attention sub-network), a first addition, Norm and mapping (Add & Norm & Projection) sub-network, a Multi-head cross-attention (Multi-head cross-attention) sub-network (hereinafter simply referred to as a cross-attention sub-network), and a second addition, Norm and mapping sub-network. The first and second addition, norm and mapping subnetworks are composed of an addition, norm subnetwork and a feed-forward Neural (fed-forward) subnetwork.
The input from the attention network is the audio characteristics of each pronunciation unit in the target text, namely the input from the attention network is the output of the content coding submodel. The output of the self-attention subnetwork is the adjusted audio feature. The self-attention sub-network is used for enhancing the audio characteristics of the relatively important pronunciation units in the pronunciation units and weakening the audio characteristics of the relatively unimportant pronunciation units through a self-attention mechanism, so that the pronunciation characteristics of the pronunciation units can better reflect the characteristics of the target text.
The input of the cross attention sub-network is the superposed audio features of each audio unit in the target audio clip and the adjusted pronunciation features of each pronunciation unit in the target text. The output of the cross-attention subnetwork is the fused feature of each pronunciation unit in the target text.
The cross attention subnetwork is used for taking the superposed audio features of the audio units as keys (keys) and values (values), taking the adjusted pronunciation features of the pronunciation units as queries (queries), searching the keys matched with the queries in the keys for each query through a cross attention mechanism, and fusing the values (namely target superposed audio features) corresponding to the keys with the queries to obtain fused features. Namely, the cross attention subnetwork fuses the adjusted pronunciation characteristics of the pronunciation unit and the target superposition audio characteristics for each pronunciation unit to obtain the fusion characteristics of the pronunciation unit.
As shown in fig. 3d, the sub-model for sound spectrum decoding includes a plurality of conversion sub-networks, a Pre-processing (Pre-net) sub-network, a Post-processing (Post-net) sub-network, and a Vocoder (WaveGlow Vocoder).
The input of the preprocessing sub-network is original Mel-spectra (Mel-spectra) as original sound spectrum features, the output of the preprocessing sub-network is the preprocessed original sound spectrum features, and the preprocessing sub-network is used for preprocessing the original sound spectrum features. The preprocessing network is composed of a plurality of linear sub-networks and a linear rectification function (Relu) sub-network.
In one possible embodiment, the inputs to the conversion sub-network are the preprocessed original spectral features and the fused features of each pronunciation unit. The output of the conversion sub-network is converted into the fusion characteristic into the sound spectrum characteristic. In this embodiment, the conversion sub-network is configured to convert the input fused feature into a sound spectrum feature according to the input original sound spectrum feature.
In another possible embodiment, the input of the conversion sub-network is the preprocessed original sound spectrum feature, the coarse-grained feature and the fused feature of each pronunciation unit, and the coarse-grained feature is the output of the previous lower branch. The output of the conversion sub-network is converted into the fusion characteristic into the sound spectrum characteristic. In this embodiment, the conversion sub-network is configured to convert the input fused feature into a sound spectrum feature according to the input original sound spectrum feature and the coarse-grained feature.
It is understood that the fusion feature is obtained by fusing the superimposed audio feature and the pronunciation feature, and the superimposed audio feature is obtained by superimposing the coarse-grained audio feature and the fine-grained audio feature, so that the fusion feature can reflect the coarse-grained audio feature to some extent. However, as described above, the fusion feature is obtained from the coarse-grained audio feature through a series of calculations, so that it is difficult for the fusion feature to accurately reflect the coarse-grained audio feature, and therefore, the fusion spectral feature converted by the conversion sub-network is difficult to accurately reflect the overall features of the target style.
Therefore, the coarse-grained audio features are input into the converter network, so that the converter sub-network can accurately refer to the overall features of the target style when converting the fusion features into the sound spectrum features, the sound spectrum features obtained through conversion can accurately reflect the overall features of the target style, and the subsequently synthesized audio segments have the target style on the whole.
The input of the post-processing sub-network is the sound spectrum characteristic obtained by conversion, and the output of the post-processing sub-network is the sound spectrum characteristic after post-processing. And the post-processing sub-network is used for post-processing the sound spectrum characteristics obtained by conversion. The post-processing sub-network is composed of a plurality of Convolutional Neural Networks (Convolutional Neural Networks).
The input of the vocoder is the sound spectrum characteristic after post-processing, namely the input of the vocoder is the output of the post-processing sub-network, and the output of the vocoder is the audio segment which is synthesized and has the target style and the voice content is the target text. The vocoder is used to convert the input audio features into audio segments.
Corresponding to the style migration synthesis method, the present disclosure also provides a speech synthesis model training method, which is used for training the speech synthesis model used in the style migration synthesis method.
The speech synthesis model training method provided by the present disclosure can be applied to any electronic device with speech synthesis model training capability, including but not limited to servers, personal computers, and the like. Moreover, the speech synthesis model provided by the present disclosure and the style migration synthesis method provided by the present disclosure may be applied to the same device, or may be applied to different devices, and the present disclosure does not set any limit to this.
The speech synthesis model training method provided by the present disclosure can be seen in fig. 4, and includes:
s401, inputting the sample audio clip and the sample text into the original model, wherein the sample text is the voice content of the sample audio clip.
The speech content of the sample text which is the sample audio fragment means that: and the text of the sample audio clip obtained through speech recognition is the sample text. For example, assuming that a sample audio clip is an audio clip recorded when "AAAABBBB" is spoken by zhang san, the sample text is "AAAABBBB".
S402, superposing the coarse-grained audio features for representing the sample audio clips and the fine-grained audio features for representing the audio units aiming at each audio unit in the sample audio clips through the original model to obtain superposed audio features of the audio units.
The principle of the original model is exactly the same as the structure and principle of the aforementioned speech synthesis model, and the only difference is that the model parameters of the original model are different from those of the speech synthesis model. Therefore, reference may be made to the foregoing description on the style extraction submodel, which is not described herein again.
And S403, extracting the pronunciation characteristics of each pronunciation unit in the sample text through the original model.
The principle of this step is the same as that of the content coding submodel, and reference may be made to the above description about the content coding submodel, which is not described herein again.
S404, fusing the pronunciation characteristics of the pronunciation units and the target superposition audio characteristics aiming at each pronunciation unit in the sample text through the original model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are the superposition audio characteristics matched with the pronunciation characteristics.
The principle of this step is the same as that of the content style cross-attention submodel, and reference may be made to the foregoing description on the content style cross-attention submodel, which is not described herein again.
S405, converting the fusion features of each pronunciation unit in the sample text into prediction sound spectrum features through the original model.
The principle of this step is the same as that of the aforementioned sub-model for sound spectrum decoding, and reference may be made to the aforementioned description about the sub-model for sound spectrum decoding, which is not described herein again.
S406, adjusting model parameters of the original model according to the difference between the predicted sound spectrum characteristic and the real sound spectrum characteristic of the sample audio frequency characteristic.
It can be understood that, since the sample text input to the original network is the content of the sample audio clip, the audio clip obtained by the original network migrating the semantic style of the sample audio clip to the sample text should be the sample audio clip, in other words, if the original network can perform the style migration synthesis accurately, the predicted sound spectrum feature should be the same as the real sound spectrum feature of the sample audio clip. The reasons for the difference between the predicted and actual spectral features are: the original model cannot accurately perform style migration synthesis.
Therefore, the difference between the predicted sound spectrum characteristic and the real sound spectrum characteristic can be used for guiding the adjustment of the model parameters of the original model, so that the model parameters of the original model are adjusted towards the direction of reducing the difference, and the speech synthesis model capable of accurately performing style migration synthesis is trained.
And S407, acquiring a new sample audio fragment and a new sample text, returning to execute S401 until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.
The newly acquired sample text should be the content of the newly acquired sample audio piece, and the newly acquired sample audio piece is different from the previous sample audio piece. The first convergence condition may be set by the user according to an actual requirement or requirement, for example, the convergence of the model parameter of the original model reaches a preset convergence threshold, and the first convergence condition may also be that the number of sample audio segments that have been used reaches a preset number threshold.
By adopting the embodiment, the original model is supervised and trained through the real sound spectrum characteristics of the sample audio, and the sample text is the voice content of the sample audio segment, so the superposed audio characteristics extracted by the original model can be better matched with the audio characteristics of all pronunciation units in the sample text, the trained voice synthesis model can better learn the matching relation between the audio characteristics and the pronunciation characteristics, and the audio segment with a target style can be synthesized in the style migration and synthesis process.
It will be appreciated that, since the sample text is the speech content of the sample audio segment, the number of pronunciation units in the sample text should be similar to or even the same as the number of audio units in the sample audio segment. When the style migration synthesis is performed by using the speech synthesis model, the target text is not the speech content of the target audio segment, and therefore the number of pronunciation units of the target text may be different from the number of audio units in the target audio segment.
The method is used for enabling the speech synthesis model to accurately realize style transition synthesis under the condition that the number of pronunciation units and audio units is large. In a possible embodiment, the target superimposed audio feature at the aforementioned S404 is a filtered audio feature matched with the pronunciation feature, and the filtered audio feature is a partial superimposed audio feature extracted from all the superimposed audio features.
The extraction mode is random extraction, and the number of the extracted superimposed audio features can be set according to the actual needs or experience of the user, for example, 80% of the superimposed audio features are extracted as the filtered audio features, and for example, 90% of the superimposed audio features are extracted as the filtered audio features.
By adopting the embodiment, the audio features can be fused with the selected partial overlapped audio features only through a cross attention mechanism in a mode of extracting partial overlapped audio features in the training process, so that the trained voice synthesis model can learn how to realize style migration synthesis under the condition that the difference between the number of the pronunciation units and the number of the audio units is large, namely, the voice synthesis model can accurately realize style migration synthesis under the condition that the difference between the number of the pronunciation units and the number of the audio units is large.
In the foregoing S407, the obtained new sample audio segment and the previous sample audio segment may have the same style (i.e., audio segments of the same person), or may have different voice styles (i.e., audio segments of different persons) from the previous sample audio segment.
In a possible embodiment, the obtaining of the new sample segment in the foregoing S407 is implemented by:
and if the second convergence condition is not reached, acquiring a new sample audio clip from the first sample data set, and if the second convergence condition is reached, acquiring a new sample audio clip from the second sample data set.
Wherein the first sample data set comprises audio segments of a first sample person and the second sample data set comprises audio segments of a plurality of sample persons. Also in this embodiment, the sample audio piece at the time of executing S401 for the first time is an audio piece of the first sample person.
The difficulty of achieving the second convergence condition is lower than that of the first convergence condition, namely the first convergence condition is not achieved when the second convergence condition is achieved, and the second convergence condition is achieved when the first convergence condition is achieved.
With the embodiment, the original model can be trained through the audio features of the first sample person, so that the original model learns how to transfer the voice style of the first sample person to the specific text, and then the original model is trained through the respective audio segments of the plurality of sample persons, so that the original model learns how to transfer different voice styles to the specific text. Because the original model is pre-trained before learning how to migrate different voice styles to the specific text, and how to migrate the voice style of the first sample person to the specific text is learned, the training of the original model can be completed only by using a small number of audio clips of different persons, and the difficulty in acquiring the sample audio clips is effectively reduced.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a style migration and synthesis apparatus provided by the present disclosure, including:
a first input module 501, configured to input a target text and a target audio segment with a target speech style into a speech synthesis model obtained through training of a sample text and a sample audio segment in advance;
a style extraction module 502, configured to extract a sub-model according to a style of the speech synthesis model, and superimpose, for each audio unit in the target audio segment, a coarse-grained audio feature used for characterizing the target audio segment and a fine-grained audio feature used for characterizing the audio unit, so as to obtain a superimposed audio feature of the audio unit;
a content coding module 503, configured to extract pronunciation characteristics of each pronunciation unit in the target text through a content coding sub-model of the speech synthesis model;
a content style cross attention module 504, configured to fuse, by using a content style cross attention submodel of the speech synthesis model, a pronunciation feature of the pronunciation unit and a target superimposed audio feature for each pronunciation unit in the target text to obtain a fused feature of the pronunciation unit, where the target superimposed audio feature is a superimposed audio feature matched with the pronunciation feature;
and a sound spectrum decoding module 505, configured to synthesize, according to the fusion feature of each pronunciation unit in the target text, an audio segment having the target speech style and speech content as the target text by using a sound spectrum decoding sub-model of the speech synthesis model.
By selecting the embodiment, the superposed audio features are obtained by superposing the audio features with different granularities, so that the superposed audio features can reflect the overall features of the target style and the detailed features of the target audio segments. And then, fusing the audio features of each pronunciation unit with the matched superposed audio features by using a cross attention mechanism, wherein the obtained fused features can reflect the voice content in the target text on one hand and reflect the whole target style and the detail features on the other hand, and the detail features reflected by the fused features of each pronunciation unit can reflect the detail features of the pronunciation unit commented in the target style because the audio features are fused with the matched superposed audio features. Therefore, in the audio clip synthesized based on the fusion feature, not only the acoustic feature as a whole is similar to the target style, but also the pronunciation of each pronunciation unit is similar to the target style, that is, the audio clip having the target style in whole and detail can be synthesized.
In a possible embodiment, the style extraction module 502, through a style extraction submodel of the speech synthesis model, for each audio unit in the target audio segment, superimposes a coarse-grained audio feature used for characterizing the target audio segment and a fine-grained audio feature used for characterizing the audio unit, so as to obtain a superimposed audio feature of the audio unit, including:
extracting the average audio features of all audio frames in the target audio clip as coarse-grained audio features through a style extraction module of the speech synthesis model;
extracting, by the style extraction module, average audio features of all audio frames in the audio unit as fine-grained audio features of the audio unit for each audio unit in the target audio clip;
and adding the fine-grained audio features and the coarse-grained audio features of the audio units to each audio unit in the target audio segment through the style extraction module to obtain the superimposed audio features of the audio units.
In a possible embodiment, the content style cross attention module 504, through the content style cross attention submodel of the speech synthesis model, for each pronunciation unit in the target text, fuses the pronunciation feature of the pronunciation unit and the target superimposed audio feature to obtain a fused feature of the pronunciation unit, including:
inputting the pronunciation characteristics of each pronunciation unit in the target text into a self-attention sub-network of a content style cross-attention sub-model in the speech synthesis model to obtain adjusted pronunciation characteristics output by the sub-attention sub-network;
and fusing the adjusted pronunciation characteristics and the target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the cross attention sub-network of the content cross sub-model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are the superposition audio characteristics matched with the adjusted pronunciation characteristics.
In a possible embodiment, the voice spectrum decoding module 505 synthesizes, by using a voice spectrum decoding submodel of the speech synthesis model, an audio segment having the target speech style and speech content of the target text according to the fusion feature of each pronunciation unit in the target text, including:
inputting the fusion features and the coarse-grained audio features of each pronunciation unit in the target text into a sound spectrum decoding sub-model of the speech synthesis model to obtain sound spectrum features output by the sound spectrum decoding sub-network;
and converting the sound spectrum characteristics into an audio fragment which has the target voice style and the voice content of which is the target text.
Referring to fig. 6, fig. 6 is a schematic structural diagram of a training apparatus for a speech synthesis model according to an embodiment of the present invention, which may include:
a second input module 601, configured to input a sample audio segment and a sample text into an original model, where the sample text is a speech content of the sample audio segment;
a first original module 602, configured to superimpose, by using the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit, so as to obtain a superimposed audio feature of the audio unit;
a second primitive module 603, configured to extract, through the primitive model, pronunciation features of each pronunciation unit in the sample text;
a third original module 604, configured to fuse, by using the original model, a pronunciation feature of each pronunciation unit in the sample text and a target superimposed audio feature to obtain a fused feature of the pronunciation unit, where the target superimposed audio feature is a superimposed audio feature that matches the pronunciation feature;
a fourth original module 605, configured to convert, according to the original model, the fusion feature of each pronunciation unit in the sample text into a predicted sound spectrum feature;
a parameter adjusting module 606, configured to adjust a model parameter of the original model according to a difference between the predicted audio spectrum feature and a true audio spectrum feature of the sample audio segment;
the obtaining module 607 is configured to obtain a new sample audio segment and a new sample text, return to the step of inputting the sample audio segment and the sample text into the original model until a first convergence condition is reached, and use the adjusted original model as a speech synthesis model.
In a possible embodiment, further comprising:
the superposition characteristic extraction module is used for extracting partial superposition audio characteristics from all the superposition audio characteristics to serve as the screened audio characteristics;
the target superimposed audio features are screened audio features matched with the pronunciation features.
In one possible embodiment, the sample audio piece is initially an audio piece of a first sample person;
the obtaining module 607 obtains a new sample audio piece, including:
if the second convergence condition is not met, acquiring a new sample audio clip from a first sample data set, wherein the first sample data set comprises the audio clip of the first sample person;
and if the second convergence condition is reached, acquiring a new sample audio clip and a new sample text from a second sample data set, wherein the second sample data set comprises audio clips of a plurality of sample persons.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all in accordance with the regulations of related laws and regulations and do not violate the good customs of the public order.
It should be noted that the sample audio clip in the present embodiment is derived from public data sets, such as LJSpeech (a public data set), VCTK (a public data set).
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the style migration synthesis method or the training method of the speech synthesis model. For example, in some embodiments, the style migration synthesis method or the training method of the speech synthesis model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the above described style migration synthesis method or training method of speech synthesis models. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform a style migration synthesis method or a training method of a speech synthesis model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (17)

1. A style migration synthesis method, comprising:
inputting a target text and a target audio clip with a target voice style into a voice synthesis model obtained by training a sample text and a sample audio clip in advance;
extracting a sub-model through the style of the voice synthesis model, and superposing a coarse-grained audio feature for representing the target audio segment and a fine-grained audio feature for representing the audio unit aiming at each audio unit in the target audio segment to obtain a superposed audio feature of the audio unit;
extracting the pronunciation characteristics of each pronunciation unit in the target text through the content coding sub-model of the voice synthesis model;
fusing the pronunciation characteristics and target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the content style cross attention submodel of the voice synthesis model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are superposition audio characteristics matched with the pronunciation characteristics;
and synthesizing an audio segment which has the target voice style and has voice content as the target text according to the fusion characteristics of each pronunciation unit in the target text through the sound spectrum decoding submodel of the voice synthesis model.
2. The method of claim 1, wherein the obtaining, by the style extraction submodel of the speech synthesis model, for each audio unit in the target audio segment, a superimposed audio feature of the audio unit by superimposing a coarse-grained audio feature used for characterizing the target audio segment and a fine-grained audio feature used for characterizing the audio unit comprises:
extracting the average audio features of all audio frames in the target audio clip as coarse-grained audio features through a style extraction module of the speech synthesis model;
extracting, by the style extraction module, average audio features of all audio frames in the audio unit as fine-grained audio features of the audio unit for each audio unit in the target audio clip;
and adding the fine-grained audio features and the coarse-grained audio features of the audio units to each audio unit in the target audio segment through the style extraction module to obtain the superimposed audio features of the audio units.
3. The method according to claim 1, wherein the fusing the pronunciation features of the pronunciation units and the target superimposed audio features for each pronunciation unit in the target text by the content style cross attention submodel of the speech synthesis model to obtain the fused features of the pronunciation units comprises:
inputting the pronunciation characteristics of each pronunciation unit in the target text into a self-attention sub-network of a content style cross-attention sub-model in the speech synthesis model to obtain adjusted pronunciation characteristics output by the sub-attention sub-network;
and fusing the adjusted pronunciation characteristics and the target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the cross attention sub-network of the content cross sub-model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are the superposition audio characteristics matched with the adjusted pronunciation characteristics.
4. The method according to claim 1, wherein the synthesizing of the audio segment having the target speech style and speech content of the target text according to the fusion feature of each pronunciation unit in the target text by the sonographic decoding submodel of the speech synthesis model comprises:
inputting the fusion features and the coarse-grained audio features of each pronunciation unit in the target text into a sound spectrum decoding sub-model of the speech synthesis model to obtain sound spectrum features output by the sound spectrum decoding sub-network;
and converting the sound spectrum characteristics into an audio fragment which has the target voice style and the voice content of which is the target text.
5. A method of training a speech synthesis model, comprising:
inputting a sample audio clip and a sample text into an original model, wherein the sample text is the voice content of the sample audio clip;
superposing, by the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit to obtain a superposed audio feature of the audio unit;
extracting pronunciation characteristics of each pronunciation unit in the sample text through the original model;
fusing the pronunciation characteristics of the pronunciation units and target superposition audio characteristics aiming at each pronunciation unit in the sample text through the original model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are superposition audio characteristics matched with the pronunciation characteristics;
converting the fusion features of each pronunciation unit in the sample text into prediction sound spectrum features through the original model;
adjusting model parameters of the original model according to the difference between the predicted audio spectrum characteristics and the real audio spectrum characteristics of the sample audio frequency fragment;
and acquiring a new sample audio clip and a new sample text, returning to execute the step of inputting the sample audio clip and the sample text into the original model until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.
6. The method of claim 5, further comprising:
extracting part of the superposed audio features from all the superposed audio features to serve as the screened audio features;
the target superimposed audio features are screened audio features matched with the pronunciation features.
7. The method of claim 6, wherein the sample audio clip is initially an audio clip of a first sample person;
the obtaining of the new sample audio piece comprises:
if the second convergence condition is not met, acquiring a new sample audio clip from a first sample data set, wherein the first sample data set comprises the audio clip of the first sample person;
and if the second convergence condition is reached, acquiring a new sample audio clip and a new sample text from a second sample data set, wherein the second sample data set comprises audio clips of a plurality of sample persons.
8. A style migration synthesis apparatus comprising:
the first input module is used for inputting the target text and the target audio clip with the target voice style into a voice synthesis model which is obtained by training a sample text and a sample audio clip in advance;
the style extraction module is used for extracting a sub-model through the style of the voice synthesis model, and for each audio unit in the target audio segment, overlapping a coarse-grained audio feature used for representing the target audio segment and a fine-grained audio feature used for representing the audio unit to obtain an overlapped audio feature of the audio unit;
the content coding module is used for extracting the pronunciation characteristics of each pronunciation unit in the target text through a content coding sub-model of the voice synthesis model;
the content style cross attention module is used for fusing pronunciation characteristics of the pronunciation units and target superposed audio characteristics aiming at each pronunciation unit in the target text through a content style cross attention submodel of the voice synthesis model to obtain the fused characteristics of the pronunciation units, wherein the target superposed audio characteristics are superposed audio characteristics matched with the pronunciation characteristics;
and the sound spectrum decoding module is used for synthesizing an audio segment which has the target voice style and has the voice content of the target text according to the fusion characteristics of each pronunciation unit in the target text through a sound spectrum decoding sub-model of the voice synthesis model.
9. The apparatus of claim 8, wherein the style extraction module, via a style extraction submodel of the speech synthesis model, for each audio unit in the target audio segment, superimposes a coarse-grained audio feature for characterizing the target audio segment and a fine-grained audio feature for characterizing the audio unit to obtain a superimposed audio feature of the audio unit, includes:
extracting the average audio features of all audio frames in the target audio clip as coarse-grained audio features through a style extraction module of the speech synthesis model;
extracting, by the style extraction module, average audio features of all audio frames in the audio unit as fine-grained audio features of the audio unit for each audio unit in the target audio clip;
and adding the fine-grained audio features and the coarse-grained audio features of the audio units to each audio unit in the target audio segment through the style extraction module to obtain the superimposed audio features of the audio units.
10. The apparatus of claim 8, wherein the content style cross attention module fuses, for each pronunciation unit in a target text, a pronunciation feature of the pronunciation unit and a target overlay audio feature to obtain a fused feature of the pronunciation unit through a content style cross attention submodel of the speech synthesis model, comprising:
inputting the pronunciation characteristics of each pronunciation unit in the target text into a self-attention sub-network of a content style cross-attention sub-model in the speech synthesis model to obtain adjusted pronunciation characteristics output by the sub-attention sub-network;
and fusing the adjusted pronunciation characteristics and the target superposition audio characteristics of the pronunciation units aiming at each pronunciation unit in the target text through the cross attention sub-network of the content cross sub-model to obtain the fusion characteristics of the pronunciation units, wherein the target superposition audio characteristics are the superposition audio characteristics matched with the adjusted pronunciation characteristics.
11. The apparatus according to claim 8, wherein the voice spectrum decoding module synthesizes an audio segment having the target speech style and speech content of the target text according to the fusion feature of each pronunciation unit in the target text by a voice spectrum decoding submodel of the speech synthesis model, including:
inputting the fusion features and the coarse-grained audio features of each pronunciation unit in the target text into a sound spectrum decoding sub-model of the speech synthesis model to obtain sound spectrum features output by the sound spectrum decoding sub-network;
and converting the sound spectrum characteristics into an audio fragment which has the target voice style and the voice content of which is the target text.
12. An apparatus for training a speech synthesis model, comprising:
the second input module is used for inputting a sample audio clip and a sample text into the original model, wherein the sample text is the voice content of the sample audio clip;
a first original module, configured to superimpose, by using the original model, for each audio unit in the sample audio clip, a coarse-grained audio feature used for characterizing the sample audio clip and a fine-grained audio feature used for characterizing the audio unit, so as to obtain a superimposed audio feature of the audio unit;
the second original module is used for extracting the pronunciation characteristics of each pronunciation unit in the sample text through the original model;
a third original module, configured to fuse, by using the original model, a pronunciation feature of the pronunciation unit and a target superimposed audio feature for each pronunciation unit in the sample text to obtain a fused feature of the pronunciation unit, where the target superimposed audio feature is a superimposed audio feature matched with the pronunciation feature;
a fourth original module, configured to convert, through the original model, the fusion feature of each pronunciation unit in the sample text into a predicted sound spectrum feature;
a parameter adjusting module, configured to adjust a model parameter of the original model according to a difference between the predicted audio spectrum feature and a true audio spectrum feature of the sample audio segment;
and the obtaining module is used for obtaining a new sample audio clip and a new sample text, returning to execute the step of inputting the sample audio clip and the sample text into the original model until a first convergence condition is reached, and taking the adjusted original model as a speech synthesis model.
13. The apparatus of claim 12, further comprising:
the superposition characteristic extraction module is used for extracting partial superposition audio characteristics from all the superposition audio characteristics to serve as the screened audio characteristics;
the target superimposed audio features are screened audio features matched with the pronunciation features.
14. The apparatus of claim 12, wherein the sample audio clip is initially an audio clip of a first sample person;
the obtaining module obtains a new sample audio clip, including:
if the second convergence condition is not met, acquiring a new sample audio clip from a first sample data set, wherein the first sample data set comprises the audio clip of the first sample person;
and if the second convergence condition is reached, acquiring a new sample audio clip and a new sample text from a second sample data set, wherein the second sample data set comprises audio clips of a plurality of sample persons.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4 or 5-7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any of claims 1-4 or 5-7.
17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-4 or 5-7.
CN202111491886.0A 2021-12-08 2021-12-08 Style migration synthesis method and device and electronic equipment Pending CN114187892A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111491886.0A CN114187892A (en) 2021-12-08 2021-12-08 Style migration synthesis method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111491886.0A CN114187892A (en) 2021-12-08 2021-12-08 Style migration synthesis method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN114187892A true CN114187892A (en) 2022-03-15

Family

ID=80603825

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111491886.0A Pending CN114187892A (en) 2021-12-08 2021-12-08 Style migration synthesis method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN114187892A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012505A (en) * 2022-12-29 2023-04-25 上海师范大学天华学院 Pronunciation animation generation method and system based on key point self-detection and style migration

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112927674A (en) * 2021-01-20 2021-06-08 北京有竹居网络技术有限公司 Voice style migration method and device, readable medium and electronic equipment
CN113450758A (en) * 2021-08-27 2021-09-28 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus, device and medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112365877A (en) * 2020-11-27 2021-02-12 北京百度网讯科技有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN112927674A (en) * 2021-01-20 2021-06-08 北京有竹居网络技术有限公司 Voice style migration method and device, readable medium and electronic equipment
CN113450758A (en) * 2021-08-27 2021-09-28 北京世纪好未来教育科技有限公司 Speech synthesis method, apparatus, device and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIANG LI等: "Towards Multi-Scale Style Control For Expressive Speech Synthesis", INTERSPEECH 2021, 3 September 2021 (2021-09-03), pages 4673 - 4676 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012505A (en) * 2022-12-29 2023-04-25 上海师范大学天华学院 Pronunciation animation generation method and system based on key point self-detection and style migration

Similar Documents

Publication Publication Date Title
CN112382271B (en) Voice processing method, device, electronic equipment and storage medium
US20220328041A1 (en) Training neural networks to predict acoustic sequences using observed prosody info
CN112599122A (en) Voice recognition method and device based on self-attention mechanism and memory network
CN114023342B (en) Voice conversion method, device, storage medium and electronic equipment
JP2022058775A (en) Target object generating method, apparatus therefor, electronic device, and storage medium
CN113889076B (en) Speech recognition and coding/decoding method, device, electronic equipment and storage medium
CN113838452B (en) Speech synthesis method, apparatus, device and computer storage medium
CN113674732B (en) Voice confidence detection method and device, electronic equipment and storage medium
CN112634880B (en) Method, apparatus, device, storage medium and program product for speaker identification
CN111344717A (en) Interactive behavior prediction method, intelligent device and computer-readable storage medium
CN112861548A (en) Natural language generation and model training method, device, equipment and storage medium
CN114495977B (en) Speech translation and model training method, device, electronic equipment and storage medium
CN114783409B (en) Training method of speech synthesis model, speech synthesis method and device
CN113257238A (en) Training method of pre-training model, coding feature acquisition method and related device
CN113689868B (en) Training method and device of voice conversion model, electronic equipment and medium
CN113129869B (en) Method and device for training and recognizing voice recognition model
CN114187892A (en) Style migration synthesis method and device and electronic equipment
CN113963715A (en) Voice signal separation method and device, electronic equipment and storage medium
Park et al. Unsupervised speech domain adaptation based on disentangled representation learning for robust speech recognition
JP2023169230A (en) Computer program, server device, terminal device, learned model, program generation method, and method
CN114898734B (en) Pre-training method and device based on voice synthesis model and electronic equipment
CN113838450B (en) Audio synthesis and corresponding model training method, device, equipment and storage medium
CN113808572B (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN115512682A (en) Polyphone pronunciation prediction method and device, electronic equipment and storage medium
CN114841175A (en) Machine translation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination