CN113112987B - Speech synthesis method, training method and device of speech synthesis model - Google Patents

Speech synthesis method, training method and device of speech synthesis model Download PDF

Info

Publication number
CN113112987B
CN113112987B CN202110400408.8A CN202110400408A CN113112987B CN 113112987 B CN113112987 B CN 113112987B CN 202110400408 A CN202110400408 A CN 202110400408A CN 113112987 B CN113112987 B CN 113112987B
Authority
CN
China
Prior art keywords
feature
synthesized
acoustic
text
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110400408.8A
Other languages
Chinese (zh)
Other versions
CN113112987A (en
Inventor
胡大盟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Horizon Information Technology Co Ltd
Original Assignee
Beijing Horizon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Horizon Information Technology Co Ltd filed Critical Beijing Horizon Information Technology Co Ltd
Priority to CN202110400408.8A priority Critical patent/CN113112987B/en
Publication of CN113112987A publication Critical patent/CN113112987A/en
Application granted granted Critical
Publication of CN113112987B publication Critical patent/CN113112987B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

A speech synthesis method, a training method of a speech synthesis model and a device are disclosed. The voice synthesis method in an embodiment of the present disclosure may include: performing text encoding on a first text to be synthesized to obtain a first synthesized feature; acoustically encoding the first acoustic feature to obtain a second composite feature; aligning the first synthesized feature, the second synthesized feature and the preselected emotion expression parameter to obtain a third synthesized feature; and acoustically decoding the third synthesized feature to obtain a second acoustic feature of the first text. The voice with specific emotion degree can be synthesized based on the preset emotion expression parameters, and the practical application requirement is met.

Description

Speech synthesis method, training method and device of speech synthesis model
Technical Field
The disclosure relates to the technical field of speech synthesis, and in particular relates to a speech synthesis method, a training method of a speech synthesis model and a device.
Background
With the popularization of intelligent devices and the development of voice recognition technology, people's interaction modes have gradually changed from traditional text to more humanized voice interaction modes. The voice synthesis technology can enable the machine to have human voice, and changes the traditional text interaction mode.
Disclosure of Invention
The current speech synthesis mainly depends on a training sound library in the aspect of emotion synthesis, but cannot adopt parameter self-adaptive adjustment to synthesize emotion speech, such as the same text to synthesize happy and angry timbre, 2 independent models need to be trained on 2 different emotion sound libraries of the speaker, and only independent emotion can be synthesized. In order to solve the technical problem, the embodiments of the present disclosure desire to provide a voice synthesis method, a training method and apparatus for a voice synthesis model, an electronic device, and a storage medium, which can synthesize voice with a corresponding emotion degree by setting parameters.
According to one aspect of the present disclosure, there is provided a speech synthesis method including:
Performing text encoding on a first text to be synthesized to obtain a first synthesized feature;
Acoustically encoding the first acoustic feature to obtain a second composite feature;
aligning the first synthesized feature, the second synthesized feature and the preselected emotion expression parameter to obtain a third synthesized feature; and
And acoustically decoding the third synthesized feature to obtain a second acoustic feature of the first text.
According to one aspect of the present disclosure, there is provided a training method of a speech synthesis model, including:
Setting a speech synthesis parameter in a speech synthesis model to a current value, the speech synthesis parameter comprising at least one of: a text encoding parameter, an acoustic decoding parameter, the emotion expression parameter and a basic weight parameter for refining granularity of the emotion expression parameter;
performing speech synthesis of the speech synthesis model using the second text and its actual acoustic features as training samples to obtain predicted acoustic features of the second text, the speech synthesis of the speech synthesis model including text encoding, acoustic encoding, alignment processing, and acoustic decoding performed in sequence; and
And adjusting the value of the voice synthesis parameter according to the alignment training characteristic, the real acoustic characteristic of the second text and the predicted acoustic characteristic generated by the alignment process.
According to one aspect of the present disclosure, there is provided a voice synthesis apparatus comprising:
The text coding unit is configured to code the text of the first text to be synthesized so as to obtain a first synthesis characteristic;
An acoustic encoding unit configured to perform acoustic encoding on the first acoustic feature to obtain a second synthesized feature;
An alignment processing unit configured to perform alignment processing on the first synthesized feature, the second synthesized feature and the preselected emotion expression parameter to obtain a third synthesized feature;
And an acoustic decoding unit configured to perform acoustic decoding on the third synthesized feature to obtain a second acoustic feature of the first text.
According to one aspect of the present disclosure, there is provided an electronic device including: one or more processors; and a memory storing a computer program which, when executed by the processor, causes the processor to execute the above-described speech synthesis method and/or training method of a speech synthesis model.
According to one aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the above-described speech synthesis method and/or training method of a speech synthesis model.
According to the embodiment of the disclosure, the voice synthesis is realized through text coding, acoustic coding, alignment processing and acoustic decoding, and the alignment processing step is completed by utilizing the pre-selected emotion expression parameters, so that the voice with specific emotion degree can be synthesized only by presetting the emotion expression parameters, and the actual application requirements are met.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing embodiments of the present application in more detail with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and together with the embodiments of the application, and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.
Fig. 1 is a flow chart illustrating a speech synthesis method according to an exemplary embodiment of the present disclosure.
Fig. 2 is an exemplary flowchart of an alignment process in a speech synthesis method provided in an exemplary embodiment of the present disclosure.
Fig. 3 is a flow chart of a training method of a speech synthesis model according to an exemplary embodiment of the present disclosure.
Fig. 4 is an exemplary flow chart for adjusting speech synthesis parameters provided by an exemplary embodiment of the present disclosure.
Fig. 5 is an exemplary flowchart for adjusting speech synthesis parameters provided by another exemplary embodiment of the present disclosure.
Fig. 6 is a schematic block diagram of a speech synthesis model and training process thereof, performing a speech synthesis process, provided in an exemplary embodiment of the present disclosure.
Fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an exemplary embodiment of the present disclosure.
Fig. 8 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.
Detailed Description
Hereinafter, exemplary embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
Summary of the application
As described above, current speech synthesis mainly relies on training voice libraries in terms of synthesizing emotion-bearing speech, but cannot use parameter-adaptive adjustment to synthesize emotion-bearing speech, for example, the same text synthesizes happy and angry timbres, 2 independent models need to be trained on 2 different emotion voice libraries of the speaker, and only independent emotion can be synthesized, and various emotion factors such as gas, offensiveness, fear, happiness, neutrality, sadness, surprise are directly encoded during speech synthesis, and the granularity of the emotion characterization mode is too large to reflect the difference between speech of the same emotion of different people, and speech with gradual emotion degree or specific emotion degree cannot be obtained.
In view of the foregoing technical problems in the related art, a basic concept of an embodiment of the present disclosure provides a method and apparatus for synthesizing speech, an electronic device, and a storage medium, where a first text to be synthesized is first text-coded to obtain a first synthesized feature, the first acoustic feature is acoustically coded to obtain a second synthesized feature, then the first synthesized feature, the second synthesized feature, and a preselected emotion expression parameter are aligned to obtain a third synthesized feature, and finally the third synthesized feature is acoustically decoded to obtain a second acoustic feature of the first text. Therefore, according to the embodiment of the disclosure, voice synthesis is realized through text coding, acoustic coding, alignment processing and acoustic decoding, and the alignment processing step is completed by utilizing the pre-selected emotion expression parameters, so that voice with corresponding emotion degree can be synthesized by only presetting the emotion expression parameters, emotion granularity is finer, the actual situation of voice with emotion is more met, voice synthesis with specific emotion degree and/or emotion degree gradual change is realized, and the actual application requirement is met.
In view of the foregoing technical problems in the related art, a basic concept of an embodiment of the present disclosure further includes providing a training method and apparatus for a speech synthesis model, an electronic device, and a storage medium, where a speech synthesis parameter in the speech synthesis model is set to a current value, where the speech synthesis parameter includes at least one of the following: a text encoding parameter, an acoustic decoding parameter, the emotion expression parameter and a basic weight parameter for refining granularity of the emotion expression parameter; performing speech synthesis of the speech synthesis model using the second text and its actual acoustic features as training samples to obtain predicted acoustic features of the second text, the speech synthesis of the speech synthesis model including text encoding, acoustic encoding, alignment processing, and acoustic decoding performed in sequence; finally, according to the alignment training feature, the real acoustic feature and the predicted acoustic feature of the second text generated by the alignment process, the value of the voice synthesis parameter is adjusted. The method is used for carrying out iterative training to obtain a voice synthesis model, an alignment processing part in the voice synthesis model can be used for realizing alignment of acoustic features and text features and fusing emotion expression parameters into the text features, so that the purpose that the voice synthesis model supports voice synthesis of different emotion degrees is achieved, when the voice synthesis is carried out through the voice synthesis model, voice of corresponding emotion degrees can be synthesized by only presetting emotion expression parameters, emotion granularity is finer, the actual situation of voice with emotion is met, voice synthesis of specific emotion degrees and/or emotion degree gradual change is realized, and practical application requirements are met.
The embodiment of the disclosure can be applied to various scenes needing speech synthesis.
Exemplary method
Fig. 1 is a flow chart of a speech synthesis method according to an embodiment of the disclosure. As shown in fig. 1, the speech synthesis method in the embodiment of the present disclosure may include:
step S101, performing text coding on a first text to be synthesized to obtain a first synthesis feature;
Step S102, performing acoustic encoding on the first acoustic feature to obtain a second synthesized feature;
Step S103, aligning the first synthesized feature, the second synthesized feature and the preselected emotion expression parameters to obtain a third synthesized feature;
Step S104, the third synthesized feature is acoustically decoded to obtain a second acoustic feature of the first text.
According to the voice synthesis method, voice synthesis is achieved through text coding, acoustic coding, alignment processing and acoustic decoding, the alignment processing is completed by using the pre-selected emotion expression parameters, voice with specific emotion degree can be synthesized only by presetting the emotion expression parameters, and practical application requirements are met.
In the embodiment of the disclosure, the emotion expression parameter may have a plurality of predetermined dimensions, and the numerical value in each predetermined dimension is within a predetermined range. That is, the emotion expression parameter may be represented by tensor data of two dimensions, three dimensions or higher dimensions, and the value of the tensor data in each dimension is continuous, so the emotion expression parameter may have a plurality of dimensions and continuous values of the data in each dimension, which may accurately describe the emotion degree, and the same emotion of different speakers may also be differentially expressed, the emotion granularity is finer, emotion in the speech synthesized by applying the emotion expression parameter is finer, the emotion in the speech may reflect not only the specific emotion degree in the speech, but also the difference of different speakers in speech expression for the same emotion, and speech conforming to the actual situation may be more accurately synthesized, and speech with gradual change degree may be synthesized by setting the value of the emotion expression parameter in the data in each dimension, thereby meeting the actual application requirement.
In at least some embodiments, the above-described emotion expression parameters can be constructed by emotion models. For example, the emotion expression parameters may be constructed by a three-dimensional emotion model, and the corresponding emotion expression parameters may be represented by three-dimensional tensor data. In order to make the emotion granularity of the emotion expression parameter sufficiently fine, the value of the data in each dimension of the emotion expression parameter may be any value between-1 and 1.
In some examples, a PAD model may be employed, where the PAD model considers emotion to have 3 dimensions of Pleasure, activation and dominance, and correspondingly, the emotion expression parameters may have three dimensions, namely, a dimension, an a dimension and a dimension, where P represents Pleasure (plasu-displeasure), and represents positive and negative characteristics of an individual emotion state; a represents an activation activity (Arousal-nonarousal) representing the level of neurophysiologic activation of the individual; d represents dominance (Dominance-submissiveness) representing the individual's control status over the scene and others. The values of the data in each dimension of the emotion expression parameter may be continuous values between-1 and 1, that is, the value range in each dimension is-1 to +1, +1 indicates that the value in the dimension is high, and-1 indicates that the value in the dimension is low. That is, an emotion can be represented by three-dimensional tensor data, that is, 3 continuous values between-1 and 1 are adopted in a three-dimensional space to represent an emotion of a speaker, emotion expression parameters (i.e., PAD values) of different speakers of the same emotion are possibly different, emotion granularity is finer, and description is more accurate. For example, an emotion expression parameter that characterizes anger emotion of a certain speaker may be expressed as (-0.51,0.59,0.25), i.e., P value of-0.51, A value of 0.59, D value of 0.25. Of course, other emotion models, such as an APA three-dimensional emotion space model, can be used, which uses three emotion attributes of affinity, pleasure and liveness, and the corresponding emotion expression parameters can have these three dimensions. In addition, the emotion expression parameter of the embodiment of the application can also be two-dimensional, four-dimensional or one-dimensional, and the like, and the data value range of the emotion expression parameter in each dimension can also be continuous values of any interval except [ -1, +1 ].
In step S103 of the embodiment of the present application, basic weight parameters obtained in advance by means such as training may be added to the emotion expression parameters to enhance the capacity of the emotion expression parameters, so as to further refine emotion granularity and meet actual requirements. Here, the basic weight parameter may also be tensor data, and its size in each dimension may be preset, and specific technical details will be described below.
In an embodiment of the present disclosure, the first composite feature may be a text feature in particular, which may be represented by tensor data. The acoustic features may in particular be mel-amplitude spectra or other similar acoustic features, which may also be represented by tensor data.
In an embodiment of the present disclosure, step S103 may further include: and selecting emotion expression parameters. Specifically, the selection of the emotion expression parameter may be achieved in a variety of ways. For example, a man-machine interface having emotion expression parameter options may be provided to the user in advance, through which emotion expression parameters input by the user are received and used as the above-described preselected emotion expression parameters. Taking the PAD model as an example, the emotion expression parameter options may include a P value option, an a value option, and a D value option, where the three options may be a slider, a scroll bar, a dialog box, or any other forms, and the user may manually operate these options to input the P value, the a value, and the D value in the emotion expression parameter to complete setting of the emotion expression parameter. For another example, a pre-selected set of emotion expression parameters (for example, a set of emotion expression parameters with equal a value, equal D value and 0.01 difference between P values, or a set of emotion expression parameters with 0.01 difference between a values, equal D value and 0.01 difference between P values) may be configured in advance by default configuration of the system, and the configuration may be automatically loaded when speech synthesis is performed, so that speech with gradually changed emotion degrees can be synthesized. For another example, emotion expression parameters from the cloud, user terminal, or any other electronic device may be received and used as the preselected emotion expression parameters for speech synthesis. Of course, embodiments of the present disclosure may also support any other applicable manner of obtaining the preselected emotion expression parameters described above, and are not limited in this regard.
In at least some embodiments, step S103 may include: step a1, performing matrix multiplication operation and normalization exponential function processing on a first part of the first synthesized feature and the second synthesized feature to obtain an aligned synthesized feature; step a2, embedding the emotion expression parameters in a second part of the first synthesized feature, wherein the second part is a part of the first synthesized feature except the first part; and a step a3 of performing matrix multiplication operation and splicing processing on the second part of the first synthesized feature embedded with the emotion expression parameter, the aligned synthesized feature and the second synthesized feature so as to obtain the third synthesized feature. Therefore, the embedding process of the emotion expression parameters can be integrated into the attention mechanism, so that the alignment processing between the text and the acoustic features is realized, and meanwhile, the fusion of the emotion expression parameters and the text features is realized, so that the finally obtained acoustic features are clearer when the voice with finer emotion granularity is synthesized, and the situations such as multiple words, word loss, phoneme errors, tone anomalies and the like are fewer, namely the error rate is lower.
In some examples, step a1 may include: step a11, segmenting the first synthesized feature in a channel dimension to obtain a first sub-feature as a first part and a second sub-feature as a second part; step a12, performing matrix multiplication operation on the first sub-feature and the second synthesized feature to obtain a fourth synthesized feature; and a step a13, performing normalized exponential function processing on the fourth synthesized feature to obtain an aligned synthesized feature.
In the above example, in step a2, the emotion expression parameter may be embedded in the second sub-feature. In this example, step a3 may include: step a31, performing matrix multiplication operation on the second sub-feature embedded with the emotion expression parameter and the alignment synthesis feature to obtain a fifth synthesis feature; step a32, stitching the portion of the fifth composite feature on the predetermined channel with the portion of the second composite feature on the predetermined channel to obtain a third composite feature.
In the above embodiment, the step a2 may be implemented in various ways. In some examples, step a2 may specifically include: step a21, generating an emotion description matrix based on a pre-obtained basic weight parameter and an emotion expression parameter input by a user; step a22, stitching the emotion description matrix with the second portion of the first composite feature.
In the embodiment of the disclosure, the emotion type corresponding to the emotion expression parameter may be agreed in advance, and the emotion type may be represented by emotion identification (for example, ID). For example, 7 emotion types, i.e., angry, offensive, fear, happy, neutral, sad, and surprise, may be contracted in advance, and these 7 emotion types are represented by 7 emotion IDs set in advance. Therefore, emotion types and multidimensional emotion expression parameters can comprehensively represent emotion which a user wants to express, so that emotion granularity is further refined, emotion expression in speech synthesized by using the emotion expression parameters and emotion types thereof is finer and more accurate, emotion in the speech can reflect specific emotion degrees and emotion types in the speech, different pronounciators can reflect differences of the same emotion on speech expression, and speech with gradually changed emotion degrees can be synthesized by setting values of data of the emotion expression parameters in each dimension, so that practical application requirements are met. Note that, the classification of emotion types is not limited to the above example, and emotion types may be freely set according to the requirements of the actual application scenario.
In the above embodiment, the step a21 may include: step a211, generating a preheating coding vector of emotion identifications corresponding to emotion expression parameters, wherein the emotion identifications corresponding to the emotion expression parameters represent preset emotion types corresponding to the emotion expression parameters; step a22, generating an emotion parameter matrix of emotion expression parameters, wherein the emotion parameter matrix comprises M rows and N columns, the N columns are in one-to-one correspondence with N preset emotion types, and the M rows correspond to M dimensions of the emotion expression parameters; and a step a23, performing matrix multiplication operation on the emotion parameter matrix, the preheating coding vector of the emotion mark and the pre-obtained basic weight parameters to obtain an emotion description matrix. Taking the PAD model as an example, the emotion expression parameters are three-dimensional, including a P value, an a value and a D value, assuming that the predetermined emotion types are the above 7 types, namely n=7 and m=3, the emotion parameter matrix is 7 columns and 3 rows, assuming that the emotion expression parameters of anger emotion of a certain speaker are expressed as (-0.51,0.59,0.25), the emotion ID of anger emotion is agreed to be 1, and the 1 st column in the emotion parameter matrix corresponds to 1 st column, then the 1 st column and 1 st row in the corresponding emotion parameter matrix have values of-0.51, the 1 st column and 2 nd row have values of 0.59, the 1 st column and 3 rd row have values of 0.25, and the values of other columns can all have default fixed values, such as 0.
In practical application, the user can select emotion types at the same time of selecting emotion expression parameters. In addition, the emotion IDs of emotion types can be configured, and meanwhile, the value range of emotion expression parameters corresponding to the emotion IDs of each emotion type (the value range of emotion expression parameters corresponding to each emotion type can take a checked value) can be configured, and after the emotion expression parameters are selected, the emotion IDs (namely, emotion types) corresponding to the emotion expression parameters can be automatically determined according to the value of the emotion expression parameters.
In the above embodiment, alignment of text features and acoustic features is achieved through a multi-head attention mechanism including data splitting, matrix multiplication, normalization index function and concatenation, and meanwhile, effective fusion of emotion features (namely emotion expression parameters or emotion expression parameters+emotion marks representing emotion types) and text features is achieved, so that voice with finer emotion granularity is synthesized, and meanwhile, the finally obtained acoustic features are ensured to be clearer, and the situations such as multiple words, lost words, phoneme errors, tone anomalies and the like are fewer, namely, the error rate is lower.
Fig. 2 shows an exemplary implementation of step S103. As shown in fig. 2, an exemplary process flow of step S103 may include: step S200, acquiring emotion expression parameters input by a user, and generating an emotion description matrix M by utilizing the emotion expression parameters, the emotion identifications corresponding to the emotion expression parameters and basic weight parameters obtained by pre-training; step S201, the tensor data P (i.e. the first synthesized feature) output by the text encoding is segmented in the channel dimension to obtain tensor data K (i.e. the first sub-feature as the first part above) and tensor data V (i.e. the second sub-feature as the second part above), in order to ensure that the tensor data V can be spliced with the emotion description matrix M, if the tensor data P has a height size of B, a width size of N, and a channel number of 480, the segmented tensor data K has a height size of B, a width size of N, a channel number of 256, and the tensor data V has a height size of B, a width size of N, and a channel number of 224; step S202, performing matrix multiplication on the tensor data K and the tensor data Q (i.e. the second synthesized feature) output by the acoustic encoding to obtain tensor data a, if necessary, performing matrix multiplication on one of the tensor data K, Q and the other tensor data K, Q to obtain tensor data a (i.e. the fourth synthesized feature); step S203, performing a normalization function (e.g., softmax) process on the tensor data a to obtain tensor data a' (i.e., the above alignment synthesis feature); step S204, splice emotion description matrix M to tensor data V, for example, when the height of tensor data V is B, the width of tensor data V is N, and the number of channels can be 224, copy N emotion description matrix M to make emotion description matrix M transform into emotion description tensor M ', if emotion description matrix height is B, width of emotion description matrix is 32, emotion description tensor M' height is B, width of emotion description tensor is N, channel number is 32, splice this emotion description tensor M 'to tensor data V, can obtain tensor data V' embedded with emotion feature; step S205, performing a matrix multiplication operation on the tensor data V 'and the tensor data a' to obtain tensor data R (i.e., the fifth synthesis feature above); in step S206, the tensor data R and the tensor data Q are spliced, that is, the data of the tensor data R on a certain channel selected in advance and the data of the tensor data Q on the channel are respectively spliced in two dimensions of width and height, so as to obtain matrix data R', which is the third composite feature.
Taking the PAD model above as an example, step S200 may be implemented by the following formula (1):
s_pad=s*PAD_weigh*Embedding_weight (1)
Where, represents matrix multiplication, s_pad represents emotion description matrix, s represents onehot vector of emotion ID, pad_ weigh represents emotion expression matrix obtained from emotion expression parameters, embedding _weight represents basic weight parameters, which can be obtained through training. For example, PAD_ weigh is a7 row 3 column emotion expression matrix, embedding _weight may be a 3 row 32 column matrix. During training, the initial value of Embedding _weight may be determined by random initialization.
It should be noted that fig. 2 is only an example, and the alignment process in step S103 in the embodiment of the disclosure may be implemented in other manners. In practical applications, step S200 and step S204 may be performed in parallel with step S201 to step S203, or may be performed before step S201 to step S203, and the specific execution order is not limited.
In at least some embodiments, the speech synthesis of embodiments of the present disclosure may employ an autoregressive mode in which it is determined whether to stop speech synthesis of a text by stopping the synthesis flag. Specifically, when the stop synthesis flag indicates that the speech synthesis of the first text is stopped, generating an acoustic feature sequence of the first text, the acoustic feature sequence of the first text including all second acoustic features of the first text; or when the stop synthesis flag indicates that the speech synthesis of the first text is continued, resetting the first acoustic feature with the second acoustic feature currently obtained, and repeating the acoustic encoding of step S102, the alignment processing of step S103, and the acoustic decoding of step S104 to obtain the next second acoustic feature of the first text. That is, speech synthesis can be performed for a certain text cycle, and each frame of acoustic feature (e.g., mel-amplitude spectrum) depends on the result of the previous frame synthesis, so that end-to-end sequence-to-sequence speech synthesis is realized, so that speech with finer emotion granularity, such as multiple words, word loss, phoneme errors, abnormal tones, etc., can be ensured to be more clear while speech with lower emotion granularity, that is, the error rate is lower, and the speech with finer emotion granularity can reflect not only the emotion concentration, emotion gradual change, etc. of a specific emotion, but also the difference of different speakers when expressing the same emotion with speech, etc.
In the above embodiment, when the autoregressive mode is adopted for speech synthesis, the value of the first acoustic feature in the first synthesis process is a default initial value, and the second acoustic feature and the subsequent stages thereof are the second acoustic feature of the first text obtained previously when speech synthesis is performed on a certain text.
In practical applications, the speech synthesis method of the embodiments of the present disclosure may be implemented by pre-training a speech synthesis model. The training method of the speech synthesis model is described below.
FIG. 3 is a flow chart of a method of training a speech synthesis model in an embodiment of the present disclosure. As shown in fig. 3, a training method of a speech synthesis model in an embodiment of the disclosure may include:
Step S301, setting the speech synthesis parameters in the speech synthesis model as the current values, wherein the speech synthesis parameters at least comprise one of the following: text encoding parameters, acoustic decoding parameters, emotion expression parameters, and basic weight parameters for refining granularity of emotion expression parameters;
Step S302, performing speech synthesis of the speech synthesis model by using the second text and the real acoustic features thereof as training samples to obtain predicted acoustic features of the second text, wherein the speech synthesis of the speech synthesis model comprises text coding, acoustic coding, alignment processing and acoustic decoding which are sequentially performed;
Step S303, adjusting the value of the voice synthesis parameter according to the alignment training feature, the real acoustic feature and the predicted acoustic feature of the second text generated by the alignment process.
According to the training method of the speech synthesis model, a speech synthesis model can be obtained through training, and the alignment processing part in the speech synthesis model can fuse preselected emotion expression parameters into text features while alignment of acoustic features and text features is achieved, so that speech capable of reflecting emotion of corresponding degree is obtained.
Fig. 4 shows an exemplary flow of step S303. In at least some embodiments, step S303 may include: step S401, determining a first loss value according to the predicted acoustic characteristics and the real acoustic characteristics of the second text; step S402, determining a second loss value according to alignment training features generated by the alignment processing; step S403, determining an updated value of the speech synthesis parameter based on one or both of the first loss value and the second loss value. In this example, a speech synthesis model for synthesizing speech with a corresponding degree of emotion through emotion expression parameters can be trained by two loss values.
In at least some embodiments, the alignment training features may be obtained from at least a portion of the first training features and the second training features via matrix multiplication and normalization exponential function processing. For example, this can be achieved by the step a1 in the above-mentioned speech synthesis method. In this embodiment, step S402 may include: step b1, performing matrix multiplication operation on a preset modulation matrix and alignment training characteristics; step b2, calculating absolute values of elements in the result of the matrix multiplication operation; and b3, calculating the average value of the absolute values of all elements in the result of the matrix multiplication operation, and obtaining a second loss value. The second loss value is used for monotone constraint limiting on an alignment curve of a first training feature and a second training feature, the first training feature is obtained by performing text coding on a second text, and the second training feature is obtained by performing acoustic coding on a real acoustic feature of the second text. In the embodiment, monotone constraint loss is introduced, and alignment curves generated by training are limited, so that stability of a speech synthesis model is enhanced, and the problem of word skipping synthesis in speech synthesis is solved.
In at least some embodiments, in order to implement a speech synthesis model that supports autoregressive modes, the predicted stop-synthesis token vector may also be obtained simultaneously with the predicted acoustic features of the second text. In this embodiment, as shown in fig. 5, step S303 may include: step S401, determining a first loss value according to the predicted acoustic characteristics and the real acoustic characteristics of the second text; step S402, determining a second loss value according to alignment training features generated by the alignment processing; step S503, determining a third loss value according to the predicted stop synthesis flag vector and the real sound flag vector obtained in advance; step S504, determining an updated value of the speech synthesis parameter based on the first loss value, the second loss value and the third loss value. In this example, a speech synthesis model supporting an autoregressive mode may be trained, so that the effects of "clearer acoustic features and lower error rate" may be achieved when the above speech synthesis method is implemented based on the speech synthesis model.
In some examples, step S504 may include: firstly, carrying out weighted summation on a first loss value, a second loss value and a third loss value to obtain a total loss value; secondly, determining gradient values of all parameters in the voice synthesis parameters according to the total loss value and the current value of all parameters in the voice synthesis parameters; finally, calculating the updated value of the corresponding parameter in the voice synthesis parameters by using the gradient value of each parameter in the voice synthesis parameters. Therefore, the training of the voice synthesis model can be realized by utilizing gradient feedback, the voice synthesis model with higher precision is obtained, and further, the clearer acoustic characteristics and lower error rate can be obtained when the voice synthesis model is used for voice synthesis, and the voice reflecting the emotion with the degree corresponding to the emotion expression parameter input by the user can be synthesized.
In the embodiment of the present disclosure, step S401 may be implemented by various suitable algorithms. In some examples, step S401 may include: and calculating absolute value deviation of the predicted acoustic features and the real acoustic features to obtain a first loss value. Specifically, the calculation of the first loss value in step 401 may be implemented by an L1 norm loss function. The L1 norm loss function may also be referred to herein as a minimum absolute deviation or minimum absolute error, which is the sum of absolute differences between the target value and the estimated value to be minimized.
In the embodiment of the present disclosure, step S503 may be implemented by various suitable algorithms. In some examples, step S503 may include: and calculating cross entropy loss of the two categories aiming at the predicted stop synthesis marker vector and the pre-obtained real sound marker vector, and obtaining a third loss value.
In the embodiment of the present disclosure, the execution of step S302 is the same as that of the above-described speech synthesis method, except that the second text and the corresponding real acoustic feature as the training sample are used in the training method. Specifically, step S302 may include: firstly, performing text coding on a second text to obtain a first training feature; secondly, carrying out acoustic coding on the real acoustic characteristics of the second text to obtain second training characteristics; thirdly, carrying out alignment processing on the first training feature, the second training feature and the preselected emotion expression parameter to obtain a third training feature of the second text; and finally, carrying out acoustic decoding on the third training feature to obtain the predicted acoustic feature of the second text and the predicted stop synthesis mark vector. Thus, the speech synthesis model obtained by training can synthesize a speech synthesis model reflecting the emotion of the user to a desired extent.
The following exemplarily describes a training speech synthesis model and speech synthesis using the speech synthesis model according to the embodiments of the present disclosure.
Fig. 6 shows a speech synthesis model and a system block diagram thereof when speech synthesis is performed.
As shown in fig. 6, the speech synthesis model of an embodiment of the present disclosure may include a text preprocessing module (not shown), a text encoding network, an audio preprocessing module (not shown), an acoustic encoding network, a constrained attention module, and an acoustic decoding network.
The text preprocessing module may be used to perform pinyin/prosody labeling on a text to extract phonemes and vectorize the phonemes to obtain preliminary text feature data, the text encoding network may be used to encode the text feature data obtained through the processing of the text preprocessing module to obtain the above first synthesized feature or first training feature (also belonging to the text feature), the audio preprocessing module may be used to extract acoustic features (e.g., the real mel amplitude spectrum, the initial mel amplitude spectrum, and the last frame mel amplitude spectrum shown in fig. 6) from an audio file, the acoustic encoding network may be used to encode acoustic features obtained through the audio preprocessing module for the audio file to obtain the above second synthesized feature or second training feature (also belonging to the acoustic feature), the constrained attention module may be used to perform alignment processing for the first synthesized feature (or first training feature) and the second synthesized feature (or second training feature) embedded with a speaker ID to obtain the above third synthesized feature (or third training feature), and the acoustic encoding network may be used to decode the third synthesized feature or third training feature (e.g., the acoustic feature obtained from the audio file (e.g., the next frame in fig. 6).
In some examples, the acoustic encoding network, the text encoding network, and the acoustic decoding network may each be implemented by a convolutional neural network.
In one example, the acoustic encoding network may be a first convolutional neural network comprising 2 fully-connected layers and 4 different convolutional layers, with the corresponding acoustic encoding parameters comprising parameters of the 2 fully-connected layers and the 4 convolutional layers. Specifically, the acoustic encoding parameters may include: a first weight parameter for defining a first fully connected layer, a second weight parameter for defining a second fully connected layer, a third weight parameter for defining a first convolutional layer, a fourth weight parameter for defining a second convolutional layer, a fifth weight parameter for defining a third convolutional layer, and a sixth weight parameter for defining a fourth convolutional layer. Here, the 4 convolution layers in the acoustic coding network may be the same or different, and the same convolution layer means that the weight parameter, the input parameter (e.g., the dimension of the input feature data, etc.), and the output parameter (e.g., the dimension of the output feature data, etc.) of the convolution layer are the same.
In one example, the text encoding network may be a convolutional neural network, which may include 2 fully-connected layers and 4 identical convolutional layers. Accordingly, the text encoding parameters include parameters of 2 full-concatenated layers and 4 convolutional layers. Specifically, the text encoding parameters may include: a first weight parameter for defining a first fully connected layer, a second weight parameter for defining a second fully connected layer, and a second weight parameter for a convolutional layer. Here, the convolution layer being identical means that the weight parameter, the input parameter (e.g., the dimension of the input feature data, etc.), and the output parameter (e.g., the dimension of the output feature data, etc.) of the convolution layer are identical.
In an example, the acoustic decoding network may be a first convolutional neural network comprising 2 fully-connected layers and 4 different convolutional layers, with the corresponding acoustic decoding parameters comprising parameters of the 2 fully-connected layers and the 4 convolutional layers. The specific details are the same as the acoustic coding network, and will not be described again.
In some examples, the constrained attention model may employ a multi-headed attention mechanism to align acoustic features of the acoustic encoding network output with text features of the text encoding network output. In particular, the multi-headed attentiveness mechanism may include: channel dimension splitting, matrix multiplication, softmax processing, matrix multiplication and stitching. The specific process refers to the exemplary process shown in fig. 2 above, and will not be described in detail. The modulation matrix used in training in the constrained attention model may be predetermined by the text length and mel feature sequence length. In this example, to ensure that the attention model satisfies a monotonic constraint function, the modulation matrix is a matrix that has values only on the diagonal.
In the example of fig. 6, an exemplary flow of training a speech synthesis model may include:
Step 1, an emotion audio library is constructed and used for training, and the emotion audio library mainly comprises 7 audio files corresponding to different emotion types.
For example, 7 emotion IDs are configured, each emotion ID represents one emotion type, audio files corresponding to the 7 emotion IDs are collected, and emotion expression parameters corresponding to each audio file are known, and the emotion expression parameters may be PAD data as described above, and each PAD data has three dimensions, namely, a P value, an a value, and a D value, so as to form the emotion audio library.
And 2, extracting Mel amplitude spectrum coefficients from the audio files in the emotion audio library as acoustic features.
And step 3, generating a silent mark sequence for calculating a loss value of the stop synthesis mark in the subsequent step.
Specifically, a batch of data needs to be subjected to a short-cut in the training process, and in order to enable the speech synthesis model to learn when to stop synthesis, a voiced frame position is marked 1, and a unvoiced frame (padding part) is marked 0, so that a voiced sound mark sequence is generated. Here, the length of each data, i.e., text and acoustic feature, in the silent mark sequence is preset, and the shortfall of the shortfall is to intercept or fill the acquired text and acoustic feature each time the length of the text and acoustic feature does not meet the preset standard, so that the length meets the preset length requirement, and the purpose is to enable the speech synthesis model to learn to stop synthesizing the mark.
And 4, constructing a text library, wherein the text library comprises texts corresponding to the audio files in the emotion audio library.
And 5, performing pinyin and prosody annotation on the texts in the text library.
And 6, vectorizing the marked result.
And 7, encoding a vector quantization result by the text encoding network to obtain a first training characteristic H.
And 8, encoding the acoustic features extracted in the step 2 by an acoustic encoding network to obtain second training features L.
And 9, the constrained attention module performs alignment processing on the first training feature H and the second training feature L by adopting a multi-head attention mechanism, and embeds emotion expression parameters and emotion IDs of corresponding audio files into a second sub-feature H2 obtained by segmentation of the first training feature H in the alignment processing process.
In this step, the specific procedure of the alignment process performed by the constrained attention module is the same as the example of fig. 2. Specifically, the constrained attention module firstly segments the first training feature H into a first sub-feature H1 and a second sub-feature H2 in the channel dimension, embeds the emotion expression parameter and the emotion ID into the second sub-feature H2, then multiplies the first sub-feature H1 and the second training feature L by a matrix to obtain a tensor a, further multiplies the tensor a by a softmax, finally multiplies the tensor a and the second sub-feature H2 embedded with the emotion expression parameter and the emotion ID to obtain a tensor T, and finally splices the tensor T and the second training feature L to obtain a decoding vector Y.
Step10, a penalty value (i.e., the second penalty value above) for the constrained attention module is calculated.
Because the text-to-speech synthesis process has sequency, the step introduces monotone constraint loss, and limits the alignment curve generated by training, so as to increase the stability of a speech synthesis model and solve the problem of word skipping synthesis in speech synthesis. Here, the training generated pair Ji Quxian (also referred to as an attention curve, a curve formed by text features and acoustic features) is a monotonic curve, and thus a modulation matrix that appears to be only diagonal on one image is pre-calculated based on the text length and mel feature sequence length.
In this step, the loss value loss_att of the constrained attention module (i.e., the second loss value above) is calculated by a limited attention loss calculation function, which may be defined as the following equation (2):
loss_att=abs(A*gts_r).mean() (2)
Wherein gts_r represents a modulation matrix pre-calculated according to the text length and the mel feature sequence length, mean () represents averaging the previous abs (a×gts_r), and the second loss value loss_att is obtained by averaging. In the training process, the loss_att is also minimized to achieve the purpose of restricting the attention, and a is the tensor a.
In step 11, the acoustic decoding network decodes the decoded vector Y output by the constrained attention module, generates a mel-magnitude spectrum (i.e., the predicted acoustic features above) and stops synthesizing the stop_token vector.
And step 12, calculating cross entropy loss of the two categories by using the generated stop_token vector and the above silent label sequence to obtain a third loss value.
And 13, calculating the L1 loss by using the Mel amplitude spectrum generated in the step 11 and the real Mel amplitude spectrum corresponding to the audio file in the emotion audio library, and obtaining a first loss value.
And 14, carrying out weighted summation on the first loss value, the second loss value and the third loss value to obtain a total loss value, carrying out gradient feedback on the three loss values by utilizing the total loss value, updating parameters of a text coding network, parameters of an acoustic coding network and parameters of an acoustic decoding network, returning to the step 7, and re-executing the flow, and iterating until the gradient of the total loss value is close to zero or lower than a preset certain value.
Here, performing gradient pass-back for the three loss values using the total loss value may include: and determining three gradients by respectively carrying out deviation guide on parameters corresponding to the first loss value, the second loss value and the third loss value by utilizing the total loss value, and updating the parameter value according to the gradient negative direction of each parameter.
In the example of fig. 6, the exemplary flow of speech synthesis is substantially the same as the flow of training the speech synthesis model described above, except that the speech synthesis is an autoregressive mode, i.e.: the method comprises the steps of firstly setting a first acoustic feature as an initial Mel amplitude spectrum frame (for example, assigning value to 0), processing the first acoustic feature through an acoustic coding network to obtain a second synthetic feature, preprocessing a text sentence to be synthesized through pinyin, prosody marking, vectorization and the like, then processing the text sentence in the text coding network to obtain a first synthetic feature corresponding to the text sentence, sending the second synthetic feature output by the acoustic coding network, the first synthetic feature output by the text coding network and a preselected emotion expression parameter into a constrained attention module, aligning the first synthetic feature and the second synthetic feature by the constrained attention module, embedding the emotion expression parameter and the emotion ID corresponding to the emotion expression parameter into the first synthetic feature to obtain a third synthetic feature, and processing the third synthetic feature through an acoustic decoding network to obtain the Mel amplitude spectrum generated by the round. Whether the synthesis is stopped or not is judged through a stop synthesis mark stop_token, and if the synthesis is judged to be needed to be continued, the next round of synthesis is continued. In the next round of synthesis process, the Mel amplitude spectrum generated in the round is used as a first acoustic feature to be sent into an acoustic coding network to repeat the process, the generated Mel amplitude spectrum of the next frame is judged whether to stop synthesis or not through stop synthesis mark stop_token, and the process is circularly executed until the synthesis stop is judged when the synthesis stop mark stop_token stops, and the Mel amplitude spectrum frames output from the first round to the last round are combined to obtain the Mel amplitude spectrum sequence corresponding to the text sentence. Here, when determining whether to stop the synthesis, the stop_token may calculate a probability value from the mel amplitude spectrum or the like currently generated, determine to stop the synthesis if the probability value is greater than a predetermined threshold (e.g., 0.5), and determine to continue the synthesis if the probability value is less than or equal to the threshold.
In the above example, the emotion expression parameters and emotion IDs capable of representing emotion features are embedded into text features, and a speech synthesis model with finer granularity of synthesizing emotion expression can be obtained by training in combination with a multi-head attention mechanism, so that speech synthesis can be performed on preselected emotion expression parameters by using the speech synthesis model, and synthesized speech can reflect not only differences of different speakers when expressing the same speech through sound, but also gradual changes of emotion degrees, and the speech synthesis can be realized without respectively establishing a speech synthesis model for different speakers or each emotion type. In addition, in the above example, monotone constraint loss is introduced when the speech synthesis model is trained, and the alignment curve generated by training is limited, so that the stability of the speech synthesis model is enhanced, and the problem of word skipping synthesis in speech synthesis can be solved. In addition, in the above example, the training of the speech synthesis model is realized by using gradient feedback, so that a speech synthesis model with higher precision is obtained, and clearer acoustic characteristics and lower error rate can be obtained during speech synthesis.
Exemplary apparatus
Fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present disclosure. As shown in fig. 7, a speech synthesis apparatus in an embodiment of the present disclosure may include:
A text encoding unit 71 configured as a text encoding unit configured to perform text encoding on a first text to be synthesized to obtain a first synthesized feature;
An acoustic encoding unit 72 configured to acoustically encode the first acoustic feature to obtain a second synthesized feature;
An alignment processing unit 73 configured to perform alignment processing on the first synthesized feature, the second synthesized feature, and the preselected emotion expression parameter to obtain a third synthesized feature;
An acoustic decoding unit 74 is configured to acoustically decode the third synthesized feature to obtain a second acoustic feature of the first text.
In some examples, the emotion expression parameter may have a plurality of predetermined dimensions, and the value in each of the predetermined dimensions is within a predetermined range.
In some examples, the alignment processing unit 73 may include: a first operation module 731 configured to perform matrix multiplication operation and normalized exponential function processing on the first portion of the first synthesized feature and the second synthesized feature to obtain an aligned synthesized feature; an embedding module 732 configured to embed the emotion expression parameter in a second portion of the first synthesized feature, the second portion being a portion of the first synthesized feature other than the first portion; a second operation module 733 configured to perform matrix multiplication operation and stitching processing on the second portion of the first synthesized feature embedded with the emotion expression parameter, the aligned synthesized feature, and the second synthesized feature, so as to obtain the third synthesized feature.
In the above example, the first operation module 731 may include: a molecular slicing module configured to slice the first composite feature in a channel dimension to obtain a first sub-feature as the first portion and a second sub-feature as the second portion; the first matrix multiplication sub-module is configured to perform matrix multiplication operation on the first sub-feature and the second synthesized feature to obtain a fourth synthesized feature; and the normalization processing sub-module is configured to perform normalization exponential function processing on the fourth synthesized feature to obtain the aligned synthesized feature.
In the above example, the embedding module 732 may be specifically configured to embed the emotion expression parameter in the second sub-feature.
In the above example, embedding module 732 may comprise: the generation sub-module is configured to generate an emotion description matrix based on the pre-obtained basic weight parameters and emotion expression parameters input by a user; and the first splicing sub-module is configured to splice the emotion description matrix with the second part of the first synthesized feature.
In the above example, the generating submodule may be specifically configured to generate the emotion description matrix by: generating a preheating coding vector corresponding to the emotion identification of the emotion expression parameter, wherein the emotion identification corresponding to the emotion expression parameter represents a preset emotion type corresponding to the emotion expression parameter; generating an emotion parameter matrix of the emotion expression parameters, wherein the emotion parameter matrix comprises M rows and N columns, the N columns are in one-to-one correspondence with N preset emotion types, and the M rows correspond to M dimensions of the emotion expression parameters; and performing matrix multiplication operation on the emotion parameter matrix, the preheating coding vector of the emotion mark and the pre-obtained basic weight parameter to obtain the emotion description matrix.
In the above example, the second operation module 733 may include: the second matrix multiplication sub-module is configured to perform matrix multiplication operation on the second sub-feature embedded with the emotion expression parameter and the aligned synthesized feature so as to obtain a fifth synthesized feature; and a second stitching sub-module configured to stitch a portion of the fifth composite feature on a predetermined channel and a portion of the second composite feature on the predetermined channel to obtain the third composite feature.
In some examples, the acoustic decoding unit 74 may be further configured to generate an acoustic feature sequence of the first text including all second acoustic features of the first text when the stop synthesis flag indicates that speech synthesis of the first text is stopped. Or the acoustic decoding unit 74 may be further configured to reset the first acoustic feature with the currently obtained second acoustic feature when the stop synthesis flag indicates that the speech synthesis of the first text is continued, so that the steps of acoustic encoding, alignment processing, and acoustic decoding are repeated to obtain the next second acoustic feature of the first text.
In at least some embodiments, the above-mentioned voice synthesis apparatus may further include: a parameter setting unit 75 configured to set a speech synthesis parameter in the speech synthesis model to a current value, the speech synthesis parameter comprising at least one of: a text encoding parameter, an acoustic decoding parameter, the emotion expression parameter and a basic weight parameter for refining granularity of the emotion expression parameter; the text encoding unit 71, the acoustic encoding unit 72, the alignment processing unit 73 and the acoustic decoding unit 74 are further configured to perform speech synthesis of the speech synthesis model, which includes text encoding, acoustic encoding, alignment processing and acoustic decoding performed in order, using the second text and its true acoustic features as training samples to obtain predicted acoustic features of the second text; and a parameter adjustment unit 76 configured to adjust the value of the speech synthesis parameter according to the alignment training feature, the real acoustic feature of the second text, and the predicted acoustic feature generated by the alignment processing performed by the alignment processing unit 73.
In some examples, parameter adjustment unit 76 may include: a first determining module configured to determine a first loss value based on the predicted acoustic features and the actual acoustic features of the second text; a second determining module configured to determine a second loss value based on alignment training features generated by the alignment process; and a third determination module configured to determine an updated value of the speech synthesis parameter based at least on the first loss value and the second loss value.
In some examples, the alignment training features used by the parameter adjustment unit 76 are obtained by the alignment processing unit 73 performing matrix multiplication and normalization exponential function processing on at least a portion of the first training features and the second training features. In this example, the second determination module may include: the first computing sub-module is configured to perform matrix multiplication operation on a preset modulation matrix and the alignment training features; the second computing sub-module is configured to compute the absolute value of each element in the result of matrix multiplication operation of the first computing sub-module; the third calculation submodule is configured to calculate the average value of the absolute values of the elements obtained by the second calculation submodule to obtain a second loss value; the second loss value is used for monotone constraint limiting on an alignment curve of a first training feature and a second training feature, the first training feature is obtained by performing text coding on a second text, and the second training feature is obtained by performing acoustic coding on a real acoustic feature of the second text.
In some examples, acoustic decoding unit 74 obtains the predicted stop synthesis marker vector along with the predicted acoustic features of the second text. In this example, the parameter adjustment unit 116 may include, in addition to the first determination module and the second determination module described above: a fourth determining module configured to determine a third loss value based on the predicted stop synthesis flag vector and the real sound flag vector obtained in advance; the third determining module may be further configured to determine an updated value of the speech synthesis parameter based on the first loss value, the second loss value, and the third loss value.
In the above example, the third determining module may include: a weighted sum sub-module configured to weight sum the first, second, and third loss values to obtain a total loss value; the gradient submodule is configured to determine gradient values of all parameters in the voice synthesis parameters according to the total loss value and the current value of all parameters in the voice synthesis parameters; and the updating sub-module is configured to calculate an updated value of a corresponding parameter in the voice synthesis parameters by utilizing the gradient value of each parameter in the voice synthesis parameters.
In the above example, the first determining module may be specifically configured to calculate an absolute value deviation of the predicted acoustic feature from the real acoustic feature, to obtain the first loss value.
In the above example, the fourth determining module may be specifically configured to calculate the cross entropy loss of the two classifications for the predicted stop synthesis token vector and the real sound token vector obtained in advance, obtaining the third loss value.
In the above example, the text encoding unit 71 may be further configured to perform text encoding on the second text to obtain a first training feature; the acoustic encoding unit 72 may be further configured to acoustically encode the real acoustic features of the second text to obtain second training features; alignment processing unit 73 may be further configured to perform an alignment process on the first training feature, the second training feature, and the preselected emotion expression parameter to obtain a third training feature of the second text; the acoustic decoding unit 74 may be further configured to acoustically decode the third training feature to obtain a predicted acoustic feature of the second text and a predicted stop synthesis flag vector.
Exemplary electronic device
Fig. 8 illustrates a block diagram of an electronic device according to an embodiment of the disclosure.
As shown in fig. 8, the electronic device 80 includes one or more processors 81 and memory 82.
Processor 81 may be a Central Processing Unit (CPU) or other form of processing unit having data processing and/or instruction execution capabilities and may control other components in electronic device 80 to perform desired functions.
Memory 82 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. The volatile memory may include, for example, random Access Memory (RAM) and/or cache memory (cache), and the like. The non-volatile memory may include, for example, read Only Memory (ROM), hard disk, flash memory, and the like. One or more computer program instructions may be stored on the computer readable storage medium that can be executed by the processor 81 to implement the speech synthesis method and/or training method of the speech synthesis model of the various embodiments of the disclosure described above and/or other desired functions.
In one example, the electronic device 80 may further include: an input device 83 and an output device 84, which are interconnected by a bus system and/or other forms of connection mechanisms (not shown). The input device 83 may be, for example, a microphone or an array of microphones. In addition, the input device 83 may include, for example, a keyboard, a mouse, and the like. The output device 84 can output various information to the outside. The output device 84 may include, for example, a display, speakers, a printer, and a communication network and remote output devices connected thereto, etc.
Of course, only some of the components of the electronic device 80 relevant to the present disclosure are shown in fig. 8, with components such as buses, input/output interfaces, etc. omitted for simplicity. In addition, the electronic device 80 may include any other suitable components depending on the particular application.
Exemplary computer program product and computer readable storage Medium
In addition to the methods and apparatus described above, embodiments of the application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the speech synthesis method and/or the training method of the speech synthesis model according to the various embodiments of the application described in the "exemplary methods" section of this specification.
The computer program product may write program code for performing operations of embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server.
Furthermore, embodiments of the present application may also be a computer-readable storage medium, having stored thereon computer program instructions, which when executed by a processor, cause the processor to perform the steps in the speech synthesis method and/or the training method of the speech synthesis model according to the various embodiments of the present application described in the "exemplary methods" section above of the present specification.
The computer readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may include, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The basic principles of the present application have been described above in connection with specific embodiments, but it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not intended to be limiting, and these advantages, benefits, effects, etc. are not to be construed as necessarily possessed by the various embodiments of the application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not necessarily limited to practice with the above described specific details.
The block diagrams of the devices, apparatuses, devices, systems referred to in the present application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent aspects of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (8)

1. A method of speech synthesis, comprising:
Performing text encoding on a first text to be synthesized to obtain a first synthesized feature;
Acoustically encoding the first acoustic feature to obtain a second composite feature;
aligning the first synthesized feature, the second synthesized feature and the preselected emotion expression parameter to obtain a third synthesized feature; and
Acoustically decoding the third synthesized feature to obtain a second acoustic feature of the first text,
Wherein the aligning the first synthesized feature, the second synthesized feature and the preselected emotion expression parameter to obtain a third synthesized feature includes:
performing matrix multiplication operation and normalized exponential function processing on the first part of the first synthesized feature and the second synthesized feature to obtain an aligned synthesized feature;
embedding the emotion expression parameter in a second portion of the first composite feature, the second portion being a portion of the first composite feature other than the first portion;
and performing matrix multiplication operation and splicing processing on the second part of the first synthesized feature embedded with the emotion expression parameter, the aligned synthesized feature and the second synthesized feature to obtain the third synthesized feature.
2. The method of claim 1, wherein matrix multiplying the first portion of the first composite feature with the second composite feature and normalizing an exponential function comprises:
Splitting the first composite feature in a channel dimension to obtain a first sub-feature as the first portion and a second sub-feature as the second portion;
Performing matrix multiplication operation on the first sub-feature and the second synthesized feature to obtain a fourth synthesized feature;
And carrying out normalized exponential function processing on the fourth synthesized feature to obtain the aligned synthesized feature.
3. The method of claim 1, wherein embedding the emotion expression parameter in the second portion of the first composite feature comprises:
generating an emotion description matrix based on the pre-obtained basic weight parameters and emotion expression parameters input by a user;
and splicing the emotion description matrix with the second part of the first synthesized feature.
4. A method of training a speech synthesis model, comprising:
Setting a speech synthesis parameter in a speech synthesis model to a current value, the speech synthesis parameter comprising at least one of: text encoding parameters, acoustic decoding parameters, emotion expression parameters, and basic weight parameters for refining granularity of the emotion expression parameters;
performing speech synthesis of the speech synthesis model using the second text and its actual acoustic features as training samples to obtain predicted acoustic features of the second text, the speech synthesis of the speech synthesis model including text encoding, acoustic encoding, alignment processing, and acoustic decoding performed in sequence; and
Adjusting the value of the speech synthesis parameter according to the alignment training feature, the real acoustic feature of the second text and the predicted acoustic feature generated by the alignment process,
The alignment training features are obtained by performing matrix multiplication operation and normalization exponential function processing on at least one part of the first training features and the second training features;
determining a second loss value based on alignment training features generated by the alignment process, comprising:
Performing matrix multiplication operation on a preset modulation matrix and the alignment training features;
calculating absolute values of elements in the result of the matrix multiplication operation;
Calculating the average value of the absolute values of all elements in the result of the matrix multiplication operation to obtain the second loss value;
The second loss value is used for performing monotone constraint restriction on an alignment curve of the first training feature and the second training feature, the first training feature is obtained by performing text coding on the second text, and the second training feature is obtained by performing acoustic coding on a true acoustic feature of the second text.
5. The method of claim 4, wherein adjusting the value of the speech synthesis parameter based on the alignment training feature, the true acoustic feature of the second text, and the predicted acoustic feature generated by the alignment process comprises:
Determining a first loss value according to the predicted acoustic characteristics and the real acoustic characteristics of the second text;
determining a second loss value according to alignment training features generated by the alignment process; and
An updated value of the speech synthesis parameter is determined based at least on the first loss value and the second loss value.
6. A speech synthesis apparatus comprising:
The text coding unit is configured to code the text of the first text to be synthesized so as to obtain a first synthesis characteristic;
An acoustic encoding unit configured to perform acoustic encoding on the first acoustic feature to obtain a second synthesized feature;
An alignment processing unit configured to perform alignment processing on the first synthesized feature, the second synthesized feature and the pre-selected emotion expression parameter to obtain a third synthesized feature;
An acoustic decoding unit configured to acoustically decode the third synthesized feature to obtain a second acoustic feature of the first text,
Wherein the alignment processing unit includes: the first operation module is configured to perform matrix multiplication operation and normalization exponential function processing on the first part of the first synthesized feature and the second synthesized feature so as to obtain an aligned synthesized feature; an embedding module configured to embed the emotion expression parameter in a second portion of the first composite feature, the second portion being a portion of the first composite feature other than the first portion; and the second operation module is configured to perform matrix multiplication operation and splicing processing on the second part of the first synthesized feature embedded with the emotion expression parameter, the aligned synthesized feature and the second synthesized feature so as to obtain the third synthesized feature.
7. An electronic device, comprising:
one or more processors; and
A memory storing a computer program which, when executed by the processor, causes the processor to perform the method according to any one of claims 1 to 5.
8. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the method of any of claims 1 to 5.
CN202110400408.8A 2021-04-14 2021-04-14 Speech synthesis method, training method and device of speech synthesis model Active CN113112987B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110400408.8A CN113112987B (en) 2021-04-14 2021-04-14 Speech synthesis method, training method and device of speech synthesis model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110400408.8A CN113112987B (en) 2021-04-14 2021-04-14 Speech synthesis method, training method and device of speech synthesis model

Publications (2)

Publication Number Publication Date
CN113112987A CN113112987A (en) 2021-07-13
CN113112987B true CN113112987B (en) 2024-05-03

Family

ID=76716778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110400408.8A Active CN113112987B (en) 2021-04-14 2021-04-14 Speech synthesis method, training method and device of speech synthesis model

Country Status (1)

Country Link
CN (1) CN113112987B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113270086B (en) 2021-07-19 2021-10-15 中国科学院自动化研究所 Voice recognition text enhancement system fusing multi-mode semantic invariance

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174272A (en) * 2007-10-26 2008-05-07 北京航空航天大学 Organization and extracting method for affection data in Chinese language text
CN102103856A (en) * 2009-12-21 2011-06-22 盛大计算机(上海)有限公司 Voice synthesis method and system
GB201212783D0 (en) * 2012-07-18 2012-08-29 Toshiba Res Europ Ltd A speech processing system
GB201405255D0 (en) * 2014-03-24 2014-05-07 Toshiba Res Europ Ltd Voice conversion
WO2016040209A1 (en) * 2014-09-11 2016-03-17 Microsoft Technology Licensing, Llc Text-to-speech with emotional content
CN106782615A (en) * 2016-12-20 2017-05-31 科大讯飞股份有限公司 Speech data emotion detection method and apparatus and system
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN111048062A (en) * 2018-10-10 2020-04-21 华为技术有限公司 Speech synthesis method and apparatus
CN111128118A (en) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 Speech synthesis method, related device and readable storage medium
WO2020173134A1 (en) * 2019-02-27 2020-09-03 平安科技(深圳)有限公司 Attention mechanism-based speech synthesis method and device
CN112489621A (en) * 2020-11-20 2021-03-12 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN112489635A (en) * 2020-12-03 2021-03-12 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism
CN112562700A (en) * 2020-12-10 2021-03-26 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050228663A1 (en) * 2004-03-31 2005-10-13 Robert Boman Media production system using time alignment to scripts
US10565994B2 (en) * 2017-11-30 2020-02-18 General Electric Company Intelligent human-machine conversation framework with speech-to-text and text-to-speech

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101174272A (en) * 2007-10-26 2008-05-07 北京航空航天大学 Organization and extracting method for affection data in Chinese language text
CN102103856A (en) * 2009-12-21 2011-06-22 盛大计算机(上海)有限公司 Voice synthesis method and system
GB201212783D0 (en) * 2012-07-18 2012-08-29 Toshiba Res Europ Ltd A speech processing system
GB201405255D0 (en) * 2014-03-24 2014-05-07 Toshiba Res Europ Ltd Voice conversion
WO2016040209A1 (en) * 2014-09-11 2016-03-17 Microsoft Technology Licensing, Llc Text-to-speech with emotional content
CN106782615A (en) * 2016-12-20 2017-05-31 科大讯飞股份有限公司 Speech data emotion detection method and apparatus and system
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN111048062A (en) * 2018-10-10 2020-04-21 华为技术有限公司 Speech synthesis method and apparatus
WO2020173134A1 (en) * 2019-02-27 2020-09-03 平安科技(深圳)有限公司 Attention mechanism-based speech synthesis method and device
CN111128118A (en) * 2019-12-30 2020-05-08 科大讯飞股份有限公司 Speech synthesis method, related device and readable storage medium
CN112489621A (en) * 2020-11-20 2021-03-12 北京有竹居网络技术有限公司 Speech synthesis method, device, readable medium and electronic equipment
CN112489635A (en) * 2020-12-03 2021-03-12 杭州电子科技大学 Multi-mode emotion recognition method based on attention enhancement mechanism
CN112562700A (en) * 2020-12-10 2021-03-26 平安科技(深圳)有限公司 Emotional voice synthesis method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于规则的汉语情感语音系统的设计与实现;曾一鸣;朱杰;;电子测量技术(第11期);全文 *
蒙古语长音频语音文本自动对齐的研究;牛米佳;飞龙;高光来;;中文信息学报(第01期);全文 *

Also Published As

Publication number Publication date
CN113112987A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
JP4296231B2 (en) Voice quality editing apparatus and voice quality editing method
CN111048062A (en) Speech synthesis method and apparatus
JP2006084715A (en) Method and device for element piece set generation
CN113920977A (en) Speech synthesis model, model training method and speech synthesis method
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN111161695B (en) Song generation method and device
US20240087558A1 (en) Methods and systems for modifying speech generated by a text-to-speech synthesiser
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
CN114207706A (en) Generating acoustic sequences via neural networks using combined prosodic information
CN112309367B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN112837669A (en) Voice synthesis method and device and server
CN113112987B (en) Speech synthesis method, training method and device of speech synthesis model
US10157608B2 (en) Device for predicting voice conversion model, method of predicting voice conversion model, and computer program product
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN113053353B (en) Training method and device of speech synthesis model
CN112927677B (en) Speech synthesis method and device
CN112908293A (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
KR20210035042A (en) Emotional speech synthesis method and apparatus for controlling the emotion between emotions
CN114255737B (en) Voice generation method and device and electronic equipment
CN113192482B (en) Speech synthesis method and training method, device and equipment of speech synthesis model
CN115273802A (en) Speech synthesis method, apparatus, device and storage medium
CN113744713A (en) Speech synthesis method and training method of speech synthesis model
KR102277205B1 (en) Apparatus for converting audio and method thereof
King A reading list of recent advances in speech synthesis
KR102426020B1 (en) Method and apparatus for Speech Synthesis Containing Emotional Rhymes with Scarce Speech Data of a Single Speaker

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant