CN115547289A - Voice synthesis method and device, electronic equipment and storage medium - Google Patents

Voice synthesis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115547289A
CN115547289A CN202211142609.3A CN202211142609A CN115547289A CN 115547289 A CN115547289 A CN 115547289A CN 202211142609 A CN202211142609 A CN 202211142609A CN 115547289 A CN115547289 A CN 115547289A
Authority
CN
China
Prior art keywords
text
dialogue
style
speech synthesis
conversation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211142609.3A
Other languages
Chinese (zh)
Inventor
冯小琴
迟文江
陈云琳
叶顺平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mobvoi Information Technology Co Ltd
Original Assignee
Mobvoi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mobvoi Information Technology Co Ltd filed Critical Mobvoi Information Technology Co Ltd
Priority to CN202211142609.3A priority Critical patent/CN115547289A/en
Publication of CN115547289A publication Critical patent/CN115547289A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

The present disclosure provides a speech synthesis method, apparatus, electronic device and storage medium, the method comprising: determining an original corpus for speech synthesis; extracting text conversation expressive force from the original corpus, wherein the text conversation expressive force comprises a conversation intention and a conversation style; determining a text conversation style feature based on the text conversation expressiveness; and inputting the text dialogue style characteristics into a voice synthesis model, and determining dialogue voice corresponding to the original corpus based on a Mel frequency spectrum output by the voice synthesis model.

Description

Voice synthesis method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of speech processing technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a storage medium.
Background
At present, the speech synthesis technology has been developed rapidly, and especially, a text-to-speech (TTS) system capable of converting any text into speech signals has output speech sound quality substantially close to the pronunciation of a real person, and is widely applied in man-machine conversation, remote speech information service, machine reading, telecommunication, entertainment and other aspects.
However, the speech output from the TTS system is less expressive than the real person's pronunciation in terms of speech, such as mood and speech speed. Therefore, it is a continuous pursuit goal to provide more speech expressive factors for the synthesized speech to improve the naturalness of the synthesized speech.
Disclosure of Invention
The present disclosure provides a speech synthesis method, apparatus, electronic device and storage medium to at least solve the above technical problems in the prior art.
According to a first aspect of the present disclosure, there is provided a speech synthesis method, the method comprising: determining an original corpus for speech synthesis; extracting text dialogue expressive force from the original corpus, wherein the text dialogue expressive force comprises dialogue intention and dialogue style; determining a text conversation style feature based on the text conversation expressiveness; inputting the text dialogue style characteristics into a speech synthesis model, and determining dialogue speech corresponding to the original corpus based on a Mel frequency spectrum output by the speech synthesis model.
In one embodiment, the determining the original corpus for speech synthesis comprises: acquiring text information for voice synthesis; separating the dialogue and the voice-over in the text information to obtain text dialogue information and voice-over text information; and determining the text dialogue information and the voice-over text information as the original corpus.
In one embodiment, the extracting of the text dialog expressive force from the original corpus comprises: mapping the text information in the original corpus to a word vector space to obtain word vector information corresponding to the original corpus; fitting the word vector information to obtain a fitting result; extracting the text dialog expressive force from the fitting result based on a conditional random field model.
In one embodiment, the determining text conversation style characteristics based on the text conversation expressiveness comprises: determining reference dialogue style characteristics based on the reference voice corresponding to the original corpus; inputting the text dialogue expressive force into a dialogue characteristic training model to obtain text dialogue style characteristics output by the dialogue characteristic training model; adjusting parameters of the dialog feature training model based on differences between the reference dialog style feature and the text dialog style feature.
In an embodiment, the inputting the text dialogue style feature to a speech synthesis model, and determining the dialogue speech corresponding to the original corpus based on a mel frequency spectrum output by the speech synthesis model includes: taking the text information corresponding to the original corpus as the input of the voice synthesis model, converting the text information into audio information, and performing voice length alignment on the audio information to obtain a linear frequency spectrum corresponding to the text information; obtaining a Mel frequency spectrum output by the speech synthesis model based on the text dialogue style characteristics and the linear frequency spectrum; and determining the dialogue voice corresponding to the original corpus based on the Mel frequency spectrum.
According to a second aspect of the present disclosure, there is provided a speech synthesis model training method, the method comprising: generating an original corpus sample set for speech synthesis; extracting a plurality of text dialogue expressive forces from the original corpus sample set, wherein the text dialogue expressive forces comprise dialogue intention and dialogue style; determining a plurality of text conversation style characteristics based on a plurality of text conversation expressiveness; training a speech synthesis model with a plurality of the text dialog style features to enable the speech model to use the text dialog style features for synthesizing a dialog speech.
In one embodiment, the determining a plurality of text dialog style features based on a plurality of the text dialog expressiveness comprises: acquiring multiple text dialogue expressive forces corresponding to the original corpus sample set; training a dialog feature training model with a plurality of the text dialog expressiveness, so that the dialog feature training model can use the text dialog expressiveness for obtaining text dialog style features.
According to a third aspect of the present disclosure, there is provided a speech synthesis apparatus, the apparatus comprising: the corpus determining module is used for determining an original corpus used for voice synthesis; the extraction module is used for extracting text conversation expressive force from the original corpus, and the text conversation expressive force comprises a conversation intention and a conversation style; the characteristic determining module is used for determining the text conversation style characteristic based on the text conversation expressive force; and the input module is used for inputting the text dialogue style characteristics into a speech synthesis model and determining dialogue speech corresponding to the original corpus based on a Mel frequency spectrum output by the speech synthesis model.
In an implementation manner, the corpus determining module is specifically configured to obtain text information for speech synthesis; separating the dialogue and the voice-over in the text information to obtain text dialogue information and voice-over text information; and determining the text dialogue information and the voice-over text information as the original corpus.
In an implementation manner, the extraction module is specifically configured to map text information in the original corpus to a word vector space, so as to obtain word vector information corresponding to the original corpus; fitting the word vector information to obtain a fitting result; extracting the text dialog expressive force from the fitting result based on a conditional random field model.
In an implementation manner, the feature determining module is specifically configured to determine a reference dialog style feature based on a reference voice corresponding to the original corpus; inputting the text dialogue expressive force into a dialogue characteristic training model to obtain text dialogue style characteristics output by the dialogue characteristic training model; adjusting parameters of the dialog feature training model based on differences between the reference dialog style feature and the text dialog style feature.
In an implementation manner, the feature determination module is specifically configured to map the text dialog expressive force to a sentence vector space, so as to obtain sentence vector information corresponding to the text dialog expressive force; and determining the text dialogue style characteristics based on the sentence vector information.
In an implementation manner, the input module is specifically configured to use text information corresponding to the original corpus as input of the speech synthesis model, convert the text information into audio information, and perform speech length alignment on the audio information to obtain a linear spectrum corresponding to the text information; obtaining a Mel frequency spectrum output by the speech synthesis model based on the text dialogue style characteristics and the linear frequency spectrum; and determining the dialogue voice corresponding to the original corpus based on the Mel frequency spectrum.
According to a fourth aspect of the present disclosure, there is provided a speech synthesis model training apparatus, the apparatus comprising: the generating module is used for generating an original corpus sample set for voice synthesis; the sample extraction module is used for extracting a plurality of text conversation expressive forces from the original corpus sample set, and the text conversation expressive forces comprise conversation intention and conversation style; the sample characteristic determining module is used for determining a plurality of text conversation style characteristics based on a plurality of text conversation expressive forces; and the training module is used for training a speech synthesis model by using the plurality of text dialogue style characteristics so that the speech model can use the text dialogue style characteristics for synthesizing dialogue speech.
In an implementation manner, the sample extraction module is specifically configured to obtain a plurality of text dialogue expressive forces corresponding to the original corpus sample set; training a dialog feature training model with a plurality of the text dialog expressiveness, so that the dialog feature training model can use the text dialog expressiveness for obtaining text dialog style features.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods of the present disclosure.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of the present disclosure.
According to the speech synthesis method, the speech synthesis device, the electronic equipment and the storage medium, the dialogue emotion, the dialogue style, the dialogue intention and the like in the original corpus are extracted, the characteristics in the original corpus are fully utilized, and the text dialogue style characteristics are obtained, so that the text dialogue style characteristics are used for improving the naturalness of the synthesized speech.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
FIG. 1 is a process flow diagram of a speech synthesis method according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram illustrating a process flow of determining original corpus used for speech synthesis in a speech synthesis method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram illustrating a process flow for extracting text dialog expressive force from original corpus in a speech synthesis method according to an embodiment of the present disclosure;
FIG. 4 is a schematic overall flow chart diagram illustrating a speech synthesis method according to an embodiment of the present disclosure;
FIG. 5 is a process flow diagram of a method for training a speech synthesis model according to an embodiment of the present disclosure;
fig. 6 is a schematic diagram illustrating a component structure of a speech synthesis apparatus according to an embodiment of the present disclosure;
fig. 7 shows a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, features and advantages of the present disclosure more apparent and understandable, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
In order to improve the naturalness of the synthesized speech, attention is paid to the accuracy of expression of the speech information in the synthesized speech, and it is necessary to give an individual emotional color and a linguistic style to the synthesized speech so that the synthesized speech can express not only the speech information but also information such as emotion, intention, attitude, and speaker characteristics. The conversation expressive power is an expressive characteristic of the interlocutor in all aspects during the conversation, and emotional colors, language styles, conversation intentions, attitudes and the like are all reflected by the conversation expressive power.
In the related art, emotion labels such as happiness, sadness and fear are added into a TTS system by giving limited conversation emotion categories, so that synthesized voice has richer expressive force. However, the speech expression includes other expression factors such as the intention of the speech and the style of the speech, and it is not sufficient to improve the naturalness of the synthesized speech by differentiating the speech expression only from the emotion.
Therefore, the embodiment of the present disclosure provides a speech synthesis method, which extracts emotion, style and intention of a dialog in text information, and obtains a text dialog style feature by fully utilizing features in the text information, so that the text dialog style feature is used for improving naturalness of synthesized speech.
Fig. 1 is a schematic processing flow diagram of a speech synthesis method according to an embodiment of the present disclosure.
Referring to fig. 1, a processing flow of a speech synthesis method according to an embodiment of the present disclosure at least includes the following steps:
step S101, determining an original corpus used for speech synthesis.
In some embodiments, the text information for speech synthesis may be pre-processed to obtain the original corpus. Wherein preprocessing may refer to changing the format and content of the text information.
In some embodiments, determining a specific implementation of the original corpus for speech synthesis, as shown in fig. 2, comprises at least the following steps:
in step S101a, text information for speech synthesis is acquired.
In some embodiments, the text information may be any text segment used for speech synthesis, a certain chapter of a novel may be intercepted, or a text segment may be selected from a corpus as the text information.
And step S101b, separating the dialogue and the voice-over in the text information to obtain the text dialogue information and the voice-over text information.
In some embodiments, dialogue and dialogue in the text message can be separated by a rule matching method. The rule for separating the dialogue and the voice-over mainly depends on the regular expression, and the regular expression is used for filtering the text information so as to distinguish the text dialogue information and the voice-over text information.
Step S101c, determining the text dialogue information and the voice-over text information as original language materials.
In some embodiments, the bystander text information can be used as the context related to the text dialogue information, the intention of the dialogue, the intonation of the dialogue and the like can be reasonably judged according to the logical relationship in the bystander text information, and the synthesized voice is modified from the aspect of text dialogue expressive force, so that the synthesized voice is closer to the natural dialogue, and the naturalness of the synthesized voice is improved.
And step S102, extracting text dialogue expressive force from the original corpus, wherein the text dialogue expressive force comprises dialogue intention and dialogue style.
In some embodiments, the text dialog expressiveness may further include: dialog intonation, dialog emotion and semantic information. It should be understood that text dialog expressiveness may be any factor that affects speech prosody during a dialog.
In some embodiments, a specific implementation of extracting the text dialog expressive force from the original corpus, as shown in fig. 3, includes at least the following steps:
step S102a, mapping the text information in the original corpus to a word vector space to obtain word vector information corresponding to the original corpus.
In some embodiments, the text information in the original corpus may be mapped to a word vector space through a BERT model, so as to obtain word vector information corresponding to the original corpus. Among them, BERT is called as Bidirectional Encoder reproduction from transforms, which is a pre-trained language Representation model that can convert text into vector Representation.
In some embodiments, the word vector information is a vector representing words in the text information.
And step S102b, fitting the word vector information to obtain a fitting result.
In some embodiments, the word vector information may be fitted through a Long Short Term Memory (LSTM) network to obtain a fitting result about the word vector information.
And step S102c, extracting the text dialogue expression force from the fitting result based on the conditional random field model.
In some embodiments, a Conditional Random Field (CRF) is a Conditional probability distribution model of another set of output Random variables given a set of input Random variables. The CRF model is added in the process of acquiring the text dialogue expression, so that the relevance between words in the same text dialogue information and the similar text information can be enhanced, the extracted text dialogue expression is consistent and continuous as much as possible, and the basic rule of the dialogue expression in the text is met.
Step S103, determining the text dialogue style characteristics based on the text dialogue expressive force.
In some embodiments, the extracted text dialog expressive force needs to be mapped into a vector, so that the text dialog expressive force can play the same effect as the original dialog speech when the dialog speech is synthesized. Training is therefore required on how to map the text dialog expressive power to the vector space.
In some embodiments, a specific implementation process for determining text dialog style characteristics based on text dialog expressiveness includes at least the following steps:
step S103a, determining reference dialogue style characteristics based on the reference voice corresponding to the original corpus.
In some embodiments, reference speech corresponding to the original corpus may be input to the dialog feature training model, so as to obtain a reference dialog style feature output by the dialog feature training model. The reference voice corresponding to the original corpus can be obtained by recording a real person.
And step S103b, inputting the text dialogue expressive force into the dialogue characteristic training model to obtain the text dialogue style characteristics output by the dialogue characteristic training model.
In some embodiments, the process of inputting the text dialogue expressive force into the dialogue feature training model to obtain a specific implementation of the text dialogue style features output by the dialogue feature training model at least comprises the following steps:
and step A, mapping the text dialogue expression to a sentence vector space to obtain sentence vector information corresponding to the text dialogue expression.
In some embodiments, the extracted text dialogue expressive force may be mapped to a sentence vector space by a dialogue style encoder in the dialogue feature training model, and the sentence vector information corresponding to the text dialogue expressive force may be obtained by converting the text dialogue expressive force into a sentence vector by the style encoder in the dialogue feature training model.
In some embodiments, the text dialogue expressive force can learn the features contained in the text reference voice in the sentence vector space as much as possible, so that the similar text dialogue expressive force has more similar directions in the vector space, and the text dialogue expressive force with large difference has different directions in the vector space, so as to obtain the text dialogue style features.
And step B, determining the text dialogue style characteristics based on the sentence vector information.
In some embodiments, sentence vector information may be converted to textual dialog style features by an adaptation layer in a dialog feature training model.
In some embodiments, the text conversation style feature is a feature containing text conversation expressive force information, which may include: dialog intention features, dialog style features, dialog emotion features, and the like.
And step S103c, adjusting parameters of the dialogue feature training model based on the difference between the reference dialogue style feature and the text dialogue style feature.
In some embodiments, the text dialog style feature is compared to a reference dialog style feature, and a parameter of the dialog feature training model may be adjusted if a difference between the text dialog style feature and the reference dialog style feature is greater than a set threshold. The threshold value can be flexibly set, and the specific numerical value of the threshold value represents the tolerance degree of the difference between the text conversation style characteristic and the reference conversation style characteristic.
And step S104, inputting the text dialogue style characteristics into a voice synthesis model, and determining dialogue voices corresponding to the original corpus based on the Mel frequency spectrum output by the voice synthesis model.
In some embodiments, the reference dialog style feature may be used as an input to a speech synthesis model to train the speech synthesis model, and parameters of the speech synthesis model may be adjusted to optimize the dialog style feature and the mel-frequency feature based on the mel-frequency spectrum corresponding to the reference dialog style feature and the mel-frequency spectrum output by the speech synthesis model. When the speech synthesis model can accurately synthesize the dialogue speech according to the dialogue style characteristics, the text dialogue style characteristics are used as the input of the speech synthesis model, so that the dialogue speech is determined.
In some embodiments, inputting the text dialogue style characteristics into a speech synthesis model, and determining a specific implementation process of dialogue speech corresponding to the original corpus based on a mel frequency spectrum output by the speech synthesis model, at least includes the following steps:
step S104a, using the text information corresponding to the original corpus as the input of the speech synthesis model, converting the text information into audio information, and performing speech length alignment on the audio information to obtain a linear frequency spectrum corresponding to the text information.
In some embodiments, the audio may be transformed from a Time-domain representation to a frequency-domain representation by a Short Time Fourier Transform (STFT), resulting in a linear spectrum (linear spectrum). Wherein the value at each frequency represents the capability of the frame of speech signal at the current frequency.
And step S104b, obtaining a Mel frequency spectrum output by the speech synthesis model based on the text dialogue style characteristics and the linear frequency spectrum.
In some embodiments, the suppression of high frequency signals by the human ear can be simulated, and the linear spectrum after STFT is processed by a plurality of triangular filters to obtain Mel spectrum (Mel spectrum); at the same time, text dialog style features are combined with the Mel-frequency spectrum.
And step S104c, determining the dialogue voice corresponding to the original corpus based on the Mel frequency spectrum.
In some embodiments, the interlocutor corresponding to the conversational speech may be synthesized from the facial keypoint synthesis network and the video synthesis network. The face key point synthesis network can obtain the features of the face key points based on the Mel frequency spectrum; the video composition network may convert the facial keypoint features into the picture of the interlocutor.
According to the embodiment of the disclosure, text information is added in speech synthesis as a feature, so that the naturalness of speech synthesis with different conversation contents is effectively improved, the feature extraction method is improved based on a neural network, and similar text conversation style features are more similar in a vector space, so that speech synthesized according to the conversation contents is no longer a cold and hard machine sound, the conversation expressive force is stronger, the expression form of speaking is enriched, and the speech synthesis effect is more natural. Simultaneously, compare in prior art, this technical scheme realizes more easily, and the effect is more stable.
Fig. 4 is a schematic overall flow chart of a speech synthesis method according to an embodiment of the present disclosure.
Referring to fig. 4, in one aspect, the dialog and the voice-over of the text information are separated by using a rule matching method, so as to obtain an original corpus including the dialog text information and the voice-over text information.
The text dialog expressive force (Style tag) in the original corpus is extracted by the BERT _ CRF model. Wherein, BERT _ CRF is a model structure composed of a BERT model and a CRF model.
Inputting a text reference speech (reference speech) corresponding to the original corpus into a reference speech compiler (reference encoder) to obtain a reference dialog style embedding (style embedding). The Style embedding is a feature containing Style tag information, and may be referred to as a dialog Style feature.
Meanwhile, the text dialogue expressive force is input into a dialogue Style compiler (Style tag encoder), the text dialogue expressive force is mapped to a sentence vector space by a twin neural network (SBERT) in the dialogue Style compiler, the text dialogue expressive force is converted into a sentence vector, and an adaptation layer (adaptation layers) converts the sentence vector, so that the text dialogue Style embedding is obtained; therefore, the text dialogue expressive force can learn the characteristics contained in the text reference voice in the sentence vector space as much as possible.
On the other hand, the text information is input into a text encoder (text encoder), the text information is converted into audio information, the audio information is subjected to voice length alignment to obtain a linear frequency spectrum corresponding to the text information, the linear frequency spectrum and a text dialogue style are embedded and input into a Mel Decoder (Mel Decoder) together to obtain a Mel frequency spectrum, and the actually obtained Mel frequency spectrum is compared with a reference Mel frequency spectrum to adjust parameters of a voice synthesis adjustment model.
Finally, the dialogue speech is synthesized according to the Mel frequency spectrum added with the text dialogue style characteristics.
The embodiment of the disclosure can solve the monotonous stiffness caused by directly using audio information to perform speech synthesis by using text information as a driving mode of conversation expressive force. Meanwhile, the method adapts to the demand development of the information society, gives individual emotional colors and language styles to the synthesized voice, reduces the dislike of the user to the machine sound, and is easier to attract the user.
FIG. 5 is a process flow diagram of a method for training a speech synthesis model according to an embodiment of the present disclosure.
Referring to fig. 5, a processing flow of a method for training a speech synthesis model according to an embodiment of the present disclosure includes at least the following steps:
step S201, an original corpus sample set for speech synthesis is generated.
In some embodiments, the original corpus in a plurality of the foregoing embodiments may be integrated to generate an original corpus sample set for speech synthesis.
Step S202, extracting a plurality of text dialogue expressive forces from the original corpus sample set, wherein the text dialogue expressive forces comprise dialogue intentions and dialogue styles.
Step S203, determining a plurality of text dialogue style characteristics based on the plurality of text dialogue expressive forces.
The extracted text dialogue expression needs to be mapped into a vector, so that the text dialogue expression can play the same effect as the original dialogue voice when the dialogue voice is synthesized. Training is therefore required on how to map the text dialog expressive power to the vector space.
In some embodiments, determining a particular implementation of a plurality of text dialog style features based on a plurality of text dialog expressiveness comprises at least the steps of:
step a, obtaining a plurality of text dialogue expressive forces corresponding to the original corpus sample set.
In some embodiments, the textual conversation expressiveness may include a style of conversation and an intent of the conversation. The dialog styles can be divided into different styles such as relaxation, active and deep, the dialog intentions can be divided into continuous dialog and non-continuous dialog, and the specific degree of intentions can be divided specifically. Therefore, a plurality of text conversation expressive forces can be extracted from the original corpus sample set, and different conversation expressive forces can be displayed in one piece of text information, such as conversation style is changed from active to relaxed, and more text conversation expressive forces are not listed.
And b, training the dialogue characteristic training model by using the multiple text dialogue expressive forces so that the dialogue characteristic training model can use the text dialogue expressive forces to acquire the text dialogue style characteristics.
Step S204, training a speech synthesis model by using the plurality of text dialogue style characteristics so that the speech model can use the text dialogue style characteristics for synthesizing dialogue speech.
In some embodiments, the text conversation style feature is a feature containing text conversation expressive force information, which may include: dialog intention features, dialog style features, dialog emotion features, and the like. Accordingly, a variety of text dialog style features may be derived from a variety of text dialog expressiveness.
In some embodiments, a plurality of reference dialogue style features in a sample set of reference speech may be extracted by a dialogue feature training model, the plurality of reference dialogue style features may be input to a speech synthesis model as text dialogue style features to train the speech synthesis model, and parameters of the speech synthesis model may be adjusted to optimize the dialogue style features and the mel features based on mel spectrums corresponding to the plurality of reference dialogue style features and differences between the mel spectrums output by the speech synthesis model.
Fig. 6 is a schematic diagram showing a configuration of a speech synthesis apparatus according to an embodiment.
Referring to fig. 6, a speech synthesis apparatus according to an embodiment of the present invention, the speech synthesis apparatus 60 includes: a corpus determining module 601, configured to determine an original corpus used for speech synthesis; an extracting module 602, configured to extract a text dialog expressive force from an original corpus, where the text dialog expressive force includes a dialog intention and a dialog style; a feature determination module 603, configured to determine a text conversation style feature based on the text conversation expressiveness; the input module 604 is configured to input the text dialogue style feature to the speech synthesis model, and determine a dialogue speech corresponding to the original corpus based on a mel spectrum output by the speech synthesis model.
In some embodiments, the corpus determining module 601 is specifically configured to obtain text information for speech synthesis; separating the dialogue and the voice-over in the text information to obtain text dialogue information and voice-over text information; and determining the text dialogue information and the voice-over text information as original language materials.
In some embodiments, the extracting module 602 is specifically configured to map text information in an original corpus to a word vector space, so as to obtain word vector information corresponding to the original corpus; fitting word vector information to obtain a fitting result; text dialogue expressiveness is extracted from the fitting result based on a conditional random field model.
In some embodiments, the feature determining module 603 is specifically configured to determine a reference dialog style feature based on a reference speech corresponding to the original corpus; inputting the text dialogue expressive force into a dialogue characteristic training model to obtain text dialogue style characteristics output by the dialogue characteristic training model; parameters of the dialog feature training model are adjusted based on differences between the reference dialog style feature and the text dialog style feature.
In some embodiments, the feature determining module 603 is specifically configured to map the text dialog expressive force to a sentence vector space, so as to obtain sentence vector information corresponding to the text dialog expressive force; and determining the text dialogue style characteristics based on the sentence vector information.
In some embodiments, the input module 604 is specifically configured to use text information corresponding to an original corpus as input of a speech synthesis model, convert the text information into audio information, and perform speech length alignment on the audio information to obtain a linear spectrum corresponding to the text information; obtaining a Mel frequency spectrum output by a speech synthesis model based on the text dialogue style characteristics and the linear frequency spectrum; and determining the dialogue voice corresponding to the original corpus based on the Mel frequency spectrum.
The present disclosure also provides an electronic device and a readable storage medium according to an embodiment of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable electronic devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A plurality of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other electronic devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as a speech synthesis method. For example, in some embodiments, a speech synthesis method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM702 and/or the communication unit 709. When the computer program is loaded into RAM 703 and executed by the computing unit 701, one or more steps of a speech synthesis method described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform a speech synthesis method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present disclosure, "a plurality" means two or more unless specifically limited otherwise.
The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims (10)

1. A method of speech synthesis, the method comprising:
determining an original corpus for speech synthesis;
extracting text conversation expressive force from the original corpus, wherein the text conversation expressive force comprises a conversation intention and a conversation style;
determining a text conversation style feature based on the text conversation expressiveness;
inputting the text dialogue style characteristics into a speech synthesis model, and determining dialogue speech corresponding to the original corpus based on a Mel frequency spectrum output by the speech synthesis model.
2. The method of claim 1, wherein determining the original corpus for speech synthesis comprises:
acquiring text information for voice synthesis;
separating the dialogue and the voice-over in the text information to obtain text dialogue information and voice-over text information;
and determining the text dialogue information and the voice-over text information as the original corpus.
3. The method of claim 1, wherein extracting the textual conversational expressiveness from the original corpus comprises:
mapping the text information in the original corpus to a word vector space to obtain word vector information corresponding to the original corpus;
fitting the word vector information to obtain a fitting result;
extracting the text dialog expressive force from the fitting result based on a conditional random field model.
4. The method of claim 1, wherein determining text conversation style features based on the text conversation expressiveness comprises:
determining reference dialogue style characteristics based on the reference voice corresponding to the original corpus;
inputting the text dialogue expressive force into a dialogue characteristic training model to obtain text dialogue style characteristics output by the dialogue characteristic training model;
adjusting parameters of the dialog feature training model based on differences between the reference dialog style feature and the text dialog style feature.
5. The method of claim 4, wherein the inputting the text dialogue expression into a dialogue feature training model to obtain a text dialogue style feature output by the dialogue feature training model comprises:
mapping the text conversation expressive force to a sentence vector space to obtain sentence vector information corresponding to the text conversation expressive force;
and determining the text dialogue style characteristics based on the sentence vector information.
6. The method according to claim 1, wherein the inputting the text dialogue style feature into a speech synthesis model, and determining the dialogue speech corresponding to the original corpus based on a mel spectrum output by the speech synthesis model comprises:
taking the text information corresponding to the original corpus as the input of the voice synthesis model, converting the text information into audio information, and performing voice length alignment on the audio information to obtain a linear frequency spectrum corresponding to the text information;
obtaining a Mel frequency spectrum output by the speech synthesis model based on the text dialogue style characteristics and the linear frequency spectrum;
and determining the dialogue voice corresponding to the original corpus based on the Mel frequency spectrum.
7. A method for training a speech synthesis model, the method comprising:
generating an original corpus sample set for speech synthesis;
extracting a plurality of text dialogue expressive forces from the original corpus sample set, wherein the text dialogue expressive forces comprise dialogue intention and dialogue style;
determining a plurality of text dialogue style characteristics based on a plurality of text dialogue expressive forces;
training a speech synthesis model with a plurality of the text dialog style features to enable the speech model to use the text dialog style features for synthesizing a dialog speech.
8. A speech synthesis apparatus, characterized in that the apparatus comprises:
the corpus determining module is used for determining an original corpus used for voice synthesis;
the extraction module is used for extracting text conversation expressive force from the original corpus, and the text conversation expressive force comprises a conversation intention and a conversation style;
the characteristic determining module is used for determining the text conversation style characteristic based on the text conversation expressive force;
and the input module is used for inputting the text dialogue style characteristics into a speech synthesis model and determining dialogue speech corresponding to the original corpus based on a Mel frequency spectrum output by the speech synthesis model.
9. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech synthesis method of any of claims 1-6;
alternatively, the at least one processor may be capable of performing the speech synthesis model training method of claim 7.
10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the speech synthesis method according to any one of claims 1-6; alternatively, the speech synthesis model training method of claim 7 is performed.
CN202211142609.3A 2022-09-20 2022-09-20 Voice synthesis method and device, electronic equipment and storage medium Pending CN115547289A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211142609.3A CN115547289A (en) 2022-09-20 2022-09-20 Voice synthesis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211142609.3A CN115547289A (en) 2022-09-20 2022-09-20 Voice synthesis method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115547289A true CN115547289A (en) 2022-12-30

Family

ID=84727956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211142609.3A Pending CN115547289A (en) 2022-09-20 2022-09-20 Voice synthesis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115547289A (en)

Similar Documents

Publication Publication Date Title
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
US11289083B2 (en) Electronic apparatus and method for controlling thereof
CN112771607A (en) Electronic device and control method thereof
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN114895817B (en) Interactive information processing method, network model training method and device
CN113808571B (en) Speech synthesis method, speech synthesis device, electronic device and storage medium
CN111696521A (en) Method for training speech clone model, readable storage medium and speech clone method
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN118043885A (en) Contrast twin network for semi-supervised speech recognition
CN113658577A (en) Speech synthesis model training method, audio generation method, device and medium
US11176943B2 (en) Voice recognition device, voice recognition method, and computer program product
CN114446268B (en) Audio data processing method, device, electronic equipment, medium and program product
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN115547289A (en) Voice synthesis method and device, electronic equipment and storage medium
CN114999440A (en) Avatar generation method, apparatus, device, storage medium, and program product
CN113948062A (en) Data conversion method and computer storage medium
CN113889073A (en) Voice processing method, device, electronic equipment and storage medium
CN113744713A (en) Speech synthesis method and training method of speech synthesis model
CN114373445B (en) Voice generation method and device, electronic equipment and storage medium
CN114420087B (en) Acoustic feature determination method, device, equipment, medium and product
WO2023047623A1 (en) Information processing device, information processing method, and information processing program
US20230081543A1 (en) Method for synthetizing speech and electronic device
CN113823287A (en) Audio processing method, device and computer readable storage medium
CN117789757A (en) Character target expression animation generation method and system based on intelligent semantic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination