CN115578995A - Speech synthesis method, system and storage medium for speech dialogue scene - Google Patents

Speech synthesis method, system and storage medium for speech dialogue scene Download PDF

Info

Publication number
CN115578995A
CN115578995A CN202211563513.4A CN202211563513A CN115578995A CN 115578995 A CN115578995 A CN 115578995A CN 202211563513 A CN202211563513 A CN 202211563513A CN 115578995 A CN115578995 A CN 115578995A
Authority
CN
China
Prior art keywords
text
vector
embedding
level
historical
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211563513.4A
Other languages
Chinese (zh)
Other versions
CN115578995B (en
Inventor
李雅
薛锦隆
邓雅月
高迎明
王风平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202211563513.4A priority Critical patent/CN115578995B/en
Publication of CN115578995A publication Critical patent/CN115578995A/en
Application granted granted Critical
Publication of CN115578995B publication Critical patent/CN115578995B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention provides a voice synthesis method, a system and a storage medium facing a voice conversation scene, comprising the following steps: determining a text embedding sequence based on the voice text data to be synthesized to obtain a current speaker embedding vector and a historical speaker information embedding vector; determining a first contextual feature and a second contextual feature based on the sentence-level text embedding vector, the speech embedding vector, and the historical dialog information embedding vector; determining a first prosodic style feature of a text angle based on a text embedding vector of a word level, a historical dialog information embedding vector and a text embedding sequence; determining a second prosodic style feature of a speech angle based on the word-level speech embedding vector, the historical dialog information embedding vector and the text embedding sequence; and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosodic style feature, the second prosodic style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio based on the Mel frequency spectrum.

Description

Speech synthesis method, system and storage medium for speech dialogue scene
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method, a system, and a storage medium for speech synthesis oriented to a speech dialogue scene.
Background
For the current speech synthesis system, after training is performed according to the text and the speech in the database, the corresponding speech can be generated by inputting the specific text, and the function of speech synthesis is realized. Current dialog-oriented speech synthesis systems typically include a speech synthesis module, a context coder, and an auxiliary coder. The voice synthesis module is used for generating corresponding audio according to the text, and comprises a text encoder, a Mel spectrum decoder, a variational adapter, a vocoder and the like; the context coder is used for modeling information of sentence level on the context information of the conversation history from the aspect of text; at the same time, an auxiliary encoder is used to extract useful statistical text features, such as semantic and syntactic features.
In the prior art, when speech synthesis is performed for a speech dialogue scene, generally, only text content of a historical dialogue is subjected to sentence-level context embedded representation through a context encoder, that is, in the prior art, only global prosody style information at a sentence level is considered when speech in the dialogue scene is synthesized, and information of other scales in the dialogue is not considered. In research, the inventor finds that in a real conversation scene, local information of word levels such as keywords, accent emphasis and prosody change plays a role in understanding the whole conversation, and people can react to specific words or phrases spoken by others in the conversation. Therefore, the inventor finds that although the conventional speech synthesis method for the speech dialogue scene can realize speech synthesis, the speech synthesized by the method has the problem of poor conformity with the historical dialogue speech; therefore, how to improve the fit between the synthesized speech and the historical dialogue is an urgent technical problem to be solved.
Disclosure of Invention
In view of the above, the present invention provides a speech synthesis method, system and storage medium for a speech dialogue scene, so as to solve one or more problems in the prior art.
According to one aspect of the invention, the invention discloses a speech synthesis method facing a speech dialogue scene, which comprises the following steps:
acquiring voice text data to be synthesized and current speaker information data corresponding to the voice text data to be synthesized, determining a phoneme sequence based on the voice text data to be synthesized, determining a text embedding sequence of the voice text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data;
acquiring historical text data, historical voice data and historical dialogue person information data of historical dialogue, and determining a historical dialogue person information embedding vector based on the historical dialogue person information data;
obtaining a sentence-level text embedding vector based on the historical text data, and determining a text angle sentence-level first context feature based on the sentence-level text embedding vector and the historical dialog information embedding vector;
obtaining a sentence-level voice embedding vector based on the historical voice data, and determining a second context feature of a voice angle sentence level based on the sentence-level voice embedding vector and the historical dialogue person information embedding vector;
obtaining a word-level text embedding vector based on the historical text data, and determining a first prosodic style feature of a text angle based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence;
obtaining a word-level voice embedding vector based on the historical voice data, and determining a second rhythm style characteristic of a voice angle based on the word-level voice embedding vector, the historical dialog information embedding vector and the text embedding sequence;
and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio corresponding to the voice text data to be synthesized based on the Mel frequency spectrum.
In some embodiments of the present invention, determining a text embedding sequence of the speech text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data, includes:
inputting the phoneme sequence into a first encoder to obtain a text embedding sequence;
and inputting the information data of the current dialog person into a second encoder to obtain an embedded vector of the current dialog person.
In some embodiments of the invention, a sentence-level text embedding vector is derived based on the historical text data, and a first context feature at a text angle sentence level is determined based on the sentence-level text embedding vector and the historical dialog information embedding vector; the method comprises the following steps:
inputting the historical text data into a first pre-training model to obtain a sentence-level text embedding vector;
splicing the sentence-level text embedded vector with the historical dialog information embedded vector to obtain a first spliced vector;
and inputting the first splicing vector into a third encoder to obtain a first context feature of a text angle sentence level.
In some embodiments of the present invention, a sentence-level speech embedding vector is obtained based on the historical speech data, and a second context feature at a speech angle sentence level is determined based on the sentence-level speech embedding vector and the historical dialog information embedding vector; the method comprises the following steps:
inputting the historical voice data into a second pre-training model to obtain a sentence-level voice embedding vector;
splicing the sentence-level voice embedded vector with the historical dialog information embedded vector to obtain a second spliced vector;
and inputting the second splicing vector into a fourth encoder to obtain a second context characteristic of the speech angle sentence level.
In some embodiments of the invention, a word-level text embedding vector is obtained based on the historical text data, and a first prosodic style feature of a text angle is determined based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence; the method comprises the following steps:
inputting the historical text data into a third pre-training model to obtain a text embedding vector at a word level;
splicing the text embedded vector of the word level and the historical dialog information embedded vector to obtain a third spliced vector;
and inputting the third splicing vector and the text embedding sequence into a fifth encoder to obtain a first prosody style characteristic of a text angle.
In some embodiments of the invention, word-level speech embedding vectors are obtained based on the historical speech data, and second prosodic style features of speech angles are determined based on the word-level speech embedding vectors, the historical speaker information embedding vectors, and the text embedding sequences; the method comprises the following steps:
inputting the historical voice data into a fourth pre-training model to obtain a voice embedding vector of a word level;
splicing the voice embedded vector of the word level and the historical dialogue person information embedded vector to obtain a fourth spliced vector;
and inputting the fourth splicing vector and the text embedding sequence into a sixth encoder to obtain a second prosodic style characteristic of the speech angle.
In some embodiments of the present invention, a predicted mel-frequency spectrum is obtained based on the first context feature, the second context feature, the first prosodic style feature, the second prosodic style feature, the text embedding sequence and the current speaker embedding vector, and an audio frequency corresponding to the speech text data to be synthesized is determined based on the mel-frequency spectrum; the method comprises the following steps:
adding the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector to obtain a fusion embedding sequence;
inputting the fusion embedding sequence into a variation adapter, and adding pitch characteristics and energy characteristics to obtain a Mel embedding vector sequence;
inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder for decoding to obtain a predicted Mel frequency spectrum;
and inputting the predicted Mel frequency spectrum into a vocoder to obtain the audio corresponding to the voice text data to be synthesized.
In some embodiments of the invention, the first pre-training model is a Sennce BERT model, the second pre-training model is a Fine-tuned Wav2vec model, the third pre-training model is a BERT model, and the fourth pre-training model is a Wav2vec model.
According to another aspect of the present invention, a speech synthesis system for a speech dialog scenario is also disclosed, the system comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the system implementing the steps of the method according to any of the embodiments described above when the computer instructions are executed by the processor.
According to yet another aspect of the invention, a computer-readable storage medium is also disclosed, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any of the embodiments above.
The voice synthesis method, the voice synthesis system and the storage medium for the voice conversation scene disclosed by the embodiment of the invention can simulate the scene of real human voice conversation, and extract the text and voice characteristics of sentence level and word level of historical conversation from the angles of multimode and multi-granularity, thereby deepening the understanding of the historical conversation, generating the voice which is more fit with the historical conversation, and leading the synthesized voice to have the characteristics of rich rhythm, emotion, style and the like.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to what has been particularly described hereinabove, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts of the drawings may be exaggerated, i.e., may be larger, relative to other components in an exemplary apparatus actually manufactured according to the present invention. In the drawings:
fig. 1 is a flowchart of a speech synthesis method for a speech dialogue scene according to an embodiment of the present invention.
Fig. 2 is a flowchart illustrating a speech synthesis method for a speech dialogue scene according to another embodiment of the present invention.
Fig. 3 is a schematic diagram of an architecture of a text angle sentence level module according to an embodiment of the present invention.
Fig. 4 is a schematic diagram of a structure of a speech angle sentence level module according to an embodiment of the present invention.
Fig. 5 is a schematic diagram of an architecture of a text angle word level module according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of an architecture of a speech angle word level module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.
It should be emphasized that the term "comprises/comprising/comprises/having" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
In the context of voice conversation, since people speak with different tones in the same content or context, the side language information (such as prosody) in the voice conversation process also plays an important role in context understanding; that is, in a real voice conversation, the content of the voice aspect contains more conversation information, such as emotional information, and the like, and the content affects the understanding of the historical conversation. Therefore, in order to improve the degree of conformity of the synthesized speech to the historical dialog in the speech synthesis system for the multi-person dialog scene, it is necessary to synthesize speech suitable for the style of the current text, such as the intonation, emotion, accent, and utterance, based on the historical dialog information, such as the intonation, emotion, accent, and utterance content of the historical dialog person, so as to improve the naturalness of the synthesized speech and the degree of conformity to the historical dialog.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same reference numerals denote the same or similar components, or the same or similar steps.
Fig. 1 is a flowchart illustrating a speech synthesis method for a speech dialog scene according to an embodiment of the present invention, as shown in fig. 1, the speech synthesis method for the speech dialog scene at least includes steps S10 to S70.
Step S10: acquiring voice text data to be synthesized and current speaker information data corresponding to the voice text data to be synthesized, determining a phoneme sequence based on the voice text data to be synthesized, determining a text embedding sequence of the voice text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data.
In the step, the speech text data to be synthesized is used as the current text, namely the speech text data to be synthesized is converted into a corresponding phoneme sequence through a phoneme converter; referring to fig. 2, the current text is converted into a phoneme sequence by a phoneme converter; and then the phoneme sequence is used as input and is input into a text encoder to obtain a text embedded sequence. In addition, the information data of the current dialog person is used as an input condition and input into a speaker encoder to obtain the embedded vector of the current dialog person.
Exemplarily, determining a text embedding sequence of the speech text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data, specifically including the following steps: inputting the phoneme sequence into a first encoder to obtain a text embedding sequence; and inputting the information data of the current dialog person into a second encoder to obtain an embedded vector of the current dialog person. Wherein the first coder is a text coder and the second coder is a speaker coder.
Step S20: historical text data, historical voice data and historical dialog information data of historical dialog are obtained, and historical dialog information embedding vectors are determined based on the historical dialog information data.
In this step, a historical speaker information embedding vector may be determined based on the speaker encoder, i.e., the historical speaker information data is input into the speaker encoder, resulting in a historical speaker information embedding vector for each historical speaker. It is understood that the speaker encoder in this step and the second encoder in step S10 may be the same encoder or different encoders.
Step S30: and obtaining a sentence-level text embedding vector based on the historical text data, and determining a text angle sentence-level first context feature based on the sentence-level text embedding vector and the historical dialog information embedding vector.
In this step, sentence-level text features are determined based on historical text data in the historical dialog process. Referring to fig. 2, the history text data is input as input information into the text angle sentence level module, and the output of the text angle sentence level module is a first context feature of the text angle sentence level.
In one embodiment, a sentence-level text embedding vector is obtained based on the historical text data, and a first context feature of a text angle sentence level is determined based on the sentence-level text embedding vector and the historical dialog information embedding vector; the method specifically comprises the following steps: inputting the historical text data into a first pre-training model to obtain a sentence-level text embedding vector; splicing the sentence-level text embedded vector with the historical dialog information embedded vector to obtain a first spliced vector; and inputting the first splicing vector into a third encoder to obtain a first context characteristic of a text angle sentence level.
Fig. 3 is a schematic diagram of an architecture of a text angle Sentence level module according to an embodiment of the present invention, and referring to fig. 3, the first pre-training model may be a sequence BERT model, that is, historical text data obtains a Sentence level text embedding vector through the sequence BERT model; and then the sentence-level text embedding vector and the historical dialog information embedding vector are spliced through the full connection layer to obtain a first splicing vector. The third encoder may specifically include a GRU network layer, a full-link layer, and a self-attention module, that is, the first concatenation vector is used as an input, and context features at a sentence level in a text angle are obtained through the third encoder, so that understanding and feature extraction of a historical conversation from the text angle at the sentence level are realized. Wherein the third encoder is a sentence-level context encoder for text angles.
Step S40: and obtaining a sentence-level voice embedding vector based on the historical voice data, and determining a second context characteristic of a voice angle sentence level based on the sentence-level voice embedding vector and the historical dialog information embedding vector.
In this step, sentence-level speech features are determined based on historical speech data in the course of a historical dialogue. Referring to fig. 2, the historical speech data is input as input information into the speech angle sentence level module, and the output of the speech angle sentence level module is a second context feature of the speech angle sentence level.
In one embodiment, a sentence-level speech embedding vector is obtained based on the historical speech data, and a second context feature of a speech angle sentence level is determined based on the sentence-level speech embedding vector and the historical dialog information embedding vector; the method specifically comprises the following steps: inputting the historical voice data into a second pre-training model to obtain a sentence-level voice embedding vector; splicing the sentence-level voice embedded vector with the historical dialogue person information embedded vector to obtain a second spliced vector; and inputting the second splicing vector into a fourth encoder to obtain a second context characteristic of the speech angle sentence level.
Fig. 4 is a schematic diagram of a structure of a speech angle sentence level module according to an embodiment of the present invention, and referring to fig. 4, the second pre-training model may be a Fine-tuned wave 2vec model, that is, historical speech data obtains a sentence-level speech embedding vector through the Fine-tuned wave 2vec model; and then the sentence-level voice embedded vector and the historical dialogue person information embedded vector are spliced through the full connection layer to obtain a second spliced vector. The fourth encoder may specifically include a GRU network layer, a full-link layer, and a self-attention module, that is, the second concatenation vector is used as an input, and context features at a sentence level from a speech angle are obtained through the fourth encoder, thereby realizing understanding and feature extraction of the historical dialog from the speech angle at the sentence level. Wherein the fourth encoder is a sentence-level context encoder for speech angles.
Step S50: and obtaining a word-level text embedding vector based on the historical text data, and determining a first prosodic style characteristic of a text angle based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence.
In this step, the text embedding sequence is a text embedding sequence corresponding to the speech text data to be synthesized, that is, prosodic style features are further extracted from the text perspective of the word level based on the historical text data, the historical dialog information embedding vector and the text embedding sequence. Referring to fig. 2, the first prosodic style characteristic of the text angle may be obtained by the text angle/word classification module, that is, the historical text data is input into the text angle/word classification module as the input information, and the output of the text angle/word classification module is used as the first prosodic style characteristic of the text angle/word classification.
In one embodiment, word-level text embedding vectors are obtained based on the historical text data, and first prosodic style features of text angles are determined based on the word-level text embedding vectors, the historical dialog information embedding vectors and the text embedding sequences; the method specifically comprises the following steps: inputting the historical text data into a third pre-training model to obtain a text embedding vector at a word level; splicing the text embedded vector of the word level and the historical dialog information embedded vector to obtain a third spliced vector; and inputting the third splicing vector and the text embedding sequence into a fifth encoder to obtain a first prosody style characteristic of a text angle.
Fig. 5 is a schematic diagram of an architecture of a text angle word level module according to an embodiment of the present invention, and referring to fig. 5, the third pre-training model may be a BERT model, that is, historical text data obtains a text embedded vector sequence at a word level through the BERT model; and then splicing the word-level text embedded vector sequence and the historical dialog information embedded vector to obtain a third spliced vector. The fifth encoder may specifically include a convolutional layer and a multi-head attention module, and then the third splicing vector and a text embedding sequence corresponding to the speech text data to be synthesized are synchronized and input to the fifth encoder as input information, so as to obtain a prosody style feature of a text angle with the same length as the text embedding sequence. Wherein the fifth encoder is also understood to be a word-level context encoder for text angles. Wherein the fifth encoder is a word-level context encoder for text angles.
Step S60: obtaining a word-level voice embedding vector based on the historical voice data, and determining a second prosodic style characteristic of a voice angle based on the word-level voice embedding vector, the historical dialog information embedding vector and the text embedding sequence.
In this step, prosodic style features are further extracted from the speech perspective at the word level based on the historical speech data, the historical speaker information embedding vectors, and the text embedding sequences. Referring to fig. 2, the second prosody style characteristic of the voice angle can be obtained by the voice angle word level module, that is, the historical voice data is input into the voice angle word level module as input information, and the output of the voice angle word level module is used as the second prosody style characteristic of the voice angle word level.
In one embodiment, word-level voice embedding vectors are obtained based on the historical voice data, and second prosodic style features of voice angles are determined based on the word-level voice embedding vectors, the historical dialog information embedding vectors and the text embedding sequence; the method specifically comprises the following steps: inputting the historical voice data into a fourth pre-training model to obtain a voice embedding vector of a word level; splicing the voice embedded vector of the word level and the historical dialogue person information embedded vector to obtain a fourth spliced vector; and inputting the fourth splicing vector and the text embedding sequence into a sixth encoder to obtain a second prosody style characteristic of the voice angle.
Fig. 6 is a schematic diagram of an architecture of a speech angle word level module according to an embodiment of the present invention, and referring to fig. 6, a fourth pre-training model may be a Wav2vec model, that is, historical speech data obtains a speech embedding vector sequence at a word level through the Wav2vec model; and then splicing the word-level voice embedded vector sequence and the historical dialog information embedded vector to obtain a fourth spliced vector. The sixth encoder may specifically include a convolutional layer and a multi-head attention module, and the fourth concatenation vector and a text embedding sequence corresponding to the speech text data to be synthesized are input to the sixth encoder as input information in synchronization, so as to obtain a prosodic style feature of a speech angle with the same length as the text embedding sequence. Wherein the sixth encoder is also understood as a word-level context encoder for speech angles. Wherein the sixth encoder is a word-level context encoder for speech angles.
Step S70: and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio corresponding to the to-be-synthesized voice text data based on the Mel frequency spectrum.
In the step, a text embedding sequence corresponding to a voice text to be synthesized is fused with a first context feature and a second context feature at a sentence level, a first prosodic style feature and a second prosodic style feature at a word level are fused, and features such as prosody, emotion and style in historical conversation are considered from the two aspects of voice and text, so that a more natural audio frequency which is highly fit with the historical conversation is obtained.
In an embodiment, a predicted mel-frequency spectrum is obtained based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and the audio corresponding to the speech text data to be synthesized is determined based on the mel-frequency spectrum; the method comprises the following steps: adding the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector to obtain a fusion embedding sequence; inputting the fusion embedding sequence into a variation adapter, and adding pitch characteristics and energy characteristics to obtain a Mel embedding vector sequence; inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder for decoding to obtain a predicted Mel frequency spectrum; and inputting the predicted Mel frequency spectrum into a vocoder to obtain the audio corresponding to the voice text data to be synthesized.
Referring to fig. 2, the text embedding sequence to which the historical dialogue features of the text angle, the sentence level of the speech angle, and the word level are added passes through a variation adapter, pitch and energy feature information is added, and then passes through a length adapter; expanding the text embedded sequence fused with the multi-modal features into a Mel embedded vector sequence with the same length as the target Mel frequency spectrum; inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder as input information to obtain a predicted Mel frequency spectrum; and finally, inputting the predicted Mel frequency spectrum into a vocoder as input information to obtain the finally predicted audio.
In this embodiment, the text embedding sequence corresponding to the speech text data to be synthesized further generates audio corresponding to the speech text data to be synthesized through a variation adaptor, a mel decoder, a vocoder, and the like after adding the word level information of the speech angle, the word level information of the text angle, the sentence level information of the speech angle, and the sentence level information of the text angle. Therefore, the speech synthesis method facing the speech dialogue scene not only considers the characteristics of two modes of speech and text, but also considers the characteristics of historical information with different levels and granularities, including sentence levels and word levels, and realizes end-to-end multi-scale multi-modal dialogue speech synthesis.
In another embodiment, the method for speech synthesis oriented to a speech dialog scenario further comprises the following steps: obtaining a context coding vector of the current speech angle sentence level based on the predicted Mel frequency spectrum; and further inputting a first context feature at a text angle sentence level and a second context feature at a voice angle sentence level corresponding to the historical dialogue into the style predictor to obtain a context coding embedded vector, further calculating the loss of the context coding embedded vector and the current voice angle sentence level context coding vector, and updating parameters of each model, the coder and the decoder based on the loss calculation result.
In the above-described embodiment, the sentence-level context-coded embedding vector considered from a text point of view, the sentence-level context-coded embedding vector considered from an audio point of view are input as input information into the style predictor, the predicted context-coded embedding vector is output, and the predicted context-coded embedding vector is compared with the sentence-level context-coded embedding vector considered from an audio point of view obtained from the audio corresponding to the current dialog, thereby making the predicted sentence-level style and the true style as close as possible.
Through the embodiment, the voice synthesis method facing the voice conversation scene disclosed by the invention can simulate the scene of real human voice conversation, and carry out all-around feature extraction on the historical conversation from the angles of multi-mode and multi-granularity, so that the understanding on the historical conversation is deepened, the voice which is more fit with the historical conversation is generated, and the synthesized voice has the characteristics of rich rhythm, emotion, style and the like.
Correspondingly, the present invention also provides a speech synthesis system facing a speech dialogue scene, which includes a processor and a memory, wherein the memory stores computer instructions, and the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of the above embodiments.
The voice synthesis system facing the voice conversation scene considers text information and side language information (such as voice prosody information) in real voice conversation, and models the emotion, prosody and the like of historical conversation from the perspective of voice, so that the information of the historical conversation is extracted more comprehensively and fully. In addition, the speech synthesis system facing the speech dialogue scene pays attention to word-level local information in a historical context on the basis of considering sentence-level global style information; the method and system can therefore pay additional attention to keywords, accents and prosodic changes, thereby making the synthesized speech more natural and more appropriate for historical conversations.
In addition, the invention also discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the above embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer data. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for speech synthesis for a speech dialog scenario, the method comprising:
acquiring voice text data to be synthesized and current speaker information data corresponding to the voice text data to be synthesized, determining a phoneme sequence based on the voice text data to be synthesized, determining a text embedding sequence of the voice text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data;
acquiring historical text data, historical voice data and historical dialog person information data of historical dialog, and determining a historical dialog person information embedding vector based on the historical dialog person information data;
obtaining a sentence-level text embedding vector based on the historical text data, and determining a text angle sentence-level first context feature based on the sentence-level text embedding vector and the historical dialog information embedding vector;
obtaining a sentence-level voice embedding vector based on the historical voice data, and determining a second context feature of a voice angle sentence level based on the sentence-level voice embedding vector and the historical dialogue person information embedding vector;
obtaining a word-level text embedding vector based on the historical text data, and determining a first prosodic style characteristic of a text angle based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence;
obtaining a voice embedding vector of a word level based on the historical voice data, and determining a second rhythm style characteristic of a voice angle based on the voice embedding vector of the word level, the historical dialog information embedding vector and the text embedding sequence;
and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio corresponding to the to-be-synthesized voice text data based on the Mel frequency spectrum.
2. The method for synthesizing speech facing to a speech dialogue scene according to claim 1, wherein determining a text embedding sequence of the speech text data to be synthesized based on the phoneme sequence and obtaining a current dialogue person embedding vector based on the current dialogue person information data comprises:
inputting the phoneme sequence into a first encoder to obtain a text embedding sequence;
and inputting the information data of the current dialog person into a second encoder to obtain an embedded vector of the current dialog person.
3. The speech synthesis method for a speech dialog scene according to claim 1, characterized in that a sentence-level text embedding vector is derived based on the historical text data, a first contextual feature at a text angle sentence level is determined based on the sentence-level text embedding vector and the historical speaker information embedding vector; the method comprises the following steps:
inputting the historical text data into a first pre-training model to obtain a sentence-level text embedding vector;
splicing the sentence-level text embedded vector with the historical dialog information embedded vector to obtain a first spliced vector;
and inputting the first splicing vector into a third encoder to obtain a first context characteristic of a text angle sentence level.
4. The method for speech synthesis oriented to a speech dialog scene according to claim 3, characterized in that a sentence-level speech embedding vector is obtained based on the historical speech data, and a second context feature at speech angle sentence level is determined based on the sentence-level speech embedding vector and the historical dialog information embedding vector; the method comprises the following steps:
inputting the historical voice data into a second pre-training model to obtain a sentence-level voice embedding vector;
splicing the sentence-level voice embedded vector with the historical dialogue person information embedded vector to obtain a second spliced vector;
and inputting the second splicing vector into a fourth encoder to obtain a second context characteristic of the speech angle sentence level.
5. The method of claim 4, wherein word-level text embedding vectors are obtained based on the historical text data, and first prosodic style features of text angles are determined based on the word-level text embedding vectors, the historical dialog information embedding vectors, and the text embedding sequences; the method comprises the following steps:
inputting the historical text data into a third pre-training model to obtain a text embedding vector at a word level;
splicing the text embedded vector of the word level and the historical dialog information embedded vector to obtain a third spliced vector;
and inputting the third splicing vector and the text embedding sequence into a fifth encoder to obtain a first prosody style characteristic of a text angle.
6. The method of claim 5, wherein word-level speech embedding vectors are obtained based on the historical speech data, and second prosodic style features of speech angles are determined based on the word-level speech embedding vectors, the historical dialog information embedding vectors, and the text embedding sequence; the method comprises the following steps:
inputting the historical voice data into a fourth pre-training model to obtain a voice embedding vector of a word level;
splicing the voice embedded vector of the word level and the historical dialogue person information embedded vector to obtain a fourth spliced vector;
and inputting the fourth splicing vector and the text embedding sequence into a sixth encoder to obtain a second prosodic style characteristic of the speech angle.
7. The method for synthesizing speech facing to the speech dialog scene of claim 1, wherein a predicted mel-frequency spectrum is obtained based on the first context feature, the second context feature, the first prosodic style feature, the second prosodic style feature, the text embedding sequence and the current speaker embedding vector, and the audio corresponding to the speech text data to be synthesized is determined based on the mel-frequency spectrum; the method comprises the following steps:
adding the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector to obtain a fusion embedding sequence;
inputting the fusion embedding sequence into a variation adapter, and adding pitch characteristics and energy characteristics to obtain a Mel embedding vector sequence;
inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder for decoding to obtain a predicted Mel frequency spectrum;
and inputting the predicted Mel frequency spectrum into a vocoder to obtain the audio corresponding to the voice text data to be synthesized.
8. The method of claim 6, wherein the first pre-training model is a Sennce BERT model, the second pre-training model is a Fine-tuned Wav2vec model, the third pre-training model is a BERT model, and the fourth pre-training model is a Wav2vec model.
9. A speech synthesis system for a speech dialog scenario, the system comprising a processor and a memory, characterized in that the memory has stored therein computer instructions for executing the computer instructions stored in the memory, the system realizing the steps of the method according to any one of claims 1 to 8 when the computer instructions are executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method according to any one of claims 1 to 8.
CN202211563513.4A 2022-12-07 2022-12-07 Speech synthesis method, system and storage medium for speech dialogue scene Active CN115578995B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211563513.4A CN115578995B (en) 2022-12-07 2022-12-07 Speech synthesis method, system and storage medium for speech dialogue scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211563513.4A CN115578995B (en) 2022-12-07 2022-12-07 Speech synthesis method, system and storage medium for speech dialogue scene

Publications (2)

Publication Number Publication Date
CN115578995A true CN115578995A (en) 2023-01-06
CN115578995B CN115578995B (en) 2023-03-24

Family

ID=84590118

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211563513.4A Active CN115578995B (en) 2022-12-07 2022-12-07 Speech synthesis method, system and storage medium for speech dialogue scene

Country Status (1)

Country Link
CN (1) CN115578995B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117238275A (en) * 2023-08-24 2023-12-15 北京邮电大学 Speech synthesis model training method and device based on common sense reasoning and synthesis method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000187495A (en) * 1998-12-21 2000-07-04 Nec Corp Method and device for synthesizing speech, and recording medium where speech synthesis program is recorded
JP2004184788A (en) * 2002-12-05 2004-07-02 Casio Comput Co Ltd Voice interaction system and program
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN113539231A (en) * 2020-12-30 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method, vocoder, device, equipment and storage medium
US20220005460A1 (en) * 2020-07-02 2022-01-06 Tobrox Computing Limited Methods and systems for synthesizing speech audio
CN114175143A (en) * 2019-08-03 2022-03-11 谷歌有限责任公司 Controlling expressiveness in an end-to-end speech synthesis system
CN114495956A (en) * 2022-02-08 2022-05-13 北京百度网讯科技有限公司 Voice processing method, device, equipment and storage medium
US20220277728A1 (en) * 2019-09-12 2022-09-01 Microsoft Technology Licensing, Llc Paragraph synthesis with cross utterance features for neural TTS
WO2022249362A1 (en) * 2021-05-26 2022-12-01 株式会社KPMG Ignition Tokyo Speech synthesis to convert text into synthesized speech

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000187495A (en) * 1998-12-21 2000-07-04 Nec Corp Method and device for synthesizing speech, and recording medium where speech synthesis program is recorded
JP2004184788A (en) * 2002-12-05 2004-07-02 Casio Comput Co Ltd Voice interaction system and program
CN105355193A (en) * 2015-10-30 2016-02-24 百度在线网络技术(北京)有限公司 Speech synthesis method and device
CN114175143A (en) * 2019-08-03 2022-03-11 谷歌有限责任公司 Controlling expressiveness in an end-to-end speech synthesis system
US20220277728A1 (en) * 2019-09-12 2022-09-01 Microsoft Technology Licensing, Llc Paragraph synthesis with cross utterance features for neural TTS
US20220005460A1 (en) * 2020-07-02 2022-01-06 Tobrox Computing Limited Methods and systems for synthesizing speech audio
CN113539231A (en) * 2020-12-30 2021-10-22 腾讯科技(深圳)有限公司 Audio processing method, vocoder, device, equipment and storage medium
WO2022249362A1 (en) * 2021-05-26 2022-12-01 株式会社KPMG Ignition Tokyo Speech synthesis to convert text into synthesized speech
CN114495956A (en) * 2022-02-08 2022-05-13 北京百度网讯科技有限公司 Voice processing method, device, equipment and storage medium

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
ILYES REBAI等: "Arabic text to speech synthesis based on neural networks for MFCC estimation", 《2013 WORLD CONGRESS ON COMPUTER AND INFORMATION TECHNOLOGY (WCCIT)》 *
NING-QIAN WU等: "Discourse-Level Prosody Modeling with a Variational Autoencoder for Non-Autoregressive Expressive Speech Synthesis", 《ICASSP 2022 - 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
RAN ZHANG等: "A novel hybrid mandarin speech synthesis system using different base units for model training and concatenation", 《2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP)》 *
张雅洁: "基于表征学习的语音合成声学建模方法研究", 《中国博士学位论文全文数据库信息科技辑》 *
智鹏鹏等: "利用说话人自适应实现基于DNN的情感语音合成", 《重庆邮电大学学报(自然科学版)》 *
李雅等: "采用重音调整模型的HMM语音合成系统", 《清华大学学报(自然科学版)》 *
高莹莹等: "面向情感语音合成的言语情感描述与预测", 《清华大学学报(自然科学版)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117238275A (en) * 2023-08-24 2023-12-15 北京邮电大学 Speech synthesis model training method and device based on common sense reasoning and synthesis method
CN117238275B (en) * 2023-08-24 2024-03-19 北京邮电大学 Speech synthesis model training method and device based on common sense reasoning and synthesis method

Also Published As

Publication number Publication date
CN115578995B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
US9368104B2 (en) System and method for synthesizing human speech using multiple speakers and context
CN112309366B (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN111899719A (en) Method, apparatus, device and medium for generating audio
US20060229877A1 (en) Memory usage in a text-to-speech system
KR100932538B1 (en) Speech synthesis method and apparatus
CN101131818A (en) Speech synthesis apparatus and method
CN113808571B (en) Speech synthesis method, speech synthesis device, electronic device and storage medium
CN115578995B (en) Speech synthesis method, system and storage medium for speech dialogue scene
CN116129863A (en) Training method of voice synthesis model, voice synthesis method and related device
CN114464162B (en) Speech synthesis method, neural network model training method, and speech synthesis model
CN112309367A (en) Speech synthesis method, speech synthesis device, storage medium and electronic equipment
CN113421550A (en) Speech synthesis method, device, readable medium and electronic equipment
WO2008147649A1 (en) Method for synthesizing speech
CN113838452B (en) Speech synthesis method, apparatus, device and computer storage medium
CN113450760A (en) Method and device for converting text into voice and electronic equipment
US20070055524A1 (en) Speech dialog method and device
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
CN114242035A (en) Speech synthesis method, apparatus, medium, and electronic device
CN114512121A (en) Speech synthesis method, model training method and device
JP5320341B2 (en) Speaking text set creation method, utterance text set creation device, and utterance text set creation program
JP2006189554A (en) Text speech synthesis method and its system, and text speech synthesis program, and computer-readable recording medium recording program thereon
CN114373445B (en) Voice generation method and device, electronic equipment and storage medium
EP1589524B1 (en) Method and device for speech synthesis
CN117636842B (en) Voice synthesis system and method based on prosody emotion migration
EP1640968A1 (en) Method and device for speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant