CN115578995A

CN115578995A - Speech synthesis method, system and storage medium for speech dialogue scene

Info

Publication number: CN115578995A
Application number: CN202211563513.4A
Authority: CN
Inventors: 李雅; 薛锦隆; 邓雅月; 高迎明; 王风平
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2022-12-07
Filing date: 2022-12-07
Publication date: 2023-01-06
Anticipated expiration: 2042-12-07
Also published as: CN115578995B

Abstract

The invention provides a voice synthesis method, a system and a storage medium facing a voice conversation scene, comprising the following steps: determining a text embedding sequence based on the voice text data to be synthesized to obtain a current speaker embedding vector and a historical speaker information embedding vector; determining a first contextual feature and a second contextual feature based on the sentence-level text embedding vector, the speech embedding vector, and the historical dialog information embedding vector; determining a first prosodic style feature of a text angle based on a text embedding vector of a word level, a historical dialog information embedding vector and a text embedding sequence; determining a second prosodic style feature of a speech angle based on the word-level speech embedding vector, the historical dialog information embedding vector and the text embedding sequence; and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosodic style feature, the second prosodic style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio based on the Mel frequency spectrum.

Description

Speech synthesis method, system and storage medium for speech dialogue scene

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method, a system, and a storage medium for speech synthesis oriented to a speech dialogue scene.

Background

For the current speech synthesis system, after training is performed according to the text and the speech in the database, the corresponding speech can be generated by inputting the specific text, and the function of speech synthesis is realized. Current dialog-oriented speech synthesis systems typically include a speech synthesis module, a context coder, and an auxiliary coder. The voice synthesis module is used for generating corresponding audio according to the text, and comprises a text encoder, a Mel spectrum decoder, a variational adapter, a vocoder and the like; the context coder is used for modeling information of sentence level on the context information of the conversation history from the aspect of text; at the same time, an auxiliary encoder is used to extract useful statistical text features, such as semantic and syntactic features.

In the prior art, when speech synthesis is performed for a speech dialogue scene, generally, only text content of a historical dialogue is subjected to sentence-level context embedded representation through a context encoder, that is, in the prior art, only global prosody style information at a sentence level is considered when speech in the dialogue scene is synthesized, and information of other scales in the dialogue is not considered. In research, the inventor finds that in a real conversation scene, local information of word levels such as keywords, accent emphasis and prosody change plays a role in understanding the whole conversation, and people can react to specific words or phrases spoken by others in the conversation. Therefore, the inventor finds that although the conventional speech synthesis method for the speech dialogue scene can realize speech synthesis, the speech synthesized by the method has the problem of poor conformity with the historical dialogue speech; therefore, how to improve the fit between the synthesized speech and the historical dialogue is an urgent technical problem to be solved.

Disclosure of Invention

In view of the above, the present invention provides a speech synthesis method, system and storage medium for a speech dialogue scene, so as to solve one or more problems in the prior art.

According to one aspect of the invention, the invention discloses a speech synthesis method facing a speech dialogue scene, which comprises the following steps:

acquiring voice text data to be synthesized and current speaker information data corresponding to the voice text data to be synthesized, determining a phoneme sequence based on the voice text data to be synthesized, determining a text embedding sequence of the voice text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data;

acquiring historical text data, historical voice data and historical dialogue person information data of historical dialogue, and determining a historical dialogue person information embedding vector based on the historical dialogue person information data;

obtaining a sentence-level text embedding vector based on the historical text data, and determining a text angle sentence-level first context feature based on the sentence-level text embedding vector and the historical dialog information embedding vector;

obtaining a sentence-level voice embedding vector based on the historical voice data, and determining a second context feature of a voice angle sentence level based on the sentence-level voice embedding vector and the historical dialogue person information embedding vector;

obtaining a word-level text embedding vector based on the historical text data, and determining a first prosodic style feature of a text angle based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence;

obtaining a word-level voice embedding vector based on the historical voice data, and determining a second rhythm style characteristic of a voice angle based on the word-level voice embedding vector, the historical dialog information embedding vector and the text embedding sequence;

and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio corresponding to the voice text data to be synthesized based on the Mel frequency spectrum.

In some embodiments of the present invention, determining a text embedding sequence of the speech text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data, includes:

inputting the phoneme sequence into a first encoder to obtain a text embedding sequence;

and inputting the information data of the current dialog person into a second encoder to obtain an embedded vector of the current dialog person.

In some embodiments of the invention, a sentence-level text embedding vector is derived based on the historical text data, and a first context feature at a text angle sentence level is determined based on the sentence-level text embedding vector and the historical dialog information embedding vector; the method comprises the following steps:

inputting the historical text data into a first pre-training model to obtain a sentence-level text embedding vector;

splicing the sentence-level text embedded vector with the historical dialog information embedded vector to obtain a first spliced vector;

and inputting the first splicing vector into a third encoder to obtain a first context feature of a text angle sentence level.

In some embodiments of the present invention, a sentence-level speech embedding vector is obtained based on the historical speech data, and a second context feature at a speech angle sentence level is determined based on the sentence-level speech embedding vector and the historical dialog information embedding vector; the method comprises the following steps:

inputting the historical voice data into a second pre-training model to obtain a sentence-level voice embedding vector;

splicing the sentence-level voice embedded vector with the historical dialog information embedded vector to obtain a second spliced vector;

and inputting the second splicing vector into a fourth encoder to obtain a second context characteristic of the speech angle sentence level.

In some embodiments of the invention, a word-level text embedding vector is obtained based on the historical text data, and a first prosodic style feature of a text angle is determined based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence; the method comprises the following steps:

inputting the historical text data into a third pre-training model to obtain a text embedding vector at a word level;

splicing the text embedded vector of the word level and the historical dialog information embedded vector to obtain a third spliced vector;

and inputting the third splicing vector and the text embedding sequence into a fifth encoder to obtain a first prosody style characteristic of a text angle.

In some embodiments of the invention, word-level speech embedding vectors are obtained based on the historical speech data, and second prosodic style features of speech angles are determined based on the word-level speech embedding vectors, the historical speaker information embedding vectors, and the text embedding sequences; the method comprises the following steps:

inputting the historical voice data into a fourth pre-training model to obtain a voice embedding vector of a word level;

splicing the voice embedded vector of the word level and the historical dialogue person information embedded vector to obtain a fourth spliced vector;

and inputting the fourth splicing vector and the text embedding sequence into a sixth encoder to obtain a second prosodic style characteristic of the speech angle.

In some embodiments of the present invention, a predicted mel-frequency spectrum is obtained based on the first context feature, the second context feature, the first prosodic style feature, the second prosodic style feature, the text embedding sequence and the current speaker embedding vector, and an audio frequency corresponding to the speech text data to be synthesized is determined based on the mel-frequency spectrum; the method comprises the following steps:

adding the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector to obtain a fusion embedding sequence;

inputting the fusion embedding sequence into a variation adapter, and adding pitch characteristics and energy characteristics to obtain a Mel embedding vector sequence;

inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder for decoding to obtain a predicted Mel frequency spectrum;

and inputting the predicted Mel frequency spectrum into a vocoder to obtain the audio corresponding to the voice text data to be synthesized.

In some embodiments of the invention, the first pre-training model is a Sennce BERT model, the second pre-training model is a Fine-tuned Wav2vec model, the third pre-training model is a BERT model, and the fourth pre-training model is a Wav2vec model.

According to another aspect of the present invention, a speech synthesis system for a speech dialog scenario is also disclosed, the system comprising a processor and a memory, the memory having stored therein computer instructions for executing the computer instructions stored in the memory, the system implementing the steps of the method according to any of the embodiments described above when the computer instructions are executed by the processor.

According to yet another aspect of the invention, a computer-readable storage medium is also disclosed, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any of the embodiments above.

The voice synthesis method, the voice synthesis system and the storage medium for the voice conversation scene disclosed by the embodiment of the invention can simulate the scene of real human voice conversation, and extract the text and voice characteristics of sentence level and word level of historical conversation from the angles of multimode and multi-granularity, thereby deepening the understanding of the historical conversation, generating the voice which is more fit with the historical conversation, and leading the synthesized voice to have the characteristics of rich rhythm, emotion, style and the like.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to what has been particularly described hereinabove, and that the above and other objects that can be achieved with the present invention will be more clearly understood from the following detailed description.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the invention. For purposes of illustrating and describing some portions of the present invention, corresponding parts of the drawings may be exaggerated, i.e., may be larger, relative to other components in an exemplary apparatus actually manufactured according to the present invention. In the drawings:

fig. 1 is a flowchart of a speech synthesis method for a speech dialogue scene according to an embodiment of the present invention.

Fig. 2 is a flowchart illustrating a speech synthesis method for a speech dialogue scene according to another embodiment of the present invention.

Fig. 3 is a schematic diagram of an architecture of a text angle sentence level module according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of a structure of a speech angle sentence level module according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of an architecture of a text angle word level module according to an embodiment of the present invention.

Fig. 6 is a schematic diagram of an architecture of a speech angle word level module according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.

It should be emphasized that the term "comprises/comprising/comprises/having" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

In the context of voice conversation, since people speak with different tones in the same content or context, the side language information (such as prosody) in the voice conversation process also plays an important role in context understanding; that is, in a real voice conversation, the content of the voice aspect contains more conversation information, such as emotional information, and the like, and the content affects the understanding of the historical conversation. Therefore, in order to improve the degree of conformity of the synthesized speech to the historical dialog in the speech synthesis system for the multi-person dialog scene, it is necessary to synthesize speech suitable for the style of the current text, such as the intonation, emotion, accent, and utterance, based on the historical dialog information, such as the intonation, emotion, accent, and utterance content of the historical dialog person, so as to improve the naturalness of the synthesized speech and the degree of conformity to the historical dialog.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same reference numerals denote the same or similar components, or the same or similar steps.

Fig. 1 is a flowchart illustrating a speech synthesis method for a speech dialog scene according to an embodiment of the present invention, as shown in fig. 1, the speech synthesis method for the speech dialog scene at least includes steps S10 to S70.

Step S10: acquiring voice text data to be synthesized and current speaker information data corresponding to the voice text data to be synthesized, determining a phoneme sequence based on the voice text data to be synthesized, determining a text embedding sequence of the voice text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data.

In the step, the speech text data to be synthesized is used as the current text, namely the speech text data to be synthesized is converted into a corresponding phoneme sequence through a phoneme converter; referring to fig. 2, the current text is converted into a phoneme sequence by a phoneme converter; and then the phoneme sequence is used as input and is input into a text encoder to obtain a text embedded sequence. In addition, the information data of the current dialog person is used as an input condition and input into a speaker encoder to obtain the embedded vector of the current dialog person.

Exemplarily, determining a text embedding sequence of the speech text data to be synthesized based on the phoneme sequence, and obtaining a current speaker embedding vector based on the current speaker information data, specifically including the following steps: inputting the phoneme sequence into a first encoder to obtain a text embedding sequence; and inputting the information data of the current dialog person into a second encoder to obtain an embedded vector of the current dialog person. Wherein the first coder is a text coder and the second coder is a speaker coder.

Step S20: historical text data, historical voice data and historical dialog information data of historical dialog are obtained, and historical dialog information embedding vectors are determined based on the historical dialog information data.

In this step, a historical speaker information embedding vector may be determined based on the speaker encoder, i.e., the historical speaker information data is input into the speaker encoder, resulting in a historical speaker information embedding vector for each historical speaker. It is understood that the speaker encoder in this step and the second encoder in step S10 may be the same encoder or different encoders.

Step S30: and obtaining a sentence-level text embedding vector based on the historical text data, and determining a text angle sentence-level first context feature based on the sentence-level text embedding vector and the historical dialog information embedding vector.

In this step, sentence-level text features are determined based on historical text data in the historical dialog process. Referring to fig. 2, the history text data is input as input information into the text angle sentence level module, and the output of the text angle sentence level module is a first context feature of the text angle sentence level.

In one embodiment, a sentence-level text embedding vector is obtained based on the historical text data, and a first context feature of a text angle sentence level is determined based on the sentence-level text embedding vector and the historical dialog information embedding vector; the method specifically comprises the following steps: inputting the historical text data into a first pre-training model to obtain a sentence-level text embedding vector; splicing the sentence-level text embedded vector with the historical dialog information embedded vector to obtain a first spliced vector; and inputting the first splicing vector into a third encoder to obtain a first context characteristic of a text angle sentence level.

Fig. 3 is a schematic diagram of an architecture of a text angle Sentence level module according to an embodiment of the present invention, and referring to fig. 3, the first pre-training model may be a sequence BERT model, that is, historical text data obtains a Sentence level text embedding vector through the sequence BERT model; and then the sentence-level text embedding vector and the historical dialog information embedding vector are spliced through the full connection layer to obtain a first splicing vector. The third encoder may specifically include a GRU network layer, a full-link layer, and a self-attention module, that is, the first concatenation vector is used as an input, and context features at a sentence level in a text angle are obtained through the third encoder, so that understanding and feature extraction of a historical conversation from the text angle at the sentence level are realized. Wherein the third encoder is a sentence-level context encoder for text angles.

Step S40: and obtaining a sentence-level voice embedding vector based on the historical voice data, and determining a second context characteristic of a voice angle sentence level based on the sentence-level voice embedding vector and the historical dialog information embedding vector.

In this step, sentence-level speech features are determined based on historical speech data in the course of a historical dialogue. Referring to fig. 2, the historical speech data is input as input information into the speech angle sentence level module, and the output of the speech angle sentence level module is a second context feature of the speech angle sentence level.

In one embodiment, a sentence-level speech embedding vector is obtained based on the historical speech data, and a second context feature of a speech angle sentence level is determined based on the sentence-level speech embedding vector and the historical dialog information embedding vector; the method specifically comprises the following steps: inputting the historical voice data into a second pre-training model to obtain a sentence-level voice embedding vector; splicing the sentence-level voice embedded vector with the historical dialogue person information embedded vector to obtain a second spliced vector; and inputting the second splicing vector into a fourth encoder to obtain a second context characteristic of the speech angle sentence level.

Fig. 4 is a schematic diagram of a structure of a speech angle sentence level module according to an embodiment of the present invention, and referring to fig. 4, the second pre-training model may be a Fine-tuned wave 2vec model, that is, historical speech data obtains a sentence-level speech embedding vector through the Fine-tuned wave 2vec model; and then the sentence-level voice embedded vector and the historical dialogue person information embedded vector are spliced through the full connection layer to obtain a second spliced vector. The fourth encoder may specifically include a GRU network layer, a full-link layer, and a self-attention module, that is, the second concatenation vector is used as an input, and context features at a sentence level from a speech angle are obtained through the fourth encoder, thereby realizing understanding and feature extraction of the historical dialog from the speech angle at the sentence level. Wherein the fourth encoder is a sentence-level context encoder for speech angles.

Step S50: and obtaining a word-level text embedding vector based on the historical text data, and determining a first prosodic style characteristic of a text angle based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence.

In this step, the text embedding sequence is a text embedding sequence corresponding to the speech text data to be synthesized, that is, prosodic style features are further extracted from the text perspective of the word level based on the historical text data, the historical dialog information embedding vector and the text embedding sequence. Referring to fig. 2, the first prosodic style characteristic of the text angle may be obtained by the text angle/word classification module, that is, the historical text data is input into the text angle/word classification module as the input information, and the output of the text angle/word classification module is used as the first prosodic style characteristic of the text angle/word classification.

In one embodiment, word-level text embedding vectors are obtained based on the historical text data, and first prosodic style features of text angles are determined based on the word-level text embedding vectors, the historical dialog information embedding vectors and the text embedding sequences; the method specifically comprises the following steps: inputting the historical text data into a third pre-training model to obtain a text embedding vector at a word level; splicing the text embedded vector of the word level and the historical dialog information embedded vector to obtain a third spliced vector; and inputting the third splicing vector and the text embedding sequence into a fifth encoder to obtain a first prosody style characteristic of a text angle.

Fig. 5 is a schematic diagram of an architecture of a text angle word level module according to an embodiment of the present invention, and referring to fig. 5, the third pre-training model may be a BERT model, that is, historical text data obtains a text embedded vector sequence at a word level through the BERT model; and then splicing the word-level text embedded vector sequence and the historical dialog information embedded vector to obtain a third spliced vector. The fifth encoder may specifically include a convolutional layer and a multi-head attention module, and then the third splicing vector and a text embedding sequence corresponding to the speech text data to be synthesized are synchronized and input to the fifth encoder as input information, so as to obtain a prosody style feature of a text angle with the same length as the text embedding sequence. Wherein the fifth encoder is also understood to be a word-level context encoder for text angles. Wherein the fifth encoder is a word-level context encoder for text angles.

Step S60: obtaining a word-level voice embedding vector based on the historical voice data, and determining a second prosodic style characteristic of a voice angle based on the word-level voice embedding vector, the historical dialog information embedding vector and the text embedding sequence.

In this step, prosodic style features are further extracted from the speech perspective at the word level based on the historical speech data, the historical speaker information embedding vectors, and the text embedding sequences. Referring to fig. 2, the second prosody style characteristic of the voice angle can be obtained by the voice angle word level module, that is, the historical voice data is input into the voice angle word level module as input information, and the output of the voice angle word level module is used as the second prosody style characteristic of the voice angle word level.

In one embodiment, word-level voice embedding vectors are obtained based on the historical voice data, and second prosodic style features of voice angles are determined based on the word-level voice embedding vectors, the historical dialog information embedding vectors and the text embedding sequence; the method specifically comprises the following steps: inputting the historical voice data into a fourth pre-training model to obtain a voice embedding vector of a word level; splicing the voice embedded vector of the word level and the historical dialogue person information embedded vector to obtain a fourth spliced vector; and inputting the fourth splicing vector and the text embedding sequence into a sixth encoder to obtain a second prosody style characteristic of the voice angle.

Fig. 6 is a schematic diagram of an architecture of a speech angle word level module according to an embodiment of the present invention, and referring to fig. 6, a fourth pre-training model may be a Wav2vec model, that is, historical speech data obtains a speech embedding vector sequence at a word level through the Wav2vec model; and then splicing the word-level voice embedded vector sequence and the historical dialog information embedded vector to obtain a fourth spliced vector. The sixth encoder may specifically include a convolutional layer and a multi-head attention module, and the fourth concatenation vector and a text embedding sequence corresponding to the speech text data to be synthesized are input to the sixth encoder as input information in synchronization, so as to obtain a prosodic style feature of a speech angle with the same length as the text embedding sequence. Wherein the sixth encoder is also understood as a word-level context encoder for speech angles. Wherein the sixth encoder is a word-level context encoder for speech angles.

Step S70: and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio corresponding to the to-be-synthesized voice text data based on the Mel frequency spectrum.

In the step, a text embedding sequence corresponding to a voice text to be synthesized is fused with a first context feature and a second context feature at a sentence level, a first prosodic style feature and a second prosodic style feature at a word level are fused, and features such as prosody, emotion and style in historical conversation are considered from the two aspects of voice and text, so that a more natural audio frequency which is highly fit with the historical conversation is obtained.

In an embodiment, a predicted mel-frequency spectrum is obtained based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and the audio corresponding to the speech text data to be synthesized is determined based on the mel-frequency spectrum; the method comprises the following steps: adding the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector to obtain a fusion embedding sequence; inputting the fusion embedding sequence into a variation adapter, and adding pitch characteristics and energy characteristics to obtain a Mel embedding vector sequence; inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder for decoding to obtain a predicted Mel frequency spectrum; and inputting the predicted Mel frequency spectrum into a vocoder to obtain the audio corresponding to the voice text data to be synthesized.

Referring to fig. 2, the text embedding sequence to which the historical dialogue features of the text angle, the sentence level of the speech angle, and the word level are added passes through a variation adapter, pitch and energy feature information is added, and then passes through a length adapter; expanding the text embedded sequence fused with the multi-modal features into a Mel embedded vector sequence with the same length as the target Mel frequency spectrum; inputting the Mel embedded vector sequence, the first context feature and the second context feature into a Mel decoder as input information to obtain a predicted Mel frequency spectrum; and finally, inputting the predicted Mel frequency spectrum into a vocoder as input information to obtain the finally predicted audio.

In this embodiment, the text embedding sequence corresponding to the speech text data to be synthesized further generates audio corresponding to the speech text data to be synthesized through a variation adaptor, a mel decoder, a vocoder, and the like after adding the word level information of the speech angle, the word level information of the text angle, the sentence level information of the speech angle, and the sentence level information of the text angle. Therefore, the speech synthesis method facing the speech dialogue scene not only considers the characteristics of two modes of speech and text, but also considers the characteristics of historical information with different levels and granularities, including sentence levels and word levels, and realizes end-to-end multi-scale multi-modal dialogue speech synthesis.

In another embodiment, the method for speech synthesis oriented to a speech dialog scenario further comprises the following steps: obtaining a context coding vector of the current speech angle sentence level based on the predicted Mel frequency spectrum; and further inputting a first context feature at a text angle sentence level and a second context feature at a voice angle sentence level corresponding to the historical dialogue into the style predictor to obtain a context coding embedded vector, further calculating the loss of the context coding embedded vector and the current voice angle sentence level context coding vector, and updating parameters of each model, the coder and the decoder based on the loss calculation result.

In the above-described embodiment, the sentence-level context-coded embedding vector considered from a text point of view, the sentence-level context-coded embedding vector considered from an audio point of view are input as input information into the style predictor, the predicted context-coded embedding vector is output, and the predicted context-coded embedding vector is compared with the sentence-level context-coded embedding vector considered from an audio point of view obtained from the audio corresponding to the current dialog, thereby making the predicted sentence-level style and the true style as close as possible.

Through the embodiment, the voice synthesis method facing the voice conversation scene disclosed by the invention can simulate the scene of real human voice conversation, and carry out all-around feature extraction on the historical conversation from the angles of multi-mode and multi-granularity, so that the understanding on the historical conversation is deepened, the voice which is more fit with the historical conversation is generated, and the synthesized voice has the characteristics of rich rhythm, emotion, style and the like.

Correspondingly, the present invention also provides a speech synthesis system facing a speech dialogue scene, which includes a processor and a memory, wherein the memory stores computer instructions, and the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the system realizes the steps of the method according to any one of the above embodiments.

The voice synthesis system facing the voice conversation scene considers text information and side language information (such as voice prosody information) in real voice conversation, and models the emotion, prosody and the like of historical conversation from the perspective of voice, so that the information of the historical conversation is extracted more comprehensively and fully. In addition, the speech synthesis system facing the speech dialogue scene pays attention to word-level local information in a historical context on the basis of considering sentence-level global style information; the method and system can therefore pay additional attention to keywords, accents and prosodic changes, thereby making the synthesized speech more natural and more appropriate for historical conversations.

In addition, the invention also discloses a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the method according to any of the above embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments can be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer data. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for speech synthesis for a speech dialog scenario, the method comprising:

acquiring historical text data, historical voice data and historical dialog person information data of historical dialog, and determining a historical dialog person information embedding vector based on the historical dialog person information data;

obtaining a word-level text embedding vector based on the historical text data, and determining a first prosodic style characteristic of a text angle based on the word-level text embedding vector, the historical dialog information embedding vector and the text embedding sequence;

obtaining a voice embedding vector of a word level based on the historical voice data, and determining a second rhythm style characteristic of a voice angle based on the voice embedding vector of the word level, the historical dialog information embedding vector and the text embedding sequence;

and obtaining a predicted Mel frequency spectrum based on the first context feature, the second context feature, the first prosody style feature, the second prosody style feature, the text embedding sequence and the current speaker embedding vector, and determining the audio corresponding to the to-be-synthesized voice text data based on the Mel frequency spectrum.

2. The method for synthesizing speech facing to a speech dialogue scene according to claim 1, wherein determining a text embedding sequence of the speech text data to be synthesized based on the phoneme sequence and obtaining a current dialogue person embedding vector based on the current dialogue person information data comprises:

3. The speech synthesis method for a speech dialog scene according to claim 1, characterized in that a sentence-level text embedding vector is derived based on the historical text data, a first contextual feature at a text angle sentence level is determined based on the sentence-level text embedding vector and the historical speaker information embedding vector; the method comprises the following steps:

and inputting the first splicing vector into a third encoder to obtain a first context characteristic of a text angle sentence level.

4. The method for speech synthesis oriented to a speech dialog scene according to claim 3, characterized in that a sentence-level speech embedding vector is obtained based on the historical speech data, and a second context feature at speech angle sentence level is determined based on the sentence-level speech embedding vector and the historical dialog information embedding vector; the method comprises the following steps:

splicing the sentence-level voice embedded vector with the historical dialogue person information embedded vector to obtain a second spliced vector;

5. The method of claim 4, wherein word-level text embedding vectors are obtained based on the historical text data, and first prosodic style features of text angles are determined based on the word-level text embedding vectors, the historical dialog information embedding vectors, and the text embedding sequences; the method comprises the following steps:

6. The method of claim 5, wherein word-level speech embedding vectors are obtained based on the historical speech data, and second prosodic style features of speech angles are determined based on the word-level speech embedding vectors, the historical dialog information embedding vectors, and the text embedding sequence; the method comprises the following steps:

7. The method for synthesizing speech facing to the speech dialog scene of claim 1, wherein a predicted mel-frequency spectrum is obtained based on the first context feature, the second context feature, the first prosodic style feature, the second prosodic style feature, the text embedding sequence and the current speaker embedding vector, and the audio corresponding to the speech text data to be synthesized is determined based on the mel-frequency spectrum; the method comprises the following steps:

8. The method of claim 6, wherein the first pre-training model is a Sennce BERT model, the second pre-training model is a Fine-tuned Wav2vec model, the third pre-training model is a BERT model, and the fourth pre-training model is a Wav2vec model.

9. A speech synthesis system for a speech dialog scenario, the system comprising a processor and a memory, characterized in that the memory has stored therein computer instructions for executing the computer instructions stored in the memory, the system realizing the steps of the method according to any one of claims 1 to 8 when the computer instructions are executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method according to any one of claims 1 to 8.