CN117953854A

CN117953854A - Multi-dialect voice synthesis method and device, electronic equipment and readable storage medium

Info

Publication number: CN117953854A
Application number: CN202410250198.2A
Authority: CN
Inventors: 张硕; 刘刚; 苏江
Original assignee: Dark Matter Beijing Intelligent Technology Co ltd; DMAI Guangzhou Co Ltd
Current assignee: Dark Matter Beijing Intelligent Technology Co ltd; DMAI Guangzhou Co Ltd
Priority date: 2024-03-05
Filing date: 2024-03-05
Publication date: 2024-04-30
Anticipated expiration: 2044-03-05
Also published as: CN117953854B

Abstract

The application provides a multi-dialect voice synthesis method, a multi-dialect voice synthesis device, electronic equipment and a readable storage medium, wherein the method comprises the following steps: collecting dialect audio data of a target user and a virtual communication object selected by the target user; retrieving reply audio data according to the communication content in the dialect audio data and the target dialect used; calling out the reference audio data of the virtual communication object according to the virtual communication object; extracting text information and dialect style characteristics of a target dialect from the reply audio data, extracting voice style characteristics of a virtual communication object from the reference audio data, and generating a target Mel frequency spectrum according to the text information, the dialect style characteristics and the voice style characteristics; and inputting the target Mel frequency spectrum into a vocoder to obtain target audio data, and outputting the target audio data to a target user. By the method, the intelligibility and naturalness of the output target audio data are improved.

Description

Multi-dialect voice synthesis method and device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of man-machine interaction technologies, and in particular, to a method and apparatus for synthesizing speech in multiple dialects, an electronic device, and a readable storage medium.

Background

In human-machine interaction scenarios (e.g., voice assistant, intelligent customer service, and virtual interaction systems), a machine or human-machine interaction model (e.g., voice assistant model, intelligent customer service model, etc.) typically uses mandarin to conduct voice communications with users. In consideration of that the user may come from different places nationally and the dialects (dialects such as cantonese and Minnan) spoken by the user in different areas are different, if the user in different areas uses the dialects of the user to conduct voice communication with the machine, the man-machine interaction model is beneficial to improving the interaction experience of the user and enabling the man-machine interaction to be smoother.

Currently, in order to realize that a human-computer interaction model can output a plurality of dialect audios with the same tone (the same voice of a person), for example, when the human-computer interaction model wants to output a plurality of dialect audios with the voice of a user a, the voice of the user a is used to communicate with users in different areas (i.e. different dialects), the audio of the user a in different dialects needs to be collected first. In general, a user only speaks one or two dialects, and if the user a needs to collect the audio of different dialects, the user a simulates the other dialects that he is not familiar with to record the audio, which results in that the audio of the dialects recorded by the user a is not good enough, natural, or even difficult to understand.

If the recorded dialect audio which is not enough in the tunnel, not natural enough and even difficult to understand is used for model training of the man-machine interaction model, the dialect learned by the man-machine interaction model is not proper and the tunnel is caused, so that the dialect voice output by the man-machine interaction model is unnatural and even difficult to understand during man-machine interaction, and the intelligibility and the naturalness are low.

Disclosure of Invention

Accordingly, the present application is directed to a method, an apparatus, an electronic device, and a readable storage medium for synthesizing multi-dialects, so that dialects of audio data output by a machine or a man-machine interaction model are more complete, and the intelligibility and naturalness of the audio data output by the machine or the man-machine interaction model are improved.

In a first aspect, an embodiment of the present application provides a method for synthesizing speech from multiple dialects, including:

in the human-computer interaction process, dialect audio data of a target user are collected, and a virtual communication object selected by the target user is received;

retrieving reply audio data matched with the communication content and the target dialect according to the communication content and the used target dialect in the dialect audio data; according to the virtual communication object, calling out the reference audio data of the virtual communication object;

extracting text information and dialect style characteristics of the target dialect from the reply audio data, and extracting voice style characteristics of the virtual communication object from the reference audio data so as to generate a target Mel frequency spectrum according to the text information, the dialect style characteristics and the voice style characteristics;

inputting the target Mel frequency spectrum into a vocoder to obtain target audio data, so as to output the target audio data to the target user; wherein the target audio data is audio data spoken by the target dialect using a speech style of the virtual communication object and the dialect audio data for replying to the target user.

With reference to the first aspect, the embodiment of the present application provides a first possible implementation manner of the first aspect, wherein the speech style features include: timbre characteristics, speaking habit characteristics and voice rhythm characteristics; the extracting text information and the dialect style feature of the target dialect from the reply audio data, extracting the voice style feature of the virtual communication object from the reference audio data, so as to generate a target mel frequency spectrum according to the text information, the dialect style feature and the voice style feature, including:

Extracting fundamental frequency information from the reply audio data by using a trained intonation extractor;

Extracting text information and dialect style features of the target dialect from the reply audio data using a trained speech style and text extractor;

Extracting the timbre features and the speaking habit features of the virtual communication object from the reference audio data using a trained global speech feature encoder;

Extracting the phonetic prosody features of the virtual communication object from the reference audio data using a trained local prosody feature encoder;

Inputting the fundamental frequency information, the text information, the dialect style characteristics, the tone characteristics, the speaking habit characteristics, the voice rhythm characteristics and the identity information of the virtual communication object into a decoder after training is completed, and outputting a target Mel frequency spectrum.

With reference to the first possible implementation manner of the first aspect, the embodiment of the present application provides a second possible implementation manner of the first aspect, wherein the speech style and text extractor includes a plurality of conformer block layers connected in sequence; between two adjacent conformer block layers, the output of the former conformer block layer is used as the input of the next conformer block layer; the extracting text information and dialect style features of the target dialect from the reply audio data using the trained speech styles and text extractor includes:

After the reply audio data is input into the voice style and text extractor, taking the output of a target conformer block layer in the voice style and text extractor as dialect style characteristics for extracting text information and the target dialect from the reply audio data; wherein the target conformer block layers are near the last conformer block layers in the speech style and text extractor.

With reference to the first possible implementation manner of the first aspect, the embodiment of the present application provides a third possible implementation manner of the first aspect, wherein the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosodic feature encoder and the decoder are trained by:

model training is carried out on an initial voice style and text extractor to be trained by using an audio training sample, and the voice style and text extractor after training is completed is obtained;

Inputting a first-language audio training sample of a sample object into an initial intonation extractor to be trained and the voice style and text extractor after training, extracting sample fundamental frequency information from the first-language audio training sample through the initial intonation extractor, and extracting sample text information and sample dialect style characteristics from the first-language audio training sample through the voice style and text extractor;

Inputting the first-language audio training sample into an initial global voice feature encoder and an initial local prosody feature encoder to be trained, extracting sample tone features and sample speaking habit features of the sample object from the first-language audio training sample through the initial global voice feature encoder, and extracting sample voice prosody features of the sample object from the first-language audio training sample through the initial local prosody feature encoder;

inputting the sample fundamental frequency information, the sample text information, the sample dialect style characteristics, the sample tone color characteristics, the sample speaking habit characteristics, the sample voice rhythm characteristics and the identity information of the sample object into an initial decoder to be trained, and outputting a sample Mel frequency spectrum;

generating a labeled mel frequency spectrum according to the first language audio training sample;

Calculating a loss function value according to the sample mel frequency spectrum and the label mel frequency spectrum;

Judging whether the training completion condition is met according to the loss function value;

When the training completion condition is met, the initial intonation extractor, the initial global voice feature encoder, the initial local prosody feature encoder and the initial decoder of the current training round are used as the intonation extractor, the voice style and text extractor, the global voice feature encoder, the local prosody feature encoder and the decoder after training is completed;

When the training completion condition is not met, updating the learnable parameters in the initial intonation extractor, the initial global speech feature encoder, the initial local prosody feature encoder and the initial decoder by using the loss function value, and continuously performing the steps of inputting a first-aspect audio training sample of a sample object into the initial intonation extractor to be trained and the speech style and text extractor after training is finished until the calculated loss function value meets the training completion condition, taking the initial intonation extractor, the initial global speech feature encoder, the initial local prosody feature encoder and the initial decoder of the current training round as the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosody feature encoder and the decoder after training is finished.

With reference to the third possible implementation manner of the first aspect, the embodiment of the present application provides a fourth possible implementation manner of the first aspect, wherein the audio training samples include a pre-training sample and a second dialect audio training sample, the pre-training sample includes a mandarin training sample and a third dialect audio training sample, and the number of mandarin training samples is greater than the number of third dialect audio training samples; model training is carried out on an initial voice style and text extractor to be trained by using an audio training sample to obtain the trained voice style and text extractor, and the method comprises the following steps:

Performing model pre-training on the initial voice style and the text extractor to be trained by using the pre-training sample to obtain a preliminary voice style and the text extractor;

and performing model training on the prepared voice style and text extractor by using the second dialect audio training sample to obtain the trained voice style and text extractor.

With reference to the first aspect, the embodiment of the present application provides a fifth possible implementation manner of the first aspect, wherein the retrieving, according to the communication content and the target dialect used in the dialect audio data, the reply audio data matching the communication content and the target dialect; and retrieving reference audio data of the virtual communication object according to the virtual communication object, comprising:

Calling out the audio data using the target dialect from the audio data of each dialect stored in advance according to the target dialect used in the dialect audio data;

Retrieving reply audio data for replying to the communication content from the audio data using the target dialect according to the communication content in the dialect audio data;

And according to the side identification information of the virtual communication object, calling out the reference audio data of the virtual communication object from the pre-stored reference audio data of each object.

In a second aspect, an embodiment of the present application further provides a speech synthesis apparatus for multi-dialects, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring dialect audio data of a target user and receiving a virtual communication object selected by the target user in the human-computer interaction process;

The calling module is used for calling out reply audio data matched with the communication content and the target dialect according to the communication content in the dialect audio data and the used target dialect; according to the virtual communication object, calling out the reference audio data of the virtual communication object;

The extraction module is used for extracting text information and dialect style characteristics of the target dialect from the reply audio data, extracting voice style characteristics of the virtual communication object from the reference audio data, and generating a target Mel frequency spectrum according to the text information, the dialect style characteristics and the voice style characteristics;

The output module is used for inputting the target Mel frequency spectrum into a vocoder to obtain target audio data so as to output the target audio data to the target user; wherein the target audio data is audio data spoken by the target dialect using a speech style of the virtual communication object and the dialect audio data for replying to the target user.

With reference to the second aspect, embodiments of the present application provide a first possible implementation manner of the second aspect, where the speech style features include: timbre characteristics, speaking habit characteristics and voice rhythm characteristics; the extracting module is configured to, when extracting text information and a dialect style feature of the target dialect from the reply audio data, extract a speech style feature of the virtual communication object from the reference audio data, so as to generate a target mel spectrum according to the text information, the dialect style feature and the speech style feature, specifically:

With reference to the first possible implementation manner of the second aspect, an embodiment of the present application provides a second possible implementation manner of the second aspect, where the speech style and text extractor includes a plurality of conformer block layers connected in sequence; between two adjacent conformer block layers, the output of the former conformer block layer is used as the input of the next conformer block layer; the extraction module is specifically configured to, when extracting text information and dialect style features of the target dialect from the reply audio data using the trained speech style and text extractor:

With reference to the first possible implementation manner of the second aspect, the embodiment of the present application provides a third possible implementation manner of the second aspect, where the apparatus further includes a training module, and the training module is configured to train to obtain the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosodic feature encoder, and the decoder by:

With reference to the third possible implementation manner of the second aspect, the embodiment of the present application provides a fourth possible implementation manner of the second aspect, wherein the audio training samples include a pre-training sample and a second dialect audio training sample, the pre-training sample includes a mandarin training sample and a third dialect audio training sample, and the number of mandarin training samples is greater than the number of third dialect audio training samples; the training module is used for performing model training on an initial voice style and text extractor to be trained by using an audio training sample, and is specifically used for:

With reference to the second aspect, an embodiment of the present application provides a fifth possible implementation manner of the second aspect, where the retrieving module is configured to retrieve reply audio data matching the communication content and the target dialect according to the communication content and the target dialect in the dialect audio data; and when the reference audio data of the virtual communication object is called out according to the virtual communication object, the virtual communication object is specifically used for:

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of any one of the possible implementations of the first aspect.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the possible implementations of the first aspect described above.

According to the multi-dialect voice synthesis method, the device, the electronic equipment and the readable storage medium, in the man-machine interaction process, a target user can select a virtual communication object, and after the virtual communication object is selected, the communication content matched with the dialect audio data and the reply audio data of the used target dialect are called out according to the collected dialect audio data of the target user, and the reference audio data of the virtual communication object is called out. Next, the dialect style features of the text information and the target dialect are extracted from the reply audio data, and the voice style features of the virtual communication object are extracted from the reference audio data, so that the target audio data for reply to the target user is generated according to the text information, the dialect style features and the voice style features. Wherein the target audio data is audio data of dialect audio data for replying to the target user, which is spoken by the target dialect, using a speech style of the virtual communication object. Compared with the prior art, the method of the embodiment allows the virtual communication object to simulate the dialects which are not familiar with the virtual communication object to record the audio, and the virtual communication object is not required to simulate various dialects to record the audio, and only needs to record the audio by using the dialects (or mandarin), which are familiar with the virtual communication object, and then the voice style characteristics of the virtual communication object are extracted from the reference audio data recorded by the virtual communication object. In addition, for each dialect, the object most familiar with the dialect can record various audio data, that is, the dialect of the recorded various audio data is the most genuine, so that when the reply audio data of the target dialect is called, the reply audio data of the most genuine dialect can be called. In this embodiment, the text information and the dialect style feature of the target dialect are extracted from the reply audio data of the most-trace dialect, and the voice style feature of the virtual communication object is extracted from the reference audio data, so that the finally generated target audio data is the audio data uttered by using the target dialect of the most-trace dialect and the voice style of the virtual communication object. Therefore, the target audio data output by the machine or the man-machine interaction model in the man-machine interaction is more tunnel and natural, and the intelligibility and naturalness of the target audio data output by the machine or the man-machine interaction model are improved.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for synthesizing speech from multiple dialects according to an embodiment of the present application;

FIG. 2 is a flow chart of generating target audio data according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a multi-aspect speech synthesis apparatus according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

When the model training is carried out on the man-machine interaction model by using the dialect audio which is insufficient in tunnel, natural and even difficult to understand, the dialect learned by the man-machine interaction model is not authentic and tunnel enough, so that the dialect voice output by the man-machine interaction model is unnatural and difficult to understand during man-machine interaction, and the problem of low intelligibility and naturalness is solved. Based on this, the embodiments of the present application provide a method, an apparatus, an electronic device, and a readable storage medium for synthesizing speech of multiple dialects, so that the dialects of audio data output by a machine or a man-machine interaction model are more complete, and the intelligibility and naturalness of the audio data output by the machine or the man-machine interaction model are improved.

Embodiment one:

For the sake of understanding the present embodiment, a detailed description will be given of a speech synthesis method of multiple languages disclosed in the present embodiment. Fig. 1 shows a flowchart of a multi-dialect speech synthesis method according to an embodiment of the present application, as shown in fig. 1, including the following steps:

s101: in the human-computer interaction process, dialect audio data of a target user are collected, and virtual communication objects selected by the target user are received.

In this embodiment, the method may be applied to a man-machine interaction device running a man-machine interaction model, for example, a voice assistant model, an intelligent customer service model, and the like. Human-computer interaction equipment such as a communication terminal or a human-computer interaction robot, etc.

The human-computer interaction process can be a process that a target user performs human-computer interaction with intelligent customer service by using a communication terminal, or a process that the target user performs face-to-face interaction with a human-computer interaction robot, and the application is not limited to the process.

In the human-computer interaction process of the target user and the human-computer interaction equipment, the human-computer interaction equipment collects dialect audio data of the target user and receives virtual communication objects selected by the target user. Wherein the dialect audio data is dialect audio spoken by the target user to the human-computer interaction device using the most familiar dialect of the target user. The virtual communication object is one object selected by the target user from a plurality of preset objects contained in the man-machine interaction device according to the preference of the target user, and the man-machine interaction device is used for talking with the target user by using the voice style of the virtual communication object selected by the target user after the target user selects the virtual communication object.

The man-machine interaction device comprises a display screen, and identification information of a plurality of preset objects, such as object names, object icons and the like, are displayed on the display screen. The human-computer interaction device may receive the virtual communication object selected by the target user by: and responding to the selection operation of the target user for the identification information of the virtual communication object in the identification information of each object displayed in the display screen, so as to determine the virtual communication object selected by the target user.

S102: retrieving reply audio data matched with the communication content and the target dialect according to the communication content in the dialect audio data and the target dialect to be used; and calling out the reference audio data of the virtual communication object according to the virtual communication object.

In this embodiment, for each dialect, a plurality of audio data of the dialect are stored in advance in the human-computer interaction device, and the audio content in different audio data under the dialect is different.

In a possible implementation manner, when performing step S102, the following steps may be specifically performed:

S1021: according to the target dialect used in the dialect audio data, the audio data using the target dialect is called out from the audio data of each dialect stored in advance.

In this embodiment, there are a plurality of audio data using the target dialect, and the audio contents in different audio data are different.

S1022: and according to the communication content in the dialect audio data, retrieving reply audio data for replying to the communication content from the audio data of the using target dialect.

For example, if the dialect audio data is "whether there is a remaining ticket for the ticket X" spoken by using the Minnan, the communication content therein is "whether there is a remaining ticket for the ticket X" and the target dialect is the Minnan. The retrieved reply audio data may be "there are remaining tickets" spoken using the Minnan.

S1023: and calling out the reference audio data of the virtual communication object from the pre-stored reference audio data of each object according to the side identification information of the virtual communication object.

In this embodiment, the human-computer interaction device further stores reference audio data of each preset object in advance, where each object corresponds to one reference audio data. The reference audio data of each object may be audio data recorded by each object using its own most familiar dialect, or audio data recorded by each object using mandarin.

S103: the method comprises the steps of extracting text information and dialect style characteristics of a target dialect from reply audio data, and extracting voice style characteristics of a virtual communication object from reference audio data to generate a target Mel frequency spectrum according to the text information, the dialect style characteristics and the voice style characteristics.

In one possible implementation, the speech style characteristics include: timbre characteristics, speaking habit characteristics, voice prosody characteristics. Wherein each user corresponds to a respective tone characteristic; the speaking habit characteristics comprise whether a voice of a child exists or not; the prosodic features of speech include pauses (phonetic rhythms), rules of accents, etc.

In executing step S103, it is specifically possible to execute the following steps S1031 to S1035:

s1031: fundamental frequency information is extracted from the reply audio data using a trained intonation extractor.

Fig. 2 is a schematic flow chart of generating target audio data according to an embodiment of the present application, as shown in fig. 2, the reply audio data are respectively input into a trained intonation extractor and a speech style and text extractor, and fundamental frequency information is extracted from the reply audio data by the trained intonation extractor. The intonation extractor can be pyworld tools or praat-parselmouth tools. The fundamental frequency information includes tone information, high-low audio information.

S1032: the trained speech styles and the text extractor are used to extract the text information and dialect style characteristics of the target dialect from the reply audio data.

In this embodiment, the text information is converted from the audio content in the reply audio data, for example, if the reply audio data is "there is a remaining ticket" spoken using Minnan. Each corresponding to a respective dialect style characteristic.

S1033: and extracting tone characteristics and speaking habit characteristics of the virtual communication object from the reference audio data by using the trained global voice characteristic encoder.

As shown in fig. 2, the reference audio data is input into the trained global speech feature encoder and the local prosody feature encoder, and the timbre features and speaking habit features of the virtual communication object are extracted from the reference audio data by the trained global speech feature encoder.

S1034: and extracting the voice prosody features of the virtual communication object from the reference audio data by using the trained local prosody feature encoder.

S1035: the basic frequency information, the text information, the dialect style characteristics, the tone color characteristics, the speaking habit characteristics, the voice rhythm characteristics and the identity information of the virtual communication object are input into a decoder after training is completed, and a target Mel frequency spectrum is output.

S104: inputting the target Mel frequency spectrum into the vocoder to obtain target audio data, so as to output the target audio data to the target user; wherein the target audio data is audio data of dialect audio data for replying to the target user, which is spoken by the target dialect, using a speech style of the virtual communication object.

As shown in fig. 2, the target mel spectrum is input into the vocoder, and after the vocoder outputs the target audio data, the human-computer interaction device outputs the target audio data to the target user.

In one possible implementation, the speech style and text extractor comprises multiple conformer block layers connected in sequence; between two adjacent conformer block layers, the output of the previous conformer block layer serves as the input of the next conformer block layer.

When step S1032 is performed to extract the text information and the dialect style feature of the target dialect from the reply audio data using the trained speech style and text extractor, the following steps may be specifically performed:

After the reply audio data is input into a voice style and text extractor, taking the output of a target conformer block layer in the voice style and text extractor as dialect style characteristics for extracting text information and target dialect from the reply audio data; wherein the target conformer block layers are near the last conformer block layers in the speech style and text extractor.

In this embodiment, the speech style and text extractor specifically includes 12 layers conformer block layers connected in sequence. After the reply audio data is input to the first conformer block layers in the speech style and text extractor, the output of layer 10, conformer block, in the speech style and text extractor is used as dialect style features to extract text information and target dialect from the reply audio data.

In this embodiment, considering that the more the dialect style feature output by the first conformer block layers is closer to the target user's voice style feature (e.g., the timbre feature, speaking habit feature, voice prosody feature) and the more the dialect style feature output by the last conformer block layers is less than the voice style feature of the target user, the output of layer 10 onformer block layers is selected as the dialect style feature for extracting text information and target dialect from the reply audio data, so that the obtained dialect style feature does not include the voice style feature of the target user, and can relatively completely retain the dialect style feature of the target dialect, and also includes accurate text information.

In one possible implementation, the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosodic feature encoder, the decoder as in fig. 2 are trained by:

S201: and performing model training on the initial voice style and text extractor to be trained by using the audio training sample to obtain the trained voice style and text extractor.

In the embodiment, sectional training is adopted, model training is firstly carried out on the voice style and the text extractor, and after the voice style and the text extractor are trained, model training is carried out on the voice style extractor, the global voice feature encoder, the local prosody feature encoder and the decoder, so that the extracted dialect style features can be ensured to be accurate in the training process. If the model training is not performed on the voice style and the text extractor, the extracted dialect style characteristics are inaccurate, and the training of the intonation extractor, the global voice feature encoder, the local prosodic feature encoder and the decoder is affected.

In one possible implementation, the audio training samples include a pre-training sample and a second dialect audio training sample, the pre-training sample includes a mandarin training sample and a third dialect audio training sample, and the number of mandarin training samples is greater than the number of third dialect audio training samples; in executing step S201, it is specifically possible to execute the following steps S2011 to S2012:

s2011: and performing model pre-training on the initial voice style and text extractor to be trained by using the pre-training sample to obtain the preliminary voice style and text extractor.

S2012: and performing model training on the prepared voice style and the text extractor by using the second dialect audio training sample to obtain the trained voice style and the trained text extractor.

In this embodiment, the pre-training samples are obtained from an open source data set, in which the number of mandarin training samples is greater than the number of dialect audio training samples, and if only the pre-training samples (i.e., the number of mandarin training samples and the number of third party audio training samples are greater) are used to perform model training on the initial speech styles and text extractors to be trained, the trained speech styles and text extractors have a weak ability to extract dialect style features. Meanwhile, considering that the number of dialect audio training samples is small, if only a small number of dialect audio training samples are used for model training of the initial speech style and the text extractor to be trained, the ability of the speech style and the text extractor to extract the dialect style features is also weak.

Based on this, in this embodiment, a large number of pre-training samples are used to perform model pre-training on the initial speech style and text extractor to be trained, so as to obtain the preliminary speech style and text extractor, and then a small number of second dialect audio training samples are used to perform training on the preliminary speech style and text extractor again, so that the extraction capability of the dialect style characteristics of the speech style and text extractor after training is improved.

S202: and inputting the first-language audio training sample of the sample object into an initial intonation extractor to be trained and a trained voice style and text extractor, extracting sample fundamental frequency information from the first-language audio training sample through the initial intonation extractor, and extracting sample text information and sample dialect style characteristics from the first-language audio training sample through the voice style and text extractor.

In this embodiment, after obtaining the trained speech styles and text extractor, model training is performed on the speech extractor, the global speech feature encoder, the local prosodic feature encoder and the decoder, specifically, the first-dialect audio training samples of the sample object are input into the initial speech extractor to be trained and the trained speech styles and text extractor, the fundamental-frequency sample information is extracted from the first-dialect audio training samples through the initial speech extractor, and the sample text information and the sample dialect style features are extracted from the first-dialect audio training samples through the speech styles and text extractor.

S203: the method comprises the steps of inputting a first-language audio training sample into an initial global voice feature encoder and an initial local prosody feature encoder to be trained, extracting sample tone features and sample speaking habit features of sample objects from the first-language audio training sample through the initial global voice feature encoder, and extracting sample voice prosody features of the sample objects from the first-language audio training sample through the initial local prosody feature encoder.

S204: sample fundamental frequency information, sample text information, sample dialect style characteristics, sample tone color characteristics, sample speaking habit characteristics, sample voice rhythm characteristics and identification information of a sample object are input into an initial decoder to be trained, and a sample mel frequency spectrum is output.

S205: generating a labeled mel spectrum according to the first-language audio training sample.

S206: and calculating a loss function value according to the sample Mel frequency spectrum and the label Mel frequency spectrum.

S207: and judging whether the training completion condition is met according to the loss function value.

S208: when the training completion condition is met, the initial intonation extractor, the initial global voice feature encoder, the initial local prosody feature encoder and the initial decoder of the current training round are used as the intonation extractor, the voice style and text extractor, the global voice feature encoder, the local prosody feature encoder and the decoder after the training is completed.

S209: when the training completion condition is not met, updating the learnable parameters in the initial intonation extractor, the initial global speech feature encoder, the initial local prosody feature encoder and the initial decoder by using the loss function value, continuously inputting the first-aspect audio training sample of the sample object into the initial intonation extractor to be trained and the trained speech style and text extractor, and performing subsequent steps until the calculated loss function value meets the training completion condition, and taking the initial intonation extractor, the initial global speech feature encoder, the initial local prosody feature encoder and the initial decoder of the current training round as the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosody feature encoder and the decoder after the training is completed.

Embodiment two:

based on the same technical concept, the present application further provides a multi-dialect speech synthesis apparatus, and fig. 3 shows a schematic structural diagram of a multi-dialect speech synthesis apparatus provided by an embodiment of the present application, as shown in fig. 3, where the apparatus includes:

The acquisition module 301 is configured to acquire dialect audio data of a target user and receive a virtual communication object selected by the target user in a human-computer interaction process;

a retrieving module 302, configured to retrieve reply audio data matching the communication content and the target dialect according to the communication content and the target dialect in the dialect audio data; according to the virtual communication object, calling out the reference audio data of the virtual communication object;

An extracting module 303, configured to extract text information and dialect style characteristics of the target dialect from the reply audio data, and extract speech style characteristics of the virtual communication object from the reference audio data, so as to generate a target mel spectrum according to the text information, the dialect style characteristics and the speech style characteristics;

An output module 304, configured to input the target mel frequency spectrum into a vocoder to obtain target audio data, so as to output the target audio data to the target user; wherein the target audio data is audio data spoken by the target dialect using a speech style of the virtual communication object and the dialect audio data for replying to the target user.

Optionally, the speech style feature includes: timbre characteristics, speaking habit characteristics and voice rhythm characteristics; the extracting module 303 is configured to, when extracting text information and a dialect style feature of the target dialect from the reply audio data, extract a speech style feature of the virtual communication object from the reference audio data, so as to generate a target mel spectrum according to the text information, the dialect style feature and the speech style feature, specifically:

Optionally, the speech style and text extractor comprises a plurality of conformer block layers connected in sequence; between two adjacent conformer block layers, the output of the former conformer block layer is used as the input of the next conformer block layer; the extracting module 303, when configured to extract text information and dialect style characteristics of the target dialect from the reply audio data using the trained speech style and text extractor, is specifically configured to:

Optionally, the apparatus further comprises a training module for training to obtain the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosodic feature encoder, the decoder by:

Optionally, the audio training samples include a pre-training sample and a second dialect audio training sample, the pre-training sample includes a mandarin training sample and a third dialect audio training sample, and the number of mandarin training samples is greater than the number of third dialect audio training samples; the training module is used for performing model training on an initial voice style and text extractor to be trained by using an audio training sample, and is specifically used for:

Optionally, the retrieving module 302 is configured to retrieve reply audio data matching the communication content and the target dialect according to the communication content and the target dialect used in the dialect audio data; and when the reference audio data of the virtual communication object is called out according to the virtual communication object, the virtual communication object is specifically used for:

Embodiment III:

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, including: the electronic device comprises a processor 401, a memory 402 and a bus 403, wherein the memory 402 stores machine readable instructions executable by the processor 401, and when the electronic device runs the information processing method, the processor 401 communicates with the memory 402 through the bus 403, and the processor 401 executes the machine readable instructions to execute the method steps described in the first embodiment.

Embodiment four:

The fourth embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor performs the method steps described in the first embodiment.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, electronic device and computer readable storage medium described above may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present application, and are not intended to limit the scope of the present application, but it should be understood by those skilled in the art that the present application is not limited thereto, and that the present application is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech synthesis for a plurality of dialects, comprising:

2. The method of claim 1, wherein the speech style feature comprises: timbre characteristics, speaking habit characteristics and voice rhythm characteristics; the extracting text information and the dialect style feature of the target dialect from the reply audio data, extracting the voice style feature of the virtual communication object from the reference audio data, so as to generate a target mel frequency spectrum according to the text information, the dialect style feature and the voice style feature, including:

3. The method of claim 2, wherein the speech style and text extractor comprises a plurality of conformer block layers connected in sequence; between two adjacent conformer block layers, the output of the former conformer block layer is used as the input of the next conformer block layer; the extracting text information and dialect style features of the target dialect from the reply audio data using the trained speech styles and text extractor includes:

4. The method of claim 2, wherein the intonation extractor, the speech style and text extractor, the global speech feature encoder, the local prosodic feature encoder, and the decoder are trained by:

5. The method of claim 4, wherein the audio training samples comprise pre-training samples and second dialect audio training samples, the pre-training samples comprising mandarin training samples and third dialect audio training samples, the number of mandarin training samples being greater than the number of third dialect audio training samples; model training is carried out on an initial voice style and text extractor to be trained by using an audio training sample to obtain the trained voice style and text extractor, and the method comprises the following steps:

6. The method of claim 1, wherein the retrieving of reply audio data matching the ac content and the target dialect is based on the ac content in the dialect audio data and the target dialect used; and retrieving reference audio data of the virtual communication object according to the virtual communication object, comprising:

7. A multi-dialect speech synthesis apparatus, comprising:

8. The apparatus of claim 7, wherein the speech style feature comprises: timbre characteristics, speaking habit characteristics and voice rhythm characteristics; the extracting module is configured to, when extracting text information and a dialect style feature of the target dialect from the reply audio data, extract a speech style feature of the virtual communication object from the reference audio data, so as to generate a target mel spectrum according to the text information, the dialect style feature and the speech style feature, specifically:

9. An electronic device, comprising: a processor, a memory and a bus, said memory storing machine-readable instructions executable by said processor, said processor and said memory communicating over the bus when the electronic device is running, said machine-readable instructions when executed by said processor performing the steps of the method according to any one of claims 1 to 6.

10. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1 to 6.