WO2023102929A1

WO2023102929A1 - Audio synthesis method, electronic device, program product and storage medium

Info

Publication number: WO2023102929A1
Application number: PCT/CN2021/137237
Authority: WO
Inventors: 吴志勇; 康世胤; 雷舜; 周逸轩; 陈礼扬
Original assignee: 清华大学深圳国际研究生院; 广州虎牙科技有限公司
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2023-06-15

Abstract

Provided in the present application are an audio synthesis method, an electronic device, a program product and a storage medium. During an audio synthesis process, an acoustic feature and a speaking style feature are extracted from each sentence, wherein the speaking style feature is determined on the basis of a paragraph feature of text data where the sentence is located; the paragraph feature is extracted from sentence features of sentences on the basis of the contribution of each sentence in the text data to a speaking style; and the sentence features are obtained on the basis of the contribution of each word in the sentences to the speaking style. On the basis of the acoustic feature and speaking style feature of a target sentence, audio data is synthesized by the target sentence. In this way, an extracted speaking style feature of a target sentence carries contribution information of each word in the target sentence to a speaking style of the sentence, and contribution information of other sentences in the context to the speaking style. Audio that is synthesized by using an acoustic feature of a sentence and a speaking style feature of the sentence has a better expressiveness, thereby enriching the expression effects of the synthesized audio.

Description

Audio synthesis method, electronic device, program product and storage medium

technical field

The present application relates to the technical field of audio processing, in particular to an audio synthesis method, an electronic device, a program product and a storage medium.

Background technique

Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio. TTS technology has been widely used in audio novels, news, voice assistants, intelligent navigation and other products. Among them, the naturalness of synthesized audio is one of the indicators to measure the effect of audio synthesis. Synthesized audio with high naturalness can make users feel as vivid as natural language in subjective experience.

The naturalness of synthesized audio depends largely on the expressiveness of the synthesized audio. Expressive audio can show the speaker's emotion and speaking style, and has a high degree of naturalness. It is an important part of TTS technology to focus on improving the richness of synthetic audio in terms of expressive effects.

Contents of the invention

The application provides an audio synthesis method, an electronic device, a program product and a storage medium, which can further improve the expressiveness of the synthesized audio.

According to a first aspect of an embodiment of the present application, an audio synthesis method is provided, the method comprising:

Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;

Acquiring the acoustic features of the target sentence;

Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;

Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.

In some examples, the contribution of each word in the sentence to the speaking style is obtained through an inter-word network based on an attention mechanism;

The contribution of each sentence of the text data to the speaking style is obtained through an inter-sentence network based on an attention mechanism.

In some examples, the paragraph features of the text data are also extracted based on the position information of each sentence in the text data.

In some examples, the synthesizing the target sentence into audio data based on the acoustic features of the target sentence and the speaking style features of the target sentence includes:

Based on the acoustic features of the target sentence and the speaking style features of the target sentence, predict the acoustic features of the target sentence carrying prosody information; the prosody information includes one of pitch information, sound intensity information or pronunciation duration. or more;

converting the acoustic features carrying prosody information into the audio data.

In some examples, the method is applied to an audio synthesis system, the audio synthesis system comprising:

An acoustic feature extraction module for extracting the acoustic features of the target sentence;

A speaking style feature extraction module for extracting the speaking style features of the target sentence;

A synthesis module for synthesizing the audio data.

In some examples, the speaking style feature extraction module includes:

A language model for extracting the semantic features;

A hierarchical encoder for extracting the sentence feature, the paragraph feature and the speaking style feature.

In some examples, the hierarchical encoder is trained using supervised learning, and the training data includes semantic features annotated with real speaking style features.

In some examples, the hierarchical encoder performs supervised learning through a knowledge distillation mechanism, the hierarchical encoder is a student model in the distillation mechanism, and the teacher model in the distillation mechanism adopts unsupervised learning from real audio data Extract real speaking style features.

In some examples, the hierarchical encoder includes:

For obtaining the contribution of each word in each sentence to the speaking style, and based on the contribution of each word to the speaking style, extracting the inter-word network of the sentence feature of the sentence from the semantic feature of the sentence;

For obtaining the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extracting paragraph features from the sentence features of each sentence, and based on the paragraph features, predicting the sentence of the speaking style feature of the target sentence network.

In some examples, the synthesis module includes:

A prosody predictor for predicting acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence;

A converter for converting the acoustic features carrying prosody information into the audio data.

In some examples, the method is performed by a live server, the text data is text data of an audiobook, and the method further includes:

Send the audio data to the audience.

According to a second aspect of the embodiments of the present application, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the steps of the method described in the first aspect are implemented.

According to a third aspect of the embodiments of the present application, an electronic device is provided, and the electronic device includes:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured as:

Acquiring the acoustic features of the target sentence;

According to a fourth aspect of the embodiments of the present application, there is provided a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in the first aspect are implemented.

The technical solution provided by the embodiments of the present application may include the following beneficial effects: in the process of audio synthesis, the speaking style feature is extracted for each sentence, and the speaking style feature is determined based on the paragraph feature of the text data where the sentence is located , the paragraph features are extracted from the sentence features of each sentence based on the contribution of each sentence in the text data to the speaking style, and the sentence features are obtained based on the contribution of each word in the sentence to the speaking style. In this way, the impact of the context structure on the speaking style of the sentence is comprehensively considered from the two levels of inter-word relationship and inter-sentence relationship, and the extracted speaking style features of the target sentence not only carry The contribution information of , also carries the contribution information of other sentences in the context to the speaking style. The audio synthesized by using the acoustic features of the sentence and the above-mentioned speaking style features has better expressive force, and improves the richness of the expressive effect of the synthesized audio.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.

Fig. 1 is a flowchart of an audio synthesis method according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an audio synthesis system according to an embodiment of the present application.

Fig. 3 is a flowchart of an audio synthesis method according to another embodiment of the present application.

Fig. 4 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 5 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 6 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 7 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 8 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 9 is a flowchart of an audio synthesis method according to another embodiment of the present application.

Fig. 10 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 11 shows an application scenario of an audio synthesis method according to an embodiment of the present application.

Fig. 12 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.

Fig. 13 is a hardware structural diagram of an electronic device according to an embodiment of the present application.

Detailed ways

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."

In related technologies, the synthesized audio has a single speaking style and a flat tone, which cannot reflect the emotion and the meaning contained in the speech content. Synthetic audio is less expressive, resulting in a gap with real speech. The inventors found that the overall semantics and emotion of a sentence are affected by the context. The expressiveness of a sentence is determined by the context of the sentence, which is not only related to the semantic information of the sentence, but also affected by the semantic information of other sentences in the context of the sentence in the text paragraph, or in other words, the expressiveness of the sentence Relates to where the sentence is located in the text paragraph. If the same sentence is in different positions in the text paragraph, the expressiveness of the sentence will also be different.

However, in related technologies, most of the solutions only focus on the semantic information of the current sentence, and ignore the influence of the semantic information of other sentences in the context of the sentence on the expressiveness of the sentence. This leads to the same sentence in different texts, or in different positions of the same text, its synthesized audio is stereotyped, unable to capture the various changes brought about by different contexts, such as intonation, rhythm, stress, emotional different. As a result, the synthesized audio is less expressive and less natural. Based on the above problems, the application proposes an audio synthesis method, including the steps shown in Figure 1:

Step 110: Acquiring a target sentence in text data, the text data including at least two consecutive sentences;

Step 120: Acquiring the acoustic features of the target sentence;

Step 130: Obtain the speaking style features of the target sentence;

Wherein, the speaking style features are determined based on the paragraph features of the text data, and the paragraph features of the text data are extracted from the sentence features of each sentence based on the contribution of each sentence of the text data to the speaking style, The sentence feature is based on the contribution of each word in the sentence to the speaking style, extracted from the semantic feature of the sentence;

Step 140: Synthesize the target sentence into audio data based on the acoustic features of the target sentence and the speaking style features of the target sentence.

For text data including at least two consecutive sentences, the method shown in FIG. 1 can be used for audio synthesis for each sentence. Wherein, the target sentence may be a sentence to be synthesized currently. The acoustic feature of the target sentence refers to the feature that can reflect the pronunciation characteristics and acoustic performance of the target sentence, for example, it can be a phoneme-level acoustic feature, that is, each phoneme is represented by a multi-dimensional vector.

In the process of extracting the speaking style features of the target sentence, the semantic features of each sentence in the text data are first extracted, and then based on the contribution of each word in the sentence to the speaking style, the sentence features of the sentence are extracted from the semantic features of the sentence, the so-called sentence Features, that is, a sentence is represented by a multidimensional vector, which contains the semantic information of a sentence as a whole. In some examples, the contribution of each word in a sentence to the speaking style can be obtained through an attention-based inter-word network.

Then, based on the contribution of each sentence to the speaking style, the paragraph features are extracted from the sentence features of each sentence. The so-called paragraph features use a multi-dimensional vector to represent a whole text, which contains the semantic information of the text data as a whole. In some examples, the contribution of each sentence of the text data to the speaking style can be obtained through an attention-based inter-sentence network.

Finally, the speaking style features are extracted from the paragraph features. The extracted speaking style features combine the relationship between words and sentences in each sentence, and comprehensively consider the influence of the context structure on the speaking style of the sentence. The extracted The speaking style feature of the target sentence not only carries the contribution information of each word in the target sentence to the speaking style of the sentence, but also carries the contribution information of other sentences in the context to the speaking style. Therefore, the audio synthesized by using the acoustic features of the sentence and the speaking style features has better expressiveness, and the richness of the expressive effect of the synthesized audio is improved.

In some embodiments, the above audio synthesis method as shown in FIG. 1 can be applied to the audio synthesis system 200 as shown in FIG. 2 . The audio synthesis system 200 includes an acoustic feature extraction module 210 , a speaking style feature extraction module 220 and a synthesis module 230 . Wherein, the acoustic feature extraction module 210 is used to extract the acoustic feature of the target sentence from the input target sentence; the speaking style feature extraction module 220 is used to extract the speaking style feature of the target sentence from the input text data; the synthesis module 230 is used for based on The acoustic features and speaking style features of the target sentence are used to synthesize the audio data of the target sentence.

In some embodiments, the acoustic features of the target sentence may be phoneme-level acoustic features. The acquisition process of the above-mentioned step 120 acoustic features may include steps as shown in Figure 3:

Step 121: Obtain the phoneme sequence of the target sentence;

Step 122: Obtain phoneme-level features of the target sentence based on the phoneme sequence;

Step 123: Based on the multi-head attention mechanism, extract the acoustic features of the target sentence from the phoneme-level features.

Correspondingly, as shown in FIG. 4 , the acoustic feature extraction module 210 may include a text-to-phoneme (Grapheme to Phoneme, G2P) submodule 211, a phoneme embedding (Phoneme Embedding) submodule 212 and a phoneme encoding (Phoneme Encoder) submodule 213. The text-to-phoneme sub-module 211 can convert the input target sentence into a phoneme sequence that can reflect its pronunciation characteristics according to the conversion logic designed by linguistic knowledge. For example, for the target sentence "Because there is so much knowledge.", the text-to-phoneme submodule 211 can convert the target sentence into "i-n1|u-e-i4|zh-iy1|sh-iy2|sh-iy2| z-a-i4|sh-iy4|t-a-i4|d-u-o1|l-e5|." phoneme sequence. Wherein, "i", "n1", and "zh" in the phoneme sequence are phoneme symbols, and each phoneme symbol represents a pronunciation. "|" is a separator used to separate adjacent words, and "-" is a separator used to separate adjacent phonemes. Certainly, the phoneme sequence of the target sentence is not limited to the above expression forms, and may also be a phoneme sequence in other forms that can reflect the pronunciation characteristics of the target sentence, which is not limited in this application. It can be assumed that the length of the phoneme sequence output by the text-to-phoneme sub-module 211 is N.

Based on the phoneme sequence output by the text-to-phoneme sub-module 211, the phoneme embedding sub-module 212 can map each phoneme into a multi-dimensional vector. For example, each phoneme can be mapped to a 256-dimensional floating-point vector. In this way, the phoneme sequence with a length of N can be mapped into an N*256 matrix after passing through the phoneme embedding module 212 , that is, phoneme-level features.

The phoneme-level features output by the phoneme embedding sub-module 212 are then input into the phoneme encoding sub-module 213 . The phoneme coding sub-module 213 includes a position coding model and several Transformer models. As shown in FIG. 5 , the phoneme coding sub-module 213 is composed of a position coding model and four Transformer models. The position encoding model can add artificially designed position information to the phoneme-level features, so that the subsequent Transformer model can also take the position of the phoneme into account when calculating. The calculation method of the position code can be implemented with reference to related technologies, and this application will not discuss it here.

The Transformer model consists of a multi-head self-attention mechanism with residual connections and layer normalization, and a 1D convolutional layer with residual connections and layer normalization. The Transformer model can extract the acoustic features of the target sentence according to the relationship between phonemes and the information of each phoneme before and after fusion. The acoustic features may be phoneme-level acoustic features, which carry information about the influence of phonemes on pronunciation characteristics. After the phoneme-level features with a size of N*256 are passed through the phoneme encoding sub-module 213, the phoneme-level acoustic features with the same size of N*256 can be extracted. That is, the size of the sequence output by the phoneme coding sub-module 213 can be consistent with the size of the original sequence.

In some embodiments, the process of extracting the speaking style features of the target sentence in step 130 may be performed by the speaking style feature extraction module 220 . As shown in FIG. 6 , the speaking style feature extraction module 220 includes a language model 221 and a hierarchical encoder 222 . Among them, the language model 221 is used to extract the semantic features of each sentence in the text data; the hierarchical encoder 222 is used to extract sentence features, paragraph features and speaking style features.

In some embodiments, the language model 221 may be an XLNET language model (Generalized Autoregressive Pretraining for Language Understanding). The XLNET language model is trained in advance on a text data with a number of billions of words. A large amount of text training data can enable the XLNET model to better understand the semantic information of the extracted text.

In other embodiments, the language model 221 may be a BERT (Bidirectional Encoder Representations from Transformers) language model. The BERT language model can be pre-trained with a large amount of Chinese text data to extract effective semantic features.

The language model 221 is not limited to the above two models, and those skilled in the art may select other models capable of extracting semantic features from text data as the language model 221 according to actual needs.

The text data input into the language model 221 includes a target sentence and several sentences before and after the target sentence. For example, including the target sentence and L sentences before and after it, 2L+1 sentences in total. Wherein, L can be a positive integer. The semantic features of the text data can be extracted from the text data through the language model 221 . In some embodiments, the semantic features of text data can be character-level semantic features, or word-level semantic features. The so-called character-level semantic features are represented by a multidimensional vector for each character. Similarly, the so-called word-level semantic features are represented by a multidimensional vector for each word. For example, if the language model 221 is a BERT language model, the extracted semantic features may be character-level semantic features or word-level semantic features. If the language model 221 is an XLNET language model, the extracted semantic features may be word-level semantic features. If the language model 221 is another model that can realize the effect of extracting semantic features from text data, before extracting semantic features, the text data can be segmented first, and then the word-level semantics can be extracted from the text data after word segmentation. feature.

In the following embodiments, the language model 221 is an XLNET language model as an example. The XLNET language model can segment sentences according to the knowledge learned by the model. If the input text data has M words in total, and each word outputs a word representing the word 768-dimensional high-dimensional representation of meaning, then the input text data can extract M*768 word-level semantic features after passing through the XLNET language model.

The semantic features output by the language model 221 can be divided into 2L+1 sequences according to the sentence to which each word belongs before being input to the level encoder 222 . Each sequence consists of semantic features corresponding to the words included in each sentence. Each segmented sequence is referred to as the semantic feature of the sentence below. After each segmented sequence is input into the hierarchical encoder 222, the sentence features of each sentence, the paragraph features of the text data, and the speaking style features of the target sentence can be extracted. The hierarchical encoder 222 includes two layers of attention networks. As shown in FIG. 7, the hierarchical encoder 222 includes an inter-word network 2221 and an inter-sentence network 2222. These two layers of attention networks have similar structures, mainly including a bidirectional gating loop Unit (Gated Recurrent Unit, GRU) and a scaled dot-product attention mechanism.

The inter-word network 2221 is used to obtain the contribution of each word in each sentence to the speaking style, and extract the sentence from the semantic features of the sentence (ie each sequence) based on the contribution of each word to the speaking style. The contribution of each word to the speaking style can be understood as the degree to which the word affects the speaking style, which can be expressed as the meaning of each word in the same sentence and the relationship between words. After the semantic features of the sentence are input into the inter-word network 2221, the bidirectional GRU will re-extract the semantic features corresponding to each word in consideration of time order and context information. At the same time, since not every word has the same contribution to the meaning of the sentence, the scaled dot product attention mechanism can be used to calculate the weight corresponding to each word, and the sentence features corresponding to the entire sentence can be summarized according to the weight. As an example, the key (Key), value (Value) and query (Query) vectors can be used to calculate the weight corresponding to each word and summarize the sentence features according to the weight. Among them, the key and value are vectors obtained by linear transformation of the semantic features corresponding to each word, and the query vector is a vector trained from the data set. As in the above example, after the XLNET language model, the word-level semantic features of M*768 can be extracted and divided into 2L+1 sequences, each sequence represents the semantic features of a sentence, and there are 2L+1 sentence semantic features . After the semantic features of each sentence pass through the inter-word network 2221, sentence features of 2L+1 sentences are extracted in total, and the sentence features of each sentence are a 256-dimensional vector.

The sentence features of each sentence can be mixed and input to the inter-sentence network 2222 . As in the above example, the inter-word network 2221 extracts a total of 2L+1 256-dimensional sentence features. These sentence features can be concatenated into a (2L+1)*256 mixed feature, and input into the inter-sentence network 2222 . The inter-sentence network 2222 can obtain the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extract paragraph features from the sentence features of each sentence, and predict the speaking style of the target sentence based on the paragraph features Inter-sentence network of features. The contribution of each sentence to the speaking style can be understood as the extent to which the sentence affects the speaking style, which can be expressed as the sentence representation and inter-sentence relationship of each sentence in the text data. In some embodiments, the paragraph features of the text data may be extracted based on the position information of each sentence in the text data, in addition to the contribution of each sentence to the speaking style.

As shown in Fig. 7, similar to the inter-word network 2221, the input concatenated mixed features go through a bidirectional GRU to re-extract the features of each sentence in combination with the context. The relative position information between sentences is then added to the re-extracted features of each sentence through position encoding. And scaled dot-product attention is adopted to obtain paragraph features. Finally, a linear layer is used to predict the speaking style features of the target sentence based on the paragraph features. As in the above example, a (2L+1)*256 mixed feature can predict the 256-dimensional paragraph feature of the text data through the inter-sentence network 2222, and finally predict a 256-dimensional speaking style feature.

Through the embodiments described above, the acoustic features of the target sentence and the speaking style features of the target sentence can be extracted. Among them, the inventors found that, in the case of limited audio synthesis training data, it is very difficult for the audio synthesis system 200 to implicitly learn the mapping relationship between the sentence semantics and the speaking style of the synthesized audio. To this end, in some embodiments, the hierarchical encoder 222 can be trained using supervised learning, and the training data can include semantic features marked with real speaking style features. Since the hierarchical encoder 222 is trained using supervised learning, it can learn the speaking style features explicitly. From implicit learning to explicit learning, the prediction effect of the model on speaking style can be greatly improved.

However, it is difficult to obtain real speaking style features. For this reason, in some embodiments, the hierarchical encoder can perform supervised learning through a knowledge distillation (Knowledge Distillation) mechanism. The knowledge distillation mechanism includes the teacher network (Teacher Network) and the student network (Student Network), by introducing the soft target (Soft Target) related to the teacher network as part of the training target of the student network to induce the training of the student network, so as to achieve knowledge transfer. The real speaking style features may include output features of the teacher model. In some embodiments, the teacher network can be a reference encoder. As shown in FIG. 8, the reference encoder includes several layers of two-dimensional convolutional neural networks, such as 6 layers, a GRU network and a fully connected network. The reference encoder is trained with unsupervised learning on audio features of real audio corresponding to text. The audio feature may be one or more of mel-spectrogram or LPC (Linear Prediction Coefficient, linear prediction coefficient), etc. Take the 80-dimensional Mel spectrum as an example. During the training process, after the 80-dimensional Mel spectrum of real audio is input into the reference encoder, the reference encoder can extract the corresponding speaking style from the Mel spectrum in an unsupervised learning manner. feature. The speaking style features output by the reference encoder can be regarded as real speaking style features. The hierarchical encoder 222 is then trained with semantic features annotated with real speaking style features as training data. In this way, the hierarchical encoder 222 can explicitly learn the speaking style features, which reduces the pressure of model training and greatly enhances the modeling effect of the hierarchical encoder 222 on the speaking style features when the amount of training data is insufficient.

After the acoustic features of the target sentence and the speaking style features of the target sentence are extracted, the target sentence can be synthesized into audio data. In some embodiments, the above step 140 audio data synthesis process may include the steps shown in Figure 9:

Step 141: Predict the acoustic features carrying prosodic information of the target sentence based on the acoustic features and speaking style features of the target sentence;

Wherein, the prosody information includes one or more of pitch information, sound intensity information or pronunciation duration.

Step 142: Convert the acoustic features carrying prosody information into the audio data.

Correspondingly, as shown in FIG. 10 , the synthesis module 230 may include a prosody predictor 231 and a converter 232 . The prosody predictor 231 is used to predict the acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence; the converter 232 is used to convert the acoustic features carrying prosody information into audio data.

In some embodiments, before the acoustic features and speaking style features of the target sentence are input into the prosody predictor 231, the speaking style features may be copied to the length of the acoustic features. As in the above example, the acoustic feature extraction module 210 outputs a phoneme-level acoustic feature with a size of N*256, and the speaking style feature extraction module 220 outputs a 256-dimensional speaking style feature. Then the speaking style feature can be copied into a feature of length N and added to the phoneme-level acoustic feature of size N*256. The mixed phone-level acoustic features are then input to the prosody predictor 231 .

The prosody predictor 231 includes three speech change predictors, and the structures of these speech change predictors are basically the same, including two one-dimensional convolutional layers with layer normalization and one fully connected layer. Three speech change predictors are used to predict the pitch, intensity and duration of the synthesized audio on that phoneme, respectively. Wherein, the unit of the pronunciation duration is frame. Each predictor predicts a floating-point number for each mixed phone-level acoustic feature as the prediction result. Subsequently, the predicted pitch predictions and intensity predictions can be transformed into a 256-dimensional representation through a fully connected layer and added to the mixed phone-level acoustic features. The prediction result of the pronunciation duration is rounded to retain an integer, which represents how many frames the pronunciation duration of the phoneme is. The acoustic features of the phoneme are copied according to the pronunciation duration corresponding to each phoneme, and the copied features are spliced together as the frame-level acoustic features, which carry the prosody information of pitch, sound intensity and pronunciation duration. As in the above example, if it is predicted that the total pronunciation duration of N*256 phoneme-level acoustic features includes n frames of audio, then the prosody predictor 231 outputs n*256 acoustic features carrying prosody information.

Subsequently, the acoustic features carrying prosodic information output by the prosody predictor 231 pass through the converter 232 to synthesize the audio data of the target sentence. Wherein the converter 232 includes a decoder and a vocoder. The decoder can convert the acoustic features carrying prosody information into corresponding audio features, such as Mel Spectrum or LPC, etc. After the audio features output by the decoder pass through the vocoder, the audio data of the target sentence can be output. Wherein, the vocoder can be a neural network vocoder based on Hifi-GAN; the audio data can be synthetic audio data with a sampling rate of 24kHz.

In the audio synthesis method provided by the present application, in the process of audio synthesis, the speech style feature is extracted for each sentence, and the speech style feature is determined based on the paragraph feature of the text data where the sentence is located, and the paragraph feature is Based on the contribution of each sentence in the text data to the speaking style, it is extracted from the sentence features of each sentence, and the sentence features are obtained based on the contribution of each word in the sentence to the speaking style. In this way, in the process of feature extraction of speaking style, not only the semantic information of the target sentence is considered, but also the influence of the context sentence on the semantic information of the target sentence is considered, so that the speech changes brought about by different contexts can be captured.

In addition, this application uses a hierarchical encoder to analyze the context semantics, and comprehensively considers the influence of the context structure on the speaking style of the sentence from the two levels of inter-word relationship and inter-sentence relationship, and the extracted speech style characteristics of the target sentence It not only carries the contribution information of each word in the target sentence to the speaking style of the sentence, but also carries the contribution information of other sentences in the context to the speaking style. The hierarchical encoder can extract more information from the context and effectively improve the long-distance dependency modeling ability, thus helping to better model speaking style features.

In addition, the hierarchical encoder adopts a knowledge distillation mechanism, and the teacher model extracts the speaking style features from the real audio corresponding to the text in an unsupervised learning manner, so as to help the student model, that is, the hierarchical encoder to better train and help the model More efficiently predict the speaking style of sentences.

In this way, this application effectively improves the ability of the model to model speaking style features from hierarchical context information, and the speaking style of the synthesized audio will be affected by the current sentence and context, making the synthesized audio more expressive and natural The degree of expression improves the richness of synthetic audio and is closer to real human speech.

In some embodiments, the audio synthesis method provided by this application can be applied in live broadcasting scenarios. For example, when a virtual anchor performs a live broadcast of an audio novel, since the voice of the virtual anchor is synthesized using TTS technology, when performing a live broadcast of an audio novel , if the synthesized speech is expressive and can show emotion and speaking style, it will be able to attract more listeners. As shown in FIG. 11 , an audio synthesis method provided by this application can be executed by a live server. Wherein, the live broadcast server may be a single server, or may be a server cluster composed of multiple servers. As shown in Figure 11, the live broadcast server 1110 can use the method provided by any of the above-mentioned embodiments to synthesize the text data of the audiobook into corresponding audio data, and then send the synthesized audio data to each audience terminal 1120 in the live broadcast room .

The audio synthesis method provided by this application can be applied to smart phones, voice assistants, smart navigation, e-books and other products in addition to live broadcast scenarios, and this application does not limit it here.

In addition, the present application also provides an audio synthesis method, which is realized by an audio synthesis system as shown in FIG. 12 . The user can input multiple consecutive sentences, for example, "Where did Xiao Ming go? Xiao Ming went to Shenzhen. Shenzhen is a beautiful city" with three sentences in total. For each sentence, the corresponding audio data can be synthesized by using the audio synthesis system as shown in FIG. 12 . Each current sentence to be synthesized is the target sentence. Taking the target sentence as "Xiao Ming has gone to Shenzhen." as an example, the target sentence "Xiao Ming has gone to Shenzhen." can be input into the acoustic feature extraction module 1210 to extract the acoustic features of the target sentence. The text data including the target sentence "Where did Xiao Ming go? Xiao Ming went to Shenzhen. Shenzhen is a beautiful city" can be input into the XLNET language model 1220 and the hierarchical encoder 1230 to extract the speaking style features of the target sentence.

Wherein, the acoustic feature extraction module 1210 includes a text-to-phoneme module 1211 , a phoneme embedding module 1212 and a phoneme encoding module 1213 . The phoneme sequence can be extracted from the target sentence through the text-to-phoneme module 1211 . The phoneme-level features can be extracted from the output phoneme sequence through the phoneme embedding module 1212 . The output phoneme-level features can be used to extract the acoustic features of the target sentence through the phoneme encoder 1213 .

After the text data passes through the XLNET language model 1220, the semantic features of the text data can be extracted. The input level encoder 1230 may then extract the speaking style of the target sentence. Wherein, the hierarchical encoder 1230 includes an inter-word network 1231 and an inter-sentence network 1232 . After the semantic features of the text data are input into the inter-word network 1231, the sentence features of each sentence in the text data can be extracted. After these sentence features are input into the inter-sentence network, the paragraph features of the text data can be extracted, and the speaking style features of the target sentence can be extracted according to the paragraph features.

Subsequently, the acoustic features and speaking style features of the target sentence can be input into the prosody predictor 1240 after being mixed. The prosody predictor 1240 includes three predictors, which are respectively used to predict the pitch, sound intensity and pronunciation duration of the target sentence. The output results of the pitch predictor and the sound intensity predictor are added to the acoustic feature after mixing processing, and based on the output result of the duration predictor, the length of the acoustic feature is adjusted through the length regulator (Length Regulator, LR), so that the output The acoustic features of are frame-level acoustic features carrying prosody information (pitch, sound intensity, duration). The acoustic features carrying prosodic information output by the prosody predictor 1240 can be converted into an 80-dimensional Mel spectrum through the decoder 1250, and finally the audio data corresponding to the target sentence "Xiao Ming has gone to Shenzhen." can be synthesized through the vocoder 1260.

For the specific implementation manners of the above embodiments, refer to the above embodiments, and the present application will not repeat them here.

In this way, through the above method, the audio synthesis system will determine the most reasonable speaking style of the target sentence according to the context information. For example, in the above example, the target sentence "Xiao Ming has gone to Shenzhen." The synthesized audio can be in the word "Shenzhen". Emphasize, prolong pronunciation, etc.

When using the above method to extract the speaking style features, not only the semantic information of the target sentence is considered, but also the influence of the context sentence on the semantic information of the target sentence is considered, so that the speech changes brought about by different contexts can be captured.

In addition, a hierarchical encoder is used to analyze the context semantics, and the influence of the context structure on the speaking style of the sentence is comprehensively considered from the two levels of inter-word relationship and inter-sentence relationship. The extracted speaking style features of the target sentence are not only It carries the contribution information of each word in the target sentence to the speaking style of the sentence, and also carries the contribution information of other sentences in the context to the speaking style. The hierarchical encoder can extract more information from the context and effectively improve the long-distance dependency modeling ability, thus helping to better model speaking style features.

Based on the audio synthesis method described in any of the above embodiments, the present application also provides a computer program product, including a computer program, which can be used to perform the audio synthesis described in any of the above embodiments when the computer program is executed by a processor method.

Based on the audio synthesis method described in any of the above embodiments, the present application also provides a schematic structural diagram of an electronic device as shown in FIG. 13 . As shown in Figure 13, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and of course may also include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it. The processor is configured as:

Acquiring the acoustic features of the target sentence;

Based on the audio synthesis method described in any of the above embodiments, the present application also provides a computer storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, it can be used to perform a method described in any of the above embodiments. A method of audio synthesis.

The foregoing describes specific embodiments of the present application. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.

Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention claimed herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field for which the application is not applied . The specification and examples are to be considered exemplary only, with a true scope and spirit of the application indicated by the following claims.

Claims

A method for audio synthesis, characterized in that the method comprises:

Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;

Acquiring the acoustic features of the target sentence;

Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;

Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
The method according to claim 1, characterized in that,

The contribution of each word in the sentence to the speaking style is obtained through an inter-word network based on an attention mechanism;

The contribution of each sentence of the text data to the speaking style is obtained through an inter-sentence network based on an attention mechanism.
The method according to claim 1, wherein the paragraph features of the text data are also extracted based on position information of each sentence in the text data.
The method according to claim 1, wherein said target sentence is synthesized into audio data based on the acoustic features of said target sentence and the speaking style features of said target sentence, comprising:

Based on the acoustic features of the target sentence and the speaking style features of the target sentence, predict the acoustic features of the target sentence carrying prosody information; the prosody information includes one of pitch information, sound intensity information or pronunciation duration. or more;

converting the acoustic features carrying prosody information into the audio data.
The method according to claim 1, wherein the method is applied to an audio synthesis system, and the audio synthesis system comprises:

An acoustic feature extraction module for extracting the acoustic features of the target sentence;

A speaking style feature extraction module for extracting the speaking style features of the target sentence;

A synthesis module for synthesizing the audio data.
The method according to claim 5, wherein the speaking style feature extraction module comprises:

A language model for extracting the semantic features;

A hierarchical encoder for extracting the sentence feature, the paragraph feature and the speaking style feature.
The method according to claim 6, wherein the hierarchical encoder is trained using supervised learning, and the training data includes semantic features marked with real speaking style features.
The method according to claim 7, wherein the hierarchical encoder performs supervised learning through a knowledge distillation mechanism, the hierarchical encoder is a student model in the distillation mechanism, and a teacher model in the distillation mechanism Using unsupervised learning to extract real speaking style features from real audio data.
The method according to claim 6, wherein the hierarchical encoder comprises:

For obtaining the contribution of each word in each sentence to the speaking style, and based on the contribution of each word to the speaking style, extracting the inter-word network of the sentence feature of the sentence from the semantic feature of the sentence;

For obtaining the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extracting paragraph features from the sentence features of each sentence, and based on the paragraph features, predicting the sentence of the speaking style feature of the target sentence network.
The method according to claim 5, wherein the synthesis module comprises:

A prosody predictor for predicting acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence;

A converter for converting the acoustic features carrying prosody information into the audio data.
The method according to claim 1, wherein the method is executed by a live broadcast server, the text data is text data of an audiobook, and the method further comprises:

Send the audio data to the audience.
A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1-11 are implemented.
An electronic device, characterized in that the electronic device comprises:

processor;

memory for storing processor-executable instructions;

Wherein, the processor is configured as:

Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;

Acquiring the acoustic features of the target sentence;

Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;

Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1-11 are realized.