WO2023102929A1 - Audio synthesis method, electronic device, program product and storage medium - Google Patents

Audio synthesis method, electronic device, program product and storage medium Download PDF

Info

Publication number
WO2023102929A1
WO2023102929A1 PCT/CN2021/137237 CN2021137237W WO2023102929A1 WO 2023102929 A1 WO2023102929 A1 WO 2023102929A1 CN 2021137237 W CN2021137237 W CN 2021137237W WO 2023102929 A1 WO2023102929 A1 WO 2023102929A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
features
speaking style
feature
target sentence
Prior art date
Application number
PCT/CN2021/137237
Other languages
French (fr)
Chinese (zh)
Inventor
吴志勇
康世胤
雷舜
周逸轩
陈礼扬
Original Assignee
清华大学深圳国际研究生院
广州虎牙科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学深圳国际研究生院, 广州虎牙科技有限公司 filed Critical 清华大学深圳国际研究生院
Priority to PCT/CN2021/137237 priority Critical patent/WO2023102929A1/en
Publication of WO2023102929A1 publication Critical patent/WO2023102929A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems

Definitions

  • the present application relates to the technical field of audio processing, in particular to an audio synthesis method, an electronic device, a program product and a storage medium.
  • Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio.
  • TTS technology has been widely used in audio novels, news, voice assistants, intelligent navigation and other products.
  • the naturalness of synthesized audio is one of the indicators to measure the effect of audio synthesis. Synthesized audio with high naturalness can make users feel as vivid as natural language in subjective experience.
  • the naturalness of synthesized audio depends largely on the expressiveness of the synthesized audio.
  • Expressive audio can show the speaker's emotion and speaking style, and has a high degree of naturalness. It is an important part of TTS technology to focus on improving the richness of synthetic audio in terms of expressive effects.
  • the application provides an audio synthesis method, an electronic device, a program product and a storage medium, which can further improve the expressiveness of the synthesized audio.
  • an audio synthesis method comprising:
  • the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
  • the contribution of each word in the sentence to the speaking style is obtained through an inter-word network based on an attention mechanism
  • the paragraph features of the text data are also extracted based on the position information of each sentence in the text data.
  • the synthesizing the target sentence into audio data based on the acoustic features of the target sentence and the speaking style features of the target sentence includes:
  • the prosody information includes one of pitch information, sound intensity information or pronunciation duration. or more;
  • the method is applied to an audio synthesis system, the audio synthesis system comprising:
  • An acoustic feature extraction module for extracting the acoustic features of the target sentence
  • a speaking style feature extraction module for extracting the speaking style features of the target sentence
  • a synthesis module for synthesizing the audio data.
  • the speaking style feature extraction module includes:
  • a hierarchical encoder for extracting the sentence feature, the paragraph feature and the speaking style feature.
  • the hierarchical encoder is trained using supervised learning, and the training data includes semantic features annotated with real speaking style features.
  • the hierarchical encoder performs supervised learning through a knowledge distillation mechanism
  • the hierarchical encoder is a student model in the distillation mechanism
  • the teacher model in the distillation mechanism adopts unsupervised learning from real audio data Extract real speaking style features.
  • the hierarchical encoder includes:
  • the synthesis module includes:
  • a converter for converting the acoustic features carrying prosody information into the audio data.
  • the method is performed by a live server, the text data is text data of an audiobook, and the method further includes:
  • a computer program product including a computer program, and when the computer program is executed by a processor, the steps of the method described in the first aspect are implemented.
  • an electronic device includes:
  • memory for storing processor-executable instructions
  • the processor is configured as:
  • the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in the first aspect are implemented.
  • the technical solution provided by the embodiments of the present application may include the following beneficial effects: in the process of audio synthesis, the speaking style feature is extracted for each sentence, and the speaking style feature is determined based on the paragraph feature of the text data where the sentence is located , the paragraph features are extracted from the sentence features of each sentence based on the contribution of each sentence in the text data to the speaking style, and the sentence features are obtained based on the contribution of each word in the sentence to the speaking style.
  • the impact of the context structure on the speaking style of the sentence is comprehensively considered from the two levels of inter-word relationship and inter-sentence relationship, and the extracted speaking style features of the target sentence not only carry The contribution information of , also carries the contribution information of other sentences in the context to the speaking style.
  • the audio synthesized by using the acoustic features of the sentence and the above-mentioned speaking style features has better expressive force, and improves the richness of the expressive effect of the synthesized audio.
  • Fig. 1 is a flowchart of an audio synthesis method according to an embodiment of the present application.
  • Fig. 2 is a schematic diagram of an audio synthesis system according to an embodiment of the present application.
  • Fig. 3 is a flowchart of an audio synthesis method according to another embodiment of the present application.
  • Fig. 4 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 5 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 6 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 7 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 8 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 9 is a flowchart of an audio synthesis method according to another embodiment of the present application.
  • Fig. 10 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 11 shows an application scenario of an audio synthesis method according to an embodiment of the present application.
  • Fig. 12 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
  • Fig. 13 is a hardware structural diagram of an electronic device according to an embodiment of the present application.
  • first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word “if” as used herein may be interpreted as “at” or “when” or “in response to a determination.”
  • Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio.
  • TTS technology has been widely used in audio novels, news, voice assistants, intelligent navigation and other products.
  • the naturalness of synthesized audio is one of the indicators to measure the effect of audio synthesis. Synthesized audio with high naturalness can make users feel as vivid as natural language in subjective experience.
  • the naturalness of synthesized audio depends largely on the expressiveness of the synthesized audio.
  • Expressive audio can show the speaker's emotion and speaking style, and has a high degree of naturalness. It is an important part of TTS technology to focus on improving the richness of synthetic audio in terms of expressive effects.
  • the synthesized audio has a single speaking style and a flat tone, which cannot reflect the emotion and the meaning contained in the speech content.
  • Synthetic audio is less expressive, resulting in a gap with real speech.
  • the inventors found that the overall semantics and emotion of a sentence are affected by the context.
  • the expressiveness of a sentence is determined by the context of the sentence, which is not only related to the semantic information of the sentence, but also affected by the semantic information of other sentences in the context of the sentence in the text paragraph, or in other words, the expressiveness of the sentence Relates to where the sentence is located in the text paragraph. If the same sentence is in different positions in the text paragraph, the expressiveness of the sentence will also be different.
  • Step 110 Acquiring a target sentence in text data, the text data including at least two consecutive sentences;
  • Step 120 Acquiring the acoustic features of the target sentence
  • Step 130 Obtain the speaking style features of the target sentence
  • the speaking style features are determined based on the paragraph features of the text data, and the paragraph features of the text data are extracted from the sentence features of each sentence based on the contribution of each sentence of the text data to the speaking style,
  • the sentence feature is based on the contribution of each word in the sentence to the speaking style, extracted from the semantic feature of the sentence;
  • Step 140 Synthesize the target sentence into audio data based on the acoustic features of the target sentence and the speaking style features of the target sentence.
  • the method shown in FIG. 1 can be used for audio synthesis for each sentence.
  • the target sentence may be a sentence to be synthesized currently.
  • the acoustic feature of the target sentence refers to the feature that can reflect the pronunciation characteristics and acoustic performance of the target sentence, for example, it can be a phoneme-level acoustic feature, that is, each phoneme is represented by a multi-dimensional vector.
  • the semantic features of each sentence in the text data are first extracted, and then based on the contribution of each word in the sentence to the speaking style, the sentence features of the sentence are extracted from the semantic features of the sentence, the so-called sentence Features, that is, a sentence is represented by a multidimensional vector, which contains the semantic information of a sentence as a whole.
  • sentence Features that is, a sentence is represented by a multidimensional vector, which contains the semantic information of a sentence as a whole.
  • the contribution of each word in a sentence to the speaking style can be obtained through an attention-based inter-word network.
  • the paragraph features are extracted from the sentence features of each sentence.
  • the so-called paragraph features use a multi-dimensional vector to represent a whole text, which contains the semantic information of the text data as a whole.
  • the contribution of each sentence of the text data to the speaking style can be obtained through an attention-based inter-sentence network.
  • the speaking style features are extracted from the paragraph features.
  • the extracted speaking style features combine the relationship between words and sentences in each sentence, and comprehensively consider the influence of the context structure on the speaking style of the sentence.
  • the extracted The speaking style feature of the target sentence not only carries the contribution information of each word in the target sentence to the speaking style of the sentence, but also carries the contribution information of other sentences in the context to the speaking style. Therefore, the audio synthesized by using the acoustic features of the sentence and the speaking style features has better expressiveness, and the richness of the expressive effect of the synthesized audio is improved.
  • the audio synthesis system 200 includes an acoustic feature extraction module 210 , a speaking style feature extraction module 220 and a synthesis module 230 .
  • the acoustic feature extraction module 210 is used to extract the acoustic feature of the target sentence from the input target sentence
  • the speaking style feature extraction module 220 is used to extract the speaking style feature of the target sentence from the input text data
  • the synthesis module 230 is used for based on The acoustic features and speaking style features of the target sentence are used to synthesize the audio data of the target sentence.
  • the acoustic features of the target sentence may be phoneme-level acoustic features.
  • the acquisition process of the above-mentioned step 120 acoustic features may include steps as shown in Figure 3:
  • Step 121 Obtain the phoneme sequence of the target sentence
  • Step 122 Obtain phoneme-level features of the target sentence based on the phoneme sequence
  • Step 123 Based on the multi-head attention mechanism, extract the acoustic features of the target sentence from the phoneme-level features.
  • the acoustic feature extraction module 210 may include a text-to-phoneme (Grapheme to Phoneme, G2P) submodule 211, a phoneme embedding (Phoneme Embedding) submodule 212 and a phoneme encoding (Phoneme Encoder) submodule 213.
  • the text-to-phoneme sub-module 211 can convert the input target sentence into a phoneme sequence that can reflect its pronunciation characteristics according to the conversion logic designed by linguistic knowledge.
  • the text-to-phoneme submodule 211 can convert the target sentence into "i-n1
  • phoneme sequence wherein, "i”, "n1", and “zh” in the phoneme sequence are phoneme symbols, and each phoneme symbol represents a pronunciation. "
  • the phoneme sequence of the target sentence is not limited to the above expression forms, and may also be a phoneme sequence in other forms that can reflect the pronunciation characteristics of the target sentence, which is not limited in this application. It can be assumed that the length of the phoneme sequence output by the text-to-phoneme sub-module 211 is N.
  • the phoneme embedding sub-module 212 can map each phoneme into a multi-dimensional vector. For example, each phoneme can be mapped to a 256-dimensional floating-point vector. In this way, the phoneme sequence with a length of N can be mapped into an N*256 matrix after passing through the phoneme embedding module 212 , that is, phoneme-level features.
  • the phoneme-level features output by the phoneme embedding sub-module 212 are then input into the phoneme encoding sub-module 213 .
  • the phoneme coding sub-module 213 includes a position coding model and several Transformer models. As shown in FIG. 5 , the phoneme coding sub-module 213 is composed of a position coding model and four Transformer models.
  • the position encoding model can add artificially designed position information to the phoneme-level features, so that the subsequent Transformer model can also take the position of the phoneme into account when calculating.
  • the calculation method of the position code can be implemented with reference to related technologies, and this application will not discuss it here.
  • the Transformer model consists of a multi-head self-attention mechanism with residual connections and layer normalization, and a 1D convolutional layer with residual connections and layer normalization.
  • the Transformer model can extract the acoustic features of the target sentence according to the relationship between phonemes and the information of each phoneme before and after fusion.
  • the acoustic features may be phoneme-level acoustic features, which carry information about the influence of phonemes on pronunciation characteristics.
  • the phoneme-level acoustic features with the same size of N*256 can be extracted. That is, the size of the sequence output by the phoneme coding sub-module 213 can be consistent with the size of the original sequence.
  • the process of extracting the speaking style features of the target sentence in step 130 may be performed by the speaking style feature extraction module 220 .
  • the speaking style feature extraction module 220 includes a language model 221 and a hierarchical encoder 222 .
  • the language model 221 is used to extract the semantic features of each sentence in the text data;
  • the hierarchical encoder 222 is used to extract sentence features, paragraph features and speaking style features.
  • the language model 221 may be an XLNET language model (Generalized Autoregressive Pretraining for Language Understanding).
  • the XLNET language model is trained in advance on a text data with a number of billions of words. A large amount of text training data can enable the XLNET model to better understand the semantic information of the extracted text.
  • the language model 221 may be a BERT (Bidirectional Encoder Representations from Transformers) language model.
  • the BERT language model can be pre-trained with a large amount of Chinese text data to extract effective semantic features.
  • the language model 221 is not limited to the above two models, and those skilled in the art may select other models capable of extracting semantic features from text data as the language model 221 according to actual needs.
  • the text data input into the language model 221 includes a target sentence and several sentences before and after the target sentence. For example, including the target sentence and L sentences before and after it, 2L+1 sentences in total. Wherein, L can be a positive integer.
  • the semantic features of the text data can be extracted from the text data through the language model 221 .
  • the semantic features of text data can be character-level semantic features, or word-level semantic features.
  • the so-called character-level semantic features are represented by a multidimensional vector for each character.
  • the so-called word-level semantic features are represented by a multidimensional vector for each word.
  • the extracted semantic features may be character-level semantic features or word-level semantic features.
  • the extracted semantic features may be word-level semantic features. If the language model 221 is an XLNET language model, the extracted semantic features may be word-level semantic features. If the language model 221 is another model that can realize the effect of extracting semantic features from text data, before extracting semantic features, the text data can be segmented first, and then the word-level semantics can be extracted from the text data after word segmentation. feature.
  • the language model 221 is an XLNET language model as an example.
  • the XLNET language model can segment sentences according to the knowledge learned by the model. If the input text data has M words in total, and each word outputs a word representing the word 768-dimensional high-dimensional representation of meaning, then the input text data can extract M*768 word-level semantic features after passing through the XLNET language model.
  • the semantic features output by the language model 221 can be divided into 2L+1 sequences according to the sentence to which each word belongs before being input to the level encoder 222 .
  • Each sequence consists of semantic features corresponding to the words included in each sentence.
  • Each segmented sequence is referred to as the semantic feature of the sentence below.
  • the sentence features of each sentence, the paragraph features of the text data, and the speaking style features of the target sentence can be extracted.
  • the hierarchical encoder 222 includes two layers of attention networks. As shown in FIG. 7, the hierarchical encoder 222 includes an inter-word network 2221 and an inter-sentence network 2222. These two layers of attention networks have similar structures, mainly including a bidirectional gating loop Unit (Gated Recurrent Unit, GRU) and a scaled dot-product attention mechanism.
  • GRU bidirectional gating loop Unit
  • the inter-word network 2221 is used to obtain the contribution of each word in each sentence to the speaking style, and extract the sentence from the semantic features of the sentence (ie each sequence) based on the contribution of each word to the speaking style.
  • the contribution of each word to the speaking style can be understood as the degree to which the word affects the speaking style, which can be expressed as the meaning of each word in the same sentence and the relationship between words.
  • the bidirectional GRU will re-extract the semantic features corresponding to each word in consideration of time order and context information.
  • the scaled dot product attention mechanism can be used to calculate the weight corresponding to each word, and the sentence features corresponding to the entire sentence can be summarized according to the weight.
  • the key (Key), value (Value) and query (Query) vectors can be used to calculate the weight corresponding to each word and summarize the sentence features according to the weight.
  • the key and value are vectors obtained by linear transformation of the semantic features corresponding to each word
  • the query vector is a vector trained from the data set.
  • the word-level semantic features of M*768 can be extracted and divided into 2L+1 sequences, each sequence represents the semantic features of a sentence, and there are 2L+1 sentence semantic features .
  • sentence features of 2L+1 sentences are extracted in total, and the sentence features of each sentence are a 256-dimensional vector.
  • the sentence features of each sentence can be mixed and input to the inter-sentence network 2222 .
  • the inter-word network 2221 extracts a total of 2L+1 256-dimensional sentence features. These sentence features can be concatenated into a (2L+1)*256 mixed feature, and input into the inter-sentence network 2222 .
  • the inter-sentence network 2222 can obtain the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extract paragraph features from the sentence features of each sentence, and predict the speaking style of the target sentence based on the paragraph features Inter-sentence network of features.
  • each sentence to the speaking style can be understood as the extent to which the sentence affects the speaking style, which can be expressed as the sentence representation and inter-sentence relationship of each sentence in the text data.
  • the paragraph features of the text data may be extracted based on the position information of each sentence in the text data, in addition to the contribution of each sentence to the speaking style.
  • the input concatenated mixed features go through a bidirectional GRU to re-extract the features of each sentence in combination with the context.
  • the relative position information between sentences is then added to the re-extracted features of each sentence through position encoding.
  • scaled dot-product attention is adopted to obtain paragraph features.
  • a linear layer is used to predict the speaking style features of the target sentence based on the paragraph features.
  • a (2L+1)*256 mixed feature can predict the 256-dimensional paragraph feature of the text data through the inter-sentence network 2222, and finally predict a 256-dimensional speaking style feature.
  • the acoustic features of the target sentence and the speaking style features of the target sentence can be extracted.
  • the inventors found that, in the case of limited audio synthesis training data, it is very difficult for the audio synthesis system 200 to implicitly learn the mapping relationship between the sentence semantics and the speaking style of the synthesized audio.
  • the hierarchical encoder 222 can be trained using supervised learning, and the training data can include semantic features marked with real speaking style features. Since the hierarchical encoder 222 is trained using supervised learning, it can learn the speaking style features explicitly. From implicit learning to explicit learning, the prediction effect of the model on speaking style can be greatly improved.
  • the hierarchical encoder can perform supervised learning through a knowledge distillation (Knowledge Distillation) mechanism.
  • the knowledge distillation mechanism includes the teacher network (Teacher Network) and the student network (Student Network), by introducing the soft target (Soft Target) related to the teacher network as part of the training target of the student network to induce the training of the student network, so as to achieve knowledge transfer.
  • the real speaking style features may include output features of the teacher model.
  • the teacher network can be a reference encoder. As shown in FIG. 8, the reference encoder includes several layers of two-dimensional convolutional neural networks, such as 6 layers, a GRU network and a fully connected network.
  • the reference encoder is trained with unsupervised learning on audio features of real audio corresponding to text.
  • the audio feature may be one or more of mel-spectrogram or LPC (Linear Prediction Coefficient, linear prediction coefficient), etc.
  • LPC Linear Prediction Coefficient, linear prediction coefficient
  • the reference encoder can extract the corresponding speaking style from the Mel spectrum in an unsupervised learning manner. feature.
  • the speaking style features output by the reference encoder can be regarded as real speaking style features.
  • the hierarchical encoder 222 is then trained with semantic features annotated with real speaking style features as training data. In this way, the hierarchical encoder 222 can explicitly learn the speaking style features, which reduces the pressure of model training and greatly enhances the modeling effect of the hierarchical encoder 222 on the speaking style features when the amount of training data is insufficient.
  • the target sentence can be synthesized into audio data.
  • the above step 140 audio data synthesis process may include the steps shown in Figure 9:
  • Step 141 Predict the acoustic features carrying prosodic information of the target sentence based on the acoustic features and speaking style features of the target sentence;
  • the prosody information includes one or more of pitch information, sound intensity information or pronunciation duration.
  • Step 142 Convert the acoustic features carrying prosody information into the audio data.
  • the synthesis module 230 may include a prosody predictor 231 and a converter 232 .
  • the prosody predictor 231 is used to predict the acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence; the converter 232 is used to convert the acoustic features carrying prosody information into audio data.
  • the speaking style features may be copied to the length of the acoustic features.
  • the acoustic feature extraction module 210 outputs a phoneme-level acoustic feature with a size of N*256
  • the speaking style feature extraction module 220 outputs a 256-dimensional speaking style feature. Then the speaking style feature can be copied into a feature of length N and added to the phoneme-level acoustic feature of size N*256.
  • the mixed phone-level acoustic features are then input to the prosody predictor 231 .
  • the prosody predictor 231 includes three speech change predictors, and the structures of these speech change predictors are basically the same, including two one-dimensional convolutional layers with layer normalization and one fully connected layer. Three speech change predictors are used to predict the pitch, intensity and duration of the synthesized audio on that phoneme, respectively.
  • the unit of the pronunciation duration is frame.
  • Each predictor predicts a floating-point number for each mixed phone-level acoustic feature as the prediction result.
  • the predicted pitch predictions and intensity predictions can be transformed into a 256-dimensional representation through a fully connected layer and added to the mixed phone-level acoustic features.
  • the prediction result of the pronunciation duration is rounded to retain an integer, which represents how many frames the pronunciation duration of the phoneme is.
  • the acoustic features of the phoneme are copied according to the pronunciation duration corresponding to each phoneme, and the copied features are spliced together as the frame-level acoustic features, which carry the prosody information of pitch, sound intensity and pronunciation duration.
  • the prosody predictor 231 outputs n*256 acoustic features carrying prosody information.
  • the acoustic features carrying prosodic information output by the prosody predictor 231 pass through the converter 232 to synthesize the audio data of the target sentence.
  • the converter 232 includes a decoder and a vocoder.
  • the decoder can convert the acoustic features carrying prosody information into corresponding audio features, such as Mel Spectrum or LPC, etc.
  • the audio data of the target sentence can be output.
  • the vocoder can be a neural network vocoder based on Hifi-GAN; the audio data can be synthetic audio data with a sampling rate of 24kHz.
  • the speech style feature is extracted for each sentence, and the speech style feature is determined based on the paragraph feature of the text data where the sentence is located, and the paragraph feature is Based on the contribution of each sentence in the text data to the speaking style, it is extracted from the sentence features of each sentence, and the sentence features are obtained based on the contribution of each word in the sentence to the speaking style.
  • the process of feature extraction of speaking style not only the semantic information of the target sentence is considered, but also the influence of the context sentence on the semantic information of the target sentence is considered, so that the speech changes brought about by different contexts can be captured.
  • this application uses a hierarchical encoder to analyze the context semantics, and comprehensively considers the influence of the context structure on the speaking style of the sentence from the two levels of inter-word relationship and inter-sentence relationship, and the extracted speech style characteristics of the target sentence It not only carries the contribution information of each word in the target sentence to the speaking style of the sentence, but also carries the contribution information of other sentences in the context to the speaking style.
  • the hierarchical encoder can extract more information from the context and effectively improve the long-distance dependency modeling ability, thus helping to better model speaking style features.
  • the hierarchical encoder adopts a knowledge distillation mechanism, and the teacher model extracts the speaking style features from the real audio corresponding to the text in an unsupervised learning manner, so as to help the student model, that is, the hierarchical encoder to better train and help the model More efficiently predict the speaking style of sentences.
  • this application effectively improves the ability of the model to model speaking style features from hierarchical context information, and the speaking style of the synthesized audio will be affected by the current sentence and context, making the synthesized audio more expressive and natural
  • the degree of expression improves the richness of synthetic audio and is closer to real human speech.
  • the audio synthesis method provided by this application can be applied in live broadcasting scenarios. For example, when a virtual anchor performs a live broadcast of an audio novel, since the voice of the virtual anchor is synthesized using TTS technology, when performing a live broadcast of an audio novel , if the synthesized speech is expressive and can show emotion and speaking style, it will be able to attract more listeners. As shown in FIG. 11 , an audio synthesis method provided by this application can be executed by a live server. Wherein, the live broadcast server may be a single server, or may be a server cluster composed of multiple servers.
  • the live broadcast server 1110 can use the method provided by any of the above-mentioned embodiments to synthesize the text data of the audiobook into corresponding audio data, and then send the synthesized audio data to each audience terminal 1120 in the live broadcast room .
  • the audio synthesis method provided by this application can be applied to smart phones, voice assistants, smart navigation, e-books and other products in addition to live broadcast scenarios, and this application does not limit it here.
  • the present application also provides an audio synthesis method, which is realized by an audio synthesis system as shown in FIG. 12 .
  • the user can input multiple consecutive sentences, for example, "Where did Xiao Ming go? Xiao Ming went to Shenzhen. Shenzhen is a beautiful city" with three sentences in total.
  • the corresponding audio data can be synthesized by using the audio synthesis system as shown in FIG. 12 .
  • Each current sentence to be synthesized is the target sentence.
  • the target sentence "Xiao Ming has gone to Shenzhen.” can be input into the acoustic feature extraction module 1210 to extract the acoustic features of the target sentence.
  • the text data including the target sentence "Where did Xiao Ming go? Xiao Ming went to Shenzhen. Shenzhen is a beautiful city” can be input into the XLNET language model 1220 and the hierarchical encoder 1230 to extract the speaking style features of the target sentence.
  • the acoustic feature extraction module 1210 includes a text-to-phoneme module 1211 , a phoneme embedding module 1212 and a phoneme encoding module 1213 .
  • the phoneme sequence can be extracted from the target sentence through the text-to-phoneme module 1211 .
  • the phoneme-level features can be extracted from the output phoneme sequence through the phoneme embedding module 1212 .
  • the output phoneme-level features can be used to extract the acoustic features of the target sentence through the phoneme encoder 1213 .
  • the semantic features of the text data can be extracted.
  • the input level encoder 1230 may then extract the speaking style of the target sentence.
  • the hierarchical encoder 1230 includes an inter-word network 1231 and an inter-sentence network 1232 .
  • the sentence features of each sentence in the text data can be extracted.
  • the paragraph features of the text data can be extracted, and the speaking style features of the target sentence can be extracted according to the paragraph features.
  • the prosody predictor 1240 includes three predictors, which are respectively used to predict the pitch, sound intensity and pronunciation duration of the target sentence.
  • the output results of the pitch predictor and the sound intensity predictor are added to the acoustic feature after mixing processing, and based on the output result of the duration predictor, the length of the acoustic feature is adjusted through the length regulator (Length Regulator, LR), so that the output
  • the acoustic features of are frame-level acoustic features carrying prosody information (pitch, sound intensity, duration).
  • the acoustic features carrying prosodic information output by the prosody predictor 1240 can be converted into an 80-dimensional Mel spectrum through the decoder 1250, and finally the audio data corresponding to the target sentence "Xiao Ming has gone to Shenzhen.” can be synthesized through the vocoder 1260.
  • the audio synthesis system will determine the most reasonable speaking style of the target sentence according to the context information. For example, in the above example, the target sentence "Xiao Ming has gone to Shenzhen.”
  • the synthesized audio can be in the word "Shenzhen”. Emphasize, prolong pronunciation, etc.
  • a hierarchical encoder is used to analyze the context semantics, and the influence of the context structure on the speaking style of the sentence is comprehensively considered from the two levels of inter-word relationship and inter-sentence relationship.
  • the extracted speaking style features of the target sentence are not only It carries the contribution information of each word in the target sentence to the speaking style of the sentence, and also carries the contribution information of other sentences in the context to the speaking style.
  • the hierarchical encoder can extract more information from the context and effectively improve the long-distance dependency modeling ability, thus helping to better model speaking style features.
  • this application effectively improves the ability of the model to model speaking style features from hierarchical context information, and the speaking style of the synthesized audio will be affected by the current sentence and context, making the synthesized audio more expressive and natural
  • the degree of expression improves the richness of synthetic audio and is closer to real human speech.
  • the present application also provides a computer program product, including a computer program, which can be used to perform the audio synthesis described in any of the above embodiments when the computer program is executed by a processor method.
  • the present application also provides a schematic structural diagram of an electronic device as shown in FIG. 13 .
  • the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and of course may also include hardware required by other services.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it.
  • the processor is configured as:
  • the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
  • the present application also provides a computer storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, it can be used to perform a method described in any of the above embodiments.
  • a method of audio synthesis is also provided.

Abstract

Provided in the present application are an audio synthesis method, an electronic device, a program product and a storage medium. During an audio synthesis process, an acoustic feature and a speaking style feature are extracted from each sentence, wherein the speaking style feature is determined on the basis of a paragraph feature of text data where the sentence is located; the paragraph feature is extracted from sentence features of sentences on the basis of the contribution of each sentence in the text data to a speaking style; and the sentence features are obtained on the basis of the contribution of each word in the sentences to the speaking style. On the basis of the acoustic feature and speaking style feature of a target sentence, audio data is synthesized by the target sentence. In this way, an extracted speaking style feature of a target sentence carries contribution information of each word in the target sentence to a speaking style of the sentence, and contribution information of other sentences in the context to the speaking style. Audio that is synthesized by using an acoustic feature of a sentence and a speaking style feature of the sentence has a better expressiveness, thereby enriching the expression effects of the synthesized audio.

Description

音频合成方法、电子设备、程序产品及存储介质Audio synthesis method, electronic device, program product and storage medium 技术领域technical field
本申请涉及音频处理技术领域,尤其涉及音频合成方法、电子设备、程序产品及存储介质。The present application relates to the technical field of audio processing, in particular to an audio synthesis method, an electronic device, a program product and a storage medium.
背景技术Background technique
语音合成(Text-To-Speech,TTS)技术是一种能把文本智能地转化为音频的技术。TTS技术已经广泛地应用到了有声小说、新闻、语音助手、智能导航等产品中。其中,合成音频的自然度是衡量音频合成效果的指标之一,达到高自然度的合成音频能使用户在主观感受上与自然语言一样生动形象。Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio. TTS technology has been widely used in audio novels, news, voice assistants, intelligent navigation and other products. Among them, the naturalness of synthesized audio is one of the indicators to measure the effect of audio synthesis. Synthesized audio with high naturalness can make users feel as vivid as natural language in subjective experience.
合成音频的自然度在很大程度上取决于合成音频的表现力,富有表现力的音频能够表现出说话人的情感和说话风格,自然度较高。着力于提高合成音频在表达效果上的丰富性是TTS技术中重要的一部分。The naturalness of synthesized audio depends largely on the expressiveness of the synthesized audio. Expressive audio can show the speaker's emotion and speaking style, and has a high degree of naturalness. It is an important part of TTS technology to focus on improving the richness of synthetic audio in terms of expressive effects.
发明内容Contents of the invention
本申请提供了音频合成方法、电子设备、程序产品及存储介质,更够提高合成音频的表现力。The application provides an audio synthesis method, an electronic device, a program product and a storage medium, which can further improve the expressiveness of the synthesized audio.
根据本申请实施例的第一方面,提供一种音频合成方法,所述方法包括:According to a first aspect of an embodiment of the present application, an audio synthesis method is provided, the method comprising:
获取文本数据中的目标句子,所述文本数据包括至少两个连续句子;Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;
获取所述目标句子的声学特征;Acquiring the acoustic features of the target sentence;
获取所述目标句子的说话风格特征;所述说话风格特征是基于所述文本数据的段落特征确定的,所述文本数据的段落特征是基于所述文本数据的各个句子对说话风格的贡献,从各个句子的句子特征提取的,所述句子特征是基于句子中各个词语对说话风格的贡献,从所述句子的语义特征提取的;Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据。Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
在一些例子中,所述句子中各个词语对说话风格的贡献,通过基于注意力机制 的词间网络获取;In some examples, the contribution of each word in the sentence to the speaking style is obtained through an inter-word network based on an attention mechanism;
所述文本数据的各个句子对说话风格的贡献,通过基于注意力机制的句子间网络获取。The contribution of each sentence of the text data to the speaking style is obtained through an inter-sentence network based on an attention mechanism.
在一些例子中,所述文本数据的段落特征,还基于各个句子在所述文本数据中的位置信息进行提取。In some examples, the paragraph features of the text data are also extracted based on the position information of each sentence in the text data.
在一些例子中,所述基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据,包括:In some examples, the synthesizing the target sentence into audio data based on the acoustic features of the target sentence and the speaking style features of the target sentence includes:
基于所述目标句子的声学特征以及所述目标句子的说话风格特征,预测所述目标句子的携带韵律信息的声学特征;所述韵律信息包括音高信息、音强信息或发音时长中的一种或多种;Based on the acoustic features of the target sentence and the speaking style features of the target sentence, predict the acoustic features of the target sentence carrying prosody information; the prosody information includes one of pitch information, sound intensity information or pronunciation duration. or more;
将所述携带韵律信息的声学特征转换为所述音频数据。converting the acoustic features carrying prosody information into the audio data.
在一些例子中,所述方法应用于音频合成系统,所述音频合成系统包括:In some examples, the method is applied to an audio synthesis system, the audio synthesis system comprising:
用于提取所述目标句子的声学特征的声学特征提取模块;An acoustic feature extraction module for extracting the acoustic features of the target sentence;
用于提取所述目标句子的说话风格特征的说话风格特征提取模块;A speaking style feature extraction module for extracting the speaking style features of the target sentence;
用于合成所述音频数据的合成模块。A synthesis module for synthesizing the audio data.
在一些例子中,说话风格特征提取模块包括:In some examples, the speaking style feature extraction module includes:
用于提取所述语义特征的语言模型;A language model for extracting the semantic features;
用于提取所述句子特征、所述段落特征以及所述说话风格特征的层级编码器。A hierarchical encoder for extracting the sentence feature, the paragraph feature and the speaking style feature.
在一些例子中,所述层级编码器采用有监督学习进行训练,训练数据包括标注有真实说话风格特征的语义特征。In some examples, the hierarchical encoder is trained using supervised learning, and the training data includes semantic features annotated with real speaking style features.
在一些例子中,所述层级编码器通过知识蒸馏机制进行有监督学习,所述层级编码器为所述蒸馏机制中的学生模型,所述蒸馏机制中的教师模型采用无监督学习从真实音频数据中提取出真实说话风格特征。In some examples, the hierarchical encoder performs supervised learning through a knowledge distillation mechanism, the hierarchical encoder is a student model in the distillation mechanism, and the teacher model in the distillation mechanism adopts unsupervised learning from real audio data Extract real speaking style features.
在一些例子中,所述层级编码器包括:In some examples, the hierarchical encoder includes:
用于获取每个句子中各个词语对说话风格的贡献,并基于所述各个词语对说话风格的贡献,从所述句子的语义特征提取所述句子的句子特征的词间网络;For obtaining the contribution of each word in each sentence to the speaking style, and based on the contribution of each word to the speaking style, extracting the inter-word network of the sentence feature of the sentence from the semantic feature of the sentence;
用于获取各个句子对说话风格的贡献,基于所述各个句子对说话风格的贡献, 从各个句子的句子特征提取段落特征,并基于所述段落特征,预测所述目标句子的说话风格特征的句子间网络。For obtaining the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extracting paragraph features from the sentence features of each sentence, and based on the paragraph features, predicting the sentence of the speaking style feature of the target sentence network.
在一些例子中,所述合成模块包括:In some examples, the synthesis module includes:
用于基于所述目标句子的声学特征以及所述目标句子的说话风格特征,预测所述目标句子的携带韵律信息的声学特征的韵律预测器;A prosody predictor for predicting acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence;
用于将所述携带韵律信息的声学特征转换为所述音频数据的转换器。A converter for converting the acoustic features carrying prosody information into the audio data.
在一些例子中,所述方法由直播服务器执行,所述文本数据为有声读物的文本数据,所述方法还包括:In some examples, the method is performed by a live server, the text data is text data of an audiobook, and the method further includes:
将所述音频数据发送给观众端。Send the audio data to the audience.
根据本申请实施例的第二方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如第一方面所述方法的步骤。According to a second aspect of the embodiments of the present application, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the steps of the method described in the first aspect are implemented.
根据本申请实施例的第三方面,提供一种电子设备,所述电子设备包括:According to a third aspect of the embodiments of the present application, an electronic device is provided, and the electronic device includes:
处理器;processor;
用于存储处理器可执行指令的存储器;memory for storing processor-executable instructions;
其中,所述处理器被配置为:Wherein, the processor is configured as:
获取文本数据中的目标句子,所述文本数据包括至少两个连续句子;Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;
获取所述目标句子的声学特征;Acquiring the acoustic features of the target sentence;
获取所述目标句子的说话风格特征;所述说话风格特征是基于所述文本数据的段落特征确定的,所述文本数据的段落特征是基于所述文本数据的各个句子对说话风格的贡献,从各个句子的句子特征提取的,所述句子特征是基于句子中各个词语对说话风格的贡献,从所述句子的语义特征提取的;Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据。Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
根据本申请实施例的第四方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如第一方面所述方法的步骤。According to a fourth aspect of the embodiments of the present application, there is provided a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in the first aspect are implemented.
本申请的实施例提供的技术方案可以包括以下有益效果:在音频合成的过程中,针对每个句子都提取了的说话风格特征,而说话风格特征是基于句子所在的文本 数据的段落特征确定的,段落特征又是基于文本数据中每个句子对说话风格的贡献,从各个句子的句子特征提取的,句子特征则是基于句子中各个词语对说话风格的贡献获得的。如此,从词间关系和句间关系这两个层面上综合考虑了上下文结构对句子说话风格的影响,所提取出的目标句子的说话风格特征不仅携带了目标句子中每个词语对句子说话风格的贡献信息,还携带了上下文其他句子对说话风格的贡献信息。利用句子的声学特征以及上述说话风格特征所合成出的音频有更优的表现力,提高了合成音频在表达效果上的丰富性。The technical solution provided by the embodiments of the present application may include the following beneficial effects: in the process of audio synthesis, the speaking style feature is extracted for each sentence, and the speaking style feature is determined based on the paragraph feature of the text data where the sentence is located , the paragraph features are extracted from the sentence features of each sentence based on the contribution of each sentence in the text data to the speaking style, and the sentence features are obtained based on the contribution of each word in the sentence to the speaking style. In this way, the impact of the context structure on the speaking style of the sentence is comprehensively considered from the two levels of inter-word relationship and inter-sentence relationship, and the extracted speaking style features of the target sentence not only carry The contribution information of , also carries the contribution information of other sentences in the context to the speaking style. The audio synthesized by using the acoustic features of the sentence and the above-mentioned speaking style features has better expressive force, and improves the richness of the expressive effect of the synthesized audio.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
附图说明Description of drawings
此处的附图被并入说明书中并构成本申请的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description serve to explain the principles of the application.
图1是本申请根据一实施例示出的一种音频合成方法的流程图。Fig. 1 is a flowchart of an audio synthesis method according to an embodiment of the present application.
图2是本申请根据一实施例示出的一种音频合成系统的示意图。Fig. 2 is a schematic diagram of an audio synthesis system according to an embodiment of the present application.
图3是本申请根据另一实施例示出的一种音频合成方法的流程图。Fig. 3 is a flowchart of an audio synthesis method according to another embodiment of the present application.
图4是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 4 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图5是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 5 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图6是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 6 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图7是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 7 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图8是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 8 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图9是本申请根据另一实施例示出的一种音频合成方法的流程图。Fig. 9 is a flowchart of an audio synthesis method according to another embodiment of the present application.
图10是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 10 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图11是本申请根据一实施例示出的一种音频合成方法的应用场景。Fig. 11 shows an application scenario of an audio synthesis method according to an embodiment of the present application.
图12是本申请根据另一实施例示出的一种音频合成系统的示意图。Fig. 12 is a schematic diagram of an audio synthesis system according to another embodiment of the present application.
图13是本申请根据一实施例示出的一种电子设备的硬件结构图。Fig. 13 is a hardware structural diagram of an electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in this application is for the purpose of describing particular embodiments only, and is not intended to limit the application. As used in this application and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word "if" as used herein may be interpreted as "at" or "when" or "in response to a determination."
语音合成(Text-To-Speech,TTS)技术是一种能把文本智能地转化为音频的技术。TTS技术已经广泛地应用到了有声小说、新闻、语音助手、智能导航等产品中。其中,合成音频的自然度是衡量音频合成效果的指标之一,达到高自然度的合成音频能使用户在主观感受上与自然语言一样生动形象。Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio. TTS technology has been widely used in audio novels, news, voice assistants, intelligent navigation and other products. Among them, the naturalness of synthesized audio is one of the indicators to measure the effect of audio synthesis. Synthesized audio with high naturalness can make users feel as vivid as natural language in subjective experience.
合成音频的自然度在很大程度上取决于合成音频的表现力,富有表现力的音频能够表现出说话人的情感和说话风格,自然度较高。着力于提高合成音频在表达效果上的丰富性是TTS技术中重要的一部分。The naturalness of synthesized audio depends largely on the expressiveness of the synthesized audio. Expressive audio can show the speaker's emotion and speaking style, and has a high degree of naturalness. It is an important part of TTS technology to focus on improving the richness of synthetic audio in terms of expressive effects.
在相关技术中,合成的音频说话风格单一,语气平淡,无法体现出情感与说话内容所蕴含的含义。合成音频的表现力较差,导致与真实语音仍然存在差距。发明人发现,句子整体的语义和情感会受到上下文影响的。句子的表现力由该句子所处语境决定,这不仅与该句子的语义信息相关,还受到该句子所处的文本段落中上下文内其他句子的语义信息影响,又或者说,句子的表现力与该句子在文本段落中所处的位置相关。同一句子若处于文本段落中不同位置,该句子的表现力也会有所不同。In related technologies, the synthesized audio has a single speaking style and a flat tone, which cannot reflect the emotion and the meaning contained in the speech content. Synthetic audio is less expressive, resulting in a gap with real speech. The inventors found that the overall semantics and emotion of a sentence are affected by the context. The expressiveness of a sentence is determined by the context of the sentence, which is not only related to the semantic information of the sentence, but also affected by the semantic information of other sentences in the context of the sentence in the text paragraph, or in other words, the expressiveness of the sentence Relates to where the sentence is located in the text paragraph. If the same sentence is in different positions in the text paragraph, the expressiveness of the sentence will also be different.
然而在相关技术中,绝大数的方案仅关注了当前句子的语义信息,而忽视了该句子的上下文内其他句子的语义信息对该句子表现力的影响。这导致了同一句子在不 同文本中,或者在同一文本的不同位置下,其合成的音频都是千篇一律的,无法捕捉到不同的上下文带来的各种变化,如语调、节奏、重音、情感的不同。从而合成的音频表现力较差,自然度较低。基于上述问题,本申请提出了一种音频合成方法,包括如图1所示步骤:However, in related technologies, most of the solutions only focus on the semantic information of the current sentence, and ignore the influence of the semantic information of other sentences in the context of the sentence on the expressiveness of the sentence. This leads to the same sentence in different texts, or in different positions of the same text, its synthesized audio is stereotyped, unable to capture the various changes brought about by different contexts, such as intonation, rhythm, stress, emotional different. As a result, the synthesized audio is less expressive and less natural. Based on the above problems, the application proposes an audio synthesis method, including the steps shown in Figure 1:
步骤110:获取文本数据中的目标句子,所述文本数据包括至少两个连续句子;Step 110: Acquiring a target sentence in text data, the text data including at least two consecutive sentences;
步骤120:获取所述目标句子的声学特征;Step 120: Acquiring the acoustic features of the target sentence;
步骤130:获取所述目标句子的说话风格特征;Step 130: Obtain the speaking style features of the target sentence;
其中,所述说话风格特征是基于所述文本数据的段落特征确定的,所述文本数据的段落特征是基于所述文本数据的各个句子对说话风格的贡献,从各个句子的句子特征提取的,所述句子特征是基于句子中各个词语对说话风格的贡献,从所述句子的语义特征提取的;Wherein, the speaking style features are determined based on the paragraph features of the text data, and the paragraph features of the text data are extracted from the sentence features of each sentence based on the contribution of each sentence of the text data to the speaking style, The sentence feature is based on the contribution of each word in the sentence to the speaking style, extracted from the semantic feature of the sentence;
步骤140:基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据。Step 140: Synthesize the target sentence into audio data based on the acoustic features of the target sentence and the speaking style features of the target sentence.
对于包括至少两个连续句子的文本数据,针对每个句子均可以采用如图1所示的方法来进行音频合成。其中,目标句子可以是当前待合成的句子。目标句子的声学特征是指能体现出目标句子发音特性、声学表现的特征,例如可以是音素级的声学特征,也即每个音素用一个多维向量来表示。For text data including at least two consecutive sentences, the method shown in FIG. 1 can be used for audio synthesis for each sentence. Wherein, the target sentence may be a sentence to be synthesized currently. The acoustic feature of the target sentence refers to the feature that can reflect the pronunciation characteristics and acoustic performance of the target sentence, for example, it can be a phoneme-level acoustic feature, that is, each phoneme is represented by a multi-dimensional vector.
在提取目标句子的说话风格特征过程中,首先提取了文本数据中各个句子的语义特征,然后基于句子中各个词语对说话风格的贡献,从句子的语义特征中提取出句子的句子特征,所谓句子特征,即用一个多维向量来表示一个句子,其包含了一个句子整体的语义信息。在一些例子中,句子中各个词语对说话风格的贡献,可以通过基于注意力机制的词间网络获取。In the process of extracting the speaking style features of the target sentence, the semantic features of each sentence in the text data are first extracted, and then based on the contribution of each word in the sentence to the speaking style, the sentence features of the sentence are extracted from the semantic features of the sentence, the so-called sentence Features, that is, a sentence is represented by a multidimensional vector, which contains the semantic information of a sentence as a whole. In some examples, the contribution of each word in a sentence to the speaking style can be obtained through an attention-based inter-word network.
随后基于各个句子对说话风格的贡献,从各个句子的句子特征提取出段落特征,所谓段落特征,即用一个多维向量来表示一整个文本,其包含了文本数据整体的语义信息。在一些例子中,文本数据的各个句子对说话风格的贡献,可以通过基于注意力机制的句子间网络获取。Then, based on the contribution of each sentence to the speaking style, the paragraph features are extracted from the sentence features of each sentence. The so-called paragraph features use a multi-dimensional vector to represent a whole text, which contains the semantic information of the text data as a whole. In some examples, the contribution of each sentence of the text data to the speaking style can be obtained through an attention-based inter-sentence network.
最后从段落特征中提取出说话风格特征,如此所提取出的说话风格特征结合了每个句子中的词间关系和句间关系,综合考虑了上下文结构对句子说话风格的影响,所提取出的目标句子的说话风格特征不仅携带了目标句子中每个词语对句子说话风格 的贡献信息,还携带了上下文其他句子对说话风格的贡献信息。因此,利用句子的声学特征以及说话风格特征所合成出的音频有更优的表现力,提高了合成音频在表达效果上的丰富性。Finally, the speaking style features are extracted from the paragraph features. The extracted speaking style features combine the relationship between words and sentences in each sentence, and comprehensively consider the influence of the context structure on the speaking style of the sentence. The extracted The speaking style feature of the target sentence not only carries the contribution information of each word in the target sentence to the speaking style of the sentence, but also carries the contribution information of other sentences in the context to the speaking style. Therefore, the audio synthesized by using the acoustic features of the sentence and the speaking style features has better expressiveness, and the richness of the expressive effect of the synthesized audio is improved.
在一些实施例中,上述如图1所示的一种音频合成的方法,可以应用于如图2所示音频合成系统200中。音频合成系统200包括声学特征提取模块210、说话风格特征提取模块220以及合成模块230。其中,声学特征提取模块210用于从输入的目标句子中提取目标句子的声学特征;说话风格特征提取模块220用于从输入的文本数据中提取目标句子的说话风格特征;合成模块230用于基于目标句子的声学特征以及说话风格特征,合成目标句子的音频数据。In some embodiments, the above audio synthesis method as shown in FIG. 1 can be applied to the audio synthesis system 200 as shown in FIG. 2 . The audio synthesis system 200 includes an acoustic feature extraction module 210 , a speaking style feature extraction module 220 and a synthesis module 230 . Wherein, the acoustic feature extraction module 210 is used to extract the acoustic feature of the target sentence from the input target sentence; the speaking style feature extraction module 220 is used to extract the speaking style feature of the target sentence from the input text data; the synthesis module 230 is used for based on The acoustic features and speaking style features of the target sentence are used to synthesize the audio data of the target sentence.
在一些实施例中,目标句子的声学特征可以是音素级声学特征。上述步骤120声学特征的获取过程可以包括如图3所示的步骤:In some embodiments, the acoustic features of the target sentence may be phoneme-level acoustic features. The acquisition process of the above-mentioned step 120 acoustic features may include steps as shown in Figure 3:
步骤121:获取所述目标句子的音素序列;Step 121: Obtain the phoneme sequence of the target sentence;
步骤122:基于所述音素序列,获取所述目标句子的音素级特征;Step 122: Obtain phoneme-level features of the target sentence based on the phoneme sequence;
步骤123:基于多头注意力机制,从所述音素级特征中提取出所述目标句子的声学特征。Step 123: Based on the multi-head attention mechanism, extract the acoustic features of the target sentence from the phoneme-level features.
相应地,如图4所示,声学特征提取模块210可以包括文本转音素(Grapheme to Phoneme,G2P)子模块211、音素嵌入(Phoneme Embedding)子模块212以及音素编码(Phoneme Encoder)子模块213。文本转音素子模块211可以按照语言学知识设计的转换逻辑,将输入的目标句子转换为能体现其发音特点的音素序列。例如,对于目标句子“因为知识实在是太多了。”,文本转音素子模块211可以将该目标句子转换为“i-n1|u-e-i4|zh-iy1|sh-iy2|sh-iy2|z-a-i4|sh-iy4|t-a-i4|d-u-o1|l-e5|。”的音素序列。其中,音素序列中的“i”、“n1”、“zh”等为音素符号,每一个音素符号表示一种发音。“|”是用于分隔相邻字的分隔符号,“-”是用于分隔相邻音素的分隔符号。当然,目标句子的音素序列不限于上述的表现形式,还可以是其他能体现目标句子发音特点的形式的音素序列,本申请在此不做限制。可以假设,文本转音素子模块211输出的音素序列长度为N。Correspondingly, as shown in FIG. 4 , the acoustic feature extraction module 210 may include a text-to-phoneme (Grapheme to Phoneme, G2P) submodule 211, a phoneme embedding (Phoneme Embedding) submodule 212 and a phoneme encoding (Phoneme Encoder) submodule 213. The text-to-phoneme sub-module 211 can convert the input target sentence into a phoneme sequence that can reflect its pronunciation characteristics according to the conversion logic designed by linguistic knowledge. For example, for the target sentence "Because there is so much knowledge.", the text-to-phoneme submodule 211 can convert the target sentence into "i-n1|u-e-i4|zh-iy1|sh-iy2|sh-iy2| z-a-i4|sh-iy4|t-a-i4|d-u-o1|l-e5|." phoneme sequence. Wherein, "i", "n1", and "zh" in the phoneme sequence are phoneme symbols, and each phoneme symbol represents a pronunciation. "|" is a separator used to separate adjacent words, and "-" is a separator used to separate adjacent phonemes. Certainly, the phoneme sequence of the target sentence is not limited to the above expression forms, and may also be a phoneme sequence in other forms that can reflect the pronunciation characteristics of the target sentence, which is not limited in this application. It can be assumed that the length of the phoneme sequence output by the text-to-phoneme sub-module 211 is N.
基于文本转音素子模块211输出的音素序列,音素嵌入子模块212可以将每个音素映射为一个多维向量。例如,每个音素可以映射成一个256维的浮点型向量。如此,长度为N的音素序列经过音素嵌入模块212后,可以映射为一个N*256的矩阵, 即音素级特征。Based on the phoneme sequence output by the text-to-phoneme sub-module 211, the phoneme embedding sub-module 212 can map each phoneme into a multi-dimensional vector. For example, each phoneme can be mapped to a 256-dimensional floating-point vector. In this way, the phoneme sequence with a length of N can be mapped into an N*256 matrix after passing through the phoneme embedding module 212 , that is, phoneme-level features.
音素嵌入子模块212输出的音素级特征随后输入音素编码子模块213。音素编码子模块213包括一个位置编码模型和若干个Transformer模型。如图5所示,音素编码子模块213由一个位置编码模型和四个Transformer模型组成。位置编码模型可以将人工设计的位置信息添加到音素级特征中,如此,后续的Transformer模型可以在计算时将音素的位置也考虑进去。位置编码的计算方式可以参照相关技术实施,本申请在此不展开论述。The phoneme-level features output by the phoneme embedding sub-module 212 are then input into the phoneme encoding sub-module 213 . The phoneme coding sub-module 213 includes a position coding model and several Transformer models. As shown in FIG. 5 , the phoneme coding sub-module 213 is composed of a position coding model and four Transformer models. The position encoding model can add artificially designed position information to the phoneme-level features, so that the subsequent Transformer model can also take the position of the phoneme into account when calculating. The calculation method of the position code can be implemented with reference to related technologies, and this application will not discuss it here.
Transformer模型由一个带有残差连接和层归一化的多头自注意力机制,以及一个带有残差连接和层归一化的一维卷积层组成。Transformer模型可以根据音素之间的关系以及融合前后各音素的信息,提取出目标句子的声学特征。该声学特征可以是音素级声学特征,携带有音素之间对发音特性影响的信息。大小为N*256的音素级特征经过音素编码子模块213后,可以提取出大小同样为N*256的是音素级声学特征。即音素编码子模块213输出的序列大小可以和原始序列大小保持一致。The Transformer model consists of a multi-head self-attention mechanism with residual connections and layer normalization, and a 1D convolutional layer with residual connections and layer normalization. The Transformer model can extract the acoustic features of the target sentence according to the relationship between phonemes and the information of each phoneme before and after fusion. The acoustic features may be phoneme-level acoustic features, which carry information about the influence of phonemes on pronunciation characteristics. After the phoneme-level features with a size of N*256 are passed through the phoneme encoding sub-module 213, the phoneme-level acoustic features with the same size of N*256 can be extracted. That is, the size of the sequence output by the phoneme coding sub-module 213 can be consistent with the size of the original sequence.
在一些实施例中,步骤130目标句子的说话风格特征的提取过程可以由说话风格特征提取模块220执行。如图6所示,说话风格特征提取模块220包括语言模型221和层级编码器222。其中,语言模型221用于提取文本数据中各个句子的语义特征;层级编码器222用于提取句子特征、段落特征以及说话风格特征。In some embodiments, the process of extracting the speaking style features of the target sentence in step 130 may be performed by the speaking style feature extraction module 220 . As shown in FIG. 6 , the speaking style feature extraction module 220 includes a language model 221 and a hierarchical encoder 222 . Among them, the language model 221 is used to extract the semantic features of each sentence in the text data; the hierarchical encoder 222 is used to extract sentence features, paragraph features and speaking style features.
在一些实施例中,语言模型221可以是XLNET语言模型(Generalized Autoregressive Pretraining for Language Understanding)。XLNET语言模型在一个字数多达数十亿的文本数据上提前训练,大量的文本的训练数据可以让XLNET模型更好地理解提取出文本的语义信息。In some embodiments, the language model 221 may be an XLNET language model (Generalized Autoregressive Pretraining for Language Understanding). The XLNET language model is trained in advance on a text data with a number of billions of words. A large amount of text training data can enable the XLNET model to better understand the semantic information of the extracted text.
在另一些实施例中,语言模型221可以是BERT(Bidirectional Encoder Representations from Transformers)语言模型。BERT语言模型可以利用大量的中文文本数据进行预训练,以提取有效的语义特征。In other embodiments, the language model 221 may be a BERT (Bidirectional Encoder Representations from Transformers) language model. The BERT language model can be pre-trained with a large amount of Chinese text data to extract effective semantic features.
语言模型221不限于上述两种模型,本领域技术人员可以根据实际需要选取其他能够实现从文本数据中提取语义特征效果的模型作为语言模型221。The language model 221 is not limited to the above two models, and those skilled in the art may select other models capable of extracting semantic features from text data as the language model 221 according to actual needs.
输入语言模型221的文本数据包括目标句子以及目标句子的前后若干个句子。例如包括目标句子及其前后各L个句子,共2L+1个句子。其中,L可以是正整数。文本数据经过语言模型221可以提取出文本数据的语义特征。在一些实施例中,文本 数据的语义特征可以是字符级别的语义特征,也可以是词级别的语义特征。所谓字符级别的语义特征为每个字符用一个多维向量来表示。同理,所谓词级别的语义特征为每个词语用一个多维向量来表示。例如,若语言模型221为BERT语言模型,则提取出的语义特征可以是字符级别的语义特征或词级别的语义特征。若语言模型221为XLNET语言模型,则提取出的语义特征可以是词级别的语义特征。若语言模型221为其他能实现从文本数据中提取语义特征效果的模型,则在提取语义特征前,可以先对文本数据进行分词处理,随后在经过分词处理后的文本数据提取出词级别的语义特征。The text data input into the language model 221 includes a target sentence and several sentences before and after the target sentence. For example, including the target sentence and L sentences before and after it, 2L+1 sentences in total. Wherein, L can be a positive integer. The semantic features of the text data can be extracted from the text data through the language model 221 . In some embodiments, the semantic features of text data can be character-level semantic features, or word-level semantic features. The so-called character-level semantic features are represented by a multidimensional vector for each character. Similarly, the so-called word-level semantic features are represented by a multidimensional vector for each word. For example, if the language model 221 is a BERT language model, the extracted semantic features may be character-level semantic features or word-level semantic features. If the language model 221 is an XLNET language model, the extracted semantic features may be word-level semantic features. If the language model 221 is another model that can realize the effect of extracting semantic features from text data, before extracting semantic features, the text data can be segmented first, and then the word-level semantics can be extracted from the text data after word segmentation. feature.
以下实施例以语言模型221为XLNET语言模型为例,XLNET语言模型可以按照模型所学习到的知识对句子进行分词,若输入的文本数据共有M个词语,并且对每个词语输出一个表示该词语意思的768维的高维表示,那么输入的文本数据经过XLNET语言模型后可以提取出M*768的词级别的语义特征。In the following embodiments, the language model 221 is an XLNET language model as an example. The XLNET language model can segment sentences according to the knowledge learned by the model. If the input text data has M words in total, and each word outputs a word representing the word 768-dimensional high-dimensional representation of meaning, then the input text data can extract M*768 word-level semantic features after passing through the XLNET language model.
语言模型221输出的语义特征在输入层级编码器222之前,可以按照每个词语所属的句子,分割为2L+1个序列。每个序列由每个句子所包括的词语对应的语义特征组成。以下将分割的每个序列简称为句子的语义特征。分割后的各个序列输入层级编码器222后,可以提取出各个句子的句子特征、文本数据的段落特征以及目标句子的说话风格特征。层级编码器222包括两层注意力网络,如图7所示,层级编码器222包括词间网络2221和句子间网络2222,这两层注意力网络具有相似的结构,主要包括一个双向门控循环单元(Gated Recurrent Unit,GRU)和一个缩放的点积注意力机制。The semantic features output by the language model 221 can be divided into 2L+1 sequences according to the sentence to which each word belongs before being input to the level encoder 222 . Each sequence consists of semantic features corresponding to the words included in each sentence. Each segmented sequence is referred to as the semantic feature of the sentence below. After each segmented sequence is input into the hierarchical encoder 222, the sentence features of each sentence, the paragraph features of the text data, and the speaking style features of the target sentence can be extracted. The hierarchical encoder 222 includes two layers of attention networks. As shown in FIG. 7, the hierarchical encoder 222 includes an inter-word network 2221 and an inter-sentence network 2222. These two layers of attention networks have similar structures, mainly including a bidirectional gating loop Unit (Gated Recurrent Unit, GRU) and a scaled dot-product attention mechanism.
词间网络2221用于获取每个句子中各个词语对说话风格的贡献,并基于所述各个词语对说话风格的贡献,从所述句子(即每个序列)的语义特征提取所述句子。各个词语对说话风格的贡献可以理解为该词语在多大程度上影响了说话风格,可以表现为同一句子内部每个词语的词义和词语之间的关系。句子的语义特征输入词间网络2221后,双向GRU会考虑时间顺序和上下文信息,重新提取每个词语对应的语义特征。同时,由于并非每个词语对句子的意思都有相同的贡献,因此可以采用缩放的点积注意力机制来计算每个词语对应的权重,并根据权重汇总出整个句子对应的句子特征。作为例子,可以利用键(Key)、值(Value)和查询(Query)向量来计算每个词语对应的权重并根据权重汇总出句子特征。其中,键和值是有每个词语对应的语义特征有线性变换得到的向量,查询向量则是一个从数据集中训练得到向量。如在上述例子中,XLNET语言模型后可以提取出M*768的词级别的语义特征共分割为2L+1个序 列,每个序列代表一个句子的语义特征,共有2L+1个句子的语义特征。每个句子的语义特征分别经过词间网络2221后,共提取出2L+1个句子的句子特征,且每个句子的句子特征为一个256维的向量。The inter-word network 2221 is used to obtain the contribution of each word in each sentence to the speaking style, and extract the sentence from the semantic features of the sentence (ie each sequence) based on the contribution of each word to the speaking style. The contribution of each word to the speaking style can be understood as the degree to which the word affects the speaking style, which can be expressed as the meaning of each word in the same sentence and the relationship between words. After the semantic features of the sentence are input into the inter-word network 2221, the bidirectional GRU will re-extract the semantic features corresponding to each word in consideration of time order and context information. At the same time, since not every word has the same contribution to the meaning of the sentence, the scaled dot product attention mechanism can be used to calculate the weight corresponding to each word, and the sentence features corresponding to the entire sentence can be summarized according to the weight. As an example, the key (Key), value (Value) and query (Query) vectors can be used to calculate the weight corresponding to each word and summarize the sentence features according to the weight. Among them, the key and value are vectors obtained by linear transformation of the semantic features corresponding to each word, and the query vector is a vector trained from the data set. As in the above example, after the XLNET language model, the word-level semantic features of M*768 can be extracted and divided into 2L+1 sequences, each sequence represents the semantic features of a sentence, and there are 2L+1 sentence semantic features . After the semantic features of each sentence pass through the inter-word network 2221, sentence features of 2L+1 sentences are extracted in total, and the sentence features of each sentence are a 256-dimensional vector.
各个句子的句子特征可以进行混合处理后输入句子间网络2222。如在上述例子中,词间网络2221共提取出2L+1个256维的句子特征。可以将这些句子特征拼接成一个(2L+1)*256的混合特征,并输入句子间网络2222。句子间网络2222可以获取各个句子对说话风格的贡献,基于所述各个句子对说话风格的贡献,从各个句子的句子特征提取段落特征,并基于所述段落特征,预测所述目标句子的说话风格特征的句子间网络。各个句子对说话风格的贡献可以理解为该句子在多大程度上影响了说话风格,可以表现为文本数据中各个句子的句子表征和句子间关系。在一些实施例中,文本数据的段落特征,除了基于各个句子对说话风格的贡献来提取,还可以基于各个句子在所述文本数据中的位置信息进行提取。The sentence features of each sentence can be mixed and input to the inter-sentence network 2222 . As in the above example, the inter-word network 2221 extracts a total of 2L+1 256-dimensional sentence features. These sentence features can be concatenated into a (2L+1)*256 mixed feature, and input into the inter-sentence network 2222 . The inter-sentence network 2222 can obtain the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extract paragraph features from the sentence features of each sentence, and predict the speaking style of the target sentence based on the paragraph features Inter-sentence network of features. The contribution of each sentence to the speaking style can be understood as the extent to which the sentence affects the speaking style, which can be expressed as the sentence representation and inter-sentence relationship of each sentence in the text data. In some embodiments, the paragraph features of the text data may be extracted based on the position information of each sentence in the text data, in addition to the contribution of each sentence to the speaking style.
如图7所示,与词间网络2221类似,输入的拼接后的混合特征经过一个双向GRU来结合上下文重新提取每个句子的特征。然后通过位置编码向重新提取的每个句子的特征添加句子之间的相对位置信息。并且采用缩放的点积注意力来获得段落特征。最后通过一个线性层来根据段落特征预测目标句子的说话风格特征。如在上述例子中,一个(2L+1)*256的混合特征经过句子间网络2222可以预测出文本数据的256维段落特征,并最后预测出一个256维的说话风格特征。As shown in Fig. 7, similar to the inter-word network 2221, the input concatenated mixed features go through a bidirectional GRU to re-extract the features of each sentence in combination with the context. The relative position information between sentences is then added to the re-extracted features of each sentence through position encoding. And scaled dot-product attention is adopted to obtain paragraph features. Finally, a linear layer is used to predict the speaking style features of the target sentence based on the paragraph features. As in the above example, a (2L+1)*256 mixed feature can predict the 256-dimensional paragraph feature of the text data through the inter-sentence network 2222, and finally predict a 256-dimensional speaking style feature.
通过上文记载的实施例,可以提取出目标句子的声学特征和目标句子的说话风格特征。其中,发明人发现,在音频合成训练数据有限的情况下,要音频合成系统200隐式地学习到句子语义和合成音频的说话风格之间的映射关系是十分困难的。为此,在一些实施例中,层级编码器222可以采用有监督学习进行训练,训练数据可以包括标注有真实说话风格特征的语义特征。由于层级编码器222采用了有监督学习进行训练,因此可以显式地学习到说话风格特征。从隐式学习到显示学习,能大大提升模型对说话风格的预测效果。Through the embodiments described above, the acoustic features of the target sentence and the speaking style features of the target sentence can be extracted. Among them, the inventors found that, in the case of limited audio synthesis training data, it is very difficult for the audio synthesis system 200 to implicitly learn the mapping relationship between the sentence semantics and the speaking style of the synthesized audio. To this end, in some embodiments, the hierarchical encoder 222 can be trained using supervised learning, and the training data can include semantic features marked with real speaking style features. Since the hierarchical encoder 222 is trained using supervised learning, it can learn the speaking style features explicitly. From implicit learning to explicit learning, the prediction effect of the model on speaking style can be greatly improved.
然而,要获得真实说话风格特征是困难的,为此,在一些实施例中,层级编码器可以通过知识蒸馏(Knowledge Distillation)机制进行有监督学习。知识蒸馏机制包括教师网络(Teacher Network)和学生网络(Student Network),通过引入与教师网络相关的软目标(Soft Target)作为学生网络的训练目标中的一部分,以诱导学生网络的训练,从而实现知识迁移。真实说话风格特征可以包括教师模型的输出特征。在一 些实施例中,教师网络可以是参考编码器,如图8所示,参考编码器包括若干层二维卷积神经网络,如6层,一个GRU网络和一个全连接网络。参考编码器采用无监督学习进行训练,其训练数据为与文本对应的真实音频的音频特征。音频特征可以是梅尔频谱(mel-spectrogram)或LPC(Linear Prediction Coefficient,线性预测系数)等等中的一种或多种。以80维的梅尔频谱为例,在训练过程中,真实音频的80维梅尔频谱输入参考编码器后,以无监督学习的方式让参考编码器从梅尔频谱中提取出相应的说话风格特征。参考编码器输出的说话风格特征可以看做是真实说话风格特征。然后以标注有真实说话风格特征的语义特征作为训练数据来训练层级编码器222。如此,层级编码器222可以显式地学习说话风格特征,减小了模型训练的压力,并大大增强了在训练数据量不足的情况下,层级编码器222对说话风格特征的建模效果。However, it is difficult to obtain real speaking style features. For this reason, in some embodiments, the hierarchical encoder can perform supervised learning through a knowledge distillation (Knowledge Distillation) mechanism. The knowledge distillation mechanism includes the teacher network (Teacher Network) and the student network (Student Network), by introducing the soft target (Soft Target) related to the teacher network as part of the training target of the student network to induce the training of the student network, so as to achieve knowledge transfer. The real speaking style features may include output features of the teacher model. In some embodiments, the teacher network can be a reference encoder. As shown in FIG. 8, the reference encoder includes several layers of two-dimensional convolutional neural networks, such as 6 layers, a GRU network and a fully connected network. The reference encoder is trained with unsupervised learning on audio features of real audio corresponding to text. The audio feature may be one or more of mel-spectrogram or LPC (Linear Prediction Coefficient, linear prediction coefficient), etc. Take the 80-dimensional Mel spectrum as an example. During the training process, after the 80-dimensional Mel spectrum of real audio is input into the reference encoder, the reference encoder can extract the corresponding speaking style from the Mel spectrum in an unsupervised learning manner. feature. The speaking style features output by the reference encoder can be regarded as real speaking style features. The hierarchical encoder 222 is then trained with semantic features annotated with real speaking style features as training data. In this way, the hierarchical encoder 222 can explicitly learn the speaking style features, which reduces the pressure of model training and greatly enhances the modeling effect of the hierarchical encoder 222 on the speaking style features when the amount of training data is insufficient.
在提取出目标句子的声学特征和目标句子的说话风格特征后,可以将所述目标句子合成为音频数据。在一些实施例中,上述步骤140音频数据的合成过程可以包括如图9所示的步骤:After the acoustic features of the target sentence and the speaking style features of the target sentence are extracted, the target sentence can be synthesized into audio data. In some embodiments, the above step 140 audio data synthesis process may include the steps shown in Figure 9:
步骤141:基于所述目标句子的声学特征以及说话风格特征,预测所述目标句子的携带韵律信息的声学特征;Step 141: Predict the acoustic features carrying prosodic information of the target sentence based on the acoustic features and speaking style features of the target sentence;
其中,韵律信息包括音高信息、音强信息或发音时长中的一种或多种。Wherein, the prosody information includes one or more of pitch information, sound intensity information or pronunciation duration.
步骤142:将所述携带韵律信息的声学特征转换为所述音频数据。Step 142: Convert the acoustic features carrying prosody information into the audio data.
相应地,如图10所示,合成模块230可以包括韵律预测器231和转换器232。韵律预测器231用于基于目标句子的声学特征以及目标句子的说话风格特征,预测目标句子的携带韵律信息的声学特征;转换器232用于将携带韵律信息的声学特征转换为音频数据。Correspondingly, as shown in FIG. 10 , the synthesis module 230 may include a prosody predictor 231 and a converter 232 . The prosody predictor 231 is used to predict the acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence; the converter 232 is used to convert the acoustic features carrying prosody information into audio data.
在一些实施例中,目标句子的声学特征和说话风格特征在输入韵律预测器231之前,可以先将说话风格特征复制成声学特征的长度。如在上述例子中,声学特征提取模块210输出了大小为N*256的是音素级声学特征,而说话风格特征提取模块220输出了一个256维的说话风格特征。那么可以将说话风格特征复制成长度为N的特征,并添加到大小为N*256的音素级声学特征上。然后将混合后的音素级声学特征输入韵律预测器231。In some embodiments, before the acoustic features and speaking style features of the target sentence are input into the prosody predictor 231, the speaking style features may be copied to the length of the acoustic features. As in the above example, the acoustic feature extraction module 210 outputs a phoneme-level acoustic feature with a size of N*256, and the speaking style feature extraction module 220 outputs a 256-dimensional speaking style feature. Then the speaking style feature can be copied into a feature of length N and added to the phoneme-level acoustic feature of size N*256. The mixed phone-level acoustic features are then input to the prosody predictor 231 .
韵律预测器231包括三个语音变化预测器,这些语音变化预测器的结构基本一致,包括两个带有层归一化的一维卷积层以及一个全连接层。三个语音变化预测器分 别用于预测合成音频在该音素上的音高、音强和发音时长。其中,发音时长的单位为帧。每个预测器对每个混合后的音素级声学特征预测出一个浮点数作为预测结果。随后,预测出的音高预测结果和音强预测结果可以通过一个全连接层转换成256维的表征并添加到混合后的音素级声学特征中。而发音时长预测结果则通过四舍五入保留整数,代表了该音素的发音时长有多少帧。按照每个音素对应的发音时长对音素的声学特征进行复制,并将复制后的特征拼接一起作为帧级别的声学特征,该声学特征携带了音高、音强和发音时长的韵律信息。如在上述例子中,若预测出N*256的音素级声学特征的发音总时长包括n帧音频,那么韵律预测器231输出大小为n*256的携带韵律信息的声学特征。The prosody predictor 231 includes three speech change predictors, and the structures of these speech change predictors are basically the same, including two one-dimensional convolutional layers with layer normalization and one fully connected layer. Three speech change predictors are used to predict the pitch, intensity and duration of the synthesized audio on that phoneme, respectively. Wherein, the unit of the pronunciation duration is frame. Each predictor predicts a floating-point number for each mixed phone-level acoustic feature as the prediction result. Subsequently, the predicted pitch predictions and intensity predictions can be transformed into a 256-dimensional representation through a fully connected layer and added to the mixed phone-level acoustic features. The prediction result of the pronunciation duration is rounded to retain an integer, which represents how many frames the pronunciation duration of the phoneme is. The acoustic features of the phoneme are copied according to the pronunciation duration corresponding to each phoneme, and the copied features are spliced together as the frame-level acoustic features, which carry the prosody information of pitch, sound intensity and pronunciation duration. As in the above example, if it is predicted that the total pronunciation duration of N*256 phoneme-level acoustic features includes n frames of audio, then the prosody predictor 231 outputs n*256 acoustic features carrying prosody information.
随后,韵律预测器231输出的携带韵律信息的声学特征经过转换器232,可以合成出目标句子的音频数据。其中转换器232包括解码器和声码器。解码器可以将携带韵律信息的声学特征转换为相应的音频特征,如梅尔频谱或LPC等。解码器输出的音频特征经过声码器后,可以输出目标句子的音频数据。其中,声码器可以是基于Hifi-GAN的神经网络声码器;音频数据可以是采样率为24kHz的合成音频数据。Subsequently, the acoustic features carrying prosodic information output by the prosody predictor 231 pass through the converter 232 to synthesize the audio data of the target sentence. Wherein the converter 232 includes a decoder and a vocoder. The decoder can convert the acoustic features carrying prosody information into corresponding audio features, such as Mel Spectrum or LPC, etc. After the audio features output by the decoder pass through the vocoder, the audio data of the target sentence can be output. Wherein, the vocoder can be a neural network vocoder based on Hifi-GAN; the audio data can be synthetic audio data with a sampling rate of 24kHz.
本申请提供的一种音频合成方法,在音频合成的过程中,针对每个句子都提取了的说话风格特征,而说话风格特征是基于句子所在的文本数据的段落特征确定的,段落特征又是基于文本数据中每个句子对说话风格的贡献,从各个句子的句子特征提取的,句子特征则是基于句子中各个词语对说话风格的贡献获得的。如此,在说话风格特征提取的过程中,不仅考虑了目标句子的语义信息,还考虑了上下文句子对目标句子的语义信息的影响,从而能捕捉到不同上下文带来的语音变化。In the audio synthesis method provided by the present application, in the process of audio synthesis, the speech style feature is extracted for each sentence, and the speech style feature is determined based on the paragraph feature of the text data where the sentence is located, and the paragraph feature is Based on the contribution of each sentence in the text data to the speaking style, it is extracted from the sentence features of each sentence, and the sentence features are obtained based on the contribution of each word in the sentence to the speaking style. In this way, in the process of feature extraction of speaking style, not only the semantic information of the target sentence is considered, but also the influence of the context sentence on the semantic information of the target sentence is considered, so that the speech changes brought about by different contexts can be captured.
此外,本申请采用了层级编码器对上下文语义进行分析,从词间关系和句间关系这两个层面上综合考虑了上下文结构对句子说话风格的影响,所提取出的目标句子的说话风格特征不仅携带了目标句子中每个词语对句子说话风格的贡献信息,还携带了上下文其他句子对说话风格的贡献信息。层级编码器可以从上下文中提取出更多的信息并且有效提升了长距离依赖的建模能力,从而帮助更好的建模说话风格特征。In addition, this application uses a hierarchical encoder to analyze the context semantics, and comprehensively considers the influence of the context structure on the speaking style of the sentence from the two levels of inter-word relationship and inter-sentence relationship, and the extracted speech style characteristics of the target sentence It not only carries the contribution information of each word in the target sentence to the speaking style of the sentence, but also carries the contribution information of other sentences in the context to the speaking style. The hierarchical encoder can extract more information from the context and effectively improve the long-distance dependency modeling ability, thus helping to better model speaking style features.
另外,层级编码器采用了知识蒸馏机制,教师模型以无监督学习的方式从文本对应的真实音频中提取出说话风格特征,以此来帮助学生模型,即层级编码器更好地训练,帮助模型更高效地预测句子的说话风格。In addition, the hierarchical encoder adopts a knowledge distillation mechanism, and the teacher model extracts the speaking style features from the real audio corresponding to the text in an unsupervised learning manner, so as to help the student model, that is, the hierarchical encoder to better train and help the model More efficiently predict the speaking style of sentences.
如此,本申请有效地提升了模型从层级上下文信息中建模说话风格特征的能力,合成的音频的说话风格会受到当前句子和上下文的影响,使得合成出的音频有更 优的表现力和自然度,提高了合成音频在表达效果上的丰富性,且更加接近真实人类的语音。In this way, this application effectively improves the ability of the model to model speaking style features from hierarchical context information, and the speaking style of the synthesized audio will be affected by the current sentence and context, making the synthesized audio more expressive and natural The degree of expression improves the richness of synthetic audio and is closer to real human speech.
在一些实施例中,本申请提供的一种音频合成方法,可以应用在直播场景中,如虚拟主播进行有声小说直播时,由于虚拟主播的语音是利用TTS技术合成的,在进行有声小说直播时,若合成的语音富有表现力,能表现出情感和说话风格,将能吸引更多的听众收听。如图11所示,本申请提供的一种音频合成方法可以由直播服务器执行。其中,直播服务器可以是单独一台服务器,也可以是由多台服务器组成的服务器集群。如图11所示,直播服务器1110可以利用上述任一实施例所提供的方法来将有声读物的文本数据合成为对应的音频数据,然后将合成的音频数据发送至直播间内的各个观众端1120。In some embodiments, the audio synthesis method provided by this application can be applied in live broadcasting scenarios. For example, when a virtual anchor performs a live broadcast of an audio novel, since the voice of the virtual anchor is synthesized using TTS technology, when performing a live broadcast of an audio novel , if the synthesized speech is expressive and can show emotion and speaking style, it will be able to attract more listeners. As shown in FIG. 11 , an audio synthesis method provided by this application can be executed by a live server. Wherein, the live broadcast server may be a single server, or may be a server cluster composed of multiple servers. As shown in Figure 11, the live broadcast server 1110 can use the method provided by any of the above-mentioned embodiments to synthesize the text data of the audiobook into corresponding audio data, and then send the synthesized audio data to each audience terminal 1120 in the live broadcast room .
本申请提供的一种音频合成方法,除了可以应用在直播场景外,还可以应用于智能手机、语音助手、智能导航、电子书等产品中,本申请在此不做限制。The audio synthesis method provided by this application can be applied to smart phones, voice assistants, smart navigation, e-books and other products in addition to live broadcast scenarios, and this application does not limit it here.
此外,本申请还提供了一种音频合成方法,通过如图12所示的音频合成系统实现。用户可以输入多个连续的句子,例如“小明去哪了?小明去深圳了。深圳是一个美丽的城市”共三个句子。针对每一个句子,均可以利用如图12所示的音频合成系统合成相应的音频数据。当前每一个待合成的句子为目标句子。以目标句子为“小明去深圳了。”为例,目标句子“小明去深圳了。”可以输入声学特征提取模块1210以提取目标句子的声学特征。而包括目标句子的文本数据“小明去哪了?小明去深圳了。深圳是一个美丽的城市”可以输入XLNET语言模型1220和层级编码器1230以提取目标句子的说话风格特征。In addition, the present application also provides an audio synthesis method, which is realized by an audio synthesis system as shown in FIG. 12 . The user can input multiple consecutive sentences, for example, "Where did Xiao Ming go? Xiao Ming went to Shenzhen. Shenzhen is a beautiful city" with three sentences in total. For each sentence, the corresponding audio data can be synthesized by using the audio synthesis system as shown in FIG. 12 . Each current sentence to be synthesized is the target sentence. Taking the target sentence as "Xiao Ming has gone to Shenzhen." as an example, the target sentence "Xiao Ming has gone to Shenzhen." can be input into the acoustic feature extraction module 1210 to extract the acoustic features of the target sentence. The text data including the target sentence "Where did Xiao Ming go? Xiao Ming went to Shenzhen. Shenzhen is a beautiful city" can be input into the XLNET language model 1220 and the hierarchical encoder 1230 to extract the speaking style features of the target sentence.
其中,声学特征提取模块1210包括文本转音素模块1211、音素嵌入模块1212以及音素编码模块1213。目标句子经过文本转音素模块1211可以提取出音素序列。输出的音素序列经过音素嵌入模块1212可以提取出音素级特征。输出的音素级特征经过音素编码器1213可以提取出目标句子的声学特征。Wherein, the acoustic feature extraction module 1210 includes a text-to-phoneme module 1211 , a phoneme embedding module 1212 and a phoneme encoding module 1213 . The phoneme sequence can be extracted from the target sentence through the text-to-phoneme module 1211 . The phoneme-level features can be extracted from the output phoneme sequence through the phoneme embedding module 1212 . The output phoneme-level features can be used to extract the acoustic features of the target sentence through the phoneme encoder 1213 .
文本数据经过XLNET语言模型1220后可以提取出文本数据的语义特征。随后输入层级编码器1230可以提取目标句子的说话风格。其中,层级编码器1230包括词间网络1231和句子间网络1232。文本数据的语义特征输入词间网络1231后可以提取出文本数据中每个句子的句子特征。这些句子特征输入句子间网络后可以提取出文本数据的段落特征,并根据段落特征提取出目标句子的说话风格特征。After the text data passes through the XLNET language model 1220, the semantic features of the text data can be extracted. The input level encoder 1230 may then extract the speaking style of the target sentence. Wherein, the hierarchical encoder 1230 includes an inter-word network 1231 and an inter-sentence network 1232 . After the semantic features of the text data are input into the inter-word network 1231, the sentence features of each sentence in the text data can be extracted. After these sentence features are input into the inter-sentence network, the paragraph features of the text data can be extracted, and the speaking style features of the target sentence can be extracted according to the paragraph features.
随后,目标句子的声学特征与说话风格特征在进行混合处理后可以输入韵律预测器1240。韵律预测器1240包括三个预测器,分别用于预测目标句子的音高、音强和发音时长。音高预测器和音强预测器输出的结果添加到混合处理后的声学特征中,并且基于时长预测器输出的结果,经过长度调节器(Length Regulator,LR)来调节声学特征的长度,以使输出的声学特征为携带韵律信息(音高、音强、时长)的帧级别的声学特征。韵律预测器1240输出的携带韵律信息的声学特征经过解码器1250可以转换为80维的梅尔频谱,最后经过声码器1260可以合成目标句子“小明去深圳了。”对应的音频数据。Subsequently, the acoustic features and speaking style features of the target sentence can be input into the prosody predictor 1240 after being mixed. The prosody predictor 1240 includes three predictors, which are respectively used to predict the pitch, sound intensity and pronunciation duration of the target sentence. The output results of the pitch predictor and the sound intensity predictor are added to the acoustic feature after mixing processing, and based on the output result of the duration predictor, the length of the acoustic feature is adjusted through the length regulator (Length Regulator, LR), so that the output The acoustic features of are frame-level acoustic features carrying prosody information (pitch, sound intensity, duration). The acoustic features carrying prosodic information output by the prosody predictor 1240 can be converted into an 80-dimensional Mel spectrum through the decoder 1250, and finally the audio data corresponding to the target sentence "Xiao Ming has gone to Shenzhen." can be synthesized through the vocoder 1260.
上述实施例的具体实现方式参见上文实施例,本申请在此不再赘述。For the specific implementation manners of the above embodiments, refer to the above embodiments, and the present application will not repeat them here.
如此,通过上述方法,音频合成系统会根据上下文信息确定目标句子最合理的说话风格,如在上述例子中,目标句子“小明去深圳了。”所合成出的音频,可以在“深圳”二字上进行强调,拖长发音等。In this way, through the above method, the audio synthesis system will determine the most reasonable speaking style of the target sentence according to the context information. For example, in the above example, the target sentence "Xiao Ming has gone to Shenzhen." The synthesized audio can be in the word "Shenzhen". Emphasize, prolong pronunciation, etc.
利用上述方法在提取说话风格特征时,不仅考虑了目标句子的语义信息,还考虑了上下文句子对目标句子的语义信息的影响,从而能捕捉到不同上下文带来的语音变化。When using the above method to extract the speaking style features, not only the semantic information of the target sentence is considered, but also the influence of the context sentence on the semantic information of the target sentence is considered, so that the speech changes brought about by different contexts can be captured.
此外,还采用了层级编码器对上下文语义进行分析,从词间关系和句间关系这两个层面上综合考虑了上下文结构对句子说话风格的影响,所提取出的目标句子的说话风格特征不仅携带了目标句子中每个词语对句子说话风格的贡献信息,还携带了上下文其他句子对说话风格的贡献信息。层级编码器可以从上下文中提取出更多的信息并且有效提升了长距离依赖的建模能力,从而帮助更好的建模说话风格特征。In addition, a hierarchical encoder is used to analyze the context semantics, and the influence of the context structure on the speaking style of the sentence is comprehensively considered from the two levels of inter-word relationship and inter-sentence relationship. The extracted speaking style features of the target sentence are not only It carries the contribution information of each word in the target sentence to the speaking style of the sentence, and also carries the contribution information of other sentences in the context to the speaking style. The hierarchical encoder can extract more information from the context and effectively improve the long-distance dependency modeling ability, thus helping to better model speaking style features.
如此,本申请有效地提升了模型从层级上下文信息中建模说话风格特征的能力,合成的音频的说话风格会受到当前句子和上下文的影响,使得合成出的音频有更优的表现力和自然度,提高了合成音频在表达效果上的丰富性,且更加接近真实人类的语音。In this way, this application effectively improves the ability of the model to model speaking style features from hierarchical context information, and the speaking style of the synthesized audio will be affected by the current sentence and context, making the synthesized audio more expressive and natural The degree of expression improves the richness of synthetic audio and is closer to real human speech.
基于上述任意实施例所述的一种音频合成方法,本申请还提供了一种计算机程序产品,包括计算机程序,计算机程序被处理器执行时可用于执行上述任意实施例所述的一种音频合成方法。Based on the audio synthesis method described in any of the above embodiments, the present application also provides a computer program product, including a computer program, which can be used to perform the audio synthesis described in any of the above embodiments when the computer program is executed by a processor method.
基于上述任意实施例所述的一种音频合成方法,本申请还提供了如图13所示 的一种电子设备的结构示意图。如图13,在硬件层面,该电子设备包括处理器、内部总线、网络接口、内存以及非易失性存储器,当然还可能包括其他业务所需要的硬件。处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,处理器被配置为:Based on the audio synthesis method described in any of the above embodiments, the present application also provides a schematic structural diagram of an electronic device as shown in FIG. 13 . As shown in Figure 13, at the hardware level, the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and of course may also include hardware required by other services. The processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it. The processor is configured as:
获取文本数据中的目标句子,所述文本数据包括至少两个连续句子;Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;
获取所述目标句子的声学特征;Acquiring the acoustic features of the target sentence;
获取所述目标句子的说话风格特征;所述说话风格特征是基于所述文本数据的段落特征确定的,所述文本数据的段落特征是基于所述文本数据的各个句子对说话风格的贡献,从各个句子的句子特征提取的,所述句子特征是基于句子中各个词语对说话风格的贡献,从所述句子的语义特征提取的;Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据。Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
基于上述任意实施例所述的一种音频合成方法,本申请还提供了一种计算机存储介质,存储介质存储有计算机程序,计算机程序被处理器执行时可用于执行上述任意实施例所述的一种音频合成方法。Based on the audio synthesis method described in any of the above embodiments, the present application also provides a computer storage medium, where a computer program is stored in the storage medium, and when the computer program is executed by a processor, it can be used to perform a method described in any of the above embodiments. A method of audio synthesis.
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of the present application. Other implementations are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in an order different from that in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain embodiments.
本领域技术人员在考虑说明书及实践这里申请的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。Other embodiments of the present application will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention claimed herein. This application is intended to cover any modification, use or adaptation of the application, these modifications, uses or adaptations follow the general principles of the application and include common knowledge or conventional technical means in the technical field for which the application is not applied . The specification and examples are to be considered exemplary only, with a true scope and spirit of the application indicated by the following claims.

Claims (14)

  1. 一种音频合成方法,其特征在于,所述方法包括:A method for audio synthesis, characterized in that the method comprises:
    获取文本数据中的目标句子,所述文本数据包括至少两个连续句子;Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;
    获取所述目标句子的声学特征;Acquiring the acoustic features of the target sentence;
    获取所述目标句子的说话风格特征;所述说话风格特征是基于所述文本数据的段落特征确定的,所述文本数据的段落特征是基于所述文本数据的各个句子对说话风格的贡献,从各个句子的句子特征提取的,所述句子特征是基于句子中各个词语对说话风格的贡献,从所述句子的语义特征提取的;Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
    基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据。Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
  2. 根据权利要求1所述的方法,其特征在于,The method according to claim 1, characterized in that,
    所述句子中各个词语对说话风格的贡献,通过基于注意力机制的词间网络获取;The contribution of each word in the sentence to the speaking style is obtained through an inter-word network based on an attention mechanism;
    所述文本数据的各个句子对说话风格的贡献,通过基于注意力机制的句子间网络获取。The contribution of each sentence of the text data to the speaking style is obtained through an inter-sentence network based on an attention mechanism.
  3. 根据权利要求1所述的方法,其特征在于,所述文本数据的段落特征,还基于各个句子在所述文本数据中的位置信息进行提取。The method according to claim 1, wherein the paragraph features of the text data are also extracted based on position information of each sentence in the text data.
  4. 根据权利要求1所述的方法,其特征在于,所述基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据,包括:The method according to claim 1, wherein said target sentence is synthesized into audio data based on the acoustic features of said target sentence and the speaking style features of said target sentence, comprising:
    基于所述目标句子的声学特征以及所述目标句子的说话风格特征,预测所述目标句子的携带韵律信息的声学特征;所述韵律信息包括音高信息、音强信息或发音时长中的一种或多种;Based on the acoustic features of the target sentence and the speaking style features of the target sentence, predict the acoustic features of the target sentence carrying prosody information; the prosody information includes one of pitch information, sound intensity information or pronunciation duration. or more;
    将所述携带韵律信息的声学特征转换为所述音频数据。converting the acoustic features carrying prosody information into the audio data.
  5. 根据权利要求1所述的方法,其特征在于,所述方法应用于音频合成系统,所述音频合成系统包括:The method according to claim 1, wherein the method is applied to an audio synthesis system, and the audio synthesis system comprises:
    用于提取所述目标句子的声学特征的声学特征提取模块;An acoustic feature extraction module for extracting the acoustic features of the target sentence;
    用于提取所述目标句子的说话风格特征的说话风格特征提取模块;A speaking style feature extraction module for extracting the speaking style features of the target sentence;
    用于合成所述音频数据的合成模块。A synthesis module for synthesizing the audio data.
  6. 根据权利要求5所述的方法,其特征在于,说话风格特征提取模块包括:The method according to claim 5, wherein the speaking style feature extraction module comprises:
    用于提取所述语义特征的语言模型;A language model for extracting the semantic features;
    用于提取所述句子特征、所述段落特征以及所述说话风格特征的层级编码器。A hierarchical encoder for extracting the sentence feature, the paragraph feature and the speaking style feature.
  7. 根据权利要求6所述的方法,其特征在于,所述层级编码器采用有监督学习进 行训练,训练数据包括标注有真实说话风格特征的语义特征。The method according to claim 6, wherein the hierarchical encoder is trained using supervised learning, and the training data includes semantic features marked with real speaking style features.
  8. 根据权利要求7所述的方法,其特征在于,所述层级编码器通过知识蒸馏机制进行有监督学习,所述层级编码器为所述蒸馏机制中的学生模型,所述蒸馏机制中的教师模型采用无监督学习从真实音频数据中提取出真实说话风格特征。The method according to claim 7, wherein the hierarchical encoder performs supervised learning through a knowledge distillation mechanism, the hierarchical encoder is a student model in the distillation mechanism, and a teacher model in the distillation mechanism Using unsupervised learning to extract real speaking style features from real audio data.
  9. 根据权利要求6所述的方法,其特征在于,所述层级编码器包括:The method according to claim 6, wherein the hierarchical encoder comprises:
    用于获取每个句子中各个词语对说话风格的贡献,并基于所述各个词语对说话风格的贡献,从所述句子的语义特征提取所述句子的句子特征的词间网络;For obtaining the contribution of each word in each sentence to the speaking style, and based on the contribution of each word to the speaking style, extracting the inter-word network of the sentence feature of the sentence from the semantic feature of the sentence;
    用于获取各个句子对说话风格的贡献,基于所述各个句子对说话风格的贡献,从各个句子的句子特征提取段落特征,并基于所述段落特征,预测所述目标句子的说话风格特征的句子间网络。For obtaining the contribution of each sentence to the speaking style, based on the contribution of each sentence to the speaking style, extracting paragraph features from the sentence features of each sentence, and based on the paragraph features, predicting the sentence of the speaking style feature of the target sentence network.
  10. 根据权利要求5所述的方法,其特征在于,所述合成模块包括:The method according to claim 5, wherein the synthesis module comprises:
    用于基于所述目标句子的声学特征以及所述目标句子的说话风格特征,预测所述目标句子的携带韵律信息的声学特征的韵律预测器;A prosody predictor for predicting acoustic features carrying prosodic information of the target sentence based on the acoustic features of the target sentence and the speaking style features of the target sentence;
    用于将所述携带韵律信息的声学特征转换为所述音频数据的转换器。A converter for converting the acoustic features carrying prosody information into the audio data.
  11. 根据权利要求1所述的方法,其特征在于,所述方法由直播服务器执行,所述文本数据为有声读物的文本数据,所述方法还包括:The method according to claim 1, wherein the method is executed by a live broadcast server, the text data is text data of an audiobook, and the method further comprises:
    将所述音频数据发送给观众端。Send the audio data to the audience.
  12. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-11任一所述方法的步骤。A computer program product, comprising a computer program, characterized in that, when the computer program is executed by a processor, the steps of the method according to any one of claims 1-11 are implemented.
  13. 一种电子设备,其特征在于,所述电子设备包括:An electronic device, characterized in that the electronic device comprises:
    处理器;processor;
    用于存储处理器可执行指令的存储器;memory for storing processor-executable instructions;
    其中,所述处理器被配置为:Wherein, the processor is configured as:
    获取文本数据中的目标句子,所述文本数据包括至少两个连续句子;Obtaining a target sentence in the text data, the text data including at least two consecutive sentences;
    获取所述目标句子的声学特征;Acquiring the acoustic features of the target sentence;
    获取所述目标句子的说话风格特征;所述说话风格特征是基于所述文本数据的段落特征确定的,所述文本数据的段落特征是基于所述文本数据的各个句子对说话风格的贡献,从各个句子的句子特征提取的,所述句子特征是基于句子中各个词语对说话风格的贡献,从所述句子的语义特征提取的;Obtain the speaking style feature of the target sentence; the speaking style feature is determined based on the paragraph feature of the text data, and the paragraph feature of the text data is based on the contribution of each sentence of the text data to the speaking style, from The sentence features of each sentence are extracted, and the sentence features are based on the contribution of each word in the sentence to the speaking style, and are extracted from the semantic features of the sentence;
    基于所述目标句子的声学特征以及所述目标句子的说话风格特征,将所述目标句子合成为音频数据。Synthesizing the target sentence into audio data based on the acoustic feature of the target sentence and the speaking style feature of the target sentence.
  14. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1-11任一项所述方法的步骤。A computer-readable storage medium, on which a computer program is stored, wherein, when the program is executed by a processor, the steps of the method according to any one of claims 1-11 are realized.
PCT/CN2021/137237 2021-12-10 2021-12-10 Audio synthesis method, electronic device, program product and storage medium WO2023102929A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137237 WO2023102929A1 (en) 2021-12-10 2021-12-10 Audio synthesis method, electronic device, program product and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137237 WO2023102929A1 (en) 2021-12-10 2021-12-10 Audio synthesis method, electronic device, program product and storage medium

Publications (1)

Publication Number Publication Date
WO2023102929A1 true WO2023102929A1 (en) 2023-06-15

Family

ID=86729552

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137237 WO2023102929A1 (en) 2021-12-10 2021-12-10 Audio synthesis method, electronic device, program product and storage medium

Country Status (1)

Country Link
WO (1) WO2023102929A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385858A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Emotional voice synthesis method and system
US20160329043A1 (en) * 2014-01-21 2016-11-10 Lg Electronics Inc. Emotional-speech synthesizing device, method of operating the same and mobile terminal including the same
US20210151029A1 (en) * 2019-11-15 2021-05-20 Electronic Arts Inc. Generating Expressive Speech Audio From Text Data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102385858A (en) * 2010-08-31 2012-03-21 国际商业机器公司 Emotional voice synthesis method and system
US20160329043A1 (en) * 2014-01-21 2016-11-10 Lg Electronics Inc. Emotional-speech synthesizing device, method of operating the same and mobile terminal including the same
US20210151029A1 (en) * 2019-11-15 2021-05-20 Electronic Arts Inc. Generating Expressive Speech Audio From Text Data

Similar Documents

Publication Publication Date Title
Tan et al. A survey on neural speech synthesis
JP7395792B2 (en) 2-level phonetic prosody transcription
WO2020118521A1 (en) Multi-speaker neural text-to-speech synthesis
CN108899009B (en) Chinese speech synthesis system based on phoneme
JP2022534764A (en) Multilingual speech synthesis and cross-language voice cloning
Robinson et al. Sequence-to-sequence modelling of f0 for speech emotion conversion
WO2022188734A1 (en) Speech synthesis method and apparatus, and readable storage medium
WO2021225829A1 (en) Speech recognition using unspoken text and speech synthesis
US20220277728A1 (en) Paragraph synthesis with cross utterance features for neural TTS
JP7228998B2 (en) speech synthesizer and program
WO2021212954A1 (en) Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
Liu et al. Maximizing mutual information for tacotron
WO2022222757A1 (en) Method for converting text data into acoustic feature, electronic device, and storage medium
Nishimura et al. Acoustic modeling for end-to-end empathetic dialogue speech synthesis using linguistic and prosodic contexts of dialogue history
WO2023102929A1 (en) Audio synthesis method, electronic device, program product and storage medium
JP7357518B2 (en) Speech synthesis device and program
CN113823259A (en) Method and device for converting text data into phoneme sequence
WO2023102931A1 (en) Method for predicting prosodic structure, and electronic device, program product and storage medium
US20230018384A1 (en) Two-Level Text-To-Speech Systems Using Synthetic Training Data
Gong et al. A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
Xu et al. End-to-End Speech Synthesis Method for Lhasa-Tibetan Multi-speaker
Deng et al. Prosody-Aware Speecht5 for Expressive Neural TTS
CN115346510A (en) Voice synthesis method and device, electronic equipment and storage medium
Tan Data-Efficient TTS

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21966847

Country of ref document: EP

Kind code of ref document: A1