WO2023102931A1 - 韵律结构的预测方法、电子设备、程序产品及存储介质 - Google Patents

韵律结构的预测方法、电子设备、程序产品及存储介质 Download PDF

Info

Publication number
WO2023102931A1
WO2023102931A1 PCT/CN2021/137240 CN2021137240W WO2023102931A1 WO 2023102931 A1 WO2023102931 A1 WO 2023102931A1 CN 2021137240 W CN2021137240 W CN 2021137240W WO 2023102931 A1 WO2023102931 A1 WO 2023102931A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
character
feature
features
prosodic
Prior art date
Application number
PCT/CN2021/137240
Other languages
English (en)
French (fr)
Inventor
康世胤
吴志勇
陈杰
宋长河
Original Assignee
广州虎牙科技有限公司
清华大学深圳国际研究生院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州虎牙科技有限公司, 清华大学深圳国际研究生院 filed Critical 广州虎牙科技有限公司
Priority to PCT/CN2021/137240 priority Critical patent/WO2023102931A1/zh
Publication of WO2023102931A1 publication Critical patent/WO2023102931A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Definitions

  • the present application relates to the technical field of audio synthesis, in particular to a prosodic structure prediction method, an electronic device, a program product and a storage medium.
  • Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio.
  • TTS technology has been widely used in smart phones, voice assistants, smart navigation, e-books and other products.
  • the prosodic structure of the text can assist the speech synthesis system to generate natural and intelligible speech. Predicting prosodic structure accurately is an important part of TTS technology.
  • the application provides a prosodic structure prediction method, an electronic device, a program product and a storage medium, which can accurately predict the prosodic structure of a text.
  • a prosodic structure prediction method comprising:
  • the prosodic structure of the sentence is predicted by using the character feature of each character of the sentence, the sentence feature of the sentence, and the context feature.
  • the step of obtaining character features includes:
  • the semantic features include one or more of the following: character-level semantic features or word-level semantic features.
  • predicting the prosodic structure of the sentence using the character features of each character of the sentence, the sentence features of the sentence, and the context features includes:
  • the character feature of each character of the sentence, the sentence feature of the sentence and the context feature are mixed to obtain a mixed hierarchical feature
  • the prosodic structure of the sentence is extracted from the mixed hierarchical features.
  • the method is applied to a prosodic structure prediction system, the prosodic structure prediction system comprising:
  • a feature extractor for extracting semantic features of each sentence from said text data
  • a multi-level encoder for extracting character features of each character in the sentence, sentence features of the sentence, and contextual features of the text data
  • a prosodic structure decoder for predicting a prosodic structure of the sentence using character features of each character in the sentence, sentence features of the sentence, and context features of the text data.
  • the feature extractor is a BERT model.
  • the multi-level encoder includes:
  • a character-level encoder for obtaining character features of each character in the sentence from the semantic features of each sentence
  • Sentence-level encoder for utilizing the character feature of each character in the sentence to obtain the sentence feature of the sentence
  • the context-level encoder is used to acquire the context features of the text data by using the sentence features of each sentence.
  • the character-level encoder includes at least one Transformer model based on a self-attention mechanism, so that the extracted character features of each character represent the influence of other characters in the sentence on the prosodic structure of the character .
  • the sentence-level encoder includes several stacked convolutional layers, and the sentence features of the sentence are obtained by mixing the output features of each convolutional layer through pooling layer processing.
  • the context-level encoder includes several stacked convolutional layers, and the contextual features are obtained by mixing the output features of each convolutional layer through a pooling layer.
  • the prosodic structure decoder includes several classifiers for classifying different types of prosodic structures.
  • the plurality of classifiers for classifying different types of prosodic structures include: a prosodic word classifier, a prosodic phrase classifier, and an intonation phrase classifier.
  • the method is performed by a live broadcast server, the text data is the text data of the audiobook, and the method further includes:
  • a computer program product including a computer program, and when the computer program is executed by a processor, the steps of the method described in the first aspect are implemented.
  • an electronic device includes:
  • memory for storing processor-executable instructions
  • the processor is configured as:
  • the prosodic structure of the sentence is predicted by using the character feature of each character of the sentence, the sentence feature of the sentence, and the context feature.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method described in the first aspect are implemented.
  • the present application provides a prosodic structure prediction method, electronic equipment, program product and storage medium, by obtaining character features of each character in each sentence of text data, sentence features of each sentence and context features of text data, wherein, Character features represent the influence of other characters in the sentence on the prosodic structure of the character, sentence features contain the semantic information of the sentence as a whole, and context features contain the semantic information of the entire text data.
  • Character features represent the influence of other characters in the sentence on the prosodic structure of the character
  • sentence features contain the semantic information of the sentence as a whole
  • context features contain the semantic information of the entire text data.
  • the above three features carry different levels of information, so that when predicting the prosodic structure of a sentence, a variety of information can be used to assist in prediction.
  • These different levels of information include contextual information, so the prediction accuracy of the prosodic structure of a sentence can be greatly improved. improve.
  • Fig. 1 is a flow chart of Chinese prosodic structure prediction in the related art.
  • Fig. 2 is a flowchart of a prosodic structure prediction method according to an embodiment of the present application.
  • Fig. 3 is a schematic diagram of a prosodic structure prediction system according to an embodiment of the present application.
  • Fig. 4 is a schematic diagram of BERT language model training according to an embodiment of the present application.
  • Fig. 5 is a schematic diagram of a semantic feature extraction process according to an embodiment of the present application.
  • Fig. 6 is a schematic diagram of a multi-level encoder according to an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a character-level encoder according to an embodiment of the present application.
  • Fig. 8 is a schematic diagram of a sentence-level encoder according to an embodiment of the present application.
  • Fig. 9 is a schematic diagram of a context-level encoder according to an embodiment of the present application.
  • Fig. 10 is a flowchart of a prosodic structure prediction method according to another embodiment of the present application.
  • Fig. 11 is a schematic diagram of a prosodic structure decoder according to an embodiment of the present application.
  • Fig. 12 is a schematic diagram of a prosodic structure prediction system according to another embodiment of the present application.
  • FIG. 13 shows an application scenario of a prosodic structure prediction method according to an embodiment of the present application.
  • Fig. 14 is a hardware structural diagram of an electronic device according to an embodiment of the present application.
  • first, second, third, etc. may be used in this application to describe various information, the information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of the present application, first information may also be called second information, and similarly, second information may also be called first information. Depending on the context, the word “if” as used herein may be interpreted as “at” or “when” or “in response to a determination.”
  • Speech synthesis (Text-To-Speech, TTS) technology is a technology that can intelligently convert text into audio.
  • TTS technology has been widely used in smart phones, voice assistants, smart navigation, e-books and other products.
  • the prosodic structure of the text can assist the speech synthesis system to generate natural and intelligible speech.
  • Prosodic structure refers to the hierarchical organization of discourse pause structure.
  • Chinese prosodic structure is a three-level structure, including prosody word (Prosody Word, PW), prosody phrase (Prosody Phrase, PPH) and intonation phrase (Intonation Phrase, IPH).
  • One or more Chinese characters constitute a PW
  • multiple PWs constitute a PPH
  • multiple PPHs constitute an IPH
  • a sentence can be divided into multiple IPHs.
  • PW, PPH and IPH do not have fixed composition rules.
  • Prosodic structures at different levels show different pause durations.
  • Accurate prosodic structure can assist in the synthesis of highly natural speech, so the prediction of prosodic structure is an important part of TTS technology.
  • the prediction process of Chinese prosodic structure is shown in FIG. 1 , the Chinese character sequence can extract text features through the feature extractor 110 , and the text features are then input into the prosodic predictor 120 to extract the Chinese prosodic structure of the Chinese character sequence.
  • the inventors found that the prosodic structure of a sentence is determined by the semantic information in the context of the sentence.
  • the prosodic structure of a sentence is not only related to the semantic information of the sentence, but also influenced by other sentences in the context of the text paragraph in which the sentence is located. Semantic information influences, or in other words, the prosodic structure of a sentence relative to its position in the text paragraph. If the same sentence is in different positions in the text paragraph, the prosodic structure of the sentence will also be different.
  • Step 210 Acquiring text data comprising at least two consecutive sentences, each sentence comprising a plurality of characters;
  • Step 220 Obtain character features of each character in each sentence, sentence features of the sentence, and context features of the text data;
  • the character feature is used to characterize the influence of other characters in the sentence on the prosodic structure of the character
  • the sentence feature of the sentence is obtained by using the character feature of each character
  • the context feature is obtained by using each sentence
  • the sentence features are obtained;
  • Step 230 For each sentence, predict the prosodic structure of the sentence by using the character feature of each character of the sentence, the sentence feature of the sentence, and the context feature.
  • the semantic features of each sentence may be extracted from the text data; using the semantic features of each sentence, the character features of each character in each sentence are acquired.
  • the method for obtaining character features is not limited to the above-mentioned methods, and those skilled in the art can also obtain the character features of each word in each sentence through other methods.
  • a prosodic structure prediction method provided by the present application obtains the character features of each character in each sentence, the sentence features of each sentence, and the context features of text data, wherein the character features represent other characters in the sentence. Influenced by the prosodic structure of the sentence, the sentence feature contains the semantic information of the sentence as a whole, while the context feature contains the semantic information of the entire text data. In this way, when using the above three features to predict the prosodic structure of a sentence, not only the semantic information of the sentence itself is considered, but also the semantic information of the entire text data, that is, the semantic information of other sentences in the context of the sentence is used to predict the sentence. Rhythmic structure. The predicted prosodic structure is influenced by the semantics of the context of the sentence, so the accuracy of the predicted prosodic structure is greatly improved.
  • the above-mentioned prosodic structure prediction method as shown in FIG. 2 can be applied to the prosodic structure prediction system 300 as shown in FIG. Encoder 320 and prosodic structure decoder 330 .
  • feature extractor 310 is used for extracting the semantic feature of each sentence from text data.
  • the multi-level encoder 320 is used to extract character features of each character in a sentence, sentence features of a sentence, and context features of text data.
  • the prosodic structure decoder 330 is used to predict the prosodic structure of the sentence by using the character features of each character in the sentence, the sentence features of the sentence and the context features of the text data.
  • the feature extractor 310 performs unsupervised learning based on a large amount of data.
  • the multi-level encoder 320 and the prosodic structure decoder 330 perform supervised learning as a whole, and the training data are semantic features marked with prosodic structures.
  • the Adam optimizer is used, which combines the advantages of the two optimization algorithms AdaGrad and RMSProp, comprehensively considers the first-order moment estimation and second-order moment estimation of the gradient, and calculates the updated step size.
  • the number of samples (batch size) selected for one training is 64, the learning rate is 0.00001, and the training is performed until the loss on the verification set no longer decreases.
  • the feature extractor 310 may be a BERT (Bidirectional Encoder Representations from Transformers) language model.
  • the BERT language model can be pre-trained with a large amount of Chinese text data to extract effective semantic features.
  • the BERT model can accept a character sequence consisting of one or more sentences as input, where the input character sequence contains a part of characters that are randomly masked.
  • the masking mechanism includes that each character in the character sequence has a 15% probability of being selected, and the selected character has an 80% probability of being replaced with [MASK], a 10% probability of being randomly replaced by other characters, and a 10% probability of replacing for the character itself.
  • the training objectives of the BERT model include predicting masked characters and judging whether more than one input sentence is the relevant context in the actual context.
  • the BERT language model is selected as the feature extractor. Compared with other artificially constructed features, such as word vectors and word embeddings, it can more effectively extract semantic information in text data.
  • the feature extractor 310 may also be a pre-trained XLNET language model (Generalized Autoregressive Pretraining for Language Understanding).
  • the feature extractor 310 is not limited to the above two models, and those skilled in the art may select other models capable of extracting semantic features from text data as the feature extractor 310 according to actual needs.
  • the semantic features of each sentence extracted from the text data may be character-level semantic features or word-level semantic features.
  • the so-called character-level semantic feature is represented by an N-dimensional vector for each character.
  • the so-called word-level semantic features are represented by an N-dimensional vector for each word.
  • the N-dimensional vectors corresponding to all characters/words in a sentence form the semantic features of the sentence.
  • the feature extractor 310 is a BERT language model
  • the extracted semantic features may be character-level semantic features or word-level semantic features.
  • the feature extractor 310 is an XLNET language model
  • the extracted semantic features can be word-level semantic features.
  • the feature extractor 310 is another model that can realize the effect of extracting semantic features from text data, before extracting semantic features, the text data can be segmented first, and then the word-level text data can be extracted from the text data after word segmentation. semantic features.
  • the BERT language model can accept text data including at least two consecutive sentences as input, such as text data can be "I eat tomatoes. It's really good Eat! Do you want to come too?" There are three sentences in total. Then the BERT language model will extract the semantic features of each sentence separately.
  • the semantic features of multiple sentences are used as a set of input features. For example, the semantic features of each sentence are combined into a set of input features in the form of multiple channels.
  • One channel in the input feature is the semantic feature of a sentence, which is input to the multi-level Encoder 320.
  • the multi-level encoder 320 utilizes the semantic features of each sentence to obtain character features of each character in each sentence, sentence features of sentences, and context features of text data.
  • the multi-level encoder 320 includes a character-level encoder 321 , a sentence-level encoder 322 and a context encoder 323 .
  • the character-level encoder 321 is used to obtain the character features of each character in the sentence from the semantic features of each sentence.
  • the sentence-level encoder 322 is used to obtain the sentence features of the sentence by using the character features of each character in the sentence.
  • the context encoder 323 is used to acquire the context features of the text data by using the sentence features of each sentence.
  • the character-level encoder 321 may include at least one Transformer model 3211 based on a self-attention mechanism, such as two Transformer models 3211 as shown in FIG. 7 .
  • a self-attention mechanism such as two Transformer models 3211 as shown in FIG. 7 .
  • the self-attention mechanism of the Transformer model 3211 the long-distance dependencies existing in the text data can be better modeled, so that the character features of each character extracted by the character-level encoder 321 can represent other characters in the sentence. The influence of a character on the prosodic structure of that character.
  • positional encoding can be added to the semantic features of the sentence, so that the Transformer model can distinguish the position of each character in the semantic features of the sentence, and the model can simultaneously consider the position information of the characters when calculating, so that The output character features can represent the impact of characters with different distances in the sentence on the prosodic structure of the character.
  • the location coding can be implemented with reference to related technologies, and this application will not discuss it here.
  • the character-level encoder 321 extracts the character features of each character from the semantic features of each sentence
  • the character features of multiple sentences are input into the sentence-level encoder 322 as a set of input features.
  • the sentence-level encoder 322 can aggregate the information carried by each sentence into a vector, that is, sentence features, based on the character features of each character in each sentence.
  • the sentence feature of each sentence contains the semantic information of the sentence as a whole.
  • the sentence-level encoder 322 includes several stacked convolutional layers.
  • the sentence encoder 322 may include a Recurrent Neural Network (RNN). Taking several stacked convolutional layers as an example, as shown in FIG.
  • RNN Recurrent Neural Network
  • the sentence-level encoder 322 includes three stacked convolutional layers (Convolutional Neural Networks, CNN) 3221-3223 and a pooling layer 3224.
  • the pooling layer 3224 can be a maximum pooling layer or an average pooling layer.
  • the sentence-level encoder 322 After the sentence-level encoder 322 extracts the sentence features of each sentence from the character features of each sentence, the sentence features of multiple sentences are input into the context-level encoder 323 as a set of input features.
  • the context-level encoder 323 can aggregate the information carried by the entire text data into a vector based on the sentence features of each sentence, that is, the context feature, which contains the semantic information of the text data.
  • the context-level encoder 323 includes several stacked convolutional layers. As shown in FIG. 9 , the context-level encoder 323 includes three stacked convolutional layers 3231 - 3233 and one pooling layer 3234 .
  • the pooling layer 3234 can be a maximum pooling layer or an average pooling layer.
  • the sentence features of multiple sentences are sequentially processed by the stacked convolutional layers 3231-3233, and the outputs of the convolutional layers 3231-3233 are processed by the pooling layer 3234 and spliced together, so that the information of the text data is aggregated into contextual features. That is, the context feature is obtained by mixing the output features of each convolutional layer through the pooling layer.
  • the above step 240 may include steps as shown in FIG. 10:
  • Step 241 For each sentence, mix the character features of each character of the sentence, the sentence features of the sentence, and the context features to obtain mixed hierarchical features;
  • Step 242 Extract the prosodic structure of the sentence from the mixed hierarchical features.
  • prosodic structure decoder 330 may be employed to predict the prosodic structure of a sentence.
  • the prosodic structure decoder 330 may include several classifiers for classifying different types of prosodic structures.
  • Several classifiers for classifying different types of prosodic structures may include: a prosodic word classifier, a prosodic phrase classifier, and an intonation phrase classifier.
  • the prosodic structure decoder 330 may include a combination of one or more of a prosodic word classifier, a prosodic phrase classifier and an intonation phrase classifier.
  • the classifier may be a Feedforward Neural Network (FNN).
  • FNN Feedforward Neural Network
  • the features extracted by the multi-level encoder 320 such as the character features of the sentence, the sentence features and the context features of the sentence, or mixed-level features, in order to analyze these features and extract useful information for classification
  • the prosodic structure decoder 330 may also include a Gated Recurrent Unit (Gated Recurrent Unit, GRU) network respectively connected to each classifier. As shown in FIG.
  • GRU Gated Recurrent Unit
  • the prosodic structure decoder 330 may include a PW GRU331, a PPH GRU332, an IPH GRU333, and a PW FFN334 connected to the PW GRU331, a PPH FFN335 connected to the PPH GRU332, and an IPH FFN336 connected to the IPH GRU333.
  • PW GRU331 is responsible for extracting useful information for PW classification
  • PW FFN334 is responsible for PW classification of each sentence.
  • PPH GRU332 is responsible for extracting useful information for PPH classification
  • PPH FFN335 is responsible for PPH classification of each sentence.
  • IPH GRU333 is responsible for extracting useful information for IPH classification
  • IPH FFN336 is responsible for IPH classification of each sentence.
  • each prediction subtask is performed by its own independent binary classifier.
  • Each binary classifier is used to judge whether there is a boundary of the prosodic structure corresponding to the binary classifier after each character of the sentence. For example, PW FFN334 is used to determine whether there is a PW boundary after each character of the sentence, PPH FFN335 is used to determine whether there is a PPH boundary after each character of the sentence, and PH FFN336 is used to determine whether there is an IPH boundary after each character of the sentence.
  • PW GRU331 accepts the input of mixed-level features, and the output features can output PW prediction results after being classified by PW FNN334.
  • the PPH prediction subtask relies on the information extracted from the PW prediction subtask, so the input features of PPH GRU332 include the output features of PW GRU331 in addition to the mixed-level features.
  • the IPH prediction subtask depends on the information extracted from the PW prediction subtask and the PPH prediction subtask, so the input features of IPH GRU333 include the output features of PW GRU331 and PPH GRU332 in addition to the mixed hierarchical features. In this way, after three classifiers are used to predict the boundaries of three levels of prosodic structures, the text data marked with prosodic structures can finally be output.
  • the text data and the prosodic structure of each sentence can be used to synthesize audio data showing the prosodic structure.
  • the text data may be synthesized into corresponding audio data by referring to the TTS technology in the related art, and this application will not describe in detail here.
  • the user only needs to provide text data containing multiple sentences in the same context, and the prosodic structure prediction system directly accepts the text data as input, No additional preprocessing is required, and the prosodic structure of multiple sentences is output, realizing the prediction of prosodic structure in an end-to-end manner.
  • this application uses a multi-level encoder to extract the features of multiple sentences to obtain multi-level features, including character features for each character in each sentence, sentence features for each sentence, and contextual features for the entire text data .
  • multi-level features can better model the semantic information of the context existing in the text data.
  • the semantic information of the sentence context can be referred to, which greatly improves the accuracy of prosodic structure prediction.
  • the present application also provides a prosodic structure prediction method, which is realized by an end-to-end prosodic structure prediction system as shown in FIG. 12 .
  • Users can enter multiple consecutive sentences, such as the three consecutive sentences in the picture "I eat tomatoes. It's really delicious! Do you want to have one too?".
  • these three sentences are referred to as S1, S2, and S3 for short below.
  • the three sentences input by the user can respectively extract the semantic features of S1, S2 and S3 through the BERT language model 1210.
  • the semantic features of these three sentences are all character-level semantic features.
  • the multi-level encoder includes a character-level encoder 1221 , a sentence-level encoder 1222 and a context-level encoder 1223 connected in sequence.
  • the character features of each character in the sentence can be extracted respectively. That is, each character in the sentence is represented by a D1-dimensional vector. If the sentence includes L characters, then the character features of the L characters can be spliced into an L*D1 matrix as the character features of the sentence (as shown in the figure).
  • the character features of S1, S2 and S3 can be extracted respectively.
  • the character features of these three sentences are input into the sentence-level encoder 1222, and the sentence-level encoder 1222 can extract the information contained in each sentence according to the character features of each character in each sentence, and use a D2-dimensional vector represent a sentence.
  • the character features of S1, S2 and S3 are passed through the sentence-level encoder 1222 to extract the sentence features of S1, S2 and S3 respectively.
  • the sentence features of the three sentences are input into the context-level encoder 1223 to extract the overall semantic information of the three sentences, and represent the three sentences with a D3-dimensional vector to obtain the context features.
  • the character features of the sentence, the sentence features of the sentence and the context features of the entire text data are mixed to obtain mixed hierarchical features.
  • the D2-dimensional sentence feature and the D3-dimensional context feature can be copied several times and then spliced with the character feature, and finally a matrix of L*(D1+D2+D3) (as shown in the figure) is formed as a mixed-level feature.
  • the prosodic structure of the sentence can be predicted through three prediction subtasks.
  • the final end-to-end prosodic structure prediction system outputs the result of "I #1 eat #1 tomato. #3 it #1 is really #2 delicious! #3 Do you #1 want #1 to have one too? #3" result.
  • the user only needs to provide text data containing multiple sentences in the same context, and the prosodic structure prediction system directly accepts text data as input without additional preprocessing, and outputs the prosodic structure of multiple sentences , enabling the prediction of prosodic structures in an end-to-end manner.
  • multi-level features are obtained, including character features for each character in each sentence, sentence features for each sentence, and contextual features for the entire text data, thereby Better modeling of contextual semantic information that exists inside textual data.
  • the semantic information of the sentence context can be referred to, which greatly improves the accuracy of prosodic structure prediction.
  • a prosodic structure prediction method provided by this application can be applied in live broadcast scenarios. For example, when a virtual anchor performs a live broadcast of an audio novel, since the voice of the virtual anchor is synthesized by using TTS technology, when performing a live broadcast of an audio novel, if The synthesized voice can be closer to the pronunciation of real people, with higher naturalness, which will attract more listeners to listen to.
  • a prosodic structure prediction method provided by the present application can be executed by a live server.
  • the live broadcast server may be a single server, or may be a server cluster composed of multiple servers. As shown in FIG.
  • the live server 1311 can predict the prosodic structure of the text data of the audiobook by using the method provided by any of the above embodiments, and then send the text data marked with the prosodic structure to another live server 1312 .
  • the live server 1312 can use TTS technology to synthesize the text data marked with prosodic structure into audio data representing the predicted prosodic structure, and send the audio data to each audience terminal 1320 in the live broadcast room.
  • a prosodic structure prediction method provided by this application can be applied to smart phones, voice assistants, smart navigation, e-books and other products in addition to live broadcast scenarios, and this application does not limit it here.
  • the present application also provides a computer program product, including a computer program.
  • the computer program When the computer program is executed by a processor, it can be used to perform one of the methods described in any of the above embodiments. Prediction methods for prosodic structure.
  • the present application also provides a schematic structural diagram of an electronic device as shown in FIG. 14 .
  • the electronic device includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and of course may also include hardware required by other services.
  • the processor reads the corresponding computer program from the non-volatile memory into the memory and then runs it.
  • the processor is configured as:
  • the prosodic structure of the sentence is predicted by using the character feature of each character of the sentence, the sentence feature of the sentence, and the context feature.
  • the above-mentioned electronic device may be a live broadcast server, or may be a user's terminal device, such as a mobile phone, a tablet computer and other terminal devices.
  • the present application also provides a computer storage medium, the storage medium stores a computer program, and when the computer program is executed by a processor, it can be used to execute the method described in any of the above embodiments.
  • a prosodic structure prediction method is described in any of the above embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供韵律结构的预测方法、电子设备、程序产品及存储介质,所述方法通过获取每个句子中每个字符的字符特征、每个句子的句子特征以及文本数据的上下文特征,其中,字符特征表征句子中其他字符对该字符的韵律结构的影响,句子特征包含了该句子整体的语义信息,而上下文特征包含了整个文本数据的语义信息。针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。在预测句子的韵律结构时使用了多种信息辅助来预测,这些不同层级的信息中包括了上下文信息,因此句子韵律结构的预测准确度能够大大提高。

Description

韵律结构的预测方法、电子设备、程序产品及存储介质 技术领域
本申请涉及音频合成技术领域,尤其涉及韵律结构的预测方法、电子设备、程序产品及存储介质。
背景技术
语音合成(Text-To-Speech,TTS)技术是一种能把文本智能地转化为音频的技术。TTS技术已经广泛地应用到了智能手机、语音助手、智能导航、电子书等产品中。其中,文本的韵律结构可以辅助语音合成系统生成自然、可懂的语音。准确地预测韵律结构是TTS技术中重要的一部分。
发明内容
本申请提供了韵律结构的预测方法、电子设备、程序产品及存储介质,能够准确地预测出文本的韵律结构。
根据本申请实施例的第一方面,提供一种韵律结构的预测方法,所述方法包括:
获取包括至少两个连续句子的文本数据,每个句子包括多个字符;
获取每个所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征;其中,所述字符特征用于表征所述句子中其他字符对该字符的韵律结构的影响,所述句子的句子特征是利用每个字符的字符特征获取到的,所述上下文特征是利用各个句子的句子特征获取到的;
针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。
在一些例子中,所述字符特征的获取步骤包括:
从所述文本数据中提取每个句子的语义特征;
利用每个句子的语义特征,获取每个句子中每个字符的字符特征。
在一些例子中,所述语义特征包括如下一种或多种:字符级别的语义特征或词级别的语义特征。
在一些例子中,所述针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构,包括:
针对每个句子,将所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征进行混合处理,得到混合层级特征;
从所述混合层级特征中提取所述句子的韵律结构。
在一些例子中,在预测每个句子的韵律结构后,还包括:
利用所述文本数据以及每个句子的韵律结构,合成表现出所述韵律结构的音频数据。
在一些例子中,所述方法应用于韵律结构预测系统,所述韵律结构预测系统包括:
用于从所述文本数据中提取每个句子的语义特征的特征提取器;
用于提取所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征的多层级编码器;
用于利用所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征,预测所述句子的韵律结构的韵律结构解码器。
在一些例子中,所述特征提取器为BERT模型。
在一些例子中,所述多层级编码器包括:
字符级编码器,用于从每个句子的语义特征中获取所述句子中每个字符的字符特征;
句子级编码器,用于利用所述句子中每个字符的字符特征,获取所述句子的句子特征;
上下文级编码器,用于利用各个句子的句子特征,获取所述文本数据的上下文特征。
在一些例子中,所述字符级编码器包括至少一个基于自注意力机制的Transformer模型,以使所提取出的每个字符的字符特征表征所述句子中其他字符对该字符的韵律结构的影响。
在一些例子中,所述句子级编码器包括若干层堆叠的卷积层,所述句子的句子 特征,是将每一层卷积层的输出特征经过池化层处理后进行混合处理得到的。
在一些例子中,所述上下文级编码器包括若干层堆叠的卷积层,所述上下文特征,是将每一层卷积层的输出特征经过池化层处理后进行混合处理得到的。
在一些例子中,所述韵律结构解码器包括若干个用于分类不同类型韵律结构的分类器。
在一些例子中,所述若干个用于分类不同类型韵律结构的分类器包括:韵律词分类器、韵律短语分类器以及语调短语分类器。
在一些例子中,所述方法由直播服务器执行,所述文本数据为所述有声读物的文本数据,所述方法还包括:
将合成的音频数据发送至观众端。
根据本申请实施例的第二方面,提供一种计算机程序产品,包括计算机程序,所述计算机程序被处理器执行时实现如第一方面所述方法的步骤。
根据本申请实施例的第三方面,提供一种电子设备,所述电子设备包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为:
获取包括至少两个连续句子的文本数据,每个句子包括多个字符;
获取每个所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征;其中,所述字符特征用于表征所述句子中其他字符对该字符的韵律结构的影响,所述句子的句子特征是利用每个字符的字符特征获取到的,所述上下文特征是利用各个句子的句子特征获取到的;
针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。
根据本申请实施例的第四方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述程序被处理器执行时实现如第一方面所述方法的步骤。
本申请的实施例提供的技术方案可以包括以下有益效果:
本申请提供了韵律结构的预测方法、电子设备、程序产品及存储介质,通过获 取文本数据的每个句子中每个字符的字符特征、每个句子的句子特征以及文本数据的上下文特征,其中,字符特征表征句子中其他字符对该字符的韵律结构的影响,句子特征包含了该句子整体的语义信息,而上下文特征包含了整个文本数据的语义信息。上述三种特征分别携带了不同层级的信息,从而在预测句子的韵律结构时可以使用多种信息辅助来预测,这些不同层级的信息中包括了上下文信息,因此句子韵律结构的预测准确度能够大大提高。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本申请。
附图说明
此处的附图被并入说明书中并构成本申请的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1是相关技术中中文韵律结构预测的流程图。
图2本申请根据一实施例示出的一种韵律结构的预测方法的流程图。
图3是本申请根据一实施例示出的韵律结构预测系统的示意图。
图4是本申请根据一实施例示出的BERT语言模型训练的示意图。
图5是本申请根据一实施例示出的语义特征提取过程的示意图。
图6是本申请根据一实施例示出的多层级编码器的示意图。
图7是本申请根据一实施例示出的字符级编码器的示意图。
图8是本申请根据一实施例示出的句子级编码器的示意图。
图9是本申请根据一实施例示出的上下文级编码器的示意图。
图10本申请根据另一实施例示出的一种韵律结构的预测方法的流程图。
图11是本申请根据一实施例示出的韵律结构解码器的示意图。
图12是本申请根据另一实施例示出的韵律结构预测系统的示意图。
图13本申请根据一实施例示出的一种韵律结构的预测方法的应用场景。
图14是本申请根据一实施例示出的一种电子设备的硬件结构图。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
在本申请使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本申请和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本文中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本申请可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本申请范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
语音合成(Text-To-Speech,TTS)技术是一种能把文本智能地转化为音频的技术。TTS技术已经广泛地应用到了智能手机、语音助手、智能导航、电子书等产品中。其中,文本的韵律结构可以辅助语音合成系统生成自然、可懂的语音。韵律结构是指话语停延结构的层级组织,中文的韵律结构为三级结构,包括韵律词(Prosody Word,PW)、韵律短语(Prosody Phrase,PPH)以及语调短语(Intonation Phrase,IPH)。一个或以上的中文字符构成PW,多个PW构成PPH,多个PPH构成IPH,而一句话又可以划分为多个IPH。PW、PPH和IPH并无固定的构成规则。不同层级的韵律结构所表现出的停顿时长不同。准确的韵律结构能辅助合成出自然度高的语音,因此韵律结构的预测是TTS技术中重要的一部分。
在相关技术,中文韵律结构的预测过程如图1所示,中文字符序列通过特征提取器110可以提取文本特征,文本特征随后输入韵律预测器120可以提取出该中文字符序列的中文韵律结构。发明人发现,句子的韵律结构由该句子所处语境下的语义信息决定,句子的韵律结构不仅与该句子的语义信息相关,还会受到该句子所处的文本段落中上下文内其他句子的语义信息影响,又或者说,句子的韵律结构与该句子在文本段落中所处的位置相关。同一句子若处于文本段落中不同位置,该句子的韵律结构 也会有所不同。
然而在相关技术中,绝大多数的方案仅考虑了从当前句子提取语义信息来预测该句子的韵律结构,而忽视了该句子的上下文内其他句子的语义信息对该句子的韵律结构的影响。因此所预测出的携带韵律结构信息的句子不管处在文本段落的哪个位置,其韵律结构都是相同的,从而降低了预测的韵律结构的准确度,且后续合成出的音频的自然度也差强人意。基于上述问题,本申请提出了一种韵律结构的预测方法,包括如图2所示的步骤:
步骤210:获取包括至少两个连续句子的文本数据,每个句子包括多个字符;
步骤220:获取每个所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征;
其中,所述字符特征用于表征所述句子中其他字符对该字符的韵律结构的影响,所述句子的句子特征是利用每个字符的字符特征获取到的,所述上下文特征是利用各个句子的句子特征获取到的;
步骤230:针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。
可选地,在获取字符特征的步骤中,可以先从文本数据中提取每个句子的语义特征;利用每个句子的语义特征,获取每个句子中每个字符的字符特征。当然,字符特征的获取方法不限于上述方式,本领域技术人员还可以通过其他方式来获取每个句子中每个词语的字符特征。
本申请提供的一种韵律结构的预测方法,通过获取每个句子中每个字符的字符特征、每个句子的句子特征以及文本数据的上下文特征,其中,字符特征表征句子中其他字符对该字符的韵律结构的影响,句子特征包含了该句子整体的语义信息,而上下文特征包含了整个文本数据的语义信息。如此,利用上述三种特征在预测句子的韵律结构时,不仅仅考虑句子本身的语义信息,还结合了整个文本数据的语义信息,也即该句子的上下文内其他句子的语义信息来预测句子的韵律结构。预测出的韵律结构受到该句子的上下文的语义影响,因此预测的韵律结构的准确度大大提高。
在一些实施例中,上述如图2所示的一种韵律结构的预测方法,可以应用于如图3所示的韵律结构预测系统300中,韵律结构预测系统300包括特征提取器310、多层级编码器320以及韵律结构解码器330。其中,特征提取器310用于从文本数据 中提取每个句子的语义特征。多层级编码器320用于提取句子中每个字符的字符特征、句子的句子特征以及文本数据的上下文特征。韵律结构解码器330用于利用句子中每个字符的字符特征、句子的句子特征以及文本数据的上下文特征,预测句子的韵律结构。
在一些实施例中,在训练阶段,特征提取器310基于大量数据进行无监督学习。多层级编码器320与韵律结构解码器330作为一个整体进行有监督学习,其训练数据为标注了韵律结构的语义特征。在训练过程中,使用了Adam优化器,其结合了AdaGrad和RMSProp两种优化算法的优点,对梯度的一阶矩估计和二阶矩估计进行综合考虑,计算出更新的步长。一次训练所选取的样本数(batch size)为64,学习率为0.00001,训练到在验证集上的损失不再下降为止。
在一些实施例中,特征提取器310可以是BERT(Bidirectional Encoder Representations from Transformers)语言模型。BERT语言模型可以利用大量的中文文本数据进行预训练,以提取有效的语义特征。如图4所示,在训练过程中,BERT模型可以接受由一个或以上的句子组成的字符序列作为输入,其中输入的字符序列包含一部分被随机屏蔽的字符。屏蔽机制包括字符序列中的每个字符有15%的概率被选中,被选中的字符有80%的概率被替换为[MASK],10%的概率被随机替换为其他字符,10%的概率替换为该字符本身。为了确保能提取有效的语义特征,BERT模型的训练目标包括预测被屏蔽的字符以及判断输入的一个以上的句子是否为实际语境中相联系的上下文。选取BERT语言模型作为特征提取器,相比于其他人工构造的特征,如词向量、词嵌入等,能够更有效地提取文本数据中的语义信息。
在另一些实施例中,特征提取器310还可以是预训练的XLNET语言模型(GeneralizedAutoregressive Pretraining for Language Understanding)。特征提取器310不限于上述两种模型,本领域技术人员可以根据实际需要选取其他能够实现从文本数据中提取语义特征效果的模型作为特征提取器310。
在一些实施例中,从文本数据中提取出的每个句子的语义特征可以是字符级别的语义特征,也可以是词级别的语义特征。所谓字符级别的语义特征为每个字符用一个N维向量来表示。同理,所谓词级别的语义特征为每个词语用一个N维向量来表示。一个句子中所有字符/词语对应的N维向量组成该个句子的语义特征。例如,若特征提取器310为BERT语言模型,则提取出的语义特征可以是字符级别的语义特征或词级别的语义特征。若特征提取器310为XLNET语言模型,则提取出的语义特征可以是 词级别的语义特征。若特征提取器310为其他能实现从文本数据中提取语义特征效果的模型,则在提取语义特征前,可以先对文本数据进行分词处理,随后在经过分词处理后的文本数据提取出词级别的语义特征。
以特征提取器310为BERT语言模型为例,如图5所示,BERT语言模型可以接受包括至少两个连续句子的文本数据作为输入,如文本数据可以是“我吃西红柿。它真的太好吃了!你要不要也来一个”共三个句子。随后BERT语言模型会分别提取每个句子的语义特征。多个句子的语义特征作为一组输入特征,例如,各个句子的语义特征以多通道的形式组合为一组输入特征,该输入特征中一个通道为一个句子的语义特征,以此输入到多层级编码器320中。多层级编码器320利用每个句子的语义特征,获取每个句子中每个字符的字符特征、句子的句子特征以及文本数据的上下文特征。
如图6所示,多层级编码器320包括字符级编码器321、句子级编码器322以及上下文编码器323。其中,字符级编码器321用于从每个句子的语义特征中获取句子中每个字符的字符特征。句子级编码器322用于利用句子中每个字符的字符特征,获取该句子的句子特征。上下文编码器323用于利用各个句子的句子特征,获取文本数据的上下文特征。
以多个句子的语义特征作为一组输入特征输入字符级编码器321,可以获取每个句子中的每个字符的字符特征。在一些实施例中,如图7所示,字符级编码器321可以包括至少一个基于自注意力机制的Transformer模型3211,如图7所示的两个Transformer模型3211。通过Transformer模型3211自带的自注意力机制,能够更好地建模文本数据内所存在的长距离依赖关系,从而通过字符级编码器321提取出的每个字符的字符特征可以表征句子中其他字符对该字符的韵律结构的影响。在一些实施例中,可以在句子的语义特征中添加位置编码,以使Transformer模型能够区分出句子的语义特征中每个字符的位置,且模型在计算时能同时考虑到字符的位置信息,使得输出的字符特征能够表征句子中不同距离的字符对该字符韵律结构的影响。位置编码可以参照相关技术实施,本申请在此不展开论述。
字符级编码器321分别从每个句子的语义特征提取每个字符的字符特征后,多个句子的字符特征作为一组输入特征,输入到句子级编码器322中。句子级编码器322可以基于每个句子中每个字符的字符特征,将每个句子所携带的信息聚合为一个向量,也即句子特征。每个句子的句子特征包含了该句子整体的语义信息。在一些实施例中, 句子级编码器322包括若干层堆叠的卷积层。在另一些实施例中,句子编码器322可以包括循环神经网络(Recurrent Neural Network,RNN)。以若干层堆叠的卷积层为例,如图8所示,句子级编码器322包括三层堆叠的卷积层(Convolutional Neural Networks,CNN)3221-3223以及一层池化层3224。池化层3224可以是最大池化层,也可以是平均池化层。每句话的字符特征依次经过堆叠的卷积层3221-3223处理后,卷积层3221-3223的输出经过池化层3224处理后拼接在一起,从而将该句话的信息聚合成句子特征。也即句子的句子特征,是将每一层卷积层的输出特征经过池化层处理后进行混合处理得到的。
句子级编码器322从每个句子的字符特征提取出每个句子的句子特征后,多个句子的句子特征作为一组输入特征,输入到上下文级编码器323中。上下文级编码器323可以基于每个句子的句子特征,将整个文本数据所携带的信息聚合为一个向量,也即上下文特征,其包含了文本数据的语义信息。在一些实施例中,上下文级编码器323包括若干层堆叠的卷积层。如图9所示,上下文级编码器323包括三层堆叠的卷积层3231-3233以及一层池化层3234。池化层3234可以是最大池化层,也可以是平均池化层。多句话的句子特征依次经过堆叠的卷积层3231-3233处理后,卷积层3231-3233的输出经过池化层3234处理后拼接在一起,从而将文本数据的信息聚合成上下文特征。也即上下文特征,是将每一层卷积层的输出特征经过池化层处理后进行混合处理得到的。
在获取每个所述句子中每个字符的字符特征、句子的句子特征以及文本数据的上下文特征后,可以利用上述三种特征预测句子的韵律结构。在一些实施例中,上述步骤240可以包括如图10所示的步骤:
步骤241:针对每个句子,将所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征进行混合处理,得到混合层级特征;
步骤242:从所述混合层级特征中提取所述句子的韵律结构。
在一些实施例中,可以采用韵律结构解码器330来预测句子的韵律结构。其中,韵律结构解码器330可以包括若干个用于分类不同类型韵律结构的分类器。若干个用于分类不同类型韵律结构的分类器可以包括:韵律词分类器、韵律短语分类器以及语调短语分类器。韵律结构解码器330可以包括韵律词分类器、韵律短语分类器以及语调短语分类器中一种或多种的组合。
在一些实施例中,分类器可以是前馈神经网络(Feedforward Neural Network,FNN)。在一些实施例中,多层级编码器320所提取的特征,如句子的字符特征、句子的句子特征以及上下文特征,或者是混合层级特征,为了对这些特征进行分析,提取对分类有用的信息,韵律结构解码器330还可以包括分别与每个分类器相连的门控循环单元(Gated Recurrent Unit,GRU)网络。如图11所示,韵律结构解码器330可以包括PW GRU331、PPH GRU332、IPH GRU333,以及与PW GRU331相连的PW FFN334、与PPH GRU332相连的PPH FFN335、与IPH GRU333相连的IPH FFN336。其中,PW GRU331负责提取对PW分类有用的信息,PW FFN334负责进行每个句子的PW分类。PPH GRU332负责提取对PPH分类有用的信息,PPH FFN335负责进行每个句子的PPH分类。IPH GRU333负责提取对IPH分类有用的信息,IPH FFN336负责进行每个句子的IPH分类。
可以理解的是,实际上韵律结构的预测可以看作包括PW预测、PPH预测以及IPH预测的三个预测子任务。每个预测子任务分别通过各自独立的二分类器进行。每个二分类器都用于判断句子的每个字符后是否有该二分类器对应的韵律结构的边界。例如PW FFN334用于判断句子的每个字符后是否有PW边界,PPH FFN335用于判断句子的每个字符后是否有PPH边界,PH FFN336用于判断句子的每个字符后是否有IPH边界。
此外,上述三个预测子任务存在依赖关系,PW预测子任务不依赖于其他子任务,因此PW GRU331接受混合层级特征的输入,其输出的特征经过PW FNN334进行分类后可以输出PW预测结果。而PPH预测子任务依赖于PW预测子任务中所提取的信息,因此PPH GRU332的输入特征除了包括混合层级特征以外,还包括PW GRU331的输出特征。IPH预测子任务依赖于PW预测子任务以及PPH预测子任务中所提取的信息,因此IPH GRU333的输入特征除了包括混合层级特征以外,还包括PW GRU331的输出特征以及PPH GRU332的输出特征。如此,经过三个分类器进行三种层级韵律结构边界预测后,最终可以输出标注有韵律结构的文本数据。
在一些实施例中,在预测出每个句子的韵律结构后,还可以利用文本数据以及每个句子的韵律结构,合成表现出该韵律结构的音频数据。其中,可以参考相关技术中TTS技术来将文本数据合成对应的音频数据,本申请在此不做详细说明。
从上文记载的实施例可知,本申请提供的一种韵律结构的预测方法,用户只需提供包含同一上下文内的多个句子的文本数据即可,韵律结构预测系统直接接受文本 数据作为输入,不需要额外的预处理,并输出多句话的韵律结构,实现了以端到端的方式预测韵律结构。
此外,本申请使用了多层级编码器对多个句子的特征进行提取,得到多层级的特征,包括每个句子中每个字符的字符特征、每个句子的句子特征以及整个文本数据的上下文特征。利用多层级的特征能够更好地建模文本数据内部存在的上下文的语义信息。使得在预测每个句子的韵律结构时,能参考句子上下文的语义信息,大大提高了韵律结构预测的准确度。
此外,本申请还提供了一种韵律结构的预测方法,通过如图12所示的端到端的韵律结构预测系统实现。用户可以输入多个连续的句子,如图中的连续三个句子“我吃西红柿。它真的太好吃了!你要不要也来一个?”。为了方便描述,以下将这三个句子分别简称为S1、S2、S3。用户输入的三个句子可以经过BERT语言模型1210分别提取出S1的语义特征、S2的语义特征和S3的语义特征。其中,这三个句子的语义特征都是字符级别的语义特征。
随后,这三个句子的语义特征可以输入多层级编码器。多层级编码器包括依次连接的字符级编码器1221、句子级编码器1222以及上下文级编码器1223。上述三个句子的语义特征经过字符级编码器1221后,可以分别提取出句子中每个字符的字符特征。即句子中每个字符分别用一个D1维向量来表示。若句子中包括L个字符,那么L个字符的字符特征可以拼接为L*D1的矩阵作为句子的字符特征(如图中所示)。如图中所示,S1的语义特征、S2的语义特征和S3的语义特征经过字符级编码器1221后,可以分别提取出S1的字符特征,S2的字符特征以及S3的字符特征。
随后,这三个句子的字符特征输入句子级编码器1222中,句子级编码器1222可以根据各个句子中每个字符的字符特征,提取出每个句子所包含的信息,并以一个D2维向量表示一个句子。如图中所示,S1的字符特征,S2的字符特征以及S3的字符特征经过句子级编码器1222,可以分别提取出S1的句子特征、S2的句子特征以及S3的句子特征。随后,这三个句子的句子特征一并输入上下文级编码器1223,可以提取出这三个句子的整体语义信息,并用一个D3维向量表示这三个句子,得到上下文特征。
针对每个句子(图中只示出了一个句子的处理流程),将该句子的字符特征、该句子的句子特征以及整个文本数据的上下文特征进行混合处理,得到混合层级特征。 其中,可以将D2维句子特征和D3维的上下文特征复制若干次后再与字符特征进行拼接,最终形成L*(D1+D2+D3)的矩阵(如图中所示)作为混合层级特征。
混合层级特征输入韵律结构解码器1230后,通过三个预测子任务可以预测出句子的韵律结构。最终端到端的韵律结构预测系统输出“我#1吃#1西红柿。#3它#1真的#2太好吃了!#3你#1要不要#1也来一个?#3”结果。
具体实现方式参见上文实施例,本申请在此不再赘述。
如此,通过上述方法,用户只需提供包含同一上下文内的多个句子的文本数据即可,韵律结构预测系统直接接受文本数据作为输入,不需要额外的预处理,并输出多句话的韵律结构,实现了以端到端的方式预测韵律结构。
此外,通过利用多层级编码器对多个句子的特征进行提取,得到多层级的特征,包括每个句子中每个字符的字符特征、每个句子的句子特征以及整个文本数据的上下文特征,从而更好地建模文本数据内部存在的上下文的语义信息。使得在预测每个句子的韵律结构时,能参考句子上下文的语义信息,大大提高了韵律结构预测的准确度。
此外,本申请提供的一种韵律结构的预测方法,可以应用在直播场景中,如虚拟主播进行有声小说直播时,由于虚拟主播的语音是利用TTS技术合成的,在进行有声小说直播时,若合成的语音能更贴近真人的发音,有更高的自然度,将能吸引更多的听众收听。如图13所示,本申请提供的一种韵律结构的预测方法可以由直播服务器执行。其中,直播服务器可以是单独一台服务器,也可以是由多台服务器组成的服务器集群。如图13所示,直播服务器1311可以利用上述任一实施例所提供的方法预测有声读物的文本数据的韵律结构,然后将标注有韵律结构的文本数据发送至另一台直播服务器1312。直播服务器1312可以利用TTS技术将标注有韵律结构的文本数据合成为表现有所预测的韵律结构的音频数据,并且将音频数据发送至直播间内的各个观众端1320。
本申请提供的一种韵律结构的预测方法,除了可以应用在直播场景外,还可以应用于智能手机、语音助手、智能导航、电子书等产品中,本申请在此不做限制。
基于上述任意实施例所述的一种韵律结构的预测方法,本申请还提供了一种计算机程序产品,包括计算机程序,计算机程序被处理器执行时可用于执行上述任意实施例所述的一种韵律结构的预测方法。
基于上述任意实施例所述的一种韵律结构的预测方法,本申请还提供了如图14所示的一种电子设备的结构示意图。如图14,在硬件层面,该电子设备包括处理器、内部总线、网络接口、内存以及非易失性存储器,当然还可能包括其他业务所需要的硬件。处理器从非易失性存储器中读取对应的计算机程序到内存中然后运行,处理器被配置为:
获取包括至少两个连续句子的文本数据,每个句子包括多个字符;
获取每个所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征;其中,所述字符特征用于表征所述句子中其他字符对该字符的韵律结构的影响,所述句子的句子特征是利用每个字符的字符特征获取到的,所述上下文特征是利用各个句子的句子特征获取到的;
针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。
其中,上述电子设备可以是直播服务器,也可以是用户的终端设备,例如手机、平板电脑等终端设备。
基于上述任意实施例所述的一种韵律结构的预测方法,本申请还提供了一种计算机存储介质,存储介质存储有计算机程序,计算机程序被处理器执行时可用于执行上述任意实施例所述的一种韵律结构的预测方法。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本领域技术人员在考虑说明书及实践这里申请的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未申请的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。

Claims (17)

  1. 一种韵律结构的预测方法,其特征在于,所述方法包括:
    获取包括至少两个连续句子的文本数据,每个句子包括多个字符;
    获取每个所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征;其中,所述字符特征用于表征所述句子中其他字符对该字符的韵律结构的影响,所述句子的句子特征是利用每个字符的字符特征获取到的,所述上下文特征是利用各个句子的句子特征获取到的;
    针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。
  2. 根据权利要求1所述的方法,其特征在于,所述字符特征的获取步骤包括:
    从所述文本数据中提取每个句子的语义特征;
    利用每个句子的语义特征,获取每个句子中每个字符的字符特征。
  3. 根据权利要求2所述的方法,其特征在于,所述语义特征包括如下一种或多种:字符级别的语义特征或词级别的语义特征。
  4. 根据权利要求1所述的方法,其特征在于,所述针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构,包括:
    针对每个句子,将所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征进行混合处理,得到混合层级特征;
    从所述混合层级特征中提取所述句子的韵律结构。
  5. 根据权利要求1所述的方法,其特征在于,在预测每个句子的韵律结构后,还包括:
    利用所述文本数据以及每个句子的韵律结构,合成表现出所述韵律结构的音频数据。
  6. 根据权利要求2所述的方法,其特征在于,所述方法应用于韵律结构预测系统,所述韵律结构预测系统包括:
    用于从所述文本数据中提取每个句子的语义特征的特征提取器;
    用于提取所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征的多层级编码器;
    用于利用所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征,预测所述句子的韵律结构的韵律结构解码器。
  7. 根据权利要求6所述的方法,其特征在于,所述特征提取器为BERT模型。
  8. 根据权利要求6所述的方法,其特征在于,所述多层级编码器包括:
    字符级编码器,用于从每个句子的语义特征中获取所述句子中每个字符的字符特征;
    句子级编码器,用于利用所述句子中每个字符的字符特征,获取所述句子的句子特征;
    上下文级编码器,用于利用各个句子的句子特征,获取所述文本数据的上下文特征。
  9. 根据权利要求8所述的方法,其特征在于,所述字符级编码器包括至少一个基于自注意力机制的Transformer模型,以使所提取出的每个字符的字符特征表征所述句子中其他字符对该字符的韵律结构的影响。
  10. 根据权利要求8所述的方法,其特征在于,所述句子级编码器包括若干层堆叠的卷积层,所述句子的句子特征,是将每一层卷积层的输出特征经过池化层处理后进行混合处理得到的。
  11. 根据权利要求8所述的方法,其特征在于,所述上下文级编码器包括若干层堆叠的卷积层,所述上下文特征,是将每一层卷积层的输出特征经过池化层处理后进行混合处理得到的。
  12. 根据权利要求6所述的方法,其特征在于,所述韵律结构解码器包括若干个用于分类不同类型韵律结构的分类器。
  13. 根据权利要求12所述的方法,其特征在于,所述若干个用于分类不同类型韵律结构的分类器包括:韵律词分类器、韵律短语分类器以及语调短语分类器。
  14. 根据权利要求5所述的方法,其特征在于,所述方法由直播服务器执行,所述文本数据为有声读物的文本数据,所述方法还包括:
    将合成的音频数据发送至观众端。
  15. 一种计算机程序产品,包括计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-14任一所述方法的步骤。
  16. 一种电子设备,其特征在于,所述电子设备包括:
    处理器;
    用于存储处理器可执行指令的存储器;
    其中,所述处理器被配置为:
    获取包括至少两个连续句子的文本数据,每个句子包括多个字符;
    获取每个所述句子中每个字符的字符特征、所述句子的句子特征以及所述文本数据的上下文特征;其中,所述字符特征用于表征所述句子中其他字符对该字符的韵律结构的影响,所述句子的句子特征是利用每个字符的字符特征获取到的,所述上下文特征是利用各个句子的句子特征获取到的;
    针对每个句子,利用所述句子的每个字符的字符特征、所述句子的句子特征以及所述上下文特征,预测所述句子的韵律结构。
  17. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述程序被处理器执行时实现权利要求1-14任一项所述方法的步骤。
PCT/CN2021/137240 2021-12-10 2021-12-10 韵律结构的预测方法、电子设备、程序产品及存储介质 WO2023102931A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137240 WO2023102931A1 (zh) 2021-12-10 2021-12-10 韵律结构的预测方法、电子设备、程序产品及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/137240 WO2023102931A1 (zh) 2021-12-10 2021-12-10 韵律结构的预测方法、电子设备、程序产品及存储介质

Publications (1)

Publication Number Publication Date
WO2023102931A1 true WO2023102931A1 (zh) 2023-06-15

Family

ID=86729532

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137240 WO2023102931A1 (zh) 2021-12-10 2021-12-10 韵律结构的预测方法、电子设备、程序产品及存储介质

Country Status (1)

Country Link
WO (1) WO2023102931A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130188863A1 (en) * 2012-01-25 2013-07-25 Richard Linderman Method for context aware text recognition
CN108470024A (zh) * 2018-03-12 2018-08-31 北京灵伴即时智能科技有限公司 一种融合句法语义语用信息的汉语韵律结构预测方法
US20190164551A1 (en) * 2017-11-28 2019-05-30 Toyota Jidosha Kabushiki Kaisha Response sentence generation apparatus, method and program, and voice interaction system
CN111274807A (zh) * 2020-02-03 2020-06-12 华为技术有限公司 文本信息的处理方法及装置、计算机设备和可读存储介质
CN112463921A (zh) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 韵律层级划分方法、装置、计算机设备和存储介质
CN112771607A (zh) * 2018-11-14 2021-05-07 三星电子株式会社 电子设备及其控制方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130188863A1 (en) * 2012-01-25 2013-07-25 Richard Linderman Method for context aware text recognition
US20190164551A1 (en) * 2017-11-28 2019-05-30 Toyota Jidosha Kabushiki Kaisha Response sentence generation apparatus, method and program, and voice interaction system
CN108470024A (zh) * 2018-03-12 2018-08-31 北京灵伴即时智能科技有限公司 一种融合句法语义语用信息的汉语韵律结构预测方法
CN112771607A (zh) * 2018-11-14 2021-05-07 三星电子株式会社 电子设备及其控制方法
CN111274807A (zh) * 2020-02-03 2020-06-12 华为技术有限公司 文本信息的处理方法及装置、计算机设备和可读存储介质
CN112463921A (zh) * 2020-11-25 2021-03-09 平安科技(深圳)有限公司 韵律层级划分方法、装置、计算机设备和存储介质

Similar Documents

Publication Publication Date Title
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
WO2020232860A1 (zh) 语音合成方法、装置及计算机可读存储介质
WO2020118521A1 (en) Multi-speaker neural text-to-speech synthesis
WO2022188734A1 (zh) 一种语音合成方法、装置以及可读存储介质
Kim et al. Expressive text-to-speech using style tag
US11355097B2 (en) Sample-efficient adaptive text-to-speech
US11475225B2 (en) Method, system, electronic device and storage medium for clarification question generation
US11636272B2 (en) Hybrid natural language understanding
WO2023020262A1 (en) Integrating dialog history into end-to-end spoken language understanding systems
JP2022551771A (ja) 区別可能な言語音を生成するための音声合成のトレーニング
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
US20230087916A1 (en) Transforming text data into acoustic feature
CN111508466A (zh) 一种文本处理方法、装置、设备及计算机可读存储介质
JP6810580B2 (ja) 言語モデル学習装置およびそのプログラム
Orken et al. Identifying the influence of transfer learning method in developing an end-to-end automatic speech recognition system with a low data level
CN113823259A (zh) 将文本数据转换为音素序列的方法及设备
Nakata et al. Predicting VQVAE-based Character Acting Style from Quotation-Annotated Text for Audiobook Speech Synthesis.
JP2022037862A (ja) テキスト基盤の事前学習モデルを活用した縦断型音声言語理解知識を蒸留するための方法、システム、およびコンピュータ読み取り可能な記録媒体
CN114783405B (zh) 一种语音合成方法、装置、电子设备及存储介质
WO2023102931A1 (zh) 韵律结构的预测方法、电子设备、程序产品及存储介质
WO2023102929A1 (zh) 音频合成方法、电子设备、程序产品及存储介质
Matoušek et al. VITS: quality vs. speed analysis
Basnet Attention And Wave Net Vocoder Based Nepali Text-To-Speech Synthesis
US20240185830A1 (en) Method, device, and computer program product for text to speech
Xu et al. End-to-End Speech Synthesis Method for Lhasa-Tibetan Multi-speaker

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21966849

Country of ref document: EP

Kind code of ref document: A1