CN110534087B

CN110534087B - Text prosody hierarchical structure prediction method, device, equipment and storage medium

Info

Publication number: CN110534087B
Application number: CN201910834143.5A
Authority: CN
Inventors: 康世胤; 吴志勇; 杜耀
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2022-02-15
Anticipated expiration: 2039-09-04
Also published as: CN110534087A

Abstract

The embodiment of the application discloses a prosodic hierarchy structure prediction method, a device, equipment and a storage medium based on artificial intelligence, wherein the method comprises the following steps: acquiring a target text; performing word segmentation and part-of-speech tagging on the target text to obtain a word segmentation tagging sequence; performing word level feature extraction according to the word segmentation tagging sequence to obtain a word level feature sequence, wherein the word level feature of each word in the word level feature sequence at least comprises a word vector obtained by semantic feature extraction; and obtaining a prosodic hierarchy sequence corresponding to the word-level feature sequence through a prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is a deep neural network model based on a self-attention mechanism. The method can effectively improve the prediction precision of the prosodic hierarchy structure.

Description

Text prosody hierarchical structure prediction method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech technology, and in particular, to a text prosody hierarchy prediction method, apparatus, device, and storage medium based on an artificial intelligence self-attention mechanism.

Background

The prosodic hierarchy structure is used for modeling prosodic features of voice pause, rhythm and the like, and the prosodic structure prediction task is used for determining the prosodic structure type of each grammatical word in a sentence according to the text features in a text processing part at the front end of voice synthesis.

The prosodic structure prediction has important significance for the naturalness of the synthesized voice quality of the voice synthesis system. At present, the common prosodic structure prediction mainly adopts a Conditional Random Field (CRF) and a Recurrent Neural Network (RNN) to model, but in practical application, the modeling performance of the two schemes is not high, and the quality of speech synthesis is limited to a certain extent.

Disclosure of Invention

The embodiment of the application provides a text prosody hierarchical structure prediction method, a text prosody hierarchical structure prediction device, text prosody hierarchical structure prediction equipment and a text prosody hierarchical structure storage medium based on artificial intelligence, and prediction accuracy of the prosody hierarchical structure can be effectively improved.

In view of the above, a first aspect of the present application provides a text prosody hierarchy prediction method based on artificial intelligence, including:

acquiring a target text;

performing word segmentation and part-of-speech tagging on the target text to obtain a word segmentation tagging sequence;

performing word level feature extraction according to the word segmentation tagging sequence to obtain a word level feature sequence, wherein the word level feature of each word in the word level feature sequence at least comprises a word vector obtained by semantic feature extraction;

and obtaining a prosodic hierarchy structure sequence corresponding to the word-level feature sequence through a prosodic hierarchy structure prediction model, wherein the prosodic hierarchy structure prediction model is a deep neural network model based on a self-attention mechanism.

A second aspect of the present application provides an apparatus for predicting a text prosody hierarchy based on artificial intelligence, comprising:

the acquisition module is used for acquiring a target text;

the word segmentation and part-of-speech tagging module is used for performing word segmentation and part-of-speech tagging on the target text to obtain a word segmentation tagging sequence;

the word level feature extraction module is used for extracting word level features according to the word segmentation tagging sequence to obtain a word level feature sequence, wherein the word level features of each word in the word level feature sequence at least comprise a word vector obtained by semantic feature extraction;

and the prosodic hierarchy structure prediction module is used for obtaining a prosodic hierarchy structure sequence corresponding to the word-level feature sequence through a prosodic hierarchy structure prediction model, and the prosodic hierarchy structure prediction model is a deep neural network model based on a self-attention mechanism.

A third aspect of the present application provides an artificial intelligence based text prosody hierarchy prediction device, the device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to execute the text prosody hierarchy prediction method according to the first aspect.

A fourth aspect of the present application provides a computer-readable storage medium for storing a computer program for executing the text prosody hierarchy prediction method of the first aspect.

A fifth aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of text prosodic hierarchy prediction of the first aspect described above.

According to the technical scheme, the embodiment of the application has the following advantages:

the embodiment of the application provides a text prosody hierarchical structure prediction method, which predicts a prosody hierarchical structure by using a deep neural network model based on a self-attention mechanism and effectively improves the prediction accuracy of the prosody hierarchical structure. Specifically, in the prosodic hierarchy structure prediction method provided in the embodiment of the present application, after a target text is obtained, word segmentation and part-of-speech tagging are performed on the target text to obtain a word segmentation tagging sequence; then, extracting word-level features according to the word segmentation tagging sequence to obtain a word-level feature sequence, wherein the word-level features of each word in the word-level feature training at least comprise word vectors obtained through semantic feature extraction; and further, obtaining a prosodic hierarchy sequence corresponding to the word-level feature sequence through a prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is a deep neural network model based on a self-attention mechanism. The prosodic hierarchy structure prediction method adopts the depth neural network model based on the self-attention mechanism to predict the prosodic hierarchy structure of each participle in the target text, the depth neural network model based on the self-attention mechanism can capture the context dependency relationship among the participles in the whole sentence range, and compared with a CRF model and an RNN model in the related technology, the depth neural network model based on the self-attention mechanism has better sequence modeling capability, so that the prediction effect of the prosodic hierarchy structure can be effectively improved, and the quality of voice synthesis is correspondingly improved.

Drawings

Fig. 1 is a schematic view of an application scenario of an artificial intelligence-based prosody hierarchy prediction method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating an artificial intelligence based prosody hierarchy prediction method according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of an embodiment of a prosodic hierarchy prediction model;

FIG. 4 is a flowchart illustrating a training method of a prosodic hierarchy prediction model according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of calculating similarity of scaled dot product attention according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating a computing process of a multi-head attention mechanism according to an embodiment of the present disclosure;

fig. 7 is a schematic operation diagram of a fully connected network sublayer provided in an embodiment of the present application;

FIG. 8 is a schematic diagram of residual concatenation provided by an embodiment of the present application;

FIG. 9 is a flowchart illustrating another prosodic hierarchy prediction method according to an embodiment of the present disclosure;

FIG. 10 is a schematic structural diagram of a first apparatus for predicting prosody hierarchy based on artificial intelligence according to an embodiment of the present disclosure;

FIG. 11 is a schematic structural diagram of a second artificial intelligence-based prosody hierarchy prediction device according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of a third artificial intelligence-based prosody hierarchy prediction device according to an embodiment of the present application;

FIG. 13 is a schematic structural diagram of a fourth artificial intelligence-based prosody hierarchy prediction device according to an embodiment of the present disclosure;

FIG. 14 is a schematic structural diagram of a fifth artificial intelligence based prosody hierarchy prediction apparatus according to an embodiment of the present application;

FIG. 15 is a schematic structural diagram of an artificial intelligence-based prosodic hierarchy prediction server according to an embodiment of the present disclosure;

fig. 16 is a schematic structural diagram of a terminal device for predicting a prosody hierarchy based on artificial intelligence according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The present application relates to the field of Artificial Intelligence (AI), and the following briefly introduces the relevant technologies in the field of Artificial Intelligence.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Among them, Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

In the related technology, CRF and RNN are generally adopted for modeling, and based on the CRF and RNN, the prosodic hierarchy structure of each participle in the text is predicted; however, CRF and RNN are generally unable to capture the dependency between any two words in the whole sentence range, thus limiting their modeling capabilities, and consequently leading to an inability to accurately predict prosodic hierarchies based on them.

In view of the problems in the related art, an embodiment of the present application provides a prosodic hierarchy structure prediction method based on an AI, where a deep neural network model based on an attention mechanism is used to predict a prosodic hierarchy structure of each participle in a target text, and the deep neural network model based on the attention mechanism can better capture context dependency relationships in a whole sentence range through an attention sublayer therein.

It should be understood that the AI-based prosody hierarchy prediction method provided in the embodiments of the present application may be applied to devices with data processing capabilities, such as terminal devices, servers, and the like; the terminal device may be a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, or the like; the server may specifically be an application server or a Web server, and in actual deployment, the server may be an independent server or a cluster server.

In order to facilitate understanding of the technical solution provided by the embodiment of the present application, taking an example that the prosody hierarchy prediction method provided by the embodiment of the present application is applied to a server, an application scenario to which the prosody hierarchy prediction method provided by the embodiment of the present application is applied is exemplarily described below.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of the AI-based prosody hierarchy prediction method according to an embodiment of the present disclosure. As shown in fig. 1, the application scenario includes: terminal device 110 and server 120, terminal device 110 and server 120 communicate through a network. The terminal device 110 is configured to receive a voice signal input by a user and transmit the voice signal to the server 120; the server 120 is configured to determine a response text corresponding to the voice signal transmitted by the terminal device 110, execute the prosody hierarchy structure prediction method provided in the embodiment of the present application, predict a prosody hierarchy structure of each participle in the response text, generate a prosody hierarchy structure sequence corresponding to the response text, convert the response text into a corresponding response voice signal according to the prosody hierarchy structure sequence, and transmit the response voice signal to the terminal device 110.

In a specific application, a user may input a voice signal to the terminal device 110 to request the terminal device 110 to reply a corresponding reply voice signal for the voice signal; after receiving the voice signal input by the user, the terminal device 110 transmits the voice signal to the server 120 through the network.

After receiving the voice signal transmitted by the terminal device 110, the server 120 determines a reply text for replying to the voice signal. Then, taking the reply text as a target text, and performing word segmentation and part-of-speech tagging on the target text to obtain a corresponding word segmentation tagging sequence; then, performing word level feature extraction on the word segmentation tagging sequence to obtain a word level feature sequence, wherein the word level feature of each word in the word level feature sequence at least comprises a word vector obtained by semantic feature extraction; then, a prosodic hierarchy sequence corresponding to the word-level feature sequence is determined through a prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is based on a deep neural network model of a self-attention mechanism.

After the server 120 determines the prosody hierarchy sequence corresponding to the response text through the above processing, a response speech signal corresponding to the response text can be further generated based on the prosody hierarchy sequence, and the response speech signal is more natural and closer to the pronunciation of human. Finally, the server 120 transmits the generated reply voice signal to the terminal device 110, and the terminal device 110 plays the reply voice signal to realize human-computer interaction with the user.

It should be noted that the human-computer interaction application scenario shown in fig. 1 is only an example, and in practical application, the prosody hierarchy prediction method provided in the embodiment of the present application may also be applied to other scenarios, for example, a scenario that a text uploaded by a user is converted into a speech, and the like.

The AI-based prosody hierarchy prediction method provided by the present application is described below by way of example.

Referring to fig. 2, fig. 2 is a flowchart illustrating an AI-based prosody hierarchy prediction method according to an embodiment of the present disclosure. For convenience of description, the following embodiments take a server as an example of an execution subject, and describe the prosody hierarchy prediction method. As shown in fig. 2, the prosodic hierarchy prediction method includes the steps of:

step 201: and acquiring a target text.

When the server needs to synthesize the corresponding speech signal for the target text, in order to obtain a more natural speech signal closer to human pronunciation, the server may predict a prosodic structure hierarchical sequence corresponding to the target text, and then synthesize the speech signal corresponding to the target text based on the prosodic structure hierarchical sequence corresponding to the target text.

It should be noted that, in different application scenarios, the server may obtain the target text in different manners. Taking an application scene of human-computer interaction as an example, the server can acquire a voice signal sent by the terminal equipment and take a text obtained by converting the voice signal as a target text; taking an application scenario of performing voice conversion processing on a text as an example, the server may use a text to be converted sent by the terminal device as a target text, or the server may use a text to be converted obtained from another server or a database as a target text; and so on. The method for acquiring the target text by the server is not limited in any way.

It should be understood that, when the prosody hierarchy prediction method provided in the embodiment of the present application is applied to a terminal device, the terminal device may also obtain a target text in different manners in different application scenarios. Taking a man-machine interaction application scene as an example, the terminal equipment can convert a voice signal input by a user into a corresponding text, and further takes the text as a target text; taking an application scenario of performing voice conversion processing on a text as an example, the terminal device may acquire a text input by a user as a target text, or the terminal device may acquire a text transmitted by the server as the target text; and so on. The method for acquiring the target text by the terminal device is not limited in any way.

Step 202: and performing word segmentation and part-of-speech tagging on the target text to obtain a word segmentation tagging sequence.

After the server obtains the target text, word segmentation processing can be performed on the target text to obtain a word segmentation sequence corresponding to the target text; and then, performing part-of-speech tagging processing on each word in the word segmentation sequence to obtain a word segmentation tagging sequence corresponding to the target text.

It should be noted that, a relatively mature word segmentation processing method and a part-of-speech tagging method are available in the related art, and here, the word segmentation processing method in the related art may be directly adopted to perform word segmentation processing on a target text, and the part-of-speech tagging method in the related art is adopted to perform part-of-speech tagging processing on a word segmentation sequence obtained by word segmentation processing, and the application does not make any limitation on the specifically adopted word segmentation processing method and part-of-speech tagging method.

Step 203: and extracting word level features according to the word segmentation tagging sequence to obtain a word level feature sequence, wherein the word level features of each word in the word level feature sequence at least comprise a word vector obtained by semantic feature extraction.

After the server obtains the segmentation tagging sequence corresponding to the target text, word-level feature extraction can be performed on each word in the segmentation tagging sequence to obtain the word-level feature of each word, and then the word-level features of each word are combined in sequence to obtain the word-level feature sequence corresponding to the target text. The word-level features of each word in the word-level feature sequence at least comprise word vectors obtained by semantic feature extraction.

It should be noted that, in practical applications, in order to further improve the prediction effect on the prosody hierarchy, the word-level features of each word may include at least one of a position vector, a part-of-speech vector, a word length vector, and a word post-punctuation vector corresponding to each word, in addition to the word vector obtained by extracting the semantic features, so that the word-level features of each word are enriched, and it is ensured that the prosody hierarchy can be predicted more accurately based on the enriched word-level features subsequently.

During specific implementation, the server can extract semantic features of the word tagging sequence to obtain a word vector corresponding to each word; coding the position information of each word in the text to obtain a position vector corresponding to each word; further, generating word-level features corresponding to each word according to at least one of a position vector, a part-of-speech vector, a word length vector and a word post-punctuation type vector corresponding to each word in the word segmentation tagging sequence and the word vector of each word; finally, combining the word level characteristic sequences corresponding to each word in the word segmentation tagging sequence to obtain a word level characteristic sequence.

When the semantic features are extracted specifically, the server can extract the semantic features of the word tagging sequence through a semantic feature extraction model to obtain a word vector corresponding to each word; that is, the server may input each word in the segmentation tagging sequence to the semantic feature extraction model one by one, so as to obtain a word vector corresponding to each word output by the semantic feature extraction model.

It should be noted that, in order to make the word vector obtained through semantic feature extraction rich in more semantic features, the server may use a pre-trained bidirectional coder representation (BERT) network structure or a Skip-Gram network structure as the semantic feature extraction model. The semantic feature extraction model (namely a BERT network structure or a Skip-Gram network structure) is obtained by pre-training on a large corpus, the extracted word vectors are usually rich in context information and have better semantic features, and the subsequent prosodic hierarchy prediction is carried out based on the word vectors, so that the prediction effect of the prosodic hierarchy can be improved.

It should be understood that, in practical applications, the server may also use other network structures as the semantic feature extraction model, and the application does not specifically limit the model structure of the semantic feature extraction model.

Although the prosodic hierarchy prediction model based on the self-attention mechanism can learn the dependency relationship between words at any distance in the whole sentence, the relative position distance between the words is ignored due to the self-attention mechanism, and in order to enable the prosodic hierarchy prediction model based on the self-attention mechanism to utilize the relative position information subsequently, a time signal (timing signal) mechanism is adopted in the method, the position information of each word in the target text is encoded to obtain a position vector corresponding to each word, and the position information can be encoded directly by using the equations (1) and (2) without learning any parameters:

where t is the index value of the time step, 2i and 2i +1 are the index values of the coding dimension, and d is the dimension of the position coding.

Part-of-speech vectors are used to represent parts-of-speech of words, such as vectors representing parts-of-speech as nouns, vectors representing parts-of-speech as verbs, and so on. A word length vector is used to indicate the number of words contained in a word, such as a vector indicating that a word includes two words, a vector indicating that a word includes three words, and so on. The post-word punctuation type vector is used for indicating whether punctuation exists behind a word and the punctuation type when the punctuation exists, if the punctuation exists behind the word, the post-word punctuation type vector is used for indicating the punctuation type behind the word, and if the punctuation does not exist behind the word, the post-word punctuation type vector is used for indicating the punctuation-free behind the word. The part-of-speech vector, the word length vector and the word post punctuation type vector can all be expressed by a single hot vector.

Target text is used as' sincere greeting and good wish. For example, the word sequences are "sincere greeting and good wish. "the part-of-speech vector corresponding to the last word" wish "in the text is a vector representing that the part-of-speech is a noun, the word length vector corresponding to the word length vector is a vector representing that the word contains two words, and the post-word punctuation type vector is a vector representing that the post-word punctuation is a period.

It should be understood that, in practical applications, the server may generate at least one of the above position vector, part of speech vector, word length vector, and post-word punctuation type vector for each word in the segmentation tagging sequence according to actual requirements, and then combine at least one of the generated position vector, part of speech vector, word length vector, and post-word punctuation type vector with the word vector of each word to generate the word-level feature corresponding to each word.

In a possible implementation manner, when the server generates a position vector, a part-of-speech vector, a word length vector, and a post-word punctuation type vector for each word, the server may sum the word vector and the position vector corresponding to each word for each word in the segmentation tagging sequence, and vector-concatenate the sum value with the part-of-speech vector, the word length vector, and the post-word punctuation type vector corresponding to each word to obtain the word-level features corresponding to each word.

For example, suppose that the server performs semantic feature extraction on the ith word in the segmentation annotation sequence by using a semantic feature extraction model to obtain a corresponding word vector e_iThe word vector e_iSumming the position vector obtained by coding the position information of the ith word, and then, adding the sum value to the text feature set r_iPerforming vector splicing to obtain a text feature set r_iThe word segmentation method is formed by splicing a part-of-speech vector, a word long vector and a word post-punctuation type vector of the ith word.

Therefore, in the process of predicting the prosody hierarchical structure, factors closely related to the prosody hierarchical structure type are taken into consideration, the semantic features referred to in the process of predicting the prosody hierarchical structure are enriched, and the accuracy of predicting the prosody hierarchical structure is further ensured.

After the word level characteristics corresponding to each word in the word segmentation tagging sequence are obtained through the processing, the word level characteristic sequences corresponding to each word are combined according to the arrangement sequence of each word in the word segmentation tagging sequence, and the word level characteristic sequence corresponding to the target text is obtained.

Step 204: and obtaining a prosodic hierarchy structure sequence corresponding to the word-level feature sequence through a prosodic hierarchy structure prediction model, wherein the prosodic hierarchy structure prediction model is a deep neural network model based on a self-attention mechanism.

After the server generates a word-level feature sequence corresponding to the target text, the word-level feature sequence is further processed by using a prosodic hierarchy structure prediction model, so that a prosodic hierarchy structure sequence corresponding to the target text is obtained, wherein the prosodic hierarchy structure prediction model is a deep neural network model based on an attention-free mechanism.

In a possible implementation manner, the network structure of the prosodic hierarchy prediction model includes a fully-connected layer, N (N is a positive integer) feature processing layers and a normalization layer, which are cascaded; the feature processing layer specifically comprises a non-linear sublayer and a self-attention sublayer.

Referring to fig. 3, fig. 3 is a schematic diagram of an operating architecture of an exemplary prosody hierarchy prediction model according to an embodiment of the present application. The word-level feature sequence of the input prosodic hierarchy prediction model may be represented as W ═ (W ═ W₁,w₂,…,w_i,…,w_n) Wherein w is_iThe word level features of the ith word in the word level feature sequence are generated in the following mode: and performing semantic feature extraction on the ith word by using a pre-trained semantic feature extraction model to obtain a word vector, summing the word vector and a position vector, and splicing the sum with a vector formed by splicing a part-of-speech vector, a word length vector and a word post-punctuation type vector to obtain the word-level feature of the ith word.

The fully-connected layer at the front end of the prosodic hierarchy structure prediction model is used for carrying out feature mixing on input word-level feature sequences to learn high-level expression of features, then, N identical feature processing layers are stacked to form a deep network, each feature processing layer is composed of a nonlinear sublayer and a self-attention sublayer, input and output of the sublayers adopt a residual error connection structure, an output part of each sublayer also comprises a normalization layer, and the last layer adopts a normalization layer (namely a softmax layer) to output probability distribution of prosodic structure types of each word in a target text.

It should be noted that, in practical applications, the value of N may be set according to actual requirements, and the value of N is not specifically limited herein.

When a training method of the prosody hierarchy prediction model is introduced in the following, each layer of network structure in the prosody hierarchy prediction model shown in fig. 3 will be described in detail, and details of the training method of the prosody hierarchy prediction model will be referred to in detail, and will not be described herein again.

It should be understood that, in practical applications, other network structures can be utilized as the prosody hierarchy prediction model according to actual needs, and the structure of the prosody hierarchy prediction model shown in fig. 3 is only an example, and the specific structure of the prosody hierarchy prediction model is not limited in any way in this application.

In one possible implementation, the server inputs the word-level feature sequence into a prosodic hierarchy prediction model, which is a four-classification model for predicting the probability that each word in the text belongs to a non-prosodic structure boundary (NB), a prosodic word boundary (PW), a prosodic phrase boundary (PPH), and an intonation phrase boundary (IPH); and then, the server obtains a prosody hierarchy structure sequence corresponding to the word-level feature sequence, wherein the prosody hierarchy structure sequence comprises each word and a prosody hierarchy structure type identifier with the maximum probability corresponding to each word.

Specifically, after the server inputs the word-level feature sequence into the prosodic hierarchy prediction model, the prosodic hierarchy prediction model correspondingly predicts the probability that each word in the target text belongs to NB, PW, PPH, IPH. Then determining a prosodic hierarchy structure type identifier with the maximum probability corresponding to each word, wherein the prosodic hierarchy structure type identifier can represent a prosodic hierarchy structure corresponding to the word; and then, arranging the prosodic hierarchy type identifications corresponding to each word in sequence to obtain a prosodic hierarchy sequence corresponding to the input word-level feature sequence.

In another possible implementation, the server inputs the word-level feature sequence into a prosodic hierarchy prediction model, which is a three-classification model for predicting the probability that each word in the text belongs to a non-prosodic structure boundary (NB), a prosodic word boundary (PW), and a prosodic phrase boundary (PPH); and then, the server obtains a prosody hierarchy structure sequence corresponding to the word-level feature sequence, wherein the prosody hierarchy structure sequence comprises each word and a prosody hierarchy structure type identifier with the maximum probability corresponding to each word.

Specifically, after the server inputs the word-level feature sequence into the prosodic hierarchy prediction model, the prosodic hierarchy prediction model predicts the probability that each word in the target text belongs to NB, PW, and PPH accordingly. Then determining a prosodic hierarchy structure type identifier with the maximum probability corresponding to each word, wherein the prosodic hierarchy structure type identifier can represent a prosodic hierarchy structure corresponding to the word; and then, arranging the prosodic hierarchy type identifications corresponding to each word in sequence to obtain a prosodic hierarchy sequence corresponding to the input word-level feature sequence.

It should be noted that, when the prosody hierarchy prediction model is a three-classification model, the prosody hierarchy prediction model can be used for predicting the probability that a word belongs to NB, PW, and PPH in a target text, and can also be used for predicting the probability that a word belongs to PW, IPH, and PPH, and certainly can also be used for predicting the probability that a word belongs to other three prosody hierarchies, and no limitation is made to the three prosody hierarchies that can be predicted by the three-classification prosody hierarchy prediction model.

In yet another possible implementation, the server may input the word-level feature sequence into a prosodic hierarchy prediction model, the prosodic hierarchy prediction model being a two-classification model for predicting a probability that each word in the target text belongs to a prosodic phrase boundary (PPH) and a non-prosodic phrase boundary; further, the server obtains a prosody hierarchy sequence corresponding to the word-level feature sequence, wherein the prosody hierarchy sequence comprises each word and an identifier of a prosody hierarchy type with the maximum probability corresponding to each word.

Specifically, after the server inputs the word-level feature sequence into the prosodic hierarchy prediction model, the prosodic hierarchy prediction model predicts the probability that each word in the target text belongs to the PPH and the non-PPH accordingly. When the prosodic hierarchy type with the maximum probability corresponding to the word is marked as PPH, the word is represented to belong to PPH, and when the prosodic hierarchy type with the maximum probability corresponding to the word is marked as non-PPH, the word is represented to belong to non-PPH; and then, arranging the prosodic hierarchy type identifications corresponding to each word in sequence to obtain a prosodic hierarchy sequence corresponding to the input word-level feature sequence.

It should be noted that, when the prosody hierarchy prediction model is a binary classification model, it may be used to predict the probability that a word belongs to PPH and non-PPH in the target text, may also be used to predict the probability that a word belongs to IPH and non-IPH, and may also be used to predict the probability that a word belongs to PW and non-PW, where no limitation is made on two prosody hierarchies that can be predicted by the binary classification prosody hierarchy prediction model.

After the processing of steps 201 to 204, the server may obtain a prosody hierarchical structure sequence corresponding to the target text, and further, the server may perform speech synthesis processing according to the prosody hierarchical structure sequence corresponding to the target text, the target voice type, the target speech rate, the target volume, and the target sampling rate, thereby obtaining the target speech corresponding to the target text.

It should be understood that the target sound type, the target speech rate, the target volume and the target sampling rate may be set by a user, or may be parameters set by default by the speech synthesis system, and the setting manner and specific values of the target sound type, the target speech rate, the target volume and the target sampling rate are not limited in any way.

The prosody hierarchical structure prediction method based on the AI adopts the deep neural network model based on the self-attention mechanism to predict the prosody hierarchical structure of each participle in the target text, the deep neural network model based on the self-attention mechanism can better capture the context dependency relationship in the whole sentence range through the self-attention sublayer in the deep neural network model, compared with a CRF model and an RNN model in the related technology, the deep neural network model has better sequence modeling capability, and correspondingly, compared with the CRF model and the RNN model in the related technology, the deep neural network model can obtain better prosody hierarchical structure prediction effect, thereby being beneficial to improving the quality of subsequent speech synthesis.

It should be understood that, in practical applications, whether the AI-based prosody hierarchy prediction method provided in the embodiment of the present application can accurately predict the prosody hierarchy corresponding to the target text mainly depends on the model performance of the prosody hierarchy prediction model, and the model performance of the prosody hierarchy prediction model is closely related to the training process of the prosody hierarchy prediction model. The training method of the prosodic hierarchy prediction model provided by the present application is described below by way of example.

Referring to fig. 4, fig. 4 is a flowchart illustrating a training method of a prosodic hierarchy prediction model according to an embodiment of the present disclosure. For convenience of description, the following embodiments take a server as an execution subject, and describe a training method of the prosodic hierarchy prediction model. Referring to fig. 4, the training method of the prosodic hierarchy prediction model includes the following steps:

step 401: obtaining a training sample set, wherein the training sample set comprises each training sample and a prosodic hierarchy label corresponding to each training sample.

Before training the prosodic hierarchy prediction model, a large number of training samples and prosodic hierarchy labels corresponding to the training samples are generally required to be obtained to form a training sample set for training the prosodic hierarchy prediction model.

It should be noted that the prosody hierarchy labels labeled for the training samples are closely related to the types of the prosody hierarchy prediction models to be trained; when the prosodic hierarchy prediction model is a four-classification model used for predicting the probability that each word in the text belongs to PW, PPH, IPH and NB, the prosodic hierarchy labeled by the server for the training sample should include four prosodic hierarchy type identifications of PW, PPH, IPH and NB; when the prosodic hierarchy prediction model is a three-classification model used for predicting the probability that each word in the text belongs to NB, PW and PPH, the prosodic hierarchy labeled by the server for the training sample should include three prosodic hierarchy type identifications of NB, PW and PPH; when the prosodic hierarchy prediction model is a binary model for predicting the probability that each word in the text belongs to the PPH and the non-PPH, the prosodic hierarchy labeled by the server for the training sample should include two prosodic hierarchy type identifications of the PPH and the non-PPH; and so on.

Optionally, in order to enable the prosodic hierarchy prediction model to learn a certain uncertainty, help to improve the generalization capability of the prosodic hierarchy prediction model, and improve the prediction effect of the prosodic hierarchy prediction model, the server may perform label smoothing on the prosodic hierarchy label corresponding to each training sample in the training sample set.

Taking four classes of prosody hierarchy tags corresponding to training samples including PW, PPH, IPH, and NB as examples, assuming that a grammar word belongs to IPH, the prosody hierarchy tags corresponding to the grammar word are represented by a unique hot vector as follows:

TAG_IPH＝(0,0,0,1)

after the label smoothing processing is performed on the prosodic hierarchy label, namely, noise is added, a certain degree of uncertainty is introduced, and assuming that the smoothing value is set to be 0.1, the label vector after the label smoothing processing is expressed as follows:

SMOOTH_IPH＝(0.03,0.03,0.03,0.9)

therefore, before the prosodic hierarchy prediction model is trained, smoothing processing can be performed on prosodic hierarchy labels corresponding to all training samples, so that the prosodic hierarchy prediction model can learn certain uncertainty in the training process.

Step 402: and performing parameter training on the deep neural network model based on the self-attention mechanism through the training sample set, and taking the trained deep neural network model based on the attention mechanism as the prosodic hierarchy prediction model.

After the training sample set is obtained, the server can perform parameter training on a pre-constructed deep neural network model based on the self-attention mechanism by using the obtained training sample set until the deep neural network model based on the self-attention mechanism meeting training end conditions is obtained through training, and then the deep neural network model based on the self-attention mechanism is used as a rhythm hierarchy structure prediction model and can be put into practical application.

Taking a pre-constructed depth neural network model based on the self-attention mechanism as an example of the model structure shown in fig. 3, a training method of a prosodic hierarchy prediction model is introduced; the self-attention sublayer, the non-linear sublayer and the residual connection mode in the deep neural network model based on the self-attention mechanism are respectively introduced as follows:

the attention mechanism can be regarded as a query (query) and a series of key value (value) pairs to obtain a representation of the query, and the specific processing procedure is as follows: similarity calculation is performed on the query and each key to obtain a series of weights, and then weight summation is performed on corresponding values to obtain representation of the query. In the specific calculation of the similarity, a common similarity calculation method such as additive attention and dot product attention may be adopted, and a similarity calculation process is described below by taking Scaled dot product attention (Scaled dot-product attention) in the dot product attention as an example.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating a process of calculating similarity by using a scaled dot product attention mechanism, where Q is a query sequence, K is a series of keys, and V is a value corresponding to a key. As shown in fig. 5, Q and K are first subjected to matrix multiplication, then scaled by a scaling factor, and then normalized, and finally subjected to matrix multiplication with V to obtain an output. The specific calculation process can be expressed as formula (3):

wherein d is a scaling factor and Q is a dimension of the vector.

Whereas the self-attention mechanism requires only one sequence to compute a representation of this sequence. The multi-head attention is that h linear transformations are carried out through query, key and value, then scaling dot products are carried out in parallel, and each scaling dot product obtains d_vRepresentation of the dimension, further, by dividing h by d_vThe values of the dimensions are spliced to form h x d_vThe vector of (2) gets an output, and the specific calculation flow of the multi-head attention mechanism is shown in fig. 6.

The calculation formula of the multi-head attention mechanism is shown in formula (4) and formula (5):

MultiHead(Q，K，V)＝Concat(head_1，……，head_h)W (5)

wherein,

respectively linear transformation matrices of query, key, value,

and (4) performing last linear transformation matrix for splicing each scaled dot product output value. In the prosodic hierarchy prediction model provided in the embodiment of the present application, the number of multiple heads may be set to 8, and the parameter may be set to d 256 for each head, d_k＝d_vD/h 64; of course, in practical applications, the parameters may also be set according to actual requirements, and the parameters are not specifically limited herein.

The application makes further application exploration for the self-attention machine, and the self-attention machine is applied to prosody hierarchy prediction. Specifically, the self-attention mechanism is realized through a self-attention sublayer in a prosodic hierarchy structure prediction model, and is mainly used for capturing context dependency relationships among words in a whole sentence range and forming word-level feature expression rich in context information, a higher-level prosodic structure is possibly dependent on words far away from each other, and the self-attention mechanism is mainly used for capturing dependency relationships among words far away from each other, so that the prediction effect of the prosodic hierarchy structure is improved. Supposing that a sentence has T words, calculating the feature representation of the last word in the sentence needs to obtain the weight value of each word by solving the similarity of the semantic features of all the words in the sentence, and then obtaining the feature representation of the word in the form of weight summation.

Compared with CRF and RNN models in the related technology, the self-attention mechanism can directly capture the dependency relationship between the first words and the last words of the sentence and is insensitive to the distance between the two words; the CRF and RNN models need to perform T-1 times of calculation to form the characteristic input of the last word, so that the dependency relationship between the first word and the last word of the sentence is learned, and in addition, when the last word is reached through the repeated cyclic calculation, the complete information of the first word of the sentence still cannot be guaranteed to be reserved. Compared with a CRF model and an RNN model, the deep neural network model based on the self-attention mechanism is more beneficial to learning the dependency relationship between words far away from each other, the prediction of the intonation phrase boundary in the prosody hierarchy prediction often depends on the last intonation phrase boundary far away from each other, the self-attention mechanism depends on the input of each word of a whole sentence, and the calculation mode of the deep neural network model is more beneficial to learning the integral structure information of the sentence.

The method is suitable for different use requirements, and a full-connection network sublayer or a circulating neural network sublayer can be adopted in the nonlinear sublayer; specifically, when a faster training speed is pursued, a fully-connected network sublayer may be employed as the nonlinear sublayer, and when a higher prediction accuracy is pursued, a recurrent neural network sublayer may be employed. These two types of non-linear sublayers are described below:

when the non-Linear sublayer is a fully connected network sublayer, as shown in fig. 7, the fully connected network sublayer may be used in combination with the self-attention network sublayer to perform non-Linear transformation on the input, mainly performing two Linear transformations, wherein the middle layer employs a Linear rectification (ReLU) activation function; the specific calculation flow is shown as formula (6):

FFN(X)＝ReLU(XW₁)W₂ (6)

wherein, W₁∈R^d×d，W₂∈R^d×dThe parameters needed to be learned when the network sub-layer is fully connected are trained.

When the nonlinear sublayer is a recurrent neural network sublayer, because the prosodic hierarchy structure prediction is that the input is a word-level feature sequence and the output is a corresponding prosodic hierarchy structure sequence, the mapping from the sequence to the sequence is true, and although the RNN is suitable for sequence modeling, when the sequence is longer, the RNN can cause training difficulty due to gradient explosion or gradient dissipation.

Although the RNN with the gate structure can learn the dependency relationship of the past time step, the performance of the RNN is limited because only information in one direction can be learned, and the bidirectional RNN can enable the network to learn the context dependency relationship in two directions, so that the bidirectional RNN network structure can be applied to the prosody hierarchy prediction model in the present application, and therefore, the nonlinear sub-layer RNN can have the following configuration:

1. a unidirectional GRU-RNN sublayer;

2. a Bidirectional LSTM-RNN sublayer, namely a Bidirectional Long Short-Term Memory unit (BLSTM);

3. bidirectional GRU-RNN sublayer, Bidirectional Gated Regenerative Unit (BGRU).

Because residual connection exists between input and output of the sublayers, the data input dimension and the output dimension need to be kept the same, therefore, when the unidirectional GRU-RNN sublayer is adopted, the number of neurons can be set to be 256 dimensions, when the BLSTM or BGRU sublayer is adopted, the number of neurons set in each direction is 128 dimensions, and the bidirectional output splicing forms 256 dimensions.

The deep neural network model has the phenomenon that the accuracy of a training set is saturated or even reduced along with the increase of the number of layers, which is the degradation problem of the neural network model, and the residual error connection is an effective method for training the deep neural network model, and the implementation mode of the deep neural network model can specifically realize the fact that the residual error connection exists between sub-layers in the feature processing layer and the addition operation is carried out on each dimension at the connection position, and the specific operation flow is shown in fig. 8.

The residual error concatenation is used in each sub-layer of the prosody hierarchy prediction model in the present application, and the calculation process can be represented by equation (7):

Y＝X+SubLayer(X) (7)

where X and Y represent the input and output, respectively, of each sublayer.

After residual connection, layer normalization operation can be further performed to control distribution among layers; the prosodic hierarchy structure prediction model in the application needs to stack the same feature processing layer for multiple times, and along with the increase of the stacking times, the increase of the model depth can also bring the problem of difficult training, and through residual connection, the prosodic hierarchy structure prediction model in the application can be helped to train, and the prosodic hierarchy structure prediction model is more favorable for trying deeper network structure configuration.

It should be understood that the deep neural network model based on the self-attention mechanism to be trained may be a model structure other than the model structure shown in fig. 3, and the application does not limit the structure of the deep neural network model based on the self-attention mechanism to be trained in any way.

Specifically, when judging whether the trained deep neural network model meets the training end condition, verifying a first model by using a test sample, wherein the first model is obtained by performing a first round of training on the deep neural network model by using training samples in a training sample set; specifically, the server inputs a test sample into the first model, and the first model is used for correspondingly processing the input test sample to obtain a prosody hierarchical structure corresponding to the test sample; and then, according to the labeled prosody hierarchical structure corresponding to the test sample and the output result of the first model, determining the prediction accuracy of the first model, when the prediction accuracy is greater than a preset threshold, determining that the working performance of the first model is better and meets the requirement, determining that the first model is a deep neural network model meeting the training end condition, and using the deep neural network model as the prosody hierarchical structure prediction model.

In addition, when judging whether the deep neural network model meets the training end condition, whether the deep neural network model is continuously trained or not can be determined according to a plurality of models obtained by a plurality of rounds of training so as to obtain a prosodic hierarchy structure prediction model with optimal working performance. Specifically, a plurality of deep neural network models obtained through a plurality of rounds of training can be verified respectively by using test samples, if the difference between the prediction accuracy rates of the deep neural network models obtained through each round of training is determined to be small, the model performance of the deep neural network model is considered to have no space for improvement, and the deep neural network model with the highest prediction accuracy rate can be selected as a rhythm hierarchy structure prediction model meeting the training end condition; if the difference between the prediction accuracy rates of the deep neural network models obtained through each round of training is large, the performance of the deep neural network model is considered to have a promotion space, and the deep neural network model can be continuously trained until the deep neural network model with the most stable and optimal performance is obtained and used as the rhythm hierarchy structure prediction model.

According to the training method of the prosodic hierarchy structure prediction model, the acquired training sample set is used for carrying out parameter training on the pre-constructed depth neural network model based on the self-attention mechanism, and then the trained depth neural network model based on the self-attention mechanism is used as the prosodic hierarchy structure prediction model and is put into practical application. The deep neural network model based on the self-attention mechanism can better capture the context dependency relationship in the whole sentence range through the self-attention sublayer, has better sequence modeling capability compared with a CRF model and an RNN model in the related technology, and can obtain better prosodic hierarchy structure prediction effect compared with the CRF model and the RNN model in the related technology, thereby being beneficial to improving the quality of subsequent speech synthesis.

In order to further understand the prosody hierarchy prediction method based on AI provided in the embodiment of the present application, the following takes an application scenario in which the method is applied to synthesize target speech from target text transmitted by a user, and a prosody hierarchy prediction model is a four-classification model as an example, and the prosody hierarchy prediction method is introduced as an example as a whole. Referring to fig. 9, fig. 9 is a flowchart illustrating the AI-based prosody hierarchy prediction method.

When a user needs to synthesize a target voice corresponding to a target text by getting an honest greeting and a nice wish, the user can input the 'getting an honest greeting and a nice wish' into the terminal equipment so as to transmit the target text to the server through the terminal equipment; after the server obtains the target text, word segmentation and part-of-speech tagging are firstly carried out on the target text, and a word segmentation tagging sequence corresponding to the target text is obtained.

And then, the server performs word level feature extraction processing on the segmentation tagging sequence corresponding to the target text to obtain a corresponding word level feature sequence. Specifically, the server may extract semantic features of the segmentation annotation sequence by using a BERT network structure to obtain a word vector corresponding to each word in the segmentation annotation sequence; coding the position information of each word in the target text to obtain a position vector corresponding to each word in the word segmentation tagging sequence; then, for each word in the word segmentation tagging sequence, summing the corresponding word vector and the position vector, and carrying out vector splicing on the sum value, the part-of-speech vector, the word length vector and the post-word punctuation type vector corresponding to the word to obtain the word level characteristics corresponding to the word; finally, combining the word level characteristic sequences corresponding to each word in the word segmentation tagging sequence according to the arrangement sequence of each word in the word segmentation tagging sequence to obtain the word level characteristic sequences.

Next, the server inputs the generated word-level feature sequence into a prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is a deep neural network model based on a self-attention mechanism, and the prosodic hierarchy prediction model processes the generated word-level feature sequence to generate the probability that each word in the target text belongs to NB, PW, PPH and IPH; furthermore, a prosody hierarchy identifier corresponding to the maximum probability is determined for each word in the target text, and is used as the prosody hierarchy corresponding to the word, for example, for the 'order', the probability that the word belongs to the PPH is the maximum, the prosody hierarchy corresponding to the 'order' can be determined to be the PPH, and for the 'sincere', the probability that the word belongs to the NB is the maximum, the prosody hierarchy corresponding to the 'sincere' can be determined to be the NB; and so on. Finally, arranging the prosodic hierarchy structure identifications corresponding to the words in the determined target text in sequence to obtain that the prosodic hierarchy structure sequence corresponding to the target text is' PW > hello < IPH > with < PPH > sincerocity < NB > and < PW > wish < IPH > with < PW > nice < NB >.

Furthermore, the server may generate a target voice corresponding to the target text based on the prosody hierarchy sequence corresponding to the target text determined by the server, in combination with the target voice type, the target speech rate, the target volume, and the target sampling rate set by the user. And transmitting the target voice to the terminal equipment so as to play the target voice through the terminal equipment.

For the above-described AI-based prosody hierarchy prediction method, the present application also provides a corresponding AI-based prosody hierarchy prediction apparatus, so as to make the above-described AI-based prosody hierarchy prediction method be applied and implemented in practice.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an AI-based prosody hierarchy prediction apparatus 1000 corresponding to the AI-based prosody hierarchy prediction method shown in fig. 2, the apparatus including:

an obtaining module 1001 configured to obtain a target text;

a segmentation and part-of-speech tagging module 1002, configured to perform segmentation and part-of-speech tagging on the target text to obtain a segmentation tagging sequence;

a word-level feature extraction module 1003, configured to perform word-level feature extraction according to the word segmentation tagging sequence to obtain a word-level feature sequence, where the word-level feature of each word in the word-level feature sequence at least includes a word vector obtained through semantic feature extraction;

a prosodic hierarchy predicting module 1004 configured to obtain a prosodic hierarchy sequence corresponding to the word-level feature sequence through a prosodic hierarchy predicting model, where the prosodic hierarchy predicting model is a deep neural network model based on a self-attention mechanism.

Optionally, on the basis of the prosody hierarchy prediction apparatus shown in fig. 10, referring to fig. 11, fig. 11 is a schematic structural diagram of another prosody hierarchy prediction apparatus provided in the embodiment of the present application, where the word-level feature extraction module 1003 includes:

a semantic feature extraction submodule 1101, configured to perform semantic feature extraction on the participle tagging sequence to obtain a word vector corresponding to each word;

a position vector coding submodule 1102, configured to code position information of each word in the target text to obtain a position vector corresponding to each word;

a word-level feature generation submodule 1103, configured to generate a word-level feature corresponding to each word according to at least one of a position vector, a part-of-speech vector, a word length vector, and a word post-punctuation type vector corresponding to each word in the word segmentation tagging sequence and a word vector corresponding to each word;

and the combining submodule 1104 is configured to combine the word-level feature sequences corresponding to each word in the word segmentation tagging sequence to obtain the word-level feature sequence.

Optionally, on the basis of the prosody hierarchy prediction apparatus shown in fig. 11, the semantic feature extraction sub-module 1101 is specifically configured to:

performing semantic feature extraction on the word segmentation labeling sequence through a semantic feature extraction model to obtain a word vector corresponding to each word; the semantic feature extraction model adopts a BERT network structure or a Skip-Gram network structure.

Optionally, on the basis of the prosody hierarchy prediction apparatus shown in fig. 11, the word-level feature generation sub-module 1103 is specifically configured to:

and summing the word vector and the position vector corresponding to each word aiming at each word in the word segmentation tagging sequence, and carrying out vector splicing on the sum value, the part-of-speech vector, the word length vector and the post-word punctuation type vector corresponding to each word to obtain the word level characteristics corresponding to each word.

Alternatively, on the basis of the prosody hierarchy prediction device shown in fig. 10, referring to fig. 12, fig. 12 is a schematic structural diagram of another prosody hierarchy prediction device provided in the embodiment of the present application, where the device further includes:

a sample obtaining module 1201, configured to obtain a training sample set, where the training sample set includes each training sample and a prosody hierarchy structure label corresponding to each training sample;

a training module 1202, configured to perform parameter training on the deep neural network model based on the self-attention mechanism through the training sample set, and use the trained deep neural network model based on the self-attention mechanism as the prosody hierarchical structure prediction model.

Alternatively, on the basis of the prosody hierarchy prediction device shown in fig. 12, referring to fig. 13, fig. 13 is a schematic structural diagram of another prosody hierarchy prediction device provided in the embodiment of the present application, where the device further includes:

a label smoothing module 1301, configured to perform label smoothing on a prosodic hierarchy structure label corresponding to each training sample in the training sample set;

the training module 1202 is specifically configured to:

and performing parameter training on the attention mechanism-based deep neural network model through the training sample set subjected to label balancing processing.

Optionally, on the basis of the prosody hierarchy prediction apparatus shown in fig. 10, the network structure of the prosody hierarchy prediction model includes a fully-connected layer, N feature processing layers, and a normalization layer, which are cascaded; n is a positive integer; the feature handling layer includes a non-linear sublayer and a self-attention sublayer.

Optionally, on the basis of the prosody hierarchy prediction apparatus shown in fig. 10, the prosody hierarchy prediction module 1004 is specifically configured to:

inputting the word-level feature sequence into the prosodic hierarchy structure prediction model, wherein the prosodic hierarchy structure prediction model is a four-classification model and is used for predicting the probability that each word in the text belongs to a prosodic word boundary, a prosodic phrase boundary, a intonation phrase boundary and a non-prosodic structure boundary;

and acquiring a prosodic hierarchy structure sequence corresponding to the word-level feature sequence, wherein the prosodic hierarchy structure sequence comprises each word and a prosodic hierarchy structure type identifier with the maximum probability corresponding to each word.

inputting the word-level feature sequence into the prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is a three-classification model and is used for determining the probability that words in a text belong to a non-prosodic structure boundary, a prosodic word boundary and a prosodic phrase boundary;

and acquiring a prosodic hierarchy structure sequence corresponding to the word-level feature sequence, wherein the prosodic hierarchy structure sequence comprises each word and an identifier of a prosodic hierarchy structure type with the maximum probability corresponding to each word.

inputting the word-level feature sequence into the prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is a two-classification model and is used for predicting the probability that each word in the text belongs to a prosodic phrase boundary and a non-prosodic phrase boundary;

Alternatively, on the basis of the prosody hierarchy prediction device shown in fig. 10, referring to fig. 14, fig. 14 is a schematic structural diagram of another prosody hierarchy prediction device provided in the embodiment of the present application, and the device further includes:

and a speech synthesis module 1401, configured to perform speech synthesis according to the prosody hierarchical structure sequence to obtain a target speech.

The advanced neural network model based on the self-attention mechanism can better capture context dependency relationship in a whole sentence range through a self-attention sublayer, has better sequence modeling capability compared with a CRF model and an RNN model in the related technology, and accordingly can obtain better prosody hierarchy structure prediction effect compared with the CRF model and the RNN model in the related technology, and further contributes to improving the quality of subsequent speech synthesis.

The embodiment of the present application further provides a server and a terminal device for predicting a prosody hierarchy, and the server and the terminal device for predicting a prosody hierarchy provided in the embodiment of the present application will be described in terms of hardware implementation.

Referring to fig. 15, fig. 15 is a schematic diagram of a server 1500 according to an embodiment of the present disclosure, where the server 1500 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1522 (e.g., one or more processors) and a memory 1532, and one or more storage media 1530 (e.g., one or more mass storage devices) for storing applications 1542 or data 1544. Memory 1532 and storage media 1530 may be, among other things, transient or persistent storage. The program stored on the storage medium 1530 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 1522 may be provided in communication with the storage medium 1530, executing a series of instruction operations in the storage medium 1530 on the server 1500.

The server 1500 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input-output interfaces 1558, and/or one or more operating systems 1541, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.

The steps performed by the server in the above embodiment may be based on the server structure shown in fig. 15.

The CPU1522 is configured to execute the following steps:

acquiring a target text;

Optionally, the CPU1522 may also execute the method steps of any specific implementation of the method for predicting a prosody hierarchy based on an AI in the embodiment of the present application.

Referring to fig. 16, fig. 16 is a schematic structural diagram of a terminal device according to an embodiment of the present application. For convenience of explanation, only the parts related to the embodiments of the present application are shown, and details of the specific technology are not disclosed. The terminal may be any terminal device including a computer, a tablet computer, a Personal Digital Assistant (PDA), and the like, taking the terminal as a mobile phone as an example:

fig. 16 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application. Referring to fig. 16, the cellular phone includes: radio Frequency (RF) circuit 1610, memory 1620, input unit 1630, display unit 1640, sensor 1650, audio circuit 1660, wireless fidelity (WiFi) module 1670, processor 1680, and power supply 1690. Those skilled in the art will appreciate that the handset configuration shown in fig. 16 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The memory 1620 may be used to store software programs and modules, and the processor 1680 executes the software programs and modules stored in the memory 1620, thereby executing various functional applications and data processing of the mobile phone. The memory 1620 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required by at least one function, and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1620 may comprise high speed random access memory, and may also comprise non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 1680 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1620 and calling data stored in the memory 1620, thereby performing overall monitoring of the mobile phone. Alternatively, processor 1680 may include one or more processing units; preferably, the processor 1680 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It is to be appreciated that the modem processor described above may not be integrated into processor 1680.

In the embodiment of the present application, the processor 1680 included in the terminal further has the following functions:

acquiring a target text;

Optionally, the processor 1680 is further configured to perform the steps of any implementation manner of the AI-based prosody hierarchy prediction method provided in the embodiments of the present application.

The present embodiment also provides a computer-readable storage medium for storing a computer program for executing any one of the embodiments of the AI-based prosody hierarchy prediction methods described in the foregoing embodiments.

The present embodiments also provide a computer program product including instructions, which when executed on a computer, cause the computer to perform any one of the embodiments of the AI-based prosodic hierarchy prediction methods described in the foregoing embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for predicting a prosodic hierarchy of text, comprising:

acquiring a target text;

obtaining a prosodic hierarchy structure sequence corresponding to the word-level feature sequence through a prosodic hierarchy structure prediction model, wherein the prosodic hierarchy structure prediction model is a deep neural network model based on a self-attention mechanism;

the word level feature extraction according to the word segmentation tagging sequence to obtain a word level feature sequence comprises the following steps:

extracting semantic features of the word segmentation labeling sequence to obtain a word vector corresponding to each word;

coding the position information of each word in the target text by using a time signal mechanism to obtain a position vector corresponding to each word;

generating word level characteristics corresponding to each word according to at least one of a position vector, a part of speech vector, a word length vector and a word post punctuation type vector corresponding to each word in the word segmentation tagging sequence and the word vector corresponding to each word;

and combining the word level characteristic sequences corresponding to each word in the word segmentation tagging sequence to obtain the word level characteristic sequences.

2. The method of predicting prosodic hierarchy of text according to claim 1, wherein the extracting semantic features of the segmentation annotation sequence to obtain a word vector corresponding to each word comprises:

3. The method for predicting prosodic hierarchy of text according to claim 1, wherein the generating the word-level features corresponding to each word according to at least one of a position vector, a part-of-speech vector, a word length vector and a word post-punctuation type vector corresponding to each word in the segmentation class sequence and a word vector corresponding to each word comprises:

4. The method of text prosody hierarchy prediction of claim 1, the method further comprising:

acquiring a training sample set, wherein the training sample set comprises each training sample and a prosodic hierarchy structure label corresponding to each training sample;

and performing parameter training on the deep neural network model based on the self-attention mechanism through the training sample set, and taking the trained deep neural network model based on the self-attention mechanism as the prosodic hierarchy prediction model.

5. The method of text prosody hierarchy prediction of claim 4, the method further comprising:

performing label smoothing processing on a prosodic hierarchy structure label corresponding to each training sample in the training sample set;

performing parameter training on the attention mechanism-based deep neural network model through the training sample set, including: and performing parameter training on the attention mechanism-based deep neural network model through the training sample set subjected to label smoothing processing.

6. The text prosody hierarchy prediction method according to any one of claims 1 to 5, wherein a network structure of the prosody hierarchy prediction model includes a cascade of a fully-connected layer, N feature processing layers, and a normalization layer; n is a positive integer; the feature handling layer includes a non-linear sublayer and a self-attention sublayer.

7. The method for predicting prosodic hierarchy of text according to any one of claims 1 to 5, wherein the obtaining of the prosodic hierarchy corresponding to the sequence of word-level features by a prosodic hierarchy prediction model comprises:

8. The method for predicting prosodic hierarchy of text according to any one of claims 1 to 5, wherein the obtaining of the prosodic hierarchy corresponding to the sequence of word-level features by a prosodic hierarchy prediction model comprises:

inputting the word-level feature sequence into the prosodic hierarchy prediction model, wherein the prosodic hierarchy prediction model is a three-classification model and is used for predicting the probability that words in a text belong to a non-prosodic structure boundary, a prosodic word boundary and a prosodic phrase boundary;

9. The method for predicting prosodic hierarchy of text according to any one of claims 1 to 5, wherein the obtaining of the prosodic hierarchy corresponding to the sequence of word-level features by a prosodic hierarchy prediction model comprises:

10. The method of text prosody hierarchy prediction of claim 1, the method further comprising:

and performing voice synthesis according to the prosodic hierarchy structure sequence to obtain target voice.

11. A text prosody hierarchy prediction apparatus, comprising:

the acquisition module is used for acquiring a target text;

the prosodic hierarchy structure prediction module is used for obtaining a prosodic hierarchy structure sequence corresponding to the word-level feature sequence through a prosodic hierarchy structure prediction model, and the prosodic hierarchy structure prediction model is a deep neural network model based on a self-attention mechanism;

wherein, the word-level feature extraction module comprises:

the semantic feature extraction submodule is used for extracting semantic features of the word segmentation labeling sequence to obtain a word vector corresponding to each word;

the position vector coding submodule is used for coding the position information of each word in the target text by utilizing a time signal mechanism to obtain a position vector corresponding to each word;

the word level feature generation submodule is used for generating word level features corresponding to each word according to at least one of a position vector, a part-of-speech vector, a word length vector and a word post-punctuation type vector corresponding to each word in the word segmentation tagging sequence and the word vector corresponding to each word;

and the combination submodule is used for combining the word level characteristic sequence corresponding to each word in the word segmentation and labeling sequence to obtain the word level characteristic sequence.

12. A text prosodic hierarchy prediction device, the device comprising a processor and a memory:

the memory is used for storing a computer program;

the processor is configured to execute the text prosody hierarchy prediction method of any one of claims 1 to 10 according to the computer program.

13. A computer-readable storage medium for storing a computer program for executing the text prosody hierarchy prediction method according to any one of claims 1 to 10.