CN110223671B

CN110223671B - Method, device, system and storage medium for predicting prosodic boundary of language

Info

Publication number: CN110223671B
Application number: CN201910492657.7A
Authority: CN
Inventors: 潘华山; 李秀林
Original assignee: Data Baker Shenzhen Technology Co ltd
Current assignee: Data Baker Shenzhen Technology Co ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2021-08-10
Anticipated expiration: 2039-06-06
Also published as: CN110223671A

Abstract

The embodiment of the invention provides a method, a device and a system for predicting a prosodic boundary of a language and a storage medium. The language rhythm boundary prediction method comprises the steps of extracting embedded characteristics of a text; predicting a task prosody boundary of a corresponding granularity based on the embedded feature by using each of at least two component models respectively, wherein at least one component model predicts the task prosody boundary of the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the granularity of the task prosody boundary predicted by the at least one other component model is larger than that of the task prosody boundary predicted by the at least one other component model; and determining a final prosody boundary based at least on the task prosody boundary other than the task prosody boundary predicted by the at least one other component model. According to the technical scheme, at least two component models which are respectively used for predicting the task prosody boundaries with different granularities are unified to perform language prosody boundary prediction under one framework, so that the prediction effect is improved.

Description

Method, device, system and storage medium for predicting prosodic boundary of language

Technical Field

The present invention relates to the field of speech analysis and processing, and more particularly, to a method, apparatus, system, and storage medium for language prosody boundary prediction.

Background

In recent years, with the development of speech technology, prosodic structure analysis prediction plays an increasingly important role in the naturalness and intelligibility of speech synthesis, analysis, and processing, and thus improving the prediction effect of a language prosodic boundary has an important meaning.

Currently, language prosodic boundary prediction is typically broken down into tasks of different granularity, and component models are built separately for each of the different granularity tasks. The accuracy of the prosodic boundary prediction of the language using such a component model is to be improved.

Disclosure of Invention

The present invention has been made in view of the above problems.

According to an aspect of the present invention, a method for predicting prosodic boundaries of a language is provided. The method comprises the following steps:

extracting embedded features of the text;

predicting a task prosody boundary of a corresponding granularity based on the embedded feature by using each of at least two component models respectively, wherein at least one component model predicts the task prosody boundary of the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the granularity of the task prosody boundary predicted by the at least one other component model is larger than that of the task prosody boundary predicted by the at least one other component model; and

determining a final prosody boundary based at least on the task prosody boundary other than the task prosody boundary predicted by the at least one other component model.

Illustratively, for each component model other than the component model used to implement the minimum-granularity prosody boundary prediction task, the component model predicts the task prosody boundaries for the corresponding granularity of the text based on the embedded features and all task prosody boundaries of smaller granularity than the corresponding granularity.

Illustratively, for each of the at least one component model, predicting the task prosody boundary for the corresponding granularity based on the embedded features using the component model comprises:

extracting fusion features of the corresponding granularity based on the embedded features and the task prosodic boundaries predicted by the at least one other component model;

and determining task prosodic boundaries of the corresponding granularity of the text by utilizing the component model based on the fusion features of the corresponding granularity.

Illustratively, extracting fused features of the corresponding granularity based on the embedded features and the predicted task prosodic boundaries of the at least one other component model comprises:

connecting the embedded features and the task prosody boundaries predicted by the at least one other component model to obtain associated features of the corresponding granularity;

and extracting the fusion features of the corresponding granularity based on the associated features of the corresponding granularity.

Illustratively, determining the final prosody boundary based at least on the task prosody boundary of the corresponding granularity predicted by the at least one component model comprises:

task prosodic boundaries for all granularities of text are merged to determine a final prosodic boundary for the text.

Illustratively, predicting the task prosody boundary of the corresponding granularity based on the embedded feature using each of the at least two component models, respectively, includes:

predicting a task prosodic boundary of a first granularity of the text based on the embedded features using a first component model;

predicting, with a second component model, a second-granularity task prosodic boundary of the text based on the embedded feature and the first-granularity task prosodic boundary; and

predicting, with a third component model, a third granularity of task prosody boundaries for the text based on the embedded features, the first granularity of task prosody boundaries, and the second granularity of task prosody boundaries.

Illustratively, the first granularity is prosodic word granularity, the second granularity is prosodic phrase granularity, and the third granularity is intonation phrase granularity.

Illustratively, prior to extracting the embedded features of the text, the method further comprises:

and training the component model according to the loss function by using the sample data.

Illustratively, the loss function is determined based on task prosody boundaries for the corresponding granularity of text predicted by each component model.

Illustratively, the component model is a neural network component model.

Illustratively, the neural network component models include a two-way long-short term memory network and a conditional random field model.

Illustratively, extracting embedded features of text includes:

performing word segmentation on the text to obtain character-level features;

performing feature embedding processing on the character-level features;

connecting all character-level features subjected to feature embedding processing to obtain connection features; and

and extracting embedded features of the text based on the connection features.

According to another aspect of the present invention, there is also provided a speech prosody boundary prediction apparatus including:

the extraction module is used for extracting embedded features of the text;

a prediction module that predicts a task prosody boundary of a corresponding granularity based on the embedded feature using each of at least two component models, respectively, wherein at least one component model predicts the task prosody boundary of the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the at least one component model has a granularity larger than the task prosody boundary predicted by the at least one other component model;

a determination module that determines a final prosody boundary based at least on task prosody boundaries other than the task prosody boundary predicted by the at least one other component model.

According to still another aspect of the present invention, there is also provided a system for predicting a prosodic boundary of a language, including: a processor and a memory, wherein the memory has stored therein computer program instructions for executing the above-described method of language prosody boundary prediction when executed by the processor.

According to yet another aspect of the present invention, there is also provided a storage medium having stored thereon program instructions for executing the above-described method of language prosody boundary prediction when executed.

According to the technical scheme of the embodiment of the invention, at least two component models which are respectively used for predicting the task prosody boundaries with different granularities are unified to perform language prosody boundary prediction under one framework, so that the prediction effect is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic block diagram of a prior art linguistic prosodic boundary prediction model;

FIG. 2 shows a schematic flow diagram of a method of prosodic boundary prediction of a language according to one embodiment of the invention;

FIG. 3a shows a schematic block diagram of a task layer of a language prosody boundary prediction model according to one embodiment of the invention;

FIG. 3b shows a schematic block diagram of a task layer of a language prosody boundary prediction model according to another embodiment of the present invention;

FIG. 3c shows a schematic block diagram of a task layer of a language prosody boundary prediction model according to yet another embodiment of the present invention;

FIG. 4 shows a schematic block diagram of a feature extraction layer of a language prosody boundary prediction model according to one embodiment of the present invention;

FIG. 5 shows a schematic block diagram of a language prosody boundary prediction model according to one embodiment of the present invention;

FIG. 6 is a schematic block diagram illustrating an apparatus for prosodic boundary prediction for a language according to an embodiment of the present invention;

FIG. 7 shows a schematic block diagram for a speech prosody boundary prediction system according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of embodiments of the invention and not all embodiments of the invention, with the understanding that the invention is not limited to the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

The language prosody boundary prediction scheme described herein predicts the position of a language prosody boundary when text content is played in speech based on the text content. The scheme can be used for front-end text processing of application scenes such as speech synthesis and video generation. Corresponding information such as voice pause and the like can be given according to prosodic boundary positions with different granularities, so that the voice can correctly express semantics, the natural fluency of voice playing is improved, and high-quality voice is output.

Prosody is a concept of auditory perception, and is a necessary means for speech interaction, and can help listeners to better understand information carried by speech. Prosody boundary prediction is closely related to the content of the text, and in order to improve the naturalness of voice playing, more prosody-related information needs to be acquired from the text, for example, prosody boundary positions with different granularities.

For example, in a chinese language, the prosodic boundaries of the chinese language are usually divided at the prosodic level. The prosodic hierarchy of chinese is generally divided into three basic units: prosodic Words (PW), Prosodic Phrases (PPH), and Intonation Phrases (IPH), and the relationships between them are arranged prosodically in a tree hierarchy, orderly. These three basic units also represent the respective granularity of prosodic boundary divisions. A prosodic phrase may comprise one or more prosodic phrases and a prosodic phrase may comprise one or more prosodic words. Thus, the prosodic phrases have the largest granularity and the prosodic words have the smallest granularity, which is between the prosodic phrases and the prosodic words. That is, the granularity of the three basic units is prosodic words, prosodic phrases and intonation phrases in order from small to large.

Specifically, taking the text "prediction of mainly studied prosodic structure herein" as an example, it can be used as a intonation phrase itself. The text may be divided by prosodic boundaries into two prosodic phrases: "the main study herein" and "prediction of prosodic structure". Further, the text may be divided into 6 prosodic words by prosodic boundaries: "text", "main", "study", "prosody", "structural" and "prediction". It is clear that the granularity of intonation phrases is larger than the granularity of prosodic phrases, which in turn are larger than the granularity of prosodic words.

In the following, the present invention will be described by taking the chinese language as an example to illustrate the method of the prosodic boundary prediction, which should be understood as merely an example and not a limitation to the present invention. The language prosodic boundary prediction method may also be used for other languages such as english, japanese, and german, among others.

Currently, prosody boundary prediction generally decomposes a task of chinese prosody boundary prediction granularity into three independent tasks of PW, PPH, and IPH with different granularities, and models and processes the tasks respectively. FIG. 1 shows a schematic block diagram of a prior art linguistic prosodic boundary prediction model. As shown in fig. 1, the prior art linguistic prosody boundary prediction model includes three parts, namely a feature extraction layer 110, a task layer 120 and a result output layer 130. The feature extraction layer 110 is used to extract embedded features of the text. The task layer 120 is configured to predict task prosody boundaries of different granularities through a plurality of component models based on the embedded features extracted by the feature extraction layer 110. The result output layer 130 outputs a final prosody boundary prediction result based on the predicted task prosody boundaries of different granularities.

The task layer 120 may include multiple component models, with different component models being used to predict task prosody boundaries of different granularities. As shown in FIG. 1, task layer 120 may include a first component model 121, a second component model 122, a third component model 123, and so on, and may include an Nth component model, where N is an integer, and may be equal to 4, for example. It is to be understood that although more than 3 component models are included in the task layer 120 shown in FIG. 1, only 2 or 3 component models may be included therein. The component models are respectively used for predicting prosodic boundaries with different granularities, thereby completing the prediction tasks with different granularities. To distinguish from the final prosodic boundary, the prosodic boundary predicted by each component model is referred to as a task prosodic boundary. Each component model independently completes the prediction task with the corresponding granularity, and the component models have no dependency relationship. For example, the multiple component models of the task layer 120 may respectively complete any one of the PW, PPH, and IPH prediction tasks, the first component model 121 may complete the PW prediction task, the second component model 122 may complete the PPH prediction task, and the third component model 123 may complete the IPH task.

Based on the task prosodic boundaries predicted by the multiple component models, a final prosodic boundary for the text may be output.

In the above-described language prosody boundary prediction model 100, component models for predicting task prosody boundaries of different granularities are independent of each other. Each of which receives a feature of the text and then performs its own predictive task based only on the received feature. The method ignores the dependency relationship between the task prosody boundaries of each granularity, and greatly reduces the effect of language prosody boundary prediction.

In order to at least partially solve the above problem, an embodiment of the present invention provides a method for predicting a prosodic boundary of a language. In the language prosody boundary prediction method, prediction tasks with different granularities are unified under a framework by utilizing a multi-task learning framework. The same input data of each prediction task is uniformly characterized and shared among the tasks. In addition, dependencies can be established between component models used to predict task prosodic boundaries of different granularities. In particular, the component model for predicting higher granularity task prosody boundaries also completes its own predicted task based on the task prosody boundaries predicted by the component model for predicting lower granularity task prosody boundaries. FIG. 2 shows a schematic flow diagram of a method 200 of prosodic boundary prediction of a language according to one embodiment of the invention. As shown in fig. 2, the method includes the following steps.

And step S210, extracting the embedded features of the text.

The text includes all the text contents to be subjected to the language prosody boundary prediction. The language prosody boundary prediction method 200 is described below using the text "mainly studying prosody structure prediction herein" as an example.

The embedded method is a method of representing discrete variables with continuous vectors. The embedded characteristic is a vector representation mode which converts the original discrete object into a continuous vector by using an embedded method and outputs the continuous vector. The embedded features capture the built-in properties of the original object to measure the similarity of objects according to the similarity in vector space. It will be appreciated that extracting embedded features of text may be better used for machine learning input, such as input for Feed Forward Neural Networks (FNNs) and/or multi-layer Feed forward Neural networks (MFNNs).

In one example, character-level features of text may be extracted first. For example, a plurality of feature information of the text, such as Chinese characters, word segmentation, part of speech, word length, distance, etc., may be extracted first. It will be appreciated that different character-level features may be flexibly adjusted, e.g., added or deleted, as desired. And then extracting the embedded features of the text by using an embedded method based on the character-level features.

Step S220, predicting the task prosody boundary with the corresponding granularity based on the embedded feature by using each of the at least two component models, wherein at least one component model predicts the task prosody boundary with the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the granularity of the task prosody boundary predicted by the at least one component model is larger than that of the task prosody boundary predicted by the at least one other component model.

The component model may be any existing or future developed model for predicting prosodic boundaries, and is not limited in this application. For example, the component model may be obtained by modeling based on a Bidirectional Long Short-Term Memory-Conditional Random Field (BLSTM-CRF).

It is understood that there are at least 2 component models, and that different component models are used to predict task prosody boundaries of different granularity. The different component models may be based on the same or different mathematical models. All component models share the embedded features extracted in step S210.

Among all component models, at least one component model predicts a task prosody boundary of a corresponding granularity is further based on task prosody boundaries predicted by at least one other component model. In other words, the two component models are not independent, but rather, a certain dependency or relationship is established between the two. Therefore, modeling is carried out on the task prosody boundary prediction on the whole, and the problem that the whole effect of the language prosody boundary prediction is reduced due to neglect of dependence or incidence relation among tasks with different granularities is solved. In summary, in the technical solution of this embodiment, a part of the component models predict the task prosody boundary of the corresponding granularity based on the embedded features of the text only. The partial component models include a component model for predicting task prosody boundaries of minimum granularity. Another portion of the component models predicts task prosody boundaries at a corresponding granularity based not only on embedded features of the text but also on task prosody boundaries predicted by the other component models, and for each of the other portion of the component models, the task prosody boundaries predicted by the component model are more granular than the task prosody boundaries on which it is based. In general, a large-grained task prosody boundary must be a small-grained task prosody boundary. Still taking the text "prediction of prosodic structure is mainly studied herein" as an example, the prosodic boundaries of larger-granularity prosodic units are in two prosodic phrases: "the main study herein" and "prediction of prosodic structure". And the position is also the prosodic boundary of the prosodic word with smaller granularity, namely the prosodic word: the boundary between "study" and "prosody". Therefore, the task prosody boundary with larger granularity is predicted based on the task prosody boundary with smaller granularity, and the accuracy of the task prosody boundary with larger granularity can be improved.

FIG. 3a shows a schematic block diagram of a task layer 320a of a language prosody boundary prediction model 300a according to one embodiment of the invention. The task layer 320a is used for predicting task prosody boundaries of different granularities through N component models respectively based on embedded features of texts. The N component models include a first component model 321a, a second component model 322a, a third component model 323a, … … N component model. Wherein N is an integer greater than 1. These component models share the embedded features of the text.

The granularity of the task prosody boundaries predicted by the component models in the order of left to right in fig. 3a gradually increases. That is, the granularity of the task prosody boundary predicted by the second component model 322a is greater than the granularity of the task prosody boundary predicted by the first component model 321a, and so on, the granularity of the task prosody boundary predicted by the nth component model is greater than the granularity of the task prosody boundary predicted by the (N-1) th component model.

The task layer 320a in the language prosody boundary prediction model 300a differs from the task layer 120 in the language prosody boundary prediction model 100 in the following. The first component model 121, the second component model 122, the third component model 123, and the nth component model in the task layer 120 are independent from each other. The second component model 322a, the third component model 323a, and the nth component model, etc. in the task layer 320a may be dependent on the component model to its left.

Specifically, in the task layer 120 of the language-prosody boundary prediction model 100, the input of each component model includes only the features of the text output by the feature extraction layer 110.

In contrast, in the task layer 320a of the language prosody boundary prediction model 300a, the input of the second component model 322a may include the task prosody boundary predicted by the first component model 321a in addition to the embedded features of the text. As shown in FIG. 3a, the input of the Nth component model may include, in addition to the embedded features of the text, the predicted task prosody boundaries of all component models to the left of the Nth component model, such as the predicted task prosody boundaries of the (N-1) th component model, the predicted task prosody boundaries of the (N-2) th component model, and … … the predicted task prosody boundaries of the first component model 321 a. It will be appreciated that the input to the nth component model may include, in addition to the embedded features of the text, the predicted task prosodic boundaries of any one or more of the component models to the left of the nth component model. In this way, the second component model, the third component model … … and the nth component model establish dependencies or associations with other component models.

It will be appreciated that while the above examples show the second component model, the third component model … …, and the Nth component model all having dependencies or associations with all of the component models to their respective left, this is not required. For example, each of the second component model, the third component model … …, and the nth component model may have a dependency relationship with one or a few component models to its respective left. In other words, the input to the component model includes the embedded features of the text and the task prosodic boundaries predicted by a portion of the component model to its left, rather than the task prosodic boundaries predicted by the entire component model to its left.

FIG. 3b shows a schematic block diagram of a task layer 320b of a language prosody boundary prediction model 300b according to another embodiment of the present invention. In the component models in the task layer 320b of the language prosody boundary prediction model 300b, there is not a dependency relationship between all the component models and all the component models to the respective left thereof. For example, the input of the third component model 323b includes embedded features of text and the predicted task prosody boundaries of the second component model 322b, but the input of the third component model 323b does not include the predicted task prosody boundaries of the first component model 321 b.

It is to be appreciated that while the above examples illustrate the second component model, the third component model … …, and the Nth component model all having dependencies or associations with other component models, this is not required. For example, there may be a few component models among the second component model, the third component model … …, and the nth component model, each of which has a dependency relationship with at least one component model to its respective left. In other words, the input of the few component models includes the embedded features of the text and the predicted task prosodic boundaries of at least one component model to its respective left.

FIG. 3c shows a schematic block diagram of a task layer 320c of a language prosody boundary prediction model 300c according to yet another embodiment of the present invention. The task layer 320a of the language prosody boundary prediction model 300a is implemented in a similar function and location to the task layer 320c of the language prosody boundary prediction model 300c, and will not be described herein again. In contrast, in the component models in the task layer 320c of the language prosody boundary prediction model 300c, not all of the component models have a dependency relationship with the component model on the left. For example, the input to the second component model 322c includes only the embedded features of the text and does not include the task prosodic boundaries predicted by the first component model 321 c. The input of the third component model 323c includes the predicted task prosodic boundaries of the second component model 322c in addition to the embedded features of the text.

Step S230, determining a final prosody boundary based at least on the task prosody boundary other than the task prosody boundary predicted by the at least one other component model. It will be appreciated that in this step, the final prosodic boundary is determined based on the predicted task prosodic boundaries of at least one component model that depends on other component models.

In one example, the final prosodic boundary may be obtained by merging a plurality of task prosodic boundaries of different granularities. For example, referring to fig. 3c as an example, the final prosody boundary may be a combination of the task prosody boundaries predicted by the first component model 321c, the third component model 323c … …, and the nth component model.

Alternatively, task prosodic boundaries of all granularities of the text may be merged to determine a final prosodic boundary of the text. For example, referring to fig. 3a as an example, the final prosody boundary may be a combination of task prosody boundaries of corresponding granularities predicted by the first component model 321a, the second component model 322a, the third component model 323a … …, and the nth component model, respectively.

It will be appreciated that each component model may result in task prosody boundaries of corresponding granularity. From a certain perspective, task prosody boundaries of different granularities may each independently represent prosody boundaries of text. For the same text, the task prosody boundaries with large granularity are less in position, and relatively speaking, the task prosody boundaries with small granularity are more in position. Still taking the aforementioned text "prediction of prosodic structure is mainly studied herein" as an example, the position of the boundary according to the prosodic phrase granularity is one, between the prosodic phrase "main study herein" and "prediction of prosodic structure". The boundary positions according to prosodic word granularity are 5, between the prosodic words "text", "main", "research", "prosody", "structured", and "predictive", respectively. Merging task prosodic boundaries of all granularities may include more prosodic boundary information, whereby the determined final prosodic boundary is more desirable.

Alternatively, the end result may be any higher granularity task prosody boundary or determined based on the higher granularity task prosody boundary that is predicted to be based not only on embedded features of the text but also on lower granularity task prosody boundaries. For example, referring again to fig. 3b as an example, the final prosody boundary may be a task prosody boundary predicted by the nth component model, where the nth component model predicts a corresponding task prosody boundary based on a task prosody boundary predicted by a (N-1) th component model of lower granularity.

It will be appreciated that the final prosodic boundary may be used for speech synthesis and the like.

According to the technical scheme, at least two component models which are respectively used for predicting the task prosody boundaries with different granularities are unified to perform language prosody boundary prediction under one framework, so that the prediction effect is improved.

In one example, the above step S210 of extracting the embedded features of the text includes the following sub-steps.

And a substep S211 of performing word segmentation on the text to obtain character-level features.

The method can obtain various character-level characteristics such as Chinese characters, word segmentation, word property, word length, word distance and the like by segmenting the text. These character-level features can be flexibly adjusted as needed, for example, by adding or deleting one or more of them.

For convenience of processing, various character-level features may be each expressed as a one-hot (one-hot) type feature, i.e., using one-hot encoding. The one-hot encoding is a process of converting class variables into a form that is readily utilized by machine learning algorithms using N-bit state registers to encode N states, each state having its own independent register bits and, at any time, only one of which is effectively encoded.

In the substep S212, the character-level features are subjected to feature embedding processing.

The feature embedding process enables the dimensionality reduction of character-level features. For example, for a font dictionary, common Chinese characters are approximately 5000 to 10000 in scale, and thus the dimension of one-hot vector of Chinese characters is also approximately 5000 to 10000. One-hot type character-level features can be converted to low-latitude features using an embedding process.

The sub-step S212 will now be described in detail by taking as an example the process of embedding the features of the one-hot character-level features of chinese characters according to an embodiment of the present invention. For example, the feature embedding processing result of the chinese character can be determined according to the following formula:

EMB_cc＝X_1×Ncc×W_Ncc×Dcc+B_cc，

wherein, EMB_ccFor character-level features of Chinese characters subjected to feature embedding, X_1×NccThe one-hot character-level features of Chinese characters, Ncc the dictionary size, Dcc the embedding dimension, and W, B the model parameters. The model parameters may be adjusted according to the relevant circumstances. For example, random initialization is performed before model training, and model parameter adjustment is performed according to a loss function during the training process.

In one example, the model used to implement step S212 may be a feed-forward neural network.

Similarly, the feature embedding processing results of other character-level features such as word segmentation, part of speech, word length and distance can be obtained, and are not described herein again.

And a substep S213 of concatenating all the character-level features subjected to the feature embedding process to obtain a concatenated feature.

The connection feature can be obtained by embedding the feature embedding processing results of the chinese characters, the segmented words, the parts of speech, the word length, and the distance obtained by the connection substep S212. The connection feature includes information of all features of the text.

And a substep S214 of extracting the embedded feature of the text based on the connection feature.

Based on the connection features obtained in sub-step S213, feature extraction may be enhanced by a multi-layered fully connected neural network to obtain embedded features of the text. Alternatively, the substep S214 may be performed using a multi layer fed forward Neural Network (MFNN). MFNN may employ a tanh function as an activation function. Alternatively, other Neural networks such as a Convolutional Neural Network (CNN) or a Bidirectional Recurrent Neural Network (B-RNN) may be used.

FIG. 4 shows a schematic block diagram of the feature extraction layer 410 of the language prosody boundary prediction model 400 according to one embodiment of the present invention. As shown in fig. 4, in the feature extraction layer, character-level features of the text, such as chinese characters, word segmentation, part of speech, word length and distance, are first obtained. These character-level features are input into a feedforward neural network, respectively, to perform a feature embedding process. The character-level features subjected to the feature embedding process are connected via a connector, thereby obtaining a connection feature. Finally, the MFNN is used to extract the embedded features of the text based on the connection features.

The method for extracting the embedded features of the text can obtain the input more beneficial to machine learning, effectively improves the accuracy of the language prosody boundary prediction and reduces the calculation cost.

Taking fig. 3a as an example, the task prosody boundaries of the corresponding granularity at which each of the second component model 322a, the third component model 323a … …, and the nth component model in the task layer 320a predicts the text are the task prosody boundaries predicted based on the embedded features and all the component models on the left of each of the embedded features. For example, the second component model 322a predicts a task prosody boundary based on the embedded features and the predicted task prosody boundary of the first component model 321 a; the third component model 323a predicts the task prosodic boundary based on the embedded features, the predicted task prosodic boundaries of the first component model 321a and the second component model 322 a; the nth component model predicts the task prosody boundary based on the embedded features, the first component model 321a, the second component model 322a, the third component model 323a … …, and the (N-1) th component model predicted task prosody boundary.

Because the task prosody boundaries with different granularities have certain dependence or association relationship to a certain extent, in order to improve the effect of predicting the whole prosody boundary, the task prosody boundary of the component model in the corresponding granularity of the predicted text is based on the embedded features and all the task prosody boundaries with smaller granularity than the corresponding granularity. Therefore, the dependency relationship between one component model and other component models for predicting the task prosody boundary with the smaller granularity is established to the greatest extent, the information of the task prosody boundary with the smaller granularity is fully utilized, and the accuracy of prosody boundary prediction is improved.

Illustratively, for each of the component models that predicts task prosody boundaries based on task prosody boundaries predicted by other component models, predicting task prosody boundaries of a corresponding granularity of text based on embedded features of the text using the component model comprises the following steps.

And step S221, extracting the fusion features of the corresponding granularity based on the embedded features of the text and the task prosody boundaries predicted by all other component models depended on the text. In other words, for each of the component models that predicts the task prosody boundary in dependence on the other component models, the fused features of the corresponding granularity of the component model are extracted based on the embedded features of the text and the task prosody boundary predicted by the other component models. Referring again to the task layer 320c shown in fig. 3c, the third component model 323c extracts fused features of a corresponding granularity based on embedded features of the text and the task prosody boundaries predicted by the second component model 322 c.

Fused features of a certain granularity are extracted based on embedded features and corresponding features of the task prosodic boundary predicted by the component model on which the corresponding component model depends. The fusion features can be extracted using a non-linear transformation algorithm, such as based on FNN_tanhFunctions, etc. to extract fused features. For example, a particular granularity of fused features may fuse embedded features and all smaller granularity of information of task prosody boundaries than the particular granularity, which is more favorable for prosody boundary prediction of the current granularity.

Illustratively, this step S221 specifically includes the following steps for each of the component models that predict the task prosody boundary based on the task prosody boundaries predicted by the other component models.

First, the embedded features and the predicted task prosody boundaries of all other component models on which the component model depends are connected to obtain the associated features of the corresponding granularity. The associated feature associates the embedded feature of the text with all of the task prosodic boundaries received by the component model, including information for both.

Then, the fusion feature of the corresponding granularity is extracted based on the associated feature of the corresponding granularity. Because the associated features include information of both the embedded features of the text and all of the task prosodic boundaries received by the component model, the fused features of the corresponding granularity can be extracted from the associated features. The extraction algorithm of the fusion features may utilize a non-linear transformation method, such as FNN_tanhFunctions, and the like.

According to the technical scheme, the correlation characteristics are obtained in a connection mode, so that the accuracy of the predicted task prosody boundary is guaranteed, the accuracy of the final prosody boundary is further guaranteed, and the method is easy to implement.

Step S222, determining a task prosody boundary of the text with the corresponding granularity by using the component model based on the fusion feature with the corresponding granularity.

As described above, the fused feature of the corresponding granularity fuses the embedded feature and the information of the task prosody boundary predicted by all the other component models on which the component model depends, and the task prosody boundary of the corresponding granularity of the text can be more accurately determined by using the component model based on the fused feature.

Illustratively, the component model described above is a neural network component model.

It is appreciated that predicting the prosodic boundaries of text based on the neural network component model may take advantage of the self-learning capabilities of the neural network, thereby enabling more accurate prosodic boundary results.

Illustratively, the neural network component models described above include bidirectional long short term memory network (BLSTM) and Conditional Random Field (CRF) models. The BLSTM-CRF model belongs to an end-to-end prosody prediction framework, which is capable of predicting not only prosody boundaries more accurately but also language-independent, i.e., capable of predicting prosody boundaries of texts of respective languages.

Alternatively, the Neural Network component model may be any one of a convolutional Neural Network (cnn), a cyclic Neural Network (RNN), a Gated cyclic Unit (GRU), a Long Short-Term Memory Network (LSTM), and the like. In one example, from another perspective, predicting a task prosody boundary of a text at a corresponding granularity based on embedded features of the text using each of at least two component models, respectively, includes the following steps.

Step S221', predicting a task prosodic boundary of a first granularity of the text based on the embedded feature using the first component model.

It is understood that the granularity of the task prosodic boundary of the first granularity is the smallest. The first component model is independent of other component models, and predicts task prosody boundaries for a first granularity of text based only on embedded features.

Optionally, predicting a first granularity of task prosodic boundaries for the text based on the embedded features comprises performing a first granularity of prediction tasks using a BLSTM-CRF model.

Determining a task prosodic boundary of a first granularity based on the embedded features according to the following formula:

First_pred＝BLSTM-CRF_first(FEAT_embed)，

wherein First is_predRepresenting a task prosodic boundary of a first granularity, FEAT_embedRepresenting embedded features, BLSTM-CRF_firstThe BLSTM-CRF model representing a first granularity.

Step S222', predicting a task prosody boundary of a second granularity of the text based on the embedded feature and the task prosody boundary of the first granularity by using a second component model. This step may include the following substeps.

And substep 1, connecting the embedded features and the task prosody boundary of the first granularity to obtain associated features of a second granularity.

The associated features of the second granularity may be determined according to the following formula:

Cancat_second＝FEAT_embed；First_pred，

wherein, Cancat_secondFirst, a task prosodic boundary of a second granularity_predRepresenting a task prosodic boundary of a first granularity, FEAT_embedRepresenting embedded features.

And a substep 2, extracting the fusion characteristics of the second granularity based on the associated characteristics of the second granularity.

The fused features of the second granularity may be determined according to the following equation:

Second_in＝FNN_tanh(Cancat_second)，

wherein Second_inDenotes a fusion trait of second granularity, Cancat_secondIndicating a correlation characteristic of a second granularity, FNN_tanhA feed-forward neural network with tanh as the activation function is shown.

And a substep 3 of determining task prosody boundaries of a second granularity of the text by using a second component model based on the fusion features of the second granularity.

Similar to the formula for determining the task prosody boundary for the first granularity in sub-step 1, the task prosody boundary for the second granularity is determined according to the following formula:

Second_pred＝BLSTM-CRF_second(Second_in)，

wherein Second_predSecond-granularity task prosodic boundaries, Second_inFor the fusion feature of the second granularity, BLSTM-CRF_secondThe BLSTM-CRF model representing the second granularity.

Step S223', predict a task prosody boundary of a third granularity of the text based on the embedded feature, the task prosody boundary of the first granularity, and the task prosody boundary of the second granularity using a third component model.

The task prosody boundary prediction step of the third granularity is similar to the task prosody boundary prediction step of the second granularity, and the specific calculation process is as follows:

Cancat_third＝FEAT_embed；Second_pred；First_pred，

Third_in＝FNN_tanh(Cancat_third)，

Third_pred＝BLSTM-CRF_third(Third_in)，

wherein, Cancat_thirdThird being a related feature of a Third granularity_inThird granularity of fusion characteristics_predFor task prosodic boundaries of third granularity, BLSTM-CRF_thirdThe BLSTM-CRF model representing the third granularity.

It can be understood that prosodic boundary prediction is performed based on the three granularities of prosodic words, prosodic phrases and intonation phrases, and prosodic boundaries of texts can be reasonably divided, so that the requirement of voice synthesis is met.

To more clearly illustrate the present invention, fig. 5 illustrates a schematic block diagram of a language prosody boundary prediction model according to still another embodiment of the present invention. As shown in fig. 5, the prosodic boundary prediction model includes three parts, i.e., a feature extraction layer 510, a task layer 520, and a result output layer 530. The function, position and structure of the feature extraction layer 510 are similar to those of the feature extraction layer 410 in the prosodic boundary prediction model 400, and are not described herein again.

Task layer 520 includes a first component model 521, a second component model 522, and a third component model 523.

The first component model 521 is used to perform the above step S221', and predict a task prosody boundary of a first granularity based on the embedded features of the text. The first granularity may be a prosodic word granularity that is the smallest.

The second component model 522 is used to perform the above step S222', and predicts a task prosody boundary of a second granularity based on the embedded feature of the text and the task prosody boundary of the first granularity predicted by the first component model 521. The second particle size is larger than the first particle size. The second granularity may be a prosodic phrase granularity.

The third component model 523 is for performing the above-described step S223', and predicts a third-granularity task prosody boundary of the text based on the embedded features, the first-granularity task prosody boundary, and the second-granularity task prosody boundary. The third particle size is larger than both the first particle size and the second particle size. The third granularity may be a intonation phrase granularity.

The result output layer 530 serves to combine the task prosody boundaries predicted by the three component models, the first component model 521, the second component model 522, and the third component model 523, to output a final prosody boundary of the text.

The above technical solution predicts the final prosodic boundary of the text based on the language prediction model 500, which can obtain a more accurate prediction result. In addition, the technical scheme is very suitable for predicting Chinese texts.

According to another aspect of the present invention, there is also provided a device for predicting prosodic boundaries of a language. Fig. 6 shows a schematic block diagram for a speech prosody boundary prediction apparatus according to an embodiment of the present invention.

As shown in fig. 6, the apparatus 600 includes an extraction module 610, a prediction module 620, and a determination module 630.

The respective modules may respectively perform the respective steps/functions for the method for language prosody boundary prediction described above. Only the main functions of the components of the device 600 will be described below, and details that have been described above will be omitted.

And the extraction module 610 is used for extracting the embedded features of the text.

And a predicting module 620, which predicts a task prosody boundary of a corresponding granularity based on the embedded features extracted by the extracting module 610 by using each of at least two component models, respectively, wherein at least one component model predicts the task prosody boundary of the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the granularity of the task prosody boundary predicted by the at least one other component model is larger than that of the task prosody boundary predicted by the at least one other component model.

A determination module 630 determines a final prosody boundary based at least on the task prosody boundary other than the task prosody boundary predicted by the at least one other component model.

FIG. 7 shows a schematic block diagram for a speech prosody boundary prediction system 700 according to one embodiment of the invention. As shown in fig. 7, system 700 includes an input device 710, a storage device 720, a processor 730, and an output device 740.

The input device 710 is used for receiving an operation instruction input by a user and collecting data. The input device 710 may include one or more of a keyboard, a mouse, a microphone, a touch screen, an image capture device, and the like.

The storage 720 stores computer program instructions for implementing the respective steps in the method for language prosody boundary prediction according to the embodiment of the present invention.

The processor 730 is configured to run the computer program instructions stored in the storage 720 to perform the corresponding steps of the method for predicting the prosody boundary of the language according to the embodiment of the present invention, and is configured to implement the extraction module 610, the prediction module 620 and the determination module 630 for use in the apparatus for predicting the prosody boundary of the language according to the embodiment of the present invention.

The output device 740 is used to output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, and the like.

In one embodiment, the computer program instructions, when executed by the processor 730, cause the system 700 to perform the steps of:

extracting embedded features of the text;

predicting a task prosody boundary of a corresponding granularity based on embedded features by using each of at least two component models respectively, wherein at least one component model predicts the task prosody boundary of the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the granularity of the task prosody boundary predicted by the at least one other component model is larger than that of the task prosody boundary predicted by the at least one other component model;

a final prosody boundary is determined based at least on the task prosody boundary other than the task prosody boundary predicted by the at least one other component model.

Furthermore, according to still another aspect of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor, cause the computer or the processor to execute the respective steps of the above-described language prosody boundary prediction method according to an embodiment of the present invention, and are used to implement the respective modules in the above-described language prosody boundary prediction apparatus according to an embodiment of the present invention or the respective modules for use in the above-described language prosody boundary prediction system. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

In one embodiment, the computer program instructions, when executed by a computer or processor, cause the computer or processor to perform the steps of:

extracting embedded features of the text;

According to the language prosody boundary prediction scheme, at least two component models which are respectively used for predicting task prosody boundaries with different granularities are unified to perform language prosody boundary prediction under one framework, and the prediction effect is improved.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules for a speech prosodic boundary prediction apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method of prosodic boundary prediction for a language, comprising:

extracting embedded features of the text;

determining a final prosody boundary based at least on the task prosody boundary other than the task prosody boundary predicted by the at least one other component model;

wherein for each component model of the at least one component model, predicting, using the component model, a task prosody boundary of a corresponding granularity based on the embedded features comprises:

extracting fusion features of the corresponding granularity based on the associated features of the corresponding granularity;

2. The method of claim 1, wherein, for each component model other than the component model used to implement the minimum-granularity prosodic boundary prediction task, the component model predicts a task prosodic boundary of the text for a corresponding granularity based on the embedded features and all task prosodic boundaries of smaller granularity than the corresponding granularity.

3. The method of claim 1 or 2, wherein the determining a final prosodic boundary based at least on the task prosodic boundary of the corresponding granularity predicted by the at least one component model comprises:

merging task prosodic boundaries of all granularities of the text to determine a final prosodic boundary of the text.

4. The method of claim 1 or 2, wherein the predicting task prosody boundaries for corresponding granularity based on the embedded features using each of at least two component models, respectively, comprises:

5. The method of claim 4, wherein the first granularity is prosodic word granularity, the second granularity is prosodic phrase granularity, and the third granularity is intonation phrase granularity.

6. The method of claim 1 or 2, wherein prior to said extracting embedded features of text, the method further comprises:

7. The method of claim 6, wherein the loss function is determined based on task prosody boundaries for a corresponding granularity of the text predicted by each component model.

8. The method of claim 1 or 2, wherein the component model is a neural network component model.

9. The method of claim 8, wherein the neural network component model comprises a two-way long-short term memory network and a conditional random field model.

10. The method of claim 1 or 2, wherein the extracting embedded features of text comprises:

performing word segmentation on the text to obtain character-level features;

performing feature embedding processing on the character-level features;

and extracting embedded features of the text based on the connection features.

11. A speech prosody boundary prediction device comprising:

the extraction module is used for extracting embedded features of the text;

a prediction module that predicts a task prosody boundary of a corresponding granularity based on the embedded feature using each of at least two component models, respectively, wherein at least one component model predicts the task prosody boundary of the corresponding granularity and is also based on the task prosody boundary predicted by at least one other component model, and the at least one component model has a granularity larger than the task prosody boundary predicted by the at least one other component model; wherein for each component model of the at least one component model, predicting, using the component model, a task prosody boundary of a corresponding granularity based on the embedded features comprises: connecting the embedded features and the task prosody boundaries predicted by the at least one other component model to obtain associated features of the corresponding granularity; extracting fusion features of the corresponding granularity based on the associated features of the corresponding granularity; determining task prosodic boundaries of the corresponding granularity of the text by using the component model based on the fusion features of the corresponding granularity;

12. A language prosodic boundary prediction system, comprising: a processor and a memory, wherein the memory has stored therein computer program instructions, wherein the computer program instructions, when executed by the processor, are for performing a method of language prosody boundary prediction according to any one of claims 1 to 10.

13. A storage medium on which program instructions are stored, the program instructions being operable when executed to perform a method of language prosody boundary prediction according to any one of claims 1 to 10.