CN107039034B

CN107039034B - Rhythm prediction method and system

Info

Publication number: CN107039034B
Application number: CN201610084393.8A
Authority: CN
Inventors: 周明; 江源; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2016-02-04
Filing date: 2016-02-04
Publication date: 2020-05-01
Anticipated expiration: 2036-02-04
Also published as: CN107039034A

Abstract

The invention discloses a prosody prediction method and a system, wherein the method comprises the following steps: the method comprises the steps of constructing a text prosody model in advance, collecting text data with corresponding voice data, carrying out automatic prosody labeling on the corresponding text data based on prosody information of the voice data to obtain automatically labeled text data, and training the text prosody model by using the automatically labeled text data; receiving text data to be predicted; then extracting text features of text data to be predicted; finally, carrying out prosody prediction on the text data to be predicted by utilizing the text features and the text prosody model. The text data collected by the invention all have corresponding voice data, and the voice data actually contains prosodic information, so that automatic prosody annotation can be performed on the text data, and the problems of high labor cost and long time consumption caused by the fact that prosody boundaries of all training text data need to be manually annotated to train a text prosody model in the prior art can be solved.

Description

Rhythm prediction method and system

Technical Field

The invention relates to the field of natural language processing, in particular to a prosody prediction method and a prosody prediction system.

Background

The speech synthesis is an important component in the processing of language information, and refers to a process of outputting speech after certain conversion of a text, and the synthesized speech has good naturalness and intelligibility as much as possible. Prosody prediction mainly aims at prediction of prosodic phrases in text data, and predicts corresponding prosodic boundary positions in the text data. The prosody prediction method is generally used in front-end text processing of voice synthesis, and after the prosody boundary position is predicted, corresponding pause can be given according to the corresponding prosody boundary position during voice synthesis, so that the synthesis naturalness is improved; in addition, the method can also be used for natural language understanding, and different prosodic divisions can obtain different semantic information. The prosodic phrases refer to a language unit with a stable prosodic pattern formed by a plurality of characters or words in natural language communication, and pause usually exists between adjacent prosodic phrases so as to form a proper rhythm; such as text data "more like a heavy bundle # is pressed in one's mind. ", where" # "is a prosodic boundary identifier used to identify a prosodic boundary.

Prosody prediction of text data is always an important work in text processing, especially front-end text processing of speech synthesis, and the prosody boundary position directly influences the naturalness of synthesized speech; of course, it also plays an important role in natural language understanding, different prosody divisions often obtain different semantic information, such as "where can be kept for a long time", and if the prosody boundary division "where can be kept for a long time", it can obtain completely different meanings from the original text, so that it is important to accurately predict the prosody boundary of the text data. The conventional prosody prediction method generally predicts the prosody of text data by constructing a text prosody model through a supervised method to obtain a prosody prediction result. When the model is constructed, the prosody boundaries of all training text data need to be manually marked to train the text prosody model in the prior art, so that the labor cost is high, and the time consumption is long; in addition, when a model is constructed, a shallow learning model is used as a text prosody model in the prior art, the data description capability of the shallow learning model is limited, and when training data reach a certain scale, the model is easy to saturate, so that the prediction accuracy of the text prosody model cannot be improved by effectively utilizing a large amount of data.

Disclosure of Invention

The embodiment of the invention provides a prosody prediction method and a prosody prediction system, which are used for solving the problems of high labor cost and long time consumption caused by the fact that a text prosody model needs to be trained by manually marking prosody boundaries of all training text data when the model is constructed in the prior art.

Therefore, the embodiment of the invention provides the following technical scheme:

a prosody prediction method, comprising:

a text rhythm model is constructed in advance;

receiving text data to be predicted;

extracting text features of text data to be predicted;

performing prosody prediction on the text data to be predicted by using the text features and the text prosody model;

the pre-constructed text prosody model comprises:

collecting text data having corresponding voice data;

performing automatic prosody annotation on corresponding text data based on prosody information of the voice data to obtain automatic annotation text data;

and training a text prosody model by using the automatic labeling text data.

Preferably, the performing automatic prosody annotation on corresponding text data based on prosody information of the voice data, and acquiring the automatically annotated text data includes:

aligning the collected text data with the corresponding voice data;

inserting prosodic boundary identifiers into corresponding text data according to prosodic information of the voice data;

and filtering the incorrect prosodic boundary identifiers to obtain the automatic labeling text data.

Preferably, the method further comprises:

and deleting messy codes and special characters in the text data before aligning the collected text data with the corresponding voice data.

Preferably, the inserting a prosody boundary identifier in the corresponding text data according to prosody information of the voice data includes:

taking a character as a unit, acquiring the pause duration of each character in the text data corresponding to the voice data;

the prosodic boundary identifier is inserted after a word whose pause duration exceeds a preset threshold.

Preferably, the filtering the incorrect prosodic boundary identifiers and obtaining the automatically labeled text data includes:

carrying out endpoint detection on voice data, and filtering prosodic boundary identifiers according to endpoint detection results; and any one or more of the following:

filtering prosodic boundary identifiers according to the speed of speech of the speaker;

filtering prosodic boundary identifiers according to the word boundary detection result;

and filtering prosodic boundary identifiers with the spacing smaller than a set threshold.

Preferably, the method further comprises:

collecting artificially marked text data in advance, wherein the artificially marked text data are text data with artificially marked prosodic boundary information;

after the text prosody model is trained by using the automatic labeling text data, the text prosody model trained by using the automatic labeling text data is retrained again by using the manual labeling text data, and the text prosody model is optimized;

performing prosody prediction on the text data to be predicted by using the text features and the text prosody model comprises the following steps:

and performing prosody prediction on the text data to be predicted by utilizing the text characteristics and the optimized text prosody model.

Preferably, the text prosody model employs a deep neural network structure.

A prosody prediction system, comprising:

the model building module is used for building a text prosody model in advance;

the receiving module is used for receiving text data to be predicted;

the characteristic extraction module is used for extracting text characteristics of the text data to be predicted;

the prosody prediction module is used for carrying out prosody prediction on the text data to be predicted by utilizing the text characteristics and the text prosody model;

the model building module comprises:

a collecting unit for collecting text data having corresponding voice data;

the labeling unit is used for carrying out automatic prosody labeling on the corresponding text data based on the prosody information of the voice data to obtain automatic labeling text data;

and the training unit is used for training a text prosody model by using the automatic labeling text data.

Preferably, the labeling unit includes:

the alignment subunit is used for aligning the collected text data with the corresponding voice data;

the labeling subunit is used for inserting prosodic boundary identifiers into the corresponding text data according to the prosodic information of the voice data;

and the error filtering subunit is used for filtering the incorrect prosodic boundary identifier to acquire the automatic labeling text data.

Preferably, the system further comprises:

and the deleting module is connected with the model building module and is used for deleting messy codes and special characters in the text data before aligning the collected text data with the corresponding voice data.

Preferably, the labeling subunit includes:

a pause duration acquiring function block for acquiring a pause duration of each word in the text data corresponding to the voice data in units of the word;

an identification function for inserting a prosodic boundary identifier after a word having a pause duration exceeding a preset threshold.

Preferably, the error filtering subunit includes:

the end point error filtering function block is used for carrying out end point detection on the voice data and filtering a prosodic boundary identifier according to an end point detection result; and any one or more of the following functional blocks:

a speech rate error filtering function block for filtering prosodic boundary identifiers according to a speech rate of a speaker;

a word boundary error filtering function block for filtering prosodic boundary identifiers according to the word boundary detection result;

and the interval error filtering functional block is used for filtering the prosody boundary identifiers with intervals smaller than a set threshold.

Preferably, the system further comprises:

the artificial labeling text data collection module is used for collecting artificial labeling text data in advance, wherein the artificial labeling text data is text data with artificially labeled prosodic boundary information;

the optimization module is connected with the model construction module and is used for training a text prosody model by using the automatic labeling text data, and then retraining the text prosody model trained by using the automatic labeling text data by using the manual labeling text data to optimize the text prosody model;

the prosody prediction module is specifically used for performing prosody prediction on the text data to be predicted by using the text features and the optimized text prosody model.

The method comprises the steps of collecting text data with corresponding voice data, carrying out automatic prosody labeling on the text data based on prosody information of the voice data to obtain automatically labeled text data, and training a text prosody model by taking the automatically labeled text data as training data to obtain text prosody model parameters; and then extracting text features of the received text data to be predicted, and finally performing prosody prediction by using the text features and the trained text prosody model. The method has the advantages that the collected text data all have corresponding voice data, the voice data actually contain prosody information, automatic prosody labeling is carried out on the text data based on the prosody information of the voice data, and the automatically labeled text data are obtained.

Further, according to the prosody prediction method and system provided by the embodiment of the invention, a small amount of manually marked text data can be collected, the manually marked text data is text data with manually marked prosody boundary information, after model training is completed by using the automatically marked text data, model training is performed again by using the manually marked text data to optimize parameters of the text prosody model, so that the influence of wrong prosody boundaries in the automatically marked text data on the model parameters can be effectively removed, and the obtained result is more accurate when the optimized text prosody model is used for prosody prediction.

Further, according to the prosody prediction method and system provided by the embodiment of the invention, the text prosody model adopts a deep neural network structure. The deep neural network has better learning ability, mass automatic labeling text data can be obtained through an automatic labeling method, model training is carried out on the deep neural network by utilizing the automatic labeling text data, more stable model parameters can be obtained, then a small amount of manual labeling text data is utilized to carry out optimization on the model parameters, and an optimized rhythm model is obtained, so that the prediction accuracy of the text rhythm model can be further improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.

FIG. 1 is a flow chart of a prior art method of prosody prediction;

FIG. 2 is a flow chart of a prosody prediction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a prosody prediction system according to an embodiment of the present invention.

Detailed Description

In order to make those skilled in the art better understand the scheme of the embodiment of the present invention, an outline of the existing prosody prediction method is first introduced. In the prior art, a text prosody model is generally constructed by a supervised method, and prosody of text data is predicted to obtain a prosody prediction result, as shown in fig. 1, the method specifically includes:

firstly, a text prosody model is constructed by a supervised method, wherein training data used for constructing the text prosody model by the supervised method, namely the text prosody model, needs to be completely labeled manually in advance, and a labeling result is given, for example, if a prosody boundary of the training data is manually labeled, the model construction method can be as follows: collecting a large amount of text data, manually marking out the prosodic boundary of the text data, and taking the manual marking result as the marking characteristic of the text data, namely that the current word is the prosodic boundary or the current word is not the prosodic boundary; after word segmentation is carried out on text data, text characteristics are extracted, wherein the text characteristics comprise word faces, word lengths and word properties; the text prosody model is constructed by utilizing the artificial labeling characteristics and the text characteristics of the text data, and the text prosody model is a shallow learning model, such as a Maximum Entropy Model (MEM), a Conditional Random field model (CRF) and the like. When the text prosody model is constructed, inputting text characteristics of text data as a model, outputting the labeled characteristics as the model, and training the text prosody model to obtain the text prosody model;

then, receiving text data to be predicted, performing word segmentation on the text data to be predicted, and extracting text features of the text data to be predicted, wherein the text features comprise: word face, word length and part of speech;

and finally, taking the text characteristics of the text data to be predicted as the input characteristics of the text prosody model, and predicting prosody boundaries by using the text prosody model to obtain a prosody prediction result.

According to the prosody prediction method and system provided by the invention, the training corpuses can be obtained by a method of combining automatic labeling and manual labeling, and the text prosody model is trained by using the training corpuses, wherein the structure of the text prosody model uses a deep neural network structure. Specifically, a large amount of text data is labeled by an automatic labeling method, then a text rhythm model is trained, and then a small amount of manually labeled text data is used for optimizing model parameters to obtain optimized model parameters, so that a model training process is completed; then receiving text data to be predicted, segmenting the text data, and extracting text features of the text data to be predicted, wherein the text features comprise: word face, word length, part of speech and word vector; and finally, carrying out prosody prediction on the text to be predicted by utilizing the text characteristics of the text data to be predicted and the trained text prosody model. The method provided by the scheme adopts an automatic labeling method and assists manual labeling when the text prosody model is trained to label the prosody boundary of the training text data, thereby effectively reducing the manual labeling cost in the prior art, labeling the prosody boundary of mass text data by the automatic labeling method, therefore, massive training text data can be obtained to train a text prosody model, the text prosody model is described by adopting a deep neural network structure, the deep neural network has better learning ability, and massive training text data can be obtained by an automatic labeling method during model training, the deep neural network is subjected to model training to obtain better model parameters, and a small amount of manually labeled data can be reused for adjusting and optimizing the model parameters to obtain an optimized model, so that the prediction accuracy of the text prosody model can be greatly improved.

Embodiments of the invention may be implemented in hardware, firmware, software, or any combination thereof. Embodiments of the invention may also be implemented as instructions stored on a machine-readable medium, which may be read and executed by one or more processors. A machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computing device). Such as Read Only Memory (ROM), Random Access Memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, etc.

The embodiments of the present invention will be described in further detail with reference to the drawings and the following description.

As shown in fig. 2, it is a flowchart of a prosody prediction method provided in an embodiment of the present invention, including the following steps:

step S01, a text prosody model is constructed in advance.

In this embodiment, the pre-constructing a text prosody model includes: collecting text data having corresponding voice data; performing automatic prosody annotation on corresponding text data based on prosody information of the voice data to obtain automatic annotation text data; and training a text prosody model by using the automatic labeling text data.

In practical application, firstly, collecting text data and voice data corresponding to the text data, wherein the collected text data can be text data with corresponding voice data on a network, such as text data of a talking novel, text data of remote education with corresponding voice data, text data of movie dialogue and the like; in addition, the text data can be pre-designed, different people can be found for audio reading, for example, the texts of scripts and novels with rich rhythm are selected, speeches and dubbing personnel are found, and the voice data of the texts can also be recorded by ordinary personnel for audio reading; of course, the text data of the voice data may also be obtained by collecting the voice data through a network or a recording mode, and then recognizing the voice data through a manual method or a voice recognition method, which is not limited herein.

Furthermore, in order to improve the accuracy of prosody prediction on the text data subsequently, the method provided by the invention can also collect the manually marked text data with manually marked prosody boundary information, so that the text prosody model can be trained again by using the manually marked text data subsequently, and the model parameters are optimized. Because the automatic labeling process of the text data automatically labels the prosody boundary according to the prosody information of the voice data corresponding to the text data, some wrong prosody boundaries may be generated in the automatic labeling process, and the wrong prosody boundaries may affect the trained model parameters, thereby affecting the accuracy of prosody prediction by using the text prosody model. At the moment, a small amount of manual labeling text data can be supplemented, model parameters obtained by using the automatic labeling text data to perform model training are adjusted and optimized, and therefore the accuracy of rhythm prediction can be further improved on the premise that manual work is not greatly increased. Specifically, the text data may be text data on a network, or text data designed in advance; after the text data is collected, the prosodic boundary positions of the text data, such as prosodic phrase boundary positions, are marked by a manual marking mode, which is described by taking the text data "more like a heavy burden placed on the heart of a person" as an example, and the text data after the prosodic boundary positions are marked by a manual marking mode "more like a heavy burden # placed on the heart of a person", wherein "#" is a prosodic boundary identifier, and of course, the prosodic boundary identifiers can adopt other representation methods, which are not limited herein.

The automatic prosody labeling is mainly used for performing prosody labeling on text data according to prosody information of the voice data corresponding to the text data, and an obvious acoustic expression form is provided at a prosody boundary, for example, a pause (i.e., the length of an unvoiced segment of the voice data). Specifically, the automatic prosody labeling is to set some rules in advance to determine whether each word in the text data is a prosody boundary based on prosody information of the speech data, and set each word that meets the rules as a prosody boundary.

In this embodiment, the performing automatic prosody labeling on the corresponding text data based on the prosody information of the voice data, and acquiring the automatically labeled text data may include: aligning the collected text data with the corresponding voice data; inserting prosodic boundary identifiers into corresponding text data according to prosodic information of the voice data; and filtering the incorrect prosodic boundary identifiers to obtain the automatic labeling text data. Specifically, the inserting a prosody boundary identifier in corresponding text data according to prosody information of the voice data includes: taking a character as a unit, acquiring the pause duration of each character in the text data corresponding to the voice data; the prosodic boundary identifier is inserted after a word whose pause duration exceeds a preset threshold. The filtering the incorrect prosodic boundary identifier, and acquiring the text data labeled with the text data automatically comprises: step a: carrying out endpoint detection on voice data, and filtering prosodic boundary identifiers according to endpoint detection results; and any one or more of the following: step b: filtering prosodic boundary identifiers according to the speed of speech of the speaker; step c: filtering prosodic boundary identifiers according to the word boundary detection result; step d: and filtering prosodic boundary identifiers with the spacing smaller than a set threshold.

Wherein, the step a: the method comprises the steps of carrying out end point detection on voice data, filtering prosodic boundary identifiers according to a detection result, wherein the step is a basic step of a process of filtering wrong prosodic boundary identifiers, and if character missing situation occurs in text data compared with the voice data, namely pronunciations existing in the voice data do not have corresponding characters in the text data, when the text data are aligned, the missed characters are used as silent segments, so that the pause of character missing positions is longer, and whether the pause is reasonable or not needs to be detected by combining the end point detection result. During detection of the tool, firstly, carrying out endpoint detection on voice data, finding the starting time of each section of voice data in the voice data, and comparing an endpoint detection result with the alignment result; and then judging whether the prosody boundary identifier with longer pause time in the alignment result is positioned in the end point detection result and has longer pause time, if so, considering the prosody boundary as a correct prosody boundary, otherwise, considering the prosody boundary as an incorrect prosody boundary, and deleting the prosody boundary identifier at the boundary.

Step b: filtering prosodic boundary identifiers based on speaker rate is a preferred step by which the error rate of automatically labeling results can be reduced. During specific filtering, the voice data corresponding to the text data (for example, one or more sections of voice data) can be audited in advance, and the filtering rhythm boundary of the pause duration threshold of the speaker is set according to the speed of the speaker, for example, the pause duration threshold is set to be between 4 frames and 40 frames according to the speed of the speaker. Judging the pause time to be a correct prosodic boundary if the prosodic boundary is within the threshold range; and if the pause duration is beyond the prosody boundary of the threshold, judging the prosody boundary to be an error prosody boundary, and deleting the prosody boundary identifier marked at the pause position.

Step c: filtering prosodic boundary identifiers based on word boundary detection results is also a preferred step for the purpose of reducing the error rate of automatic labeling results. The word boundaries comprise two types, namely grammar word boundaries and prosodic word boundaries, the grammar words refer to words formed by combining fixed words in a dictionary, and the prosodic words refer to words formed by combining continuously-pronounced words when pronouncing; prosodic boundaries generally do not appear inside these two words, and prosodic boundary identifiers that appear inside grammatical words and prosodic words are deleted when filtering specifically.

Step d: filtering prosodic boundary identifiers whose pitch is less than a set threshold is also a preferred step, and typically prosodic boundaries do not occur in close proximity. In this embodiment, if the two prosody boundary identifiers are located closer, that is, only a few words apart (for example, within 3 words) are set, one of the prosody boundary identifiers can be directly filtered out, and during specific filtering, the prosody boundary identifier with a shorter pause duration can be deleted.

It should be noted that, the steps b to d may be used simultaneously or partially, and there is no special requirement for which step is executed first; in addition, other methods can be adopted to filter the incorrect prosody boundary identifier, and the scheme is not limited, for example, firstly, a prosody boundary is predicted by using a pre-trained prosody boundary prediction model to obtain the prediction probability of each prosody boundary identifier; and deleting the prosody boundary identifiers with smaller prediction probability, wherein the training method of the prosody boundary prediction model is the same as that in the prior art, and the detailed description is omitted in the scheme, and the specific method or modes are adopted to filter the wrong prosody boundary identifiers according to the using effect. And after the filtering is finished, taking the positions of the rest prosodic boundary identifiers as correct prosodic boundaries, thereby obtaining the automatic labeling text data.

Further, in order to facilitate specific operations such as inserting prosodic boundary identifiers, the method further includes: before aligning the collected text data with the corresponding voice data, deleting messy codes and special characters in the text data, and then inserting prosodic boundary identifiers into corresponding characters of the text data according to prosodic information of the voice data.

In practical application, before alignment, if the text data and the voice data are long, the text data and the voice data can be firstly divided into small sections of text data and voice data; when the text data and the voice data are aligned specifically, a dynamic programming method can be used for searching an optimal path matched between the text data and the voice data, the text data and the voice data can be aligned according to the optimal path, and a time boundary of the text data on the voice data is obtained, so that the time length of each word in the text data and the pause time length between the words are obtained; then, by taking the character as a unit, inserting a prosody boundary identifier at a position where the pause duration exceeds a preset threshold according to the pause duration between the characters in the text data, if the pause duration threshold is set to be 2 frames, and if the pause duration between the two characters is longer than 2 frames, inserting a prosody boundary identifier between the two characters; otherwise, no prosodic boundary identifier is inserted.

In a specific embodiment, the automatic labeling process is described by taking as an example that the text data "rule is formally implemented in january one of this year" and the corresponding voice data "rule is formally implemented in january eleven of this year", and after automatic labeling, the incorrect prosodic boundary identifier is filtered by using the steps a to d. Firstly, the text data "rule is formally implemented in october one of today" is aligned with the voice data, and a prosodic boundary identifier is added after the word with the pause time exceeding the threshold value by taking the word as a unit, so as to obtain the aligned text data, the aligned text data is represented by using pinyin as "tiao li sp (13) yu sp (5) jin niean sp (3) shi yue sp (25) yi sp (8) hao sp (25) zheng sp (3) shi sp (9) shi", the automatic labeling result corresponding to the text data is "rule # in october # of this year # october # 1 # and # positive # implementation", wherein sp represents the prosodic boundary identifier labeled by taking the word as a unit after alignment, has the same function as the identifier "#", wherein the number is the duration of each pause, is represented by using the number of frames, and the threshold value can be determined according to experience, human speech speed and/or actual use effect, the threshold used in this embodiment is 2 frames, but the threshold may be larger, for example, 5 frames; the incorrect prosodic boundary identifiers are then filtered.

For step a: the text data is missed with a cross word after the month word, so that the pause time after the month word is longer when the text data is aligned with the voice data, and when the voice data is subjected to endpoint detection and no longer pause exists, the pause identifier after the month is an error pause identifier, and the identifier is deleted. The text data after filtering the prosodic boundary identifiers through the step a is "implementation of rules # in # this year # october # No. # positive # implementation #".

For step b: listening one or more sections of voice data in advance to obtain the pause duration range of a speaker from 4 frames to 40 frames, obtaining the duration of each pause according to the alignment result of the text data and the voice data, deleting prosody boundary identifiers with the pause duration from 4 frames to 40 frames, and filtering the text data with wrong prosody boundary identifiers to obtain' rule # in # October today, # formal practice #.

For step c: according to the word segmentation dictionary, the word segmentation result of the grammar words can be obtained as 'rule is formally implemented in October I of this year', prosodic word segmentation is carried out according to the prosodic model, the obtained word segmentation result is 'rule is formally implemented in October I of this year', because prosodic boundaries do not appear in the grammar words and the prosodic words, prosodic boundary identifiers in the grammar words and the prosodic words are filtered, and text data obtained after filtering the wrong prosodic boundary identifiers is 'rule # formally implemented in # October I of this year'.

For step d: it can be preset that the prosody boundary identifiers within a specified number of words are required to be filtered (for example, 4 words), the prosody boundary identifier behind the example and the prosody boundary identifier behind the example in the text data are only separated by one word and are smaller than a set threshold, and one of the prosody boundary identifiers needs to be filtered out.

It should be noted that, the above-mentioned examples are extreme, the threshold of the pause duration is not optimized, and the pronunciation of the speech data has more prosody errors, and the speech data collected in actual use usually does not have such extreme examples, and the automatic identification text data obtained after filtering the incorrect prosody boundary identifier in the above steps b to d already conforms to the prosody law of human speaking. Any one or more of the steps b to d may be selected to filter the error prosody boundary identifier according to the practical use effect, and of course, the error filtering may also be performed based on other suitable rules, for example, filtering in consideration of part of speech or syntactic structure, which is not limited herein.

The automatic labeling text data can be obtained through the steps, and then the text prosody model is trained by utilizing the automatic labeling text data, and the method specifically comprises the following steps:

firstly, extracting text features of automatically labeled text data, wherein the text features comprise: the word face, word length, part of speech, and word vector, for example, perform word segmentation processing on the text data, and extract text features of the text data in units of words. Wherein the word face represents information of a word itself, and the word face of the regulation is the regulation; the part of speech represents the grammatical role of a word, such as noun and adjective; the word length represents the number of single words contained in the word; the word vector is to map words into real vectors with low dimensions, the specific mapping method is the same as the prior art, and the detailed description is omitted. Particularly, when the invention also obtains the manually marked text data, the text characteristics of the manually marked text data are extracted.

Then, performing model training by using text features of automatically labeled text data, wherein the text prosody model can be a shallow learning model, such as MEM, CRF and the like, or a deep learning model, such as a fully-connected deep neural network model, such as a DNN, a RNN, a CNN and the like; preferably, the present invention employs a recurrent neural network structure as a structure of the text prosody model. Because the deep neural network has better learning ability, massive training text data can be obtained by an automatic labeling method, the training data is used for carrying out model training on the deep neural network, more stable model parameters can be obtained, and the accuracy of prediction can be effectively improved when the text prosody model is used for carrying out prosody prediction. Specifically, the recurrent neural network structure includes an input layer, a hidden layer and an output layer, where the layers are connected to each other using different connection weights (i.e., model parameters), the hidden layer may include a combination of multiple bidirectional recurrent layers and a feedforward layer, and the recurrent layer refers to a connection between nodes in the same layer; when the model is trained, the text features of the automatically labeled text data are used as the input of the recurrent neural network, and the text features are output as the prosodic boundary labeling result of the automatically labeled text data, namely, the current word is a prosodic boundary or not, and the model parameters are obtained after the training is finished.

Particularly, when the text features of the manually marked text data are extracted, the text features of the manually marked text data are used for performing model training again on the basis of the text prosody model trained by the automatically marked data to optimize the parameters of the text prosody model, so that the influence of the error prosody boundary in the automatically marked text data on the model parameters can be effectively removed, and the acquired result is more accurate when the optimized text prosody model is used for prosody prediction. Specifically, the text features of the manually marked text data are used as the input of the deep neural network, the prosodic boundary result of the manually marked text data is output, namely the current word is a prosodic boundary or the current word is not a prosodic boundary, model parameters obtained by training the text features of the automatically marked text data are used as initial values of model parameters to be optimized, model training is carried out, and after the training is finished, the optimized prosodic model is obtained.

In step S02, text data to be predicted is received.

In the present invention, the text data may be: chinese characters, English letters, phonetic symbols, etc.

And step S03, extracting text features of the text data to be predicted.

In this embodiment, word segmentation processing may be performed on the text data to be predicted first, and then text features of each word segmentation may be extracted according to the word segmentation result. The text features include word faces, parts of speech, word length and word vectors.

In one embodiment, the text data "many of which are dimmed at once" is used as an example for illustration. Firstly, performing word segmentation processing on text data, wherein the word segmentation result of the text is that the text becomes dull and many words at a time, and extracting the text characteristics of word segmentation units: word face, part of speech, word length and word vector; the extracted text features are shown in table one:

table one: extracting text features of text data

The first column is a word surface, the second column is a part of speech, letters are identifiers of specific parts of speech, m represents a number word, v represents a verb, u represents a help word, a represents an adjective, the third column is a word length, and the fourth column is a word vector corresponding to the current word.

And step S04, performing prosody prediction on the text data to be predicted by using the text features and the text prosody model.

In this embodiment, after the text features are used as the input of the text prosody model, the text prosody model is used to predict prosody boundaries of the text data, so as to obtain prosody boundary prediction results of the text data, where for example, 0 represents a non-phrase boundary and 1 represents a phrase boundary; taking the text data "become much dimmer at once" as an example for explanation, the prosodic boundary prediction result is output as follows: at once/0 becomes/1 dim/0 by/0 many/0; then, adding a prosody boundary identifier at the predicted prosody boundary to obtain a prosody prediction result of the text data to be predicted, wherein the prosody prediction result is as follows: become a lot darker # at a glance.

The prosody prediction method provided by the embodiment of the invention is used for pre-constructing a text prosody model and specifically comprises the following steps: collecting text data with corresponding voice data, performing automatic prosody annotation on the corresponding text data based on prosody information of the voice data to obtain automatically annotated text data, and training a text prosody model by using the automatically annotated text data; and extracting text features of the text data to be predicted, and performing prosody prediction on the text data to be predicted by utilizing the text features and the text prosody model. Because the training data of the text prosody model is obtained by automatically labeling the prosody of the corresponding text data based on the prosody information of the voice data, the process does not need manual labeling, and the problems of high labor cost and long time consumption caused by the fact that the prosody boundaries of all the training text data need to be manually labeled to train the text prosody model in the prior art can be solved; in addition, on the basis of a prosody model trained by automatic labeling data, a small amount of manual labeling data is used for optimizing the prosody model, and the accuracy of prosody prediction can be effectively improved when the text prosody model provided by the invention is used for carrying out prosody prediction.

Accordingly, an embodiment of the present invention further provides a prosody prediction system, which is a schematic structural diagram of the system as shown in fig. 3.

In this embodiment, the system includes:

a model building module 301, configured to pre-build a text prosody model;

a receiving module 402, configured to receive text data to be predicted;

a feature extraction module 403, configured to extract text features of text data to be predicted;

a prosody prediction module 404, configured to perform prosody prediction on the text data to be predicted by using the text feature and the text prosody model;

the model building module 301 includes:

a collecting unit for collecting text data having corresponding voice data;

In the present invention, the labeling unit includes:

Further, in order to facilitate the system to perform automatic prosody labeling on the text data, the system further comprises:

and a deleting module 505 connected to the model building module 301, configured to delete the messy codes and the special characters in the text data before aligning the collected text data with the corresponding voice data.

Specifically, the labeling subunit includes:

Furthermore, the error filtering subunit includes:

In order to further improve the accuracy of prosody prediction by using the text prosody model provided by the invention, the system further comprises:

the artificial labeling text data collection module 606 is configured to collect artificial labeling text data in advance, where the artificial labeling text data is text data with artificially labeled prosodic boundary information;

the optimizing module 607 connected to the model building module 301 is configured to train a text prosody model by using the automatically labeled text data, and then train the text prosody model trained by using the automatically labeled text data again by using the manually labeled text data to optimize the text prosody model;

the prosody prediction module 404 is specifically configured to perform prosody prediction on the text data to be predicted by using the text features and the optimized text prosody model.

Preferably, the text prosody model employs a deep neural network structure.

Of course, the system may further include a storage module (not shown) for storing relevant information such as text prosody model parameters, and in addition, may also be used for storing relevant information in the prosody prediction process, such as text features. In this way, automatic processing by a computer is facilitated, and a prosody prediction result or the like can be stored.

In the prosody prediction system provided in the embodiment of the present invention, the model building module 301 pre-builds a text prosody model, where the model building module 301 specifically includes: the text feature prediction method comprises a collecting unit, a labeling unit and a training unit, and comprises the steps of receiving text data to be predicted through a receiving module 402, extracting text features of the text data to be predicted through a feature extracting module 403, and finally performing prosody prediction on the text data to be predicted through a prosody predicting module 404. Because the labeling unit of the model building module 301 can perform automatic prosody labeling on corresponding text data based on prosody information of the voice data, mass automatic labeling text data can be obtained for training the text prosody model, accuracy of prosody prediction by using the model is improved, and the problems of high labor cost and long time consumption in the prior art that training data are obtained by a manual labeling mode can be solved.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, they are described in a relatively simple manner, and reference may be made to some descriptions of method embodiments for relevant points. The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above embodiments of the present invention are described in detail, and the present invention is described herein by using specific embodiments, and the above description of the embodiments is only used to help understanding the method, system and earphone of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A prosody prediction method, comprising:

a text rhythm model is constructed in advance;

receiving text data to be predicted;

extracting text features of text data to be predicted;

the pre-constructed text prosody model comprises:

collecting text data having corresponding voice data;

performing automatic prosody annotation on corresponding text data based on prosody information of the voice data to obtain the automatic annotation text data, wherein the automatic annotation result is filtered in any one or more of the following modes: filtering prosodic boundary identifiers according to the speed of speech of the speaker; filtering prosodic boundary identifiers according to the word boundary detection result; filtering prosodic boundary identifiers with spacing smaller than a set threshold;

and training a text prosody model by using the automatic labeling text data.

2. The method of claim 1, wherein performing automatic prosody labeling on corresponding text data based on prosody information of the voice data, and obtaining automatically labeled text data comprises:

aligning the collected text data with the corresponding voice data;

3. The method of claim 2, further comprising:

4. The method of claim 2, wherein inserting prosodic boundary identifiers in corresponding text data according to prosodic information of the speech data comprises:

5. The method of claim 2, wherein filtering the incorrect prosodic boundary identifiers and obtaining automatically labeled text data comprises:

performing endpoint detection on the voice data, and filtering prosodic boundary identifiers according to the endpoint detection result.

6. The method according to any one of claims 1 to 5, further comprising:

7. The method of any of claims 1-5, wherein the text prosody model employs a deep neural network structure.

8. A prosody prediction system, comprising:

the model building module is used for building a text prosody model in advance;

the receiving module is used for receiving text data to be predicted;

the model building module comprises:

a collecting unit for collecting text data having corresponding voice data;

the labeling unit is used for carrying out automatic prosody labeling on corresponding text data based on prosody information of the voice data to acquire the automatic labeling text data, and comprises the following steps of filtering an automatic prosody labeling result in any one or more of the following modes: filtering prosodic boundary identifiers according to the speed of speech of the speaker; filtering prosodic boundary identifiers according to the word boundary detection result; filtering prosodic boundary identifiers with spacing smaller than a set threshold;

9. The system of claim 8, wherein the labeling unit comprises:

10. The system of claim 9, further comprising:

11. The system of claim 9, wherein the labeling subunit comprises:

12. The system of claim 9, wherein the error filtering subunit comprises:

13. The system of any one of claims 8 to 12, further comprising: