CN113129863B

CN113129863B - Voice duration prediction method, device, equipment and readable storage medium

Info

Publication number: CN113129863B
Application number: CN201911417701.4A
Authority: CN
Inventors: 胡亚军; 江源; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2024-05-31
Anticipated expiration: 2039-12-31
Also published as: CN113129863A

Abstract

The embodiment of the application discloses a voice duration prediction method, a device, equipment and a readable storage medium, wherein after text data are acquired, a pre-trained duration prediction model is utilized to encode at least two prosody levels on the text data, so as to obtain encoding feature sequences of at least two prosody levels; according to the method, when the text data is encoded, the encoding of the at least two prosody levels is performed, so that the control of different prosody levels can be performed on the voice duration, and when the voice synthesis is performed on the basis of the voice duration predicted by the method, the occurrence probability of a word-in-word phenomenon is reduced, and the continuity of the synthesized voice is better.

Description

Voice duration prediction method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for predicting speech duration.

Background

With the development of related technology application of artificial intelligence, the application field of the speech synthesis technology is expanding continuously, for example, the fields from broadcasting applications (such as stations, banks, airport broadcasting, etc.) to man-machine interactive applications (such as artificial intelligent assistant, customer service, etc.) can be applied to the speech synthesis technology, which has higher requirements on expressive force, tone quality, etc. of the synthesized speech.

Duration prediction is an important link of a voice synthesis technology, however, the inventor of the application researches and discovers that when voice synthesis is performed based on the voice duration predicted by the current duration prediction method, a word-in-word phenomenon is easy to occur, and the continuity is poor.

Disclosure of Invention

In view of the above, the present application provides a method, apparatus, device and readable storage medium for predicting speech duration, so as to improve the continuity of synthesized speech.

In order to achieve the above object, the following solutions have been proposed:

a speech duration prediction method, comprising:

acquiring text data;

Coding at least two prosody levels on the text data by utilizing a pre-trained duration prediction model to obtain coding feature sequences of the at least two prosody levels;

and generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration prediction model.

In the method, preferably, the coding feature sequences of at least two prosody levels are obtained by using the duration prediction model; and generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels, wherein the process comprises the following steps:

Coding at least two prosody levels on the text data by using a coding module of the duration prediction model to obtain coding feature sequences of at least two prosody levels;

And generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using a duration generation module of the duration prediction model.

In the above method, preferably, the coding module of the duration prediction model is used to code the text data in at least two prosody levels to obtain a coding feature sequence in at least two levels, including:

Extracting the characteristics of the text data by utilizing a characteristic extraction module of the duration prediction model to obtain a text characteristic sequence which is used as a coding characteristic sequence of the lowest prosody level;

And coding at least one non-lowest prosody level of the text feature sequence by utilizing a prosody feature acquisition module of the duration prediction model to obtain at least one prosody feature sequence of the non-lowest prosody level, wherein the prosody feature sequence is used as the coding feature sequence of other prosody levels in the coding feature sequences of the at least two prosody levels.

In the above method, preferably, the duration generating module that uses the duration prediction model generates a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels, including:

Generating a voice duration sequence corresponding to the text feature sequence as a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using a first time generation module of the duration prediction model; the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one;

Or alternatively

Generating a voice duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by using a first time generation module of the duration prediction model;

generating a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using a second duration generation module of the duration prediction model; the voice duration in the voice duration sequence corresponds to the prosodic features in the prosodic feature sequence one by one;

and processing the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosody feature sequence to obtain the voice duration sequence corresponding to the text data.

In the above method, preferably, the process of generating, by using the first time length generation module, a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels includes:

Splicing the text features and prosodic features of each prosodic level generated based on the text features by using a splicing module of the duration prediction model corresponding to each text feature to obtain splicing features corresponding to the text features;

sampling in a preset variable which accords with preset distribution by utilizing a sampling module of the duration prediction model to obtain target values corresponding to all splicing characteristics;

Performing first transformation on a target value sequence formed by each target value according to a splicing feature sequence formed by splicing features corresponding to each text feature by using a transformation module of the duration prediction model to obtain a voice duration sequence corresponding to the text feature sequence; wherein the transformation module has: and carrying out inverse transformation on the voice duration sequence corresponding to the text feature sequence according to the splicing feature sequence to obtain the capability of the target value sequence.

In the above method, preferably, the process of sampling in the variable conforming to the preset distribution by using the sampling module includes:

corresponding to each splicing characteristic, randomly sampling in the variable conforming to the preset distribution by utilizing the sampling module to obtain a target value corresponding to the splicing characteristic;

Or alternatively

Corresponding to each splicing characteristic, utilizing the sampling module to adjust the parameters of the variables conforming to the preset distribution so as to determine a target sampling range corresponding to the splicing characteristic; randomly sampling in a target sampling range corresponding to the splicing characteristic to obtain a target value corresponding to the splicing characteristic;

Or alternatively

Corresponding to part of the splicing characteristics, randomly sampling in the variables conforming to the preset distribution by utilizing the sampling module to obtain target values corresponding to all the splicing characteristics in the part of the splicing characteristics;

Corresponding to another part of splicing characteristics, adjusting parameters of the variables conforming to preset distribution by utilizing the sampling module to determine target sampling ranges corresponding to all splicing characteristics; and randomly sampling in a target sampling range corresponding to each splicing characteristic to obtain a target value corresponding to each splicing characteristic in the other part of splicing characteristics.

In the above method, preferably, the preset distribution is gaussian distribution, the corresponding each splicing feature, and the adjusting, by using the sampling module, a parameter of the variable conforming to the preset distribution to determine a target sampling range corresponding to the splicing feature includes:

and adjusting the variance of the Gaussian distribution by using the sampling module according to each splicing characteristic, and taking the value range of the variable of the Gaussian distribution which accords with the adjusted variance as a target sampling range corresponding to the splicing characteristic.

Preferably, the processing the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosodic feature sequence to obtain the voice duration sequence corresponding to the text data includes:

According to the sequence from high to low of the prosody levels, for the voice duration sequences of any two adjacent prosody levels, according to the voice duration sequence of the higher prosody level in the voice duration sequences of the two adjacent prosody levels, the voice duration of the voice duration sequence of the lower prosody level in the voice duration sequences of the two adjacent prosody levels is adjusted so that the sum of the voice durations in the voice duration sequences of the lower prosody level is equal to the sum of the voice durations in the voice duration sequences of the higher prosody level.

In the above method, preferably, the training process of the speech duration prediction model includes:

Extracting the characteristics of a text sample by using the characteristic extraction module to obtain a text characteristic sequence of the text sample;

Encoding at least one prosodic hierarchy of the text feature sequence of the text sample by utilizing the prosodic feature acquisition module to obtain the prosodic feature sequence of the at least one prosodic hierarchy of the text sample;

Generating a voice duration sequence corresponding to the text feature sequence according to the text feature sequence of the text sample and each prosody feature sequence by using the first time generation module;

Generating a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using the second duration generation module;

And updating parameters of the voice duration prediction model by taking the voice duration sequence corresponding to the text feature sequence approaching the voice duration label corresponding to the text feature sequence and the voice duration label corresponding to each prosodic feature sequence approaching the voice duration label corresponding to the prosodic feature sequence as targets.

A speech duration prediction apparatus comprising:

the acquisition module is used for acquiring text data;

The coding control module is used for coding at least two prosody levels of the text data by utilizing a pre-trained duration prediction model to obtain coding feature sequences of the at least two prosody levels;

And the generation control module is used for generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by utilizing the duration prediction model.

In the above apparatus, preferably, the coding control module is specifically configured to:

the generation control module is specifically configured to:

In the above apparatus, preferably, the coding control module includes:

the feature extraction control module is used for extracting the features of the text data by utilizing the feature extraction module of the duration prediction model to obtain a text feature sequence which is used as a coding feature sequence of the lowest prosody level;

and the prosodic feature acquisition control module is used for coding at least one non-lowest prosodic level on the text feature sequence by utilizing the prosodic feature acquisition module of the duration prediction model to obtain at least one prosodic feature sequence of the non-lowest prosodic level, and the prosodic feature sequence is used as the coding feature sequence of other prosodic levels in the coding feature sequences of the at least two prosodic levels.

In the above apparatus, preferably, the generation control module includes:

The first control module is used for generating a voice duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by using a first time generation module of the duration prediction model, and the voice duration sequence is used as the voice duration sequence corresponding to the text data; the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one;

Or alternatively

The first control module is used for generating a voice duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by utilizing a first time generation module of the duration prediction model;

the second control module is used for generating a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by utilizing a second duration generation module of the duration prediction model; the voice duration in the voice duration sequence corresponds to the prosodic features in the prosodic feature sequence one by one;

the duration processing module is used for processing the voice duration sequences corresponding to the text feature sequences and the voice duration sequences corresponding to the prosody feature sequences to obtain the voice duration sequences corresponding to the text data.

In the above apparatus, preferably, the first control module includes:

The splicing control module is used for splicing the text features and prosodic features of each prosodic level generated based on the text features by utilizing the splicing module of the duration prediction model to obtain corresponding splicing features of the text features;

the sampling control module is used for sampling in preset variables conforming to preset distribution by utilizing the sampling module of the duration prediction model to obtain target values corresponding to all splicing characteristics;

The transformation control module is used for carrying out first transformation on a target value sequence formed by each target value according to a splicing characteristic sequence formed by splicing characteristics corresponding to each text characteristic by utilizing the transformation module of the duration prediction model to obtain a voice duration sequence corresponding to the text characteristic sequence; wherein the transformation module has: and carrying out inverse transformation on the voice duration sequence corresponding to the text feature sequence according to the splicing feature sequence to obtain the capability of the target value sequence.

In the above apparatus, preferably, the sampling control module is specifically configured to:

Or alternatively

In the above apparatus, preferably, the preset distribution is gaussian, and the sampling control module is configured to, when corresponding to each splicing feature, adjust, by using the sampling module, a parameter of the variable that meets the preset distribution to determine a target sampling range corresponding to the splicing feature:

The above apparatus, preferably, the duration processing module is specifically configured to:

The above device, preferably, the voice duration prediction device further includes a training module, specifically configured to:

A speech duration prediction apparatus comprising a memory and a processor;

The memory is used for storing programs;

The processor is configured to execute the program to implement the steps of the method for predicting a duration of speech according to any one of the above claims.

A readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the speech duration prediction method of any one of the preceding claims.

According to the technical scheme, after the text data is obtained, the text data is encoded by at least two prosody levels by using the pre-trained duration prediction model, so that the encoding feature sequences of the at least two prosody levels are obtained; according to the method, when the text data is encoded, the encoding of the at least two prosody levels is performed, so that the control of different prosody levels can be performed on the voice duration, and when the voice synthesis is performed on the basis of the voice duration predicted by the method, the occurrence probability of a word-in-word phenomenon is reduced, and the continuity of the synthesized voice is better.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an implementation of a method for predicting a duration of speech according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a speech duration prediction model according to an embodiment of the present application;

FIG. 3 is a diagram showing a sequence of coding features for obtaining at least two prosody levels using a duration prediction model according to an embodiment of the present application; generating an implementation flow chart of a voice duration sequence corresponding to text data according to the coding feature sequences of the at least two prosody levels;

FIG. 4 is a schematic diagram of a coding module according to an embodiment of the present application;

FIG. 5 is an exemplary diagram of a prosodic feature acquisition module encoding two prosodic levels according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a first time period generating module according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a duration generation module according to an embodiment of the present application;

FIG. 8 is a flowchart of one implementation of generating a voice duration sequence corresponding to text data according to a coding feature sequence of at least two prosody levels using a duration generation module according to an embodiment of the present application;

Fig. 9 is a schematic structural diagram of a speech duration prediction apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of a hardware structure of a speech duration prediction apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to solve the problems that the word-in-word phenomenon is obvious and the continuity is poor when the voice duration predicted by the existing duration prediction method is synthesized, the basic idea of the application is that text data is encoded by utilizing a pre-trained duration prediction model, and according to the encoding characteristic sequences of at least two prosody levels, the voice duration can be controlled in different prosody levels, so that the probability of the word-in-word phenomenon is reduced, the continuity of the synthesized voice is better and the voice is more natural to sound.

Based on the basic idea, an implementation flowchart of the voice duration prediction method provided by the embodiment of the present application is shown in fig. 1, and may include:

step S11: text data is acquired.

The text data is text data of a voice to be synthesized specified by a user. The text data may be text entered by the user in real time or may be data in text specified by the user.

Step S12: and coding the text data by using a pre-trained duration prediction model in at least two prosody levels to obtain coding feature sequences of the at least two prosody levels.

Alternatively, the at least two prosody levels may be at least two prosody levels of the following several prosody levels: phonemes, prosodic words, prosodic phrases, prosodic clauses, prosodic sentences, and the like. These prosodic levels can be considered as divisions of sentences on different scales based on text analysis of the sentences.

As shown in Table 1, an example of prosodic hierarchy classification is performed for the sentence "get ticket in China women's volleyball first time, start travel of tenth race in history", according to the present application. In this example, L1 represents a prosodic level of a prosodic word, L3 represents a prosodic phrase, L4 represents a prosodic level of a prosodic clause, and L5 represents a prosodic level of a prosodic sentence.

TABLE 1

Of course, in addition to the several prosodic levels listed above, in embodiments of the application, more levels may be defined, such as, for example, smaller levels than phonemes, e.g., multi-frame levels, etc.

Modeling scales of different prosody levels are different, wherein the modeling scales of the phoneme levels are phonemes, namely coding features in a coding feature sequence of the phoneme levels are in one-to-one correspondence with phonemes in a phoneme string obtained by text data analysis; modeling scales of prosodic word levels are words, namely the coding features in the coding feature sequences of the prosodic word levels are in one-to-one correspondence with words in word sequences obtained by text data analysis; the modeling scale of the prosodic phrase level is that the coding features in the coding feature sequence of the prosodic phrase level are in one-to-one correspondence with the phrases in the phrase sequence obtained by analyzing the text data; the modeling scale of the prosodic clause level is a clause, namely the coding features in the coding feature sequence of the prosodic clause level are in one-to-one correspondence with the clauses in the clause sequence obtained by text data analysis; the modeling scale of the prosodic sentence level is a sentence, that is, the coded features in the coded feature sequence of the prosodic word level are in one-to-one correspondence with the sentences in the sentence sequence obtained by the text data parsing.

Step S13: and generating a voice duration sequence corresponding to the text data according to the coding feature sequences of at least two prosody levels by using the duration prediction model.

The modeling scale of the speech duration can be a phoneme, or can be other modeling scales, for example, syllables, words, or phrases, etc.

For example, if the modeling scale of the speech duration is a phoneme, each speech duration in the sequence of speech durations corresponds to a phoneme in the phoneme string parsed from the text data. And if the modeled scale of the speech duration is a word, each speech duration in the sequence of speech durations corresponds to a word in the text data.

According to the voice duration prediction method provided by the embodiment of the application, the text data is encoded by at least two prosody levels by utilizing the pre-trained duration prediction model, so that the encoding feature sequences of the at least two prosody levels are obtained; according to the coding feature sequences of the at least two prosody levels, a voice duration sequence corresponding to the text data is generated, and the scheme is used for coding the text data, so that the voice duration can be controlled in different prosody levels, and when the voice synthesis is performed based on the voice duration predicted by the method, the probability of occurrence of a word-by-word phenomenon is reduced, and the continuity of the synthesized voice is better.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech duration prediction model according to an embodiment of the present application, which may include:

A coding module 21 and a duration generating module 22; wherein,

The encoding module 21 is configured to encode the text data in at least two prosodic levels, resulting in a sequence of encoded features in at least two prosodic levels.

When at least two prosodic levels of the text data are encoded, the text data are usually encoded at a lower prosodic level to obtain an encoded feature sequence of the lower prosodic level, and then the encoded feature sequence of the lower prosodic level is utilized to encode the higher prosodic level to obtain an encoded feature sequence of the higher prosodic level.

The coding module 21 may be implemented by a neural network, for example, the coding module 21 may be implemented by a convolutional neural network (Convolutional Neural Network, CNN), or by a recurrent neural network (Recurrent Neural Network, RNN), or by a Long Short-Term Memory (LSTM), or by a combination of networks, etc. In particular, which network form is implemented, the present application is not particularly limited.

The duration generation module 22 is configured to generate a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels, or generate a voice duration sequence of the at least two prosody levels as at least two initial voice duration sequences. The at least two initial voice duration sequences are used for generating a voice duration sequence corresponding to the text data.

In the embodiment of the application, different prosody levels can adopt different prediction networks to predict the duration, and of course, different prosody levels can also adopt the same prediction network to predict the duration. Each prosody level selects one of the following networks as a duration prediction network for that prosody level: convolutional neural network (Convolutional Neural Network, CNN), cyclic neural network (Recurrent Neural Network, RNN), long Short Memory network (LSTM), conditional Flow-generating model (cFlow), etc. In particular, which network form is implemented, the present application is not particularly limited.

Correspondingly, the coding feature sequences of at least two prosody levels are obtained by using the duration prediction model; an implementation flowchart for generating a voice duration sequence corresponding to text data according to the coding feature sequences of the at least two prosody levels is shown in fig. 3, and may include:

Step S31: the text data is encoded at least two prosodic levels using the encoding module 21, resulting in a sequence of encoded features for at least two prosodic levels.

Step S32: and generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration generation module 22. In particular, the method comprises the steps of,

If the duration generation module 22 directly generates the voice duration sequence corresponding to the text data, the voice duration sequence is taken as the voice duration sequence corresponding to the text data. If the duration generation module 22 generates at least two initial voice duration sequences, the duration generation module 22 may process the at least two initial voice duration sequences after generating the at least two initial voice duration sequences, to generate a voice duration sequence corresponding to the text data.

In an alternative embodiment, a schematic structural diagram of the encoding module 21 is shown in fig. 4, and may include:

a feature extraction module 41 and a prosodic feature acquisition module 42; wherein,

The feature extraction module 41 is configured to extract features of the text data, and obtain a text feature sequence as a coding feature sequence of the lowest prosody level.

In the embodiment of the application, the text data can be subjected to characteristic extraction according to the modeling scale of the voice duration, and the modeling scale of the voice duration can be a phoneme, a syllable or other modeling scales, such as a word or a word, and the like.

Specifically, after determining a modeling scale of the voice duration, analyzing text data based on the modeling scale to obtain each modeling object in the modeling scale in the text data, and then extracting features of the modeling objects to obtain a text feature sequence of the text data. Each text feature in the sequence of text features corresponds to a modeled object in the text data, and the text feature corresponding to the modeled object may be an explicit feature of the modeled object, i.e. a feature with a clear meaning.

Taking modeling scale of voice duration as an example of phonemes, analyzing text data to obtain phoneme strings, and then carrying out feature extraction on the phoneme strings to obtain text features corresponding to each phoneme, wherein the text features corresponding to each phoneme in the phoneme strings form a text feature sequence, and the text features corresponding to each phoneme comprise one or more of the following information: what the phoneme is, what the previous phoneme is, what the next phoneme is, what the pitch of the phoneme is, and location information, such as the location of the phoneme in a sentence, the location of the phoneme in a current clause, etc., may be included.

In the embodiment of the application, the modeling scale of the text features is the same as the modeling scale of the voice duration.

The prosodic feature obtaining module 42 is configured to encode the text feature sequence with at least one non-lowest prosodic level to obtain at least one prosodic feature sequence with at least one non-lowest prosodic level, as the coding feature sequences of other prosodic levels in the coding feature sequences of the at least two prosodic levels.

For example, assuming that the modeling scale of the text feature is a phoneme when feature extraction is performed on the text data, the lowest prosodic level is defined as a phoneme, and accordingly, the at least one non-lowest prosodic level may be at least one of the following prosodic levels: prosodic words, prosodic phrases, prosodic clauses, prosodic sentences. And if the modeling scale of the text feature is a word when the text data is feature extracted, that is, each feature in the sequence of text features corresponds to a word in the text data, the lowest prosodic hierarchy is defined as a prosodic word, and accordingly, the at least one non-lowest prosodic hierarchy may be at least one of the following prosodic hierarchies: prosodic phrases, prosodic clauses, prosodic sentences.

For each prosody level, there is a time-series correlation between prosody features when encoding the prosody level, i.e., when calculating a certain prosody feature, the prosody features are calculated not only by using the relevant text features, but also by using the relevant text features and the prosody features calculated at the previous time.

As shown in fig. 5, an exemplary graph is provided for the prosodic feature obtaining module 42 according to the embodiment of the present application when coding two prosodic levels (usually, hidden layer coding, that is, the prosodic features obtained by the coding are hidden layer features). In this example, taking "travel of the tenth race on the opening history" as an example, the process of encoding the L1 prosody level and the L2 prosody level on the text feature sequence is described. In this example, the text feature sequence of "turn on travel of the tenth game on history" is a phoneme-level text feature sequence, i.e., one phoneme for each text feature. After acquiring the text feature sequence of the tenth match on the opening history (i.e. the phoneme level input in fig. 5, where the words are displayed to illustrate the correspondence between the prosodic features and the text features, the prosodic feature acquiring module actually inputs the text features corresponding to the phonemes parsed by the words), the prosodic feature acquiring module performs word level encoding to obtain the prosodic features of the word level, then performs L1 prosodic level encoding according to the prosodic features of the word level to obtain the prosodic feature sequence of the L1 prosodic level, and then performs L2 prosodic level encoding according to the prosodic features of the L1 prosodic level to obtain the prosodic feature sequence of the L2 prosodic level. When each stage of coding is performed, except for one prosodic feature, each prosodic feature needs to be calculated at the previous time to obtain the prosodic feature. For example, when calculating the prosodic features of the "on" word, the prosodic features of the "on" word are utilized in addition to the text features corresponding to the "on" word. For example, in the encoding process of the L1 prosody level, in calculating the prosody feature of the word "match", the prosody feature of the word "tenth" is utilized in addition to the prosody features of the two words "ratio" and "match".

It should be noted that fig. 5 is an example of an RNN network structure, but the prosodic feature obtaining module 42 in the embodiment of the application is not limited to implementation through the RNN network, and may be other network structures.

In an alternative embodiment, the duration generation module 22 may include:

The first time generation module is used for generating a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of at least two prosody levels, and taking the voice time length sequence corresponding to the text data as the voice time length sequence corresponding to the text data; the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one.

Optionally, a schematic structural diagram of the first time period generating module is shown in fig. 6, and may include:

And the splicing module 61 is used for splicing the text feature and the prosodic features of each prosodic level generated based on the text feature corresponding to each text feature to obtain the corresponding splicing feature of the text feature.

Taking the example of "getting a ticket in China women's volleyball at the first time and starting a travel of the tenth game in history", it is assumed that each text feature in the text feature sequence corresponds to a phoneme, and explanation is made below by taking a phoneme corresponding to a "Chinese" word and a phoneme corresponding to a "Chinese" word as examples. The two phonemes of zh and ong can be resolved from the "middle" word, and the two phonemes of p and ai can be resolved from the "row" word. Assuming that the text feature sequence is encoded in two prosody levels, namely an L1 prosody level and an L3 prosody level, 12 prosody features are included in the prosody feature sequence of the L1 prosody level, corresponding to 12 words in the L1 prosody level, and 4 prosody features are included in the prosody feature sequence of the L3 prosody level, corresponding to four prosody phrases in the L3 level, respectively. In the embodiment of the application, the text feature corresponding to the zh phoneme is spliced with the text feature corresponding to the zh, the prosodic feature of Chinese and the prosodic feature of Chinese women's volleyball to obtain the spliced feature corresponding to the text feature corresponding to the zh phoneme. Similarly, the text feature corresponding to the phoneme corresponding to the ong is obtained by splicing the text feature corresponding to the ong, the prosodic feature of Chinese and the prosodic feature of Chinese women's volleyball. Similarly, the text feature corresponding to p is the text feature corresponding to p, in the embodiment of the application, the prosodic feature of "women's volleyball" and the prosodic feature of "China women's volleyball" are spliced to obtain the spliced feature corresponding to the text feature corresponding to p.

And the sampling module 62 is configured to sample in a preset variable that accords with a preset distribution, so as to obtain a target value corresponding to each splicing feature.

The variable conforming to the preset distribution is a variable related to the duration. The variable conforming to the preset distribution may be a variable conforming to a standard gaussian distribution, i.e. all values of the variable conform to the standard gaussian distribution. The number of target values is the same as the number of text features in the sequence of text features. That is, assuming that the number of text features in the text feature sequence is N, the number of times of sampling in the variable conforming to the preset distribution is also N, so that N target values can be obtained. Specifically, when N times of sampling are performed on the variables conforming to the preset distribution, the N times of sampling may be sequentially performed, or may be performed simultaneously, that is, sampling is performed simultaneously on N identical variables conforming to the preset distribution (equivalent to copying the variables conforming to the preset distribution N-1 times), to obtain N target values.

Alternatively, the sampling module 62 may sample in any of the following ways:

Mode one:

And (3) randomly sampling in a variable conforming to preset distribution corresponding to each splicing characteristic to obtain a target value corresponding to the splicing characteristic.

Due to the randomness of the sampling, a more expressive speech duration can be generated.

Mode two:

Corresponding to each splicing characteristic, adjusting parameters of variables conforming to preset distribution to determine a target sampling range corresponding to the splicing characteristic; and randomly sampling in a target sampling range corresponding to the splicing characteristic to obtain a target value corresponding to the splicing characteristic. Specifically, if the preset distribution is gaussian distribution, the variance of the gaussian distribution can be adjusted, and the value range of the variable of the gaussian distribution which accords with the adjusted variance is used as the target sampling range corresponding to the splicing characteristic.

In general, the smaller the variance, the more stable the predicted speech duration.

Mode three:

corresponding to the part splicing characteristics, randomly sampling in variables conforming to preset distribution to obtain target values corresponding to all splicing characteristics in the part splicing characteristics;

Corresponding to the other part of splicing characteristics, adjusting parameters of variables conforming to preset distribution to determine target sampling ranges corresponding to the splicing characteristics; and randomly sampling in a target sampling range corresponding to each splicing characteristic to obtain a target value corresponding to each splicing characteristic in the other part of splicing characteristics.

Obviously, the third mode is a combination of the first mode and the second mode.

The transformation module 63 is configured to perform a first transformation trained in advance on a target value sequence formed by each target value according to a concatenation feature sequence formed by concatenation features corresponding to each text feature, so as to obtain a voice duration sequence corresponding to the text feature sequence; the conversion module 63 has: and carrying out inverse transformation on the voice duration sequence corresponding to the text feature sequence according to the splicing feature sequence to obtain the capability of the target value sequence.

That is, given the text a and the voice duration sequence At formed by the voice durations of the phonemes in the voice signal corresponding to the text a, the transformation module 63 may perform the inverse transformation of the first transformation on the voice duration sequence At by using the text feature sequence of the text a (the text feature sequence At the phoneme level) to obtain the target value sequence, where the target value sequence conforms to the preset distribution.

Alternatively, the transformation module 63 may be implemented by a cFlow model. cFlow is a probabilistic modeling framework based on variable substitution. Assuming that the text feature sequence is c= [ c1, c2, …, cN ], each text feature in the text feature sequence c corresponds to a phoneme, and the duration sequence formed by the voice durations corresponding to the phonemes in the text feature sequence c is x= [ x1, x2, …, xN ], x can be transformed into another variable s= [ s1, s2, …, sN ] by a reversible transformation f, namely:

s＝f(x,c)，x＝f^-1(s,c)

Assuming that the distribution of s is simple and the likelihood can be calculated directly (e.g., gaussian distribution), the distribution of x is:

In the cFlow model, complex transformations are typically obtained by concatenating multiple transformations f. For example, assuming that the K transforms are f ₁,f₂,f₃,...,f_K respectively, the K transforms can transform x into x→s ₁→s₂→s₃→...→s_K in turn, and the probabilities of the final x are respectively:

The cFlow model can be understood as: to model a variable x, x is first transformed to another variable s by a series of reversible transformations f, and s is subjected to a simple distribution (e.g., a standard gaussian distribution N (0, 1)) that is preset, the variable x being described by the variable s and the reversible transformations f. When the voice duration is required to be generated, the text features of the text data of the voice duration to be generated are obtained, sampling is carried out from s, and the voice duration is obtained by carrying out inverse transformation f ^-1 on the text features of the text data of the voice duration to be generated through the sampling value.

In the embodiment of the application, the voice duration sequence generated by the first time duration generation module is directly used as the voice duration sequence corresponding to the text data.

In another alternative embodiment, a schematic structure of the duration generation module 22 is shown in fig. 7, and may include:

a first time period generation module 71 and a second time period generation module 72; wherein,

The first time length generation module 71 is configured to generate a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of at least two prosody levels; the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one. The structure of the first time period generating module 71 is the same as that of the first time period generating module shown in fig. 6, and will not be described again.

The second duration generation module 72 is configured to generate a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence; the speech duration in the sequence of speech durations corresponds one-to-one with the prosodic features in the sequence of prosodic features.

In the embodiment of the application, the voice duration of each prosody level is respectively generated corresponding to each prosody level. The networks to which the durations of the different prosody levels are applied may be the same or different. The networks of predicted speech durations may be independent of each other for different prosody levels.

Accordingly, the foregoing implementation flowchart of generating, by using the duration generating module 22, a voice duration sequence corresponding to text data according to the coding feature sequences of at least two prosody levels is shown in fig. 8, and may include:

Step S81: the first time length generation module 71 is utilized to generate a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of at least two prosody levels.

Step S82: generating a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using a second duration generation module 72; the speech duration in the sequence of speech durations corresponds one-to-one with the prosodic features in the sequence of prosodic features.

Step S83: processing the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosodic feature sequence to obtain the voice duration sequence corresponding to the text data, wherein the voice duration sequence specifically comprises:

According to the sequence from high to low of the prosody levels, for the voice duration sequences of any two adjacent prosody levels, according to the voice duration sequence of the higher prosody level in the voice duration sequences of the two adjacent prosody levels, the voice duration of the voice duration sequence of the lower prosody level in the voice duration sequences of the two adjacent prosody levels is adjusted so that the sum of the voice durations in the voice duration sequences of the lower prosody level is equal to the sum of the voice durations in the voice duration sequences of the higher prosody level;

and determining the adjusted voice duration sequence of the lowest prosody level as the voice duration sequence corresponding to the text data.

In the embodiment, the stability of duration prediction can be improved while the duration expressive force is maintained through duration constraint of multiple prosody scales.

Taking three prosody levels of L1, L3 and L4 as an example, specifically taking "travel of the tenth race on the opening history" in the L4 prosody level, the corresponding "travel of the tenth race on the opening history" of the L3 prosody level, the "opening", "on the history", "tenth", "race", "travel" of the L1 prosody level are described as an example:

Assume that the duration predictions for each prosody level are: the total duration of the clause "travel on the tenth match on the open history" is 300 frames (one frame is typically 5 ms); the total duration of the phrase of the tenth time in the starting history is 180 frames, and the total duration of the phrase of the competition travel is 110 frames; the total duration of the word "on" is 40 frames, the total duration of the word "history" is 60 frames, the total duration of the word "tenth" is 80 frames, the total duration of the word "match" is 60 frames, and the total duration of the word "travel" is 56 frames. The method for processing the duration prediction result of each prosody level is as follows:

The total duration of the two phrases of the tenth phrase on the opening history and the travel of the game is 180+110=290 frames, which is shorter than the total duration of the clause of the tenth phrase on the opening history by 10 frames, namely the duration of the two phrases of the tenth phrase on the opening history and the travel of the game is needed to be stretched, so that the total duration of the two phrases of the tenth phrase on the opening history and the travel of the game is 300 frames, and the duration of each phrase can be stretched to 300/290 times of the original duration, namely the duration of the tenth phrase on the opening history is stretched to be: 180×300/290≡186 frames, the duration of the "travel of the game" is stretched as: 110 x 300/290 ≡ 114 frames.

The total duration of the three words "on", "history" and "tenth" is: 40+60+80=180 frames, which is shorter than 186 frames of the time length of the tenth time in the "on history" by 6 frames, the time lengths of the three words of "on", "on history" and "tenth time" need to be stretched, so that the total time length of the three words of "on", "on history" and "tenth time" is 186 frames, specifically, the time length of each word can be stretched to be 186/180 times as long as the original 186/180, that is, the time length of "on" is stretched to 40×186/180≡41 frames, the time length of "on history" is stretched to 60×186/180=62 frames, and the time length of "tenth time" is stretched to 80×186/180≡83 frames.

The total duration of the two words of the "match" and the "travel" is 60+56=116 frames, which is longer than the duration 114 frame of the "match" by 2 frames, and the duration of the two words of the "match" and the "travel" needs to be compressed, so that the total duration of the two words of the "match" and the "travel" is 114 frames, specifically, the duration of each word can be compressed to be the original 114/116, that is, the duration of the word of the "match" is compressed to be 60×114/116×59 frames, and the duration of the word of the "travel" is compressed to be 56×114/116×55 frames.

If the L1 prosody level is the lowest prosody level, the voice duration sequence corresponding to the sentence "turn on travel of the tenth match on history" can be determined as follows: [41, 62, 83, 59, 55].

If the L1 prosody level is not the lowest prosody level and the phonemes are the lowest prosody level, it is also necessary to adjust the duration of each phoneme in the phoneme string corresponding to the "travel of the tenth game on the start history" sentence according to the adjusted speech duration sequence of the L1 prosody level (i.e., [41, 62, 83, 59, 55 ]) such that the sum of the durations of the adjusted phoneme strings is equal to the sum of the speech durations in the adjusted speech duration sequence of the L1 prosody level (i.e., 41+62+83+59+55=300).

The training process of the duration prediction model is described below. The training process of the duration prediction model specifically may include:

the feature extraction module 41 is utilized to extract the features of the text sample, and a text feature sequence of the text sample is obtained.

The text feature sequence of the text sample is encoded at least one prosodic level using the prosodic feature acquisition module 42 to obtain a prosodic feature sequence of the at least one prosodic level of the text sample.

The first time length generation module 71 is utilized to generate a voice time length sequence corresponding to the text feature sequence according to the text feature sequence of the text sample and each prosody feature sequence.

The second duration generation module 72 is utilized to generate a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence.

And updating parameters of the voice duration prediction model by taking a voice duration sequence corresponding to the text feature sequence of the text sample approaching to a voice duration sequence label corresponding to the text feature sequence and a voice duration sequence label corresponding to each prosodic feature sequence of the text sample approaching to the voice duration sequence label corresponding to the prosodic feature sequence as targets. The method comprises the steps of updating parameters of a voice duration prediction model by taking a voice duration sequence label corresponding to each prosody level approaching to the voice duration sequence label corresponding to the prosody level as a target.

That is, in the embodiment of the application, when the time length prediction model is trained, the loss of different prosody levels is considered, so that the time length modeling precision of the time length prediction model is improved, and the continuity and expressive force of the predicted time length are effectively improved.

The Loss function Loss of the duration prediction model may be a weighted sum of the Loss functions corresponding to the respective prosody levels. Alternatively to this, the method may comprise,

If the first time generation module 71 is cFlow networks, it can be trained by maximum likelihood criteria, the Loss function Loss ₁ is:

The loss function has two objectives, one is that the distance between the transformed variables s and the predetermined simple distribution (e.g., N (0, 1)) is as small as possible, and the other is that the transformation process is as simple as possible.

The loss function for each prosody level in the second duration generation module 72 may be trained by MSE (mean square error) criteria, then:

The Loss _i represents the Loss function of the ith prosody level in the second duration generation module 72.

Because the objective of the Loss ₁ function is to take as large a value as possible, and the objective of the Loss _i function is to take as small a value as possible, the Loss function Loss of the duration prediction model may be specifically:

Loss＝-Loss₁+∑ω_iLoss_i

Where ω _i is the weight of the loss function of the ii-th prosody level. The goal of Loss is to take as small a value as possible.

Corresponding to the method embodiment, the embodiment of the application also provides a voice duration prediction device, the structure schematic diagram of which is shown in fig. 9, which may include:

an acquisition module 91, a coding control module 92 and a generation control module 93; wherein,

The acquiring module 91 is configured to acquire text data;

The encoding control module 92 is configured to encode at least two prosodic levels on the text data using a pre-trained duration prediction model, so as to obtain an encoding feature sequence of at least two prosodic levels;

The generation control module 93 is configured to generate a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration prediction model.

According to the voice duration prediction device provided by the embodiment of the application, the text data is encoded by at least two prosody levels by utilizing the pre-trained duration prediction model, so that the encoding feature sequences of the at least two prosody levels are obtained; according to the coding feature sequences of the at least two prosody levels, a voice duration sequence corresponding to the text data is generated, and the scheme is used for coding the text data, so that the voice duration can be controlled in different prosody levels, and when the voice synthesis is performed based on the voice duration predicted by the method, the probability of occurrence of a word-by-word phenomenon is reduced, and the continuity of the synthesized voice is better.

In an alternative embodiment, the encoding control module 92 is specifically configured to:

the generation control module is specifically configured to:

In an alternative embodiment, the encoding control module 92 may include:

And the prosodic feature acquisition control module is used for coding at least one non-lowest prosodic level of the text feature sequence by utilizing the prosodic feature acquisition module of the duration prediction model to obtain a prosodic feature sequence of at least one prosodic level, and the prosodic feature sequence is used as the coding feature sequence of other prosodic levels in the coding feature sequences of at least two prosodic levels.

In an alternative embodiment, the generation control module may include:

in an alternative embodiment, the generation control module may include:

In an alternative embodiment, the first control module includes:

The transformation control module is used for carrying out pre-trained first transformation on a target value sequence formed by each target value according to a splicing characteristic sequence formed by splicing characteristics corresponding to each text characteristic by utilizing the transformation module of the duration prediction model to obtain a voice duration sequence corresponding to the text characteristic sequence; wherein the transformation module has: and carrying out inverse transformation on the voice duration sequence corresponding to the text feature sequence according to the splicing feature sequence to obtain the capability of the target value sequence.

In an alternative embodiment, the sampling control module may specifically be configured to:

Or alternatively

In an optional embodiment, the preset distribution is a gaussian distribution, and the sampling control module is specifically configured to, when corresponding to each stitching feature, adjust, by using the sampling module, a parameter of the variable that conforms to the preset distribution to determine a target sampling range corresponding to the stitching feature:

In an alternative embodiment, the duration processing module may specifically be configured to:

In an optional embodiment, the voice duration prediction apparatus may further include a training module, specifically configured to:

And updating parameters of the voice duration prediction model by taking the voice duration sequence corresponding to the text feature sequence and the voice duration label corresponding to the text feature sequence as targets, and taking the voice duration sequence corresponding to each prosody feature sequence and the voice duration label corresponding to the prosody feature sequence as targets, namely updating parameters of the voice duration prediction model by taking the voice duration sequence corresponding to each prosody level and the voice duration label corresponding to the prosody level as targets.

The voice duration prediction device provided by the embodiment of the application can be applied to voice duration prediction equipment, such as PC terminals, cloud platforms, servers, server clusters and the like. Alternatively, fig. 10 shows a block diagram of a hardware structure of the voice duration prediction apparatus, and referring to fig. 10, the hardware structure of the voice duration prediction apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

In the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete the communication with each other through the communication bus 4;

The processor 1 may be a central processing unit CPU, or an Application-specific integrated Circuit ASIC (Application SPECIFIC INTEGRATED Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;

Wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:

acquiring text data;

Alternatively, the refinement function and the extension function of the program may be described with reference to the above.

The embodiment of the present application also provides a storage medium storing a program adapted to be executed by a processor, the program being configured to:

acquiring text data;

Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for predicting a duration of speech, comprising:

acquiring text data;

extracting the characteristics of the text data by utilizing a characteristic extraction module of a pre-trained duration prediction model to obtain a text characteristic sequence which is used as a coding characteristic sequence of the lowest prosody level; the prosodic feature acquisition module of the duration prediction model is utilized to encode at least one non-lowest prosodic level on the text feature sequence and the prosodic feature sequence calculated at the previous moment to obtain at least one prosodic feature sequence of the non-lowest prosodic level at the current moment, and the prosodic feature sequence is used as the coding feature sequence of other prosodic levels in the coding feature sequences of at least two prosodic levels;

generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration prediction model;

According to the sequence from high to low of the prosody levels, for the voice duration sequences of any two adjacent prosody levels, according to the voice duration sequence of the higher prosody level in the voice duration sequences of the two adjacent prosody levels, performing duration stretching or duration compression processing on the voice duration of the voice duration sequence of the lower prosody level in the voice duration sequences of the two adjacent prosody levels, so that the sum of the voice durations in the voice duration sequences of the lower prosody level is equal to the sum of the voice durations in the voice duration sequences of the higher prosody level;

and determining the adjusted voice duration sequence of the lowest prosody level as the voice duration sequence corresponding to the final text data.

2. The method according to claim 1, wherein the generating, by using the duration generating module of the duration prediction model, a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels includes:

generating a voice duration sequence corresponding to the text feature sequence as a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using a first time generation module of the duration prediction model; the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one.

3. The method according to claim 1, wherein the generating, by using the duration generating module of the duration prediction model, a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels includes:

And taking the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosody feature sequence as the voice duration sequence corresponding to the text data.

4. A method according to claim 2 or 3, wherein the process of generating, with the first time duration generation module, a sequence of speech time durations corresponding to the sequence of text features from the sequences of coding features of the at least two prosody levels comprises:

And performing a pre-trained first transformation on a target value sequence formed by each target value according to a spliced feature sequence formed by spliced features corresponding to each text feature by using a transformation module of the duration prediction model to obtain a voice duration sequence corresponding to the text feature sequence.

5. The method of claim 4, wherein sampling the variable that meets the preset distribution with the sampling module comprises:

Or alternatively

6. A method according to claim 2 or 3, wherein the training process of the speech duration prediction model comprises:

and updating parameters of the voice duration prediction model by taking a voice duration sequence label corresponding to each prosody level approaching to the voice duration sequence label corresponding to the prosody level as a target.

7. A speech duration prediction apparatus, comprising:

the acquisition module is used for acquiring text data;

The coding control module is used for extracting the characteristics of the text data by utilizing a characteristic extraction module of a pre-trained duration prediction model to obtain a text characteristic sequence which is used as a coding characteristic sequence of the lowest prosody level; the prosodic feature acquisition module of the duration prediction model is utilized to encode at least one non-lowest prosodic level on the text feature sequence and the prosodic feature sequence calculated at the previous moment to obtain at least one prosodic feature sequence of the non-lowest prosodic level at the current moment, and the prosodic feature sequence is used as the coding feature sequence of other prosodic levels in the coding feature sequences of at least two prosodic levels;

the generation control module is used for generating a voice duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by utilizing the duration prediction model;

8. A speech duration prediction apparatus comprising a memory and a processor;

The memory is used for storing programs;

The processor is configured to execute the program to implement the steps of the speech duration prediction method according to any one of claims 1 to 6.

9. A readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the steps of the speech duration prediction method according to any one of claims 1-6.