CN113129863A

CN113129863A - Voice time length prediction method, device, equipment and readable storage medium

Info

Publication number: CN113129863A
Application number: CN201911417701.4A
Authority: CN
Inventors: 胡亚军; 江源; 胡国平; 胡郁
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-16
Anticipated expiration: 2039-12-31

Abstract

The embodiment of the application discloses a method, a device and equipment for predicting voice duration and a readable storage medium, wherein after text data are obtained, at least two prosody levels are coded on the text data by using a pre-trained duration prediction model to obtain coding feature sequences of the at least two prosody levels; the method comprises the steps of generating a voice time sequence corresponding to text data according to the coding feature sequences of at least two prosody levels by utilizing the time prediction model, and coding at least two prosody levels when the text data are coded by the scheme, so that the voice time can be controlled at different prosody levels.

Description

Voice time length prediction method, device, equipment and readable storage medium

Technical Field

The present application relates to the field of speech processing technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for predicting speech duration.

Background

With the development of the application of the artificial intelligence related technology, the application field of the speech synthesis technology is continuously expanded, for example, the speech synthesis technology can be applied to the fields from broadcast applications (such as station, bank, airport broadcast, and the like) to human-computer interactive applications (such as artificial intelligence assistant, customer service, and the like), which has higher requirements on expressive force, tone quality, and the like of synthesized speech.

However, the inventor of the present application finds that, when speech synthesis is performed on speech duration predicted based on the current duration prediction method, a word-on-word phenomenon is likely to occur, and the sense of continuity is poor.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, a device and a readable storage medium for predicting a speech duration to improve continuity of synthesized speech.

In order to achieve the above object, the following solutions are proposed:

a speech duration prediction method comprises the following steps:

acquiring text data;

coding at least two prosody levels on the text data by using a pre-trained duration prediction model to obtain coding feature sequences of at least two prosody levels;

and generating a voice time length sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the time length prediction model.

In the above method, preferably, the duration prediction model is used to obtain at least two prosody level coding feature sequences; the process of generating the speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels includes:

coding at least two prosody levels on the text data by using a coding module of the duration prediction model to obtain coding feature sequences of at least two prosody levels;

and generating a voice time length sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using a time length generation module of the time length prediction model.

Preferably, in the above method, the encoding module of the duration prediction model is used to encode the text data at least two prosody levels to obtain at least two-level encoding feature sequences, including:

extracting the characteristics of the text data by using a characteristic extraction module of the duration prediction model to obtain a text characteristic sequence serving as a coding characteristic sequence of the lowest prosody level;

and performing at least one non-lowest prosody level coding on the text feature sequence by using a prosody feature acquisition module of the duration prediction model to obtain at least one prosody feature sequence of the non-lowest prosody level as a coding feature sequence of other prosody levels in the coding feature sequences of the at least two prosody levels.

Preferably, in the method, the generating, by using the duration generating module of the duration prediction model, a speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels includes:

generating a voice duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by using a first duration generation module of the duration prediction model, wherein the voice duration sequence is used as a voice duration sequence corresponding to the text data; the voice time length in the voice time length sequence corresponds to the text features in the text feature sequence one by one;

alternatively, the first and second electrodes may be,

generating a voice duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by using a first duration generation module of the duration prediction model;

generating a voice time sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using a second time generation module of the time prediction model; the voice time length in the voice time length sequence corresponds to the prosodic feature in the prosodic feature sequence one by one;

and processing the voice time length sequence corresponding to the text characteristic sequence and the voice time length sequence corresponding to each prosody characteristic sequence to obtain the voice time length sequence corresponding to the text data.

Preferably, in the method, the generating, by the first duration generating module, the speech duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody hierarchies includes:

splicing the text features and prosodic features of each prosodic level generated based on the text features by utilizing a splicing module of the duration prediction model corresponding to each text feature to obtain splicing features corresponding to the text features;

sampling in preset variables which accord with preset distribution by using a sampling module of the duration prediction model to obtain target values corresponding to all splicing characteristics;

performing first transformation on a target value sequence formed by each target value by utilizing a transformation module of the duration prediction model according to a splicing characteristic sequence formed by splicing characteristics corresponding to each text characteristic to obtain a voice duration sequence corresponding to the text characteristic sequence; wherein the transformation module has: and according to the splicing characteristic sequence, performing inverse transformation of the first transformation on the voice time length sequence corresponding to the text characteristic sequence to obtain the capacity of the target value sequence.

Preferably, the method, using the sampling module to perform sampling on the variables conforming to the preset distribution, includes:

corresponding to each splicing characteristic, randomly sampling the variables which accord with the preset distribution by using the sampling module to obtain a target value corresponding to the splicing characteristic;

alternatively, the first and second electrodes may be,

corresponding to each splicing characteristic, adjusting the parameters of the variables which accord with the preset distribution by using the sampling module so as to determine a target sampling range corresponding to the splicing characteristic; randomly sampling in a target sampling range corresponding to the splicing characteristic to obtain a target value corresponding to the splicing characteristic;

alternatively, the first and second electrodes may be,

corresponding to part of splicing characteristics, randomly sampling the variables which accord with the preset distribution by using the sampling module to obtain target values corresponding to all the splicing characteristics in the part of splicing characteristics;

corresponding to the other part of the splicing characteristics, adjusting the parameters of the variables which accord with the preset distribution by using the sampling module so as to determine a target sampling range corresponding to each splicing characteristic; and randomly sampling in a target sampling range corresponding to each splicing feature to obtain a target value corresponding to each splicing feature in the other part of splicing features.

In the above method, preferably, the preset distribution is gaussian distribution, and the adjusting, by using the sampling module, the parameter of the variable that matches the preset distribution to determine the target sampling range corresponding to the splicing feature, for each splicing feature, includes:

and corresponding to each splicing characteristic, adjusting the variance of the Gaussian distribution by using the sampling module, and taking the value range of the variable of the Gaussian distribution which accords with the adjusted variance as a target sampling range corresponding to the splicing characteristic.

Preferably, the processing the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosody feature sequence to obtain the voice duration sequence corresponding to the text data includes:

according to the sequence of the prosody levels from high to low, for the voice time length sequences of any two adjacent prosody levels, according to the voice time length sequence of a higher prosody level in the voice time length sequences of the two adjacent prosody levels, the voice time length of the voice time length sequence of a lower prosody level in the voice time length sequences of the two adjacent prosody levels is adjusted, so that the sum of the voice time lengths in the voice time length sequence of the lower prosody level is equal to the sum of the voice time lengths in the voice time length sequence of the higher prosody level.

In the above method, preferably, the training process of the speech duration prediction model includes:

extracting the characteristics of a text sample by using the characteristic extraction module to obtain a text characteristic sequence of the text sample;

coding at least one prosody level on the text feature sequence of the text sample by using a prosody feature acquisition module to obtain a prosody feature sequence of at least one prosody level of the text sample;

generating a voice time length sequence corresponding to the text characteristic sequence according to the text characteristic sequence of the text sample and each prosody characteristic sequence by using the first time length generation module;

generating a voice time length sequence corresponding to each prosodic feature sequence by using the second time length generation module according to each prosodic feature sequence;

and updating the parameters of the voice duration prediction model by taking the voice duration sequence corresponding to the text characteristic sequence approaching to the voice duration label corresponding to the text characteristic sequence and the voice duration sequence corresponding to each prosodic characteristic sequence approaching to the voice duration label corresponding to the prosodic characteristic sequence as targets.

A speech duration prediction apparatus comprising:

the acquisition module is used for acquiring text data;

the encoding control module is used for encoding at least two prosody levels on the text data by utilizing a pre-trained duration prediction model to obtain encoding feature sequences of at least two prosody levels;

and the generation control module is used for generating a voice time length sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the time length prediction model.

Preferably, in the above apparatus, the encoding control module is specifically configured to:

the generation control module is specifically configured to:

Preferably, the above apparatus, wherein the encoding control module includes:

the feature extraction control module is used for extracting the features of the text data by using the feature extraction module of the duration prediction model to obtain a text feature sequence serving as a coding feature sequence of the lowest prosody level;

and the prosodic feature acquisition control module is used for coding at least one non-lowest prosodic level on the text feature sequence by using the prosodic feature acquisition module of the duration prediction model to obtain at least one prosodic feature sequence of the non-lowest prosodic level as a coding feature sequence of other prosodic levels in the coding feature sequences of the at least two prosodic levels.

Preferably, in the above apparatus, the generation control module includes:

a first control module, configured to generate, by using a first duration generation module of the duration prediction model, a speech duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels, where the speech duration sequence is used as a speech duration sequence corresponding to the text data; the voice time length in the voice time length sequence corresponds to the text features in the text feature sequence one by one;

alternatively, the first and second electrodes may be,

the first control module is used for generating a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by utilizing a first time length generation module of the time length prediction model;

the second control module is used for generating a voice time length sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using a second time length generation module of the time length prediction model; the voice time length in the voice time length sequence corresponds to the prosodic feature in the prosodic feature sequence one by one;

and the duration processing module is used for processing the voice duration sequences corresponding to the text characteristic sequences and the voice duration sequences corresponding to the prosody characteristic sequences to obtain the voice duration sequences corresponding to the text data.

In the above apparatus, preferably, the first control module includes:

the splicing control module is used for splicing the text features and the prosodic features of each prosodic level generated based on the text features by utilizing the splicing module of the duration prediction model to obtain the splicing features corresponding to the text features;

the sampling control module is used for sampling variables which are preset and accord with preset distribution by using the sampling module of the duration prediction model to obtain target values corresponding to all the splicing characteristics;

the conversion control module is used for performing first conversion on the target value sequence formed by the target values according to the splicing characteristic sequence formed by the splicing characteristics corresponding to the text characteristics by using the conversion module of the duration prediction model to obtain a voice duration sequence corresponding to the text characteristic sequence; wherein the transformation module has: and according to the splicing characteristic sequence, performing inverse transformation of the first transformation on the voice time length sequence corresponding to the text characteristic sequence to obtain the capacity of the target value sequence.

Preferably, in the above apparatus, the sampling control module is specifically configured to:

alternatively, the first and second electrodes may be,

The above apparatus is preferably, where the preset distribution is gaussian distribution, and the sampling control module adjusts, in correspondence to each splicing feature, the parameter of the variable that conforms to the preset distribution by using the sampling module, so as to determine a target sampling range corresponding to the splicing feature, and specifically is configured to:

Preferably, the duration processing module is specifically configured to:

The above apparatus, preferably, the speech duration prediction apparatus further includes a training module, specifically configured to:

A speech duration prediction apparatus includes a memory and a processor;

the memory is used for storing programs;

the processor is configured to execute the program to implement the steps of the speech duration prediction method according to any one of the above embodiments.

A readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the speech duration prediction method according to any one of the preceding claims.

According to the technical scheme, after the text data is obtained, the method, the device, the equipment and the readable storage medium for predicting the voice duration provided by the embodiment of the application utilize the pre-trained duration prediction model to encode the text data by at least two prosody levels to obtain the encoding feature sequences of the at least two prosody levels; the method comprises the steps of generating a voice time sequence corresponding to text data according to the coding feature sequences of at least two prosody levels by utilizing the time prediction model, and coding at least two prosody levels when the text data are coded by the scheme, so that the voice time can be controlled at different prosody levels.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of an implementation of a speech duration prediction method disclosed in the embodiment of the present application;

FIG. 2 is a schematic diagram of a structure of a speech duration prediction model disclosed in an embodiment of the present application;

FIG. 3 is a block diagram illustrating an example of a time duration prediction model for obtaining at least two prosodic levels of coding feature sequences; generating an implementation flow chart of a voice time length sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels;

FIG. 4 is a schematic structural diagram of an encoding module disclosed in the embodiments of the present application;

fig. 5 is an exemplary diagram of a prosodic feature obtaining module for encoding two prosodic levels according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a first duration generating module disclosed in the embodiment of the present application;

fig. 7 is a schematic structural diagram of a duration generation module disclosed in an embodiment of the present application;

FIG. 8 is a flowchart illustrating an implementation of generating a speech duration sequence corresponding to text data according to at least two prosodic level coding feature sequences by using a duration generation module according to an embodiment of the present disclosure;

fig. 9 is a schematic structural diagram of a speech duration prediction apparatus according to an embodiment of the present application;

fig. 10 is a block diagram of a hardware configuration of a speech duration prediction apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to overcome the problems that when the voice time predicted by the existing time length prediction method is subjected to voice synthesis, the phenomenon of one word and one pause is obvious and the continuity sense is poor, the basic idea of the application is that at least two prosody levels are coded on text data by utilizing a pre-trained time length prediction model, and according to coding feature sequences of at least two prosody levels, the voice time length can be controlled at different prosody levels, so that the probability of the phenomenon of one word and one pause is reduced, the continuity of the synthesized voice is better, and the synthesized voice can be heard more naturally.

Based on the foregoing basic ideas, an implementation flowchart of the speech duration prediction method provided in the embodiment of the present application is shown in fig. 1, and may include:

step S11: text data is acquired.

The text data is text data of a speech to be synthesized specified by a user. The text data may be a text input by the user in real time or data in a text designated by the user.

Step S12: and coding at least two prosody levels on the text data by using a pre-trained duration prediction model to obtain coding feature sequences of at least two prosody levels.

Optionally, the at least two prosody levels may be at least two of the following prosody levels: phonemes, prosodic words, prosodic phrases, prosodic clauses, prosodic sentences, and the like. These prosodic levels can be thought of as the division of sentences on different scales based on textual analysis of the sentences.

As shown in table 1, an example of performing prosody hierarchy division on the sentence "women in china get tickets for the game at the first time and start the tenth game trip in history" in the embodiment of the present application is shown. In this example, L1 denotes a prosody hierarchy of prosodic words, L3 denotes a prosodic hierarchy of prosodic phrases, L4 denotes a prosodic clause, and L5 denotes a prosodic hierarchy of prosodic sentences.

TABLE 1

Of course, in addition to the above-listed prosodic hierarchy, in the embodiment of the present application, more hierarchies may be defined, for example, a hierarchy smaller than a phoneme may be defined, for example, a multi-frame level, a frame level, and the like.

Modeling scales of different prosody levels are different, wherein the modeling scales of the phoneme level are phonemes, namely, coding features in the coding feature sequence of the phoneme level correspond to phonemes in a phoneme string obtained by analyzing text data one by one; the modeling scale of the prosodic word hierarchy is words, namely the coding features in the coding feature sequence of the prosodic word hierarchy correspond to the words in the word sequence obtained by analyzing the text data one by one; the modeling scale of the prosodic phrase level is phrases, namely the encoding features in the encoding feature sequence of the prosodic phrase level correspond to the phrases in the phrase sequence obtained by analyzing the text data one by one; the modeling scale of the prosodic clause level is clauses, namely the coding features in the coding feature sequence of the prosodic clause level correspond to clauses in the clause sequence obtained by analyzing text data one by one; the modeling scale of the prosodic sentence level is a sentence, namely, the coding features in the coding feature sequence of the prosodic word level correspond to sentences in the sentence sequence obtained by analyzing the text data one by one.

Step S13: and generating a voice time length sequence corresponding to the text data according to the coding feature sequences of at least two prosody levels by using the time length prediction model.

The modeling scale of the duration of the speech may be phoneme, or may be other modeling scales, such as syllables, words, or phrases.

For example, if the modeling scale of the speech duration is a phoneme, each speech duration in the speech duration sequence corresponds to a phoneme in the phoneme string parsed from the text data. And if the modeled scale of speech durations is a word, each speech duration in the sequence of speech durations corresponds to a word in the text data.

The voice duration prediction method provided by the embodiment of the application utilizes a pre-trained duration prediction model to encode at least two prosody levels on text data to obtain encoding feature sequences of at least two prosody levels; according to the scheme, when the text data is coded, the coding of at least two prosody levels is carried out, so that the control of different prosody levels can be carried out on the voice time.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a speech duration prediction model according to an embodiment of the present application, which may include:

an encoding module 21 and a duration generating module 22; wherein the content of the first and second substances,

the encoding module 21 is configured to perform encoding on the text data at least two prosody levels to obtain encoding feature sequences of at least two prosody levels.

When encoding text data at least two prosody levels, the text data is usually encoded at a lower prosody level to obtain an encoding feature sequence at the lower prosody level, and then encoded at a higher prosody level by using the encoding feature sequence at the lower prosody level to obtain an encoding feature sequence at the higher prosody level.

The encoding module 21 may be implemented by a Neural Network, for example, the encoding module 21 may be implemented by a Convolutional Neural Network (CNN), or implemented by a Recurrent Neural Network (RNN), or implemented by a Long Short-Term Memory (LSTM), or implemented by a combination of multiple networks, and so on. Specifically, the network is implemented in which form, and the application is not particularly limited.

And a duration generating module 22, configured to generate a speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody hierarchies, or generate speech duration sequences of the at least two prosody hierarchies as at least two initial speech duration sequences. The at least two initial voice time length sequences are used for generating voice time length sequences corresponding to the text data.

In the embodiment of the application, different prediction networks can be adopted for the duration prediction in different prosody hierarchies, and certainly, the same prediction network can be adopted for the duration prediction in different prosody hierarchies. Each prosody level selects one of the following networks as the duration prediction network of the prosody level: convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), Long Short-Term Memory networks (LSTM), conditional Flow-based generative models (cFlow), and the like. Specifically, the network is implemented in which form, and the application is not particularly limited.

Correspondingly, the coding feature sequences of at least two prosody levels are obtained by utilizing the duration prediction model; an implementation flowchart of generating a speech duration sequence corresponding to text data according to the coding feature sequences of the at least two prosody levels is shown in fig. 3, and may include:

step S31: and encoding the text data by using an encoding module 21 at least two prosody levels to obtain at least two prosody level encoding feature sequences.

Step S32: and generating a voice time length sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the time length generation module 22. In particular, the method comprises the following steps of,

if the duration generation module 22 directly generates the voice duration sequence corresponding to the text data, the voice duration sequence is used as the voice duration sequence corresponding to the text data. If the duration generation module 22 generates at least two initial voice duration sequences, the at least two initial voice duration sequences may be processed after the duration generation module 22 generates the at least two initial voice duration sequences, so as to generate a voice duration sequence corresponding to the text data.

In an alternative embodiment, a schematic structural diagram of the encoding module 21 is shown in fig. 4, and may include:

a feature extraction module 41 and a prosodic feature acquisition module 42; wherein the content of the first and second substances,

the feature extraction module 41 is configured to extract features of the text data to obtain a text feature sequence, which is used as a coding feature sequence of the lowest prosody level.

In the embodiment of the present application, feature extraction may be performed on text data according to a modeling scale of a speech duration, where the modeling scale of the speech duration may be a phoneme, a syllable, or other modeling scales, such as a word or a word.

Specifically, after the modeling scale of the voice duration is determined, the text data is analyzed based on the modeling scale to obtain each modeling object in the text data at the modeling scale, and then the feature of the modeling object is extracted to obtain a text feature sequence of the text data. Each text feature in the text feature sequence corresponds to a modeling object in the text data, and the text feature corresponding to the modeling object may be an explicit feature of the modeling object, that is, a feature with a definite meaning.

Taking a modeling scale of voice duration as an example, analyzing text data to obtain a phoneme string, then performing feature extraction on the phoneme string to obtain a text feature corresponding to each phoneme, wherein the text feature corresponding to each phoneme in the phoneme string forms a text feature sequence, and the text feature corresponding to each phoneme contains one or more of the following information: what the phoneme is, what the previous phoneme of the phoneme is, what the next phoneme of the phoneme is, what the tone of the phoneme is, and may further include position information such as the position of the phoneme in the sentence, the position of the phoneme in the current clause, and the like.

In the embodiment of the application, the modeling scale of the text feature is the same as the modeling scale of the voice duration.

The prosodic feature obtaining module 42 is configured to perform at least one coding of a non-lowest prosodic level on the text feature sequence to obtain at least one prosodic feature sequence of the non-lowest prosodic level, where the at least one prosodic feature sequence is used as a coding feature sequence of another prosodic level in the coding feature sequences of the at least two prosodic levels.

For example, assuming that the modeling scale of the text features is a phoneme when feature extraction is performed on the text data, the lowest prosodic level is defined as the phoneme, and accordingly, the at least one non-lowest prosodic level may be at least one of the following prosodic levels: prosodic words, prosodic phrases, prosodic clauses, prosodic sentences. And if the modeling scale of the text features is a word when the text data is subjected to feature extraction, that is, each feature in the text feature sequence corresponds to a word in the text data, the lowest prosody level is defined as a prosodic word, and correspondingly, at least one non-lowest prosody level may be at least one of the following prosody levels: prosodic phrases, prosodic clauses, prosodic sentences.

Corresponding to each prosody level, when the prosody level is coded, the prosody features have time sequence correlation, namely when a certain prosody feature is calculated, the calculation is not performed by only using the related text feature, but is performed by using the related text feature and the prosody feature calculated at the previous moment.

As shown in fig. 5, an exemplary diagram of the prosody feature obtaining module 42 provided in the embodiment of the present application when performing encoding of two prosody levels (typically, hidden layer encoding, that is, the prosody features obtained by encoding are hidden layer features). In this example, the process of encoding the text feature sequence at the L1 prosody level and encoding the text feature sequence at the L2 prosody level will be described by taking "start the travel of the tenth game in history" as an example. In this example, the text feature sequence of "turn on the tenth game on history" is a phoneme-level text feature sequence, i.e., one phoneme for each text feature. After acquiring a text feature sequence (i.e., the phoneme-level input in fig. 5, where the displayed word is to illustrate a corresponding relationship between the prosodic feature and the text feature, and the text feature corresponding to the phoneme parsed from the word is actually input to the prosodic feature acquisition module), the prosodic feature acquisition module 42 performs word-level coding to obtain a prosodic feature of a word level, performs L1 prosodic level coding according to the prosodic feature of the word level to obtain a prosodic feature sequence of an L1 prosodic level, and performs L2 prosodic level coding according to the prosodic feature of an L1 prosodic level to obtain a prosodic feature sequence of an L2 prosodic level. When each level of coding is performed, except for one prosody feature, each prosody feature needs to be calculated at the previous moment to obtain the prosody feature. For example, when calculating the prosodic feature of the word "start", the prosodic feature of the word "open" is used in addition to the text feature corresponding to the word "start". For example, in the encoding process of the prosody hierarchy of L1, when calculating the prosody characteristics of the word "match", the prosody characteristics of the word "tenth time" are used in addition to the prosody characteristics of the two words "ratio" and "match".

It should be noted that fig. 5 illustrates an RNN network structure as an example, but the prosodic feature obtaining module 42 in the embodiment of the present application is not limited to be implemented by an RNN network, and may also be implemented by other network structures.

In an alternative embodiment, the duration generation module 22 may include:

the first time length generation module is used for generating a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of at least two prosody levels, and the voice time length sequence is used as a voice time length sequence corresponding to the text data; and the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one.

Optionally, a schematic structural diagram of the first duration generating module is shown in fig. 6, and may include:

and the splicing module 61 is configured to splice the text features and prosodic features of each prosodic hierarchy generated based on the text features to obtain corresponding splicing features of the text features.

Taking "get the game ticket in the first time of the female rank of china and start the travel of the tenth game in the history" as an example, it is assumed that each text feature in the text feature sequence corresponds to a phoneme, and the following explains the example of the phoneme corresponding to the "middle" word and the phoneme corresponding to the "rank" word. The two phonemes zh and ong can be analyzed by the Chinese word, and the two phonemes p and ai can be analyzed by the row word. Assuming that the text feature sequence is encoded at two prosody levels, namely, the L1 prosody level and the L3 prosody level, there are 12 prosody features in the prosody feature sequence at the L1 prosody level corresponding to 12 words at the L1 prosody level, and there are 4 prosody features in the prosody feature sequence at the L3 prosody level corresponding to four prosody phrases at the L3 level. The text feature corresponding to the phoneme of zh is obtained by splicing the text feature corresponding to zh, the prosodic feature of "china" and the prosodic feature of "female line of china" in the embodiment of the present application. Similarly, the text feature corresponding to the ong phoneme is obtained by splicing the text feature corresponding to the ong phoneme, the prosody feature of "china" and the prosody feature of "female chinese line" in the embodiment of the present application. Similarly, corresponding to the text feature corresponding to the p phoneme, in the embodiment of the present application, the text feature corresponding to p, the prosody feature of "female line" and the prosody feature of "female line of china" are spliced to obtain the splicing feature corresponding to the text feature corresponding to the p phoneme.

And the sampling module 62 is configured to sample variables that are preset and meet the preset distribution to obtain target values corresponding to the splicing characteristics.

The variable conforming to the preset distribution is a variable related to the time length. The variable meeting the preset distribution can be a variable meeting a standard gaussian distribution, that is, all values of the variable meet the standard gaussian distribution. The number of target values is the same as the number of text features in the sequence of text features. That is, assuming that the number of text features in the text feature sequence is N, the number of times of sampling in the variable conforming to the preset distribution is also N, so that N target values can be obtained. Specifically, when N times of sampling are performed on variables conforming to the preset distribution, the N times of sampling may be performed sequentially, or may be performed simultaneously, that is, the N times of sampling are performed simultaneously on N identical variables conforming to the preset distribution (which is equivalent to copying the variables conforming to the preset distribution N-1 times), so as to obtain N target values.

Optionally, the sampling module 62 may perform sampling by any one of the following methods:

the first method is as follows:

and corresponding to each splicing characteristic, randomly sampling variables which accord with preset distribution to obtain a target value corresponding to the splicing characteristic.

Due to the randomness of the sampling, a speech duration with stronger expressiveness can be generated.

The second method comprises the following steps:

corresponding to each splicing characteristic, adjusting parameters of variables which accord with preset distribution so as to determine a target sampling range corresponding to the splicing characteristic; and randomly sampling in a target sampling range corresponding to the splicing characteristic to obtain a target value corresponding to the splicing characteristic. Specifically, if the preset distribution is gaussian distribution, the variance of the gaussian distribution can be adjusted, and the value range of the variable of the gaussian distribution after the variance is adjusted is used as the target sampling range corresponding to the splicing feature.

In general, the smaller the variance, the stronger the stability of the predicted speech duration.

The third method comprises the following steps:

corresponding to part of the splicing characteristics, randomly sampling variables which accord with preset distribution to obtain target values corresponding to all the splicing characteristics in the part of the splicing characteristics;

corresponding to the other part of the splicing characteristics, adjusting parameters of variables which accord with preset distribution so as to determine a target sampling range corresponding to each splicing characteristic; and randomly sampling in a target sampling range corresponding to each splicing feature to obtain a target value corresponding to each splicing feature in the other part of splicing features.

Obviously, the third mode is a combination of the first mode and the second mode.

A transformation module 63, configured to perform a first pre-trained transformation on a target value sequence formed by each target value according to a splicing feature sequence formed by the splicing features corresponding to each text feature, to obtain a voice duration sequence corresponding to the text feature sequence; the conversion module 63 includes: and according to the splicing characteristic sequence, performing inverse transformation of the first transformation on the voice time length sequence corresponding to the text characteristic sequence to obtain the capacity of the target value sequence.

That is, given the text a and the speech duration sequence At formed by the speech durations of the phonemes in the speech signal corresponding to the text a, the transformation module 63 may perform inverse transformation of the first transformation on the speech duration sequence At by using the text feature sequence (phoneme-level text feature sequence) of the text a to obtain a target value sequence, where the target value sequence conforms to the preset distribution.

Alternatively, the transformation module 63 may be implemented by a cFlow model. cFlow is a probabilistic modeling framework based on variable substitution. Assuming that the text feature sequence is c ═ c1, c2, …, cN ], each text feature in the text feature sequence c corresponds to a phoneme, and the duration sequence formed by the speech durations corresponding to the phonemes in the text feature sequence c is x ═ x1, x2, …, xN ], x can be transformed into another variable s ═ s1, s2, …, sN ] by a reversible transformation f, that is:

s＝f(x,c)，x＝f^-1(s,c)

assuming that the distribution of s is simple, the likelihood (e.g., gaussian) can be computed directly, and the distribution of x is:

in the cFlow model, a complex transform is usually obtained by concatenating a plurality of transforms f. For example, suppose that K transforms are respectively f₁,f₂,f₃,...,f_KThen the K transform can sequentially transform x to x → s₁→s₂→s₃→...→s_KFinally, the probabilities of x are:

the cFlow model can be understood as: to model a variable x, x is first transformed to another variable s through a series of invertible transformations f, and s is subjected to a simple distribution (e.g., a standard gaussian distribution N (0, 1)) that is preset, with the variable s and invertible transformations f describing the variable x. When the voice time length needs to be generated, acquiring the text characteristics of the text data of the voice time length to be generated, sampling from s, and carrying out inverse transformation f through the sampling value and the text characteristics of the text data of the voice time length to be generated^-1And obtaining the voice time length.

In the embodiment of the application, the voice duration sequence generated by the first duration generation module is directly used as the voice duration sequence corresponding to the text data.

In another alternative embodiment, a schematic structural diagram of the duration generating module 22 is shown in fig. 7, and may include:

a first duration generation module 71 and a second duration generation module 72; wherein the content of the first and second substances,

the first time length generation module 71 is configured to generate a speech time length sequence corresponding to the text feature sequence according to the coding feature sequences of at least two prosody hierarchies; and the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one. The structure of the first duration generating module 71 is the same as that of the first duration generating module shown in fig. 6, and is not described herein again.

The second duration generating module 72 is configured to generate a voice duration sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence; the voice time length in the voice time length sequence corresponds to the prosodic feature in the prosodic feature sequence one by one.

In the embodiment of the application, the voice duration of each prosody level is generated corresponding to each prosody level. The networks to which the durations of the different prosodic levels are generated may be the same or different. The networks that predict speech duration may be independent of each other at different prosodic levels.

Accordingly, an implementation flowchart of the generating, by the duration generating module 22, the speech duration sequence corresponding to the text data according to the coding feature sequences of at least two prosody levels is shown in fig. 8, and may include:

step S81: and generating a voice time length sequence corresponding to the text feature sequence according to the coding feature sequences of at least two prosody levels by using the first time length generation module 71.

Step S82: generating a voice time length sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using a second time length generation module 72; the voice time length in the voice time length sequence corresponds to the prosodic feature in the prosodic feature sequence one by one.

Step S83: processing the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosody feature sequence to obtain a voice duration sequence corresponding to the text data, which may specifically include:

according to the sequence of the prosody levels from high to low, for the voice time length sequences of any two adjacent prosody levels, according to the voice time length sequence of a higher prosody level in the voice time length sequences of the two adjacent prosody levels, the voice time length of the voice time length sequence of a lower prosody level in the voice time length sequences of the two adjacent prosody levels is adjusted, so that the sum of the voice time lengths in the voice time length sequence of the lower prosody level is equal to the sum of the voice time lengths in the voice time length sequence of the higher prosody level;

and determining the adjusted voice duration sequence of the lowest prosody level as the voice duration sequence corresponding to the text data.

In the embodiment, through the duration constraint of multiple prosodic scales, the duration expression can be kept, and meanwhile, the stability of duration prediction is improved.

Taking three prosody hierarchies of L1, L3, and L4 as examples, specifically, taking "the trip of the tenth game in the opening history" in the prosody hierarchy of L4, "the tenth time in the opening history" and "the trip of the game" in the corresponding prosody hierarchy of L3, "the opening", "the history", "the tenth time", "the game" and "the trip" in the prosody hierarchy of L1 as examples:

assume that the prediction of duration for each prosody level is: the total duration of the clause "turn on the trip of the tenth game in history" is 300 frames (one frame is usually 5 ms); the total time length of the phrase of 'the tenth time on the opening history' is 180 frames, and the total time length of the phrase of 'the trip of the game' is 110 frames; the total duration of the word of 'on' is 40 frames, the total duration of the word of 'history' is 60 frames, the total duration of the word of 'tenth' is 80 frames, the total duration of the word of 'race' is 60 frames, and the total duration of the word of 'trip' is 56 frames. Then, the manner of processing the duration prediction result of each prosody level is as follows:

the total duration of the phrases "the tenth time in the opening history" and "the trip of the game" is 180+110 frames to 290 frames, which is 10 frames shorter than the total duration of 300 frames of the clause "the trip of the tenth time in the opening history", and then the durations of the phrases "the tenth time in the opening history" and "the trip of the game" need to be stretched, so that the total duration of the phrases "the tenth time in the opening history" and "the trip of the game" is 300 frames, the duration of each phrase can be stretched to 300/290 times of the original duration, that is, the duration of "the tenth time in the opening history" is stretched to: 180 × 300/290 ≈ 186 frames, the "duration stretch of the game" is: 110 × 300/290 ≈ 114 frames.

The total duration of the three words "on", "history" and "tenth" is: if the frame length of the three words "on", "history" and "tenth" is 6 frames shorter than the 186 frame length of "on history", the time length of the words "on", "history" and "tenth" needs to be stretched so that the total time length of the three words "on", "history" and "tenth" is 186 frames, and the time length of each word can be stretched 186/180 times, namely, the time length of "on" is stretched 40 × 186/180 ≈ 41 frames, the time length of "history" is stretched 60 × 186/180 ═ 62 frames, and the time length of "tenth" is stretched 80 × 186/180 ≈ 83 frames.

The total duration of the two words of "race" and "trip" is 60+56 ≈ 116 frames, which is 2 frames longer than the duration 114 frame of "race trip", and the duration of the two words of "race" and "trip" needs to be compressed, so that the total duration of the two words of "race" and "trip" is 114 frames, and the duration of each word can be compressed to the original 114/116, that is, the duration of the word of "race" is compressed to 60 × 114/116 ≈ 59 frames, and the duration of the word of "trip" is compressed to 56 × 114/116 ≈ 55 frames.

If the prosody level of L1 is the lowest prosody level, it can be determined that the speech duration sequence corresponding to the phrase "turn on a trip of the tenth game in history" is: [41, 62, 83, 59, 55].

If the prosody hierarchy of L1 is not the lowest prosody hierarchy and the phonemes are the lowest prosody hierarchy, the durations of the phonemes in the phoneme string corresponding to the "trip to the tenth game in history" are further adjusted according to the adjusted speech duration sequence of L1 prosody hierarchy (i.e., [41, 62, 83, 59, 55]), so that the sum of the durations of the adjusted phoneme strings is equal to the sum of the speech durations in the adjusted speech duration sequence of L1 prosody hierarchy (i.e., 41+62+83+59+55 equals 300).

The following describes the training process of the duration prediction model. The training process of the duration prediction model specifically may include:

and extracting the features of the text sample by using the feature extraction module 41 to obtain a text feature sequence of the text sample.

And performing at least one prosody level coding on the text feature sequence of the text sample by using a prosody feature acquisition module 42 to obtain at least one prosody level prosody feature sequence of the text sample.

And generating a voice time length sequence corresponding to the text characteristic sequence according to the text characteristic sequence of the text sample and each prosody characteristic sequence by using the first time length generation module 71.

And generating a voice time length sequence corresponding to each prosodic feature sequence according to each prosodic feature sequence by using the second time length generating module 72.

And updating the parameters of the voice duration prediction model by taking the voice duration sequence corresponding to the text characteristic sequence of the text sample approaching to the voice duration sequence label corresponding to the text characteristic sequence and the voice duration sequence corresponding to each prosodic characteristic sequence of the text sample approaching to the voice duration sequence label corresponding to the prosodic characteristic sequence as targets. Namely, the parameters of the speech duration prediction model are updated by taking the speech duration sequence corresponding to each prosody level as a target, wherein the speech duration sequence is close to the speech duration sequence label corresponding to the prosody level.

That is to say, in the embodiment of the application, when the time length prediction model is trained, losses of different prosody levels are considered, so that the time length modeling precision of the time length prediction model is improved, and the continuity and expressive force of the predicted time length are effectively improved.

The Loss function Loss of the duration prediction model may be a weighted sum of the Loss functions corresponding to the respective prosody levels. Alternatively to this, the first and second parts may,

if the first duration generation module 71 is a cFlow network, it can be trained by the maximum likelihood criterion, and the Loss function Loss is obtained₁Comprises the following steps:

the loss function has two objectives, one is that the distance between each of the transformed variables s and a predetermined simple distribution (e.g., N (0, 1)) is as small as possible, and the other is that the transformation process is as simple as possible.

The loss function of each prosodic level in the second duration generating module 72 may be trained by a MSE (mean square error) criterion, then:

Loss_ia loss function representing the ith prosody level in the second duration generating module 72.

Due to Loss₁The objective of the function is to take the valueAs large as possible, and Loss_iThe objective of the function is to take a value as small as possible, and therefore, the Loss function Loss of the duration prediction model may specifically be:

Loss＝-Loss₁+∑ω_iLoss_i

wherein, ω is_iIs the weight of the loss function of the ii prosody level. The goal of Loss is to take as little value as possible.

Corresponding to the method embodiment, an embodiment of the present application further provides a speech duration prediction apparatus, a schematic structural diagram of which is shown in fig. 9, and the speech duration prediction apparatus may include:

an acquisition module 91, a coding control module 92 and a generation control module 93; wherein the content of the first and second substances,

the obtaining module 91 is configured to obtain text data;

the encoding control module 92 is configured to perform encoding on the text data by using a pre-trained duration prediction model to obtain at least two prosody levels of encoding feature sequences;

the generation control module 93 is configured to generate a speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration prediction model.

The voice duration prediction device provided by the embodiment of the application utilizes a pre-trained duration prediction model to encode at least two prosody levels on text data to obtain encoding feature sequences of at least two prosody levels; according to the scheme, when the text data is coded, the coding of at least two prosody levels is carried out, so that the control of different prosody levels can be carried out on the voice time.

In an optional embodiment, the encoding control module 92 is specifically configured to:

the generation control module is specifically configured to:

In an alternative embodiment, the encoding control module 92 may include:

and the prosodic feature acquisition control module is used for coding at least one non-lowest prosodic level on the text feature sequence by using the prosodic feature acquisition module of the duration prediction model to obtain a prosodic feature sequence of at least one prosodic level as a coding feature sequence of other prosodic levels in the coding feature sequences of at least two prosodic levels.

In an optional embodiment, the generation control module may include:

in an optional embodiment, the generation control module may include:

In an alternative embodiment, the first control module comprises:

the conversion control module is used for performing pre-trained first conversion on the target value sequence formed by the target values according to the splicing characteristic sequence formed by the splicing characteristics corresponding to the text characteristics by using the conversion module of the duration prediction model to obtain a voice duration sequence corresponding to the text characteristic sequence; wherein the transformation module has: and according to the splicing characteristic sequence, performing inverse transformation of the first transformation on the voice time length sequence corresponding to the text characteristic sequence to obtain the capacity of the target value sequence.

In an optional embodiment, the sampling control module may be specifically configured to:

alternatively, the first and second electrodes may be,

In an optional embodiment, the preset distribution is a gaussian distribution, and the sampling control module, when corresponding to each splicing feature, adjusts the parameter of the variable conforming to the preset distribution by using the sampling module to determine a target sampling range corresponding to the splicing feature, specifically configured to:

In an optional embodiment, the duration processing module may be specifically configured to:

In an optional embodiment, the speech duration prediction apparatus may further include a training module, specifically configured to:

and updating the parameters of the voice duration prediction model by taking the voice duration sequence corresponding to the text characteristic sequence approaching to the voice duration label corresponding to the text characteristic sequence and the voice duration sequence corresponding to each prosody characteristic sequence approaching to the voice duration label corresponding to the prosody characteristic sequence as targets, namely updating the parameters of the voice duration prediction model by taking the voice duration sequence corresponding to each prosody level approaching to the voice duration sequence label corresponding to the prosody level as a target.

The voice duration prediction device provided by the embodiment of the application can be applied to voice duration prediction equipment, such as a PC terminal, a cloud platform, a server cluster and the like. Alternatively, fig. 10 shows a block diagram of a hardware structure of the speech duration prediction apparatus, and referring to fig. 10, the hardware structure of the speech duration prediction apparatus may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;

in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete mutual communication through the communication bus 4;

the processor 1 may be a central processing unit CPU, or an application Specific Integrated circuit asic, or one or more Integrated circuits configured to implement embodiments of the present invention, etc.;

the memory 3 may include a high-speed RAM memory, and may further include a non-volatile memory (non-volatile memory) or the like, such as at least one disk memory;

wherein the memory stores a program and the processor can call the program stored in the memory, the program for:

acquiring text data;

Alternatively, the detailed function and the extended function of the program may be as described above.

Embodiments of the present application further provide a storage medium, where a program suitable for execution by a processor may be stored, where the program is configured to:

acquiring text data;

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for predicting speech duration, comprising:

acquiring text data;

2. The method of claim 1, wherein the duration prediction model is used to obtain at least two prosodic level coding feature sequences; the process of generating the speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels includes:

3. The method of claim 2, wherein encoding the text data at least two prosodic levels by using the encoding module of the duration prediction model to obtain at least two levels of encoded feature sequences comprises:

4. The method according to claim 3, wherein the generating the speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration generation module of the duration prediction model comprises:

generating a voice duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels by using a first duration generation module of the duration prediction model, wherein the voice duration sequence is used as a voice duration sequence corresponding to the text data; and the voice duration in the voice duration sequence corresponds to the text features in the text feature sequence one by one.

5. The method according to claim 3, wherein the generating the speech duration sequence corresponding to the text data according to the coding feature sequences of the at least two prosody levels by using the duration generation module of the duration prediction model comprises:

6. The method according to claim 4 or 5, wherein the generating, by the first duration generating module, the speech duration sequence corresponding to the text feature sequence according to the coding feature sequences of the at least two prosody levels comprises:

and performing pre-trained first transformation on the target value sequence formed by the target values by utilizing a transformation module of the duration prediction model according to the splicing characteristic sequence formed by the splicing characteristics corresponding to the text characteristics to obtain the voice duration sequence corresponding to the text characteristic sequence.

7. The method according to claim 6, wherein the sampling module is used for sampling the variables meeting the preset distribution, and the sampling module comprises:

alternatively, the first and second electrodes may be,

8. The method according to claim 5, wherein the processing the voice duration sequence corresponding to the text feature sequence and the voice duration sequence corresponding to each prosody feature sequence to obtain the voice duration sequence corresponding to the text data includes:

9. The method according to claim 4 or 5, wherein the training process of the speech duration prediction model comprises:

and updating the parameters of the voice time length prediction model by taking the voice time length sequence corresponding to each prosody level approaching to the voice time length sequence label corresponding to the prosody level as a target.

10. A speech duration prediction apparatus, comprising:

the acquisition module is used for acquiring text data;

11. A speech duration prediction apparatus comprising a memory and a processor;

the memory is used for storing programs;

the processor, configured to execute the program, and implement the steps of the speech duration prediction method according to any one of claims 1 to 9.

12. A readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the speech duration prediction method according to any one of claims 1-9.