CN110992926B

CN110992926B - Speech synthesis method, apparatus, system and storage medium

Info

Publication number: CN110992926B
Application number: CN201911366561.2A
Authority: CN
Inventors: 黄志强; 李秀林
Original assignee: Databaker Beijng Technology Co ltd
Current assignee: Beibei (Qingdao) Technology Co.,Ltd.
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2022-06-10
Anticipated expiration: 2039-12-26
Also published as: CN110992926A

Abstract

The embodiment of the invention provides a voice synthesis method, a device, a system and a storage medium, wherein the method comprises the following steps: converting a text to be synthesized into an input sequence containing a plurality of phononic elements; inputting the input sequence into a sequence neural network model based on an attention mechanism to obtain a correlation matrix between the input sequence and an acoustic characteristic sequence, and outputting the acoustic characteristic sequence containing a voice frame set, wherein each element in the correlation matrix represents a weight value occupied by a corresponding phonon element in the corresponding voice frame set; determining the pronunciation duration occupied by each phonon element in each voice frame set based on the weight value; and determining the total pronunciation duration of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each voice frame set. According to the technical scheme, natural and smooth voice is obtained, the information of the pronunciation duration of the phonon element is provided, and the user experience is effectively improved.

Description

Speech synthesis method, apparatus, system and storage medium

Technical Field

The present invention relates to the field of data processing, and more particularly, to a method, apparatus, system, and storage medium for speech synthesis.

Background

The speech synthesis technology converts text into speech to make a machine produce sound, and is an important link for realizing human-computer interaction. The text is converted into speech, generally, the text is processed into a corresponding input sequence, and then the input sequence is converted into corresponding speech by using different speech synthesis models. With the rapid development of speech synthesis technology, the pronunciation duration of each phoneme in speech is also more and more widely regarded. For example, in applications such as virtual anchor, cartoon animation and the like, the correct mouth shape of pronunciation can be simulated according to the pronunciation duration information of each phoneme, so that the expression of the virtual character is more vivid.

In the existing voice synthesis technology, the quality of the synthesized voice is poor, the synthesized voice is hard, and the user experience is poor; some synthesized voices are more fluent and natural, but cannot provide phoneme duration information, so that inconvenience is brought to users.

Disclosure of Invention

The present invention has been made in view of the above problems.

According to an aspect of the present invention, there is provided a speech synthesis method including:

converting a text to be synthesized into an input sequence containing a plurality of phononic elements;

inputting the input sequence into a sequence neural network model based on an attention mechanism to obtain a correlation matrix between the input sequence and an acoustic characteristic sequence, and outputting the acoustic characteristic sequence containing a voice frame set, wherein each element in the correlation matrix represents a weight value occupied by a corresponding phonon element in the corresponding voice frame set;

Determining the pronunciation duration occupied by each phoneme element in each voice frame set based on the weight value;

and determining the total pronunciation duration of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each voice frame set.

Illustratively, the attention-based sequence-to-sequence neural network model comprises an encoder and a decoder, and the inputting the input sequence into the attention-based sequence-to-sequence neural network model to obtain the correlation matrix between the input sequence and the acoustic feature sequence comprises:

inputting the input sequence into the encoder for encoding to obtain a plurality of encoder hidden states respectively corresponding to the plurality of phonon elements;

determining a decoder implicit state;

determining the correlation matrix based on the plurality of encoder hidden states and decoder hidden states.

Illustratively, the determining the decoder implicit state comprises:

initializing the decoder to determine an initial implicit state of the decoder;

determining a current relevance vector in the relevance matrix based on the plurality of encoder hidden states and a decoder previous hidden state;

Determining a current weighted feature based on the plurality of encoder implicit states and a current correlation vector in the correlation matrix;

determining the current implicit state of the decoder based on the previous implicit state of the decoder and the current weighted feature;

the determining the correlation matrix based on the plurality of encoder hidden states and decoder hidden states comprises:

and determining the correlation matrix according to all the correlation vectors.

Illustratively, the determining a current relevance vector in the relevance matrix based on the plurality of encoder implicit states and the decoder previous implicit state comprises:

calculating the probability of each encoder hidden state relative to the previous hidden state of the decoder;

normalizing the probability to obtain a normalized probability;

determining the current relevance vector based on all normalized probabilities.

Illustratively, the calculating the probability of each encoder implicit state relative to the previous implicit state of the decoder comprises:

calculating a dot product value of each encoder implicit state and a previous implicit state of the decoder;

and taking the dot product value as the probability.

Illustratively, the determining the pronunciation duration occupied by each phoneme element in each voice frame set based on the weight value comprises:

Acquiring unit time length of each voice frame set;

and multiplying the unit time length by the weighted value to take the product as the pronunciation time length occupied by each phoneme element in each voice frame set.

Illustratively, the determining the total pronunciation duration of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each speech frame set includes:

and adding the pronunciation duration occupied by each phonon element in all the voice frame sets according to an identifier to obtain the total pronunciation duration occupied by each phonon element in the acoustic feature sequence, wherein the identifier is used for identifying the position of the phonon element in the input sequence.

Illustratively, after determining the total pronunciation time length occupied by each phoneme element in the acoustic feature sequence according to the pronunciation time length occupied by each phoneme element in each voice frame set, the method further comprises:

determining a pronunciation time period of each phononic element in the acoustic feature sequence based on a pronunciation time period of each phononic element in the acoustic feature sequence.

Illustratively, the acoustic feature sequence is any one of or a combination of a mel-frequency spectrum parameter sequence, a fundamental frequency parameter sequence and a bark spectrum parameter sequence.

According to still another aspect of the present invention, there is provided a video playback method including:

synthesizing the voice of the virtual anchor in the video by using the voice synthesis method and determining the pronunciation time period of each phononic element in the acoustic feature sequence of the voice;

and the virtual anchor emits the voice, and determines the playing mouth shape when the virtual anchor emits the voice according to the pronunciation time period of each phonon element in the acoustic feature sequence of the voice.

According to another aspect of the present invention, there is provided a method for playing a text, including:

synthesizing the voice of the text by using the voice synthesis method and determining the pronunciation time period of each phonon element in the acoustic feature sequence of the voice;

and playing the text by using the voice, and simultaneously displaying the text, wherein the display state of the corresponding characters in the text is controlled according to the pronunciation time period of each phonon element in the acoustic feature sequence of the voice so as to highlight the current playing position of the voice.

According to a fourth aspect of the present invention, there is provided a speech synthesis apparatus comprising:

the input sequence module is used for converting the text to be synthesized into an input sequence containing a plurality of phonon elements;

The output sequence module is used for inputting the input sequence into a sequence neural network model based on an attention mechanism so as to obtain a correlation matrix between the input sequence and the acoustic characteristic sequence and output the acoustic characteristic sequence containing a voice frame set, wherein each element in the correlation matrix represents a weight value occupied by a corresponding phonon element in the corresponding voice frame set;

a unit time length determining module, configured to determine, based on the weight value, a pronunciation time length occupied by each phoneme element in each voice frame set;

and the total duration determining module is used for determining the total pronunciation duration of each phonon element in the acoustic feature sequence according to the pronunciation duration of each phonon element in each voice frame set.

According to a fifth aspect of the present invention, there is provided a speech synthesis system comprising: a processor and a memory, wherein the memory has stored therein computer program instructions for performing the above-described speech synthesis method when executed by the processor.

According to a sixth aspect of the present invention, there is provided a storage medium having stored thereon program instructions for performing a speech synthesis method as described above when executed.

According to the technical scheme of the embodiment of the invention, the pronunciation duration in the output acoustic feature sequence is distributed to the phonon elements in the input sequence by utilizing the correlation matrix obtained by the attention-based sequence-to-sequence neural network model. Therefore, the technical scheme not only obtains natural and smooth voice, but also provides information of pronunciation duration of the phonon element, and effectively improves user experience.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing in more detail embodiments of the present invention with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 shows a schematic flow diagram of a speech synthesis method according to one embodiment of the invention;

FIG. 2A illustrates a composition diagram of a correlation matrix according to one embodiment of the invention;

FIG. 2B is a diagram illustrating the calculation of phone durations based on a correlation matrix, according to one embodiment of the invention;

FIG. 3 shows a schematic flow diagram for obtaining a correlation matrix from the attention-based sequence-to-sequence neural network model, according to one embodiment of the present invention;

FIG. 4 shows a schematic flow chart diagram for determining the implicit state of a decoder according to one embodiment of the present invention;

FIG. 5 shows a schematic diagram of speech synthesis according to one embodiment of the invention;

FIG. 6 shows a schematic flow diagram for determining a correlation vector according to one embodiment of the invention;

FIG. 7 shows a schematic flow diagram of a video playback method according to one embodiment of the invention;

FIG. 8 shows a schematic flow diagram of a text playback method according to one embodiment of the invention;

FIG. 9 shows a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present invention; and

FIG. 10 shows a schematic block diagram of a speech synthesis system according to one embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some of the embodiments of the present invention, and not all of the embodiments of the present invention, and it should be understood that the present invention is not limited by the exemplary embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the invention described herein without inventive step, shall fall within the scope of protection of the invention.

Currently, the more common speech synthesis methods include a parametric method, a concatenation method, a sequence-to-sequence method, and the like. The parameter method and the splicing method need corpus training with phonon boundary information, phonon duration information corresponding to voice can be easily obtained, and the two modes have been in large-scale stable service in the industry for many years. However, the speech generated by the two methods is relatively hard and the user experience is poor. Compared with two traditional methods, the voice synthesized by a sequence-to-sequence voice synthesis model based on an attention mechanism has obvious, smooth and natural tone quality, but the information of the duration of the phonons cannot be given at present.

The speech synthesis scheme of the present invention is implemented using an attention-based sequence-to-sequence (Seq2Seq) neural network model. The attention mechanism is to select information more critical to the current target task from a plurality of information. So-called sequence-to-sequence neural network models are able to transform one sequence into another sequence through the action of a recurrent neural network. A sequence-to-sequence neural network model based on an attention mechanism is not only able to translate one input sequence into a corresponding output sequence, but is also able to determine which input elements in the input sequence are more critical, i.e., the degree of correlation between the input elements and the output elements, for any output element in the output sequence.

FIG. 1 shows a schematic flow diagram of a speech synthesis method 100 according to one embodiment of the present invention. As shown in fig. 1, the speech synthesis method 100 includes the following steps.

The text to be synthesized is converted into an input sequence comprising a plurality of phononic elements S110.

The text to be synthesized is the text that needs to be subjected to speech synthesis. In one example, taking the text "hello" to be synthesized as an example, it may be converted to a toned sub-input sequence [ n, i3, h, ao3 ]. For each element in the input sequence, it may be referred to as a phonon element. In this example, "n", "i 3", "h", and "ao 3" are phonon elements in the input sequence, respectively.

And S120, inputting the input sequence into a sequence neural network model based on an attention mechanism to obtain a correlation matrix between the input sequence and the acoustic characteristic sequence and outputting the acoustic characteristic sequence containing the voice frame set. Each element in the correlation matrix represents a weight value of a corresponding phonon element in a corresponding voice frame set.

This step is used to generate a synthesized speech corresponding to the text to be synthesized by a sequence-to-sequence neural network model based on an attention mechanism. For the neural network model described above, the synthesized speech it generates is represented in the form of a sequence of acoustic features. The sequence of acoustic features can be directly converted by the player into speech that can be perceived audibly. The acoustic feature sequence can be any one or a combination of a Mel spectrum parameter sequence, a fundamental frequency parameter sequence and a bark spectrum parameter sequence. One or more sets of speech frames may be included in the acoustic signature sequence.

In one example, the acoustic signature sequence is a sequence of mel-spectrum parameters that comprise a set of speech frames. For the attention-based sequence-to-sequence neural network model, every time a set of speech frames in the Mel's spectral parameter sequence is to be generated, a correlation vector between all the phonon elements in the input sequence and the current set of speech frames is computed. Each element in the relevancy vector represents a weight value occupied by the corresponding phonon element and the current voice frame set. All correlation vectors constitute a correlation matrix between the input sequence and the acoustic feature sequence.

FIG. 2A is a diagram illustrating the composition of a correlation matrix according to one embodiment of the invention. As shown in fig. 2A, the first row in the table represents a sequence of acoustic features, including 6 sets of speech frames; the first column in the table represents the input sequence, comprising 4 phonon elements. The values larger than 0 and smaller than 1 in the table represent the weight values of the corresponding phone sub-elements in the corresponding voice frame set, and the positions with empty values represent the corresponding weight values as 0. For example, a value of 0.7 in the sixth column of the second row indicates that the phonon element "ao 3" occupies a weight value of 0.7 in the fifth speech frame set, and a value of 0.7 in the second column of the fifth row indicates that the phonon element "n" occupies a weight value of 0.7 in the first speech frame set.

And S130, determining the pronunciation duration occupied by each phonon element in each voice frame set based on the weight value.

As indicated previously, the weight values reflect the proportional relationship of the phone elements in the input sequence to the set of speech frames in the acoustic feature sequence. For example, in the example shown in FIG. 2A, in the first set of speech frames, the "n" phonon element is 0.7, and the "i 3" phonon element is 0.3.

When the voice frame set actually pronounces, the time length of the phoneme in each voice frame set and the weight value have strong correlation. In an embodiment of the present application, each phoneme element is assigned a pronunciation duration in each set of speech frames based on the weight value.

In the example shown in fig. 2A, the input sequence is a sequence containing 4 phoneme elements [ n, i3, h, ao3], and the output sequence is a mel-spectrum parameter sequence containing 6 speech frame sets [ y1, y2, y3, y4, y5, y6 ]. Wherein, each voice frame set comprises one or more voice frames, and each voice frame has a specific pronunciation time. That is, the pronunciation duration t0 of each speech frame set can be determined. On this basis, if it is known that the weighting value of the phoneme element "n" in the speech frame set "y 1" is a1, the pronunciation time length of the phoneme element "n" in the speech frame set "y 1" is t1 — a1 × t 0.

And S140, determining the total pronunciation time length of each phonon element in the acoustic feature sequence according to the pronunciation time length of each phonon element in each voice frame set.

In the example shown in fig. 2A, the acoustic signature sequence contains 6 speech frame sets. Respectively calculating the pronunciation time length t1, t2, t3, t4, t5 and t6 occupied by the phonon element "n" in each voice frame set, wherein the total pronunciation time length occupied by the phonon element "n" in the acoustic feature sequence [ y1, y2, y3, y4, y5 and y6] is as follows: t (n) ═ t1+ t2+ t3+ t4+ t5+ t 6.

FIG. 2B is a diagram illustrating the calculation of phone time duration based on a correlation matrix, according to one embodiment of the invention. As shown in fig. 2B, it is known that the pronunciation time length of each speech frame set is 100ms, the weight occupied by the phoneme element "n" in the first speech frame set is 0.7 (see fig. 2A), and the pronunciation time length of the phoneme element "n" in the first speech frame set is 100 × 0.7 ═ 70 ms. According to fig. 2B, the pronunciation duration of the phonon element "n" in the second set of speech frames is 60ms, the pronunciation duration of the phonon element "n" in the third set of speech frames is 10ms, and the pronunciation durations in the remaining sets of speech frames are null (representing 0), so that the total pronunciation duration occupied by the phonon element "n" in the whole acoustic feature sequence is 70+60+10 — 140 ms.

In the technical scheme, the sequence-to-sequence neural network model based on the attention mechanism is used for carrying out voice synthesis, and pronunciation duration in the output acoustic characteristic sequence is distributed to the phonon elements in the input sequence according to the generated correlation matrix. Not only natural and smooth voice is obtained, but also information of pronunciation duration of the phonon elements is provided, and user experience is effectively improved.

In one example, the attention-based sequence-to-sequence neural network model includes an encoder (Encode) and a decoder (Decode).

The encoder may be implemented using a neural network with specific parameters. For different input elements, the encoder will produce different implicit outputs, which are called encoder implicit states.

The decoder may also be implemented using a neural network with specific parameters. The decoder is connected with the encoder. The input elements of the decoder include all the output elements of the encoder. For different input elements of the decoder, the decoder will also produce different implicit outputs, referred to as decoder implicit states.

As previously described, in step 120, the input sequence is input into an attention-based sequence to a sequence neural network model to obtain a correlation matrix between the input sequence and the acoustic feature sequence. The attention-based sequence-to-sequence neural network model in one embodiment of the invention includes the encoder and decoder described above. FIG. 3 shows a schematic flow diagram for obtaining a correlation matrix from the attention-based sequence-to-sequence neural network model, according to one embodiment of the present invention. As shown in fig. 3, obtaining the correlation matrix includes the following steps:

And S310, inputting the input sequence into the encoder to be encoded so as to obtain a plurality of encoder hidden states respectively corresponding to the plurality of phonon elements. In this embodiment, a plurality of phonon elements in an input sequence is encoded with an encoder. And then, a plurality of encoder hidden states respectively corresponding to the phonon elements can be obtained.

At S320, the decoder implicit state is determined. The decoder hidden state may include a preset initial hidden state. As mentioned before, the output element of the encoder can also be used as the input element of the decoder to obtain the decoder implicit state.

The correlation matrix is determined based on the plurality of encoder hidden states and decoder hidden states S330. The encoder hidden state and the decoder hidden state contain information of the phonon elements and the speech frame set. Therefore, the correlation matrix can be accurately determined, and the duration of the phonon element can be accurately determined.

Fig. 4 shows a schematic flow chart of the above step S320 for determining the implicit state of the decoder according to an embodiment of the present invention. As shown in fig. 4, the step S320 of determining the implicit state of the decoder includes the following steps:

the decoder is initialized to determine an initial implicit state of the decoder S321. Preset initialization parameters may be input to the decoder to obtain an initial implicit state of the decoder.

A current correlation vector in the correlation matrix is determined based on the plurality of encoder hidden states and a previous decoder hidden state S322.

S323, determining current weighting characteristics based on a plurality of encoder hidden states of an encoder and a current relevance vector in a relevance matrix.

S324, determining the current implicit state of the decoder based on the previous implicit state of the decoder and the current weighting characteristic.

The above-described process of determining the implicit state of the decoder is described in detail below with reference to fig. 5.

FIG. 5 shows a schematic diagram of speech synthesis according to one embodiment of the invention. As shown in fig. 5, for the attention-based sequence-to-sequence neural network model, x1, x2, x3, and x4 are input elements in the input sequence, respectively, and y1, y2, y3, y4, y5, and y6 are output elements in the output sequence, respectively. In one example, x1, x2, x3, x4 are phone elements, respectively, and y1, y2, y3, y4, y5, y6 are mel-spectrum frame sets in the acoustic feature sequence, respectively. Each speech frame set contains a specific number of speech frames, for example, each speech frame set contains 5 speech frames. Each speech frame has a specific duration. Thus, each set of speech frames also has a specific duration, e.g. 100ms, accordingly.

The attention-based sequence-to-sequence neural network model shown in fig. 5 includes an encoder and a decoder. The encoder implicit states generated by the encoder for the phoneme elements x1, x2, x3, x4 are represented in fig. 5 by h1, h2, h3, h4, respectively. In one example, the encoder implicit states h1, h2, h3, and h4 are in a linear relationship with the phoneme elements x1, x2, x3, and x4, respectively.

Data processing in a sequence-to-sequence neural network model based on an attention mechanism is sometimes sequential. The Mel spectrum frame set in the acoustic feature sequence finally output by the model is a time sequence, and different output elements in the output sequence are respectively generated at different moments. Wherein the decoder implicit state of the decoder is also sometimes sequential. In the embodiment of fig. 5, the decoder implicit states are represented as H0, H1, H2, H3, H4, H5, and H6. Where H0 represents the decoder initial implicit state. H0 may be preset as a vector whose elements are all 0. H1, H2, H3, H4, H5, and H6 represent subsequent implicit states of the decoder, which correspond to different sets of speech frames, respectively.

After the input sequence is input to the encoder for encoding and the encoder implicit states h1, h2, h3 and h4 are obtained, the current relevance vector in the relevance matrix can be determined based on these encoder implicit states and the previous implicit state of the decoder. The relationship between the encoder hidden state and the previous hidden state of the decoder determines the correlation between the two, from which the current correlation vector [ a1, a2, a3, a4] can be determined.

The current weighted feature may be determined from the encoder implicit states h1, h2, h3, and h4 and the current correlation vector [ a1, a2, a3, a4] in the correlation matrix. The current weighted feature may be a weighted sum of encoder implicit states h1, h2, h3, and h4 based on the current correlation vector [ a1, a2, a3, a4 ]. As shown in fig. 5, a1, a2, a3 and a4 are weight values corresponding to h1, h2, h3 and h4, respectively. The specific value of the correlation vector is different for the weighted features at different time instants.

Each of the decoder implicit states H1, H2, H3, H4, H5, and H6 corresponds to a weighted feature, respectively. The current implicit state of the decoder may be determined based on the previous implicit state of the decoder and the current weighted characteristics.

In the embodiment shown in fig. 5, the decoder is first initialized to determine the initial hidden state H0 of the decoder.

A first correlation vector in the correlation matrix is determined from the encoder hidden states H1, H2, H3 and H4 and the decoder initial hidden state H0. The first weighted feature C1 is determined based on the encoder hidden states h1, h2, h3, and h4 and the first relevance vector. The decoder first hidden state H1 is determined by the decoder initial hidden state H0 and the first weighted feature C1 together.

A second correlation vector in the correlation matrix is determined from the encoder hidden states H1, H2, H3 and H4 and the decoder first hidden state H1. The second weighted feature C2 is determined based on the encoder implicit states h1, h2, h3 and h4 and the second correlation vector. The decoder two implicit states H2 are determined by the decoder first implicit state H1 and the second weighted profile C2 together.

By analogy, the ith correlation vector in the correlation matrix is determined according to the encoder hidden states H1, H2, H3 and H4 and the (i-1) th hidden state H (i-1) of the decoder. The ith weighted feature Ci is determined based on the encoder hidden states h1, h2, h3, and h4 and the ith relevance vector. The decoder ith hidden state Hi is determined by the decoder (i-1) th hidden state H (i-1) and the ith weighting feature Ci.

After the decoder hidden state is determined in step S323, step S330 is performed to determine a correlation matrix. As described above, a current correlation vector in the correlation matrix may be determined based on a plurality of encoder hidden states and a previous decoder hidden state. And forming a correlation matrix by all the correlation vectors.

The correlation matrix determined in the scheme accurately reflects the correlation between the output acoustic characteristic sequence and the input sequence. Furthermore, the method and the device ensure that the scheme can obtain more accurate duration of the phonon elements.

Fig. 6 shows a schematic flow chart of the step S322 of determining the relevance vector according to an embodiment of the invention. As shown in fig. 6, the step S322 of determining the current relevance vector specifically includes the following steps:

the probability of each encoder implicit state relative to the previous implicit state of the decoder is calculated S331.

Assume that the current correlation vector is the ith correlation vector Ai, i is a natural number. In the example of FIG. 5, the probability of each of the encoder hidden states H1, H2, H3, and H4 is calculated relative to the decoder previous hidden state H (i-1).

In one example, a dot product value of each encoder implicit state with a previous implicit state of the decoder is calculated to take the dot product value as the probability. That is, H1. H (i-1), H2. H (i-1), H3. H (i-1) and H4. H (i-1) were calculated, respectively.

And S332, normalizing the probability to obtain a normalized probability.

In one example, the probabilities are normalized using a softmax function to obtain softmax (H1. H (i-1)), softmax (H2. H (i-1)), softmax (H3. H (i-1)), and softmax (H4. H (i-1)), respectively.

And S333, determining the current relevance vector based on all the normalized probabilities.

In the above example, [ softmax (H1 · H (i-1)), softmax (H2 · H (i-1)), softmax (H3 · H (i-1)), softmax (H4 · H (i-1)) ] is the current correlation vector, on the basis of obtaining each normalized probability.

On the basis of the determination of each current correlation vector, a more accurate correlation matrix can be obtained. Wherein the relevancy vector represents the weight value of each input phonon element in the output voice frame set. And allocating time length to each input element according to the weight value, so that the pronunciation time length of each input element corresponds to the proportion of the pronunciation time length in the output element, and the phonon time length in the acoustic characteristic sequence can be accurately predicted.

In one example, a unique identifier is added to each phononic element based on its position in the input sequence. It will be appreciated that the position of each phononic element in the input sequence is uniquely determined, and an identifier is added to each phononic element based on the position information, and is therefore also uniquely determined. In this case, the step of determining the total pronunciation time length of each phoneme element in the acoustic feature sequence according to the pronunciation time length of each phoneme element in each voice frame set includes: and adding the pronunciation duration occupied by each phonon element in all the voice frame sets according to the identifier to obtain the total pronunciation duration occupied by each phonon element in the acoustic feature sequence. For example, the phonon element "n" shown in fig. 2B occupies a pronunciation time period of 70ms in the first voice frame set, 60ms in the second voice frame set, and 10ms in the third voice frame set. Thus, the total duration of pronunciation of the phonon element "n" in the acoustic feature sequence is 140 ms.

Each phoneme element has a unique position identifier, and pronunciation durations occupied by the corresponding phoneme elements in all the voice frame sets are added according to the position identifiers, so that the condition that pronunciation durations of the same phoneme element appearing at different positions of an input sequence are superposed by mistake can be avoided, and the accuracy of the pronunciation durations is ensured.

In one example, the pronunciation time period of each phononic element in the acoustic feature sequence may be determined based on a pronunciation time period of the each phononic element in the acoustic feature sequence. It is understood that the phoneme elements in the input sequence are determined, and when the pronunciation duration of each phoneme element is further determined, the pronunciation time period of each phoneme element in the acoustic feature sequence can be accurately obtained. The starting time of the pronunciation time period of each phonon element is the time corresponding to the sum of the durations of the other phonon elements before the phonon element; the ending time of the pronunciation time period of each phonon element is the corresponding time after the sum of the time lengths of the other phonon elements before the phonon element and the pronunciation time length of the phonon element. For example, in fig. 2B, the start time of the phonon element "h" is the time when the phonon elements "n" and "i 3" are pronounced after the end time of the first speech frame set, and the end time is the time when 10ms elapses after the start time, that is, the time when the second speech frame set ends. In other words, the pronunciation period of the phonon element "h" is from 190(100+60+30) ms to 200 ms.

The pronunciation time period of the factor is determined based on the pronunciation duration of the phonons, so that the specific application object can be adjusted according to the phonon elements pronouncing at different time periods, and the application range of voice synthesis is wider.

According to another aspect of the present invention, a video playing method is also provided. The video relates to a virtual anchor, such as an animated character or the like. Fig. 7 shows a schematic flow chart of a video playing method according to an embodiment of the invention. As shown in fig. 7, the video playing method includes:

and S710, synthesizing the voice of the virtual anchor in the video by using the voice synthesis method in the text, and determining the pronunciation time period of each phonon element in the acoustic feature sequence of the voice.

The virtual anchor is a simulated character processed by digital technology, and can replace the real character in the audio-visual media to interact with audiences. The voice uttered by the virtual anchor is synthesized by a voice synthesis method based on a text prepared in advance. It is understood that the phononic element is obtained by converting text corresponding to speech. The text may be converted into an input sequence comprising a plurality of phononic elements. Inputting the input sequence into a sequence based on an attention mechanism to a sequence neural network model, obtaining a correlation matrix between the input sequence and the acoustic characteristic sequence and outputting the acoustic characteristic sequence containing the voice frame set. And obtaining the voice of the virtual anchor based on the acoustic feature sequence. Each element in the correlation matrix represents a weight value of a corresponding phonon element in a corresponding voice frame set. The pronunciation duration occupied by each phonon element in each voice frame set can be determined based on the weight value. And determining the total pronunciation duration of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each voice frame set. Determining a pronunciation time period of each phononic element in the acoustic feature sequence based on a pronunciation time period of each phononic element in the acoustic feature sequence.

S720, the virtual anchor sends out the voice, and the playing mouth shape of the virtual anchor when sending out the voice is determined according to the pronunciation time period of each phonon element in the acoustic feature sequence of the voice. When the virtual main broadcast is played in the video and the virtual main broadcast sends out the voice, the playing mouth shape of the virtual main broadcast is determined according to the pronunciation time interval of the phonon element in the acoustic feature sequence of the voice, which is determined in the step S710. In other words, the speaking mouth shape corresponds to the pronunciation time interval of the phonon element.

By the video playing method, the virtual anchor image in the video can be more vivid and real, the video playing quality is improved, and better audio-visual experience is provided for audiences.

According to the third aspect of the present invention, a text playing method is also provided. Fig. 8 shows a schematic flow diagram of a text playback method according to an embodiment of the invention. As shown in fig. 8, the text playing method includes:

and S810, synthesizing the voice of the text by using the voice synthesis method in the text and determining the pronunciation time period of each phonon element in the acoustic feature sequence of the voice. It will be appreciated that the text may be converted into an input sequence comprising a plurality of phononic elements. Inputting the input sequence into a sequence based on an attention mechanism to a sequence neural network model, obtaining a correlation matrix between the input sequence and the acoustic characteristic sequence and outputting the acoustic characteristic sequence containing the voice frame set. The synthesized speech is obtained based on the acoustic feature sequence. Each element in the correlation matrix represents a weight value of a corresponding phonon element in a corresponding voice frame set. The pronunciation duration occupied by each phonon element in each voice frame set can be determined based on the weight value. And determining the total pronunciation duration of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each voice frame set. Determining a pronunciation time period of each phononic element in the acoustic feature sequence based on a pronunciation time period of each phononic element in the acoustic feature sequence.

And S820, playing the text by using the voice and simultaneously displaying the text, wherein the display state of the corresponding characters in the text is controlled according to the pronunciation time period of each phonon element in the acoustic feature sequence of the voice so as to highlight the current playing position of the voice.

In the text playing process, the text can be played with voice while being displayed. And when the text is displayed, controlling the display state of the corresponding characters in the text according to the pronunciation time period of each phonon element in the acoustic feature sequence of the voice so as to highlight the current playing position of the voice. For example, text whose speech is currently being uttered is highlighted.

By the text playing method, the display state of the corresponding text in the text can correspond to the voice playing progress, so that a user can know the position of the current voice playing content in the text, the text playing progress is convenient to display, and the reading and listening experience of the user is improved.

According to the fourth aspect of the invention, a speech synthesis apparatus is also provided. Fig. 9 shows a schematic block diagram of a speech synthesis apparatus according to an embodiment of the present invention.

As shown in fig. 9, the apparatus 900 includes an input sequence module 910, an output sequence module 920, a unit duration determination module 930, and a total duration determination module 940.

The various modules may perform the various steps/functions of the speech synthesis method described above, respectively. Only the main functions of the components of the device 900 are described below, and details that have been described above are omitted.

An input sequence module 910, configured to convert a text to be synthesized into an input sequence including a plurality of phononic elements;

an output sequence module 920, configured to input the input sequence into a sequence neural network model based on an attention mechanism, so as to obtain a correlation matrix between the input sequence and the acoustic feature sequence and output an acoustic feature sequence including a speech frame set, where each element in the correlation matrix represents a weight value occupied by a corresponding phonon element in the corresponding speech frame set;

a unit duration determining module 930, configured to determine, based on the weight value, a pronunciation duration occupied by each phoneme element in each voice frame set;

and a total duration determining module 940, configured to determine the total duration of pronunciation of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each voice frame set.

According to a fifth aspect of the present invention, there is also provided a speech synthesis system comprising: a processor and a memory, wherein the memory has stored therein computer program instructions for performing the above-described speech synthesis method when executed by the processor.

FIG. 10 shows a schematic block diagram for a speech synthesis system 1000 according to one embodiment of the present invention. As shown in fig. 10, the system 1000 includes an input device 1010, a storage device 1020, a processor 1030, and an output device 1040.

The input device 1010 is used for receiving an operation instruction input by a user and collecting data. The input device 1010 may include one or more of a keyboard, a mouse, a microphone, a touch screen, an image capture device, and the like.

The storage 1020 stores computer program instructions for implementing the corresponding steps in the speech synthesis method according to an embodiment of the present invention.

The processor 1030 is configured to run the computer program instructions stored in the storage 1020 to perform the corresponding steps of the speech synthesis method according to the embodiment of the present invention, and is configured to implement the input sequence module 910, the output sequence module 920, the unit duration determination module 930, and the total duration determination module 940 in the speech synthesis apparatus according to the embodiment of the present invention.

The output device 1040 is used to output various information (e.g., images and/or sounds) to an external (e.g., user), and may include one or more of a display, a speaker, etc.

In one embodiment, the computer program instructions, when executed by the processor 1030, cause the system 1000 to perform the steps of:

inputting the input sequence into a sequence neural network model based on an attention mechanism to obtain a correlation matrix between the input sequence and an acoustic characteristic sequence and output the acoustic characteristic sequence containing a voice frame set, wherein each element in the correlation matrix represents a weight value occupied by a corresponding phonon element in the corresponding voice frame set;

Furthermore, according to a sixth aspect of the present invention, there is also provided a storage medium on which program instructions are stored, which when executed by a computer or a processor cause the computer or the processor to perform the respective steps of the above-mentioned speech synthesis method according to an embodiment of the present invention, and to implement the respective modules in the above-mentioned speech synthesis apparatus according to an embodiment of the present invention or the respective modules for the above-mentioned speech synthesis system. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media. The computer-readable storage medium may be any combination of one or more computer-readable storage media.

In one embodiment, the computer program instructions, when executed by a computer or processor, cause the computer or processor to perform the steps of:

The detailed implementation and technical effects of the above speech synthesis apparatus, speech synthesis system and storage medium can be understood by those skilled in the art from reading the above description of the speech synthesis method. For brevity, no further description is provided herein.

Although the illustrative embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the foregoing illustrative embodiments are merely exemplary and are not intended to limit the scope of the invention thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present invention. All such changes and modifications are intended to be included within the scope of the present invention as set forth in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present invention should not be construed to reflect the intent: rather, the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

Various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules used in a speech synthesis apparatus according to embodiments of the present invention. The present invention may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiment of the present invention or the description thereof, and the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and the changes or substitutions should be covered within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A speech synthesis method, comprising:

determining the total pronunciation duration of each phoneme element in the acoustic feature sequence according to the pronunciation duration of each phoneme element in each voice frame set;

wherein the attention-based sequence-to-sequence neural network model includes an encoder and a decoder,

the inputting the input sequence into a sequence neural network model based on an attention mechanism to obtain a correlation matrix between the input sequence and the acoustic feature sequence comprises:

Determining a decoder implicit state;

2. The speech synthesis method of claim 1,

the determining the decoder implicit state comprises:

initializing the decoder to determine an initial implicit state of the decoder;

determining a current weighted feature based on the plurality of encoder implicit states and the current relevance vector;

determining a current implicit state of the decoder based on the previous implicit state of the decoder and the current weighted feature;

3. The method of speech synthesis of claim 2, wherein the determining a current relevance vector in the relevance matrix based on the plurality of encoder implicit states and the decoder previous implicit state comprises:

normalizing the probability to obtain a normalized probability;

determining the current relevance vector based on all normalized probabilities.

4. The speech synthesis method according to claim 1, wherein after determining the total pronunciation time length of each sub-element in the acoustic feature sequence according to the pronunciation time length of each sub-element in each speech frame set, the method further comprises:

5. A video playback method, comprising:

synthesizing a voice of a virtual anchor in a video using the voice synthesis method of claim 4, and determining a pronunciation time period of each phononic element in an acoustic feature sequence of the voice;

and the virtual anchor emits the voice, and determines the broadcast mouth shape when the virtual anchor emits the voice according to the pronunciation time period of each phonon element in the acoustic feature sequence of the voice.

6. A method for playing text, comprising:

synthesizing speech of the text using the speech synthesis method of claim 4 and determining a pronunciation period for each phononic element in a sequence of acoustic features of the speech;

7. A speech synthesis apparatus, comprising:

the output sequence module is used for inputting the input sequence into an attention-based sequence-to-sequence neural network model so as to obtain a correlation matrix between the input sequence and the acoustic feature sequence and output the acoustic feature sequence containing a voice frame set, wherein each element in the correlation matrix represents a weight value occupied by a corresponding phonon element in the corresponding voice frame set, and the attention-based sequence-to-sequence neural network model comprises an encoder and a decoder; the output sequence module is specifically configured to input the input sequence into the encoder for encoding, so as to obtain a plurality of encoder hidden states respectively corresponding to the plurality of phonon elements; determining a decoder implicit state; determining the correlation matrix based on the plurality of encoder hidden states and decoder hidden states;

8. A speech synthesis system comprising: a processor and a memory, wherein the memory has stored therein computer program instructions for performing the speech synthesis method according to any one of claims 1 to 4 when the computer program instructions are executed by the processor.

9. A storage medium on which program instructions are stored, characterized in that the program instructions are adapted to perform a speech synthesis method according to any one of claims 1 to 4 when executed.