CN115497450A

CN115497450A - Speech synthesis method and device

Info

Publication number: CN115497450A
Application number: CN202211123920.3A
Authority: CN
Inventors: 江明奇; 王瑞; 陈云琳; 叶顺平
Original assignee: Wenwen Intelligent Information Technology Co ltd
Current assignee: Wenwen Intelligent Information Technology Co ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2022-12-20

Abstract

The embodiment of the application discloses a voice synthesis method and a voice synthesis device, and the method comprises the following steps: acquiring text data and pitch data corresponding to the text data, wherein the text data comprises a plurality of texts, and the pitch data represents a pitch corresponding to each text; coding the text data and the pitch data to obtain text characteristic data and pitch characteristic data; carrying out duration prediction on the plurality of texts to obtain the predicted duration of each text, wherein the predicted duration represents the number of frames corresponding to the text; combining the text characteristic data and the pitch characteristic data to obtain first characteristic data; carrying out time length expansion on the first characteristic data by using the predicted time length of each text to obtain second characteristic data; and decoding the second characteristic data to obtain a voice spectrum parameter, inputting the voice spectrum parameter into a pre-trained neural network vocoder, and generating and obtaining the target voice.

Description

Voice synthesis method and device

Technical Field

The present application relates to the field of data processing, deep learning, and speech synthesis technologies, and in particular, to a speech synthesis method and apparatus.

Background

There are two current methods of synthesizing songs from text. One is to synthesize a song using a hidden markov model, which does not take pitch characteristics into account, resulting in a synthesized song lacking emotion in its voice. The other method is to synthesize the song by using the existing neural network model, but due to the fact that the dimension of the considered features is too small, the model is over-fitted, and the synthesized song is not stable and real enough.

Disclosure of Invention

The present application provides a speech synthesis method and apparatus to solve the above technical problems.

To this end, an aspect of the embodiments of the present application provides a speech synthesis method, where the method includes:

acquiring text data and pitch data corresponding to the text data, wherein the text data comprises a plurality of texts, and the pitch data represents the pitch corresponding to each text;

coding the text data and the pitch data to obtain text characteristic data and pitch characteristic data;

carrying out duration prediction on the plurality of texts to obtain the predicted duration of each text, wherein the predicted duration represents the number of frames corresponding to the text;

combining the text characteristic data and the pitch characteristic data to obtain first characteristic data;

carrying out time length expansion on the first characteristic data by utilizing the predicted time length of each text to obtain second characteristic data;

and decoding the second characteristic data to obtain a voice spectrum parameter, inputting the voice spectrum parameter into a pre-trained neural network vocoder, and generating and obtaining target voice.

Wherein the merging the text feature data and the pitch feature data includes:

determining text sub-feature data and pitch sub-feature data corresponding to each text from the text feature data and the pitch feature data;

determining first sub-feature data of the text according to the text sub-feature data and the pitch sub-feature data of the text;

and combining all the first sub-feature data according to the sequence of the corresponding texts in the text data to obtain first feature data.

Wherein, the performing duration expansion on the first feature data by using the predicted duration of each text to obtain second feature data comprises:

determining first sub-feature data corresponding to each text from the first feature data;

expanding the first sub-feature data corresponding to the text to the frame number indicated by the prediction duration according to the prediction duration of the text to obtain second sub-feature data;

and merging the second sub-feature data according to the sequence of the corresponding texts in the text data to obtain second feature data.

After the obtaining of the second feature data, the method further includes:

determining second sub-feature data corresponding to each text from the second feature data;

traversing all the second sub-feature data;

determining the similarity of the current second sub-feature data and each of other second sub-feature data, and determining the weight according to the similarity;

adjusting the current second sub-feature data according to the current second sub-feature data, all other second sub-feature data and the weight of each other second sub-feature data of the current second sub-feature data;

and after traversing all the second sub-feature data, combining all the adjusted second sub-feature data according to the sequence of the corresponding texts in the text data to obtain second feature data.

Another aspect of the embodiments of the present application provides a method for training a speech synthesis model, where the method includes:

acquiring a plurality of text sample data and a label voice spectrum parameter corresponding to each text sample data;

inputting the text sample data into an initial speech synthesis model to obtain predicted characteristic data of the text sample data;

determining a loss value of the text sample data according to the label voice spectrum parameter and the prediction voice spectrum parameter of the text sample data;

and optimizing the initial voice synthesis model according to the loss values of the plurality of text sample data to obtain a voice synthesis model.

Another aspect of the embodiments of the present application provides a speech synthesis apparatus, including:

the device comprises a first acquisition module, a second acquisition module, a pitch processing module and a control module, wherein the first acquisition module is used for acquiring text data and pitch data corresponding to the text data, the text data comprises a plurality of texts, and the pitch data represents a pitch corresponding to each text;

the encoding module is used for encoding the text data and the pitch data to obtain text characteristic data and pitch characteristic data;

the first deep learning module is used for predicting the duration of the texts to obtain the predicted duration of each text, and the predicted duration represents the number of frames corresponding to the text;

the calculation module is used for combining the text characteristic data and the pitch characteristic data to obtain first characteristic data;

the first deep learning module is further configured to perform duration expansion on the first feature data by using the predicted duration of each text to obtain second feature data;

and the decoding module is used for decoding the second characteristic data to obtain a voice spectrum parameter, inputting the voice spectrum parameter into a pre-trained neural network vocoder, and generating and obtaining target voice.

The calculation module is further configured to determine text sub-feature data and pitch sub-feature data corresponding to each text from the text feature data and the pitch feature data;

the calculation module is further used for determining first sub-feature data of the text according to the text sub-feature data and the pitch sub-feature data of the text;

the calculation module is further configured to combine all the first sub-feature data according to the sequence of the corresponding texts in the text data, so as to obtain first feature data.

The first deep learning module is further configured to determine first sub-feature data corresponding to each text from the first feature data;

the first deep learning module is further configured to expand the first sub-feature data corresponding to the text to the number of frames indicated by the predicted duration according to the predicted duration of the text to obtain second sub-feature data;

the first deep learning module is further configured to merge the second sub-feature data according to the sequence of the corresponding texts in the text data to obtain second feature data.

The first deep learning module is further configured to determine second sub-feature data corresponding to each text from the second feature data;

the first deep learning module is further used for traversing all the second sub-feature data;

the first deep learning module is further configured to determine similarity between the current second sub-feature data and each of the other second sub-feature data, and determine a weight according to the similarity;

the first deep learning module is further configured to adjust the current second sub-feature data according to the current second sub-feature data, all other second sub-feature data, and the weight of each other second sub-feature data of the current second sub-feature data;

and the first deep learning module is further configured to merge all the adjusted second sub-feature data according to the sequence of the corresponding texts in the text data after traversing all the second sub-feature data, so as to obtain second feature data.

Another aspect of the embodiments of the present application provides a speech synthesis model training apparatus, where the apparatus includes:

the second acquisition module is used for acquiring a plurality of text sample data and a label voice spectrum parameter corresponding to each text sample data;

the second deep learning module is used for inputting the text sample data into an initial speech synthesis model to obtain the predicted characteristic data of the text sample data;

the second deep learning module is further configured to determine a loss value of the text sample data according to the tag speech spectrum parameter and the predicted speech spectrum parameter of the text sample data;

and the second deep learning module is also used for optimizing the initial voice synthesis model according to the loss values of the plurality of text sample data to obtain a voice synthesis model.

In the above scheme, the text feature data and the pitch feature data are obtained by encoding the text data and the pitch data corresponding to the text data. And combining the text characteristic data and the pitch characteristic data, so that the obtained first characteristic data can contain the text characteristic data and the pitch characteristic data at the same time. And the predicted duration of each text is used for carrying out duration expansion on the corresponding part of data in the first feature data, so that the second feature data obtained after the duration expansion can simultaneously contain feature data of the text, the pitch and the rhythm. The target voice obtained after the second feature data is decoded and input into the neural network vocoder can fully consider the text, pitch and rhythm features, and the target voice has emotion and is higher in stability and authenticity.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 shows a flow diagram of a speech synthesis method according to an embodiment of the present application;

FIG. 2 shows a flow diagram of a method of merging text feature data and pitch feature data according to another embodiment of the present application;

FIG. 3 shows a flow diagram of a method of time duration augmentation of first feature data according to another embodiment of the present application;

fig. 4 shows a flow chart of a method of adapting second characteristic data according to another embodiment of the present application;

FIG. 5 shows a flow diagram of a method of speech synthesis model training according to an embodiment of the present application;

FIG. 6 shows a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 7 shows a schematic structural diagram of a speech synthesis model training apparatus according to another embodiment of the present application.

Detailed Description

In order to make the objects, features and advantages of the present application more obvious and understandable, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application.

The technical problem is solved.

In order to enable a synthesized song to have emotion and improve stability and reality of the song, an embodiment of the present application provides a speech synthesis method, as shown in fig. 1, the method including:

step 101, obtaining text data and pitch data corresponding to the text data, wherein the text data comprises a plurality of texts, and the pitch data represents a pitch corresponding to each text.

And 102, coding the text data and the pitch data to obtain text characteristic data and pitch characteristic data.

And 103, carrying out duration prediction on the plurality of texts to obtain the predicted duration of each text, wherein the predicted duration represents the frame number corresponding to the text.

For example, a certain text data is "today", includes two texts of "present" and "day", and duration prediction is performed on the two texts, so that the prediction duration of "present" is 5 speech frames, and the prediction duration of "day" is 4 speech frames.

Preferably, a certain text is "me", the text "me" has two phonemes, i.e., "w" and "o", respectively, the predicted duration of the text is [2,2], the predicted duration of the text can further characterize the frame number of the phonemes in the text, the phoneme "w" characterizing the text is 2 speech frames, and the phoneme "o" of the text is 2 speech frames.

And 104, combining the text characteristic data and the pitch characteristic data to obtain first characteristic data.

And combining the text characteristic data and the pitch characteristic data, so that the obtained first characteristic data can simultaneously contain the text characteristic data and the pitch characteristic data.

And 105, performing duration expansion on the first characteristic data by using the predicted duration of each text to obtain second characteristic data.

And carrying out time length expansion on the part of data corresponding to the text in the first characteristic data by utilizing the predicted time length of each text, and expanding the corresponding part of data to the frame number represented by the predicted time length.

And 106, decoding the second characteristic data to obtain a voice spectrum parameter, and inputting the voice spectrum parameter into a pre-trained neural network vocoder to generate and obtain target voice.

The voice spectrum parameters obtained after the second feature data is decoded cannot be played, and the voice spectrum parameters need to be processed by using a trained neural network vocoder, so that target voice capable of being played is obtained.

In the present embodiment, the speech spectral parameters are processed using a pre-trained neuroalvocoder. The neural Vocoder adopts a convolutional neural network to directly model on sampling points, so that the generated voice is higher in reality.

And coding the text data and the pitch data corresponding to the text data to obtain text characteristic data and pitch characteristic data. And combining the text characteristic data and the pitch characteristic data, so that the obtained first characteristic data can simultaneously contain the text characteristic data and the pitch characteristic data. And the predicted duration of each text is used for carrying out duration expansion on the corresponding part of data in the first feature data, so that the second feature data obtained after the duration expansion can simultaneously contain feature data of the text, the pitch and the rhythm. The target voice obtained after the second feature data is decoded and input into the neural network vocoder can fully consider the text, the pitch and the rhythm features, and the target voice has emotion and higher stability and authenticity.

As shown in fig. 2, the present embodiment further provides a method for merging text feature data and pitch feature data, including:

step 201, determining text sub-feature data and pitch sub-feature data corresponding to each text from the text feature data and pitch feature data.

And determining text sub-feature data corresponding to each text from the text feature data.

For example, a certain text data is "weather today is really good", and the text feature data obtained by encoding the text data is feature data of 48 bits. Each text in the text data respectively corresponds to 8 bits of feature data. The text 'today' corresponds to the feature data of the 1 st bit to the 8 th bit in the text feature data; the text "day" corresponds to the feature data from the 9 th bit to the 16 th bit in the text feature data; the text "day" corresponds to the feature data of the 17 th bit to the 24 th bit in the text feature data; the text "qi" corresponds to the 25 th to 32 th bit feature data in the text feature data; the text true corresponds to the feature data of 33 th bit to 40 th bit in the text feature data; the text "good" corresponds to the 41 th to 48 th bits of feature data in the text feature data. And determining the text characteristic data corresponding to the texts as text sub-characteristic data corresponding to the texts.

And determining pitch sub-feature data corresponding to each text from the pitch feature data.

For example, a text datum is "hello", and pitch feature data obtained by encoding pitch data corresponding to the text datum is feature data of 20 bits. Each text in the text data respectively corresponds to 10 bits of feature data. The pitch data corresponding to the text "you" corresponds to the characteristic data of the 1 st bit to the 10 th bit in the pitch characteristic data; pitch data corresponding to the text "good" corresponds to the feature data of the 11 th bit to the 20 th bit in the pitch feature data; and determining the pitch characteristic data corresponding to the texts as pitch sub-characteristic data corresponding to the texts.

Step 202, determining the first sub-feature data of the text according to the text sub-feature data and the pitch sub-feature data of the text.

In this embodiment, there are two methods for determining the first sub-feature data of the text according to the text sub-feature data and the pitch sub-feature data of the text:

the first method comprises the following steps: and adding the text sub-feature data and the pitch sub-feature data of the text to obtain first sub-feature data of the text.

For example, if the text sub-feature data of a certain text is "21322148" and the pitch sub-feature data is "12856231", the two data are added to obtain the first sub-feature data "34178379" of the text.

And secondly, combining text sub-feature data and pitch sub-feature data of the text to obtain first sub-feature data of the text.

For example, if the text sub-feature data of a certain text is 8-bit data, "21322148", and the pitch sub-feature data is 8-bit data, "12856231", the two data are combined to obtain 16-bit first sub-feature data "2132214812856231" of the text.

In other embodiments, other methods capable of determining the first sub-feature data of the text from the text sub-feature data and the pitch sub-feature data of the text may be used.

And 203, combining all the first sub-characteristic data according to the sequence of the corresponding texts in the text data to obtain first characteristic data.

For example, a certain text data is "hello" and the corresponding texts are "hello", "hello" and "o", respectively. The first sub-feature data corresponding to the text "you" is "21322148"; the first sub-feature data corresponding to the text "good" is "12856231"; the first sub-feature data corresponding to the text "o" is "14257855"; the first sub-feature data corresponding to the three texts are merged according to the sequence of the corresponding texts in the text data, so as to obtain first feature data "213221481285623114257855".

The first feature data is determined based on the text feature data and the pitch feature data such that the first feature data can contain both the text and the pitch feature data. The target voice obtained after the first characteristic data is processed in the subsequent steps can fully contain characteristic data of pitch, so that the synthesized target voice has emotion.

As shown in fig. 3, this embodiment further provides a method for performing duration extension on first feature data, including:

step 301, determining first sub-feature data corresponding to each text from the first feature data.

In this embodiment, the first sub-feature data corresponding to each text obtained in step 202 may be directly used, or the corresponding first sub-feature data may be determined from the first feature data according to the text.

Step 302, according to the predicted duration of the text, expanding the first sub-feature data corresponding to the text to the frame number indicated by the predicted duration to obtain second sub-feature data.

For example, certain text data is "hello", and the corresponding texts are "your" and "good", respectively. The prediction duration corresponding to the text "you" is 3 frames, and the first sub-feature data is "21322148"; the text "good" corresponds to a prediction duration of 2 frames, and the first sub-feature data is "12856231".

Copying 3 parts of first sub-feature data "21322148" corresponding to the text according to the predicted duration corresponding to the text "you" and combining the copied parts to obtain second sub-feature data "2132214821322122148" corresponding to the text;

and copying 2 copies of the first sub-feature data '12856231' corresponding to the text according to the predicted duration corresponding to the text 'good', and combining the copies to obtain the second sub-feature data '1285623112856231' corresponding to the text.

Step 303, merging the second sub-feature data according to the sequence of the corresponding text in the text data to obtain second feature data.

For example, a certain text data is "hello", and the corresponding texts are "you" and "good", respectively. The second sub-feature data corresponding to the text "you" is "213221482132214822148"; the second sub-feature data corresponding to the text "good" is "1285623112856231". The two second sub feature data are merged in the order of the corresponding text in the text data to obtain the second feature data "213221482212214812812823112856231".

And carrying out time length expansion on the first sub-feature data corresponding to the text by utilizing the predicted time length corresponding to the text. The length of the corresponding partial data in the obtained second characteristic data can be more consistent with the length of the real pronunciation, so that the stability and the authenticity of the synthesized target voice are higher.

As shown in fig. 4, this embodiment further provides a method for adjusting second feature data, including:

step 401, determining second sub-feature data corresponding to each text from the second feature data.

In this embodiment, the second sub-feature data corresponding to each text obtained in step 302 may be directly used, or the corresponding second sub-feature data may be determined from the second feature data according to the text.

And step 402, traversing all the second sub-feature data.

And step 403, determining the similarity between the current second sub-feature data and each of the other second sub-feature data, and determining the weight according to the similarity.

Step 404, adjusting the current second sub-feature data according to the current second sub-feature data, all other second sub-feature data, and the weight of the current second sub-feature data and each other second sub-feature data.

And 405, after traversing all the second sub-feature data, combining all the adjusted second sub-feature data according to the sequence of the corresponding texts in the text data to obtain second feature data.

And calculating the similarity of the current second sub-feature data and each other second sub-feature data by using the trained attention layer, and determining the weight according to the similarity. And finally, adjusting the current second sub-feature data according to the current second sub-feature data, all other second sub-feature data, the current second sub-feature data and the weight of each other second sub-feature data.

And finally, adjusting the current second sub-feature data according to all the second sub-feature data and the weights. The context relations between the current second sub-feature data and other second sub-feature data can be fully fused, so that the authenticity of the finally synthesized target speech is further improved.

As shown in fig. 5, this embodiment further provides a method for training a speech synthesis model, including:

step 501, obtaining a plurality of text sample data and a label voice spectrum parameter corresponding to each text sample data.

And 502, inputting the text sample data into an initial speech synthesis model to obtain a predicted speech spectrum parameter of the text sample data.

Step 503, determining a loss value of the text sample data according to the labeled voice spectrum parameter and the predicted voice spectrum parameter of the text sample data.

And step 504, optimizing the initial voice synthesis model according to the loss values of the plurality of text sample data to obtain a voice synthesis model.

And optimizing parameters in the initial speech synthesis model based on the loss values, if the model is not converged, predicting according to all text sample data by using the optimized model again, calculating the loss value, and optimizing the parameters in the optimized model in a new round based on the recalculated loss values until the model is converged.

In order to implement the above-mentioned speech synthesis method, as shown in fig. 6, an example of the present application further provides a speech synthesis apparatus, including:

the first acquisition module 10 is configured to acquire text data and pitch data corresponding to the text data, where the text data includes multiple texts, and the pitch data represents a pitch corresponding to each text;

the encoding module 20 is configured to encode the text data and the pitch data to obtain text characteristic data and pitch characteristic data;

the first deep learning module 30 is configured to perform duration prediction on the multiple texts to obtain a predicted duration of each text, where the predicted duration represents a frame number corresponding to the text;

the calculation module 40 is configured to combine the text feature data and the pitch feature data to obtain first feature data;

the first deep learning module 30 is further configured to perform duration expansion on the first feature data by using the predicted duration of each text to obtain second feature data;

and a decoding module 50, configured to decode the second feature data to obtain a voice spectrum parameter, and input the voice spectrum parameter into a pre-trained neural network vocoder to generate and obtain a target voice.

The calculating module 40 is further configured to determine text sub-feature data and pitch sub-feature data corresponding to each text from the text feature data and pitch feature data;

the calculating module 40 is further configured to determine first sub-feature data of the text according to the text sub-feature data and the pitch sub-feature data of the text;

the calculating module 40 is further configured to combine all the first sub-feature data according to the sequence of the corresponding text in the text data, so as to obtain first feature data.

The first deep learning module 30 is further configured to determine, from the first feature data, first sub-feature data corresponding to each text;

the first deep learning module 30 is further configured to expand the first sub-feature data corresponding to the text to the number of frames indicated by the predicted duration according to the predicted duration of the text, so as to obtain second sub-feature data;

the first deep learning module 30 is further configured to merge the second sub-feature data according to the sequence of the corresponding text in the text data, so as to obtain second feature data.

The first deep learning module 30 is further configured to determine second sub-feature data corresponding to each text from the second feature data;

the first deep learning module 30 is further configured to traverse all the second sub-feature data;

the first deep learning module 30 is further configured to determine similarity between the current second sub-feature data and each of the other second sub-feature data, and determine a weight according to the similarity;

the first deep learning module 30 is further configured to adjust the current second sub-feature data according to the current second sub-feature data, all other second sub-feature data, and the weight of each other second sub-feature data of the current second sub-feature data;

the first deep learning module 30 is further configured to, after traversing all the second sub-feature data, merge all the adjusted second sub-feature data according to the order of the corresponding text in the text data to obtain second feature data.

In order to implement the above-mentioned speech synthesis model training method, as shown in fig. 7, an example of the present application further provides a speech synthesis model training apparatus, including:

the second acquisition module 60 is configured to acquire a plurality of text sample data and a tag voice spectrum parameter corresponding to each text sample data;

a second deep learning module 70, configured to input the text sample data into an initial speech synthesis model, so as to obtain predicted feature data of the text sample data;

the second deep learning module 70 is further configured to determine a loss value of the text sample data according to the tag speech spectrum parameter and the predicted speech spectrum parameter of the text sample data;

the second deep learning module 70 is further configured to optimize the initial speech synthesis model according to the loss values of the plurality of text sample data, so as to obtain a speech synthesis model.

In an example, an embodiment of the present application further provides a mobile terminal, which includes at least one memory, and a processor communicatively connected to the at least one memory; wherein the memory stores instructions executable by the at least one processor, the instructions configured to perform the speech synthesis method of any of the embodiments of fig. 1-4 and the speech synthesis model training method of the embodiment of fig. 5.

In addition, an embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions, where the computer-executable instructions are used to execute the speech synthesis method flow described in any one of the above embodiments in fig. 1 to 4 and the speech synthesis model training method flow described in the above embodiment in fig. 5.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Moreover, various embodiments or examples and features of various embodiments or examples described in this specification can be combined and combined by one skilled in the art without being mutually inconsistent.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the embodiments of the present application, "a plurality" means two or more unless specifically defined otherwise.

It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

The word "if," as used herein, may be interpreted as "at \8230; \8230when" or "when 8230; \823030when" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (a stated condition or event)" may be interpreted as "upon determining" or "in response to determining" or "upon detecting (a stated condition or event)" or "in response to detecting (a stated condition or event)", depending on the context.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a Processor (Processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech synthesis, the method comprising:

carrying out time length prediction on the plurality of texts to obtain the predicted time length of each text, wherein the predicted time length represents the frame number corresponding to the text;

and decoding the second characteristic data to obtain a voice spectrum parameter, and inputting the voice spectrum parameter into a pre-trained neural network vocoder to generate and obtain a target voice.

2. The speech synthesis method of claim 1, wherein the combining the text feature data and the pitch feature data comprises:

and combining all the first sub-characteristic data according to the sequence of the corresponding texts in the text data to obtain first characteristic data.

3. The speech synthesis method of claim 1, wherein the time-length-extending the first feature data by using the predicted time-length of each text to obtain second feature data comprises:

4. The speech synthesis method of claim 1, wherein after obtaining the second feature data, the method further comprises:

traversing all the second sub-feature data;

determining the similarity between the current second sub-feature data and each of other second sub-feature data, and determining the weight according to the similarity;

5. A method for training a speech synthesis model, comprising:

6. A speech synthesis apparatus, characterized in that the apparatus comprises:

the first deep learning module is used for carrying out duration prediction on the plurality of texts to obtain the predicted duration of each text, and the predicted duration represents the frame number corresponding to the text;

7. The speech synthesis apparatus of claim 6, comprising:

the calculation module is further used for determining text sub-feature data and pitch sub-feature data corresponding to each text from the text feature data and the pitch feature data;

the calculation module is further configured to combine all the first sub-feature data according to the sequence of the corresponding text in the text data, so as to obtain first feature data.

8. The speech synthesis apparatus according to claim 6, comprising:

the first deep learning module is further configured to expand the first sub-feature data corresponding to the text to the number of frames indicated by the predicted duration according to the predicted duration of the text, so as to obtain second sub-feature data;

the first deep learning module is further configured to merge the second sub-feature data according to the sequence of the corresponding text in the text data to obtain second feature data.

9. The speech synthesis apparatus according to claim 6, comprising:

and the first deep learning module is further configured to merge all the adjusted second sub-feature data according to the sequence of the corresponding texts in the text data after all the second sub-feature data are traversed, so as to obtain second feature data.

10. A speech synthesis model training apparatus, comprising:

and the second deep learning module is further used for optimizing the initial speech synthesis model according to the loss values of the plurality of text sample data to obtain a speech synthesis model.