CN118262698A

CN118262698A - Speech synthesis method, device, electronic equipment and storage medium

Info

Publication number: CN118262698A
Application number: CN202410416936.6A
Authority: CN
Inventors: 朱叶凡; 叶金辉
Original assignee: Zhejiang Geely Holding Group Co Ltd; Geely Automobile Research Institute Ningbo Co Ltd
Current assignee: Zhejiang Geely Holding Group Co Ltd; Geely Automobile Research Institute Ningbo Co Ltd
Filing date: 2024-04-08
Publication date: 2024-06-28

Abstract

The specification provides a voice synthesis method, a device, an electronic device and a storage medium, wherein the method comprises the following steps: generating an acoustic feature sequence corresponding to a target text to be synthesized, wherein the acoustic feature sequence is used for representing acoustic features of the target text; fusing the acoustic feature sequence and the phoneme sequence of the target text extracted by the voice synthesis model to obtain a fused sequence; and obtaining target synthesized voice generated by the voice synthesis model according to the reference voice and the fusion sequence.

Description

Speech synthesis method, device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, a device, an electronic apparatus, and a storage medium.

Background

Speech synthesis (TTS), for short, is a technique that can convert any input text into corresponding speech. The synthesized voice can simulate the characteristics of tone, pitch, speed and the like of human sounding, so that the computer can interact with human more intelligently. In daily life, the voice synthesis technology can be widely applied to equipment such as voice assistants, automatic voice answering machines, self-service terminals and the like. By this technique, the device can be made to answer the user's question using personified sounds.

The speech synthesis model is a specific algorithm or framework for implementing speech synthesis techniques. For example, VALL-E model, by inputting text information and reference voice into VALL-E model, the model can output synthesized voice expressing the text information, and the sound characteristics of the synthesized voice are similar to those of the reference voice.

However, the current speech synthesis model still has the problem of incorrect pronunciation, such as poor definition of synthesized speech, poor distinction of near-pronunciation words, etc.

Disclosure of Invention

In order to overcome the problems in the related art, the present specification provides a method, an apparatus, an electronic device, and a storage medium for speech synthesis.

According to a first aspect of embodiments of the present specification, there is provided a speech synthesis method, the method comprising:

Generating an acoustic feature sequence corresponding to a target text to be synthesized, wherein the acoustic feature sequence is used for representing acoustic features of the target text;

fusing the acoustic feature sequence and the phoneme sequence of the target text extracted by the voice synthesis model to obtain a fused sequence;

and obtaining target synthesized voice generated by the voice synthesis model according to the reference voice and the fusion sequence.

According to a second aspect of embodiments of the present specification, there is provided a speech synthesis apparatus comprising:

An acoustic feature sequence generating unit, configured to generate an acoustic feature sequence corresponding to a target text to be synthesized, where the acoustic feature sequence is used to characterize acoustic features of the target text;

the fusion sequence generating unit is used for fusing the acoustic feature sequence and the phoneme sequence of the target text extracted by the voice synthesis model to form a fusion sequence;

and the target synthetic voice generating unit is used for acquiring target synthetic voice generated by the voice synthesis model according to the reference voice and the fusion sequence.

According to a third aspect of embodiments of the present specification, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method according to the first aspect when the program is executed.

According to a fourth aspect of embodiments of the present description, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method according to the first aspect.

The technical scheme provided by the embodiment of the specification can comprise the following beneficial effects:

In the embodiment of the present disclosure, an acoustic feature sequence corresponding to a target text to be synthesized is generated and used for characterizing acoustic features of the target text, the acoustic feature sequence and a phoneme sequence of the target text extracted by a speech synthesis model are fused to form a fusion sequence, so that the fusion sequence includes acoustic features characterizing the target text, and then the speech synthesis model can generate target synthesized speech based on the fusion sequence and a reference speech.

Therefore, in the process of generating the target synthesized voice based on the target text, the method supplements corresponding acoustic characteristic information for the target text. Specifically, instead of directly inputting the target text and the reference voice into the voice synthesis model, an acoustic feature sequence for representing the acoustic feature of the target text is generated first, and then the sequence is fused with a phoneme sequence of the target text extracted by the voice synthesis model to form a fused sequence, so that the voice synthesis model generates corresponding target synthesized voice based on the fused sequence and the reference voice. By the method, the voice synthesis model can more accurately convert the target text into the target synthesized voice, so that the pronunciation error of the target synthesized voice is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a schematic diagram of a speech synthesis model according to an exemplary embodiment of the present disclosure.

Fig. 2 is a flow chart illustrating a method of speech synthesis according to an exemplary embodiment of the present disclosure.

FIG. 3 is a model framework diagram of an acoustic feature predictor shown in the present specification in accordance with an exemplary embodiment.

Fig. 4 is a schematic flow diagram of generating a fusion sequence according to an exemplary embodiment of the present disclosure.

Fig. 5 is a schematic diagram of the structure of a speech synthesis model of the reasoning phase shown in the present specification according to an exemplary embodiment.

FIG. 6 is a flow chart illustrating a training method for an acoustic feature predictor according to an exemplary embodiment of the present disclosure.

FIG. 7 is a schematic diagram of a speech synthesis model of a training phase shown in this specification according to an exemplary embodiment.

FIG. 8 is a schematic diagram of another training method for an acoustic feature predictor, according to an exemplary embodiment of the present description.

Fig. 9 is a schematic flow chart of acquiring a training data set according to an exemplary embodiment.

Fig. 10 is a schematic flow chart of another method for acquiring a training data set according to an exemplary embodiment.

Fig. 11 is a block diagram of an electronic device according to an exemplary embodiment of the present description.

Fig. 12 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment of the present specification.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the present specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present description as detailed in the accompanying claims.

The terminology used in the description presented herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the description. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this specification to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of the present description. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.

Next, embodiments of the present specification will be described in detail.

As shown in fig. 1, fig. 1 shows a model architecture of a speech synthesis model 1 disclosed in the related art. The speech synthesis model 1 may be a VALL-E model. Of course, other token-based speech synthesis models are possible, such as Coqui TTS models or BASE TTS models. The model framework of the speech synthesis model 1 may include, but is not limited to, a phoneme conversion module (Phoneme Conversion) 102, an audio encoder (Audio Codec Encoder) 105, an acoustic model 106, and an audio decoder (Audio Codec Decoder) 107. If the speech synthesis model is VALL-E model, the acoustic model 106 is embodied as a speech codec language model (Neural Codec Language Model). Specifically, the phoneme conversion module 102 is configured to convert the target text 101 to be synthesized into a phoneme sequence 103 and input the phoneme sequence into the acoustic model 106; the audio encoder 105 is configured to convert the reference speech 104 into a first discrete acoustic sequence and input into the acoustic model 106; the second discrete acoustic sequence output by the acoustic model 106 is input into an audio decoder 107, by which audio decoder 107 the target synthesized speech 108 is generated.

In the speech synthesis model 1 of the related art, the processing steps within the dashed box 10 are as follows: the phoneme conversion module 102 directly converts the target text 101 to be synthesized into a phoneme sequence 103 and inputs into an acoustic model 106. It can be seen that since the above-mentioned phoneme sequence 103 input to the acoustic model 106 is directly converted from the target text 101 to be synthesized, it contains only the basic units of speech, lacking in acoustic feature information about finer granularity of the target text 101 to be synthesized. However, in practical applications, since different phonemes have different acoustic performances in different contexts, the speech synthesis model 1 may not capture such a complex relationship when mapping the target text 101 to be synthesized to the acoustic features of the target synthesized speech, resulting in inaccurate pronunciation of the finally generated target synthesized speech 108.

The above-described problems existing in the model architecture of the speech synthesis model 1 in fig. 1 are addressed. As shown in fig. 2, the present specification provides a speech synthesis method to overcome the above problems. The method comprises the following specific steps:

step 201: an acoustic feature sequence corresponding to the target text to be synthesized is generated.

In one embodiment, the target text to be synthesized is the text content that is needed to generate the target synthesized speech. The target text to be synthesized may be any form of text, and may be a sentence, a paragraph, or even an article. For example, the target text to be synthesized that is input to the speech synthesis model may be a short sentence — a weather forecast display for tomorrow, hangzhou will have rain. "; it may also be a longer article- "this is a research paper about artificial intelligence, which details the principles and applications of machine learning and deep learning. ...".

In an embodiment, an acoustic feature sequence corresponding to a target text to be synthesized is generated, the acoustic feature sequence being used to characterize acoustic features of the target text. The form of the "feature sequence" in the present embodiment may be a feature vector, a feature matrix, or the like. Specifically, a trained acoustic feature predictor is used to convert the target text to be synthesized into a corresponding phoneme sequence, and further convert the phoneme sequence into a corresponding acoustic feature sequence, which is used to characterize the acoustic features of the target text. The acoustic feature predictor may be a deep learning model having an ability to generate an acoustic feature sequence corresponding to the target text to be synthesized, and is not a name of a specific model, for example, the acoustic feature predictor may be a relatively common transform model or an RNN model, and the acoustic feature predictor may be trained by training the transform model or the RNN model to have an ability to generate an acoustic feature sequence corresponding to the target text to be synthesized. Of course, the person skilled in the art may instead of using the model architecture of the existing deep learning model, customize the model architecture of the acoustic feature predictor to improve the adaptability of the generated acoustic feature sequence.

In another embodiment, the present embodiment customizes the model architecture of the acoustic feature predictor:

As shown in fig. 3, the acoustic feature predictor 3 may be composed of two parts, a phoneme conversion module 30 and an acoustic feature conversion module 31. The phoneme conversion module 30 may employ the same model architecture as the phoneme conversion module 102 in the speech synthesis model 1 of fig. 1 for converting the target text 101 to be synthesized into a phoneme sequence and inputting the phoneme sequence to the acoustic feature conversion module 31 to generate an acoustic feature sequence 307 corresponding to the target text 101 to be synthesized by the acoustic feature conversion module 31. The acoustic signature conversion module 31 comprises 5 conversion sub-modules connected in sequence, the output of the last sub-module being the input of the next sub-module. Wherein Con1D represents a one-dimensional convolution; RMSNorm is all Root Mean Square Normalization; DP is known collectively as Dropout; linear represents a Linear transformation.

Step 202: and fusing the acoustic feature sequence and the phoneme sequence of the target text extracted by the voice synthesis model to obtain a fused sequence.

In one embodiment, as shown in fig. 4, the target text 101 to be synthesized is input to the phoneme conversion module 102 and the acoustic feature predictor 3, respectively, and the phoneme sequence 103 output by the phoneme conversion module 102 and the acoustic feature sequence 307 output by the acoustic feature predictor 3 are processed into a fusion sequence 402 through a fusion process 401.

In an embodiment, the fusion process 401 may be to splice the acoustic feature sequence 307 and the phoneme sequence 103 together to generate a fusion sequence. For example, if the acoustic feature sequence 307 is [1,2,1], the phoneme sequence 103 is [1, 1], the fusion sequence obtained by splicing the two may be [1,2,1,1,1,1] or [1,1,1,1,2,1]. The present description does not limit the manner in which the acoustic feature sequences are spliced with the phoneme sequences.

In another embodiment, the fusion process 401 may also be to add the acoustic feature sequence 307 to the feature values of the corresponding locations within the phoneme sequence 103 to generate a fusion sequence. For example, if the acoustic feature sequence is [1,2,1], the phoneme sequence is [1, 1], the fusion sequence may be [2,3,2]. Of course, the added feature values may be further averaged, and the fusion sequence may be changed to [2/2,3/2,2/2].

Through the above fusion process, the features of the acoustic feature sequence and the corresponding phoneme sequence are combined together, so that the fusion sequence simultaneously contains the features of the acoustic feature sequence and the phoneme sequence. In other words, the fusion sequence contains acoustic feature information about the target text 101.

Step 203: and obtaining target synthesized voice generated by the voice synthesis model according to the reference voice and the fusion sequence.

In summary, the method supplements corresponding acoustic feature information for the target text in the process of generating the target synthesized speech based on the target text. Specifically, instead of directly inputting the target text and the reference voice into the voice synthesis model, an acoustic feature sequence for representing the acoustic feature of the target text is generated first, and then the sequence is fused with a phoneme sequence of the target text extracted by the voice synthesis model to form a fused sequence, so that the voice synthesis model generates corresponding target synthesized voice based on the fused sequence and the reference voice. By the method, the voice synthesis model can more accurately convert the target text into the target synthesized voice, so that the pronunciation error of the target synthesized voice is reduced.

In one embodiment, the reference speech is a speech sample used as a reference in a speech synthesis model, which is typically a human recording, representing characteristics of a particular language, accent, style of speech, etc. The reference speech may be used to instruct the speech synthesis model to generate synthesized speech of a particular style or intonation. In other words, the speech synthesis model may generate a similar speech output by learning speech features in the reference speech.

In an embodiment, the target synthesized speech is a sound representation of the target text to be synthesized for conveying semantic information of the target text. For example, if the semantic information of the target text is "hello," world-! The target synthesized speech is expressed using the speech features of the reference speech.

In one embodiment, as shown in FIG. 5, the speech synthesis model 1 generates the target synthesized speech 108 from the reference speech 104 and the fusion sequence 402.

It should be noted that, in the reasoning stage of the model, the speech synthesis model 1 of fig. 5 (the speech synthesis model proposed by the present invention, which includes the acoustic feature predictor 3) is different from the speech synthesis model 1 of fig. 1 (the speech synthesis model of the related art) in that the processing logic within the dashed line box 10 is different, and the other modules (including the audio encoder 105, the acoustic model 106, and the audio decoder 107) in the speech synthesis model 1 except for the dashed line box 10 in the speech synthesis model 1 of fig. 5 may employ the model architecture of the corresponding modules in the speech synthesis model 1 of fig. 1.

As shown in fig. 6, fig. 6 shows a training method of the acoustic feature predictor 3 of fig. 5, and the training method of fig. 6 is explained next in connection with fig. 7. The method comprises the following specific steps:

step 601: a training data set is obtained, the training data set comprising text data and corresponding speech data.

In one embodiment, the training data set may be trained using a public corpus, such as the presently disclosed ATSHELL data set, which includes 178 hours of speech data. Of course, for a specific scene and task, a data set related to the task may be collected and labeled, for example, a recording device is used to record voice data, and text data corresponding to the voice data is labeled. The present description is not limited to the source of the training data set.

The training data set may include text data and corresponding speech data. For example, the text data may be a sentence "hello, world-! The voice data may be a sound recording that reads the text data.

Step 602: the text data is input into the acoustic feature predictor to be trained, and a first acoustic feature sequence corresponding to the text data is generated through the acoustic feature predictor.

As shown in fig. 7, the text data 701 is input to the acoustic feature predictor 3 to be trained, and a first acoustic feature sequence 702 corresponding to the text data 701 is generated by the acoustic feature predictor 3.

Step 603: inputting the voice data into the voice representation model, and generating a voice representation vector corresponding to the voice data through the voice representation model; and inputting the speech characterization vector and the text data into the acoustic feature aligner, and converting the speech characterization vector into a second acoustic feature sequence by the acoustic feature aligner.

In one embodiment, as shown in FIG. 7, speech data 707 is input to a speech representation model 703, and speech characterization vectors 704 corresponding to the speech data 707 are generated by the speech representation model 703; and inputting the speech characterization vector 704 and text data 701 to an acoustic feature aligner 705, whereby the speech characterization vector 704 is converted into a second acoustic feature sequence 706 by the acoustic feature aligner 705.

In one embodiment, the speech representation model 703 may be a trained HuBERT model, and the speech representation model 703 may also be other trained deep learning models that have the ability to convert speech data into speech characterization vectors 704. The present specification does not set any limit to this.

In one embodiment, acoustic feature aligner 705 may employ the model architecture of a conventional Transform model that has the ability to convert speech characterization vector 704 into a second acoustic feature sequence 706. Of course, a custom model architecture is also possible that has the ability to convert the speech characterization vector 704 into a second acoustic feature sequence 706. The present description does not set any limit to the model architecture of the acoustic feature aligner.

It should be noted that, in the training stage of the model, two auxiliary training modules (including the speech expression model 703 and the acoustic feature aligner 705) are added to the speech synthesis model 1 of fig. 5 in this embodiment, so as to train the acoustic feature predictor 3 in the speech synthesis model 1 of fig. 5, see in particular the processing logic of the dashed box 70 of fig. 7. Any auxiliary training module may only occur in the training phase for the acoustic feature predictor 3. In the reasoning phase of the model, only the model architecture to the speech synthesis model 1 shown in fig. 5 can be used. In other words, the actual reasoning process of the speech synthesis model 1 is seen in fig. 5 and does not include the processing of the dashed box 70.

Step 604: updating parameters of the acoustic feature predictor based on a similarity loss between the first acoustic feature sequence and the second acoustic feature sequence.

In an embodiment, the parameters of the acoustic feature predictor may be updated based on a loss function such as a mean square error, an absolute value error, or a cross entropy loss between the first acoustic feature sequence and the second acoustic feature sequence. It will be appreciated that the mean square error, absolute value error or cross entropy loss are all used to measure the similarity loss between the first and second acoustic feature sequences and to measure the difference between them, thereby updating the parameters of the acoustic feature predictor. By continuously iteratively updating the parameters of the acoustic feature predictor, the first acoustic feature sequence and the second acoustic feature sequence may be made closer and closer.

The embodiment extracts the voice characteristics of the voice data through the voice representation model, and further extracts the acoustic characteristics in the voice characteristics by utilizing the acoustic characteristic aligner. The acoustic feature predictor is trained with the ability to extract acoustic features from the text data by reducing the difference between the second acoustic feature sequence extracted from the speech data and the first acoustic feature sequence extracted from the text data by the acoustic feature predictor using a loss of similarity between the first acoustic feature sequence output by the acoustic feature predictor and the second acoustic feature sequence output by the acoustic feature aligner.

In an embodiment, the training method for other modules of the speech synthesis model 1 (the modules other than the acoustic feature predictor 3 in fig. 5) may employ a loss function defined for the other modules of the speech synthesis model 1 in the related art. For example, if the speech synthesis model is VALL-E model, the VALL-E model may be used to train other modules with the loss functions that are customized to each module separately. The present specification does not set any limit to this.

To overcome the problem that the speech characterization vector 704 generated by the speech expression model 703 in fig. 7 contains too many disturbance factors related to the speech style, the acoustic feature sequence generated by the trained acoustic feature predictor 3 further contains the disturbance factors as well, and finally, the quality and naturalness of the synthesized speech generated by the speech synthesis model are disturbed.

In view of the above, in this embodiment, after obtaining the speech feature vector 704 output by the speech representation model 703, further processing is performed on the speech feature vector 704 to eliminate the interference factor of the speech feature vector 704, and then the speech feature vector is input to the acoustic feature aligner 705.

Specifically, as shown in fig. 8, all speech characterization vectors 704 are partitioned into different classes by a clustering process 801, wherein the similarity between speech characterization vectors 704 of the same class is higher than the similarity between speech characterization vectors 704 of different classes. In this way, speech characterization vectors of similar speech styles can be clustered into the same class.

In particular, a clustering approach may be utilized to categorize all speech characterization vectors into different categories. The clustering method can be K-means or DBSCAN clustering methods. The present specification does not set any limit to the clustering method.

Taking the K-means method as an example, the specific dividing steps are as follows:

① An arbitrary 512 speech characterization vectors are preset as an initial centroid.

② For any speech token vector, its distance from each initial centroid is calculated and assigned to the cluster closest to it.

③ The average of the speech characterization vectors within each cluster is calculated as the new centroid.

④ Repeating steps ② and ③, iterating continuously, reassigning the speech characterization vector and updating the centroid until the centroid does not change or reaches the preset iteration times.

Finally, all speech characterization vectors are classified into different categories.

For all the speech characterization vectors of the same class, each speech characterization vector in the same class is converted into a number with the same value, and the number is converted into an embedded vector. For example, for speech token vectors [1,2,3], [2, 3] and [1, 2] within class 1, each is converted to a number of 1, where number 1 may be the number of the class, and then this number 1 is converted to embedded vectors [0.1,0.2,0.2], [0.1,0.2,0.2] and [0.1,0.2,0.2] corresponding to the respective speech token vectors. It can be seen that, by the above processing, the difference in terms of the speech styles of the speech characterization vectors similar to the speech styles can be eliminated, thereby eliminating the interference factor of the speech characterization vectors with respect to the speech styles.

Of course, in addition to the above manner, each speech characterization vector in the same class may be connected with a class embedding vector that characterizes the class, so as to form an embedding vector corresponding to the speech characterization vector. For example, the class embedding vectors for class 1 are [1, 1], the speech characterization vectors within class 1 [1,2,3], [2, 3] and [1, 2]. The embedding vectors [1,2,3,1,1,1], [2,2,3,1,1,1] and [1,2,2,1,1,1] are obtained by concatenating each speech characterization vector with a class embedding vector, respectively. Therefore, through the processing, the difference of the similar voice representation vectors in the voice style can be partially eliminated, and the interference phonemes of the voice representation vectors relative to the voice style can be reduced.

The embedded vectors are input to an acoustic feature aligner and each embedded vector is converted by the acoustic feature aligner into a corresponding second acoustic feature sequence, respectively.

In this embodiment, by clustering similar speech characterization vectors together, it is meant that the same class of speech characterization vectors characterize similar acoustic features. The same class of speech characterization vectors are then converted into the same or similar embedded vectors. Therefore, the converted embedded vector eliminates or reduces interference factors related to the voice style, helps a voice synthesis model to better capture and maintain the consistency of voice, and improves the quality and naturalness of the synthesized voice.

Since acquiring a large-scale speech data set is a challenging task, it requires a lot of time, resources and costs, and also requires attention to data privacy and compliance issues when gathering the speech data set. Thus, in order to address the above-described problems, to ensure that there is sufficient speech data to adequately train the speech synthesis model, the present specification provides the method of FIG. 9 for augmenting the training data set by generating synthesized speech data. The method comprises the following specific steps:

step 901: acquiring an initial data set, wherein the initial data set comprises initial text data and corresponding initial voice data;

The initial dataset may be a public corpus, such as the ATSHELL dataset presently disclosed.

Step 902: generating synthesized speech data using at least a portion of the initial speech data in the initial dataset

In one embodiment, a portion of the initial speech data in the initial data set may be selected to generate synthesized speech data. For example, 10% of the initial speech data in the initial data set may be selected to generate corresponding synthesized speech data, or all of the initial speech data in the initial data set may be used to generate corresponding synthesized speech data.

In an embodiment, for the problem that voice data with a single tone color and long duration is difficult to collect in the real world, corresponding synthesized voice data can be generated specifically for the voice data, so as to compensate for the problem that the number of the voice data is too small.

In one embodiment, for each initial speech data that requires the generation of synthesized speech data, the initial speech data is input into a timbre transformation model to obtain a plurality of synthesized speech data output by the model, with a timbre difference between any two of the plurality of synthesized speech data. For example, for "hello, world-! "Speech data, through the tone conversion model can obtain the model output a plurality of through different tone reading" hello, world-! "synthesized speech data.

In an embodiment, the timbre transformation model may specifically be a trained UNet model. Of course, other trained models may be available and have the ability to model UNet.

Step 903: the synthesized speech data is added to the expanded data set and the initial data set and the expanded data set are combined into the training data set.

The embodiment can easily generate the synthesized voice data, so that the data scale of the training data set can be increased to improve the generalization capability of the model. In addition, as the plurality of synthesized voice data corresponding to the same initial voice data are different in tone quality, the diversity of the training data set in tone quality is increased, and therefore the model can be helped to learn more tone quality modes.

Although current speech synthesis models can implement timbre cloning, timbre inconsistencies in synthesized speech remain a non-negligible problem in speech synthesis tasks. For example, the first half of the synthesized speech is descriptive of the timbre characteristic of speaker A and the second half is converted into the timbre characteristic of speaker B, resulting in non-uniformity of the timbre of the synthesized speech.

In response to this problem, the present specification generates a training data set containing tone perturbations by a method as shown in fig. 10 to enhance the consistency in tone of the synthesized speech generated by the trained speech synthesis model:

step 1001: an initial data set is acquired, the initial data set including initial text data and corresponding initial speech data.

Step 1002: disturbance speech data is generated using at least part of the initial speech data of the initial data set.

Step 1003: the perturbed speech data is added to the expanded data set and the initial data set and the expanded data set are combined into the training data set.

Step 1001 is the same as step 901, and embodiments of step 901 are specifically visible, and will not be described herein.

In one embodiment, a portion of the speech segments in any of the initial speech data are replaced with other speech segments that contain the same semantics to obtain perturbed speech data corresponding to the any of the initial speech data.

In one embodiment, a portion of the speech segments in one initial speech data are replaced with speech segments of other initial speech data that contain the same semantics. For example, the first voice clip of the initial voice data 1 is "hello, world", the second voice clip of the initial voice data 2 is "hello, world", and the initial voice data 1 and the initial voice data 2 are not the same initial voice data; it is possible to replace the first speech segment of the initial speech data 1 with the second speech segment of the initial speech data 2 and take the replaced initial speech data 1 as disturbance speech data corresponding to the initial speech data 1.

In one embodiment, any speech segment in the initial speech data is replaced with other speech segments in the initial speech data that contain the same semantics. For example, if any speech segment of some initial speech data is "hello, the world" and other speech segments having the same meaning as that of the other speech segment are also present in other positions in the initial speech data, the other speech segments are replaced with any speech segment of the initial speech data to obtain disturbance speech data corresponding to the initial speech data.

In this embodiment, by adding tone disturbance to the voice data in the training data set, the voice synthesis model trained based on the disturbance voice data added with tone disturbance can have a strong anti-interference capability on tone disturbance, so as to ensure tone consistency of the synthesized voice.

Corresponding to the embodiments of the aforementioned method, the present specification also provides embodiments of the apparatus and the terminal to which it is applied.

As shown in fig. 11, fig. 11 is a schematic structural view of an electronic device according to an exemplary embodiment shown in the present specification. At the hardware level, the device includes a processor 1102, an internal bus 1104, a network interface 1106, memory 1108, and non-volatile storage 1110, although other hardware requirements for other services are possible. One or more embodiments of the present description may be implemented in a software-based manner, such as by the processor 1102 reading a corresponding computer program from the non-volatile storage 1110 into the memory 1108 and then running. Of course, in addition to software implementation, one or more embodiments of the present disclosure do not exclude other implementation manners, such as a logic device or a combination of software and hardware, etc., that is, the execution subject of the following processing flow is not limited to each logic module, but may also be a hardware or logic device.

As shown in fig. 12, fig. 12 is a block diagram of a speech synthesis apparatus according to an exemplary embodiment of the present specification. The device 12 can be applied to the electronic equipment shown in fig. 11 to implement the technical solution of the present specification. The device 12 comprises:

an acoustic feature sequence generating unit 1204, configured to generate an acoustic feature sequence corresponding to a target text to be synthesized, where the acoustic feature sequence is used to characterize acoustic features of the target text;

A fusion sequence generating unit 1206, configured to fuse the acoustic feature sequence with a phoneme sequence of the target text extracted by the speech synthesis model to obtain a fusion sequence;

A target synthetic speech generating unit 1208, configured to obtain a target synthetic speech generated by the speech synthesis model according to the reference speech and the fusion sequence.

Optionally, the speech synthesis model includes an acoustic feature predictor, and the acoustic feature sequence generating unit 1204 is specifically configured to input the target text to be synthesized into the trained acoustic feature predictor, so as to generate an acoustic feature sequence corresponding to the target text by the acoustic feature predictor.

Optionally, the speech synthesis model includes a speech representation model and an acoustic feature aligner, training the acoustic feature predictor, including: acquiring a training data set, wherein the training data set comprises text data and corresponding voice data thereof; inputting the text data into the acoustic feature predictor to be trained, and generating a first acoustic feature sequence corresponding to the text data through the acoustic feature predictor; inputting the voice data into the voice representation model, and generating a voice representation vector corresponding to the voice data through the voice representation model; and inputting the speech characterization vector and the text data into the acoustic feature aligner, converting the speech characterization vector into a second acoustic feature sequence by the acoustic feature aligner; updating parameters of the acoustic feature predictor based on a similarity loss between the first acoustic feature sequence and the second acoustic feature sequence.

Optionally, the apparatus 12 further comprises a clustering unit 1202 for classifying all speech token vectors into different classes, wherein the similarity between speech token vectors of a same class is higher than the similarity between speech token vectors of different classes; determining the embedded vectors corresponding to each divided class, wherein the embedded vector corresponding to any one class is obtained by converting each voice characterization vector belonging to the class; the inputting the speech characterization vector and the text data into the acoustic feature aligner, converting the speech characterization vector into a second acoustic feature sequence by the acoustic feature aligner, comprising: and inputting the embedded vectors into the acoustic feature aligner, and respectively converting each embedded vector into a corresponding second acoustic feature sequence through the acoustic feature aligner.

Optionally, the clustering unit 1202 is specifically configured to divide all speech characterization vectors into different classes by a clustering algorithm.

Optionally, the training data set includes an initial data set and an extended data set, and the acquiring the training data set includes: acquiring an initial data set, wherein the initial data set comprises initial text data and corresponding initial voice data; generating synthetic voice data by utilizing at least part of initial voice data in the initial data set, wherein any initial voice data is input into a tone conversion model to obtain a plurality of synthetic voice data output by the model, and tone difference exists between any two of the plurality of synthetic voice data; the synthesized speech data is added to the expanded data set and the initial data set and the expanded data set are combined into the training data set.

Optionally, the training data set includes an initial data set and an extended data set, and the acquiring the training data set includes: acquiring an initial data set, wherein the initial data set comprises initial text data and corresponding initial voice data; generating disturbance voice data by utilizing at least part of initial voice data in the initial data set, wherein part of voice fragments in any initial voice data are replaced by other voice fragments containing the same semantics, so as to obtain disturbance voice data corresponding to any initial voice data; the perturbed speech data is added to the expanded data set and the initial data set and the expanded data set are combined into the training data set.

The implementation process of the functions and roles of each module in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, and the components shown as modules may or may not be physical, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present description. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present specification also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of any of the foregoing speech synthesis methods provided by the present application.

In particular, computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Other embodiments of the present description will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This specification is intended to cover any variations, uses, or adaptations of the specification following, in general, the principles of the specification and including such departures from the present disclosure as come within known or customary practice within the art to which the specification pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the specification being indicated by the following claims.

It is to be understood that the present description is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present description is limited only by the appended claims.

The foregoing description of the preferred embodiments is provided for the purpose of illustration only, and is not intended to limit the scope of the disclosure, since any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, wherein the speech synthesis model includes an acoustic feature predictor, the generating an acoustic feature sequence corresponding to a target text to be synthesized, comprising:

A target text to be synthesized is input to a trained acoustic feature predictor to generate an acoustic feature sequence corresponding to the target text by the acoustic feature predictor.

3. The method of claim 2, wherein the speech synthesis model comprises a speech representation model and an acoustic feature aligner, the training of the acoustic feature predictor comprising:

Acquiring a training data set, wherein the training data set comprises text data and corresponding voice data thereof;

Inputting the text data into the acoustic feature predictor to be trained, and generating a first acoustic feature sequence corresponding to the text data through the acoustic feature predictor;

Inputting the voice data into the voice representation model, and generating a voice representation vector corresponding to the voice data through the voice representation model; and inputting the speech characterization vector and the text data into the acoustic feature aligner, converting the speech characterization vector into a second acoustic feature sequence by the acoustic feature aligner;

updating parameters of the acoustic feature predictor based on a similarity loss between the first acoustic feature sequence and the second acoustic feature sequence.

4. A method according to claim 3, characterized in that the method further comprises:

Dividing all the voice characterization vectors into different categories, wherein the similarity between the voice characterization vectors of the same category is higher than the similarity between the voice characterization vectors of different categories;

Determining the embedded vectors corresponding to each divided class, wherein the embedded vector corresponding to any one class is obtained by converting each voice characterization vector belonging to the class;

the inputting the speech characterization vector and the text data into the acoustic feature aligner, converting the speech characterization vector into a second acoustic feature sequence by the acoustic feature aligner, comprising:

And inputting the embedded vectors into the acoustic feature aligner, and respectively converting each embedded vector into a corresponding second acoustic feature sequence through the acoustic feature aligner.

5. The method of claim 4, wherein the classifying all speech characterization vectors into different categories comprises:

All speech characterization vectors are classified into different categories by a clustering algorithm.

6. A method according to claim 3, wherein the training data set comprises an initial data set and an extended data set, the acquiring the training data set comprising:

Acquiring an initial data set, wherein the initial data set comprises initial text data and corresponding initial voice data;

Generating synthetic voice data by utilizing at least part of initial voice data in the initial data set, wherein any initial voice data is input into a tone conversion model to obtain a plurality of synthetic voice data output by the model, and tone difference exists between any two of the plurality of synthetic voice data;

the synthesized speech data is added to the expanded data set and the initial data set and the expanded data set are combined into the training data set.

7. A method according to claim 3, wherein the training data set comprises an initial data set and an extended data set, the acquiring the training data set comprising:

Generating disturbance voice data by utilizing at least part of initial voice data in the initial data set, wherein part of voice fragments in any initial voice data are replaced by other voice fragments containing the same semantics, so as to obtain disturbance voice data corresponding to any initial voice data;

The perturbed speech data is added to the expanded data set and the initial data set and the expanded data set are combined into the training data set.

8. A speech synthesis apparatus, the apparatus comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 7 when the program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.