CN115101041A - Method and device for training speech synthesis and speech synthesis model - Google Patents

Method and device for training speech synthesis and speech synthesis model Download PDF

Info

Publication number
CN115101041A
CN115101041A CN202210502540.4A CN202210502540A CN115101041A CN 115101041 A CN115101041 A CN 115101041A CN 202210502540 A CN202210502540 A CN 202210502540A CN 115101041 A CN115101041 A CN 115101041A
Authority
CN
China
Prior art keywords
frame
feature
features
predicted
audio data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210502540.4A
Other languages
Chinese (zh)
Inventor
黄一鸣
张辉
原湉
梁芸铭
杨烨华
陈泽裕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210502540.4A priority Critical patent/CN115101041A/en
Publication of CN115101041A publication Critical patent/CN115101041A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/0018Speech coding using phonetic or linguistical decoding of the source; Reconstruction using text-to-speech synthesis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The disclosure provides a method and a device for training a speech synthesis model and a speech synthesis model, electronic equipment and a readable storage medium, and relates to the technical field of artificial intelligence such as deep learning and natural language processing. The speech synthesis method comprises the following steps: acquiring a text to be processed to obtain a phoneme sequence of the text to be processed; coding the phoneme sequence to obtain a first characteristic; decoding the first characteristic to obtain a plurality of frames of second characteristics; obtaining a third characteristic corresponding to each frame of second characteristic according to each frame of second characteristic and an adjacent characteristic frame of each frame of second characteristic; and obtaining audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame. The training method of the speech synthesis model comprises the following steps: acquiring a data set; constructing a neural network model comprising an encoder, a decoder, a posterior network layer and a vocoder; and training the neural network model based on the training audio data of the training texts and the training audio data of the training texts to obtain a speech synthesis model.

Description

Method and device for training speech synthesis and speech synthesis model
Technical Field
The present disclosure relates to the field of computer technology, and more particularly to the field of artificial intelligence techniques such as deep learning and natural language processing. A method, a device, an electronic device and a readable storage medium for training a speech synthesis and a speech synthesis model are provided.
Background
In a speech synthesis task, generally, after a text to be processed is input, audio data corresponding to the text to be processed is output. However, in the existing speech synthesis mode, the return of the first audio data can only be started after the whole calculation is finished, and the low-delay application scene cannot be met, and the response time of speech synthesis is long.
Disclosure of Invention
According to a first aspect of the present disclosure, there is provided a speech synthesis method comprising: acquiring a text to be processed to obtain a phoneme sequence of the text to be processed; coding the phoneme sequence to obtain a first characteristic; decoding the first characteristic to obtain a plurality of frames of second characteristics; obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and an adjacent feature frame of each frame of second feature; and obtaining audio data corresponding to each frame of third features according to each frame of third features and adjacent feature frames of each frame of third features.
According to a second aspect of the present disclosure, there is provided a training method of a speech synthesis model, including: acquiring a data set, wherein the data set comprises a plurality of training texts and training audio data of the training texts; constructing a neural network model comprising an encoder, a decoder, a posterior network layer and a vocoder, wherein the encoder is used for obtaining a first predicted characteristic according to a phoneme sequence of a training text, the decoder is used for obtaining a multi-frame predicted second characteristic according to the first predicted characteristic, the posterior network layer is used for obtaining a predicted third characteristic corresponding to the second predicted characteristic of each frame according to the second predicted characteristic of each frame and an adjacent characteristic frame of the second predicted characteristic of each frame, and the vocoder is used for obtaining predicted audio data corresponding to the third predicted characteristic of each frame according to the third predicted characteristic of each frame and the adjacent characteristic frame of the third predicted characteristic of each frame; and training the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
According to a third aspect of the present disclosure, there is provided a speech synthesis apparatus comprising: the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a text to be processed to obtain a phoneme sequence of the text to be processed; the first processing unit is used for coding the phoneme sequence to obtain a first characteristic; the second processing unit is used for decoding the first characteristic to obtain a plurality of frames of second characteristics; the third processing unit is used for obtaining a third characteristic corresponding to each frame of second characteristic according to each frame of second characteristic and an adjacent characteristic frame of each frame of second characteristic; and the synthesizing unit is used for obtaining the audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame.
According to a fourth aspect of the present disclosure, there is provided a training apparatus for a speech synthesis model, comprising: a second obtaining unit, configured to obtain a data set, where the data set includes a plurality of training texts and training audio data of the plurality of training texts; the device comprises a construction unit, a decoder, a posterior network layer and a vocoder, wherein the encoder is used for obtaining a first predicted feature according to a phoneme sequence of a training text, the decoder is used for obtaining a multi-frame predicted second feature according to the first predicted feature, the posterior network layer is used for obtaining a predicted third feature corresponding to the second predicted feature of each frame according to the second predicted feature of each frame and an adjacent feature frame of the second predicted feature of each frame, and the vocoder is used for obtaining predicted audio data corresponding to the third predicted feature of each frame according to the third predicted feature of each frame and the adjacent feature frame of the third predicted feature of each frame; and the training unit is used for training the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method as described above.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method as described above.
According to the technical scheme, the third features and the audio data are obtained by combining the adjacent feature frames, the boundary of each feature can be fully considered, and therefore the continuity between the obtained audio data corresponding to different third features is improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become readily apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a schematic diagram according to a third embodiment of the present disclosure;
FIG. 4 is a schematic diagram according to a fourth embodiment of the present disclosure;
FIG. 5 is a schematic diagram according to a fifth embodiment of the present disclosure;
FIG. 6 is a schematic diagram according to a sixth embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device for implementing speech synthesis or a method of training a speech synthesis model according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram according to a first embodiment of the present disclosure. As shown in fig. 1, the speech synthesis method of the present embodiment specifically includes the following steps:
s101, acquiring a text to be processed to obtain a phoneme sequence of the text to be processed;
s102, coding the phoneme sequence to obtain a first characteristic;
s103, decoding the first characteristic to obtain a plurality of frames of second characteristics;
s104, obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and an adjacent feature frame of each frame of second feature;
and S105, obtaining audio data corresponding to each frame of third feature according to each frame of third feature and the adjacent feature frame of each frame of third feature.
The speech synthesis method of the embodiment obtains the third features and the audio data respectively by combining the adjacent feature frames, and can fully consider the boundary of each feature, thereby improving the continuity between the obtained audio data corresponding to different third features.
In the embodiment, when S101 is executed to acquire a text to be processed, the text input by the input end or the text selected by the input end may be used as the text to be processed.
In this embodiment, when S101 is executed to obtain a phoneme sequence of a text to be processed, a phoneme corresponding to each character in the text to be processed may be first obtained, and then a phoneme sequence of the text to be processed may be formed according to phonemes corresponding to all characters; before S101 is executed to obtain a phoneme sequence of the text to be processed, punctuation marks in the text to be processed may be removed.
After the phoneme sequence of the text to be processed is obtained by executing S101, the embodiment executes S102 to encode the phoneme sequence, and obtains the first feature.
In this embodiment, when the phoneme sequence is encoded in step S102 to obtain the first feature, the optional implementation manner that can be adopted is as follows: and acquiring the encoder characteristics of the phoneme sequence, and taking the acquired encoder characteristics as first characteristics.
In addition, in the present embodiment, when the phoneme sequence is encoded to obtain the first feature in step S102, the phoneme sequence may be input to a speech synthesis model trained in advance, and the phoneme sequence may be encoded by an encoder in the speech synthesis model, so that the output encoder feature may be used as the first feature.
After the first feature is obtained in S102, S103 is executed to decode the first feature to obtain multiple frames of second features; the second feature obtained in this embodiment is an acoustic feature, such as Mel-frequency spectrum (Mel spectrum).
In this embodiment, when S103 is executed to obtain the second feature of multiple frames according to the first feature, the optional implementation manner that may be adopted is: acquiring the decoder characteristic of the first characteristic, and acquiring a plurality of frames of second characteristics according to the acquired decoder characteristic; it can be understood that the decoder feature obtained by executing S103 in this embodiment may be directly the second feature of multiple frames, or may be a cut of the obtained decoder feature, and the cut result is taken as the second feature of multiple frames.
In this embodiment, when S103 is executed to decode the first feature to obtain the second features of multiple frames, the first feature may be input into a pre-trained speech synthesis model, and the decoder in the speech synthesis model decodes the first feature, so that the output features of the decoder of multiple frames are used as the second features of multiple frames.
After executing S103 to obtain multiple frames of second features, executing S104 to obtain a third feature corresponding to each frame of second features according to each frame of second features and the adjacent feature frame of each frame of second features.
Specifically, when S104 is executed to obtain the third feature corresponding to each frame of the second feature according to each frame of the second feature and the adjacent feature frame of each frame of the second feature, this embodiment may adopt an optional implementation manner: determining adjacent feature frames of the second features of each frame from the second features of the frames; intercepting filling features from adjacent feature frames of each frame of second features, and filling the intercepted filling features into each frame of second features; and obtaining a third feature corresponding to each frame of second feature according to each frame of second feature after filling.
That is to say, in the present embodiment, the filling features intercepted from the adjacent feature frames of each frame of second features are used to fill each frame of second features, so as to obtain the third features corresponding to each frame of second features, and each frame of second features after the filling processing includes partial features in the adjacent feature frames, so that the boundary of each frame of second features can be accurately calculated, and the accuracy of the obtained third features corresponding to each frame of second features is improved.
In this embodiment, when S104 is executed to intercept a filling feature from an adjacent feature frame of each frame of second features, and fill the intercepted filling feature into each frame of second features, according to a position relationship between each frame of second features and the adjacent feature frame, a feature located at a first preset position of the adjacent feature frame may be intercepted as the filling feature, and the intercepted filling feature may be filled into a second preset position of each frame of second features.
In this embodiment, the first preset position and the second preset position have a corresponding relationship, if the first preset position is a beginning, the second preset position is an end, and if the first preset position is an end, the second preset position is a beginning.
For example, if the multi-frame second feature obtained by performing S103 in the embodiment includes M1, M2, and M3, the embodiment performs S104 to determine that the adjacent feature frame of M1 is M2, determine that the adjacent feature frame of M2 is M1 and M3, and determine that the adjacent feature frame of M3 is M2; for M1, if M2 is located after M1, then the feature located at the beginning of M2 is intercepted as a filling feature and filled to the end of M1; for M2, if M3 is located after M2, then the feature located at the beginning of M3 is intercepted as a filling feature and filled to the end of M2, and if M1 is located before M2, then the feature located at the end of M1 is intercepted as a filling feature and filled to the beginning of M2; for M3, M2 precedes M3, and then the feature at the end of M2 is truncated as a padding feature to fill the beginning of M3.
In this embodiment, when S104 is executed to obtain a third feature corresponding to each frame of second features according to each frame of second features after filling, an optional implementation manner that may be adopted is as follows: acquiring the posterior feature of each frame of second features after filling, wherein the posterior feature is a convolution feature, for example, the second features after filling are processed by using a convolution network (for example, posterior network Post Net) containing 5 layers of convolution, and the processing result is used as the posterior feature of each frame of second features after filling; and removing the posterior features corresponding to the filling features from the posterior features to obtain third features corresponding to the second features of each frame.
That is to say, when the third feature corresponding to each frame of the second feature is obtained, the posterior feature corresponding to the filling feature in the posterior features of the second feature may be removed, so as to further improve the accuracy of the obtained third feature and avoid the repetition of different finally generated audio data.
In this embodiment, when S104 is executed to obtain the third feature corresponding to each frame of the second feature according to each frame of the second feature and the adjacent feature frame of each frame of the second feature, the adjacent feature frame of each frame of the second feature and each frame of the second feature may be further input into a pre-trained speech synthesis model, and the posterior network layer in the speech synthesis model processes each frame of the second feature and the adjacent feature frame of each frame of the second feature to output the third feature corresponding to each frame of the second feature.
In addition, when performing S104 to obtain the third feature corresponding to the second feature of each frame according to the second feature of each frame and the adjacent feature frame of the second feature of each frame, the embodiment may further include the following contents: acquiring the obtained frame number of the second feature, which may be put into the first buffer in this embodiment, so as to acquire the frame number of the second feature stored in the first buffer; and under the condition that the acquired frame number meets a first preset requirement, obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and the adjacent feature frame of each frame of second feature.
That is to say, in this embodiment, only when the number of the acquired frames of the second feature meets the first preset requirement, the step of obtaining the third feature according to the second feature is executed again, and since the number of the second feature is sufficient, the accuracy of the obtained third feature can be improved.
In this embodiment, when S104 is executed, in a case that it is determined that the obtained number of frames of the second feature exceeds the first frame number threshold, it may be determined that the obtained number of frames of the second feature meets the first preset requirement.
After the third feature corresponding to the second feature of each frame is obtained in S104, S105 is performed to obtain audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame.
Specifically, when S105 is executed to obtain the audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame, this embodiment may adopt an optional implementation manner as follows: determining adjacent feature frames of the third features of each frame from the third features of the plurality of frames; intercepting filling features from adjacent feature frames of each frame of third features, and filling the intercepted filling features into each frame of third features; and obtaining audio data corresponding to the third feature of each frame according to the filled third feature of each frame.
Similarly, when S105 is executed to intercept a filling feature from an adjacent feature frame of each frame of the third feature and fill the intercepted filling feature into each frame of the third feature, the present embodiment may intercept, according to a positional relationship between each frame of the third feature and the adjacent feature frame, a feature located at a first preset position of the adjacent feature frame as the filling feature and fill the intercepted filling feature into a second preset position of each frame of the third feature; the first preset position and the second preset position have a corresponding relation.
That is to say, in the present embodiment, the audio data corresponding to each frame of third feature is obtained in a manner of filling each frame of third feature with the filling feature intercepted from the adjacent feature frame of each frame of third feature, and each frame of third feature after the filling processing includes a part of features in the adjacent feature frame, so that the boundary of each frame of third feature can be accurately calculated, and the continuity between the obtained audio data corresponding to different third features is improved.
In this embodiment, when performing S105 to obtain audio data corresponding to each frame of third features according to each frame of third features after padding, the optional implementation manners that can be adopted are: acquiring audio data of each frame of second characteristics after filling; and removing the audio data corresponding to the filling features in the audio data to obtain the audio data corresponding to the third features of each frame.
That is to say, when the audio data corresponding to each frame of the third feature is obtained, the audio data corresponding to the filling feature in the audio data may be removed, so as to avoid repetition between the generated audio data corresponding to different third features.
In this embodiment, when S105 is executed to obtain the audio data corresponding to each frame of the third feature according to each frame of the third feature and the adjacent feature frame of each frame of the third feature, the audio data corresponding to each frame of the third feature may also be obtained by inputting the adjacent feature frame of each frame of the third feature and each frame of the third feature into a speech synthesis model obtained through pre-training, and processing the adjacent feature frame of each frame of the third feature and each frame of the third feature by a vocoder in the speech synthesis model.
In addition, when performing S105 to obtain the audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame, the embodiment may further include the following: acquiring the obtained frame number of the third feature, which may be put into the second buffer in this embodiment, so as to acquire the frame number of the third feature stored in the second buffer; and under the condition that the acquired frame number meets a second preset requirement, obtaining audio data corresponding to each frame of third feature according to each frame of third feature and the adjacent feature frame of each frame of third feature.
That is to say, in this embodiment, only when the number of the acquired frames of the third feature meets the second preset requirement, the step of obtaining the audio data according to the third feature is executed again, and since the number of the third feature is sufficient, the accuracy of the obtained audio data can be improved.
In this embodiment, when S105 is executed, in a case that it is determined that the obtained number of frames of the third feature exceeds the second frame number threshold, it may be determined that the obtained number of frames of the third feature meets the second preset requirement.
After the audio data corresponding to each frame of the third feature is obtained in S105, the obtained audio data may be sequentially returned to the input end, and as the purpose of streaming speech synthesis is achieved, the response time of speech synthesis can be reduced, and the speed of speech synthesis is increased in this embodiment.
Fig. 2 is a schematic diagram according to a second embodiment of the present disclosure. As shown in fig. 2, the training method of the speech synthesis model of the embodiment specifically includes the following steps:
s201, acquiring a data set, wherein the data set comprises a plurality of training texts and training audio data of the training texts;
s202, a neural network model comprising an encoder, a decoder, a posterior network layer and a vocoder is constructed, wherein the encoder is used for obtaining a first predicted feature according to a phoneme sequence of a training text, the decoder is used for obtaining a multi-frame predicted second feature according to the first predicted feature, the posterior network layer is used for obtaining a predicted third feature corresponding to the second predicted feature of each frame according to the second predicted feature of each frame and an adjacent feature frame of the second predicted feature of each frame, and the vocoder is used for obtaining predicted audio data corresponding to the third predicted feature of each frame according to the third predicted feature of each frame and the adjacent feature frame of the third predicted feature of each frame;
s203, training the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
In the training method of the speech synthesis model of this embodiment, the posterior network layer and the vocoder in the constructed neural network model obtain the predicted third feature and the predicted audio data by combining the adjacent feature frames, so that the accuracy of the obtained predicted third feature and the continuity between different predicted audio data can be improved, and corresponding audio data is generated according to different feature frames, so that the speech synthesis model can achieve the purpose of streaming speech synthesis, thereby enhancing the synthesis effect of the speech synthesis model.
In this embodiment, after the S201 is executed to obtain training audio data including a plurality of training texts and the plurality of training texts, the data set may be further divided into a training set, an evaluation set, and a test set; the training set is used for training the neural network model, the evaluation set is used for evaluating the neural network model in the training process, and the test set is used for testing the effect of the neural network model obtained through training.
After the data set is acquired in S201, the embodiment may further extract training acoustic features from the training audio data, and align the phoneme sequence of the training text with the extracted training acoustic features; the purpose of extracting training acoustic features from training audio data is to calculate loss functions of an encoder and a decoder in a neural network model, and then update parameters in the encoder and the decoder according to the loss functions.
After the data set is acquired in S201, a neural network model including an encoder, a decoder, an a posteriori network layer and a vocoder is constructed in S202.
In the neural network model constructed in step S202, the encoder, the decoder, and the posterior network layer form an acoustic model, and the acoustic model is used to convert the input text into acoustic features, such as mel-frequency spectrum features; the structure of the acoustic model in this embodiment is based on the Tacotron2 structure.
In the neural network model constructed by executing S202 in this embodiment, the encoder is configured to obtain a predicted first feature according to the phoneme sequence of the training text, where the predicted first feature is an encoder feature.
In the neural network model constructed in S202, the decoder is configured to obtain a multi-frame predicted second feature according to the predicted first feature, where the predicted second feature is, for example, an acoustic feature of a mel-frequency spectrum.
In the embodiment, the decoder outputs the predicted second feature of the current frame and the decoder state of the current frame according to the predicted first feature, the predicted second feature of the previous frame and the decoder state of the previous frame in the neural network model constructed in S202, that is, the decoder in this embodiment outputs the predicted second feature of one frame.
In the neural network model constructed in S202, the posterior network layer is configured to predict the second feature according to each frame and an adjacent feature frame of the predicted second feature of each frame, and obtain a predicted third feature corresponding to the predicted second feature of each frame; the a posteriori network layer in the present embodiment may be Post Net (a posteriori network) formed by a 5-layer convolutional network.
Specifically, in the neural network model constructed in S202 in this embodiment, when the posterior network layer predicts the second feature according to each frame and predicts the adjacent feature frame of the second feature according to each frame, and obtains the predicted third feature corresponding to the predicted second feature of each frame, an optional implementation manner that may be adopted is: determining adjacent feature frames of each frame for predicting the second feature from the plurality of frames for predicting the second feature; intercepting filling features from adjacent feature frames of each frame of predicted second features, and filling the intercepted filling features into each frame of predicted second features; and predicting the second characteristic according to each frame after filling to obtain a predicted third characteristic corresponding to the predicted second characteristic of each frame.
In the neural network model constructed by the embodiment executing S202, when the posterior network layer predicts the second feature according to each frame after filling, and obtains the predicted third feature corresponding to the predicted second feature of each frame, an optional implementation manner that may be adopted is as follows: obtaining the posterior feature of each frame of the predicted second feature after filling, wherein the posterior feature in the embodiment is obtained by processing the input predicted second feature after filling by a posterior network layer; and removing the posterior features corresponding to the filling features from the posterior features to obtain predicted third features corresponding to the predicted second features of each frame.
In addition, in the neural network model constructed in S202, when the posterior network layer obtains the predicted third feature corresponding to the predicted second feature of each frame according to the predicted second feature of each frame and the adjacent feature frame of the predicted second feature of each frame, the posterior network layer may further include the following contents: acquiring the frame number of the obtained predicted second characteristic; and under the condition that the acquired frame number meets a first preset requirement, predicting the second characteristic according to each frame and the adjacent characteristic frame of the predicted second characteristic to obtain a predicted third characteristic corresponding to the predicted second characteristic of each frame.
In this embodiment, the posterior network layer may determine that the obtained number of frames of the predicted second feature satisfies the first preset requirement when it is determined that the obtained number of frames of the predicted second feature exceeds the first frame number threshold.
In the neural network model constructed in step S202, the vocoder is configured to predict the third feature according to each frame and predict the adjacent feature frame of the third feature according to each frame, so as to obtain predicted audio data corresponding to the predicted third feature of each frame.
Specifically, in the neural network model constructed in step S202, when the vocoder predicts the third feature according to each frame and predicts the adjacent feature frame of the third feature according to each frame to obtain the predicted audio data corresponding to the predicted third feature of each frame, an optional implementation manner that may be adopted is: determining adjacent feature frames of each frame for predicting the third feature from the multiple frames for predicting the third feature; intercepting filling features from adjacent feature frames of each frame for predicting the third feature, and filling the intercepted filling features into each frame for predicting the third feature; and predicting the third characteristic according to each frame after filling to obtain predicted audio data corresponding to the predicted third characteristic of each frame.
In the neural network model constructed in S202, when the vocoder predicts the third feature according to each frame after being filled, and obtains predicted audio data corresponding to the predicted third feature of each frame, the optional implementation manner that may be adopted is as follows: acquiring the prediction audio data of the filled predicted third feature of each frame; and removing the predicted audio data corresponding to the filling features in the predicted audio data to obtain predicted audio data corresponding to the predicted third feature of each frame.
In addition, in the neural network model constructed in step S202, when the vocoder predicts the third feature according to each frame and predicts the adjacent feature frame of the third feature according to each frame to obtain the audio data corresponding to the predicted third feature of each frame, the vocoder may further include the following: acquiring the frame number of the obtained predicted third feature; and under the condition that the acquired frame number meets a second preset requirement, predicting the third feature according to each frame and an adjacent feature frame of the predicted third feature of each frame to obtain audio data corresponding to the predicted third feature of each frame.
The vocoder may determine that the obtained number of predicted third features satisfies a second preset requirement, in a case that it is determined that the obtained number of predicted third features exceeds a second frame number threshold.
In this embodiment, after the step S202 of constructing the neural network model including the encoder, the decoder, the posterior network layer and the vocoder is performed, the step S203 of training the neural network model based on the plurality of training texts and the training audio data of the plurality of training texts is performed until the neural network model converges to obtain the speech synthesis model.
Specifically, when performing S203 training the neural network model based on the training audio data of the plurality of training texts and the plurality of training texts, the present embodiment may adopt an optional implementation manner as follows: respectively inputting phoneme sequences of a plurality of training texts into a neural network model, and acquiring predicted audio data output by the neural network model aiming at each training text; calculating a first loss function value according to training audio data and predicted audio data of a plurality of training texts; and adjusting parameters of an encoder, a decoder, a posterior network layer and a vocoder in the neural network model according to the calculated first loss function value until the first loss function value is converged to obtain a speech synthesis model.
In addition, in the data set acquired in step S201 in this embodiment, if training acoustic features corresponding to phoneme sequences of training texts are acquired at the same time, in this embodiment, when S203 is executed to train a neural network model based on a plurality of training texts and training audio data of the plurality of training texts, an optional implementation manner that may be adopted is: respectively inputting the phoneme sequences of the training texts into a neural network model, and acquiring a predicted second characteristic and predicted audio data output by the neural network model aiming at each training text; calculating a first loss function value according to training audio data and predicted audio data of a plurality of training texts, and calculating a second loss function value according to predicted second characteristics and training acoustic characteristics of a plurality of training texts; and adjusting parameters of a vocoder in the neural network model according to the first loss function value obtained by calculation, and adjusting parameters of an encoder, a decoder and a posterior network layer in the neural network model according to the second loss function value obtained by calculation until the first loss function value and the second loss function value are converged to obtain the speech synthesis model.
Fig. 3 is a schematic diagram according to a third embodiment of the present disclosure. A flow chart of the embodiment when a second feature of multiple frames and a third feature of multiple frames are obtained is shown in fig. 3, M in fig. 3 represents the second feature output by the decoder, and M' represents the third feature output by the a posteriori network layer; in fig. 3, the second feature output by the decoder is also put into the first buffer, and the third feature output by the a posteriori network layer is put into the second buffer.
Fig. 4 is a schematic diagram according to a fourth embodiment of the present disclosure. A flowchart of the present embodiment when audio data corresponding to the second feature of a plurality of frames is obtained is shown in fig. 4, where M in fig. 4 denotes the third feature, and W denotes the audio data.
Fig. 5 is a schematic diagram according to a fifth embodiment of the present disclosure. As shown in fig. 5, the speech synthesis apparatus 500 of the present embodiment includes:
the first obtaining unit 501 is configured to obtain a text to be processed, and obtain a phoneme sequence of the text to be processed;
the first processing unit 502 is configured to encode the phoneme sequence to obtain a first feature;
the second processing unit 503 is configured to decode the first feature to obtain multiple frames of second features;
the third processing unit 504 is configured to obtain a third feature corresponding to each frame of the second feature according to each frame of the second feature and an adjacent feature frame of each frame of the second feature;
and the synthesizing unit 505 is configured to obtain audio data corresponding to each frame of third features according to each frame of third features and the adjacent feature frame of each frame of third features.
When the first obtaining unit 501 obtains the text to be processed, the text input by the input terminal or the text selected by the input terminal may be used as the text to be processed.
When the first obtaining unit 501 obtains the phoneme sequence of the text to be processed, it may first obtain a phoneme corresponding to each character in the text to be processed, and then form a phoneme sequence of the text to be processed according to phonemes corresponding to all characters; before the first obtaining unit 501 obtains the phoneme sequence of the text to be processed, punctuation marks in the text to be processed may be removed.
In the embodiment, after the first obtaining unit 501 obtains the phoneme sequence of the text to be processed, the first processing unit 502 encodes the phoneme sequence to obtain the first feature.
When the first processing unit 502 encodes the phoneme sequence to obtain the first feature, the optional implementation manners that may be adopted are: and acquiring the encoder features of the phoneme sequences, and taking the acquired encoder features as first features.
When the phoneme sequence is encoded to obtain the first feature, the first processing unit 502 may input the phoneme sequence to a speech synthesis model trained in advance, encode the phoneme sequence by an encoder in the speech synthesis model, and set the output encoder feature as the first feature.
In this embodiment, after the first processing unit 502 obtains the first feature, the second processing unit 503 decodes the first feature to obtain multiple frames of the second feature; the second characteristic obtained by the second processing unit 503 is an acoustic characteristic, such as Mel spectrum (Mel spectrum).
When the second processing unit 503 decodes the first feature to obtain the multi-frame second feature, the optional implementation manners that can be adopted are as follows: acquiring the decoder characteristic of the first characteristic, and acquiring a plurality of frames of second characteristics according to the acquired decoder characteristic; it is understood that the decoder feature obtained by the second processing unit 503 may be a multi-frame second feature directly, or may be a single decoder feature obtained by slicing, and the slicing result is used as a multi-frame second feature.
When the first feature is decoded to obtain the multi-frame second feature, the second processing unit 503 may input the first feature into a pre-trained speech synthesis model, and decode the first feature by a decoder in the speech synthesis model, so as to use the output multi-frame decoder feature as the multi-frame second feature.
After the second processing unit 503 obtains multiple frames of second features, the embodiment obtains a third feature corresponding to each frame of second features by the third processing unit 504 according to each frame of second features and the adjacent feature frame of each frame of second features.
Specifically, when the third processing unit 504 obtains the third feature corresponding to each frame of the second feature according to each frame of the second feature and the adjacent feature frame of each frame of the second feature, the optional implementation manners that can be adopted are as follows: determining adjacent feature frames of the second features of each frame from the second features of the frames; intercepting filling features from adjacent feature frames of each frame of second features, and filling the intercepted filling features into each frame of second features; and obtaining a third feature corresponding to each frame of second feature according to each frame of second feature after filling.
That is to say, the third processing unit 504 obtains the third feature corresponding to each frame of second feature in a manner of filling each frame of second feature with the filling feature intercepted from the adjacent feature frame of each frame of second feature, and since each frame of second feature after the filling processing includes a part of features in the adjacent feature frame, the boundary of each frame of second feature can be accurately calculated, and the accuracy of the obtained third feature corresponding to each frame of second feature is improved.
The third processing unit 504 may intercept a filling feature from an adjacent feature frame of each frame of second features, and when the intercepted filling feature is filled into each frame of second features, intercept a feature located at a first preset position of the adjacent feature frame as the filling feature according to a positional relationship between each frame of second features and the adjacent feature frame, and fill the intercepted filling feature into a second preset position of each frame of second features.
In this embodiment, the first preset position and the second preset position have a corresponding relationship, if the first preset position is a beginning, the second preset position is an end, and if the first preset position is an end, the second preset position is a beginning.
When the third processing unit 504 obtains the third feature corresponding to each frame of second feature according to each frame of second feature after filling, the optional implementation manner that can be adopted is as follows: obtaining posterior features of each frame of second features after filling; and removing the posterior features corresponding to the filling features from the posterior features to obtain third features corresponding to the second features of each frame.
That is to say, when the third processing unit 504 obtains the third feature corresponding to each frame of the second feature, the posterior features corresponding to the filling features in the posterior features of the second feature may be removed, so as to improve the accuracy of the obtained third feature and avoid the repetition of different finally generated audio data.
When the third processing unit 504 obtains the third feature corresponding to each frame of the second feature according to each frame of the second feature and the adjacent feature frame of each frame of the second feature, the third processing unit may further input the each frame of the second feature and the adjacent feature frame of each frame of the second feature into a speech synthesis model obtained by pre-training, and process the each frame of the second feature and the adjacent feature frame of each frame of the second feature by an a posteriori network layer in the speech synthesis model to output the third feature corresponding to each frame of the second feature.
In addition, when obtaining the third feature corresponding to the second feature of each frame according to the second feature of each frame and the adjacent feature frame of the second feature of each frame, the third processing unit 504 may further include the following: acquiring the frame number of the obtained second characteristic; and under the condition that the number of the acquired frames meets a first preset requirement, obtaining a third characteristic corresponding to each frame of the second characteristic according to each frame of the second characteristic and an adjacent characteristic frame of each frame of the second characteristic.
The third processing unit 504 may determine that the obtained number of frames of the second feature satisfies the first preset requirement, in a case that it is determined that the obtained number of frames of the second feature exceeds the first frame number threshold.
After the third processing unit 504 obtains the third feature corresponding to the second feature of each frame, the present embodiment obtains the audio data corresponding to the third feature of each frame by the synthesizing unit 505 according to the third feature of each frame and the adjacent feature frame of the third feature of each frame.
Specifically, when the synthesis unit 505 obtains the audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame, the optional implementation manners that may be adopted are: determining adjacent feature frames of the third features of each frame from the third features of the plurality of frames; intercepting filling features from adjacent feature frames of each frame of third features, and filling the intercepted filling features into each frame of third features; and obtaining audio data corresponding to the third feature of each frame according to the filled third feature of each frame.
Similarly, when the composition unit 505 intercepts the filling feature from the adjacent feature frame of each frame of the third feature and fills the intercepted filling feature into each frame of the third feature, it may intercept a feature located at a first preset position of the adjacent feature frame as the filling feature according to a position relationship between each frame of the third feature and the adjacent feature frame, and fill the intercepted filling feature into a second preset position of each frame of the third feature; the first preset position and the second preset position have a corresponding relation.
That is to say, the synthesizing unit 505 obtains the audio data corresponding to each frame of third feature by filling each frame of third feature with the filling feature intercepted from the adjacent feature frame of each frame of third feature, and since each frame of third feature after the filling processing includes a part of features in the adjacent feature frame, the boundary of each frame of third feature can be accurately calculated, and the continuity between the obtained audio data corresponding to different third features is improved.
When the synthesis unit 505 obtains the audio data corresponding to each frame of third feature according to each frame of third feature after padding, the following optional implementation manners may be adopted: acquiring audio data of each frame of second characteristics after filling; and removing the audio data corresponding to the filling features in the audio data to obtain the audio data corresponding to the third features of each frame.
That is to say, when the synthesis unit 505 obtains the audio data corresponding to the third feature of each frame, the audio data corresponding to the filler feature in the audio data may be removed, so as to avoid duplication between the generated audio data corresponding to different third features.
When the audio data corresponding to each frame of third feature is obtained according to each frame of third feature and the adjacent feature frame of each frame of third feature, the synthesizing unit 505 may further input the each frame of third feature and the adjacent feature frame of each frame of third feature into a speech synthesis model obtained by pre-training, and a vocoder in the speech synthesis model processes the each frame of third feature and the adjacent feature frame of each frame of third feature to obtain the audio data corresponding to each frame of third feature.
In addition, when the synthesis unit 505 obtains the audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame, the following contents may also be included: acquiring the frame number of the obtained third characteristic; and under the condition that the acquired frame number meets a second preset requirement, obtaining audio data corresponding to each frame of third feature according to each frame of third feature and the adjacent feature frame of each frame of third feature.
Wherein, the synthesizing unit 505 may determine that the obtained number of frames of the third feature satisfies a second preset requirement in a case that it is determined that the obtained number of frames of the third feature exceeds a second frame number threshold.
After obtaining the audio data corresponding to each frame of the third feature, the synthesizing unit 505 may sequentially return the obtained audio data to the input end, and as the purpose of streaming speech synthesis is achieved, the embodiment can reduce the response time of speech synthesis, and increase the speed of speech synthesis.
Fig. 6 is a schematic diagram according to a sixth embodiment of the present disclosure. As shown in fig. 6, the training apparatus 600 for a speech synthesis model according to the present embodiment includes:
a second obtaining unit 601, configured to obtain a data set, where the data set includes a plurality of training texts and training audio data of the plurality of training texts;
the constructing unit 602 is configured to construct a neural network model including an encoder, a decoder, an a posteriori network layer, and a vocoder, where the encoder is configured to obtain a predicted first feature according to a phoneme sequence of a training text, the decoder is configured to obtain a multi-frame predicted second feature according to the predicted first feature, the a posteriori network layer is configured to obtain a predicted third feature corresponding to the predicted second feature of each frame according to the predicted second feature of each frame and an adjacent feature frame of the predicted second feature of each frame, and the vocoder is configured to obtain predicted audio data corresponding to the predicted third feature of each frame according to the predicted third feature of each frame and the adjacent feature frame of the predicted third feature of each frame;
the training unit 603 is configured to train the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
After the second obtaining unit 601 obtains training audio data including a plurality of training texts and the plurality of training texts, the data set may be further divided into a training set, an evaluation set, and a test set; the training set is used for training the neural network model, the evaluation set is used for evaluating the neural network model in the training process, and the test set is used for testing the effect of the neural network model obtained through training.
After acquiring the data set, the second acquiring unit 601 may further extract training acoustic features from the training audio data, and align a phoneme sequence of the training text with the extracted training acoustic features; the purpose of extracting training acoustic features from training audio data is to calculate loss functions of an encoder and a decoder in a neural network model, and then update parameters in the encoder and the decoder according to the loss functions.
After the data set is acquired by the second acquiring unit 601, the present embodiment constructs a neural network model including an encoder, a decoder, a posterior network layer and a vocoder by the constructing unit 602.
In the neural network model constructed by the construction unit 602, the encoder, the decoder and the posterior network layer form an acoustic model, and the acoustic model is used for converting an input text into acoustic features, such as mel-frequency spectrum features; the structure of the acoustic model in this embodiment is based on the Tacotron2 structure.
In the neural network model constructed by the construction unit 602, the encoder is configured to obtain a predicted first feature according to the phoneme sequence of the training text, where the predicted first feature is an encoder feature.
In the neural network model constructed by the construction unit 602, the decoder is configured to obtain a multi-frame predicted second feature according to the predicted first feature, where the predicted second feature is, for example, an acoustic feature of a mel-frequency spectrum.
In the neural network model constructed by the constructing unit 602, the decoder outputs the predicted second feature of the current frame and the decoder state of the current frame according to the predicted first feature, the predicted second feature of the previous frame and the decoder state of the previous frame, that is, the decoder in this embodiment outputs the predicted second feature of the frame.
In the neural network model constructed by the construction unit 602, the posterior network layer is configured to predict the second feature and an adjacent feature frame of the predicted second feature according to each frame, and obtain a predicted third feature corresponding to the predicted second feature of each frame; the a posteriori network layer in the present embodiment may be Post Net (a posteriori network) formed by a 5-layer convolutional network.
Specifically, in the neural network model constructed by the construction unit 602, when the posterior network layer predicts the second feature according to each frame and the adjacent feature frame of the predicted second feature of each frame, and obtains the predicted third feature corresponding to the predicted second feature of each frame, the optional implementation manner that can be adopted is as follows: determining adjacent feature frames of each frame for predicting the second feature from the plurality of frames for predicting the second feature; intercepting filling features from adjacent feature frames of each frame of predicted second features, and filling the intercepted filling features into each frame of predicted second features; and predicting the second characteristic according to each frame after filling to obtain a predicted third characteristic corresponding to the predicted second characteristic of each frame.
In the neural network model constructed by the construction unit 602, when the posterior network layer predicts the second feature according to each frame after the filling and obtains a predicted third feature corresponding to the predicted second feature of each frame, an optional implementation manner that can be adopted is as follows: obtaining the posterior feature of each frame of the predicted second feature after filling, wherein the posterior feature in the embodiment is obtained by processing the input predicted second feature after filling by a posterior network layer; and removing the posterior features corresponding to the filling features from the posterior features to obtain predicted third features corresponding to the predicted second features of each frame.
In addition, in the neural network model constructed by the construction unit 602, when the posterior network layer obtains a predicted third feature corresponding to each frame of the predicted second feature by predicting the second feature and predicting the adjacent feature frame of the second feature for each frame, the posterior network layer may further include the following contents: acquiring the frame number of the obtained predicted second characteristic; and under the condition that the acquired frame number meets a first preset requirement, predicting the second characteristic according to each frame and the adjacent characteristic frame of the predicted second characteristic to obtain a predicted third characteristic corresponding to the predicted second characteristic of each frame.
In this embodiment, the posterior network layer may determine that the obtained number of frames of the predicted second feature satisfies the first preset requirement when it is determined that the obtained number of frames of the predicted second feature exceeds the first frame number threshold.
In the neural network model constructed by the construction unit 602, the vocoder is configured to predict the third feature according to each frame and predict an adjacent feature frame of the third feature according to each frame, so as to obtain predicted audio data corresponding to the predicted third feature of each frame.
Specifically, in the neural network model constructed by the construction unit 602, when the vocoder predicts the third feature according to each frame and predicts the adjacent feature frame of the third feature according to each frame to obtain the predicted audio data corresponding to the predicted third feature of each frame, the optional implementation manners that can be adopted are: determining adjacent feature frames of each frame for predicting the third feature from the multiple frames for predicting the third feature; intercepting filling features from adjacent feature frames of each frame of predicted third features, and filling the intercepted filling features into each frame of predicted third features; and predicting the third characteristic according to each frame after filling to obtain predicted audio data corresponding to the predicted third characteristic of each frame.
In the neural network model constructed by the constructing unit 602, when the vocoder predicts the third feature according to each frame after the filling, and obtains predicted audio data corresponding to the predicted third feature of each frame, the optional implementation manner that can be adopted is as follows: acquiring the prediction audio data of the filled predicted third feature of each frame; and removing the predicted audio data corresponding to the filling features in the predicted audio data to obtain predicted audio data corresponding to the predicted third feature of each frame.
In addition, in the neural network model constructed by the constructing unit 602, when the vocoder predicts the third feature according to each frame and predicts the adjacent feature frame of the third feature for each frame to obtain the audio data corresponding to the predicted third feature for each frame, the following contents may be further included: acquiring the frame number of the obtained predicted third feature; and under the condition that the acquired frame number meets a second preset requirement, predicting the third feature according to each frame and an adjacent feature frame of the predicted third feature of each frame to obtain audio data corresponding to the predicted third feature of each frame.
The vocoder may determine that the obtained number of predicted third features satisfies a second preset requirement, if it is determined that the obtained number of predicted third features exceeds a second frame number threshold.
In this embodiment, after the construction unit 602 constructs the neural network model including the encoder, the decoder, the posterior network layer, and the vocoder, the training unit 603 trains the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
Specifically, when the training unit 603 trains the neural network model based on the training audio data of the training texts and the training audio data of the training texts, the optional implementation manner that can be adopted is as follows: respectively inputting phoneme sequences of a plurality of training texts into a neural network model, and acquiring predicted audio data output by the neural network model aiming at each training text; calculating a first loss function value according to training audio data and prediction audio data of a plurality of training texts; and adjusting parameters of an encoder, a decoder, a posterior network layer and a vocoder in the neural network model according to the calculated first loss function value until the first loss function value is converged to obtain a speech synthesis model.
In addition, in the data set acquired by second acquiring unit 602, if training acoustic features corresponding to the phoneme sequences of the training texts are acquired at the same time, when training unit 603 trains the neural network model based on the training audio data of the plurality of training texts and the plurality of training texts, an optional implementation manner that may be adopted is as follows: respectively inputting the phoneme sequences of the training texts into a neural network model, and acquiring a predicted second characteristic and predicted audio data output by the neural network model aiming at each training text; calculating a first loss function value according to training audio data and predicted audio data of a plurality of training texts, and calculating a second loss function value according to predicted second characteristics and training acoustic characteristics of a plurality of training texts; and adjusting parameters of a vocoder in the neural network model according to the first loss function value obtained by calculation, and adjusting parameters of an encoder, a decoder and a posterior network layer in the neural network model according to the second loss function value obtained by calculation until the first loss function value and the second loss function value are converged to obtain the speech synthesis model.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
As shown in fig. 7, is a block diagram of an electronic device for a method of training a speech synthesis or speech synthesis model according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the apparatus 700 comprises a computing unit 701 which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as speech synthesis or a training method of a speech synthesis model. For example, in some embodiments, the method of training speech synthesis or speech synthesis models may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708.
In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM702 and/or communications unit 709. When the computer program is loaded into the RAM703 and executed by the computing unit 701, one or more steps of the method for training speech synthesis or a speech synthesis model described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g. by means of firmware) to perform speech synthesis or a training method of a speech synthesis model.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable speech synthesis or speech synthesis model training apparatus such that the program codes, when executed by the processor or controller, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a presentation device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for presenting information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server that incorporates a blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations and substitutions are possible, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (27)

1. A method of speech synthesis comprising:
acquiring a text to be processed to obtain a phoneme sequence of the text to be processed;
coding the phoneme sequence to obtain a first characteristic;
decoding the first characteristic to obtain a multi-frame second characteristic;
obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and an adjacent feature frame of each frame of second feature;
and obtaining audio data corresponding to each frame of third features according to each frame of third features and adjacent feature frames of each frame of third features.
2. The method of claim 1, wherein the deriving a third feature corresponding to the second feature of each frame according to the second feature of each frame and the adjacent feature frame of the second feature of each frame comprises:
acquiring the frame number of the second characteristic;
and under the condition that the frame number meets a first preset requirement, obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and the adjacent feature frame of each frame of second feature.
3. The method according to any one of claims 1-2, wherein the deriving a third feature corresponding to each frame of second features from the frame of second features adjacent to each frame of second features comprises:
determining adjacent feature frames of the second feature of each frame from the second features of the plurality of frames;
intercepting filling features from adjacent feature frames of the second features of each frame, and filling the filling features into the second features of each frame;
and obtaining the third feature corresponding to each frame of second feature according to each frame of second feature after filling.
4. The method of claim 3, wherein the deriving the third feature corresponding to each frame of the second features according to each frame of the second features after the padding comprises:
obtaining posterior features of each frame of second features after filling;
and removing posterior features corresponding to the filling features from the posterior features to obtain third features corresponding to each frame of second features.
5. The method according to any one of claims 1-4, wherein the deriving audio data corresponding to each frame of the third feature from the adjacent feature frame of each frame of the third feature and the third feature comprises:
acquiring the frame number of the third characteristic;
and under the condition that the frame number meets a second preset requirement, obtaining audio data corresponding to each frame of third feature according to each frame of third feature and the adjacent feature frame of each frame of third feature.
6. The method according to any one of claims 1-5, wherein the deriving audio data corresponding to each frame of third features from the third features of each frame and the neighboring feature frames of the third features of each frame comprises:
determining an adjacent feature frame of each frame of third features from the plurality of frames of third features;
intercepting filling features from adjacent feature frames of each frame of third features, and filling the filling features into each frame of third features;
and obtaining the audio data corresponding to the third feature of each frame according to the filled third feature of each frame.
7. The method of claim 6, wherein the deriving the audio data corresponding to each frame of third features from each frame of third features after padding comprises:
acquiring audio data of each frame of third characteristics after filling;
and removing the audio data corresponding to the filling features in the audio data to obtain the audio data corresponding to each frame of third features.
8. The method of claim 1, wherein said encoding the sequence of phonemes resulting in a first feature comprises:
inputting the phoneme sequence into a speech synthesis model;
and coding the phoneme sequence by a coder in the speech synthesis model to obtain the first characteristic output by the coder.
9. The method of claim 1, wherein the decoding the first feature to obtain a multiframe second feature comprises:
inputting the first feature into a speech synthesis model;
and decoding the first characteristic by a decoder in the speech synthesis model to obtain the multiframe second characteristic output by the decoder.
10. The method of claim 1, wherein the deriving a third feature corresponding to the second feature of each frame according to the second feature of each frame and the adjacent feature frame of the second feature of each frame comprises:
inputting the second feature of each frame and the adjacent feature frame of the second feature of each frame into a speech synthesis model;
and processing each frame of second feature and an adjacent feature frame of each frame of second feature by a posterior network layer in the speech synthesis model to obtain a third feature which is output by the posterior network layer and corresponds to each frame of second feature.
11. The method of claim 1, wherein the deriving audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame comprises:
inputting the third feature of each frame and the adjacent feature frame of the third feature of each frame into a speech synthesis model;
and processing the third feature of each frame and the adjacent feature frame of the third feature of each frame by a vocoder in the voice synthesis model to obtain the audio data which is output by the vocoder and corresponds to the third feature of each frame.
12. A method of training a speech synthesis model, comprising:
acquiring a data set, wherein the data set comprises a plurality of training texts and training audio data of the training texts;
constructing a neural network model comprising an encoder, a decoder, a posterior network layer and a vocoder, wherein the encoder is used for obtaining a first predicted characteristic according to a phoneme sequence of a training text, the decoder is used for obtaining a multi-frame predicted second characteristic according to the first predicted characteristic, the posterior network layer is used for obtaining a predicted third characteristic corresponding to the second predicted characteristic of each frame according to the second predicted characteristic of each frame and an adjacent characteristic frame of the second predicted characteristic of each frame, and the vocoder is used for obtaining predicted audio data corresponding to the third predicted characteristic of each frame according to the third predicted characteristic of each frame and the adjacent characteristic frame of the third predicted characteristic of each frame;
and training the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
13. The method of claim 10, wherein the a posteriori network layer predicts the second feature per frame and a neighboring feature frame of the second feature per frame based on the predicted third feature corresponding to the predicted second feature per frame, the obtaining comprising:
determining adjacent feature frames of each frame predicting the second feature from the plurality of frames predicting the second feature;
intercepting filling features from adjacent feature frames of the predicted second feature of each frame, and filling the filling features into the predicted second feature of each frame;
and predicting the second characteristic according to each frame after filling to obtain a predicted third characteristic corresponding to the predicted second characteristic of each frame.
14. The method of claim 13, wherein the a posteriori network layer predicts the second feature from the each frame after the padding, and deriving the predicted third feature corresponding to the each frame predicted second feature comprises:
obtaining posterior features of the filled predicted second features of each frame;
and removing the posterior features corresponding to the filling features in the posterior features to obtain the predicted third features corresponding to the predicted second features of each frame.
15. The method of any of claims 12-14, wherein the vocoder predicts the third feature from each frame and predicts a neighboring feature frame of the third feature per frame, and wherein obtaining predicted audio data corresponding to the predicted third feature per frame comprises:
determining an adjacent feature frame of each frame predicting the third feature from the plurality of frames predicting the third feature;
intercepting filling features from adjacent feature frames of the predicted third feature of each frame, and filling the filling features into the predicted third feature of each frame;
and predicting the third characteristic according to each frame after filling to obtain the predicted audio data corresponding to the predicted third characteristic of each frame.
16. The method of claim 15, wherein the vocoder predicts the third feature from each frame after the padding, and deriving the predicted audio data corresponding to the predicted third feature from each frame comprises:
acquiring the prediction audio data of each frame after filling to predict the third characteristic;
and removing the predicted audio data corresponding to the filling features in the predicted audio data to obtain the predicted audio data corresponding to the predicted third feature of each frame.
17. A speech synthesis apparatus comprising:
the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring a text to be processed to obtain a phoneme sequence of the text to be processed;
the first processing unit is used for coding the phoneme sequence to obtain a first characteristic;
the second processing unit is used for decoding the first characteristic to obtain a plurality of frames of second characteristics;
the third processing unit is used for obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and an adjacent feature frame of each frame of second feature;
and the synthesis unit is used for obtaining the audio data corresponding to the third feature of each frame according to the third feature of each frame and the adjacent feature frame of the third feature of each frame.
18. The apparatus according to claim 17, wherein the third processing unit, when obtaining a third feature corresponding to each frame of the second feature according to each frame of the second feature and an adjacent feature frame of each frame of the second feature, specifically performs:
acquiring the frame number of the second characteristic;
and under the condition that the frame number meets a first preset requirement, obtaining a third feature corresponding to each frame of second feature according to each frame of second feature and the adjacent feature frame of each frame of second feature.
19. The apparatus according to any one of claims 17 to 18, wherein the third processing unit, when obtaining a third feature corresponding to each frame of second features according to each frame of second features and an adjacent feature frame of each frame of second features, specifically performs:
determining adjacent feature frames of the second feature of each frame from the plurality of frames of the second feature;
intercepting filling features from adjacent feature frames of the second features of each frame, and filling the filling features into the second features of each frame;
and obtaining the third feature corresponding to the second feature of each frame according to the second feature of each frame after filling.
20. The apparatus according to claim 19, wherein the third processing unit, when obtaining the third feature corresponding to each frame of second features according to each frame of second features after the padding, specifically performs:
obtaining posterior features of each frame of second features after filling;
and removing posterior features corresponding to the filling features from the posterior features to obtain third features corresponding to each frame of second features.
21. The apparatus according to any one of claims 17 to 20, wherein the synthesis unit, when obtaining the audio data corresponding to the third feature of each frame from the third feature of each frame and an adjacent feature frame of the third feature of each frame, specifically performs:
acquiring the frame number of the third characteristic;
and under the condition that the frame number meets a second preset requirement, obtaining audio data corresponding to each frame of third feature according to each frame of third feature and the adjacent feature frame of each frame of third feature.
22. The apparatus according to any one of claims 17 to 21, wherein the synthesis unit, when obtaining the audio data corresponding to the third feature of each frame from the third feature of each frame and an adjacent feature frame of the third feature of each frame, specifically performs:
determining an adjacent feature frame of each frame of third features from the plurality of frames of third features;
intercepting filling features from adjacent feature frames of each frame of third features, and filling the filling features into each frame of third features;
and obtaining the audio data corresponding to the third feature of each frame according to the filled third feature of each frame.
23. The apparatus according to claim 22, wherein the synthesizing unit, when obtaining the audio data corresponding to the third feature of each frame from the third feature of each frame after the padding, specifically performs:
acquiring audio data of each frame of third characteristics after filling;
and removing the audio data corresponding to the filling features in the audio data to obtain the audio data corresponding to the third features of each frame.
24. An apparatus for training a speech synthesis model, comprising:
a second obtaining unit, configured to obtain a data set, where the data set includes a plurality of training texts and training audio data of the plurality of training texts;
the device comprises a construction unit, a decoder, a posterior network layer and a vocoder, wherein the encoder is used for obtaining a first predicted feature according to a phoneme sequence of a training text, the decoder is used for obtaining a multi-frame predicted second feature according to the first predicted feature, the posterior network layer is used for obtaining a predicted third feature corresponding to the second predicted feature of each frame according to the second predicted feature of each frame and an adjacent feature frame of the second predicted feature of each frame, and the vocoder is used for obtaining predicted audio data corresponding to the third predicted feature of each frame according to the third predicted feature of each frame and the adjacent feature frame of the third predicted feature of each frame;
and the training unit is used for training the neural network model based on the training texts and the training audio data of the training texts until the neural network model converges to obtain the speech synthesis model.
25. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-16.
26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-16.
27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-16.
CN202210502540.4A 2022-05-09 2022-05-09 Method and device for training speech synthesis and speech synthesis model Pending CN115101041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210502540.4A CN115101041A (en) 2022-05-09 2022-05-09 Method and device for training speech synthesis and speech synthesis model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210502540.4A CN115101041A (en) 2022-05-09 2022-05-09 Method and device for training speech synthesis and speech synthesis model

Publications (1)

Publication Number Publication Date
CN115101041A true CN115101041A (en) 2022-09-23

Family

ID=83287801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210502540.4A Pending CN115101041A (en) 2022-05-09 2022-05-09 Method and device for training speech synthesis and speech synthesis model

Country Status (1)

Country Link
CN (1) CN115101041A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583903A (en) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
WO2020200178A1 (en) * 2019-04-03 2020-10-08 北京京东尚科信息技术有限公司 Speech synthesis method and apparatus, and computer-readable storage medium
CN113129868A (en) * 2021-03-12 2021-07-16 北京百度网讯科技有限公司 Method for obtaining speech recognition model, speech recognition method and corresponding device
CN113516964A (en) * 2021-08-13 2021-10-19 北京房江湖科技有限公司 Speech synthesis method, readable storage medium, and computer program product
WO2022057637A1 (en) * 2020-09-18 2022-03-24 北京字节跳动网络技术有限公司 Speech translation method and apparatus, and device, and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020200178A1 (en) * 2019-04-03 2020-10-08 北京京东尚科信息技术有限公司 Speech synthesis method and apparatus, and computer-readable storage medium
CN111583903A (en) * 2020-04-28 2020-08-25 北京字节跳动网络技术有限公司 Speech synthesis method, vocoder training method, device, medium, and electronic device
WO2022057637A1 (en) * 2020-09-18 2022-03-24 北京字节跳动网络技术有限公司 Speech translation method and apparatus, and device, and storage medium
CN113129868A (en) * 2021-03-12 2021-07-16 北京百度网讯科技有限公司 Method for obtaining speech recognition model, speech recognition method and corresponding device
CN113516964A (en) * 2021-08-13 2021-10-19 北京房江湖科技有限公司 Speech synthesis method, readable storage medium, and computer program product

Similar Documents

Publication Publication Date Title
EP4064277B1 (en) Method and apparatus for training speech recognition model, device and storage medium
CN112466288A (en) Voice recognition method and device, electronic equipment and storage medium
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN113838452B (en) Speech synthesis method, apparatus, device and computer storage medium
CN114360557A (en) Voice tone conversion method, model training method, device, equipment and medium
CN114141228B (en) Training method of speech synthesis model, speech synthesis method and device
CN113380239B (en) Training method of voice recognition model, voice recognition method, device and equipment
CN112861548A (en) Natural language generation and model training method, device, equipment and storage medium
CN114495956A (en) Voice processing method, device, equipment and storage medium
CN114495977B (en) Speech translation and model training method, device, electronic equipment and storage medium
CN111862967B (en) Voice recognition method and device, electronic equipment and storage medium
US20230410794A1 (en) Audio recognition method, method of training audio recognition model, and electronic device
CN113838450B (en) Audio synthesis and corresponding model training method, device, equipment and storage medium
CN115101041A (en) Method and device for training speech synthesis and speech synthesis model
CN114267376A (en) Phoneme detection method and device, training method and device, equipment and medium
CN113689866A (en) Training method and device of voice conversion model, electronic equipment and medium
CN115064148A (en) Method and device for training speech synthesis and speech synthesis model
CN114783428A (en) Voice translation method, voice translation device, voice translation model training method, voice translation model training device, voice translation equipment and storage medium
CN114898742A (en) Method, device, equipment and storage medium for training streaming voice recognition model
CN114187892A (en) Style migration synthesis method and device and electronic equipment
CN113793599A (en) Training method of voice recognition model and voice recognition method and device
CN115309888B (en) Method and device for generating chart abstract and training method and device for generating model
CN113689867B (en) Training method and device of voice conversion model, electronic equipment and medium
CN114420087B (en) Acoustic feature determination method, device, equipment, medium and product
CN113553863B (en) Text generation method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination