WO2022141671A1 - Speech synthesis method and apparatus, device, and storage medium - Google Patents

Speech synthesis method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2022141671A1
WO2022141671A1 PCT/CN2021/071672 CN2021071672W WO2022141671A1 WO 2022141671 A1 WO2022141671 A1 WO 2022141671A1 CN 2021071672 W CN2021071672 W CN 2021071672W WO 2022141671 A1 WO2022141671 A1 WO 2022141671A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
matching
synthesis
speech
feature
Prior art date
Application number
PCT/CN2021/071672
Other languages
French (fr)
Chinese (zh)
Inventor
周良
孟廷
侯秋侠
刘丹
江源
胡亚军
Original Assignee
科大讯飞股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 科大讯飞股份有限公司 filed Critical 科大讯飞股份有限公司
Publication of WO2022141671A1 publication Critical patent/WO2022141671A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks

Definitions

  • the present application relates to the technical field of speech processing, and more particularly, to a speech synthesis method, apparatus, device and storage medium.
  • Speech synthesis is a hot spot of human-computer interaction research at home and abroad. Speech synthesis is the process of synthesizing the input original text to be synthesized into speech output.
  • the traditional speech synthesis model is generally based on an end-to-end speech synthesis scheme, that is, the training text and the corresponding speech data or waveform data are directly used to train the speech synthesis model, and the trained speech synthesis model is based on the input original text to be synthesized,
  • the synthesized speech can be output, or the waveform data can be output, and then the corresponding synthesized speech can be obtained based on the waveform data.
  • the present application is proposed to provide a speech synthesis method, apparatus, device and storage medium to improve the quality of synthesized speech.
  • the specific plans are as follows:
  • a speech synthesis method comprising:
  • the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
  • speech synthesis is performed on the original text to obtain synthesized speech.
  • the method according to claim 1, wherein the obtaining auxiliary synthesis features corresponding to the matching text comprises:
  • the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.
  • the auxiliary synthesis features include:
  • the phoneme sequence corresponding to the matching text determined based on the pronunciation audio corresponding to the matching text
  • the acoustic feature of the pronunciation audio corresponding to the matched text is the acoustic feature of the pronunciation audio corresponding to the matched text.
  • the obtaining the matching text of the text segment that matches the original text exists including:
  • matching texts that match text fragments within the original text are determined.
  • the obtaining the matching text of the text segment that matches the original text exists including:
  • the uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match.
  • the preconfigured template text includes:
  • Template text in each preconfigured resource package wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
  • determining the matching text that matches the text fragment in the original text includes:
  • the matching text that matches the text fragment in the original text is determined.
  • the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text includes:
  • the process of determining the preconfigured resource package includes:
  • the phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.
  • the process of determining the preconfigured resource package further includes:
  • the phoneme-level prosodic encoding is incorporated into the resource bundle.
  • the phoneme-level prosodic coding corresponding to the template text is determined based on the template text and the corresponding pronunciation audio, including:
  • the encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.
  • the method before acquiring the upload text in the upload data, the method further includes:
  • the uploaded text is the text segment that is synthesized incorrectly in the initial synthesized speech
  • the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the synthesized incorrect text segment
  • the uploaded text is an extended text that includes a text fragment synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the extended text.
  • the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text includes:
  • the auxiliary synthesis feature corresponding to the matching text is determined.
  • performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech including:
  • speech synthesis is performed on the original text to obtain synthesized speech.
  • performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech further comprising:
  • the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
  • determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text includes:
  • the pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesize with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
  • performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech including:
  • the matched text and the acoustic features of the pronunciation audio determining the target acoustic feature required for predicting the current speech frame
  • the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.
  • the target acoustic features required for predicting the current speech frame are determined based on the context information, the matched text and the acoustic features of the pronunciation audio, including:
  • the matched text and the acoustic features of the pronunciation audio obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;
  • target acoustic features required for predicting the current speech frame are determined.
  • the correlation degree of each frame of the acoustic features includes:
  • the first attention weight matrix including the attention weight of each frame of acoustic features to each text unit in the matched text;
  • the second attention weight matrix includes the attention weight of the context information to each text unit in the matched text
  • a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair
  • the attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.
  • determining the target acoustic feature required for predicting the current speech frame based on the correlation degree includes:
  • Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight, and the acoustic features of each frame of the pronunciation audio are weighted and added to obtain a target acoustic feature.
  • the prediction of the current speech frame based on the context information and the determined target acoustic feature includes:
  • the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.
  • a speech synthesis apparatus comprising:
  • an original text obtaining unit used to obtain the original text to be synthesized
  • Auxiliary synthesis feature acquisition unit used to obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is determined based on the pronunciation audio corresponding to the matching text.
  • An auxiliary speech synthesis unit configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.
  • a speech synthesis device comprising: a memory and a processor;
  • the memory for storing programs
  • the processor is configured to execute the program to implement each step of the speech synthesis method described above.
  • a storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements each step of the above-mentioned speech synthesis method.
  • a computer program product when the computer program product runs on a terminal device, the terminal device causes the terminal device to execute each step of the above-mentioned speech synthesis method.
  • the speech synthesis method of the present application in the process of speech synthesis of the original text to be synthesized, refers to the auxiliary synthesis feature corresponding to the matching text of the text fragment that matches the original text, the auxiliary synthesis feature.
  • the present application can use the pronunciation information in the pronunciation audio corresponding to the matching text to assist the original text by referring to the auxiliary synthesis feature corresponding to the matching text.
  • the information referenced in the speech synthesis of the original text is enriched, thereby improving the speech synthesis quality of the original text.
  • the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing.
  • the solution of the present application can be applied to these two types of speech synthesis systems at the same time.
  • the auxiliary synthesis feature corresponding to the above matching text can be used as the analysis result of the speech synthesis front end or the analysis result of the auxiliary correction speech synthesis front end, and then the analysis result is sent to the speech synthesis back end to assist in the speech synthesis of the original text.
  • the auxiliary synthesis feature corresponding to the matched text can be directly used as the reference information when the speech synthesis system synthesizes the original text.
  • the speech synthesis of the original text is performed with reference to the auxiliary synthesis feature of the present application, which can enrich the reference information during speech synthesis, thereby improving the quality of the synthesized speech.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • Figure 2 illustrates a schematic diagram of a phoneme sequence extraction model architecture
  • FIG. 3 illustrates a schematic diagram of a synthesis flow of a speech synthesis back-end
  • Fig. 4 illustrates a kind of speech synthesis system architecture schematic diagram
  • Figure 5 illustrates a process schematic diagram of a prediction-generating network determining phoneme-level prosodic coding
  • FIG. 6 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present application.
  • the present application provides a speech synthesis solution, which can be applied to various speech synthesis tasks.
  • the speech synthesis solution of the present application can be applied to speech synthesis work in a human-computer interaction scenario, as well as various other scenarios that require speech synthesis.
  • the solution of the present application can be implemented based on a terminal with data processing capability, and the terminal can be a mobile phone, a computer, a server, a cloud, or the like.
  • the speech synthesis method of the present application may include the following steps:
  • Step S100 acquiring the original text to be synthesized.
  • the original text is the text to be synthesized into speech.
  • the original text may be provided by the user, or may be the text provided by other devices or applications that needs to be synthesized by speech.
  • Step S110 acquiring auxiliary synthesis features corresponding to the matched text, where the matched text and the original text have text fragments that match.
  • the matching text can be text that matches the original text or a text fragment within the original text, for example, the original text is "this pair of pants is not discounted", and the matching text may be "this pair of pants is not discounted” or "discounted” .
  • the matching text may also be text that contains text fragments that match the text fragments within the original text. Still taking the above-mentioned original text as an example, the matching text may be "do you have a discount on this dress?", that is, the matching text includes a text fragment "discount” that matches the original text.
  • the matching text may be the text that is pre-configured and stored in the present application.
  • the fixed discourse text may be recorded in advance, and the discourse text may be stored. Then, the utterance text matching the original text is searched in the stored utterance text as the matching text.
  • there are some fixed vocabulary texts such as the prompt content text that the intelligent customer service or terminal needs to prompt the user for information, such as "May I ask what you need to inquire about", "Hello, may I ask what Can I help you?" "Press 1 for phone bills, 2 for traffic” and so on.
  • these fixed discourse texts may be pre-recorded and stored together with the discourse texts as prompt sounds.
  • the matching text can also be user-uploaded text.
  • the user when uploading the original text to be synthesized, the user uploads the text that is prone to synthesis errors in the original text as the matching text, and can also upload the pronunciation audio corresponding to the matching text.
  • the synthesis system outputs the synthesized initial synthesized speech. The user can determine the incorrectly synthesized text in the initial synthesized speech, then record the pronunciation audio corresponding to the synthesized incorrectly text, and upload the synthesized incorrectly text and the corresponding pronunciation audio to the speech synthesis system. Or, the user uploads the extended text containing the text with the synthesis error, and the pronunciation audio corresponding to the extended text.
  • the auxiliary synthesis feature corresponding to the matching text may be a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text.
  • the auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, such as pronunciation information such as the phoneme sequence of pronunciation, pause information, stress, rhythm, emotion and other pronunciation information.
  • the pronunciation information can assist the speech synthesis of the original text and improve the quality of the original text. Speech synthesis quality.
  • the pronunciation of the text segment whose matched text matches the original text is the standard pronunciation of the text segment in the original text.
  • the original text is "These pants are not discounted”.
  • the matching text is "discount”
  • the pronunciation audio corresponding to the matching text is the audio corresponding to "da zhe”, not other pronunciation audios such as "da she”.
  • auxiliary synthesis features can be determined based on the corresponding pronunciation audio of the matched text to assist speech synthesis of the original text.
  • the auxiliary synthesis feature can be determined in advance based on the pronunciation audio corresponding to the matched text and stored in a local or third-party device.
  • the process of obtaining the auxiliary synthesis feature corresponding to the matching text may be to search for the auxiliary synthesis feature corresponding to the pre-stored matching text in the local or third-party storage.
  • the process of obtaining the auxiliary synthesis feature corresponding to the matching text in this step may be, after obtaining the pronunciation audio corresponding to the matching text , and the auxiliary synthesis feature is determined based on the pronunciation audio.
  • Step S120 referring to the auxiliary synthesis feature, perform speech synthesis on the original text to obtain synthesized speech.
  • the speech synthesis system when it performs speech synthesis on the original text in this step, in addition to referring to the original text, it may further refer to the auxiliary synthesis feature corresponding to the matched text, that is, to enrich the information referenced in the speech synthesis process of the original text.
  • the auxiliary synthesis feature since the auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, the pronunciation information can assist the speech synthesis of the original text and improve the speech synthesis quality of the original text.
  • the speech synthesis method in the process of speech synthesis of the original text to be synthesized, reference is made to the auxiliary synthesis feature corresponding to the matched text of the text segment that matches the original text, and the auxiliary synthesis feature is based on matching
  • the feature used for auxiliary speech synthesis determined by the pronunciation audio corresponding to the text, it can be seen that the present application can utilize the pronunciation information in the pronunciation audio corresponding to the matching text to assist in the speech synthesis of the original text by referring to the auxiliary synthesis feature corresponding to the matching text. , which enriches the information referenced in the speech synthesis of the original text, thereby improving the speech synthesis quality of the original text.
  • the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing.
  • the solution of the present application can be applied to these two types of speech synthesis systems at the same time.
  • the auxiliary synthesis feature corresponding to the above matching text can be used as the analysis result of the speech synthesis front end or the analysis result of the auxiliary correction speech synthesis front end, and then the analysis result is sent to the speech synthesis back end to assist in the speech synthesis of the original text.
  • the auxiliary synthesis feature corresponding to the matched text can be directly used as the reference information when the speech synthesis system synthesizes the original text.
  • the speech synthesis of the original text is carried out with reference to the auxiliary synthesis feature of the present application, which can enrich the reference information during speech synthesis, thereby improving the quality of the synthesized speech.
  • the auxiliary synthesis feature corresponding to the matching text mentioned above and the process of performing speech synthesis on the original text with reference to the auxiliary synthesis feature are described.
  • the auxiliary synthesis feature is a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text.
  • the auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, and the pronunciation information can assist the speech synthesis of the original text.
  • the speech synthesis quality of the original text is a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text.
  • the auxiliary synthesis feature is the phoneme sequence corresponding to the matching text.
  • the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. Among them, the speech synthesis system with front-end preprocessing first performs front-end analysis on the original text before performing speech synthesis on the original text. sequence for speech synthesis.
  • This processing method can improve the quality of speech synthesis to a certain extent, but when there is an error in the pre-built pronunciation dictionary, it will lead to errors in the back-end synthesized speech.
  • a phoneme sequence corresponding to the matched text may be determined as an auxiliary synthesis feature.
  • the pronunciation audio corresponding to the matched text is the correct pronunciation, so the correct phoneme sequence corresponding to the matched text can be extracted from the pronunciation audio.
  • the correct phoneme sequence can be used as an auxiliary synthesis feature to participate in the speech synthesis process of the original text.
  • This embodiment provides an implementation manner of extracting the phoneme sequence from the pronunciation audio corresponding to the matched text.
  • FIG. 2 it exemplifies a schematic diagram of the architecture of a phoneme sequence extraction model.
  • This application can pre-train a phoneme sequence extraction model for extracting phoneme sequences from pronunciation audio.
  • the phoneme sequence extraction model can adopt the LSTM (long short term memory, long short term memory network) network architecture or other optional network architectures such as HMM and CNN. As shown in Figure 2, it exemplifies a phoneme sequence extraction model with an encoder-attention-decoder architecture.
  • the encoding end uses the LSTM network to encode the audio feature sequence (x 1 ,x 2 ,...,x n ) of the pronunciation audio to obtain the hidden layer encoding sequence (h 1 ,h 2 ,...,h n ), and the decoding end Also using the LSTM network, at the decoding time t, the hidden layer state h t- 1 at the time t-1 is input and the context vector c t-1 calculated by the attention module is jointly calculated to obtain the hidden layer vector s t of the decoding end, and then pass The projection yields the phoneme y t at time t .
  • the decoding stops when the special symbol end marker is decoded, and the phoneme sequence (y 1 , y 2 ,...,y t ) is obtained.
  • Exemplary instructions such as:
  • the phoneme sequence extracted from the pronunciation audio corresponding to the matching text is: [zh e4 jian4 i1 f u7 b u4 d a3 zh e2].
  • the process of performing speech synthesis on the original text may include:
  • the phoneme sequence corresponding to the matched text and the same text segment in the original text may be acquired based on the phoneme sequence corresponding to the matched text.
  • the phoneme sequence corresponding to the same text segment is extracted from the phoneme sequence corresponding to the matched text.
  • the pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and combine with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
  • the initial phoneme sequence corresponding to the original text can also be determined by querying the pronunciation dictionary, and the phoneme sequence corresponding to the same text segment extracted from the phoneme sequence corresponding to the matching text is used to replace the initial phoneme sequence in the initial phoneme sequence. From the phoneme sequence corresponding to the same text segment, the replaced phoneme sequence corresponding to the original text is obtained.
  • the phoneme sequence of the original text can be used as the text analysis result of the speech synthesis front end, and sent to the speech synthesis back end to assist in the speech synthesis of the original text.
  • the phoneme sequence of the original text obtained in this embodiment includes the phoneme sequence corresponding to the matching text
  • the part of the phoneme sequence is determined based on the correct pronunciation audio corresponding to the matching text, so that the phoneme sequence of the original text is used to assist in speech synthesis It can improve the accuracy of the synthesized speech, especially for some polyphonic words and easy typos, the accuracy of the synthesized speech is greatly improved.
  • the auxiliary synthesis feature is the prosodic information corresponding to the matching text.
  • the front-end speech synthesis can perform text analysis on the original text.
  • the process of text analysis can also predict the prosody information of the original text, and then the synthesis back-end performs speech synthesis based on the original text and prosody information.
  • the naturalness of the synthesized speech can be improved.
  • the prosody information predicted for the original text may also be wrong, which in turn leads to errors in the prosody of the back-end synthesized speech and affects the quality of the synthesized speech.
  • prosodic information corresponding to the matched text may be determined based on the pronunciation audio corresponding to the matched text, as an auxiliary synthesis feature.
  • the prosody information corresponding to the matched text may be phoneme-level prosody information, which includes the prosody information of each phoneme unit in the phoneme sequence corresponding to the matched text.
  • the pronunciation audio corresponding to the matched text is the correct pronunciation, so the correct prosody information corresponding to the matched text can be extracted from the pronunciation audio.
  • the correct prosody information can be used as an auxiliary synthesis feature to participate in the speech synthesis process of the original text. For example, the corrected prosody information of the original text is determined based on the correct prosody information, and then sent to the synthesis back-end for speech synthesis.
  • the process of performing speech synthesis on the original text may include:
  • the prosody information corresponding to the matched text and the same text segment in the original text may be acquired based on the prosody information corresponding to the matched text.
  • prosody prediction technology can be used to predict the prosody information of the remaining text segments in the original text except the same text segment, and combine with the prosody information corresponding to the same text segment to obtain the prosody information of the original text.
  • the process of performing speech synthesis on the original text may include:
  • the auxiliary synthesis feature is the phoneme-level prosodic coding corresponding to the matching text.
  • the phoneme-level prosodic coding corresponding to the matched text includes some pronunciation information of the pronunciation audio corresponding to the matched text, such as prosodic features such as pronunciation duration, stress and emphasis.
  • the speech synthesis backend When the speech synthesis backend performs speech synthesis, it can model the prosody information of the original text, thereby improving the naturalness of the synthesized speech.
  • the phoneme-level prosody coding corresponding to the matched text can be used as an auxiliary synthesis feature, and sent to the speech synthesis backend to assist in speech synthesis.
  • the phoneme-level prosody coding corresponding to the matching text contains the correct pronunciation information corresponding to the matching text.
  • the speech synthesis backend performs speech synthesis based on the phoneme-level prosody coding corresponding to the matching text
  • the original text and the matching text contain the correct pronunciation information.
  • the same text fragment can synthesize the voice that is consistent with the pronunciation audio of the matching text.
  • the back-end of speech synthesis performs operations such as convolution on the original text.
  • operations such as convolution on the original text.
  • phoneme-level prosodic coding corresponding to the same text fragment in the processing process, so as to use Phoneme-level prosodic encoding of the same text segment assists in improving the speech synthesis quality of the remaining text segments in the original text.
  • speech synthesis is only performed on non-identical text segments in the original text, and then the synthesized speech of the non-identical text segments is spliced with the pre-configured speech of the same text segment to obtain a whole corresponding to the original text. Synthesized speech.
  • This processing method will lead to the problem of inconsistent timbre of the overall synthesized speech of the original text, and reduce the quality of the synthesized speech.
  • the speech synthesis system of the present application is still a complete synthesis system. By performing overall speech synthesis on the original text, it can be ensured that the timbre of the synthesized speech is consistent.
  • the phoneme-level prosody coding in this embodiment may also be different.
  • FIG. 3 illustrates a schematic diagram of a synthesis flow of a speech synthesis back-end.
  • the back-end of speech synthesis includes a duration model and an acoustic model, and the duration prosody information and the acoustic parameter prosody information are modeled respectively by the duration model and the acoustic model.
  • the phoneme-level prosodic coding corresponding to the matched text in this embodiment of the present application may include duration coding and acoustic parameter coding.
  • the duration encoding can be sent to the duration model to assist in phoneme-level duration modeling
  • the acoustic parameter encoding can be sent to the acoustic model to assist in phoneme-level acoustic parameters. modeling.
  • the acoustic parameter encoding may include one or more different acoustic parameter encodings, such as fundamental frequency acoustic parameter encoding or other acoustic parameter encoding, and the like.
  • the above step S120 referring to the auxiliary synthesis feature, performs speech synthesis on the original text
  • the process may further include:
  • the matched text is the same text segment as the original text, and then the phoneme-level prosodic coding corresponding to the same text segment is extracted from the phoneme-level prosodic coding corresponding to the matching text.
  • the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
  • Phoneme-level prosody coding includes duration coding and acoustic parameter coding.
  • the back-end speech synthesis can send the duration encoding corresponding to the same text segment into the duration model for phoneme-level duration modeling, and encode the acoustic parameters corresponding to the same text segment.
  • the acoustic model is sent into the acoustic model for phoneme-level acoustic parameter modeling, and the synthesized speech is finally obtained by the speech synthesis back-end.
  • the auxiliary synthesis feature is the acoustic feature of the pronunciation audio corresponding to the matching text.
  • speech synthesis systems can be divided into two types: with front-end preprocessing and without front-end preprocessing.
  • the speech synthesis system without front-end preprocessing does not perform front-end analysis on the original text, but directly performs speech synthesis on the original text.
  • the acoustic feature of the pronunciation audio corresponding to the matching text may be used as an auxiliary synthesis feature, and sent to the speech synthesis system to assist in the speech synthesis of the original text.
  • the acoustic feature contains the pronunciation information of the pronunciation audio that matches the text.
  • the acoustic feature associated with each frame can be extracted from the acoustic feature to assist in synthesizing each frame.
  • Frames can be used to correct pronunciation errors, such as correcting the pronunciation errors of rare words, special symbols, polyphonic words and foreign words that are prone to errors, and finally obtain a high-quality synthesized speech.
  • the acoustic features include but are not limited to cepstral features of pronunciation audio.
  • the process of performing speech synthesis on the original text may include:
  • S1 Process the original text based on a speech synthesis model to obtain context information for predicting the current speech frame.
  • the speech synthesis model can adopt the encoder-decoder architecture of encoder-decoder, and further connect the encoding and decoding layers through an attention module. Then, the original text can obtain the context information C t required when synthesizing the current speech frame y t through the encoder-decoder encoding and decoding architecture and the attention module.
  • the context information C t indicates the text information in the original text needed to synthesize the current speech frame y t .
  • the matched text and the acoustic features of the pronunciation audio determine the target acoustic feature required for predicting the current speech frame.
  • step S2 may include:
  • the context information can be obtained through the attention mechanism to obtain the similarity with the matching text, and the attention matrix of the acoustic features of the pronunciation audio to the matching text can obtain the correlation between the acoustic features of each frame and the matching text.
  • the correlation between the context information and the acoustic features of each frame can be obtained.
  • the correlation indicates the context information and each frame. The proximity of acoustic features. It can be understood that, when the context information is highly correlated with the acoustic feature of a target frame, it indicates that the pronunciation of the text corresponding to the context information is strongly correlated with the acoustic feature of the target frame.
  • step S21 an optional implementation manner of step S21 is introduced, which may include the following steps:
  • the first attention weight matrix W mx includes the attention weight of each frame of acoustic feature to each text unit in the matched text.
  • the size of the matrix W mx is T my *T mx , where T my represents the frame length of the acoustic feature corresponding to the pronunciation audio, and T mx represents the length of the matched text.
  • the second attention weight matrix W cmx includes the attention weight of the context information C t to each text unit in the matched text.
  • the size of the matrix W cmx is 1*T mx .
  • the third attention weight matrix W cmy includes the attention weight of the context information C t on the acoustic features of each frame, as the degree of correlation between the context information and the acoustic features of each frame.
  • the size of the matrix W cmy is 1*T my .
  • the matrix W cmy can be expressed as:
  • W mx ' represents the transpose of matrix W mx .
  • each correlation degree may be normalized first, and the normalized correlation degree
  • the acoustic features of each frame of the pronunciation audio are weighted and added to obtain the target acoustic feature required for predicting the current speech frame.
  • the target acoustic feature can be denoted as C mt .
  • this embodiment provides a solution, so that when performing speech synthesis on the original text, for different speech frames to be predicted, the information amount of the referenced target acoustic feature C mt can be controlled.
  • the specific implementation process can include:
  • a threshold mechanism or other strategies may be used to determine the fusion coefficient a gate of the target acoustic feature C mt when predicting the current speech frame.
  • a gate can be expressed as:
  • a gate sigmoid(g g (C mt ,s t ))
  • s t represents the current hidden layer vector at the decoding end
  • g g () represents the set function relationship
  • the current speech frame y t can be expressed as:
  • g() represents the set functional relationship.
  • FIG. 4 it illustrates a schematic diagram of a speech synthesis system architecture.
  • the speech synthesis system illustrated in Figure 4 adopts an end-to-end synthesis process of codec and attention mechanism.
  • the original text is encoded by the encoding end to obtain the encoding vector of the original text, and the context information C t required for predicting the current speech frame y t can be obtained through the first attention module.
  • the matching text is encoded by the encoding end to obtain the encoding vector of the matching text. Further, the context information C t can obtain the attention weight of each text unit in the matched text through the second attention module to form a second attention weight matrix.
  • the attention weight of the acoustic feature of the pronunciation audio of the matched text to the matched text can be obtained to form a first attention weight matrix. Further, based on the first attention weight matrix and the second attention weight matrix, a third attention weight matrix of the context information C t to the acoustic feature is obtained.
  • the third attention weight matrix includes the correlation degree between the context information C t and the acoustic features of each frame.
  • the target acoustic feature C mt required for predicting the current speech frame y t is obtained by performing sofmax regularization on the third attention weight matrix, and performing weighted addition with the acoustic features of each frame of the pronunciation audio.
  • the decoder can predict the current speech frame y t based on the target acoustic feature C mt and the context information C t .
  • the expression for predicting the current speech frame y t by the decoding end may refer to the foregoing related introduction.
  • Each predicted speech frame is mapped to synthesized speech by a vocoder.
  • the process of acquiring the auxiliary synthesis feature corresponding to the matching text is introduced.
  • the process may include:
  • the application can collect and record a large amount of fixed phonetic texts in advance in the speech synthesis scene, take the collected phonetic texts as template texts, and store the template texts and corresponding pronunciation audios simultaneously.
  • the auxiliary synthesis feature is determined based on the pronunciation audio of the template text, and then the template text and the auxiliary synthesis feature are stored together.
  • step S1 may include:
  • each resource package includes a template text and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
  • the auxiliary synthesis features may include phoneme sequences and prosodic information corresponding to the template text. Further, the auxiliary synthesis feature may also include phoneme-level prosodic coding corresponding to the template text.
  • the template text is "Welcome to AI Voice Assistant”.
  • the auxiliary synthesis features that can be determined may include the phoneme sequence of the template text, prosodic information, phoneme-level prosodic coding, and the like. Furthermore, the template text and auxiliary synthesis features can be packaged into a resource package.
  • the packaged resource package it can be encoded into a binary resource text, so as to reduce the occupation of storage space and facilitate the processing and identification of the subsequent speech synthesis system.
  • the process of determining the phoneme-level prosodic coding corresponding to the template text is introduced.
  • the phoneme-level prosodic coding corresponding to the template text can be determined based on the coding prediction network and the generation network, which can specifically include the following steps:
  • A2 Input the template text and the phoneme-level prosody information into a coding prediction network to obtain a predicted phoneme-level prosodic code.
  • A3. Input the predicted phoneme-level prosody coding and the template text into a generation network to obtain the generated phoneme-level prosodic information.
  • A4 Train the encoding prediction network and the generation network with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, obtain the encoded prediction network after training.
  • the process of training the coding prediction network and the generation network with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information as the goal may specifically be calculating the generated phoneme-level prosody information and the extracted mean square error MSE of the prosody information at the phoneme level, and adjusting network parameters through iterative training, so that the MSE reaches a preset threshold, then the training can be ended.
  • the above-mentioned step S11 in the template text that is pre-configured and stored, determines the implementation process of the matching text that matches the text fragment in the original text, which may include:
  • the completely matching template text is determined as the matching text. If it does not exist, partial matching can be performed, for example, starting from one or both ends of the original text, and looking for the matching text of the maximum length in the template text of each resource bundle as the matching text.
  • the original text is "Are you Wang Ning?”
  • the exact same template text is not matched, but the template text is matched "Is your name Liu Wu? ”, the original text is matched with the above template text at the maximum length, and the matching text can be obtained: “Are you” and “Are you?”.
  • the present application can obtain the data uploaded by the user.
  • the uploaded data includes the uploaded text and the pronunciation audio corresponding to the uploaded text.
  • the uploaded text has a matching text fragment with the original text.
  • the uploaded text can be used as matching text.
  • initial speech synthesis may be performed, and the initial synthesized speech of the original text may be output.
  • the process of initial speech synthesis on the original text can use various existing or possible future speech synthesis schemes.
  • the user can determine the incorrectly synthesized text segment in the initial synthesized speech, and determine the correct pronunciation corresponding to the synthesized incorrectly synthesized text segment, and then can use the synthesized incorrectly synthesized text segment as the uploaded text, and the synthesized incorrectly
  • the correct pronunciation corresponding to the text segment of the uploaded text is used as the pronunciation audio corresponding to the uploaded text, and is uploaded as the upload data.
  • the user can obtain the extended text containing the incorrectly synthesized text segment in the initial synthesized speech, and obtain the correct pronunciation corresponding to the extended text, take the extended text as the uploaded text, and the correct pronunciation corresponding to the extended text as the corresponding uploaded text.
  • the pronunciation audio of and upload it together as upload data.
  • the auxiliary synthesis feature can be determined in advance based on the pronunciation audio corresponding to the matching text and stored in the local or third-party device. .
  • the process of obtaining the auxiliary synthesis feature corresponding to the matching text may be to search for the auxiliary synthesis feature corresponding to the pre-stored matching text in the local or third-party storage.
  • the process of obtaining the auxiliary synthesis feature corresponding to the matching text in this step may be, after obtaining the pronunciation audio corresponding to the matching text, based on the Pronunciation audio determines auxiliary synthesis features.
  • the method of obtaining the matching text in the above step S1 is realized by the first method 1), that is, the original text and the template text in each pre-configured resource package are respectively matched and calculated, and the matching degree is calculated.
  • the implementation process of the above-mentioned step S2 may specifically include:
  • the resource package contains auxiliary synthesis features corresponding to the template text, such as phoneme sequences, prosody information, and phoneme-level prosodic coding.
  • the matching text is the same as the template text or belongs to a partial text segment in the template text. Therefore, auxiliary synthesis features corresponding to the matching text can be extracted from the auxiliary synthesis features corresponding to the template text.
  • the implementation process of the above step S2 may specifically include:
  • the auxiliary synthesis feature corresponding to the matching text is determined.
  • the speech synthesis apparatus provided by the embodiments of the present application is described below, and the speech synthesis apparatus described below and the speech synthesis method described above can be referred to each other correspondingly.
  • FIG. 6 is a schematic structural diagram of a speech synthesis apparatus disclosed in an embodiment of the present application.
  • the apparatus may include:
  • the original text acquisition unit 11 is used to acquire the original text to be synthesized
  • Auxiliary synthesis feature acquisition unit 12 is used to obtain an auxiliary synthesis feature corresponding to a matching text, where the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is based on the pronunciation audio corresponding to the matching text. Determined features for assisting speech synthesis;
  • the auxiliary speech synthesis unit 13 is configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.
  • the process of obtaining the auxiliary synthesis feature corresponding to the matching text by the above-mentioned auxiliary synthesis feature obtaining unit may include:
  • the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.
  • auxiliary synthesis features may include:
  • the phoneme sequence corresponding to the matching text determined based on the pronunciation audio corresponding to the matching text
  • the acoustic feature of the pronunciation audio corresponding to the matched text is the acoustic feature of the pronunciation audio corresponding to the matched text.
  • the process in which the above-mentioned auxiliary synthesis feature obtaining unit obtains the matching text of the text fragment that matches the original text exists may include:
  • matching texts that match text fragments within the original text are determined.
  • the above preconfigured template text may include:
  • each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
  • the process of determining the matching text that matches the text fragment in the original text in the preconfigured template text by the above-mentioned auxiliary synthesis feature obtaining unit may include:
  • the matching text that matches the text fragment in the original text is determined.
  • the process in which the above-mentioned auxiliary synthesis feature obtaining unit obtains the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text may include:
  • the apparatus of the present application may further include: a resource package configuration unit for configuring resource packages, and the process may include:
  • the phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.
  • the process of configuring the resource package by the resource package configuration unit may further include:
  • the phoneme-level prosodic encoding is incorporated into the resource bundle.
  • the above-mentioned resource package configuration unit determines the process of phoneme-level prosodic coding corresponding to the template text based on the template text and the corresponding pronunciation audio, which may include:
  • the encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.
  • the process of obtaining the matching text of the text segment that matches the original text by the above-mentioned auxiliary synthesis feature obtaining unit may include:
  • the uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match.
  • the process for the auxiliary synthesis feature acquisition unit to obtain the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text may include:
  • the auxiliary synthesis feature corresponding to the matching text is determined.
  • the apparatus of the present application may further include: an initial synthesized speech output unit, configured to output an initial synthesized speech of the original text before acquiring the uploaded text in the uploaded data.
  • the uploaded text is the incorrectly synthesized text segment in the initial synthesized speech
  • the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the incorrectly synthesized text segment
  • the uploading The text is the extended text that contains the text fragments synthesized incorrectly in the initial synthesized speech
  • the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the expanded text.
  • the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, which can be include:
  • speech synthesis is performed on the original text to obtain synthesized speech.
  • the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, and also Can include:
  • the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
  • the above-mentioned auxiliary speech synthesis unit determines the process of determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text, which may include:
  • the pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesize with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
  • the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, which may include:
  • the matched text and the acoustic features of the pronunciation audio determining the target acoustic feature required for predicting the current speech frame
  • the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.
  • the above-mentioned auxiliary speech synthesis unit determines the process of predicting the target acoustic characteristics required for the current speech frame, which may include:
  • the matched text and the acoustic features of the pronunciation audio obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;
  • target acoustic features required for predicting the current speech frame are determined.
  • the above-mentioned auxiliary speech synthesis unit obtains the process of obtaining the correlation degree of each frame of acoustic features in the context information and the acoustic features of the pronunciation audio, which may include:
  • the first attention weight matrix including the attention weight of each frame of acoustic features to each text unit in the matched text;
  • the second attention weight matrix includes the attention weight of the context information to each text unit in the matched text
  • a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair
  • the attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.
  • the above-mentioned auxiliary speech synthesis unit determines the process of predicting the required target acoustic feature of the current speech frame based on the degree of association, and can include:
  • Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight, and the acoustic features of each frame of the pronunciation audio are weighted and added to obtain a target acoustic feature.
  • the above-mentioned auxiliary speech synthesis unit based on the context information and the determined target acoustic feature, predicts the process of the current speech frame, which may include:
  • the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.
  • FIG. 7 shows a block diagram of the hardware structure of the speech synthesis device.
  • the hardware structure of the speech synthesis device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication interface. bus4;
  • the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the communication with each other through the communication bus 4;
  • the processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
  • the memory 3 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), etc., such as at least one disk memory;
  • the memory stores a program
  • the processor can call the program stored in the memory, and the program is used for:
  • the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
  • speech synthesis is performed on the original text to obtain synthesized speech.
  • refinement function and extension function of the program may refer to the above description.
  • An embodiment of the present application further provides a storage medium, where the storage medium can store a program suitable for the processor to execute, and the program is used for:
  • the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
  • speech synthesis is performed on the original text to obtain synthesized speech.
  • refinement function and extension function of the program may refer to the above description.
  • an embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any one of the above-mentioned speech synthesis methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

A speech synthesis method and apparatus, a device, and a storage medium. In the method, in the process of performing speech synthesis on original text to be synthesized, an assistant synthesis feature corresponding to matching text which comprises a text segment matching a text segment of the original text is referenced, and the assistant synthesis feature is a feature used for assisting in speech synthesis and determined on the basis of pronunciation audio corresponding to the matching text. By referencing the assistant synthesis feature corresponding to the matching text, speech synthesis of the original text can be assisted in by using pronunciation information in the pronunciation audio corresponding to the matching text, thereby enriching information referenced during speech synthesis of the original text, and improving the quality of speech synthesis of the original text.

Description

语音合成方法、装置、设备及存储介质Speech synthesis method, device, equipment and storage medium
本申请要求于2020年12月30日提交至中国国家知识产权局、申请号为202011607966.3、发明名称为“语音合成方法、装置、设备及存储介质”的专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the patent application with the application number 202011607966.3 and the invention title "Speech Synthesis Method, Apparatus, Equipment and Storage Medium", which was submitted to the State Intellectual Property Office of China on December 30, 2020, the entire contents of which are by reference Incorporated in this application.
技术领域technical field
本申请涉及语音处理处理技术领域,更具体的说,是涉及一种语音合成方法、装置、设备及存储介质。The present application relates to the technical field of speech processing, and more particularly, to a speech synthesis method, apparatus, device and storage medium.
背景技术Background technique
近些年,随着信息的发展及人工智能的兴起,人机交互变得越来越重要。其中语音合成是国内外人机交互研究的热点。语音合成即将输入的待合成的原始文本合成为语音输出的过程。In recent years, with the development of information and the rise of artificial intelligence, human-computer interaction has become more and more important. Among them, speech synthesis is a hot spot of human-computer interaction research at home and abroad. Speech synthesis is the process of synthesizing the input original text to be synthesized into speech output.
传统的语音合成模型一般为基于端到端的语音合成方案,即直接使用训练文本及对应的语音数据或波形数据来训练语音合成模型,训练完毕的语音合成模型,基于输入的待合成的原始文本,即可输出合成的语音,或者输出波形数据,进而基于波形数据得到对应的合成语音。The traditional speech synthesis model is generally based on an end-to-end speech synthesis scheme, that is, the training text and the corresponding speech data or waveform data are directly used to train the speech synthesis model, and the trained speech synthesis model is based on the input original text to be synthesized, The synthesized speech can be output, or the waveform data can be output, and then the corresponding synthesized speech can be obtained based on the waveform data.
现有语音合成方案仅参考了原始文本进行语音合成,导致合成语音容易出错,合成效果不佳。The existing speech synthesis solutions only refer to the original text for speech synthesis, which leads to the error-prone speech synthesis and poor synthesis effect.
发明内容SUMMARY OF THE INVENTION
鉴于上述问题,提出了本申请以便提供一种语音合成方法、装置、设备及存储介质,以提高合成语音的质量。具体方案如下:In view of the above problems, the present application is proposed to provide a speech synthesis method, apparatus, device and storage medium to improve the quality of synthesized speech. The specific plans are as follows:
在本申请的第一方面,提供了一种语音合成方法,包括:In a first aspect of the present application, a speech synthesis method is provided, comprising:
获取待合成的原始文本;Get the original text to be synthesized;
获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语 音。With reference to the auxiliary synthesis feature, speech synthesis is performed on the original text to obtain synthesized speech.
优选地,根据权利要求1所述的方法,其特征在于,所述获取匹配文本对应的辅助合成特征,包括:Preferably, the method according to claim 1, wherein the obtaining auxiliary synthesis features corresponding to the matching text comprises:
获取与所述原始文本存在相匹配的文本片段的匹配文本;obtaining the matching text of the text fragment matching the existence of the original text;
获取基于所述匹配文本对应的发音音频所确定的辅助合成特征。The auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.
优选地,所述辅助合成特征包括:Preferably, the auxiliary synthesis features include:
基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的音素序列;The phoneme sequence corresponding to the matching text, determined based on the pronunciation audio corresponding to the matching text;
和/或,and / or,
基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的韵律信息;Determined based on the pronunciation audio corresponding to the matched text, the prosodic information corresponding to the matched text;
和/或,and / or,
基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的音素级韵律编码;Determined based on the pronunciation audio corresponding to the matched text, the phoneme-level prosodic coding corresponding to the matched text;
和/或,and / or,
所述匹配文本对应的发音音频的声学特征。The acoustic feature of the pronunciation audio corresponding to the matched text.
优选地,所述获取与所述原始文本存在相匹配的文本片段的匹配文本,包括:Preferably, the obtaining the matching text of the text segment that matches the original text exists, including:
在预配置的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。Among the preconfigured template texts, matching texts that match text fragments within the original text are determined.
优选地,所述获取与所述原始文本存在相匹配的文本片段的匹配文本,包括:Preferably, the obtaining the matching text of the text segment that matches the original text exists, including:
获取上传数据中的上传文本,作为所述匹配文本,所述上传数据还包括所述上传文本对应的发音音频,所述上传文本与所述原始文本存在相匹配的文本片段。The uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match.
优选地,所述预配置的模板文本包括:Preferably, the preconfigured template text includes:
各个预配置的资源包中的模板文本,其中每一资源包包含一模板文本,及基于所述模板文本对应的发音音频所确定的,与所述模板文本对应的辅 助合成特征。Template text in each preconfigured resource package, wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
优选地,所述在预配置的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本,包括:Preferably, in the preconfigured template text, determining the matching text that matches the text fragment in the original text includes:
分别将所述原始文本与预配置的每个资源包中的模板文本进行匹配计算;The original text and the template text in each preconfigured resource package are respectively matched and calculated;
在匹配度最高的资源包所包含的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。In the template text included in the resource package with the highest matching degree, the matching text that matches the text fragment in the original text is determined.
优选地,所述获取基于所述匹配文本对应的发音音频所确定的辅助合成特征,包括:Preferably, the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text, includes:
获取所述匹配度最高的资源包中包含的,与所述匹配文本对应的辅助合成特征。Acquire the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.
优选地,预配置的资源包的确定过程,包括:Preferably, the process of determining the preconfigured resource package includes:
获取预配置的模板文本及对应的发音音频;Get the preconfigured template text and the corresponding pronunciation audio;
基于所述发音音频,确定所述模板文本对应的音素序列及韵律信息;Based on the pronunciation audio, determine the phoneme sequence and prosody information corresponding to the template text;
将所述音素序列及韵律信息作为所述模板文本对应的辅助合成特征,并将所述辅助合成特征与所述模板文本整理成一个资源包。The phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.
优选地,预配置的资源包的确定过程,还包括:Preferably, the process of determining the preconfigured resource package further includes:
基于所述模板文本及对应的发音音频,确定所述模板文本对应的音素级韵律编码;Based on the template text and the corresponding pronunciation audio, determine the phoneme-level prosodic coding corresponding to the template text;
将所述音素级韵律编码合并入所述资源包中。The phoneme-level prosodic encoding is incorporated into the resource bundle.
优选地,所述基于所述模板文本及对应的发音音频,确定所述模板文本对应的音素级韵律编码,包括:Preferably, the phoneme-level prosodic coding corresponding to the template text is determined based on the template text and the corresponding pronunciation audio, including:
基于所述模板文本及对应的发音音频,提取出音素级的韵律信息;Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio;
将所述模板文本及所述音素级的韵律信息输入编码预测网络,得到预测的音素级的韵律编码;Inputting the template text and the phoneme-level prosody information into a coding prediction network to obtain the predicted phoneme-level prosody coding;
将所述预测的音素级的韵律编码及所述模板文本输入生成网络,得到生成的音素级的韵律信息;Inputting the predicted phoneme-level prosody coding and the template text into a generating network to obtain the generated phoneme-level prosody information;
以生成的音素级的韵律信息趋近于提取出的所述音素级的韵律信息为 目标训练所述编码预测网络及所述生成网络,直至训练结束时,得到训练后的编码预测网络所预测的音素级的韵律编码。The encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.
优选地,在所述获取上传数据中的上传文本之前,该方法还包括:Preferably, before acquiring the upload text in the upload data, the method further includes:
获取并输出所述原始文本的初始合成语音;obtaining and outputting the initial synthesized speech of the original text;
则所述上传文本为,所述初始合成语音中合成错误的文本片段,所述上传文本对应的发音音频为,所述合成错误的文本片段对应的正确发音;Then the uploaded text is the text segment that is synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the synthesized incorrect text segment;
或,所述上传文本为,包含所述初始合成语音中合成错误的文本片段的扩展文本,所述上传文本对应的发音音频为,所述扩展文本对应的正确发音。Or, the uploaded text is an extended text that includes a text fragment synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the extended text.
优选地,所述获取基于所述匹配文本对应的发音音频所确定的辅助合成特征,包括:Preferably, the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text, includes:
基于所述上传数据中所述匹配文本对应的发音音频,确定匹配文本对应的辅助合成特征。Based on the pronunciation audio corresponding to the matching text in the uploaded data, the auxiliary synthesis feature corresponding to the matching text is determined.
优选地,所述参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音,包括:Preferably, performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech, including:
基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列;Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text;
和/或,and / or,
基于所述匹配文本对应的韵律信息,确定所述原始文本的韵律信息;determining the prosody information of the original text based on the prosody information corresponding to the matched text;
基于所述原始文本的音素序列和/或韵律信息,对所述原始文本进行语音合成,得到合成语音。Based on the phoneme sequence and/or prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.
优选地,所述参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音,还包括:Preferably, performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech, further comprising:
基于所述匹配文本对应的音素级韵律编码,获取所述匹配文本与所述原始文本中相同文本片段对应的音素级韵律编码;Based on the phoneme-level prosody coding corresponding to the matched text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matched text and the original text;
在对所述原始文本进行语音合成过程中,将所述相同文本片段对应的音素级韵律编码作为语音合成模型的补充输入,得到合成语音。During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
优选地,所述基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列,包括:Preferably, determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text includes:
基于所述匹配文本对应的音素序列,获取所述匹配文本与所述原始文本中相同文本片段对应的音素序列;Based on the phoneme sequence corresponding to the matching text, obtain the phoneme sequence corresponding to the same text segment in the matching text and the original text;
查询发音词典,确定所述原始文本中除所述相同文本片段外的其余文本片段的音素序列,并与所述相同文本片段对应的音素序列合成,得到原始文本的音素序列。The pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesize with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
优选地,所述参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音,包括:Preferably, performing speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech, including:
基于语音合成模型处理所述原始文本,得到预测当前语音帧的上下文信息;Process the original text based on the speech synthesis model to obtain context information for predicting the current speech frame;
基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,确定预测当前语音帧所需的目标声学特征;Based on the context information, the matched text and the acoustic features of the pronunciation audio, determining the target acoustic feature required for predicting the current speech frame;
基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧,在预测得到所有语音帧后,由预测的各语音帧组成合成语音。Based on the context information and the determined target acoustic feature, the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.
优选地,所述基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,确定预测当前语音帧所需的目标声学特征,包括:Preferably, the target acoustic features required for predicting the current speech frame are determined based on the context information, the matched text and the acoustic features of the pronunciation audio, including:
基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度;Based on the context information, the matched text and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;
基于所述关联度,确定预测当前语音帧所需的目标声学特征。Based on the degree of association, target acoustic features required for predicting the current speech frame are determined.
优选地,所述获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度,包括:Preferably, in the acquisition of the context information and the acoustic features of the pronunciation audio, the correlation degree of each frame of the acoustic features includes:
获取所述发音音频的声学特征对所述匹配文本的第一注意力权重矩阵,所述第一注意力权重矩阵包括每一帧声学特征对所述匹配文本中各文本单元的注意力权重;Acquiring the first attention weight matrix of the acoustic feature of the pronunciation audio to the matched text, the first attention weight matrix including the attention weight of each frame of acoustic features to each text unit in the matched text;
获取所述上下文信息对所述匹配文本的第二注意力权重矩阵,所述第二注意力权重矩阵包括所述上下文信息对所述匹配文本中各文本单元的注意力权重;obtaining a second attention weight matrix of the context information to the matched text, where the second attention weight matrix includes the attention weight of the context information to each text unit in the matched text;
基于所述第一注意力权重及所述第二注意力权重矩阵,得到所述上下 文信息对所述声学特征的第三注意力权重矩阵,所述第三注意力权重矩阵包括所述上下文信息对每一帧声学特征的注意力权重,作为所述上下文信息与每一帧声学特征的关联度。Based on the first attention weight and the second attention weight matrix, a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair The attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.
优选地,所述基于所述关联度,确定预测当前语音帧所需的目标声学特征,包括:Preferably, determining the target acoustic feature required for predicting the current speech frame based on the correlation degree includes:
对各个所述关联度进行归一化,并以归一化后的各个关联度作为权重,对所述发音音频的各帧声学特征进行加权相加,得到目标声学特征。Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight, and the acoustic features of each frame of the pronunciation audio are weighted and added to obtain a target acoustic feature.
优选地,所述基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧,包括:Preferably, the prediction of the current speech frame based on the context information and the determined target acoustic feature includes:
基于语音合成模型的解码端当前的隐层矢量及所述目标声学特征,确定预测当前语音帧时所述目标声学特征的融合系数;Determine the fusion coefficient of the target acoustic feature when predicting the current speech frame based on the current hidden layer vector at the decoding end of the speech synthesis model and the target acoustic feature;
参考所述融合系数,对所述目标声学特征及所述上下文信息进行融合,并基于融合结果预测当前语音帧。With reference to the fusion coefficient, the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.
在本申请的第二方面,提供了一种语音合成装置,包括:In a second aspect of the present application, a speech synthesis apparatus is provided, comprising:
原始文本获取单元,用于获取待合成的原始文本;an original text obtaining unit, used to obtain the original text to be synthesized;
辅助合成特征获取单元,用于获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Auxiliary synthesis feature acquisition unit, used to obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is determined based on the pronunciation audio corresponding to the matching text The features used to assist speech synthesis;
辅助语音合成单元,用于参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音。An auxiliary speech synthesis unit, configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.
在本申请的第三方面,提供了一种语音合成设备,包括:存储器和处理器;In a third aspect of the present application, a speech synthesis device is provided, comprising: a memory and a processor;
所述存储器,用于存储程序;the memory for storing programs;
所述处理器,用于执行所述程序,实现如上所述的语音合成方法的各个步骤。The processor is configured to execute the program to implement each step of the speech synthesis method described above.
在本申请的第四方面,提供了一种存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上所述的语音合成方法的各个步骤。In a fourth aspect of the present application, there is provided a storage medium on which a computer program is stored, and when the computer program is executed by a processor, implements each step of the above-mentioned speech synthesis method.
在本申请的第五方面,提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述的语音合成方法的各个步骤。In a fifth aspect of the present application, a computer program product is provided, when the computer program product runs on a terminal device, the terminal device causes the terminal device to execute each step of the above-mentioned speech synthesis method.
借由上述技术方案,本申请的语音合成方法,在对待合成的原始文本进行语音合成的过程中,参考了与原始文本存在相匹配的文本片段的匹配文本对应的辅助合成特征,该辅助合成特征为基于匹配文本对应的发音音频所确定的用于辅助语音合成的特征,由此可知,本申请通过参考匹配文本对应的辅助合成特征,可以利用匹配文本对应的发音音频中发音信息辅助对原始文本进行语音合成,丰富了原始文本语音合成时所参考的信息,进而提高了原始文本的语音合成质量。By the above technical solution, the speech synthesis method of the present application, in the process of speech synthesis of the original text to be synthesized, refers to the auxiliary synthesis feature corresponding to the matching text of the text fragment that matches the original text, the auxiliary synthesis feature. In order to determine the feature for auxiliary speech synthesis based on the pronunciation audio corresponding to the matching text, it can be seen that the present application can use the pronunciation information in the pronunciation audio corresponding to the matching text to assist the original text by referring to the auxiliary synthesis feature corresponding to the matching text. By performing speech synthesis, the information referenced in the speech synthesis of the original text is enriched, thereby improving the speech synthesis quality of the original text.
可以理解的是,语音合成系统可以划分为带前端预处理和不带前端预处理两种类型,本申请方案可以同时适用于该两种类型的语音合成系统,对于带前端预处理的语音合成系统,上述匹配文本对应的辅助合成特征可以作为语音合成前端的分析结果或辅助修正语音合成前端的分析结果,进而将分析结果送入语音合成后端辅助对原始文本进行语音合成,对于不带前端预处理的语音合成系统,匹配文本对应的辅助合成特征可以直接作为语音合成系统对原始文本进行合成时的参考信息。对于两种类型的语音合成系统,参考本申请的辅助合成特征进行原始文本的语音合成,能够丰富语音合成时的参考信息,进而能够提高合成语音的质量。It can be understood that the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. The solution of the present application can be applied to these two types of speech synthesis systems at the same time. , the auxiliary synthesis feature corresponding to the above matching text can be used as the analysis result of the speech synthesis front end or the analysis result of the auxiliary correction speech synthesis front end, and then the analysis result is sent to the speech synthesis back end to assist in the speech synthesis of the original text. In the processed speech synthesis system, the auxiliary synthesis feature corresponding to the matched text can be directly used as the reference information when the speech synthesis system synthesizes the original text. For the two types of speech synthesis systems, the speech synthesis of the original text is performed with reference to the auxiliary synthesis feature of the present application, which can enrich the reference information during speech synthesis, thereby improving the quality of the synthesized speech.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for purposes of illustrating preferred embodiments only and are not to be considered limiting of the application. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:
图1为本申请实施例提供的语音合成方法的一流程示意图;1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;
图2示例了一种音素序列提取模型架构示意图;Figure 2 illustrates a schematic diagram of a phoneme sequence extraction model architecture;
图3示例了一种语音合成后端的合成流程示意图;FIG. 3 illustrates a schematic diagram of a synthesis flow of a speech synthesis back-end;
图4示例了一种语音合成系统架构示意图;Fig. 4 illustrates a kind of speech synthesis system architecture schematic diagram;
图5示例了一种预测-生成网络确定音素级韵律编码的过程示意图;Figure 5 illustrates a process schematic diagram of a prediction-generating network determining phoneme-level prosodic coding;
图6为本申请实施例提供的一种语音合成装置结构示意图;FIG. 6 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application;
图7为本申请实施例提供的语音合成设备的结构示意图。FIG. 7 is a schematic structural diagram of a speech synthesis device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
本申请提供了一种语音合成方案,可以适用于各种语音合成任务。本申请的语音合成方案可以适用于人机交互场景下的语音合成工作,以及其它各种需要进行语音合成的场景。The present application provides a speech synthesis solution, which can be applied to various speech synthesis tasks. The speech synthesis solution of the present application can be applied to speech synthesis work in a human-computer interaction scenario, as well as various other scenarios that require speech synthesis.
本申请方案可以基于具备数据处理能力的终端实现,该终端可以是手机、电脑、服务器、云端等。The solution of the present application can be implemented based on a terminal with data processing capability, and the terminal can be a mobile phone, a computer, a server, a cloud, or the like.
接下来,结合图1所述,本申请的语音合成方法可以包括如下步骤:Next, as described in conjunction with FIG. 1 , the speech synthesis method of the present application may include the following steps:
步骤S100、获取待合成的原始文本。Step S100, acquiring the original text to be synthesized.
具体的,原始文本为待合成语音的文本。原始文本可以是用户提供的,也可以是其它设备、应用提供的需要进行语音合成的文本。Specifically, the original text is the text to be synthesized into speech. The original text may be provided by the user, or may be the text provided by other devices or applications that needs to be synthesized by speech.
步骤S110、获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段。Step S110 , acquiring auxiliary synthesis features corresponding to the matched text, where the matched text and the original text have text fragments that match.
其中,匹配文本可以是与原始文本或原始文本内的文本片段相匹配的文本,示例如,原始文本为“这条裤子不打折”,匹配文本可以是“这条裤子不打折”或“打折”。除此之外,匹配文本还可以是包含与原始文本内文本片段相匹配文本片段的文本。仍以上述原始文本为例,匹配文本可以是“你这件衣服打折吗”,也即匹配文本包含与原始文本相匹配的文本片段“打折”。Wherein, the matching text can be text that matches the original text or a text fragment within the original text, for example, the original text is "this pair of pants is not discounted", and the matching text may be "this pair of pants is not discounted" or "discounted" . In addition to this, the matching text may also be text that contains text fragments that match the text fragments within the original text. Still taking the above-mentioned original text as an example, the matching text may be "do you have a discount on this dress?", that is, the matching text includes a text fragment "discount" that matches the original text.
匹配文本可以是本申请预先配置存储的文本,示例如,在客服、交互等场景下,可以预先将固定的话术文本进行录音,将话术文本进行存储。则在存储的话术文本中查找与原始文本匹配的话术文本,作为匹配文本。以客服及交互场景为例,其存在一些固定的话术文本,如智能客服或终端需要向用户进行信息提示的提示内容文本,示例如“请问您需要查询什么内容”、“您好,请问有什么可以帮助您的吗”“查询话费请按1,查询流量请按2”等等。对应的,这些固定的话术文本可以预先进行录音,作为提示音与话术文本共同进行存储。The matching text may be the text that is pre-configured and stored in the present application. For example, in scenarios such as customer service and interaction, the fixed discourse text may be recorded in advance, and the discourse text may be stored. Then, the utterance text matching the original text is searched in the stored utterance text as the matching text. Taking customer service and interaction scenarios as an example, there are some fixed vocabulary texts, such as the prompt content text that the intelligent customer service or terminal needs to prompt the user for information, such as "May I ask what you need to inquire about", "Hello, may I ask what Can I help you?" "Press 1 for phone bills, 2 for traffic" and so on. Correspondingly, these fixed discourse texts may be pre-recorded and stored together with the discourse texts as prompt sounds.
除此之外,匹配文本还可以是用户上传的文本。示例如,用户在上传待合成的原始文本的同时,将原始文本中容易合成出错的文本作为匹配文本一并上传,同时还可以上传匹配文本对应的发音音频。再比如,用户上传待合成的原始文本后,合成系统输出合成的初始合成语音。用户可以确定出初始合成语音中合成错误的文本,进而录制该合成错误的文本对应的发音音频,将合成错误的文本及对应的发音音频上传至语音合成系统。或者是,用户上传包含该合成错误的文本的扩展文本,以及扩展文本对应的发音音频。In addition to this, the matching text can also be user-uploaded text. For example, when uploading the original text to be synthesized, the user uploads the text that is prone to synthesis errors in the original text as the matching text, and can also upload the pronunciation audio corresponding to the matching text. For another example, after the user uploads the original text to be synthesized, the synthesis system outputs the synthesized initial synthesized speech. The user can determine the incorrectly synthesized text in the initial synthesized speech, then record the pronunciation audio corresponding to the synthesized incorrectly text, and upload the synthesized incorrectly text and the corresponding pronunciation audio to the speech synthesis system. Or, the user uploads the extended text containing the text with the synthesis error, and the pronunciation audio corresponding to the extended text.
上述匹配文本对应的辅助合成特征可以是,基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征。辅助合成特征包含了匹配文本对应的发音音频的发音信息,示例如发音的音素序列、停顿信息、重读、韵律、情感等发音信息,该发音信息能够辅助对原始文本的语音合成,提升原始文本的语音合成质量。The auxiliary synthesis feature corresponding to the matching text may be a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text. The auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, such as pronunciation information such as the phoneme sequence of pronunciation, pause information, stress, rhythm, emotion and other pronunciation information. The pronunciation information can assist the speech synthesis of the original text and improve the quality of the original text. Speech synthesis quality.
匹配文本对应的发音音频中,对于匹配文本与原始文本匹配的文本片段的发音,为该文本片段在原始文本中的标准发音。示例如,原始文本为“这条裤子不打折”。匹配文本为“打折”,则匹配文本对应的发音音频为“da zhe”对应的音频,而非“da she”等其它发音音频。基于此,可以基于匹配文本对应发音音频确定辅助合成特征,以辅助对原始文本的语音合成。In the pronunciation audio corresponding to the matched text, the pronunciation of the text segment whose matched text matches the original text is the standard pronunciation of the text segment in the original text. For example, the original text is "These pants are not discounted". If the matching text is "discount", the pronunciation audio corresponding to the matching text is the audio corresponding to "da zhe", not other pronunciation audios such as "da she". Based on this, auxiliary synthesis features can be determined based on the corresponding pronunciation audio of the matched text to assist speech synthesis of the original text.
可以理解的是,若匹配文本对应的发音音频可以在对原始文本进行语 音合成之前获取到,则可以预先基于匹配文本对应的发音音频确定辅助合成特征并存储在本地或第三方设备。则本步骤中获取匹配文本对应的辅助合成特征的过程可以是,在本地或第三方存储中查找预存储的匹配文本对应的辅助合成特征。It can be understood that, if the pronunciation audio corresponding to the matched text can be obtained before the speech synthesis is performed on the original text, the auxiliary synthesis feature can be determined in advance based on the pronunciation audio corresponding to the matched text and stored in a local or third-party device. In this step, the process of obtaining the auxiliary synthesis feature corresponding to the matching text may be to search for the auxiliary synthesis feature corresponding to the pre-stored matching text in the local or third-party storage.
除此之外,若匹配文本对应的发音音频为对原始文本语音合成过程临时获取的,则本步骤中获取匹配文本对应的辅助合成特征的过程可以是,在获取到匹配文本对应的发音音频后,基于该发音音频确定辅助合成特征。In addition, if the pronunciation audio corresponding to the matching text is temporarily obtained from the original text-to-speech synthesis process, the process of obtaining the auxiliary synthesis feature corresponding to the matching text in this step may be, after obtaining the pronunciation audio corresponding to the matching text , and the auxiliary synthesis feature is determined based on the pronunciation audio.
步骤S120、参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音。Step S120, referring to the auxiliary synthesis feature, perform speech synthesis on the original text to obtain synthesized speech.
具体的,本步骤中语音合成系统在对原始文本进行语音合成时,除了参考原始文本之外还可以进一步参考匹配文本对应的辅助合成特征,也即丰富了原始文本语音合成过程所参考的信息。同时,由于该辅助合成特征包含了匹配文本对应的发音音频的发音信息,该发音信息能够辅助对原始文本的语音合成,提升原始文本的语音合成质量。Specifically, when the speech synthesis system performs speech synthesis on the original text in this step, in addition to referring to the original text, it may further refer to the auxiliary synthesis feature corresponding to the matched text, that is, to enrich the information referenced in the speech synthesis process of the original text. At the same time, since the auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, the pronunciation information can assist the speech synthesis of the original text and improve the speech synthesis quality of the original text.
本申请实施例提供的语音合成方法,在对待合成的原始文本进行语音合成的过程中,参考了与原始文本存在相匹配的文本片段的匹配文本对应的辅助合成特征,该辅助合成特征为基于匹配文本对应的发音音频所确定的用于辅助语音合成的特征,由此可知,本申请通过参考匹配文本对应的辅助合成特征,可以利用匹配文本对应的发音音频中发音信息辅助对原始文本进行语音合成,丰富了原始文本语音合成时所参考的信息,进而提高了原始文本的语音合成质量。In the speech synthesis method provided by the embodiment of the present application, in the process of speech synthesis of the original text to be synthesized, reference is made to the auxiliary synthesis feature corresponding to the matched text of the text segment that matches the original text, and the auxiliary synthesis feature is based on matching The feature used for auxiliary speech synthesis determined by the pronunciation audio corresponding to the text, it can be seen that the present application can utilize the pronunciation information in the pronunciation audio corresponding to the matching text to assist in the speech synthesis of the original text by referring to the auxiliary synthesis feature corresponding to the matching text. , which enriches the information referenced in the speech synthesis of the original text, thereby improving the speech synthesis quality of the original text.
可以理解的是,语音合成系统可以划分为带前端预处理和不带前端预处理两种类型,本申请方案可以同时适用于该两种类型的语音合成系统,对于带前端预处理的语音合成系统,上述匹配文本对应的辅助合成特征可以作为语音合成前端的分析结果或辅助修正语音合成前端的分析结果,进而将分析结果送入语音合成后端辅助对原始文本进行语音合成,对于不带前端预处理的语音合成系统,匹配文本对应的辅助合成特征可以直接作为语音合成系统对原始文本进行合成时的参考信息。对于两种类型的语音合 成系统,参考本申请的辅助合成特征进行原始文本的语音合成,能够丰富语音合成时的参考信息,进而能够提高合成语音的质量。It can be understood that the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. The solution of the present application can be applied to these two types of speech synthesis systems at the same time. , the auxiliary synthesis feature corresponding to the above matching text can be used as the analysis result of the speech synthesis front end or the analysis result of the auxiliary correction speech synthesis front end, and then the analysis result is sent to the speech synthesis back end to assist in the speech synthesis of the original text. In the processed speech synthesis system, the auxiliary synthesis feature corresponding to the matched text can be directly used as the reference information when the speech synthesis system synthesizes the original text. For the two types of speech synthesis systems, the speech synthesis of the original text is carried out with reference to the auxiliary synthesis feature of the present application, which can enrich the reference information during speech synthesis, thereby improving the quality of the synthesized speech.
在本申请的一些实施例中,对前文中提及的匹配文本对应的辅助合成特征,及参考辅助合成特征对原始文本进行语音合成的过程进行说明。In some embodiments of the present application, the auxiliary synthesis feature corresponding to the matching text mentioned above and the process of performing speech synthesis on the original text with reference to the auxiliary synthesis feature are described.
辅助合成特征为基于匹配文本对应的发音音频所确定的用于辅助语音合成的特征,辅助合成特征包含了匹配文本对应的发音音频的发音信息,该发音信息能够辅助对原始文本的语音合成,提升原始文本的语音合成质量。The auxiliary synthesis feature is a feature used for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text. The auxiliary synthesis feature includes the pronunciation information of the pronunciation audio corresponding to the matching text, and the pronunciation information can assist the speech synthesis of the original text. The speech synthesis quality of the original text.
本实施例中提供了几种可选形式的辅助合成特征,如下实施例中分别进行介绍:Several optional forms of auxiliary synthesis features are provided in this embodiment, which are respectively introduced in the following embodiments:
1)、辅助合成特征为匹配文本对应的音素序列。1), the auxiliary synthesis feature is the phoneme sequence corresponding to the matching text.
具体的,语音合成系统可以分为带前端预处理和不带前端预处理两种类型。其中,带前端预处理的语音合成系统在对原始文本进行语音合成之前,首先对原始文本进行前端分析,如通过查询发音词典预测原始文本对应的音素序列,由语音合成后端基于原始文本及音素序列进行语音合成。Specifically, the speech synthesis system can be divided into two types with front-end preprocessing and without front-end preprocessing. Among them, the speech synthesis system with front-end preprocessing first performs front-end analysis on the original text before performing speech synthesis on the original text. sequence for speech synthesis.
这种处理方式,能够一定程度上提升语音合成的质量,但是,当预先构建的发音词典存在错误时,则会导致后端合成语音出错。This processing method can improve the quality of speech synthesis to a certain extent, but when there is an error in the pre-built pronunciation dictionary, it will lead to errors in the back-end synthesized speech.
为此,本实施例中可以基于匹配文本对应的发音音频,确定匹配文本对应的音素序列,作为辅助合成特征。To this end, in this embodiment, based on the pronunciation audio corresponding to the matched text, a phoneme sequence corresponding to the matched text may be determined as an auxiliary synthesis feature.
可以理解的是,匹配文本对应的发音音频为正确发音,因此可以从发音音频中提取出匹配文本对应的正确音素序列。该正确的音素序列可以作为辅助合成特征,参与到对原始文本的语音合成过程。It can be understood that the pronunciation audio corresponding to the matched text is the correct pronunciation, so the correct phoneme sequence corresponding to the matched text can be extracted from the pronunciation audio. The correct phoneme sequence can be used as an auxiliary synthesis feature to participate in the speech synthesis process of the original text.
本实施例中提供了一种从匹配文本对应的发音音频中,提取音素序列的实现方式。This embodiment provides an implementation manner of extracting the phoneme sequence from the pronunciation audio corresponding to the matched text.
如图2所示,其示例了一种音素序列提取模型架构示意图。As shown in Figure 2, it exemplifies a schematic diagram of the architecture of a phoneme sequence extraction model.
本申请可以预先训练音素序列提取模型,用于从发音音频中提取音素序列。This application can pre-train a phoneme sequence extraction model for extracting phoneme sequences from pronunciation audio.
音素序列提取模型可以采用LSTM(long short term memory,长短时记忆网络)网络架构或HMM、CNN等其它可选的网络架构。如图2所示,其示例了一种采用编码-注意力-解码架构的音素序列提取模型。The phoneme sequence extraction model can adopt the LSTM (long short term memory, long short term memory network) network architecture or other optional network architectures such as HMM and CNN. As shown in Figure 2, it exemplifies a phoneme sequence extraction model with an encoder-attention-decoder architecture.
编码端使用LSTM网络对发音音频的音频特征序列(x 1,x 2,...,x n)进行编码得到隐层编码序列(h 1,h 2,...,h n),解码端同样采用LSTM网络,在解码时刻t通过输入t-1时刻的隐层状态h t-1和由注意力模块计算出的上下文向量c t-1共同计算得到解码端隐层向量s t,然后通过投影得到t时刻的音素y t。当解码出特殊符号结束标记时解码停止,得到音素序列(y 1,y 2,...,y t)。 The encoding end uses the LSTM network to encode the audio feature sequence (x 1 ,x 2 ,...,x n ) of the pronunciation audio to obtain the hidden layer encoding sequence (h 1 ,h 2 ,...,h n ), and the decoding end Also using the LSTM network, at the decoding time t, the hidden layer state h t- 1 at the time t-1 is input and the context vector c t-1 calculated by the attention module is jointly calculated to obtain the hidden layer vector s t of the decoding end, and then pass The projection yields the phoneme y t at time t . The decoding stops when the special symbol end marker is decoded, and the phoneme sequence (y 1 , y 2 ,...,y t ) is obtained.
示例性说明如:Exemplary instructions such as:
当匹配文本为“这件衣服不打折”时,从匹配文本对应的发音音频中提取的音素序列为:[zh e4 j ian4 i1 f u7 b u4 d a3 zh e2]。When the matching text is "this dress is not discounted", the phoneme sequence extracted from the pronunciation audio corresponding to the matching text is: [zh e4 jian4 i1 f u7 b u4 d a3 zh e2].
当辅助合成特征为音素序列时,上述步骤S120,参考所述辅助合成特征,对所述原始文本进行语音合成的过程,可以包括:When the auxiliary synthesis feature is a phoneme sequence, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:
S1、基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列。S1. Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text.
具体的,可以基于所述匹配文本对应的音素序列,获取所述匹配文本与所述原始文本中相同文本片段对应的音素序列。Specifically, the phoneme sequence corresponding to the matched text and the same text segment in the original text may be acquired based on the phoneme sequence corresponding to the matched text.
示例如,确定所述匹配文本与所述原始文本中相同文本片段,进而在所述匹配文本对应的音素序列中,提取所述相同文本片段对应的音素序列。For example, it is determined that the matched text is the same text segment as the original text, and then the phoneme sequence corresponding to the same text segment is extracted from the phoneme sequence corresponding to the matched text.
进一步,查询发音词典,确定所述原始文本中除所述相同文本片段外的其余文本片段的音素序列,并与所述相同文本片段对应的音素序列组合,得到原始文本的音素序列。Further, the pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and combine with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
当然,还可以通过查询发音词典,确定所述原始文本对应的初始音素序列,并利用在匹配文本对应的音素序列中提取的所述相同文本片段对应的音素序列,替换掉所述初始音素序列中所述相同文本片段对应的音素序列,得到原始文本对应的替换后的音素序列。Of course, the initial phoneme sequence corresponding to the original text can also be determined by querying the pronunciation dictionary, and the phoneme sequence corresponding to the same text segment extracted from the phoneme sequence corresponding to the matching text is used to replace the initial phoneme sequence in the initial phoneme sequence. From the phoneme sequence corresponding to the same text segment, the replaced phoneme sequence corresponding to the original text is obtained.
S2、基于所述原始文本的音素序列,对所述原始文本进行语音合成,得到合成语音。S2. Based on the phoneme sequence of the original text, perform speech synthesis on the original text to obtain synthesized speech.
具体的,可以将原始文本的音素序列作为语音合成前端的文本分析结果,送入语音合成后端辅助对原始文本进行语音合成。Specifically, the phoneme sequence of the original text can be used as the text analysis result of the speech synthesis front end, and sent to the speech synthesis back end to assist in the speech synthesis of the original text.
由于本实施例中得到的原始文本的音素序列中包含了匹配文本对应的音素序列,该部分音素序列是基于匹配文本对应的正确发音音频所确定的,从而以原始文本的音素序列辅助进行语音合成时,能够提升合成语音的准确度,尤其是对于一些多音字、易错字,其合成语音准确度大大提高。Since the phoneme sequence of the original text obtained in this embodiment includes the phoneme sequence corresponding to the matching text, the part of the phoneme sequence is determined based on the correct pronunciation audio corresponding to the matching text, so that the phoneme sequence of the original text is used to assist in speech synthesis It can improve the accuracy of the synthesized speech, especially for some polyphonic words and easy typos, the accuracy of the synthesized speech is greatly improved.
2)、辅助合成特征为匹配文本对应的韵律信息。2), the auxiliary synthesis feature is the prosodic information corresponding to the matching text.
结合前文介绍,语音合成前端可以对原始文本进行文本分析,该文本分析的过程还可以预测原始文本的韵律信息,进而由合成后端基于原始文本、韵律信息进行语音合成。通过考虑韵律信息,可以提升合成语音的自然度。Combined with the previous introduction, the front-end speech synthesis can perform text analysis on the original text. The process of text analysis can also predict the prosody information of the original text, and then the synthesis back-end performs speech synthesis based on the original text and prosody information. By taking into account prosodic information, the naturalness of the synthesized speech can be improved.
可以理解的是,对原始文本所预测的韵律信息也可能出错,进而导致后端合成语音的韵律出错,影响合成语音的质量。It is understandable that the prosody information predicted for the original text may also be wrong, which in turn leads to errors in the prosody of the back-end synthesized speech and affects the quality of the synthesized speech.
为此,本实施例中可以基于匹配文本对应的发音音频,确定匹配文本对应的韵律信息,作为辅助合成特征。这里,匹配文本对应的韵律信息,可以是音素级韵律信息,其包括匹配文本对应的音素序列中,每一音素单元的韵律信息。To this end, in this embodiment, prosodic information corresponding to the matched text may be determined based on the pronunciation audio corresponding to the matched text, as an auxiliary synthesis feature. Here, the prosody information corresponding to the matched text may be phoneme-level prosody information, which includes the prosody information of each phoneme unit in the phoneme sequence corresponding to the matched text.
可以理解的是,匹配文本对应的发音音频为正确发音,因此可以从发音音频中提取出匹配文本对应的正确的韵律信息。该正确的韵律信息可以作为辅助合成特征,参与到对原始文本的语音合成过程。示例如,基于该正确的韵律信息确定原始文本的修正后韵律信息,进而送入合成后端进行语音合成。It can be understood that the pronunciation audio corresponding to the matched text is the correct pronunciation, so the correct prosody information corresponding to the matched text can be extracted from the pronunciation audio. The correct prosody information can be used as an auxiliary synthesis feature to participate in the speech synthesis process of the original text. For example, the corrected prosody information of the original text is determined based on the correct prosody information, and then sent to the synthesis back-end for speech synthesis.
当辅助合成特征为韵律信息时,上述步骤S120,参考所述辅助合成特征,对所述原始文本进行语音合成的过程,可以包括:When the auxiliary synthesis feature is prosody information, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:
S1、基于所述匹配文本对应的韵律信息,确定所述原始文本的韵律信息。S1. Determine the prosody information of the original text based on the prosody information corresponding to the matched text.
具体的,可以基于所述匹配文本对应的韵律信息,获取所述匹配文本与所述原始文本中相同文本片段对应的韵律信息。Specifically, the prosody information corresponding to the matched text and the same text segment in the original text may be acquired based on the prosody information corresponding to the matched text.
进一步,可以采用韵律预测技术预测原始文本中除所述相同文本片段的其余文本片段的韵律信息,并与所述相同文本片段对应的韵律信息组合,得到原始文本的韵律信息。Further, prosody prediction technology can be used to predict the prosody information of the remaining text segments in the original text except the same text segment, and combine with the prosody information corresponding to the same text segment to obtain the prosody information of the original text.
S2、基于所述原始文本的韵律信息,对所述原始文本进行语音合成,得到合成语音。S2. Based on the prosody information of the original text, perform speech synthesis on the original text to obtain synthesized speech.
另一种情况下,当辅助合成特征同时包含音素序列和韵律信息时,上述步骤S120,参考所述辅助合成特征,对所述原始文本进行语音合成的过程,可以包括:In another case, when the auxiliary synthesis feature includes both phoneme sequence and prosody information, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:
S1、基于所述匹配文本对应的音素序列和韵律信息,确定所述原始文本的音素序列和韵律信息。S1. Determine the phoneme sequence and prosody information of the original text based on the phoneme sequence and prosody information corresponding to the matched text.
S2、基于所述原始文本的音素序列和韵律信息,对所述原始文本进行语音合成,得到合成语音。S2. Based on the phoneme sequence and prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.
3)、辅助合成特征为匹配文本对应的音素级韵律编码。3) The auxiliary synthesis feature is the phoneme-level prosodic coding corresponding to the matching text.
具体的,匹配文本对应的音素级韵律编码包含了匹配文本对应的发音音频的一些发音信息,如发音时长、重读强调等韵律特征。Specifically, the phoneme-level prosodic coding corresponding to the matched text includes some pronunciation information of the pronunciation audio corresponding to the matched text, such as prosodic features such as pronunciation duration, stress and emphasis.
语音合成后端在进行语音合成时,可以对原始文本的韵律信息进行建模,进而提升合成语音的自然度。本实施例中,为了提升语音合成后端对原始文本韵律信息的建模准确度,可以将匹配文本对应的音素级韵律编码作为辅助合成特征,送入语音合成后端辅助进行语音合成。When the speech synthesis backend performs speech synthesis, it can model the prosody information of the original text, thereby improving the naturalness of the synthesized speech. In this embodiment, in order to improve the modeling accuracy of the original text prosody information by the speech synthesis backend, the phoneme-level prosody coding corresponding to the matched text can be used as an auxiliary synthesis feature, and sent to the speech synthesis backend to assist in speech synthesis.
可以理解的是,匹配文本对应的音素级韵律编码包含了匹配文本对应的正确发音信息,语音合成后端基于匹配文本对应的音素级韵律编码进行语音合成时,对于原始文本与匹配文本共同包含的相同文本片段,可以合成出与匹配文本的发音音频一致的语音。It is understandable that the phoneme-level prosody coding corresponding to the matching text contains the correct pronunciation information corresponding to the matching text. When the speech synthesis backend performs speech synthesis based on the phoneme-level prosody coding corresponding to the matching text, the original text and the matching text contain the correct pronunciation information. The same text fragment can synthesize the voice that is consistent with the pronunciation audio of the matching text.
同时,语音合成后端对原始文本进行卷积等运算处理过程,对于原始文本中除相同文本片段外的其余文本片段,其在处理过程也会参考相同文本片段对应的音素级韵律编码,从而利用相同文本片段的音素级韵律编码辅助提升原始文本中其余文本片段的语音合成质量。At the same time, the back-end of speech synthesis performs operations such as convolution on the original text. For the rest of the text fragments in the original text except the same text fragment, it will also refer to the phoneme-level prosodic coding corresponding to the same text fragment in the processing process, so as to use Phoneme-level prosodic encoding of the same text segment assists in improving the speech synthesis quality of the remaining text segments in the original text.
此外,某些现有技术中,仅对原始文本中非相同文本片段进行语音合 成,进而将非相同文本片段的合成语音,与预配置的相同文本片段的语音进行拼接,得到原始文本对应的整体合成语音。这种处理方式会导致原始文本的整体合成语音出现音色不一致的问题,降低了合成语音的质量。In addition, in some prior art, speech synthesis is only performed on non-identical text segments in the original text, and then the synthesized speech of the non-identical text segments is spliced with the pre-configured speech of the same text segment to obtain a whole corresponding to the original text. Synthesized speech. This processing method will lead to the problem of inconsistent timbre of the overall synthesized speech of the original text, and reduce the quality of the synthesized speech.
而本申请的语音合成系统仍然是一个完整的合成系统,通过对原始文本进行整体的语音合成,可以保障合成语音的音色是一致的。However, the speech synthesis system of the present application is still a complete synthesis system. By performing overall speech synthesis on the original text, it can be ensured that the timbre of the synthesized speech is consistent.
进一步的,基于语音合成后端对韵律信息建模的不同形式,本实施例中音素级韵律编码也可以不同。Further, based on different forms of modeling prosody information in the speech synthesis backend, the phoneme-level prosody coding in this embodiment may also be different.
如图3,其示例了一种语音合成后端的合成流程示意图。FIG. 3 illustrates a schematic diagram of a synthesis flow of a speech synthesis back-end.
由图3可知,语音合成后端包括时长模型和声学模型,通过时长模型和声学模型分别对时长韵律信息和声学参数韵律信息进行建模。It can be seen from Figure 3 that the back-end of speech synthesis includes a duration model and an acoustic model, and the duration prosody information and the acoustic parameter prosody information are modeled respectively by the duration model and the acoustic model.
则为了适配图3所示的语音合成后端的模型结构,本申请实施例中匹配文本对应的音素级韵律编码可以包括时长编码和声学参数编码。Then, in order to adapt to the model structure of the speech synthesis back-end shown in FIG. 3 , the phoneme-level prosodic coding corresponding to the matched text in this embodiment of the present application may include duration coding and acoustic parameter coding.
则匹配文本对应的韵律编码在送入语音合成后端辅助进行语音合成时,可以将时长编码送入时长模型辅助进行音素级时长建模,将声学参数编码送入声学模型辅助进行音素级声学参数建模。When the prosody code corresponding to the matching text is sent to the back-end of speech synthesis to assist in speech synthesis, the duration encoding can be sent to the duration model to assist in phoneme-level duration modeling, and the acoustic parameter encoding can be sent to the acoustic model to assist in phoneme-level acoustic parameters. modeling.
其中,声学参数编码可以包括一种或多种不同的声学参数编码,示例如基频声学参数编码或其它声学参数编码等。The acoustic parameter encoding may include one or more different acoustic parameter encodings, such as fundamental frequency acoustic parameter encoding or other acoustic parameter encoding, and the like.
在前述示例的辅助合成特征包括音素序列、韵律信息的基础上,进一步的当辅助合成特征还包括音素级韵律编码时,上述步骤S120,参考所述辅助合成特征,对所述原始文本进行语音合成的过程,还可以进一步包括:On the basis that the auxiliary synthesis feature in the foregoing example includes phoneme sequence and prosody information, further when the auxiliary synthesis feature further includes phoneme-level prosodic coding, the above step S120, referring to the auxiliary synthesis feature, performs speech synthesis on the original text The process may further include:
S3、基于所述匹配文本对应的音素级韵律编码,获取所述匹配文本与所述原始文本中相同文本片段对应的音素级韵律编码。S3. Based on the phoneme-level prosody coding corresponding to the matching text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matching text and the original text.
具体的,可以确定所述匹配文本与所述原始文本中相同文本片段,进而在所述匹配文本对应的音素级韵律编码中,提取所述相同文本片段对应的音素级韵律编码。Specifically, it may be determined that the matched text is the same text segment as the original text, and then the phoneme-level prosodic coding corresponding to the same text segment is extracted from the phoneme-level prosodic coding corresponding to the matching text.
S4、在对所述原始文本进行语音合成过程中,将所述相同文本片段对应的音素级韵律编码作为语音合成模型的补充输入,得到合成语音。S4. During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
仍以图3为例进行说明:Still take Figure 3 as an example for illustration:
音素级韵律编码包括时长编码和声学参数编码。Phoneme-level prosody coding includes duration coding and acoustic parameter coding.
则语音合成后端在对原始文本进行语音合成过程中,可以将所述相同文本片段对应的时长编码送入时长模型进行音素级时长建模,以及,将所述相同文本片段对应的声学参数编码送入声学模型进行音素级声学参数建模,最终由语音合成后端得到合成语音。Then, in the process of synthesizing the original text, the back-end speech synthesis can send the duration encoding corresponding to the same text segment into the duration model for phoneme-level duration modeling, and encode the acoustic parameters corresponding to the same text segment. The acoustic model is sent into the acoustic model for phoneme-level acoustic parameter modeling, and the synthesized speech is finally obtained by the speech synthesis back-end.
4)、辅助合成特征为匹配文本对应的发音音频的声学特征。4) The auxiliary synthesis feature is the acoustic feature of the pronunciation audio corresponding to the matching text.
如前文介绍所述,语音合成系统可以分为带前端预处理和不带前端预处理两种类型。其中,不带前端预处理的语音合成系统不会对原始文本进行前端分析,而是直接对原始文本进行语音合成。为了提升对原始文本的合成语音的质量把控,本实施例中可以将匹配文本对应的发音音频的声学特征作为辅助合成特征,送入语音合成系统辅助对原始文本进行语音合成。该声学特征包含了匹配文本的发音音频的发音信息,则语音合成系统在对原始文本逐帧进行语音合成时,可以从声学特征中提取与每一帧相关联的声学特征,以辅助合成每一帧,可以达到修正发音错误的问题,如纠正比较容易出错的罕见字、特殊符号、多音字以及外来词汇等的发音错误,最终得到质量较高的合成语音。As mentioned above, speech synthesis systems can be divided into two types: with front-end preprocessing and without front-end preprocessing. Among them, the speech synthesis system without front-end preprocessing does not perform front-end analysis on the original text, but directly performs speech synthesis on the original text. In order to improve the quality control of the synthesized speech of the original text, in this embodiment, the acoustic feature of the pronunciation audio corresponding to the matching text may be used as an auxiliary synthesis feature, and sent to the speech synthesis system to assist in the speech synthesis of the original text. The acoustic feature contains the pronunciation information of the pronunciation audio that matches the text. When the speech synthesis system performs speech synthesis on the original text frame by frame, the acoustic feature associated with each frame can be extracted from the acoustic feature to assist in synthesizing each frame. Frames can be used to correct pronunciation errors, such as correcting the pronunciation errors of rare words, special symbols, polyphonic words and foreign words that are prone to errors, and finally obtain a high-quality synthesized speech.
其中,声学特征包括但不限于发音音频的倒谱特征。Wherein, the acoustic features include but are not limited to cepstral features of pronunciation audio.
当辅助合成特征为匹配文本对应的发音音频的声学特征时,上述步骤S120,参考所述辅助合成特征,对所述原始文本进行语音合成的过程,可以包括:When the auxiliary synthesis feature is the acoustic feature of the pronunciation audio corresponding to the matching text, in the above step S120, referring to the auxiliary synthesis feature, the process of performing speech synthesis on the original text may include:
S1、基于语音合成模型处理所述原始文本,得到预测当前语音帧的上下文信息。S1. Process the original text based on a speech synthesis model to obtain context information for predicting the current speech frame.
具体的,语音合成模型可以采用encoder-decoder的编解码架构,进一步可以在编码和解码层之间通过注意力模块连接。则原始文本经过encoder-decoder编解码架构以及注意力模块,可以得到合成当前语音帧y t时所需的上下文信息C t。该上下文信息C t表明了合成当前语音帧y t所需要的原始文本中的文本信息。 Specifically, the speech synthesis model can adopt the encoder-decoder architecture of encoder-decoder, and further connect the encoding and decoding layers through an attention module. Then, the original text can obtain the context information C t required when synthesizing the current speech frame y t through the encoder-decoder encoding and decoding architecture and the attention module. The context information C t indicates the text information in the original text needed to synthesize the current speech frame y t .
S2、基于所述上下文信息、所述匹配文本及所述发音音频的声学特征, 确定预测当前语音帧所需的目标声学特征。S2. Based on the context information, the matched text and the acoustic features of the pronunciation audio, determine the target acoustic feature required for predicting the current speech frame.
一种可选的实现方式中,步骤S2可以包括:In an optional implementation manner, step S2 may include:
S21、基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度。S21. Based on the context information, the matched text, and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio.
具体的,上下文信息可以通过注意力机制得到与匹配文本的相似程度,以及通过发音音频的声学特征对匹配文本的注意力矩阵,获取每一帧声学特征与匹配文本的关联度,在此基础上,基于上下文信息与匹配文本的相似程度,以及每一帧声学特征与匹配文本的关联度,可以获取到上下文信息与每一帧声学特征的关联度,该关联度表明了上下文信息和每一帧声学特征的接近程度。可以理解的是,当上下文信息与一目标帧声学特征的关联度较高时,表明该上下文信息对应的文本的发音与目标帧声学特征强相关。Specifically, the context information can be obtained through the attention mechanism to obtain the similarity with the matching text, and the attention matrix of the acoustic features of the pronunciation audio to the matching text can obtain the correlation between the acoustic features of each frame and the matching text. On this basis , based on the similarity between the context information and the matching text, and the correlation between the acoustic features of each frame and the matching text, the correlation between the context information and the acoustic features of each frame can be obtained. The correlation indicates the context information and each frame. The proximity of acoustic features. It can be understood that, when the context information is highly correlated with the acoustic feature of a target frame, it indicates that the pronunciation of the text corresponding to the context information is strongly correlated with the acoustic feature of the target frame.
接下来,介绍步骤S21的一种可选实现方式,可以包括如下步骤:Next, an optional implementation manner of step S21 is introduced, which may include the following steps:
S211、获取所述发音音频的声学特征对所述匹配文本的第一注意力权重矩阵W mxS211. Acquire a first attention weight matrix W mx of the acoustic feature of the pronunciation audio to the matched text.
其中,第一注意力权重矩阵W mx包括每一帧声学特征对所述匹配文本中各文本单元的注意力权重。矩阵W mx的大小为T my*T mx,其中T my表示发音音频对应的声学特征的帧长,T mx表示匹配文本的长度。 Wherein, the first attention weight matrix W mx includes the attention weight of each frame of acoustic feature to each text unit in the matched text. The size of the matrix W mx is T my *T mx , where T my represents the frame length of the acoustic feature corresponding to the pronunciation audio, and T mx represents the length of the matched text.
S212、获取所述上下文信息C t对所述匹配文本的第二注意力权重矩阵W cmxS212: Acquire a second attention weight matrix W cmx of the context information C t to the matched text.
其中,所述第二注意力权重矩阵W cmx包括所述上下文信息C t对所述匹配文本中各文本单元的注意力权重。矩阵W cmx的大小为1*T mxWherein, the second attention weight matrix W cmx includes the attention weight of the context information C t to each text unit in the matched text. The size of the matrix W cmx is 1*T mx .
S213、基于所述第一注意力权重W mx及所述第二注意力权重矩阵W cmx,得到所述上下文信息C t对所述声学特征的第三注意力权重矩阵W cmyS213. Based on the first attention weight W mx and the second attention weight matrix W cmx , obtain a third attention weight matrix W cmy of the context information C t on the acoustic feature.
其中,所述第三注意力权重矩阵W cmy包括所述上下文信息C t对每一帧声学特征的注意力权重,作为所述上下文信息与每一帧声学特征的关联度。矩阵W cmy的大小为1*T my。矩阵W cmy可以表示为: Wherein, the third attention weight matrix W cmy includes the attention weight of the context information C t on the acoustic features of each frame, as the degree of correlation between the context information and the acoustic features of each frame. The size of the matrix W cmy is 1*T my . The matrix W cmy can be expressed as:
W cmy=W cmx*W mxW cmy =W cmx *W mx
其中,W mx′表示矩阵W mx的转置。 Among them, W mx ' represents the transpose of matrix W mx .
S22、基于所述关联度,确定预测当前语音帧所需的目标声学特征。S22. Based on the correlation degree, determine the target acoustic feature required for predicting the current speech frame.
具体的,在上述步骤中得到上下文信息与发音音频的声学特征中,每一帧声学特征的关联度之后,可以首先对各个关联度进行归一化处理,并以归一化后的各个关联度作为权重,对所述发音音频的各帧声学特征进行加权相加,得到预测当前语音帧所需的目标声学特征。目标声学特征可以表示为C mtSpecifically, in the above steps to obtain the context information and the acoustic features of the pronunciation audio, after the correlation degree of each frame of the acoustic features, each correlation degree may be normalized first, and the normalized correlation degree As the weight, the acoustic features of each frame of the pronunciation audio are weighted and added to obtain the target acoustic feature required for predicting the current speech frame. The target acoustic feature can be denoted as C mt .
S3、基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧,并在预测得到所有语音帧后,由预测的各语音帧组成合成语音。S3. Predict the current speech frame based on the context information and the determined target acoustic feature, and after all the speech frames are predicted, form a synthesized speech from the predicted speech frames.
可以理解的是,由于原始文本和匹配文本存在匹配的文本片段,但原始文本可能未必与匹配文本完全相同,这就导致对于上述步骤得到的预测当前语音帧所需的目标声学特征C mt,除了对原始文本中与匹配文本相同的文本片段进行语音合成时能够使用到之外,其余文本片段的合成过程不需要使用该目标声学特征C mt。为此,本实施例提供了一种解决方案,使得在对原始文本进行语音合成时,对于待预测的不同语音帧,可以控制所参考的目标声学特征C mt的信息量。具体实现过程可以包括: It can be understood that, since there are matching text segments between the original text and the matching text, the original text may not necessarily be exactly the same as the matching text, which leads to the target acoustic feature C mt required for predicting the current speech frame obtained in the above steps, except The target acoustic feature C mt is not required to be used in the process of synthesizing the remaining text segments except that it can be used for speech synthesis of the same text segment as the matched text in the original text. To this end, this embodiment provides a solution, so that when performing speech synthesis on the original text, for different speech frames to be predicted, the information amount of the referenced target acoustic feature C mt can be controlled. The specific implementation process can include:
S31、基于语音合成模型的解码端当前的隐层矢量及所述目标声学特征C mt,确定预测当前语音帧时所述目标声学特征C mt的融合系数a gateS31. Based on the current hidden layer vector at the decoding end of the speech synthesis model and the target acoustic feature C mt , determine the fusion coefficient a gate of the target acoustic feature C mt when predicting the current speech frame.
具体的,本实施例中可以采用门限机制或其它策略来决定预测当前语音帧时所述目标声学特征C mt的融合系数a gate。以门限机制为例,a gate可以表示为: Specifically, in this embodiment, a threshold mechanism or other strategies may be used to determine the fusion coefficient a gate of the target acoustic feature C mt when predicting the current speech frame. Taking the threshold mechanism as an example, a gate can be expressed as:
a gate=sigmoid(g g(C mt,s t)) a gate =sigmoid(g g (C mt ,s t ))
其中,s t表示解码端当前的隐层矢量,g g()表示设定函数关系。 Among them, s t represents the current hidden layer vector at the decoding end, and g g () represents the set function relationship.
S32、参考所述融合系数a gate,对所述目标声学特征C mt及所述上下文信息C t进行融合,并基于融合结果预测当前语音帧。 S32 , referring to the fusion coefficient a gate , fuse the target acoustic feature C mt and the context information C t , and predict the current speech frame based on the fusion result.
具体的,当前语音帧y t可以表示为: Specifically, the current speech frame y t can be expressed as:
y t=g(y t-1,s t,(1-a gate)*C t+a gate*C mt) y t =g(y t-1 ,s t ,(1-a gate )*C t +a gate *C mt )
其中,g()表示设定的函数关系。Among them, g() represents the set functional relationship.
参考图4,其示例了一种语音合成系统架构示意图。Referring to FIG. 4, it illustrates a schematic diagram of a speech synthesis system architecture.
图4示例的语音合成系统采用编解码加注意力机制的端到端合成流程。The speech synthesis system illustrated in Figure 4 adopts an end-to-end synthesis process of codec and attention mechanism.
原始文本经过编码端编码得到原始文本的编码向量,经过第一注意力模块可以获得预测当前语音帧y t时所需的上下文信息C tThe original text is encoded by the encoding end to obtain the encoding vector of the original text, and the context information C t required for predicting the current speech frame y t can be obtained through the first attention module.
匹配文本经过编码端编码得到匹配文本的编码向量。进一步,上下文信息C t可以通过第二注意力模块得到对匹配文本中各文本单元的注意力权重,组成第二注意力权重矩阵。 The matching text is encoded by the encoding end to obtain the encoding vector of the matching text. Further, the context information C t can obtain the attention weight of each text unit in the matched text through the second attention module to form a second attention weight matrix.
同时,本实施例中可以获取到匹配文本的发音音频的声学特征对匹配文本的注意力权重,组成第一注意力权重矩阵。进而基于第一注意力权重矩阵和第二注意力权重矩阵,得到上下文信息C t对声学特征的第三注意力权重矩阵。第三注意力权重矩阵包括上下文信息C t与每一帧声学特征的关联度。通过对第三注意力权重矩阵进行sofmax规整,并与发音音频的各帧声学特征进行加权相加,得到预测当前语音帧y t所需要的目标声学特征C mtAt the same time, in this embodiment, the attention weight of the acoustic feature of the pronunciation audio of the matched text to the matched text can be obtained to form a first attention weight matrix. Further, based on the first attention weight matrix and the second attention weight matrix, a third attention weight matrix of the context information C t to the acoustic feature is obtained. The third attention weight matrix includes the correlation degree between the context information C t and the acoustic features of each frame. The target acoustic feature C mt required for predicting the current speech frame y t is obtained by performing sofmax regularization on the third attention weight matrix, and performing weighted addition with the acoustic features of each frame of the pronunciation audio.
解码端可以基于目标声学特征C mt和上下文信息C t,预测当前语音帧y tThe decoder can predict the current speech frame y t based on the target acoustic feature C mt and the context information C t .
解码端预测当前语音帧y t的表达式可以参照前文相关介绍。 The expression for predicting the current speech frame y t by the decoding end may refer to the foregoing related introduction.
预测得到的各语音帧通过声码器映射为合成语音。Each predicted speech frame is mapped to synthesized speech by a vocoder.
在本申请的一些实施例中,对前述步骤S110,获取匹配文本对应的辅助合成特征的过程进行介绍,具体的,该过程可以包括:In some embodiments of the present application, the foregoing step S110, the process of acquiring the auxiliary synthesis feature corresponding to the matching text is introduced. Specifically, the process may include:
S1、获取与所述原始文本存在相匹配的文本片段的匹配文本。S1. Obtain the matching text of the text segment that matches the existence of the original text.
本实施例中提供了两种不同的实现方式,分别介绍如下:Two different implementations are provided in this embodiment, which are respectively introduced as follows:
1)、一种可选的实施方式下,本申请可以预先收集语音合成场景下大量的固定话术文本并进行录音,将收集的话术文本作为模板文本,同时存储模板文本及对应的发音音频。或者是,基于模板文本的发音音频确定辅 助合成特征,进而将模板文本与辅助合成特征一并进行存储。1) Under an optional implementation, the application can collect and record a large amount of fixed phonetic texts in advance in the speech synthesis scene, take the collected phonetic texts as template texts, and store the template texts and corresponding pronunciation audios simultaneously. Alternatively, the auxiliary synthesis feature is determined based on the pronunciation audio of the template text, and then the template text and the auxiliary synthesis feature are stored together.
基于此,步骤S1的实施过程可以包括:Based on this, the implementation process of step S1 may include:
S11、在预先配置存储的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。S11. From the template text stored in the pre-configuration, determine the matching text that matches the text segment in the original text.
可选的,本实施例中可以基于收集的模板文本及对应的发音音频,整理打包成资源包。具体的,每一资源包中包含一模板文本,以及基于该模板文本对应的发音音频所确定的,与所述模板文本对应的辅助合成特征。Optionally, in this embodiment, the collected template text and the corresponding pronunciation audio may be organized and packaged into a resource package. Specifically, each resource package includes a template text and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
辅助合成特征可以包括模板文本对应的音素序列及韵律信息。进一步的,辅助合成特征还可以包括模板文本对应的音素级韵律编码。The auxiliary synthesis features may include phoneme sequences and prosodic information corresponding to the template text. Further, the auxiliary synthesis feature may also include phoneme-level prosodic coding corresponding to the template text.
举例说明如:Examples such as:
模板文本为“欢迎使用人工智能语音助手”。The template text is "Welcome to AI Voice Assistant".
基于该模板文本对应的发音音频,可以确定的辅助合成特征可以包括模板文本的音素序列、韵律信息、音素级韵律编码等。进而,可以将模板文本与辅助合成特征打包成一个资源包。Based on the pronunciation audio corresponding to the template text, the auxiliary synthesis features that can be determined may include the phoneme sequence of the template text, prosodic information, phoneme-level prosodic coding, and the like. Furthermore, the template text and auxiliary synthesis features can be packaged into a resource package.
以模板文本的韵律信息为例,其示例性格式可以如下:Taking the prosody information of template text as an example, its exemplary format can be as follows:
“欢[=huan1]迎[=ying2][w1]使[=shi3]用[=yong4][w3]人[=ren2]工[=gong1]智[=zhi4]能[=neng2][w1]语[=yu3]音[=yin1][w1]助[=zhu4]手[=shou3]”。"Huan [=huan1] welcome [=ying2] [w1] make [=shi3] use [=yong4] [w3] people [=ren2] work [=gong1] wisdom [=zhi4] can [=neng2] [w1] Language [=yu3] Sound [=yin1] [w1] Assistant [=zhu4] Hand [=shou3]”.
其中,每个字的发音通过[=pinyin]来指定,“[w1]”、“[w3]”表示不同的韵律停顿信息。The pronunciation of each word is specified by [=pinyin], and "[w1]" and "[w3]" represent different prosodic pause information.
可以理解的是,上述仅仅是本申请示例的一种韵律信息表示方式,本领域技术人员还可以采用其它不同的标记格式来表示模板文本的韵律信息。It can be understood that the above is only a representation of prosody information in the examples of the present application, and those skilled in the art may also use other different markup formats to represent the prosody information of the template text.
对于打包后的资源包,可以编码成一个二进制资源文本,以降低存储空间的占用,同时便于后续语音合成系统的处理识别。For the packaged resource package, it can be encoded into a binary resource text, so as to reduce the occupation of storage space and facilitate the processing and identification of the subsequent speech synthesis system.
结合图5,对模板文本对应的音素级韵律编码的确定过程进行介绍。With reference to FIG. 5 , the process of determining the phoneme-level prosodic coding corresponding to the template text is introduced.
如图5所示,可以基于编码预测网络及生成网络来确定模板文本对应的音素级韵律编码,具体可以包括如下步骤:As shown in Figure 5, the phoneme-level prosodic coding corresponding to the template text can be determined based on the coding prediction network and the generation network, which can specifically include the following steps:
A1、基于所述模板文本及对应的发音音频,提取出音素级的韵律信息。A1. Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio.
A2、将所述模板文本及所述音素级的韵律信息输入编码预测网络,得到预测的音素级的韵律编码。A2. Input the template text and the phoneme-level prosody information into a coding prediction network to obtain a predicted phoneme-level prosodic code.
A3、将所述预测的音素级的韵律编码及所述模板文本输入生成网络,得到生成的音素级的韵律信息。A3. Input the predicted phoneme-level prosody coding and the template text into a generation network to obtain the generated phoneme-level prosodic information.
A4、以生成的音素级的韵律信息趋近于提取出的所述音素级的韵律信息为目标训练所述编码预测网络及所述生成网络,直至训练结束时,得到训练后的编码预测网络所预测的模板文本对应的音素级韵律编码。A4. Train the encoding prediction network and the generation network with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, obtain the encoded prediction network after training. The phoneme-level prosodic encoding corresponding to the predicted template text.
其中,以生成的音素级的韵律信息趋近于提取出的所述音素级的韵律信息为目标训练所述编码预测网络及所述生成网络的过程,具体可以是计算生成的音素级的韵律信息和提取出的所述音素级的韵律信息的均方误差MSE,通过迭代训练调整网络参数,使得MSE达到预设阈值,则可以结束训练。Wherein, the process of training the coding prediction network and the generation network with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information as the goal may specifically be calculating the generated phoneme-level prosody information and the extracted mean square error MSE of the prosody information at the phoneme level, and adjusting network parameters through iterative training, so that the MSE reaches a preset threshold, then the training can be ended.
进一步的,基于上述预先配置的资源包,上述步骤S11,在预先配置存储的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本的实现过程,可以包括:Further, based on the above-mentioned pre-configured resource package, the above-mentioned step S11, in the template text that is pre-configured and stored, determines the implementation process of the matching text that matches the text fragment in the original text, which may include:
S111、分别将所述原始文本与预配置的每个资源包中的模板文本进行匹配计算。S111. Perform matching calculation between the original text and the template text in each preconfigured resource package.
S112、在匹配度最高的资源包所包含的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。S112. From the template text included in the resource package with the highest matching degree, determine the matching text that matches the text segment in the original text.
具体的,上述匹配计算过程可以先确定是否存在与原始文本完全匹配的模板文本,若存在,则将完全匹配的模板文本确定为匹配文本。若不存在,则可以进行部分匹配,示例如,从原始文本的一端或两端开始,在每个资源包的模板文本中查找最大长度匹配文本,作为匹配文本。Specifically, in the above matching calculation process, it may be determined whether there is template text that completely matches the original text, and if there is, the completely matching template text is determined as the matching text. If it does not exist, partial matching can be performed, for example, starting from one or both ends of the original text, and looking for the matching text of the maximum length in the template text of each resource bundle as the matching text.
示例如,原始文本为“请问您是王宁吗?”,在与资源包中的模板文本进行匹配时,未匹配到完全相同的模板文本,但是匹配到模板文本“请问您是刘武吗?”,将原始文本与上述模板文本进行最大长度匹配,可以得到匹配文本:“请问您是”和“吗?”。For example, the original text is "Are you Wang Ning?", when it is matched with the template text in the resource package, the exact same template text is not matched, but the template text is matched "Is your name Liu Wu? ”, the original text is matched with the above template text at the maximum length, and the matching text can be obtained: “Are you” and “Are you?”.
2)、另一种可选的实施方式下,本申请可以获取用户上传数据。上传数据中包含上传文本及上传文本对应的发音音频。该上传文本与原始文本存在相匹配的文本片段。进而可以将上传文本作为匹配文本。2) In another optional implementation manner, the present application can obtain the data uploaded by the user. The uploaded data includes the uploaded text and the pronunciation audio corresponding to the uploaded text. The uploaded text has a matching text fragment with the original text. In turn, the uploaded text can be used as matching text.
一种可选的场景如,在上述步骤S100获取待合成的原始文本之后,可以进行初始语音合成,并输出原始文本的初始合成语音。对原始文本进行初始语音合成的过程,可以使用现有或未来可能出现的各种语音合成方案。用户在收到初始合成语音后可以确定出初始合成语音中合成错误的文本片段,并确定该合成错误的文本片段对应的正确发音,进而可以将合成错误的文本片段作为上传文本,将该合成错误的文本片段对应的正确发音作为上传文本对应的发音音频,一并作为上传数据进行上传。或者是,用户可以获取到包含所述初始合成语音中合成错误的文本片段的扩展文本,以及获取到扩展文本对应的正确发音,将扩展文本作为上传文本,扩展文本对应的正确发音作为上传文本对应的发音音频,一并作为上传数据进行上传。In an optional scenario, after obtaining the original text to be synthesized in the above step S100, initial speech synthesis may be performed, and the initial synthesized speech of the original text may be output. The process of initial speech synthesis on the original text can use various existing or possible future speech synthesis schemes. After receiving the initial synthesized speech, the user can determine the incorrectly synthesized text segment in the initial synthesized speech, and determine the correct pronunciation corresponding to the synthesized incorrectly synthesized text segment, and then can use the synthesized incorrectly synthesized text segment as the uploaded text, and the synthesized incorrectly The correct pronunciation corresponding to the text segment of the uploaded text is used as the pronunciation audio corresponding to the uploaded text, and is uploaded as the upload data. Alternatively, the user can obtain the extended text containing the incorrectly synthesized text segment in the initial synthesized speech, and obtain the correct pronunciation corresponding to the extended text, take the extended text as the uploaded text, and the correct pronunciation corresponding to the extended text as the corresponding uploaded text. The pronunciation audio of , and upload it together as upload data.
S2、获取基于所述匹配文本对应的发音音频所确定的辅助合成特征。S2. Acquire the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text.
具体的,参考前文相关介绍可知,若匹配文本对应的发音音频可以在对原始文本进行语音合成之前获取到,则可以预先基于匹配文本对应的发音音频确定辅助合成特征并存储在本地或第三方设备。则本步骤中获取匹配文本对应的辅助合成特征的过程可以是,在本地或第三方存储中查找预存储的匹配文本对应的辅助合成特征。Specifically, referring to the relevant introduction above, if the pronunciation audio corresponding to the matching text can be obtained before the original text is synthesized by speech, the auxiliary synthesis feature can be determined in advance based on the pronunciation audio corresponding to the matching text and stored in the local or third-party device. . In this step, the process of obtaining the auxiliary synthesis feature corresponding to the matching text may be to search for the auxiliary synthesis feature corresponding to the pre-stored matching text in the local or third-party storage.
此外,若匹配文本对应的发音音频为对原始文本语音合成过程临时获取的,则本步骤中获取匹配文本对应的辅助合成特征的过程可以是,在获取到匹配文本对应的发音音频后,基于该发音音频确定辅助合成特征。In addition, if the pronunciation audio corresponding to the matching text is temporarily obtained from the speech synthesis process of the original text, the process of obtaining the auxiliary synthesis feature corresponding to the matching text in this step may be, after obtaining the pronunciation audio corresponding to the matching text, based on the Pronunciation audio determines auxiliary synthesis features.
需要说明的是,若上述步骤S1获取匹配文本的方式通过第一种1)方式实现,即分别将所述原始文本与预配置的每个资源包中的模板文本进行匹配计算,并在匹配度最高的资源包所包含的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本,则上述步骤S2的实现过程具体可以包括:It should be noted that, if the method of obtaining the matching text in the above step S1 is realized by the first method 1), that is, the original text and the template text in each pre-configured resource package are respectively matched and calculated, and the matching degree is calculated. In the template text contained in the highest resource package, to determine the matching text that matches the text fragment in the original text, the implementation process of the above-mentioned step S2 may specifically include:
S21、获取所述匹配度最高的资源包中包含的,与所述匹配文本对应的 辅助合成特征。S21. Obtain the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.
可以理解的是,资源包中包含模板文本对应的辅助合成特征,如音素序列、韵律信息、音素级韵律编码等。而匹配文本与模板文本相同或属于模板文本中的部分文本片段,因此可以从模板文本对应的辅助合成特征中,提取出匹配文本对应的辅助合成特征。It can be understood that the resource package contains auxiliary synthesis features corresponding to the template text, such as phoneme sequences, prosody information, and phoneme-level prosodic coding. The matching text is the same as the template text or belongs to a partial text segment in the template text. Therefore, auxiliary synthesis features corresponding to the matching text can be extracted from the auxiliary synthesis features corresponding to the template text.
进一步,若上述步骤S1获取匹配文本的方式通过第二种2)方式实现,即将用户上传数据中的上传文本作为匹配文本,则上述步骤S2的实现过程具体可以包括:Further, if the method of obtaining the matching text in the above step S1 is realized by the second method 2), that is, the uploaded text in the user uploaded data is used as the matching text, then the implementation process of the above step S2 may specifically include:
基于所述上传数据中所述匹配文本对应的发音音频,确定匹配文本对应的辅助合成特征。Based on the pronunciation audio corresponding to the matching text in the uploaded data, the auxiliary synthesis feature corresponding to the matching text is determined.
下面对本申请实施例提供的语音合成装置进行描述,下文描述的语音合成装置与上文描述的语音合成方法可相互对应参照。The speech synthesis apparatus provided by the embodiments of the present application is described below, and the speech synthesis apparatus described below and the speech synthesis method described above can be referred to each other correspondingly.
参见图6,图6为本申请实施例公开的一种语音合成装置结构示意图。Referring to FIG. 6, FIG. 6 is a schematic structural diagram of a speech synthesis apparatus disclosed in an embodiment of the present application.
如图6所示,该装置可以包括:As shown in Figure 6, the apparatus may include:
原始文本获取单元11,用于获取待合成的原始文本;The original text acquisition unit 11 is used to acquire the original text to be synthesized;
辅助合成特征获取单元12,用于获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Auxiliary synthesis feature acquisition unit 12 is used to obtain an auxiliary synthesis feature corresponding to a matching text, where the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is based on the pronunciation audio corresponding to the matching text. Determined features for assisting speech synthesis;
辅助语音合成单元13,用于参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音。The auxiliary speech synthesis unit 13 is configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.
可选的,上述辅助合成特征获取单元获取匹配文本对应的辅助合成特征的过程,可以包括:Optionally, the process of obtaining the auxiliary synthesis feature corresponding to the matching text by the above-mentioned auxiliary synthesis feature obtaining unit may include:
获取与所述原始文本存在相匹配的文本片段的匹配文本;obtaining the matching text of the text fragment matching the existence of the original text;
获取基于所述匹配文本对应的发音音频所确定的辅助合成特征。The auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.
可选的,上述辅助合成特征可以包括:Optionally, the above-mentioned auxiliary synthesis features may include:
基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的音素序列;The phoneme sequence corresponding to the matching text, determined based on the pronunciation audio corresponding to the matching text;
和/或,and / or,
基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的韵律信息;Determined based on the pronunciation audio corresponding to the matched text, the prosodic information corresponding to the matched text;
和/或,and / or,
基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的音素级韵律编码;Determined based on the pronunciation audio corresponding to the matched text, the phoneme-level prosodic coding corresponding to the matched text;
和/或,and / or,
所述匹配文本对应的发音音频的声学特征。The acoustic feature of the pronunciation audio corresponding to the matched text.
可选的,上述辅助合成特征获取单元获取与所述原始文本存在相匹配的文本片段的匹配文本的过程,可以包括:Optionally, the process in which the above-mentioned auxiliary synthesis feature obtaining unit obtains the matching text of the text fragment that matches the original text exists may include:
在预配置的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。Among the preconfigured template texts, matching texts that match text fragments within the original text are determined.
可选的,上述预配置的模板文本可以包括:Optionally, the above preconfigured template text may include:
各个预配置的资源包中的模板文本,其中每一资源包包含一模板文本,及基于所述模板文本对应的发音音频所确定的,与所述模板文本对应的辅助合成特征。Template texts in each preconfigured resource package, wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
可选的,上述辅助合成特征获取单元在预配置的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本的过程,可以包括:Optionally, the process of determining the matching text that matches the text fragment in the original text in the preconfigured template text by the above-mentioned auxiliary synthesis feature obtaining unit may include:
分别将所述原始文本与预配置的每个资源包中的模板文本进行匹配计算;The original text and the template text in each preconfigured resource package are respectively matched and calculated;
在匹配度最高的资源包所包含的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。In the template text included in the resource package with the highest matching degree, the matching text that matches the text fragment in the original text is determined.
可选的,上述辅助合成特征获取单元获取基于所述匹配文本对应的发音音频所确定的辅助合成特征的过程,可以包括:Optionally, the process in which the above-mentioned auxiliary synthesis feature obtaining unit obtains the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text may include:
获取所述匹配度最高的资源包中包含的,与所述匹配文本对应的辅助合成特征。Acquire the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.
可选的,本申请的装置还可以包括:资源包配置单元,用于配置资源包,该过程可以包括:Optionally, the apparatus of the present application may further include: a resource package configuration unit for configuring resource packages, and the process may include:
获取预配置的模板文本及对应的发音音频;Get the preconfigured template text and the corresponding pronunciation audio;
基于所述发音音频,确定所述模板文本对应的音素序列及韵律信息;Based on the pronunciation audio, determine the phoneme sequence and prosody information corresponding to the template text;
将所述音素序列及韵律信息作为所述模板文本对应的辅助合成特征,并将所述辅助合成特征与所述模板文本整理成一个资源包。The phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.
可选的,上述资源包配置单元配置资源包的过程还可以包括:Optionally, the process of configuring the resource package by the resource package configuration unit may further include:
基于所述模板文本及对应的发音音频,确定所述模板文本对应的音素级韵律编码;Based on the template text and the corresponding pronunciation audio, determine the phoneme-level prosodic coding corresponding to the template text;
将所述音素级韵律编码合并入所述资源包中。The phoneme-level prosodic encoding is incorporated into the resource bundle.
可选的,上述资源包配置单元基于所述模板文本及对应的发音音频,确定所述模板文本对应的音素级韵律编码的过程,可以包括:Optionally, the above-mentioned resource package configuration unit determines the process of phoneme-level prosodic coding corresponding to the template text based on the template text and the corresponding pronunciation audio, which may include:
基于所述模板文本及对应的发音音频,提取出音素级的韵律信息;Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio;
将所述模板文本及所述音素级的韵律信息输入编码预测网络,得到预测的音素级的韵律编码;Inputting the template text and the phoneme-level prosody information into a coding prediction network to obtain the predicted phoneme-level prosody coding;
将所述预测的音素级的韵律编码及所述模板文本输入生成网络,得到生成的音素级的韵律信息;Inputting the predicted phoneme-level prosody coding and the template text into a generating network to obtain the generated phoneme-level prosody information;
以生成的音素级的韵律信息趋近于提取出的所述音素级的韵律信息为目标训练所述编码预测网络及所述生成网络,直至训练结束时,得到训练后的编码预测网络所预测的音素级的韵律编码。The encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.
另一种可选的情况下,上述辅助合成特征获取单元获取与所述原始文本存在相匹配的文本片段的匹配文本的过程,可以包括:In another optional case, the process of obtaining the matching text of the text segment that matches the original text by the above-mentioned auxiliary synthesis feature obtaining unit may include:
获取上传数据中的上传文本,作为所述匹配文本,所述上传数据还包括所述上传文本对应的发音音频,所述上传文本与所述原始文本存在相匹配的文本片段。在此基础上,辅助合成特征获取单元获取基于所述匹配文本对应的发音音频所确定的辅助合成特征的过程,可以包括:The uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match. On this basis, the process for the auxiliary synthesis feature acquisition unit to obtain the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matching text may include:
基于所述上传数据中所述匹配文本对应的发音音频,确定匹配文本对应的辅助合成特征。Based on the pronunciation audio corresponding to the matching text in the uploaded data, the auxiliary synthesis feature corresponding to the matching text is determined.
可选的,本申请的装置还可以包括:初始合成语音输出单元,用于在所述获取上传数据中的上传文本之前,输出所述原始文本的初始合成语音。 在此基础上,所述上传文本为,所述初始合成语音中合成错误的文本片段,所述上传文本对应的发音音频为,所述合成错误的文本片段对应的正确发音;或,所述上传文本为,包含所述初始合成语音中合成错误的文本片段的扩展文本,所述上传文本对应的发音音频为,所述扩展文本对应的正确发音。Optionally, the apparatus of the present application may further include: an initial synthesized speech output unit, configured to output an initial synthesized speech of the original text before acquiring the uploaded text in the uploaded data. On this basis, the uploaded text is the incorrectly synthesized text segment in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the incorrectly synthesized text segment; or, the uploading The text is the extended text that contains the text fragments synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the expanded text.
可选的,当辅助合成特征包括匹配文本对应的音素序列和/或韵律信息时,上述辅助语音合成单元参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音的过程,可以包括:Optionally, when the auxiliary synthesis feature includes the phoneme sequence and/or prosody information corresponding to the matching text, the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, which can be include:
基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列;Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text;
和/或,and / or,
基于所述匹配文本对应的韵律信息,确定所述原始文本的韵律信息;determining the prosody information of the original text based on the prosody information corresponding to the matched text;
基于所述原始文本的音素序列和/或韵律信息,对所述原始文本进行语音合成,得到合成语音。Based on the phoneme sequence and/or prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.
进一步可选的,当辅助合成特征进一步还包括匹配文本对应的音素级韵律编码时,上述辅助语音合成单元参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音的过程,还可以包括:Further optionally, when the auxiliary synthesis feature further includes the phoneme-level prosody coding corresponding to the matching text, the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, and also Can include:
基于所述匹配文本对应的音素级韵律编码,获取所述匹配文本与所述原始文本中相同文本片段对应的音素级韵律编码;Based on the phoneme-level prosody coding corresponding to the matched text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matched text and the original text;
在对所述原始文本进行语音合成过程中,将所述相同文本片段对应的音素级韵律编码作为语音合成模型的补充输入,得到合成语音。During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
可选的,上述辅助语音合成单元基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列的过程,可以包括:Optionally, the above-mentioned auxiliary speech synthesis unit determines the process of determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text, which may include:
基于所述匹配文本对应的音素序列,获取所述匹配文本与所述原始文本中相同文本片段对应的音素序列;Based on the phoneme sequence corresponding to the matching text, obtain the phoneme sequence corresponding to the same text segment in the matching text and the original text;
查询发音词典,确定所述原始文本中除所述相同文本片段外的其余文本片段的音素序列,并与所述相同文本片段对应的音素序列合成,得到原始文本的音素序列。The pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesize with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
可选的,当辅助合成特征包括匹配文本对应的发音音频的声学特征时, 上述辅助语音合成单元参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音的过程,可以包括:Optionally, when the auxiliary synthesis feature includes the acoustic feature of the pronunciation audio corresponding to the matching text, the above-mentioned auxiliary speech synthesis unit refers to the auxiliary synthesis feature, performs speech synthesis on the original text, and obtains the process of synthesizing speech, which may include:
基于语音合成模型处理所述原始文本,得到预测当前语音帧的上下文信息;Process the original text based on the speech synthesis model to obtain context information for predicting the current speech frame;
基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,确定预测当前语音帧所需的目标声学特征;Based on the context information, the matched text and the acoustic features of the pronunciation audio, determining the target acoustic feature required for predicting the current speech frame;
基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧,在预测得到所有语音帧后,由预测的各语音帧组成合成语音。Based on the context information and the determined target acoustic feature, the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.
可选的,上述辅助语音合成单元基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,确定预测当前语音帧所需的目标声学特征的过程,可以包括:Optionally, the above-mentioned auxiliary speech synthesis unit, based on the context information, the matching text and the acoustic characteristics of the pronunciation audio, determines the process of predicting the target acoustic characteristics required for the current speech frame, which may include:
基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度;Based on the context information, the matched text and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;
基于所述关联度,确定预测当前语音帧所需的目标声学特征。Based on the degree of association, target acoustic features required for predicting the current speech frame are determined.
可选的,上述辅助语音合成单元获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度的过程,可以包括:Optionally, the above-mentioned auxiliary speech synthesis unit obtains the process of obtaining the correlation degree of each frame of acoustic features in the context information and the acoustic features of the pronunciation audio, which may include:
获取所述发音音频的声学特征对所述匹配文本的第一注意力权重矩阵,所述第一注意力权重矩阵包括每一帧声学特征对所述匹配文本中各文本单元的注意力权重;Acquiring the first attention weight matrix of the acoustic feature of the pronunciation audio to the matched text, the first attention weight matrix including the attention weight of each frame of acoustic features to each text unit in the matched text;
获取所述上下文信息对所述匹配文本的第二注意力权重矩阵,所述第二注意力权重矩阵包括所述上下文信息对所述匹配文本中各文本单元的注意力权重;obtaining a second attention weight matrix of the context information to the matched text, where the second attention weight matrix includes the attention weight of the context information to each text unit in the matched text;
基于所述第一注意力权重及所述第二注意力权重矩阵,得到所述上下文信息对所述声学特征的第三注意力权重矩阵,所述第三注意力权重矩阵包括所述上下文信息对每一帧声学特征的注意力权重,作为所述上下文信息与每一帧声学特征的关联度。Based on the first attention weight and the second attention weight matrix, a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair The attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.
可选的,上述辅助语音合成单元基于所述关联度,确定预测当前语音 帧所需的目标声学特征的过程,可以包括:Optionally, the above-mentioned auxiliary speech synthesis unit determines the process of predicting the required target acoustic feature of the current speech frame based on the degree of association, and can include:
对各个所述关联度进行归一化,并以归一化后的各个关联度作为权重,对所述发音音频的各帧声学特征进行加权相加,得到目标声学特征。Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight, and the acoustic features of each frame of the pronunciation audio are weighted and added to obtain a target acoustic feature.
可选的,上述辅助语音合成单元基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧的过程,可以包括:Optionally, the above-mentioned auxiliary speech synthesis unit, based on the context information and the determined target acoustic feature, predicts the process of the current speech frame, which may include:
基于语音合成模型的解码端当前的隐层矢量及所述目标声学特征,确定预测当前语音帧时所述目标声学特征的融合系数;Determine the fusion coefficient of the target acoustic feature when predicting the current speech frame based on the current hidden layer vector at the decoding end of the speech synthesis model and the target acoustic feature;
参考所述融合系数,对所述目标声学特征及所述上下文信息进行融合,并基于融合结果预测当前语音帧。With reference to the fusion coefficient, the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.
本申请实施例提供的语音合成装置可应用于语音合成设备,如终端:手机、电脑等。可选的,图7示出了语音合成设备的硬件结构框图,参照图7,语音合成设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;The speech synthesis apparatus provided by the embodiments of the present application can be applied to speech synthesis equipment, such as a terminal: a mobile phone, a computer, and the like. Optionally, FIG. 7 shows a block diagram of the hardware structure of the speech synthesis device. Referring to FIG. 7, the hardware structure of the speech synthesis device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication interface. bus4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;In the embodiment of the present application, the number of the processor 1, the communication interface 2, the memory 3, and the communication bus 4 is at least one, and the processor 1, the communication interface 2, and the memory 3 complete the communication with each other through the communication bus 4;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本发明实施例的一个或多个集成电路等;The processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;The memory 3 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), etc., such as at least one disk memory;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序用于:Wherein, the memory stores a program, and the processor can call the program stored in the memory, and the program is used for:
获取待合成的原始文本;Get the original text to be synthesized;
获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语 音。With reference to the auxiliary synthesis feature, speech synthesis is performed on the original text to obtain synthesized speech.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the refinement function and extension function of the program may refer to the above description.
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序用于:An embodiment of the present application further provides a storage medium, where the storage medium can store a program suitable for the processor to execute, and the program is used for:
获取待合成的原始文本;Get the original text to be synthesized;
获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音。With reference to the auxiliary synthesis feature, speech synthesis is performed on the original text to obtain synthesized speech.
可选的,所述程序的细化功能和扩展功能可参照上文描述。Optionally, the refinement function and extension function of the program may refer to the above description.
进一步地,本申请实施例还提供了一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行上述语音合成方法中的任意一种实现方式。Further, an embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any one of the above-mentioned speech synthesis methods.
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。Finally, it should also be noted that in this document, relational terms such as first and second are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as required, and the same and similar parts can be referred to each other. .
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使 用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the present application. Therefore, this application is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (25)

  1. 一种语音合成方法,其特征在于,包括:A method for speech synthesis, comprising:
    获取待合成的原始文本;Get the original text to be synthesized;
    获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is a feature for auxiliary speech synthesis determined based on the pronunciation audio corresponding to the matching text ;
    参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音。With reference to the auxiliary synthesis feature, speech synthesis is performed on the original text to obtain synthesized speech.
  2. 根据权利要求1所述的方法,其特征在于,所述获取匹配文本对应的辅助合成特征,包括:The method according to claim 1, wherein the obtaining auxiliary synthesis features corresponding to the matching text comprises:
    获取与所述原始文本存在相匹配的文本片段的匹配文本;obtaining the matching text of the text fragment matching the existence of the original text;
    获取基于所述匹配文本对应的发音音频所确定的辅助合成特征。The auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text is acquired.
  3. 根据权利要求1或2所述的方法,其特征在于,所述辅助合成特征包括:The method according to claim 1 or 2, wherein the auxiliary synthesis feature comprises:
    基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的音素序列;The phoneme sequence corresponding to the matching text, determined based on the pronunciation audio corresponding to the matching text;
    和/或,and / or,
    基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的韵律信息;Determined based on the pronunciation audio corresponding to the matched text, the prosodic information corresponding to the matched text;
    和/或,and / or,
    基于所述匹配文本对应的发音音频所确定的,所述匹配文本对应的音素级韵律编码;Determined based on the pronunciation audio corresponding to the matched text, the phoneme-level prosodic coding corresponding to the matched text;
    和/或,and / or,
    所述匹配文本对应的发音音频的声学特征。The acoustic feature of the pronunciation audio corresponding to the matched text.
  4. 根据权利要求2所述的方法,其特征在于,所述获取与所述原始文本存在相匹配的文本片段的匹配文本,包括:The method according to claim 2, wherein the obtaining the matching text of the text segment matching the existence of the original text comprises:
    在预配置的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。Among the preconfigured template texts, matching texts that match text fragments within the original text are determined.
  5. 根据权利要求2所述的方法,其特征在于,所述获取与所述原始文本存在相匹配的文本片段的匹配文本,包括:The method according to claim 2, wherein the obtaining the matching text of the text segment matching the existence of the original text comprises:
    获取上传数据中的上传文本,作为所述匹配文本,所述上传数据还包括所述上传文本对应的发音音频,所述上传文本与所述原始文本存在相匹配的文本片段。The uploaded text in the uploaded data is acquired as the matching text, the uploaded data further includes pronunciation audio corresponding to the uploaded text, and the uploaded text and the original text have text fragments that match.
  6. 根据权利要求4所述的方法,其特征在于,所述预配置的模板文本包括:The method of claim 4, wherein the preconfigured template text comprises:
    各个预配置的资源包中的模板文本,其中每一资源包包含一模板文本,及基于所述模板文本对应的发音音频所确定的,与所述模板文本对应的辅助合成特征。Template texts in each preconfigured resource package, wherein each resource package includes a template text, and an auxiliary synthesis feature corresponding to the template text determined based on the pronunciation audio corresponding to the template text.
  7. 根据权利要求6所述的方法,其特征在于,所述在预配置的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本,包括:The method according to claim 6, wherein, in the preconfigured template text, determining the matching text that matches the text fragment in the original text comprises:
    分别将所述原始文本与预配置的每个资源包中的模板文本进行匹配计算;The original text and the template text in each preconfigured resource package are respectively matched and calculated;
    在匹配度最高的资源包所包含的模板文本中,确定与所述原始文本内的文本片段相匹配的匹配文本。In the template text included in the resource package with the highest matching degree, the matching text that matches the text fragment in the original text is determined.
  8. 根据权利要求7所述的方法,其特征在于,所述获取基于所述匹配文本对应的发音音频所确定的辅助合成特征,包括:The method according to claim 7, wherein the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text comprises:
    获取所述匹配度最高的资源包中包含的,与所述匹配文本对应的辅助合成特征。Acquire the auxiliary synthesis feature corresponding to the matching text contained in the resource package with the highest matching degree.
  9. 根据权利要求6-8任一项所述的方法,其特征在于,预配置的资源包的确定过程,包括:The method according to any one of claims 6-8, wherein the process of determining the preconfigured resource package includes:
    获取预配置的模板文本及对应的发音音频;Get the preconfigured template text and the corresponding pronunciation audio;
    基于所述发音音频,确定所述模板文本对应的音素序列及韵律信息;Based on the pronunciation audio, determine the phoneme sequence and prosody information corresponding to the template text;
    将所述音素序列及韵律信息作为所述模板文本对应的辅助合成特征,并将所述辅助合成特征与所述模板文本整理成一个资源包。The phoneme sequence and prosody information are used as auxiliary synthesis features corresponding to the template text, and the auxiliary synthesis features and the template text are organized into a resource package.
  10. 根据权利要求9所述的方法,其特征在于,预配置的资源包的确定过程,还包括:The method according to claim 9, wherein the process of determining the preconfigured resource package further comprises:
    基于所述模板文本及对应的发音音频,确定所述模板文本对应的音素级韵律编码;Based on the template text and the corresponding pronunciation audio, determine the phoneme-level prosodic coding corresponding to the template text;
    将所述音素级韵律编码合并入所述资源包中。The phoneme-level prosodic encoding is incorporated into the resource bundle.
  11. 根据权利要求10所述的方法,其特征在于,所述基于所述模板文本及对应的发音音频,确定所述模板文本对应的音素级韵律编码,包括:The method according to claim 10, wherein, determining the phoneme-level prosodic coding corresponding to the template text based on the template text and the corresponding pronunciation audio, comprising:
    基于所述模板文本及对应的发音音频,提取出音素级的韵律信息;Extract phoneme-level prosodic information based on the template text and the corresponding pronunciation audio;
    将所述模板文本及所述音素级的韵律信息输入编码预测网络,得到预测的音素级的韵律编码;Inputting the template text and the phoneme-level prosody information into a coding prediction network to obtain the predicted phoneme-level prosody coding;
    将所述预测的音素级的韵律编码及所述模板文本输入生成网络,得到生成的音素级的韵律信息;Inputting the predicted phoneme-level prosody coding and the template text into a generating network to obtain the generated phoneme-level prosody information;
    以生成的音素级的韵律信息趋近于提取出的所述音素级的韵律信息为目标训练所述编码预测网络及所述生成网络,直至训练结束时,得到训练后的编码预测网络所预测的音素级的韵律编码。The encoding prediction network and the generation network are trained with the generated phoneme-level prosody information approaching the extracted phoneme-level prosody information, until the end of the training, the predicted encoding prediction network after training is obtained. Phoneme-level prosodic coding.
  12. 根据权利要求5所述的方法,其特征在于,在所述获取上传数据中的上传文本之前,该方法还包括:The method according to claim 5, characterized in that, before acquiring the uploaded text in the uploaded data, the method further comprises:
    获取并输出所述原始文本的初始合成语音;obtaining and outputting the initial synthesized speech of the original text;
    则所述上传文本为,所述初始合成语音中合成错误的文本片段,所述上传文本对应的发音音频为,所述合成错误的文本片段对应的正确发音;Then the uploaded text is the text segment that is synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the synthesized incorrect text segment;
    或,所述上传文本为,包含所述初始合成语音中合成错误的文本片段的扩展文本,所述上传文本对应的发音音频为,所述扩展文本对应的正确发音。Or, the uploaded text is an extended text that includes a text fragment synthesized incorrectly in the initial synthesized speech, and the pronunciation audio corresponding to the uploaded text is the correct pronunciation corresponding to the extended text.
  13. 根据权利要求5或12所述的方法,其特征在于,所述获取基于所述匹配文本对应的发音音频所确定的辅助合成特征,包括:The method according to claim 5 or 12, wherein the obtaining the auxiliary synthesis feature determined based on the pronunciation audio corresponding to the matched text comprises:
    基于所述上传数据中所述匹配文本对应的发音音频,确定匹配文本对应的辅助合成特征。Based on the pronunciation audio corresponding to the matching text in the uploaded data, the auxiliary synthesis feature corresponding to the matching text is determined.
  14. 根据权利要求3所述的方法,其特征在于,所述参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音,包括:The method according to claim 3, wherein, referring to the auxiliary synthesis feature, performing speech synthesis on the original text to obtain synthesized speech, comprising:
    基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列;Determine the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text;
    和/或,and / or,
    基于所述匹配文本对应的韵律信息,确定所述原始文本的韵律信息;determining the prosody information of the original text based on the prosody information corresponding to the matched text;
    基于所述原始文本的音素序列和/或韵律信息,对所述原始文本进行语音合成,得到合成语音。Based on the phoneme sequence and/or prosody information of the original text, speech synthesis is performed on the original text to obtain synthesized speech.
  15. 根据权利要求14所述的方法,其特征在于,所述参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音,还包括:The method according to claim 14, wherein, referring to the auxiliary synthesis feature, performing speech synthesis on the original text to obtain synthesized speech, further comprising:
    基于所述匹配文本对应的音素级韵律编码,获取所述匹配文本与所述原始文本中相同文本片段对应的音素级韵律编码;Based on the phoneme-level prosody coding corresponding to the matched text, obtain the phoneme-level prosodic coding corresponding to the same text segment in the matched text and the original text;
    在对所述原始文本进行语音合成过程中,将所述相同文本片段对应的音素级韵律编码作为语音合成模型的补充输入,得到合成语音。During the speech synthesis process of the original text, the phoneme-level prosody coding corresponding to the same text segment is used as a supplementary input of the speech synthesis model to obtain synthesized speech.
  16. 根据权利要求14所述的方法,其特征在于,所述基于所述匹配文本对应的音素序列,确定所述原始文本的音素序列,包括:The method according to claim 14, wherein the determining the phoneme sequence of the original text based on the phoneme sequence corresponding to the matched text comprises:
    基于所述匹配文本对应的音素序列,获取所述匹配文本与所述原始文本中相同文本片段对应的音素序列;Based on the phoneme sequence corresponding to the matching text, obtain the phoneme sequence corresponding to the same text segment in the matching text and the original text;
    查询发音词典,确定所述原始文本中除所述相同文本片段外的其余文本片段的音素序列,并与所述相同文本片段对应的音素序列合成,得到原始文本的音素序列。The pronunciation dictionary is queried to determine the phoneme sequences of other text segments in the original text except the same text segment, and synthesizing with the phoneme sequences corresponding to the same text segment to obtain the phoneme sequence of the original text.
  17. 根据权利要求3所述的方法,其特征在于,所述参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音,包括:The method according to claim 3, wherein, referring to the auxiliary synthesis feature, performing speech synthesis on the original text to obtain synthesized speech, comprising:
    基于语音合成模型处理所述原始文本,得到预测当前语音帧的上下文信息;Process the original text based on the speech synthesis model to obtain context information for predicting the current speech frame;
    基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,确定预测当前语音帧所需的目标声学特征;Based on the context information, the matched text and the acoustic features of the pronunciation audio, determining the target acoustic feature required for predicting the current speech frame;
    基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧,在预测得到所有语音帧后,由预测的各语音帧组成合成语音。Based on the context information and the determined target acoustic feature, the current speech frame is predicted, and after all speech frames are predicted, synthesized speech is composed of the predicted speech frames.
  18. 根据权利要求17所述的方法,其特征在于,所述基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,确定预测当前语音帧所需的目标声学特征,包括:The method according to claim 17, wherein the determining the target acoustic feature required for predicting the current speech frame based on the context information, the matching text and the acoustic features of the pronunciation audio comprises:
    基于所述上下文信息、所述匹配文本及所述发音音频的声学特征,获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度;Based on the context information, the matched text and the acoustic features of the pronunciation audio, obtain the correlation degree of the acoustic features of each frame in the context information and the acoustic features of the pronunciation audio;
    基于所述关联度,确定预测当前语音帧所需的目标声学特征。Based on the degree of association, target acoustic features required for predicting the current speech frame are determined.
  19. 根据权利要求18所述的方法,其特征在于,所述获取所述上下文信息与所述发音音频的声学特征中,每一帧声学特征的关联度,包括:The method according to claim 18, characterized in that in acquiring the context information and the acoustic features of the pronunciation audio, the degree of correlation of each frame of the acoustic features comprises:
    获取所述发音音频的声学特征对所述匹配文本的第一注意力权重矩阵,所述第一注意力权重矩阵包括每一帧声学特征对所述匹配文本中各文本单元的注意力权重;Obtain the first attention weight matrix of the acoustic feature of the pronunciation audio to the matched text, and the first attention weight matrix includes the attention weight of each frame of acoustic feature to each text unit in the matched text;
    获取所述上下文信息对所述匹配文本的第二注意力权重矩阵,所述第二注意力权重矩阵包括所述上下文信息对所述匹配文本中各文本单元的注意力权重;acquiring a second attention weight matrix of the context information to the matched text, where the second attention weight matrix includes the context information to the attention weights of each text unit in the matched text;
    基于所述第一注意力权重及所述第二注意力权重矩阵,得到所述上下文信息对所述声学特征的第三注意力权重矩阵,所述第三注意力权重矩阵包括所述上下文信息对每一帧声学特征的注意力权重,作为所述上下文信息与每一帧声学特征的关联度。Based on the first attention weight and the second attention weight matrix, a third attention weight matrix of the context information to the acoustic feature is obtained, and the third attention weight matrix includes the context information pair The attention weight of the acoustic features of each frame is used as the correlation degree between the context information and the acoustic features of each frame.
  20. 根据权利要求18所述的方法,其特征在于,所述基于所述关联度,确定预测当前语音帧所需的目标声学特征,包括:The method according to claim 18, wherein determining the target acoustic feature required for predicting the current speech frame based on the correlation degree comprises:
    对各个所述关联度进行归一化,并以归一化后的各个关联度作为权重,对所述发音音频的各帧声学特征进行加权相加,得到目标声学特征。Each of the correlation degrees is normalized, and each normalized correlation degree is used as a weight to perform weighted addition on the acoustic features of each frame of the pronunciation audio to obtain the target acoustic feature.
  21. 根据权利要求17-20中任一项所述的方法,其特征在于,所述基于所述上下文信息,及确定的所述目标声学特征,预测当前语音帧,包括:The method according to any one of claims 17-20, wherein the predicting the current speech frame based on the context information and the determined target acoustic feature comprises:
    基于语音合成模型的解码端当前的隐层矢量及所述目标声学特征,确定预测当前语音帧时所述目标声学特征的融合系数;Determine the fusion coefficient of the target acoustic feature when predicting the current speech frame based on the current hidden layer vector of the decoding end of the speech synthesis model and the target acoustic feature;
    参考所述融合系数,对所述目标声学特征及所述上下文信息进行融合,并基于融合结果预测当前语音帧。With reference to the fusion coefficient, the target acoustic feature and the context information are fused, and the current speech frame is predicted based on the fusion result.
  22. 一种语音合成装置,其特征在于,包括:A speech synthesis device, comprising:
    原始文本获取单元,用于获取待合成的原始文本;an original text obtaining unit, used to obtain the original text to be synthesized;
    辅助合成特征获取单元,用于获取匹配文本对应的辅助合成特征,所述匹配文本与所述原始文本存在相匹配的文本片段,所述辅助合成特征为基于所述匹配文本对应的发音音频所确定的用于辅助语音合成的特征;Auxiliary synthesis feature acquisition unit, used to obtain the auxiliary synthesis feature corresponding to the matching text, the matching text and the original text have matching text fragments, and the auxiliary synthesis feature is determined based on the pronunciation audio corresponding to the matching text The features used to assist speech synthesis;
    辅助语音合成单元,用于参考所述辅助合成特征,对所述原始文本进行语音合成,得到合成语音。An auxiliary speech synthesis unit, configured to perform speech synthesis on the original text with reference to the auxiliary synthesis feature to obtain synthesized speech.
  23. 一种语音合成设备,其特征在于,包括:存储器和处理器;A speech synthesis device, comprising: a memory and a processor;
    所述存储器,用于存储程序;the memory for storing programs;
    所述处理器,用于执行所述程序,实现如权利要求1~21中任一项所述的语音合成方法的各个步骤。The processor is configured to execute the program to implement each step of the speech synthesis method according to any one of claims 1 to 21.
  24. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1~21中任一项所述的语音合成方法的各个步骤。A storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, each step of the speech synthesis method according to any one of claims 1 to 21 is implemented.
  25. 一种计算机程序产品,所述计算机程序产品在终端设备上运行时,使得所述终端设备执行权利要求1~21中任一项所述的语音合成方法的各个步骤。A computer program product, when running on a terminal device, the computer program product enables the terminal device to execute each step of the speech synthesis method according to any one of claims 1 to 21.
PCT/CN2021/071672 2020-12-30 2021-01-14 Speech synthesis method and apparatus, device, and storage medium WO2022141671A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011607966.3 2020-12-30
CN202011607966.3A CN112802444B (en) 2020-12-30 2020-12-30 Speech synthesis method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022141671A1 true WO2022141671A1 (en) 2022-07-07

Family

ID=75804405

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/071672 WO2022141671A1 (en) 2020-12-30 2021-01-14 Speech synthesis method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN112802444B (en)
WO (1) WO2022141671A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117765926A (en) * 2024-02-19 2024-03-26 上海蜜度科技股份有限公司 Speech synthesis method, system, electronic equipment and medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421547B (en) * 2021-06-03 2023-03-17 华为技术有限公司 Voice processing method and related equipment
CN113672144A (en) * 2021-09-06 2021-11-19 北京搜狗科技发展有限公司 Data processing method and device
CN114373445B (en) * 2021-12-23 2022-10-25 北京百度网讯科技有限公司 Voice generation method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101171624A (en) * 2005-03-11 2008-04-30 株式会社建伍 Speech synthesis device, speech synthesis method, and program
CN109102796A (en) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 A kind of phoneme synthesizing method and device
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
US20200335080A1 (en) * 2017-10-31 2020-10-22 Sk Telecom Co., Ltd. Speech synthesis apparatus and method
CN111816158A (en) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 Voice synthesis method and device and storage medium
CN111930900A (en) * 2020-09-28 2020-11-13 北京世纪好未来教育科技有限公司 Standard pronunciation generating method and related device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071300B (en) * 2020-11-12 2021-04-06 深圳追一科技有限公司 Voice conversation method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101171624A (en) * 2005-03-11 2008-04-30 株式会社建伍 Speech synthesis device, speech synthesis method, and program
US20200335080A1 (en) * 2017-10-31 2020-10-22 Sk Telecom Co., Ltd. Speech synthesis apparatus and method
CN109102796A (en) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 A kind of phoneme synthesizing method and device
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN111816158A (en) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 Voice synthesis method and device and storage medium
CN111930900A (en) * 2020-09-28 2020-11-13 北京世纪好未来教育科技有限公司 Standard pronunciation generating method and related device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117765926A (en) * 2024-02-19 2024-03-26 上海蜜度科技股份有限公司 Speech synthesis method, system, electronic equipment and medium
CN117765926B (en) * 2024-02-19 2024-05-14 上海蜜度科技股份有限公司 Speech synthesis method, system, electronic equipment and medium

Also Published As

Publication number Publication date
CN112802444B (en) 2023-07-25
CN112802444A (en) 2021-05-14

Similar Documents

Publication Publication Date Title
WO2022141671A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
JP6550068B2 (en) Pronunciation prediction in speech recognition
US20230043916A1 (en) Text-to-speech processing using input voice characteristic data
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
US11410684B1 (en) Text-to-speech (TTS) processing with transfer of vocal characteristics
CN106971709B (en) Statistical parameter model establishing method and device and voice synthesis method and device
WO2019165748A1 (en) Speech translation method and apparatus
CN108899009B (en) Chinese speech synthesis system based on phoneme
US6119086A (en) Speech coding via speech recognition and synthesis based on pre-enrolled phonetic tokens
CN111667812A (en) Voice synthesis method, device, equipment and storage medium
Peyser et al. Improving performance of end-to-end ASR on numeric sequences
CN112634856A (en) Speech synthesis model training method and speech synthesis method
WO2021051765A1 (en) Speech synthesis method and apparatus, and storage medium
CN111930900B (en) Standard pronunciation generating method and related device
WO2024088262A1 (en) Data processing system and method for speech recognition model, and speech recognition method
CN113327574A (en) Speech synthesis method, device, computer equipment and storage medium
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
CN114944149A (en) Speech recognition method, speech recognition apparatus, and computer-readable storage medium
CN113327575B (en) Speech synthesis method, device, computer equipment and storage medium
CN112908293B (en) Method and device for correcting pronunciations of polyphones based on semantic attention mechanism
CN113450760A (en) Method and device for converting text into voice and electronic equipment
US20230360633A1 (en) Speech processing techniques
WO2023116243A1 (en) Data conversion method and computer storage medium
WO2021231050A1 (en) Automatic audio content generation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21912454

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21912454

Country of ref document: EP

Kind code of ref document: A1