WO2021127817A1 - 一种多语言文本合成语音方法、装置、设备及存储介质 - Google Patents

一种多语言文本合成语音方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021127817A1
WO2021127817A1 PCT/CN2019/127334 CN2019127334W WO2021127817A1 WO 2021127817 A1 WO2021127817 A1 WO 2021127817A1 CN 2019127334 W CN2019127334 W CN 2019127334W WO 2021127817 A1 WO2021127817 A1 WO 2021127817A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
encoding
joint
multilingual
synthesized
Prior art date
Application number
PCT/CN2019/127334
Other languages
English (en)
French (fr)
Inventor
黄东延
盛乐园
熊友军
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to CN201980003170.6A priority Critical patent/CN111247581B/zh
Priority to PCT/CN2019/127334 priority patent/WO2021127817A1/zh
Publication of WO2021127817A1 publication Critical patent/WO2021127817A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of speech technology, and in particular to a method, device, equipment and storage medium for speech synthesis of multilingual text.
  • Speech synthesis is an important task in speech interaction. Its goal is to synthesize text information into a natural sound that looks like a real person.
  • the traditional speech synthesis system consists of two parts: the front end and the back end.
  • the function of the front end is to analyze the text and extract linguistic information, such as word segmentation, part-of-speech tagging, prosodic structure prediction, etc.
  • the back end is to synthesize speech from the linguistic information obtained from the front end.
  • end-to-end speech synthesis systems such as Tacotron (end-to-end deep learning speech synthesis model) and Tacotron2, which use neural networks to simplify the front-end of traditional speech synthesis.
  • Tacotron and Tacotron2 first directly generate spectral features (Melspectrograms) from the text and then use vocoders such as Griffin-Lim (using the Griffin-Lim algorithm audio generation model) and WaveNet (original audio generation model) to synthesize the spectral features into speech.
  • This end-to-end model based on neural network greatly improves the synthesized speech quality.
  • the end-to-end model here refers to a sequence-to-sequence model with an attention mechanism.
  • the text sequence is mapped to the semantic space using an encoder and a series of encoder hidden states are generated, and then the decoder uses the attention mechanism to use the hidden states of these semantic spaces as context information, constructs the hidden states of the decoder, and then outputs the spectral feature frame.
  • the attention mechanism often includes recurrent neural networks.
  • the cyclic neural network can generate the output sequence from the input sequence, and the current output sequence is determined by all the previous output sequences and the current implicit state. For a particular spectrum frame, due to insufficient input information of the encoder or insufficient encoding of the encoder, it may still deviate from the actual situation after many cycles. Judging from the performance of the synthesized speech, it may sound like missing words or skipped words.
  • this single-language speech synthesis system can already meet daily needs in most scenarios, for some specific scenarios, such as robots, translators, etc., a multi-language speech synthesis system is required. If you train a system for each language, it will bring a lot of consumption to the deployment of the model. Therefore, it is particularly important to develop a multilingual text speech synthesis method that does not miss or skip words and is simple to deploy.
  • the present invention provides a multilingual text speech synthesis method, the method includes:
  • the predicted frequency spectrum feature is input to the vocoder for synthesis processing to obtain the target speech corresponding to the multilingual text to be synthesized.
  • said converting all text encodings corresponding to the encoding rules into joint text encodings includes:
  • the spliced text encoding is subjected to linear affine transformation to obtain a joint text encoding.
  • the inputting the joint text encoding and standard spectral feature data into a decoder for predictive decoding to obtain predicted spectral features includes:
  • the high-level features of the joint text encoding and the standard spectrum feature data are input to a decoder for predictive decoding to obtain the predicted spectrum feature.
  • the inputting the multilingual text to be synthesized into at least two encoders with different encoding rules for encoding to obtain the text encoding corresponding to the encoding rules includes:
  • the multilingual text to be synthesized is input into a phoneme encoder for encoding, and a phoneme text encoding corresponding to the phoneme encoder is obtained.
  • the splicing all text codes corresponding to the encoding rules to obtain spliced text codes includes:
  • the One-hot text encoding, the UTF-8 text encoding, and the phoneme text encoding are spliced in the channel dimension to obtain spliced text encoding.
  • the spliced text encoding includes three-dimensional data; wherein, the first-dimensional data is In the One-hot text encoding, the second dimension data is the UTF-8 text encoding, and the third dimension data is the phoneme text encoding.
  • the performing linear affine transformation on the spliced text encoding to obtain a joint text encoding includes:
  • the multi-dimensional spliced text code is input into the first neural network to perform linear affine transformation to select the text code corresponding to the coding rule to obtain a joint text code.
  • the performing high-level feature extraction on the joint text coding to obtain high-level features of the joint text coding includes:
  • the joint text coding is input into the second neural network for high-level feature extraction, and the joint text coding high-level features are obtained.
  • the second neural network includes a character-level convolutional neural network, three convolutional layers, and a bidirectional long and short-term memory cyclic neural network that are sequentially arranged.
  • the inputting the joint text encoding high-level features and the standard spectral feature data into a decoder for predictive decoding to obtain the predicted spectral features includes:
  • the third neural network of the decoder performs spectrum feature prediction according to the joint text coding, the standard spectrum feature data and the attention mechanism, and obtains the predicted spectrum feature.
  • the method before acquiring the multilingual text to be synthesized, the method further includes:
  • the present invention also provides a multi-language text speech synthesis device, the device includes:
  • the joint coding module is used to obtain the multi-language text to be synthesized, and input the multi-language text to be synthesized into at least two encoders with different encoding rules for encoding, to obtain the text encoding corresponding to the encoding rule, and to encode all the multi-language texts.
  • the text encoding corresponding to the rule is converted to a joint text encoding;
  • the speech synthesis module is used to input the joint text encoding and standard spectral characteristic data into the decoder for predictive decoding to obtain the predicted spectral characteristic, and input the predicted spectral characteristic into the vocoder for synthesis processing, to obtain the data to be synthesized The target voice corresponding to the language text.
  • the joint coding module includes a separate coding sub-module and a joint coding sub-module
  • the separate encoding sub-module is used to obtain the multilingual text to be synthesized, and input the multilingual text to be synthesized into at least two encoders with different encoding rules for encoding, to obtain the text encoding corresponding to the encoding rule;
  • the joint coding submodule is used to splice all text codes corresponding to the coding rules to obtain a spliced text code, and perform linear affine transformation on the spliced text code to obtain a joint text code.
  • the speech synthesis module includes a high-level feature extraction sub-module and a spectral feature prediction sub-module;
  • the high-level feature extraction submodule is used to perform high-level feature extraction on the joint text coding to obtain high-level features of the joint text coding;
  • the spectral feature prediction submodule is used to input the high-level features of the joint text encoding and the standard spectral feature data into a decoder for predictive decoding to obtain the predicted spectral feature.
  • the present invention also provides a storage medium storing a computer instruction program, which when executed by a processor causes the processor to execute the steps of any one of the methods described in the first aspect.
  • the present invention also provides a multilingual text speech synthesis device, including at least one memory and at least one processor, the memory stores a computer instruction program, and when the computer instruction program is executed by the processor , Enabling the processor to execute the steps of any one of the methods in the first aspect.
  • the multilingual text speech synthesis method of the present invention inputs the multilingual text to be synthesized into at least two encoders with different encoding rules for encoding to obtain text encodings corresponding to the encoding rules, and then all the texts are encoded.
  • the text encoding corresponding to the encoding rules is converted into joint text encoding; encoding with different rules through at least two encoders with different encoding rules can more fully retain the characteristics of the text, which is also conducive to the processing of multi-language text; Encoders with different encoding rules perform encoding with different rules and then undergo conversion to obtain a joint text encoding, which improves the stability of the text synthesis speech effect, while also reducing the difficulty of deployment and reducing the cost of deployment. Therefore, the present invention can more fully retain the characteristics of the text, is beneficial to the processing of multi-language text, reduces the difficulty of deployment, and reduces the cost of deployment.
  • FIG. 1 is a flowchart of a method for speech synthesis of multilingual text in an embodiment
  • FIG. 2 is a flowchart of determining joint text encoding of the multilingual text speech synthesis method of FIG. 1;
  • Fig. 3 is a flowchart of predictive decoding of the multilingual text speech synthesis method of Fig. 1;
  • Figure 4 is a flowchart of a method for speech synthesis of multilingual text in another embodiment
  • Figure 5 is a structural block diagram of a multilingual text speech synthesis device in an embodiment
  • FIG. 6 is a structural block diagram of the joint coding module of the multi-language text speech synthesis device of FIG. 5;
  • Fig. 7 is a structural block diagram of a speech synthesis module of the multilingual text speech synthesis device of Fig. 5;
  • Fig. 8 is a structural block diagram of a computer device in an embodiment.
  • a method for speech synthesis of multilingual text includes:
  • the multilingual text refers to the text containing at least two types of languages at the same time.
  • the multilingual text includes a mixture of Chinese, English, French, and Arabic numerals, and the examples here are not specifically limited.
  • a multi-language text is obtained from a text input device or a database or a network, and the multi-language text is used as the multi-language text to be synthesized, so as to synthesize the multi-language text to be synthesized and synthesize it into The form of the speech, the multilingual text to be synthesized before the speech synthesis and the content of the speech expression after the synthesis speech does not change.
  • the user can trigger the input of text through the text input device.
  • the text input device starts to collect text
  • the text input device stops collecting text, so that the text input device can collect a piece of text.
  • the multi-language text to be synthesized is sequentially inputted into at least two encoders with different encoding rules in the reading order to be encoded, and the text encoding corresponding to the encoding rule is obtained.
  • the number of encoders with different encoding rules can be two, three, four, five, six, or seven, which are not specifically limited in this example. It is understandable that the dimensions of the language features acquired by encoders with different encoding rules during encoding are different.
  • the multi-language text to be synthesized can be fully obtained from multiple dimensions. This language feature avoids the problem of insufficient language features obtained by a single encoder or insufficient output information.
  • each encoder needs to separately encode the multilingual text to be synthesized in the reading order.
  • the characters or glyphs in the multi-language text to be synthesized are respectively input into at least two encoders with different encoding rules for encoding, to obtain text encodings corresponding to the encoding rules.
  • the encoder can select from the prior art to encode the text to obtain language features, such as One-hot encoder, UTF-8 encoder, phoneme encoder, and the examples are not specifically limited here.
  • the encoder can be selected according to the type of language in the multilingual text to be synthesized, or the encoder can be selected according to the field involved in the content of the multilingual text to be synthesized, which is not specifically limited in this example.
  • all text codes corresponding to the coding rules obtained by encoding at least two encoders with different coding rules are subjected to joint coding conversion to obtain joint text coding.
  • At least two encoder codes with different encoding rules are spliced and linearly affine transformed to obtain a joint text encoding, and the dimension of the joint text encoding is one dimension.
  • the standard spectrum feature data is input to the decoder for the decoder to learn, and the decoder obtains the spectrum feature corresponding to the joint text encoding by decoding the joint text encoding, and the frequency spectrum corresponding to the joint text encoding is obtained.
  • Features are used as predicted spectral features.
  • the spectrum feature can be implemented as a Mel spectrum, which is not specifically limited in this example.
  • the standard spectral feature data is a standard spectral feature database pre-trained through a neural network.
  • the spectrum feature is implemented as a mel spectrum
  • the mel spectrum is passed through a vocoder for speech synthesis processing to obtain a target voice corresponding to the predicted spectrum feature, and the target voice corresponding to the predicted spectrum feature is The speech is used as the target speech corresponding to the multilingual text to be synthesized.
  • the vocoder can select a Universal Vocoding vocoder from the prior art, which is not specifically limited in this example.
  • the multilingual text to be synthesized is respectively input into at least two encoders with different encoding rules for encoding to obtain a text encoding corresponding to the encoding rule, and then all the encoding rules are corresponded to
  • the text encoding is converted to joint text encoding; encoding with different rules through at least two encoders with different encoding rules can more fully retain the characteristics of the text, which is also conducive to the processing of multilingual text; through at least two different encoding rules After encoding with different rules, the encoder is converted to a joint text encoding, which improves the stability of the text synthesis speech effect, while also reducing the difficulty of deployment and reducing the cost of deployment.
  • said converting all the text encodings corresponding to the encoding rules into joint text encoding includes:
  • S202 Splicing all text codes corresponding to the encoding rules to obtain spliced text codes
  • the text encoding corresponding to each encoding rule is used as one-dimensional data, and all one-dimensional data of the text encoding corresponding to the encoding rules are sequentially spliced to obtain spliced text encoding.
  • the head ends of the one-dimensional data of the text encoding corresponding to all the encoding rules are aligned, and all the one-dimensional data of the text encoding corresponding to the encoding rules are spliced into multi-dimensional data to obtain the spliced text encoding.
  • the multi-dimensional spliced text encoding is subjected to linear affine transformation to select the text encoding corresponding to the encoding rule to obtain a joint text encoding, and the dimension of the joint text encoding is one dimension.
  • the linear affine transformation is used to select one of the text codes corresponding to the encoding rule as the target text code corresponding to the text unit for each text unit, and concatenate all the target text codes in sequence to obtain a joint text code.
  • the inputting the multilingual text to be synthesized into at least two encoders with different encoding rules for encoding, to obtain the text encoding corresponding to the encoding rule includes: inputting the multilingual text to be synthesized Enter the One-hot encoder for encoding, and obtain the One-hot text encoding corresponding to the One-hot encoder; enter the multilingual text to be synthesized into the UTF-8 encoder for encoding, and obtain the corresponding UTF-8 encoder UTF-8 text encoding; input the multilingual text to be synthesized into the phoneme encoder for encoding, and obtain the phoneme text encoding corresponding to the phoneme encoder.
  • the One-Hot encoding is one-hot encoding, also known as one-bit effective encoding.
  • the method is to use N-bit status registers to encode N states, each state has its own independent register bit, and at any time, Only one of them is valid.
  • One-Hot Encoding can put together a collection of characters or glyphs of different languages as an input dictionary.
  • the UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode, which can be used to represent any character in the Unicode standard and consists of 128 characters, including upper and lower case It is composed of letters, numbers 0-9, punctuation marks, non-printing characters (four line breaks, tabs, etc.) and control characters (backspace, bell, etc.), which can adapt to all characters in the world.
  • the phoneme is the smallest phonetic unit divided according to the natural attributes of the speech, and is analyzed according to the pronunciation actions in the syllable, and one action constitutes one phoneme.
  • the One-hot encoder is obtained through neural network training by adopting the One-hot encoding method, and the training method can be selected from the prior art, which will not be repeated here.
  • the UTF-8 encoder is obtained through neural network training in a UTF-8 encoding method, and the training method can be selected from the prior art, which will not be repeated here. Used to map the input character or glyph to an entry with 256 possible values for encoder input.
  • the phoneme encoder is obtained through neural network training by adopting a phoneme encoding method, and the training method can be selected from the prior art.
  • the phoneme encoder does not need to learn complicated pronunciation rules, and the same phoneme can be shared in different languages.
  • One-hot encoder, UTF-8 encoder, and phoneme encoder are currently widely used encoders for extracting text encodings. By using these three encoders, this method improves the language features of the retained text and has more Conducive to the processing of multilingual texts. It is understandable that this method can also adopt other encoders for extracting text codes, which are not specifically limited in this example.
  • the splicing all text codes corresponding to the encoding rules to obtain spliced text codes includes:
  • the One-hot text encoding, the UTF-8 text encoding, and the phoneme text encoding are spliced in the channel dimension to obtain spliced text encoding.
  • the spliced text encoding includes three-dimensional data; wherein, the first-dimensional data is In the One-hot text encoding, the second dimension data is the UTF-8 text encoding, and the third dimension data is the phoneme text encoding.
  • the performing linear affine transformation on the spliced text encoding to obtain a joint text encoding includes:
  • the multi-dimensional spliced text code is input into the first neural network to perform linear affine transformation to select the text code corresponding to the coding rule to obtain a joint text code, and the dimension of the joint text code is one dimension.
  • the text unit is used as an independent unit to select the text encoding corresponding to the encoding rule from the multi-dimensional spliced text encoding through the trained first neural network, and one of the texts corresponding to the encoding rule is selected
  • the encoding is used as the target text encoding corresponding to the text unit, and all the target text encodings are sequentially spliced to obtain a joint text encoding; wherein, the text encoding selection rule corresponding to the encoding rule is obtained by the first neural network after training of.
  • the One-hot text encoding corresponding to the One-hot encoder and the UTF-8 encoding are obtained through encoding.
  • the UTF-8 text encoding corresponding to the encoder, the phoneme text encoding corresponding to the phoneme encoder, and the One-hot text encoding corresponding to the One-hot encoder and UTF-8 corresponding to the UTF-8 encoder for each text unit Select one of the text encoding and the phoneme text encoding corresponding to the phoneme encoder as the target text encoding corresponding to the text unit.
  • the first neural network can select a neural network that can perform linear affine transformation from the prior art, which will not be repeated here.
  • the inputting the joint text encoding and standard spectral feature data into a decoder for predictive decoding to obtain predicted spectral features includes: performing high-level feature extraction on the joint text encoding to obtain high-level features of the joint text encoding; The high-level features of the joint text encoding and the standard spectrum feature data are input to a decoder for predictive decoding to obtain the predicted spectrum feature.
  • the high-level features refer to features related to language classification, semantic information, etc. included in the multilingual text to be synthesized.
  • High-level features are features related to language classification, semantic information, etc., and the predicted spectrum features obtained by predicting and decoding the joint text encoding containing high-level features retain the language classification and semantic information of the multilingual text to be synthesized, thereby further improving The accuracy of the final synthesized target speech corresponding to the multilingual text to be synthesized.
  • the high-level feature extraction of the joint text coding to obtain the high-level feature of the joint text coding includes:
  • the joint text coding is input into the second neural network for high-level feature extraction, and the high-level features of the joint text coding are obtained.
  • the rules for performing high-level feature extraction on the joint text encoding can be obtained by training the second neural network.
  • the second neural network can select a neural network that can perform high-level feature extraction on text encoding from the prior art, which will not be repeated here.
  • the second neural network includes a character-level convolutional neural network, three convolutional layers, and a bidirectional long and short-term memory cyclic neural network that are sequentially arranged.
  • the character-level convolutional neural network is used to implement character embedding, and the detailed structure can be selected from the prior art, which will not be repeated here.
  • the three convolutional layers are used to implement high-level feature extraction, and the detailed structure can be selected from the prior art, which will not be repeated here.
  • the two-way long and short-term memory cyclic neural network is used for semantic relationship recognition, and the cyclic neural network is used to directly learn the semantic feature expression of the question sentence from the word.
  • the detailed structure can be selected from the prior art and will not be repeated here.
  • inputting the joint text encoding high-level feature and the standard spectral feature data into a decoder for predictive decoding to obtain the predicted spectral feature includes:
  • the third neural network of the decoder performs spectrum feature prediction according to the high-level features of the joint text encoding, the standard spectrum feature data and the attention mechanism, and obtains the predicted spectrum feature.
  • the standard spectrum feature data is input to a third neural network for learning, and the third neural network is trained to obtain a decoder, and the decoder maps the joint text encoding high-level features into a sequence of spectrum features according to the attention mechanism , Using the sequence of spectral features as the predicted spectral features.
  • the decoder obtained by learning and training the third neural network can capture the pronunciation of words, as well as various subtle changes in human speech, including volume, speaking speed and intonation.
  • the third neural network can select a neural network capable of extracting text encoding from the prior art, which will not be repeated here.
  • the third neural network includes a 2-layer preprocessing neural network, a 2-layer long-short-term memory network, a linear affine transformation neural network, and a 5-layer convolutional post-processing neural network.
  • the detailed structure of the 2-layer pre-processing neural network, 2-layer long short-term memory network, linear affine transformation neural network, and 5-layer convolutional post-processing neural network can be selected from the prior art, and will not be repeated here.
  • the long and short-term memory network is used to utilize context-related information in the mapping process between input and output sequences.
  • a method for speech synthesis of multilingual text is also proposed, and the method includes:
  • the multi-language text refers to the text containing multiple types of languages at the same time.
  • the multi-language text includes a mixture of Chinese, English, French, and Arabic numerals, and the examples here are not specifically limited.
  • the multi-language text to be processed refers to obtaining multi-language text from a text input device or a database or a network.
  • S404 Perform language standardization processing according to the multi-language text to be processed to obtain the multi-language text to be synthesized
  • non-standardized uses such as: English word abbreviations, abbreviations, and multiple words connected together by a connector, etc. These non-standardized uses may cause omissions or omissions when synthesizing speech from text. The question of skipping words.
  • the language standardization process includes the abbreviation reduction, the abbreviation reduction, and the disconnection of multiple words connected together, which are not specifically limited in this example.
  • S414 Input the predicted spectral characteristics into the vocoder for synthesis processing to obtain a target speech corresponding to the multilingual text to be synthesized.
  • the multi-language text to be synthesized is obtained by performing language standardization processing on the multi-language text to be processed, and then the multi-language text to be synthesized is used as input for synthesizing speech, which further avoids the phenomenon of missing words or skipped words, and further Commission the quality of synthesized speech.
  • the present invention provides a multilingual text speech synthesis device, the device includes:
  • the joint coding module 502 is used to obtain the multi-language text to be synthesized, input the multi-language text to be synthesized into at least two encoders with different encoding rules for encoding, obtain the text encoding corresponding to the encoding rule, and convert all the The text encoding corresponding to the encoding rule is converted into a joint text encoding;
  • the speech synthesis module 504 is configured to input the joint text encoding and standard spectral feature data into a decoder for predictive decoding to obtain a predicted spectral feature, and input the predicted spectral feature into a vocoder for synthesis processing to obtain and the to-be-synthesized The target voice corresponding to the multilingual text.
  • the multi-language text speech synthesis device of this embodiment inputs the to-be-synthesized multi-language text into at least two encoders with different encoding rules for encoding to obtain a text encoding corresponding to the encoding rule, and then corresponds all the encoding rules to
  • the text encoding is converted to joint text encoding; encoding with different rules through at least two encoders with different encoding rules can more fully retain the characteristics of the text, which is also conducive to the processing of multi-language text; through at least two different encoding rules After encoding with different rules, the encoder is converted to a joint text encoding, which improves the stability of the text synthesis speech effect, while also reducing the difficulty of deployment and reducing the cost of deployment.
  • the joint coding module includes a coding sub-module 5022, a coding joint conversion sub-module 5024, respectively;
  • the separate encoding sub-module 5022 is configured to obtain the multi-language text to be synthesized, and input the multi-language text to be synthesized into at least two encoders with different encoding rules for encoding, to obtain a text encoding corresponding to the encoding rule;
  • the joint coding sub-module 5024 is used for concatenating all text codes corresponding to the coding rules to obtain a concatenated text code, and performing linear affine transformation on the concatenated text code to obtain a joint text code.
  • the speech synthesis module includes a high-level feature extraction sub-module 5042, a spectral feature prediction sub-module 5044;
  • the high-level feature extraction submodule 5042 is configured to perform high-level feature extraction on the joint text coding to obtain high-level features of the joint text coding;
  • the spectral feature prediction sub-module 5044 is configured to input the high-level features of the joint text encoding and the standard spectral feature data into a decoder for predictive decoding to obtain the predicted spectral feature.
  • Fig. 8 shows an internal structure diagram of a computer device in an embodiment.
  • the computer device may specifically be a terminal or a server.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system and may also store a computer program.
  • the processor can enable the processor to implement a multilingual text speech synthesis method.
  • a computer program may also be stored in the internal memory, and when the computer program is executed by the processor, the processor can make the processor execute the method of speech synthesis of multi-language text.
  • FIG. 8 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a multilingual text speech synthesis method provided by the present application can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 8.
  • the memory of the computer equipment can store various program templates of a multi-language text speech synthesis device.
  • the present invention provides a storage medium that stores a computer instruction program, and when the computer instruction program is executed by a processor, the processor executes the following method steps:
  • the predicted frequency spectrum feature is input to the vocoder for synthesis processing to obtain the target speech corresponding to the multilingual text to be synthesized.
  • the multi-language text to be synthesized is respectively input into at least two encoders with different encoding rules for encoding to obtain text encodings corresponding to the encoding rules, and then all texts corresponding to the encoding rules are encoded Convert to joint text encoding; use at least two encoders with different encoding rules to encode different rules, which can more fully retain the characteristics of the text, and it is also conducive to the processing of multilingual text; through at least two encoders with different encoding rules After encoding with different rules, the joint text encoding is obtained through conversion, which improves the stability of the effect of text synthesis speech, and also reduces the difficulty of deployment and reduces the cost of deployment.
  • the converting all text codes corresponding to the coding rules into joint text codes includes: concatenating all text codes corresponding to the coding rules to obtain a concatenated text code; encoding the concatenated text Perform linear affine transformation to obtain joint text encoding.
  • the inputting the joint text encoding and standard spectral feature data into a decoder for predictive decoding to obtain predicted spectral features includes: performing high-level feature extraction on the joint text encoding to obtain high-level features of the joint text encoding; The high-level features of the joint text encoding and the standard spectrum feature data are input to a decoder for predictive decoding to obtain the predicted spectrum feature.
  • the inputting the multilingual text to be synthesized into at least two encoders with different encoding rules for encoding, to obtain the text encoding corresponding to the encoding rule includes: inputting the multilingual text to be synthesized Enter the One-hot encoder for encoding, and obtain the One-hot text encoding corresponding to the One-hot encoder; enter the multilingual text to be synthesized into the UTF-8 encoder for encoding, and obtain the corresponding UTF-8 encoder UTF-8 text encoding; input the multilingual text to be synthesized into the phoneme encoder for encoding, and obtain the phoneme text encoding corresponding to the phoneme encoder.
  • the splicing all the text encodings corresponding to the encoding rules to obtain the spliced text encoding includes: encoding the One-hot text encoding, the UTF-8 text encoding, and the phoneme text encoding The splicing is performed in the channel dimension to obtain spliced text encoding.
  • the spliced text encoding includes three-dimensional data; wherein, the first-dimensional data is the One-hot text encoding, the second-dimensional data is the UTF-8 text encoding, and the first The three-dimensional data is the phoneme text encoding.
  • the performing linear affine transformation of the spliced text encoding to obtain a joint text encoding includes: inputting the multi-dimensional spliced text encoding into a first neural network to perform linear affine transformation on the encoding rule The corresponding text encoding is selected, and the joint text encoding is obtained.
  • the performing high-level feature extraction on the joint text coding to obtain the high-level features of the joint text coding includes: inputting the joint text coding into a second neural network for high-level feature extraction to obtain the high-level features of the joint text coding .
  • the second neural network includes a character-level convolutional neural network, three convolutional layers, and a bidirectional long and short-term memory cyclic neural network that are sequentially arranged.
  • the inputting the high-level features of the joint text encoding and the standard spectral characteristic data into a decoder for predictive decoding to obtain the predicted spectral characteristic includes: obtaining the standard spectral characteristic data; Encoding high-level features, the standard spectrum feature data is input to the decoder, the decoder includes a third neural network; the third neural network of the decoder is combined with the attention mechanism according to the joint text encoding and the standard spectrum feature data Perform spectrum feature prediction to obtain predicted spectrum features.
  • the method before acquiring the multi-language text to be synthesized, the method further includes: acquiring the multi-language text to be processed; and performing language standardization processing according to the multi-language text to be processed to obtain the multi-language text to be synthesized.
  • the present invention provides a multi-language text speech synthesis device, including at least one memory and at least one processor, the memory stores a computer instruction program, and the computer instruction program is executed by the processor When the time, the processor is caused to execute the following method steps:
  • the predicted frequency spectrum feature is input to the vocoder for synthesis processing to obtain the target speech corresponding to the multilingual text to be synthesized.
  • the multilingual text to be synthesized is input into at least two encoders with different encoding rules for encoding to obtain text encodings corresponding to the encoding rules, and then all texts corresponding to the encoding rules are encoded Convert to joint text encoding; use at least two encoders with different encoding rules to encode different rules, which can more fully retain the characteristics of the text, and it is also conducive to the processing of multilingual text; through at least two encoders with different encoding rules After encoding with different rules, the joint text encoding is obtained through conversion, which improves the stability of the effect of text synthesis speech, and also reduces the difficulty of deployment and reduces the cost of deployment.
  • the converting all text codes corresponding to the coding rules into joint text codes includes: concatenating all text codes corresponding to the coding rules to obtain a concatenated text code; encoding the concatenated text Perform linear affine transformation to obtain joint text encoding.
  • the inputting the joint text encoding and standard spectral feature data into a decoder for predictive decoding to obtain predicted spectral features includes: performing high-level feature extraction on the joint text encoding to obtain high-level features of the joint text encoding; The high-level features of the joint text encoding and the standard spectrum feature data are input to a decoder for predictive decoding to obtain the predicted spectrum feature.
  • the inputting the multilingual text to be synthesized into at least two encoders with different encoding rules for encoding, to obtain the text encoding corresponding to the encoding rule includes: inputting the multilingual text to be synthesized Enter the One-hot encoder for encoding, and obtain the One-hot text encoding corresponding to the One-hot encoder; enter the multilingual text to be synthesized into the UTF-8 encoder for encoding, and obtain the corresponding UTF-8 encoder UTF-8 text encoding; input the multilingual text to be synthesized into the phoneme encoder for encoding, and obtain the phoneme text encoding corresponding to the phoneme encoder.
  • the splicing all the text encodings corresponding to the encoding rules to obtain the spliced text encoding includes: encoding the One-hot text encoding, the UTF-8 text encoding, and the phoneme text encoding The splicing is performed in the channel dimension to obtain spliced text encoding.
  • the spliced text encoding includes three-dimensional data; wherein, the first-dimensional data is the One-hot text encoding, the second-dimensional data is the UTF-8 text encoding, and the first The three-dimensional data is the phoneme text encoding.
  • the performing linear affine transformation of the spliced text encoding to obtain a joint text encoding includes: inputting the multi-dimensional spliced text encoding into a first neural network to perform linear affine transformation on the encoding rule The corresponding text encoding is selected, and the joint text encoding is obtained.
  • the performing high-level feature extraction on the joint text coding to obtain the high-level features of the joint text coding includes: inputting the joint text coding into a second neural network for high-level feature extraction to obtain the high-level features of the joint text coding .
  • the second neural network includes a character-level convolutional neural network, three convolutional layers, and a bidirectional long and short-term memory cyclic neural network that are sequentially arranged.
  • the inputting the high-level features of the joint text encoding and the standard spectral characteristic data into a decoder for predictive decoding to obtain the predicted spectral characteristic includes: obtaining the standard spectral characteristic data; Encoding high-level features, the standard spectrum feature data is input to the decoder, the decoder includes a third neural network; the third neural network of the decoder is combined with the attention mechanism according to the joint text encoding and the standard spectrum feature data Perform spectrum feature prediction to obtain predicted spectrum features.
  • the method before acquiring the multi-language text to be synthesized, the method further includes: acquiring the multi-language text to be processed; and performing language standardization processing according to the multi-language text to be processed to obtain the multi-language text to be synthesized.
  • multi-language text speech synthesis method a multi-language text speech synthesis device, storage medium and multi-language text speech synthesis equipment belong to a general inventive concept, a multi-language text speech
  • the content in the embodiments of the synthesis method, a multi-language text speech synthesis device, a storage medium, and a multi-language text speech synthesis device may be mutually applicable.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Channel
  • memory bus Radbus direct RAM
  • RDRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Machine Translation (AREA)

Abstract

本发明公开了一种多语言文本的语音合成方法、装置、设备及存储介质。所述方法包括:获取待合成多语言文本;将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;将所有所述编码规则对应的文本编码转换为联合文本编码;将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。本发明有利于多语言文本的处理,降低了部署难度,降低了部署成本。

Description

一种多语言文本合成语音方法、装置、设备及存储介质 技术领域
本申请涉及语音技术领域,尤其涉及一种多语言文本的语音合成方法、装置、设备及存储介质。
背景技术
语音合成是语音交互中一个重要的任务,它的目标是将文本信息合成出自然的像真人发出来的声音。传统的语音合成系统包括两个部分:前端和后端。前端的作用是对文本进行分析和语言学信息的提取,比如:分词,词性标注,韵律结构预测等。后端是将从前端获取的语言学信息合成出语音。
技术问题
在过去十多年,语音拼接合成和参数合成被广泛的应用,并且取得了不错的效果。拼接合成需要大量的语料,从语料中选取语音片段合成所需要的语音。虽然合成出的每个片段的语音自然度比较高,但是语音内的连续性不够好。参数合成虽然相对拼接合成需要更少的语料,但是往往因为模型比较复杂,包含了大量的参数,修改起来很费时费力。
最近几年,随着深度学习的发展,端到端的语音合成系统被提出来,比如:Tacotron(端到端的深度学习语音合成模型)和Tacotron2,它们使用神经网络简化了传统语音合成的前端。Tacotron和Tacotron2首先直接从文本中生成频谱特征(Melspectrograms)然后使用声码器,比如:Griffin-Lim(采用Griffin-Lim算法音频生成模型)和WaveNet(原始音频生成模型)将频谱特征合成出语音。这种基于神经网络的端到端的模型很大程度上提高了合成的语音质量,其中,这里的端到端模型指的就是带有注意力机制的序列到序列的模型。将文本序列使用编码器映射到语义空间并生成一系列编码器隐藏状态,然后解码器使用注意力机制将这些语义空间的隐藏状态作为上下文信息,构造解码器隐藏状态,然后输出频谱特征帧。其中注意力机制中常常包括循环神经网络。循环神经网络可以由输入的序列生成输出的序列,输出的当前序列由之前所有的输出序列和当前的隐含状态共同决定。对于某一个特定的频谱帧,由于编码器的输入信息不足或者编码器的编码不充分,可能经过多次的循环之后,还是与实际有偏差。在合成的语音的表现上来看,听起来可能会是漏词或跳词。
同时虽然在大多数场景下这种单一语言的语音合成系统已经能够满足日常的需求,但是对于一些特定的场景,比如机器人,翻译机等需要多种语言的语音合成系统。如果针对每一种语言训练一个系统,会给模型的部署带来很大的消耗。因此,开发一种不会漏词或跳词、部署简单的多语言文本的语音合成方法显得尤为重要。
技术解决方案
基于此,有必要针对上述问题,提出了一种多语言文本的语音合成方法、装置、设备及存储介质,用于解决现有技术中漏词或跳词、部署复杂的技术问题。
第一方面,本发明一种多语言文本的语音合成方法,所述方法包括:
获取待合成多语言文本;
将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
将所有所述编码规则对应的文本编码转换为联合文本编码;
将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;
将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
在一个实施例中,所述将所有所述编码规则对应的文本编码转换为联合文本编码,包括:
将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码;
将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
在一个实施例中,所述将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,包括:
将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;
将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
在一个实施例中,所述将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,包括:
将所述待合成多语言文本输入One-hot编码器进行编码,得到与One-hot编码器对应的One-hot文本编码;
将所述待合成多语言文本输入UTF-8编码器进行编码,得到与UTF-8编码器对应的UTF-8文本编码;
将所述待合成多语言文本输入音素编码器进行编码,得到与音素编码器对应的音素文本编码。
在一个实施例中,所述将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,包括:
将所述One-hot文本编码、所述UTF-8文本编码、所述音素文本编码在通道维度上进行拼接,得到拼接文本编码,所述拼接文本编码包括三维数据;其中,第一维数据为所述One-hot文本编码,第二维数据为所述UTF-8文本编码,第三维数据为所述音素文本编码。
在一个实施例中,所述将所述拼接文本编码进行线性仿射变换,得到联合文本编码,包括:
将多维的所述拼接文本编码输入第一神经网络进行线性仿射变换对所述编码规则对应的文本编码进行选择,得到联合文本编码。
在一个实施例中,所述将所述联合文本编码进行高层特征提取,得到联合文本编码高层特征,包括:
将所述联合文本编码输入第二神经网络进行高层特征提取,得到联合文本编码高层特征。
在一个实施例中,所述第二神经网络包括依次设置的字符级卷积神经网络、三个卷积层及双向长短时记忆循环神经网络。
在一个实施例中,所述将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征,包括:
获取标准频谱特征数据;
将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器,所述解码器包括第三神经网络;
所述解码器的第三神经网络根据所述联合文本编码、所述标准频谱特征数据结合注意力机制进行频谱特征预测,得到预测频谱特征。
在一个实施例中,所述获取待合成多语言文本之前,还包括:
获取待处理多语言文本;
根据所述待处理多语言文本进行语言标准化处理,得到待合成多语言文本。
第二方面,本发明还提出了一种多语言文本的语音合成装置,所述装置包括:
联合编码模块,用于获取待合成多语言文本,将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,将所有所述编码规则对应的文本编码转换为联合文本编码;
语音合成模块,用于将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
在一个实施例中,所述联合编码模块包括分别编码子模块、联合编码子模块;
所述分别编码子模块用于获取待合成多语言文本,将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
所述联合编码子模块用于将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
在一个实施例中,所述语音合成模块包括高层特征提取子模块、频谱特征预测子模块;
所述高层特征提取子模块用于将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;
所述频谱特征预测子模块用于将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
第三方面,本发明还提出了一种存储介质,存储有计算机指令程序,所述计算机指令程序被处理器执行时,使得所述处理器执行第一方面任一项所述方法的步骤。
第四方面,本发明还提出了一种多语言文本的语音合成设备,包括至少一个存储器、至少一个处理器,所述存储器存储有计算机指令程序,所述计算机指令程序被所述处理器执行时,使得所述处理器执行第一方面任一项所述方法的步骤。
有益效果
综上所述,本发明的多语言文本的语音合成方法将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码得到与编码规则对应的文本编码,再将所有所述编码规则对应的文本编码转换为联合文本编码;通过至少两个编码规则不同的编码器进行不同规则的编码,能更充分的保留文本的特征,也有利于多语言文本的处理;通过至少两个编码规则不同的编码器进行不同规则的编码后再经过转换得到联合文本编码,提高了文本合成语音的效果的稳定性,同时也降低了部署难度,降低了部署成本。因此,本发明能更充分的保留文本的特征,有利于多语言文本的处理,降低了部署难度,降低了部署成本。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中多语言文本的语音合成方法的流程图;
图2为图1的多语言文本的语音合成方法的确定联合文本编码的流程图;
图3为图1的多语言文本的语音合成方法的预测解码的流程图;
图4为另一个实施例中多语言文本的语音合成方法的流程图;
图5为一个实施例中多语言文本的语音合成装置的结构框图;
图6为图5的多语言文本的语音合成装置的联合编码模块的结构框图;
图7为图5的多语言文本的语音合成装置的语音合成模块的结构框图;
图8为一个实施例中计算机设备的结构框图。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
如图1所示,在一个实施例中,提出了一种多语言文本的语音合成方法,所述方法包括:
S102、获取待合成多语言文本;
所述多语言文本是指文本中同时包含至少两个种类的语言,比如,多语言文本包括中文、英文、法文、阿拉伯数字混合而成,在此举例不作具体限定。
具体而言,从文本输入设备或数据库或网络中获取多语言文本,将所述多语言文本作为所述待合成多语言文本,以便对所述待合成多语言文本进行合成,并将其合成为语音的形式,在合成语音之前所述待合成多语言文本和合成语音之后的语音表达的内容不发生改变。
用户可以通过文本输入设备触发文本的输入,当用户开始输入时则文本输入设备开始采集文本,当用户停止输入时则文本输入设备停止采集文本,从而使文本输入设备可以采集一段文本。
S104、将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
具体而言,将所述待合成多语言文本的按阅读顺序依次分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码。其中,编码规则不同的编码器可以是两个、三个、四个、五个、六个、七个,在此举例不作具体限定。可以理解的是,不同编码规则的编码器在编码时获取的语言特征的维度不同,通过至少两个编码规则不同的编码器进行编码,可以从多个维度充分获取了所述待合成多语言文本的语言特征,避免了单一编码器获取语言特征不充分或输出信息不充足的问题。
可以理解的是,每个编码器都需要对所述待合成多语言文本按阅读顺序单独进行编码。
可选的,将所述待合成多语言文本中的字符或字形分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码。
所述编码器可以从现有技术中选择对文本进行编码获取语言特征,比如One-hot编码器、UTF-8编码器、音素编码器,在此举例不作具体限定。
其中,可以根据所述待合成多语言文本中的语言的种类选择编码器,也可以根据所述待合成多语言文本的内容涉及的领域选择编码器,在此举例不作具体限定。
S106、将所有所述编码规则对应的文本编码转换为联合文本编码;
具体而言,将至少两个编码规则不同的编码器编码得到的所有所述编码规则对应的文本编码进行联合编码转换,得到联合文本编码。
可选的,将至少两个编码规则不同的编码器编码进行拼接及线性仿射变换,得到联合文本编码,所述联合文本编码的维度为一维。
S108、将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;
具体而言,将所述标准频谱特征数据输入解码器供解码器学习,解码器通过对所述联合文本编码进行解码处理,得到与联合文本编码对应的频谱特征,将与联合文本编码对应的频谱特征作为预测频谱特征。
所述频谱特征可以实现为梅尔频谱,在此举例不作具体限定。
所述标准频谱特征数据是通过神经网络预训练出的标准的频谱特征数据库。
S110、将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
可选的,将所述频谱特征实现为梅尔频谱,将梅尔频谱通过声码器进行语音合成处理,得到与所述预测频谱特征对应的目标语音,将与所述预测频谱特征对应的目标语音作为与所述待合成多语言文本对应的目标语音。
当所述频谱特征实现为梅尔频谱时,所述声码器可以从现有技术中选择Universal Vocoding声码器,在此举例不作具体限定。
本实施例的多语言文本的语音合成方法将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码得到与编码规则对应的文本编码,再将所有所述编码规则对应的文本编码转换为联合文本编码;通过至少两个编码规则不同的编码器进行不同规则的编码,能更充分的保留文本的特征,也有利于多语言文本的处理;通过至少两个编码规则不同的编码器进行不同规则的编码后再经过转换得到联合文本编码,提高了文本合成语音的效果的稳定性,同时也降低了部署难度,降低了部署成本。
如图2所示,在一个实施例中,所述将所有所述编码规则对应的文本编码转换为联合文本编码,包括:
S202、将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码;
具体而言,将每个所述编码规则对应的文本编码作为一维数据,再将所有所述编码规则对应的文本编码的一维数据进行依次拼接,得到拼接文本编码。
可以理解的是,对齐所有所述编码规则对应的文本编码的一维数据的头端,将所有所述编码规则对应的文本编码的一维数据拼接成多维数据,得到拼接文本编码。
S204、将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
具体而言,将多维的所述拼接文本编码进行线性仿射变换对用于所述编码规则对应的文本编码进行选择,得到联合文本编码,所述联合文本编码的维度为一维。
线性仿射变换用于针对每个文本单元选择其中一个所述编码规则对应的文本编码作为与文本单元对应的目标文本编码,将所有所述目标文本编码依次进行拼接,得到联合文本编码。
在一个实施例中,所述将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,包括:将所述待合成多语言文本输入One-hot编码器进行编码,得到与One-hot编码器对应的One-hot文本编码;将所述待合成多语言文本输入UTF-8编码器进行编码,得到与UTF-8编码器对应的UTF-8文本编码;将所述待合成多语言文本输入音素编码器进行编码,得到与音素编码器对应的音素文本编码。
所述One-Hot 编码即独热编码,又称一位有效编码,其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都有它独立的寄存器位,并且在任意时候,其中只有一位有效。One-Hot 编码可以将不同语言的字符或字形的集合放在一起作为输入的字典。
所述UTF-8(8位元,Universal Character Set/Unicode Transformation Format)是针对Unicode的一种可变长度字符编码,可以用来表示Unicode标准中的任何字符,由128个字符组成,包括大小写字母、数字0-9、标点符号、非打印字符(换行符、制表符等4个)以及控制字符(退格、响铃等)组成,能适应全球所有字符。
所述音素是根据语音的自然属性划分出来的最小语音单位,依据音节里的发音动作来分析,一个动作构成一个音素。
所述One-hot编码器是采用One-hot编码的方式采用神经网络训练得到,训练方法可以从现有技术中选择,在此不作赘述。
所述UTF-8编码器是采用UTF-8编码的方式采用神经网络训练得到,训练方法可以从现有技术中选择,在此不作赘述。用于可以将输入的字符或字形映射到一个具有256个可能值的词条进行编码器输入。
所述音素编码器是采用音素编码的方式采用神经网络训练得到,训练方法可以从现有技术中选择。音素编码器不需要学习复杂的发音规则,相同的音素可以在不同的语言中共享。
One-hot编码器、UTF-8编码器、音素编码器是目前应用比较广泛的提取文本编码的编码器,本方法通过采用这三种编码器,提高了保留的文本的语言特征,也更有利于多语言文本的处理。可以理解的是,本方法还可以采取其他提取文本编码的编码器,在此举例不作具体限定。
在一个实施例中,所述将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,包括:
将所述One-hot文本编码、所述UTF-8文本编码、所述音素文本编码在通道维度上进行拼接,得到拼接文本编码,所述拼接文本编码包括三维数据;其中,第一维数据为所述One-hot文本编码,第二维数据为所述UTF-8文本编码,第三维数据为所述音素文本编码。
在一个实施例中,所述将所述拼接文本编码进行线性仿射变换,得到联合文本编码,包括:
将多维的所述拼接文本编码输入第一神经网络进行线性仿射变换对所述编码规则对应的文本编码进行选择,得到联合文本编码,所述联合文本编码的维度为一维。
具体而言,以文本单元为独立单元通过已训练的所述第一神经网络从多维的所述拼接文本编码对所述编码规则对应的文本编码进行选择,选择其中一个所述编码规则对应的文本编码作为与文本单元对应的目标文本编码,将所有所述目标文本编码依次进行拼接,得到联合文本编码;其中,所述编码规则对应的文本编码选择的规则是所述第一神经网络经过训练得到的。比如,选择One-hot编码器、UTF-8编码器、音素编码器提取与编码规则对应的文本编码时,通过编码得到与One-hot编码器对应的One-hot文本编码、与UTF-8编码器对应的UTF-8文本编码、与音素编码器对应的音素文本编码,对每个文本单元从与One-hot编码器对应的One-hot文本编码、与UTF-8编码器对应的UTF-8文本编码、与音素编码器对应的音素文本编码中选择其中一种作为与文本单元对应的目标文本编码。
所述第一神经网络可以从现有技术中选择可以进行线性仿射变换的神经网络,在此不做赘述。
在一个实施例中,所述将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,包括:将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
所述高层特征是指所述待合成多语言文本包含的与语言分类、语义信息等相关的特征。
高层特征是包含的与语言分类、语义信息等相关的特征,通过包含高层特征的所述联合文本编码预测解码得到预测频谱特征保留了待合成多语言文本的语言分类、语义信息,从而进一步提高了最终合成出的与所述待合成多语言文本对应的目标语音的准确性。
在一个实施例中,所述从所述联合文本编码进行高层特征提取得到联合文本编码高层特征,包括:
将所述联合文本编码输入第二神经网络进行高层特征提取,得到联合文本编码高层特征。对所述联合文本编码进行高层特征提取的规则,可以通过对所述第二神经网络经过训练得到的。
所述第二神经网络可以从现有技术中选择可以对文本编码进行高层特征提取的神经网络,在此不做赘述。
在一个实施例中,所述第二神经网络包括依次设置的字符级卷积神经网络、三个卷积层及双向长短时记忆循环神经网络。
所述字符级卷积神经网络用于实现字符嵌入,详细结构可以从现有技术中选择,在此不做赘述。
所述三个卷积层用于实现高层特征提取,详细结构可以从现有技术中选择,在此不做赘述。
所述双向长短时记忆循环神经网络用于语义关系识别,利用循环神经网络直接从词学习问句的语义特征表示,详细结构可以从现有技术中选择,在此不做赘述。
如图3所示,在一个实施例中,所述将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征,包括:
S302、获取标准频谱特征数据;
S304、将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器,所述解码器包括第三神经网络;
S306、所述解码器的第三神经网络根据所述联合文本编码高层特征、所述标准频谱特征数据结合注意力机制进行频谱特征预测,得到预测频谱特征。
具体而言,将所述标准频谱特征数据输入第三神经网络进行学习,对第三神经网络进行学习训练得到解码器,解码器根据注意力机制将所述联合文本编码高层特征映射成频谱特征序列,将所述频谱特征序列作为预测频谱特征。通过对第三神经网络进行学习训练得到的解码器,可以捕捉单词的发音,还可以捕捉人类语音的各种细微变化,包括音量、语速和语调。
所述第三神经网络可以从现有技术中选择可以对文本编码进行提取的神经网络,在此不做赘述。
在一个实施例中,所述第三神经网络包括2层预处理神经网络、2层长短期记忆网络、线性仿射变换神经网络、5层卷积后处理神经网络。2层预处理神经网络、2层长短期记忆网络、线性仿射变换神经网络、5层卷积后处理神经网络的详细结构可以从现有技术中选择,在此不做赘述。
所述长短时记忆网络用于在输入和输出序列之间的映射过程中利用上下文相关信息。
如图4所示,在一个实施例中,还提出了一种多语言文本的语音合成方法,所述方法包括:
S402、获取待处理多语言文本;
所述多语言文本是指文本中同时包含多个种类的语言,比如,多语言文本包括中文、英文、法文、阿拉伯数字混合而成,在此举例不作具体限定。
所述待处理多语言文本是指从文本输入设备或数据库或网络中获取多语言文本。
S404、根据所述待处理多语言文本进行语言标准化处理,得到待合成多语言文本;
在语言的使用过程中,存在非标准化的使用,比如:英文单词的缩写、简写、多个单词通过连接符连接在一起等,这些非标准化的使用在从文本合成语音的时可能出现漏词或跳词的问题。
所述语言标准化处理包括把缩写还原、简写还原、连接在一起的多个单词拆开,在此举例不做具体限定。
S406、获取待合成多语言文本;
S408、将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
S410、将所有所述编码规则对应的文本编码转换为联合文本编码;
S412、将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;
S414、将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
本实施例通过将所述待处理多语言文本进行语言标准化处理后得到待合成多语言文本,再把待合成多语言文本作为输入用于合成语音,进一步避免了漏词或跳词的现象,进一步提成合成语音的质量。
如图5所示,在一个实施例中,本发明提出了一种多语言文本的语音合成装置,所述装置包括:
联合编码模块502,用于获取待合成多语言文本,将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,将所有所述编码规则对应的文本编码转换为联合文本编码;
语音合成模块504,用于将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
本实施例的多语言文本的语音合成装置将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码得到与编码规则对应的文本编码,再将所有所述编码规则对应的文本编码转换为联合文本编码;通过至少两个编码规则不同的编码器进行不同规则的编码,能更充分的保留文本的特征,也有利于多语言文本的处理;通过至少两个编码规则不同的编码器进行不同规则的编码后再经过转换得到联合文本编码,提高了文本合成语音的效果的稳定性,同时也降低了部署难度,降低了部署成本。
如图6所示,在一个实施例中,所述联合编码模块包括分别编码子模块5022、编码联合转换子模块5024;
所述分别编码子模块5022用于获取待合成多语言文本,将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
所述联合编码子模块5024用于将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
如图7所示,在一个实施例中,所述语音合成模块包括高层特征提取子模块5042、频谱特征预测子模块5044;
所述高层特征提取子模块5042用于将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;
所述频谱特征预测子模块5044用于将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
图8示出了一个实施例中计算机设备的内部结构图。该计算机设备具体可以是终端,也可以是服务器。如图8所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现多语言文本的语音合成方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行多语言文本的语音合成方法。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
在一个实施例中,本申请提供的一种多语言文本的语音合成方法可以实现为一种计算机程序的形式,计算机程序可在如图8所示的计算机设备上运行。计算机设备的存储器中可存储组成的一种多语言文本的语音合成装置的各个程序模板。比如,联合编码模块502、语音合成模块504。
在一个实施例中,本发明提出了一种存储介质,存储有计算机指令程序,所述计算机指令程序被处理器执行时,使得所述处理器执行时实现如下方法步骤:
获取待合成多语言文本;
将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
将所有所述编码规则对应的文本编码转换为联合文本编码;
将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;
将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
本实施例的方法步骤执行时将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码得到与编码规则对应的文本编码,再将所有所述编码规则对应的文本编码转换为联合文本编码;通过至少两个编码规则不同的编码器进行不同规则的编码,能更充分的保留文本的特征,也有利于多语言文本的处理;通过至少两个编码规则不同的编码器进行不同规则的编码后再经过转换得到联合文本编码,提高了文本合成语音的效果的稳定性,同时也降低了部署难度,降低了部署成本。
在一个实施例中,所述将所有所述编码规则对应的文本编码转换为联合文本编码,包括:将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码;将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
在一个实施例中,所述将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,包括:将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
在一个实施例中,所述将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,包括:将所述待合成多语言文本输入One-hot编码器进行编码,得到与One-hot编码器对应的One-hot文本编码;将所述待合成多语言文本输入UTF-8编码器进行编码,得到与UTF-8编码器对应的UTF-8文本编码;将所述待合成多语言文本输入音素编码器进行编码,得到与音素编码器对应的音素文本编码。
在一个实施例中,所述将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,包括:将所述One-hot文本编码、所述UTF-8文本编码、所述音素文本编码在通道维度上进行拼接,得到拼接文本编码,所述拼接文本编码包括三维数据;其中,第一维数据为所述One-hot文本编码,第二维数据为所述UTF-8文本编码,第三维数据为所述音素文本编码。
在一个实施例中,所述将所述拼接文本编码进行线性仿射变换,得到联合文本编码,包括:将多维的所述拼接文本编码输入第一神经网络进行线性仿射变换对所述编码规则对应的文本编码进行选择,得到联合文本编码。
在一个实施例中,所述将所述联合文本编码进行高层特征提取,得到联合文本编码高层特征,包括:将所述联合文本编码输入第二神经网络进行高层特征提取,得到联合文本编码高层特征。
在一个实施例中,所述第二神经网络包括依次设置的字符级卷积神经网络、三个卷积层及双向长短时记忆循环神经网络。
在一个实施例中,所述将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征,包括:获取标准频谱特征数据;将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器,所述解码器包括第三神经网络;所述解码器的第三神经网络根据所述联合文本编码、所述标准频谱特征数据结合注意力机制进行频谱特征预测,得到预测频谱特征。
在一个实施例中,所述获取待合成多语言文本之前,还包括:获取待处理多语言文本;根据所述待处理多语言文本进行语言标准化处理,得到待合成多语言文本。
在一个实施例中,本发明提出了一种多语言文本的语音合成设备,包括至少一个存储器、至少一个处理器,所述存储器存储有计算机指令程序,所述计算机指令程序被所述处理器执行时,使得所述处理器执行实现如下方法步骤:
获取待合成多语言文本;
将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
将所有所述编码规则对应的文本编码转换为联合文本编码;
将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;
将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
本实施例的方法步骤执行时将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码得到与编码规则对应的文本编码,再将所有所述编码规则对应的文本编码转换为联合文本编码;通过至少两个编码规则不同的编码器进行不同规则的编码,能更充分的保留文本的特征,也有利于多语言文本的处理;通过至少两个编码规则不同的编码器进行不同规则的编码后再经过转换得到联合文本编码,提高了文本合成语音的效果的稳定性,同时也降低了部署难度,降低了部署成本。
在一个实施例中,所述将所有所述编码规则对应的文本编码转换为联合文本编码,包括:将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码;将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
在一个实施例中,所述将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,包括:将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
在一个实施例中,所述将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,包括:将所述待合成多语言文本输入One-hot编码器进行编码,得到与One-hot编码器对应的One-hot文本编码;将所述待合成多语言文本输入UTF-8编码器进行编码,得到与UTF-8编码器对应的UTF-8文本编码;将所述待合成多语言文本输入音素编码器进行编码,得到与音素编码器对应的音素文本编码。
在一个实施例中,所述将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,包括:将所述One-hot文本编码、所述UTF-8文本编码、所述音素文本编码在通道维度上进行拼接,得到拼接文本编码,所述拼接文本编码包括三维数据;其中,第一维数据为所述One-hot文本编码,第二维数据为所述UTF-8文本编码,第三维数据为所述音素文本编码。
在一个实施例中,所述将所述拼接文本编码进行线性仿射变换,得到联合文本编码,包括:将多维的所述拼接文本编码输入第一神经网络进行线性仿射变换对所述编码规则对应的文本编码进行选择,得到联合文本编码。
在一个实施例中,所述将所述联合文本编码进行高层特征提取,得到联合文本编码高层特征,包括:将所述联合文本编码输入第二神经网络进行高层特征提取,得到联合文本编码高层特征。
在一个实施例中,所述第二神经网络包括依次设置的字符级卷积神经网络、三个卷积层及双向长短时记忆循环神经网络。
在一个实施例中,所述将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征,包括:获取标准频谱特征数据;将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器,所述解码器包括第三神经网络;所述解码器的第三神经网络根据所述联合文本编码、所述标准频谱特征数据结合注意力机制进行频谱特征预测,得到预测频谱特征。
在一个实施例中,所述获取待合成多语言文本之前,还包括:获取待处理多语言文本;根据所述待处理多语言文本进行语言标准化处理,得到待合成多语言文本。
需要说明的是,上述一种多语言文本的语音合成方法、一种多语言文本的语音合成装置、存储介质及多语言文本的语音合成设备属于一个总的发明构思,一种多语言文本的语音合成方法、一种多语言文本的语音合成装置、存储介质及多语言文本的语音合成设备实施例中的内容可相互适用。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink) DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种多语言文本的语音合成方法,所述方法包括:
    获取待合成多语言文本;
    将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
    将所有所述编码规则对应的文本编码转换为联合文本编码;
    将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征;
    将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
  2. 根据权利要求1所述的多语言文本的语音合成方法,其特征在于,所述将所有所述编码规则对应的文本编码转换为联合文本编码,包括:
    将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码;
    将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
  3. 根据权利要求1所述的多语言文本的语音合成方法,其特征在于,所述将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,包括:
    将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;
    将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
  4. 根据权利要求2所述的多语言文本的语音合成方法,其特征在于,所述将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,包括:
    将所述待合成多语言文本输入One-hot编码器进行编码,得到与One-hot编码器对应的One-hot文本编码;
    将所述待合成多语言文本输入UTF-8编码器进行编码,得到与UTF-8编码器对应的UTF-8文本编码;
    将所述待合成多语言文本输入音素编码器进行编码,得到与音素编码器对应的音素文本编码。
  5. 根据权利要求4所述的多语言文本的语音合成方法,其特征在于,所述将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,包括:
    将所述One-hot文本编码、所述UTF-8文本编码、所述音素文本编码在通道维度上进行拼接,得到拼接文本编码,所述拼接文本编码包括三维数据;其中,第一维数据为所述One-hot文本编码,第二维数据为所述UTF-8文本编码,第三维数据为所述音素文本编码。
  6. 根据权利要求2所述的多语言文本的语音合成方法,其特征在于,所述将所述拼接文本编码进行线性仿射变换,得到联合文本编码,包括:
    将多维的所述拼接文本编码输入第一神经网络进行线性仿射变换对所述编码规则对应的文本编码进行选择,得到联合文本编码。
  7. 根据权利要求3所述的多语言文本的语音合成方法,其特征在于,所述将所述联合文本编码进行高层特征提取,得到联合文本编码高层特征,包括:
    将所述联合文本编码输入第二神经网络进行高层特征提取,得到联合文本编码高层特征。
  8. 根据权利要求7所述的多语言文本的语音合成方法,其特征在于,所述第二神经网络包括依次设置的字符级卷积神经网络、三个卷积层及双向长短时记忆循环神经网络。
  9. 根据权利要求3所述的多语言文本的语音合成方法,其特征在于,所述将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征,包括:
    获取标准频谱特征数据;
    将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器,所述解码器包括第三神经网络;
    所述解码器的第三神经网络根据所述联合文本编码、所述标准频谱特征数据结合注意力机制进行频谱特征预测,得到预测频谱特征。
  10. 根据权利要求1至9任一项所述的多语言文本的语音合成方法,其特征在于,所述获取待合成多语言文本之前,还包括:
    获取待处理多语言文本;
    根据所述待处理多语言文本进行语言标准化处理,得到待合成多语言文本。
  11. 一种多语言文本的语音合成装置,其特征在于,所述装置包括:
    联合编码模块,用于获取待合成多语言文本,将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码,将所有所述编码规则对应的文本编码转换为联合文本编码;
    语音合成模块,用于将所述联合文本编码、标准频谱特征数据输入解码器进行预测解码,得到预测频谱特征,将所述预测频谱特征输入声码器进行合成处理,得到与所述待合成多语言文本对应的目标语音。
  12. 根据权利要求11所述的多语言文本的语音合成装置,其特征在于,所述联合编码模块包括分别编码子模块、联合编码子模块;
    所述分别编码子模块用于获取待合成多语言文本,将所述待合成多语言文本分别输入至少两个编码规则不同的编码器中进行编码,得到与编码规则对应的文本编码;
    所述联合编码子模块用于将所有所述编码规则对应的文本编码进行拼接,得到拼接文本编码,将所述拼接文本编码进行线性仿射变换,得到联合文本编码。
  13. 根据权利要求11所述的多语言文本的语音合成装置,其特征在于,所述语音合成模块包括高层特征提取子模块、频谱特征预测子模块;
    所述高层特征提取子模块用于将所述联合文本编码进行高层特征提取得到联合文本编码高层特征;
    所述频谱特征预测子模块用于将所述联合文本编码高层特征、所述标准频谱特征数据输入解码器进行预测解码,得到所述预测频谱特征。
  14. 一种存储介质,存储有计算机指令程序,其特征在于,所述计算机指令程序被处理器执行时,使得所述处理器执行如权利要求1至10中任一项所述方法的步骤。
  15. 一种多语言文本的语音合成设备,其特征在于,包括至少一个存储器、至少一个处理器,所述存储器存储有计算机指令程序,所述计算机指令程序被所述处理器执行时,使得所述处理器执行如权利要求1至10中任一项所述方法的步骤。
PCT/CN2019/127334 2019-12-23 2019-12-23 一种多语言文本合成语音方法、装置、设备及存储介质 WO2021127817A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980003170.6A CN111247581B (zh) 2019-12-23 2019-12-23 一种多语言文本合成语音方法、装置、设备及存储介质
PCT/CN2019/127334 WO2021127817A1 (zh) 2019-12-23 2019-12-23 一种多语言文本合成语音方法、装置、设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/127334 WO2021127817A1 (zh) 2019-12-23 2019-12-23 一种多语言文本合成语音方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021127817A1 true WO2021127817A1 (zh) 2021-07-01

Family

ID=70880890

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127334 WO2021127817A1 (zh) 2019-12-23 2019-12-23 一种多语言文本合成语音方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111247581B (zh)
WO (1) WO2021127817A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220215827A1 (en) * 2020-05-13 2022-07-07 Tencent Technology (Shenzhen) Company Limited Audio synthesis method and apparatus, computer readable medium, and electronic device
US12106746B2 (en) * 2020-05-13 2024-10-01 Tencent Technology (Shenzhen) Company Limited Audio synthesis method and apparatus, computer readable medium, and electronic device

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112133282B (zh) * 2020-10-26 2022-07-08 厦门大学 轻量级多说话人语音合成系统及电子设备
CN112365878B (zh) * 2020-10-30 2024-01-23 广州华多网络科技有限公司 语音合成方法、装置、设备及计算机可读存储介质
CN112634858B (zh) * 2020-12-16 2024-01-23 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及存储介质
CN112712789B (zh) * 2020-12-21 2024-05-03 深圳市优必选科技股份有限公司 跨语言音频转换方法、装置、计算机设备和存储介质
WO2022133630A1 (zh) * 2020-12-21 2022-06-30 深圳市优必选科技股份有限公司 跨语言音频转换方法、计算机设备和存储介质
CN112634865B (zh) * 2020-12-23 2022-10-28 爱驰汽车有限公司 语音合成方法、装置、计算机设备和存储介质
CN112652294B (zh) * 2020-12-25 2023-10-24 深圳追一科技有限公司 语音合成方法、装置、计算机设备和存储介质
CN112735373B (zh) * 2020-12-31 2024-05-03 科大讯飞股份有限公司 语音合成方法、装置、设备及存储介质
CN113160792B (zh) * 2021-01-15 2023-11-17 广东外语外贸大学 一种多语种的语音合成方法、装置和系统
CN113033150A (zh) * 2021-03-18 2021-06-25 深圳市元征科技股份有限公司 一种程序文本的编码处理方法、装置以及存储介质
CN113870834A (zh) * 2021-09-26 2021-12-31 平安科技(深圳)有限公司 多语言语音合成方法、系统、设备和存储介质
CN118506764A (zh) * 2024-07-17 2024-08-16 成都索贝数码科技股份有限公司 基于自回归类深度学习语音合成的可控输出方法及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1188957A (zh) * 1996-09-24 1998-07-29 索尼公司 矢量量化方法和语音编码方法及其装置
US20140025381A1 (en) * 2012-07-20 2014-01-23 Microsoft Corporation Evaluating text-to-speech intelligibility using template constrained generalized posterior probability
CN104732542A (zh) * 2015-03-27 2015-06-24 安徽省道一电子科技有限公司 基于多摄像头自标定的全景车辆安全系统的图像处理方法
CN105390141A (zh) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 声音转换方法和装置
US20170103749A1 (en) * 2015-10-13 2017-04-13 GM Global Technology Operations LLC Dynamically adding or removing functionality to speech recognition systems
CN109767755A (zh) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 一种语音合成方法和系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11069335B2 (en) * 2016-10-04 2021-07-20 Cerence Operating Company Speech synthesis using one or more recurrent neural networks
BR112019006979A2 (pt) * 2016-10-24 2019-06-25 Semantic Machines Inc sequência para sequenciar transformações para síntese de fala via redes neurais recorrentes
US10249289B2 (en) * 2017-03-14 2019-04-02 Google Llc Text-to-speech synthesis using an autoencoder
JP7112075B2 (ja) * 2017-08-07 2022-08-03 国立研究開発法人情報通信研究機構 音声合成のためのフロントエンドの学習方法、コンピュータプログラム、音声合成システム、及び音声合成のためのフロントエンド処理方法
WO2019139428A1 (ko) * 2018-01-11 2019-07-18 네오사피엔스 주식회사 다중 언어 텍스트-음성 합성 방법
CN109326283B (zh) * 2018-11-23 2021-01-26 南京邮电大学 非平行文本条件下基于文本编码器的多对多语音转换方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1188957A (zh) * 1996-09-24 1998-07-29 索尼公司 矢量量化方法和语音编码方法及其装置
US20140025381A1 (en) * 2012-07-20 2014-01-23 Microsoft Corporation Evaluating text-to-speech intelligibility using template constrained generalized posterior probability
CN104732542A (zh) * 2015-03-27 2015-06-24 安徽省道一电子科技有限公司 基于多摄像头自标定的全景车辆安全系统的图像处理方法
US20170103749A1 (en) * 2015-10-13 2017-04-13 GM Global Technology Operations LLC Dynamically adding or removing functionality to speech recognition systems
CN105390141A (zh) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 声音转换方法和装置
CN109767755A (zh) * 2019-03-01 2019-05-17 广州多益网络股份有限公司 一种语音合成方法和系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220215827A1 (en) * 2020-05-13 2022-07-07 Tencent Technology (Shenzhen) Company Limited Audio synthesis method and apparatus, computer readable medium, and electronic device
US12106746B2 (en) * 2020-05-13 2024-10-01 Tencent Technology (Shenzhen) Company Limited Audio synthesis method and apparatus, computer readable medium, and electronic device

Also Published As

Publication number Publication date
CN111247581A (zh) 2020-06-05
CN111247581B (zh) 2023-10-10

Similar Documents

Publication Publication Date Title
CN111247581B (zh) 一种多语言文本合成语音方法、装置、设备及存储介质
JP7464621B2 (ja) 音声合成方法、デバイス、およびコンピュータ可読ストレージ媒体
CN109446534B (zh) 机器翻译方法及装置
CN113811946B (zh) 数字序列的端到端自动语音识别
CN111739508B (zh) 一种基于dnn-hmm双模态对齐网络的端到端语音合成方法及系统
CN112352275A (zh) 具有多级别文本信息的神经文本到语音合成
CN111341293B (zh) 一种文本语音的前端转换方法、装置、设备和存储介质
CN112668346B (zh) 翻译方法、装置、设备及存储介质
CN114038447A (zh) 语音合成模型的训练方法、语音合成方法、装置及介质
CN113450758B (zh) 语音合成方法、装置、设备及介质
CN113327574A (zh) 一种语音合成方法、装置、计算机设备和存储介质
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN113870835A (zh) 基于人工智能的语音合成方法、装置、设备及存储介质
CN117877460A (zh) 语音合成方法、装置、语音合成模型训练方法、装置
CN113823259B (zh) 将文本数据转换为音素序列的方法及设备
US11817079B1 (en) GAN-based speech synthesis model and training method
CN116825084A (zh) 跨语种的语音合成方法、装置、电子设备和存储介质
CN116597807A (zh) 基于多尺度风格的语音合成方法、装置、设备及介质
CN111583902B (zh) 语音合成系统、方法、电子设备及介质
KR102592623B1 (ko) 외부 정렬정보에 기반한 실시간 동시통역 모델 학습 방법, 동시통역 방법 및 시스템
CN115171647A (zh) 一种具有自然停顿处理的语音合成方法、装置、电子设备及计算机可读介质
CN113160793A (zh) 基于低资源语言的语音合成方法、装置、设备及存储介质
Saychum et al. A great reduction of wer by syllable toneme prediction for thai grapheme to phoneme conversion
CN117094329B (zh) 一种用于解决语音歧义的语音翻译方法及装置
CN115392189B (zh) 多语种混合语料的生成方法及装置、训练方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957924

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957924

Country of ref document: EP

Kind code of ref document: A1