WO2020190050A1 - Appareil de synthèse vocale et procédé associé - Google Patents

Appareil de synthèse vocale et procédé associé Download PDF

Info

Publication number
WO2020190050A1
WO2020190050A1 PCT/KR2020/003753 KR2020003753W WO2020190050A1 WO 2020190050 A1 WO2020190050 A1 WO 2020190050A1 KR 2020003753 W KR2020003753 W KR 2020003753W WO 2020190050 A1 WO2020190050 A1 WO 2020190050A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech synthesis
information
neural network
speech
vector
Prior art date
Application number
PCT/KR2020/003753
Other languages
English (en)
Korean (ko)
Inventor
박중배
한기종
Original Assignee
휴멜로 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 휴멜로 주식회사 filed Critical 휴멜로 주식회사
Publication of WO2020190050A1 publication Critical patent/WO2020190050A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • G10L2013/105Duration

Definitions

  • the present disclosure relates to a speech synthesis apparatus and method thereof.
  • the present invention relates to a speech synthesis device equipped with a function for adjusting pitch, intensity and time signature, a method for synthesizing speech performed in the device, and a method for constructing a speech synthesis model based on a neural network equipped with the above-listed adjustment functions .
  • Speech synthesis technology is a technology that synthesizes a sound similar to a human speaking sound from an input text, and is commonly known as a text-to-speech (TTS) technology.
  • TTS text-to-speech
  • personal portable devices such as smart phones, e-book readers, and vehicle navigation have been actively developed and distributed, the demand for speech synthesis technology for voice output is rapidly increasing.
  • a technical problem to be solved through some embodiments of the present disclosure is to provide a speech synthesis apparatus capable of performing pitch adjustment together in synthesizing a speech for a given text, and a method performed by the apparatus.
  • Another technical problem to be solved through some embodiments of the present disclosure is to provide a speech synthesis apparatus capable of performing tempo adjustment together in synthesizing a speech for a given text, and a method performed by the apparatus.
  • Another technical problem to be solved through some embodiments of the present disclosure is to provide a speech synthesis device capable of performing sound intensity adjustment together in synthesizing a speech for a given text, and a method performed by the device. will be.
  • a speech synthesis apparatus includes a preprocessor that performs preprocessing on an input text, and inputs the preprocessed text and beat information into a speech synthesis model based on a neural network.
  • a speech synthesis unit for synthesizing a target speech in which the time signature information is reflected with respect to the input text
  • the speech synthesis model includes: an embedding module for converting the preprocessed text into a character embedding vector, the time signature information and the character embedding vector
  • An aggregator module that generates an input vector constituting an input sequence by aggregating the input sequence, an encoder neural network that encodes the input sequence and outputs an encoded vector, and an output sequence associated with the target speech by decoding the encoded vector It may include a decoder neural network that outputs.
  • the speech synthesis model may further include an attention module positioned between the encoder neural network and the decoder neural network and determining a portion to be focused by the decoder neural network in the encoded vector.
  • the encoder neural network and the decoder neural network may be implemented based on a recurrent neural network (RNN) or a self-attention technique.
  • RNN recurrent neural network
  • self-attention technique e.g., a self-attention technique
  • the aggregator module may generate the input vector by concatenating the character embedding vector and the beat information.
  • the beat information may be information about duration of a sound set for each phoneme or for each syllable for the input text.
  • the output sequence is composed of data in the form of a spectrogram
  • the speech synthesis unit may further include a vocoder unit for converting the output sequence into the target speech.
  • the output sequence is composed of data in the form of a spectrogram
  • the speech synthesis unit inputs the training text and correct answer beat information preprocessed by the preprocessor into the speech synthesis model, and the result obtained
  • the predicted spectrogram data may be compared with the correct answer spectrogram data to calculate an error value, and the calculated error value may be back-propagated to train the speech synthesis model.
  • a speech synthesis apparatus includes a preprocessor that performs preprocessing on an input text, and inputs the preprocessed text and prosody information into a speech synthesis model based on a neural network.
  • a speech synthesis unit for synthesizing a target speech in which the prosody information is reflected with respect to the input text
  • the speech synthesis model comprises: an embedding module for converting the preprocessed text into a character embedding vector, the prosody information and the character embedding
  • An aggregator module that generates an input vector constituting an input sequence by aggregating a vector, an encoder neural network that encodes the input sequence to output an encoded vector, and an output associated with the target speech by decoding the encoded vector It may include a decoder neural network that outputs the sequence.
  • a speech synthesis apparatus includes a preprocessor for performing preprocessing on an input text, and the preprocessed text and prosody information to a neural network-based speech synthesis model.
  • a speech synthesis unit for synthesizing a target speech in which the prosody information is reflected with respect to the input text, wherein the speech synthesis model includes: an embedding module for converting the preprocessed text into a character embedding vector, and the character embedding vector
  • An encoder neural network that encodes an input sequence and outputs an encoded vector, and a decoder neural network that outputs an output sequence associated with the target speech by decoding the encoded vector using the prosody information.
  • FIG. 1 is a diagram for explaining input and output of a speech synthesis apparatus according to some embodiments of the present disclosure.
  • FIG. 2 is an exemplary block diagram illustrating a speech synthesis apparatus and data flow in a learning process according to some embodiments of the present disclosure.
  • FIG. 3 is an exemplary block diagram illustrating a preprocessor according to some embodiments of the present disclosure.
  • FIG. 4 is an exemplary diagram illustrating an operation of a text preprocessor according to some embodiments of the present disclosure.
  • FIG. 5 is an exemplary diagram for explaining an operation of a voice analysis unit according to some embodiments of the present disclosure.
  • FIG. 6 is an exemplary block diagram illustrating a speech synthesizer according to some embodiments of the present disclosure.
  • FIG. 7 is a diagram illustrating a neural network structure of a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 8 is an exemplary diagram for describing an operation of an aggregator module according to some embodiments of the present disclosure.
  • FIG. 9 is an exemplary diagram illustrating an LSTM recurrent neural network that can be used in a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 10 is an exemplary diagram for explaining a learning process for a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 11 is an exemplary block diagram illustrating a speech synthesis apparatus and data flow in a synthesis process according to some embodiments of the present disclosure.
  • FIGS. 12 and 13 are diagrams for explaining a neural network structure of a modified speech synthesis model according to various embodiments of the present disclosure.
  • FIG. 14 is an exemplary flowchart illustrating a speech synthesis method according to some embodiments of the present disclosure.
  • 15 is an exemplary flowchart for further explaining a learning process for a speech synthesis model according to some embodiments of the present disclosure.
  • 16 is an exemplary flowchart for further explaining a synthesis process using a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 17 is an exemplary diagram for explaining a user interface (UI) representing a result of speech synthesis that can be referred to in various embodiments of the present disclosure.
  • UI user interface
  • FIG. 18 is a diagram illustrating an exemplary computing device capable of implementing a speech synthesis apparatus according to some embodiments of the present disclosure.
  • first, second, A, B, (a) and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, order, or order of the component is not limited by the term.
  • a component is described as being “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to that other component, but another component between each component It should be understood that elements may be “connected”, “coupled” or “connected”.
  • the prosody information may include all kinds of information related to the prosody of the synthesized sound, such as the pitch (or pitch, pitch), and strength (ie, strength and weakness) of the input text.
  • the prosody information may be expressed as pitch information or sound intensity information corresponding to unit text or unit time, but the technical scope of the present disclosure is not limited thereto.
  • the unit text may be a phoneme, a syllable, a word, or a word, but the technical scope of the present disclosure is not limited thereto.
  • the pitch may, for example, be expressed in the form of a frequency (e.g. fundamental frequency F0), but the technical scope of the present disclosure is not limited thereto.
  • the intensity of the sound may be displayed in the form of, for example, decibel (dB), but the technical scope of the present disclosure is not limited thereto.
  • the beat information may include all kinds of information related to the beat of the synthesized sound, such as information on the duration of the sound for the input text.
  • the beat information may be expressed as unit text or sound length information corresponding to a unit time, but the technical scope of the present disclosure is not limited thereto.
  • the unit text may be a phoneme, a syllable, a word, or a word, but the technical scope of the present disclosure is not limited thereto.
  • the length of the sound may be expressed in the form of a single value, such as a duration, and may be expressed in the form of a range value such as (start time, end time), but the technical scope of the present disclosure is limited thereto. no.
  • the target voice literally means a synthesized sound that is a target to be generated from a given text.
  • an instruction refers to a series of computer-readable instructions grouped on the basis of functions, which are components of a computer program and executed by a processor.
  • FIG. 1 is an exemplary diagram illustrating inputs and outputs of a speech synthesis apparatus 10 according to some embodiments of the present disclosure.
  • the speech synthesis device 10 receives at least one of text (1), prosody information (3), and time signature (5), and synthesizes and outputs a corresponding voice (7).
  • the voice 7 refers to a target voice (ie, target synthesized sound) in which the input prosody information 3 and/or the beat information 5 is reflected.
  • the computing device may be a notebook computer, a desktop computer, a laptop computer, etc., but is not limited thereto and may include all types of devices equipped with a computing function. For an example of the computing device, refer to FIG. 18 further.
  • the speech synthesis device 10 is implemented as a single computing device, but the first function of the speech synthesis device 10 is implemented in a first computing device, and the second function is a second computing device. Of course, it may be implemented in.
  • the speech synthesis apparatus 10 uses a speech synthesis model based on a neural network to synthesize a target speech reflecting the prosody information 3 and/or the beat information 5 .
  • a speech synthesis model based on a neural network to synthesize a target speech reflecting the prosody information 3 and/or the beat information 5 .
  • the speech synthesis apparatus 10 may operate to reflect only the prosody information 3 or to reflect only the time signature information 50.
  • FIG. 2 is an exemplary block diagram illustrating a speech synthesis apparatus 10 according to some embodiments of the present disclosure.
  • FIG. 2 shows a data flow for a process in which the speech synthesis model 63 is trained.
  • each component 21 to 27 in the learning process of the speech synthesis model 63 will be first described, and the operation of each component 21 to 27 in the speech synthesis process Will be described later with reference to FIG. 11.
  • the speech synthesis apparatus 10 may include an input unit 21, a preprocessor 23, a storage unit 25, and a speech synthesis unit 27.
  • FIG. 2 only the components related to the embodiment of the present disclosure are shown in FIG. 2. Accordingly, those of ordinary skill in the art to which the present disclosure pertains may recognize that other general-purpose components may be further included in addition to the components illustrated in FIG. 2.
  • each of the constituent elements of the speech synthesis apparatus 10 shown in FIG. 2 represents functional elements that are functionally divided, and a plurality of constituent elements may be implemented in a form integrated with each other in an actual physical environment. .
  • each component will be described in detail.
  • the input unit 21 receives a data set for learning including text for learning and voice data for correct answers.
  • the correct answer voice data is voice data corresponding to the learning text, and the voice reflects prosody (ie, pitch and intensity) characteristics and beat characteristics according to the tone of the speaker. Accordingly, in the case of machine learning prosody information and time signature information reflected in the correct answer voice data, it is possible to synthesize a natural voice while controlling prosody or time signature.
  • the learning text and correct answer voice data are input to the preprocessor 23 for preprocessing.
  • the preprocessor 23 performs preprocessing on the inputted learning text and correct answer voice data.
  • the preprocessor 23 may include a text preprocessor 31, a speech analysis unit 33, and a speech preprocessor 35 as illustrated in FIG. 3.
  • the text preprocessor 31 performs preprocessing on the input text.
  • the pre-processing can be in various ways, such as dividing the input text into sentences, parsing the text in units of sentences into units such as words, words, characters, and phonemes, and converting numbers and special characters into characters. It can be, and the specific pre-treatment method may vary according to the embodiment. Some examples of the pretreatment process are shown in FIG. 4.
  • the text preprocessor 31 converts the number of the input text 41 into a character to generate a text 43 in the form of a character, and converts the text 43 to the text 45 in phoneme units. ) Can be converted.
  • this is only an example for explaining the operation of the text pre-processing unit 31, and the text pre-processing unit 31 may perform a natural language pre-processing function in various ways.
  • the voice analysis unit 33 extracts time signature information and prosody information for the correct answer voice data through voice analysis on the correct answer voice data.
  • the speech analysis unit 33 may further receive text for learning together with correct answer speech data in order to increase the accuracy of extraction.
  • the voice analysis unit 33 may include a time signature information extraction unit 33-1 and a prosody information extraction unit 33-2.
  • the time signature information extraction unit 33-1 extracts time signature information from correct answer speech data in audio format (eg wav format audio), and the prosody information extraction unit 33-2 extracts prosody information from correct answer speech data in audio format. Extract.
  • the beat information extractor 33-1 may analyze the correct answer voice data to extract a length of a sound corresponding to each phoneme or syllable (or word, sentence, etc.).
  • the prosody information extracting unit 33-2 may analyze the correct answer speech data to extract a pitch (e.g. frequency) or intensity of a sound corresponding to each phoneme or syllable (or word, sentence, etc.).
  • one or more speech analysis algorithms well known in the art may be used.
  • speech analysis and annotation tools well known in the art, such as the SPPAS tool, may be used.
  • the SPPAS tool provides a function of analyzing audio speech and extracting information about duration and pitch in phoneme units.
  • FIG. 5 An example of extracting beat information using the SPPAS tool is shown in FIG. 5. As shown in FIG. 5, the SPPAS tool analyzes audio data 51 in an audio format to extract sound length information 53 for each phoneme.
  • the speech preprocessor 35 converts correct answer speech data (e.g. wav format audio) in an audio format into spectrogram format data.
  • the speech preprocessor 35 may convert speech data into STFT spectrogram data by performing Short Time Fourier Transform (STFT) signal processing, or may convert the STFT spectrogram data into mel-scale.
  • STFT Short Time Fourier Transform
  • the spectrogram data may be used to train the speech synthesis model 63.
  • Data preprocessed text, prosody information, beat information, spectrogram data, etc. preprocessed by the preprocessor 23 may be stored in the storage unit 25.
  • the data stored in the storage unit 25 may be used to perform iterative learning on the speech synthesis model 63, build another speech synthesis model, or rebuild the speech synthesis model 63.
  • the storage unit 25 stores and manages various data such as text, prosody information, time signature information, voice data, and spectrogram data.
  • the storage unit 25 may use a database.
  • the speech synthesis unit 27 receives the pre-processed training text and correct answer spectrogram data and constructs a neural network-based speech synthesis model 63 by using them.
  • a function of adjusting the pitch, intensity of sound, and/or beat prosody information and time signature information are used for learning of the speech synthesis model 63.
  • the voice synthesis model 63 may be constructed by further learning pitch information.
  • the voice synthesis model 63 may be constructed by further learning the beat information.
  • the speech synthesis unit 27 may include a learning unit 61, a speech synthesis model 63, a synthesis unit 65, and a vocoder unit 67.
  • a speech synthesis model 63 may be included in the speech synthesis unit 27.
  • a synthesis unit 65 may be included in the speech synthesis unit 27.
  • a vocoder unit 67 may be included in the speech synthesis unit 27.
  • the learning unit 61 trains the speech synthesis model 63 using the training data set. That is, the learning unit 61 may construct the speech synthesis model 63 by updating the weight of the speech synthesis model 63 so that the prediction error of the speech synthesis model 63 is minimized using the training data set.
  • the training dataset may be provided from the preprocessor 23 or the storage unit 25.
  • the neural network structure of the speech synthesis model 63 will be first described, and then the operation of the learning unit 61 will be described in detail.
  • the speech synthesis model 63 is a neural network-based model that receives preprocessed text, prosody information, and/or time signature information, and synthesizes speech corresponding thereto. As shown in FIG. 7, the speech synthesis model 63 according to some embodiments of the present disclosure includes an embedding module 71, an aggregator module 73, an encoder neural network 75, and an attention module. (77) and a decoder neural network (79).
  • the embedding module 71 is a module that converts text preprocessed through an embedding technique into a character embedding vector.
  • the embedding module 71 may generate a character embedding vector in units of phonemes, and may generate a character embedding vector in units such as syllables, words, and words.
  • the embedding module 71 may generate a character embedding vector using a fasttext embedding technique, an auto-encoder embedding technique, a self-attention embedding technique, etc.
  • the technical scope is not limited thereto.
  • the aggregator module 75 is a module that aggregates the input information to generate an input vector constituting an input sequence for the encoder neural network 75. For example, when a character embedding vector, prosody information, and beat information are input, the aggregator module 75 aggregates the character embedding vector, the prosody information, and the beat information to generate an input vector for the encoder neural network 75 can do. In addition, the generated sequence of input vectors may be input to the encoder neural network 75.
  • the specific manner in which the aggregation is performed may vary depending on the embodiment.
  • the aggregator module 75 may perform aggregating by concatenating input information.
  • FIG. 8 illustrates an example of generating an input sequence by connecting the beat information 53 of the phoneme unit shown in FIG. 5 to corresponding character embedding vectors 81 and 84.
  • the aggregator module 73 connects the first time signature information 82 for the phoneme p with the first character embedding vector 81 for the phoneme p to provide a first input vector. (83) is generated, and a second input vector 86 can be generated by concatenating the second time signature information 85 for the next phoneme r with the second character embedding vector 84 for the phoneme r. have.
  • the generated input vectors 83 and 86 constitute an input sequence of the encoder neural network 75. Prosody information can also be linked to the character embedding vector in a similar way.
  • information input through an artificial neural network may be aggregated.
  • aggregation may be performed using a self-attention or auto encoder technique.
  • aggregation may be performed through a linear model.
  • character embedding vectors e.g. 81, 84
  • time signature information e.g. 82, 85
  • an input vector (or sequence) of the encoder neural network 75 may be generated.
  • the technical scope of the present disclosure is not limited to the examples listed above.
  • the encoder neural network 75 is a neural network that receives an input sequence composed of one or more vectors, encodes the input sequence, and outputs an encoded vector. As the learning progresses, the encoder neural network 75 understands the context contained in the input sequence and outputs an encoded vector representing the understood context.
  • the encoded vector may be referred to as a context vector in the art.
  • the encoder neural network 75 and the decoder neural network 79 may be implemented as a recurrent neural network (RNN) to be suitable for inputting and outputting sequences.
  • RNN recurrent neural network
  • the encoder neural network 75 and the decoder neural network 79 may be implemented as a Long Short-Term Memory Model (LSTM) neural network 90 as shown in FIG. 9.
  • LSTM Long Short-Term Memory Model
  • the present invention is not limited thereto, and at least some of the encoder neural network 75 and the decoder neural network 79 may be implemented through a self-attention, a transformer network, or the like. Those skilled in the art will be able to clearly understand self-attention and transformer networks, and detailed descriptions of the techniques will be omitted.
  • the attention module 77 provides attention information indicating which part to focus on (or which part to focus on) when learning/predicting the output sequence for the vector encoded by the decoder neural network 79. This module is provided. As the learning progresses, the attention module 77 may learn a mapping relationship between the encoded vector and the output sequence to provide attention information indicating a portion to be focused and a portion not to be focused upon decoding.
  • the attention information may be provided in the form of a weight vector (or weight matrix), but the technical scope of the present disclosure is not limited thereto. Those skilled in the art will be able to clearly understand the attention mechanism, and a detailed description thereof will be omitted.
  • the decoder neural network 79 receives the encoded vector and the attention information and outputs an output sequence corresponding to the encoded vector.
  • the decoder neural network 79 predicts an output sequence for a voice reflecting specific prosody and time signature information using the encoded vector and the attention information.
  • the output sequence may be composed of spectrogram data in units of frames, but the technical scope of the present disclosure is not limited thereto.
  • the decoder neural network 79 may further input spectrogram data of a previous frame and sequentially output spectrogram data of a current frame to construct an output sequence.
  • the spectrogram data is data representing a spectrogram of a voice signal, and may be STFT spectrogram data or mel-spectrogram data, but the technical scope of the present disclosure is not limited thereto.
  • the reason why the decoder neural network 79 is configured to output spectrogram data instead of a speech signal is that a prediction error can be calculated more accurately than that of a speech signal when learning is performed with spectrogram data.
  • a speech synthesis model with superior performance can be constructed.
  • each of the training data 100 may include a training text 101 and a correct answer voice data 102.
  • the learning data 100 may be composed of at least one of learning beat information and learning prosody information, learning text 101 and correct answer spectrogram data.
  • the correct answer voice data 102 is voice data in an audio format corresponding to the text 101. Before the learning is performed, the correct answer voice data 102 is converted into correct answer spectrogram data 104 through the voice preprocessor 35, and the text 101 is appropriately preprocessed by the text preprocessor 31. I can.
  • the voice analysis unit 33 may analyze the correct answer voice data 102 to extract time signature information and/or prosody information.
  • the process of learning the speech synthesis model 63 by the learning unit 61 is as follows. First, when the preprocessed text 101 is input to the embedding module 71, the text 101 preprocessed by the embedding module 71 is converted into a character embedding vector. The character embedding vector and the beat information and/or prosody information are converted into a single input vector by the aggregator module 73. The sequence of the input vectors is input to the encoder neural network 75, and as a result, an output sequence composed of the predicted spectrogram data 103 is output from the decoder neural network 79.
  • the learning unit 61 compares the prediction spectrogram data 103 and the correct answer spectrogram data 104 to calculate a prediction error 105, backpropagating the prediction error 105, and weights the speech synthesis model 63. Update.
  • the weights of the encoder neural network 75, the attention module 77, and the decoder neural network 79 may be updated at once through the backpropagation.
  • the embedding module 71 and/or the aggregator module 73 is implemented as some layers of the neural network, the weights of the embedding module 71 and/or the aggregator module 73 may also be updated together.
  • the learning unit 61 may build the speech synthesis model 63 by repeating this learning process for a plurality of training data.
  • the prediction error 105 may further include an attention error 106 value.
  • Attention error 106 is the attention information provided by the attention module 77 (i.e., prediction information) and beat information (i.e., correct answer beat information) or prosody information (i.e., correct answer prosody) extracted by the voice analysis unit 33 Information).
  • beat information i.e., correct answer beat information
  • prosody information i.e., correct answer prosody
  • the prediction error 105 may further include an attention error 106 value.
  • Attention error 106 is the attention information provided by the attention module 77 (i.e., prediction information) and beat information (i.e., correct answer beat information) or prosody information (i.e., correct answer prosody) extracted by the voice analysis unit 33 Information).
  • beat information i.e., correct answer beat information
  • prosody information i.e., correct answer prosody
  • the attention information should have a value similar to the pitch information. Therefore, when the weight of the speech synthesis model 63 is updated so that the difference between the attention information and the pitch information (that is, the attention error 106) is minimized, a more accurate pitch adjustment function can be provided through the speech synthesis model 63. do.
  • a parameter for controlling the degree to which the aforementioned attention error 106 is reflected in the prediction error 105 may be further used.
  • the parameter is a kind of hyper-parameter and may be set to adjust the degree to which the attention error 106 is reflected in the prediction error 105 before the model 63 is trained.
  • the value of the set parameter is reflected in the attention error 106 (eg multiplication, addition, etc.), the size of the attention error 107 is changed, and the attention error 107 of the changed size is reflected in the prediction error 105. It can be reflected (eg added).
  • learning about the speech synthesis model 63 may be performed while changing the value of the parameter.
  • a speech synthesis model to be used in an actual synthesis process may be determined according to performance evaluation results of the first speech synthesis model and the second speech synthesis model.
  • the learning unit 61 and the speech synthesis model 63 have been described with reference to FIGS. 7 to 10.
  • the other components 65 and 67 of the speech synthesis unit 27 are those used in actual speech synthesis, and will be described with reference to FIG. 11.
  • FIG. 11 is also an exemplary block diagram illustrating a speech synthesis apparatus 11 according to some embodiments of the present disclosure, and FIG. 11 also illustrates a data flow in a speech synthesis process. Hereinafter, it will be described with reference to FIGS. 6 and 11 together.
  • the speech synthesis model 63 can be provided for a speech synthesis function for text for synthesis.
  • a speech synthesis function for text for synthesis.
  • the input unit 21 receives information 111 for synthesis.
  • the synthesis information 111 includes text for synthesis, prosody information for synthesis, and time signature for synthesis. Among them, the text for synthesis is input to the text preprocessor 31, and preprocessing is performed by the text preprocessor 31.
  • the synthesizing prosody information and the synthesizing beat information are inputted to the speech synthesizing unit 27, more precisely, to the synthesizing unit 65.
  • the synthesis unit 65 inputs the preprocessed synthesis text, the synthesis prosody information, and the synthesis time signature information into the speech synthesis model 63, and obtains an output sequence associated with the target speech as a result.
  • the output sequence may be composed of, for example, spectrogram data in units of frames.
  • the output sequence is input to the vocoder unit 67 to be converted into a target voice in an audio format.
  • the vocoder unit 67 converts the output sequence into audio data in an audio format (ie, target audio).
  • the vocoder unit 67 may be implemented in any way.
  • the vocoder unit 67 may be implemented with one or more vocoder modules (e.g. WaveNet, Griffin-lim) well known in the art. In order not to obscure the subject matter of the present invention, a detailed description of the vocoder unit 67 will be omitted.
  • the target voice output by the vocoder unit 67 is a voice in which prosody information for synthesis and time signature information for synthesis are reflected.
  • the prosody or beat of the finally synthesized target speech may be adjusted by adjusting input prosody or time signature information.
  • FIGS. 2, 3, 6, and 11 may be essential components for implementing the speech synthesis apparatus 10. That is, the speech synthesis apparatus 10 according to some other embodiments of the present disclosure may be implemented by some of the components shown in FIGS. 2, 3, 6, and 11.
  • Each of the components shown in FIGS. 2, 3, 6, and 11 may mean software or hardware such as a Field Programmable Gate Array (FPGA) or an Application-Specific Integrated Circuit (ASIC). have.
  • the components are not limited to software or hardware, and may be configured to be in an addressable storage medium, or may be configured to execute one or more processors.
  • the functions provided in the above components may be implemented by more subdivided components, or may be implemented as one component that performs a specific function by combining a plurality of components.
  • the speech synthesis apparatus 10 since a neural network-based speech synthesis model is constructed by synthesizing the time signature information and prosody information reflected in the actual speech data of the speaker, a natural synthesized sound can be generated while controlling the time signature and prosody. For example, the beat or prosody contained in the synthesized target voice may be adjusted by adjusting the intensity or pitch of the sound in the prosody information.
  • FIGS. 12 and 13 a neural network structure of a modified speech synthesis model according to various embodiments of the present disclosure will be described with reference to FIGS. 12 and 13.
  • a description of a portion overlapping with the above-described speech synthesis model 63 will be omitted.
  • FIG. 12 illustrates a neural network structure of a modified speech synthesis model 120 according to the first embodiment of the present disclosure.
  • the components 121 to 125 of the speech synthesis model 120 according to the first embodiment are similar to the speech synthesis model 63 described above. However, since the aggregator module 122 generates an input sequence based on the time signature information 126 and the character embedding vector, and the prosody information 127 is provided to the decoder neural network 125, the above-described speech synthesis model 63 ) And the difference.
  • the prosody information 127 when the prosody information 127 is provided to the decoder neural network 125, the prosody information 127 may be transformed according to the decoding interval of the decoder neural network 125.
  • the prosody information 127 when the prosody information 127 is pitch information set in a phoneme unit, the pitch information is not divided into a phoneme unit, but may be divided into a frame unit of spectrogram data and provided to the decoder neural network 125.
  • the decoder neural network 125 decodes the encoded vector and attention information and outputs an output sequence.
  • the prosody information 127 is used together to perform decoding.
  • the operation of the other modules 121, 123, 124 is similar to that described above.
  • the first embodiment may be adjusted based on prosody information input during prosody decoding of synthesized speech. That is, since prosody adjustment will be performed based on prosody information input immediately before decoding, there is an advantage that prosody adjustment can be performed more precisely.
  • the effects mentioned above can be applied equally to the case of time signature information.
  • FIG. 13 illustrates a neural network structure of a modified speech synthesis model 130 according to a second embodiment of the present disclosure.
  • components 131 to 135 of the speech synthesis model 130 according to the second embodiment are similar to the speech synthesis model 63 or 130 described above. However, it differs from the above-described speech synthesis model 63 in that the prosody information 137 is input to the aggregator module 123 and is also input to the decoder neural network 135.
  • the decoder neural network 135 decodes the encoded vector and attention information and outputs an output sequence.
  • the prosody information 137 is used together to perform decoding.
  • the operation of the other modules 131, 133, and 134 is similar to that described above.
  • modified speech synthesis models 120 and 130 according to various embodiments of the present disclosure have been described with reference to FIGS. 12 and 13.
  • a voice synthesis method according to some embodiments of the present disclosure will be described in detail with reference to FIGS. 14 to 17.
  • Each step of the speech synthesis method may be performed by a computing device.
  • each step of the speech synthesis method may be implemented with one or more instructions executed by a processor of a computing device. All the steps included in the speech synthesis method may be performed by one physical computing device, but the first steps of the method are performed by a first computing device, and the second steps of the method are performed by a second computing device. It can also be performed by In the following, description will be continued on the assumption that each step of the speech synthesis method is performed by the speech synthesis device 10. However, for convenience of explanation, the description of the operation subject of each step included in the speech synthesis method may be omitted.
  • FIG. 14 is an exemplary flowchart illustrating a speech synthesis method according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.
  • the speech synthesis method includes a learning process of constructing a speech synthesis model and a synthesis process of synthesizing speech using the speech synthesis model.
  • the learning process starts in step S100 of acquiring a learning dataset.
  • each of the training data included in the training dataset is composed of a training text and a correct answer voice data.
  • step S200 a neural network-based speech synthesis model is constructed using the training dataset. Since the structure of the speech synthesis model has already been described above, further description will be omitted, and details of this step S200 will be described later with reference to FIG. 15.
  • the synthesis process starts in step S300 of obtaining data for synthesis.
  • the data for synthesis is composed of text for synthesis, prosody information for synthesis, and time signature for synthesis.
  • the prosody information for synthesis may be excluded from the synthesis data.
  • the synthesized beat information may be excluded from the synthesized data.
  • a target speech for the text for synthesis is synthesized and output using a speech synthesis model. More specifically, an output sequence composed of spectrogram data may be output from a speech synthesis model, and the target voice may be output by vocoding the output sequence.
  • the target speech is a speech in which the synthesis prosody information and the synthesis time signature information are reflected.
  • the synthesized target voice refer to FIG. 17.
  • the prosody of the synthesized target voice may be adjusted by adjusting the pitch or intensity of the sound on the prosody information.
  • the beat of the synthesized target voice can be adjusted.
  • steps S100 and S200 are performed by the input unit 21, the preprocessor 23 and the learning unit 61, and the steps S300 and S400 are the input unit 21, the preprocessor ( 23), it can be performed by the synthesis unit 65 and the vocoder unit 67.
  • 15 is an exemplary flowchart illustrating a learning process of a speech synthesis model according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.
  • the learning process begins in step S210 of pre-processing the text for learning and the correct answer voice data.
  • the correct answer voice data may be converted into correct answer spectrogram data through preprocessing. Since the contents of the pre-processing are the same as described above, further description will be omitted.
  • step S220 the correct answer voice data is analyzed to extract time signature information and prosody information for the text for learning.
  • step S230 the text preprocessed through embedding is converted into a character embedding vector.
  • the character embedding vector may be generated for each phoneme or for each syllable, which may vary depending on the embodiment.
  • the embedding may be performed in an embedding module (e.g. 71 in FIG. 7) constituting the speech synthesis model (e.g. 73 in FIG. 7), but may be performed in a separate embedding module.
  • an input vector constituting the input sequence of the encoder neural network (eg 75 in FIG. 7) by aggregating the character embedding vector, prosody information, and time signature information in the aggregator module of the speech synthesis model (eg, 73 in FIG. 7) is Is created.
  • the input vector may be generated by connecting the prosody information and the beat information to a character embedding vector, and the sequence of the input vector may be an input sequence of the encoder neural network.
  • the technical scope of the present disclosure is not limited thereto.
  • step S250 encoding is performed on the input sequence in the encoder neural network (e.g. 75 in FIG. 7). Through this, the input sequence is transformed into an encoded vector, and the encoded vector is output from the encoder neural network.
  • the encoder neural network e.g. 75 in FIG. 7
  • step S240 decoding of the encoded vector is performed in a decoder neural network (e.g. 79 in FIG. 7) of the speech synthesis model.
  • a decoder neural network e.g. 79 in FIG. 7 of the speech synthesis model.
  • the encoded vector is converted into an output sequence composed of predictive spectrogram data, and the output sequence is output from the decoder neural network.
  • the decoder neural network may further receive attention information from an attention module (e.g. 77 of FIG. 7) positioned between the encoder neural network and the decoder neural network.
  • an attention module e.g. 77 of FIG. 7
  • the decoder neural network may receive prediction spectrogram data of a previous frame and may further use this to output prediction spectrogram data of a current frame.
  • step S250 the weight of the speech synthesis model is updated by backpropagating an error value between the correct answer spectrogram data and the predicted spectrogram data.
  • weights of the encoder neural network and the decoder neural network may be updated at once through the error backpropagation. If an embedding module is included in the speech synthesis model, the weight of the embedding module may be updated as well.
  • the error value may further include an attention error. Since the attention error is the same as described above, further description will be omitted.
  • step S210 is performed by the text preprocessor 31 and the voice preprocessor 35
  • step S220 is performed by the speech analysis unit 33
  • the remaining steps S230 to S270 are It can be performed by the learning unit 61 and the speech synthesis model 63.
  • a method of constructing a speech synthesis model according to some embodiments of the present disclosure has been described with reference to FIG. 15. According to the above-described method, a speech synthesis model capable of adjusting prosody and beat can be constructed.
  • step S400 a speech synthesis process that can be performed in step S400 will be described in detail with reference to FIG. 16.
  • 16 is an exemplary flowchart illustrating a speech synthesis process based on a speech synthesis model according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.
  • the speech synthesis method starts in step S410 of preprocessing the text for synthesis.
  • step S420 the text for synthesis preprocessed through embedding is converted into a character embedding vector.
  • the character embedding vector may be generated for each phoneme or for each syllable, which may vary depending on the embodiment.
  • the embedding may be performed in an embedding module (e.g. 71 in FIG. 7) constituting the speech synthesis model (e.g. 73 in FIG. 7), but may be performed in a separate embedding module.
  • an input sequence of an encoder neural network (eg 75 in FIG. 7) is constructed by aggregating the character embedding vector, the prosody information for synthesis, and the time signature information for synthesis in the aggregator module of the speech synthesis model (eg 73 in FIG. 7).
  • An input vector is created.
  • the input vector may be generated by connecting the prosody information and the beat information to a character embedding vector, and the sequence of the input vector may be an input sequence of the encoder neural network.
  • the technical scope of the present disclosure is not limited thereto.
  • step S440 encoding is performed on the input sequence in the encoder neural network (e.g. 75 in FIG. 7). Through this, the input sequence is transformed into an encoded vector, and the encoded vector is output from the encoder neural network.
  • the encoder neural network e.g. 75 in FIG. 7
  • step S450 the encoded vector is decoded in a decoder neural network (e.g. 79 in FIG. 7) of the speech synthesis model.
  • a decoder neural network e.g. 79 in FIG. 7 of the speech synthesis model.
  • the encoded vector is converted into an output sequence composed of spectrogram data in a frame unit, and the output sequence is output from the decoder neural network.
  • the decoder neural network may further receive attention information from an attention module (eg, 77 in FIG. 7) located between the encoder neural network and the decoder neural network, and perform decoding by further using the attention information.
  • an attention module eg, 77 in FIG. 7
  • the decoder neural network may receive spectrogram data of a previous frame and may further use this to output spectrogram data of a current frame.
  • a target voice in an audio format is synthesized by vocoding the spectrogram data in units of frames included in the output sequence.
  • the target voice is a voice in which the prosody information for synthesis and the time signature for synthesis are reflected.
  • the target voice may be visually provided through a GUI (Graphical User Interface).
  • GUI Graphic User Interface
  • FIG. 17 An example of the GUI is shown in FIG. 17.
  • the table 171 shown at the top of FIG. 17 shows the text/rhyme/beat information for synthesis in syllable units, and the GUI 175 shown at the bottom of FIG. 17 is synthesized from the synthesis information of the table 171.
  • the target voice is displayed in the form of a voice waveform.
  • the illustrated speech waveform shows an actual result synthesized through a speech synthesis model constructed according to the above-described embodiments.
  • the target voice is accurately synthesized according to the prosody information for synthesis and the time signature for synthesis.
  • the pitch information 173 of the syllable ("i") is "660Hz”
  • the pitch information 174 of the syllable ("i") is "663Hz”
  • the beat information 172 of the syllable (“Yo") is set to have a relatively long sound length compared to other syllables as "0.5 sec”, but the voice waveform 176 of the syllable (“Yo”) is It can be confirmed that it was synthesized.
  • step S410 is performed by the text preprocessing unit 31
  • steps S420 to S450 are performed by the synthesis unit 65 and the speech synthesis model 63
  • step S460 is performed by the vocoder unit ( 67).
  • FIGS. 14 to 17 So far, a speech synthesis method according to some embodiments of the present disclosure has been described with reference to FIGS. 14 to 17.
  • an exemplary computing device 180 capable of implementing the speech synthesis device 10 according to some embodiments of the present disclosure will be described.
  • FIG. 18 is a hardware configuration diagram illustrating an exemplary computing device 180 capable of implementing the speech synthesis device 10 according to some embodiments of the present disclosure.
  • the computing device 180 is a memory for loading a computer program executed by one or more processors 181, a bus 183, a communication interface 184, and the processor 181 ( 182 and a storage 185 for storing the computer program 186 may be included.
  • the processor 181 182 and a storage 185 for storing the computer program 186 may be included.
  • FIG. 18 only components related to the embodiment of the present disclosure are shown in FIG. 18. Accordingly, those of ordinary skill in the art to which the present disclosure pertains may recognize that other general-purpose components may be further included in addition to the components illustrated in FIG. 18.
  • the processor 181 controls the overall operation of each component of the computing device 180.
  • the processor 181 includes a CPU (Central Processing Unit), MPU (Micro Processor Unit), MCU (Micro Controller Unit), GPU (Graphic Processing Unit), or any type of processor well known in the art of the present disclosure. Can be. Also, the processor 181 may perform an operation on at least one application or program for executing the method according to the embodiments of the present disclosure.
  • the computing device 180 may include one or more processors.
  • the memory 182 stores various types of data, commands and/or information.
  • the memory 182 may load one or more programs 186 from the storage 185 in order to execute the speech synthesis method according to embodiments of the present disclosure.
  • a module as shown in FIG. 2 may be implemented on the memory 182.
  • the memory 182 may be implemented as a volatile memory such as RAM, but the technical scope of the present disclosure is not limited thereto.
  • the bus 183 provides communication functions between components of the computing device 180.
  • the bus 183 may be implemented as various types of buses such as an address bus, a data bus, and a control bus.
  • the communication interface 184 supports wired/wireless Internet communication of the computing device 180.
  • the communication interface 184 may support various communication methods other than Internet communication.
  • the communication interface 184 may be configured to include a communication module well known in the technical field of the present disclosure.
  • the communication interface 184 may be omitted.
  • the storage 185 may non-temporarily store the one or more programs 186 and various data.
  • the various types of data may include data managed by the storage unit 25.
  • the storage 185 is a nonvolatile memory such as a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, etc., a hard disk, a removable disk, or a technical field to which the present disclosure belongs. It may be configured to include any known computer-readable recording medium.
  • ROM Read Only Memory
  • EPROM Erasable Programmable ROM
  • EEPROM Electrically Erasable Programmable ROM
  • flash memory etc., a hard disk, a removable disk, or a technical field to which the present disclosure belongs. It may be configured to include any known computer-readable recording medium.
  • Computer program 186 may include one or more instructions that when loaded into memory 182 cause processor 181 to perform a method/operation in accordance with various embodiments of the present disclosure. That is, the processor 181 may perform methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.
  • the computer program 186 includes an operation of acquiring a training data set, an operation of constructing a speech synthesis model using the training data set, an operation of acquiring synthesis data, and the synthesis using the speech synthesis model. It may include instructions to perform an operation of synthesizing a target voice for the target data.
  • the speech synthesis device 10 according to some embodiments of the present disclosure may be implemented through the computing device 180.
  • An exemplary computing device 180 capable of implementing the speech synthesis device 10 according to an embodiment of the present disclosure has been described so far with reference to FIGS. 1 to 18.
  • the technical idea of the present disclosure described with reference to FIGS. 1 to 18 so far may be implemented as computer-readable code on a computer-readable medium.
  • the computer-readable recording medium is, for example, a removable recording medium (CD, DVD, Blu-ray disk, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk).
  • the computer program recorded in the computer-readable recording medium may be transmitted to another computing device through a network such as the Internet and installed in the other computing device, thereby being used in the other computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

La présente invention concerne un appareil de synthèse vocale ayant une fonction de réglage de la signature temporelle, de la hauteur ou de l'intensité sonore. L'appareil de synthèse vocale selon plusieurs modes de réalisation de la présente invention comprend : une unité de prétraitement prétraitant un texte d'entrée ; et une unité de synthèse vocale entrant les informations de texte prétraité et de signature temporelle dans un modèle de synthèse vocale à base de réseau neuronal pour synthétiser le texte d'entrée avec une parole cible dans laquelle les informations de signature temporelle sont reflétées, le modèle de synthèse vocale comprenant: un module d'incorporation convertissant le texte prétraité en vecteur d'incorporation de caractères ; un module d'agrégateur agrégeant les informations de signature temporelle et le vecteur d'incorporation de caractères pour générer un vecteur d'entrée constituant une séquence d'entrée ; un réseau neuronal codeur codant la séquence d'entrée pour délivrer en sortie un vecteur codé ; et un réseau neuronal décodeur décodant le vecteur codé pour délivrer en sortie une séquence de sortie associée à la parole cible.
PCT/KR2020/003753 2019-03-19 2020-03-19 Appareil de synthèse vocale et procédé associé WO2020190050A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR1020190030903A KR102057926B1 (ko) 2019-03-19 2019-03-19 음성 합성 장치 및 그 방법
KR10-2019-0030903 2019-03-19

Publications (1)

Publication Number Publication Date
WO2020190050A1 true WO2020190050A1 (fr) 2020-09-24

Family

ID=69062788

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/003753 WO2020190050A1 (fr) 2019-03-19 2020-03-19 Appareil de synthèse vocale et procédé associé

Country Status (2)

Country Link
KR (1) KR102057926B1 (fr)
WO (1) WO2020190050A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786006A (zh) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
CN112887789A (zh) * 2021-01-22 2021-06-01 北京百度网讯科技有限公司 视频生成模型的构建和视频生成方法、装置、设备及介质
CN113205793A (zh) * 2021-04-30 2021-08-03 北京有竹居网络技术有限公司 音频生成方法、装置、存储介质及电子设备
WO2022156654A1 (fr) * 2021-01-22 2022-07-28 华为技术有限公司 Procédé et appareil de traitement de données de texte

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
CN111133506A (zh) * 2019-12-23 2020-05-08 深圳市优必选科技股份有限公司 语音合成模型的训练方法、装置、计算机设备及存储介质
KR102173382B1 (ko) * 2020-02-25 2020-11-03 휴멜로 주식회사 텍스트 생성 장치 및 방법
KR102277205B1 (ko) 2020-03-18 2021-07-15 휴멜로 주식회사 오디오 변환 장치 및 방법
KR20210145490A (ko) 2020-05-25 2021-12-02 삼성전자주식회사 어텐션 기반 시퀀스 투 시퀀스 모델의 성능 향상 방법 및 장치
KR102168529B1 (ko) 2020-05-29 2020-10-22 주식회사 수퍼톤 인공신경망을 이용한 가창음성 합성 방법 및 장치
KR102414521B1 (ko) 2020-08-13 2022-06-30 국방과학연구소 어텐션 매커니즘을 적용한 음성합성 시스템 및 그 방법
KR102498667B1 (ko) * 2020-08-27 2023-02-10 네오사피엔스 주식회사 합성 음성을 화자 이미지에 적용하는 방법 및 시스템
KR102392904B1 (ko) * 2020-09-25 2022-05-02 주식회사 딥브레인에이아이 텍스트 기반의 음성 합성 방법 및 장치
CN112542153A (zh) * 2020-12-02 2021-03-23 北京沃东天骏信息技术有限公司 时长预测模型训练方法和装置、语音合成方法和装置
CN113035169B (zh) * 2021-03-12 2021-12-07 北京帝派智能科技有限公司 一种可在线训练个性化音色库的语音合成方法和系统
CN113421547B (zh) * 2021-06-03 2023-03-17 华为技术有限公司 一种语音处理方法及相关设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006084967A (ja) * 2004-09-17 2006-03-30 Advanced Telecommunication Research Institute International 予測モデルの作成方法およびコンピュータプログラム
KR20080045413A (ko) * 2006-11-20 2008-05-23 한국전자통신연구원 정적 특성과 동적 특성이 반영된 끊어읽기 예측 방법 및이를 기반으로 하는 음성합성 방법 및 시스템
KR20150087023A (ko) * 2014-01-21 2015-07-29 엘지전자 주식회사 감성음성 합성장치, 감성음성 합성장치의 동작방법, 및 이를 포함하는 이동 단말기
JP2016061968A (ja) * 2014-09-18 2016-04-25 株式会社東芝 音声処理装置、音声処理方法およびプログラム
KR20190016889A (ko) * 2017-08-09 2019-02-19 한국과학기술원 텍스트-음성 변환 방법 및 시스템
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006084967A (ja) * 2004-09-17 2006-03-30 Advanced Telecommunication Research Institute International 予測モデルの作成方法およびコンピュータプログラム
KR20080045413A (ko) * 2006-11-20 2008-05-23 한국전자통신연구원 정적 특성과 동적 특성이 반영된 끊어읽기 예측 방법 및이를 기반으로 하는 음성합성 방법 및 시스템
KR20150087023A (ko) * 2014-01-21 2015-07-29 엘지전자 주식회사 감성음성 합성장치, 감성음성 합성장치의 동작방법, 및 이를 포함하는 이동 단말기
JP2016061968A (ja) * 2014-09-18 2016-04-25 株式会社東芝 音声処理装置、音声処理方法およびプログラム
KR20190016889A (ko) * 2017-08-09 2019-02-19 한국과학기술원 텍스트-음성 변환 방법 및 시스템
KR102057926B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112786006A (zh) * 2021-01-13 2021-05-11 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
WO2022151931A1 (fr) * 2021-01-13 2022-07-21 北京有竹居网络技术有限公司 Procédé et appareil de synthèse de la parole, procédé et appareil d'entraînement de modèle de synthèse, support et dispositif
CN112786006B (zh) * 2021-01-13 2024-05-17 北京有竹居网络技术有限公司 语音合成方法、合成模型训练方法、装置、介质及设备
CN112887789A (zh) * 2021-01-22 2021-06-01 北京百度网讯科技有限公司 视频生成模型的构建和视频生成方法、装置、设备及介质
WO2022156654A1 (fr) * 2021-01-22 2022-07-28 华为技术有限公司 Procédé et appareil de traitement de données de texte
CN112887789B (zh) * 2021-01-22 2023-02-21 北京百度网讯科技有限公司 视频生成模型的构建和视频生成方法、装置、设备及介质
CN113205793A (zh) * 2021-04-30 2021-08-03 北京有竹居网络技术有限公司 音频生成方法、装置、存储介质及电子设备
CN113205793B (zh) * 2021-04-30 2022-05-31 北京有竹居网络技术有限公司 音频生成方法、装置、存储介质及电子设备

Also Published As

Publication number Publication date
KR102057926B1 (ko) 2019-12-20

Similar Documents

Publication Publication Date Title
WO2020190050A1 (fr) Appareil de synthèse vocale et procédé associé
WO2020190054A1 (fr) Appareil de synthèse de la parole et procédé associé
WO2020027619A1 (fr) Procédé, dispositif et support d'informations lisible par ordinateur pour la synthèse vocale à l'aide d'un apprentissage automatique sur la base d'une caractéristique de prosodie séquentielle
WO2019139430A1 (fr) Procédé et appareil de synthèse texte-parole utilisant un apprentissage machine, et support de stockage lisible par ordinateur
JP7445267B2 (ja) 多言語テキスト音声合成モデルを利用した音声翻訳方法およびシステム
WO2019139431A1 (fr) Procédé et système de traduction de parole à l'aide d'un modèle de synthèse texte-parole multilingue
WO2020145439A1 (fr) Procédé et dispositif de synthèse vocale basée sur des informations d'émotion
WO2019139428A1 (fr) Procédé de synthèse vocale à partir de texte multilingue
US7502739B2 (en) Intonation generation method, speech synthesis apparatus using the method and voice server
EP3818518A1 (fr) Appareil électronique et son procédé de commande
JP2001282279A (ja) 音声情報処理方法及び装置及び記憶媒体
WO2020209647A1 (fr) Procédé et système pour générer une synthèse texte-parole par l'intermédiaire d'une interface utilisateur
CN112102811B (zh) 一种合成语音的优化方法、装置及电子设备
KR20200111609A (ko) 음성 합성 장치 및 그 방법
WO2022045651A1 (fr) Procédé et système pour appliquer une parole synthétique à une image de haut-parleur
WO2022260432A1 (fr) Procédé et système pour générer une parole composite en utilisant une étiquette de style exprimée en langage naturel
JPS62231998A (ja) 音声合成方法および装置
WO2021040490A1 (fr) Procédé et appareil de synthèse de la parole
KR20200111608A (ko) 음성 합성 장치 및 그 방법
CN113178188A (zh) 语音合成方法、装置、设备及存储介质
US6178402B1 (en) Method, apparatus and system for generating acoustic parameters in a text-to-speech system using a neural network
JP2001265375A (ja) 規則音声合成装置
WO2022169208A1 (fr) Système de visualisation vocale pour apprentissage de l'anglais, et procédé associé
JP2583074B2 (ja) 音声合成方法
KR100806287B1 (ko) 문말 억양 예측 방법 및 이를 기반으로 하는 음성합성 방법및 시스템

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20774253

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20774253

Country of ref document: EP

Kind code of ref document: A1