WO2020190054A1 - Appareil de synthèse de la parole et procédé associé - Google Patents

Appareil de synthèse de la parole et procédé associé Download PDF

Info

Publication number
WO2020190054A1
WO2020190054A1 PCT/KR2020/003768 KR2020003768W WO2020190054A1 WO 2020190054 A1 WO2020190054 A1 WO 2020190054A1 KR 2020003768 W KR2020003768 W KR 2020003768W WO 2020190054 A1 WO2020190054 A1 WO 2020190054A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech synthesis
neural network
emotion
information
vector
Prior art date
Application number
PCT/KR2020/003768
Other languages
English (en)
Korean (ko)
Inventor
이자룡
박중배
Original Assignee
휴멜로 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 휴멜로 주식회사 filed Critical 휴멜로 주식회사
Publication of WO2020190054A1 publication Critical patent/WO2020190054A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/02Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using spectral analysis, e.g. transform vocoders or subband vocoders
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present disclosure relates to a speech synthesis apparatus and method thereof.
  • the present invention relates to an apparatus for synthesizing an emotional speech reflecting emotion information using a neural network-based speech synthesis model, a speech synthesis method performed by the apparatus, and a method of constructing the speech synthesis model.
  • Speech synthesis technology is a technology that synthesizes a sound similar to a human speaking sound from an input text, and is commonly known as a text-to-speech (TTS) technology.
  • TTS text-to-speech
  • personal portable devices such as smart phones, e-book readers, and vehicle navigation have been actively developed and distributed, the demand for speech synthesis technology for voice output is rapidly increasing.
  • an audio post-processing method was mainly used to synthesize emotional voices.
  • the audio post-processing method synthesizes the voice for the input text and modifies the audio signal of the synthesized voice according to the desired emotion. This method artificially modifies the audio signal, so the naturalness of the voice disappears. There was a problem.
  • a technical problem to be solved through some embodiments of the present disclosure is to provide an apparatus capable of synthesizing emotional voices containing various emotions for a given text and a method performed by the apparatus.
  • Another technical problem to be solved through some embodiments of the present disclosure is to provide an apparatus capable of constructing a neural network-based speech synthesis model capable of synthesizing an emotional voice for a given text, and a method performed in the apparatus. .
  • Another technical problem to be solved through some embodiments of the present disclosure is to provide an apparatus capable of synthesizing emotional voices containing various emotions for each speaker with respect to a given text, and a method performed by the apparatus.
  • Another technical problem to be solved through some embodiments of the present disclosure is an apparatus capable of constructing a neural network-based speech synthesis model capable of synthesizing emotional voices containing various emotions for each speaker for a given text, and in the apparatus It is to provide a way to be done.
  • a speech synthesis apparatus includes a preprocessor that performs preprocessing on an input text, and inputs the preprocessed text and emotion information into a speech synthesis model based on a neural network.
  • a speech synthesis unit for synthesizing an emotion voice in which the emotion information is reflected with respect to the input text
  • the speech synthesis model includes a text embedding module for converting the preprocessed text into a text embedding vector, and an emotion embedding the emotion information
  • An emotion embedding module that converts into a vector, an encoder neural network that receives an input sequence consisting of the character embedding vector and the emotion embedding vector and outputs an encoded vector, and an encoder neural network that receives the encoded vector and outputs an output sequence associated with the emotion speech. It may include a decoder neural network.
  • the speech synthesis model may further include an attention module positioned between the encoder neural network and the decoder neural network and determining a portion to be focused by the decoder neural network in the encoded vector.
  • the output sequence is composed of data in the form of a spectrogram
  • the speech synthesis unit may further include a vocoder unit for converting the output sequence into the emotional speech.
  • the decoder neural network may further receive the emotion embedding vector and output the output sequence.
  • the output sequence is composed of spectrogram-type data
  • the speech synthesis unit inputs the text for training preprocessed by the preprocessor into the speech synthesis model, and the spectrogram data obtained as a result
  • the speech synthesis model may be trained by comparing correct answer spectrogram data to calculate an error value, and back-propagating the calculated error value.
  • the speech synthesis model further includes a speaker embedding module for converting speaker information into a speaker embedding vector, and the speech synthesis unit inputs the speaker information to the speech synthesis model to determine the specific information indicated by the speaker information.
  • the voice reflecting the emotion information for the speaker may be output as the emotional voice.
  • a speech synthesis apparatus includes a preprocessor for performing preprocessing on an input text, and the preprocessed text and emotion information in a neural network-based speech synthesis model. And a speech synthesis unit for synthesizing an emotion voice in which the emotion information is reflected with respect to the input text, wherein the speech synthesis model includes a text embedding module for converting the preprocessed text into a text embedding vector, and the emotion information
  • An emotion embedding module that converts into an embedding vector, an encoder neural network that receives an input sequence consisting of the character embedding vector and outputs an encoded vector, and the encoded vector and the emotion embedding vector, and outputs an output sequence associated with the emotion speech It may include a decoder neural network.
  • a method of constructing a speech synthesis model according to some embodiments of the present disclosure for solving the above technical problem is a method of constructing a speech synthesis model including an encoder neural network and a decoder neural network to synthesize emotional speech in a computing device.
  • FIG. 1 is a diagram for explaining input and output of a speech synthesis apparatus according to some embodiments of the present disclosure.
  • FIG. 2 is an exemplary block diagram illustrating a speech synthesis apparatus according to some embodiments of the present disclosure.
  • FIG. 3 is an exemplary diagram for describing an operation of a preprocessor according to some embodiments of the present disclosure.
  • FIG. 4 is an exemplary block diagram illustrating a speech synthesizer according to some embodiments of the present disclosure.
  • 5 and 6 are diagrams for explaining a neural network structure of a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 7 and 8 are exemplary diagrams for explaining emotional information that may be referred to in various embodiments of the present disclosure.
  • FIG. 9 is an exemplary diagram illustrating an LSTM recurrent neural network that can be used in a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 10 is an exemplary diagram for explaining a learning operation for a speech synthesis model according to some embodiments of the present disclosure.
  • 11 to 15 are diagrams for explaining a neural network structure of a modified speech synthesis model according to various embodiments of the present disclosure.
  • 16 is an exemplary flowchart illustrating a speech synthesis method according to some embodiments of the present disclosure.
  • 17 is an exemplary flowchart illustrating a method of constructing a speech synthesis model according to some embodiments of the present disclosure.
  • FIG. 18 is a diagram illustrating an exemplary computing device capable of implementing a speech synthesis device according to some embodiments of the present disclosure.
  • first, second, A, B, (a) and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, order, or order of the component is not limited by the term.
  • a component is described as being “connected”, “coupled” or “connected” to another component, the component may be directly connected or connected to that other component, but another component between each component It should be understood that elements may be “connected”, “coupled” or “connected”.
  • emotional speech means a speech synthesized by literally containing human emotions.
  • an instruction refers to a series of computer-readable instructions grouped on a function basis, which is a component of a computer program and executed by a processor.
  • FIG. 1 is an exemplary diagram illustrating inputs and outputs of a speech synthesis apparatus 10 according to some embodiments of the present disclosure.
  • the speech synthesis device 10 is a computing device that receives text 1 and emotion information 3, synthesizes and outputs emotion voices 7 corresponding thereto.
  • the emotional voice 7 at this time refers to the voice reflecting the emotional information 3.
  • the computing device may be a notebook computer, a desktop computer, a laptop computer, etc., but is not limited thereto and may include all types of devices equipped with a computing function.
  • the computing device refer to FIG. 18 further.
  • the speech synthesis device 10 is implemented as a single computing device, but the first function of the speech synthesis device 10 is implemented in a first computing device, and the second function is a second computing device. It can also be implemented in
  • the speech synthesis apparatus 10 may further receive speaker information 5 and synthesize and output the emotional voice 7 of a specific speaker indicated by the speaker information 10.
  • the emotional voice 7 at this time means a voice reflecting the emotion information 3 of the specific speaker.
  • the speech synthesis apparatus 10 constructs a speech synthesis model based on a neural network in order to synthesize speech reflecting various and continuous emotions, and through the speech synthesis model Voice (7) can be synthesized.
  • a speech synthesis model based on a neural network in order to synthesize speech reflecting various and continuous emotions, and through the speech synthesis model Voice (7) can be synthesized.
  • FIG. 2 is an exemplary block diagram illustrating a speech synthesis apparatus 10 according to some embodiments of the present disclosure.
  • the speech synthesis apparatus 10 may include an input unit 21, a preprocessor 23, a storage unit 25, and a speech synthesis unit 27.
  • FIG. 2 only the components related to the embodiment of the present disclosure are shown in FIG. 2. Accordingly, those of ordinary skill in the art to which the present disclosure pertains may recognize that other general-purpose components may be further included in addition to the components illustrated in FIG. 2.
  • each of the constituent elements of the speech synthesis apparatus 10 shown in FIG. 2 represents functional elements that are functionally divided, and a plurality of constituent elements may be implemented in a form integrated with each other in an actual physical environment. .
  • each component will be described in detail.
  • the input unit 21 receives text, emotion information, speaker information, and the like.
  • text may be provided to the preprocessor 23 for preprocessing, and the remaining information (e.g. emotion, speaker information, etc.) may be provided to the speech synthesis unit 25.
  • the input unit 21 may receive a training data set including training text, training emotion information, training speaker information, and correct answer voice data for training the speech synthesis model 43. A description of each data will be described later.
  • the preprocessor 23 performs preprocessing on the input text.
  • the pre-processing can be in various ways, such as dividing the input text into sentences, parsing the text in units of sentences into units such as words, words, characters, and phonemes, and converting numbers and special characters into characters. It can be, and the specific pre-treatment method may vary according to the embodiment. Some examples of the pretreatment process are shown in FIG. 3.
  • the preprocessor 23 converts the number of the input text 31 into a character to generate a text 33 in the form of a character, and converts the text 33 to the text 35 in phoneme units. Can be converted to However, this is only an example for describing the operation of the preprocessor 23, and the preprocessor 23 may perform a natural language preprocessing function in various ways.
  • the preprocessor 23 may further perform not only a text preprocessing function, but also a preprocessing function of converting voice data (e.g. wav format audio) into spectrogram format data.
  • the preprocessor 23 may perform Short Time Fourier Transform (STFT) signal processing to convert voice data into STFT spectrogram data or transform the STFT spectrogram data into mel-scale.
  • STFT Short Time Fourier Transform
  • the spectrogram data may be used to train the speech synthesis model 43.
  • the storage unit 25 includes text, emotion information, speaker information, and voice data. It stores and manages various data such as spectrogram data. For effective data management, the storage unit 25 may manage the various types of data in a database. The various data may be used as training data for constructing the speech synthesis model 43, but the technical scope of the present disclosure is not limited thereto.
  • the speech synthesis unit 27 receives the pre-processed text, emotions, and speaker information, and generates (synthesizes) the emotional voice of a specific speaker indicated by the speaker information. That is, the speech synthesis unit 27 may generate voices of different speakers for the same text or may generate emotional voices reflecting different emotions. For example, when the first speaker information is input, the speech synthesis unit 27 synthesizes and outputs the voice of the first speaker, and when the first emotion information is input, the speech synthesis unit 27 generates an emotional voice containing the first emotion. Can be synthesized and printed.
  • the speech synthesis unit 27 may include a learning unit 41, a speech synthesis model 43, a synthesis unit 45, and a vocoder unit 47.
  • a learning unit 41 may include a speech synthesis model 43, a synthesis unit 45, and a vocoder unit 47.
  • the learning unit 41 trains the speech synthesis model 43 using the training data set. That is, the learning unit 41 may construct the speech synthesis model 43 by updating the weight of the speech synthesis model 43 so that the prediction error of the speech synthesis model 43 is minimized using the training data set.
  • the training data set may be provided from the storage unit 25, but the technical scope of the present disclosure is not limited thereto.
  • the structure of the neural network of the speech synthesis model 43 will be first described, and then the operation of the learning unit 41 will be described in detail.
  • the speech synthesis model 43 is a neural network-based model that receives pre-processed text, emotion information, and/or speaker information and synthesizes emotion speech corresponding thereto. As shown in FIG. 5, the speech synthesis model 43 according to some embodiments of the present disclosure includes an embedding module 51, an encoder neural network 53, an attention module 55, and a decoder neural network 57. I can.
  • the embedding module 51 is a module that embeds input information and converts it into vector data. As illustrated in FIG. 6, the embedding module 51 may include a character embedding module 61, an emotion embedding module 63, and a speaker embedding module 65.
  • the character embedding module 61 is a module that embeds preprocessed text information and converts it into a character embedding vector.
  • the character embedding module 61 may be able to generate a character embedding vector using a fasttext embedding technique, an auto-encoder embedding technique, a self-attention embedding technique, etc.
  • the technical scope of is not limited thereto.
  • the emotion embedding module 63 is a module that embeds emotion information and converts it into an emotion embedding vector.
  • the emotion embedding module 63 may be implemented as a specific layer of the speech synthesis model 43.
  • the emotion embedding module 63 may be implemented as a fully connected layer or a fully connected network located in front of the encoder neural network 53 and/or the decoder neural network 57.
  • the emotion embedding module 63 and the other modules are configured as one organic neural network, so that end-to-end learning and speech synthesis may be performed. That is, all components 63, 53 to 57 of the speech synthesis model 43 may be learned at once through error backpropagation.
  • various advantages of the end-to-end method can be secured compared to a conventional speech synthesis model implemented by integrating a plurality of independent modules.
  • the advantages are due to accumulation of a loss of a specific module. Problems of deteriorating model performance are solved, learning is easier, and high-performance speech synthesis models can be built with a smaller amount of training datasets.
  • the specific form of the emotion information may vary according to embodiments.
  • the emotion information may be an emotion vector indicating a probability of one or more emotions.
  • emotion information in which most of emotions and emotions are mixed with a very small amount of emotion It can be expressed by the emotion vector 73 on the right side.
  • a speech synthesis model capable of finer emotion control and generating a complex emotion voice can be constructed.
  • the emotion information may be label information indicating a specific emotion.
  • emotion information indicating an emotion (happy) may be expressed as an emotion label 83 on the right.
  • emotion information such as an emotion vector or an emotion label may be automatically generated by a machine learning model for classifying emotion classes.
  • the machine learning model is a model that receives voice data or spectrogram data and outputs an emotion class.
  • the emotion vector may be generated based on a confidence score for each emotion class output by the machine learning model, and the emotion label is generated based on the final classification result of the machine learning model. I can.
  • emotion information is automatically generated, time and human cost required for generating a data set for learning can be reduced.
  • the speaker embedding module 65 is a module that embeds speaker information and converts it into a speaker embedding vector.
  • the speaker information may be label information (refer to FIG. 9) indicating a specific speaker, but the technical scope of the present disclosure is not limited thereto.
  • the speaker embedding module 65 may be implemented as a specific layer of the speech synthesis model 43.
  • the speaker embedding module 65 may be implemented as a fully connected layer or fully connected network located in front of the encoder neural network 53 and/or the decoder neural network 57.
  • the speaker embedding module 65 and other modules are configured as one organic neural network, so that end-to-end learning and speech synthesis can be performed. That is, all components 63, 53 to 57 of the speech synthesis model 43 may be learned at once through error backpropagation.
  • At least some of the above-described respective embedding modules 61 to 65 may be implemented as separate modules that independently perform an embedding function. That is, at least some of each of the embedding modules 61 to 65 are not affected by the learning of the speech synthesis model 43, or a separately learned embedding module or a module that performs embedding through a mathematical algorithm without needing to be learned. It can also be implemented.
  • the output vectors of each embedding module 61 to 65 are input to the encoder neural network 53.
  • the output vectors e.g. a character embedding vector, an emotion embedding vector, and a speaker embedding vector
  • a vector generated by concatenating an emotion and/or speaker embedding vector to a character embedding vector may be input to the encoder neural network 53.
  • each of the output vectors may be independently input to the encoder neural network 53, which may be modified as much as possible according to the implementation method of the input layer of the encoder neural network 53.
  • the encoder neural network 53 is a neural network that receives an input sequence composed of one or more character embedding vectors, an emotion embedding vector and/or a speaker embedding vector, encodes input information, and outputs the encoded vector. As the learning progresses, the encoder neural network 53 understands the context according to the input sequence, the emotion embedding vector, and the speaker embedding vector, and outputs an encoded vector representing the understood context.
  • the encoded vector may be referred to as a context vector in the art.
  • the encoder neural network 53 and the decoder neural network 57 may be implemented as a recurrent neural network (RNN) to be suitable for receiving and outputting a sequence.
  • RNN recurrent neural network
  • the encoder neural network 53 and the decoder neural network 57 may be implemented as a Long Short-Term Memory Model (LSTM) neural network 90 as shown in FIG. 9.
  • LSTM Long Short-Term Memory Model
  • the present invention is not limited thereto, and at least some of the encoder neural network 75 and the decoder neural network 79 may be implemented through a self-attention, a transformer network, or the like. Those skilled in the art will be able to clearly understand self-attention and transformer networks, and detailed descriptions of the techniques will be omitted.
  • the attention module 55 indicates which part to focus on (or which part to focus on) when learning/predicting the output sequence for the vector encoded by the decoder neural network 57. It is a module that provides attention information. As the learning progresses, the attention module 55 may learn a mapping relationship between the encoded vector and the output sequence to provide attention information indicating a portion to be focused on and a portion not to be focused upon decoding.
  • the attention information may be provided in the form of a weight vector (or weight matrix), but the technical scope of the present disclosure is not limited thereto. Those skilled in the art will be able to clearly understand the attention mechanism, and a detailed description thereof will be omitted.
  • the decoder neural network 57 receives the encoded vector and the attention information and outputs an output sequence corresponding to the encoded vector. More specifically, the decoder neural network 57 predicts an output sequence associated with the emotional voice of a specific speaker using the encoded vector and the attention information. In this case, the output sequence may be composed of spectrogram data in units of frames, but the technical scope of the present disclosure is not limited thereto.
  • the decoder neural network 57 may further input spectrogram data of a previous frame and sequentially output spectrogram data of a current frame to construct an output sequence.
  • the spectrogram data is data representing a spectrogram of a voice signal, and may be STFT spectrogram data or mel-spectrogram data, but the technical scope of the present disclosure is not limited thereto.
  • the reason why the decoder neural network 57 is configured to output spectrogram data instead of a speech signal is that when learning is performed with spectrogram data, a prediction error can be calculated more accurately than that of a speech signal.
  • a speech synthesis model with superior performance can be constructed.
  • each learning data 100 may include text 101, emotion information 102, speaker information 103, and correct answer voice data 104.
  • the correct answer voice data 104 is voice data of a specific speaker (e.g. wav format audio) indicated by the speaker information 103, and corresponds to the text 101 and reflects the emotion information 102.
  • the correct answer voice data 104 is converted into correct answer spectrogram data 106 through the preprocessor 23, and the text 101 is subjected to appropriate preprocessing by the preprocessor 23.
  • the process of learning the speech synthesis model 43 by the learning unit 41 is as follows. First, the preprocessed text 101 is input to the character embedding module 61, and the emotion and speaker information 102 and 103 are input to the emotion embedding module and the speaker embedding modules 63 and 65, respectively. In addition, spectrogram data 105 predicted by the decoder neural network 55 is output as a result.
  • the learning unit 41 compares the predicted spectrogram data 105 and the correct answer spectrogram data 106 to calculate a prediction error 107, and backpropagates the prediction error 107 to determine the speech synthesis model 43 Update weights.
  • the weights of the encoder neural network 53, the attention module 55, and the decoder neural network 55 may be updated at once through the backpropagation.
  • the embedding module 51 is implemented as some layers of a neural network, the weight of the embedding module 51 may also be updated.
  • the learning unit 41 may build the speech synthesis model 43 by repeating such a learning process for a plurality of training data.
  • the synthesis unit 45 predicts and outputs spectrogram data using the speech synthesis model 43 learned by the learning unit 41. More specifically, the synthesizing unit 45 inputs text for synthesis in which the correct answer speech data does not exist, emotion information for synthesis, and speaker information for synthesis into the speech synthesis model 43, and as a result, the synthesis speaker information is Predict the output sequence of a specific speaker pointed to. As described above, the output sequence may consist of, for example, frame-by-frame prediction spectrogram data.
  • the synthesis speaker information may be label information indicating the specific speaker who wants to synthesize speech
  • the synthesis emotion information may be information in the form of an emotion vector or an emotion label as emotion information of the specific makeup to be expressed. have.
  • the vocoder unit 47 converts the predicted spectrogram data included in the output sequence into emotional voice data (e.g. wav format audio). If the conversion function can be performed, the vocoder unit 47 may be implemented in any way.
  • the vocoder unit 47 may be implemented with one or more vocoder modules (e.g. WaveNet, Griffin-lim) well known in the art. In order not to obscure the subject matter of the present invention, further description of the vocoder unit 47 will be omitted.
  • FIG. 2 or 4 may be essential components for implementing the speech synthesis apparatus 10. That is, the speech synthesis apparatus 10 according to some other embodiments of the present disclosure may be implemented by some of the components illustrated in FIG. 2 or 4.
  • Each component shown in FIG. 2 or 4 may mean software or hardware such as a Field Programmable Gate Array (FPGA) or an Application-Specific Integrated Circuit (ASIC).
  • FPGA Field Programmable Gate Array
  • ASIC Application-Specific Integrated Circuit
  • the components are not limited to software or hardware, and may be configured to be in an addressable storage medium, or may be configured to execute one or more processors.
  • the functions provided in the above components may be implemented by more subdivided components, or may be implemented as one component that performs a specific function by combining a plurality of components.
  • the speech synthesis apparatus 10 since a neural network-based speech synthesis model is constructed by learning emotion information and emotion voice data, an emotion speech reflecting the emotion information may be synthesized through the speech synthesis model. In this method, a natural emotional voice can be generated, since the emotional voice is not synthesized by performing audio post-processing or combining voice fragments.
  • a speech synthesis function capable of controlling emotion may be provided by changing emotion information input to the speech synthesis model. For example, the emotion contained in the synthesized voice may be adjusted by adjusting the type and/or strength and weakness of the emotion in the emotion information.
  • a speech synthesis model is constructed by learning speaker information and emotional voice data, emotional voices of a plurality of speakers can be synthesized through one speech synthesis model.
  • FIGS. 11 to 15 a neural network structure of a modified speech synthesis model according to various embodiments of the present disclosure will be described with reference to FIGS. 11 to 15.
  • a description of a portion overlapping with the above-described speech synthesis model 43 will be omitted.
  • FIG 11 illustrates a neural network structure of a modified speech synthesis model 110 according to the first embodiment of the present disclosure.
  • the speech synthesis model 110 further includes an emotion embedding module 115 and a speaker embedding module 116 for the decoder neural network 114. That is, the decoder neural network 114 further receives output vectors (ie, the emotion embedding vector and the speaker embedding vector) of the emotion embedding module 115 and the speaker embedding module 116.
  • the emotion embedding module 115 and the speaker embedding module 116 may be implemented as a specific layer (e.g. a fully connected layer) located in front of the decoder neural network 114.
  • the emotion embedding module 115 and the speaker embedding module 116 do not exist, and the output vectors of the emotion embedding module and speaker embedding module included in the embedding module 111 are input to the decoder neural network 114
  • the speech synthesis model 110 may be implemented in a form that is configured.
  • the embedding module 111 may include a character embedding module, an emotion embedding module, and a speaker embedding module, and the functions of the embedding module 111, the encoder neural network 112, the attention module 113 and the decoder neural network 124 are described above. It is similar to one. However, there are some differences in that the decoder neural network 114 further receives an emotion embedding vector and a speaker embedding vector and outputs an output sequence.
  • FIG. 12 illustrates a neural network structure of a modified speech synthesis model 120 according to a second embodiment of the present disclosure.
  • the character embedding vector output from the character embedding module 121 is input to the encoder neural network 122, and the emotion embedding module 125 and the speaker embedding module 126 ) Of the output vectors (ie, the emotion embedding vector and the speaker embedding vector) are input to the decoder neural network 124.
  • the emotion embedding module 125 and the speaker embedding module 126 may be implemented as a specific layer located in front of the decoder neural network 124.
  • the overall structure of the speech synthesis model 120 and the operation of each module 121 to 126 are similar to the speech synthesis model 110 according to the first embodiment described above, but the encoder neural network 122 inputs only the character embedding vector. There is a difference in using it.
  • FIG. 13 illustrates a neural network structure of a modified speech synthesis model 130 according to a third embodiment of the present disclosure.
  • the speech synthesis model 130 since the speech synthesis model 130 according to the third embodiment is a model for a single speaker, it does not include a speaker embedding module. Accordingly, the encoder neural network 133 uses only the output vectors (ie, the character embedding vector and the emotion embedding vector) of the character embedding module 131 and the emotion embedding module 132 as input values.
  • the overall structure of the speech synthesis model 130 and the operation of each of the modules 131 to 135 are similar to those of the above-described embodiments.
  • FIG. 14 illustrates a neural network structure of a modified speech synthesis model 140 according to a fourth embodiment of the present disclosure.
  • the speech synthesis model 140 is also a model for a single speaker, similar to the third embodiment described above. Accordingly, the speech synthesis model 140 also does not include a speaker embedding module. However, in the fourth embodiment, the emotion embedding vector is further input to the decoder neural network 145.
  • the emotion embedding module 146 may be implemented as a specific layer (e.g. a fully connected layer) located in front of the decoder neural network 145.
  • the emotion embedding module 146 does not exist, and the speech synthesis model 140 may be implemented in a form in which the emotion embedding vector of the emotion embedding module 142 is input to the decoder neural network 114.
  • the emotion embedding module 142 may be omitted. That is, in this embodiment, similar to the above-described second embodiment, only character embedding vectors are inputted to the encoder neural network 143, and emotion embedding vectors can be inputted only to the decoder neural network 145.
  • the speech synthesis models 130 and 140 described with reference to FIGS. 13 and 14 may be constructed for each speaker.
  • a first voice synthesis model 150-1 for synthesizing the voice of a first speaker is constructed, and a second voice synthesis model 150-2 for synthesizing the voice of a second speaker.
  • This is separately constructed, and an n-th speech synthesis model 150-n for synthesizing the speech of the n-th speaker may be separately constructed.
  • modified speech synthesis models 110 to 140 and 150-1 to 150-n according to various embodiments of the present disclosure have been described with reference to FIGS. 11 to 15.
  • Various speech synthesis models e.g. 43, 110 to 140, 150-1 to 150-n have been described so far, but the effects achieved according to the configuration of each model may vary.
  • a model in which emotion information is input to an encoder neural network may more accurately control the speed of a synthesized speech according to the emotion information.
  • the model in which the emotion information is input to the decoder neural network can more accurately adjust the tone or pitch of the synthesized speech according to the emotion information.
  • a voice containing natural emotions can be synthesized as if a real person speaks.
  • model to which the speaker information is further input can synthesize speech for multiple speakers, all the costs required for model construction (eg, computing cost for learning) compared to the case of building a speech synthesis model for each speaker. This can be saved.
  • model construction eg, computing cost for learning
  • synergy occurs when learning is performed for a large number of speakers, a relatively high-performance speech synthesis model can be built even when the amount of learning data for each speaker is small, and the cost of building the learning data will be reduced. I can.
  • Each step of the speech synthesis method may be performed by a computing device.
  • each step of the speech synthesis method may be implemented with one or more instructions executed by a processor of a computing device. All the steps included in the speech synthesis method may be performed by one physical computing device, but the first steps of the method are performed by a first computing device, and the second steps of the method are performed by a second computing device. It can also be performed by In the following, description will be continued on the assumption that each step of the speech synthesis method is performed by the speech synthesis device 10. However, for convenience of explanation, the description of the operation subject of each step included in the speech synthesis method may be omitted.
  • 16 is an exemplary flowchart illustrating a speech synthesis method according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.
  • the speech synthesis method includes a learning process of constructing a speech synthesis model and a synthesis process of synthesizing speech using the speech synthesis model.
  • the learning process starts in step S100 of acquiring a learning dataset.
  • each learning data included in the learning dataset is composed of text for learning, emotion information for learning, speaker information for learning, and correct answer voice data.
  • the training speaker information may be excluded from the training dataset.
  • step S200 a neural network-based speech synthesis model is constructed using the training dataset. Since the structure of the speech synthesis model has already been described above, further description will be omitted, and details of this step S200 will be described later with reference to FIG. 17.
  • the synthesis process starts in step S300 of obtaining data for synthesis.
  • the synthesis data is composed of text for synthesis, emotion information for synthesis, and speaker information for synthesis.
  • speaker information may be excluded from the synthesis data.
  • the emotion information for synthesis may be an emotion vector or an emotion label.
  • step S400 an emotional voice of a specific speaker with respect to the text for synthesis is output using a speech synthesis model.
  • the emotional voice means a voice in which the emotional information for synthesis is reflected.
  • an output sequence composed of spectrogram data may be output from a speech synthesis model, and the emotional voice may be output by vocoding the output sequence.
  • the emotion of the synthesized emotion voice may be adjusted by adjusting the type of emotion or the strength of emotion on the emotion information. Also, by changing the speaker information, emotional voices of different speakers can be synthesized.
  • steps S100 and S200 are performed by the input unit 21, the preprocessor 23 and the learning unit 41, and the steps S300 and S400 are the input unit 21, the preprocessor ( 23), it may be performed by the synthesis unit 45 and the vocoder unit 47.
  • 17 is an exemplary flowchart illustrating a method of constructing a speech synthesis model according to some embodiments of the present disclosure. However, this is only a preferred embodiment for achieving the object of the present disclosure, and of course, some steps may be added or deleted as necessary.
  • the method of constructing the speech synthesis model begins in step S210 of performing pre-processing on the training text and the jeongdam speech data. Since the contents of the pre-processing are the same as described above, further description will be omitted.
  • step S220 the preprocessed text is converted into a character embedding vector, the learning emotion information is converted into an emotion embedding vector, and the learning speaker information is converted into a speaker embedding vector.
  • the conversion process may be performed in an embedding module (e.g. 51 in FIG. 5) constituting the speech synthesis model (e.g. 43 in FIG. 5), but may be performed in a separate embedding module.
  • step S230 the character embedding vector and the emotion embedding vector are inputted from an encoder neural network (e.g. 55 in FIG. 5) of the speech synthesis model, and the encoded vector is output by encoding them. More precisely, an input sequence composed of the character embedding vector may be input to the encoder neural network.
  • an encoder neural network e.g. 55 in FIG. 5
  • step S240 the encoded vector is inputted from a decoder neural network (e.g. 57 in FIG. 5) of the speech synthesis model and predicted spectrogram data is output.
  • the decoder neural network may further receive attention information from an attention module (e.g. 55) located between the encoder neural network and the decoder neural network.
  • the decoder neural network may receive prediction spectrogram data of a previous frame, and may further use this to output prediction spectrogram data of a current frame. Prediction spectrogram data sequentially output corresponds to the output sequence.
  • step S250 the weight of the speech synthesis model is updated by backpropagating the error between the correct answer spectrogram data and the predicted spectrogram data.
  • weights of the encoder neural network and the decoder neural network may be updated at once through the error backpropagation. If an embedding module is included in the speech synthesis model, the weight of the embedding module may be updated as well.
  • a speech synthesis model may be constructed.
  • the above-described steps S210 to S250 may be performed by the learning unit 41 and the speech synthesis model 43.
  • a method of constructing a speech synthesis model according to some embodiments of the present disclosure has been described with reference to FIG. 17. According to the above-described method, a speech synthesis model capable of controlling emotion and capable of synthesizing speech for a plurality of speakers can be constructed.
  • an exemplary computing device 180 capable of implementing the speech synthesis device 10 according to some embodiments of the present disclosure will be described.
  • FIG. 18 is a hardware configuration diagram illustrating an exemplary computing device 180 capable of implementing the speech synthesis device 10 according to some embodiments of the present disclosure.
  • the computing device 180 is a memory for loading a computer program executed by one or more processors 181, a bus 183, a communication interface 184, and the processor 181 ( 182 and a storage 185 for storing the computer program 186 may be included.
  • the processor 181 182 and a storage 185 for storing the computer program 186 may be included.
  • FIG. 18 only components related to the embodiment of the present disclosure are shown in FIG. 18. Accordingly, those of ordinary skill in the art to which the present disclosure pertains may recognize that other general-purpose components may be further included in addition to the components illustrated in FIG. 18.
  • the processor 181 controls the overall operation of each component of the computing device 180.
  • the processor 181 includes a CPU (Central Processing Unit), MPU (Micro Processor Unit), MCU (Micro Controller Unit), GPU (Graphic Processing Unit), or any type of processor well known in the art of the present disclosure. Can be. Also, the processor 181 may perform an operation on at least one application or program for executing the method according to the embodiments of the present disclosure.
  • the computing device 180 may include one or more processors.
  • the memory 182 stores various types of data, commands and/or information.
  • the memory 182 may load one or more programs 186 from the storage 185 in order to execute the speech synthesis method according to embodiments of the present disclosure.
  • a module as shown in FIG. 2 may be implemented on the memory 182.
  • the memory 182 may be implemented as a volatile memory such as RAM, but the technical scope of the present disclosure is not limited thereto.
  • the bus 183 provides communication functions between components of the computing device 180.
  • the bus 183 may be implemented as various types of buses such as an address bus, a data bus, and a control bus.
  • the communication interface 184 supports wired/wireless Internet communication of the computing device 180.
  • the communication interface 184 may support various communication methods other than Internet communication.
  • the communication interface 184 may be configured to include a communication module well known in the technical field of the present disclosure.
  • the communication interface 184 may be omitted.
  • the storage 185 may non-temporarily store the one or more programs 186 and various data.
  • the various types of data may include data managed by the storage unit 25.
  • the storage 185 is a nonvolatile memory such as a Read Only Memory (ROM), an Erasable Programmable ROM (EPROM), an Electrically Erasable Programmable ROM (EEPROM), a flash memory, etc., a hard disk, a removable disk, or a technical field to which the present disclosure belongs. It may be configured to include any known computer-readable recording medium.
  • ROM Read Only Memory
  • EPROM Erasable Programmable ROM
  • EEPROM Electrically Erasable Programmable ROM
  • flash memory etc., a hard disk, a removable disk, or a technical field to which the present disclosure belongs. It may be configured to include any known computer-readable recording medium.
  • Computer program 186 may include one or more instructions that when loaded into memory 182 cause processor 181 to perform a method/operation in accordance with various embodiments of the present disclosure. That is, the processor 181 may perform methods/operations according to various embodiments of the present disclosure by executing the one or more instructions.
  • the computer program 186 includes an operation of acquiring a training data set, an operation of constructing a speech synthesis model using the training data set, an operation of acquiring synthesis data, and the synthesis using the speech synthesis model. It may include instructions for performing an operation of synthesizing emotional voices for the dragon data.
  • the computer program 186 embeds the text for learning and converts it into a character embedding vector, the operation of embedding the emotion information for learning and converting it into an emotion embedding vector, and the character embedding vector and the emotion embedding in the encoder neural network An operation of receiving a vector and outputting an encoded vector, an operation of receiving the encoded vector from the decoder neural network and outputting prediction spectrogram data, and backpropagating an error between the correct answer spectrogram data and the predicted spectrogram data. propagation) to update the speech synthesis model.
  • the speech synthesis apparatus 10 may be implemented through the computing device 180.
  • An exemplary computing device 180 capable of implementing the speech synthesis device 10 according to an embodiment of the present disclosure has been described so far with reference to FIGS. 1 to 18.
  • the technical idea of the present disclosure described with reference to FIGS. 1 to 18 so far may be implemented as computer-readable code on a computer-readable medium.
  • the computer-readable recording medium is, for example, a removable recording medium (CD, DVD, Blu-ray disk, USB storage device, removable hard disk) or a fixed recording medium (ROM, RAM, computer-equipped hard disk).
  • the computer program recorded in the computer-readable recording medium may be transmitted to another computing device through a network such as the Internet and installed in the other computing device, thereby being used in the other computing device.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

L'invention concerne un appareil de synthèse de la parole capable de synthétiser une parole émotionnelle qui reflète des informations émotionnelles. Un appareil de synthèse de la parole selon certains modes de réalisation de la présente invention peut comprendre : une unité de prétraitement pour pré-traiter un texte entré ; et une unité de synthèse de parole, qui entre le texte prétraité et les informations émotionnelles dans un modèle de synthèse de parole basé sur un réseau neuronal, de façon à synthétiser, pour le texte entré, la parole émotionnelle qui reflète des informations émotionnelles, le modèle de synthèse de parole comprenant un réseau neuronal de codeur et un réseau neuronal de décodeur, et la parole émotionnelle reflétant les informations émotionnelles peut être délivrée en utilisant des vecteurs d'incorporation d'émotion pour les informations émotionnelles en tant qu'entrée dans le réseau neuronal de codeur.
PCT/KR2020/003768 2019-03-19 2020-03-19 Appareil de synthèse de la parole et procédé associé WO2020190054A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
KR10-2019-0030905 2019-03-19
KR1020190030905A KR102057927B1 (ko) 2019-03-19 2019-03-19 음성 합성 장치 및 그 방법

Publications (1)

Publication Number Publication Date
WO2020190054A1 true WO2020190054A1 (fr) 2020-09-24

Family

ID=69062875

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2020/003768 WO2020190054A1 (fr) 2019-03-19 2020-03-19 Appareil de synthèse de la parole et procédé associé

Country Status (2)

Country Link
KR (1) KR102057927B1 (fr)
WO (1) WO2020190054A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633364A (zh) * 2020-12-21 2021-04-09 上海海事大学 一种基于Transformer-ESIM注意力机制的多模态情绪识别方法
CN112992177A (zh) * 2021-02-20 2021-06-18 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN113257218A (zh) * 2021-05-13 2021-08-13 北京有竹居网络技术有限公司 语音合成方法、装置、电子设备和存储介质
CN113421546A (zh) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 基于跨被试多模态的语音合成方法及相关设备
US11241574B2 (en) 2019-09-11 2022-02-08 Bose Corporation Systems and methods for providing and coordinating vagus nerve stimulation with audio therapy
WO2022105553A1 (fr) * 2020-11-20 2022-05-27 北京有竹居网络技术有限公司 Procédé et appareil de synthèse de la parole, support lisible et dispositif électronique
CN117423327A (zh) * 2023-10-12 2024-01-19 北京家瑞科技有限公司 基于gpt神经网络的语音合成方法和装置

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102057927B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법
KR102277205B1 (ko) * 2020-03-18 2021-07-15 휴멜로 주식회사 오디오 변환 장치 및 방법
CN111402923B (zh) * 2020-03-27 2023-11-03 中南大学 基于wavenet的情感语音转换方法
CN111627420B (zh) * 2020-04-21 2023-12-08 升智信息科技(南京)有限公司 极低资源下的特定发音人情感语音合成方法及装置
CN111667812B (zh) * 2020-05-29 2023-07-18 北京声智科技有限公司 一种语音合成方法、装置、设备及存储介质
KR102382191B1 (ko) * 2020-07-03 2022-04-04 한국과학기술원 음성 감정 인식 및 합성의 반복 학습 방법 및 장치
CN111973178A (zh) * 2020-08-14 2020-11-24 中国科学院上海微系统与信息技术研究所 一种脑电信号识别系统及方法
KR102392904B1 (ko) * 2020-09-25 2022-05-02 주식회사 딥브레인에이아이 텍스트 기반의 음성 합성 방법 및 장치
CN112365881A (zh) 2020-11-11 2021-02-12 北京百度网讯科技有限公司 语音合成方法及对应模型的训练方法、装置、设备与介质
KR102503066B1 (ko) * 2020-11-24 2023-03-02 주식회사 자이냅스 어텐션 얼라인먼트의 스코어를 이용하여 스펙트로그램의 품질을 평가하는 방법 및 음성 합성 시스템
KR102576606B1 (ko) * 2021-03-26 2023-09-08 주식회사 엔씨소프트 음색 임베딩 모델 학습 장치 및 방법

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006084967A (ja) * 2004-09-17 2006-03-30 Advanced Telecommunication Research Institute International 予測モデルの作成方法およびコンピュータプログラム
KR20130091364A (ko) * 2011-12-26 2013-08-19 한국생산기술연구원 로봇의 학습이 가능한 감정생성장치 및 감정생성방법
KR20190016889A (ko) * 2017-08-09 2019-02-19 한국과학기술원 텍스트-음성 변환 방법 및 시스템
KR101954447B1 (ko) * 2018-03-12 2019-03-05 박기수 이동 단말 및 고정 단말 간 연동 기반 텔레마케팅 서비스 제공 방법
KR102057927B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006084967A (ja) * 2004-09-17 2006-03-30 Advanced Telecommunication Research Institute International 予測モデルの作成方法およびコンピュータプログラム
KR20130091364A (ko) * 2011-12-26 2013-08-19 한국생산기술연구원 로봇의 학습이 가능한 감정생성장치 및 감정생성방법
KR20190016889A (ko) * 2017-08-09 2019-02-19 한국과학기술원 텍스트-음성 변환 방법 및 시스템
KR101954447B1 (ko) * 2018-03-12 2019-03-05 박기수 이동 단말 및 고정 단말 간 연동 기반 텔레마케팅 서비스 제공 방법
KR102057927B1 (ko) * 2019-03-19 2019-12-20 휴멜로 주식회사 음성 합성 장치 및 그 방법

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LIN, ZHOUHAN ET AL.: "A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING", ARXIV:1703.03130V1, 9 March 2017 (2017-03-09), pages 1 - 15, XP080755413, Retrieved from the Internet <URL:https://arxiv.org/pdf/1703.03130v1.pdf> [retrieved on 20200608] *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11241574B2 (en) 2019-09-11 2022-02-08 Bose Corporation Systems and methods for providing and coordinating vagus nerve stimulation with audio therapy
WO2022105553A1 (fr) * 2020-11-20 2022-05-27 北京有竹居网络技术有限公司 Procédé et appareil de synthèse de la parole, support lisible et dispositif électronique
CN112633364B (zh) * 2020-12-21 2024-04-05 上海海事大学 一种基于Transformer-ESIM注意力机制的多模态情绪识别方法
CN112633364A (zh) * 2020-12-21 2021-04-09 上海海事大学 一种基于Transformer-ESIM注意力机制的多模态情绪识别方法
CN112992177B (zh) * 2021-02-20 2023-10-17 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN112992177A (zh) * 2021-02-20 2021-06-18 平安科技(深圳)有限公司 语音风格迁移模型的训练方法、装置、设备及存储介质
CN113257218B (zh) * 2021-05-13 2024-01-30 北京有竹居网络技术有限公司 语音合成方法、装置、电子设备和存储介质
WO2022237665A1 (fr) * 2021-05-13 2022-11-17 北京有竹居网络技术有限公司 Procédé et appareil de synthèse de la parole, dispositif électronique, et support de stockage
CN113257218A (zh) * 2021-05-13 2021-08-13 北京有竹居网络技术有限公司 语音合成方法、装置、电子设备和存储介质
CN113421546A (zh) * 2021-06-30 2021-09-21 平安科技(深圳)有限公司 基于跨被试多模态的语音合成方法及相关设备
CN113421546B (zh) * 2021-06-30 2024-03-01 平安科技(深圳)有限公司 基于跨被试多模态的语音合成方法及相关设备
CN117423327A (zh) * 2023-10-12 2024-01-19 北京家瑞科技有限公司 基于gpt神经网络的语音合成方法和装置
CN117423327B (zh) * 2023-10-12 2024-03-19 北京家瑞科技有限公司 基于gpt神经网络的语音合成方法和装置

Also Published As

Publication number Publication date
KR102057927B1 (ko) 2019-12-20

Similar Documents

Publication Publication Date Title
WO2020190054A1 (fr) Appareil de synthèse de la parole et procédé associé
WO2020190050A1 (fr) Appareil de synthèse vocale et procédé associé
JP7445267B2 (ja) 多言語テキスト音声合成モデルを利用した音声翻訳方法およびシステム
WO2019139430A1 (fr) Procédé et appareil de synthèse texte-parole utilisant un apprentissage machine, et support de stockage lisible par ordinateur
WO2019139428A1 (fr) Procédé de synthèse vocale à partir de texte multilingue
WO2019139431A1 (fr) Procédé et système de traduction de parole à l&#39;aide d&#39;un modèle de synthèse texte-parole multilingue
WO2020145439A1 (fr) Procédé et dispositif de synthèse vocale basée sur des informations d&#39;émotion
EP3614376B1 (fr) Procédé de synthèse vocale, serveur et support de stockage
US20210209315A1 (en) Direct Speech-to-Speech Translation via Machine Learning
Zhao et al. Foreign Accent Conversion by Synthesizing Speech from Phonetic Posteriorgrams.
KR20200111609A (ko) 음성 합성 장치 및 그 방법
KR102306844B1 (ko) 비디오 번역 및 립싱크 방법 및 시스템
WO2022045651A1 (fr) Procédé et système pour appliquer une parole synthétique à une image de haut-parleur
US20200410979A1 (en) Method, device, and computer-readable storage medium for speech synthesis in parallel
WO2020209647A1 (fr) Procédé et système pour générer une synthèse texte-parole par l&#39;intermédiaire d&#39;une interface utilisateur
JP2022512233A (ja) 多言語スタイル依存音声言語処理のためのニューラル調整コード
WO2022260432A1 (fr) Procédé et système pour générer une parole composite en utilisant une étiquette de style exprimée en langage naturel
CN112102811A (zh) 一种合成语音的优化方法、装置及电子设备
WO2022203152A1 (fr) Procédé et dispositif de synthèse de parole sur la base d&#39;ensembles de données d&#39;apprentissage de locuteurs multiples
WO2019088635A1 (fr) Dispositif et procédé de synthèse vocale
KR20200111608A (ko) 음성 합성 장치 및 그 방법
WO2022177091A1 (fr) Dispositif électronique et son procédé de commande
Seong et al. Multilingual speech synthesis for voice cloning
WO2022034982A1 (fr) Procédé de réalisation d&#39;opération de génération de parole synthétique sur un texte
KR102277205B1 (ko) 오디오 변환 장치 및 방법

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20773072

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20773072

Country of ref document: EP

Kind code of ref document: A1