WO2022121179A1 - 语音合成方法、装置、设备及存储介质 - Google Patents

语音合成方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2022121179A1
WO2022121179A1 PCT/CN2021/084215 CN2021084215W WO2022121179A1 WO 2022121179 A1 WO2022121179 A1 WO 2022121179A1 CN 2021084215 W CN2021084215 W CN 2021084215W WO 2022121179 A1 WO2022121179 A1 WO 2022121179A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
vector information
text
speech synthesis
synthesized
Prior art date
Application number
PCT/CN2021/084215
Other languages
English (en)
French (fr)
Inventor
孙奥兰
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121179A1 publication Critical patent/WO2022121179A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application relates to the technical field of semantic synthesis, and in particular, to a speech synthesis method, apparatus, computer device, and computer-readable storage medium.
  • TTS speech synthesis system (Text To Speech speech synthesis system), is an indispensable part of the intelligent dialogue system. Academia and industry try to achieve human-like speech synthesis with limited resources and time. In recent years, the neural network method has become the mainstream solution in the field of speech synthesis after the release of Google's Tacotron and Wavenet.
  • One of the purposes of the embodiments of the present application is to provide a speech synthesis method, apparatus, computer equipment and computer-readable storage medium, so as to solve the problem that in the process of existing prosody control, the process of manually selecting reference speech may cause model errors Accumulation, the technical problem of low accuracy of synthesized speech.
  • an embodiment of the present application provides a speech synthesis method, including:
  • the speech synthesis model includes an application layer, an output layer, a graph encoder and an attention mechanism
  • the speech synthesis information corresponding to the Mel spectrum information is output through the output layer.
  • an embodiment of the present application provides a speech synthesis apparatus, including:
  • a first acquisition module configured to acquire text to be synthesized, and input the text to be synthesized into a speech synthesis model, wherein the speech synthesis model includes an application layer, an output layer, a graph encoder and an attention mechanism;
  • a first generation module configured to encode the image embedding vector information according to the image encoder, generate corresponding first prosody vector information, and use the first prosody vector information as the first intermediate vector information;
  • a second generation module configured to generate corresponding Mel language spectrum information according to the first intermediate vector information based on the attention mechanism
  • the second obtaining module is configured to output the speech synthesis information corresponding to the Mel spectrum information through the output layer.
  • an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements when executing the computer program:
  • the speech synthesis model includes an application layer, an output layer, a graph encoder and an attention mechanism
  • the speech synthesis information corresponding to the Mel spectrum information is output through the output layer.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:
  • the speech synthesis model includes an application layer, an output layer, a graph encoder and an attention mechanism
  • the speech synthesis information corresponding to the Mel spectrum information is output through the output layer.
  • the embodiment of the present application has the beneficial effect of acquiring the text to be synthesized and inputting the text to be synthesized into a speech synthesis model, wherein the speech synthesis model includes an application layer, an output layer, a graph coding converting the text to be synthesized into graph embedding vector information based on the application layer; encoding the graph embedding vector information according to the graph encoder to generate corresponding first prosodic vector information, and Taking the first prosodic vector information as the first intermediate vector information; based on the attention mechanism, generating corresponding Mel language spectrum information according to the first intermediate vector information; outputting the Mel language through the output layer
  • the speech synthesis information corresponding to the spectral information can be mapped to different prosody rhythms by analyzing the specific semantic information of the text information through the graph-assisted encoder, making the process of prosody adjustment a fully automated process and improving the accuracy of speech synthesis.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • Fig. 2 is the sub-step flowchart schematic diagram of the speech synthesis method in Fig. 1;
  • Fig. 3 is the sub-step flow schematic diagram of the speech synthesis method in Fig. 1;
  • FIG. 4 is a schematic flowchart of another speech synthesis method provided by an embodiment of the present application.
  • FIG. 5 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
  • Embodiments of the present application provide a speech synthesis method, apparatus, computer device, and computer-readable storage medium.
  • the speech synthesis method may be applied to computer equipment, and the computer equipment may be electronic equipment such as notebook computers and desktop computers.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • the speech synthesis method includes steps S101 to S105.
  • Step S101 Acquire the text to be synthesized, and input the to-be-synthesized text into a speech synthesis model.
  • the to-be-synthesized text is acquired, where the to-be-synthesized text includes short sentences, short texts, and the like.
  • the obtaining method includes obtaining the text input by the user, or obtaining the text stored in a preset storage path, and the like, wherein the preset storage path includes a blockchain.
  • the text to be synthesized is obtained, the text to be synthesized is input into the semantic synthesis model, and the speech synthesis model can be stored in the preset blockchain.
  • the speech synthesis model includes an application layer, an output layer, a graph encoder and attention force mechanism, etc.
  • the method before acquiring the text to be synthesized, the method further includes: acquiring the speech text to be trained, wherein the speech text to be trained includes text information and speech information corresponding to the text information;
  • the voice information trains a preset voice sequence model, and obtains the image embedding vector information corresponding to the text information and the Mel spectrum information corresponding to the voice information; and obtains the corresponding information through the image embedding vector information and the Mel spectrum information.
  • the loss function is used to update the model parameters of the preset speech sequence model through the loss function to generate a corresponding speech synthesis model.
  • the to-be-trained speech text is obtained, where the to-be-trained speech text includes text information and speech information corresponding to the text information.
  • the preset voice sequence model is trained by the text information and the voice information, and the corresponding graph embedding vector information is obtained through the text information and the graph encoder in the preset voice sequence model.
  • the attention mechanism obtains a Mel spectrogram, and the corresponding loss function is obtained through the embedded vector information and Mel spectrogram of the graph, and the model parameters of the preset speech sequence model are optimized through the loss function.
  • a corresponding speech synthesis model is generated from the preset speech sequence model.
  • Step S102 Convert the text to be synthesized into image embedding vector information based on the application layer.
  • the speech synthesis model when the text to be synthesized is input into the speech synthesis model, the speech synthesis model includes an application layer.
  • the application layer converts the to-be-synthesized text into graph embedding vector information.
  • Graph embedding is a process of mapping a high-dimensional dense matrix of graph data into a low-micro-dense vector. By representing a graph as a set of low-dimensional vectors, there are different types of graphs, such as homogeneous graphs, heterogeneous graphs, attribute graphs, etc. .
  • the graph embedding vector information includes node vector information and edge vector information, the vector information of each word is obtained through the node vector information, and the prosodic relationship between each word is obtained through the edge vector information.
  • the edge vector information includes directed edge vector information, reverse edge vector information and sequential edge vector information.
  • step S102 includes: sub-step S1021 to sub-step S1022.
  • Sub-step S1021 Split the text to be synthesized into each word by the application layer, and obtain the order relationship between each word.
  • the application layer splits the to-be-synthesized text into various words, and obtains the order relationship between the various words.
  • the text to be synthesized is “I love China”, and the “I love China” is divided into “I”, “love”, “ ⁇ ”, "country”. And get the sequence between "I”, “Love”, “China” and “Country” as “I”--"Love”--"China”--”Country”.
  • Sub-step S1022 Map and convert each word and the sequence relationship between each of the words to obtain the graph embedding vector information corresponding to the to-be-synthesized text.
  • Step S103 Encode the graph embedding vector information according to the graph encoder, generate corresponding first prosodic vector information, and use the first prosody vector information as the first intermediate vector information.
  • the image embedding vector information of the text to be synthesized is obtained, the image embedding vector information is encoded by the image encoder in the speech synthesis model, and the corresponding first prosodic vector information is generated.
  • the graph encoder includes a mapping function, and the graph embedding vector information is mapped and encoded by the mapping function to obtain first prosodic vector information corresponding to the graph embedding vector information, and when the first prosody vector information is obtained, the The first prosodic vector information is used as the first intermediate vector information.
  • the graph embedding vector information includes a plurality of node vectors and a plurality of edge vectors
  • the encoding of the graph embedding vector information according to the graph encoder to generate the corresponding first prosodic vector information includes: obtaining, by the graph encoder, an edge vector between each of the node vectors, and calculating the edge vector between each of the node vectors.
  • the edge vector is encoded to obtain first prosodic vector information corresponding to the graph embedding vector information, wherein the edge vector represents a prosodic relationship corresponding to the two node vectors.
  • the edge vector between the respective node vectors is encoded by the graph encoder to obtain the prosodic vector information between the respective node vectors, The sequence relationship between them and the prosody vector information between each node vector are obtained to obtain the corresponding first prosody vector information.
  • Step S104 based on the attention mechanism, generate corresponding Mel spectrum information according to the first intermediate vector information.
  • the first intermediate vector information when the first intermediate vector information is obtained, the first intermediate vector information is input into the attention mechanism, the context learning is performed on the first intermediate vector information through the attention mechanism, and the first intermediate vector information is obtained. Generate the corresponding Mel spectrum information.
  • the attention mechanism is a multi-head attention mechanism, information that should not be known (ie, illegal information) is masked when the sequence is generated through the multi-head attention.
  • the multi-head attention is mainly to be consistent during training and inference. For example, during training, you want to predict the pronunciation of "w", but in fact, when entering the network, the entire prosody vector will enter. The sequence after the "w" is shielded from the network, preventing the network from seeing information that needs to be predicted in the future, because this information cannot be seen during inference.
  • the multi-head attention is composed of several self-attentions, such as 4-head attention, which is essentially 4 times of self-attention to the sequence.
  • step S104 includes: sub-step S1041 to sub-step S1042.
  • Sub-step S1041 Input the first intermediate vector information into the attention mechanism, and obtain the context prosody information of each node in the first intermediate vector information through the weight matrix in the attention mechanism.
  • the first prosody vector information when the first prosody vector information is acquired, the first prosody vector information is used as intermediate vector information and input into the attention mechanism, and each element in the first intermediate vector is acquired through the weight matrix in the attention mechanism.
  • the contextual prosodic information of the node when the first prosody vector information is acquired, the first prosody vector information is used as intermediate vector information and input into the attention mechanism, and each element in the first intermediate vector is acquired through the weight matrix in the attention mechanism.
  • the contextual prosodic information of the node when the first prosody vector information is acquired, the first prosody vector information is used as intermediate vector information and input into the attention mechanism, and each element in the first intermediate vector is acquired through the weight matrix in the attention mechanism.
  • Sub-step S1042 Generate corresponding Mel spectrum information by decoding the context prosody information of each node in the first intermediate vector information.
  • the context prosody information of each node in the first intermediate vector is acquired, the context prosody information of each node in the first intermediate vector is decoded by a preset decoder in the attention mechanism, Get the corresponding Mel spectrum information.
  • Step S105 Output the speech synthesis information corresponding to the Mel spectrum information through the output layer.
  • the output layer outputs the speech synthesis information corresponding to the Mel spectrum information.
  • the output layer includes a vocoder, and the vocoder acquires the feature information in the voice and audio fields in the Mel spectrum information, and generates corresponding voice synthesis information by synthesizing the feature information in the voice and audio fields.
  • outputting the speech synthesis information corresponding to the Mel spectrum information through the output layer includes: extracting the voice and audio domain features in the Mel spectrum information through the output layer; The predicate audio domain feature is mapped, and the corresponding speech synthesis information output by the output layer is obtained.
  • the voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features in the Mel spectrum information are extracted, and the voice and audio domain features are extracted from the Mel spectrum information. Domain features are mapped to obtain the corresponding speech synthesis information output by the output layer.
  • the output layer includes an extraction layer and a mapping layer, the voice and audio domain features in the Mel spectrum information are extracted through the extraction layer, and the voice and audio domain features are activated and mapped through the activation function in the mapping layer to obtain the corresponding speech synthesis information.
  • the text to be synthesized is acquired, and the text to be synthesized is input into the speech synthesis model, the application layer converts the to-be-synthesized text into graph embedding vector information, and the graph encoder encodes the graph embedding vector information to generate a corresponding first A prosodic vector information, the attention mechanism generates the corresponding Mel spectrum information according to the first intermediate vector information, and the output layer outputs the speech synthesis information corresponding to the Mel spectrum information, so as to analyze the specific semantic information of the text information through the graph-assisted encoder To map to different speech prosody rhythms, the process of prosody adjustment becomes a fully automated process, which improves the accuracy of speech synthesis.
  • FIG. 4 is a schematic flowchart of another speech synthesis method provided by an embodiment of the present application.
  • the speech synthesis method includes steps S201 to S208.
  • Step S201 Acquire the text to be synthesized, and input the to-be-synthesized text into a speech synthesis model.
  • the to-be-synthesized text is acquired, where the to-be-synthesized text includes short sentences, short texts, and the like.
  • the obtaining method includes obtaining the text input by the user, or obtaining the text stored in a preset storage path, and the like, wherein the preset storage path includes a blockchain.
  • the text to be synthesized is obtained, the text to be synthesized is input into the semantic synthesis model, and the speech synthesis model can be stored in the preset blockchain.
  • the speech synthesis model includes an application layer, an output layer, a graph encoder and attention force mechanism, etc.
  • the method before acquiring the text to be synthesized, the method further includes: acquiring the speech text to be trained, wherein the speech text to be trained includes text information and speech information corresponding to the text information;
  • the voice information trains a preset voice sequence model, and obtains the image embedding vector information corresponding to the text information and the Mel spectrum information corresponding to the voice information; and obtains the corresponding information through the image embedding vector information and the Mel spectrum information.
  • the loss function is used to update the model parameters of the preset speech sequence model through the loss function to generate a corresponding speech synthesis model.
  • the to-be-trained speech text is obtained, where the to-be-trained speech text includes text information and speech information corresponding to the text information.
  • the preset voice sequence model is trained by the text information and the voice information, and the corresponding graph embedding vector information is obtained through the text information and the graph encoder in the preset voice sequence model.
  • the attention mechanism obtains a Mel spectrogram, and the corresponding loss function is obtained through the embedded vector information and Mel spectrogram of the graph, and the model parameters of the preset speech sequence model are optimized through the loss function.
  • a corresponding speech synthesis model is generated from the preset speech sequence model.
  • Step S202 Convert the text to be synthesized into image embedding vector information based on the application layer.
  • the speech synthesis model when the text to be synthesized is input into the speech synthesis model, the speech synthesis model includes an application layer.
  • the application layer converts the to-be-synthesized text into graph embedding vector information.
  • Graph embedding is a process of mapping a high-dimensional dense matrix of graph data into a low-micro-dense vector. By representing a graph as a set of low-dimensional vectors, there are different types of graphs, such as homogeneous graphs, heterogeneous graphs, attribute graphs, etc. .
  • the graph embedding vector information includes node vector information and edge vector information, the vector information of each word is obtained through the node vector information, and the prosodic relationship between each word is obtained through the edge vector information.
  • the edge vector information includes directed edge vector information, reverse edge vector information and sequential edge vector information.
  • Step S203 Encode the graph embedding vector information according to the graph encoder, generate corresponding first prosodic vector information, and use the first prosodic vector information as the first intermediate vector information.
  • the image embedding vector information of the text to be synthesized is obtained, the image embedding vector information is encoded by the image encoder in the speech synthesis model, and the corresponding first prosodic vector information is generated.
  • the graph encoder includes a mapping function, and the graph embedding vector information is mapped and encoded by the mapping function to obtain first prosodic vector information corresponding to the graph embedding vector information, and when the first prosody vector information is obtained, the The first prosodic vector information is used as the first intermediate vector information.
  • Step S204 Convert the text to be synthesized into text vector information based on the application layer.
  • the to-be-synthesized text is converted into text vector information through the application layer in the speech synthesis model.
  • the position of each word in the text to be synthesized and the corresponding pinyin of each word are extracted through the application layer, and the position of each word in the preset coding rules and the number or letter corresponding to the pinyin corresponding to each word are obtained.
  • the text to be synthesized is converted into text vector information.
  • Step S205 Encode the text vector information according to the encoder to generate corresponding hidden vector information.
  • the encoder when the text vector information is acquired, the encoder will encode the text vector information to generate corresponding hidden vector information.
  • the encoder includes an encoding rule, and the text vector information is encoded by the encoding rule to obtain corresponding hidden vector information.
  • Step S206 splicing the hidden vector information and the first intermediate vector information to generate corresponding second intermediate vector information.
  • the hidden vector information and the first intermediate vector information are obtained, the dimension information of the hidden vector information and the dimension information of the first intermediate vector information are obtained respectively, and the hidden vector information and the first intermediate vector information are determined through the dimension information.
  • the information is in the same dimension, and the hidden vector information and the first intermediate vector information are spliced in the same dimension to generate the corresponding second intermediate vector information.
  • Step S207 generating corresponding Mel spectrum information based on the attention mechanism and the second intermediate vector information.
  • the information is input to the attention mechanism, and the contextual prosody information of each node in the second intermediate vector is acquired through the weight matrix in the attention mechanism.
  • a preset decoder is used to decode the contextual prosody information of each node in the second intermediate vector to obtain corresponding Mel spectrum information.
  • Step S208 outputting the speech synthesis information corresponding to the Mel spectrum information through the output layer.
  • the output layer outputs the speech synthesis information corresponding to the Mel spectrum information.
  • the output layer includes a vocoder, and the vocoder acquires the feature information in the voice and audio fields in the Mel spectrum information, and generates corresponding voice synthesis information by synthesizing the feature information in the voice and audio fields.
  • the text vector information and graph embedding vector information corresponding to the to-be-synthesized text are obtained, and the graph embedding vector information is encoded and encoded by the graph encoder.
  • the text vector information is encoded to obtain the Mel spectrum information output by the attention mechanism, and the speech synthesis information corresponding to the Mel spectrum information output by the output layer is obtained, and the semantic structure information is embedded into the speech synthesis model.
  • the encoder analyzes the prosody information from the text side, and the encoder analyzes the word position information from the text side, and the graph-assisted encoder analyzes the specific semantic information of the text information to map to different voice prosodic rhythms, making the process of prosody adjustment a fully automated process. process to improve the accuracy of speech synthesis.
  • FIG. 5 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • the speech synthesis apparatus 400 includes: a first acquisition module 401 , a conversion model 402 , a first generation module 403 , a second generation module 404 , and a second acquisition module 405 .
  • the first obtaining module 401 is configured to obtain the text to be synthesized, and input the text to be synthesized into a speech synthesis model, wherein the speech synthesis model includes an application layer, an output layer, a graph encoder and an attention mechanism;
  • a conversion model 402 configured to convert the text to be synthesized into graph embedding vector information based on the application layer
  • a first generating module 403, configured to encode the graph embedding vector information according to the graph encoder, generate corresponding first prosody vector information, and use the first prosody vector information as the first intermediate vector information;
  • a second generating module 404 configured to generate corresponding Mel language spectrum information according to the first intermediate vector information based on the attention mechanism
  • the second obtaining module 405 is configured to output the speech synthesis information corresponding to the Mel spectrum information through the output layer.
  • the speech synthesis device is also specifically used for:
  • the text vector information is encoded according to the encoder to generate corresponding hidden vector information
  • the first generation module 403 is also specifically used for:
  • the edge vector between each of the node vectors is obtained by the graph encoder, and the edge vector is encoded to obtain the first prosody vector information corresponding to the graph embedding vector information, wherein the edge vector represents Corresponds to the prosodic relationship of the two node vectors.
  • the second generation module 404 is specifically also used for:
  • Corresponding Mel spectrum information is generated by decoding the context prosody information of each node in the first intermediate vector information.
  • the second obtaining module 405 is also specifically used for:
  • the conversion model 402 is also specifically used for:
  • the speech synthesis device is also used for:
  • the voice text to be trained includes text information and voice information corresponding to the text information
  • a corresponding loss function is obtained through the image embedding vector information and the Mel spectrum information, and the model parameters of the preset speech sequence model are updated through the loss function to generate a corresponding speech synthesis model.
  • the apparatuses provided in the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 6 .
  • FIG. 6 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
  • the computer device may be a terminal.
  • the computer device includes a processor, a memory, and a network interface connected through a system bus, wherein the memory may include a volatile storage medium, a non-volatile storage medium, and an internal memory.
  • the nonvolatile storage medium can store operating systems and computer programs.
  • the computer program includes program instructions that, when executed, cause the processor to perform any speech synthesis method.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
  • the internal memory provides an environment for running the computer program in the non-volatile storage medium.
  • the processor can cause the processor to execute any speech synthesis method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • the processor is configured to run a computer program stored in the memory to implement the following steps:
  • the speech synthesis model includes an application layer, an output layer, a graph encoder and an attention mechanism
  • the speech synthesis information corresponding to the Mel spectrum information is output through the output layer.
  • the processor and the speech synthesis model further include an encoder; when the method is implemented, it is used to implement:
  • the text vector information is encoded according to the encoder to generate corresponding hidden vector information
  • the graph embedding vector information of the processor includes a plurality of node vectors and a plurality of edge vectors
  • the encoding of the graph embedding vector information according to the graph encoder to generate the corresponding first prosody vector information is used to realize:
  • the edge vector between each of the node vectors is obtained by the graph encoder, and the edge vector is encoded to obtain the first prosody vector information corresponding to the graph embedding vector information, wherein the edge vector represents Corresponds to the prosodic relationship of the two node vectors.
  • the processor when the processor generates corresponding Mel spectrum information according to the first intermediate vector information based on the attention mechanism, the processor is used to implement:
  • Corresponding Mel spectrum information is generated by decoding the context prosody information of each node in the first intermediate vector information.
  • the processor when the processor outputs the speech synthesis information corresponding to the Mel spectrum information through the output layer, it is used to implement:
  • the processor when the processor converts the to-be-synthesized text into a graph embedding vector based on the application layer, the processor is configured to implement:
  • the processor when it acquires the text to be synthesized, when it is implemented, it is used to implement:
  • the voice text to be trained includes text information and voice information corresponding to the text information
  • a corresponding loss function is obtained through the image embedding vector information and the Mel spectrum information, and the model parameters of the preset speech sequence model are updated through the loss function to generate a corresponding speech synthesis model.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, the computer program includes program instructions, and the method implemented when the program instructions are executed may refer to this document Various embodiments of speech synthesis methods are claimed.
  • the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the computer device ) card, flash card (Flash Card) and so on.
  • the computer-readable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; The data created by the use of the node, etc.
  • the blockchain referred to in this application is a new application mode of computer technologies such as storage, point-to-point transmission, consensus mechanism, and encryption algorithm of speech synthesis models.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

本申请涉及人工智能领域,公开了一种语音合成方法、装置、计算机设备及计算机可读存储介质,该方法包括:获取待合成文本,并通过语音合成模型将所述待合成文本转换为图嵌入向量信息,根据图编码器对所述图嵌入向量信息进行编码,生成对应的第一中间向量信息,根据所述第一中间向量信息生成对应的梅尔语谱信息,输出所述梅尔语谱信息对应的语音合成信息,实现通过图辅助编码器分析文本信息的具体语义信息来映射到不同的语音韵律节奏,使得韵律调节的过程成为一个全自动化的过程,提高了语音合成的准确率。同时,本申请还涉及区块链技术,且本申请可适用于智慧政务、智慧教育、智慧医疗等领域,从而可以进一步推动智慧城市的建设。

Description

语音合成方法、装置、设备及存储介质
本申请要求于2020年12月11日在中华人民共和国国家知识产权局专利局提交的、申请号为202011446751.8、发明名称为“语音合成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语义合成技术领域,尤其涉及一种语音合成方法、装置、计算机设备及计算机可读存储介质。
背景技术
TTS语音合成系统(Text To Speech  语音合成系统),是智能对话系统中不可或缺的一部分。学术界和工业界尝试用有限的资源和时间来实现真人式语音的合成。近些年,神经网络的方法在Google的Tacotron 和Wavenet发布后,成为语音合成领域的主流解决方案。
发明人意识到,目前基于神经网络的TTS模型已经展示出了良好的合成效果,但是在语音合成的过程中,韵律嵌入仍旧是一个有挑战性的任务。韵律向量首先尝试被从Mel语谱中提取出之后,在端到端模型的注意力机制处与编码器的输出一同输入attention机制,但这种方法对于句子长度敏感,合成效果鲁棒性差。为此提出多头全局风格标记,被用来代表语音的不同的说话风格,这些方法控制了合成语音的全局风格,但是局部说话韵律如停顿、重读和语调,对于合成语音的自然度来说仍旧至关重要。因此学者提出用时间结构来控制合成语音的说话风格,或采用变分自编码器来学习说话风格的隐状态向量,使得端到端模型能够更容易用于局部风格控制。虽然一定程度上解决了语音合成过程中的局部韵律控制,但是在进行韵律控制的过程中,手动挑选参考语音的过程可能会造成模型误差的累积,合成语音的准确率较低。
技术问题
本申请实施例的目的之一在于:提供了一种语音合成方法、装置、计算机设备及计算机可读存储介质,以解决现有韵律控制的过程中,手动挑选参考语音的过程可能会造成模型误差的累积,合成语音的准确率较低的技术问题。
技术解决方案
第一方面,本申请实施例提供了一种语音合成方法,包括:
获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
基于所述应用层将所述待合成文本转换为图嵌入向量信息;
根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
第二方面,本申请实施例提供了一种语音合成装置,包括:
第一获取模块,用于获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
转换模型,用于基于所述应用层将所述待合成文本转换为图嵌入向量信息;
第一生成模块,用于根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
第二生成模块,用于基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
第二获取模块,用于通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,所述处理器执行计算机程序时实现:
获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
基于所述应用层将所述待合成文本转换为图嵌入向量信息;
根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:
获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
基于所述应用层将所述待合成文本转换为图嵌入向量信息;
根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
有益效果
本申请实施例与现有技术相比存在的有益效果是:通过获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;基于所述应用层将所述待合成文本转换为图嵌入向量信息;根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;通过所述输出层输出所述梅尔语谱信息对应的语音合成信息,实现通过图辅助编码器分析文本信息的具体语义信息来映射到不同的语音韵律节奏,使得韵律调节的过程成为一个全自动化的过程,提高了语音合成的准确率。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种语音合成方法的流程示意图;
图2为图1中的语音合成方法的子步骤流程示意图;
图3为图1中的语音合成方法的子步骤流程示意图;
图4为本申请实施例提供的另一种语音合成方法的流程示意图;
图5为本申请实施例提供的一种语音合成装置的示意性框图;
图6为本申请一实施例涉及的计算机设备的结构示意框图。
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。
本发明的实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。
本申请实施例提供一种语音合成方法、装置、计算机设备及计算机可读存储介质。其中,该语音合成方法可应用于计算机设备中,该计算机设备可以是笔记本电脑、台式电脑等电子设备。
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。
请参照图1,图1 为本申请的实施例提供的一种语音合成方法的流程示意图。
如图1所示,该语音合成方法包括步骤S101至步骤S105。
步骤S101、获取待合成文本,并将所述待合成文本输入语音合成模型。
示范性的,获取待合成文本,该待合成文本包括短句和短文本等。该获取的方式包括获取用户输入的文本,或获取预置存储路径中存储的文本等,其中该预置存储路径包括区块链。在获取到待合成文本时,将该待合成文本输入到语义合成模型中,该语音合成模型可以存储在预置区块链中,该语音合成模型包括应用层、输出层、图编码器以及注意力机制等。
在一实施例中,所述获取待合成文本之前,还包括:获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
示范性的,获取待训练语音文本,该待训练语音文本包括文本信息以及所述文本信息对应的语音信息。通过该文本信息和语音信息训练预置语音序列模型,通过该文本信息和该预置语音序列模型中的图编码器,得到对应的图嵌入向量信息,通过该语音信息和预置语音序列模型中的注意力机制得到对一个的梅尔语谱图,通过该图嵌入向量信息和梅尔语谱图得到对应的损失函数,通过该损失函数对预置语音序列模型的模型参数进行优化,在确定优化该预置语音序列模型的模型参数后,该预置语音序列模型处于收敛状态时,将该预置语音序列模型生成对应的语音合成模型。
步骤S102、基于所述应用层将所述待合成文本转换为图嵌入向量信息。
示范性的,在将待合成文本输入到语音合成模型中,该语音合成模型中包括应用层。该应用层在检测到待合成文本时,将该待合成文本转换为图嵌入向量信息。图嵌入是一种将图数据高维稠密的矩阵映射为低微稠密向量的过程,通过将图表示为一组低维向量,存在不同类型的图,例如,同构图,异构图,属性图等。该图嵌入向量信息包括结点向量信息和边向量信息,通过该结点向量信息得到各个字词的向量信息,通过该边向量信息得到各个字词之间的韵律关系。其中,边向量信息包括有向边向量信息、反向边向量信息以及顺序边向量信息。
在一实施例中,具体地,参照图2,步骤S102包括:子步骤S1021至子步骤S1022。
子步骤S1021、通过所述应用层将所述待合成文本拆分为各个字词,并获取各个字词之间的顺序关系。
示范性的,该应用层在检测到该合成文本时,将该待合成文本拆分为各个字词,并获取各个字词之间的顺序关系。例如,待合成文本为“我爱中国”,将该“我爱中国”拆分为“我”、“爱”、“中”、“国”。并获取该“我”、“爱”、“中”、“国”之间的顺序为“我”--“爱”--“中”--“国”。
子步骤S1022、对各个字词以及各个所述字词之间的顺序关系进行映射转换,得到所述待合成文本对应的图嵌入向量信息。
示范性的,在获取到待合成文本的各个字词和各个字词之间的顺序关系时,对各个字词以及该各个字词的顺序关系进行映射,得到各个字词的字词向量信息以及各个字词之间的顺序向量信息即边向量信息,将得到的字词向量信息和边向量信息进行组合,得到对应的图嵌入向量信息,其中,边向量信息中的权重为0。
步骤S103、根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息。
示范性的,在获取到待合成文本的图嵌入向量信息时,通过该语音合成模型中的图编码器对应该图嵌入向量信息进行编码,生成对应的第一韵律向量信息。例如,该图编码器中包括映射函数,通过该映射函数对图嵌入向量信息进行映射编码,得到该图嵌入向量信息对应的第一韵律向量信息,在得到该第一韵律向量信息时,将该第一韵律向量信息作为第一中间向量信息。
在一实时例中,所述图嵌入向量信息包括点多个结点向量和多个边向量;
所述根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,包括:通过所述图编码器获取各个所述结点向量之间的边向量,并对所述边向量进行编码,得到所述图嵌入向量信息对应的第一韵律向量信息,其中,所述边向量表示对应两个所述结点向量的韵律关系。
示范性的,在获取到待合成文本的图嵌入向量信息时,通过图编码器对该各个结点向量间的边向量进行编码,得到各个结点向量之间的韵律向量信息,通过各个结点之间的顺序关系以及各个结点向量之间的韵律向量信息,得到对应的第一韵律向量信息。
步骤S104、基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息。
示范性的,在获取到第一中间向量信息时,将第一中间向量信息输入至注意力机制中,通过该注意力机制对该第一中间向量信息进行上下文学习,将该第一中间向量信息生成对应的梅尔语谱信息。例如,该注意力机制为多头注意力机制时,通过多头注意力遮蔽序列生成时不应知道的信息(即不合法的信息)。其中,多头注意力主要是为了训练时和推断时要一致,比如,在训练时,想要预测“w”这个发音,但是实际上进入网络时是整个韵律向量都会进入,要把这个韵律向量在“w”这个之后的序列都对网络屏蔽,防止网络看到未来需要预测的信息,因为这些信息在推断时是看不到的。
需要说明的是,多头注意力由几个自注意力组成,比如4头注意力,实质上就是对序列做4次自注意力。
在一实施例中,具体地,参照图3,步骤S104包括:子步骤S1041至子步骤S1042。
子步骤S1041、将所述第一中间向量信息输入到所述注意力机制中,通过所述注意力机制中的权重矩阵获取所述第一中间向量信息中各个结点的上下文韵律信息。
示范性的,在获取到该第一韵律向量信息时,将第一韵律向量信息作为中间向量信息并输入至注意力机制中,通过该注意力机制中的权重矩阵获取该第一中间向量中各个结点的上下文韵律信息。
子步骤S1042、通过对所述第一中间向量信息中各个结点的上下文韵律信息进行解码,生成对应的梅尔频谱信息。
示范性的,在获取到该第一中间向量中各个结点的上下文韵律信息时,通过该注意力机制中的预置解码器对该第一中间向量中各个结点的上下文韵律信息进行解码,得到对应的梅尔语谱信息。
步骤S105、通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
示范性的,在获取到梅尔语谱信息,通过输出层输出该梅尔语谱信息对应的语音合成信息。例如,该输出层包括声码器,该声码器获取该梅尔语谱信息中的语音频域特征信息,通过对该语音频域特征信息进行合成,生成对应的语音合成信息。
在一实施例中,所通过所述输出层输出所述梅尔语谱信息对应的语音合成信息,包括:通过所述输出层提取所述梅尔频谱信息中的语音频域特征;并对所述语音频域特征进行映射,获取所述输出层输出对应的语音合成信息。
示范性的,在获取到梅尔语谱信息时,通过该输出层提取该梅尔频谱信息中的语音频域特征,在提取到该梅尔频谱信息中的语音频域特征,对该语音频域特征进行映射,获取输出层输出对应的语音合成信息。例如,该输出层包括提取层和映射层,通过该提取层提取该该梅尔频谱信息中的语音频域特征,通过该映射层中的激活函数对该语音频域特征进行激活映射,得到对应的语音合成信息。
在本申请实施例中,获取待合成文本,并将待合成文本输入语音合成模型,应用层将待合成文本转换为图嵌入向量信息,图编码器对图嵌入向量信息进行编码,生成对应的第一韵律向量信息,注意力机制根据第一中间向量信息生成对应的梅尔语谱信息,输出层输出梅尔语谱信息对应的语音合成信息,实现通过图辅助编码器分析文本信息的具体语义信息来映射到不同的语音韵律节奏,使得韵律调节的过程成为一个全自动化的过程,提高了语音合成的准确率。
请参照图4,图4为本申请的实施例提供的另一种语音合成方法的流程示意图。
如图4所示,该语音合成方法包括步骤S201至步骤S208。
步骤S201、获取待合成文本,并将所述待合成文本输入语音合成模型。
示范性的,获取待合成文本,该待合成文本包括短句和短文本等。该获取的方式包括获取用户输入的文本,或获取预置存储路径中存储的文本等,其中该预置存储路径包括区块链。在获取到待合成文本时,将该待合成文本输入到语义合成模型中,该语音合成模型可以存储在预置区块链中,该语音合成模型包括应用层、输出层、图编码器以及注意力机制等。
在一实施例中,所述获取待合成文本之前,还包括:获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
示范性的,获取待训练语音文本,该待训练语音文本包括文本信息以及所述文本信息对应的语音信息。通过该文本信息和语音信息训练预置语音序列模型,通过该文本信息和该预置语音序列模型中的图编码器,得到对应的图嵌入向量信息,通过该语音信息和预置语音序列模型中的注意力机制得到对一个的梅尔语谱图,通过该图嵌入向量信息和梅尔语谱图得到对应的损失函数,通过该损失函数对预置语音序列模型的模型参数进行优化,在确定优化该预置语音序列模型的模型参数后,该预置语音序列模型处于收敛状态时,将该预置语音序列模型生成对应的语音合成模型。
步骤S202、基于所述应用层将所述待合成文本转换为图嵌入向量信息。
示范性的,在将待合成文本输入到语音合成模型中,该语音合成模型中包括应用层。该应用层在检测到待合成文本时,将该待合成文本转换为图嵌入向量信息。图嵌入是一种将图数据高维稠密的矩阵映射为低微稠密向量的过程,通过将图表示为一组低维向量,存在不同类型的图,例如,同构图,异构图,属性图等。该图嵌入向量信息包括结点向量信息和边向量信息,通过该结点向量信息得到各个字词的向量信息,通过该边向量信息得到各个字词之间的韵律关系。其中,边向量信息包括有向边向量信息、反向边向量信息以及顺序边向量信息。
步骤S203、根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息。
示范性的,在获取到待合成文本的图嵌入向量信息时,通过该语音合成模型中的图编码器对应该图嵌入向量信息进行编码,生成对应的第一韵律向量信息。例如,该图编码器中包括映射函数,通过该映射函数对图嵌入向量信息进行映射编码,得到该图嵌入向量信息对应的第一韵律向量信息,在得到该第一韵律向量信息时,将该第一韵律向量信息作为第一中间向量信息。
步骤S204、基于所述应用层将所述待合成文本转换为文本向量信息。
示范性的,在将该待合成文本输入待语音合成模型中,通过该语音合成模型中的应用层将该待合成文本转换为文本向量信息。通过应用层提取待合成文本中各个字词的位置以及各个字词对应的拼音,获取预置编码规中各个字词的位置以及各个字词对应的拼音对应的数字或字母,通过数字或字母将待合成文本转化为文本向量信息。
步骤S205、根据所述编码器对所述文本向量信息进行编码,生成对应的隐藏向量信息。
示范性的,在获取到文本向量信息时,通过该编码器对该文本向量信息将进行编码,生成对应的隐藏向量信息。例如,该编码器中包括编码规则,通过该编码规则对该文本向量信息进行编码,得到对应的隐藏向量信息。
步骤S206、将所述隐藏向量信息和所述第一中间向量信息进行拼接,生成对应的第二中间向量信息。
示范性的,在获取到隐藏向量信息和第一中间向量信息时,分别获取隐藏向量信息的维度信息和第一中间向量信息的维度信息,通过该维度信息,确定隐藏向量信息与第一中间向量信息同一维度,并在同一维度将隐藏向量信息与第一中间向量信息进行拼接,生成对应的第二中间向量信息。
步骤S207、基于所述注意力机制和所述第二中间向量信息,生成对应的梅尔语谱信息。
示范性的,在获取到该第二中间向量信息时,将输入至注意力机制中,通过该注意力机制中的权重矩阵获取该第二中间向量中各个结点的上下文韵律信息。在获取到该第二中间向量中各个结点的上下文韵律信息时,通过预置解码器对该第二中间向量中各个结点的上下文韵律信息进行解码,得到对应的梅尔语谱信息。
步骤S208、通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
示范性的,在获取到梅尔语谱信息,通过输出层输出该梅尔语谱信息对应的语音合成信息。例如,该输出层包括声码器,该声码器获取该梅尔语谱信息中的语音频域特征信息,通过对该语音频域特征信息进行合成,生成对应的语音合成信息。
在本申请实施例中,通过将待合成文本输入到语音合成模型中,得到该待合成文本对应的文本向量信息和图嵌入向量信息,通过该图编码器对图嵌入向量信息进行编码和编码器对文本向量信息进行编码,得到注意力机制输出的梅尔语谱信息,并获取输出层输出梅尔语谱信息对应的语音合成信息,将语义结构信息嵌入到语音合成模型中,而图辅助编码器从文本侧分析韵律信息,编码器从文本侧分析字词位置信息,实现通过图辅助编码器分析文本信息的具体语义信息来映射到不同的语音韵律节奏,使得韵律调节的过程成为一个全自动化的过程,提高了语音合成的准确率。
请参照图5,图5为本申请实施例提供的一种语音合成装置的示意性框图。
如图5所示,该语音合成装置400,包括:第一获取模块401、转换模型402、第一生成模块403、第二生成模块404、第二获取模块405。
第一获取模块401,用于获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
转换模型402,用于基于所述应用层将所述待合成文本转换为图嵌入向量信息;
第一生成模块403,用于根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
第二生成模块404,用于基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
第二获取模块405,用于通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
其中,语音合成装置具体还用于:
基于所述应用层将所述待合成文本转换为文本向量信息;
根据所述编码器对所述文本向量信息进行编码,生成对应的隐藏向量信息;
将所述隐藏向量信息和所述第一中间向量信息进行拼接,生成对应的第二中间向量信息;
基于所述注意力机制和所述第二中间向量信息,生成对应的梅尔语谱信息。
其中,第一生成模块403具体还用于:
通过所述图编码器获取各个所述结点向量之间的边向量,并对所述边向量进行编码,得到所述图嵌入向量信息对应的第一韵律向量信息,其中,所述边向量表示对应两个所述结点向量的韵律关系。
其中,第二生成模块404具体还用于:
将所述第一中间向量信息输入到所述注意力机制中,通过所述注意力机制中的权重矩阵获取所述第一中间向量信息中各个结点的上下文韵律信息;
通过对所述第一中间向量信息中各个结点的上下文韵律信息进行解码,生成对应的梅尔频谱信息。
其中,第二获取模块405具体还用于:
通过所述输出层提取所述梅尔频谱信息中的语音频域特征;
并对所述语音频域特征进行映射,获取所述输出层输出对应的语音合成信息。
其中,转换模型402具体还用于:
通过所述应用层将所述待合成文本拆分为各个字词,并获取各个字词之间的顺序关系;
对各个字词以及各个所述字词之间的顺序关系进行映射转换,得到所述待合成文本对应的图嵌入向量信息。
其中,语音合成装置还用于:
获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;
通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;
通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块及单元的具体工作过程,可以参考前述语音合成方法实施例中的对应过程,在此不再赘述。
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图6所示的计算机设备上运行。
请参阅图6,图6为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以为终端。
如图6所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括易失性存储介质、非易失性存储介质以及内存储器。
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种语音合成方法。
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种语音合成方法。
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
应当理解的是,处理器可以是中央处理单元 (Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现场可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:
获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
基于所述应用层将所述待合成文本转换为图嵌入向量信息;
根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
在一个实施例中,所述处理器所述语音合成模型还包括编码器;所述方法实现时,用于实现:
基于所述应用层将所述待合成文本转换为文本向量信息;
根据所述编码器对所述文本向量信息进行编码,生成对应的隐藏向量信息;
将所述隐藏向量信息和所述第一中间向量信息进行拼接,生成对应的第二中间向量信息;
基于所述注意力机制和所述第二中间向量信息,生成对应的梅尔语谱信息。
在一个实施例中,所述处理器所述图嵌入向量信息包括点多个结点向量和多个边向量;
所述根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息实现时,用于实现:
通过所述图编码器获取各个所述结点向量之间的边向量,并对所述边向量进行编码,得到所述图嵌入向量信息对应的第一韵律向量信息,其中,所述边向量表示对应两个所述结点向量的韵律关系。
在一个实施例中,所述处理器所述基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息实现时,用于实现:
将所述第一中间向量信息输入到所述注意力机制中,通过所述注意力机制中的权重矩阵获取所述第一中间向量信息中各个结点的上下文韵律信息;
通过对所述第一中间向量信息中各个结点的上下文韵律信息进行解码,生成对应的梅尔频谱信息。
在一个实施例中,所述处理器所通过所述输出层输出所述梅尔语谱信息对应的语音合成信息实现时,用于实现:
通过所述输出层提取所述梅尔频谱信息中的语音频域特征;
并对所述语音频域特征进行映射,获取所述输出层输出对应的语音合成信息。
在一个实施例中,所述处理器所述基于所述应用层将所述待合成文本转换为图嵌入向量信实现时,用于实现:
通过所述应用层将所述待合成文本拆分为各个字词,并获取各个字词之间的顺序关系;
对各个字词以及各个所述字词之间的顺序关系进行映射转换,得到所述待合成文本对应的图嵌入向量信息。
在一个实施例中,所述处理器所述获取待合成文本之前实现时,用于实现:
获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;
通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;
通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序中包括程序指令,所述程序指令被执行时所实现的方法可参照本申请语音合成方法的各个实施例。
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
本申请所指区块链是语音合成模型的存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种语音合成方法,其中,包括:
    获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
    基于所述应用层将所述待合成文本转换为图嵌入向量信息;
    根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
    基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
    通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
  2. 如权利要求1所述的语音合成方法,其中,所述语音合成模型还包括编码器;所述并将所述第一韵律向量信息作为第一中间向量信息之后,所述通过所述输出层输出所述梅尔语谱信息对应的语音合成信息之前,还包括:
    基于所述应用层将所述待合成文本转换为文本向量信息;
    根据所述编码器对所述文本向量信息进行编码,生成对应的隐藏向量信息;
    将所述隐藏向量信息和所述第一中间向量信息进行拼接,生成对应的第二中间向量信息;
    基于所述注意力机制和所述第二中间向量信息,生成对应的梅尔语谱信息。
  3. 如权利要求1所述的语音合成方法,其中,所述图嵌入向量信息包括点多个结点向量和多个边向量;
    所述根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,包括:
    通过所述图编码器获取各个所述结点向量之间的边向量,并对所述边向量进行编码,得到所述图嵌入向量信息对应的第一韵律向量信息,其中,所述边向量表示对应两个所述结点向量的韵律关系。
  4. 如权利要求1所述的语音合成方法,其中,所述基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息,包括:
    将所述第一中间向量信息输入到所述注意力机制中,通过所述注意力机制中的权重矩阵获取所述第一中间向量信息中各个结点的上下文韵律信息;
    通过对所述第一中间向量信息中各个结点的上下文韵律信息进行解码,生成对应的梅尔频谱信息。
  5. 如权利要求1所述的语音合成方法,其中,所通过所述输出层输出所述梅尔语谱信息对应的语音合成信息,包括:
    通过所述输出层提取所述梅尔频谱信息中的语音频域特征;
    并对所述语音频域特征进行映射,获取所述输出层输出对应的语音合成信息。
  6. 如权利要求1所述的语音合成方法,其中,所述基于所述应用层将所述待合成文本转换为图嵌入向量信,包括:
    通过所述应用层将所述待合成文本拆分为各个字词,并获取各个字词之间的顺序关系;
    对各个字词以及各个所述字词之间的顺序关系进行映射转换,得到所述待合成文本对应的图嵌入向量信息。
  7. 如权利要求1所述的语音合成方法,其中,所述获取待合成文本之前,还包括:
    获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;
    通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;
    通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
  8. 一种语音合成装置,其中,包括:
    第一获取模块,用于获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
    转换模型,用于基于所述应用层将所述待合成文本转换为图嵌入向量信息;
    第一生成模块,用于根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
    第二生成模块,用于基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
    第二获取模块,用于通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
  9. 一种计算机设备,其中,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现:
    获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
    基于所述应用层将所述待合成文本转换为图嵌入向量信息;
    根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
    基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
    通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
  10. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现:
    基于所述应用层将所述待合成文本转换为文本向量信息;
    根据所述编码器对所述文本向量信息进行编码,生成对应的隐藏向量信息;
    将所述隐藏向量信息和所述第一中间向量信息进行拼接,生成对应的第二中间向量信息;
    基于所述注意力机制和所述第二中间向量信息,生成对应的梅尔语谱信息。
  11. 如权利要求9所述的计算机设备,其中,所述图嵌入向量信息包括点多个结点向量和多个边向量,所述处理器执行所述计算机程序时还实现:
    通过所述图编码器获取各个所述结点向量之间的边向量,并对所述边向量进行编码,得到所述图嵌入向量信息对应的第一韵律向量信息,其中,所述边向量表示对应两个所述结点向量的韵律关系。
  12. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现:
    将所述第一中间向量信息输入到所述注意力机制中,通过所述注意力机制中的权重矩阵获取所述第一中间向量信息中各个结点的上下文韵律信息;
    通过对所述第一中间向量信息中各个结点的上下文韵律信息进行解码,生成对应的梅尔频谱信息。
  13. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现:
    通过所述输出层提取所述梅尔频谱信息中的语音频域特征;
    并对所述语音频域特征进行映射,获取所述输出层输出对应的语音合成信息。
  14. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现:
    通过所述应用层将所述待合成文本拆分为各个字词,并获取各个字词之间的顺序关系;
    对各个字词以及各个所述字词之间的顺序关系进行映射转换,得到所述待合成文本对应的图嵌入向量信息。
  15. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现:
    获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;
    通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;
    通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:
    获取待合成文本,并将所述待合成文本输入语音合成模型,其中,所述语音合成模型包括应用层、输出层、图编码器和注意力机制;
    基于所述应用层将所述待合成文本转换为图嵌入向量信息;
    根据所述图编码器对所述图嵌入向量信息进行编码,生成对应的第一韵律向量信息,并将所述第一韵律向量信息作为第一中间向量信息;
    基于所述注意力机制,根据所述第一中间向量信息生成对应的梅尔语谱信息;
    通过所述输出层输出所述梅尔语谱信息对应的语音合成信息。
  17. 如权利要求16所述的计算机可读存储介质,其中,所述计算机程序被处理器执行时还实现:
    基于所述应用层将所述待合成文本转换为文本向量信息;
    根据所述编码器对所述文本向量信息进行编码,生成对应的隐藏向量信息;
    将所述隐藏向量信息和所述第一中间向量信息进行拼接,生成对应的第二中间向量信息;
    基于所述注意力机制和所述第二中间向量信息,生成对应的梅尔语谱信息。
  18. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现:
    将所述第一中间向量信息输入到所述注意力机制中,通过所述注意力机制中的权重矩阵获取所述第一中间向量信息中各个结点的上下文韵律信息;
    通过对所述第一中间向量信息中各个结点的上下文韵律信息进行解码,生成对应的梅尔频谱信息。
  19. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现:
    通过所述应用层将所述待合成文本拆分为各个字词,并获取各个字词之间的顺序关系;
    对各个字词以及各个所述字词之间的顺序关系进行映射转换,得到所述待合成文本对应的图嵌入向量信息。
  20. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现:
    获取待训练语音文本,其中,所述待训练语音文本包括文本信息以及所述文本信息对应的语音信息;
    通过所述文本信息和所述语音信息训练预置语音序列模型,得到所述文本信息对应的图嵌入向量信息和所述语音信息对应的梅尔频谱信息;
    通过所述图嵌入向量信息和所述梅尔频谱信息得到对应的损失函数,并通过所述损失函数更新所述预置语音序列模型的模型参数,生成对应的语音合成模型。
PCT/CN2021/084215 2020-12-11 2021-03-31 语音合成方法、装置、设备及存储介质 WO2022121179A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011446751.8 2020-12-11
CN202011446751.8A CN112349269A (zh) 2020-12-11 2020-12-11 语音合成方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022121179A1 true WO2022121179A1 (zh) 2022-06-16

Family

ID=74427800

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084215 WO2022121179A1 (zh) 2020-12-11 2021-03-31 语音合成方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN112349269A (zh)
WO (1) WO2022121179A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349269A (zh) * 2020-12-11 2021-02-09 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质
CN112786009A (zh) * 2021-02-26 2021-05-11 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质
CN112948584B (zh) * 2021-03-03 2023-06-23 北京百度网讯科技有限公司 短文本分类方法、装置、设备以及存储介质
CN113096641B (zh) * 2021-03-29 2023-06-13 北京大米科技有限公司 信息处理方法及装置
CN113345412A (zh) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 语音合成方法、装置、设备以及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110264991A (zh) * 2019-05-20 2019-09-20 平安科技(深圳)有限公司 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
CN110288972A (zh) * 2019-08-07 2019-09-27 北京新唐思创教育科技有限公司 语音合成模型训练方法、语音合成方法及装置
CN110782870A (zh) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质
CN111754973A (zh) * 2019-09-23 2020-10-09 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN111816158A (zh) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN111954903A (zh) * 2018-12-11 2020-11-17 微软技术许可有限责任公司 多说话者神经文本到语音合成
CN111951781A (zh) * 2020-08-20 2020-11-17 天津大学 一种基于图到序列的中文韵律边界预测的方法
CN112349269A (zh) * 2020-12-11 2021-02-09 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111954903A (zh) * 2018-12-11 2020-11-17 微软技术许可有限责任公司 多说话者神经文本到语音合成
CN110264991A (zh) * 2019-05-20 2019-09-20 平安科技(深圳)有限公司 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
CN110288972A (zh) * 2019-08-07 2019-09-27 北京新唐思创教育科技有限公司 语音合成模型训练方法、语音合成方法及装置
CN110782870A (zh) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质
CN111816158A (zh) * 2019-09-17 2020-10-23 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN111754973A (zh) * 2019-09-23 2020-10-09 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN111951781A (zh) * 2020-08-20 2020-11-17 天津大学 一种基于图到序列的中文韵律边界预测的方法
CN112349269A (zh) * 2020-12-11 2021-02-09 平安科技(深圳)有限公司 语音合成方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN112349269A (zh) 2021-02-09

Similar Documents

Publication Publication Date Title
JP7464621B2 (ja) 音声合成方法、デバイス、およびコンピュータ可読ストレージ媒体
WO2022121179A1 (zh) 语音合成方法、装置、设备及存储介质
JP7066349B2 (ja) 翻訳方法、翻訳装置及びコンピュータプログラム
WO2022178941A1 (zh) 语音合成方法、装置、设备及存储介质
KR20210146368A (ko) 숫자 시퀀스에 대한 종단 간 자동 음성 인식
KR20220004737A (ko) 다국어 음성 합성 및 언어간 음성 복제
WO2022188734A1 (zh) 一种语音合成方法、装置以及可读存储介质
JP4901155B2 (ja) 音声認識装置が使用するのに適した文法を生成するための方法、媒体、およびシステム
JP2023535230A (ja) 2レベル音声韻律転写
WO2020248393A1 (zh) 语音合成方法、系统、终端设备和可读存储介质
CN115485766A (zh) 使用bert模型的语音合成韵律
CN112352275A (zh) 具有多级别文本信息的神经文本到语音合成
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN112365878A (zh) 语音合成方法、装置、设备及计算机可读存储介质
JP2024505076A (ja) 多様で自然なテキスト読み上げサンプルを生成する
CN118043885A (zh) 用于半监督语音识别的对比孪生网络
CN116343747A (zh) 语音合成方法、语音合成装置、电子设备及存储介质
US11960852B2 (en) Robust direct speech-to-speech translation
CN113450758B (zh) 语音合成方法、装置、设备及介质
JP2010139745A (ja) 統計的発音変異モデルを記憶する記録媒体、自動音声認識システム及びコンピュータプログラム
CN113823259A (zh) 将文本数据转换为音素序列的方法及设备
EP4218006A1 (en) Using speech recognition to improve cross-language speech synthesis
CN116844522A (zh) 音律边界标签标注方法和语音合成方法
Zahariev et al. An approach to speech ambiguities eliminating using semantically-acoustical analysis
US20220189455A1 (en) Method and system for synthesizing cross-lingual speech

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901895

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901895

Country of ref document: EP

Kind code of ref document: A1