WO2022178941A1 - Speech synthesis method and apparatus, and device and storage medium - Google Patents

Speech synthesis method and apparatus, and device and storage medium Download PDF

Info

Publication number
WO2022178941A1
WO2022178941A1 PCT/CN2021/084167 CN2021084167W WO2022178941A1 WO 2022178941 A1 WO2022178941 A1 WO 2022178941A1 CN 2021084167 W CN2021084167 W CN 2021084167W WO 2022178941 A1 WO2022178941 A1 WO 2022178941A1
Authority
WO
WIPO (PCT)
Prior art keywords
style
audio
text
vector information
speech
Prior art date
Application number
PCT/CN2021/084167
Other languages
French (fr)
Chinese (zh)
Inventor
孙奥兰
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022178941A1 publication Critical patent/WO2022178941A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the present application relates to the technical field of speech processing, and in particular, to a speech synthesis method, apparatus, computer device, and computer-readable storage medium.
  • One of the purposes of the embodiments of the present application is to provide a speech synthesis method, device, computer equipment and computer-readable storage medium, so as to solve the problem that in the prior art, the speaking style cannot be individually controlled, and the emotional expression of the synthesized speech is very simple. technical issues.
  • an embodiment of the present application provides a speech synthesis method, including:
  • the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
  • Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  • an embodiment of the present application provides a speech synthesis apparatus, including:
  • the first acquisition module is used to acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes multiple reference encoder, text encoder, fully connected layer and output layer;
  • a second obtaining module configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information
  • a third acquisition module configured to encode the text to be processed based on the text encoder to obtain text encoding vector information
  • a generation module configured to splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram
  • An output module configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
  • an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements when executing the computer program:
  • the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
  • Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:
  • the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
  • Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  • the embodiments of the present application have the following beneficial effects: by acquiring the text to be processed and the speech style audio to be synthesized, and inputting the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model,
  • the preset speech synthesis model includes a multi-reference encoder, a text encoder, a fully connected layer and an output layer; encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information; Encode the text to be processed based on the text encoder to obtain text encoding vector information; splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram Perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed, so as to control the speech style of the synthesized speech and synthesize more emotionally expressed speech.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • Fig. 2 is the sub-step flowchart schematic diagram of the speech synthesis method in Fig. 1;
  • Fig. 3 is the sub-step flow schematic diagram of the speech synthesis method in Fig. 1;
  • FIG. 4 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
  • Embodiments of the present application provide a speech synthesis method, apparatus, computer device, and computer-readable storage medium.
  • the speech synthesis method may be applied to computer equipment, and the computer equipment may be electronic equipment such as notebook computers and desktop computers.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • the speech synthesis method includes steps S101 to S105.
  • Step S101 Acquire the text to be processed and the speech style audio to be synthesized, and input the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder , text encoder, connection layer, and output layer.
  • the to-be-processed text and the to-be-synthesized speech-style audio are acquired, where the to-be-processed text includes short sentences or short texts, and the to-be-synthesized speech-style audio includes timbre, emotion, and rhythm.
  • the acquisition method includes acquiring pre-stored text to be processed and/or speech-style audio to be synthesized through a preset storage path, or acquiring pre-stored text to be processed and/or speech-style audio to be synthesized from a preset blockchain.
  • the to-be-processed text and the to-be-synthesized speech-style audio are acquired, the to-be-processed text and the to-be-synthesized speech-style audio are input into a preset speech synthesis model, where the preset speech synthesis model includes a multi-reference encoder, a text encoder, and the like.
  • Step S102 Encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information.
  • the speech style audio to be synthesized is encoded by the multi-reference encoder in the speech synthesis model to obtain style embedding vector information corresponding to the speech style audio to be synthesized.
  • the reference encoder is composed of a convolutional neural network (ConvolutionalNeuralNetworks, CNN) and a recurrent neural network (RecurrentNeuralNetwork, RNN), and the convolutional neural network layer is composed of multiple layers of two-dimensional convolutional layers.
  • the neural network layer consists of an RNN, where the kernel of the two-dimensional convolutional layer can be selected as 3*3, and the stride can be selected as 2*2.
  • the CNN layer is a six-layer two-dimensional convolutional layer, then The output channels of 32, 32, 64, 64, 128, and 128 can be set sequentially for these six-layer 2D convolutional layers.
  • step S102 includes: sub-step S1021 to sub-step S1022.
  • Sub-step S1021 Encode the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio according to a plurality of the reference encoders, respectively, to obtain reference embedded latent vector information.
  • the speech synthesis model includes a plurality of reference encoders, and each reference encoder in the speech synthesis model encodes the timbre speech style audio, emotional speech style audio and prosodic speech style to obtain the speech to be synthesized.
  • the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio are processed through the convolutional neural network in the reference encoder to obtain corresponding three-dimensional tensors, that is, the timbre speaking style audio, emotion Audio features are extracted from speaking style audio and prosodic speaking style audio, and the audio features are processed through each two-dimensional convolutional layer in the convolutional neural network in turn to obtain a tensor, and the tensor is transformed into a three-dimensional tensor,
  • the time complexity of the output is maintained; then the three-dimensional tensor is processed through the recurrent neural network layer in the reference encoder to obtain the reference embedded latent vector corresponding to the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio information.
  • Sub-step S1022 Calculate the reference embedded latent vector information according to the multi-head attention mechanism to obtain style embedded vector information.
  • a multi-head attention mechanism is used to calculate the similarity between the preset vector corresponding to each preset style marker and the reference embedded latent vector information.
  • the style weight of each preset style marker for timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio, that is, accumulating the relationship between the preset vector corresponding to each preset style marker and the reference embedded latent vector information Similarity, get the total similarity, and calculate the ratio of the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information to the total similarity, and then calculate the preset corresponding to each style tag. Let the ratio of the similarity between the vector and the reference embedded latent vector information to the total similarity be determined as the style weight of each style
  • each preset style tag has a positive effect on timbre speaking style audio, emotional speaking style audio and
  • the style weights for prosodic speaking style audio are 0.3, 0.15, 0.2, 0.2, and 0.15, respectively.
  • Step S103 Encode the text to be processed based on the text encoder to obtain text encoding vector information.
  • the text to be processed is encoded by the text encoder to obtain corresponding text encoding vector information
  • the text encoder includes a weight matrix
  • the to-be-processed text is mapped by the weight matrix to obtain the corresponding text.
  • Text encoding vector information is encoded by the text encoder to obtain corresponding text encoding vector information.
  • step S103 includes: sub-step S1031 to sub-step S1032.
  • Sub-step S1031 splitting the text to be processed into each word by the text encoder, and obtaining the order relationship between each word;
  • the encoder splits the to-be-processed text into respective words, and obtains the sequence relationship between the respective words. For example, the text to be processed is “I love China”, and the “I love China” is divided into “I”, “ ⁇ ", “ ⁇ ", “ ⁇ ”. And get the order between "I", “Love”, “Zhong” and "Country” as “I”--"Love”--"China”--"Country”.
  • Sub-step S1032 map and convert each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
  • each word of the text to be processed and the order relationship between each word map each word and the order relationship of each word to obtain word vector information of each word and
  • the sequence vector information between each word is the edge vector information, and the obtained word vector information and the edge vector information are combined to obtain the corresponding text encoding vector information, wherein the weight in the edge vector information is 0.
  • Step S104 splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram.
  • the Mel language spectrogram is obtained by splicing the style embedding vector information and the text encoding vector information.
  • the dimension information of the style embedding vector information and the text encoding vector information are obtained respectively, and the style embedding vector information and the text encoding vector information are spliced in the same dimension to generate a Mel language spectrogram.
  • the style embedding vector information is obtained through connection layer broadcasting, and the obtained style embedding vector information is connected with the text encoding vector information to obtain splicing vector information; The vector information is decoded to generate a Mel spectrogram.
  • the style embedding vector information is obtained through connection layer broadcasting, and the obtained style embedding vector information is connected with the text encoding vector information to obtain the splicing vector information.
  • the connection layer sends a broadcast to each multi-reference encoder, and when each multi-reference encoder encodes the synthesized speech style audio to obtain the style embedding vector information, each multi-reference encoder sends the obtained style embedding vector information to the The connection layer of the fully connected layer.
  • the connection layer obtains the style embedding vector information, it obtains the dimension information of the style embedding vector information and the dimension information of the text encoding vector information respectively, and splices the obtained dimension information of the style embedding vector information and the dimension information of the text encoding vector information.
  • the stitching includes dimension stitching. For example, obtain the dimension information of the style embedding vector information and the dimension information of the text encoding vector information, determine the dimension coordinates of the style embedding vector information and the dimension coordinates of the text encoding vector, and insert the style embedding vector information and the text encoding vector at the same dimension coordinates. The information is spliced to obtain the corresponding splicing vector information.
  • the obtained splicing vector information is input into a preset decoder, and the splicing vector information is decoded by the preset encoder to generate a corresponding Mel language spectrogram.
  • the decoder converts the transmitted splicing vector information into spectral signal information through self-decoding, and generates a Mel spectrogram by converting the spectral signal information.
  • Step S105 Perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
  • the output layer outputs the speech synthesis information of the Mel spectrum information.
  • the output layer includes a vocoder, and the vocoder acquires the feature information in the voice and audio domains in the Mel spectrum information, and generates speech synthesis information by synthesizing the feature information in the voice and audio domains.
  • performing feature extraction on the Mel spectrogram through the output layer and outputting the target audio of the text to be processed includes: extracting the voice and audio domain in the Mel spectrum information through the output layer feature, and map the voice and audio domain features, and output the target audio of the text to be processed.
  • the voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features in the Mel spectrum information are extracted, and the voice and audio domain features are extracted from the Mel spectrum information. Domain features are mapped to obtain the output speech synthesis information of the output layer.
  • the output layer includes an extraction layer and a mapping layer, the voice and audio domain features in the Mel spectrum information are extracted through the extraction layer, and the voice and audio domain features are activated and mapped through the activation function in the mapping layer to obtain the voice synthetic information.
  • the obtained text to be processed and the audio of the speech style to be synthesized are input into a preset speech synthesis model for encoding, so as to obtain style embedding vector information and text encoding vector information; It is spliced with the text encoding vector information to generate the Mel spectrogram; the feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output to realize the control of the speech style of the synthesized speech, and the synthesis is more efficient. Voices that express multiple emotions.
  • FIG. 4 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • the speech synthesis apparatus 400 includes: a first acquisition module 401 , a second acquisition module 402 , a third acquisition module 403 , a generation module 404 , and an output module 405 .
  • the first obtaining module 401 is used to obtain the text to be processed and the speech style audio to be synthesized, and input the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes Multi-reference encoder, text encoder, fully connected layer and output layer;
  • a second obtaining module 402 configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information
  • a third acquiring module 403, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information
  • the generating module 404 is used for splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;
  • the output module 405 is configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
  • the second obtaining module 402 is also specifically used for:
  • the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information
  • the reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
  • the second obtaining module 402 is also specifically used for:
  • the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
  • the three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
  • the second obtaining module 402 is also specifically used for:
  • each preset style is obtained.
  • style embedding vector for style tags
  • the third obtaining module 403 is also specifically used for:
  • the generating module 404 is also specifically used for:
  • the splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
  • the output module is also used for:
  • the voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.
  • the apparatuses provided by the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 5 .
  • FIG. 5 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application.
  • the computer device may be a terminal.
  • the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.
  • the nonvolatile storage medium can store operating systems and computer programs.
  • the computer program includes program instructions that, when executed, cause the processor to perform any speech synthesis method.
  • the processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
  • the internal memory provides an environment for running the computer program in the non-volatile storage medium.
  • the processor can cause the processor to execute any speech synthesis method.
  • the network interface is used for network communication, such as sending assigned tasks.
  • the network interface is used for network communication, such as sending assigned tasks.
  • FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • DSP Digital Signal Processor
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
  • the processor is configured to run a computer program stored in the memory to implement the following steps:
  • the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
  • Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  • the processor, the multi-reference encoder includes multiple reference encoders and a multi-head attention mechanism;
  • the to-be-synthesized speaking style audio includes timbre speaking style audio, emotional speaking style audio, and prosodic speaking style audio Described to encode the speech style audio to be synthesized based on the multi-reference encoder, when obtaining the style embedding vector information method to realize, for realizing:
  • the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information
  • the reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
  • the processor when the processor encodes the to-be-synthesized speaking style audio according to a plurality of the reference encoders, and obtains reference embedded latent vector information for implementation, it is used to implement:
  • the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
  • the three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
  • the processor calculates the reference embedded latent vector information according to the multi-head attention mechanism, and when the obtained style embedded vector information is implemented, is used to implement:
  • each preset style is obtained.
  • style embedding vector for style tags
  • the processor when the processor encodes the text to be processed based on the text encoder, and obtains text encoding vector information for implementation, it is used to implement:
  • the fully-connected layer of the processor includes a connection layer and a preset decoder; the fully-connected layer splices the style embedding vector information and the text encoding vector information to generate a Mel
  • the fully-connected layer splices the style embedding vector information and the text encoding vector information to generate a Mel
  • the splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
  • the processor when the processor performs feature extraction on the Mel spectrogram through the output layer, and outputs the target audio of the to-be-processed text, it is used to achieve:
  • the voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the to-be-processed text target audio.
  • Embodiments of the present application further provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and a computer program is stored on the computer-readable storage medium, and the computer
  • the program includes program instructions, and the method implemented when the program instructions are executed may refer to the various embodiments of the speech synthesis method of the present application.
  • the computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the computer device ) card, flash card (Flash Card) and so on.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as storage of preset speech synthesis models, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A speech synthesis method and apparatus, and a computer device and a computer-readable storage medium. The method comprises: acquiring text to be processed and speaking style audio to be synthesized, and inputting said text and said speaking style audio into a preset speech synthesis model; encoding said speaking style audio on the basis of a multi-reference encoder, so as to obtain style embedding vector information; encoding said text on the basis of a text encoder, so as to obtain text encoding vector information; combining the style embedding vector information and the text encoding vector information by means of a fully connected layer, so as to generate a Mel spectrogram; and performing feature extraction on the Mel spectrogram by means of an output layer, and outputting target audio of said text. Control over the speaking style of synthesized speech is thus realized, such that more pieces of emotional expression speech are synthesized.

Description

语音合成方法、装置、设备及存储介质Speech synthesis method, device, equipment and storage medium
本申请要求于2021年02月26日在中华人民共和国国家知识产权局专利局提交的、申请号为202110218672.X、发明名称为“语音合成方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires a Chinese patent application with an application number of 202110218672.X and an invention title of "Speech Synthesis Method, Device, Equipment and Storage Medium", which was submitted to the Patent Office of the State Intellectual Property Office of the People's Republic of China on February 26, 2021. Priority, the entire contents of which are incorporated herein by reference.
技术领域technical field
本申请涉及语音处理技术领域,尤其涉及一种语音合成方法、装置、计算机设备及计算机可读存储介质。The present application relates to the technical field of speech processing, and in particular, to a speech synthesis method, apparatus, computer device, and computer-readable storage medium.
背景技术Background technique
在语音合成的过程中,不仅要考虑合成语音的清晰度和流畅度,还要考虑合成语音的韵律信息,使得合成的语音具有丰富的情绪表达。在合成语音时,不仅仅单纯考虑语句的平滑度,还要考虑改变说话者的情绪状态,利用模型来学习参考音频的风格信息,从而达到可以与人声相媲美的程度。发明人意识到,目前的韵律模型构建中,常用的方法是将所有的说话风格归为一种表达,不能对说话风格进行分离,所以无法对说话风格进行单独控制,合成语音的情绪表达十分单一。In the process of speech synthesis, not only the clarity and fluency of the synthesized speech should be considered, but also the prosody information of the synthesized speech, so that the synthesized speech has rich emotional expression. When synthesizing speech, not only consider the smoothness of the sentence, but also consider changing the emotional state of the speaker, and use the model to learn the style information of the reference audio, so as to achieve a level comparable to that of human voice. The inventor realized that in the construction of the current prosody model, the commonly used method is to classify all the speaking styles into one expression, and the speaking styles cannot be separated, so the speaking styles cannot be individually controlled, and the emotional expression of the synthesized speech is very simple. .
技术问题technical problem
本申请实施例的目的之一在于:提供了一种语音合成方法、装置、计算机设备及计算机可读存储介质,以解决现有技术中无法对说话风格进行单独控制,合成语音的情绪表达十分单一的技术问题。One of the purposes of the embodiments of the present application is to provide a speech synthesis method, device, computer equipment and computer-readable storage medium, so as to solve the problem that in the prior art, the speaking style cannot be individually controlled, and the emotional expression of the synthesized speech is very simple. technical issues.
技术解决方案technical solutions
第一方面,本申请实施例提供了一种语音合成方法,包括:In a first aspect, an embodiment of the present application provides a speech synthesis method, including:
获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
第二方面,本申请实施例提供了一种语音合成装置,包括:In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:
第一获取模块,用于获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;The first acquisition module is used to acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes multiple reference encoder, text encoder, fully connected layer and output layer;
第二获取模块,用于基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;a second obtaining module, configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information;
第三获取模块,用于基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;a third acquisition module, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information;
生成模块,用于通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;A generation module, configured to splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;
输出模块,用于通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。An output module, configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
第三方面,本申请实施例提供了一种计算机设备,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,所述处理器执行计算机程序时实现:In a third aspect, an embodiment of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements when executing the computer program:
获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质可以是非易失性,也可以是易失性,计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现:In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores a computer program, the computer program Implemented when executed by the processor:
获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
有益效果beneficial effect
本申请实施例与现有技术相比存在的有益效果是:通过获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频,实现对合成的语音进行说话风格的控制,合成更多情绪表达的语音。Compared with the prior art, the embodiments of the present application have the following beneficial effects: by acquiring the text to be processed and the speech style audio to be synthesized, and inputting the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, Wherein, the preset speech synthesis model includes a multi-reference encoder, a text encoder, a fully connected layer and an output layer; encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information; Encode the text to be processed based on the text encoder to obtain text encoding vector information; splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram Perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed, so as to control the speech style of the synthesized speech and synthesize more emotionally expressed speech.
附图说明Description of drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following briefly introduces the accompanying drawings used in the description of the embodiments. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.
图1为本申请实施例提供的一种语音合成方法的流程示意图;1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application;
图2为图1中的语音合成方法的子步骤流程示意图;Fig. 2 is the sub-step flowchart schematic diagram of the speech synthesis method in Fig. 1;
图3为图1中的语音合成方法的子步骤流程示意图;Fig. 3 is the sub-step flow schematic diagram of the speech synthesis method in Fig. 1;
图4为本申请实施例提供的一种语音合成装置的示意性框图;FIG. 4 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application;
图5为本申请一实施例涉及的计算机设备的结构示意框图。FIG. 5 is a schematic structural block diagram of a computer device according to an embodiment of the present application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics and advantages of the purpose of the present application will be further described with reference to the accompanying drawings in conjunction with the embodiments.
本发明的实施方式Embodiments of the present invention
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.
附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the figures are for illustration only, and do not necessarily include all contents and operations/steps, nor do they have to be performed in the order described. For example, some operations/steps can also be decomposed, combined or partially combined, so the actual execution order may be changed according to the actual situation.
本申请实施例提供一种语音合成方法、装置、计算机设备及计算机可读存储介质。其中,该语音合成方法可应用于计算机设备中,该计算机设备可以是笔记本电脑、台式电脑等电子设备。Embodiments of the present application provide a speech synthesis method, apparatus, computer device, and computer-readable storage medium. Wherein, the speech synthesis method may be applied to computer equipment, and the computer equipment may be electronic equipment such as notebook computers and desktop computers.
下面结合附图,对本申请的一些实施方式作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and features in the embodiments may be combined with each other without conflict.
请参照图1,图1 为本申请的实施例提供的一种语音合成方法的流程示意图。Please refer to FIG. 1 , which is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
如图1所示,该语音合成方法包括步骤S101至步骤S105。As shown in FIG. 1 , the speech synthesis method includes steps S101 to S105.
步骤S101、获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、连接层和输出层。Step S101: Acquire the text to be processed and the speech style audio to be synthesized, and input the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder , text encoder, connection layer, and output layer.
示范性的,获取待处理文本和待合成说话风格音频,该待处理文本包括短句或短文本等,该待合成说话风格音频包括音色、情感和韵律。获取的方式包括通过预置存储路径获取预先存储的待处理文本和/或待合成说话风格音频,或者,从预置区块链中获取预先存储的待处理文本和/或待合成说话风格音频。在获取到待处理文本和待合成说话风格音频时,将该待处理文本和待合成说话风格音频输入到预置语音合成模型,该预置语音合成模型包括多参考编码器和文本编码器等。Exemplarily, the to-be-processed text and the to-be-synthesized speech-style audio are acquired, where the to-be-processed text includes short sentences or short texts, and the to-be-synthesized speech-style audio includes timbre, emotion, and rhythm. The acquisition method includes acquiring pre-stored text to be processed and/or speech-style audio to be synthesized through a preset storage path, or acquiring pre-stored text to be processed and/or speech-style audio to be synthesized from a preset blockchain. When the to-be-processed text and the to-be-synthesized speech-style audio are acquired, the to-be-processed text and the to-be-synthesized speech-style audio are input into a preset speech synthesis model, where the preset speech synthesis model includes a multi-reference encoder, a text encoder, and the like.
步骤S102、基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息。Step S102: Encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information.
示范性的,通过该语音合成模型中的多参考编码器对该待合成说话风格音频进行编码,得到该待合成说话风格音频对应的风格嵌入向量信息。在一实施例中,该参考编码器由卷积神经网络(ConvolutionalNeuralNetworks,CNN)和循环神经网络(RecurrentNeuralNetwork,RNN)组成,且该卷积神经网络层由多层二维卷积层组成,该循环神经网络层由一个RNN组成,其中,该二维卷积层的核可选为3*3,步长可选为2*2,例如,如果该CNN层为6层二维卷积层,则可以为这六层的二维卷积层依次设置32、32、64、64、128和128的输出通道。Exemplarily, the speech style audio to be synthesized is encoded by the multi-reference encoder in the speech synthesis model to obtain style embedding vector information corresponding to the speech style audio to be synthesized. In one embodiment, the reference encoder is composed of a convolutional neural network (ConvolutionalNeuralNetworks, CNN) and a recurrent neural network (RecurrentNeuralNetwork, RNN), and the convolutional neural network layer is composed of multiple layers of two-dimensional convolutional layers. The neural network layer consists of an RNN, where the kernel of the two-dimensional convolutional layer can be selected as 3*3, and the stride can be selected as 2*2. For example, if the CNN layer is a six-layer two-dimensional convolutional layer, then The output channels of 32, 32, 64, 64, 128, and 128 can be set sequentially for these six-layer 2D convolutional layers.
在一实施例中,具体地,参照图2,步骤S102包括:子步骤S1021至子步骤S1022。In an embodiment, specifically, referring to FIG. 2 , step S102 includes: sub-step S1021 to sub-step S1022.
子步骤S1021、根据多个所述参考编码器分别对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到参考嵌入隐向量信息。Sub-step S1021: Encode the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio according to a plurality of the reference encoders, respectively, to obtain reference embedded latent vector information.
示范性的,该语音合成模型中包括多个参考编码器,通过该语音合成模型中的各个参考编码器对该音色说话风格音频、情感说话风格音频和韵律说话风格进行编码,得到该待合成说话风格音频对应的目标参考嵌入向量。具体地,通过该参考编码器中的卷积神经网络对该音色说话风格音频、情感说话风格音频和韵律说话风格音频进行处理,得到对应的三维张量,即分别从该音色说话风格音频、情感说话风格音频和韵律说话风格音频中提取音频特征,并依次通过该卷积神经网络中的每个二维卷积层处理该音频特征,得到张量,并将该张量变换为三维张量,但保持输出的时间复杂度;然后通过该参考编码器中的循环神经网络层对该三维张量进行处理,得到该音色说话风格音频、情感说话风格音频和韵律说话风格音频对应的参考嵌入隐向量信息。Exemplarily, the speech synthesis model includes a plurality of reference encoders, and each reference encoder in the speech synthesis model encodes the timbre speech style audio, emotional speech style audio and prosodic speech style to obtain the speech to be synthesized. The target reference embedding vector corresponding to the style audio. Specifically, the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio are processed through the convolutional neural network in the reference encoder to obtain corresponding three-dimensional tensors, that is, the timbre speaking style audio, emotion Audio features are extracted from speaking style audio and prosodic speaking style audio, and the audio features are processed through each two-dimensional convolutional layer in the convolutional neural network in turn to obtain a tensor, and the tensor is transformed into a three-dimensional tensor, However, the time complexity of the output is maintained; then the three-dimensional tensor is processed through the recurrent neural network layer in the reference encoder to obtain the reference embedded latent vector corresponding to the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio information.
子步骤S1022、根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息。Sub-step S1022: Calculate the reference embedded latent vector information according to the multi-head attention mechanism to obtain style embedded vector information.
示范性的,在得到参考嵌入隐向量信息之后,通过多头注意力机制计算每个预置风格标记对应的预设向量与该参考嵌入隐向量信息之间的相似度。在确定每个预置风格标记对应的预设向量与参考嵌入隐向量信息之间的相似度之后,根据每个预置风格标记对应的预设向量与该参考嵌入隐向量信息之间的相似度,确定每个预置风格标记对音色说话风格音频、情感说话风格音频和韵律说话风格音频的风格权重,即累加每个预置风格标记对应的预设向量与该参考嵌入隐向量信息之间的相似度,得到总相似度,并计算每个预置风格标记对应的预设向量与该参考嵌入隐向量信息之间的相似度占该总相似度的比率,然后将每个风格标记对应的预设向量与该参考嵌入隐向量信息之间的相似度占该总相似度的比率确定为每个风格标记对音色说话风格音频、情感说话风格音频和韵律说话风格音频的风格权重。Exemplarily, after obtaining the reference embedded latent vector information, a multi-head attention mechanism is used to calculate the similarity between the preset vector corresponding to each preset style marker and the reference embedded latent vector information. After determining the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information, according to the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information , to determine the style weight of each preset style marker for timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio, that is, accumulating the relationship between the preset vector corresponding to each preset style marker and the reference embedded latent vector information Similarity, get the total similarity, and calculate the ratio of the similarity between the preset vector corresponding to each preset style tag and the reference embedded latent vector information to the total similarity, and then calculate the preset corresponding to each style tag. Let the ratio of the similarity between the vector and the reference embedded latent vector information to the total similarity be determined as the style weight of each style marker for timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio.
例如,预置风格标记的数量为5,且每个预置风格标记对应的预设向量与该参考嵌入隐向量信息之间的相似度分别为0.6、0.3、0.4、0.4和0.3,则总相似度为2,每个预置风格标记的该相似度与总相似度之比分别为0.3、0.15、0.2、0.2和0.15,则每个预置风格标记对音色说话风格音频、情感说话风格音频和韵律说话风格音频的风格权重分别为0.3、0.15、0.2、0.2和0.15。For example, if the number of preset style markers is 5, and the similarity between the preset vector corresponding to each preset style marker and the reference embedded latent vector information is 0.6, 0.3, 0.4, 0.4, and 0.3, then the total similarity The ratio of the similarity to the total similarity of each preset style tag is 0.3, 0.15, 0.2, 0.2, and 0.15, respectively, then each preset style tag has a positive effect on timbre speaking style audio, emotional speaking style audio and The style weights for prosodic speaking style audio are 0.3, 0.15, 0.2, 0.2, and 0.15, respectively.
在确定每个预置风格标记对该音色说话风格音频、情感说话风格音频和韵律说话风格音频的风格权重之后,用每个风格标记对该待合成说话风格音频的风格权重乘以该参考嵌入隐向量信息,得到每个预置风格标记的风格嵌入向量,然后累加每个风格标记的风格嵌入向量,得到该待合成说话风格音频对应的目标风格嵌入向量。After determining the style weight of each preset style tag for the timbre speaking style audio, emotional speaking style audio and prosodic speaking style audio, multiply the reference embedding hidden by the style weight of each style tag for the to-be-synthesized speaking style audio vector information, obtain the style embedding vector of each preset style tag, and then accumulate the style embedding vector of each style tag to obtain the target style embedding vector corresponding to the speech style audio to be synthesized.
步骤S103、基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息。Step S103: Encode the text to be processed based on the text encoder to obtain text encoding vector information.
示范性的,通过该文本编码器对该待处理文本进行编码,得到对应的文本编码向量信息,例如,该文本编码器包括权重矩阵,通过该权重矩阵对该待处理文本进行映射,得到对应的文本编码向量信息。Exemplarily, the text to be processed is encoded by the text encoder to obtain corresponding text encoding vector information, for example, the text encoder includes a weight matrix, and the to-be-processed text is mapped by the weight matrix to obtain the corresponding text. Text encoding vector information.
在一实施例中,具体地,参照图3,步骤S103包括:子步骤S1031至子步骤S1032。In an embodiment, specifically, referring to FIG. 3 , step S103 includes: sub-step S1031 to sub-step S1032.
子步骤S1031、通过所述文本编码器将所述待处理文本拆分为各个字词,并获取各个字词之间的顺序关系;Sub-step S1031, splitting the text to be processed into each word by the text encoder, and obtaining the order relationship between each word;
示范性的,示范性的,该编码器在检测到该待处理文本时,将该待处理文本拆分为各个字词,并获取各个字词之间的顺序关系。例如,待处理文本为“我爱中国”,将该“我爱中国”拆分为“我”、“爱”、“中”、“国”。并获取该“我”、“爱”、“中”、“国”之间的顺序为“我”--“爱”--“中”--“国”。Exemplarily, when detecting the to-be-processed text, the encoder splits the to-be-processed text into respective words, and obtains the sequence relationship between the respective words. For example, the text to be processed is "I love China", and the "I love China" is divided into "I", "爱", "中", "国". And get the order between "I", "Love", "Zhong" and "Country" as "I"--"Love"--"China"--"Country".
子步骤S1032、对各个字词以及各个所述字词之间的顺序关系进行映射转换,生成所述待合成文本的文本编码向量信息。Sub-step S1032 , map and convert each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
示范性的,在获取到待处理文本的各个字词和各个字词之间的顺序关系时,对各个字词以及该各个字词的顺序关系进行映射,得到各个字词的字词向量信息以及各个字词之间的顺序向量信息即边向量信息,将得到的字词向量信息和边向量信息进行组合,得到对应的文本编码向量信息,其中,边向量信息中的权重为0。Exemplarily, when acquiring each word of the text to be processed and the order relationship between each word, map each word and the order relationship of each word to obtain word vector information of each word and The sequence vector information between each word is the edge vector information, and the obtained word vector information and the edge vector information are combined to obtain the corresponding text encoding vector information, wherein the weight in the edge vector information is 0.
步骤S104、通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图。Step S104, splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram.
示范性的,通过将该风格嵌入向量信息和文本编码向量信息进行拼接,得到梅尔语谱图。例如,分别获取风格嵌入向量信息和文本编码向量信息的维度信息,在同一维度上将风格嵌入向量信息和文本编码向量信息,进行拼接,生成梅尔语谱图。Exemplarily, the Mel language spectrogram is obtained by splicing the style embedding vector information and the text encoding vector information. For example, the dimension information of the style embedding vector information and the text encoding vector information are obtained respectively, and the style embedding vector information and the text encoding vector information are spliced in the same dimension to generate a Mel language spectrogram.
在一实施例中,通过连接层广播获取所述风格嵌入向量信息,并将获取到的风格嵌入向量信息与所述文本编码向量信息进行连接,得到拼接向量信息;通过预置解码器对的拼接向量信息进行解码,生成梅尔语谱图。In one embodiment, the style embedding vector information is obtained through connection layer broadcasting, and the obtained style embedding vector information is connected with the text encoding vector information to obtain splicing vector information; The vector information is decoded to generate a Mel spectrogram.
示范性的,通过连接层广播来获取风格嵌入向量信息,并将获取到的风格嵌入向量信息与文本编码向量信息进行连接,得到拼接向量信息。示范例的,连接层向各个多参考编码器发送广播,在各个多参考编码器在对待合成说话风格音频进行编码得到风格嵌入向量信息时,各个多参考编码器将得到的风格嵌入向量信息发送至全连接层的连接层。连接层在获取到风格嵌入向量信息时,分别获取风格嵌入向量信息的维度信息和文本编码向量信息的维度信息,通过获取到的风格嵌入向量信息的维度信息和文本编码向量信息的维度信息进行拼接,该拼接包括维度拼接。例如,获取风格嵌入向量信息的维度信息和文本编码向量信息的维度信息,确定该风格嵌入向量信息的维度坐标以及文本编码向量的维度坐标,在同一维度坐标处将风格嵌入向量信息和文本编码向量信息进行拼接,得到对应的拼接向量信息。Exemplarily, the style embedding vector information is obtained through connection layer broadcasting, and the obtained style embedding vector information is connected with the text encoding vector information to obtain the splicing vector information. As an example, the connection layer sends a broadcast to each multi-reference encoder, and when each multi-reference encoder encodes the synthesized speech style audio to obtain the style embedding vector information, each multi-reference encoder sends the obtained style embedding vector information to the The connection layer of the fully connected layer. When the connection layer obtains the style embedding vector information, it obtains the dimension information of the style embedding vector information and the dimension information of the text encoding vector information respectively, and splices the obtained dimension information of the style embedding vector information and the dimension information of the text encoding vector information. , the stitching includes dimension stitching. For example, obtain the dimension information of the style embedding vector information and the dimension information of the text encoding vector information, determine the dimension coordinates of the style embedding vector information and the dimension coordinates of the text encoding vector, and insert the style embedding vector information and the text encoding vector at the same dimension coordinates. The information is spliced to obtain the corresponding splicing vector information.
在得到拼接向量信息时,将得到的拼接向量信息输入到预置解码器,通过预置编码器对该拼接向量信息进行解码,生成对应的梅尔语谱图。例如,解码器将传拼接向量信息通过自身解码,转成语谱信号信息,通过将语谱信号信息生成梅尔语谱图。步骤S105、通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。When the splicing vector information is obtained, the obtained splicing vector information is input into a preset decoder, and the splicing vector information is decoded by the preset encoder to generate a corresponding Mel language spectrogram. For example, the decoder converts the transmitted splicing vector information into spectral signal information through self-decoding, and generates a Mel spectrogram by converting the spectral signal information. Step S105: Perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
示范性的,在获取到梅尔语谱信息,通过输出层输出该梅尔语谱信息的语音合成信息。例如,该输出层包括声码器,该声码器获取该梅尔语谱信息中的语音频域特征信息,通过对该语音频域特征信息进行合成,生成语音合成信息。Exemplarily, after acquiring the Mel spectrum information, the output layer outputs the speech synthesis information of the Mel spectrum information. For example, the output layer includes a vocoder, and the vocoder acquires the feature information in the voice and audio domains in the Mel spectrum information, and generates speech synthesis information by synthesizing the feature information in the voice and audio domains.
具体的,通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频,包括:通过所述输出层提取所述梅尔频谱信息中的语音频域特征,并对所述语音频域特征进行映射,输出所述待处理文本的目标音频。Specifically, performing feature extraction on the Mel spectrogram through the output layer and outputting the target audio of the text to be processed includes: extracting the voice and audio domain in the Mel spectrum information through the output layer feature, and map the voice and audio domain features, and output the target audio of the text to be processed.
示范性的,在获取到梅尔语谱信息时,通过该输出层提取该梅尔频谱信息中的语音频域特征,在提取到该梅尔频谱信息中的语音频域特征,对该语音频域特征进行映射,获取输出层输出语音合成信息。例如,该输出层包括提取层和映射层,通过该提取层提取该该梅尔频谱信息中的语音频域特征,通过该映射层中的激活函数对该语音频域特征进行激活映射,得到语音合成信息。Exemplarily, when the Mel spectrum information is obtained, the voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features in the Mel spectrum information are extracted, and the voice and audio domain features are extracted from the Mel spectrum information. Domain features are mapped to obtain the output speech synthesis information of the output layer. For example, the output layer includes an extraction layer and a mapping layer, the voice and audio domain features in the Mel spectrum information are extracted through the extraction layer, and the voice and audio domain features are activated and mapped through the activation function in the mapping layer to obtain the voice synthetic information.
在本申请实施例中,通过将获取到的待处理文本和待合成说话风格音频输入预置语音合成模型进行编码,得到风格嵌入向量信息和文本编码向量信息;通过全连接层对风格嵌入向量信息和文本编码向量信息进行拼接,生成梅尔语谱图;通过输出层对梅尔语谱图进行特征提取,并输出待处理文本的目标音频,实现对合成的语音进行说话风格的控制,合成更多情绪表达的语音。In the embodiment of the present application, the obtained text to be processed and the audio of the speech style to be synthesized are input into a preset speech synthesis model for encoding, so as to obtain style embedding vector information and text encoding vector information; It is spliced with the text encoding vector information to generate the Mel spectrogram; the feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output to realize the control of the speech style of the synthesized speech, and the synthesis is more efficient. Voices that express multiple emotions.
请参照图4,图4为本申请实施例提供的一种语音合成装置的示意性框图。Please refer to FIG. 4 , which is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.
如图4所示,该语音合成装置400,包括:第一获取模块401、第二获取模块402、第三获取模块403、生成模块404、输出模块405。As shown in FIG. 4 , the speech synthesis apparatus 400 includes: a first acquisition module 401 , a second acquisition module 402 , a third acquisition module 403 , a generation module 404 , and an output module 405 .
第一获取模块401,用用于获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;The first obtaining module 401 is used to obtain the text to be processed and the speech style audio to be synthesized, and input the text to be processed and the speech style audio to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes Multi-reference encoder, text encoder, fully connected layer and output layer;
第二获取模块402,用于基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;A second obtaining module 402, configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information;
第三获取模块403,用于基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;A third acquiring module 403, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information;
生成模块404,用于通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;The generating module 404 is used for splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;
输出模块405,用于通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。The output module 405 is configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
其中,第二获取模块402具体还用于:Wherein, the second obtaining module 402 is also specifically used for:
根据多个所述参考编码器分别对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到参考嵌入隐向量信息;According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;
根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息。The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
其中,第二获取模块402具体还用于:Wherein, the second obtaining module 402 is also specifically used for:
通过多个所述参考编码器中的卷积神经网络对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到所述待合成说话风格音频的三维张量;The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
通过所述参考编码器中的循环神经网络对所述三维张量进行处理,得到所述待合成说话风格音频的参考嵌入隐向量信息。The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
其中,第二获取模块402具体还用于:Wherein, the second obtaining module 402 is also specifically used for:
获取所述多头注意力机制中各个预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重;obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;
通过各个所述预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重乘以所述参考嵌入隐向量信息,得到各个所述预置风格标记的风格嵌入向量;By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;
累加各个所述预置风格标记的风格嵌入向量,得到所述待合成说话风格音频的风格嵌入向量信息。Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
其中,第三获取模块403具体还用于:Wherein, the third obtaining module 403 is also specifically used for:
通过所述文本编码器将所述待处理文本拆分为各个字词,并获取各个字词之间的顺序关系;Splitting the to-be-processed text into words by the text encoder, and acquiring the order relationship between the words;
对各个字词以及各个所述字词之间的顺序关系进行映射转换,生成所述待合成文本的文本编码向量信息。Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
其中,生成模块404具体还用于:Wherein, the generating module 404 is also specifically used for:
通过所述连接层广播获取所述风格嵌入向量信息,并将获取到的风格嵌入向量信息与所述文本编码向量信息进行连接,得到拼接向量信息;Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;
通过所述预置解码器对的拼接向量信息进行解码,生成梅尔语谱图。The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
其中,输出模块还用于:Among them, the output module is also used for:
通过所述输出层提取所述梅尔频谱信息中的语音频域特征,并对所述语音频域特征进行映射,输出所述待处理文本的目标音频。The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.
需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的装置和各模块及单元的具体工作过程,可以参考前述语音合成方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that, for the convenience and brevity of description, for the specific working process of the above-described device and each module and unit, reference may be made to the corresponding process in the foregoing speech synthesis method embodiment, It is not repeated here.
上述实施例提供的装置可以实现为一种计算机程序的形式,该计算机程序可以在如图5所示的计算机设备上运行。The apparatuses provided by the above embodiments may be implemented in the form of a computer program, and the computer program may be executed on the computer device as shown in FIG. 5 .
请参阅图5,图5为本申请实施例提供的一种计算机设备的结构示意性框图。该计算机设备可以为终端。Please refer to FIG. 5. FIG. 5 is a schematic block diagram of the structure of a computer device according to an embodiment of the present application. The computer device may be a terminal.
如图5所示,该计算机设备包括通过系统总线连接的处理器、存储器和网络接口,其中,存储器可以包括非易失性存储介质和内存储器。As shown in FIG. 5, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.
非易失性存储介质可存储操作系统和计算机程序。该计算机程序包括程序指令,该程序指令被执行时,可使得处理器执行任意一种语音合成方法。The nonvolatile storage medium can store operating systems and computer programs. The computer program includes program instructions that, when executed, cause the processor to perform any speech synthesis method.
处理器用于提供计算和控制能力,支撑整个计算机设备的运行。The processor is used to provide computing and control capabilities to support the operation of the entire computer equipment.
内存储器为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器执行时,可使得处理器执行任意一种语音合成方法。The internal memory provides an environment for running the computer program in the non-volatile storage medium. When the computer program is executed by the processor, the processor can cause the processor to execute any speech synthesis method.
该网络接口用于进行网络通信,如发送分配的任务等。本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface is used for network communication, such as sending assigned tasks. Those skilled in the art can understand that the structure shown in FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
应当理解的是,处理器可以是中央处理单元 (Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现场可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that the processor may be a central processing unit (Central Processing Unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application-specific integrated circuit (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. Wherein, the general-purpose processor can be a microprocessor or the processor can also be any conventional processor or the like.
其中,在一个实施例中,所述处理器用于运行存储在存储器中的计算机程序,以实现如下步骤:Wherein, in one embodiment, the processor is configured to run a computer program stored in the memory to implement the following steps:
获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
在一个实施例中,所述处理器所述多参考编码器包括多个参考编码器和多头注意力机制;所述待合成说话风格音频包括音色说话风格音频、情感说话风格音频和韵律说话风格音频;所述基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息法实现时,用于实现:In one embodiment, the processor, the multi-reference encoder includes multiple reference encoders and a multi-head attention mechanism; the to-be-synthesized speaking style audio includes timbre speaking style audio, emotional speaking style audio, and prosodic speaking style audio Described to encode the speech style audio to be synthesized based on the multi-reference encoder, when obtaining the style embedding vector information method to realize, for realizing:
根据多个所述参考编码器分别对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到参考嵌入隐向量信息;According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;
根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息。The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
在一个实施例中,所述处理器所述根据多个所述参考编码器对所述所述待合成说话风格音频进行编码,得到参考嵌入隐向量信息实现时,用于实现:In one embodiment, when the processor encodes the to-be-synthesized speaking style audio according to a plurality of the reference encoders, and obtains reference embedded latent vector information for implementation, it is used to implement:
通过多个所述参考编码器中的卷积神经网络对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到所述待合成说话风格音频的三维张量;The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
通过所述参考编码器中的循环神经网络对所述三维张量进行处理,得到所述待合成说话风格音频的参考嵌入隐向量信息。The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
在一个实施例中,所述处理器所述根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息实现时,用于实现:In one embodiment, the processor calculates the reference embedded latent vector information according to the multi-head attention mechanism, and when the obtained style embedded vector information is implemented, is used to implement:
获取所述多头注意力机制中各个预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重;obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;
通过各个所述预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重乘以所述参考嵌入隐向量信息,得到各个所述预置风格标记的风格嵌入向量;By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;
累加各个所述预置风格标记的风格嵌入向量,得到所述待合成说话风格音频的风格嵌入向量信息。Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
在一个实施例中,所述处理器所述基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息实现时,用于实现:In one embodiment, when the processor encodes the text to be processed based on the text encoder, and obtains text encoding vector information for implementation, it is used to implement:
通过所述文本编码器将所述待处理文本拆分为各个字词,并获取各个字词之间的顺序关系;Splitting the to-be-processed text into words by the text encoder, and obtaining the order relationship between the words;
对各个字词以及各个所述字词之间的顺序关系进行映射转换,生成所述待合成文本的文本编码向量信息。Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
在一个实施例中,所述处理器所述全连接层包括连接层和预置解码器;通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图实现时,用于实现:In one embodiment, the fully-connected layer of the processor includes a connection layer and a preset decoder; the fully-connected layer splices the style embedding vector information and the text encoding vector information to generate a Mel When the spectrogram is implemented, it is used to implement:
通过所述连接层广播获取所述风格嵌入向量信息,并将获取到的风格嵌入向量信息与所述文本编码向量信息进行连接,得到拼接向量信息;Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;
通过所述预置解码器对拼接向量信息进行解码,生成梅尔语谱图。The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
在一个实施例中,所述处理器所述通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本目标音频实现时,用于实现:In one embodiment, when the processor performs feature extraction on the Mel spectrogram through the output layer, and outputs the target audio of the to-be-processed text, it is used to achieve:
通过所述输出层提取所述梅尔频谱信息中的语音频域特征,并对所述语音频域特征进行映射,输出所述待处理文本目标音频。The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the to-be-processed text target audio.
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质上存储有计算机程序,所述计算机程序中包括程序指令,所述程序指令被执行时所实现的方法可参照本申请语音合成方法的各个实施例。Embodiments of the present application further provide a computer-readable storage medium, the computer-readable storage medium may be non-volatile or volatile, and a computer program is stored on the computer-readable storage medium, and the computer The program includes program instructions, and the method implemented when the program instructions are executed may refer to the various embodiments of the speech synthesis method of the present application.
其中,所述计算机可读存储介质可以是前述实施例所述的计算机设备的内部存储单元,例如所述计算机设备的硬盘或内存。所述计算机可读存储介质也可以是所述计算机设备的外部存储设备,例如所述计算机设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiments, such as a hard disk or a memory of the computer device. The computer-readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) equipped on the computer device ) card, flash card (Flash Card) and so on.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
本申请所指区块链是预置语音合成模型的存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as storage of preset speech synthesis models, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, herein, the terms "comprising", "comprising" or any other variation thereof are intended to encompass non-exclusive inclusion, such that a process, method, article or system comprising a series of elements includes not only those elements, It also includes other elements not expressly listed or inherent to such a process, method, article or system. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article or system that includes the element.
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above-mentioned serial numbers of the embodiments of the present application are only for description, and do not represent the advantages or disadvantages of the embodiments. The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person skilled in the art can easily think of various equivalents within the technical scope disclosed in the present application. Modifications or substitutions shall be covered by the protection scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种语音合成方法,其中,包括: A speech synthesis method, comprising:
    获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
    基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
    基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
    通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
    通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  2. 如权利要求1所述的语音合成方法,其中,所述多参考编码器包括多个参考编码器和多头注意力机制;所述待合成说话风格音频包括音色说话风格音频、情感说话风格音频和韵律说话风格音频; The speech synthesis method of claim 1, wherein the multi-reference encoder includes multiple reference encoders and a multi-head attention mechanism; the speech-style audio to be synthesized includes timbre speech-style audio, emotional speech-style audio, and prosody speaking style audio;
    所述基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息,包括:Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information, including:
    根据多个所述参考编码器分别对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到参考嵌入隐向量信息;According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;
    根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息。The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
  3. 如权利要求2所述的语音合成方法,其中,所述根据多个所述参考编码器对所述所述待合成说话风格音频进行编码,得到参考嵌入隐向量信息,包括: The speech synthesis method according to claim 2, wherein the encoding of the speech style audio to be synthesized according to a plurality of the reference encoders to obtain reference embedded latent vector information, comprising:
    通过多个所述参考编码器中的卷积神经网络对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到所述待合成说话风格音频的三维张量;The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
    通过所述参考编码器中的循环神经网络对所述三维张量进行处理,得到所述待合成说话风格音频的参考嵌入隐向量信息。The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
  4. 如权利要求2所述的语音合成方法,其中,所述根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息,包括: The speech synthesis method according to claim 2, wherein calculating the reference embedded latent vector information according to the multi-head attention mechanism to obtain style embedded vector information, comprising:
    获取所述多头注意力机制中各个预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重;obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;
    通过各个所述预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重乘以所述参考嵌入隐向量信息,得到各个所述预置风格标记的风格嵌入向量;By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;
    累加各个所述预置风格标记的风格嵌入向量,得到所述待合成说话风格音频的风格嵌入向量信息。Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
  5. 如权利要求1所述的语音合成方法,其中,所述基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息,包括: The speech synthesis method according to claim 1, wherein the encoding the text to be processed based on the text encoder to obtain text encoding vector information, comprising:
    通过所述文本编码器将所述待处理文本拆分为各个字词,并获取各个字词之间的顺序关系;Splitting the to-be-processed text into words by the text encoder, and obtaining the order relationship between the words;
    对各个字词以及各个所述字词之间的顺序关系进行映射转换,生成所述待合成文本的文本编码向量信息。Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
  6. 如权利要求1所述的语音合成方法,其中,所述全连接层包括连接层和预置解码器;通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图,包括: The speech synthesis method according to claim 1, wherein the fully connected layer comprises a connection layer and a preset decoder; the style embedding vector information and the text encoding vector information are spliced through the fully connected layer, Generate a Mel spectrogram, including:
    通过所述连接层广播获取所述风格嵌入向量信息,并将获取到的风格嵌入向量信息与所述文本编码向量信息进行连接,得到拼接向量信息;Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;
    通过所述预置解码器对所述拼接向量信息进行解码,生成梅尔语谱图。The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
  7. 如权利要求1所述的语音合成方法,其中,所述通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频,包括: The speech synthesis method according to claim 1, wherein the feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the to-be-processed text is output, comprising:
    通过所述输出层提取所述梅尔频谱信息中的语音频域特征,并对所述语音频域特征进行映射,输出所述待处理文本的目标音频。The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.
  8. 一种语音合成装置,其中,包括: A speech synthesis device, comprising:
    第一获取模块,用于获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;The first acquisition module is used to acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the speech synthesis model includes multiple reference encoder, text encoder, fully connected layer and output layer;
    第二获取模块,用于基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;a second obtaining module, configured to encode the speech style audio to be synthesized based on the multi-reference encoder to obtain style embedding vector information;
    第三获取模块,用于基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;a third acquisition module, configured to encode the text to be processed based on the text encoder to obtain text encoding vector information;
    生成模块,用于通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;A generation module, configured to splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel language spectrogram;
    输出模块,用于通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。An output module, configured to perform feature extraction on the Mel spectrogram through the output layer, and output the target audio of the text to be processed.
  9. 一种计算机设备,其中,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现: A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the computer program to achieve:
    获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
    基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
    基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
    通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
    通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  10. 如权利要求9所述的计算机设备,其中,所述多参考编码器包括多个参考编码器和多头注意力机制,所述待合成说话风格音频包括音色说话风格音频、情感说话风格音频和韵律说话风格音频,所述处理器执行所述计算机程序时还实现: The computer device of claim 9, wherein the multi-reference encoder comprises a plurality of reference encoders and a multi-head attention mechanism, and the speech-style audio to be synthesized comprises timbre speech-style audio, emotional speech-style audio, and prosodic speech style audio, the processor further implements when executing the computer program:
    根据多个所述参考编码器分别对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到参考嵌入隐向量信息;According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;
    根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息。The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
  11. 如权利要求10所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现: The computer device of claim 10, wherein the processor, when executing the computer program, further implements:
    通过多个所述参考编码器中的卷积神经网络对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到所述待合成说话风格音频的三维张量;The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
    通过所述参考编码器中的循环神经网络对所述三维张量进行处理,得到所述待合成说话风格音频的参考嵌入隐向量信息。The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
  12. 如权利要求10所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现: The computer device of claim 10, wherein the processor, when executing the computer program, further implements:
    获取所述多头注意力机制中各个预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重;obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;
    通过各个所述预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重乘以所述参考嵌入隐向量信息,得到各个所述预置风格标记的风格嵌入向量;By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;
    累加各个所述预置风格标记的风格嵌入向量,得到所述待合成说话风格音频的风格嵌入向量信息。Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
  13. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现: The computer device of claim 9, wherein the processor, when executing the computer program, further implements:
    通过所述文本编码器将所述待处理文本拆分为各个字词,并获取各个字词之间的顺序关系;Splitting the to-be-processed text into words by the text encoder, and acquiring the order relationship between the words;
    对各个字词以及各个所述字词之间的顺序关系进行映射转换,生成所述待合成文本的文本编码向量信息。Perform mapping conversion on each word and the sequence relationship between each of the words to generate text encoding vector information of the text to be synthesized.
  14. 如权利要求9所述的计算机设备,其中,所述全连接层包括连接层和预置解码器,所述处理器执行所述计算机程序时还实现: The computer device according to claim 9, wherein the fully connected layer comprises a connection layer and a preset decoder, and the processor further implements when executing the computer program:
    通过所述连接层广播获取所述风格嵌入向量信息,并将获取到的风格嵌入向量信息与所述文本编码向量信息进行连接,得到拼接向量信息;Obtain the style embedding vector information through the connection layer broadcast, and connect the obtained style embedding vector information with the text encoding vector information to obtain the splicing vector information;
    通过所述预置解码器对所述拼接向量信息进行解码,生成梅尔语谱图。The splicing vector information is decoded by the preset decoder to generate a Mel spectrogram.
  15. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机程序时还实现: The computer device of claim 9, wherein the processor, when executing the computer program, further implements:
    通过所述输出层提取所述梅尔频谱信息中的语音频域特征,并对所述语音频域特征进行映射,输出所述待处理文本的目标音频。The voice and audio domain features in the Mel spectrum information are extracted through the output layer, and the voice and audio domain features are mapped to output the target audio of the text to be processed.
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现:A computer-readable storage medium storing a computer program, wherein the computer program is executed by a processor to realize:
    获取待处理文本和待合成说话风格音频,并将所述待处理文本和所述待合成说话风格音频输入预置语音合成模型,其中,所述预置语音合成模型包括多参考编码器、文本编码器、全连接层和输出层;Acquire the text to be processed and the audio of the speech style to be synthesized, and input the text to be processed and the audio of the speech style to be synthesized into a preset speech synthesis model, wherein the preset speech synthesis model includes a multi-reference encoder, a text encoding , the fully connected layer and the output layer;
    基于所述多参考编码器对所述待合成说话风格音频进行编码,得到风格嵌入向量信息;Encoding the to-be-synthesized speaking style audio based on the multi-reference encoder to obtain style embedding vector information;
    基于所述文本编码器对所述待处理文本进行编码,得到文本编码向量信息;Encoding the text to be processed based on the text encoder to obtain text encoding vector information;
    通过所述全连接层对所述风格嵌入向量信息和所述文本编码向量信息进行拼接,生成梅尔语谱图;splicing the style embedding vector information and the text encoding vector information through the fully connected layer to generate a Mel spectrogram;
    通过所述输出层对所述梅尔语谱图进行特征提取,并输出所述待处理文本的目标音频。Feature extraction is performed on the Mel spectrogram through the output layer, and the target audio of the text to be processed is output.
  17. 如权利要求16所述的计算机可读存储介质,其中,所述多参考编码器包括多个参考编码器和多头注意力机制,所述待合成说话风格音频包括音色说话风格音频、情感说话风格音频和韵律说话风格音频所述计算机程序被处理器执行时还实现: 17. The computer-readable storage medium of claim 16, wherein the multi-reference encoder comprises a plurality of reference encoders and a multi-head attention mechanism, and the speech-style audio to be synthesized comprises timbre speech-style audio, emotional speech-style audio and prosodic speaking style audio the computer program when executed by the processor also implements:
    根据多个所述参考编码器分别对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到参考嵌入隐向量信息;According to a plurality of the reference encoders, the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are respectively encoded to obtain reference embedded latent vector information;
    根据所述多头注意力机制计算所述参考嵌入隐向量信息,得到风格嵌入向量信息。The reference embedded latent vector information is calculated according to the multi-head attention mechanism to obtain style embedded vector information.
  18. 如权利要求17所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现: The computer-readable storage medium of claim 17, wherein the processor, when executing the computer program, further implements:
    通过多个所述参考编码器中的卷积神经网络对所述音色说话风格音频、情感说话风格音频和韵律说话风格音频进行编码,得到所述待合成说话风格音频的三维张量;The timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio are encoded by the convolutional neural networks in the multiple reference encoders, so as to obtain the three-dimensional tensor of the speaking style audio to be synthesized;
    通过所述参考编码器中的循环神经网络对所述三维张量进行处理,得到所述待合成说话风格音频的参考嵌入隐向量信息。The three-dimensional tensor is processed by the recurrent neural network in the reference encoder to obtain the reference embedded latent vector information of the speech style audio to be synthesized.
  19. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现: The computer-readable storage medium of claim 16, wherein the processor, when executing the computer program, further implements:
    获取所述多头注意力机制中各个预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重;obtaining the style weights of each preset style marker in the multi-head attention mechanism for the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio;
    通过各个所述预置风格标记对所述所述音色说话风格音频、所述情感说话风格音频和所述韵律说话风格音频的风格权重乘以所述参考嵌入隐向量信息,得到各个所述预置风格标记的风格嵌入向量;By multiplying the style weights of the timbre speaking style audio, the emotional speaking style audio and the prosodic speaking style audio by each of the preset style tags by the reference embedded latent vector information, each preset style is obtained. style embedding vector for style tags;
    累加各个所述预置风格标记的风格嵌入向量,得到所述待合成说话风格音频的风格嵌入向量信息。Accumulating the style embedding vectors of each of the preset style tags to obtain the style embedding vector information of the speech style audio to be synthesized.
  20. 如权利要求16所述的计算机可读存储介质,其中,所述处理器执行所述计算机程序时还实现: The computer-readable storage medium of claim 16, wherein the processor, when executing the computer program, further implements:
    通过所述文本编码器将所述待处理文本拆分为各个字词,并获取各个字词之间的顺序关系;Splitting the to-be-processed text into words by the text encoder, and obtaining the order relationship between the words;
    对各个字词以及各个所述字词之间的顺序关系进行映射转换,生成所述待合成文本的文本编码向量信息。Mapping conversion is performed on each word and the sequence relationship between each of the words, so as to generate text encoding vector information of the text to be synthesized.
PCT/CN2021/084167 2021-02-26 2021-03-30 Speech synthesis method and apparatus, and device and storage medium WO2022178941A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110218672.XA CN112786009A (en) 2021-02-26 2021-02-26 Speech synthesis method, apparatus, device and storage medium
CN202110218672.X 2021-02-26

Publications (1)

Publication Number Publication Date
WO2022178941A1 true WO2022178941A1 (en) 2022-09-01

Family

ID=75761958

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/084167 WO2022178941A1 (en) 2021-02-26 2021-03-30 Speech synthesis method and apparatus, and device and storage medium

Country Status (2)

Country Link
CN (1) CN112786009A (en)
WO (1) WO2022178941A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470507A (en) * 2022-10-31 2022-12-13 青岛他坦科技服务有限公司 Medium and small enterprise research and development project data management method

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113345412A (en) * 2021-05-31 2021-09-03 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium
CN113345466B (en) * 2021-06-01 2024-03-01 平安科技(深圳)有限公司 Main speaker voice detection method, device and equipment based on multi-microphone scene
CN113822017A (en) * 2021-06-03 2021-12-21 腾讯科技(深圳)有限公司 Audio generation method, device, equipment and storage medium based on artificial intelligence
CN113409765B (en) * 2021-06-11 2024-04-26 北京搜狗科技发展有限公司 Speech synthesis method and device for speech synthesis
CN113506562B (en) * 2021-07-19 2022-07-19 武汉理工大学 End-to-end voice synthesis method and system based on fusion of acoustic features and text emotional features
CN113345416B (en) * 2021-08-02 2021-10-29 智者四海(北京)技术有限公司 Voice synthesis method and device and electronic equipment
CN115272537A (en) * 2021-08-06 2022-11-01 宿迁硅基智能科技有限公司 Audio driving expression method and device based on causal convolution
CN113744713A (en) * 2021-08-12 2021-12-03 北京百度网讯科技有限公司 Speech synthesis method and training method of speech synthesis model
CN113707125B (en) * 2021-08-30 2024-02-27 中国科学院声学研究所 Training method and device for multi-language speech synthesis model
CN113744716B (en) * 2021-10-19 2023-08-29 北京房江湖科技有限公司 Method and apparatus for synthesizing speech
CN114255737B (en) * 2022-02-28 2022-05-17 北京世纪好未来教育科技有限公司 Voice generation method and device and electronic equipment
CN114822495B (en) * 2022-06-29 2022-10-14 杭州同花顺数据开发有限公司 Acoustic model training method and device and speech synthesis method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110288973A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Phoneme synthesizing method, device, equipment and computer readable storage medium
CN110473516A (en) * 2019-09-19 2019-11-19 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device and electronic equipment
US20200394998A1 (en) * 2018-08-02 2020-12-17 Neosapience, Inc. Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN112164379A (en) * 2020-10-16 2021-01-01 腾讯科技(深圳)有限公司 Audio file generation method, device, equipment and computer readable storage medium
CN112349269A (en) * 2020-12-11 2021-02-09 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102103856A (en) * 2009-12-21 2011-06-22 盛大计算机(上海)有限公司 Voice synthesis method and system
WO2020209647A1 (en) * 2019-04-09 2020-10-15 네오사피엔스 주식회사 Method and system for generating synthetic speech for text through user interface
CN110718208A (en) * 2019-10-15 2020-01-21 四川长虹电器股份有限公司 Voice synthesis method and system based on multitask acoustic model
CN112270920A (en) * 2020-10-28 2021-01-26 北京百度网讯科技有限公司 Voice synthesis method and device, electronic equipment and readable storage medium
CN112382272B (en) * 2020-12-11 2023-05-23 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and storage medium capable of controlling speech speed

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200394998A1 (en) * 2018-08-02 2020-12-17 Neosapience, Inc. Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN110288973A (en) * 2019-05-20 2019-09-27 平安科技(深圳)有限公司 Phoneme synthesizing method, device, equipment and computer readable storage medium
CN110473516A (en) * 2019-09-19 2019-11-19 百度在线网络技术(北京)有限公司 Phoneme synthesizing method, device and electronic equipment
CN112164379A (en) * 2020-10-16 2021-01-01 腾讯科技(深圳)有限公司 Audio file generation method, device, equipment and computer readable storage medium
CN112349269A (en) * 2020-12-11 2021-02-09 平安科技(深圳)有限公司 Speech synthesis method, apparatus, device and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115470507A (en) * 2022-10-31 2022-12-13 青岛他坦科技服务有限公司 Medium and small enterprise research and development project data management method
CN115470507B (en) * 2022-10-31 2023-02-07 青岛他坦科技服务有限公司 Medium and small enterprise research and development project data management method

Also Published As

Publication number Publication date
CN112786009A (en) 2021-05-11

Similar Documents

Publication Publication Date Title
WO2022178941A1 (en) Speech synthesis method and apparatus, and device and storage medium
CN110264991B (en) Training method of speech synthesis model, speech synthesis method, device, equipment and storage medium
CN106688034B (en) Text-to-speech conversion with emotional content
KR20220004737A (en) Multilingual speech synthesis and cross-language speech replication
WO2020232997A1 (en) Speech synthesis method and apparatus, and device and computer-readable storage medium
CN112687259B (en) Speech synthesis method, device and readable storage medium
KR20210146368A (en) End-to-end automatic speech recognition for digit sequences
TWI698857B (en) Speech recognition system and method thereof, and computer program product
WO2022121179A1 (en) Speech synthesis method and apparatus, device, and storage medium
KR102625184B1 (en) Speech synthesis training to create unique speech sounds
WO2022203699A1 (en) Unsupervised parallel tacotron non-autoregressive and controllable text-to-speech
WO2021174922A1 (en) Statement sentiment classification method and related device
CN115210809A (en) Consistent prediction of streaming sequence models
CN111696521A (en) Method for training speech clone model, readable storage medium and speech clone method
US20220156552A1 (en) Data conversion learning device, data conversion device, method, and program
CN112035699A (en) Music synthesis method, device, equipment and computer readable medium
CN111444379A (en) Audio feature vector generation method and audio segment representation model training method
CN112735377B (en) Speech synthesis method, device, terminal equipment and storage medium
CN111898363B (en) Compression method, device, computer equipment and storage medium for long and difficult text sentence
US11960852B2 (en) Robust direct speech-to-speech translation
WO2021114617A1 (en) Voice synthesis method and apparatus, computer device, and computer readable storage medium
CN114464163A (en) Method, device, equipment, storage medium and product for training speech synthesis model
CN113112969A (en) Buddhism music score recording method, device, equipment and medium based on neural network
Matoušek et al. VITS: quality vs. speed analysis
WO2021182199A1 (en) Information processing method, information processing device, and information processing program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927375

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21927375

Country of ref document: EP

Kind code of ref document: A1