WO2022188734A1 - 一种语音合成方法、装置以及可读存储介质 - Google Patents

一种语音合成方法、装置以及可读存储介质 Download PDF

Info

Publication number
WO2022188734A1
WO2022188734A1 PCT/CN2022/079502 CN2022079502W WO2022188734A1 WO 2022188734 A1 WO2022188734 A1 WO 2022188734A1 CN 2022079502 W CN2022079502 W CN 2022079502W WO 2022188734 A1 WO2022188734 A1 WO 2022188734A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
layer
attention
text
encoding
Prior art date
Application number
PCT/CN2022/079502
Other languages
English (en)
French (fr)
Inventor
郑艺斌
李新辉
卢鲤
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022188734A1 publication Critical patent/WO2022188734A1/zh
Priority to US17/984,437 priority Critical patent/US12033612B2/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/043Time compression or expansion by changing speed
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of Internet technology, in particular to speech synthesis.
  • the embodiments of the present application provide a speech synthesis method, apparatus, and readable storage medium, which can accelerate the convergence of the model and improve the stability of synthesized speech.
  • the embodiments of the present application provide a speech synthesis method, including:
  • the N layers of encoding layers include encoding layer E i and encoding layer E i+1 , and encoding layer E i+1 is the next encoding layer of encoding layer E i , N is an integer greater than 1, i is a positive integer, and i is less than N; the encoding layer E i+1 includes the first multi-head self-attention network;
  • the target text coding sequence of the coding layer E i+1 is generated according to the second attention matrix and the historical text coding sequence, and synthetic speech data matching the text input sequence is generated based on the target text coding sequence.
  • the embodiments of the present application provide a speech synthesis method, including:
  • the N-layer initial encoding layer includes the initial encoding layer X i and the initial encoding layer X i+1 , the initial encoding Layer X i+1 is the next encoding layer of the initial encoding layer X i , N is an integer greater than 1, i is a positive integer, and i is less than N; the initial encoding layer X i+1 includes the initial multi-head self-attention network;
  • the speech loss function is generated according to the predicted speech data and the reference speech data, and the model parameters in the initial residual attention acoustic model are modified by the speech loss function to obtain the residual attention acoustic model; the residual attention acoustic model uses for generating synthetic speech data that matches text input sequences.
  • the embodiments of the present application provide a speech synthesis apparatus, including:
  • the conversion module is used to convert the text input sequence into the text feature representation sequence
  • the matrix generation module is used to input the text feature representation sequence into an encoder comprising N-layer coding layers; the N-layer coding layers include coding layers E i and coding layers E i+1 ; the coding layers E i+1 include the first multi-head Self-attention network; obtain the first attention matrix output by the encoding layer E i and the historical text encoding sequence, according to the residual connection between the first attention matrix and the first multi-head self-attention network and the historical text encoding sequence, Generate the second attention matrix of the coding layer E i+ 1 ; the coding layer E i+1 is the next coding layer of the coding layer E i , N is an integer greater than 1, i is a positive integer, and i is less than N;
  • the speech synthesis module is used for generating the target text coding sequence of the coding layer E i+1 according to the second attention matrix and the historical text coding sequence, and generating synthetic speech data matching the text input sequence based on the target text coding sequence.
  • the embodiments of the present application provide a speech synthesis apparatus, including:
  • the conversion module is used to input the text sample sequence into the initial residual attention acoustic model, and convert the text sample sequence into the text feature sample sequence through the initial residual attention acoustic model;
  • the matrix generation module is used to input the text feature sample sequence into the initial encoder including the N-layer initial encoding layer in the initial residual attention acoustic model;
  • the N-layer initial encoding layer includes the initial encoding layer X i and the initial encoding layer X i+1 ;
  • the initial coding layer X i+1 includes the initial multi-head self-attention network; obtain the first attention matrix and the historical text coding sequence output by the initial coding layer X i , according to the first attention matrix and the initial multi-head self-attention
  • the residual connection between the networks and the historical text encoding sequence generate the second attention matrix of the initial encoding layer X i+1 ;
  • the initial encoding layer X i+1 is the next encoding layer of the initial encoding layer X i
  • N is An integer greater than 1, i is a positive integer, and i is less than N;
  • the speech synthesis module is used to generate the target text coding sequence of the initial coding layer X i+1 according to the second attention matrix and the historical text coding sequence, and generate the predicted speech data matching the text sample sequence based on the target text coding sequence;
  • the correction module is used to generate a speech loss function according to the predicted speech data and the reference speech data, and modify the model parameters in the initial residual attention acoustic model through the speech loss function to obtain a residual attention acoustic model;
  • An attention acoustic model is used to generate synthetic speech data that matches a textual input sequence.
  • An aspect of the embodiments of the present application provides a computer device, including: a processor, a memory, and a network interface;
  • the above-mentioned processor is connected to the above-mentioned memory and the above-mentioned network interface, wherein the above-mentioned network interface is used to provide a data communication function, the above-mentioned memory is used to store a computer program, and the above-mentioned processor is used to call the above-mentioned computer program to execute the method in the embodiment of the present application. .
  • An aspect of an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is adapted to be loaded by a processor and execute the method in the embodiment of the present application.
  • embodiments of the present application provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, where the computer instructions are stored in a computer-readable storage medium, and a processor of a computer device stores the computer-readable storage medium.
  • the medium reads the computer instructions, and the processor executes the computer instructions, so that the computer device executes the methods in the embodiments of the present application.
  • a text input sequence can be converted into a text feature representation sequence, and then the text feature representation sequence can be input into an encoder including N-layer coding layers.
  • the first attention matrix of the current coding layer can be generated according to the residual connection between the first attention matrix output by the previous coding layer and the multi-head self-attention network in the current coding layer and the historical text coding sequence output by the previous coding layer.
  • Second attention matrix further, the target text coding sequence of the current coding layer can be generated according to the obtained second attention matrix and the historical text coding sequence, and finally a synthetic speech matching the above text input sequence can be generated based on the target text coding sequence data.
  • the embodiment of the present application can make full use of the calculation results of each layer of the network, and put the residual into the attention matrix, that is, perform residual connection on each layer of attention matrix, In this way, the attention matrices of each layer can communicate with each other, which effectively accelerates the convergence of the model. At the same time, it also makes the attention matrices of each layer of the network tend to be consistent, which can improve the clarity and stability of the synthesized speech.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIGS. 2a-2c are schematic diagrams of a speech synthesis scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • 4a-4b are schematic diagrams of the network structure of a residual attention acoustic model provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of a scenario of voice adjustment provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a model training provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, involving a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • speech processing technology includes automatic speech recognition technology (Automatic Speech Recognition, ASR), speech synthesis technology and voiceprint recognition technology.
  • ASR Automatic Speech Recognition
  • speech synthesis technology also known as text-to-speech conversion technology, converts text information generated by the computer itself or input from external sources into comprehensible and highly natural speech output.
  • Technology which is equivalent to installing an artificial mouth on the machine, so that the machine can say what it wants to express through different timbres.
  • Speech synthesis technology involves multiple disciplines such as acoustics, linguistics, digital signal processing, and computer science.
  • the speech synthesis technology it is mainly divided into the language analysis part and the acoustic system part, also known as the front-end part and the back-end part.
  • the language analysis part mainly analyzes the input text information and generates the corresponding linguistic specification. How to read; the acoustic system part mainly generates the corresponding audio according to the linguistic specifications provided by the language analysis part to realize the function of vocalization.
  • the acoustic system part currently has three main technical implementation methods, namely: waveform splicing, parameter synthesis and end-to-end speech synthesis technology.
  • the end-to-end speech synthesis technology is currently a relatively popular technology.
  • simplifications such as wavenet (a technique that uses neural networks to model raw audio waveforms), Tacotron (an end-to-end speech synthesis model that synthesizes speech directly from text), Tacotron2 (a modified model of Tacotron), and Deepvoice3 (a fully convolutional neuron speech synthesis system based on attention mechanism) and other technologies.
  • wavenet a technique that uses neural networks to model raw audio waveforms
  • Tacotron an end-to-end speech synthesis model that synthesizes speech directly from text
  • Tacotron2 a modified model of Tacotron
  • Deepvoice3 a fully convolutional neuron speech synthesis system based on attention mechanism
  • the acoustic model itself has the characteristics of autoregressive generation, and the speed of generating acoustic parameters is slow.
  • inaccurate attention alignment can also lead to unstable synthesized speech, resulting in the problem of missing words and repeated words.
  • some speech synthesis acoustic models use Transformer-based feedforward networks to optimize the above problems, this acoustic model only simply stacks multiple feedforward networks. When the number of layers stacked in the network is relatively large, it is easy to cause gradients to disappear. , thereby affecting the convergence of the model and the stability of the final synthesized speech.
  • FIG. 1 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • the system architecture may include a service server 100 and a terminal cluster, and the terminal cluster may include: terminal equipment 200a, terminal equipment 200b, terminal equipment 200c, .
  • There is a communication connection with the terminal device 200b and a communication connection exists between the terminal device 200a and the terminal device 200c.
  • any terminal device in the terminal cluster may have a communication connection with the service server 100.
  • the above-mentioned communication connection does not limit the connection mode, and can be directly connected through wired communication. or indirect connection, direct or indirect connection may also be performed through wireless communication, and other methods may also be used, which are not limited in this application.
  • each terminal device in the terminal cluster shown in FIG. 1 may be installed with an application client, and when the application client runs in each terminal device, it can be respectively connected with the service server 100 shown in FIG. 1 above. Data interaction is performed between them, so that the service server 100 can receive service data from each terminal device.
  • the application client can be a game application, a social application, an instant messaging application, a car application, a live broadcast application, a short video application, a video application, a music application, a shopping application, an education application, a novel application, a news application, a payment application, a browsing application
  • the application client may be an independent client, or may be an embedded sub-client integrated in a client (eg, a game client, a shopping client, a news client, etc.), which is not limited herein.
  • the service server 100 may be a collection of multiple servers such as a background server, a data processing server, and a stream cache server corresponding to the application client.
  • the service server 100 can provide a text-to-speech service for the terminal cluster through a communication function.
  • a terminal device which may be a terminal device 200a, a terminal device 200b, a terminal device 200c, or a terminal device 200n
  • a certain application client A listed above for example, a news application
  • displayed text data for example, a news application
  • the business server 100 can call a trained residual attention acoustic model based on deep learning technology, in the residual attention acoustic model, the above-mentioned text input sequence is converted into a text feature representation sequence, and then can be used for The text feature representation sequence is sequentially subjected to processing operations such as encoding, length adjustment, decoding, and linear transformation to obtain a corresponding acoustic feature sequence. Finally, synthetic speech data matching the above text input sequence can be obtained based on the acoustic feature sequence. Then, the obtained synthesized speech data can be returned to the application client A, and the terminal device can play the synthesized speech data in the application client A. For example, when the application client A is the client corresponding to the news application, all the text in a certain news can be converted into synthetic voice data, so the user can obtain the relevant information in the news by playing the synthetic voice data.
  • the vehicle-mounted terminal will be configured on the vehicle.
  • an independent vehicle-mounted application with text-to-speech function can be installed on the vehicle-mounted terminal.
  • the content of the short message or conversational message can be converted into voice and played by triggering the voice conversion control in the in-vehicle application; or the in-vehicle application with text-to-speech function can be embedded into other in-vehicle applications
  • the text information the user wants to obtain is converted into voice broadcast; Install the application client with the text-to-speech function, and the mobile terminal and the vehicle-mounted terminal can establish a data connection through the local wireless local area network or Bluetooth. After the text-to-speech is completed on the mobile terminal, the mobile terminal can send the synthesized voice data to the vehicle-mounted terminal. After the car terminal receives the voice data, it can be played through the
  • the system architecture may include multiple service servers, one terminal device may be connected to one service server, and each service server may obtain service data (for example, , the whole text data in a web page, or the part of the text data selected by the user), so that the residual attention acoustic model can be invoked to convert the business data into synthetic speech data.
  • service data for example, , the whole text data in a web page, or the part of the text data selected by the user
  • the terminal device can also obtain service data, so that the residual attention acoustic model can be invoked to convert the service data into synthetic speech data.
  • the above residual attention acoustic model is a parallel speech synthesis acoustic model based on residual attention, that is, in the process of encoding or decoding the model, the attention matrix calculated by each layer of the network will be residual. Therefore, in the process of synthesizing speech data, the calculation results of each layer of the network can be fully utilized, so that the attention matrices of each layer can communicate with each other, which effectively accelerates the convergence of the model.
  • the attention matrix tends to be consistent, which improves the clarity and stability of synthesized speech.
  • the methods provided in the embodiments of the present application may be executed by computer devices, and the computer devices include but are not limited to terminal devices or service servers.
  • the business server may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud database, cloud service, cloud computing, cloud function, cloud storage, network service, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • Terminal devices can be smart phones, tablet computers, laptops, desktop computers, PDAs, mobile internet devices (MIDs), wearable devices (such as smart watches, smart bracelets, etc.), smart computers, smart cars, etc.
  • An intelligent terminal that can run the above application client.
  • the terminal device and the service server may be directly or indirectly connected in a wired or wireless manner, which is not limited in this embodiment of the present application.
  • the service server may also be a node on the blockchain network.
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. , and can verify, store and update data at the same time.
  • the blockchain is essentially a decentralized database. Each node in the database stores an identical blockchain.
  • the blockchain network includes consensus nodes, which are responsible for the consensus of the entire blockchain network. It can be understood that a block is a data packet that carries transaction data (that is, a transaction business) on the blockchain network. It is a data structure marked with a timestamp and the hash value of the previous block. Blocks are verified by the network's consensus mechanism and determine the transactions in the block.
  • a blockchain node system may include multiple nodes, the blockchain node system may correspond to a blockchain network (including but not limited to a blockchain network corresponding to a consortium chain), and the multiple nodes may specifically include the above
  • the so-called business server the nodes here can be collectively referred to as blockchain nodes.
  • Data sharing can be performed between blockchain nodes and blockchain nodes.
  • Each node can receive input information during normal work, and maintain the shared data in the blockchain node system based on the received input information.
  • there can be an information connection between each node in the blockchain node system there can be an information connection between each node in the blockchain node system, and information can be transmitted between nodes through the above information connection.
  • any node in the blockchain node system receives input information (such as text data)
  • other nodes in the blockchain node system obtain the input information according to the consensus algorithm, and the input Information is stored as data in shared data, so that the data stored on all nodes in the blockchain node system is consistent.
  • the method provided by the present application can be naturally applied to any scenario where text needs to be converted into speech.
  • the following will take the terminal device 200a converting a piece of text into speech through the service server 100 as an example for specific description.
  • FIG. 2a to FIG. 2c are schematic diagrams of a speech synthesis scenario provided by an embodiment of the present application.
  • the implementation process of the speech synthesis scenario may be performed in the service server 100 as shown in FIG. 1 , or may be performed in a terminal device (such as the terminal device 200a, the terminal device 200b, the terminal device 200c or the terminal device 200n as shown in FIG. 1). Any one of) can also be executed jointly by the terminal device and the service server, which is not limited here.
  • the embodiments of this application are described by taking the terminal device 200a and the service server 100 jointly executing as an example.
  • the target user holds a terminal device 200a, and multiple applications (such as education applications, shopping applications, reading applications, etc.) can be installed on the terminal device 200a.
  • the target user wants to open one of the applications, such as the target user application A1, the terminal device 200a can respond to a trigger operation (such as a click operation) for the target application A1, and display the display interface corresponding to the target application A1 on the screen.
  • the terminal device 200a can The server 100 sends a data access request, and the service server 100 can recommend a specific electronic book for the target user according to the data access request.
  • the electronic book that matches the reading preference of the target user can be displayed
  • a recommendation list may be displayed, or a recommended list of electronic books that is similar to the historical electronic books browsed by the target user may be displayed, or a recommended list of electronic books that are currently popular may also be displayed.
  • the target user can be displayed in the display area 301a.
  • the title and author corresponding to the electronic book currently opened by the user can also be displayed in the display area 302a with picture data such as the relevant cover image or illustration, and the content corresponding to the current chapter can be displayed in the display area 305a.
  • FIG. 2a it can be understood that when the target user wants to hear the voice converted according to the content in the current chapter, he can click on the conversion control in the display area 303, and then the terminal device 200a can respond to the click on the conversion control Operation, send a speech synthesis request to the service server 100, at this time, the current playback progress corresponding to the speech playback progress bar displayed in the display area 304a and the total duration of the synthesized speech are "00:00", which is used to indicate that the speech has not yet been synthesized at this moment. data.
  • the service server 100 After the service server 100 receives the speech synthesis request, it can obtain all the contents contained in the current chapter, extract text data from it, and further perform text processing on the extracted text data, including filtering out useless characters, standardizing the format, etc., thereby A text input sequence (here, a sequence of characters) that is convenient for processing by the acoustic model can be obtained. Further, the service server 100 will input the text input sequence into the pre-trained residual attention acoustic model, and the corresponding synthesized speech data can be obtained through the acoustic model. The specific process of speech synthesis through the residual attention acoustic model can be seen in Figure 2b.
  • the above text input sequence is first converted to obtain the text feature representation sequence, Then, the text feature representation sequence can be input into the encoder.
  • the encoder can include N layers of encoding layers, namely, the encoding layer E1, the encoding layer E2, ..., the encoding layer EN, and the network of each encoding layer.
  • the structures are all the same. Among them, the size of N can be adjusted according to the size of the corpus, which is not limited here.
  • each coding layer includes a multi-head self-attention network and a one-dimensional convolutional network (the multi-head self-attention network in each coding layer can be called the first multi-head self-attention network ), a residual connection is performed between the multi-head self-attention network of each coding layer and the one-dimensional convolutional network, and the attention matrix of each coding layer can be calculated through the multi-head self-attention network, and the attention matrix can be calculated based on the attention matrix
  • the text encoding sequence of each encoding layer is obtained, so the text encoding sequence output by the last encoding layer (ie encoding layer EN) can be determined as the target text encoding sequence.
  • the residual connection is performed on the attention matrix of each layer, that is, when calculating the attention matrix of the current coding layer, it is necessary to use the information of the previous coding layer. Attention matrix information.
  • the text feature representation sequence is first input to coding layer E1, and coding layer E1 can represent the sequence according to the text feature
  • the attention matrix B1 and the text encoding sequence C1 are generated, and then the text encoding sequence C1 and the attention matrix B1 can be passed to the encoding layer E2.
  • the multi-head self-attention can be based on the attention matrix B1 and the encoding layer E2.
  • the residual connection between the force networks and the text encoding sequence C1 generates an attention matrix B2, and then the text encoding sequence C2 can be generated according to the attention matrix B2 and the text encoding sequence C1.
  • the encoding process in the encoding layer E2 and the encoding layer E3 is similar to the encoding process in the encoding layer E2, and will not be repeated here.
  • the text encoding sequence C4 generated by the encoding layer E4 can finally be determined as the encoding result output by the encoder, that is, the target text encoding sequence.
  • synthetic speech data matching the text input sequence can be generated based on the target text encoding sequence.
  • the sequence generates the synthesized speech data corresponding to the third chapter of the current electronic book "Besieged City”.
  • This process also involves a duration predictor, a length regulator, a decoder, and a linear output layer in the residual attention acoustic model, and the specific process can refer to the embodiment corresponding to FIG. 5 below.
  • the service server 100 can use a text database with massive text and an audio database to train a deep neural network to obtain a residual attention acoustic model.
  • a text database with massive text and an audio database to train a deep neural network to obtain a residual attention acoustic model.
  • FIG. 7 For the specific process, refer to the embodiment corresponding to FIG. 7 below.
  • the residual attention acoustic model in Figure 2b only shows part of the network structure for simple illustration, and for a more detailed model framework, please refer to the corresponding embodiments in Figures 4a-4b. Related descriptions are not repeated here.
  • the service server 100 can return the synthesized voice data finally generated to the terminal device 200a.
  • the terminal device 200a After the terminal device 200a receives the synthesized voice data, it can play the data on the display interface 300b of the target application A1.
  • the style of the conversion controls in the display area 302b has also changed at this time, that is, the “stop state” in the display area 303a in the display interface 300a of the above-mentioned FIG.
  • the terminal device 200a can respond to the triggering operation of the conversion control by the target user, suspend the synthesized voice data being played, and then can also trigger the conversion control to pause the state by triggering the conversion control again.
  • the synthesized speech data below resumes playback.
  • a voice playback progress bar can be displayed, including the current playback progress (for example, "06:01", that is, the current playback to the 6th minute and 1st second) and the total duration of the synthesized speech (for example, "15:26” , that is, the total duration is 15 minutes and 26 seconds), the target user can also adjust the current playback progress by dragging and dropping the voice playback progress bar.
  • the representation of the text input sequence can also be in the form of a phoneme sequence.
  • a phoneme is the smallest phonetic unit divided according to the natural attributes of the voice.
  • the attention acoustic model processes the phoneme sequence in the same way as the character sequence, so it will not be repeated here.
  • the method provided in this application is suitable for any scenario that needs to convert text into speech. Therefore, in addition to the reading application described above, the target application A1 can also be other types of applications.
  • the target application A1 when the target application A1 is a news application, it can be Convert the news content into voice data; when the target application A1 is a game application, the voices that need to be played in the game, such as plot introductions, character monologues, etc., can be synthesized by entering the corresponding text data; when the target application A1 is an intelligent customer service application When using an application (such as a shopping application), the relevant text data can also be entered and converted into voice data. When the customer's response triggers a certain rule, the intelligent customer service will play the corresponding voice data.
  • an application such as a shopping application
  • the service server 100 may be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud service, cloud database, cloud computing, cloud function, cloud storage, Cloud servers for basic cloud computing services such as network services, cloud communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Therefore, the calculation process mentioned above can be published on multiple physical servers or multiple cloud servers, that is, all text-to-speech calculations are completed in parallel through distributed or cluster, and then the matching text input sequence can be quickly obtained. synthesized speech data.
  • the embodiment of the present application provides an acoustic model for speech synthesis based on residual attention based on a deep neural network.
  • the embodiment of the present application can make full use of each layer in the acoustic model.
  • the residual is put into the attention matrix, that is, the residual connection is performed on the attention matrix of each layer, so that the attention matrix of each layer can communicate with each other, which effectively accelerates the convergence of the model.
  • the attention matrix of each layer of the network tends to be consistent, which can improve the clarity and stability of the synthesized speech.
  • the embodiment of the present application realizes the function of rapidly converting text data into high-quality voice data.
  • FIG. 3 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • the speech synthesis method may be performed by a computer device, and the computer device may include a terminal device or a service server as described in FIG. 1 .
  • the speech synthesis method may include at least the following S101-S103:
  • the method provided in this application can be modeled based on characters or phonemes, so the computer equipment can first perform text preprocessing on the input characters or phonemes to obtain a text input sequence, and then input the text input sequence into the trained residual formula
  • the vector conversion layer (Token Embedding) in the attention acoustic model is converted to obtain a text feature representation sequence that is convenient for model processing.
  • the specific process is: in the residual attention acoustic model, first input the text input sequence into the vector conversion layer, The vector conversion layer is used to search and match in the vector conversion table, so that the feature vector matching the text input sequence can be used as the text feature representation sequence.
  • the search process can be implemented by one-hot table lookup (also called one-hot encoding, which mainly uses M-bit state registers to encode M states, each state has its own independent register bits, and only one bit is valid at any time).
  • the above-mentioned vector conversion table may include the mapping relationship between each character or phoneme and the feature vector, so the vector conversion table may be pre-built before applying the model.
  • the maximum sequence length of the input characters or phonemes may be limited to 256, and the dimension of the vector corresponding to the text feature representation sequence may be set to 256.
  • FIGS. 4a-4b are schematic diagrams of the network structure of a residual attention acoustic model provided by an embodiment of the present application.
  • the first part is the input layer (Input) of the residual attention acoustic model.
  • the input layer can detect the length, format, etc. corresponding to the input characters or phonemes.
  • the second part of the network structure is the vector conversion layer, which can also be called the character/phoneme vector layer (Token Embedding).
  • the vector conversion layer can convert each input character or phoneme into a corresponding fixed-dimensional vector. After each character or phoneme is transformed into a vector, the text feature representation sequence can be obtained.
  • S102 input the text feature representation sequence into an encoder comprising N layers of encoding layers; the N layers of encoding layers include encoding layer E i and encoding layer E i+1 ; encoding layer E i+1 includes the first multi-head self-attention network ; Obtain the first attention matrix and the historical text encoding sequence output by the encoding layer E i , and generate the encoding layer E according to the residual connection between the first attention matrix and the first multi-head self-attention network and the historical text encoding sequence The second attention matrix of i+1 ; the coding layer E i+1 is the next coding layer of the coding layer E i , N is an integer greater than 1, i is a positive integer, and i is less than N;
  • the first attention matrix is used to identify the attention parameters used in the process that the coding layer E i encodes the input data to obtain the target text coding sequence .
  • a multi-head self-attention network performs residual connection to obtain the second attention matrix, which is used as the encoding layer E i+1 to encode the historical text encoding sequence to obtain the attention used in the process of the target text encoding sequence force parameter.
  • the attention parameters used by the coding layer E i and the coding layer E i+1 are interoperable, so that in the process of encoding through the encoder to obtain the target text encoding sequence , the attention matrix used by each coding layer of the encoder can be consistent, which helps to improve the clarity and stability of the subsequent generated speech data.
  • the computer equipment can input the converted text feature representation sequence into the encoder in the network structure as shown in Figure 4a.
  • the encoder includes N layers of encoding layers, where N is For an integer greater than 1, it is understandable that N can be adjusted according to the size of the corpus.
  • the structure of each coding layer is the same. Please refer to Figure 4b for a schematic diagram of the specific structure.
  • the encoder in the residual attention acoustic model provided by the embodiment of the present application is based on multi-head self-attention Mechanism (Multi-Head Self-attention), a multi-head self-attention layer (Residual Multi-Head Self-attention Layer) with residual attention connection and one-dimensional convolution are combined.
  • a feedforward network structure with residual The multi-head self-attention layer of the differential attention connection adopts the multi-head self-attention network to extract the cross-position information, and encodes the input information through the cross-position information, that is, each encoding layer includes a multi-head self-attention network and a one-dimensional convolution.
  • each multi-head self-attention network in each encoding layer can be called the first multi-head self-attention network
  • both layers of networks use a residual connection, in which each multi-head self-attention network is It includes at least two single-head self-attention networks (each single-head self-attention network in the first multi-head self-attention network can be called the first single-head self-attention network), and the specific number can be based on actual needs Adjustment is made, which is not limited in this embodiment of the present application.
  • N coding layers include coding layer E i and coding layer E i+1 , and coding layer E i+1 is the next coding layer of coding layer E i , where i is a positive integer, and i is less than N .
  • the encoding layer E i+1 In order to obtain the coding result of the coding layer E i+1 , it is first necessary to use the output result of the previous coding layer to generate the attention matrix of the coding layer E i+1 , which is called the second attention matrix, that is to say, it is necessary to obtain the coding
  • the first attention matrix and the historical text encoding sequence output by layer E i and then the encoding layer E i+1 can be based on the relationship between the first attention matrix and the first multi-head self-attention network in the encoding layer E i+1 .
  • the residual connection and the historical text encoding sequence generate the second attention matrix of the encoding layer E i+1 .
  • the specific process may be: obtaining the first attention matrix output by the encoding layer E i and the historical text encoding sequence, wherein the historical text
  • the encoding sequence may include at least two first matching matrices corresponding to the first single-head self-attention network respectively, that is, the first matching matrix may be initialized according to the historical text encoding sequence, wherein the first matching matrix includes the first matching matrix.
  • the Query matrix, Key matrix and Value matrix corresponding to the multi-head self-attention network can be used to reduce the attention to irrelevant characters or phonemes while keeping the attention to the current character or phoneme unchanged .
  • first single-head self-attention network For each first single-head self-attention network, its corresponding Query matrix, Key matrix, and Value matrix are all equal to the historical text encoding sequence output by the encoding layer E i . Further, the first mapping matrix corresponding to the first multi-head self-attention network can be obtained, and the first mapping matrix is used to map the above-mentioned first matching matrix into different forms.
  • the first mapping matrix also includes three different Matrix, respectively the mapping matrix W Q corresponding to the Query matrix, the mapping matrix W K corresponding to the Key matrix, and the mapping matrix W V corresponding to the Value matrix, wherein, the mapping matrix W Q , the mapping matrix W K , and the mapping matrix W V can all pass through After random initialization, it is optimized by the correlation network, so for each first single-head self-attention network, the three mapping matrices are different.
  • Prev′ Prev i into the above formula to calculate the sub-attention corresponding to the i-th first single-head self-attention network Force matrix head i (i is a positive integer).
  • Q, K, V are used to represent Query matrix, Key matrix and Value matrix respectively
  • Wi Q , Wi K and Wi V represent the mapping matrix corresponding to the i - th first single-head self-attention network.
  • Previ represents the split matrix extracted from the first attention matrix corresponding to the i -th first single-head self-attention network.
  • the first single-head self-attention network can be divided into The attention matrix is equally divided.
  • each first single-head self-attention network will use one of the splitting matrices for calculation.
  • Adding Prev' to the formula means that the residual connection is performed on the attention matrices of the two adjacent coding layers.
  • Figure 4b again.
  • the multi-head self-attention network of the current encoding layer will be residually connected with the attention matrix output by the previous encoding layer, so that the attention matrix of each encoding layer is connected. can communicate naturally.
  • the Softmax function can also be called a normalized exponential function, which can standardize the calculation results, and finally display them in the form of probability, and then perform a weighted summation on the matrix V' to obtain the i-th first single-head self-attention
  • the sub-attention matrix head i corresponding to the network.
  • the second attention matrix of the encoding layer E i+1 can be obtained. It can be understood that the second attention matrix will be passed to the next encoding layer through the same process as above.
  • the advantage of the multi-head self-attention network is that it performs h calculations instead of just one calculation.
  • the advantage of this is to allow residual attention acoustic models to learn relevant information in different representation subspaces.
  • the existing Transformer-based acoustic model for speech synthesis only simply stacks the modules containing the multi-head self-attention mechanism, and does not fully exploit the calculation results of the previous network.
  • the embodiment of the present application reduces the model instability caused by the gradient calculation by performing residual connection on the attention matrix of each layer of the network. problem, which effectively accelerates the convergence of the model.
  • powerful parallel computing power can be achieved through the multi-head self-attention mechanism, and it is also convenient to use other more efficient optimization methods to improve the speed.
  • the input data is the text feature representation sequence output by the vector conversion layer, so in the encoding layer E 1 , the text feature representation sequence can be regarded as the historical text encoding sequence, and the first An attention matrix is set to be an all-zero matrix, and the calculation process thereof is consistent with the calculation process of the encoding layer E i+1 , which is not repeated here.
  • S103 Generate a target text coding sequence of the coding layer E i+1 according to the second attention matrix and the historical text coding sequence, and generate synthetic speech data matching the text input sequence based on the target text coding sequence.
  • the second attention matrix is the attention parameter used by the encoding layer E i+1 for encoding the historical text encoding sequence.
  • the encoding layer E i+1 The corresponding target coding sequence is obtained by encoding.
  • the computer equipment determines the target text encoding sequence corresponding to the text input sequence through the encoder in the residual attention acoustic model.
  • the residual attention acoustic model can generate speech data based on the target text encoding sequence, and the target text encoding sequence is used as the encoding layer E. i
  • the quantized representation obtained by encoding the text input sequence can accurately reflect the semantic information of the text input sequence and the related information used for synthesizing speech, and identify the association between the text and phonemes in the text input sequence.
  • the target text encoding sequence of the encoding layer E i+1 can be generated according to the second attention matrix and the historical text encoding sequence.
  • the second attention matrix obtained in the above S102 can be multiplied by the historical text encoding sequence to obtain
  • the specific calculation formula of the first intermediate coding sequence is as follows:
  • ResidualMultiHead(Q, K, V, Prev) Concat(head 1 , . . . , head h ) WO .
  • Prev represents the first attention matrix
  • W O represents the historical text encoding sequence, combined with the calculation formula in S102 above,
  • the first multi-head self-attention network includes h first single-head self-attention networks (h is an integer greater than 1)
  • the sub-attention matrix head 1 of the first first single-head self-attention network, head 1 The sub-attention matrix head 2 of the second first single-head self-attention network, ..., the sub-attention matrix head h of the h-th first single-head self-attention network
  • the second attention obtained by concatenating with the Concat function
  • the force matrix, and then multiplied by the historical text encoding sequence WO can obtain the first intermediate encoding sequence.
  • Fig. 4b again. As shown in Fig.
  • a second intermediate coding sequence can be obtained, and then the second intermediate coding sequence can be obtained.
  • the coding sequence is input to the first convolutional network in the coding layer E i+1 , and the third intermediate coding sequence can be output through the first convolutional network, and the third intermediate coding sequence and the second intermediate coding sequence are again subjected to residual connection and normalization.
  • the current text encoding sequence of the encoding layer E i+1 is finally obtained.
  • the current text coding sequence when the above-mentioned current text coding sequence is the text coding sequence output by the Nth coding layer (ie, the last coding layer), in order to facilitate the distinction, the current text coding sequence can be determined as the target text coding sequence (also called the target text coding sequence). hide state sequences for characters/phonemes).
  • the above-mentioned first convolutional network may be composed of a two-layer one-dimensional convolutional network with a Rectified Linear Unit (ReLU) activation function, or other activation functions (such as a Sigmod function, a Tanh function, etc.) may also be used.
  • ReLU Rectified Linear Unit
  • Add&Norm residual connections and layer normalization processing
  • the synthetic speech data that matches the text input sequence can be generated, please refer to Fig. 4a again, and the target text encoding sequence will pass through the duration predictor ( Duration Predictor), length regulator (Length Regulator), decoder and the first linear output layer (Linear Layer), and then output the acoustic feature sequence, based on the acoustic feature sequence, the synthesized speech data can be obtained.
  • Duration Predictor Duration Predictor
  • Leength Regulator length regulator
  • decoder Linear Layer
  • the length of the target text encoding sequence does not match the length of the acoustic feature sequence, and the length of the target text encoding sequence is usually smaller than the length of the acoustic feature sequence.
  • the Encoder-Attention-Decoder mechanism may lead to misalignment between phonemes and mel spectrum, which in turn leads to repeated words or missing words in the generated speech. Therefore, the embodiments of the present application solve this problem.
  • the length of the target text encoding sequence and the acoustic feature sequence will be aligned by the length regulator.
  • a duration predictor is also required to predict the duration of each character/phoneme.
  • FIG. 5 is a schematic flowchart of a voice synthesis method provided by an embodiment of the present application.
  • the process of the speech synthesis method includes the following S201-S205, and S201-S205 is a specific embodiment of S103 in the embodiment corresponding to FIG. 3, and the speech synthesis process includes the following steps:
  • the duration predictor may include a two-layer one-dimensional convolutional network activated by a ReLU activation function (or other activation functions) and a second linear output layer. It should be noted that , the duration predictor is stacked on top of the encoder, which can be used as a module independent of the residual-attention acoustic model and jointly trained end-to-end with the residual-attention acoustic model, or it can be directly used as A module in the residual attention acoustic model to predict the duration information for each character or phoneme.
  • the computer device can input the target text encoding sequence into the first-layer one-dimensional convolutional network in the duration predictor to perform feature extraction and normalization, so as to obtain the first duration feature, and then the first duration feature can be input into the second layer.
  • the one-dimensional convolutional network performs feature extraction and normalization again to obtain the second duration feature.
  • the second duration feature can be input into the second linear output layer, and the second duration feature can be linearly transformed through the second linear output layer.
  • the output scalar can get the predicted duration sequence corresponding to the text input sequence.
  • the predicted duration sequence includes at least two duration parameters, and the duration parameters are used to represent duration information corresponding to each character or phoneme.
  • S202 input the target text encoding sequence into a length regulator, and in the length regulator, perform sequence length expansion on the target text encoding sequence according to the predicted duration sequence, to obtain an expanded target text encoding sequence;
  • the above-mentioned target text encoding sequence includes at least two encoding vectors, so the computer device can input the target text encoding sequence into the length adjuster, and in the length adjuster, the encoding vectors are respectively adjusted according to at least two duration parameters in the predicted duration sequence. Make a copy to get a copy-encoded vector. Further, the copied encoding vector and the target text encoding sequence can be spliced to obtain an expanded target text encoding sequence, wherein the sequence length of the expanded target text encoding sequence is equal to the sum of at least two duration parameters.
  • the structure of the decoder is consistent with the network structure of the encoder shown in Figure 4b above, that is, the self-attention layer (Residual Multi -Head Self-attention Layer) and one-dimensional convolution are combined.
  • the self-attention layer Residual Multi -Head Self-attention Layer
  • one-dimensional convolution are combined.
  • the multi-head self-attention layer with residual attention connection uses the multi-head self-attention network to decode the input information, as shown in Figure 4b, that is, each layer is decoded
  • Each layer includes a multi-head self-attention network and a one-dimensional convolutional network (the multi-head self-attention network in the decoding layer of each layer can be called the second multi-head self-attention network), and these two layers of networks use A residual connection
  • each multi-head self-attention network includes at least two single-head self-attention networks (each single-head self-attention network in the second multi-head self-attention network can be called a second single-head self-attention network. Head Self-Attention Network).
  • the decoder includes N decoding layers (the same number as the encoding layers in the encoder), and it can be understood that N can be adjusted according to the size of the corpus.
  • the above N-layer decoding layers include decoding layer D j and decoding layer D j+1 , and decoding layer D j+1 is a decoding layer next to decoding layer D j , j is a positive integer, and j is less than N.
  • the fourth attention matrix In order to obtain the decoding result of the decoding layer D j+1 , it is first necessary to use the output result of the previous decoding layer to generate the attention matrix of the decoding layer D j+1 , which is called the fourth attention matrix.
  • the third attention matrix output by layer D j and the historical speech decoding sequence, and then the decoding layer D j+1 can be based on the third attention matrix and the decoding layer D j+1 in the second multi-head self-attention network.
  • the residual connection and the historical speech decoding sequence are used to generate the fourth attention matrix of the decoding layer D j+1 .
  • the specific process can be: obtaining the third attention matrix output by the decoding layer D j and the historical speech decoding sequence, wherein the historical speech
  • the decoding sequence may include at least two second matching matrices corresponding to the second single-head self-attention network respectively, where the second matching matrix includes the Query matrix, the Key matrix and the Value matrix corresponding to the second multi-head self-attention network.
  • These three matrices can be initialized according to the historical speech decoding sequence, that is, for each second single-head self-attention network, the corresponding Query matrix, Key matrix and Value matrix are all equal to the historical speech output by the decoding layer D j . decoding sequence.
  • the second mapping matrix corresponding to the second multi-head self-attention network in the decoding layer D j+1 can be obtained, and the second mapping matrix is used to map the above-mentioned second matching matrix into different forms.
  • the second The mapping matrix also includes three different matrices, which are the mapping matrix corresponding to the Query matrix, the mapping matrix corresponding to the Key matrix, and the mapping matrix corresponding to the Value matrix.
  • the process of generating the second mapping matrix is the same as the process of generating the first mapping matrix in the above S102. The process is the same and will not be repeated here.
  • the specific calculation formula can refer to the calculation formula of the sub-attention matrix in S102.
  • the Concat function to concatenate all the sub-attention matrices corresponding to the second single-head self-attention network, and perform a linear transformation to obtain the fourth attention matrix of the decoding layer D j+1 .
  • the input data is the expanded target text encoding sequence
  • the expanded target text encoding sequence can be used as the historical speech decoding sequence
  • the The third attention matrix is set to be an all-zero matrix, and its calculation process is consistent with the calculation process of the decoding layer D j+1 above, which will not be repeated here.
  • the target speech decoding sequence of the decoding layer D j+1 can be generated according to the fourth attention matrix and the historical speech decoding sequence.
  • the specific process is: multiplying the fourth attention matrix and the historical speech decoding sequence to obtain the first intermediate Decoding sequence (for the specific calculation formula, please refer to the above formula for calculating the first intermediate coding sequence), and then perform residual connection and normalization processing on the first intermediate decoding sequence and the historical speech decoding sequence to obtain the second intermediate decoding sequence, and then
  • the second intermediate decoding sequence is input to the second convolutional network in the decoding layer D j+1 , and the third intermediate decoding sequence can be output through the second convolutional network, and the residual error is performed on the third intermediate decoding sequence and the second intermediate decoding sequence again.
  • the current speech decoding sequence of the decoding layer D j+1 is finally obtained.
  • the current speech decoding sequence may be determined as the target speech decoding sequence.
  • the above-mentioned second convolutional network may be composed of a two-layer one-dimensional convolutional network with a ReLU activation function.
  • the decoding process of the decoder is also parallel. It should be noted that, since the structure of the encoder and decoder in the classical Transformer-based speech synthesis acoustic model is similar to the structure of the encoder or decoder in this application, the method provided in this application can be naturally extended to Any Transformer-based acoustic model for speech synthesis, including the autoregressive Transformer acoustic model.
  • the target speech decoding sequence decoded by the decoder in parallel is input into the first linear output layer, and the target speech decoding sequence is linearly transformed through the first linear output layer, so that the corresponding text input sequence can be obtained.
  • Acoustic feature sequence in this embodiment of the present application, the acoustic feature sequence may specifically be a Mel-Spectrogram sequence.
  • the computer device may use a pre-trained vocoder (Vocoder) to perform acoustic feature conversion on the acoustic feature sequence, that is, convert the acoustic feature sequence into synthetic speech data matching the text input sequence.
  • a pre-trained vocoder Vicoder
  • the vocoder can be specifically a WaveGlow network (a network that relies on streams to synthesize high-quality speech from a Mel spectrogram), which can realize parallelized speech synthesis, or can be a SqueezeWave network (a network that can be used for mobile voice Synthetic lightweight flow model), which can effectively improve the speed of speech synthesis, or you can use vocoders such as Griffin-Lim, WaveNet, Parallel to synthesize speech from acoustic feature sequences, and you can choose the appropriate vocoder according to actual needs. device, which is not limited in this embodiment of the present application.
  • the MOS Mean Opinion Score, Subjective Mean Opinion Score
  • the MOS index is used to measure the naturalness and quality of the sound close to the human voice, and the method provided by the embodiment of the present application effectively improves the For the clarity and naturalness of synthesized speech, the sound quality is comparable to the autoregressive Transformer TTS and Tacotron2.
  • the voice adjustment parameters are first obtained, and then the above-mentioned predicted duration sequence can be updated according to the voice adjustment parameters to obtain an updated predicted duration sequence, and further, the speech rate or rate of the synthesized speech data can be adjusted according to the updated predicted duration sequence.
  • rhythm That is to say, the duration of characters/phonemes can be extended or shortened proportionally, which is used to control the speed of synthesizing speech, thereby determining the length of the generated mel spectrogram, and can also be controlled by adjusting the duration of space characters in the sentence Pauses between words, i.e. adding spaces between adjacent characters/phonemes, allowing to adjust part of the prosody of the synthesized speech.
  • FIG. 6 is a schematic diagram of a speech adjustment scenario provided by an embodiment of the present application.
  • Adjustment parameter ⁇ 1
  • D1 [2, 2, 3, 1]
  • P1 [s sp p iy iy iy ch]
  • the speech adjustment parameter ⁇ can be reduced a little.
  • the sequence P3 is longer, so slower pronunciation can be achieved.
  • the embodiments of the present application can convert a text input sequence into a text feature representation sequence, and then the text feature representation sequence can be input into an encoder including N-layer coding layers, and in the encoder, the current coding layer is calculated.
  • the attention matrix When the attention matrix is used, it can be generated according to the residual connection between the first attention matrix output by the previous coding layer and the multi-head self-attention network in the current coding layer and the historical text coding sequence output by the previous coding layer.
  • the second attention matrix of the current coding layer further, the target text coding sequence of the current coding layer can be generated according to the obtained second attention matrix and the historical text coding sequence, and finally, based on the target text coding sequence, length adjustment and decoding can be performed.
  • the embodiment of the present application can make full use of the calculation results of each layer of the network, and put the residual into the attention matrix, that is, perform residual connection on each layer of attention matrix, In this way, the attention matrices of each layer can communicate with each other, which effectively accelerates the convergence of the model. At the same time, it also makes the attention matrices of each layer of the network tend to be consistent, which can improve the clarity and stability of the synthesized speech.
  • the clarity and naturalness of the synthesized speech in the embodiment of the present application are better, and the spectral details of the synthesized speech are also clearer, and in addition, the pronunciation errors, Problems with wrong intonation and unnatural rhythm.
  • FIG. 7 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • the speech synthesis method may be performed by a computer device, and the computer device may include a terminal device or a service server as described in FIG. 1 .
  • the speech synthesis method may include at least the following S301-S304:
  • the computer device can select a part of the massive sample data for model training, and the selected sample data is used as a data set.
  • the data set includes reference speech data and corresponding text records for training the model.
  • the data The remaining data in the set can also be divided into a test set and a validation set, which are respectively used to verify the generalization performance of the model and adjust the hyperparameters of the model, which will not be described in detail here.
  • the vector conversion layer searches and matches in the vector conversion table, and then uses the feature vector matching the text sample sequence as the text feature sample sequence.
  • the initial encoder including the N-layer initial encoding layer in the initial residual attention acoustic model;
  • the N-layer initial encoding layer includes the initial encoding layer X i and the initial encoding layer X i+1 ;
  • the initial coding layer X i+1 includes the initial multi-head self-attention network; the first attention matrix output by the initial coding layer X i and the historical text coding sequence are obtained, according to the relationship between the first attention matrix and the initial multi-head self-attention network.
  • the initial encoding layer X i+1 is the next encoding layer of the initial encoding layer X i
  • N is an integer greater than 1 , i is a positive integer, and i is less than N;
  • the initial residual attention acoustic model is configured with an initial encoder including N layers of initial coding layers, where N is an integer greater than 1. It can be understood that N can be adjusted according to the size of the corpus.
  • the structure of the coding layer is the same.
  • Each initial coding layer includes a multi-head self-attention network and a one-dimensional convolutional network (each layer).
  • the multi-head self-attention networks in the initial coding layer can be called initial multi-head self-attention networks), and both layers of networks use a residual connection, wherein each multi-head self-attention network includes at least two single-headed self-attention networks.
  • Head self-attention network (each single-head self-attention network in the initial multi-head self-attention network can be called an initial single-head self-attention network). It is assumed that the N initial coding layers include the initial coding layer X i and the initial coding layer X i +1 , and the initial coding layer X i+1 is the next coding layer of the initial coding layer Xi, where i is a positive integer, and i is less than N. In order to obtain the encoding result of the initial encoding layer X i+1 , it is first necessary to use the output result of the previous initial encoding layer to generate the attention matrix of the initial encoding layer X i+1 , which is called the second attention matrix.
  • the specific process is: Obtain the first attention matrix and the historical text coding sequence output by the initial coding layer X i , and then initialize the matching matrix according to the historical text coding sequence, wherein the matching matrix includes the Query matrix and the Key matrix corresponding to the initial multi-head self-attention network and Value matrix, for each initial single-head self-attention network, its corresponding Query matrix, Key matrix and Value matrix are all equal to the historical text encoding sequence output by the initial encoding layer X i . Further, the mapping matrix corresponding to the initial multi-head self-attention network can be obtained. The mapping matrix is used to map the above matching matrix into different forms.
  • mapping matrix also includes three different matrices, which are the mapping matrices corresponding to the Query matrix.
  • W Q the mapping matrix W K corresponding to the Key matrix
  • W V the mapping matrix W V corresponding to the Value matrix
  • the mapping matrix W Q , the mapping matrix W K , and the mapping matrix W V can be randomly initialized and then obtained through correlation network optimization, Therefore, these three mapping matrices are different for each initial single-head self-attention network.
  • the sub-attention matrix corresponding to each initial single-head self-attention network in the initial coding layer X i+1 can be calculated. , splicing all the sub-attention matrices corresponding to the initial single-head self-attention network, and then performing a linear transformation to obtain the second attention matrix of the initial coding layer X i+1 .
  • the input data is the text feature sample sequence output by the initial vector conversion layer, so in the initial coding layer X 1 , the text feature sample sequence can be used as the historical text coding sequence.
  • the first attention matrix is set to be an all-zero matrix, and the calculation process thereof is consistent with the calculation process of the above-mentioned initial coding layer X i+1 , which will not be repeated here.
  • the computer device can multiply the second attention matrix obtained in S302 by the historical text encoding sequence to obtain the first intermediate encoding sequence, and perform residual connection and normalization on the first intermediate encoding sequence and the historical text encoding sequence.
  • the second intermediate coding sequence can be obtained, and then the second intermediate coding sequence can be input into the initial convolutional network in the initial coding layer X i+1 , and the third intermediate coding sequence can be output through the initial convolutional network.
  • the third intermediate encoding sequence and the second intermediate encoding sequence are subjected to residual connection and normalization processing, and finally the current text encoding sequence of the initial encoding layer X i+1 is obtained.
  • the current text encoding sequence when the above-mentioned current text encoding sequence is the text encoding sequence output by the Nth initial encoding layer (ie, the last initial encoding layer), for the convenience of distinction, the current text encoding sequence may be determined as the target text encoding sequence.
  • the above-mentioned initial convolutional network may be composed of a two-layer one-dimensional convolutional network having a ReLU activation function or other activation functions (eg, a Sigmod function, a Tanh function, etc.), which is not limited in this embodiment of the present application.
  • the target text coding sequence After the target text coding sequence is obtained through the parallel coding of the initial encoder, further, based on the target text coding sequence, synthetic speech data that matches the text sample sequence can be generated: the target text coding sequence will sequentially pass through the initial residual attention acoustic model.
  • the network structure of the initial decoder is the same as the network structure of the above-mentioned initial encoder.
  • the computer device can generate a speech loss function (for example, a mean square error loss function) according to the predicted speech data and the reference speech data corresponding to the text sample sequence, which is used to represent the difference between the synthesized predicted speech data and the real reference speech data. Then, the model parameters in the initial residual attention acoustic model can be modified through the speech loss function to obtain a trained residual attention acoustic model. Among them, the residual attention acoustic model is used to generate synthetic speech data that matches the text input sequence.
  • the residual attention acoustic model can include a trained vector conversion layer, an encoder, a duration predictor, and a length regulator. , decoder, linear output layer, and vocoder. It should be noted that the duration predictor and vocoder can be used as part of the model or as a module independent of the model. When they are independent modules, they can be combined with the residual End-to-end co-training of differential attention acoustic models.
  • FIG. 8 is a schematic flowchart of a model training provided by an embodiment of the present application.
  • the schematic diagram of the flow mainly includes two parts.
  • the first part is data preparation, including text preprocessing, acoustic feature extraction, and phoneme duration information extraction; the second part uses the given data (including text preprocessing to get The text sample sequence and the duration information obtained by extracting the phoneme duration information) to train the residual attention-based parallel speech synthesis acoustic model (that is, the residual attention acoustic model) to achieve high-precision parallel acoustic model modeling.
  • the text sample sequence and reference speech data are input into the model for training, and the encoder-decoder attention alignment can be obtained, which can then be used to train the duration predictor.
  • the embodiment of the present application provides a method for modeling a parallel speech synthesis acoustic model based on residual attention, by combining a text sample sequence and reference speech data into paired data and inputting it into an initial residual attention acoustic model for training,
  • the predicted speech data matching the text sample sequence can be obtained.
  • the initial residual attention acoustics can be used according to the speech loss function generated by the predicted speech data and the reference speech data.
  • the model parameters in the model are modified, so that a high-precision residual attention acoustic model can be obtained, which can accurately, stably and efficiently predict acoustic parameters.
  • the acoustic model obtained by this modeling method can be applied to any scene that needs to convert text into speech.
  • the spectrum details are also clearer, and in addition, it can well alleviate the problems of pronunciation errors, intonation errors and unnatural prosody in the existing solutions, and can be naturally extended to any language, dialect, speaker, and adaptively related speech.
  • the Transformer structure used in it is improved, which has good scalability.
  • FIG. 9 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • the speech synthesis apparatus may be a computer program (including program code) running on a computer device, for example, the speech synthesis apparatus is an application software; the apparatus may be used to execute corresponding steps in the speech synthesis method provided by the embodiments of the present application.
  • the speech synthesis apparatus 1 may include: a conversion module 11, a matrix generation module 12, and a speech synthesis module 13;
  • the conversion module 11 is used to convert the text input sequence into a text feature representation sequence
  • the above-mentioned conversion module 11 is specifically used for inputting the text input sequence into the vector conversion layer, searching in the vector conversion table through the vector conversion layer, and using the feature vector matching the text input sequence as the text feature representation sequence;
  • the vector conversion table includes characters Or the mapping relationship between phonemes and feature vectors;
  • the matrix generation module 12 is used to input the text feature representation sequence into an encoder comprising N layers of encoding layers; the N layers of encoding layers include encoding layers E i and encoding layers E i + 1 ; Head self-attention network; obtain the first attention matrix output by the encoding layer E i and the historical text encoding sequence, according to the residual connection between the first attention matrix and the first multi-head self-attention network and the historical text encoding sequence , generate the second attention matrix of the coding layer E i+ 1 ; the coding layer E i+1 is the next coding layer of the coding layer E i , N is an integer greater than 1, i is a positive integer, and i is less than N;
  • the speech synthesis module 13 is configured to generate the target text encoding sequence of the encoding layer E i+1 according to the second attention matrix and the historical text encoding sequence, and generate synthesized speech data matching the text input sequence based on the target text encoding sequence.
  • the specific function implementation of the conversion module 11 may refer to S101 in the embodiment corresponding to FIG. 3 above, and the specific function implementation of the matrix generation module 12 may refer to S102 in the above-mentioned embodiment corresponding to FIG. 3 .
  • S103 in the embodiment corresponding to FIG. 3 and S201-S205 in the embodiment corresponding to FIG. 5 , which will not be repeated here.
  • the speech synthesis device 1 may further include: a speech adjustment module 14;
  • the speech adjustment module 14 is used for acquiring speech adjustment parameters, updating the prediction duration sequence according to the speech adjustment parameters, and obtaining an updated prediction duration sequence; and adjusting the speech rate or rhythm of the synthesized speech data according to the updated prediction duration sequence.
  • the specific function implementation manner of the voice adjustment module 14 may refer to S103 in the embodiment corresponding to FIG. 3 above, which will not be repeated here.
  • the first multi-head self-attention network includes at least two first single-head self-attention networks
  • the above-mentioned matrix generation module 12 may include: a first matrix generation unit 121 and a second matrix generation unit 122;
  • the first matrix generation unit 121 is used to obtain the first attention matrix output by the encoding layer E i and the historical text encoding sequence; the historical text encoding sequence includes at least two first matching matrices corresponding to the first single-head self-attention network respectively ; Obtain the first mapping matrix corresponding to the first multi-head self-attention network, and generate at least two first single-head self-attention according to the residual connection between the first mapping matrix, the first matching matrix and the first attention matrix The sub-attention matrices corresponding to the force network respectively;
  • the second matrix generation unit 122 is configured to splicing at least two sub-attention matrices to obtain the second attention matrix of the coding layer E i+1 .
  • the encoding layer E i+1 includes a first convolutional network
  • the above-mentioned speech synthesis module 13 may include: an encoding unit 131 and a speech generating unit 132;
  • the encoding unit 131 is used to multiply the second attention matrix and the historical text encoding sequence to obtain the first intermediate encoding sequence; perform residual connection and normalization processing on the first intermediate encoding sequence and the historical text encoding sequence to obtain For the second intermediate coding sequence, input the second intermediate coding sequence into the first convolutional network to obtain the third intermediate coding sequence; perform residual connection and normalization on the third intermediate coding sequence and the second intermediate coding sequence to obtain the coding
  • the current text encoding sequence of layer E i+1 when the current text encoding sequence is the text encoding sequence of the Nth layer encoding layer, the current text encoding sequence is determined as the target text encoding sequence;
  • the speech generation unit 132 is used to input the target text encoding sequence into the duration predictor, and obtain the predicted duration sequence corresponding to the text input sequence; input the target text encoding sequence into the length regulator, and in the length regulator, the target text is analyzed according to the predicted duration sequence. Expand the sequence length of the coding sequence to obtain the expanded target text coding sequence; input the expanded target text coding sequence into a decoder including N-layer decoding layers to generate the target speech decoding sequence; input the target speech decoding sequence into the first linear output In the first linear output layer, the target speech decoding sequence is linearly transformed to obtain an acoustic feature sequence; the acoustic feature sequence is subjected to acoustic feature transformation to obtain synthesized speech data matching the text input sequence.
  • the specific function implementation of the encoding unit 131 may refer to S103 in the embodiment corresponding to FIG. 3 above, and the specific function implementation of the speech generating unit 132 may refer to S201-S205 in the above-mentioned embodiment corresponding to FIG. 5 , which is not repeated here. Repeat.
  • the above-mentioned N-layer decoding layers include a decoding layer D j and a decoding layer D j+1 , the decoding layer D j+1 is a decoding layer next to the decoding layer D j , j is a positive integer, and j is less than N; the decoding layer D j+1 includes a second multi-head self-attention network;
  • the above-mentioned speech generation unit 132 may include: a matrix generation subunit 1321, a decoding subunit 1322, a duration prediction subunit 1323, and a sequence expansion subunit 1324;
  • the matrix generation subunit 1321 is used to obtain the third attention matrix output by the decoding layer Dj and the historical speech decoding sequence, according to the residual connection between the third attention matrix and the second multi-head self-attention network and the historical speech Decoding the sequence to generate the fourth attention matrix of the decoding layer D j+1 ;
  • the second multi-head self-attention network includes at least two second single-head self-attention networks
  • the above-mentioned matrix generation subunit 1321 is specifically used to obtain the third attention matrix output by the decoding layer Dj and the historical speech decoding sequence;
  • the historical speech decoding sequence includes at least two second matchings corresponding to the second single-head self-attention network matrix; obtain the second mapping matrix corresponding to the second multi-head self-attention network, and generate at least two second single-head self-attention according to the residual connection between the second mapping matrix, the second matching matrix and the third attention matrix sub-attention matrices corresponding to the attention network respectively; splicing at least two sub-attention matrices to obtain the fourth attention matrix of the decoding layer D j+1 ;
  • the decoding subunit 1322 is used to generate the target speech decoding sequence of the decoding layer D j+1 according to the fourth attention matrix and the historical speech decoding sequence; if the decoding layer D j is the first layer decoding layer, then the history of the decoding layer D j The speech decoding sequence is the expanded target text encoding sequence;
  • the decoding layer D j+1 includes a second convolutional network
  • the above decoding subunit 1322 is specifically used to multiply the fourth attention matrix and the historical speech decoding sequence to obtain the first intermediate decoding sequence; perform residual connection and normalization on the first intermediate decoding sequence and the historical speech decoding sequence Process to obtain the second intermediate decoding sequence, input the second intermediate decoding sequence into the second convolutional network, and obtain the third intermediate decoding sequence; perform residual connection and normalization processing on the third intermediate decoding sequence and the second intermediate decoding sequence , obtain the current speech decoding sequence of the decoding layer D j+1 ; when the current speech decoding sequence is the speech decoding sequence of the Nth layer decoding layer, the current speech decoding sequence is determined as the target speech decoding sequence;
  • the duration prediction subunit 1323 is used to input the target text encoding sequence into the two-layer one-dimensional convolutional network in the duration predictor to obtain duration features; the duration features are input into the second linear output layer, and the duration features are analyzed by the second linear output layer. Perform linear transformation to obtain the predicted duration sequence corresponding to the text input sequence.
  • the target text encoding sequence includes at least two encoding vectors; the predicted duration sequence includes at least two duration parameters;
  • the sequence expansion subunit 1324 is used to input the target text encoding sequence into the length regulator, and in the length regulator, at least two encoding vectors are copied according to at least two duration parameters in the predicted duration sequence to obtain the copied encoding vector;
  • the copied coding vector is spliced with the target text coding sequence to obtain the expanded target text coding sequence; the sequence length of the expanded target text coding sequence is equal to the sum of at least two duration parameters.
  • the specific function implementation of the matrix generation subunit 1321 and the decoding subunit 1322 may refer to S203 in the embodiment corresponding to FIG. 5 above, and the specific function implementation of the duration prediction subunit 1323 may refer to the embodiment corresponding to FIG. 5 above.
  • S201, the specific function implementation manner of the sequence expansion subunit 1324 may refer to S202 in the embodiment corresponding to FIG. 5 above, which will not be repeated here.
  • a text input sequence can be converted into a text feature representation sequence, and then the text feature representation sequence can be input into an encoder including N-layer coding layers.
  • the first attention matrix of the current coding layer can be generated according to the residual connection between the first attention matrix output by the previous coding layer and the multi-head self-attention network in the current coding layer and the historical text coding sequence output by the previous coding layer.
  • Second attention matrix further, the target text coding sequence of the current coding layer can be generated according to the obtained second attention matrix and the historical text coding sequence, and finally a synthetic speech matching the above text input sequence can be generated based on the target text coding sequence data.
  • the embodiment of the present application can make full use of the calculation results of each layer of the network, and put the residual into the attention matrix, that is, perform residual connection on each layer of attention matrix, In this way, the attention matrices of each layer can communicate with each other, which effectively accelerates the convergence of the model. At the same time, it also makes the attention matrices of each layer of the network tend to be consistent, which can improve the clarity and stability of the synthesized speech.
  • FIG. 10 is a schematic structural diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • the speech synthesis apparatus may be a computer program (including program code) running on a computer device, for example, the speech synthesis apparatus is an application software; the apparatus may be used to execute corresponding steps in the speech synthesis method provided by the embodiments of the present application.
  • the speech synthesis device 2 may include: a conversion module 21, a matrix generation module 22, a speech synthesis module 23 and a correction module 24;
  • the conversion module 21 is used to input the text sample sequence into the initial residual attention acoustic model, and convert the text sample sequence into the text feature sample sequence through the initial residual attention acoustic model;
  • the matrix generation module 22 is used to input the text feature sample sequence into the initial encoder including the N-layer initial encoding layer in the initial residual attention acoustic model; the N-layer initial encoding layer includes the initial encoding layer X i and the initial encoding layer X i+1 ; the initial encoding layer X i+1 includes the initial multi-head self-attention network; obtain the first attention matrix and the historical text encoding sequence output by the initial encoding layer X i , according to the first attention matrix and the initial multi-head self-attention The residual connection between the force networks and the historical text encoding sequence generate the second attention matrix of the initial encoding layer X i+1 ; the initial encoding layer X i+1 is the next encoding layer of the initial encoding layer X i , N is an integer greater than 1, i is a positive integer, and i is less than N;
  • the speech synthesis module 23 is used for generating the target text coding sequence of the initial coding layer X i+1 according to the second attention matrix and the historical text coding sequence, and generating the predicted speech data matching the text sample sequence based on the target text coding sequence;
  • the correction module 24 is used to generate a speech loss function according to the predicted speech data and the reference speech data, and modify the model parameters in the initial residual attention acoustic model through the speech loss function to obtain a residual attention acoustic model;
  • a novel attentional acoustic model is used to generate synthetic speech data that matches text input sequences.
  • the specific function implementation of the conversion module 21 may refer to S301 in the embodiment corresponding to FIG. 7 above, and the specific function implementation of the matrix generation module 22 may refer to S302 in the above-mentioned embodiment corresponding to FIG. 7 .
  • S303 in the embodiment corresponding to FIG. 7, and for a specific function implementation manner of the correction module 24, refer to S304 in the above embodiment corresponding to FIG. 7, which will not be repeated here.
  • the embodiment of the present application provides a method for modeling a parallel speech synthesis acoustic model based on residual attention, by combining a text sample sequence and reference speech data into paired data and inputting it into an initial residual attention acoustic model for training,
  • the predicted speech data matching the text sample sequence can be obtained.
  • the initial residual attention acoustics can be used according to the speech loss function generated by the predicted speech data and the reference speech data.
  • the model parameters in the model are modified, so that a high-precision residual attention acoustic model can be obtained, which can accurately, stably and efficiently predict acoustic parameters.
  • the acoustic model obtained by this modeling method can be applied to any scene that needs to convert text into speech.
  • the spectrum details are also clearer, and in addition, it can well alleviate the problems of pronunciation errors, intonation errors and unnatural prosody in the existing solutions, and can be naturally extended to any language, dialect, speaker, and adaptively related speech.
  • the Transformer structure used in it is improved, which has good scalability.
  • FIG. 11 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • the computer device 1000 may include: a processor 1001 , a network interface 1004 and a memory 1005 , in addition, the above-mentioned computer device 1000 may further include: a user interface 1003 , and at least one communication bus 1002 .
  • the communication bus 1002 is used to realize the connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 1004 may be high-speed RAM memory or non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located away from the aforementioned processor 1001 .
  • the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 can provide a network communication function;
  • the user interface 1003 is mainly used to provide an input interface for the user; and
  • the processor 1001 can be used to call the device control stored in the memory 1005 application to achieve:
  • the N layers of encoding layers include encoding layer E i and encoding layer E i+1 , and encoding layer E i+1 is the next encoding layer of encoding layer E i , N is an integer greater than 1, i is a positive integer, and i is less than N; the encoding layer E i+1 includes the first multi-head self-attention network;
  • the target text coding sequence of the coding layer E i+1 is generated according to the second attention matrix and the historical text coding sequence, and synthetic speech data matching the text input sequence is generated based on the target text coding sequence.
  • the computer device 1000 described in this embodiment of the present application can execute the description of the speech synthesis method in any of the foregoing embodiments corresponding to FIG. 3 and FIG. 5 , and details are not repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • the computer device 2000 may include: a processor 2001 , a network interface 2004 and a memory 2005 , in addition, the above-mentioned computer device 2000 may further include: a user interface 2003 , and at least one communication bus 2002 .
  • the communication bus 2002 is used to realize the connection and communication between these components.
  • the user interface 2003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 2003 may also include a standard wired interface and a wireless interface.
  • the network interface 2004 may include a standard wired interface and a wireless interface (eg, a WI-FI interface).
  • the memory 2004 may be high-speed RAM memory or non-volatile memory, such as at least one disk memory.
  • the memory 2005 may also be at least one storage device located away from the aforementioned processor 2001 .
  • the memory 2005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 2004 can provide network communication functions;
  • the user interface 2003 is mainly used to provide an input interface for the user; and
  • the processor 2001 can be used to invoke the device control stored in the memory 2005 application to achieve:
  • the N-layer initial encoding layer includes the initial encoding layer X i and the initial encoding layer X i+1 , the initial encoding Layer X i+1 is the next encoding layer of the initial encoding layer X i , N is an integer greater than 1, i is a positive integer, and i is less than N; the initial encoding layer X i+1 includes the initial multi-head self-attention network;
  • the speech loss function is generated according to the predicted speech data and the reference speech data, and the model parameters in the initial residual attention acoustic model are modified by the speech loss function to obtain the residual attention acoustic model; the residual attention acoustic model uses for generating synthetic speech data that matches text input sequences.
  • the computer device 2000 described in this embodiment of the present application can execute the description of the above-mentioned speech synthesis method in the foregoing embodiment corresponding to FIG. 7 , and details are not repeated here. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • the embodiment of the present application further provides a computer-readable storage medium, and the computer-readable storage medium described above stores the computer code executed by the speech synthesis apparatus 1 and the speech synthesis apparatus 2 mentioned above.
  • program, and the above-mentioned computer program includes program instructions, when the above-mentioned processor executes the above-mentioned program instructions, it can execute the description of the speech synthesis method in the corresponding embodiment of any one of the preceding FIGS. Let's go into details. In addition, the description of the beneficial effects of using the same method will not be repeated.
  • the computer-readable storage medium embodiments involved in the present application please refer to the description of the method embodiments of the present application.
  • the embodiments of the present application also provide a computer program product including instructions, which, when executed on a computer, cause the computer to execute the methods provided by the above embodiments.
  • the above-mentioned computer-readable storage medium may be the speech synthesis apparatus provided in any of the foregoing embodiments or an internal storage unit of the above-mentioned computer device, such as a hard disk or a memory of the computer device.
  • the computer-readable storage medium can also be an external storage device of the computer device, such as a plug-in hard disk, a smart media card (smart media card, SMC), a secure digital (secure digital, SD) card equipped on the computer device, Flash card (flash card), etc.
  • the computer-readable storage medium may also include both an internal storage unit of the computer device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the computer device.
  • the computer-readable storage medium can also be used to temporarily store data that has been or will be output.
  • the embodiments of the present application further provide a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the method provided by any of the foregoing embodiments corresponding to FIG. 3 , FIG. 5 , and FIG. 7 .
  • each process and/or the schematic structural diagrams of the method flowcharts and/or structural schematic diagrams can be implemented by computer program instructions. or blocks, and combinations of processes and/or blocks in flowcharts and/or block diagrams.
  • These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce a function
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions
  • the apparatus implements the functions specified in one or more of the flowcharts and/or one or more blocks of the structural diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the block or blocks of the flowchart and/or structural representation.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Telephonic Communication Services (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种语音合成方法、装置以及可读存储介质,该方法包括:将文本输入序列转换为文本特征表示序列(S101);将文本特征表示序列输入包含N层编码层的编码器,其中,N层编码层中包括编码层Ei以及编码层Ei+1,编码层Ei+1包括第一多头自注意力网络;获取编码层Ei输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层Ei+1的第二注意力矩阵(S102);根据第二注意力矩阵以及历史文本编码序列生成编码层Ei+1的目标文本编码序列,基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据(S103)。该方法可以有效加速模型的收敛,提升合成语音的稳定性。

Description

一种语音合成方法、装置以及可读存储介质
本申请要求于2021年03月11日提交中国专利局、申请号为202110267221.5、申请名称为“一种语音合成方法、装置以及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,尤其涉及语音合成。
背景技术
近几年来,随着计算能力的大规模提升,深度学习技术得到了大规模的研究与运用,进一步推动了语音合成技术的发展。目前,基于神经网络的端到端的文本到语音合成(Text-to-Speech,TTS)技术发展迅速,与传统语音合成中的拼接法(concatenative synthesis)和参数法(statistical parametric synthesis)相比,端到端的文本到语音合成技术生成的语音数据通常具有更好的声音自然度,该技术的基本思想是采用基于注意力机制(attention mechanism)的编解码框架(例如基于Transformer的语音合成声学模型),直接从输入的字符序列或者音素序列上预测对应的声学特征序列,在学术界和工业界都获得了较为广泛的运用。
然而,相关技术中的语音合成声学模型的结构容易影响模型的收敛以及最终合成的语音的稳定性。
发明内容
本申请实施例提供了一种语音合成方法、装置以及可读存储介质,可以加速模型的收敛,且提升合成语音的稳定性。
本申请实施例一方面提供了一种语音合成方法,包括:
将文本输入序列转换为文本特征表示序列;
将文本特征表示序列输入包含N层编码层的编码器;N层编码层中包括编码层E i以及编码层E i+1,编码层E i+1为编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;编码层E i+1包括第一多头自注意力网络;
获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层E i+1的第二注意力矩阵;
根据第二注意力矩阵以及历史文本编码序列生成编码层E i+1的目标文本编码序列,基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据。
本申请实施例一方面提供了一种语音合成方法,包括:
将文本样本序列输入初始残差式注意力声学模型,通过初始残差式注意力声学模型将文本样本序列转换为文本特征样本序列;
将文本特征样本序列输入初始残差式注意力声学模型中的包含N层初始编码层的初始编码器;N层初始编码层中包括初始编码层X i以及初始编码层X i+1,初始编码层X i+1为初始编码层X i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;初始编码层X i+1包括初始多头自注意力网络;
获取初始编码层X i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与初始多头自注意力网络之间的残差连接以及历史文本编码序列,生成初始编码层X i+1的第二注意力矩阵;
根据第二注意力矩阵以及历史文本编码序列生成初始编码层X i+1的目标文本编码序列,基于目标文本编码序列生成与文本样本序列相匹配的预测语音数据;
根据预测语音数据以及参考语音数据生成语音损失函数,通过语音损失函数对初始残差式注意力声学模型中的模型参数进行修正,得到残差式注意力声学模型;残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据。
本申请实施例一方面提供了一种语音合成装置,包括:
转换模块,用于将文本输入序列转换为文本特征表示序列;
矩阵生成模块,用于将文本特征表示序列输入包含N层编码层的编码器;N层编码层中包括编码层E i以及编码层E i+1;编码层E i+1包括第一多头自注意力网络;获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层E i+1的第二注意力矩阵;编码层E i+1为编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
语音合成模块,用于根据第二注意力矩阵以及历史文本编码序列生成编码层E i+1的目标文本编码序列,基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据。
本申请实施例一方面提供了一种语音合成装置,包括:
转换模块,用于将文本样本序列输入初始残差式注意力声学模型,通过初始残差式注意力声学模型将文本样本序列转换为文本特征样本序列;
矩阵生成模块,用于将文本特征样本序列输入初始残差式注意力声学模型中的包含N层初始编码层的初始编码器;N层初始编码层中包括初始编码层X i以及初始编码层X i+1;初始编码层X i+1包括初始多头自注意力网络;获取初始编码层X i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与初始多头自注意力网络之间的残差连接以及历史文本编码序列,生成初始编码层X i+1的第二注意力矩阵;初始编码层X i+1为初始编码层X i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
语音合成模块,用于根据第二注意力矩阵以及历史文本编码序列生成初始编码层X i+1的目标文本编码序列,基于目标文本编码序列生成与文本样本序列相匹配的预测语音数据;
修正模块,用于根据预测语音数据以及参考语音数据生成语音损失函数,通过语音损失函数对初始残差式注意力声学模型中的模型参数进行修正,得到残差式注意力声学模型;残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据。
本申请实施例一方面提供了一种计算机设备,包括:处理器、存储器、网络接口;
上述处理器与上述存储器、上述网络接口相连,其中,上述网络接口用于提供数据通信功能,上述存储器用于存储计算机程序,上述处理器用于调用上述计算机程序,以执行本申请实施例中的方法。
本申请实施例一方面提供了一种计算机可读存储介质,上述计算机可读存储介质中存储有计算机程序,上述计算机程序适于由处理器加载并执行本申请实施例中的方法。
本申请实施例一方面提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行本申请实施例中的方法。
本申请实施例可以将文本输入序列转换为文本特征表示序列,进而可以将该文本特征表示序列输入包含N层编码层的编码器,在该编码器中,计算当前编码层的注意力矩阵时,可以根据前一层编码层输出的第一注意力矩阵与当前编码层中的多头自注意力网络之间的残差连接以及前一层编码层输出的历史文本编码序列,生成当前编码层的第二注意力矩阵,进一步,可以根据得到的第二注意力矩阵以及历史文本编码序列生成当前编码层的目标文本编码序列,最终可以基于该目标文本编码序列生成与上述文本输入序列相匹配的合成语音数据。由此可见,在合成语音数据的过程中,本申请实施例可以充分利用每一层网络的计算结果,将残差放到注意力矩阵中,即对每一层注意力矩阵进行残差连接,从而使得每一层的注意力矩阵能够互通,有效加速了模型的收敛,同时,也使得每一层网络的注意力矩阵趋于一致性,从而可以提升合成语音的清晰度和稳定性。
附图说明
图1是本申请实施例提供的一种系统架构示意图;
图2a-图2c是本申请实施例提供的一种语音合成的场景示意图;
图3是本申请实施例提供的一种语音合成方法的流程示意图;
图4a-图4b是本申请实施例提供的一种残差式注意力声学模型的网络结构示意图;
图5是本申请实施例提供的一种语音合成方法的流程示意图;
图6是本申请实施例提供的一种语音调节的场景示意图;
图7是本申请实施例提供的一种语音合成方法的流程示意图;
图8是本申请实施例提供的一种模型训练的流程示意图;
图9是本申请实施例提供的一种语音合成装置的结构示意图;
图10是本申请实施例提供的一种语音合成装置的结构示意图;
图11是本申请实施例提供的一种计算机设备的结构示意图;
图12是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
语音处理技术(Speech Technology)的关键技术有自动语音识别技术(Automatic Speech Recognition,ASR)、语音合成技术以及声纹识别技术。其中,语音合成技术(Text-to-Speech,TTS)又称文语转换技术,是将计算机自己产生的、或外部输入的文字信息转换为可以听得懂的、具有高自然度的语音输出的技术,这相当于给机器装上了人工嘴巴,使得机器可以通过不同的音色说出想要表达的内容。语音合成技术涉及声学、语言学、数字信号处理、计算机科学等多个学科技术。
在语音合成技术中,主要分为语言分析部分和声学系统部分,也称为前端部分和后端部分,语言分析部分主要是根据输入的文字信息进行分析,生成对应的语言学规格书,想好该怎么读;声学系统部分主要是根据语言分析部分提供的语言学规格书,生成对应的音频,实现发声的功能。其中,声学系统部分目前主要有三种技术实现方式,分别为:波形拼接、参数合成以及端到端的语音合成技术。而端到端的语音合成技术是目前比较火的技术,通过神经网络学习的方法,实现直接输入文本或者注音字符,中间为黑盒部分,然后输出合成音频,原来复杂的语言分析部分得到了极大的简化,例如wavenet(一种利用神经网络对原始音频波形建模的技术),Tacotron(一种从文字直接合成语音的端到端的语音合成模型),Tacotron2(对Tacotron进行改良后的模型)以及deepvoice3(基于注意力机制的全卷积神经元语音合成系统)等技术。端到端的语音合成技术,大大降低了对语言学知识的要求,且可以实现多种语言的语音合成,不再受语言学知识的限制。通过端到端合成的音频,效果得到进一步的优化,声音会更加贴近真人。然而相关技术中声学模型本身具有自回归生成的特性,生成声学参数的速度慢。另外,注意力对齐不准也会导致合成语音不稳定,导致出现漏词和重复词的问题。一些语音合成声学模型虽然采用基于Transformer的前馈网络对上述问题进行了一些优化,但该声学模型仅仅对多个前馈网络进行简单的堆叠,当网络堆叠的层数比较多时,容易造成梯度消失,从而影响模型的收敛以及最终合成的语音的稳定性。
本申请实施例提供的方案涉及人工智能的语音合成技术以及深度学习等技术,具体过程通过如下实施例进行说明。
请参见图1,是本申请实施例提供的一种系统架构示意图。该系统架构可以包括业务服务器100以及终端集群,终端集群可以包括:终端设备200a、终端设备200b、终端设备200c、…、终端设备200n,其中,终端集群之间可以存在通信连接,例如终端设备200a与终端设备200b之间存在通信连接,终端设备200a与终端设备200c之间存在通信连接。同时,终端集群中的任一终端设备可以与业务服务器100存在通信连接,例如终端设备200a与业务服务器100之间存在通信连接,其中,上述通信连接不限定连接方式,可以通过有线通信方式进行直接或间接地连接,也可以通过无线通信方式进行直接或间接地连接,还可以通过其它方式,本申请在此不做限制。
应该理解,如图1所示的终端集群中的每个终端设备均可以安装有应用客户端,当该应用客户端运行于各终端设备中时,可以分别与上述图1所示的业务服务器100之间进行数据交互,使得业务服务器100可以接收来自于每个终端设备的业务数据。其中,该应用客户端可以为游戏应用、社交应用、即时通信应用、车载应用、直播应用、短视频应用、视频应用、音乐应用、购物应用、教育应用、小说应用、新闻应用、支付应用、浏览器等具有显示文字、图像、音频、视频等数据信息功能的应用客户端。其中,该应用客户端可以为独立的客户端,也可以为集成在某客户端(例如游戏客户端、购物客户端、新闻客户端等)中的嵌入式子客户端,在此不做限定。其中,业务服务器100可以为该应用客户端对应的后台服务器、数据处理服务器、流缓存服务器等多个服务器的集合。
业务服务器100可以通过通信功能为终端集群提供文本转语音服务,例如,终端设备(可以是终端设备200a、终端设备200b、终端设备200c或者终端设备200n)可以获取上述列举的某个应用客户端A(例如新闻应用)中所显示的文本数据,并对这些文本数据进行文本处理,得到文本输入序列。进一步,业务服务器100可以调用训练好的、基于深度学习技术的残差式注意力声学模型,在该残差式注意力声学模型中,将上述文本输入序列转换为文本特征表示序列,进而可以对该文本特征表示序列依次进行编码、长度调节、解码、线性变换等处理操作,得到对应的声学特征序列,最终可以基于该声学特征序列得到与上述文本输入序列相匹配的合成语音数据。然后可以将得到的合成语音数据返回给应用客户端A,终端设备可以在应用客户端A中播放该合成语音数据。例如,当应用客户端A为新闻应用对应的客户端时,可以将某则新闻中的文字全部转换为合成语音数据,因此用户可以通过播放该合成语音数据获取该则新闻中的相关信息。
在车载场景下,车载终端会配置在车辆上,出于安全性和便捷性的考量,车载终端上可以安装具有文字转语音功能的独立的车载应用,例如当用户在驾驶车辆的过程中接收到一条短息或者会话消息时,可以通过触发该车载应用中的语音转换控件,将短信或会话消息的内容转换为语音后播放出来;或者可以将具有文字转语音功能的车载应用嵌入到其它车载应用中,在特殊情况下(例如车辆行驶过程中或用户触发相关控件时)将用户希望获取的文本信息转换为语音播报出来;又或者可以在智能手机、平板电脑等具备移动联网功能的移动终端上安装具有文字转语音功能的应用客户端,移动终端和车载终端可通过本地无线局域网或者蓝牙建立数据连接,在移动终端上完成文字转语音后,移动终端可以将合成的语音数据发送至车载终端,车载终端接收到语音数据后可通过车载音响进行播放。
可选的,可以理解的是,系统架构中可以包括多个业务服务器,一个终端设备可以与一个业务服务器相连接,每个业务服务器可以获取到与之相连接的终端设备中的业务数据(例如,一个网页中的全部文本数据,或者,用户选择的部分文本数据),从而可以调用残差式注意力声学模型将该业务数据转换为合成语音数据。
可选的,可以理解的是,终端设备也可以获取到业务数据,从而可以调用残差式注意力声学模型将该业务数据转换为合成语音数据。
其中,上述残差式注意力声学模型为基于残差式注意力的并行语音合成声学模型,即在该模型进行编码或解码的过程中,会对每一层网络计算得到的注意力矩阵进行残差连接, 因此在合成语音数据的过程中,可以充分利用每一层网络的计算结果,使得每一层的注意力矩阵能够互通,有效加速了模型的收敛,同时,也使得每一层网络的注意力矩阵趋于一致性,从而可以提升合成语音的清晰度和稳定性。
可以理解的是,本申请实施例提供的方法可以由计算机设备执行,计算机设备包括但不限于终端设备或业务服务器。其中,业务服务器可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云数据库、云服务、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端设备可以是智能手机、平板电脑、笔记本电脑、台式计算机、掌上电脑、移动互联网设备(mobile internet device,MID)、可穿戴设备(例如智能手表、智能手环等)、智能电脑、智能车载等可以运行上述应用客户端的智能终端。其中,终端设备和业务服务器可以通过有线或无线方式进行直接或间接地连接,本申请实施例在此不做限制。
需要说明的是,业务服务器还可以是区块链网络上的一个节点。区块链是一种分布式数据存储、点对点传输、共识机制以及加密算法等计算机技术的新型应用模式,主要用于对数据按时间顺序进行整理,并加密成账本,使其不可被篡改和伪造,同时可进行数据的验证、存储和更新。区块链本质上是一个去中心化的数据库,该数据库中的每个节点均存储一条相同的区块链,区块链网络中包括共识节点,共识节点负责区块链全网的共识。可以理解的是,区块(Block)是在区块链网络上承载交易数据(即交易业务)的数据包,是一种被标记上时间戳和之前一个区块的哈希值的数据结构,区块经过网络的共识机制验证并确定区块中的交易。
在一个区块链节点系统中可以包括多个节点,该区块链节点系统可以对应于区块链网络(包括但不限于联盟链所对应的区块链网络),多个节点具体可以包括上述所说的业务服务器,这里的节点可统称为区块链节点。区块链节点与区块链节点之间可以进行数据共享,每个节点在进行正常工作时可以接收到输入信息,并基于接收到的输入信息维护该区块链节点系统内的共享数据。为了保证区块链节点系统内的信息互通,区块链节点系统中的每个节点之间可以存在信息连接,节点之间可以通过上述信息连接进行信息传输。例如,当区块链节点系统中的任意节点(如上述业务服务器)接收到输入信息(如文本数据)时,区块链节点系统中的其他节点便根据共识算法获取该输入信息,将该输入信息作为共享数据中的数据进行存储,使得区块链节点系统中全部节点上存储的数据均一致。
本申请提供的方法可以自然运用于任何需要将文字转换为语音的场景,为了便于理解,下面以终端设备200a通过业务服务器100将一段文字转换成语音为例进行具体说明。
请一并参见图2a-图2c,是本申请实施例提供的一种语音合成的场景示意图。该语音合成场景的实现过程可以在如图1所示的业务服务器100中进行,也可以在终端设备(如图1所示的终端设备200a、终端设备200b、终端设备200c或终端设备200n中的任意一个)中进行,还可以由终端设备和业务服务器共同执行,此处不做限制,本申请实施例以终端设备200a和业务服务器100共同执行为例进行说明。如图2a所示,目标用户持有终端设备200a,在终端设备200a上可以安装有多个应用(例如教育应用、购物应用、阅读类应用等),假设 目标用户希望打开其中一个应用,如目标应用A1,则终端设备200a可以响应针对目标应用A1的触发操作(如点击操作),在屏幕上显示目标应用A1对应的显示界面,假设目标应用A1为阅读类应用,则终端设备200a可以向业务服务器100发送数据访问请求,业务服务器100可以根据该数据访问请求,为目标用户推荐特定的电子读物,例如可以在目标应用A1对应的默认首页中,显示根据目标用户的阅读喜好相匹配的电子读物推荐列表,或者可以显示与目标用户浏览过的历史电子读物相似的电子读物推荐列表,或者还可以显示当前热度较高的电子读物推荐列表。如图2a所示,假设目标用户通过选择电子读物推荐列表中的其中一个选项,打开了《围城》第三章,则在目标应用A1对应的显示界面300a中,可以在显示区域301a中显示目标用户当前打开的电子读物所对应的标题以及作者,在显示区域302a中还可以显示相关的封面图或插图等图片数据,在显示区域305a中则可以显示当前章节对应的内容。
请再次参见图2a,可以理解,当目标用户希望听到根据当前章节中的内容转换而成的语音时,可以点击显示区域303中的转换控件,进而终端设备200a可以响应针对该转换控件的点击操作,向业务服务器100发送语音合成请求,此时显示区域304a中所显示的语音播放进度条对应的当前播放进度以及合成语音总时长均为“00:00”,用于表示此刻还未合成语音数据。业务服务器100接收到语音合成请求后,可以获取当前章节包含的全部内容,并从中提取出文本数据,进一步可以对提取出的文本数据进行文本处理,包括滤除无用字符、格式标准化处理等,从而可以得到后续便于声学模型处理的文本输入序列(此处为字符序列)。再进一步,业务服务器100会将该文本输入序列输入提前训练好的残差式注意力声学模型中,通过该声学模型可以得到对应的合成语音数据。通过残差式注意力声学模型进行语音合成的具体过程可以参见图2b,如图2b所示,在残差式注意力声学模型中,首先对上述文本输入序列进行转换,得到文本特征表示序列,进而可以将该文本特征表示序列输入编码器,如图2b所示,该编码器可以包括N层编码层,即编码层E1、编码层E2、……、编码层EN,每层编码层的网络结构都是相同的。其中,N的大小可以根据语料规模进行调整,此处不做限制。如图2b所示,每层编码层均包括一个多头自注意力网络和一个一维卷积网络(每层编码层中的多头自注意力网络均可以称之为第一多头自注意力网络),每层编码层的多头自注意力网络与一维卷积网络之间进行残差连接,通过多头自注意力网络可以计算得到每一层编码层的注意力矩阵,基于注意力矩阵可以计算得到每一层编码层的文本编码序列,因此最后一层编码层(即编码层EN)输出的文本编码序列可以确定为目标文本编码序列。需要说明的是,在上述残差式注意力声学模型中,会对每一层的注意力矩阵进行残差连接,即在计算当前编码层的注意力矩阵时,需要利用前一层编码层的注意力矩阵信息。
例如,假设编码器有4层编码层,分别为编码层E1、编码层E2、编码层E3以及编码层E4,则文本特征表示序列先输入编码层E1,编码层E1可以根据该文本特征表示序列生成注意力矩阵B1以及文本编码序列C1,进而可以将文本编码序列C1和注意力矩阵B1传入编码层E2,在编码层E2中,可以根据注意力矩阵B1与编码层E2中的多头自注意力网络之间的残差连接以及文本编码序列C1,生成注意力矩阵B2,进而可以根据注意力矩阵B2以及文本编码序列C1生成文本编码序列C2。在编码层E2和编码层E3中的编码过程与编码层E2中的编码过 程类似,这里不再进行赘述。依此类推,最终可以将编码层E4生成的文本编码序列C4确定为编码器输出的编码结果,即目标文本编码序列。
请再次参见图2b,进一步,在残差式注意力声学模型中,可以基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据,结合前述图2a,也就是说,可以基于目标文本编码序列生成当前电子读物《围城》第三章对应的合成语音数据。该过程还涉及到残差式注意力声学模型中的时长预测器、长度调节器、解码器以及线性输出层等,具体过程可以参见下述图5所对应的实施例。
其中,业务服务器100可以利用具有海量文本的文本数据库以及音频数据库,训练深度神经网络得到残差式注意力声学模型,具体过程可以参见下述图7所对应的实施例。需要说明的是,图2b中的残差式注意力声学模型仅仅显示了部分网络结构用于简单的举例说明,更详细的模型框架可以一并参见后续图4a-图4b所对应实施例中的相关描述,这里不再进行赘述。
请参见图2c,如图2c所示,业务服务器100可以将最终生成的合成语音数据返回给终端设备200a,终端设备200a接收到该合成语音数据后,可以在目标应用A1的显示界面300b中播放该合成语音数据,可以看到此时显示区域302b中的转换控件的样式也发生了变化,即由上述图2a的显示界面300a中的显示区域303a中的“停止状态”更新为当前显示界面300b中的显示区域302b中的“播放状态”,终端设备200a可以响应目标用户针对该转换控件的触发操作,将正在播放的合成语音数据进行暂停处理,后续还可以通过再次触发该转换控件将暂停状态下的合成语音数据恢复播放。此外,在显示区域301b中,可以显示语音播放进度条,包括当前播放进度(例如“06:01”,即当前播放到第6分第1秒)以及合成语音总时长(例如“15:26”,即总时长为15分26秒),目标用户通过对语音播放进度条进行拖拽操作,还可以调整当前播放进度。
需要说明的是,文本输入序列的表示形式除了上述场景中所描述的字符序列外,还可以为音素序列的形式,音素(phone)是根据语音的自然属性划分出来的最小语音单位,残差式注意力声学模型对音素序列的处理和对字符序列的处理过程是一样的,因此这里不再进行赘述。此外,本申请提供的方法适用于任何需要将文字转换为语音的场景,因此除了上述描述的阅读类应用,目标应用A1还可以为其他类型的应用,例如当目标应用A1为新闻应用时,可以将新闻内容转换为语音数据;当目标应用A1为游戏应用时,可以将剧情介绍、人物独白等需要在游戏中播放的语音通过录入相应的文本数据进行合成;当目标应用A1为包含智能客服的应用(例如购物应用)时,也可以通过录入相关的文本数据并将其转换为语音数据,当客户的应答触发某个规则时,则智能客服会播放相应的语音数据。
可以理解的是,业务服务器100可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。所以上述提及的计算过程均可以发布在多个物理服务器,或多个云服务器上,即通过分布式或集群并行完成所有文本转语音的计算,进而可以快速地获取到与文本输入序列相匹配的合成语音数据。
上述可知,本申请实施例基于深度神经网络,提供了一种基于残差式注意力的语音合成声学模型,在合成语音数据的过程中,本申请实施例可以充分利用该声学模型中每一层网络的计算结果,将残差放到注意力矩阵中,即对每一层注意力矩阵进行残差连接,从而使得每一层的注意力矩阵能够互通,有效加速了模型的收敛,同时,也使得每一层网络的注意力矩阵趋于一致性,从而可以提升合成语音的清晰度和稳定性。本申请实施例实现了将文本数据快速转换为高质量语音数据的功能。
请参见图3,图3是本申请实施例提供的一种语音合成方法的流程示意图。该语音合成方法可以由计算机设备执行,计算机设备可以包括如图1所述的终端设备或业务服务器。该语音合成方法至少可以包括以下S101-S103:
S101,将文本输入序列转换为文本特征表示序列;
具体的,本申请提供的方法可以基于字符或音素进行建模,因此计算机设备可以先对输入的字符或音素进行文本预处理,得到文本输入序列,进而将文本输入序列输入训练好的残差式注意力声学模型中的向量转换层(Token Embedding)进行转换,得到便于模型处理的文本特征表示序列,具体过程为:在残差式注意力声学模型中,首先将文本输入序列输入向量转换层,通过向量转换层在向量转换表中进行查找匹配,从而可以将与文本输入序列相匹配的特征向量作为文本特征表示序列。可选的,查找的过程可以通过one-hot查表实现(也可以称为独热编码,主要是采用M位状态寄存器来对M个状态进行编码,每个状态都有自己独立的寄存器位,并且在任意时候只有一位有效)。其中,上述向量转换表可以包括各个字符或音素与特征向量之间的映射关系,因此可以在应用模型前预先构建向量转换表。
在一个优选的实施例中,可以限定输入的字符或音素的最大序列长度为256,将文本特征表示序列对应的向量维度设置为256。
请一并参见图4a-图4b,是本申请实施例提供的一种残差式注意力声学模型的网络结构示意图。如图4a所示,在该网络结构中,第一部分为残差式注意力声学模型的输入层(Input),输入层可以对输入的字符或音素所对应的长度、格式等内容进行检测。该网络结构的第二部分即为向量转换层,也可以称之为字符/音素向量层(Token Embedding),向量转换层可以将输入的各个字符或音素转换成对应的固定维度的向量,输入的每个字符或音素均完成向量转换后即可得到文本特征表示序列。
S102,将文本特征表示序列输入包含N层编码层的编码器;N层编码层中包括编码层E i以及编码层E i+1;编码层E i+1包括第一多头自注意力网络;获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层E i+1的第二注意力矩阵;编码层E i+1为编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
第一注意力矩阵用于标识编码层E i对输入数据编码得到目标文本编码序列的过程中所采用的注意力参数,计算机设备通过将第一注意力矩阵与编码层E i+1包括的第一多头自注意力网络进行残差连接,可以得到第二注意力矩阵,该第二注意力矩阵作为编码层E i+1对历史文本编码序列进行编码得到目标文本编码序列过程中使用的注意力参数。
也就是说,在生成第二注意力矩阵时,编码层E i和编码层E i+1所采用的注意力参数实现了互通,由此在通过编码器进行编码得到目标文本编码序列的过程中,编码器的每一层编码层采用的注意力矩阵能够趋于一致,有助于提升后续生成语音数据的清晰度和稳定性。
具体的,在残差式注意力声学模型中,计算机设备可以将转换得到的文本特征表示序列输入如图4a所示的网络结构中的编码器,该编码器包含有N层编码层,N为大于1的整数,可以理解,N可以根据语料规模进行调整。每层编码层的结构是相同的,具体结构示意图请一并参见图4b,如图4b所示,本申请实施例提供的残差式注意力声学模型中的编码器,为基于多头自注意力机制(Multi-Head Self-attention)的、带残差注意力连接的多头自注意力层(Residual Multi-Head Self-attention Layer)和一维卷积相结合的一种前馈网络结构,带残差注意力连接的多头自注意力层采用多头自注意力网络提取交叉位置信息,通过交叉位置信息对输入信息进行编码,即每层编码层均包括一个多头自注意力网络以及一个一维卷积网络(每层编码层中的多头自注意力网络均可以称之为第一多头自注意力网络),且这两层网络均采用一个残差连接,其中,每个多头自注意力网络均包括至少两个单头自注意力网络(第一多头自注意力网络中的每个单头自注意力网络均可以称之为第一单头自注意力网络),具体数量可以根据实际需要进行调整,本申请实施例对此不做限制。
假设上述N层编码层中包括编码层E i以及编码层E i+1,且编码层E i+1为编码层E i的下一层编码层,其中,i为正整数,且i小于N。为了得到编码层E i+1的编码结果,首先需要利用前一层编码层的输出结果生成编码层E i+1的注意力矩阵,称为第二注意力矩阵,也就是说,需要获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,进而编码层E i+1可以根据第一注意力矩阵与编码层E i+1中的第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层E i+1的第二注意力矩阵,具体过程可以为:获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,其中,历史文本编码序列可以包括至少两个第一单头自注意力网络分别对应的第一匹配矩阵,也就是说,可以根据历史文本编码序列对第一匹配矩阵进行初始化,其中,第一匹配矩阵包括第一多头自注意力网络对应的Query矩阵、Key矩阵以及Value矩阵,这三个矩阵可以用于在保持对当前字符或音素的关注度不变的情况下,降低对不相关字符或音素的关注度。对于每个第一单头自注意力网络来说,其对应的Query矩阵、Key矩阵以及Value矩阵均等于编码层E i输出的历史文本编码序列。进一步,可以获取第一多头自注意力网络对应的第一映射矩阵,第一映射矩阵用于将上述第一匹配矩阵映射为不同的形式,可以理解,第一映射矩阵同样包括三个不同的矩阵,分别为Query矩阵对应的映射矩阵W Q、Key矩阵对应的映射矩阵W K以及Value矩阵对应的映射矩阵W V,其中,映射矩阵W Q、映射矩阵W K、映射矩阵W V均可以经过随机初始化后再通过相关网络优化得到,因此对于每个第一单头自注意力网络来说,这三个映射矩阵都是不一样的。
在计算第二注意力矩阵前,需要根据上述第一映射矩阵、第一匹配矩阵与第一注意力矩阵之间的残差连接,计算得到编码层E i+1中的每个第一单头自注意力网络分别对应的子注意力矩阵,具体的计算公式如下:
Figure PCTCN2022079502-appb-000001
其中,
Figure PCTCN2022079502-appb-000002
将Q′=QW i Q,K′=KW i K,V′=VW i V,Prev′=Prev i带入上述公式即可计算得到第i个第一单头自注意力网络对应的子注意力矩阵head i(i为正整数)。上述计算公式中,Q、K、V分别用于表示Query矩阵、Key矩阵以及Value矩阵,W i Q、W i K以及W i V表示第i个第一单头自注意力网络对应的映射矩阵,Prev i表示第i个第一单头自注意力网络对应的、从第一注意力矩阵提取出的拆分矩阵,具体的,可以根据第一单头自注意力网络的总数量将第一注意力矩阵进行均等划分,例如假设有4个第一单头自注意力网络,第一注意力矩阵的维度为4*16,则可以将第一注意力矩阵均等划分为4个4*4的拆分矩阵,每个第一单头自注意力网络会使用其中一个拆分矩阵进行计算,在公式中加上Prev′即表示对相邻两层编码层的注意力矩阵进行了残差连接,请再参见图4b,在编码器的网络结构中,当前编码层的多头自注意力网络会与前一层编码层输出的注意力矩阵进行残差连接,使得每一层编码层的注意力矩阵能够自然互通。除以
Figure PCTCN2022079502-appb-000003
(d k为矩阵K的维度)可以起到调节作用,防止梯度消失,且
Figure PCTCN2022079502-appb-000004
并不是唯一值,主要由经验所得。Softmax函数又可以称为归一化指数函数,可以对计算结果进行标准化,最终以概率的形式展现出来,再对矩阵V′进行加权求和,即可得到第i个第一单头自注意力网络对应的子注意力矩阵head i。进一步,将所有第一单头自注意力网络对应的子注意力矩阵进行拼接,再进行一次线性变换,即可得到编码层E i+1的第二注意力矩阵。可以理解,第二注意力矩阵会经过上述同样的过程传递到下一层编码层。
上述可知,与单头自注意力网络相比,假设多头自注意力网络包括h个单头自注意力网络,则多头自注意力网络的优势在于进行了h次计算而不仅仅是一次计算,这样做的好处是允许残差式注意力声学模型在不同的表示子空间里学习到相关的信息。现有基于Transformer的语音合成声学模型由于仅仅是对包含多头自注意力机制的模块进行简单堆叠,没有充分挖掘前一层网络的计算结果,当随着堆叠层数增加导致梯度消失时,会影响模型的收敛和最终合成语音的稳定性,而本申请实施例为了充分利用每一层网络的计算结果,通过对每一层网络的注意力矩阵进行残差连接来减少梯度计算导致的模型不稳定问题,有效加速了模型的收敛。此外,通过多头自注意力机制可以实现强大的并行计算能力,同时也方便使用其他更高效的优化方法来提升速度。
需要说明的是,对于第一层编码层E 1,其输入数据为向量转换层输出的文本特征表示序列,因此在编码层E 1中,可以将文本特征表示序列作为历史文本编码序列,将第一注意力矩阵设置为全零矩阵,其计算过程与上述编码层E i+1的计算过程一致,这里不再进行赘述。
S103,根据第二注意力矩阵以及历史文本编码序列生成编码层E i+1的目标文本编码序列,基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据。
如前所述,第二注意力矩阵为编码层E i+1用于对历史文本编码序列进行编码时采用的注意力参数,通过第二注意力矩阵的注意力指示,编码层E i+1编码得到对应的目标编码序列。
计算机设备通过残差式注意力声学模型中的编码器确定文本输入序列对应的目标文本编码序列,残差式注意力声学模型可以基于目标文本编码序列生成语音数据,目标文本编码序列作为编码层E i对文本输入序列编码得到的量化表示,可以准确的体现出文本输入序列的语义信息和用于合成语音的相关信息,标识了文本输入序列中文本与音素间的关联,从而基于残差式注意力声学模型的解码器对目标文本编码序列进行解码后,可以得到清晰、 流畅的合成语音数据。
具体的,可以根据第二注意力矩阵以及历史文本编码序列生成编码层E i+1的目标文本编码序列,首先可以将上述S102得到的第二注意力矩阵与历史文本编码序列进行相乘,得到第一中间编码序列,具体的计算公式如下:
ResidualMultiHead(Q,K,V,Prev)=Concat(head 1,…,head h)W O
此处的Prev表示第一注意力矩阵,W O表示历史文本编码序列,结合上述S102中的计算公式,
Figure PCTCN2022079502-appb-000005
假设第一多头自注意力网络包括h个第一单头自注意力网络(h为大于1的整数),则将第1个第一单头自注意力网络的子注意力矩阵head 1、第2个第一单头自注意力网络的子注意力矩阵head 2、……、第h个第一单头自注意力网络的子注意力矩阵head h使用Concat函数进行拼接得到的第二注意力矩阵,再乘以历史文本编码序列W O可以得到第一中间编码序列。进一步,请再次参见图4b,如图4b所示,对第一中间编码序列和历史文本编码序列进行残差连接以及归一化处理后,可以得到第二中间编码序列,进而可以将第二中间编码序列输入编码层E i+1中的第一卷积网络,通过第一卷积网络可以输出第三中间编码序列,再次对第三中间编码序列和第二中间编码序列进行残差连接以及归一化处理,最终得到编码层E i+1的当前文本编码序列。可以理解,当上述当前文本编码序列为第N层编码层(即最后一层编码层)输出的文本编码序列时,为了便于区分,可以将当前文本编码序列确定为目标文本编码序列(也可以称为字符/音素隐藏状态序列)。可选的,上述第一卷积网络可以由具有修正线性单元(Rectified Linear Unit,ReLU)激活函数的两层一维卷积网络构成,或者还可以使用其它激活函数(例如Sigmod函数、Tanh函数等)以加入非线性因素,用于对第二中间编码序列进行非线性变化,本申请实施例对此不做限制。由图4b可知,在整个编码器的网络结构中均使用了残差连接和对层进行了归一化处理(简称为Add&Norm),这样做可以更好地优化深度网络。
经上述编码器并行编码得到目标文本编码序列后,进一步,可以基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据,请再次参见图4a,目标文本编码序列会依次经过时长预测器(Duration Predictor)、长度调节器(Length Regulator)、解码器以及第一线性输出层(Linear Layer),然后输出声学特征序列,基于声学特征序列可以得到合成语音数据。可以理解,一般情况下,目标文本编码序列的长度与声学特征序列的长度是不匹配的,目标文本编码序列的长度通常要小于声学特征序列的长度,而现有的自回归模型中常用的编码器-注意力-解码器(Encoder-Attention-Decoder)机制可能导致音素和梅尔谱之间的错误对齐,进而导致生成的语音出现重复吐词或漏词,因此本申请实施例为了解决这个问题,将会通过长度调节器对目标文本编码序列和声学特征序列的长度进行对齐,在推理过程中,由于字符/音素的时长信息(即每个字符/音素对齐的声学特征序列的长度)没有给定,因此还需要时长预测器对每个字符/音素的时长进行预测。
生成语音数据的具体过程请参见图5,图5是本申请实施例提供的一种语音合成方法的流程示意图。如图5所示,该语音合成方法的过程包括如下S201-S205,且S201-S205为图3所对应实施例中S103的一个具体实施例,该语音合成过程包括如下步骤:
S201,将所目标文本编码序列输入时长预测器,获取文本输入序列对应的预测时长序 列;
具体的,请再次参见图4a中的时长预测器,该时长预测器可以包括由ReLU激活函数(或其它激活函数)激活的两层一维卷积网络以及第二线性输出层,需要说明的是,该时长预测器堆叠在编码器的顶部,其可以作为一个独立于残差式注意力声学模型的模块,并与残差式注意力声学模型端到端一起联合训练得到,或者,可以直接作为残差式注意力声学模型中的一个模块,用以预测每个字符或音素对应的时长信息。
计算机设备可以将目标文本编码序列输入时长预测器中的第1层一维卷积网络进行特征提取并进行归一化处理,可以得到第一时长特征,进而可以将第一时长特征输入第2层一维卷积网络再次进行特征提取以及归一化处理,得到第二时长特征,进一步,可以将第二时长特征输入第二线性输出层,通过第二线性输出层对第二时长特征进行线性变换输出标量,可以得到文本输入序列对应的预测时长序列。其中,预测时长序列包括至少两个时长参数,时长参数用于表示每个字符或音素对应的时长信息。
S202,将目标文本编码序列输入长度调节器,在长度调节器中,根据预测时长序列对目标文本编码序列进行序列长度拓展,得到拓展后的目标文本编码序列;
具体的,上述目标文本编码序列包括至少两个编码向量,因此计算机设备可以将目标文本编码序列输入长度调节器,在长度调节器中,根据预测时长序列中的至少两个时长参数分别对编码向量进行复制,得到复制编码向量。进一步,可以将复制编码向量与目标文本编码序列进行拼接,从而得到拓展后的目标文本编码序列,其中,拓展后的目标文本编码序列的序列长度与至少两个时长参数的总和相等。
例如,记目标文本编码序列为H=[h 1,h 2,…,h n],n为目标文本编码序列的长度,h i表示目标文本编码序列中的第i个编码向量,记预测时长序列为D=[d 1,d 2,…,d n],其中,
Figure PCTCN2022079502-appb-000006
m为对应声学特征序列的长度,假设给定H=[h 1,h 2,h 3]和D=[2,3,1],那么拓展后的目标文本编码序列变为H′=[h 1,h 1,h 2,h 2,h 2,h 3],即长度调节器对编码向量h 1复制1次,对编码向量h 2复制2次,对编码向量h 3不进行复制。
S203,将拓展后的目标文本编码序列输入包含N层解码层的解码器,生成目标语音解码序列;
具体的,在残差式注意力声学模型中,解码器的结构与上述图4b所示的编码器的网络结构是一致的,即同样由带残差注意力连接的自注意力层(Residual Multi-Head Self-attention Layer)和一维卷积组合而成,带残差注意力连接的多头自注意力层采用多头自注意力网络对输入信息进行解码,如图4b所示,即每层解码层均包括一个多头自注意力网络以及一个一维卷积网络(每层解码层中的多头自注意力网络均可以称之为第二多头自注意力网络),且这两层网络均采用一个残差连接,每个多头自注意力网络均包括至少两个单头自注意力网络(第二多头自注意力网络中的每个单头自注意力网络均可以称之为第二单头自注意力网络)。其中,解码器包括N层解码层(与编码器中的编码层的数量相同),可以理解,N可以根据语料规模进行调整。
假设上述N层解码层中包括解码层D j以及解码层D j+1,且解码层D j+1为解码层D j的下一层解码层,j为正整数,且j小于N。为了得到解码层D j+1的解码结果,首先需要利用前一层 解码层的输出结果生成解码层D j+1的注意力矩阵,称为第四注意力矩阵,也就是说,需要获取解码层D j输出的第三注意力矩阵以及历史语音解码序列,进而解码层D j+1可以根据第三注意力矩阵与解码层D j+1中的第二多头自注意力网络之间的残差连接以及历史语音解码序列,生成解码层D j+1的第四注意力矩阵,具体过程可以为:获取解码层D j输出的第三注意力矩阵以及历史语音解码序列,其中,历史语音解码序列可以包括至少两个第二单头自注意力网络分别对应的第二匹配矩阵,这里的第二匹配矩阵包括第二多头自注意力网络对应的Query矩阵、Key矩阵以及Value矩阵,同样可以根据历史语音解码序列对这3个矩阵进行初始化,即对于每个第二单头自注意力网络来说,其对应的Query矩阵、Key矩阵以及Value矩阵均等于解码层D j输出的历史语音解码序列。进一步,可以获取解码层D j+1中的第二多头自注意力网络对应的第二映射矩阵,第二映射矩阵用于将上述第二匹配矩阵映射为不同的形式,可以理解,第二映射矩阵同样包括三个不同的矩阵,分别为Query矩阵对应的映射矩阵、Key矩阵对应的映射矩阵以及Value矩阵对应的映射矩阵,生成第二映射矩阵的过程与上述S102中生成第一映射矩阵的过程是一样的,这里不再进行赘述。
在计算第四注意力矩阵前,需要根据上述第二映射矩阵、第二匹配矩阵与第三注意力矩阵之间的残差连接,计算得到解码层D j+1中的每个第二单头自注意力网络分别对应的子注意力矩阵,具体计算公式可参见S102中子注意力矩阵的计算公式。进一步,使用Concat函数将所有第二单头自注意力网络对应的子注意力矩阵进行拼接,再进行一次线性变换,即可得到解码层D j+1的第四注意力矩阵。
需要说明的是,对于第一层解码层D 1,其输入数据为拓展后的目标文本编码序列,因此在解码层D 1中,可以将拓展后的目标文本编码序列作为历史语音解码序列,将第三注意力矩阵设置为全零矩阵,其计算过程与上述解码层D j+1的计算过程一致,这里不再进行赘述。
进一步,可以根据第四注意力矩阵以及历史语音解码序列生成解码层D j+1的目标语音解码序列,具体过程为:将第四注意力矩阵与历史语音解码序列进行相乘,得到第一中间解码序列(具体计算公式可以参见上述计算第一中间编码序列的公式),进而对第一中间解码序列和历史语音解码序列进行残差连接以及归一化处理,得到第二中间解码序列,再将第二中间解码序列输入解码层D j+1中的第二卷积网络,通过第二卷积网络可以输出第三中间解码序列,再次对第三中间解码序列和第二中间解码序列进行残差连接以及归一化处理,最终得到解码层D j+1的当前语音解码序列。可以理解,当上述当前语音解码序列为第N层解码层(即最后一层解码层)输出的语音解码序列时,为了便于区分,可以将当前语音解码序列确定为目标语音解码序列。可选的,上述第二卷积网络可以由具有ReLU激活函数的两层一维卷积网络构成。
可以理解,与编码器一样,解码器的解码过程也是并行的。需要说明的是,由于经典的基于Transformer的语音合成声学模型中的编码器和解码器的结构与本申请中的编码器或解码器的结构是类似的,因此本申请提供的方法可以自然拓展到任何基于Transformer的语音合成声学模型中,包括自回归的Transformer声学模型。
S204,将目标语音解码序列输入第一线性输出层,在第一线性输出层中,对目标语音解码序列进行线性变换,得到声学特征序列;
具体的,如图4a所示,将解码器并行解码出来的目标语音解码序列输入第一线性输出层,通过第一线性输出层对目标语音解码序列进行线性变换,从而可以得到文本输入序列对应的声学特征序列,在本申请实施例中,声学特征序列具体可以为梅尔频谱图(Mel-Spectrogram)序列。
S205,对声学特征序列进行声学特征转换,得到与文本输入序列相匹配的合成语音数据。
具体的,计算机设备可以使用预先训练好的声码器(Vocoder)对声学特征序列进行声学特征转换,即将声学特征序列转换为与文本输入序列相匹配的合成语音数据。其中,声码器具体可以为WaveGlow网络(一种依靠流的从梅尔频谱图合成高质量语音的网络),可以实现并行化的语音合成,或者可以为SqueezeWave网络(一种可用于移动端语音合成的轻量级的流模型),可以有效提升语音合成的速度,或者还可以使用诸如Griffin-Lim,WaveNet,Parallel的声码器从声学特征序列合成语音,可以根据实际需要选取合适的声码器,本申请实施例对此不做限制。
对于合成语音数据的声音质量,可以采用MOS(Mean Opinion Score,主观平均意见值)测试进行评估,MOS指标用来衡量声音接近人声的自然度和音质,而本申请实施例提供的方法有效提高了合成语音的清晰度和自然度,其音质可以与自回归的Transformer TTS和Tacotron2相媲美。
此外,传统的自回归声学模型会自动一个接一个地生成梅尔频谱图,而没有明确利用文本和语音之间的对齐方式,以至于在自回归生成中通常很难直接控制合成语音的速度和韵律,而本申请实施例采用非自回归(non auto-regressive)的序列到序列(seq-to-seq)模型,不需要依赖上一个时间步的输入,可以让整个模型真正的并行化,还可以支持显示地控制合成语音数据的语速或者韵律停顿,具体的,引入语音调节参数α(长度调节机制),用户可以通过调节语音调节参数α来调节合成语音数据的语速或者韵律,在长度调节器中,首先获取语音调节参数,进而可以根据语音调节参数对上述预测时长序列进行更新,得到更新后的预测时长序列,进一步可以根据更新后的预测时长序列,调节合成语音数据的语速或韵律。也就是说,可以等比例地延长或者缩短字符/音素的持续时间,用于控制合成语音的速度,从而确定生成的梅尔频谱图的长度,还可以通过调整句子中空格字符的持续时间来控制单词之间的停顿,即在相邻字符/音素之间添加间隔,从而实现调整合成语音的部分韵律。
请一并参见图6,是本申请实施例提供的一种语音调节的场景示意图。如图6所示,针对调整一个英文单词“speech”的合成语音的速度的场景,单词“speech”对应的音素序列为P=[s p iy ch],可用于表示对应的发音,若设置语音调节参数α=1,则其预测时长序列为D1=[2,2,3,1],因此单词“speech”对应的音素序列可以拓展为P1=[s s p p iy iy iy ch],即表示当前单词“speech”对应的语速为1倍速(即正常语速)。如果用户希望合成更快速的语音,则可以把语音调节参数α调小一点,例如将语音调节参数α设置为0.5时,对应的预测时长序列会更新为D2=0.5*[2,2,3,1]=[1,1,1.5,0.5],四舍五入得到D2=[1,1,2,1],单词“speech”对应的音素序列则相应更新为P2=[s p iy iy ch],与上述序列P1相比,序 列P2更短,因此可以实现更快速的发音。如果用户希望合成慢速的语音,则可以把语音调节参数α调大一点,例如将语音调节参数α设置为2时,对应的预测时长序列会更新为D3=2*[2,2,3,1]=[4,4,6,2],单词“speech”对应的音素序列则相应更新为P3=[s s s s p p p p iy iy iy iy iy iy ch ch],与上述序列P1相比,序列P3更长,因此可以实现更慢的发音。
可以理解的是,本申请实施例中所示的数字均为虚构数字,实际应用时,应以实际数字为准。
本申请实施例基于深度神经网络,可以将文本输入序列转换为文本特征表示序列,进而可以将该文本特征表示序列输入包含N层编码层的编码器,在该编码器中,计算当前编码层的注意力矩阵时,可以根据前一层编码层输出的第一注意力矩阵与当前编码层中的多头自注意力网络之间的残差连接以及前一层编码层输出的历史文本编码序列,生成当前编码层的第二注意力矩阵,进一步,可以根据得到的第二注意力矩阵以及历史文本编码序列生成当前编码层的目标文本编码序列,最终可以基于该目标文本编码序列,通过长度调节、解码、线性变换、声学特征转换等处理过程,生成与上述文本输入序列相匹配的合成语音数据。由此可见,在合成语音数据的过程中,本申请实施例可以充分利用每一层网络的计算结果,将残差放到注意力矩阵中,即对每一层注意力矩阵进行残差连接,从而使得每一层的注意力矩阵能够互通,有效加速了模型的收敛,同时,也使得每一层网络的注意力矩阵趋于一致性,从而可以提升合成语音的清晰度和稳定性。与现有语音合成方案相比,本申请实施例合成语音的清晰度和自然度更好,合成语音的频谱细节上看也更清晰,另外可以很好地缓解现有方案中存在的发音错误、语调错误和韵律不自然的问题。
请参见图7,是本申请实施例提供的一种语音合成方法的流程示意图。该语音合成方法可以由计算机设备执行,计算机设备可以包括如图1所述的终端设备或业务服务器。该语音合成方法至少可以包括以下S301-S304:
S301,将文本样本序列输入初始残差式注意力声学模型,通过初始残差式注意力声学模型将文本样本序列转换为文本特征样本序列;
具体的,计算机设备可以从海量的样本数据中选取一部分用于模型训练,选取出的样本数据作为数据集,该数据集包含有参考语音数据以及相应的文本记录,用来训练模型,此外,数据集中剩余的数据还可以划分为测试集和验证集,分别用于验证模型的泛化性能和调整模型的超参数,这里不进行具体描述。对这些文本记录进行文本预处理,可以得到文本样本序列,进而将文本样本序列输入初始残差式注意力声学模型中的初始向量转换层,预先根据业务需要构建了向量转换表,因此可以通过初始向量转换层在向量转换表中进行查找匹配,进而将与文本样本序列相匹配的特征向量作为文本特征样本序列。
S302,将文本特征样本序列输入初始残差式注意力声学模型中的包含N层初始编码层的初始编码器;N层初始编码层中包括初始编码层X i以及初始编码层X i+1;初始编码层X i+1包括初始多头自注意力网络;获取初始编码层X i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与初始多头自注意力网络之间的残差连接以及历史文本编码序列,生成初始编码层X i+1的第二注意力矩阵;初始编码层X i+1为初始编码层X i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
具体的,初始残差式注意力声学模型中配置有包含N层初始编码层的初始编码器,N为大于1的整数,可以理解,N可以根据语料规模进行调整,初始编码器中每层初始编码层的结构都是相同的,初始编码层的具体网络结构可以参见上述图4b所示的结构示意图,每层初始编码层都包括一个多头自注意力网络以及一个一维卷积网络(每层初始编码层中的多头自注意力网络均可以称之为初始多头自注意力网络),且这两层网络均采用一个残差连接,其中,每个多头自注意力网络均包括至少两个单头自注意力网络(初始多头自注意力网络中的每个单头自注意力网络均可以称之为初始单头自注意力网络)。假设N层初始编码层中包括初始编码层X i以及初始编码层X i+1,且初始编码层X i+1为初始编码层X i的下一层编码层,其中,i为正整数,且i小于N。为了得到初始编码层X i+1的编码结果,首先需要利用前一层初始编码层的输出结果生成初始编码层X i+1的注意力矩阵,称为第二注意力矩阵,具体过程为:获取始编码层X i输出的第一注意力矩阵以及历史文本编码序列,进而可以根据历史文本编码序列对匹配矩阵进行初始化,其中,匹配矩阵包括初始多头自注意力网络对应的Query矩阵、Key矩阵以及Value矩阵,对于每个初始单头自注意力网络来说,其对应的Query矩阵、Key矩阵以及Value矩阵均等于初始编码层X i输出的历史文本编码序列。进一步,可以获取初始多头自注意力网络对应的映射矩阵,映射矩阵用于将上述匹配矩阵映射为不同的形式,可以理解,映射矩阵同样包括三个不同的矩阵,分别为Query矩阵对应的映射矩阵W Q、Key矩阵对应的映射矩阵W K以及Value矩阵对应的映射矩阵W V,其中,映射矩阵W Q、映射矩阵W K、映射矩阵W V均可以经过随机初始化后再通过相关网络优化得到,因此对于每个初始单头自注意力网络来说,这三个映射矩阵都是不一样的。
进一步,根据上述映射矩阵、匹配矩阵与第一注意力矩阵之间的残差连接,可以计算得到初始编码层X i+1中的每个初始单头自注意力网络分别对应的子注意力矩阵,将所有初始单头自注意力网络对应的子注意力矩阵进行拼接,再进行一次线性变换,即可得到初始编码层X i+1的第二注意力矩阵。
该步骤更具体的处理过程可以参见上述图3所对应实施例中的S102,这里不再进行赘述。
需要说明的是,对于第一层初始编码层X 1,其输入数据为初始向量转换层输出的文本特征样本序列,因此在初始编码层X 1中,可以将文本特征样本序列作为历史文本编码序列,将第一注意力矩阵设置为全零矩阵,其计算过程与上述初始编码层X i+1的计算过程一致,这里不再进行赘述。
S303,根据第二注意力矩阵以及历史文本编码序列生成初始编码层X i+1的目标文本编码序列,基于目标文本编码序列生成与文本样本序列相匹配的预测语音数据;
具体的,计算机设备可以将上述S302得到的第二注意力矩阵与历史文本编码序列进行相乘,得到第一中间编码序列,对第一中间编码序列和历史文本编码序列进行残差连接以及归一化处理后,可以得到第二中间编码序列,进而可以将第二中间编码序列输入初始编码层X i+1中的初始卷积网络,通过初始卷积网络可以输出第三中间编码序列,再次对第三中间编码序列和第二中间编码序列进行残差连接以及归一化处理,最终得到初始编码层X i+1的当前文本编码序列。可以理解,当上述当前文本编码序列为第N层初始编码层(即最后一层初始编码层)输出的文本编码序列时,为了便于区分,可以将当前文本编码序列确定 为目标文本编码序列。其中,上述初始卷积网络可以由具有ReLU激活函数或其它激活函数(例如Sigmod函数、Tanh函数等)的两层一维卷积网络构成,本申请实施例对此不做限制。
经上述初始编码器并行编码得到目标文本编码序列后,进一步,可以基于目标文本编码序列生成与文本样本序列相匹配的合成语音数据:目标文本编码序列会依次经过初始残差式注意力声学模型中的初始时长预测器、初始长度调节器、包含N层初始解码层的初始解码器以及初始线性输出层,然后输出声学特征序列,使用初始声码器对声学特征序列进行声学特征转换,可以得到预测语音数据。生成预测语音数据的具体过程可以参考上述图5所对应的实施例,这里不再进行赘述。其中,初始解码器的网络结构与上述初始编码器的网络结构是相同的。
S304,根据预测语音数据以及参考语音数据生成语音损失函数,通过语音损失函数对初始残差式注意力声学模型中的模型参数进行修正,得到残差式注意力声学模型;残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据。
具体的,计算机设备可以根据预测语音数据以及文本样本序列对应的参考语音数据生成语音损失函数(例如可以是均方误差损失函数),用于表示合成的预测语音数据与真实的参考语音数据之间的差距,进而可以通过该语音损失函数对初始残差式注意力声学模型中的模型参数进行修正,得到训练好的残差式注意力声学模型。其中,残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据,该残差式注意力声学模型可以包括训练好的向量转换层、编码器、时长预测器、长度调节器、解码器、线性输出层以及声码器,需要说明的是,时长预测器和声码器既可以作为模型的一部分,又可以作为独立于模型的模块,当它们作为独立模块时,可以与残差式注意力声学模型进行端到端的协同训练。
请一并参见图8,是本申请实施例提供的一种模型训练的流程示意图。如图8所示,该流程示意图主要包括两个部分,第一部分是数据准备,包括文本预处理、声学特征提取、音素时长信息提取;第二部分利用给定的数据(包括文本预处理后得到的文本样本序列以及进行音素时长信息提取得到的时长信息)训练基于残差式注意力的并行语音合成声学模型(即残差式注意力声学模型),实现高精度的并行声学模型建模。其中,将文本样本序列以及参考语音数据输入到模型进行训练,可获得编码器-解码器注意对齐,进而可用于训练时长预测器。
本申请实施例提供了一种基于残差式注意力的并行语音合成声学模型建模方法,通过将文本样本序列以及参考语音数据共同组成配对数据输入初始残差式注意力声学模型中进行训练,可以得到与文本样本序列相匹配的预测语音数据,进一步,为了提升模型合成语音的准确率和效率,可以根据由预测语音数据和参考语音数据生成的语音损失函数来对初始残差式注意力声学模型中的模型参数进行修正,从而可以得到一个高精度的残差式注意力声学模型,该声学模型可以准确、稳定、高效地进行声学参数预测。此外,通过该建模方法得到的声学模型可以运用于任何需要将文字转换为语音的场景,与现有语音合成方案相比,该声学模型合成语音的清晰度和自然度更好,合成语音的频谱细节上看也更清晰,另外可以很好地缓解现有方案中存在的发音错误、语调错误和韵律不自然的问题,且可以自然拓展到任意语言、方言、说话人以及自适应相关的语音合成任务,对其中用到的 Transformer结构进行改进,具有很好的可拓展性。
请参见图9,是本申请实施例提供的一种语音合成装置的结构示意图。该语音合成装置可以是运行于计算机设备的一个计算机程序(包括程序代码),例如该语音合成装置为一个应用软件;该装置可以用于执行本申请实施例提供的语音合成方法中的相应步骤。如图9所示,该语音合成装置1可以包括:转换模块11、矩阵生成模块12、语音合成模块13;
转换模块11,用于将文本输入序列转换为文本特征表示序列;
上述转换模块11,具体用于将文本输入序列输入向量转换层,通过向量转换层在向量转换表中进行查找,将与文本输入序列相匹配的特征向量作为文本特征表示序列;向量转换表包括字符或音素与特征向量之间的映射关系;
矩阵生成模块12,用于将文本特征表示序列输入包含N层编码层的编码器;N层编码层中包括编码层E i以及编码层E i+1;编码层E i+1包括第一多头自注意力网络;获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层E i+1的第二注意力矩阵;编码层E i+1为编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
语音合成模块13,用于根据第二注意力矩阵以及历史文本编码序列生成编码层E i+1的目标文本编码序列,基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据。
其中,转换模块11的具体功能实现方式可以参见上述图3所对应实施例中的S101,矩阵生成模块12的具体功能实现方式可以参见上述图3所对应实施例中的S102,语音合成模块13的具体功能实现方式可以参见上述图3所对应实施例中的S103以及上述图5所对应实施例中的S201-S205,这里不再进行赘述。
请一并参见图9,该语音合成装置1还可以包括:语音调节模块14;
语音调节模块14,用于获取语音调节参数,根据语音调节参数对预测时长序列进行更新,得到更新后的预测时长序列;根据更新后的预测时长序列,调节合成语音数据的语速或韵律。
其中,语音调节模块14的具体功能实现方式可以参见上述图3所对应实施例中的S103,这里不再进行赘述。
在一种实施方式中,第一多头自注意力网络包括至少两个第一单头自注意力网络;
请一并参见图9,上述矩阵生成模块12可以包括:第一矩阵生成单元121、第二矩阵生成单元122;
第一矩阵生成单元121,用于获取编码层E i输出的第一注意力矩阵以及历史文本编码序列;历史文本编码序列包括至少两个第一单头自注意力网络分别对应的第一匹配矩阵;获取第一多头自注意力网络对应的第一映射矩阵,根据第一映射矩阵、第一匹配矩阵与第一注意力矩阵之间的残差连接,生成至少两个第一单头自注意力网络分别对应的子注意力矩阵;
第二矩阵生成单元122,用于将至少两个子注意力矩阵进行拼接,得到编码层E i+1的第二注意力矩阵。
其中,第一矩阵生成单元121以及第二矩阵生成单元122的具体功能实现方式可以参见 上述图3所对应实施例中的S102,这里不再进行赘述。
在一种实施方式中,编码层E i+1包括第一卷积网络;
请一并参见图9,上述语音合成模块13可以包括:编码单元131、语音生成单元132;
编码单元131,用于将第二注意力矩阵以及历史文本编码序列进行相乘,得到第一中间编码序列;对第一中间编码序列和历史文本编码序列进行残差连接以及归一化处理,得到第二中间编码序列,将第二中间编码序列输入第一卷积网络,得到第三中间编码序列;对第三中间编码序列和第二中间编码序列进行残差连接以及归一化处理,得到编码层E i+1的当前文本编码序列;当当前文本编码序列为第N层编码层的文本编码序列时,将当前文本编码序列确定为目标文本编码序列;
语音生成单元132,用于将目标文本编码序列输入时长预测器,获取文本输入序列对应的预测时长序列;将目标文本编码序列输入长度调节器,在长度调节器中,根据预测时长序列对目标文本编码序列进行序列长度拓展,得到拓展后的目标文本编码序列;将拓展后的目标文本编码序列输入包含N层解码层的解码器,生成目标语音解码序列;将目标语音解码序列输入第一线性输出层,在第一线性输出层中,对目标语音解码序列进行线性变换,得到声学特征序列;对声学特征序列进行声学特征转换,得到与文本输入序列相匹配的合成语音数据。
其中,编码单元131的具体功能实现方式可以参见上述图3所对应实施例中的S103,语音生成单元132的具体功能实现方式可以参见上述图5所对应实施例中的S201-S205,这里不再进行赘述。
在一种实施方式中,上述N层解码层中包括解码层D j以及解码层D j+1,解码层D j+1为解码层D j的下一层解码层,j为正整数,且j小于N;解码层D j+1包括第二多头自注意力网络;
请一并参见图9,上述语音生成单元132可以包括:矩阵生成子单元1321、解码子单元1322、时长预测子单元1323以及序列拓展子单元1324;
矩阵生成子单元1321,用于获取解码层D j输出的第三注意力矩阵以及历史语音解码序列,根据第三注意力矩阵与第二多头自注意力网络之间的残差连接以及历史语音解码序列,生成解码层D j+1的第四注意力矩阵;
在一种实施方式中,第二多头自注意力网络包括至少两个第二单头自注意力网络;
上述矩阵生成子单元1321,具体用于获取解码层D j输出的第三注意力矩阵以及历史语音解码序列;历史语音解码序列包括至少两个第二单头自注意力网络分别对应的第二匹配矩阵;获取第二多头自注意力网络对应的第二映射矩阵,根据第二映射矩阵、第二匹配矩阵与第三注意力矩阵之间的残差连接,生成至少两个第二单头自注意力网络分别对应的子注意力矩阵;将至少两个子注意力矩阵进行拼接,得到解码层D j+1的第四注意力矩阵;
解码子单元1322,用于根据第四注意力矩阵以及历史语音解码序列生成解码层D j+1的目标语音解码序列;若解码层D j为第一层解码层,则解码层D j的历史语音解码序列为拓展后的目标文本编码序列;
在一种实施方式中,解码层D j+1包括第二卷积网络;
上述解码子单元1322,具体用于将第四注意力矩阵以及历史语音解码序列进行相乘, 得到第一中间解码序列;对第一中间解码序列和历史语音解码序列进行残差连接以及归一化处理,得到第二中间解码序列,将第二中间解码序列输入第二卷积网络,得到第三中间解码序列;对第三中间解码序列和第二中间解码序列进行残差连接以及归一化处理,得到解码层D j+1的当前语音解码序列;当当前语音解码序列为第N层解码层的语音解码序列时,将当前语音解码序列确定为目标语音解码序列;
时长预测子单元1323,用于将目标文本编码序列输入时长预测器中的两层一维卷积网络,得到时长特征;将时长特征输入第二线性输出层,通过第二线性输出层对时长特征进行线性变换,得到文本输入序列对应的预测时长序列。
在一种实施方式中,目标文本编码序列包括至少两个编码向量;预测时长序列包括至少两个时长参数;
序列拓展子单元1324,用于将目标文本编码序列输入长度调节器,在长度调节器中,根据预测时长序列中的至少两个时长参数对至少两个编码向量进行复制,得到复制编码向量;将复制编码向量与目标文本编码序列进行拼接,得到拓展后的目标文本编码序列;拓展后的目标文本编码序列的序列长度与至少两个时长参数的总和相等。
其中,矩阵生成子单元1321以及解码子单元1322的具体功能实现方式可以参见上述图5所对应实施例中的S203,时长预测子单元1323的具体功能实现方式可以参见上述图5所对应实施例中的S201,序列拓展子单元1324的具体功能实现方式可以参见上述图5所对应实施例中的S202,这里不再进行赘述。
本申请实施例可以将文本输入序列转换为文本特征表示序列,进而可以将该文本特征表示序列输入包含N层编码层的编码器,在该编码器中,计算当前编码层的注意力矩阵时,可以根据前一层编码层输出的第一注意力矩阵与当前编码层中的多头自注意力网络之间的残差连接以及前一层编码层输出的历史文本编码序列,生成当前编码层的第二注意力矩阵,进一步,可以根据得到的第二注意力矩阵以及历史文本编码序列生成当前编码层的目标文本编码序列,最终可以基于该目标文本编码序列生成与上述文本输入序列相匹配的合成语音数据。由此可见,在合成语音数据的过程中,本申请实施例可以充分利用每一层网络的计算结果,将残差放到注意力矩阵中,即对每一层注意力矩阵进行残差连接,从而使得每一层的注意力矩阵能够互通,有效加速了模型的收敛,同时,也使得每一层网络的注意力矩阵趋于一致性,从而可以提升合成语音的清晰度和稳定性。
请参见图10,是本申请实施例提供的一种语音合成装置的结构示意图。该语音合成装置可以是运行于计算机设备的一个计算机程序(包括程序代码),例如该语音合成装置为一个应用软件;该装置可以用于执行本申请实施例提供的语音合成方法中的相应步骤。如图10所示,该语音合成装置2可以包括:转换模块21、矩阵生成模块22、语音合成模块23以及修正模块24;
转换模块21,用于将文本样本序列输入初始残差式注意力声学模型,通过初始残差式注意力声学模型将文本样本序列转换为文本特征样本序列;
矩阵生成模块22,用于将文本特征样本序列输入初始残差式注意力声学模型中的包含N层初始编码层的初始编码器;N层初始编码层中包括初始编码层X i以及初始编码层X i+1; 初始编码层X i+1包括初始多头自注意力网络;获取初始编码层X i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与初始多头自注意力网络之间的残差连接以及历史文本编码序列,生成初始编码层X i+1的第二注意力矩阵;初始编码层X i+1为初始编码层X i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
语音合成模块23,用于根据第二注意力矩阵以及历史文本编码序列生成初始编码层X i+1的目标文本编码序列,基于目标文本编码序列生成与文本样本序列相匹配的预测语音数据;
修正模块24,用于根据预测语音数据以及参考语音数据生成语音损失函数,通过语音损失函数对初始残差式注意力声学模型中的模型参数进行修正,得到残差式注意力声学模型;残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据。
其中,转换模块21的具体功能实现方式可以参见上述图7所对应实施例中的S301,矩阵生成模块22的具体功能实现方式可以参见上述图7所对应实施例中的S302,语音合成模块23的具体功能实现方式可以参见上述图7所对应实施例中的S303,修正模块24的具体功能实现方式可以参见上述图7所对应实施例中的S304,这里不再进行赘述。
本申请实施例提供了一种基于残差式注意力的并行语音合成声学模型建模方法,通过将文本样本序列以及参考语音数据共同组成配对数据输入初始残差式注意力声学模型中进行训练,可以得到与文本样本序列相匹配的预测语音数据,进一步,为了提升模型合成语音的准确率和效率,可以根据由预测语音数据和参考语音数据生成的语音损失函数来对初始残差式注意力声学模型中的模型参数进行修正,从而可以得到一个高精度的残差式注意力声学模型,该声学模型可以准确、稳定、高效地进行声学参数预测。此外,通过该建模方法得到的声学模型可以运用于任何需要将文字转换为语音的场景,与现有语音合成方案相比,该声学模型合成语音的清晰度和自然度更好,合成语音的频谱细节上看也更清晰,另外可以很好地缓解现有方案中存在的发音错误、语调错误和韵律不自然的问题,且可以自然拓展到任意语言、方言、说话人以及自适应相关的语音合成任务,对其中用到的Transformer结构进行改进,具有很好的可拓展性。
请参见图11,是本申请实施例提供的一种计算机设备的结构示意图。如图11所示,该计算机设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,上述计算机设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1004可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图11所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在如图11所示的计算机设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现:
将文本输入序列转换为文本特征表示序列;
将文本特征表示序列输入包含N层编码层的编码器;N层编码层中包括编码层E i以及编码层E i+1,编码层E i+1为编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;编码层E i+1包括第一多头自注意力网络;
获取编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与第一多头自注意力网络之间的残差连接以及历史文本编码序列,生成编码层E i+1的第二注意力矩阵;
根据第二注意力矩阵以及历史文本编码序列生成编码层E i+1的目标文本编码序列,基于目标文本编码序列生成与文本输入序列相匹配的合成语音数据。
应当理解,本申请实施例中所描述的计算机设备1000可执行前文图3、图5任一个所对应实施例中对该语音合成方法的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
请参见图12,是本申请实施例提供的一种计算机设备的结构示意图。如图12所示,该计算机设备2000可以包括:处理器2001,网络接口2004和存储器2005,此外,上述计算机设备2000还可以包括:用户接口2003,和至少一个通信总线2002。其中,通信总线2002用于实现这些组件之间的连接通信。其中,用户接口2003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口2003还可以包括标准的有线接口、无线接口。网络接口2004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器2004可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器2005可选的还可以是至少一个位于远离前述处理器2001的存储装置。如图12所示,作为一种计算机可读存储介质的存储器2005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在如图12所示的计算机设备2000中,网络接口2004可提供网络通讯功能;而用户接口2003主要用于为用户提供输入的接口;而处理器2001可以用于调用存储器2005中存储的设备控制应用程序,以实现:
将文本样本序列输入初始残差式注意力声学模型,通过初始残差式注意力声学模型将文本样本序列转换为文本特征样本序列;
将文本特征样本序列输入初始残差式注意力声学模型中的包含N层初始编码层的初始编码器;N层初始编码层中包括初始编码层X i以及初始编码层X i+1,初始编码层X i+1为初始编码层X i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;初始编码层X i+1包括初始多头自注意力网络;
获取初始编码层X i输出的第一注意力矩阵以及历史文本编码序列,根据第一注意力矩阵与初始多头自注意力网络之间的残差连接以及历史文本编码序列,生成初始编码层X i+1的第二注意力矩阵;
根据第二注意力矩阵以及历史文本编码序列生成初始编码层X i+1的目标文本编码序列,基于目标文本编码序列生成与文本样本序列相匹配的预测语音数据;
根据预测语音数据以及参考语音数据生成语音损失函数,通过语音损失函数对初始残 差式注意力声学模型中的模型参数进行修正,得到残差式注意力声学模型;残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据。
应当理解,本申请实施例中所描述的计算机设备2000可执行前文图7所对应实施例中对上述语音合成方法的描述,在此不再赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。
此外,这里需要指出的是:本申请实施例还提供了一种计算机可读存储介质,且上述计算机可读存储介质中存储有前文提及的语音合成装置1和语音合成装置2所执行的计算机程序,且上述计算机程序包括程序指令,当上述处理器执行上述程序指令时,能够执行前文图3、图5、图7任一个所对应实施例中对语音合成方法的描述,因此,这里将不再进行赘述。另外,对采用相同方法的有益效果描述,也不再进行赘述。对于本申请所涉及的计算机可读存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述。
本申请实施例还提供了一种包括指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述实施例提供的方法。
上述计算机可读存储介质可以是前述任一实施例提供的语音合成装置或者上述计算机设备的内部存储单元,例如计算机设备的硬盘或内存。该计算机可读存储介质也可以是该计算机设备的外部存储设备,例如该计算机设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。进一步地,该计算机可读存储介质还可以既包括该计算机设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该计算机设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
此外,这里需要指出的是:本申请实施例还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。计算机设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行前文图3、图5、图7任一个所对应实施例提供的方法。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
本申请实施例提供的方法及相关装置是参照本申请实施例提供的方法流程图和/或结构示意图来描述的,具体可由计算机程序指令实现方法流程图和/或结构示意图的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。这些计算机程序指令可提供到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或结构示意图一个方框或多个方框中指定的功能的装置。这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作 的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或结构示意图一个方框或多个方框中指定的功能。这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或结构示意一个方框或多个方框中指定的功能的步骤。
以上所揭露的仅为本申请较佳实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (16)

  1. 一种语音合成方法,所述方法由计算机设备执行,所述方法包括:
    将文本输入序列转换为文本特征表示序列;
    将所述文本特征表示序列输入包含N层编码层的编码器;所述N层编码层中包括编码层E i以及编码层E i+1,所述编码层E i+1为所述编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;所述编码层E i+1包括第一多头自注意力网络;
    获取所述编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据所述第一注意力矩阵与所述第一多头自注意力网络之间的残差连接以及所述历史文本编码序列,生成所述编码层E i+1的第二注意力矩阵;
    根据所述第二注意力矩阵以及所述历史文本编码序列生成所述编码层E i+1的目标文本编码序列,基于所述目标文本编码序列生成与所述文本输入序列相匹配的合成语音数据。
  2. 根据权利要求1所述的方法,所述第一多头自注意力网络包括至少两个第一单头自注意力网络;
    所述获取所述编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据所述第一注意力矩阵与所述第一多头自注意力网络之间的残差连接以及所述历史文本编码序列,生成所述编码层E i+1的第二注意力矩阵,包括:
    获取所述编码层E i输出的第一注意力矩阵以及历史文本编码序列;所述历史文本编码序列包括所述至少两个第一单头自注意力网络分别对应的第一匹配矩阵;
    获取所述第一多头自注意力网络对应的第一映射矩阵,根据所述第一映射矩阵、所述第一匹配矩阵与所述第一注意力矩阵之间的残差连接,生成所述至少两个第一单头自注意力网络分别对应的子注意力矩阵;
    将至少两个子注意力矩阵进行拼接,得到所述编码层E i+1的第二注意力矩阵。
  3. 根据权利要求1所述的方法,所述编码层E i+1包括第一卷积网络;
    所述根据所述第二注意力矩阵以及所述历史文本编码序列生成所述编码层E i+1的目标文本编码序列,包括:
    将所述第二注意力矩阵以及所述历史文本编码序列进行相乘,得到第一中间编码序列;
    对所述第一中间编码序列和所述历史文本编码序列进行残差连接以及归一化处理,得到第二中间编码序列,将所述第二中间编码序列输入所述第一卷积网络,得到第三中间编码序列;
    对所述第三中间编码序列和所述第二中间编码序列进行残差连接以及归一化处理,得到所述编码层E i+1的当前文本编码序列;
    当所述当前文本编码序列为第N层编码层的文本编码序列时,将所述当前文本编码序列确定为目标文本编码序列。
  4. 根据权利要求1所述的方法,所述基于所述目标文本编码序列生成与所述文本输入序列相匹配的合成语音数据,包括:
    将所述目标文本编码序列输入时长预测器,获取所述文本输入序列对应的预测时长序列;
    将所述目标文本编码序列输入长度调节器,在所述长度调节器中,根据所述预测时长序列对所述目标文本编码序列进行序列长度拓展,得到拓展后的目标文本编码序列;
    将所述拓展后的目标文本编码序列输入包含N层解码层的解码器,生成目标语音解码序列;
    将所述目标语音解码序列输入第一线性输出层,在所述第一线性输出层中,对所述目标语音解码序列进行线性变换,得到声学特征序列;
    对所述声学特征序列进行声学特征转换,得到与所述文本输入序列相匹配的合成语音数据。
  5. 根据权利要求4所述的方法,所述N层解码层中包括解码层D j以及解码层D j+1,所述解码层D j+1为所述解码层D j的下一层解码层,j为正整数,且j小于N;所述解码层D j+1包括第二多头自注意力网络;
    所述将所述拓展后的目标文本编码序列输入包含N层解码层的解码器,生成目标语音解码序列,包括:
    获取所述解码层D j输出的第三注意力矩阵以及历史语音解码序列,根据所述第三注意力矩阵与所述第二多头自注意力网络之间的残差连接以及所述历史语音解码序列,生成所述解码层D j+1的第四注意力矩阵;
    根据所述第四注意力矩阵以及所述历史语音解码序列生成所述解码层D j+1的目标语音解码序列;若所述解码层D j为第一层解码层,则所述解码层D j的历史语音解码序列为所述拓展后的目标文本编码序列。
  6. 根据权利要求5所述的方法,所述第二多头自注意力网络包括至少两个第二单头自注意力网络;
    所述获取所述解码层D j输出的第三注意力矩阵以及历史语音解码序列,根据所述第三注意力矩阵与所述第二多头自注意力网络之间的残差连接以及所述历史语音解码序列,生成所述解码层D j+1的第四注意力矩阵,包括:
    获取所述解码层D j输出的第三注意力矩阵以及历史语音解码序列;所述历史语音解码序列包括所述至少两个第二单头自注意力网络分别对应的第二匹配矩阵;
    获取所述第二多头自注意力网络对应的第二映射矩阵,根据所述第二映射矩阵、所述第二匹配矩阵与所述第三注意力矩阵之间的残差连接,生成所述至少两个第二单头自注意力网络分别对应的子注意力矩阵;
    将至少两个子注意力矩阵进行拼接,得到所述解码层D j+1的第四注意力矩阵。
  7. 根据权利要求5所述的方法,所述解码层D j+1包括第二卷积网络;
    所述根据所述第四注意力矩阵以及所述历史语音解码序列生成所述解码层D j+1的目标语音解码序列,包括:
    将所述第四注意力矩阵以及所述历史语音解码序列进行相乘,得到第一中间解码序列;
    对所述第一中间解码序列和所述历史语音解码序列进行残差连接以及归一化处理,得到第二中间解码序列,将所述第二中间解码序列输入所述第二卷积网络,得到第三中间解码序列;
    对所述第三中间解码序列和所述第二中间解码序列进行残差连接以及归一化处理,得到所述解码层D j+1的当前语音解码序列;
    当所述当前语音解码序列为第N层解码层的语音解码序列时,将所述当前语音解码序列确定为目标语音解码序列。
  8. 根据权利要求4所述的方法,所述时长预测器包括两层一维卷积网络以及第二线性输出层;
    所述将所述目标文本编码序列输入时长预测器,获取所述文本输入序列对应的预测时长序列,包括:
    将所述目标文本编码序列输入所述时长预测器中的所述两层一维卷积网络,得到时长特征;
    将所述时长特征输入所述第二线性输出层,通过所述第二线性输出层对所述时长特征进行线性变换,得到所述文本输入序列对应的预测时长序列。
  9. 根据权利要求4所述的方法,所述目标文本编码序列包括至少两个编码向量;所述预测时长序列包括至少两个时长参数;
    所述将所述目标文本编码序列输入长度调节器,在所述长度调节器中,根据所述预测时长序列对所述目标文本编码序列进行序列长度拓展,得到拓展后的目标文本编码序列,包括:
    将所述目标文本编码序列输入长度调节器,在所述长度调节器中,根据所述预测时长序列中的所述至少两个时长参数对所述至少两个编码向量进行复制,得到复制编码向量;
    将所述复制编码向量与所述目标文本编码序列进行拼接,得到拓展后的目标文本编码序列;所述拓展后的目标文本编码序列的序列长度与所述至少两个时长参数的总和相等。
  10. 根据权利要求4所述的方法,还包括:
    获取语音调节参数,根据所述语音调节参数对所述预测时长序列进行更新,得到更新后的预测时长序列;
    根据所述更新后的预测时长序列,调节所述合成语音数据的语速或韵律。
  11. 根据权利要求1所述的方法,所述将文本输入序列转换为文本特征表示序列,包括:
    将文本输入序列输入向量转换层,通过所述向量转换层在向量转换表中进行查找,将与所述文本输入序列相匹配的特征向量作为文本特征表示序列;所述向量转换表包括字符或音素与特征向量之间的映射关系。
  12. 一种语音合成方法,所述方法由计算机设备执行,所述方法包括:
    将文本样本序列输入初始残差式注意力声学模型,通过所述初始残差式注意力声学模型将所述文本样本序列转换为文本特征样本序列;
    将所述文本特征样本序列输入所述初始残差式注意力声学模型中的包含N层初始编码层的初始编码器;所述N层初始编码层中包括初始编码层X i以及初始编码层X i+1,所述初始编码层X i+1为所述初始编码层X i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;所述初始编码层X i+1包括初始多头自注意力网络;
    获取所述初始编码层X i输出的第一注意力矩阵以及历史文本编码序列,根据所述第一 注意力矩阵与所述初始多头自注意力网络之间的残差连接以及所述历史文本编码序列,生成所述初始编码层X i+1的第二注意力矩阵;
    根据所述第二注意力矩阵以及所述历史文本编码序列生成所述初始编码层X i+1的目标文本编码序列,基于所述目标文本编码序列生成与所述文本样本序列相匹配的预测语音数据;
    根据所述预测语音数据以及参考语音数据生成语音损失函数,通过所述语音损失函数对所述初始残差式注意力声学模型中的模型参数进行修正,得到残差式注意力声学模型;所述残差式注意力声学模型用于生成与文本输入序列相匹配的合成语音数据。
  13. 一种语音合成装置,包括:
    转换模块,用于将文本输入序列转换为文本特征表示序列;
    矩阵生成模块,用于将所述文本特征表示序列输入包含N层编码层的编码器;所述N层编码层中包括编码层E i以及编码层E i+1;所述编码层E i+1包括第一多头自注意力网络;获取所述编码层E i输出的第一注意力矩阵以及历史文本编码序列,根据所述第一注意力矩阵与所述第一多头自注意力网络之间的残差连接以及所述历史文本编码序列,生成所述编码层E i+1的第二注意力矩阵;所述编码层E i+1为所述编码层E i的下一层编码层,N为大于1的整数,i为正整数,且i小于N;
    语音合成模块,用于根据所述第二注意力矩阵以及所述历史文本编码序列生成所述编码层E i+1的目标文本编码序列,基于所述目标文本编码序列生成与所述文本输入序列相匹配的合成语音数据。
  14. 一种计算机设备,包括:处理器、存储器以及网络接口;
    所述处理器与所述存储器、所述网络接口相连,其中,所述网络接口用于提供数据通信功能,所述存储器用于存储程序代码,所述处理器用于调用所述程序代码,以执行权利要求1-12任一项所述的方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,所述计算机程序适于由处理器加载并执行权利要求1-12任一项所述的方法。
  16. 一种包括指令的计算机程序产品,当其在计算机上运行时,使得所述计算机执行权利要求1-12任一项所述的方法。
PCT/CN2022/079502 2021-03-11 2022-03-07 一种语音合成方法、装置以及可读存储介质 WO2022188734A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/984,437 US12033612B2 (en) 2021-03-11 2022-11-10 Speech synthesis method and apparatus, and readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110267221.5 2021-03-11
CN202110267221.5A CN112687259B (zh) 2021-03-11 2021-03-11 一种语音合成方法、装置以及可读存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/984,437 Continuation US12033612B2 (en) 2021-03-11 2022-11-10 Speech synthesis method and apparatus, and readable storage medium

Publications (1)

Publication Number Publication Date
WO2022188734A1 true WO2022188734A1 (zh) 2022-09-15

Family

ID=75455509

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/079502 WO2022188734A1 (zh) 2021-03-11 2022-03-07 一种语音合成方法、装置以及可读存储介质

Country Status (2)

Country Link
CN (1) CN112687259B (zh)
WO (1) WO2022188734A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117809621A (zh) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 一种语音合成方法、装置、电子设备及存储介质

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112687259B (zh) * 2021-03-11 2021-06-18 腾讯科技(深圳)有限公司 一种语音合成方法、装置以及可读存储介质
CN113160794B (zh) * 2021-04-30 2022-12-27 京东科技控股股份有限公司 基于音色克隆的语音合成方法、装置及相关设备
CN113628630B (zh) * 2021-08-12 2023-12-01 科大讯飞股份有限公司 基于编解码网络的信息转换方法和装置、电子设备
CN113781995B (zh) * 2021-09-17 2024-04-05 上海喜马拉雅科技有限公司 语音合成方法、装置、电子设备及可读存储介质
CN114783407B (zh) * 2022-06-21 2022-10-21 平安科技(深圳)有限公司 语音合成模型训练方法、装置、计算机设备及存储介质
CN115394284A (zh) * 2022-08-23 2022-11-25 平安科技(深圳)有限公司 语音合成方法、系统、设备及存储介质
CN116364055B (zh) * 2023-05-31 2023-09-01 中国科学院自动化研究所 基于预训练语言模型的语音生成方法、装置、设备及介质
CN117333950B (zh) * 2023-11-30 2024-03-12 苏州元脑智能科技有限公司 动作生成方法、装置、计算机设备和存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114522A1 (en) * 2016-10-24 2018-04-26 Semantic Machines, Inc. Sequence to sequence transformations for speech synthesis via recurrent neural networks
CN112687259A (zh) * 2021-03-11 2021-04-20 腾讯科技(深圳)有限公司 一种语音合成方法、装置以及可读存储介质

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10796686B2 (en) * 2017-10-19 2020-10-06 Baidu Usa Llc Systems and methods for neural text-to-speech using convolutional sequence learning
CN109543824B (zh) * 2018-11-30 2023-05-23 腾讯科技(深圳)有限公司 一种序列模型的处理方法和装置
US11011154B2 (en) * 2019-02-08 2021-05-18 Tencent America LLC Enhancing hybrid self-attention structure with relative-position-aware bias for speech synthesis
CN110070852B (zh) * 2019-04-26 2023-06-16 平安科技(深圳)有限公司 合成中文语音的方法、装置、设备及存储介质
CN111353299B (zh) * 2020-03-03 2022-08-09 腾讯科技(深圳)有限公司 基于人工智能的对话场景确定方法和相关装置
CN111930942B (zh) * 2020-08-07 2023-08-15 腾讯云计算(长沙)有限责任公司 文本分类方法、语言模型训练方法、装置及设备

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180114522A1 (en) * 2016-10-24 2018-04-26 Semantic Machines, Inc. Sequence to sequence transformations for speech synthesis via recurrent neural networks
CN112687259A (zh) * 2021-03-11 2021-04-20 腾讯科技(深圳)有限公司 一种语音合成方法、装置以及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
PENG LIU; YUEWEN CAO; SONGXIANG LIU; NA HU; GUANGZHI LI; CHAO WENG; DAN SU: "VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 12 February 2021 (2021-02-12), 201 Olin Library Cornell University Ithaca, NY 14853 , XP081884333 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117809621A (zh) * 2024-02-29 2024-04-02 暗物智能科技(广州)有限公司 一种语音合成方法、装置、电子设备及存储介质
CN117809621B (zh) * 2024-02-29 2024-06-11 暗物智能科技(广州)有限公司 一种语音合成方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN112687259B (zh) 2021-06-18
CN112687259A (zh) 2021-04-20
US20230075891A1 (en) 2023-03-09

Similar Documents

Publication Publication Date Title
WO2022188734A1 (zh) 一种语音合成方法、装置以及可读存储介质
CN112735373B (zh) 语音合成方法、装置、设备及存储介质
CN109859736B (zh) 语音合成方法及系统
JP2022534764A (ja) 多言語音声合成およびクロスランゲージボイスクローニング
CN115516552A (zh) 使用未说出的文本和语音合成的语音识别
WO2021189984A1 (zh) 语音合成方法、装置、设备及计算机可读存储介质
WO2022178969A1 (zh) 语音对话数据处理方法、装置、计算机设备及存储介质
WO2021227707A1 (zh) 音频合成方法、装置、计算机可读介质及电子设备
WO2023245389A1 (zh) 歌曲生成方法、装置、电子设备和存储介质
WO2022222757A1 (zh) 将文本数据转换为声学特征的方法、电子设备和存储介质
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN114360493A (zh) 语音合成方法、装置、介质、计算机设备和程序产品
CN112035699A (zh) 音乐合成方法、装置、设备和计算机可读介质
CN113450758B (zh) 语音合成方法、装置、设备及介质
Shechtman et al. Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture.
CN116958343A (zh) 面部动画生成方法、装置、设备、介质及程序产品
CN116665639A (zh) 语音合成方法、语音合成装置、电子设备及存储介质
CN114464163A (zh) 语音合成模型的训练方法、装置、设备、存储介质和产品
US12033612B2 (en) Speech synthesis method and apparatus, and readable storage medium
Zahariev et al. Intelligent voice assistant based on open semantic technology
KR20210131125A (ko) 발화 속도 조절이 가능한 텍스트 음성 변환 학습 장치 및 발화 속도 조절이 가능한 텍스트 음성 변환 장치
Liu et al. Exploring effective speech representation via asr for high-quality end-to-end multispeaker tts
WO2023102929A1 (zh) 音频合成方法、电子设备、程序产品及存储介质
WO2023102931A1 (zh) 韵律结构的预测方法、电子设备、程序产品及存储介质
Xu et al. End-to-End Speech Synthesis Method for Lhasa-Tibetan Multi-speaker

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22766258

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22766258

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15-02-2024)