WO2022121158A1 - Procédé et appareil de synthèse de la parole, et dispositif électronique et support d'enregistrement - Google Patents

Procédé et appareil de synthèse de la parole, et dispositif électronique et support d'enregistrement Download PDF

Info

Publication number
WO2022121158A1
WO2022121158A1 PCT/CN2021/083186 CN2021083186W WO2022121158A1 WO 2022121158 A1 WO2022121158 A1 WO 2022121158A1 CN 2021083186 W CN2021083186 W CN 2021083186W WO 2022121158 A1 WO2022121158 A1 WO 2022121158A1
Authority
WO
WIPO (PCT)
Prior art keywords
character
vector
attention
sequence
feature
Prior art date
Application number
PCT/CN2021/083186
Other languages
English (en)
Chinese (zh)
Inventor
孙奥兰
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2022121158A1 publication Critical patent/WO2022121158A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Definitions

  • the present application relates to the field of speech synthesis, and in particular, to a speech synthesis method, apparatus, electronic device, and computer-readable storage medium.
  • a speech synthesis method comprising:
  • Receive character text carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;
  • the attention feature model includes a multi-head attention network and a character feature extraction network
  • the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence
  • Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
  • a speech synthesis device comprising:
  • a character vector building module is used to receive character text, replace the character text with pinyin, obtain character pinyin, and use a pre-built alphabet to calculate the character position of the character pinyin in the alphabet, and to calculate the character position of the character And described character pinyin performs encoding operation, obtains character vector;
  • a character feature sequence extraction module for inputting the character vector into the pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network, using the multi-head attention network Perform attention calculation on the character vector to obtain an attention vector, perform residual connection on the attention vector and the character vector to obtain a character attention vector, and use the character feature extraction network to pay attention to the character
  • the force vector performs feature extraction to obtain character feature sequences
  • a pronunciation pause sequence extraction module for inputting the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence
  • a speech synthesis module for performing residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and using a pre-built vocoder to perform speech synthesis on the speech sequence to obtain the character text synthesized speech.
  • An electronic device comprising:
  • a processor that executes the instructions stored in the memory to achieve the following steps:
  • Receive character text carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;
  • the attention feature model includes a multi-head attention network and a character feature extraction network
  • the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence
  • Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
  • a computer-readable storage medium comprising a storage data area and a storage program area, the storage data area stores data created, and the storage program area stores a computer program; wherein, the computer program is executed by a processor The following steps are implemented:
  • Receive character text carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;
  • the attention feature model includes a multi-head attention network and a character feature extraction network
  • the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence
  • Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
  • the present application can solve the problem that the synthesized speech is not smooth and natural enough.
  • FIG. 1 is a schematic flowchart of a speech synthesis method provided by an embodiment of the present application.
  • FIG. 2 is a detailed schematic flowchart of S6 in a speech synthesis method provided by an embodiment of the present application
  • FIG. 3 is a schematic block diagram of a speech synthesis apparatus provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an internal structure of an electronic device for implementing a speech synthesis method provided by an embodiment of the present application
  • the embodiments of the present application provide a speech synthesis method, and the execution subject of the speech synthesis method includes, but is not limited to, at least one of electronic devices that can be configured to execute the method provided by the embodiments of the present application, such as a server and a terminal.
  • the speech synthesis method can be executed by software or hardware installed in a terminal device or a server device, and the software can be a blockchain platform.
  • the server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like.
  • the speech synthesis method includes:
  • the character text input by the user is acceptable, for example, the user input character text A: "Hello, today's trip is accompanied by heavy rain and strong wind, please pay attention to safety". Then described character text A is carried out phonetic replacement, obtain character phonetic B: " nihao, jintianchuxingbanyoubaoyukuangfeng, qingzhuyianquan ", wherein in the embodiment of the application, described character text is carried out phonetic replacement, obtain character phonetic, comprise: utilize JAVA Pinyin4j in the language, builds a pinyin replacement program; uses the pinyin replacement program to perform pinyin replacement on the character text to obtain the character pinyin.
  • pinyin4j is located in net.sourceforge.pinyin4j in JAVA language, so use import net.sourceforge.pinyin4j to import pinyin4j to obtain the pinyin replacement program.
  • the alphabet is constructed by using pinyin.
  • a corresponds to 1
  • b corresponds to 2
  • c corresponds to 3
  • the above-mentioned character pinyin B "nihao, jintianchuxingbanyoubaoyukuangfeng, qingzhuyianquan" uses the The alphabet is constructed to obtain character positions including numbers.
  • the embodiment of the present application adopts a one-hot encoding method to perform encoding operations on the character position and the character pinyin to obtain a character vector.
  • the attention feature model before performing the S3, the attention feature model needs to be trained.
  • the training of the attention feature model includes:
  • Step A constructing an attention feature model to be trained including the multi-head attention network and the character feature extraction network.
  • the step A includes: constructing the multi-head attention network according to a multi-head attention mechanism; constructing the character feature extraction network according to a convolutional neural network; combining the multi-head attention network and the character feature extraction network, The attention feature model to be trained is obtained.
  • constructing the multi-head attention network according to the multi-head attention mechanism includes: receiving a trained Transform model, extracting an encoder from the Transform model, and using the multi-head attention mechanism in the encoder to construct Get the multi-head attention network.
  • the user can train and complete the Transform model in advance.
  • the Transform model is a deep learning model that can realize classification or fitting, including an encoder and a decoder, wherein the encoder includes a multi-head attention mechanism.
  • the network layer where the multi-head attention mechanism is located is extracted to construct the multi-head attention network.
  • the attention feature model to be trained is obtained by combining.
  • Step B Receive a training text set and a training label set, input the training text set into the attention feature model to be trained for feature extraction, and obtain a feature sequence training set.
  • the training text set is a text set collected and sorted out by a user in advance
  • the training label set is a voice set corresponding to the training text set.
  • Obtaining a feature sequence training set includes: performing pinyin replacement on the training text set to obtain a pinyin training set, calculating the character positions of the pinyin training set in the alphabet, obtaining a position training set, and comparing the pinyin training set and Perform an encoding operation on the position training set to obtain a vector training set, and use the multi-head attention network to perform an attention calculation on the vector training set to obtain an attention vector set; train the attention vector set and the vector training set Perform residual connection on the set to obtain an attention vector training set; use the character feature extraction network to perform feature extraction on the attention vector training set to obtain the feature sequence training set.
  • attention calculation is performed on the vector training set to obtain the attention vector set.
  • the present application uses the following formula to perform residual connection on the attention vector set and the vector training set:
  • result attention represents the attention vector training set
  • s represents the attention vector set
  • p represents the vector training set
  • the convolution operation in the character feature extraction network is used to sequentially perform feature extraction on each attention vector in the attention vector training set, and then the feature sequence training set is obtained.
  • the convolution operation is a convolution calculation operation based on a convolution kernel, and the size of the convolution kernel is set to 3*3 in this application, so as to obtain the feature sequence training set.
  • Step C Build multiple linear activation layers.
  • the present application constructs a linear activation layer to help the attention feature model to be trained for model training, wherein the linear activation layer includes normalization and activation function, and the activation function can use a Gaussian distribution function.
  • Step D use the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set.
  • using the multi-layer linear activation layer to perform an activation operation on the feature sequence training set to obtain a prediction sequence set includes: performing normalization on the feature sequence training set to obtain a feature sequence normalized set , using the Gaussian distribution function to calculate the Gaussian distribution of the normalized set of feature sequences, and obtain the predicted sequence set according to the Gaussian distribution.
  • the normalization is an operation of mapping the values in the feature sequence training set to a specified range. For example, mapping the values in the feature sequence training set to the [0,1] range, it can Scale down the values to reduce computational stress.
  • calculating the Gaussian distribution of the normalized set of feature sequences by using the Gaussian distribution function includes: using the Gaussian distribution function to calculate the mean and variance of the normalized set of feature sequences, and using the Gaussian distribution function to calculate the mean and variance of the normalized set of feature sequences. The mean and variance of the normalized set of feature sequences are calculated, and the Gaussian distribution of the normalized set of feature sequences is obtained.
  • the Gaussian distribution represents the probability distribution of data within a specified range
  • the maximum probability distribution of the training set of feature sequences is found from the Gaussian distribution, that is, the set of prediction sequences is obtained.
  • Step E Calculate the error value between the predicted sequence set and the training label set, and determine the magnitude relationship between the error value and a preset error threshold.
  • the squared difference formula is used to calculate the error value between the predicted sequence set and the training label set.
  • Step F If the error value is greater than the error threshold, adjust the internal parameters of the attention feature model to be trained, and return to Step B.
  • Step G If the error value is less than or equal to the error threshold, obtain the attention feature models of the multi-head attention network and the character feature extraction network.
  • the error value is less than or equal to the error threshold, it indicates that the attention feature model to be trained has strong character feature extraction capability, and the training is completed to obtain the attention feature model.
  • the character vector can be input into the pre-trained attention feature model.
  • the training stages in S4 and S3 are similar, and both use the principle of the multi-head attention mechanism of the encoder in the Transform model to perform the attention calculation to obtain the attention vector.
  • character attention represents the character attention vector
  • m represents the attention vector
  • u represents the character vector
  • the S6 includes:
  • the normalization is as described above, the operation of mapping the value in the character attention vector to a specified range.
  • the value in the character attention vector is mapped to the range of [0, 1]. .
  • performing a convolution operation on the normalized vector to obtain a character convolution vector includes: constructing a convolution kernel according to a preset convolution kernel dimension; using the convolution kernel to perform a convolution operation on the normalized vector Convolution operation to obtain the character convolution vector.
  • the residual connection is the same as the above, and the character convolution vector and the character attention vector are correspondingly added to obtain the character feature sequence.
  • the pronunciation pause prediction model is formed based on a plurality of fast Fourier transform modules.
  • 10 fast Fourier transform modules are used to form the pronunciation pause prediction model.
  • the S7 includes: transforming the character pinyin into a word vector to obtain a pinyin vector; inputting the pinyin vector and the character vector into the pronunciation pause prediction model, and using the pronunciation pause prediction model for all Perform Fourier transform on the pinyin vector and the character vector to obtain a Fourier transform sequence; perform pronunciation pause prediction on the Fourier transform sequence to obtain the pronunciation pause sequence.
  • the fast Fourier transform is a fast algorithm of discrete Fourier transform (DFT), which can predict the Fourier transform sequence corresponding to the character vector and the pinyin vector, wherein the Fourier transform sequence includes speech frequency, Amplitude and phase, and the articulation pause sequence can be obtained through the Fourier transform sequence.
  • DFT discrete Fourier transform
  • the vocoder is a decoder that can realize speech synthesis, including a channel vocoder, a formant vocoder, a pattern vocoder, a linear prediction vocoder, encoder, quadrature function vocoder, etc.
  • the synthesized speech of the character text can be obtained by inputting the speech sequence into the vocoding synthesizer.
  • speech synthesis is performed in two parts. First, a pre-trained attention feature model is used to perform feature extraction on character text to obtain character feature sequences. Second, a pronunciation pause prediction model is used to predict the pronunciation pause sequence of character text. Finally, the character feature sequence and the pronunciation pause sequence are performed residual connection to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
  • the present application not only predicts the character feature sequence, but also adds the prediction process of the pronunciation pause sequence, so the synthesized speech is closer to natural in frequency amplitude, etc. Human voice, so the speech synthesis method, device and computer-readable storage medium proposed in this application can solve the problem that the synthesized speech is not smooth and natural enough.
  • FIG. 3 it is a block diagram of the speech synthesis apparatus of the present application.
  • the speech synthesis apparatus 100 described in this application can be installed in an electronic device.
  • the speech synthesis apparatus may include a character vector construction module 101 , a character feature sequence extraction module 102 , a pronunciation pause sequence extraction module 103 and a speech synthesis module 104 .
  • the modules described in the present invention can also be called units, which refer to a series of computer program segments that can be executed by the electronic device processor and can perform fixed functions, and are stored in the memory of the electronic device.
  • each module/unit is as follows:
  • the character vector construction module 101 is used for receiving character text, performing pinyin substitution on the character text to obtain the character pinyin, and calculating the character position of the character pinyin in the alphabet using a pre-built alphabet, and for all the characters in the alphabet. Describe character position and described character pinyin to carry out encoding operation, obtain character vector;
  • the character feature sequence extraction module 102 is configured to input the character vector into a pre-trained attention feature model, wherein the attention feature model includes a multi-head attention network and a character feature extraction network, using the multi-head attention network
  • the attention network performs attention calculation on the character vector, obtains the attention vector, performs residual connection on the attention vector and the character vector, obtains the character attention vector, and uses the character feature extraction network to extract the character. Perform feature extraction on the character attention vector to obtain a character feature sequence;
  • the pronunciation pause sequence extraction module 103 is used to input the character vector into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence;
  • the speech synthesis module 104 is used to perform residual connection on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and use a pre-built vocoder to perform speech synthesis on the speech sequence to obtain the speech sequence. Synthesized speech for character text.
  • Each module in the speech synthesis apparatus 100 provided by the embodiment of the present application can use the same means as the above-mentioned speech synthesis method, and the specific implementation steps will not be repeated here.
  • the technical effect is the same as that of the above-mentioned speech synthesis method, that is, the problem that the synthesized speech is not smooth and natural is solved.
  • FIG. 4 it is a schematic structural diagram of an electronic device implementing the speech synthesis method of the present application.
  • the electronic device 1 may include a processor 10, a memory 11 and a bus, and may also include a computer program stored in the memory 11 and executable on the processor 10, such as a speech synthesis program 12.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, mobile hard disk, multimedia card, card-type memory (for example: SD or DX memory, etc.), magnetic memory, magnetic disk, CD etc.
  • the memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a mobile hard disk of the electronic device 1 .
  • the memory 11 may also be an external storage device of the electronic device 1, such as a pluggable mobile hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash memory card (Flash Card), etc.
  • the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device.
  • the memory 11 can not only be used to store application software installed in the electronic device 1 and various types of data, such as the code of the speech synthesis program 12, etc., but also can be used to temporarily store data that has been output or will be output.
  • the processor 10 may be composed of integrated circuits, for example, may be composed of a single packaged integrated circuit, or may be composed of multiple integrated circuits packaged with the same function or different functions, including one or more integrated circuits.
  • Central Processing Unit CPU
  • microprocessor digital processing chip
  • graphics processor and combination of various control chips, etc.
  • the processor 10 is the control core (Control Unit) of the electronic device, and uses various interfaces and lines to connect the various components of the entire electronic device, by running or executing the program or module (for example, executing the program) stored in the memory 11. speech synthesis program, etc.), and call data stored in the memory 11 to execute various functions of the electronic device 1 and process data.
  • the bus may be a peripheral component interconnect (PCI for short) bus or an extended industry standard architecture (Extended industry standard architecture, EISA for short) bus or the like.
  • PCI peripheral component interconnect
  • EISA Extended industry standard architecture
  • the bus can be divided into address bus, data bus, control bus and so on.
  • the bus is configured to implement connection communication between the memory 11 and at least one processor 10 and the like.
  • FIG. 4 only shows an electronic device with components. Those skilled in the art can understand that the structure shown in FIG. 4 does not constitute a limitation on the electronic device 1, and may include fewer or more components than those shown in the drawings. components, or a combination of certain components, or a different arrangement of components.
  • the electronic device 1 may also include a power supply (such as a battery) for powering the various components, preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that the power management
  • the device implements functions such as charge management, discharge management, and power consumption management.
  • the power source may also include one or more DC or AC power sources, recharging devices, power failure detection circuits, power converters or inverters, power status indicators, and any other components.
  • the electronic device 1 may further include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be repeated here.
  • the electronic device 1 may also include a network interface, optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • a network interface optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a Bluetooth interface, etc.), which is usually used in the electronic device 1 Establish a communication connection with other electronic devices.
  • the electronic device 1 may further include a user interface, and the user interface may be a display (Display), an input unit (eg, a keyboard (Keyboard)), optionally, the user interface may also be a standard wired interface or a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.
  • the display may also be appropriately called a display screen or a display unit, which is used for displaying information processed in the electronic device 1 and for displaying a visualized user interface.
  • the speech synthesis program 12 stored in the memory 11 in the electronic device 1 is a combination of multiple instructions, and when running in the processor 10, it can realize:
  • Receive character text carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;
  • the attention feature model includes a multi-head attention network and a character feature extraction network
  • the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence
  • Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
  • the modules/units integrated in the electronic device 1 may be stored in a computer-readable storage medium.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) .
  • the computer usable storage medium may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function, and the like; using the created data, etc.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be volatile or non-volatile.
  • the readable storage medium stores a computer program, and the computer program is stored in the When executed by the processor of the electronic device, it can achieve:
  • Receive character text carry out pinyin replacement of described character text, obtain character pinyin, utilize pre-built alphabet, calculate the character position of described character pinyin in described alphabet;
  • the attention feature model includes a multi-head attention network and a character feature extraction network
  • the character vector is input into a pre-built pronunciation pause prediction model to obtain a pronunciation pause sequence
  • Residual connection is performed on the character feature sequence and the pronunciation pause sequence to obtain a speech sequence, and a pre-built vocode synthesizer is used to perform speech synthesis on the speech sequence to obtain the synthesized speech of the character text.
  • modules described as separate components may or may not be physically separated, and the components shown as modules may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units can be implemented in the form of hardware, or can be implemented in the form of hardware plus software function modules.
  • the blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Procédé de synthèse de la parole et appareil de synthèse de la parole (100), et dispositif électronique (1) et support d'enregistrement. Le procédé comprend : l'obtention d'un vecteur de caractère, et la réalisation d'un calcul d'attention sur le vecteur de caractère à l'aide d'un réseau d'attention multi-tête pour obtenir un vecteur d'attention (S4); la réalisation d'une connexion résiduelle sur le vecteur d'attention et le vecteur de caractère pour obtenir un vecteur d'attention de caractère (S5); la réalisation d'une extraction de caractéristiques sur le vecteur d'attention de caractère à l'aide d'un réseau d'extraction de caractéristiques de caractère pour obtenir une séquence de caractéristiques de caractère (S6); et l'entrée du vecteur de caractère dans un modèle de prédiction de pause de prononciation pré-construit pour obtenir une séquence de pause de prononciation (S7); la réalisation d'une connexion résiduelle sur la séquence de caractéristiques de caractère et la séquence de pause de prononciation pour obtenir une séquence de parole, et la réalisation d'une synthèse de la parole sur la séquence de parole à l'aide d'un vocodeur pré-construit pour obtenir une parole synthétisée d'un texte de caractère (S8). Le problème selon lequel la parole synthétisée n'est pas lisse et suffisamment naturelle peut être résolu.
PCT/CN2021/083186 2020-12-11 2021-03-26 Procédé et appareil de synthèse de la parole, et dispositif électronique et support d'enregistrement WO2022121158A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011452787.7A CN112509554A (zh) 2020-12-11 2020-12-11 语音合成方法、装置、电子设备及存储介质
CN202011452787.7 2020-12-11

Publications (1)

Publication Number Publication Date
WO2022121158A1 true WO2022121158A1 (fr) 2022-06-16

Family

ID=74972920

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083186 WO2022121158A1 (fr) 2020-12-11 2021-03-26 Procédé et appareil de synthèse de la parole, et dispositif électronique et support d'enregistrement

Country Status (2)

Country Link
CN (1) CN112509554A (fr)
WO (1) WO2022121158A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112509554A (zh) * 2020-12-11 2021-03-16 平安科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质
CN113112985B (zh) * 2021-04-21 2022-01-18 合肥工业大学 一种基于深度学习的语音合成方法
CN114154459A (zh) * 2021-10-28 2022-03-08 北京搜狗科技发展有限公司 语音识别文本处理方法、装置、电子设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190172443A1 (en) * 2017-12-06 2019-06-06 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
CN110534089A (zh) * 2019-07-10 2019-12-03 西安交通大学 一种基于音素和韵律结构的中文语音合成方法
CN110782870A (zh) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质
CN110808027A (zh) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 语音合成方法、装置以及新闻播报方法、系统
CN111899716A (zh) * 2020-08-03 2020-11-06 北京帝派智能科技有限公司 一种语音合成方法和系统
CN112509554A (zh) * 2020-12-11 2021-03-16 平安科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190172443A1 (en) * 2017-12-06 2019-06-06 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
CN110534089A (zh) * 2019-07-10 2019-12-03 西安交通大学 一种基于音素和韵律结构的中文语音合成方法
CN110782870A (zh) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质
CN110808027A (zh) * 2019-11-05 2020-02-18 腾讯科技(深圳)有限公司 语音合成方法、装置以及新闻播报方法、系统
CN111899716A (zh) * 2020-08-03 2020-11-06 北京帝派智能科技有限公司 一种语音合成方法和系统
CN112509554A (zh) * 2020-12-11 2021-03-16 平安科技(深圳)有限公司 语音合成方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN112509554A (zh) 2021-03-16

Similar Documents

Publication Publication Date Title
WO2022121158A1 (fr) Procédé et appareil de synthèse de la parole, et dispositif électronique et support d'enregistrement
WO2022121176A1 (fr) Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage lisible
US8527276B1 (en) Speech synthesis using deep neural networks
WO2021189984A1 (fr) Procédé et appareil de synthèse de la parole et dispositif et support de stockage lisible par ordinateur
US11488577B2 (en) Training method and apparatus for a speech synthesis model, and storage medium
CN110264991A (zh) 语音合成模型的训练方法、语音合成方法、装置、设备及存储介质
WO2022227190A1 (fr) Procédé et appareil de synthèse vocale, dispositif électronique et support de stockage
WO2022116420A1 (fr) Procédé et appareil de détection d'événement vocal, dispositif électronique, et support de stockage informatique
WO2021212683A1 (fr) Procédé et appareil d'interrogation basés sur une carte de connaissances juridiques, dispositif électronique et support
WO2022121157A1 (fr) Procédé et appareil de synthèse de la parole, dispositif électronique et support de stockage
WO2021151344A1 (fr) Procédé et appareil de synthèse de chanson, et support de stockage lisible par ordinateur
CN113642316B (zh) 中文文本纠错方法、装置、电子设备及存储介质
CN113096242A (zh) 虚拟主播生成方法、装置、电子设备及存储介质
CN104126200A (zh) 声学处理单元
WO2022194062A1 (fr) Procédé et appareil de détection de marqueur de maladie, dispositif électronique et support d'enregistrement
CN115953997A (zh) 使用神经网络的文本到语音合成的无监督对齐
CN113345431A (zh) 跨语言语音转换方法、装置、设备及介质
CN114863945A (zh) 基于文本的语音变声方法、装置、电子设备及存储介质
CN113205814B (zh) 语音数据标注方法、装置、电子设备及存储介质
WO2021208700A1 (fr) Procédé et appareil de sélection de données vocales, dispositif électronique et support d'enregistrement
WO2022121152A1 (fr) Procédé de dialogue intelligent, appareil, dispositif électronique et support de stockage
CN114155832A (zh) 基于深度学习的语音识别方法、装置、设备及介质
CN116564322A (zh) 语音转换方法、装置、设备及存储介质
WO2022142105A1 (fr) Procédé et appareil de synthèse texte-parole , dispositif électronique et support d'enregistrement
WO2022141867A1 (fr) Procédé et appareil de reconnaissance de parole, dispositif électronique et support de stockage lisible

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21901875

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21901875

Country of ref document: EP

Kind code of ref document: A1