WO2020215666A1 - Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage - Google Patents

Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage Download PDF

Info

Publication number
WO2020215666A1
WO2020215666A1 PCT/CN2019/116509 CN2019116509W WO2020215666A1 WO 2020215666 A1 WO2020215666 A1 WO 2020215666A1 CN 2019116509 W CN2019116509 W CN 2019116509W WO 2020215666 A1 WO2020215666 A1 WO 2020215666A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
acoustic
model
sample
feature sequence
Prior art date
Application number
PCT/CN2019/116509
Other languages
English (en)
Chinese (zh)
Inventor
彭俊清
尚迪雅
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020215666A1 publication Critical patent/WO2020215666A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of speech synthesis, and in particular to a speech synthesis method, device, computer equipment and storage medium.
  • TTS Text-To-Speech
  • the existing text-to-speech system also has a sound processing system capable of synthesizing a real person's pronunciation (recording), this kind of processing system that simulates a real person's pronunciation has slow processing speed and high error rate.
  • a method of speech synthesis including:
  • a preset acoustic model according to the speech sample and the corresponding acoustic feature sequence, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model, and the preset acoustic model is based on wavenet network;
  • a speech synthesis device includes:
  • Sample acquisition module used to acquire voice samples
  • the feature extraction module is configured to perform feature extraction on the voice sample to obtain the acoustic feature sequence corresponding to the voice sample;
  • the training module is configured to train a preset acoustic model according to the speech sample and the acoustic feature sequence corresponding to it, and determine the acoustic model that meets the preset requirements after training as a speech synthesis model.
  • the acoustic model is based on the wavenet network;
  • the synthesis module is used to obtain speech parameters parsed from the speech text to be synthesized, input the speech parameters into the speech synthesis model, and obtain the synthesized speech output by the speech synthesis model.
  • a computer device includes a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • a preset acoustic model according to the speech sample and the corresponding acoustic feature sequence, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model, and the preset acoustic model is based on wavenet network;
  • One or more readable storage media storing computer readable instructions
  • the one or more processors execute the following steps: the computer-readable storage medium stores a computer program, characterized in that the computer program is The processor implements the speech synthesis method according to any one of claims 1 to 5 when executed.
  • a preset acoustic model according to the speech sample and the corresponding acoustic feature sequence, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model, and the preset acoustic model is based on wavenet network;
  • FIG. 1 is a schematic diagram of an application environment of a speech synthesis method in an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a speech synthesis method in an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a speech synthesis method in an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a speech synthesis method in an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a speech synthesis method in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a speech synthesis method in an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a speech synthesis device in an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a speech synthesis device in an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a speech synthesis device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the speech synthesis method provided in this embodiment can be applied in an application environment as shown in FIG. 1, where the client communicates with the server through the network.
  • Clients include, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with an independent server or a server cluster composed of multiple servers. In some cases, the server may obtain the speech text to be synthesized from the client, and generate a synthesized speech corresponding to the speech text to be synthesized.
  • a voice processing method is provided, and the method is applied to the server in FIG. 1 as an example for description, including the following steps:
  • S20 Perform feature extraction on the voice sample to obtain an acoustic feature sequence corresponding to the voice sample
  • the voice samples may be derived from a public voice database or voice data collected by oneself. Normally, if the format of the acquired voice data is inconsistent, the voice data needs to be uniformly processed into the same format before it can be used as a voice sample.
  • the format of the voice data includes but is not limited to the file type and sampling frequency of the voice data.
  • STRAIGHT can be used to process the voice samples, extract acoustic features from the voice samples, and form an acoustic feature sequence.
  • Each voice sample can generate a corresponding acoustic feature sequence.
  • the acoustic feature sequence is the acoustic feature extracted from the speech sample according to a certain
  • STRAIGHT is an algorithm for speech signal analysis and synthesis.
  • the characteristic of STRAIGHT is that it can separate the spectral envelope of the voice from the voice information, decompose the voice signal into independent spectrum parameters and fundamental frequency parameters, and can flexibly adjust the fundamental frequency, duration, speech rate and other parameters of the voice signal.
  • the voice samples and the acoustic feature sequence obtained after the voice sample processing can be used for the training of the acoustic model.
  • the acoustic model may be a neural network model constructed based on a wavenet network.
  • the wavenet network is an autoregressive network based on CNN (Convolutional Neural Network).
  • the wavenet network directly models the voice waveform level and decomposes the joint probability of the waveform sequence into conditional probability multiplications.
  • the following formula can be used to predict the sample point value p(x) in the voice:
  • x t is the sampling point at time t
  • each factor item represents the probability distribution of the current sampling point using the historical information at time t as input to predict the current sampling point.
  • the wavenet network simulates the waveform sequence o 1: T through the auxiliary characteristic sequence a 1: N , as shown in the following formula:
  • the wavenet network can produce very high-quality synthetic speech in terms of naturalness.
  • the wavenet network uses a gated activation function similar to PixelCNN (pixel recurrent neural network) in structure, where the gate unit can be calculated by the following formula:
  • W f,k and W g,k respectively represent the filter convolution weight and gated convolution weight of the k-th layer.
  • the wavenet network also uses the residual network structure and parameterized skip connections to build a deep network, and this network structure also helps to speed up the model convergence.
  • the acoustic model whose output result meets the preset requirement can be determined as the speech synthesis model.
  • the obtained speech synthesis model can be used to convert text data into audio data (ie, synthesized speech).
  • the preset requirements can be determined according to actual conditions. For example, when the difference between the synthesized speech output by the wavenet network and the original speech sample is less than a specified threshold, it can be determined that the output result of the acoustic model meets the preset requirements. In other cases, if the number of training times of the acoustic model reaches the preset value, it can also be determined that the output result of the acoustic model meets the preset requirement.
  • the speech text to be synthesized refers to the text that needs to be converted into audio data.
  • a preset text analysis model can be used to parse the to-be-synthesized speech text to obtain the speech parameters corresponding to the to-be-synthesized speech text.
  • Speech parameters include, but are not limited to, the pitch, prosody, syllable, and interval between sentences of words.
  • a speech sample is obtained, and the speech sample can be used to train the language synthesis model to be constructed, so as to improve the authenticity of the speech pronunciation of the speech synthesis model.
  • Perform feature extraction on the voice sample to obtain the acoustic feature sequence corresponding to the voice sample, and separate the feature extraction step (that is, obtaining the acoustic feature sequence) from the training process of the acoustic model independently, reducing the amount of calculation during acoustic model training .
  • a preset acoustic model according to the speech sample and the corresponding acoustic feature sequence, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model, and the preset acoustic model is based on Wavenet network to obtain a high-quality speech synthesis model in terms of naturalness (referring to the authenticity of simulating human speech).
  • Acquire the speech parameters parsed from the speech text to be synthesized input the speech parameters into the speech synthesis model, and obtain the synthesized speech output by the speech synthesis model to perform speech synthesis tasks, and obtain the required and to-be-synthesized speech parameters
  • the synthesized speech corresponding to the speech text is based on Wavenet network to obtain a high-quality speech synthesis model in terms of naturalness (referring to the authenticity of simulating human speech).
  • the method further includes:
  • S42 Input the to-be-synthesized speech text into a preset text analysis model, and obtain speech parameters corresponding to the to-be-synthesized speech text output by the preset text analysis model.
  • the trained speech synthesis model can synthesize the to-be-synthesized speech text into a synthesized speech close to the pronunciation of a real person.
  • the preset text parsing model can analyze the pitch, prosody, syllable and other information of all words contained in the speech text to be synthesized, and generate corresponding speech parameters (which can be expressed in the form of annotated files with context information). This process can also be called text analysis.
  • the obtained speech parameters can be converted into synthesized speech through the speech synthesis model.
  • text analysis provides an important basis for back-end speech synthesis. The effect of text analysis will directly affect the naturalness and accuracy of synthesized speech.
  • the initials and vowels of Mandarin Chinese are used as the primitives of speech synthesis.
  • the input Chinese text is obtained through text standardization, grammatical analysis, prosody prediction analysis, and character-to-sound conversion with the help of a grammar dictionary and a grammar rule library.
  • the sentence information, word information, prosody structure information of the input text, and the initials and vowels of each Chinese character so as to obtain the information of the speech synthesis primitives (consonants and vowels) of the input Mandarin text and the context-related information of each speech synthesis primitive, and finally generate
  • the phonetic parameters include the monophone annotations and context-sensitive annotations of each word in the Chinese text.
  • the speech text to be synthesized is obtained to obtain the text to be processed, and the text is used to synthesize the corresponding speech.
  • the speech text to be synthesized is input into a preset text analysis model, and the speech parameters corresponding to the speech text to be synthesized, which are output by the preset text analysis model, are obtained to obtain a higher matching degree with the speech text to be synthesized Voice parameters to obtain higher-quality synthesized speech.
  • step S20 includes:
  • S202 Calculate acoustic features of each of the speech frames respectively, where the acoustic features include fundamental frequency, energy, and Mel frequency cepstrum coefficients;
  • S203 Sort the acoustic features of each of the speech frames in time sequence to form the acoustic feature sequence.
  • the speech sample may be segmented according to actual conditions, and divided into multiple speech frames.
  • the speech sample belongs to the speech signal. Only steady-state information can be processed for signal processing. Since the speech signal is a quasi-steady-state signal, it is necessary to frame the speech samples. When processing speech samples, the length of each speech frame can be 20ms-30ms. In this interval, the speech signal is regarded as a steady-state signal.
  • the speech signal can be divided into frames to perform wavelet transformation, that is, after the speech signal is divided into frames, wavelet transformation and processing are performed on each frame.
  • Feature extraction can be performed on the speech information of each frame to obtain the acoustic characteristics of each speech frame. Acoustic features obtained include but are not limited to fundamental frequency, energy, and Mel frequency cepstrum coefficients.
  • the fundamental frequency determines the timbre and tone changes of the voice.
  • the airflow causes the vocal cords to vibrate periodically to produce the fundamental sound, and the frequency of the fundamental sound vibration is the fundamental frequency.
  • the short-term autocorrelation function can be used to extract the fundamental frequency. The steps are as follows:
  • the estimated value of the pitch period of the speech is the first position where the peak appears in all the clear points
  • the short-term autocorrelation function can be defined as:
  • x(n) refers to the speech signal
  • m indicates that the window function is added from the mth point.
  • Speech can be divided into silent segment, unvoiced segment and voiced segment.
  • the voice signal amplitude of the unvoiced segment is relatively small and has no regularity; the voice signal amplitude of the voiced segment is relatively large.
  • the voice signal changes regularly and has a certain quasi-periodicality. Therefore, the short-term (10-30ms) smooth characteristic of speech is adopted. Because the voice changes very slowly, it is usually assumed that the voice is almost unchanged in a short period of time. Therefore, for the voiced segment, it can be considered to be periodic in a short period of time, that is, the voiced signal is Periodicity in a short period of time:
  • x(t+n ⁇ 0 ) x(t)
  • x(t) represents the speech signal at time t
  • ⁇ 0 is the period of the voiced signal, called the pitch period.
  • f 0 is the fundamental frequency. Since ⁇ 0 is almost constant, it can also be considered that f 0 is almost constant.
  • Energy reflects the intensity of the speaker’s voice, and the intensity of speech in different emotional states is not the same.
  • the intensity of the speech is obviously greater than the general state; and in a low and depressed state, the intensity of the speech Will be significantly reduced.
  • short-term energy and short-term average amplitude are used to express the strength of the sound.
  • the short-term energy of the speech signal is defined as:
  • N is expressed as the nth moment
  • x(m) is the speech signal, that is, the short-term energy is the weighted sum of squares of the sample value of a frame.
  • each sample point value of the speech signal is squared, and then through a h(n) filter, the output is a time sequence composed of short-term energy.
  • MFCC Mel frequency cepstral coefficient
  • the STRAIGHT-based processing model can also be used to process each speech frame to generate corresponding acoustic features.
  • the acoustic features may also include the pronunciation duration in the speech sample and features related to the pronunciation duration.
  • the pronunciation duration can refer to the duration of a person's speech. People speak differently in different emotional states. For example, when people are more agitated or excited, their nerves are in a highly excited state, and they tend to speak faster and speak faster. On the contrary, when people are sad, their speech will be weaker and slower due to depression.
  • the voice sample is divided into multiple voice frames, so as to divide the voice sample into multiple voice segments (ie, voice frames) that are convenient for computer processing.
  • the acoustic characteristics of each of the speech frames are calculated respectively, and the acoustic characteristics include the fundamental frequency, energy, and Mel frequency cepstrum coefficients to obtain important acoustic characteristics of the speech sample.
  • the acoustic features of each speech frame are sorted in time sequence to form the acoustic feature sequence, and the acoustic feature sequence can be directly used to input the acoustic model for training, so as to reduce the amount of computation and computation time for acoustic model training.
  • step S30 includes:
  • S302 Input the mixed speech data and the corresponding acoustic feature sequence into the acoustic model for pre-training, to obtain a pre-training model;
  • the mixed voice data may refer to a sample set composed of voice samples containing different human voices.
  • one voice sample may include more than one person's voice, or the speaker of one voice sample is different from the speaker of another voice sample.
  • Feature extraction can be performed on each voice sample in the mixed voice data to obtain the acoustic feature sequence corresponding to each voice sample.
  • the pre-training is the same as the normal training process, and "pre-training" is used here to indicate that it is only used to distinguish it from the training process of step S304. The number of iterations required for training can also be determined according to actual needs. After training with mixed speech data, the corresponding pre-training model can be obtained.
  • the specific voice data may refer to a sample set composed of voice samples containing the pronunciation of the same person. For example, all voice samples in the specific voice data are uttered by user A. Feature extraction can be performed on each voice sample in the specific voice data to obtain the acoustic feature sequence corresponding to each voice sample. After training on specific voice data, a speech synthesis model matching the speaker of the specific voice data can be obtained. That is, the obtained speech synthesis model can generate synthesized speech similar to the speaker, for example, can generate the pronunciation of a specific celebrity. Because the pre-trained model is first trained with mixed speech data, and then the specific speech data is used for further adaptive training, the trained acoustic model has a better generalized representation ability. It can combine the acoustic features of different speakers to establish acoustic features and The mapping relationship between voice waveforms.
  • the mixed speech data and the corresponding acoustic feature sequence are obtained to obtain a specific speech sample for training.
  • the specific speech data and the corresponding acoustic feature sequence are input into the pre-training model for training, so as to obtain a speech synthesis model with higher synthesis quality.
  • step S30 further includes:
  • noise may be added to part of the speech samples.
  • the preset ratio can be determined according to actual needs, such as 1-10%.
  • noise can refer to environmental sounds in different scenarios, such as airport noise, office noise, vegetable market noise, supermarket noise, etc.
  • the voice sample after adding noise is the noise sample.
  • the process of feature extraction of noise samples and input acoustic model training is basically the same as that of speech samples without noise, so I will not repeat them here.
  • the voice samples of a preset ratio are selected, and different selection ratios can be set for different scenarios.
  • Acoustic feature sequences corresponding to the noise samples are acquired, and the acquired acoustic feature sequences can be directly input to an acoustic model for training. Train a preset acoustic model according to the noise sample and the corresponding acoustic feature sequence, the speech sample and the acoustic feature sequence corresponding to it, so that the obtained speech synthesis model has better adaptability And generalization capabilities.
  • a speech synthesis device is provided, and the speech synthesis device corresponds to the speech synthesis method in the foregoing embodiment one-to-one.
  • the speech synthesis device includes a sample acquisition module 10, a feature extraction module 20, a training module 30 and a synthesis module 40.
  • the detailed description of each functional module is as follows:
  • the sample obtaining module 10 is used to obtain voice samples
  • the feature extraction module 20 is configured to perform feature extraction on the voice sample to obtain the acoustic feature sequence corresponding to the voice sample;
  • the training module 30 is configured to train a preset acoustic model according to the voice sample and the acoustic feature sequence corresponding thereto, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model.
  • the preset acoustic model is based on the wavenet network;
  • the synthesis module 40 is configured to obtain speech parameters parsed from the speech text to be synthesized, input the speech parameters into the speech synthesis model, and obtain synthesized speech output by the speech synthesis model.
  • the speech synthesis device further includes:
  • the obtaining text module 50 is used to obtain the speech text to be synthesized
  • the voice parameter obtaining module 60 is configured to input the voice text to be synthesized into a preset text analysis model, and obtain voice parameters corresponding to the voice text to be synthesized from the preset text analysis model.
  • the feature extraction module 20 includes:
  • the speech cutting unit 201 is configured to cut the speech sample into multiple speech frames
  • the acoustic feature extraction unit 202 is configured to extract and calculate the acoustic features of each of the speech frames, where the acoustic features include fundamental frequency, energy, and Mel frequency cepstrum coefficients;
  • the generating acoustic feature sequence unit 203 is configured to sort the acoustic features of each speech frame in time sequence to form the acoustic feature sequence.
  • the training module 30 includes:
  • a pre-training unit configured to input the mixed speech data and the corresponding acoustic feature sequence into the acoustic model for pre-training to obtain a pre-training model
  • the specific voice training unit is used to input the specific voice data and the corresponding acoustic feature sequence into the pre-training model for training.
  • the training module 30 also includes:
  • Add noise unit used to add noise to the selected voice samples to form noise samples
  • An acquiring noise feature unit configured to acquire an acoustic feature sequence corresponding to the noise sample
  • the noise sample training unit is configured to train a preset acoustic model according to the noise sample and the corresponding acoustic feature sequence, the voice sample, and the acoustic feature sequence corresponding thereto.
  • Each module in the above-mentioned speech synthesis device can be implemented in whole or in part by software, hardware and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store the data involved in the speech synthesis method.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instructions are executed by the processor to realize a speech synthesis method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • a preset acoustic model according to the speech sample and the corresponding acoustic feature sequence, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model, and the preset acoustic model is based on wavenet network;
  • one or more computer-readable storage media storing computer-readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage media. Storage medium.
  • the readable storage medium stores computer readable instructions, and the computer readable instructions implement the following steps when executed by one or more processors:
  • a preset acoustic model according to the speech sample and the corresponding acoustic feature sequence, and determine the acoustic model that meets the preset requirements after the training as a speech synthesis model, and the preset acoustic model is based on wavenet network;
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

La présente invention porte sur un procédé et un appareil de synthèse de la parole, sur un dispositif informatique ainsi que sur un support de stockage. Ledit procédé consiste : à acquérir un échantillon de parole (S10) ; à effectuer une extraction de caractéristique sur l'échantillon de parole pour obtenir une séquence de caractéristiques acoustiques correspondant à l'échantillon de parole (S20) ; à former un modèle acoustique prédéfini en fonction de l'échantillon de parole et de la séquence de caractéristiques acoustiques correspondant à celui-ci et à déterminer le modèle acoustique qui satisfait des exigences prédéfinies après la formation sous la forme d'un modèle de synthèse de la parole, le modèle acoustique prédéfini étant basé sur un réseau WaveNet (S30) ; et à acquérir des paramètres de parole analysés à partir d'un texte de parole à synthétiser, à entrer les paramètres de parole dans le modèle de synthèse de parole et à acquérir une parole synthétisée produite par le modèle de synthèse de parole (S40). Le procédé de synthèse de la parole peut améliorer la vitesse de traitement et la précision d'un système de traitement du son capable de simuler la prononciation d'une personne réelle.
PCT/CN2019/116509 2019-04-23 2019-11-08 Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage WO2020215666A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910328125.XA CN110033755A (zh) 2019-04-23 2019-04-23 语音合成方法、装置、计算机设备及存储介质
CN201910328125.X 2019-04-23

Publications (1)

Publication Number Publication Date
WO2020215666A1 true WO2020215666A1 (fr) 2020-10-29

Family

ID=67239848

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116509 WO2020215666A1 (fr) 2019-04-23 2019-11-08 Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage

Country Status (2)

Country Link
CN (1) CN110033755A (fr)
WO (1) WO2020215666A1 (fr)

Families Citing this family (28)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110033755A (zh) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及存储介质
CN110675881B (zh) * 2019-09-05 2021-02-19 北京捷通华声科技股份有限公司 一种语音校验方法和装置
CN111816158B (zh) * 2019-09-17 2023-08-04 北京京东尚科信息技术有限公司 一种语音合成方法及装置、存储介质
CN111081216B (zh) * 2019-12-26 2022-04-15 度小满科技(北京)有限公司 一种音频合成方法、装置、服务器及存储介质
CN113192482B (zh) * 2020-01-13 2023-03-21 北京地平线机器人技术研发有限公司 语音合成方法及语音合成模型的训练方法、装置、设备
CN111276119B (zh) * 2020-01-17 2023-08-22 平安科技(深圳)有限公司 语音生成方法、系统和计算机设备
CN111276120B (zh) * 2020-01-21 2022-08-19 华为技术有限公司 语音合成方法、装置和计算机可读存储介质
CN111276121B (zh) * 2020-01-23 2021-04-30 北京世纪好未来教育科技有限公司 语音对齐方法、装置、电子设备及存储介质
CN113299272B (zh) * 2020-02-06 2023-10-31 菜鸟智能物流控股有限公司 语音合成模型训练和语音合成方法、设备及存储介质
CN111292720B (zh) * 2020-02-07 2024-01-23 北京字节跳动网络技术有限公司 语音合成方法、装置、计算机可读介质及电子设备
CN111312208A (zh) * 2020-03-09 2020-06-19 广州深声科技有限公司 一种说话人不相干的神经网络声码器系统
CN111402923B (zh) * 2020-03-27 2023-11-03 中南大学 基于wavenet的情感语音转换方法
CN111489734B (zh) * 2020-04-03 2023-08-22 支付宝(杭州)信息技术有限公司 基于多说话人的模型训练方法以及装置
CN113257236B (zh) * 2020-04-30 2022-03-29 浙江大学 一种基于核心帧筛选的模型得分优化方法
CN111696517A (zh) * 2020-05-28 2020-09-22 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及计算机可读存储介质
CN111785303B (zh) * 2020-06-30 2024-04-16 合肥讯飞数码科技有限公司 模型训练方法、模仿音检测方法、装置、设备及存储介质
CN111916049B (zh) * 2020-07-15 2021-02-09 北京声智科技有限公司 一种语音合成方法及装置
CN111968678B (zh) * 2020-09-11 2024-02-09 腾讯科技(深圳)有限公司 一种音频数据处理方法、装置、设备及可读存储介质
CN112289298A (zh) * 2020-09-30 2021-01-29 北京大米科技有限公司 合成语音的处理方法、装置、存储介质以及电子设备
CN112349268A (zh) * 2020-11-09 2021-02-09 湖南芒果听见科技有限公司 一种应急广播音频处理系统及其运行方法
WO2022141126A1 (fr) * 2020-12-29 2022-07-07 深圳市优必选科技股份有限公司 Procédé d'apprentissage de conversion de parole personnalisé, dispositif informatique et support de stockage
CN112767957B (zh) * 2020-12-31 2024-05-31 中国科学技术大学 获得预测模型的方法、语音波形的预测方法及相关装置
CN112863483B (zh) * 2021-01-05 2022-11-08 杭州一知智能科技有限公司 支持多说话人风格、语言切换且韵律可控的语音合成装置
CN112992162B (zh) * 2021-04-16 2021-08-20 杭州一知智能科技有限公司 一种音色克隆方法、系统、装置及计算机可读存储介质
CN112951203B (zh) * 2021-04-25 2023-12-29 平安创科科技(北京)有限公司 语音合成方法、装置、电子设备及存储介质
CN113450764B (zh) * 2021-07-08 2024-02-06 平安科技(深圳)有限公司 文本语音识别方法、装置、设备及存储介质
CN113569196A (zh) * 2021-07-15 2021-10-29 苏州仰思坪半导体有限公司 数据处理方法、装置、介质和设备
CN113838450B (zh) * 2021-08-11 2022-11-25 北京百度网讯科技有限公司 音频合成及相应的模型训练方法、装置、设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514878A (zh) * 2012-06-27 2014-01-15 北京百度网讯科技有限公司 声学建模方法及装置和语音识别方法及装置
CN107945786A (zh) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 语音合成方法和装置
CN108630190A (zh) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 用于生成语音合成模型的方法和装置
CN109102796A (zh) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 一种语音合成方法及装置
CN110033755A (zh) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573694B (zh) * 2018-02-01 2022-01-28 北京百度网讯科技有限公司 基于人工智能的语料扩充及语音合成系统构建方法及装置
CN108597492B (zh) * 2018-05-02 2019-11-26 百度在线网络技术(北京)有限公司 语音合成方法和装置
CN109036371B (zh) * 2018-07-19 2020-12-18 北京光年无限科技有限公司 用于语音合成的音频数据生成方法及系统
CN108899009B (zh) * 2018-08-17 2020-07-03 百卓网络科技有限公司 一种基于音素的中文语音合成系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514878A (zh) * 2012-06-27 2014-01-15 北京百度网讯科技有限公司 声学建模方法及装置和语音识别方法及装置
CN107945786A (zh) * 2017-11-27 2018-04-20 北京百度网讯科技有限公司 语音合成方法和装置
CN108630190A (zh) * 2018-05-18 2018-10-09 百度在线网络技术(北京)有限公司 用于生成语音合成模型的方法和装置
CN109102796A (zh) * 2018-08-31 2018-12-28 北京未来媒体科技股份有限公司 一种语音合成方法及装置
CN110033755A (zh) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 语音合成方法、装置、计算机设备及存储介质

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LING, ZHENHUA ET AL.: "Research on Speech Synthesis Vocoders Using WaveNet", AI-VIEW, no. 1, 28 February 2018 (2018-02-28), pages 1 - 64, XP055747021, ISSN: 2096-5036 *
WU, HONGZHUAN ET AL.: "The Speech Parameter Synthesizer based on Deep Convolutional Neural Network)", THE PROCEEDINGS OF THE 14TH NATIONAL CONFERENCE ON MAN-MACHINE SPEECH COMMUNICATION (NCMMSC’2017)), 31 October 2017 (2017-10-31), pages 177 - 181 *
WU, HONGZHUAN: "Research on Speech Synthesis Vocoders Using Convolutional Neural Networks", A DISSERTATION FOR MASTER`S DEGREE, no. 1, 15 January 2019 (2019-01-15), XP055747021, ISSN: 1674-0246 *
WU, HONGZHUAN: "Research on Speech Synthesis Vocoders Using Convolutional Neural Networks", RESEARCH ON SPEECH SYNTHESIS VOCODERS USING CONVOLUTIONAL NEUREAL NETWORKS, no. 1, 15 January 2019 (2019-01-15), XP055747021, ISSN: 1674-0246 *

Also Published As

Publication number Publication date
CN110033755A (zh) 2019-07-19

Similar Documents

Publication Publication Date Title
WO2020215666A1 (fr) Procédé et appareil de synthèse de la parole, dispositif informatique et support de stockage
CN112689871B (zh) 使用神经网络以目标讲话者的话音从文本合成语音
US11664011B2 (en) Clockwork hierarchal variational encoder
US11881210B2 (en) Speech synthesis prosody using a BERT model
JP2023535230A (ja) 2レベル音声韻律転写
US8594993B2 (en) Frame mapping approach for cross-lingual voice transformation
CN111833843B (zh) 语音合成方法及系统
JP7257593B2 (ja) 区別可能な言語音を生成するための音声合成のトレーニング
CN110570876B (zh) 歌声合成方法、装置、计算机设备和存储介质
US20220246132A1 (en) Generating Diverse and Natural Text-To-Speech Samples
JP7393585B2 (ja) テキスト読み上げのためのWaveNetの自己トレーニング
WO2021134591A1 (fr) Procédé et appareil de synthèse de la parole, terminal intelligent et support d'enregistrement
JP2024529880A (ja) 合成トレーニングデータを使用した2レベルのテキスト読上げシステム
WO2015025788A1 (fr) Dispositif et procédé de génération quantitative motif f0, et dispositif et procédé d'apprentissage de modèles pour la génération d'un motif f0
WO2022072936A2 (fr) Synthèse texte-parole à l'aide d'une prédiction de durée
Singh et al. Spectral modification based data augmentation for improving end-to-end ASR for children's speech
Dua et al. Spectral warping and data augmentation for low resource language ASR system under mismatched conditions
Yang et al. Adversarial feature learning and unsupervised clustering based speech synthesis for found data with acoustic and textual noise
CN113963679A (zh) 一种语音风格迁移方法、装置、电子设备及存储介质
Chen et al. The USTC System for Voice Conversion Challenge 2016: Neural Network Based Approaches for Spectrum, Aperiodicity and F0 Conversion.
Kurian et al. Connected digit speech recognition system for Malayalam language
JP7357518B2 (ja) 音声合成装置及びプログラム
WO2022039636A1 (fr) Procédé de synthèse vocale avec attribution d'une intonation fiable d'un modèle à cloner
US11335321B2 (en) Building a text-to-speech system from a small amount of speech data
CN115394284B (zh) 语音合成方法、系统、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19926179

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19926179

Country of ref document: EP

Kind code of ref document: A1