WO2022141126A1 - Personalized speech conversion training method, computer device, and storage medium - Google Patents

Personalized speech conversion training method, computer device, and storage medium Download PDF

Info

Publication number
WO2022141126A1
WO2022141126A1 PCT/CN2020/141091 CN2020141091W WO2022141126A1 WO 2022141126 A1 WO2022141126 A1 WO 2022141126A1 CN 2020141091 W CN2020141091 W CN 2020141091W WO 2022141126 A1 WO2022141126 A1 WO 2022141126A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
voice
data
model
corpus
Prior art date
Application number
PCT/CN2020/141091
Other languages
French (fr)
Chinese (zh)
Inventor
黄东延
王若童
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Priority to PCT/CN2020/141091 priority Critical patent/WO2022141126A1/en
Publication of WO2022141126A1 publication Critical patent/WO2022141126A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management

Definitions

  • the present invention relates to the field of computer technology, in particular to a personalized voice conversion training method, computer equipment and storage medium.
  • speech synthesis technology which is one of the important ways of human-computer communication, has received extensive attention from researchers because of its convenience and speed.
  • the goal of speech synthesis is to make the synthesized speech intelligible, clear, natural and expressive.
  • the existing speech synthesis system In order to make the synthesized speech more clear, natural and expressive, the existing speech synthesis system generally selects a target speaker, records a large amount of pronunciation data of the target speaker, and uses these pronunciation data as the basic data of speech synthesis .
  • the advantage of this method is that the sound quality and timbre of the synthesized speech will be more similar to the speech made by the speaker itself, and its clarity and naturalness will be greatly improved, but the disadvantage is that a large number of sample speech data of the target speaker needs to be obtained. The collection of these sample speech data will consume a lot of material and financial resources, which makes it very difficult to further develop a unique personalized speech synthesis model for each individual user.
  • the present invention provides a personalized speech conversion training method, the method comprising:
  • the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;
  • the initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;
  • the voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model
  • the first sample voice data of the target speaker obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the first sample voice data
  • the scale is much smaller than the scale of speech parallel corpus
  • the specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
  • the present invention provides a computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:
  • the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content;
  • the initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;
  • the voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model
  • the first sample voice data of the target speaker obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the scale of the first sample voice data Much smaller than the size of the phonetic parallel corpus;
  • the specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
  • the present invention provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
  • the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content;
  • the initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;
  • the voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model
  • the first sample voice data of the target speaker obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the scale of the first sample voice data Much smaller than the size of the phonetic parallel corpus;
  • the specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
  • the present invention provides a personalized speech conversion training method, computer equipment and storage medium.
  • the initial speech conversion model is trained by the acquired speech parallel corpus of N speakers to obtain an average speech conversion model; the speech parallelism of a specific speaker is obtained.
  • the corpus is combined with the speech parallel corpus of N speakers respectively to obtain N groups of training speech data; the average speech conversion model is trained based on the N groups of training speech data to obtain a specific speech conversion average model;
  • the sample speech data is obtained, the second sample speech data corresponding to the specific speaker is obtained, and the specific speech conversion average model is trained to obtain a target speech conversion model for converting the specific speech to the target speech.
  • the present invention since the scale of the first sample speech data of the target speaker is much smaller than the scale of the speech parallel corpus, the present invention only needs a few sample speech data of the target speaker to realize the synthesis of high-quality personalized speech, which greatly reduces the The production cost of personalized speech, so that a unique personalized speech synthesis model can be produced for each individual user to realize individualized speech synthesis for each individual user.
  • FIG. 1 is a flowchart of a personalized speech conversion training method in one embodiment
  • FIG. 2 is a flowchart of a personalized speech conversion training method in another embodiment
  • FIG. 3 is a schematic diagram of an algorithm for aligning source speech acoustic features with desired speech acoustic features in one embodiment
  • FIG. 4 is a schematic flowchart of aligning source speech acoustic features with desired speech acoustic features in one embodiment
  • Fig. 5 is the flow chart of the personalized speech conversion method in one embodiment
  • FIG. 6 is a flowchart of a personalized speech conversion training method in yet another embodiment
  • FIG. 7 is a structural block diagram of a flowchart of a personalized speech conversion training device in one embodiment
  • FIG. 8 is a structural block diagram of a flowchart of a personalized speech conversion training apparatus in another embodiment
  • Figure 9 is a diagram of the internal structure of a computer device in one embodiment.
  • the present invention provides a personalized voice conversion training method, the method includes:
  • Step 102 Acquire voice corpus data in the voice corpus, where the voice corpus data includes: parallel voice corpora of N speakers, and the parallel voice corpus refers to the voice corpus of multiple people corresponding to the same voice text content.
  • the speech corpus refers to a place where speech corpus data is stored.
  • the speech corpus includes a sufficient amount of speech corpus, and the speech corpus may include speech samples and text samples corresponding to the speech samples.
  • Speech-parallel corpus means that each speaker's spoken text content is the same. For example, for example, each speaker has 300 speech sentences, and the text content corresponding to the 300 speech sentences is the same.
  • step 104 the initial speech conversion model is trained based on the speech parallel corpus of N speakers to obtain an average speech conversion model.
  • the parallel speech corpus of N speakers is combined in pairs to obtain Group training speech data; based on Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
  • the initial speech conversion model is obtained based on a neural network deep learning model.
  • the neural network model can select the BiLSTM (Bi-directional Long Short-Term Memory, bidirectional long short-term memory) model, and utilize the speech parallel corpus of N speakers to train the initial speech conversion model obtained based on the BiLSTM model, Get the average model of speech conversion.
  • BiLSTM Bi-directional Long Short-Term Memory, bidirectional long short-term memory
  • Step 106 Acquire the voice parallel corpus of the specific speaker, and combine the voice parallel corpus of the specific speaker with the voice parallel corpus of N speakers to obtain N groups of training voice data.
  • the parallel speech corpus of a specific speaker is combined with the speech parallel corpus of N speakers to obtain N groups of training speech data, that is, N groups of training speech data converted from a specific speaker to N speakers are obtained.
  • the N groups of training voice data can be stored in the cloud, and correspondingly, the N groups of training voice data can also be stored in the local device.
  • the specific speaker is A, and there are 10 speakers, each of which has 300 speech parallel corpora.
  • Obtain 300 speech parallel corpora of a specific speaker A, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of 10 speakers, and then 1x10x300 3000 sets of training speech data can be obtained.
  • Step 108 train the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.
  • the voice conversion average model is trained by using N groups of training speech data to obtain a specific voice conversion average model, that is, a specific voice conversion average model that can be converted from a specific speaker to N speakers.
  • Step 110 Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the first sample voice data
  • the scale of the data is much smaller than the scale of the speech parallel corpus.
  • the second sample voice data corresponding to the specific speaker is obtained from the voice corpus, and the first sample data of the target speaker is obtained through a voice acquisition device.
  • the first sample data of the target speaker can be obtained through a recording studio and corresponding equipment. sample. Since the scale of the first sample speech data is much smaller than the scale of the speech parallel corpus, the first sample speech data of the target speaker is a small sample speech data of the target speaker.
  • Step 112 train the average model of specific voice conversion based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
  • the first sample speech data of the target speaker and the second sample speech data corresponding to the specific speaker are combined to obtain training speech data converted from the specific speaker to the target speaker.
  • the invention provides a personalized voice conversion training method, device and computer equipment.
  • the initial voice conversion model is trained by the acquired voice parallel corpus of N speakers to obtain an average voice conversion model; the voice parallel corpus of a specific speaker is obtained. and combine them with the speech parallel corpus of N speakers respectively to obtain N groups of training speech data; train the average speech conversion model based on the N groups of training speech data to obtain a specific speech conversion average model; obtain the first sample of the target speaker For this voice data, the second sample voice data corresponding to the specific speaker is obtained, and the average model of specific voice conversion is trained to obtain a target voice conversion model for converting the specific voice to the target voice.
  • the present invention since the scale of the first sample speech data of the target speaker is much smaller than the scale of the speech parallel corpus, the present invention only needs a few sample speech data of the target speaker to realize the synthesis of high-quality personalized speech, which greatly reduces the The production cost of personalized speech, so that a unique personalized speech synthesis model can be produced for each individual user to realize individualized speech synthesis for each individual user.
  • the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the desired speech; the Methods also include:
  • Step 202 using a voice feature analyzer to extract acoustic features of the source voice and the desired voice respectively, to obtain the acoustic features of the source voice and the desired voice.
  • the speech feature analyzer before using the speech feature analyzer to extract the acoustic features of the source speech and the expected speech respectively, it also includes audio resampling of the source speech and the expected speech.
  • the audio resampling algorithm can be used to achieve Convert between arbitrary sample rates of audio signals.
  • step 204 the control aligns the acoustic features of the source speech with the acoustic features of the desired speech on the time axis.
  • a dynamic programming time alignment (Dynamic Time Warping) method is used to align the acoustic features of the source speech to the acoustic feature length of the desired speech. Since the acoustic features are extracted frame by frame, it is necessary to measure the distance between frames at time t.
  • the function to measure the distance between frames at time t is:
  • I and J are feature matrices, and the dimension is T (number of frames) x N (feature dimension).
  • Step 206 using the aligned source speech acoustic features and expected speech acoustic features to train the preset neural network model to obtain an initial speech conversion model.
  • the aligned source speech acoustic features and the desired speech acoustic features are sent into the bidirectional long short-term memory recurrent neural network BLSTM model to obtain the initial speech conversion model, that is, the initial speech conversion model that can convert the source speech to the desired speech is obtained.
  • the function of measuring the distance between frames at time t is:
  • I and J are feature matrices, and the dimension is T (number of frames) x N (feature dimension).
  • N number of frames
  • x N feature dimension
  • the aligned source speech acoustic features x(T x N) are fed into the bidirectional long short-term memory recurrent neural network BLSTM model.
  • the relevant parameters of the bidirectional long short-term memory recurrent neural network BLSTM model are shown in Table 1.
  • the aligned source speech acoustic features x (T x N, N is 130 here) are sent into the bidirectional long short-term memory recurrent neural network BLSTM model and the output transformed acoustic features are (T x N, where N is 130 here).
  • the initial speech conversion model is trained based on the speech parallel corpus of N speakers to obtain an average speech conversion model, including: combining the speech parallel corpora of N people in pairs to obtain Group training speech data; based on Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
  • the parallel speech corpus of N people is combined in pairs to obtain the conversion between pairs of N speakers.
  • Group training speech data Since everyone has the same number of multi-sentence parallel corpora, so Each set of training voice data in the set of training voice data includes multiple sets of sub-training voice data. For example, when each of the N speakers has 300 speech-parallel corpora, combining the N-person speech-parallel corpus in pairs, we get Group sub-training speech data. In a specific embodiment, when there are 300 parallel speech corpora of 10 different speakers, the 300 parallel corpora of 10 different speakers are combined in pairs to obtain Group sub-training speech data.
  • the method further includes: acquiring the speech text to be converted, and converting the speech text to be converted into speech data of a specific speaker through a speech synthesis model; converting the speech data of the specific speaker As the input of the target speech conversion model, the target speech data output by the target speech conversion model is obtained.
  • the speech data of the specific speaker is used as the input of the target speech conversion model, and the target speech data output by the target speech conversion model is obtained.
  • the combination of speech synthesis technology and speech conversion technology that is, adding a speech conversion model after the speech synthesis model, makes each target speaker only need to provide a small amount of sample speech data to achieve high-quality personalized speech synthesis.
  • the method before converting the speech text to be converted into speech data of a specific speaker by using the speech synthesis model, the method further includes:
  • Step 602 Acquire target speech corpus data corresponding to a specific speaker.
  • the target speech corpus data is included in the target speech corpus, and the target speech corpus data includes target speech data and target text data corresponding to the target speech data.
  • Step 604 Perform text analysis and speech analysis on the target speech corpus data to obtain text features of the speech corpus and sound features of the speech corpus, respectively.
  • the sound feature of the speech corpus is obtained by a speech analyzer, and includes at least one of a timbre parameter, a pitch parameter and a loudness parameter.
  • the text analysis can be lexical analysis or syntactic analysis, and the text features include: phoneme sequence, part of speech, word length, and prosodic pause.
  • Step 606 using the text features of the speech corpus and the voice features of the speech corpus to train the preset neural network model to obtain a speech synthesis model corresponding to the specific speaker.
  • the preset neural network model is trained by using the text features of the speech corpus and the sound features of the speech corpus to obtain a speech synthesis model corresponding to a specific speaker.
  • a bi-directional long short-term memory neural network BiLSTM model is selected.
  • the BiLSTM model of the bidirectional long short-term memory neural network is a deformation of the LSTM model, which is composed of a forward LSTM model and a backward LSTM model.
  • the speech synthesis model includes a duration model and an acoustic model, acquires the speech text to be converted, converts the speech text to be converted into speech data of a specific speaker through the speech synthesis model, and further includes: converting the speech to be converted Perform text analysis on the text to obtain the features of the text to be converted; use the features of the text to be converted as the input of the duration model to obtain the duration features corresponding to the features of the text to be converted; input the duration features and the features of the text to be converted into the acoustic model to obtain the specific speaker's voice data.
  • the smallest phonetic unit is divided into phonemes according to the natural attributes of the voice, and the duration model is used to predict the length of each phoneme's pronunciation and to control the speed of the pronunciation.
  • the acoustic model is used to obtain the speech data of a specific speaker through the feature of duration and the text to be converted.
  • the features of the text to be converted are obtained through the optimal front-end sub-module, which is obtained based on a neural network deep learning model, including a prosody prediction module, a part-of-speech module, a word length module, and a phonetic sequence module.
  • the present invention provides a personalized voice conversion training device, which includes:
  • the first obtaining module 702 is configured to obtain the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content.
  • the first training module 704 is used for training the initial speech conversion model based on the speech parallel corpus of N speakers to obtain an average speech conversion model.
  • the second obtaining module 706 is configured to obtain the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data.
  • the second training module 708 is configured to train the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.
  • the second training module 708 is further configured to combine the parallel speech corpora of N people in pairs to obtain Group training speech data: based on Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
  • the parallel speech corpus of a specific speaker is used as the source speech in the N groups of training speech data
  • the speech parallel corpus of N speakers is used as the desired speech in the N groups of training speech data.
  • the second training module 708 is further configured to extract the acoustic features of the source speech and the expected speech respectively by using the speech feature analyzer to obtain the acoustic features of the source speech and the desired speech acoustic features; control the time axis to compare the acoustic features of the source speech and the desired speech acoustics Feature alignment; use the aligned source speech acoustic features and expected speech acoustic features to train the preset neural network model to obtain an initial speech conversion model.
  • the third obtaining module 710 is configured to obtain the first sample voice data of the target speaker, and obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same,
  • the scale of the first sample speech data is much smaller than the scale of the speech parallel corpus.
  • the third training module 712 is configured to train a specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
  • a personalized voice conversion training device further includes:
  • the fourth obtaining module 714 is configured to obtain the speech text to be converted.
  • the speech synthesis module 716 is used for converting the speech text to be converted into speech data of a specific speaker through a speech synthesis model.
  • the speech synthesis module 716 is further used to obtain the target speech corpus data corresponding to the specific speaker; perform text analysis and speech analysis on the target speech corpus data to obtain the text features of the speech corpus and the sound features of the speech corpus respectively; The text features of the speech corpus and the sound features of the speech corpus are used to train a preset neural network model to obtain a speech synthesis model corresponding to a specific speaker.
  • the speech conversion module 718 is configured to use the speech data of the specific speaker as the input of the target speech conversion model, and obtain the target speech data output by the target speech conversion model.
  • the computer equipment may be a personalized voice conversion training device or a terminal or server connected to the personalized voice conversion training device.
  • the computer device includes a processor, memory, and a network interface connected by a system bus.
  • the memory includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device stores an operating system, and also stores a computer program.
  • the processor can implement the personalized voice conversion training method.
  • a computer program can also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the personalized speech conversion training method.
  • the network interface is used to communicate with external devices.
  • FIG. 9 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
  • the personalized speech conversion training method provided by the present application can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 9 .
  • the individual program templates that make up the personalized speech conversion training device can be stored in the memory of the computer device.
  • a computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to execute the above-mentioned personalized speech conversion training method.
  • a computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to execute the above-mentioned personalized speech conversion training method.
  • Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • SLDRAM synchronous chain Road (Synchlink) DRAM
  • Rambus direct RAM
  • DRAM direct memory bus dynamic RAM
  • RDRAM memory bus dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Provided is a personalized speech conversion training method, comprising: acquiring a speech parallel corpus of N speakers to train an initial speech conversion model and obtain an speech conversion average model (104); obtaining a speech parallel corpus of a specific speaker and combining with the speech parallel corpus of said N speakers to obtain N sets of training speech data (106); training said speech conversion average model on the basis of the N sets of training speech data to obtain a specific speech conversion average model (108); obtaining first sample speech data of a target speaker and obtaining second sample speech data corresponding to a specific speaker (110), and training the specific speech conversion average model to obtain a target speech conversion model for conversion of specific speech to target speech (112). The invention also relates to a computer device and a storage medium.

Description

个性化语音转换训练方法、计算机设备及存储介质Personalized voice conversion training method, computer equipment and storage medium 技术领域technical field
本发明涉及计算机技术领域,尤其是涉及一种个性化语音转换训练方法、计算机设备及存储介质。The present invention relates to the field of computer technology, in particular to a personalized voice conversion training method, computer equipment and storage medium.
背景技术Background technique
随着多媒体通信技术的不断发展,作为人机通信重要方式之一的语音合成技术以其方便、快捷的优点收到了研究者的广泛关注。语音合成的目标是使合成的语音可懂、清晰、自然而富有表现力。为了使合成的语音更加清晰、自然、富有表现力,现有的语音合成系统一般都会选择一个目标说话人,录制这个目标说话人的大量的发音数据,并将这些发音数据作为语音合成的基础数据。这种方式的优点在于,合成语音的音质、音色会与发音人本身发出的语音更为相似,其清晰度和自然度会大大提高,但缺点在于,需要获取目标说话人的大量样本语音数据,这些样本语音数据的采集工作会耗费大量的物力、财力,从而使得更进一步的为每个个人用户都制作独特的个性化语音合成模型变得非常困难。With the continuous development of multimedia communication technology, speech synthesis technology, which is one of the important ways of human-computer communication, has received extensive attention from researchers because of its convenience and speed. The goal of speech synthesis is to make the synthesized speech intelligible, clear, natural and expressive. In order to make the synthesized speech more clear, natural and expressive, the existing speech synthesis system generally selects a target speaker, records a large amount of pronunciation data of the target speaker, and uses these pronunciation data as the basic data of speech synthesis . The advantage of this method is that the sound quality and timbre of the synthesized speech will be more similar to the speech made by the speaker itself, and its clarity and naturalness will be greatly improved, but the disadvantage is that a large number of sample speech data of the target speaker needs to be obtained. The collection of these sample speech data will consume a lot of material and financial resources, which makes it very difficult to further develop a unique personalized speech synthesis model for each individual user.
发明内容SUMMARY OF THE INVENTION
基于此,有必要针对上述问题,提供一种只需要采集目标说话人少量样本语音数据的个性化语音转换训练方法、计算机设备及存储介质。Based on this, it is necessary to provide a personalized voice conversion training method, computer equipment and storage medium that only need to collect a small number of sample voice data of the target speaker in order to solve the above problems.
第一方面,本发明提供一种个性化语音转换训练方法,该方法包括:In a first aspect, the present invention provides a personalized speech conversion training method, the method comprising:
获取语音语料库中的语音语料数据,该语音语料数据包括:N个说话人的语音平行语料,语音平行语料是指多个人的语音语料对应相同的语音文本内容;Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;
基于N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;The initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;
获取特定说话人的语音平行语料,将该特定说话人的语音平行语料分别与 N个说话人的语音平行语料进行组合,得到N组训练语音数据;Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data;
基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型;The voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model;
获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,该第一样本语音数据和该第二样本语音数据对应的文本内容相同,第一样本语音数据的规模远小于语音平行语料的规模;Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the first sample voice data The scale is much smaller than the scale of speech parallel corpus;
基于第一样本语音数据和第二样本语音数据对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
第二方面,本发明提供一种计算机设备,包括存储器和处理器,该存储器存储有计算机程序,该计算机程序被该处理器执行时,使得该处理器执行如下步骤:In a second aspect, the present invention provides a computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:
获取语音语料库中的语音语料数据,语音语料数据包括:N个说话人的语音平行语料,语音平行语料是指多个人的语音语料对应相同的语音文本内容;Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content;
基于N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;The initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;
获取特定说话人的语音平行语料,将特定说话人的语音平行语料分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据;Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data;
基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型;The voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model;
获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,第一样本语音数据和第二样本语音数据对应的文本内容相同,第一样本语音数据的规模远小于语音平行语料的规模;Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the scale of the first sample voice data Much smaller than the size of the phonetic parallel corpus;
基于第一样本语音数据和第二样本语音数据对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
第三方面,本发明提供一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时,使得该处理器执行如下步骤:In a third aspect, the present invention provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:
获取语音语料库中的语音语料数据,语音语料数据包括:N个说话人的语 音平行语料,语音平行语料是指多个人的语音语料对应相同的语音文本内容;Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content;
基于N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;The initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;
获取特定说话人的语音平行语料,将特定说话人的语音平行语料分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据;Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data;
基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型;The voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model;
获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,第一样本语音数据和第二样本语音数据对应的文本内容相同,第一样本语音数据的规模远小于语音平行语料的规模;Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the scale of the first sample voice data Much smaller than the size of the phonetic parallel corpus;
基于第一样本语音数据和第二样本语音数据对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
本发明提供一种个性化语音转换训练方法、计算机设备及存储介质,通过获取的N个说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;获取特定说话人的语音平行语料并分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据;基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型;获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,并对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。由于目标说话人的第一样本语音数据的规模远小于语音平行语料的规模,因此本发明只需要很少的目标说话人的样本语音数据便可以实现高质量个性化语音的合成,大大降低了个性化语音的制作成本,从而可以为每个个人用户都制作独特的个性化语音合成模型,实现每个个人用户的个性化语音合成。The present invention provides a personalized speech conversion training method, computer equipment and storage medium. The initial speech conversion model is trained by the acquired speech parallel corpus of N speakers to obtain an average speech conversion model; the speech parallelism of a specific speaker is obtained. The corpus is combined with the speech parallel corpus of N speakers respectively to obtain N groups of training speech data; the average speech conversion model is trained based on the N groups of training speech data to obtain a specific speech conversion average model; The sample speech data is obtained, the second sample speech data corresponding to the specific speaker is obtained, and the specific speech conversion average model is trained to obtain a target speech conversion model for converting the specific speech to the target speech. Since the scale of the first sample speech data of the target speaker is much smaller than the scale of the speech parallel corpus, the present invention only needs a few sample speech data of the target speaker to realize the synthesis of high-quality personalized speech, which greatly reduces the The production cost of personalized speech, so that a unique personalized speech synthesis model can be produced for each individual user to realize individualized speech synthesis for each individual user.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述 中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图示出的结构获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained according to the structures shown in these drawings without creative efforts.
图1为一个实施例中个性化语音转换训练方法的流程图;1 is a flowchart of a personalized speech conversion training method in one embodiment;
图2为另一个实施例中个性化语音转换训练方法的流程图;2 is a flowchart of a personalized speech conversion training method in another embodiment;
图3为一个实施例中将源语音声学特征与期望语音声学特征对齐的算法示意图;3 is a schematic diagram of an algorithm for aligning source speech acoustic features with desired speech acoustic features in one embodiment;
图4为一个实施例中将源语音声学特征与期望语音声学特征对齐的流程示意图;4 is a schematic flowchart of aligning source speech acoustic features with desired speech acoustic features in one embodiment;
图5为一个实施例中个性化语音转换方法的流程图;Fig. 5 is the flow chart of the personalized speech conversion method in one embodiment;
图6为又一个实施例中个性化语音转换训练方法的流程图;6 is a flowchart of a personalized speech conversion training method in yet another embodiment;
图7为一个实施例中个性化语音转换训练装置的流程图的结构框图;7 is a structural block diagram of a flowchart of a personalized speech conversion training device in one embodiment;
图8为另一个实施例中个性化语音转换训练装置的流程图的结构框图;8 is a structural block diagram of a flowchart of a personalized speech conversion training apparatus in another embodiment;
图9为一个实施例中计算机设备的内部结构图。Figure 9 is a diagram of the internal structure of a computer device in one embodiment.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
如图1所示,本发明提供一种个性化语音转换训练方法,该方法包括:As shown in Figure 1, the present invention provides a personalized voice conversion training method, the method includes:
步骤102,获取语音语料库中的语音语料数据,语音语料数据包括:N个说话人的语音平行语料,语音平行语料是指多个人的语音语料对应相同的语音文本内容。Step 102: Acquire voice corpus data in the voice corpus, where the voice corpus data includes: parallel voice corpora of N speakers, and the parallel voice corpus refers to the voice corpus of multiple people corresponding to the same voice text content.
其中,语音语料库是指存储语音语料数据的地方,语音语料库中包括足量的语音语料,语音语料可以包括语音样本以及与语音样本对应的文本样本。语音平行语料是指每个说话人的说话文本内容都相同。举个例子,比如,每个说话人有300句语音,该300句语音对应的文本内容相同。The speech corpus refers to a place where speech corpus data is stored. The speech corpus includes a sufficient amount of speech corpus, and the speech corpus may include speech samples and text samples corresponding to the speech samples. Speech-parallel corpus means that each speaker's spoken text content is the same. For example, for example, each speaker has 300 speech sentences, and the text content corresponding to the 300 speech sentences is the same.
步骤104,基于N个人说话人的语音平行语料对初始语音转换模型进行训 练,得到语音转换平均模型。In step 104, the initial speech conversion model is trained based on the speech parallel corpus of N speakers to obtain an average speech conversion model.
其中,将N个说话人的语音平行语料进行两两组合,得到
Figure PCTCN2020141091-appb-000001
组训练语音数据;基于
Figure PCTCN2020141091-appb-000002
组训练语音数据对初始语音转换模型进行训练,得到语音转换平均模型。具体的,初始语音转换模型是基于神经网络深度学习模型得到的。
Among them, the parallel speech corpus of N speakers is combined in pairs to obtain
Figure PCTCN2020141091-appb-000001
Group training speech data; based on
Figure PCTCN2020141091-appb-000002
Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained. Specifically, the initial speech conversion model is obtained based on a neural network deep learning model.
在一个实施例中,神经网络模型可以选取BiLSTM(Bi-directional Long Short-Term Memory,双向长短记忆)模型,利用N个说话人的语音平行语料对基于BiLSTM模型得到的初始语音转换模型进行训练,得到语音转换平均模型。In one embodiment, the neural network model can select the BiLSTM (Bi-directional Long Short-Term Memory, bidirectional long short-term memory) model, and utilize the speech parallel corpus of N speakers to train the initial speech conversion model obtained based on the BiLSTM model, Get the average model of speech conversion.
步骤106,获取特定说话人的语音平行语料,将特定说话人的语音平行语料分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据。Step 106: Acquire the voice parallel corpus of the specific speaker, and combine the voice parallel corpus of the specific speaker with the voice parallel corpus of N speakers to obtain N groups of training voice data.
其中,将特定说话人的语音平行语料分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据,即得到N组从特定说话人转换到N个说话人的训练语音数据。N组训练语音数据可以储存于云端,相应的,N组训练语音数据也可以储存于本地设备中。Wherein, the parallel speech corpus of a specific speaker is combined with the speech parallel corpus of N speakers to obtain N groups of training speech data, that is, N groups of training speech data converted from a specific speaker to N speakers are obtained. The N groups of training voice data can be stored in the cloud, and correspondingly, the N groups of training voice data can also be stored in the local device.
在一个实施例中,特定说话人为A,有10个说话人,每个人有300句语音平行语料。获取特定说话人A的300句语音平行语料,将特定说话人的语音平行语料分别与10个说话人的语音平行语料进行组合,则可得到1x10x300=3000组训练语音数据。In one embodiment, the specific speaker is A, and there are 10 speakers, each of which has 300 speech parallel corpora. Obtain 300 speech parallel corpora of a specific speaker A, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of 10 speakers, and then 1x10x300=3000 sets of training speech data can be obtained.
步骤108,基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型。Step 108 , train the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.
其中,利用N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型,即得到可以从特定说话人转换到N个说话人的语音的特定语音转换平均模型。The voice conversion average model is trained by using N groups of training speech data to obtain a specific voice conversion average model, that is, a specific voice conversion average model that can be converted from a specific speaker to N speakers.
步骤110,获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,第一样本语音数据和第二样本语音数据对应的文本内容相同,第一样本语音数据的规模远小于语音平行语料的规模。Step 110: Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the first sample voice data The scale of the data is much smaller than the scale of the speech parallel corpus.
其中,从语音语料库中获取特定说话人对应的第二样本语音数据,并通过 语音获取设备获取目标说话人的第一样本数据,例如,可以通过录音棚及相应设备获取目标说话人的第一样本数据。由于第一样本语音数据的规模远小于语音平行语料的规模,故目标说话人的第一样本语音数据为目标说话人的小样本语音数据。Among them, the second sample voice data corresponding to the specific speaker is obtained from the voice corpus, and the first sample data of the target speaker is obtained through a voice acquisition device. For example, the first sample data of the target speaker can be obtained through a recording studio and corresponding equipment. sample. Since the scale of the first sample speech data is much smaller than the scale of the speech parallel corpus, the first sample speech data of the target speaker is a small sample speech data of the target speaker.
步骤112,基于第一样本语音数据和第二样本语音数据对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。Step 112 , train the average model of specific voice conversion based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.
其中,将目标说话人的第一样本语音数据和特定说话人对应的第二样本语音数据进行组合,得到从特定说话人转换到目标说话人的训练语音数据。举个例子来说明,利用从特定说话人转换到目标说话人的训练语音数据对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The first sample speech data of the target speaker and the second sample speech data corresponding to the specific speaker are combined to obtain training speech data converted from the specific speaker to the target speaker. As an example to illustrate, use the training speech data converted from a specific speaker to a target speaker to train a specific speech conversion average model, and obtain a target speech conversion model for converting a specific speech to a target speech.
本发明提供一种个性化语音转换训练方法、装置及计算机设备,通过获取的N个说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;获取特定说话人的语音平行语料并分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据;基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型;获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,并对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。由于目标说话人的第一样本语音数据的规模远小于语音平行语料的规模,因此本发明只需要很少的目标说话人的样本语音数据便可以实现高质量个性化语音的合成,大大降低了个性化语音的制作成本,从而可以为每个个人用户都制作独特的个性化语音合成模型,实现每个个人用户的个性化语音合成。The invention provides a personalized voice conversion training method, device and computer equipment. The initial voice conversion model is trained by the acquired voice parallel corpus of N speakers to obtain an average voice conversion model; the voice parallel corpus of a specific speaker is obtained. and combine them with the speech parallel corpus of N speakers respectively to obtain N groups of training speech data; train the average speech conversion model based on the N groups of training speech data to obtain a specific speech conversion average model; obtain the first sample of the target speaker For this voice data, the second sample voice data corresponding to the specific speaker is obtained, and the average model of specific voice conversion is trained to obtain a target voice conversion model for converting the specific voice to the target voice. Since the scale of the first sample speech data of the target speaker is much smaller than the scale of the speech parallel corpus, the present invention only needs a few sample speech data of the target speaker to realize the synthesis of high-quality personalized speech, which greatly reduces the The production cost of personalized speech, so that a unique personalized speech synthesis model can be produced for each individual user to realize individualized speech synthesis for each individual user.
在一个实施例中,如图2所示,N组训练语音数据中将特定说话人的语音平行语料作为源语音,N组训练语音数据中将N个说话人的语音平行语料作为期望语音;该方法还包括:In one embodiment, as shown in FIG. 2 , in the N groups of training speech data, the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the desired speech; the Methods also include:
步骤202,利用语音特征分析器对源语音与期望语音分别进行声学特征提取,得到源语音声学特征与期望语音声学特征。 Step 202 , using a voice feature analyzer to extract acoustic features of the source voice and the desired voice respectively, to obtain the acoustic features of the source voice and the desired voice.
其中,为了减少计算复杂度和存储复杂度,利用语音特征分析器对源语音与 期望语音分别进行声学特征提取之前还包括对源语音与期望语音进行音频重采样,音频重采样算法可以用来实现音频信号任意采样速率之间的转换。利用语音特征分析器对音频重采样后的源语音与期望语音分别进行声学特征提取,将源语音与期望语音的语音信号转换为频谱(Spectrum)、基频(Fundamental Frequency)、非周期频率(Aperiod Spectrum)等声学特征。Among them, in order to reduce the computational complexity and storage complexity, before using the speech feature analyzer to extract the acoustic features of the source speech and the expected speech respectively, it also includes audio resampling of the source speech and the expected speech. The audio resampling algorithm can be used to achieve Convert between arbitrary sample rates of audio signals. Use the speech feature analyzer to extract the acoustic features of the source speech and the expected speech after audio resampling, respectively, and convert the speech signals of the source speech and the expected speech into spectrum (Spectrum), fundamental frequency (Fundamental Frequency), aperiodic frequency (Aperiod). Spectrum) and other acoustic features.
步骤204,控制在时间轴上将源语音声学特征与期望语音声学特征对齐。In step 204, the control aligns the acoustic features of the source speech with the acoustic features of the desired speech on the time axis.
在一个实施例中,如图3、图4所示,采用了动态规划时间对齐(Dynamic Time Warping)的方法,将源语音的声学特征对齐到期望语音的声学特征长度上。由于声学特征是一帧一帧进行提取的,故需衡量t时刻帧与帧之间的距离。衡量t时刻帧与帧之间的距离函数为:In one embodiment, as shown in FIG. 3 and FIG. 4 , a dynamic programming time alignment (Dynamic Time Warping) method is used to align the acoustic features of the source speech to the acoustic feature length of the desired speech. Since the acoustic features are extracted frame by frame, it is necessary to measure the distance between frames at time t. The function to measure the distance between frames at time t is:
Figure PCTCN2020141091-appb-000003
Figure PCTCN2020141091-appb-000003
其中,I,J为特征矩阵,维度为T(帧数)x N(特征维度)。Among them, I and J are feature matrices, and the dimension is T (number of frames) x N (feature dimension).
步骤206,利用对齐后的源语音声学特征与期望语音声学特征对预设神经网络模型进行训练,得到初始语音转换模型。 Step 206 , using the aligned source speech acoustic features and expected speech acoustic features to train the preset neural network model to obtain an initial speech conversion model.
其中,将对齐之后的源语音声学特征与期望语音声学特征送入双向长短记忆循环神经网络BLSTM模型当中,得到初始语音转换模型,即得到可以从源语音转换到期望语音的初始语音转换模型。Among them, the aligned source speech acoustic features and the desired speech acoustic features are sent into the bidirectional long short-term memory recurrent neural network BLSTM model to obtain the initial speech conversion model, that is, the initial speech conversion model that can convert the source speech to the desired speech is obtained.
在一个实施例中,衡量t时刻帧与帧之间的距离函数为:In one embodiment, the function of measuring the distance between frames at time t is:
Figure PCTCN2020141091-appb-000004
Figure PCTCN2020141091-appb-000004
其中,I,J为特征矩阵,维度为T(帧数)x N(特征维度)。当N为130时,将对齐后的源语音声学特征x(T x N)送入双向长短记忆循环神经网络BLSTM模型中。此时,双向长短记忆循环神经网络BLSTM模型相关参数如表1所示。Among them, I and J are feature matrices, and the dimension is T (number of frames) x N (feature dimension). When N is 130, the aligned source speech acoustic features x(T x N) are fed into the bidirectional long short-term memory recurrent neural network BLSTM model. At this time, the relevant parameters of the bidirectional long short-term memory recurrent neural network BLSTM model are shown in Table 1.
Figure PCTCN2020141091-appb-000005
Figure PCTCN2020141091-appb-000005
表1Table 1
将对齐后的源语音声学特征x(T x N,N此处为130)送入双向长短记忆循环神经网络BLSTM模型中输出经过转换后的声学特征为
Figure PCTCN2020141091-appb-000006
(T x N,N此处为130)。
The aligned source speech acoustic features x (T x N, N is 130 here) are sent into the bidirectional long short-term memory recurrent neural network BLSTM model and the output transformed acoustic features are
Figure PCTCN2020141091-appb-000006
(T x N, where N is 130 here).
根据公式
Figure PCTCN2020141091-appb-000007
计算
Figure PCTCN2020141091-appb-000008
的损失,其中,y为正确标注的声学特征,根据计算出的损失进行梯度下降,更新双向长短记忆循环神经网络BLSTM模型的参数权值,从而得到初始语音转换模型。
According to the formula
Figure PCTCN2020141091-appb-000007
calculate
Figure PCTCN2020141091-appb-000008
, where y is the correctly labeled acoustic feature. According to the calculated loss, gradient descent is performed to update the parameter weights of the bidirectional long short-term memory recurrent neural network BLSTM model, thereby obtaining the initial speech conversion model.
在一个实施例中,基于N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型,包括:将N个人的语音平行语料进行两两组合,得到
Figure PCTCN2020141091-appb-000009
组训练语音数据;基于
Figure PCTCN2020141091-appb-000010
组训练语音数据对初始语音转换模型进行训练,得到语音转换平均模型。
In one embodiment, the initial speech conversion model is trained based on the speech parallel corpus of N speakers to obtain an average speech conversion model, including: combining the speech parallel corpora of N people in pairs to obtain
Figure PCTCN2020141091-appb-000009
Group training speech data; based on
Figure PCTCN2020141091-appb-000010
Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
其中,将N个人的语音平行语料进行两两组合,得到N个说话人中两两之间进行转换的
Figure PCTCN2020141091-appb-000011
组训练语音数据。由于每个人都具有相同数量的多句语音平行语料,故
Figure PCTCN2020141091-appb-000012
组训练语音数据中每组训练语音数据都包括多组子训练语音数据。例如,当N个说话人中每个说话人都300句语音平行语料时,将N个人的语音平行语料进行两两组合,会得到
Figure PCTCN2020141091-appb-000013
组子训练语音数据。在一个具体实施例中,当有10个不同说话人的300句语音平行语料时,将10个不同说话人的300句平行语料进行两两组合,可以得到
Figure PCTCN2020141091-appb-000014
组子训练语音数据。
Among them, the parallel speech corpus of N people is combined in pairs to obtain the conversion between pairs of N speakers.
Figure PCTCN2020141091-appb-000011
Group training speech data. Since everyone has the same number of multi-sentence parallel corpora, so
Figure PCTCN2020141091-appb-000012
Each set of training voice data in the set of training voice data includes multiple sets of sub-training voice data. For example, when each of the N speakers has 300 speech-parallel corpora, combining the N-person speech-parallel corpus in pairs, we get
Figure PCTCN2020141091-appb-000013
Group sub-training speech data. In a specific embodiment, when there are 300 parallel speech corpora of 10 different speakers, the 300 parallel corpora of 10 different speakers are combined in pairs to obtain
Figure PCTCN2020141091-appb-000014
Group sub-training speech data.
在一个实施例中,如图5所示,该方法还包括:获取待转换的语音文本,通过语音合成模型将待转换的语音文本转换为特定说话人的语音数据;将特定说话人的语音数据作为目标语音转换模型的输入,获取目标语音转换模型输出的目标语音数据。In one embodiment, as shown in FIG. 5 , the method further includes: acquiring the speech text to be converted, and converting the speech text to be converted into speech data of a specific speaker through a speech synthesis model; converting the speech data of the specific speaker As the input of the target speech conversion model, the target speech data output by the target speech conversion model is obtained.
其中,通过语音合成模型将待转换的语音文本转换为特定说话人的语音数据后,将特定说话人的语音数据作为目标语音转换模型的输入,获取目标语音转换模型输出的目标语音数据。将语音合成技术与语音转换技术相结合,即在语音合成模型后加语音转换模型,使得每位目标说话人都只需要提供少量的样本语音数据便可实现高质量个性化语音合成。Wherein, after the speech text to be converted is converted into speech data of a specific speaker by the speech synthesis model, the speech data of the specific speaker is used as the input of the target speech conversion model, and the target speech data output by the target speech conversion model is obtained. The combination of speech synthesis technology and speech conversion technology, that is, adding a speech conversion model after the speech synthesis model, makes each target speaker only need to provide a small amount of sample speech data to achieve high-quality personalized speech synthesis.
在一个实施例中,如图6所示,通过语音合成模型将待转换的语音文本转换为特定说话人的语音数据之前,还包括:In one embodiment, as shown in FIG. 6 , before converting the speech text to be converted into speech data of a specific speaker by using the speech synthesis model, the method further includes:
步骤602,获取特定说话人对应的目标语音语料数据。Step 602: Acquire target speech corpus data corresponding to a specific speaker.
其中,目标语音语料数据包含于目标语音语料库中,目标语音语料数据包括目标语音数据及与目标语音数据对应的目标文本数据。The target speech corpus data is included in the target speech corpus, and the target speech corpus data includes target speech data and target text data corresponding to the target speech data.
步骤604,将目标语音语料数据进行文本分析与语音分析,分别得到语音语料文本特征与语音语料声音特征。Step 604: Perform text analysis and speech analysis on the target speech corpus data to obtain text features of the speech corpus and sound features of the speech corpus, respectively.
其中,语音语料声音特征是通过语音分析器获得,包括音色参数、音调参数和响度参数中的至少一项。文本分析可以是词法分析,或者句法分析,文本特征包括:音子序列、词性、词长以及韵律停顿等。Wherein, the sound feature of the speech corpus is obtained by a speech analyzer, and includes at least one of a timbre parameter, a pitch parameter and a loudness parameter. The text analysis can be lexical analysis or syntactic analysis, and the text features include: phoneme sequence, part of speech, word length, and prosodic pause.
步骤606,利用语音语料文本特征与语音语料声音特征对预设神经网络模型进行训练,得到与特定说话人对应的语音合成模型。 Step 606 , using the text features of the speech corpus and the voice features of the speech corpus to train the preset neural network model to obtain a speech synthesis model corresponding to the specific speaker.
其中,利用语音语料文本特征与语音语料声音特征对预设神经网络模型进行训练,得到与特定说话人对应的语音合成模型。The preset neural network model is trained by using the text features of the speech corpus and the sound features of the speech corpus to obtain a speech synthesis model corresponding to a specific speaker.
在一个实施例中,在神经网络深度学习模型方面,选取双向长短记忆神经网络BiLSTM模型。双向长短记忆神经网络BiLSTM模型是LSTM模型的变形,是由前向的LSTM模型与向后的LSTM模型结合而成。在LSTM中具有遗忘门(forget gate)、输入门(input gate)与输出门(output gate)3种门结构。通过LSTM结构中获取长时间序列上样本与样本之间的关系,并且通过输入门、遗忘门以及输出门对历史状态进行保留及舍弃,从而实现了长距离历史信息的有效缓存,因此选取BiLSTM模型来进行训练得到与特定说话人对应的语音合成模型。In one embodiment, in terms of the neural network deep learning model, a bi-directional long short-term memory neural network BiLSTM model is selected. The BiLSTM model of the bidirectional long short-term memory neural network is a deformation of the LSTM model, which is composed of a forward LSTM model and a backward LSTM model. There are three gate structures in LSTM: forget gate, input gate and output gate. The relationship between samples and samples in long-term sequences is obtained through the LSTM structure, and the historical state is retained and discarded through the input gate, forget gate, and output gate, thereby realizing effective caching of long-distance historical information. Therefore, the BiLSTM model is selected. to train to obtain a speech synthesis model corresponding to a specific speaker.
在一个实施例中,语音合成模型包括时长模型与声学模型,获取待转换的 语音文本,通过语音合成模型将待转换的语音文本转换为特定说话人的语音数据,还包括:将待转换的语音文本进行文本分析,得到待转换文本特征;将待转换文本特征作为时长模型的输入,得到待转换文本特征对应的时长特征;将时长特征与待转换文本特征输入到声学模型,得到特定说话人的语音数据。In one embodiment, the speech synthesis model includes a duration model and an acoustic model, acquires the speech text to be converted, converts the speech text to be converted into speech data of a specific speaker through the speech synthesis model, and further includes: converting the speech to be converted Perform text analysis on the text to obtain the features of the text to be converted; use the features of the text to be converted as the input of the duration model to obtain the duration features corresponding to the features of the text to be converted; input the duration features and the features of the text to be converted into the acoustic model to obtain the specific speaker's voice data.
其中,根据语音的自然属性划分出来最小语音单位为音素(phone),时长模型是用来预测每个音素发音的长短,用来控制发音的速度。声学模型用来通过时长特征与待转换文本特征获取特定说话人的语音数据。具体的,待转换文本特征是通过最优前端子模块得到的,该前端子模块是基于神经网络深度学习模型得到的,包括韵律预测模块、词性模块、词长模块、音子序列模块等。Among them, the smallest phonetic unit is divided into phonemes according to the natural attributes of the voice, and the duration model is used to predict the length of each phoneme's pronunciation and to control the speed of the pronunciation. The acoustic model is used to obtain the speech data of a specific speaker through the feature of duration and the text to be converted. Specifically, the features of the text to be converted are obtained through the optimal front-end sub-module, which is obtained based on a neural network deep learning model, including a prosody prediction module, a part-of-speech module, a word length module, and a phonetic sequence module.
如图7所示,本发明提供一种个性化语音转换训练装置,该装置包括:As shown in FIG. 7 , the present invention provides a personalized voice conversion training device, which includes:
第一获取模块702,用于获取语音语料库中的语音语料数据,语音语料数据包括:N个说话人的语音平行语料,语音平行语料是指多个人的语音语料对应相同的语音文本内容。The first obtaining module 702 is configured to obtain the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content.
第一训练模块704,用于基于N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型。The first training module 704 is used for training the initial speech conversion model based on the speech parallel corpus of N speakers to obtain an average speech conversion model.
第二获取模块706,用于获取特定说话人的语音平行语料,将特定说话人的语音平行语料分别与N个说话人的语音平行语料进行组合,得到N组训练语音数据。The second obtaining module 706 is configured to obtain the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data.
第二训练模块708,用于基于N组训练语音数据对语音转换平均模型进行训练,得到特定语音转换平均模型。The second training module 708 is configured to train the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.
在一个实施例中,第二训练模块708还用于将N个人的语音平行语料进行两两组合,得到
Figure PCTCN2020141091-appb-000015
组训练语音数据:基于
Figure PCTCN2020141091-appb-000016
组训练语音数据对初始语音转换模型进行训练,得到语音转换平均模型。
In one embodiment, the second training module 708 is further configured to combine the parallel speech corpora of N people in pairs to obtain
Figure PCTCN2020141091-appb-000015
Group training speech data: based on
Figure PCTCN2020141091-appb-000016
Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
在一个实施例中,N组训练语音数据中将特定说话人的语音平行语料作为源语音,N组训练语音数据中将N个说话人的语音平行语料作为期望语音。第二训练模块708还用于利用语音特征分析器对源语音与期望语音分别进行声学特征提取,得到源语音声学特征与期望语音声学特征;控制在时间轴上将源语 音声学特征与期望语音声学特征对齐;利用对齐后的源语音声学特征与期望语音声学特征对预设神经网络模型进行训练,得到初始语音转换模型。In one embodiment, the parallel speech corpus of a specific speaker is used as the source speech in the N groups of training speech data, and the speech parallel corpus of N speakers is used as the desired speech in the N groups of training speech data. The second training module 708 is further configured to extract the acoustic features of the source speech and the expected speech respectively by using the speech feature analyzer to obtain the acoustic features of the source speech and the desired speech acoustic features; control the time axis to compare the acoustic features of the source speech and the desired speech acoustics Feature alignment; use the aligned source speech acoustic features and expected speech acoustic features to train the preset neural network model to obtain an initial speech conversion model.
第三获取模块710,用于获取目标说话人的第一样本语音数据,获取特定说话人对应的第二样本语音数据,第一样本语音数据和第二样本语音数据对应的文本内容相同,第一样本语音数据的规模远小于语音平行语料的规模。The third obtaining module 710 is configured to obtain the first sample voice data of the target speaker, and obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus.
第三训练模块712,用于基于第一样本语音数据和第二样本语音数据对特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The third training module 712 is configured to train a specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
在一个实施例中,如图8所示,一种个性化语音转换训练装置还包括:In one embodiment, as shown in Figure 8, a personalized voice conversion training device further includes:
第四获取模块714,用于获取待转换的语音文本。The fourth obtaining module 714 is configured to obtain the speech text to be converted.
语音合成模块716,用于通过语音合成模型将待转换的语音文本转换为特定说话人的语音数据。The speech synthesis module 716 is used for converting the speech text to be converted into speech data of a specific speaker through a speech synthesis model.
在一个实施例中,语音合成模块716还用于,获取特定说话人对应的目标语音语料数据;将目标语音语料数据进行文本分析与语音分析,分别得到语音语料文本特征与语音语料声音特征;利用语音语料文本特征与语音语料声音特征对预设神经网络模型进行训练,得到与特定说话人对应的语音合成模型。In one embodiment, the speech synthesis module 716 is further used to obtain the target speech corpus data corresponding to the specific speaker; perform text analysis and speech analysis on the target speech corpus data to obtain the text features of the speech corpus and the sound features of the speech corpus respectively; The text features of the speech corpus and the sound features of the speech corpus are used to train a preset neural network model to obtain a speech synthesis model corresponding to a specific speaker.
语音转换模块718,用于将特定说话人的语音数据作为目标语音转换模型的输入,获取目标语音转换模型输出的目标语音数据。The speech conversion module 718 is configured to use the speech data of the specific speaker as the input of the target speech conversion model, and obtain the target speech data output by the target speech conversion model.
如图9所示,在一个实施例中计算机设备的内部结构图。该计算机设备可以是个性化语音转换训练装置或与个性化语音转换训练装置连接的终端或服务器。如图9所示,该计算机设备包括通过系统总线连接的处理器、存储器、和网络接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质存储有操作系统,还可存储有计算机程序,该计算机程序被处理器执行时,可使得处理器实现个性化语音转换训练方法。该内存储器中也可储存有计算机程序,该计算机程序被处理器执行时,可使得处理器执行个性化语音转换训练方法。网络接口用于与外接进行通信。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构 成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。As shown in FIG. 9, the internal structure diagram of the computer device in one embodiment. The computer equipment may be a personalized voice conversion training device or a terminal or server connected to the personalized voice conversion training device. As shown in Figure 9, the computer device includes a processor, memory, and a network interface connected by a system bus. Wherein, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and also stores a computer program. When the computer program is executed by the processor, the processor can implement the personalized voice conversion training method. A computer program can also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the personalized speech conversion training method. The network interface is used to communicate with external devices. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
在一个实施例中,本申请提供的个性化语音转换训练方法可以实现为一种计算机程序的形式,计算机程序可在如图9所示的计算机设备上运行。计算机设备的存储器中可存储组成个性化语音转换训练装置的各个程序模板。比如,第一获取模块702,第一训练模块704,第二获取模块706,第二训练模块708,第三获取模块710,第三训练模块712,第四获取模块714,语音合成模块716,语音转换模块718。In one embodiment, the personalized speech conversion training method provided by the present application can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 9 . The individual program templates that make up the personalized speech conversion training device can be stored in the memory of the computer device. For example, the first acquisition module 702, the first training module 704, the second acquisition module 706, the second training module 708, the third acquisition module 710, the third training module 712, the fourth acquisition module 714, the speech synthesis module 716, the speech Conversion module 718 .
一种计算机设备,包括存储器和处理器,该存储器存储有计算机程序,该计算机程序被该处理器执行时,使得该处理器执行上述个性化语音转换训练方法。A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to execute the above-mentioned personalized speech conversion training method.
一种计算机可读存储介质,存储有计算机程序,该计算机程序被处理器执行时,使得该处理器执行上述个性化语音转换训练方法。A computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to execute the above-mentioned personalized speech conversion training method.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a non-volatile computer-readable storage medium , when the program is executed, it may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述 实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.
以上该实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above embodiments only express several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims (18)

  1. 一种个性化语音转换训练方法,其特征在于,所述方法包括:A personalized voice conversion training method, characterized in that the method comprises:
    获取语音语料库中的语音语料数据,所述语音语料数据包括:N个说话人的语音平行语料,所述语音平行语料是指多个人的语音语料对应相同的语音文本内容;Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;
    基于所述N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;The initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model;
    获取特定说话人的语音平行语料,将所述特定说话人的语音平行语料分别与所述N个说话人的语音平行语料进行组合,得到N组训练语音数据;Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of the N speakers respectively to obtain N groups of training speech data;
    基于所述N组训练语音数据对所述语音转换平均模型进行训练,得到特定语音转换平均模型;The voice conversion average model is trained based on the N groups of training voice data to obtain a specific voice conversion average model;
    获取目标说话人的第一样本语音数据,获取所述特定说话人对应的第二样本语音数据,所述第一样本语音数据和所述第二样本语音数据对应的文本内容相同,所述第一样本语音数据的规模远小于所述语音平行语料的规模;Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus;
    基于所述第一样本语音数据和所述第二样本语音数据对所述特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
  2. 根据权利要求1所述的方法,其特征在于,所述N组训练语音数据中将特定说话人的语音平行语料作为源语音,所述N组训练语音数据中将N个说话人的语音平行语料作为期望语音;所述方法还包括:The method according to claim 1, wherein in the N groups of training speech data, the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the source speech. as the desired speech; the method further includes:
    利用语音特征分析器对所述源语音与期望语音分别进行声学特征提取,得到源语音声学特征与期望语音声学特征;Use a voice feature analyzer to extract acoustic features of the source voice and the desired voice, respectively, to obtain the source voice acoustic features and the desired voice acoustic features;
    控制在时间轴上将所述源语音声学特征与所述期望语音声学特征对齐;controlling the aligning of the source speech acoustic feature with the desired speech acoustic feature on a time axis;
    利用所述对齐后的源语音声学特征与所述期望语音声学特征对预设神经网络模型进行训练,得到初始语音转换模型。A preset neural network model is trained by using the aligned source speech acoustic features and the desired speech acoustic features to obtain an initial speech conversion model.
  3. 根据权利要求1所述的方法,其特征在于,所述基于所述N个人说话人 的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型,包括:The method according to claim 1, wherein the initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model, including:
    将所述N个人的语音平行语料进行两两组合,得到
    Figure PCTCN2020141091-appb-100001
    组训练语音数据;
    The voice parallel corpus of the N people is combined in pairs to obtain
    Figure PCTCN2020141091-appb-100001
    Group training speech data;
    基于所述
    Figure PCTCN2020141091-appb-100002
    组训练语音数据对初始语音转换模型进行训练,得到语音转换平均模型。
    based on the
    Figure PCTCN2020141091-appb-100002
    Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取待转换的语音文本,通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据;Acquiring the speech text to be converted, and converting the speech text to be converted into the speech data of the specific speaker through a speech synthesis model;
    将所述特定说话人的语音数据作为所述目标语音转换模型的输入,获取所述目标语音转换模型输出的目标语音数据。The voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is obtained.
  5. 根据权利要求4所述的方法,其特征在于,所述通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据之前,还包括:The method according to claim 4, wherein, before converting the speech text to be converted into the speech data of the specific speaker by using a speech synthesis model, the method further comprises:
    获取所述特定说话人对应的目标语音语料数据;obtaining the target speech corpus data corresponding to the specific speaker;
    将所述目标语音语料数据进行文本分析与语音分析,分别得到语音语料文本特征与语音语料声音特征;Carrying out text analysis and voice analysis on the target voice corpus data to obtain the text feature of the voice corpus and the voice feature of the voice corpus respectively;
    利用所述语音语料文本特征与语音语料声音特征对预设神经网络模型进行训练,得到与所述特定说话人对应的语音合成模型。The preset neural network model is trained by using the text features of the speech corpus and the sound features of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.
  6. 根据权利要求4所述的方法,其特征在于,所述语音合成模型包括时长模型与声学模型,所述获取待转换的语音文本,通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据,还包括:The method according to claim 4, wherein the speech synthesis model includes a duration model and an acoustic model, the acquiring speech text to be converted, and converting the speech text to be converted into the speech text by using the speech synthesis model Speech data for a specific speaker, including:
    将所述待转换的语音文本进行文本分析,得到待转换文本特征;Carrying out text analysis on the speech text to be converted, to obtain text features to be converted;
    将所述待转换文本特征作为所述时长模型的输入,得到所述待转换文本特征对应的时长特征;Using the text feature to be converted as the input of the duration model, obtain the duration feature corresponding to the text feature to be converted;
    将所述时长特征与所述待转换文本特征输入到所述声学模型,得到所述特定说话人的语音数据。Inputting the duration feature and the to-be-converted text feature into the acoustic model to obtain the speech data of the specific speaker.
  7. 一种计算机设备,包括存储器和处理器,所述存储器存储有计算机程序,所述计算机程序被所述处理器执行时,使得所述处理器执行以下步骤:A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:
    获取语音语料库中的语音语料数据,所述语音语料数据包括:N个说话人 的语音平行语料,所述语音平行语料是指多个人的语音语料对应相同的语音文本内容;Acquire the voice corpus data in the voice corpus, and the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the same voice text content corresponding to the voice corpus of a plurality of people;
    基于所述N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;The initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model;
    获取特定说话人的语音平行语料,将所述特定说话人的语音平行语料分别与所述N个说话人的语音平行语料进行组合,得到N组训练语音数据;Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of the N speakers respectively to obtain N groups of training speech data;
    基于所述N组训练语音数据对所述语音转换平均模型进行训练,得到特定语音转换平均模型;The voice conversion average model is trained based on the N groups of training voice data to obtain a specific voice conversion average model;
    获取目标说话人的第一样本语音数据,获取所述特定说话人对应的第二样本语音数据,所述第一样本语音数据和所述第二样本语音数据对应的文本内容相同,所述第一样本语音数据的规模远小于所述语音平行语料的规模;Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, the The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus;
    基于所述第一样本语音数据和所述第二样本语音数据对所述特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
  8. 根据权利要求7所述的设备,其特征在于,所述N组训练语音数据中将特定说话人的语音平行语料作为源语音,所述N组训练语音数据中将N个说话人的语音平行语料作为期望语音;所述方法还包括:The device according to claim 7, wherein in the N groups of training speech data, the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the source speech. as the desired speech; the method further includes:
    利用语音特征分析器对所述源语音与期望语音分别进行声学特征提取,得到源语音声学特征与期望语音声学特征;Use a voice feature analyzer to extract acoustic features of the source voice and the desired voice, respectively, to obtain the source voice acoustic features and the desired voice acoustic features;
    控制在时间轴上将所述源语音声学特征与所述期望语音声学特征对齐;controlling the aligning of the source speech acoustic feature with the desired speech acoustic feature on a time axis;
    利用所述对齐后的源语音声学特征与所述期望语音声学特征对预设神经网络模型进行训练,得到初始语音转换模型。The preset neural network model is trained by using the aligned source speech acoustic features and the desired speech acoustic features to obtain an initial speech conversion model.
  9. 根据权利要求7所述的设备,其特征在于,所述基于所述N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型,包括:The device according to claim 7, wherein the initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model, comprising:
    将所述N个人的语音平行语料进行两两组合,得到
    Figure PCTCN2020141091-appb-100003
    组训练语音数据;
    The voice parallel corpus of the N people is combined in pairs to obtain
    Figure PCTCN2020141091-appb-100003
    Group training speech data;
    基于所述
    Figure PCTCN2020141091-appb-100004
    组训练语音数据对初始语音转换模型进行训练,得到语音转换平均模型。
    based on the
    Figure PCTCN2020141091-appb-100004
    Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
  10. 根据权利要求7所述的设备,其特征在于,所述方法还包括:The device according to claim 7, wherein the method further comprises:
    获取待转换的语音文本,通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据;Acquiring the speech text to be converted, and converting the speech text to be converted into the speech data of the specific speaker through a speech synthesis model;
    将所述特定说话人的语音数据作为所述目标语音转换模型的输入,获取所述目标语音转换模型输出的目标语音数据。The voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is acquired.
  11. 根据权利要求10所述的设备,其特征在于,所述通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据之前,还包括:The device according to claim 10, wherein before converting the to-be-converted speech text into the speech data of the specific speaker by using a speech synthesis model, the method further comprises:
    获取所述特定说话人对应的目标语音语料数据;obtaining the target speech corpus data corresponding to the specific speaker;
    将所述目标语音语料数据进行文本分析与语音分析,分别得到语音语料文本特征与语音语料声音特征;Carrying out text analysis and voice analysis on the target voice corpus data to obtain the text feature of the voice corpus and the voice feature of the voice corpus respectively;
    利用所述语音语料文本特征与语音语料声音特征对预设神经网络模型进行训练,得到与所述特定说话人对应的语音合成模型。The preset neural network model is trained by using the text feature of the speech corpus and the sound feature of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.
  12. 根据权利要求10所述的设备,其特征在于,所述语音合成模型包括时长模型与声学模型,所述获取待转换的语音文本,通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据,还包括:The device according to claim 10, wherein the speech synthesis model includes a duration model and an acoustic model, the acquiring speech text to be converted, and converting the speech text to be converted into the speech text through the speech synthesis model Speech data for a specific speaker, including:
    将所述待转换的语音文本进行文本分析,得到待转换文本特征;Carrying out text analysis on the speech text to be converted, to obtain text features to be converted;
    将所述待转换文本特征作为所述时长模型的输入,得到所述待转换文本特征对应的时长特征;Using the text feature to be converted as the input of the duration model, obtain the duration feature corresponding to the text feature to be converted;
    将所述时长特征与所述待转换文本特征输入到所述声学模型,得到所述特定说话人的语音数据。Inputting the duration feature and the to-be-converted text feature into the acoustic model to obtain the speech data of the specific speaker.
  13. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时,使得所述处理器执行以下步骤:A computer-readable storage medium storing a computer program, when executed by a processor, the computer program causes the processor to perform the following steps:
    获取语音语料库中的语音语料数据,所述语音语料数据包括:N个说话人的语音平行语料,所述语音平行语料是指多个人的语音语料对应相同的语音文本内容;Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;
    基于所述N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型;The initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model;
    获取特定说话人的语音平行语料,将所述特定说话人的语音平行语料分别 与所述N个说话人的语音平行语料进行组合,得到N组训练语音数据;Obtain the speech parallel corpus of the specific speaker, combine the speech parallel corpus of the specific speaker with the speech parallel corpus of the N speakers respectively, and obtain N groups of training speech data;
    基于所述N组训练语音数据对所述语音转换平均模型进行训练,得到特定语音转换平均模型;The voice conversion average model is trained based on the N groups of training voice data to obtain a specific voice conversion average model;
    获取目标说话人的第一样本语音数据,获取所述特定说话人对应的第二样本语音数据,所述第一样本语音数据和所述第二样本语音数据对应的文本内容相同,所述第一样本语音数据的规模远小于所述语音平行语料的规模;Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, the The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus;
    基于所述第一样本语音数据和所述第二样本语音数据对所述特定语音转换平均模型进行训练,得到特定语音转换到目标语音的目标语音转换模型。The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
  14. 根据权利要求13所述的存储介质,其特征在于,所述N组训练语音数据中将特定说话人的语音平行语料作为源语音,所述N组训练语音数据中将N个说话人的语音平行语料作为期望语音;所述方法还包括:The storage medium according to claim 13, wherein, in the N groups of training speech data, parallel speech corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech of N speakers is paralleled The corpus is used as the desired speech; the method further includes:
    利用语音特征分析器对所述源语音与期望语音分别进行声学特征提取,得到源语音声学特征与期望语音声学特征;Use a voice feature analyzer to extract acoustic features of the source voice and the desired voice, respectively, to obtain the source voice acoustic features and the desired voice acoustic features;
    控制在时间轴上将所述源语音声学特征与所述期望语音声学特征对齐;controlling the aligning of the source speech acoustic feature with the desired speech acoustic feature on a time axis;
    利用所述对齐后的源语音声学特征与所述期望语音声学特征对预设神经网络模型进行训练,得到初始语音转换模型。The preset neural network model is trained by using the aligned source speech acoustic features and the desired speech acoustic features to obtain an initial speech conversion model.
  15. 根据权利要求13所述的存储介质,其特征在于,所述基于所述N个人说话人的语音平行语料对初始语音转换模型进行训练,得到语音转换平均模型,包括:The storage medium according to claim 13, wherein the initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model, comprising:
    将所述N个人的语音平行语料进行两两组合,得到
    Figure PCTCN2020141091-appb-100005
    组训练语音数据;
    The voice parallel corpus of the N people is combined in pairs to obtain
    Figure PCTCN2020141091-appb-100005
    Group training speech data;
    基于所述
    Figure PCTCN2020141091-appb-100006
    组训练语音数据对初始语音转换模型进行训练,得到语音转换平均模型。
    based on the
    Figure PCTCN2020141091-appb-100006
    Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
  16. 根据权利要求13所述的存储介质,其特征在于,所述方法还包括:The storage medium according to claim 13, wherein the method further comprises:
    获取待转换的语音文本,通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据;Acquiring the speech text to be converted, and converting the speech text to be converted into the speech data of the specific speaker through a speech synthesis model;
    将所述特定说话人的语音数据作为所述目标语音转换模型的输入,获取所述目标语音转换模型输出的目标语音数据。The voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is acquired.
  17. 根据权利要求16所述的存储介质,其特征在于,所述通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据之前,还包括:The storage medium according to claim 16, wherein before converting the to-be-converted speech text into the speech data of the specific speaker by using a speech synthesis model, the method further comprises:
    获取所述特定说话人对应的目标语音语料数据;obtaining the target speech corpus data corresponding to the specific speaker;
    将所述目标语音语料数据进行文本分析与语音分析,分别得到语音语料文本特征与语音语料声音特征;Carrying out text analysis and voice analysis on the target voice corpus data to obtain the text feature of the voice corpus and the voice feature of the voice corpus respectively;
    利用所述语音语料文本特征与语音语料声音特征对预设神经网络模型进行训练,得到与所述特定说话人对应的语音合成模型。The preset neural network model is trained by using the text feature of the speech corpus and the sound feature of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.
  18. 根据权利要求16所述的存储介质,其特征在于,所述语音合成模型包括时长模型与声学模型,所述获取待转换的语音文本,通过语音合成模型将所述待转换的语音文本转换为所述特定说话人的语音数据,还包括:The storage medium according to claim 16, wherein the speech synthesis model includes a duration model and an acoustic model, and the acquisition of the speech and text to be converted, the speech synthesis model is used to convert the speech and text to be converted into the desired speech and text. speech data for a specific speaker, including:
    将所述待转换的语音文本进行文本分析,得到待转换文本特征;Carrying out text analysis on the speech text to be converted, to obtain text features to be converted;
    将所述待转换文本特征作为所述时长模型的输入,得到所述待转换文本特征对应的时长特征;Using the text feature to be converted as the input of the duration model, obtain the duration feature corresponding to the text feature to be converted;
    将所述时长特征与所述待转换文本特征输入到所述声学模型,得到所述特定说话人的语音数据。Inputting the duration feature and the to-be-converted text feature into the acoustic model to obtain the speech data of the specific speaker.
PCT/CN2020/141091 2020-12-29 2020-12-29 Personalized speech conversion training method, computer device, and storage medium WO2022141126A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141091 WO2022141126A1 (en) 2020-12-29 2020-12-29 Personalized speech conversion training method, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/141091 WO2022141126A1 (en) 2020-12-29 2020-12-29 Personalized speech conversion training method, computer device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022141126A1 true WO2022141126A1 (en) 2022-07-07

Family

ID=82259903

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/141091 WO2022141126A1 (en) 2020-12-29 2020-12-29 Personalized speech conversion training method, computer device, and storage medium

Country Status (1)

Country Link
WO (1) WO2022141126A1 (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN109285535A (en) * 2018-10-11 2019-01-29 四川长虹电器股份有限公司 Phoneme synthesizing method based on Front-end Design
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110335588A (en) * 2019-06-26 2019-10-15 中国科学院自动化研究所 More speaker speech synthetic methods, system and device
JP2020027168A (en) * 2018-08-10 2020-02-20 大学共同利用機関法人情報・システム研究機構 Learning device, learning method, voice synthesis device, voice synthesis method and program
CN111433847A (en) * 2019-12-31 2020-07-17 深圳市优必选科技股份有限公司 Speech conversion method and training method, intelligent device and storage medium
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105206257A (en) * 2015-10-14 2015-12-30 科大讯飞股份有限公司 Voice conversion method and device
CN105185372A (en) * 2015-10-20 2015-12-23 百度在线网络技术(北京)有限公司 Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device
JP2020027168A (en) * 2018-08-10 2020-02-20 大学共同利用機関法人情報・システム研究機構 Learning device, learning method, voice synthesis device, voice synthesis method and program
CN109285535A (en) * 2018-10-11 2019-01-29 四川长虹电器股份有限公司 Phoneme synthesizing method based on Front-end Design
CN110033755A (en) * 2019-04-23 2019-07-19 平安科技(深圳)有限公司 Phoneme synthesizing method, device, computer equipment and storage medium
CN110335588A (en) * 2019-06-26 2019-10-15 中国科学院自动化研究所 More speaker speech synthetic methods, system and device
CN111433847A (en) * 2019-12-31 2020-07-17 深圳市优必选科技股份有限公司 Speech conversion method and training method, intelligent device and storage medium
CN111613224A (en) * 2020-04-10 2020-09-01 云知声智能科技股份有限公司 Personalized voice synthesis method and device

Similar Documents

Publication Publication Date Title
WO2020215666A1 (en) Speech synthesis method and apparatus, computer device, and storage medium
JP7427723B2 (en) Text-to-speech synthesis in target speaker's voice using neural networks
US11664011B2 (en) Clockwork hierarchal variational encoder
JP7395792B2 (en) 2-level phonetic prosody transcription
Dutoit et al. The MBROLA project: Towards a set of high quality speech synthesizers free of use for non commercial purposes
US11881210B2 (en) Speech synthesis prosody using a BERT model
US8594993B2 (en) Frame mapping approach for cross-lingual voice transformation
WO2018192424A1 (en) Statistical parameter model establishment method, speech synthesis method, server and storage medium
CN111433847B (en) Voice conversion method, training method, intelligent device and storage medium
CN102543081B (en) Controllable rhythm re-estimation system and method and computer program product
CN112820268A (en) Personalized voice conversion training method and device, computer equipment and storage medium
CN101901598A (en) Humming synthesis method and system
WO2021134581A1 (en) Prosodic feature prediction-based speech synthesis method, apparatus, terminal, and medium
Shechtman et al. Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture.
WO2022141126A1 (en) Personalized speech conversion training method, computer device, and storage medium
RU61924U1 (en) STATISTICAL SPEECH MODEL
Hsieh et al. A speaking rate-controlled mandarin TTS system
CN111192566B (en) English speech synthesis method and device
RU2754920C1 (en) Method for speech synthesis with transmission of accurate intonation of the cloned sample
Kulkarni et al. Clartts: An open-source classical arabic text-to-speech corpus
WO2022140966A1 (en) Cross-language voice conversion method, computer device, and storage medium
Sudhakar et al. Development of Concatenative Syllable-Based Text to Speech Synthesis System for Tamil
Liu et al. Exploring effective speech representation via asr for high-quality end-to-end multispeaker tts
US20220068256A1 (en) Building a Text-to-Speech System from a Small Amount of Speech Data
WO2022133630A1 (en) Cross-language audio conversion method, computer device and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20967474

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967474

Country of ref document: EP

Kind code of ref document: A1