WO2022141126A1

WO2022141126A1 - Personalized speech conversion training method, computer device, and storage medium

Info

Publication number: WO2022141126A1
Application number: PCT/CN2020/141091
Authority: WO
Inventors: 黄东延; 王若童
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2020-12-29
Filing date: 2020-12-29
Publication date: 2022-07-07

Abstract

Provided is a personalized speech conversion training method, comprising: acquiring a speech parallel corpus of N speakers to train an initial speech conversion model and obtain an speech conversion average model (104); obtaining a speech parallel corpus of a specific speaker and combining with the speech parallel corpus of said N speakers to obtain N sets of training speech data (106); training said speech conversion average model on the basis of the N sets of training speech data to obtain a specific speech conversion average model (108); obtaining first sample speech data of a target speaker and obtaining second sample speech data corresponding to a specific speaker (110), and training the specific speech conversion average model to obtain a target speech conversion model for conversion of specific speech to target speech (112). The invention also relates to a computer device and a storage medium.

Description

Personalized voice conversion training method, computer equipment and storage medium

technical field

The present invention relates to the field of computer technology, in particular to a personalized voice conversion training method, computer equipment and storage medium.

Background technique

With the continuous development of multimedia communication technology, speech synthesis technology, which is one of the important ways of human-computer communication, has received extensive attention from researchers because of its convenience and speed. The goal of speech synthesis is to make the synthesized speech intelligible, clear, natural and expressive. In order to make the synthesized speech more clear, natural and expressive, the existing speech synthesis system generally selects a target speaker, records a large amount of pronunciation data of the target speaker, and uses these pronunciation data as the basic data of speech synthesis . The advantage of this method is that the sound quality and timbre of the synthesized speech will be more similar to the speech made by the speaker itself, and its clarity and naturalness will be greatly improved, but the disadvantage is that a large number of sample speech data of the target speaker needs to be obtained. The collection of these sample speech data will consume a lot of material and financial resources, which makes it very difficult to further develop a unique personalized speech synthesis model for each individual user.

SUMMARY OF THE INVENTION

Based on this, it is necessary to provide a personalized voice conversion training method, computer equipment and storage medium that only need to collect a small number of sample voice data of the target speaker in order to solve the above problems.

In a first aspect, the present invention provides a personalized speech conversion training method, the method comprising:

Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;

The initial speech conversion model is trained based on the speech parallel corpus of N speakers, and the average speech conversion model is obtained;

Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data;

The voice conversion average model is trained based on N groups of training voice data to obtain a specific voice conversion average model;

Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the first sample voice data The scale is much smaller than the scale of speech parallel corpus;

The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.

In a second aspect, the present invention provides a computer device, comprising a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:

Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content;

Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the scale of the first sample voice data Much smaller than the size of the phonetic parallel corpus;

In a third aspect, the present invention provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, causes the processor to perform the following steps:

The present invention provides a personalized speech conversion training method, computer equipment and storage medium. The initial speech conversion model is trained by the acquired speech parallel corpus of N speakers to obtain an average speech conversion model; the speech parallelism of a specific speaker is obtained. The corpus is combined with the speech parallel corpus of N speakers respectively to obtain N groups of training speech data; the average speech conversion model is trained based on the N groups of training speech data to obtain a specific speech conversion average model; The sample speech data is obtained, the second sample speech data corresponding to the specific speaker is obtained, and the specific speech conversion average model is trained to obtain a target speech conversion model for converting the specific speech to the target speech. Since the scale of the first sample speech data of the target speaker is much smaller than the scale of the speech parallel corpus, the present invention only needs a few sample speech data of the target speaker to realize the synthesis of high-quality personalized speech, which greatly reduces the The production cost of personalized speech, so that a unique personalized speech synthesis model can be produced for each individual user to realize individualized speech synthesis for each individual user.

Description of drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention, and for those of ordinary skill in the art, other drawings can also be obtained according to the structures shown in these drawings without creative efforts.

1 is a flowchart of a personalized speech conversion training method in one embodiment;

2 is a flowchart of a personalized speech conversion training method in another embodiment;

3 is a schematic diagram of an algorithm for aligning source speech acoustic features with desired speech acoustic features in one embodiment;

4 is a schematic flowchart of aligning source speech acoustic features with desired speech acoustic features in one embodiment;

Fig. 5 is the flow chart of the personalized speech conversion method in one embodiment;

6 is a flowchart of a personalized speech conversion training method in yet another embodiment;

7 is a structural block diagram of a flowchart of a personalized speech conversion training device in one embodiment;

8 is a structural block diagram of a flowchart of a personalized speech conversion training apparatus in another embodiment;

Figure 9 is a diagram of the internal structure of a computer device in one embodiment.

Detailed ways

In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

As shown in Figure 1, the present invention provides a personalized voice conversion training method, the method includes:

Step 102: Acquire voice corpus data in the voice corpus, where the voice corpus data includes: parallel voice corpora of N speakers, and the parallel voice corpus refers to the voice corpus of multiple people corresponding to the same voice text content.

The speech corpus refers to a place where speech corpus data is stored. The speech corpus includes a sufficient amount of speech corpus, and the speech corpus may include speech samples and text samples corresponding to the speech samples. Speech-parallel corpus means that each speaker's spoken text content is the same. For example, for example, each speaker has 300 speech sentences, and the text content corresponding to the 300 speech sentences is the same.

In step 104, the initial speech conversion model is trained based on the speech parallel corpus of N speakers to obtain an average speech conversion model.

Among them, the parallel speech corpus of N speakers is combined in pairs to obtain

Group training speech data; based on

Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained. Specifically, the initial speech conversion model is obtained based on a neural network deep learning model.

In one embodiment, the neural network model can select the BiLSTM (Bi-directional Long Short-Term Memory, bidirectional long short-term memory) model, and utilize the speech parallel corpus of N speakers to train the initial speech conversion model obtained based on the BiLSTM model, Get the average model of speech conversion.

Step 106: Acquire the voice parallel corpus of the specific speaker, and combine the voice parallel corpus of the specific speaker with the voice parallel corpus of N speakers to obtain N groups of training voice data.

Wherein, the parallel speech corpus of a specific speaker is combined with the speech parallel corpus of N speakers to obtain N groups of training speech data, that is, N groups of training speech data converted from a specific speaker to N speakers are obtained. The N groups of training voice data can be stored in the cloud, and correspondingly, the N groups of training voice data can also be stored in the local device.

In one embodiment, the specific speaker is A, and there are 10 speakers, each of which has 300 speech parallel corpora. Obtain 300 speech parallel corpora of a specific speaker A, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of 10 speakers, and then 1x10x300=3000 sets of training speech data can be obtained.

Step 108 , train the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.

The voice conversion average model is trained by using N groups of training speech data to obtain a specific voice conversion average model, that is, a specific voice conversion average model that can be converted from a specific speaker to N speakers.

Step 110: Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the first sample voice data The scale of the data is much smaller than the scale of the speech parallel corpus.

Among them, the second sample voice data corresponding to the specific speaker is obtained from the voice corpus, and the first sample data of the target speaker is obtained through a voice acquisition device. For example, the first sample data of the target speaker can be obtained through a recording studio and corresponding equipment. sample. Since the scale of the first sample speech data is much smaller than the scale of the speech parallel corpus, the first sample speech data of the target speaker is a small sample speech data of the target speaker.

Step 112 , train the average model of specific voice conversion based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting the specific voice to the target voice.

The first sample speech data of the target speaker and the second sample speech data corresponding to the specific speaker are combined to obtain training speech data converted from the specific speaker to the target speaker. As an example to illustrate, use the training speech data converted from a specific speaker to a target speaker to train a specific speech conversion average model, and obtain a target speech conversion model for converting a specific speech to a target speech.

The invention provides a personalized voice conversion training method, device and computer equipment. The initial voice conversion model is trained by the acquired voice parallel corpus of N speakers to obtain an average voice conversion model; the voice parallel corpus of a specific speaker is obtained. and combine them with the speech parallel corpus of N speakers respectively to obtain N groups of training speech data; train the average speech conversion model based on the N groups of training speech data to obtain a specific speech conversion average model; obtain the first sample of the target speaker For this voice data, the second sample voice data corresponding to the specific speaker is obtained, and the average model of specific voice conversion is trained to obtain a target voice conversion model for converting the specific voice to the target voice. Since the scale of the first sample speech data of the target speaker is much smaller than the scale of the speech parallel corpus, the present invention only needs a few sample speech data of the target speaker to realize the synthesis of high-quality personalized speech, which greatly reduces the The production cost of personalized speech, so that a unique personalized speech synthesis model can be produced for each individual user to realize individualized speech synthesis for each individual user.

In one embodiment, as shown in FIG. 2 , in the N groups of training speech data, the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the desired speech; the Methods also include:

Step 202 , using a voice feature analyzer to extract acoustic features of the source voice and the desired voice respectively, to obtain the acoustic features of the source voice and the desired voice.

Among them, in order to reduce the computational complexity and storage complexity, before using the speech feature analyzer to extract the acoustic features of the source speech and the expected speech respectively, it also includes audio resampling of the source speech and the expected speech. The audio resampling algorithm can be used to achieve Convert between arbitrary sample rates of audio signals. Use the speech feature analyzer to extract the acoustic features of the source speech and the expected speech after audio resampling, respectively, and convert the speech signals of the source speech and the expected speech into spectrum (Spectrum), fundamental frequency (Fundamental Frequency), aperiodic frequency (Aperiod). Spectrum) and other acoustic features.

In step 204, the control aligns the acoustic features of the source speech with the acoustic features of the desired speech on the time axis.

In one embodiment, as shown in FIG. 3 and FIG. 4 , a dynamic programming time alignment (Dynamic Time Warping) method is used to align the acoustic features of the source speech to the acoustic feature length of the desired speech. Since the acoustic features are extracted frame by frame, it is necessary to measure the distance between frames at time t. The function to measure the distance between frames at time t is:

Among them, I and J are feature matrices, and the dimension is T (number of frames) x N (feature dimension).

Step 206 , using the aligned source speech acoustic features and expected speech acoustic features to train the preset neural network model to obtain an initial speech conversion model.

Among them, the aligned source speech acoustic features and the desired speech acoustic features are sent into the bidirectional long short-term memory recurrent neural network BLSTM model to obtain the initial speech conversion model, that is, the initial speech conversion model that can convert the source speech to the desired speech is obtained.

In one embodiment, the function of measuring the distance between frames at time t is:

Among them, I and J are feature matrices, and the dimension is T (number of frames) x N (feature dimension). When N is 130, the aligned source speech acoustic features x(T x N) are fed into the bidirectional long short-term memory recurrent neural network BLSTM model. At this time, the relevant parameters of the bidirectional long short-term memory recurrent neural network BLSTM model are shown in Table 1.

Table 1

The aligned source speech acoustic features x (T x N, N is 130 here) are sent into the bidirectional long short-term memory recurrent neural network BLSTM model and the output transformed acoustic features are

(T x N, where N is 130 here).

According to the formula

calculate

, where y is the correctly labeled acoustic feature. According to the calculated loss, gradient descent is performed to update the parameter weights of the bidirectional long short-term memory recurrent neural network BLSTM model, thereby obtaining the initial speech conversion model.

In one embodiment, the initial speech conversion model is trained based on the speech parallel corpus of N speakers to obtain an average speech conversion model, including: combining the speech parallel corpora of N people in pairs to obtain

Group training speech data; based on

Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.

Among them, the parallel speech corpus of N people is combined in pairs to obtain the conversion between pairs of N speakers.

Group training speech data. Since everyone has the same number of multi-sentence parallel corpora, so

Each set of training voice data in the set of training voice data includes multiple sets of sub-training voice data. For example, when each of the N speakers has 300 speech-parallel corpora, combining the N-person speech-parallel corpus in pairs, we get

Group sub-training speech data. In a specific embodiment, when there are 300 parallel speech corpora of 10 different speakers, the 300 parallel corpora of 10 different speakers are combined in pairs to obtain

Group sub-training speech data.

In one embodiment, as shown in FIG. 5 , the method further includes: acquiring the speech text to be converted, and converting the speech text to be converted into speech data of a specific speaker through a speech synthesis model; converting the speech data of the specific speaker As the input of the target speech conversion model, the target speech data output by the target speech conversion model is obtained.

Wherein, after the speech text to be converted is converted into speech data of a specific speaker by the speech synthesis model, the speech data of the specific speaker is used as the input of the target speech conversion model, and the target speech data output by the target speech conversion model is obtained. The combination of speech synthesis technology and speech conversion technology, that is, adding a speech conversion model after the speech synthesis model, makes each target speaker only need to provide a small amount of sample speech data to achieve high-quality personalized speech synthesis.

In one embodiment, as shown in FIG. 6 , before converting the speech text to be converted into speech data of a specific speaker by using the speech synthesis model, the method further includes:

Step 602: Acquire target speech corpus data corresponding to a specific speaker.

The target speech corpus data is included in the target speech corpus, and the target speech corpus data includes target speech data and target text data corresponding to the target speech data.

Step 604: Perform text analysis and speech analysis on the target speech corpus data to obtain text features of the speech corpus and sound features of the speech corpus, respectively.

Wherein, the sound feature of the speech corpus is obtained by a speech analyzer, and includes at least one of a timbre parameter, a pitch parameter and a loudness parameter. The text analysis can be lexical analysis or syntactic analysis, and the text features include: phoneme sequence, part of speech, word length, and prosodic pause.

Step 606 , using the text features of the speech corpus and the voice features of the speech corpus to train the preset neural network model to obtain a speech synthesis model corresponding to the specific speaker.

The preset neural network model is trained by using the text features of the speech corpus and the sound features of the speech corpus to obtain a speech synthesis model corresponding to a specific speaker.

In one embodiment, in terms of the neural network deep learning model, a bi-directional long short-term memory neural network BiLSTM model is selected. The BiLSTM model of the bidirectional long short-term memory neural network is a deformation of the LSTM model, which is composed of a forward LSTM model and a backward LSTM model. There are three gate structures in LSTM: forget gate, input gate and output gate. The relationship between samples and samples in long-term sequences is obtained through the LSTM structure, and the historical state is retained and discarded through the input gate, forget gate, and output gate, thereby realizing effective caching of long-distance historical information. Therefore, the BiLSTM model is selected. to train to obtain a speech synthesis model corresponding to a specific speaker.

In one embodiment, the speech synthesis model includes a duration model and an acoustic model, acquires the speech text to be converted, converts the speech text to be converted into speech data of a specific speaker through the speech synthesis model, and further includes: converting the speech to be converted Perform text analysis on the text to obtain the features of the text to be converted; use the features of the text to be converted as the input of the duration model to obtain the duration features corresponding to the features of the text to be converted; input the duration features and the features of the text to be converted into the acoustic model to obtain the specific speaker's voice data.

Among them, the smallest phonetic unit is divided into phonemes according to the natural attributes of the voice, and the duration model is used to predict the length of each phoneme's pronunciation and to control the speed of the pronunciation. The acoustic model is used to obtain the speech data of a specific speaker through the feature of duration and the text to be converted. Specifically, the features of the text to be converted are obtained through the optimal front-end sub-module, which is obtained based on a neural network deep learning model, including a prosody prediction module, a part-of-speech module, a word length module, and a phonetic sequence module.

As shown in FIG. 7 , the present invention provides a personalized voice conversion training device, which includes:

The first obtaining module 702 is configured to obtain the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of multiple people corresponding to the same voice text content.

The first training module 704 is used for training the initial speech conversion model based on the speech parallel corpus of N speakers to obtain an average speech conversion model.

The second obtaining module 706 is configured to obtain the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of N speakers to obtain N groups of training speech data.

The second training module 708 is configured to train the voice conversion average model based on the N groups of training voice data to obtain a specific voice conversion average model.

In one embodiment, the second training module 708 is further configured to combine the parallel speech corpora of N people in pairs to obtain

Group training speech data: based on

In one embodiment, the parallel speech corpus of a specific speaker is used as the source speech in the N groups of training speech data, and the speech parallel corpus of N speakers is used as the desired speech in the N groups of training speech data. The second training module 708 is further configured to extract the acoustic features of the source speech and the expected speech respectively by using the speech feature analyzer to obtain the acoustic features of the source speech and the desired speech acoustic features; control the time axis to compare the acoustic features of the source speech and the desired speech acoustics Feature alignment; use the aligned source speech acoustic features and expected speech acoustic features to train the preset neural network model to obtain an initial speech conversion model.

The third obtaining module 710 is configured to obtain the first sample voice data of the target speaker, and obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus.

The third training module 712 is configured to train a specific voice conversion average model based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.

In one embodiment, as shown in Figure 8, a personalized voice conversion training device further includes:

The fourth obtaining module 714 is configured to obtain the speech text to be converted.

The speech synthesis module 716 is used for converting the speech text to be converted into speech data of a specific speaker through a speech synthesis model.

In one embodiment, the speech synthesis module 716 is further used to obtain the target speech corpus data corresponding to the specific speaker; perform text analysis and speech analysis on the target speech corpus data to obtain the text features of the speech corpus and the sound features of the speech corpus respectively; The text features of the speech corpus and the sound features of the speech corpus are used to train a preset neural network model to obtain a speech synthesis model corresponding to a specific speaker.

The speech conversion module 718 is configured to use the speech data of the specific speaker as the input of the target speech conversion model, and obtain the target speech data output by the target speech conversion model.

As shown in FIG. 9, the internal structure diagram of the computer device in one embodiment. The computer equipment may be a personalized voice conversion training device or a terminal or server connected to the personalized voice conversion training device. As shown in Figure 9, the computer device includes a processor, memory, and a network interface connected by a system bus. Wherein, the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device stores an operating system, and also stores a computer program. When the computer program is executed by the processor, the processor can implement the personalized voice conversion training method. A computer program can also be stored in the internal memory, and when the computer program is executed by the processor, the processor can execute the personalized speech conversion training method. The network interface is used to communicate with external devices. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

In one embodiment, the personalized speech conversion training method provided by the present application can be implemented in the form of a computer program, and the computer program can be executed on a computer device as shown in FIG. 9 . The individual program templates that make up the personalized speech conversion training device can be stored in the memory of the computer device. For example, the first acquisition module 702, the first training module 704, the second acquisition module 706, the second training module 708, the third acquisition module 710, the third training module 712, the fourth acquisition module 714, the speech synthesis module 716, the speech Conversion module 718 .

A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor causes the processor to execute the above-mentioned personalized speech conversion training method.

A computer-readable storage medium stores a computer program, which, when executed by a processor, causes the processor to execute the above-mentioned personalized speech conversion training method.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the program can be stored in a non-volatile computer-readable storage medium , when the program is executed, it may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. In order to make the description simple, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features It is considered to be the range described in this specification.

The above embodiments only express several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the patent of the present application. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

A personalized voice conversion training method, characterized in that the method comprises:

Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;

The initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model;

Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of the N speakers respectively to obtain N groups of training speech data;

The voice conversion average model is trained based on the N groups of training voice data to obtain a specific voice conversion average model;

Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, and the The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus;

The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
The method according to claim 1, wherein in the N groups of training speech data, the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the source speech. as the desired speech; the method further includes:

Use a voice feature analyzer to extract acoustic features of the source voice and the desired voice, respectively, to obtain the source voice acoustic features and the desired voice acoustic features;

controlling the aligning of the source speech acoustic feature with the desired speech acoustic feature on a time axis;

A preset neural network model is trained by using the aligned source speech acoustic features and the desired speech acoustic features to obtain an initial speech conversion model.
The method according to claim 1, wherein the initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model, including:

The voice parallel corpus of the N people is combined in pairs to obtain
Group training speech data;

based on the
Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
The method according to claim 1, wherein the method further comprises:

Acquiring the speech text to be converted, and converting the speech text to be converted into the speech data of the specific speaker through a speech synthesis model;

The voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is obtained.
The method according to claim 4, wherein, before converting the speech text to be converted into the speech data of the specific speaker by using a speech synthesis model, the method further comprises:

obtaining the target speech corpus data corresponding to the specific speaker;

Carrying out text analysis and voice analysis on the target voice corpus data to obtain the text feature of the voice corpus and the voice feature of the voice corpus respectively;

The preset neural network model is trained by using the text features of the speech corpus and the sound features of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.
The method according to claim 4, wherein the speech synthesis model includes a duration model and an acoustic model, the acquiring speech text to be converted, and converting the speech text to be converted into the speech text by using the speech synthesis model Speech data for a specific speaker, including:

Carrying out text analysis on the speech text to be converted, to obtain text features to be converted;

Using the text feature to be converted as the input of the duration model, obtain the duration feature corresponding to the text feature to be converted;

Inputting the duration feature and the to-be-converted text feature into the acoustic model to obtain the speech data of the specific speaker.
A computer device includes a memory and a processor, the memory stores a computer program, and when the computer program is executed by the processor, the processor performs the following steps:

Acquire the voice corpus data in the voice corpus, and the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the same voice text content corresponding to the voice corpus of a plurality of people;

The initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model;

Acquire the speech parallel corpus of a specific speaker, and combine the speech parallel corpus of the specific speaker with the speech parallel corpus of the N speakers respectively to obtain N groups of training speech data;

The voice conversion average model is trained based on the N groups of training voice data to obtain a specific voice conversion average model;

Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, the The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus;

The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
The device according to claim 7, wherein in the N groups of training speech data, the speech parallel corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech parallel corpus of N speakers is used as the source speech. as the desired speech; the method further includes:

Use a voice feature analyzer to extract acoustic features of the source voice and the desired voice, respectively, to obtain the source voice acoustic features and the desired voice acoustic features;

controlling the aligning of the source speech acoustic feature with the desired speech acoustic feature on a time axis;

The preset neural network model is trained by using the aligned source speech acoustic features and the desired speech acoustic features to obtain an initial speech conversion model.
The device according to claim 7, wherein the initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model, comprising:

The voice parallel corpus of the N people is combined in pairs to obtain
Group training speech data;

based on the
Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
The device according to claim 7, wherein the method further comprises:

Acquiring the speech text to be converted, and converting the speech text to be converted into the speech data of the specific speaker through a speech synthesis model;

The voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is acquired.
The device according to claim 10, wherein before converting the to-be-converted speech text into the speech data of the specific speaker by using a speech synthesis model, the method further comprises:

obtaining the target speech corpus data corresponding to the specific speaker;

Carrying out text analysis and voice analysis on the target voice corpus data to obtain the text feature of the voice corpus and the voice feature of the voice corpus respectively;

The preset neural network model is trained by using the text feature of the speech corpus and the sound feature of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.
The device according to claim 10, wherein the speech synthesis model includes a duration model and an acoustic model, the acquiring speech text to be converted, and converting the speech text to be converted into the speech text through the speech synthesis model Speech data for a specific speaker, including:

Carrying out text analysis on the speech text to be converted, to obtain text features to be converted;

Using the text feature to be converted as the input of the duration model, obtain the duration feature corresponding to the text feature to be converted;

Inputting the duration feature and the to-be-converted text feature into the acoustic model to obtain the speech data of the specific speaker.
A computer-readable storage medium storing a computer program, when executed by a processor, the computer program causes the processor to perform the following steps:

Acquiring the voice corpus data in the voice corpus, the voice corpus data includes: the voice parallel corpus of N speakers, and the voice parallel corpus refers to the voice corpus of a plurality of people corresponding to the same voice text content;

The initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model;

Obtain the speech parallel corpus of the specific speaker, combine the speech parallel corpus of the specific speaker with the speech parallel corpus of the N speakers respectively, and obtain N groups of training speech data;

The voice conversion average model is trained based on the N groups of training voice data to obtain a specific voice conversion average model;

Obtain the first sample voice data of the target speaker, obtain the second sample voice data corresponding to the specific speaker, the first sample voice data and the text content corresponding to the second sample voice data are the same, the The scale of the first sample speech data is much smaller than the scale of the speech parallel corpus;

The specific voice conversion average model is trained based on the first sample voice data and the second sample voice data to obtain a target voice conversion model for converting specific voices to target voices.
The storage medium according to claim 13, wherein, in the N groups of training speech data, parallel speech corpus of a specific speaker is used as the source speech, and in the N groups of training speech data, the speech of N speakers is paralleled The corpus is used as the desired speech; the method further includes:

Use a voice feature analyzer to extract acoustic features of the source voice and the desired voice, respectively, to obtain the source voice acoustic features and the desired voice acoustic features;

controlling the aligning of the source speech acoustic feature with the desired speech acoustic feature on a time axis;

The preset neural network model is trained by using the aligned source speech acoustic features and the desired speech acoustic features to obtain an initial speech conversion model.
The storage medium according to claim 13, wherein the initial speech conversion model is trained based on the speech parallel corpus of the N speakers to obtain an average speech conversion model, comprising:

The voice parallel corpus of the N people is combined in pairs to obtain
Group training speech data;

based on the
Group training speech data is used to train the initial speech conversion model, and the average speech conversion model is obtained.
The storage medium according to claim 13, wherein the method further comprises:

Acquiring the speech text to be converted, and converting the speech text to be converted into the speech data of the specific speaker through a speech synthesis model;

The voice data of the specific speaker is used as the input of the target voice conversion model, and the target voice data output by the target voice conversion model is acquired.
The storage medium according to claim 16, wherein before converting the to-be-converted speech text into the speech data of the specific speaker by using a speech synthesis model, the method further comprises:

obtaining the target speech corpus data corresponding to the specific speaker;

Carrying out text analysis and voice analysis on the target voice corpus data to obtain the text feature of the voice corpus and the voice feature of the voice corpus respectively;

The preset neural network model is trained by using the text feature of the speech corpus and the sound feature of the speech corpus to obtain a speech synthesis model corresponding to the specific speaker.
The storage medium according to claim 16, wherein the speech synthesis model includes a duration model and an acoustic model, and the acquisition of the speech and text to be converted, the speech synthesis model is used to convert the speech and text to be converted into the desired speech and text. speech data for a specific speaker, including:

Carrying out text analysis on the speech text to be converted, to obtain text features to be converted;

Using the text feature to be converted as the input of the duration model, obtain the duration feature corresponding to the text feature to be converted;

Inputting the duration feature and the to-be-converted text feature into the acoustic model to obtain the speech data of the specific speaker.