CN116825081A

CN116825081A - Speech synthesis method, device and storage medium based on small sample learning

Info

Publication number: CN116825081A
Application number: CN202311084629.4A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Moore Threads Technology Co Ltd
Current assignee: Moore Threads Technology Co Ltd
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-09-29
Anticipated expiration: 2043-08-25
Also published as: CN116825081B

Abstract

The present application relates to the field of speech synthesis technology, and in particular, to a method, an apparatus, and a storage medium for speech synthesis based on small sample learning. The method comprises the following steps: acquiring target voice data of a target object; extracting a target voiceprint feature vector through a voiceprint feature extractor obtained through pre-training according to target voice data, wherein the target voiceprint feature vector indicates the tone of a target object; and performing small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, wherein the target voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used for performing voice synthesis. According to the embodiment of the application, through the method of combining the end-to-end voice synthesis model with the voiceprint feature extractor, natural voice can be synthesized under the condition of a small sample, the tone of the synthesized voice is ensured to be more similar to that of a target object, and the efficiency and effect of personalized voice synthesis are improved.

Description

Speech synthesis method, device and storage medium based on small sample learning

Technical Field

The present application relates to the field of speech synthesis technology, and in particular, to a method, an apparatus, and a storage medium for speech synthesis based on small sample learning.

Background

The voice synthesis technology is widely applied to various voice interaction scenes at present. Conventional speech synthesis techniques typically include two parts: acoustic models (English: analog model) and vocoders (English: vocoder). The acoustic model is used to convert an input text or phoneme sequence into spectral features (e.g., mel-frequency spectrum), and the vocoder is responsible for restoring the spectral features converted by the acoustic model into audio. Under this framework, because the acoustic model and the vocoder are independently trained, there is a problem of mismatch in speech synthesis, affecting the effect of synthesized speech.

In view of the above-mentioned speech synthesis problems, end-to-end speech synthesis techniques have emerged. In the end-to-end speech synthesis technique, the two parts of the acoustic model and vocoder are no longer divided, but instead the audio is synthesized directly from a sequence of text or phonemes. Compared with the former, the end-to-end voice synthesis technology has higher naturalness for the obtained voice. Based on the speech synthesis technology, personalized speech synthesis allows to re-inscribe the timbre of someone. For example, in a vehicle-mounted application, if the human-computer interaction system can interact with the human-computer interaction system by using the tone of the driver, the human-computer interaction effect can be improved. In such applications, unlike professional voice recorders, which can provide training data for tens of hours, often only a small amount of audio (e.g., several seconds to several minutes of audio) of the user can be collected, which requires that the speech synthesis model also perform training under a small sample condition, and synthesizes speech similar to the target speaker and having a high degree of naturalness.

How to improve the effect of personalized speech synthesis under the condition of small samples has not provided a suitable and effective method in the related art.

Disclosure of Invention

In view of this, a method, apparatus and storage medium for speech synthesis based on small sample learning are proposed. According to the embodiment of the application, through the method of combining the end-to-end voice synthesis model with the voiceprint feature extractor, natural voice can be synthesized under the condition of a small sample, the tone of the synthesized voice is ensured to be more similar to that of a target object, and the efficiency and effect of personalized voice synthesis are improved.

In a first aspect, an embodiment of the present application provides a method for synthesizing speech based on small sample learning, the method comprising:

acquiring target voice data of a target object;

extracting a target voiceprint feature vector by a voiceprint feature extractor obtained through pre-training according to the target voice data, wherein the target voiceprint feature vector indicates the tone of the target object;

and performing small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, wherein the target voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used for performing voice synthesis.

In one possible implementation, the target speech data includes multiple sets of target text audio pairs, each set of the target text audio pairs including target text and audio data of the corresponding target object;

the extracting the target voiceprint feature vector by a voiceprint feature extractor obtained by pre-training according to the target voice data comprises the following steps:

for each group of target text audio pairs, converting the target text into a corresponding target phoneme sequence, and converting the audio data of the target object into the target voiceprint feature vector through the voiceprint feature extractor obtained through pre-training;

and performing small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, wherein the small sample learning comprises the following steps:

and carrying out small sample learning according to a plurality of groups of target phoneme sequences and the target voiceprint feature vectors to obtain the end-to-end speech synthesis model.

In another possible implementation, the parameters of the normalization layer in the end-to-end speech synthesis model include: the time length predictor in the end-to-end speech synthesis model is used for predicting the duration of each input phoneme, the prior posterior converter is used for mutually converting between a phoneme coding space and an audio coding space, the posterior encoder is used for converting input speech data into an audio coding space of a target dimension, and the posterior decoder is used for restoring an audio coding sequence into corresponding speech data.

In another possible implementation manner, after the performing small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, the method further includes:

and storing the target voiceprint feature vector and the adjusted parameters of the end-to-end voice synthesis model.

In another possible implementation manner, before the acquiring the target voice data of the target object, the method further includes:

acquiring a first training data set, wherein the first training data set comprises a plurality of groups of first sample data, and each group of first sample data comprises audio data and corresponding object identifiers;

training the voiceprint feature extractor according to the first training data set, wherein the voiceprint feature extractor is a neural network model for converting input audio data into voiceprint feature vectors;

acquiring a second training data set, wherein the second training data set comprises a plurality of groups of second sample data, each group of second sample data comprises sample text and audio data of a corresponding sample object, and the sample objects are other objects except the target object;

training the end-to-end speech synthesis model according to the second training data set and the voiceprint feature extractor;

After training is completed, parameters of the voiceprint feature extractor and the end-to-end speech synthesis model obtained through training are stored.

In another possible implementation manner, the training the end-to-end speech synthesis model according to the second training data set and the voiceprint feature extractor includes:

for each set of the second sample data, converting the sample text into a corresponding sample phoneme sequence, and converting, by the voiceprint feature extractor, audio data of the sample object into a sample voiceprint feature vector, the sample voiceprint feature vector indicating a timbre of the sample object;

and training the end-to-end voice synthesis model according to the sample phoneme sequence and the sample voiceprint feature vector, wherein the sample voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model.

In another possible implementation manner, the training the end-to-end speech synthesis model according to the sample phoneme sequence and the sample voiceprint feature vector includes:

inputting the sample phoneme sequence and the sample voiceprint feature vector into a preset end-to-end voice synthesis model, and outputting to obtain synthesized voice data;

Inputting the synthesized voice data into the voiceprint feature extractor, outputting to obtain a first voice feature, inputting the reference voice data corresponding to the synthesized voice data into the voiceprint feature extractor, and outputting to obtain a second voice feature;

and training the end-to-end voice synthesis model according to the similarity between the first voice characteristic and the second voice characteristic.

In another possible implementation, the first speech feature includes: a first voiceprint feature vector of the synthesized speech data and a first output result of the synthesized speech data, which is output from each of the plurality of intermediate layers after being processed by the voiceprint feature extractor;

the second speech feature comprises: and a second voice characteristic vector of the reference voice data and a second output result of the reference voice data, which is output from each of the plurality of intermediate layers after being processed by the voice characteristic extractor.

In another possible implementation manner, after the small sample learning according to the target voice data and the target voiceprint feature vector, the method further includes:

acquiring an input first text;

converting the first text into a corresponding first phoneme sequence;

And outputting and obtaining first voice data through the adjusted end-to-end voice synthesis model according to the first phoneme sequence and the stored target voiceprint feature vector, wherein the first voice data is voice data with the tone of the target object.

In a second aspect, embodiments of the present application provide a small sample learning-based speech synthesis apparatus, the apparatus comprising:

the acquisition module is used for acquiring target voice data of a target object;

the extraction module is used for extracting a target voiceprint feature vector through a voiceprint feature extractor obtained through pre-training according to the target voice data, wherein the target voiceprint feature vector indicates the tone of the target object;

the adjustment module is used for carrying out small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, the target voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used for carrying out voice synthesis.

The extraction module is further configured to:

the adjusting module is further configured to:

In another possible implementation, the apparatus further includes: a storage module for:

In another possible implementation, the apparatus further includes: training module for:

In another possible implementation manner, the training module is further configured to:

In another possible implementation, the apparatus further includes: a use module for:

acquiring an input first text;

converting the first text into a corresponding first phoneme sequence;

In a third aspect, embodiments of the present application provide a small sample learning-based speech synthesis apparatus, the apparatus comprising:

a processor;

A memory for storing processor-executable instructions;

wherein the processor is configured to implement the method provided by the first aspect or any one of the possible implementations of the first aspect when executing the instructions.

In a fourth aspect, embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method provided by the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying the computer readable code, which when run in an electronic device, a processor in the electronic device performs the method provided by the first aspect or any one of the possible implementations of the first aspect.

In summary, the embodiment of the application obtains the target voice data of the target object; extracting a target voiceprint feature vector through a voiceprint feature extractor obtained through pre-training according to target voice data, wherein the target voiceprint feature vector indicates the tone of a target object; performing small sample learning according to target voice data and target voiceprint feature vectors to obtain an end-to-end voice synthesis model, wherein the target voiceprint feature vectors are used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used for performing voice synthesis; based on the end-to-end voice synthesis model, the synthesized voice has higher naturalness compared with other voice synthesis technologies; and by combining the end-to-end voice synthesis model with the voiceprint feature extractor, the target voiceprint feature vector is extracted based on the voiceprint feature extractor, and the target voiceprint feature vector only adjusts the parameters of the normalization layer in the end-to-end voice synthesis model, namely a small number of parameters, so that voice data similar to the tone of the target object can be synthesized under the condition of a small sample, and the efficiency and the effect of personalized voice synthesis under the condition of the small sample are improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the application and together with the description, serve to explain the principles of the application.

Fig. 1 shows a schematic structural diagram of a computer device according to an exemplary embodiment of the present application.

Fig. 2 shows a flowchart of a small sample learning-based speech synthesis method according to an exemplary embodiment of the present application.

Fig. 3 is a schematic diagram of a speech synthesis method according to an exemplary embodiment of the present application.

FIG. 4 illustrates a flow chart of a model training method for a training phase provided by an exemplary embodiment of the present application.

Fig. 5 shows a schematic diagram of a voiceprint feature extractor provided in an exemplary embodiment of the present application during a training phase and a use phase.

Fig. 6 is a schematic diagram illustrating the structure of an end-to-end speech synthesis model according to an exemplary embodiment of the present application.

Fig. 7 is a schematic diagram of a training end-to-end speech synthesis model according to an exemplary embodiment of the present application.

Fig. 8 shows a flowchart of a small sample learning method of the registration phase provided by an exemplary embodiment of the present application.

Fig. 9 shows a flowchart of a speech synthesis method at the use stage provided by an exemplary embodiment of the application.

Fig. 10 is a schematic diagram illustrating a speech synthesis method according to another exemplary embodiment of the present application.

Fig. 11 is a schematic structural diagram of a speech synthesis apparatus based on small sample learning according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the application will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following description in order to provide a better illustration of the application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, well known methods, procedures, components, and circuits have not been described in detail so as not to obscure the present application.

In the related art, the speech synthesis technology mainly has several problems: the voice synthesis framework based on the acoustic model and the vocoder is adopted, so that the naturalness of the synthesized voice is required to be further improved; voice print feature vectors are only used for introducing voice color information of a speaker, but effective control is not available on the voice color similarity between the synthesized voice and the target object, so that the synthesized voice is not similar to the target object; relying on additional other models (e.g., audio conversion models, etc.) increases training difficulty and cost.

The embodiment of the application provides a voice synthesis method based on small sample learning, which combines an end-to-end voice synthesis model with a voiceprint feature extractor, so that voice data with high naturalness and high tone similarity with a target object can be synthesized under the condition of small samples, and the efficiency and effect of personalized voice synthesis under the condition of small samples are improved. On the one hand, based on the end-to-end speech synthesis model, the generated speech has higher naturalness than other speech synthesis technologies. On the other hand, the voiceprint feature vector is obtained through the voiceprint feature extractor and used for model training, and the tone similarity between the synthesized voice data and the reference voice data is continuously improved in the training process, so that the tone of the subsequently synthesized voice is more similar to the tone of the target object.

First, an application scenario according to the present application will be described.

The embodiment of the application provides a voice synthesis method based on small sample learning, wherein an execution subject is computer equipment. Referring to fig. 1, a schematic diagram of a computer device according to an exemplary embodiment of the present application is shown.

The computer device may be a terminal or a server. Terminals include mobile terminals or fixed terminals, such as terminals that may be cell phones, tablet computers, laptop portable computers, desktop computers, and the like. The server may be a server, a server cluster comprising a plurality of servers, or a cloud computing service center.

As shown in fig. 1, the computer device includes a processor 10, a memory 20, and a communication interface 30. Those skilled in the art will appreciate that the architecture shown in fig. 1 is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the processor 10 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 20, and calling data stored in the memory 20, thereby controlling the computer device as a whole. The processor 10 may be implemented by a CPU or by a graphics processor (Graphics Processing Unit, GPU).

The memory 20 may be used to store software programs and modules. The processor 10 executes various functional applications and data processing by running software programs and modules stored in the memory 20. The memory 20 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system 21, an acquisition module 22, an extraction module 23, an adjustment module 24, an application program 25 required for at least one function, and the like; the storage data area may store data created according to the use of the computer device, etc. The Memory 20 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), erasable programmable Read-Only Memory (Erasable Programmable Read Only Memory, EPROM), programmable Read-Only Memory (Programmable Read-Only Memory, PROM), read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk. Accordingly, the memory 20 may also include a memory controller to provide access to the memory 20 by the processor 10.

Wherein the processor 10 performs the following functions by running the acquisition module 22: acquiring target voice data of a target object; the processor 10 performs the following functions by running the extraction module 23: extracting a target voiceprint feature vector through a voiceprint feature extractor obtained through pre-training according to target voice data, wherein the target voiceprint feature vector indicates the tone of a target object; the processor 10 performs the following functions by running the adjustment module 24: and performing small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, wherein the target voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used for performing voice synthesis.

Referring to fig. 2, a flowchart of a small sample learning-based speech synthesis method according to an exemplary embodiment of the present application is shown, and the method is used in the computer device shown in fig. 1 for illustration. The method comprises the following steps.

Step 201, target voice data of a target object is acquired.

Optionally, before the computer device obtains the target voice data of the target object, model training is performed, that is, a voiceprint feature extractor and an end-to-end voice synthesis model are obtained through pre-training, the voiceprint feature extractor is used for converting the audio data of the object into a voiceprint feature vector of the object, and the end-to-end voice synthesis model is used for converting a phoneme sequence of the text and the voiceprint feature vector of the object into synthetic voice data, wherein the synthetic voice data is voice data with a tone of the object. After training is completed, parameters of the voiceprint feature extractor and the end-to-end speech synthesis model obtained through training are stored. It should be noted that, the training process of the voiceprint feature extractor and the end-to-end speech synthesis model may refer to the details of the following embodiments, which are not described herein.

The computer device obtains target voice data of a target object. Optionally, the target speech data comprises a plurality of sets of target text audio pairs, each set of target text audio pairs comprising audio data of the target text and a corresponding target object.

Optionally, in the registration stage, the computer device displays a system prompt message, where the system prompt message includes a target text, and the system prompt message indicates that the target object speaks a voice according to the target text. The computer device collects speech spoken by the target object, i.e., audio data of the target object, thereby forming a set of target text audio pairs.

Step 202, extracting a target voiceprint feature vector by a voiceprint feature extractor obtained through pre-training according to target voice data, wherein the target voiceprint feature vector indicates the tone of a target object.

The computer equipment extracts a voiceprint feature vector of a target object, namely a target voiceprint feature vector, according to target voice data through a voiceprint feature extractor obtained through pre-training, wherein the target voiceprint feature vector indicates the tone color of the target object.

Optionally, the target speech data includes multiple sets of target text-to-audio pairs, and for each set of target text-to-audio pairs, the computer device converts the target text into a corresponding target phoneme sequence and converts the audio data of the target object into the target voiceprint feature vector by means of a voiceprint feature extractor trained in advance. Illustratively, the computer device performs text normalization (Text Normalization, TN) on the target text, converts the target normalized text sequence to a target normalized text sequence, and reconverts the target normalized text sequence to a corresponding target phoneme sequence.

Step 203, performing small sample learning according to the target voice data and the target voiceprint feature vector to obtain an end-to-end voice synthesis model, wherein the target voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used for performing voice synthesis.

The computer equipment performs small sample learning according to the target voice data and the target voiceprint feature vector, adjusts parameters of a normalization layer in the end-to-end voice synthesis model, and other parameters in the end-to-end voice synthesis model remain unchanged to obtain a final end-to-end voice synthesis model, wherein the end-to-end voice synthesis model is used for performing voice synthesis. Wherein the other parameters are all other parameters except parameters of the normalization layer in the end-to-end voice synthesis model.

Optionally, the computer device performs small sample learning according to the multiple groups of target phoneme sequences and the target voiceprint feature vectors to obtain an end-to-end speech synthesis model. I.e. for each set of target phoneme sequences and target voiceprint feature vectors, parameters of a normalization layer in the end-to-end speech synthesis model are adjusted according to each set of target phoneme sequences and target voiceprint feature vectors.

Optionally, the computer device performs small sample learning according to the target voice data and the target voiceprint feature vector, and stores the target voiceprint feature vector and the parameters of the adjusted end-to-end voice synthesis model after obtaining the end-to-end voice synthesis model.

Optionally, in the using stage, the computer device invokes the stored target voiceprint feature vector and the parameters of the adjusted end-to-end speech synthesis model to perform speech synthesis.

Optionally, the computer device invokes the stored target voiceprint feature vector and the adjusted parameters of the end-to-end speech synthesis model to perform speech synthesis, including: the computer equipment acquires the input first text, converts the first text into a corresponding first phoneme sequence, and outputs first voice data through the adjusted end-to-end voice synthesis model according to the first phoneme sequence and the stored target voiceprint feature vector, wherein the first voice data is voice data with the tone of the target object.

In one illustrative example, as shown in fig. 3, the computer device inputs the first phoneme sequence and the stored target voiceprint feature vector into the adjusted end-to-end speech synthesis model 31, outputting speech data having the timbre of the target object, i.e., first speech data.

It should be noted that, the process of invoking the voiceprint feature extractor and synthesizing the speech by the end-to-end speech synthesis model by the computer device may refer to the relevant details in the following embodiments, which are not described herein.

Optionally, the small sample learning-based voice synthesis method provided by the embodiment of the application includes, but is not limited to, three stages: training phase, registration phase and use phase.

In the training stage, the voiceprint feature extractor and the end-to-end speech synthesis model are trained, and after the training is completed, parameters of the voiceprint feature extractor and the end-to-end speech synthesis model obtained through training are stored.

In the registration stage, target voice data of a target object is acquired, a target voice print feature vector is extracted through a voice print feature extractor, small sample learning is carried out according to the target voice data and the target voice print feature vector, parameters of a normalization layer in an end-to-end voice synthesis model are adjusted, and after registration is finished, the target voice print feature vector and the adjusted parameters of the end-to-end voice synthesis model are stored.

In the using stage, the input first text is obtained, and according to the first text and the stored target voiceprint feature vector, first voice data is obtained through the output of the adjusted end-to-end voice synthesis model, wherein the first voice data is voice data with the tone of the target object.

In the following, several exemplary embodiments are used to describe a model training method in the training stage, a small sample learning method in the registration stage, and a speech synthesis method in the use stage, respectively.

Referring to FIG. 4, a flow chart of a model training method for training phases is shown, according to an exemplary embodiment of the present application, which is used in the computer device shown in FIG. 1 for illustration. The method comprises the following steps.

Step 401, a first training data set is obtained, the first training data set comprising a plurality of sets of first sample data, each set of first sample data comprising audio data and a corresponding object identification.

In a training phase, the computer device acquires a first training data set, wherein the first training data set comprises first sample data corresponding to a plurality of sample objects respectively, each group of first sample data comprises audio data and corresponding object identification, and the object identification indicates identity information of the sample object, namely the object identification is used for uniquely indicating the sample object in the plurality of sample objects, and the sample object is other objects except a target object.

Step 402, training a voiceprint feature extractor according to a first training data set, the voiceprint feature extractor being a neural network model for converting input audio data into voiceprint feature vectors.

The computer device trains the voiceprint feature extractor from the first training data set. The voiceprint feature extractor is a neural network model, for example, the voiceprint feature extractor is a ResNet34 model. The voiceprint feature extractor is configured to convert input audio data of an object into a voiceprint feature vector of the object. Illustratively, the voiceprint feature extractor is configured to extract, from input audio data of an object, tone information associated with the object in the audio data, and represent the tone information as a voiceprint feature vector.

In one possible implementation, as shown in fig. 5, the voiceprint feature extractor 51 acts as a classification model when training, where the input parameters are audio data and the output parameters are predicted object identity information. After model training is completed, the last classification layer of the classification model is removed to obtain the voiceprint feature extractor 51, the input parameter of the voiceprint feature extractor 51 is audio data, and the output parameter is a voiceprint feature vector corresponding to the audio data.

Step 403, obtaining a second training data set, where the second training data set includes a plurality of sets of second sample data, each set of second sample data includes sample text and audio data of a corresponding sample object, and the sample object is another object except the target object.

The computer device obtains a second training data set comprising a plurality of sets of second sample data, each set of second sample data comprising sample text and audio data of a corresponding sample object. Each set of second sample data, also referred to as a sample text audio pair, corresponds to one object identification.

Step 404, training an end-to-end speech synthesis model according to the second training data set and the voiceprint feature extractor.

Optionally, the second training data set comprises a plurality of sets of second sample data, for each set of second sample data, the computer device converting the sample text into a corresponding sequence of sample phonemes and converting the audio data of the sample object into a sample voiceprint feature vector by the voiceprint feature extractor, the sample voiceprint feature vector indicating a timbre of the sample object. The sample voiceprint feature vector serves as an additional input parameter for instructing the end to end speech synthesis model to generate speech data having the timbre of the target object.

Optionally, the computer device performs text normalization on the sample text and converts the sample text into a sample normalized text sequence. For example, the sample text "1990" is converted to a sample normalized text sequence "nine to nine zero years". Since the phoneme sequence is adopted as the input parameter of the end-to-end speech synthesis model in the embodiment of the application. Thus, the computer device reconverts the sample normalized text sequence into a corresponding sample phoneme sequence.

Optionally, the computer device trains an end-to-end speech synthesis model based on the sample phoneme sequence and the sample voiceprint feature vector. The extracted voiceprint feature vector and the sample phoneme sequence are used together for controlling an end-to-end speech synthesis model. Meanwhile, the voiceprint feature vector is also used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model, and the learning under the condition of a small sample is facilitated because the parameters of the normalization layer are fewer.

Optionally, during the training process, all parameters in the end-to-end speech synthesis model are adjusted such that: the training error of the end-to-end voice synthesis model is as small as possible; the similarity between the first voiceprint feature vector obtained by the voiceprint feature extractor of the synthesized voice data and the second voiceprint feature vector corresponding to the reference voice data is as large as possible; the similarity between the synthesized speech data and the reference speech data is as large as possible between the output results output from each intermediate layer after being processed by the voiceprint feature extractor. The synthesized voice data are voice data with tone colors of sample objects synthesized by a machine, and the reference voice data are real voice data expressed by the collected sample objects.

The end-to-end Speech synthesis model is a neural network model, for example, the end-to-end Speech synthesis model is a Natural language (english) model. The end-to-end speech synthesis model is used to convert the input phoneme sequence and the voiceprint feature vector of the object into synthetic speech data, which is speech data having the timbre of the object.

In one possible implementation, a schematic diagram of an end-to-end speech synthesis model is shown in fig. 6, the end-to-end speech synthesis model including a phoneme encoder 61, a duration predictor 62, a priori posterior converter 63, a posterior decoder 64, and a posterior encoder 65. Wherein the phoneme encoder 61 is configured to convert the input phoneme sequence into a phoneme encoding space of a target dimension, such as a high dimension of the target dimension. The duration predictor 62 is configured to predict a duration of each of the inputted phonemes and map a phoneme-encoded sequence having a length of N, which is the same as the number of frames of the speech, to a sequence having a length of M, both of which are positive integers. A priori posterior converter 63 is used to mutually convert between the phoneme encoding space and the audio encoding space, a posterior decoder 64 is used to restore the audio encoding sequence to corresponding synthesized speech data, which is audible to the human ear, and a posterior encoder 65 is used to convert the input reference speech data to the audio encoding space of the target dimension.

Optionally, the sample voiceprint feature vector is input as an additional input parameter to a duration predictor, a priori posterior converter, a posterior encoder, and a posterior decoder in the end-to-end speech synthesis model. In addition to being an additional input parameter, the sample voiceprint feature vector is also used to adjust parameters of the normalization layer in the end-to-end speech synthesis model.

Optionally, parameters of the normalization layer in the end-to-end speech synthesis model include: a duration predictor in the end-to-end speech synthesis model, a priori posterior converters, posterior encoders and posterior decoders normalize parameters of the layers respectively.

Alternatively, the normalization Layer is a Batch normalization (English: batch Nor) Layer or a Layer normalization (English: layer Nor) Layer. The embodiment of the present application is not limited thereto.

Optionally, the sample voiceprint feature vector is normalized by the following formula, so as to adjust parameters of the normalized layer in the end-to-end speech synthesis model:

；

where x is the current input parameter, o is the current output parameter, m is the average of all input parameters,for standard deviation of all input parameters, v is the sample voiceprint feature vector,andis a parameter of the normalization layer to be adjusted. It can be seen that the normalization layer in the end-to-end speech synthesis model provided by the embodiment of the application considers the influence of the voiceprint feature vector.

Optionally, the computer device trains an end-to-end speech synthesis model according to the sample phoneme sequence and the sample voiceprint feature vector, and includes: inputting the sample phoneme sequence and the sample voiceprint feature vector into a preset end-to-end voice synthesis model, and outputting to obtain synthesized voice data; inputting the synthesized voice data into a voiceprint feature extractor, outputting to obtain a first voice feature, inputting the reference voice data corresponding to the synthesized voice data into the voiceprint feature extractor, and outputting to obtain a second voice feature; and training an end-to-end voice synthesis model according to the similarity between the first voice characteristic and the second voice characteristic.

Optionally, the first voice feature comprises: a first voiceprint feature vector of the synthesized voice data and a first output result of the synthesized voice data, which are respectively output from the plurality of intermediate layers after being processed by the voiceprint feature extractor; the second speech feature comprises: the second voice characteristic vector of the reference voice data and the second output result of the reference voice data output from each of the plurality of intermediate layers after being processed by the voice characteristic extractor.

Optionally, the computer device trains an end-to-end speech synthesis model according to a similarity between the first speech feature and the second speech feature, including: and training an end-to-end voice synthesis model according to the similarity between the first voice print feature vector of the synthesized voice data and the second voice print feature vector of the reference voice data and the similarity between the first output result and the second output result output from each middle layer after the processing of the voice print feature extractor of the synthesized voice data and the reference voice data.

In one possible implementation, the schematic diagram of the computer device training the end-to-end speech synthesis model according to the similarity between the first speech feature and the second speech feature is shown in fig. 7, and the computer device trains the end-to-end speech synthesis model 31 according to the similarity between the synthesized speech data and the voiceprint feature vector of the reference speech data output by the voiceprint feature extractor 51, and the similarity between the synthesized speech data and the output result of the reference speech data output from each intermediate layer after the processing by the voiceprint feature extractor 51.

Optionally, the similarity between the voiceprint feature vectors is represented by cosine distances of the two vectors, and the cosine distances and the similarity are in positive correlation, namely, the larger the cosine distance is, the higher the similarity is represented.

Alternatively, the computer device uses the results after each module of ResNet34 as the output results for the middle layer of the voiceprint feature extractor. The synthesized voice and the real voice are output through the middle layer of the voiceprint feature extractor and are subjected to pooling operation. The pooling operation employs mean pooling, i.e., the calculation of the average of all output vectors produced by the entire speech segment. The similarity between the output results of the middle layer can be expressed by cosine distance between two pooling results, and the cosine distance and the similarity are in positive correlation, namely the larger the cosine distance is, the higher the similarity is expressed.

And step 405, after training is completed, storing parameters of the voiceprint feature extractor and the end-to-end voice synthesis model obtained by training.

After the end-to-end speech synthesis model is trained based on the second training dataset and the voiceprint feature extractor, the computer device stores parameters of the trained voiceprint feature extractor and the trained end-to-end speech synthesis model.

Referring to fig. 8, a flowchart of a small sample learning method of registration phase according to an exemplary embodiment of the present application is shown, and the present embodiment is illustrated by using the method in the computer device shown in fig. 1. The method comprises the following steps.

In step 801, target voice data of a target object is acquired, and a target voiceprint feature vector is extracted by a voiceprint feature extractor.

In the registration stage, the computer equipment acquires target voice data of a target object, and extracts a target voiceprint feature vector through a voiceprint feature extractor according to the target voice data of the target object.

Step 802, adjusting parameters of a normalization layer in the end-to-end speech synthesis model according to the target speech data and the target voiceprint feature vector to obtain an adjusted end-to-end speech synthesis model.

The computer equipment performs small sample learning according to the target voice data and the target voiceprint feature vector, adjusts parameters of the normalization layer in the end-to-end voice synthesis model, and the adjustment method can be analogous to the method of the reference training stage, and is different in that only the parameters of the normalization layer in the end-to-end voice synthesis model are adjusted in the registration stage, and the rest parameters remain unchanged.

Step 803, after registration is completed, storing the target voiceprint feature vector and the parameters of the adjusted end-to-end speech synthesis model.

After the end-to-end speech synthesis model is adjusted according to the target speech data of the target object, i.e. registration is completed, the computer equipment stores the target voiceprint feature vector and parameters of the adjusted end-to-end speech synthesis model.

It should be noted that, for details of each step in the small sample learning method in the registration stage, reference may be made to the description of the foregoing embodiments, which is not repeated herein.

Referring to fig. 9, a flowchart of a speech synthesis method at a use stage according to an exemplary embodiment of the present application is shown, and the method is used in the computer device shown in fig. 1 for illustration. The method comprises the following steps.

Step 901, obtaining an input first text.

In the use phase, the computer device obtains the entered first text and a composition instruction that instructs the computer to read aloud the entered first text in accordance with the timbre of the registered target object.

The first text is converted into a corresponding first phoneme sequence, step 902.

The computing device performs text normalization on the first text, converts the first text into a first normalized text sequence, and converts the first normalized text sequence into a corresponding first phoneme sequence.

In step 903, according to the first phoneme sequence and the stored target voiceprint feature vector, first voice data is obtained by outputting the adjusted end-to-end voice synthesis model, where the first voice data is voice data with a tone of the target object.

In an illustrative example, based on the structure of the end-to-end speech synthesis model as shown in fig. 6, a schematic diagram of the speech synthesis method is shown in fig. 10, and the computer device inputs the first phoneme sequence and the stored target voiceprint feature vector into the adjusted end-to-end speech synthesis model, wherein the sample voiceprint feature vector is input as an additional input parameter to a duration predictor 62, a priori posterior converter 63 and a posterior decoder 64 in the end-to-end speech synthesis model, and outputs the resulting first speech data.

In summary, according to the method provided by the embodiment of the application, on one hand, based on the end-to-end speech synthesis technology, the synthesized speech is more natural. On the other hand, the end-to-end voice synthesis model is combined with the voiceprint feature extractor, so that the naturalness of personalized voice synthesis is improved, and meanwhile, the tone similarity with a target object is improved. On the other hand, the voiceprint feature extractor is used for extracting the voiceprint feature vector, and the voiceprint feature vector only adjusts a small amount of parameters in the end-to-end voice synthesis model, so that voice data similar to the tone of the target object can be synthesized under the condition that training samples are few. On the other hand, the end-to-end speech synthesis model of the foundation is trained through the speech data of a large number of non-target objects, and only a small number of parameters in the end-to-end speech synthesis model, namely parameters of the normalization layer, are adjusted aiming at the target objects, so that the end-to-end speech synthesis model can synthesize natural personalized speech under the condition of a small sample. On the other hand, the voice print similarity between the synthesized voice data and the reference voice data is optimized at different levels of the voice print feature extractor, so that the tone of the synthesized voice data is more similar to that of the target object, and the personalized voice synthesis effect is further improved.

Referring to fig. 11, a schematic structural diagram of a speech synthesis apparatus based on small sample learning according to an embodiment of the present application is shown. The apparatus may be implemented as a dedicated hardware circuit or as a combination of hardware and software as all or part of the computer device in fig. 1, the apparatus comprising: an acquisition module 22, an extraction module 23 and an adjustment module 24.

An acquisition module 22, configured to acquire target voice data of a target object;

an extracting module 23, configured to extract, according to target voice data, a target voiceprint feature vector through a voiceprint feature extractor obtained by training in advance, where the target voiceprint feature vector indicates a tone of a target object;

the adjustment module 24 is configured to learn a small sample according to the target voice data and the target voiceprint feature vector, to obtain an end-to-end voice synthesis model, where the target voiceprint feature vector is used to adjust parameters of a normalization layer in the end-to-end voice synthesis model, and the end-to-end voice synthesis model is used to perform voice synthesis.

In one possible implementation, the target speech data includes multiple sets of target text audio pairs, each set of target text audio pairs including audio data of a target text and a corresponding target object;

The extraction module 23 is further configured to:

for each group of target text audio pairs, converting the target text into a corresponding target phoneme sequence, and converting the audio data of the target object into a target voiceprint feature vector through a voiceprint feature extractor obtained through pre-training;

the adjustment module 24 is further configured to:

and carrying out small sample learning according to the multiple groups of target phoneme sequences and the target voiceprint feature vectors to obtain an end-to-end speech synthesis model.

In another possible implementation, the parameters of the normalization layer in the end-to-end speech synthesis model include: the method comprises the steps of a duration predictor in an end-to-end voice synthesis model, a priori posterior converter, a posterior encoder and a posterior decoder, wherein the duration predictor is used for predicting the duration of each input phoneme, the a priori posterior converter is used for mutually converting between a phoneme coding space and an audio coding space, the posterior encoder is used for converting input voice data into an audio coding space of a target dimension, and the posterior decoder is used for restoring an audio coding sequence into corresponding voice data.

And storing the target voiceprint feature vector and the parameters of the adjusted end-to-end voice synthesis model.

training a voiceprint feature extractor according to the first training data set, wherein the voiceprint feature extractor is a neural network model for converting input audio data into voiceprint feature vectors;

acquiring a second training data set, wherein the second training data set comprises a plurality of groups of second sample data, each group of second sample data comprises sample text and audio data of a corresponding sample object, and the sample objects are other objects except target objects;

training an end-to-end speech synthesis model according to the second training data set and the voiceprint feature extractor;

In another possible implementation, the training module is further configured to:

for each set of second sample data, converting the sample text into a corresponding sample phoneme sequence, and converting the audio data of the sample object into a sample voiceprint feature vector by a voiceprint feature extractor, wherein the sample voiceprint feature vector indicates the tone of the sample object;

And training an end-to-end voice synthesis model according to the sample phoneme sequence and the sample voiceprint feature vector, wherein the sample voiceprint feature vector is used for adjusting parameters of a normalization layer in the end-to-end voice synthesis model.

inputting the synthesized voice data into a voiceprint feature extractor, outputting to obtain a first voice feature, inputting the reference voice data corresponding to the synthesized voice data into the voiceprint feature extractor, and outputting to obtain a second voice feature;

and training an end-to-end voice synthesis model according to the similarity between the first voice characteristic and the second voice characteristic.

In another possible implementation, the first speech feature comprises: a first voiceprint feature vector of the synthesized voice data and a first output result of the synthesized voice data, which are respectively output from the plurality of intermediate layers after being processed by the voiceprint feature extractor;

the second speech feature comprises: the second voice characteristic vector of the reference voice data and the second output result of the reference voice data output from each of the plurality of intermediate layers after being processed by the voice characteristic extractor.

acquiring an input first text;

converting the first text into a corresponding first phoneme sequence;

It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.

An embodiment of the present application provides a small sample learning-based speech synthesis apparatus, the apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the methods performed by the computer device in the various method embodiments described above when executing the instructions.

Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the methods performed by a computer device in the various method embodiments described above.

Embodiments of the present application provide a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying the computer readable code, which when run in an electronic device, a processor in the electronic device performs the method performed by the computer device in the various method embodiments described above.

The present application may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present application.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present application may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of embodiments of the application has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of speech synthesis based on small sample learning, the method comprising:

acquiring target voice data of a target object;

2. The method of claim 1, wherein the target speech data comprises a plurality of sets of target text audio pairs, each set of target text audio pairs comprising target text and audio data of the corresponding target object;

3. The method of claim 1, wherein the parameters of the normalization layer in the end-to-end speech synthesis model comprise: the time length predictor in the end-to-end speech synthesis model is used for predicting the duration of each input phoneme, the prior posterior converter is used for mutually converting between a phoneme coding space and an audio coding space, the posterior encoder is used for converting input speech data into an audio coding space of a target dimension, and the posterior decoder is used for restoring an audio coding sequence into corresponding speech data.

4. A method according to any one of claims 1 to 3, wherein the performing small sample learning according to the target voice data and the target voiceprint feature vector, after obtaining an end-to-end voice synthesis model, further comprises:

5. A method according to any one of claims 1 to 3, further comprising, prior to said obtaining the target speech data of the target object:

6. The method of claim 5, wherein training the end-to-end speech synthesis model based on the second training data set and the voiceprint feature extractor comprises:

7. The method of claim 6, wherein training the end-to-end speech synthesis model based on the sample phoneme sequence and the sample voiceprint feature vector comprises:

8. The method of claim 7, wherein the step of determining the position of the probe is performed,

the first speech feature comprises: a first voiceprint feature vector of the synthesized speech data and a first output result of the synthesized speech data, which is output from each of the plurality of intermediate layers after being processed by the voiceprint feature extractor;

9. A method according to any one of claims 1 to 3, wherein after said small sample learning from said target speech data and said target voiceprint feature vector, further comprising:

Acquiring an input first text;

converting the first text into a corresponding first phoneme sequence;

10. A small sample learning-based speech synthesis apparatus, the apparatus comprising:

11. A small sample learning-based speech synthesis apparatus, the apparatus comprising:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-9 when executing the instructions.

12. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-9.

13. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying the computer readable code, characterized in that a processor in an electronic device performs the method of any one of claims 1-9 when the computer readable code is run in the electronic device.