CN114999442A

CN114999442A - Self-adaptive character-to-speech method based on meta learning and related equipment thereof

Info

Publication number: CN114999442A
Application number: CN202210591183.3A
Authority: CN
Inventors: 杨焱麒
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2022-09-02

Abstract

The application belongs to the field of artificial intelligence and relates to a method for converting self-adaptive characters into voice based on meta-learning, which comprises the following steps: pre-training is carried out according to the full data set to obtain an initial value of a preset acoustic model; sampling sound training sample data, performing characteristic training through a preset acoustic model to generate a mel frequency spectrum, and generating a style code through a preset style encoder; carrying out self-adaptive example normalization processing on the layer normalization of a preset acoustic model to obtain a target acoustic model comprising a target mel frequency spectrum; and finally, converting the strange sample data to output target voice data with style codes. The application also provides a self-adaptive text-to-speech device based on meta learning, a computer device and a storage medium. In addition, the present application also relates to block chain technology, and the data involved in the conversion process can be stored in the block chain. The method and the device can reduce the training complexity and realize the adaptive learning and conversion of the small sample data.

Description

Self-adaptive character-to-speech method based on meta learning and related equipment thereof

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method for converting self-adaptive characters into voice based on meta-learning and related equipment thereof.

Background

With the success of neural networks in many applications, the Speech To Text (TTS) system based on neural networks has also improved significantly over the past few years. Speech fidelity and intelligibility of TTS have been greatly improved. Applications such as artificial intelligence voice assistant services and audio navigation systems have been widely developed and deployed. While meeting the demand of generating high-quality voice, the demand of the client for individuation is increasing, which requires that the TTS model can capture the voice of different speakers well while generating high-quality voice. However, the existing adaptive TTS system is mainly based on a pre-trained model, i.e. the model first trains the voices of multiple speakers from the beginning, and then fine-tunes on the pre-trained model by using a small portion of the speech data of the speakers. However, this method still requires the speaker to provide part of the speech data, and the fine tuning also requires several thousand iterations of step to converge, and if the speaker who is not present in the training data is encountered, the synthesis effect is poor. Therefore, the adaptive TTS system in the prior art has the problems of high training complexity and poor conversion effect for small sample data.

Disclosure of Invention

The embodiment of the application aims to provide a method for converting self-adaptive characters into voice based on meta-learning and related equipment thereof, wherein the data volume of voice training sample data is small, the training complexity can be reduced, and the adaptive learning capability and the conversion effect of small sample data can be improved.

In order to solve the above technical problem, an embodiment of the present application provides a method for converting a text to a speech adaptively based on meta learning, which adopts the following technical solutions:

pre-training is carried out based on the obtained full data set of the speaker to obtain a pre-training acoustic model, wherein parameters included in the pre-training acoustic model are initial values of a preset acoustic model;

sampling voice training sample data from the full data set, performing feature training according to the voice training sample data through the preset acoustic model to generate a mel frequency spectrum, and generating a style code through a preset style encoder;

performing adaptive instance normalization processing on the layer normalization of the preset acoustic model, and injecting the style code into the preset acoustic model to obtain a target acoustic model comprising a target mel frequency spectrum, wherein the target mel frequency spectrum is provided with the style code;

and acquiring strange sample data, and inputting the strange sample data into the target acoustic model to output target voice data with the style code corresponding to the strange sample data.

Further, the step of performing feature training according to the sound training sample data through the preset acoustic model to generate a mel frequency spectrum, and the step of generating a style code through a preset style encoder specifically include:

inputting the sound data in the sound training sample data into the preset acoustic model, and generating a mel frequency spectrum according to the sampling frequency of the sound data;

and inputting the voice data in the voice training sample data into the preset style encoder, and generating the style code according to the sampling frequency and the sample precision of the voice data.

Further, the step of performing adaptive instance normalization on the layer normalization of the preset acoustic model, and injecting the style code into the preset acoustic model to obtain a target acoustic model including a target mel frequency spectrum specifically includes:

calculating a first parameter of the style encoding by the adaptive instance normalization process;

calculating a second parameter of the mel frequency spectrum through the adaptive instance normalization processing;

and performing data matching based on the first parameter of the mel frequency spectrum and the second parameter of the style code, and outputting the target mel frequency spectrum with the style code.

Further, after the step of performing adaptive instance normalization on the layer normalization of the preset acoustic model and injecting the style code into the preset acoustic model, the method further includes the steps of:

sampling character request sample data from the full data set, inputting the character request sample data into the target acoustic model for conversion detection, and judging whether to output detection data corresponding to the character request sample data.

Further, the step of inputting the text request sample data into the target acoustic model for conversion detection and judging whether to output detection data corresponding to the text request sample data includes:

judging whether the target mel frequency spectrum contains the style code or not through a preset style discriminator;

and judging whether the target mel frequency spectrum is aligned with a phoneme corresponding to the input text request sample data or not through a preset phoneme discriminator.

In order to solve the above technical problem, an embodiment of the present application further provides an adaptive text-to-speech apparatus based on meta learning, which adopts the following technical solutions:

the first training module is used for model pre-training based on the full data set of the speaker, and taking model parameters obtained by pre-training as initial values of a preset acoustic model;

the second training module is used for sampling sound training sample data from the full data set, performing feature training according to the sound training sample data through the preset acoustic model to generate a mel frequency spectrum, and generating a style code through a preset style encoder;

the normalization processing module is used for carrying out self-adaptive example normalization processing on the layer normalization of the preset acoustic model and injecting the style code into the preset acoustic model to obtain a target acoustic model comprising a target mel frequency spectrum, wherein the target mel frequency spectrum is provided with the style code;

and the conversion module is used for acquiring strange sample data and inputting the strange sample data into the target acoustic model so as to output target voice data with the style code corresponding to the strange sample data.

Further, the second training module comprises:

the first generation submodule is used for inputting the sound data in the sound training sample data into the preset acoustic model and generating a mel frequency spectrum according to the sampling frequency of the sound data;

and the second generation submodule is used for inputting the voice data in the voice training sample data into the preset style encoder and generating the style code according to the sampling frequency and the sample precision of the voice data.

Further, the normalization processing module includes:

a first calculation submodule, configured to calculate a first parameter of the style encoding through the adaptive instance normalization processing;

a second calculation submodule, configured to calculate a second parameter of the mel spectrum through the adaptive instance normalization processing;

and the third calculation submodule is used for outputting the target mel frequency spectrum with the style code according to the first parameter of the mel frequency spectrum and the second parameter of the style code.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

the method comprises a memory and a processor, wherein the memory stores computer readable instructions, and the processor implements the steps of the adaptive text-to-speech method based on meta learning in any one of the above embodiments when executing the computer readable instructions.

In order to solve the foregoing technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the meta learning based adaptive text-to-speech method described in any of the above embodiments.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: the method comprises the steps of training a preset acoustic model through extracted sound training sample data to obtain a corresponding mel frequency spectrum and a style code, then carrying out self-adaptive example normalization processing on layer normalization of the preset acoustic model, injecting the style code into the preset acoustic model to obtain a target acoustic model comprising the target mel frequency spectrum, wherein the target mel frequency spectrum is provided with the style code. In the acoustic model learning process, the data volume of the sampled voice training sample data of the speaker is small, and the training complexity can be reduced during the adaptive instance normalization processing; and the preset acoustic model performs feature learning according to the sampled sound training sample data, the style code is injected into the preset acoustic model, and the finally obtained target acoustic model is tested, so that when characters are converted into voice, the corresponding style code and the target voice data can be generated according to a small amount of strange sample data, and the small sample data has strong adaptability learning and conversion capability, and is more beneficial to realizing personalized requirements.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an adaptive text-to-speech method based on meta-learning according to the present application;

FIG. 3 is a flow diagram of one embodiment of step 202 of FIG. 2;

FIG. 4 is a flowchart of an embodiment of step 203 in FIG. 2;

FIG. 5 is a flow diagram illustrating an embodiment of an adaptive text-to-speech method based on meta learning according to the present application;

FIG. 6 is a flowchart of one embodiment of step 205 of FIG. 5;

FIG. 7 is a block diagram illustrating an embodiment of an adaptive text-to-speech apparatus based on meta learning according to the present application;

FIG. 8 is a schematic diagram of an embodiment of the second training module of FIG. 7;

FIG. 9 is a block diagram illustrating an embodiment of the normalization processing module shown in FIG. 7;

FIG. 10 is a block diagram illustrating an alternative embodiment of an adaptive text-to-speech apparatus based on meta-learning according to the present application;

FIG. 11 is a block diagram illustrating an embodiment of the determining module shown in FIG. 10;

fig. 12 is a block diagram of a basic configuration of the computer device provided in the present embodiment.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

The server 105 may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.

It should be noted that, the adaptive text-to-speech method based on meta learning provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the adaptive text-to-speech method based on meta learning apparatus is generally disposed in the server/terminal device.

It should be understood that the number of

terminal devices

101, 102, 103, network 104 and server 105 in fig. 1 is merely illustrative. There may be any number of

terminal devices

101, 102, 103, networks 104, and servers 105, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of an adaptive text-to-speech method based on meta-learning according to the present application is shown. The self-adaptive text-to-speech method based on meta learning comprises the following steps:

step S201, model pre-training is carried out based on the full data set of the speaker, and model parameters obtained through pre-training are used as initial values of a preset acoustic model.

In this embodiment, an electronic device (for example, the server/terminal device shown in fig. 1) on which an adaptive text-to-speech method based on meta learning is executed may obtain the full data set of the speaker and perform data transmission, etc. through a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

Specifically, the full-volume data set may be a data set of a plurality of speakers collected in advance, and the data volume in the full-volume data set is large and may be thousands of data volumes. In the full-volume data set, the data of each speaker may include a plurality of pieces of sound data and text data corresponding to the sound data, and the plurality of pieces of sound data are different. In the full data set, the sound data of each speaker is in one-to-one correspondence with the speaker, and the correspondence can be represented by identification.

The model pre-training based on the full-volume data set of the speaker may refer to training performed before performing formal training in order to provide an initial model parameter for a preset acoustic model (text2 mel). In the deep learning neural network, the training process is based on a gradient descent method to carry out parameter optimization, and a minimum loss function and an optimal model weight are obtained through step-by-step iteration. When gradient descent is performed, each parameter in the model needs to be assigned an initial value. The pre-training can accelerate the feature learning speed of the subsequent acoustic model and improve the efficiency of the model. Where text2mel can be an acoustic model based on a Transformer that does not need to loop but rather process all words or symbols in the sequence in parallel, while using the self-attention mechanism to combine context with more distant words. By processing all the words in parallel and letting each word notice the other words in the sentence in multiple processing steps, the training speed of the Transformer is fast and its translation accuracy is high. Therefore, when the text2mel is subjected to subsequent training based on the initial value given by the pre-training acoustic model, the convergence speed of gradient reduction in model training is favorably accelerated, and a target acoustic model with low error is more likely to be obtained.

Step S202, sound training sample data are sampled from the full data set, feature training is carried out according to the sound training sample data through a preset acoustic model to generate a mel frequency spectrum, and a style code is generated through a preset style encoder.

Specifically, in the full-volume data set, each speaker corresponds to a subset of data, and because the data volume is large, the subset of data of some speakers may be extracted in a random extraction manner, for example: subset data of 50 speakers were selected from the 5000 full volume data sets for feature training of the acoustic model. The extraction method may also be to extract the subset data of the preceding/following vocalizers in order, and of course, other extraction methods may also be included, for example: interval decimation, and the like.

More specifically, the voice training sample data may be directly sampled from the full-size data set, or may be sampled from the subset data. The voice training sample data may include voice data of a speaker and text data corresponding to the voice data, and may also be referred to as a support set (Xs, ts) in this embodiment, where Xs represents the voice data of the speaker and ts represents the text data corresponding to the voice data. The acoustic model performs characteristic training on the sound training sample data to realize the conversion of characters to voice by the acoustic model.

Wherein, text2mel is used as a generator, and the frequency of the voice in the voice data in the voice training sample data is used for generating a corresponding mel frequency spectrum (Mel frequency spectrum). The mel spectrum may perform a mathematical operation according to frequencies in the sound data to convert it into a mel scale (mel scale). The mel spectrum is a spectrum under mel scale (MATLAB function) obtained by dot multiplication of spectrum with Mel filterbank. The MATLAB function is a spectrogram of a signal obtained using a short-time fourier transform. Each filter in the mel-filter bank is a triangular filter, and the dot multiplication process is expanded.

The preset style encoder (style encoder) may encode according to voice data of a speaker to generate a style code corresponding to the voice data, where the style code may include identity information, tone, and rhythm of the speaker.

Thus, the text2mel is used as a generator to generate the mel frequency spectrum corresponding to the voice training sample data, and the style code corresponding to the voice data in the voice training sample data is generated according to the StyleEncoder.

Step S203, carrying out self-adaptive example normalization processing on the layer normalization of the preset acoustic model, and injecting style codes into the preset acoustic model to obtain a target acoustic model comprising a target mel frequency spectrum, wherein the target mel frequency spectrum is provided with the style codes.

Specifically, after the text2mel in the transformer generates a corresponding mel spectrum according to the voice data, all Layer Normalization (Layer Normalization) in the text2mel of the transformer structure may be processed by using an Adaptive Instance Normalization (AdaIN) processing method, and style codes generated by the styleneencoder according to the voice data may be injected into the text2mel, so that the generated mel spectrum has style information of the style codes, and a target acoustic model with the target mel spectrum is finally obtained.

And S204, acquiring strange sample data, and inputting the strange sample data into the target acoustic model to output target voice data with style codes corresponding to the strange sample data.

In this embodiment, the strange sample data may be a small amount of text sample data that is not carried with sound data and is acquired by the target acoustic model in actual application. After the strange sample data is input into the target acoustic model, through testing, when the target acoustic model is used for converting characters into voice, corresponding style codes and target voice data can be generated according to a small amount of strange sample data, and adaptive learning and conversion of small sample data are achieved.

In the embodiment of the invention, the extracted voice training sample data is used for training text2mel to obtain a corresponding mel frequency spectrum and style codes, then self-adaptive example normalization processing is carried out on layer normalization of the text2mel, the style codes are injected into the text2mel to obtain a target acoustic model comprising a target mel frequency spectrum, and the target mel frequency spectrum is provided with the style codes. In the learning process of the acoustic model, the data volume of the sound training sample data of the speaker is small, and the training complexity can be reduced during the self-adaptive example normalization processing; and the text2mel carries out feature learning according to the sampled voice training sample data, the style codes are injected into the text2mel, and the finally obtained target acoustic model is tested, so that when characters are converted into voice, the corresponding style codes and the target voice data can be generated according to a small amount of strange sample data, the adaptability learning and conversion capability of small sample data is high, and the individual requirements can be realized.

In some alternative implementations, as shown in fig. 3, fig. 3 is a flowchart of a specific embodiment of step 202 in fig. 2. The step 202 executed by the electronic device specifically includes the following steps:

step S2021, inputting the sound data in the sound training sample data into a preset acoustic model, and generating a mel frequency spectrum according to the sampling frequency of the sound data.

Specifically, the voice data in the voice training sample data of each speaker may be sequentially input into text2mel, and then converted according to the relationship between the frequency of a piece of voice data and the mel scale. Specifically, the section of sound data may be segmented into a plurality of sections of bar sound sources, and the frequency of each bar sound source may be different/the same. Then, conversion may be performed based on the relationship between the frequency and the mel scale according to the frequency of each bar sound source, thereby obtaining the mel spectrum corresponding to the sound data. Wherein, the relationship between the frequency (f) and the mel-scale (m) is converted as shown in the following formula (1):

m＝2595*log10(1+f/700) (1)

step S2022, inputting the voice data in the voice training sample data into a preset style encoder, and generating a style code according to the sampling frequency and the sample precision of the voice data.

In particular, audio coding is mainly to accomplish compression of sound information. After the sound signal is digitized, the information quantity is much larger than the analog transmission state, and the information quantity cannot be directly transmitted like the analog television sound; therefore, a compression encoding process is required to be added to the sound, namely, the sound is encoded by the audio. In this embodiment, the voice data may be input into the styleneencoder, and the speaker identity, the pitch of the voice, and the prosody are encoded based on the sampling frequency and the sample precision of the voice data, so as to obtain a style code corresponding to each voice data. The pitch and rhythm of the voice corresponding to each speaker are different. The above-mentioned method for encoding by the wind style encoder may include, but is not limited to, linear prediction encoding, sub-band encoding, and the like.

In the embodiment of the invention, the voice data is input into text2mel, and a corresponding mel frequency spectrum can be generated according to the formula (1); and inputting the voice data into a StyleEncoder, and coding the identity of a speaker, the tone and the rhythm of the voice according to the sampling frequency and the sample precision of the voice data to obtain a style code corresponding to each voice data, so that the style code can be injected into text2mel when the adaptive instance normalization processing is carried out, and finally a target acoustic model is obtained.

In some alternative implementations, as shown in fig. 4, fig. 4 is a flowchart of a specific embodiment of step 203 in fig. 2. The step 203 executed by the electronic device specifically includes the following steps:

step S2031, a first parameter of the style encoding is calculated by adaptive instance normalization processing.

Specifically, performing the adaptive instance normalization process described above may include performing a mean and variance alignment process on the style code. Since the adaptive instance normalized AdaIN has the capability of learning the training mapping parameters, a first parameter may be first calculated for the input style code by the adaptive instance normalized AdaIN, where the first parameter includes a mean and a variance of the style code, and the mean and the variance are aligned to match the mean and the variance of the mel-frequency spectrum.

Step S2032, a second parameter of the mel spectrum is calculated by the adaptive instance normalization process.

Specifically, since the adaptive instance normalization AdaIN has the ability to learn the training mapping parameters, the second parameter (variance and mean of the mel spectrum) can be adaptively calculated from the mel spectrum.

And S2033, performing data matching based on the first parameter of the mel frequency spectrum and the second parameter of the style coding, and outputting the target mel frequency spectrum with the style coding.

Specifically, by combining the calculated variance and mean of the style code and the variance and mean of the mel-frequency spectrum, the result of inputting the style code and the mel-frequency spectrum into the adaptive instance normalization AdaIN for processing can be calculated, that is, the target mel-frequency spectrum with the style code is output, and the target acoustic model is obtained. In the text2mel learning process, the sampled voice training sample data of the user has small data volume, and the calculation complexity can be reduced in the adaptive instance normalization AdaIN.

In some alternative implementations, as shown in fig. 5, fig. 5 is a flowchart of a specific embodiment mode after step 203 in fig. 2. After step 203, the electronic device may be further configured to perform the following steps:

step S205, sampling the text request sample data from the full data set, inputting the text request sample data into the target acoustic model for conversion detection, and judging whether to output detection data corresponding to the text request sample data.

Specifically, the text request sample data may be directly sampled from the full data set, or may be sampled from the sub data set of the speaker. The text request sample data is text sample data which does not include sound data, and can be used for performing conversion detection on the target acoustic model and judging whether the text-to-speech function can be realized. Inputting character request sample data and the style code into a target acoustic model, and generating a target mel frequency spectrum M' (detection data) with the style code if the target acoustic model has a character-to-speech function; if the target mel spectrum M' with the style code is not/erroneously generated, it may indicate that the conversion has failed.

More specifically, in order to determine whether the target acoustic model generates the target mel spectrum M' corresponding to the text request sample data, the detection determination may be performed by providing at least one discriminator. And the data volume of the output detection data is consistent with that of the collected character request sample data, and one character request sample data correspondingly outputs a target mel frequency spectrum M'. When the number of the collected text request sample data is one, a corresponding discriminator can be set; when the number of the text request sample data is multiple, a corresponding number of discriminators may be set. By detecting the target acoustic model in advance, debugging can be performed before practical application, the target acoustic model is improved perfectly, and better practical application is facilitated.

In the embodiment of the application, in order to judge whether the target acoustic model can complete the voice conversion, the conversion detection is performed by inputting the text request sample data into the target acoustic model, and a plurality of discriminators are provided for judgment. Therefore, by detecting the target acoustic model in advance, the debugging can be performed before the target acoustic model is not practically used, the target acoustic model is improved perfectly, and the better practical application is facilitated.

In some alternative implementations, as shown in fig. 6, fig. 6 is a flowchart of a specific embodiment of step 205 in fig. 5. The step 205 executed by the electronic device specifically includes the following steps:

step S2051, determining whether the target mel frequency spectrum contains a style code by a preset style discriminator.

Specifically, the style discriminator (style discriminator) may be a discriminator for identifying whether the target mel spectrum M' carries a style code. And when the target mel frequency spectrum M 'is generated, the target mel frequency spectrum M' is continuously input into the style discriminator for recognition.

Step S2052, determining whether the target mel spectrum is aligned to a phoneme corresponding to the input text request sample data by using a preset phoneme discriminator.

Specifically, the above-mentioned phoneme discriminator (phoneme discriminator) may be a discriminator for judging whether the target mel spectrum M' is aligned with the phoneme of the input text request sample data. And after the target mel frequency spectrum M 'is generated, the target mel frequency spectrum M' can be continuously input into a phoneme discriminator for judgment. When the style discriminator judges that the target mel frequency spectrum M 'has the style code and the target mel frequency spectrum M' is aligned with the phoneme of the input text request sample data, the style discriminator can show that the generated target acoustic model can accurately realize the text-to-speech function.

In the embodiment of the present application, two discriminators (a stylistic discriminator and a phoneme discriminator) are provided to respectively determine whether the target mel spectrum M 'has a style code, and determine whether the target mel spectrum M' is aligned with a phoneme of input text request sample data, so as to detect the text-to-speech function of the generated target acoustic model. Therefore, model optimization and the like can be performed in time according to the recognition accuracy of the target acoustic model.

It should be emphasized that, in order to further ensure the privacy and security of the information such as the voice training sample data, mel frequency spectrum, style code, target mel frequency spectrum, strange sample data, target voice data and the like involved in the voice conversion process, the information such as the voice training sample data, mel frequency spectrum, style code, target mel frequency spectrum, strange sample data, target voice data and the like involved in the voice conversion process may also be stored in a node of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like. The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless otherwise indicated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 7, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an adaptive text-to-speech apparatus based on meta learning, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 7, the adaptive text-to-speech apparatus 700 based on meta learning of the present embodiment includes: a first training module 701, a second training module 702, a normalization processing module 703, and a conversion module 704. Wherein:

the first training module 701 is configured to perform model pre-training based on a full-scale data set of a speaker, and use model parameters obtained by the pre-training as initial values of a preset acoustic model.

The second training module 702 is configured to sample voice training sample data from the full-size data set, perform feature training on the voice training sample data through a preset acoustic model to generate a mel spectrum, and generate a style code through a preset style encoder.

The normalization processing module 703 is configured to perform adaptive instance normalization processing on layer normalization of a preset acoustic model, and inject style codes into the preset acoustic model to obtain a target acoustic model including a target mel spectrum, where the target mel spectrum has style codes.

The conversion module 704 is used for acquiring strange sample data and inputting the strange sample data into the target acoustic model to output target voice data with style codes corresponding to the strange sample data.

The model pre-training based on the full-scale data set of the speaker may refer to training performed before formal training in order to provide an initial model parameter for the text2mel, where the model parameter obtained by the first training module 701 through pre-training may be used as an initial value of the text2 mel. In the deep learning neural network, the training process is based on a gradient descent method to carry out parameter optimization, and a minimum loss function and an optimal model weight are obtained through step-by-step iteration. When gradient descent is performed, each parameter in the model needs to be assigned an initial value. The pre-training may be to accelerate the feature learning speed of the subsequent acoustic model and to improve the efficiency of the model. Wherein, text2mel can be an acoustic model based on transformer. Therefore, when the text2mel is subjected to subsequent training based on the initial value provided by the pre-training, the convergence speed of gradient reduction in the acoustic model training can be increased, and a target acoustic model with low error can be obtained more possibly.

More specifically, the voice training sample data may be directly sampled from the full-size data set, or may be sampled from the subset data. The voice training sample data may include voice data of a speaker and text data corresponding to the voice data, and may also be referred to as a support set (Xs, ts) in this embodiment, where Xs represents the voice data of the speaker and ts represents the text data corresponding to the voice data. The second training module 702 performs feature training on the voice training sample data to realize conversion from text to speech by the acoustic model.

The text2mel can be used as a generator, and the frequency of the sound in the sound data in the sound training sample data is used for generating a corresponding mel frequency spectrum. The mel spectrum may perform mathematical operations based on frequencies in the sound data to convert it to a mel scale. The mel spectrum is a spectrum under mel scale, which is obtained by dot-multiplying the spectrum by a mel filter bank. The MATLAB function is a spectrogram of a signal obtained using a short-time fourier transform. Each filter in the mel-filter bank is a triangular filter, and the dot multiplication process is expanded.

The styleneencoder may encode according to the voice data of the speaker, and generate a style code corresponding to the voice data, where the style code may include identity information, tone, and prosody of the speaker.

Specifically, after the text2mel in the above-mentioned transformer generates the corresponding mel spectrum according to the voice data, the Normalization processing module 703 may process all Layer normalizations in the text2mel of the transformer structure by using the adaptive instance Normalization AdaIN processing method. And injecting style codes generated by the StyleEncoder according to the sound data into the text2mel so that the generated mel frequency spectrum has style information of the style codes, and finally obtaining a target acoustic model with a target mel frequency spectrum.

After the target acoustic model is obtained, strange sample data can be obtained, and the strange sample data can be a small amount of character sample data which is obtained by the target acoustic model in actual application and does not carry sound data. After the strange sample data is input into the target acoustic model, the conversion module 704 may finally output target voice data corresponding to the strange sample data, and the target voice data further includes style codes corresponding to the strange sample data.

Referring to fig. 8, which is a schematic structural diagram of an embodiment of the second training module, the second training module 702 includes a first generating sub-module 7021 and a second generating sub-module 7022. Wherein the content of the first and second substances,

the first generating sub-module 7021 is configured to input the sound data in the sound training sample data into a preset acoustic model, and generate a mel spectrum according to a sampling frequency of the sound data.

The second generating sub-module 7022 is configured to input the voice data in the voice training sample data into a preset style encoder, and generate a style code according to the sampling frequency and the sample precision of the voice data.

In the embodiment of the present invention, the first generating sub-module 7021 may generate a corresponding mel frequency spectrum according to the above formula (1) by inputting the sound data into text2 mel; and the second generation sub-module 7022 inputs the voice data into the styleneencoder, and encodes the identity of the speaker, the pitch of the voice, and the prosody according to the sampling frequency and the sample accuracy of the voice data to obtain a style code corresponding to each voice data, so that the style code can be injected into text2mel when performing adaptive instance normalization processing, and a target acoustic model is finally obtained.

Referring to fig. 9, which is a schematic structural diagram of a specific embodiment of the normalization processing module, the normalization processing module 703 includes a first computation submodule 7031, a second computation submodule 7032, and a third computation submodule 7033. Wherein the content of the first and second substances,

the first computing submodule 7031 is configured to compute a first parameter of the style encoding by an adaptive instance normalization process.

The second calculating submodule 7032 is used to calculate a second parameter of the mel frequency spectrum by means of an adaptive instance normalization process.

The third calculation sub-module 7033 is configured to perform data matching based on the first parameter of the mel-frequency spectrum and the second parameter of the style code, and output the target mel-frequency spectrum with the style code.

In the embodiment of the present application, since the adaptive instance normalization AdaIN has the capability of learning and training mapping parameters, in this implementation, the first computing sub-module 7031 computes the mean value and the variance of the style coding for the input style coding through the adaptive instance normalization AdaIN, and computes the variance and the mean value of the mel frequency spectrum through the second computing sub-module 7032 in an adaptive manner according to the mel frequency spectrum, and then the third computing sub-module 7033 implements matching with the mean value and the variance of the mel frequency spectrum through alignment processing of the mean value and the variance of the style coding, and can output the target mel frequency spectrum with the style coding to obtain the target acoustic model. In the text2mel learning process, the sampled voice training sample data of the user has small data volume, and the calculation complexity can be reduced in the adaptive instance normalization AdaIN.

In some optional implementations of this embodiment, referring to fig. 10, the apparatus 700 further includes: the determining module 705 is configured to sample text request sample data from the full data set, input the text request sample data into the target acoustic model for conversion detection, and determine whether to output detection data corresponding to the text request sample data.

In this embodiment of the application, in order to determine whether the target acoustic model can complete the voice conversion, the determination module 705 inputs the text request sample data into the target acoustic model for conversion detection, and provides a plurality of discriminators for determination. Therefore, by detecting the target acoustic model in advance, the fault can be eliminated before the target acoustic model is not put into practice, the target acoustic model is improved perfectly, and the target acoustic model is convenient to put into practice and use better.

Referring to fig. 11, as a structural schematic diagram of an embodiment of the determining module, the determining module 705 includes a first determining sub-module 7051 and a second determining sub-module 7052. Wherein the content of the first and second substances,

the first judging submodule 7051 is configured to judge whether the target mel frequency spectrum includes a style code by using a preset style discriminator.

The second judging sub-module 7052 is configured to judge, by using a preset phoneme discriminator, whether a target mel frequency spectrum is aligned with a phoneme corresponding to the input text request sample data.

In this embodiment, the stylistic encoder provided by the first determining sub-module 7051 is used to determine whether the target mel frequency spectrum M 'has a style code, and the phoneme discriminator provided by the second determining sub-module 7052 is used to determine whether the target mel frequency spectrum M' is aligned with the phoneme of the input text request sample data, so as to detect the text-to-speech function of the generated target acoustic model. Therefore, model optimization and the like can be performed in time according to the recognition accuracy of the target acoustic model.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 12, fig. 12 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 120 includes a memory 121, a processor 122, and a network interface 123 communicatively connected to each other via a system bus. It is noted that only computer device 120 having

components

121 and 123 is shown, but it is understood that not all of the shown components are required and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 121 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 121 may be an internal storage unit of the computer device 120, such as a hard disk or a memory of the computer device 120. In other embodiments, the memory 121 may also be an external storage device of the computer device 120, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device 120. Of course, the memory 121 may also include both internal and external storage devices for the computer device 120. In this embodiment, the memory 121 is generally used for storing an operating system and various application software installed on the computer device 120, such as computer readable instructions of the adaptive text-to-speech method based on meta-learning. Further, the memory 121 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 122 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 122 generally operates to control the overall operation of the computer device 120. In this embodiment, processor 122 is configured to execute computer readable instructions stored in memory 121 or to process data, such as computer readable instructions for executing a meta-learning based adaptive text-to-speech method.

Network interface 123 may include a wireless network interface or a wired network interface, with network interface 123 typically being used to establish communication connections between computer device 120 and other electronic devices.

According to the method and the device, the extracted voice training sample data are used for training text2mel to obtain a corresponding mel frequency spectrum and style codes, then self-adaptive example normalization processing is carried out on layer normalization of the text2mel, the style codes are injected into the text2mel to obtain a target acoustic model comprising a target mel frequency spectrum, and the target mel frequency spectrum is provided with the style codes. In the learning process of the acoustic model, the data volume of the sound training sample data of the speaker is small, and the training complexity can be reduced during the self-adaptive example normalization processing; and the text2mel carries out feature learning according to the sampled voice training sample data, the style codes are injected into the text2mel, and the finally obtained target acoustic model is tested, so that when characters are converted into voice, the corresponding style codes and the target voice data can be generated according to a small amount of strange sample data, the adaptability learning and conversion capability of small sample data is high, and the individual requirements can be realized.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the meta learning based adaptive text-to-speech method as described above.

According to the embodiment of the application, the text2mel is trained by the extracted voice training sample data, so that a corresponding mel frequency spectrum and style codes can be obtained, then, the layer normalization of the text2mel is subjected to self-adaptive example normalization processing, the style codes are injected into the text2mel, a target acoustic model comprising a target mel frequency spectrum is obtained, and the target mel frequency spectrum is provided with the style codes. In the learning process of the acoustic model, the data volume of the sound training sample data of the speaker is small, and the training complexity can be reduced during the self-adaptive example normalization processing; and the text2mel carries out feature learning according to the sampled voice training sample data, the style codes are injected into the text2mel, and the finally obtained target acoustic model is tested, so that when characters are converted into voice, the corresponding style codes and the target voice data can be generated according to a small amount of strange sample data, the adaptability learning and conversion capability of small sample data is high, and the individual requirements can be realized.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present application may be substantially or partially embodied in the form of a software product, where the computer software product is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk), and includes several instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the adaptive text-to-speech method based on meta learning according to the embodiments of the present application.

It should be understood that the above-described embodiments are merely exemplary of some, and not all, embodiments of the present application, and that the drawings illustrate preferred embodiments of the present application without limiting the scope of the claims appended hereto. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A self-adaptive character-to-speech method based on meta-learning is characterized by comprising the following steps:

performing model pre-training based on a full data set of a speaker, and taking model parameters obtained by pre-training as initial values of a preset acoustic model;

2. The meta-learning based adaptive text-to-speech method according to claim 1, wherein the step of performing feature training according to the voice training sample data through the preset acoustic model to generate mel frequency spectrum, and the step of generating style code through a preset style encoder specifically comprises:

3. The meta-learning based adaptive text-to-speech method according to claim 1, wherein the step of performing adaptive instance normalization on the layer normalization of the preset acoustic model, and injecting the style code into the preset acoustic model to obtain a target acoustic model including a target mel spectrum specifically includes:

4. The meta-learning based adaptive text-to-speech method according to claim 1, further comprising, after the step of performing adaptive instance normalization on the layer normalization of the preset acoustic model and injecting the style code into the preset acoustic model, the steps of:

5. The meta-learning based adaptive text-to-speech method according to claim 4, wherein the step of inputting the text request sample data into the target acoustic model for conversion detection and determining whether to output detection data corresponding to the text request sample data comprises:

6. An adaptive text-to-speech device based on meta-learning, comprising:

7. The meta-learning based adaptive text-to-speech apparatus according to claim 6, wherein the second training module comprises:

8. The device of claim 6, wherein the normalization processing module comprises:

9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed performs the steps of the meta learning based adaptive text to speech method of any one of claims 1 to 5.

10. A computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of the meta learning based adaptive text to speech method according to any of claims 1 to 5.