CN114898734A

CN114898734A - Pre-training method and device based on speech synthesis model and electronic equipment

Info

Publication number: CN114898734A
Application number: CN202210552552.8A
Authority: CN
Inventors: 樊晓然; 郑人杰; 陈俊坤; 朱鹏飞; 庞超; 王硕寰; 原湉; 李昕同; 孙宇; 黄亮; 陈泽裕
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2022-08-12

Abstract

The invention discloses a pre-training method and device based on a speech synthesis model and electronic equipment, and particularly relates to the technical field of artificial intelligence such as natural language processing, deep learning and speech technology. The scheme is as follows: acquiring voice sample data and text sample data corresponding to the voice sample data; performing feature extraction on voice sample data to generate sample voice features; generating a voice matrix characteristic according to the sample voice characteristic and the cross-language phoneme table; and performing joint mask learning according to the voice matrix characteristics and the text sample data so as to pre-train the voice synthesis model. Therefore, the voice synthesis model is pre-trained through the combined mask learning of the voice matrix characteristics and the text sample data, namely, the combined training of the voice characteristics and the text characteristics is fully considered in the pre-training process of the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and conditions are provided for improving the voice synthesis quality.

Description

Pre-training method and device based on speech synthesis model and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to artificial intelligence technologies such as natural language processing, deep learning, and speech technology, and in particular, to a pre-training method and apparatus based on a speech synthesis model, and an electronic device.

Background

With the development of computer technology, speech has been widely used in daily life and work as an important carrier for people to obtain information. The existing mainstream speech model is remarkably improved in the task effect of many speech understanding related directions, such as speech recognition, speech classification, speech text translation and the like. But generating high quality speech remains challenging for speech synthesis.

In the related art, the speech synthesis model can only process a single language or a single type of speech synthesis task, and cross-language speech synthesis usually needs to introduce a priori knowledge, thereby possibly causing a problem of low speech synthesis quality. Therefore, how to pre-train the speech synthesis model to improve the speech synthesis quality is very important.

Disclosure of Invention

The disclosure provides a pre-training method and device based on a speech synthesis model, electronic equipment and a storage medium.

In one aspect of the present disclosure, a pre-training method based on a speech synthesis model is provided, including:

acquiring voice sample data and text sample data corresponding to the voice sample data;

performing feature extraction on the voice sample data to generate sample voice features;

generating a voice matrix characteristic according to the sample voice characteristic and the cross-language phoneme table;

and performing joint mask learning according to the voice matrix characteristics and the text sample data so as to pre-train a voice synthesis model.

In another aspect of the present disclosure, a pre-training apparatus based on a speech synthesis model is provided, including:

the acquisition module is used for acquiring voice sample data and text sample data corresponding to the voice sample data;

the extraction module is used for extracting the characteristics of the voice sample data to generate sample voice characteristics;

the first generation module is used for generating a voice matrix characteristic according to the sample voice characteristic and the cross-language phoneme table;

and the processing module is used for performing joint mask learning according to the voice matrix characteristics and the text sample data so as to pre-train a voice synthesis model.

In another aspect of the present disclosure, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for pre-training based on a speech synthesis model as described in an embodiment of the above aspect.

In another aspect of the present disclosure, a non-transitory computer-readable storage medium storing thereon a computer program for causing a computer to execute a pre-training method based on a speech synthesis model according to an embodiment of the above aspect is provided.

In another aspect of the present disclosure, a computer program product is provided, which includes a computer program, and when executed by a processor, the computer program implements the pre-training method based on the speech synthesis model according to the embodiment of the above aspect.

According to the pre-training method and device based on the speech synthesis model, the electronic equipment and the storage medium, speech sample data and text sample data corresponding to the speech sample data can be obtained firstly, then feature extraction can be carried out on the speech sample data to generate sample speech features, then speech matrix features can be generated according to the sample speech features and the cross-language phoneme table, and then joint mask learning can be carried out according to the speech matrix features and the text sample data to pre-train the speech synthesis model. Therefore, the sample voice characteristics can be generated by extracting the characteristics of the voice sample data, then the voice matrix characteristics can be generated by utilizing the sample voice characteristics and the cross-language factor table, then the voice synthesis model can be pre-trained by learning the voice matrix characteristics and the text sample data through the joint mask, namely, the joint training of the voice characteristics and the text characteristics is fully considered in the pre-training process of the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and conditions are provided for improving the voice synthesis quality.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the present disclosure;

fig. 1A is a schematic diagram illustrating a mapping relationship between phonemes in a first language and phonemes in a second language according to an embodiment of the disclosure;

FIG. 2 is a schematic flow chart illustrating a pre-training method based on a speech synthesis model according to an embodiment of the present disclosure;

fig. 2A is a schematic view of an application scenario of a pre-training method based on a speech synthesis model according to an embodiment of the present disclosure;

FIG. 2B is a schematic diagram of a pre-training process based on a speech synthesis model according to an embodiment of the present disclosure;

FIG. 3 is a schematic flow chart illustrating a pre-training method based on a speech synthesis model according to an embodiment of the present disclosure;

FIG. 4 is a block diagram of an electronic device for implementing a speech synthesis model based pre-training method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning technology, a deep learning technology, a big data processing technology, a knowledge map technology and the like.

Natural language processing is the computer processing, understanding and use of human languages (such as chinese, english, etc.), which is a cross discipline between computer science and linguistics, also commonly referred to as computational linguistics. Since natural language is the fundamental mark that humans distinguish from other animals. Without language, human thinking has not been talk about, so natural language processing embodies the highest task and context of artificial intelligence, that is, only when a computer has the capability of processing natural language, the machine has to realize real intelligence.

Deep learning refers to a multi-layered artificial neural network and a method of training it. One layer of neural network takes a large number of matrix numbers as input, weights are taken through a nonlinear activation method, and another data set is generated as output. Through the appropriate number of matrixes, multiple layers of tissues are linked together to form a neural network brain to carry out accurate and complex processing just like people identify object labeling pictures.

The speech technology refers to key technologies in the computer field, such as automatic speech recognition technology (ASR) and speech synthesis technology (TTS). The computer can listen, see, speak and feel, and the development direction of future human-computer interaction is provided, wherein the voice becomes the best viewed human-computer interaction mode in the future, and the voice has more advantages than other interaction modes.

A speech synthesis model-based pre-training method, apparatus, electronic device, and storage medium according to embodiments of the present disclosure are described below with reference to the accompanying drawings.

The pre-training method based on the speech synthesis model according to the embodiments of the present disclosure may be performed by a pre-training apparatus based on the speech synthesis model according to the embodiments of the present disclosure, and the apparatus may be configured in an electronic device.

Fig. 1 is a schematic flowchart of a pre-training method based on a speech synthesis model according to an embodiment of the present disclosure.

As shown in fig. 1, the pre-training method based on the speech synthesis model may include the following steps:

step 101, acquiring voice sample data and text sample data corresponding to the voice sample data.

The voice sample data may be any type of audio data, such as chinese audio data, english audio data, and the like, which is not limited in this disclosure.

In addition, the text sample data may be text data corresponding to the voice sample data, for example, the voice sample data is chinese audio data, and then the corresponding text sample data may be chinese type text data; or, if the voice sample data is english audio data, the corresponding text sample data may be english text data, and the like, which is not limited in this disclosure.

It is to be understood that the voice sample data may be generated by processing audio data in any audio, video, etc., and the present disclosure is not limited thereto.

In addition, after the text sample data corresponding to the voice sample data is acquired, the text sample data may be further processed to acquire phoneme data and the like corresponding to the text sample data, which is not limited in this disclosure.

Optionally, text sample data and the like corresponding to the voice sample data may be generated according to an open-source chinese data set, an english data set and the like, which is not limited in this disclosure.

Step 102, performing feature extraction on the voice sample data to generate sample voice features.

The sample speech features may be understood as features that may reflect the speech sample data, such as a corresponding audio matrix, or a position number corresponding to each speech sample, or other types of speech features, and so on. The present disclosure is not limited thereto.

In addition, there are various methods for extracting features from speech sample data. For example, the voice sample data may be subjected to feature extraction in an MFCC (mel frequency cepstral coefficients) manner to generate sample voice features; or, a filter bank (fbank) manner may also be adopted to perform feature extraction on the voice sample data to generate sample voice features, and the like, which is not limited in this disclosure.

And 103, generating a voice matrix characteristic according to the sample voice characteristic and the cross-language phoneme table.

The cross-language phoneme table may be a phoneme table generated in advance, and may include a mapping relationship between two languages, for example, a corresponding relationship between a chinese language phoneme and an english language phoneme, and the like.

Optionally, the cross-language phoneme table may be traversed according to text sample data corresponding to the sample speech feature to determine one or more cross-language phonemes and the like corresponding to the sample speech feature, which is not limited in this disclosure.

It will be appreciated that the sample speech features and the cross-language phonemes can be spliced and fused to generate speech matrix features. For example, the sample speech features and the generated cross-language phonemes may be spliced and fused in an alignment manner to generate speech matrix features, and the like; alternatively, the sample speech feature and the generated cross-language phoneme at the same time may be aligned and spliced up and down, and the like, which is not limited in this disclosure.

For example, if the voice sample data is chinese, if the cross-language factor table includes a mapping relationship between a chinese phoneme and an english phoneme, the generated phoneme may be chinese or may also be english based on the cross-language factor table, and the disclosure does not limit this.

Therefore, in the embodiment of the disclosure, based on the cross-language factor table, conversion among phonemes of multiple languages can be realized, and conditions are provided for improving efficiency and accuracy of speech synthesis.

It will be appreciated that the cross-language phone table may be generated in advance and then used directly during the pre-training process.

Alternatively, the first language phoneme and the second language phoneme may be obtained first, and then the mapping relationship between the first language phoneme and the second language phoneme may be learned through the mapping learning model to form the cross-language phoneme table.

The first language phoneme and the second language phoneme may be language phonemes of different language types, for example, the first language phoneme is a chinese phoneme, and the second language phoneme may be any other language phoneme except the chinese phoneme, for example, the second language phoneme may be an english phoneme, a french phoneme, or the like, which is not limited in this disclosure.

In addition, the mapping learning model may be any learning model for predicting and mapping the language phonemes, which is not limited in this disclosure.

Optionally, any first language phoneme may be input to the mapping learning model, so as to determine, through processing of the mapping learning model, prediction probabilities corresponding to the second language phonemes corresponding to the any first language phoneme, and then, according to the respective prediction probabilities, a mapping relationship between the first language phoneme and the second language phoneme may be determined.

For example, a probability threshold may be set in advance, and in the case that the prediction probability is greater than the probability threshold, it may be considered that a correct mapping relationship exists between the first language phoneme and the second language phoneme; if the predicted probability is less than or equal to the probability threshold, it may be determined that the mapping relationship between the first language phoneme and the second language phoneme is less accurate, and then training may be continued on the first language phoneme and the second language phoneme to form a cross-language phoneme table.

For example, when the set probability threshold is 0.65, the prediction probability between the first language phoneme and the second language phoneme can be as shown in fig. 1A.

As shown in fig. 1A, the prediction probability between the first language phoneme "F" and the second language phoneme "F" is 0.71, which is greater than 0.65, and thus the mapping relationship between "F" and "F" is considered to be more accurate and reliable; or, the prediction probability between the first language phoneme "JH" and the second language phoneme "zh" is 0.49, which is smaller than 0.65, and then the reliability of the mapping relationship between "JH" and "zh" may be considered to be low; the prediction probability between the first language phoneme "AW 2" and the second language phoneme "ao 3" is 0.23, which is less than 0.65, and the mapping relationship between "AW 2" and "ao 3" is considered to be low in reliability; training of JH, zh, AW2, ao3 may be continued so that the mapping learning model may learn the mapping relationship between the first language phoneme and the second language phoneme to form a cross-language phoneme table.

Alternatively, any first language phoneme may be matched with each second language phoneme, and the second language phoneme with the highest probability value may be determined as the second language phoneme having a mapping relationship with the any first language phoneme, and so on.

It should be noted that the above examples are only illustrative, and should not be taken as limitations on the probability threshold, the first language phoneme, the probability value between the second language phonemes, and the like in the embodiments of the present disclosure.

Therefore, in the embodiment of the present disclosure, since the cross-language phoneme table is generated based on the mapping relationship between the two language phonemes, rather than based on the priori knowledge, the reliability of the cross-language phoneme table may be higher, and the generated sound may not only preserve the timbre of the speaker, but also preserve the style and accent of the speaker. And then based on the cross-language phoneme table with high reliability and the sample voice characteristics, when generating the voice matrix characteristics, the accuracy and the reliability of the voice matrix characteristics are improved.

And 104, performing joint mask learning according to the voice matrix characteristics and the text sample data so as to pre-train the voice synthesis model.

The speech synthesis model may be any model for performing speech processing, and may include any network structure, such as a coder network (coder network) transformer, or may also include other network structures, and the disclosure is not limited thereto.

Optionally, the former may utilize convolution operation and a self-attention mechanism to enhance the learning of feature representation and fuse the local feature representation and the global feature representation; in addition, the context adopts a parallel structure, so that local features and global representation can be greatly reserved, and feature interaction between language and voice is enhanced. The present disclosure is not limited thereto.

Optionally, part of the content in the speech matrix feature may be masked, part of the content in the text sample data may be masked, the masked speech matrix feature and the text sample data may be spliced and input to a speech synthesis model, so as to output a corresponding predicted speech feature and a corresponding predicted text through processing by the speech synthesis model, and then the speech synthesis model may be pre-trained according to a difference between the predicted speech feature and the sample speech feature and a difference between the predicted text and the text sample data, and so on.

Therefore, in the embodiment of the disclosure, joint mask training can be performed on the speech synthesis model based on the speech matrix features and the text sample data, so that the speech synthesis model can learn the alignment relationship between speech and text, and can generate more accurate and reliable speech features.

According to the embodiment of the disclosure, voice sample data and text sample data corresponding to the voice sample data can be obtained first, then feature extraction can be performed on the voice sample data to generate sample voice features, then voice matrix features can be generated according to the sample voice features and the cross-language phoneme table, and then joint mask learning can be performed according to the voice matrix features and the text sample data to perform pre-training on a voice synthesis model. Therefore, the sample voice characteristics can be generated by extracting the characteristics of the voice sample data, then the voice matrix characteristics can be generated by utilizing the sample voice characteristics and the cross-language factor table, then the voice synthesis model can be pre-trained by learning according to the voice matrix characteristics and the text sample data, namely, the combined training of the voice characteristics and the text characteristics is fully considered in the pre-training process of the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and conditions are provided for improving the voice synthesis quality.

Fig. 2 is a schematic flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the present disclosure, and as shown in fig. 2, the pre-training method based on the speech synthesis model may include the following steps:

step 201, acquiring voice sample data and text sample data corresponding to the voice sample data.

Step 202, performing feature extraction on the voice sample data to generate sample voice features.

Step 203, generating a speech matrix characteristic according to the sample speech characteristic and the cross-language phoneme table.

It should be noted that specific contents and implementation manners of step 201 to step 203 may refer to descriptions of other embodiments of the present disclosure, and are not described herein again.

And 204, masking the audio matrix in the voice matrix characteristics by adopting a first masking mode, masking the text characteristics in the text sample data by adopting a second masking mode, and pre-training the voice synthesis model.

Optionally, the first masking manner may be a masking manner at a continuous interval level, the second masking manner may be a discrete masking manner, and the like, which is not limited in this disclosure.

For example, if the text feature is "t ian1 q i4 h en3 h ao 3" which is masked in a discrete masking manner, the masked text feature can be expressed as: "t ian1 q i4[ MASK ] h ao 3", or "t ian1[ MASK ] i4 h en3 h ao 3", etc., which the disclosure does not limit.

Optionally, the mask position of the first mask mode and the mask position of the second mask mode may be different, that is, the first mask mode and the second mask mode may adopt a non-overlapping mode, and the mask processing position of the audio matrix is different from the mask processing position of the text feature. The disclosure is not limited thereto

The audio matrix may be a spectrogram of the voice sample data, or may also be other matrices capable of representing voice features, and the like; the text feature may be a phoneme corresponding to the text sample data, or may also be other features, and the like, which is not limited in this disclosure.

Therefore, in the embodiment of the present disclosure, a first masking method may be used to perform masking processing on an audio matrix in the speech matrix features to obtain a masked audio matrix and a masked speech matrix feature; then, a second masking mode is adopted for the text features in the text sample data to obtain masked text features and text sample data, and then the masked voice matrix features and the masked text sample data can be input into a voice synthesis model to pre-train the voice synthesis model. Because the first mask mode is different from the second mask mode, the mask processing parts in the audio matrix and the text characteristics are different, and when the voice synthesis model is pre-trained, the voice synthesis model can be pre-trained by combining different voice matrix characteristics and text sample data, namely the learned characteristics of the voice synthesis model are more comprehensive and reliable, so that the corresponding prediction result is determined, and the accuracy and the reliability of the prediction result are improved as much as possible.

Optionally, a residual error loss function may be used for pre-training the speech matrix features, and a cross entropy loss function may be used for pre-training the text sample data.

It can be understood that after the speech matrix features and the text sample data which are subjected to the masking processing are input into the speech synthesis model, the predicted speech features corresponding to the speech matrix features and the predicted text corresponding to the masked part in the text sample data can be determined through the processing of the speech synthesis model.

The predicted speech feature may then be matched to the sample speech feature, such as by using a residual loss function, or by using any L1-type loss function, to determine a first loss value between the predicted speech feature and the sample speech feature, and the speech synthesis model may then be pre-trained based on the first loss value to improve the learning capability and performance of the speech synthesis model.

And then, matching the predicted text with the text sample data, for example, determining a second loss value between the predicted text and the text sample data by adopting a cross entropy loss function, and then pre-training the speech synthesis model based on the second loss value to improve the learning capability and performance of the speech synthesis model.

Therefore, in the embodiment of the disclosure, by adopting a mode of training by combining the audio features and the text features and the mask, the speech synthesis model can learn the alignment relationship between the speech and the text, and further when the speech synthesis model is used for carrying out speech, the generated frequency spectrum has higher precision and the quality of the synthesized sound is higher, and various speech synthesis type tasks can be realized.

Optionally, before performing joint mask learning on the speech matrix feature and the text sample data to pre-train the speech synthesis model, speech coding may be performed on the speech matrix feature to make the feature dimensions of the speech matrix feature and the text sample data the same, so that the speech synthesis model may fully learn the speech matrix feature and the text sample data, the feature interaction between speech and text is enhanced, and the output predicted speech feature and predicted text may be more accurate and reliable.

It should be noted that the pre-training method based on the speech synthesis model provided by the present disclosure may be applied to any scenes such as speech recognition, speech synthesis, and the like, for example, speech editing, personalized speech synthesis, cross-language speech synthesis, and the like as shown in fig. 2A. The present disclosure is not limited thereto.

The pre-training process based on the speech synthesis model provided by the present disclosure is described below with reference to fig. 2B.

As shown in fig. 2B, first, voice sample data and text sample data corresponding to the voice sample data may be obtained, for example, the text sample data may be "weather is good", then, feature extraction may be performed on the voice sample data to generate sample voice features, and then, voice matrix features may be generated according to the sample voice features and the cross-language phoneme table.

The cross-language phoneme table and the text sample data can be used firstly to generate a cross-language phoneme "t ian1 q i4 h en3 h ao 3" corresponding to the text sample data with good weather, then the number (segmentindex) of each voice feature in the sample voice feature can be determined, then the audio matrix Spectrogram can be subjected to MASK processing by adopting a first MASK mode, and then the cross-language phoneme, the audio matrix subjected to MASK [ MASK ] processing and the number can be used as the voice matrix feature. The speech matrix features may then be input to an acoustic coder (acousticcoder) for processing by the acoustic coder to generate acoustic vectors (acousticumbering). And then, masking the text features in the text sample data by adopting a second masking mode to obtain the text sample data processed by masking [ MASK ].

Then, the acoustic vector may be subjected to speech coding, so that the characteristic dimensions of the acoustic vector are the same as those of the text sample data subjected to masking processing, then, the acoustic vector and the text sample data may be spliced and input into a speech synthesis model, for example, the acoustic vector and the text sample data may be input into a conformerblock, so that a corresponding spectrogram may be output after the processing of the conformerblock, and then, the spectrogram may be input into a post-processing network post-net, so that a corresponding predicted spectrogram may be obtained after the processing of the post-processing network. The prediction spectrogram may be matched with an audio matrix corresponding to the voice sample data, for example, a residual loss function may be used to determine a corresponding first loss value, and then the voice synthesis model may be pre-trained using the first loss value.

It is understood that the acoustic vector and the masked text sample data are input into the speech synthesis model, and the corresponding predicted text data "h HH" may also be output, and then the predicted text data may be matched with the text sample data, for example, a cross entropy loss function may be used to determine the corresponding second loss value. The speech synthesis model may then be pre-trained using the first loss value.

Therefore, in the embodiment of the disclosure, joint mask learning can be performed according to the speech matrix features and the text sample data to pre-train the speech synthesis model, so that the speech synthesis model can fully learn the speech matrix features and the text features, performance is improved, and when the speech synthesis model is reused, a spectrogram with higher accuracy can be generated, thereby providing conditions for improving the quality of speech synthesis.

It should be noted that the above examples are only illustrative, and should not be taken as a limitation on the pre-training process in the embodiments of the present disclosure.

According to the embodiment of the disclosure, voice sample data and text sample data corresponding to the voice sample data can be obtained first, then feature extraction can be performed on the voice sample data to generate sample voice features, voice matrix features can be generated according to the sample voice features and the cross-language phoneme table, then masking can be performed on an audio matrix in the voice matrix features by adopting a first masking mode, masking can be performed on the text features in the text sample data by adopting a second masking mode, and pre-training can be performed on a voice synthesis model. Therefore, the sample voice characteristics can be generated by extracting the characteristics of the voice sample data, then the voice matrix characteristics can be generated by utilizing the sample voice characteristics and the cross-language factor table, then the pre-training of the voice synthesis model can be realized by learning according to the joint mask of the audio matrix and the text characteristics, namely, the audio characteristics and the text characteristics are fully considered in the pre-training process of the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and conditions are provided for improving the voice synthesis quality.

In order to implement the above embodiments, the present disclosure further provides a pre-training device based on a speech synthesis model.

Fig. 3 is a schematic structural diagram of a pre-training apparatus based on a speech synthesis model according to an embodiment of the present disclosure.

As shown in fig. 3, the pre-training apparatus 300 based on speech synthesis model includes: an acquisition module 310, an extraction module 320, a first generation module 330, and a processing module 340.

The obtaining module 310 is configured to obtain voice sample data and text sample data corresponding to the voice sample data.

The extracting module 320 is configured to perform feature extraction on the voice sample data to generate a sample voice feature.

A first generating module 330, configured to generate a speech matrix feature according to the sample speech feature and the cross-language phoneme table.

And the processing module 340 is configured to perform joint mask learning according to the speech matrix features and the text sample data, so as to pre-train a speech synthesis model.

Optionally, the processing module 340 is further configured to:

and performing voice coding on the voice matrix characteristics to enable the voice matrix characteristics and the characteristic dimensions of the text sample data to be the same.

Optionally, the processing module 340 is specifically configured to:

and masking the audio matrix in the voice matrix characteristics by adopting a first masking mode, and masking the text characteristics in the text sample data by adopting a second masking mode.

Optionally, the first masking manner is a masking manner at a continuous interval level, and the second masking manner is a discrete masking manner.

Optionally, a mask position of the first mask manner is different from a mask position of the second mask manner.

Optionally, the processing module 340 is further configured to:

and pre-training the voice matrix characteristics by adopting a residual error loss function, and pre-training the text sample data by adopting a cross entropy loss function.

Optionally, the system further includes a second generating module, configured to:

acquiring a first language phoneme and a second language phoneme;

learning a mapping relationship between the first language phoneme and the second language phoneme through a mapping learning model to form the cross-language phoneme table.

The functions and specific implementation principles of the above modules in the embodiments of the present disclosure may refer to the above method embodiments, which are not described herein again.

The pre-training device based on the speech synthesis model of the embodiment of the disclosure may acquire speech sample data and text sample data corresponding to the speech sample data, may perform feature extraction on the speech sample data to generate sample speech features, may generate speech matrix features according to the sample speech features and the cross-language phoneme table, and may perform joint mask learning according to the speech matrix features and the text sample data to perform pre-training on the speech synthesis model. Therefore, the sample voice characteristics can be generated by extracting the characteristics of the voice sample data, then the voice matrix characteristics can be generated by utilizing the sample voice characteristics and the cross-language factor table, then the voice synthesis model can be pre-trained by learning the voice matrix characteristics and the text sample data through the joint mask, namely, the joint training of the voice characteristics and the text characteristics is fully considered in the pre-training process of the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and conditions are provided for improving the voice synthesis quality.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 401 performs the various methods and processes described above, such as a pre-training method based on a speech synthesis model. For example, in some embodiments, the pre-training method based on the speech synthesis model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the pre-training method based on speech synthesis models described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the pre-training method based on the speech synthesis model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme, voice sample data and text sample data corresponding to the voice sample data can be obtained firstly, then feature extraction can be carried out on the voice sample data to generate sample voice features, then voice matrix features can be generated according to the sample voice features and the cross-language phoneme table, and then joint mask learning can be carried out according to the voice matrix features and the text sample data to carry out pre-training on the voice synthesis model. Therefore, the sample voice characteristics can be generated by extracting the characteristics of the voice sample data, then the voice matrix characteristics can be generated by utilizing the sample voice characteristics and the cross-language factor table, then the voice synthesis model can be pre-trained by learning the voice matrix characteristics and the text sample data through the joint mask, namely, the joint training of the voice characteristics and the text characteristics is fully considered in the pre-training process of the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and conditions are provided for improving the voice synthesis quality.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations, and substitutions can be made in accordance with design requirements and other phonemes. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A pre-training method based on a speech synthesis model comprises the following steps:

2. The method of claim 1, wherein prior to said performing joint mask learning from said speech matrix features and said text sample data to pre-train a speech synthesis model, further comprising:

and performing voice coding on the voice matrix characteristics so that the voice matrix characteristics and the characteristic dimension of the text sample data are the same.

3. The method of claim 1, wherein said performing joint mask learning from said speech matrix features and said text sample data to pre-train a speech synthesis model, comprises:

4. The method of claim 3, wherein the first masking pattern is a continuous interval level masking pattern, and the second masking pattern is a discrete masking pattern.

5. The method of claim 3, wherein mask positions of the first masking manner are different from mask positions of the second masking manner.

6. The method of claim 3, further comprising:

7. The method of claim 1, wherein the cross-language phone list is obtained by:

acquiring a first language phoneme and a second language phoneme;

8. A pre-training apparatus based on a speech synthesis model, comprising: the acquisition module is used for acquiring voice sample data and text sample data corresponding to the voice sample data;

9. The apparatus of claim 8, wherein the processing module is further configured to:

10. The apparatus of claim 8, wherein the processing module is specifically configured to:

11. The apparatus as claimed in claim 10, wherein the first mask pattern is a mask pattern of a continuous interval level, and the second mask pattern is a discrete mask pattern.

12. The apparatus of claim 10, wherein mask positions of the first masking manner are different from mask positions of the second masking manner.

13. The apparatus of claim 10, wherein the processing module is further configured to:

14. The apparatus of claim 8, further comprising a second generating module to:

acquiring a first language phoneme and a second language phoneme;

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.