CN114898734B

CN114898734B - Pre-training method and device based on voice synthesis model and electronic equipment

Info

Publication number: CN114898734B
Application number: CN202210552552.8A
Authority: CN
Inventors: 樊晓然; 郑人杰; 陈俊坤; 朱鹏飞; 庞超; 王硕寰; 原湉; 李昕同; 孙宇; 黄亮; 陈泽裕
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-05-20
Publication date: 2024-07-16
Anticipated expiration: 2042-05-20
Also published as: CN114898734A

Abstract

The disclosure discloses a pre-training method and device based on a voice synthesis model and electronic equipment, and particularly relates to the technical fields of artificial intelligence such as natural language processing, deep learning, voice technology and the like. Wherein, the scheme is: acquiring text sample data corresponding to the voice sample data; extracting characteristics of the voice sample data to generate sample voice characteristics; generating a voice matrix feature according to the sample voice feature and the cross-language phoneme list; and performing joint mask learning according to the voice matrix characteristics and the text sample data so as to pretrain the voice synthesis model. Therefore, the speech synthesis model is pre-trained through the joint mask learning of the speech matrix characteristics and the text sample data, namely, the joint training of the speech characteristics and the text characteristics is fully considered in the process of pre-training the speech synthesis model, so that the generated speech synthesis model is more accurate and reliable, and further, conditions are provided for improving the speech synthesis quality.

Description

Pre-training method and device based on voice synthesis model and electronic equipment

Technical Field

The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence such as natural language processing, deep learning, voice technology and the like, and particularly relates to a pre-training method and device based on a voice synthesis model and electronic equipment.

Background

With the development of computer technology, voice is widely used in daily life and work as an important carrier for people to acquire information. The existing mainstream voice model is significantly improved in the effect of many voice understanding related direction tasks, such as voice recognition, voice classification, voice text translation and the like. But for speech synthesis, generating high quality speech remains challenging.

In the related art, a speech synthesis model can only process a single language or a single type of speech synthesis task, and cross-language speech synthesis generally needs to introduce a priori knowledge, so that the problem of low speech synthesis quality may be caused. Therefore, how to pretrain the speech synthesis model to improve the speech synthesis quality is important.

Disclosure of Invention

The disclosure provides a pre-training method, device, electronic equipment and storage medium based on a speech synthesis model.

In one aspect of the disclosure, a pre-training method based on a speech synthesis model is provided, including:

acquiring voice sample data and text sample data corresponding to the voice sample data;

Extracting features of the voice sample data to generate sample voice features;

generating a voice matrix feature according to the sample voice feature and the cross-language phoneme list;

and carrying out joint mask learning according to the voice matrix characteristics and the text sample data so as to pretrain a voice synthesis model.

In another aspect of the present disclosure, there is provided a pre-training apparatus based on a speech synthesis model, including:

The acquisition module is used for acquiring voice sample data and text sample data corresponding to the voice sample data;

the extraction module is used for extracting the characteristics of the voice sample data so as to generate sample voice characteristics;

The first generation module is used for generating voice matrix characteristics according to the sample voice characteristics and the cross-language phoneme list;

and the processing module is used for carrying out joint mask learning according to the voice matrix characteristics and the text sample data so as to pretrain the voice synthesis model.

In another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the speech synthesis model-based pre-training method of an embodiment of the above aspect.

In another aspect of the disclosure, a non-transitory computer readable storage medium storing computer instructions for causing a computer to execute the speech synthesis model-based pre-training method described in the above-described embodiments of the aspect is provided.

In another aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the speech synthesis model-based pre-training method described in the above-described embodiments of the aspect.

According to the pre-training method, the device, the electronic equipment and the storage medium based on the voice synthesis model, voice sample data and text sample data corresponding to the voice sample data can be acquired firstly, then feature extraction can be carried out on the voice sample data to generate sample voice features, then voice matrix features can be generated according to the sample voice features and the cross-language phoneme table, and then joint mask learning can be carried out according to the voice matrix features and the text sample data to pre-train the voice synthesis model. Therefore, the speech synthesis model can be pre-trained by carrying out feature extraction on the speech sample data to generate sample speech features, then generating speech matrix features by utilizing the sample speech features and a cross-language factor table, and then carrying out combined mask learning on the speech matrix features and the text sample data, namely, fully considering the combined training of the speech features and the text features in the process of pre-training the speech synthesis model, so that the generated speech synthesis model is more accurate and reliable, and further, conditions are provided for improving the speech synthesis quality.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the disclosure;

FIG. 1A is a schematic diagram of a mapping relationship between a first language phoneme and a second language phoneme according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the disclosure;

Fig. 2A is a schematic application scenario diagram of a pre-training method based on a speech synthesis model according to an embodiment of the disclosure;

FIG. 2B is a schematic diagram of a pre-training process based on a speech synthesis model according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the disclosure;

FIG. 4 is a block diagram of an electronic device for implementing a speech synthesis model-based pre-training method in accordance with an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial intelligence is the discipline of studying the process of making a computer mimic certain mental processes and intelligent behaviors (e.g., learning, reasoning, thinking, planning, etc.) of a person, both hardware-level and software-level techniques. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning, deep learning, a big data processing technology, a knowledge graph technology and the like.

Natural language processing is the processing, understanding, and use of human language (e.g., chinese, english, etc.) by a computer, which is an interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics. Since natural language is the fundamental sign of humans as distinguished from other animals. Without language, human thinking is not talking, so natural language processing embodies the highest tasks and boundaries of artificial intelligence, that is, machines achieve true intelligence only when computers have the ability to process natural language.

Deep learning refers to a multi-layer artificial neural network and a method of training it. A neural network takes a large number of matrix numbers as input, weights the matrix numbers by a nonlinear activation method, and then generates another data set as output. Through proper matrix quantity, multiple layers of tissues are linked together to form a neural network 'brain' for precise and complex processing, just like people identify object labeling pictures.

Speech technology refers to the key technologies in the computer field, automatic speech recognition technology (ASR) and speech synthesis technology (TTS). The method has the advantages that the computer can listen, watch, say and feel, and is a development direction of human-computer interaction in the future, wherein the voice becomes the best human-computer interaction mode in the future, and the voice has more advantages than other interaction modes.

The following describes a pre-training method, apparatus, electronic device and storage medium based on a speech synthesis model according to an embodiment of the present disclosure with reference to the accompanying drawings.

The pre-training method based on the voice synthesis model provided by the embodiment of the disclosure can be executed by the pre-training device based on the voice synthesis model provided by the embodiment of the disclosure, and the device can be configured in electronic equipment.

Fig. 1 is a schematic flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the disclosure.

As shown in fig. 1, the pre-training method based on the speech synthesis model may include the following steps:

step 101, obtaining voice sample data and text sample data corresponding to the voice sample data.

The voice sample data may be any type of audio data, for example, may be chinese audio data, english audio data, and the like, which is not limited in this disclosure.

In addition, the text sample data may be text data corresponding to the voice sample data, for example, the voice sample data is Chinese audio data, and then the corresponding text sample data may be Chinese type text data; or if the voice sample data is english audio data, the corresponding text sample data may be english-type text data, etc., which is not limited in this disclosure.

It will be appreciated that the voice sample data may be processed by any audio, audio data in video, etc. to generate voice sample data, etc., as the disclosure is not limited in this regard.

In addition, after the text sample data corresponding to the voice sample data is obtained, the text sample data may be further processed to obtain phoneme data corresponding to the text sample data, and the disclosure is not limited thereto.

Alternatively, text sample data corresponding to the voice sample data may be generated according to an open-source chinese data set, an english data set, or the like, which is not limited in this disclosure.

Step 102, extracting features from the voice sample data to generate sample voice features.

The sample voice features may be understood as features that may reflect the voice sample data, for example, may include a corresponding audio matrix, or may also include a location number corresponding to each voice sample, or may also include other types of voice features, and so on. The present disclosure is not limited in this regard.

In addition, there are various methods for extracting features from voice sample data. For example, the characteristic extraction can be performed on the voice sample data by adopting a mode of an MFCC (mel frequency cepstral coefficents, mel frequency cepstral coefficient) so as to generate sample voice characteristics; alternatively, the voice sample data may be subjected to feature extraction by using fbank (filter bank) to generate a sample voice feature, which is not limited in this disclosure.

Step 103, generating the voice matrix features according to the sample voice features and the cross-language phonemic table.

The cross-language phone table may be a pre-generated phone table, where a mapping relationship between two language phones may be included, for example, a correspondence relationship between a chinese language phone and an english language phone may be included, which is not limited in the present disclosure.

Alternatively, the cross-language phoneme list may be traversed according to text sample data corresponding to the sample speech features to determine one or more cross-language phonemes corresponding to the sample speech features, and the like, which is not limited in this disclosure.

It will be appreciated that the sample speech features and cross-language phonemes may be spliced and fused to generate speech matrix features. For example, the sample speech features and the generated cross-language phonemes may be spliced and fused in an aligned manner to generate speech matrix features and the like; or the voice features of the sample voice features and the generated cross-language phonemes at the same time can be spliced in an up-down alignment manner, and the like, which is not limited in the disclosure.

For example, if the speech sample data is chinese, and if the cross-language factor table includes a mapping relationship between chinese phonemes and english phonemes, the generated phonemes may be chinese, english, or the like based on the cross-language factor table, which is not limited in the present disclosure.

Therefore, in the embodiment of the disclosure, based on the cross-language factor table, conversion among multiple language phonemes can be realized, and conditions are provided for improving the efficiency and accuracy of speech synthesis.

It will be appreciated that the cross-language phone table may be generated in advance and then used directly during the pre-training process.

Alternatively, the first language phone and the second language phone may be acquired first, and then a mapping relation between the first language phone and the second language phone may be learned through a mapping learning model to form a cross-language phone table.

The first language phonemes and the second language phonemes may be language phonemes of different language types, for example, the first language phonemes are chinese phonemes, and the second language phonemes may be any other type of language phonemes except chinese phonemes, for example, english phonemes, french phonemes, and the like, which is not limited in this disclosure.

In addition, the mapping learning model may be any learning model that predicts and maps a language phoneme, which is not limited in this disclosure.

Optionally, any first language phoneme may be input to the mapping learning model, so that prediction probabilities corresponding to each second language phoneme corresponding to any first language phoneme are determined through processing of the mapping learning model, and then a mapping relationship between the first language phoneme and the second language phoneme may be determined according to each prediction probability.

For example, a probability threshold may be set in advance, and in the case where the predicted probability is greater than the probability threshold, the first language phoneme and the second language phoneme may be considered to have a correct mapping relationship; if the prediction probability is less than or equal to the probability threshold, the mapping relationship between the first language phoneme and the second language phoneme may be considered to have low accuracy, and training may be continued on the first language phoneme and the second language phoneme to form a cross-language phoneme table.

For example, in the case where the set probability threshold is 0.65, the prediction probability between the first language phoneme and the second language phoneme may be as shown in fig. 1A.

As can be seen from FIG. 1A, the prediction probability between the first language phoneme "F" and the second language phoneme "F" is 0.71, which is greater than 0.65, so that the mapping relationship between "F" and "F" can be considered to be more accurate and reliable; or the prediction probability between the first language phoneme "JH" and the second language phoneme "zh" is 0.49, which is smaller than 0.65, so that the reliability of the mapping relationship between the "JH" and the "zh" can be considered to be lower; the prediction probability between the first language phoneme 'AW 2' and the second language phoneme 'ao 3' is 0.23, and is smaller than 0.65, so that the reliability of the mapping relationship between the 'AW 2' and the 'ao 3' is considered to be lower; then training of "JH", "zh", "AW2", "ao3" may continue such that the mapping learning model may learn the mapping relationship between the first language phones and the second language phones to form a cross-language phone table.

Or, any first language phoneme may be matched with each second language phoneme, and the second language phoneme with the largest probability value may be determined as the second language phoneme having the mapping relation with any first language phoneme.

It should be noted that the above examples are only schematic illustrations, and are not intended to limit probability thresholds, first language phonemes, second language phoneme-to-phoneme probability values, and the like in the embodiments of the present disclosure.

Therefore, in the embodiment of the disclosure, the cross-language phoneme list is generated based on the mapping relation between two language phonemes, rather than based on priori knowledge, so that the reliability of the cross-language phoneme list is higher, and the generated sound can not only keep the tone color of a speaker, but also keep the style and accent of the speaker. And further, based on the high-credibility cross-language phoneme list and the sample voice characteristics, the accuracy and the reliability of the voice matrix characteristics are improved when the voice matrix characteristics are generated.

And 104, performing joint mask learning according to the voice matrix characteristics and the text sample data to pretrain the voice synthesis model.

The speech synthesis model may be any model that performs speech processing, and may include any network structure, for example, may include the encoder network conformer (convolution augmented transformer), or may also include other network structures, and the disclosure is not limited thereto.

Optionally conformer may utilize convolution operations and self-attention mechanisms to enhance learning of the feature representation and to fuse the local and global feature representations; in addition, conformer adopts a parallel structure, so that local features and global representation can be reserved to a large extent, and feature interaction between language and voice and the like are enhanced. The present disclosure is not limited in this regard.

Optionally, masking a part of the content in the speech matrix feature and masking a part of the content in the text sample data may be performed respectively, then the speech matrix feature after masking and the text sample data may be spliced and input into a speech synthesis model, so as to output a corresponding predicted speech feature and a corresponding predicted text through processing of the speech synthesis model, and then the speech synthesis model may be pre-trained according to a difference between the predicted speech feature and the sample speech feature and a difference between the predicted text and the text sample data, and so on.

Therefore, in the embodiment of the disclosure, the voice synthesis model can be subjected to joint mask training based on the voice matrix characteristics and the text sample data, so that the voice synthesis model can learn the alignment relation between voice and text and can generate more accurate and reliable voice characteristics, and when the voice characteristics are utilized for voice synthesis, the quality of voice synthesis generation can be improved, and conditions are provided for realization of various voice synthesis tasks.

According to the embodiment of the disclosure, voice sample data and text sample data corresponding to the voice sample data can be acquired first, then feature extraction can be performed on the voice sample data to generate sample voice features, then voice matrix features can be generated according to the sample voice features and the cross-language phoneme table, and then joint mask learning can be performed according to the voice matrix features and the text sample data to pretrain a voice synthesis model. Therefore, the voice synthesis model can be pre-trained by carrying out feature extraction on voice sample data to generate sample voice features, then utilizing the sample voice features and a cross-language factor table to generate voice matrix features, and then carrying out pre-training on the voice synthesis model by carrying out combined mask learning according to the voice matrix features and the text sample data, namely, fully taking the combined training of the voice features and the text features into consideration in the process of pre-training the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and further, conditions are provided for improving the voice synthesis quality.

Fig. 2 is a flow chart of a pre-training method based on a speech synthesis model according to an embodiment of the disclosure, as shown in fig. 2, the pre-training method based on the speech synthesis model may include the following steps:

step 201, obtaining voice sample data and text sample data corresponding to the voice sample data.

Step 202, extracting features from the voice sample data to generate sample voice features.

Step 203, generating the speech matrix features according to the sample speech features and the cross-language phonemic form.

It should be noted that, the specific content and implementation manner of the steps 201 to 203 may refer to the descriptions of other embodiments of the disclosure, and are not repeated herein.

Step 204, masking the audio matrix in the speech matrix features in a first masking mode, and masking the text features in the text sample data in a second masking mode to pretrain the speech synthesis model.

Alternatively, the first masking mode may be a masking mode of a continuous interval level, the second masking mode may be a discrete masking mode, and the like, which is not limited in this disclosure.

For example, if the text feature is "t ian1 q i h en 3h ao3", and the text feature is masked by using the discrete masking method, the text feature after the masking process may be expressed as: "t ian1 q i [ MASK ] h ao3", or "t ian1[ MASK ] i 4h en 3h ao3", or the like, which is not limited by the present disclosure.

Alternatively, the masking positions of the first masking mode and the second masking mode may be different, that is, the first masking mode and the second masking mode may be in a non-overlapping mode, where masking is performed on the audio matrix, where masking is performed on the text feature, and so on. The present disclosure does not limit this

The audio matrix may be a spectrogram of the voice sample data, or may be other matrices capable of characterizing voice features, etc.; text features, which may be phonemes corresponding to the text sample data, or may be other features, etc., are not limited in this disclosure.

Therefore, in the embodiment of the disclosure, masking processing can be performed on the audio matrix in the voice matrix features in a first masking manner to obtain a masked audio matrix and a masked voice matrix feature; the text features in the text sample data may then be masked in a second masking manner to obtain masked text features, text sample data, and the masked speech matrix features and the masked text sample data may then be input into a speech synthesis model to pretrain the speech synthesis model. Because the first masking mode is different from the second masking mode, the masking processing parts in the audio matrix and the text features are different, and when the speech synthesis model is pre-trained, the speech synthesis model can be pre-trained by combining different speech matrix features and text sample data, namely the learned features of the speech synthesis model are more comprehensive and reliable, so that the corresponding prediction result is determined, and the accuracy and reliability of the prediction result are improved as much as possible.

Alternatively, the residual error loss function may be used for pre-training for the speech matrix features and the cross entropy loss function may be used for the text sample data.

It will be appreciated that after the speech matrix features and the text sample data subjected to masking processing are input to the speech synthesis model, the predicted speech features corresponding to the speech matrix features and the predicted text corresponding to the masked portion of the text sample data may be determined through processing by the speech synthesis model.

The predicted speech feature may then be matched with the sample speech feature, e.g., a residual loss function may be employed, or any L1 type loss function may be employed to determine a first loss value between the predicted speech feature and the sample speech feature, and then the speech synthesis model may be pre-trained based on the first loss value to improve the learning ability and performance of the speech synthesis model.

The predicted text and the text sample data may be matched, for example, a cross entropy loss function may be used to determine a second loss value between the predicted text and the text sample data, and then the speech synthesis model may be pre-trained based on the second loss value to improve learning ability and performance of the speech synthesis model.

Therefore, in the embodiment of the disclosure, by adopting the mode of combining the audio feature and the text feature and training the mask, the alignment relationship between the voice and the text can be learned by the voice synthesis model, so that when the voice synthesis model is used for voice formation, the generated frequency spectrum has higher precision, the quality of the synthesized voice is higher, and various voice synthesis type tasks can be realized.

Optionally, before the voice matrix features and the text sample data are subjected to joint mask learning to pretrain the voice synthesis model, the voice matrix features can be subjected to voice coding so that feature dimensions of the voice matrix features and the text sample data are the same, so that the voice synthesis model can fully learn the voice matrix features and the text sample data, feature interaction between voice and text is enhanced, and the output predicted voice features and predicted text can be more accurate and reliable.

It should be noted that, the pre-training method based on the speech synthesis model provided in the present disclosure may be applicable to any scenario of speech recognition, speech synthesis, and the like, for example, speech editing, personalized speech synthesis, cross-language speech synthesis, and the like as shown in fig. 2A. The present disclosure is not limited in this regard.

The pre-training process based on the speech synthesis model provided by the present disclosure is described below in connection with fig. 2B.

As shown in fig. 2B, first, voice sample data and text sample data corresponding to the voice sample data may be acquired, for example, the text sample data may be "weather-friendly", then feature extraction may be performed on the voice sample data to generate sample voice features, and then voice matrix features may be generated according to the sample voice features and the cross-language phonemic table.

The cross-language phonemes't ian1 q i h en3 h ao 3' corresponding to the text sample data 'good weather' can be generated by using the cross-language phonemes table and the text sample data, then the number (segmentindex) of each voice feature in the sample voice feature can be determined, then the audio matrix Spectrogram can be subjected to MASK processing in a first MASK mode, and then the cross-language phonemes, the audio matrix processed by the MASK [ MASK ] and the number can be used as voice matrix features. The speech matrix features may then be input into an acoustic encoder (acousticencoder) to be processed by the acoustic encoder to generate acoustic vectors (acousticembedding). The text features in the text sample data may then be masked in a second masking manner to obtain masked MASK processed text sample data.

The acoustic vector may then be speech encoded to make the feature dimensions of the acoustic vector and the masked text sample data the same, and then the acoustic vector and the text sample data may be spliced and input into a speech synthesis model, for example, may be input into conformerblock to undergo the conformerblock process to output a corresponding spectrogram, and then the spectrogram may be input into a post-processing network post-net to undergo the post-processing network process to obtain a corresponding predicted spectrogram. Then, the prediction spectrogram can be matched with an audio matrix corresponding to the voice sample data, for example, a residual loss function can be adopted to determine a corresponding first loss value, and then the voice synthesis model can be pre-trained by using the first loss value.

It may be appreciated that the acoustic vector and the text sample data subjected to masking processing may be input into the speech synthesis model, and the corresponding predicted text data "HH" may be output, and then the predicted text data may be matched with the text sample data, for example, a cross entropy loss function may be used to determine the corresponding second loss value. The speech synthesis model may then be pre-trained using the first loss value.

Therefore, in the embodiment of the disclosure, joint mask learning can be performed according to the voice matrix features and the text sample data to pretrain the voice synthesis model, so that the voice synthesis model can fully learn the voice matrix features and the text features, the performance is improved, and when the voice synthesis model is reused, a spectrogram with higher accuracy can be generated, and further, the condition is provided for improving the quality of voice synthesis.

It should be noted that the above examples are illustrative only and should not be taken as limiting the pretraining process in the embodiments of the present disclosure.

According to the embodiment of the disclosure, voice sample data and text sample data corresponding to the voice sample data can be acquired first, then feature extraction can be performed on the voice sample data to generate sample voice features, voice matrix features are generated according to the sample voice features and a cross-language phoneme table, then an audio matrix in the voice matrix features can be masked in a first masking mode, and text features in the text sample data can be masked in a second masking mode to pretrain a voice synthesis model. Therefore, the voice synthesis model can be pre-trained by carrying out feature extraction on voice sample data to generate sample voice features, then generating voice matrix features by utilizing the sample voice features and a cross-language factor table, and then carrying out joint mask learning according to an audio matrix and text features, namely, fully considering the audio features and the text features in the process of pre-training the voice synthesis model, so that the generated voice synthesis model is more accurate and reliable, and further, conditions are provided for improving the voice synthesis quality.

In order to implement the above embodiment, the present disclosure further proposes a pre-training device based on a speech synthesis model.

Fig. 3 is a schematic structural diagram of a pre-training device based on a speech synthesis model according to an embodiment of the disclosure.

As shown in fig. 3, the pre-training device 300 based on the speech synthesis model includes: the device comprises an acquisition module 310, an extraction module 320, a first generation module 330 and a processing module 340.

The obtaining module 310 is configured to obtain voice sample data and text sample data corresponding to the voice sample data.

The extracting module 320 is configured to perform feature extraction on the voice sample data to generate a sample voice feature.

A first generating module 330, configured to generate a speech matrix feature according to the sample speech feature and the cross-language phone table.

And the processing module 340 is configured to perform joint mask learning according to the speech matrix features and the text sample data, so as to pretrain the speech synthesis model.

Optionally, the processing module 340 is further configured to:

And carrying out voice coding on the voice matrix characteristics so that the characteristic dimensions of the voice matrix characteristics and the characteristic dimensions of the text sample data are the same.

Optionally, the processing module 340 is specifically configured to:

masking the audio matrix in the voice matrix features in a first masking mode, and masking the text features in the text sample data in a second masking mode.

Optionally, the first masking mode is a masking mode of a continuous interval level, and the second masking mode is a discrete masking mode.

Optionally, the masking positions of the first masking mode are different from the masking positions of the second masking mode.

Optionally, the processing module 340 is further configured to:

and pre-training the voice matrix features by adopting a residual error loss function, and pre-training the text sample data by adopting a cross entropy loss function.

Optionally, the method further comprises a second generation module for:

Acquiring a first language phoneme and a second language phoneme;

and learning a mapping relation between the first language phonemes and the second language phonemes through a mapping learning model to form the cross-language phonemic form.

The functions and specific implementation principles of the foregoing modules in the embodiments of the present disclosure may refer to the foregoing method embodiments, and are not repeated herein.

According to the pre-training device based on the voice synthesis model, voice sample data and text sample data corresponding to the voice sample data can be obtained first, then feature extraction can be carried out on the voice sample data to generate sample voice features, then voice matrix features can be generated according to the sample voice features and a cross-language phoneme table, and then joint mask learning can be carried out according to the voice matrix features and the text sample data to pre-train the voice synthesis model. Therefore, the speech synthesis model can be pre-trained by carrying out feature extraction on the speech sample data to generate sample speech features, then generating speech matrix features by utilizing the sample speech features and a cross-language factor table, and then carrying out combined mask learning on the speech matrix features and the text sample data, namely, fully considering the combined training of the speech features and the text features in the process of pre-training the speech synthesis model, so that the generated speech synthesis model is more accurate and reliable, and further, conditions are provided for improving the speech synthesis quality.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 4 illustrates a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In RAM 403, various programs and data required for the operation of device 400 may also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

Various components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, etc.; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408, such as a magnetic disk, optical disk, etc.; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 401 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 401 performs the various methods and processes described above, such as a pre-training method based on a speech synthesis model. For example, in some embodiments, the speech synthesis model based pre-training method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the speech synthesis model based pre-training method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the pre-training method based on the speech synthesis model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

According to the technical scheme, voice sample data and text sample data corresponding to the voice sample data can be acquired first, then feature extraction can be carried out on the voice sample data to generate sample voice features, then voice matrix features can be generated according to the sample voice features and a cross-language phoneme table, and then joint mask learning can be carried out according to the voice matrix features and the text sample data to pretrain a voice synthesis model. Therefore, the speech synthesis model can be pre-trained by carrying out feature extraction on the speech sample data to generate sample speech features, then generating speech matrix features by utilizing the sample speech features and a cross-language factor table, and then carrying out combined mask learning on the speech matrix features and the text sample data, namely, fully considering the combined training of the speech features and the text features in the process of pre-training the speech synthesis model, so that the generated speech synthesis model is more accurate and reliable, and further, conditions are provided for improving the speech synthesis quality.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other phonemes. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A pre-training method based on a speech synthesis model, comprising:

Extracting features of the voice sample data to generate sample voice features;

2. The method of claim 1, wherein prior to said joint mask learning from said speech matrix features and said text sample data to pre-train a speech synthesis model, further comprising:

3. The method of claim 1, wherein the performing joint mask learning based on the speech matrix features and the text sample data to pre-train a speech synthesis model comprises:

4. A method as claimed in claim 3, wherein the first masking is a continuous interval level masking and the second masking is a discrete masking.

5. The method of claim 3, wherein the masking positions of the first masking pattern are different from the masking positions of the second masking pattern.

6. A method as claimed in claim 3, further comprising:

7. The method of claim 1 wherein the cross-language phone table is obtained by:

Acquiring a first language phoneme and a second language phoneme;

8. A speech synthesis model-based pretraining apparatus, comprising: the acquisition module is used for acquiring voice sample data and text sample data corresponding to the voice sample data;

9. The apparatus of claim 8, wherein the processing module is further to:

10. The apparatus of claim 8, wherein the processing module is specifically configured to:

11. The apparatus of claim 10, wherein the first masking mode is a continuous interval level masking mode and the second masking mode is a discrete masking mode.

12. The apparatus of claim 10, wherein the masking positions of the first masking pattern are different from the masking positions of the second masking pattern.

13. The apparatus of claim 10, wherein the processing module is further to:

14. The apparatus of claim 8, further comprising a second generation module to:

Acquiring a first language phoneme and a second language phoneme;

15. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.