CN116469371A

CN116469371A - Speech synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN116469371A
Application number: CN202310427654.1A
Authority: CN
Inventors: 郭洋; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-12
Filing date: 2023-04-12
Publication date: 2023-07-21

Abstract

The embodiment of the application provides a voice synthesis method and device, electronic equipment and a storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a target text; inputting the target text to a preset encoder for acoustic coding to obtain an original phoneme; obtaining acoustic characteristic data according to the original phonemes; inputting the acoustic characteristic data into a preset target duration prediction model to predict the duration, so as to obtain the target duration; inputting the target duration and the acoustic characteristic data into a preset target phoneme distribution prediction model to perform phoneme distribution prediction to obtain target phoneme distribution data; the target phoneme distribution data are used for representing distribution difference conditions of all original phonemes in the target duration; up-sampling the original phonemes according to the target phoneme distribution data and the target duration to obtain target phonemes; and performing voice synthesis according to the target phonemes to obtain target voices. The embodiment of the application can improve the accuracy of voice synthesis.

Description

Speech synthesis method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method and apparatus for speech synthesis, an electronic device, and a storage medium.

Background

Currently, in speech synthesis tasks, the phoneme sequence is matched to the spectrum sequence by means of a hard extension of the phoneme sequence. In the method, when the time length prediction is in error, the accuracy of voice synthesis is affected.

Disclosure of Invention

The embodiment of the application mainly aims to provide a voice synthesis method and device, electronic equipment and storage medium, and aims to improve the accuracy of voice synthesis.

To achieve the above object, a first aspect of an embodiment of the present application proposes a speech synthesis method, including:

acquiring a target text;

inputting the target text to a preset encoder for acoustic encoding to obtain an original phoneme;

obtaining acoustic characteristic data according to the original phonemes;

inputting the acoustic characteristic data into a preset target duration prediction model for duration prediction to obtain target duration;

inputting the target duration and the acoustic characteristic data into a preset target phoneme distribution prediction model to perform phoneme distribution prediction to obtain target phoneme distribution data; wherein the target phoneme distribution data is used for representing distribution difference conditions of all original phonemes in the target duration;

Performing up-sampling processing on the original phonemes according to the target phoneme distribution data and the target duration to obtain target phonemes;

and performing voice synthesis according to the target phonemes to obtain target voices.

In some embodiments, the obtaining acoustic feature data from the original phonemes includes:

acquiring target identification data of a target speaking object;

and carrying out combination processing according to the target identification data and the original phonemes to obtain the acoustic characteristic data.

In some embodiments, before the acoustic feature data is input to a preset target duration prediction model to perform duration prediction, and a target duration is obtained, the method further includes training the target duration prediction model, and specifically includes:

acquiring a first sample phoneme of a first sample text and a first sample duration of the first sample phoneme;

acquiring first object identification data of a preset first sample speaking object;

inputting the first sample phonemes and the first object identification data into a preset original duration prediction model to predict duration, so as to obtain initial duration;

and carrying out parameter adjustment on the original duration prediction model according to the initial duration and the first sample duration to obtain the target duration prediction model.

In some embodiments, before the inputting the target duration and the acoustic feature data into a preset target phoneme distribution prediction model to perform phoneme distribution prediction to obtain target phoneme distribution data, the method further includes training the target phoneme distribution prediction model, and specifically includes:

acquiring a second sample phoneme of a second sample text and a second sample duration of the second sample phoneme;

acquiring second object identification data of a preset second sample speaking object;

obtaining sample phoneme distribution data according to the second object identification data and the second sample phonemes; wherein the sample phoneme distribution data is used to characterize the distribution difference situation of all the second sample phonemes within the second sample duration;

inputting the second object identification data, the second sample duration and the second sample phonemes into a preset original phoneme distribution prediction model to perform phoneme distribution prediction to obtain initial phoneme distribution data;

and carrying out parameter adjustment on the original phoneme distribution prediction model according to the initial phoneme distribution data and the sample phoneme distribution data to obtain the target phoneme distribution prediction model.

In some embodiments, the up-sampling processing is performed on the original phoneme according to the target phoneme distribution data and the target duration to obtain a target phoneme, including:

obtaining duration center data according to the target duration;

and carrying out up-sampling processing on the original phonemes according to the duration center data and the target phoneme distribution data to obtain the target phonemes.

In some embodiments, the performing speech synthesis according to the target phoneme to obtain a target speech includes:

inputting the target phonemes to a preset decoder for feature conversion to obtain a voice frequency spectrum;

and inputting the voice frequency spectrum to a preset vocoder to perform voice synthesis to obtain the target voice.

In some embodiments, the target phoneme distribution prediction model comprises a feature extraction layer and a feature conversion layer; wherein the feature extraction layer comprises a convolution layer and a normalization layer, and the feature conversion layer comprises a linear layer and an activation layer;

inputting the target duration and the acoustic feature data into a preset target phoneme distribution prediction model to perform phoneme distribution prediction to obtain target phoneme distribution data, wherein the method comprises the following steps:

Inputting the target duration and the acoustic feature data into the feature extraction layer to perform feature extraction to obtain initial phoneme distribution data;

and inputting the initial phoneme distribution data into the feature conversion layer to perform feature conversion to obtain the target phoneme distribution data.

To achieve the above object, a second aspect of the embodiments of the present application proposes a speech synthesis apparatus, the apparatus comprising:

the text acquisition module is used for acquiring a target text;

the encoding module is used for inputting the target text into a preset encoder for acoustic encoding to obtain an original phoneme; obtaining acoustic characteristic data according to the original phonemes;

the duration prediction module is used for inputting the acoustic characteristic data into a preset target duration prediction model to predict the duration, so as to obtain the target duration;

the parameter range prediction module is used for inputting the target duration time and the acoustic characteristic data into a preset target phoneme distribution prediction model to perform phoneme distribution prediction so as to obtain target phoneme distribution data; wherein the target phoneme distribution data is used for representing distribution difference conditions of all original phonemes in the target duration;

The up-sampling module is used for up-sampling the original phonemes according to the target phoneme distribution data and the target duration time to obtain target phonemes;

and the voice generation module is used for carrying out voice synthesis according to the target phonemes to obtain target voices.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.

According to the voice synthesis method and device, the electronic equipment and the storage medium, the target duration of the original phonemes is predicted through the trained target duration prediction model, and the target duration is obtained. And predicting the distribution difference condition of the original phonemes in the target duration by using a target phoneme distribution prediction model, and carrying out up-sampling processing on the original phonemes according to the predicted target phoneme distribution data and the target duration, so that the sequence length of the target phonemes can be aligned with the spectrum sequence length for speech synthesis. Therefore, the voice synthesis method provided by the embodiment of the application considers the distribution difference of each phoneme in the extended phoneme sequence, so that the problem that the voice synthesis accuracy is affected by hard extension in the related technology is avoided.

Drawings

FIG. 1 is a flow chart of a speech method provided by an embodiment of the present application;

FIG. 2 is another flow chart of a speech method provided by an embodiment of the present application;

FIG. 3 is another flow chart of a speech method provided by an embodiment of the present application;

FIG. 4 is another flow chart of a speech method provided by an embodiment of the present application;

FIG. 5 is another flow chart of a speech method provided by an embodiment of the present application;

FIG. 6 is another flow chart of a speech method provided by an embodiment of the present application;

FIG. 7 is another flow chart of a speech method provided by an embodiment of the present application;

fig. 8 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Speech synthesis (TTS): also called text-to-speech conversion, is a technique of converting text information generated by a computer itself or input externally into intelligible and fluent speech output. In the speech synthesis technology, it is mainly divided into a language analysis section and an acoustic system section, also called front-end section and back-end section. The language analysis part is used for mainly analyzing the input text information to generate a corresponding linguistic specification; the acoustic system part mainly generates corresponding audio according to the linguistic specification provided by the linguistic analysis part, thereby realizing the function of sounding. Specifically, the language analysis part comprises character input, text structure and language judgment, text standardization, text-to-phoneme conversion, sentence reading prosody prediction and the like.

The text structure and language judgment is to judge the language of the input text to be synthesized, for example, judge the text to be synthesized as Chinese, english, japanese and the like, then segment the whole text in the text to be synthesized into single sentences according to grammar rules of corresponding languages, and transmit the segmented sentences to a subsequent processing module. The text normalization is to perform normalization processing on all contents in the text to be synthesized, for example, when Arabic numerals or letters exist in the text to be synthesized, the Arabic numerals or letters need to be converted into characters according to set rules so as to facilitate subsequent text phonetic transcription work. The text-to-phoneme is used for determining the pronunciation of the current synthesized text, for example, in the speech synthesis of Chinese, the text is mainly marked by pinyin, so the text-to-phoneme needs to convert the text into the corresponding pinyin. When some characters are polyphones, the specific reading of the characters is judged through word segmentation, part-of-speech syntactic analysis and the like, and the characters are determined to be the tones of several sounds. The prosody prediction is to predict the prosody of the text, i.e. determine where the synthesized voice needs to be stopped, how long, which words or words need to be re-read, which words need to be read lightly, etc., so as to make the voice of the synthesized voice high and low meandering, and to suppress the pause, thereby achieving the effect of more truly imitating the voice of a person. The acoustic system part mainly has three technical implementation modes, namely a waveform splicing voice synthesis technology, a parameter voice synthesis technology and an end-to-end voice synthesis technology. The waveform splicing voice synthesis technology is used for splicing syllables in the existing library to realize the voice synthesis function. The parameter speech synthesis technology mainly carries out frequency spectrum characteristic parameter modeling on the existing sound recording through a data method, constructs a mapping relation of a text sequence mapping to speech characteristics, and generates a parameter synthesizer. The end-to-end speech synthesis technology is a method of learning through a neural network, and outputs synthesized audio to directly input text or phonetic characters according to a middle black box part.

One-hot Encoding (One-hot Encoding): also known as one-bit valid encoding, is to encode N states using N-bit registers, each state having its own register bit, and only one of the bits is valid at any time. I.e. only one bit is a 1 and the remaining bits are zero values. One-hot encoding is to use 0 and 1 to represent some parameters, and N-bit state registers are used to encode N states. Before the occurrence of the one-time thermal encoding, the classifier of the machine learning algorithm is not capable of data processing for the unordered discrete classification features. Because the data typically handled by the classifier is continuous and ordered. A mapping table may be built up for these discrete feature data to be ordered and continuous. For example, the attribute information of an individual is mapped as follows: for gender characteristics [ men, women ] can be mapped as (0, 1) two-dimensional data; for residence characteristics [ A ground, B ground, C ground ] can be mapped into (0, 1, 2) three-dimensional data; five-dimensional data can be mapped for occupational features [ actors, teachers, crews, engineers, firefighters ]. Thus, for sample a (female, a ground, a crewmember) a feature map of (1, 0, 2) can be made. When the attribute information is converted by adopting the single-hot coding, according to the principle that the N-bit register is coded corresponding to N states, the attribute information can be converted into the following orderly and continuous forms: for gender characteristics [ male, female ], a "male" may be encoded as 10, a "female" may be encoded as 01, where n=2; for the residence characteristics [ a ground, B ground, C ground ], the "a ground" may be encoded as 100, the "B ground" may be encoded as 010, and the "C ground" may be encoded as 001, where n=3; for professional characteristics [ actors, teachers, crews, engineers, firefighters ], the "actor" may be encoded 10000, the "teacher" may be encoded 01000, the "crews" may be encoded 00100, the "engineer" may be encoded 00010, the "firefighter" may be encoded 00001, where n=5. Correspondingly, the attribute information of sample a is encoded (0,1,1,0,0,0,0,1,0,0).

Vocoder (Vocoder): for converting acoustic features into playable speech waveforms. Currently, vocoders can be classified into phase reconstruction-based vocoders and neural network-based vocoders. A vocoder based on phase reconstruction is mainly applied to acoustic features (such as mel spectrum, etc.) used in TTS, and there is already a loss of phase features, so an algorithm is used to calculate the phase features and reconstruct the speech waveform. The vocoder based on the neural network directly maps the acoustic characteristics and the voice waveform, so that the synthesized voice has higher tone quality.

Based on this, the embodiment of the application provides a voice synthesis method and device, electronic equipment and storage medium, aiming at improving the accuracy of voice synthesis.

The voice synthesis method and device, the electronic device and the storage medium provided in the embodiments of the present application are specifically described through the following embodiments, and the recommendation method in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a voice synthesis method, which relates to the technical field of artificial intelligence. The voice synthesis method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a speech synthesis method, but is not limited to the above form.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of these data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.

Fig. 1 is an optional flowchart of a speech synthesis method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107:

step S101, acquiring a target text;

step S102, inputting a target text to a preset encoder for acoustic encoding to obtain an original phoneme;

step S103, acoustic characteristic data are obtained according to the original phonemes;

step S104, inputting acoustic feature data into a preset target duration prediction model for duration prediction to obtain target duration;

step S105, inputting target duration and acoustic feature data into a preset target phoneme distribution prediction model to perform phoneme distribution prediction to obtain target phoneme distribution data; the target phoneme distribution data are used for representing distribution difference conditions of all original phonemes in the target duration;

step S106, up-sampling processing is carried out on the original phonemes according to the target phoneme distribution data and the target duration time to obtain target phonemes;

step S107, performing voice synthesis according to the target phonemes to obtain target voices.

In the steps S101 to S107 illustrated in the embodiments of the present application, the target duration of the original phoneme is predicted by using the trained target duration prediction model, so as to obtain the target duration. And predicting the distribution difference condition of the original phonemes in the target duration by using a target phoneme distribution prediction model, and carrying out up-sampling processing on the original phonemes according to the predicted target phoneme distribution data and the target duration, so that the sequence length of the target phonemes can be aligned with the spectrum sequence length for speech synthesis. It can be known that, according to the voice synthesis method provided by the embodiment of the application, the distribution difference of each phoneme in the extended phoneme sequence is considered, so that the influence of inaccurate duration prediction on the voice synthesis accuracy in the hard extension in the related technology is avoided.

In step S101 of some embodiments, a target text to be subjected to a speech synthesis operation is obtained from a terminal, a server, or an application program loaded in the terminal, which are matched in advance, by means of an API (Application Programming Interface, application program code interface) or the like.

In step S102 of some embodiments, preprocessing operations such as text structure and language judgment, text normalization, text-to-phoneme conversion, sentence reading prosody prediction and the like are performed on the target text by a preset encoder, so as to obtain a plurality of original phonemes, where the plurality of original phonemes form a phoneme sequence corresponding to the target text. It is understood that a phoneme is a minimum unit or a minimum speech segment constituting a syllable, and is a minimum linear speech unit divided from the viewpoint of sound quality. For example, when the target text is "mandarin", the target text will get "p, u, t, o, ng, h, u, a" eight original phonemes through the preset encoder.

In step S103 of some embodiments, the original phoneme is taken as acoustic feature data, which is input data of the target duration prediction model. It will be appreciated that, according to the requirements of speech synthesis, for example, when the target speech desired to be synthesized has specific voice features such as a pitch, a rhythm, an emotion, etc., since different speech features will have different effects on the duration prediction of the same original phoneme, the above features may also be used as acoustic feature data, that is, the acoustic feature data may include the original phoneme, or the original phoneme and the speech feature, which is not specifically limited in this embodiment of the present application.

Referring to fig. 2, in some embodiments, step S103 includes, but is not limited to, steps S201 through S202:

step S201, obtaining target identification data of a target speaking object;

step S202, carrying out combination processing according to the target identification data and the original phonemes to obtain acoustic characteristic data.

In step S201 of some embodiments, the speech synthesis method provided in the embodiments of the present application may be applied to a multi-speaking object scene, that is, a target speech corresponding to a characteristic of a target speaking object such as a pitch, a timbre, or the like may be generated according to a selected target speaking object. Therefore, different object identification data is set in advance for different speaking objects, and the object identification data of the selected target speaking object is taken as target identification data. It is understood that the object identification data may be in the form of custom codes, unicode, etc., which are not particularly limited in this embodiment of the present application. Taking the single thermal code as an example, the preset a object has tone 1 (assuming that the two tone attributes of tone 1 and tone 2 are included together) and tone 2 (assuming that the two tone attributes of tone 1 and tone 2 are included together), the preset B object has tone 2 and tone 1, the single thermal code corresponding to the a object is (0, 1, 0), and the single thermal code corresponding to the object B is (1, 0, 1). It will be appreciated that the speaking object may also be directly encoded, for example, the preset speaking object includes an a object, a B object, and a C object, where the object identification data of the a object is 100, the object identification data of the B object is 010, and the object identification data of the C object is 001.

In step S202 of some embodiments, the target identification data and the original phoneme are used together as acoustic feature data, that is, the input data of the target duration prediction model includes the target identification data and the original phoneme, so that the target duration predicted by the target duration prediction model is the real duration of the target speaking object speaking the original phoneme. It will be appreciated that the target identification data and the original phonemes may be input to the target duration prediction model in the form of an array, matrix, or the like, and embodiments of the present application are not particularly limited.

According to the voice synthesis method, the target identification data and the original phonemes are used as input data of the target duration prediction model together, so that the target duration is the real duration of the matched target speaking object, and the naturalness and similarity of the synthesized target voice are improved.

Referring to fig. 3, in some embodiments, before step S104, the speech synthesis method provided in the embodiments of the present application further includes training a target duration prediction model, including, but not limited to, steps S301 to S304:

step S301, acquiring a first sample phoneme of a first sample text and a first sample duration of the first sample phoneme;

Step S302, acquiring first object identification data of a preset first sample speaking object;

step S303, inputting a first sample phoneme and first object identification data into a preset original duration prediction model to predict duration, so as to obtain an initial duration;

and step S304, carrying out parameter adjustment on the original duration prediction model according to the initial duration and the first sample duration to obtain a target duration prediction model.

In step S301 of some embodiments, a preset first text sample is input to a corresponding encoder, so as to perform preprocessing operations such as text structure and language judgment, text normalization, text-to-phoneme conversion, sentence prosody prediction, and the like on the first text sample, so as to obtain a plurality of first text phonemes. A preset duration label (i.e., a first sample duration) of each first sample phoneme is acquired.

In step S302 of some embodiments, a speaking object library is preset, where the speaking object library includes a plurality of speaking objects, and object identification data corresponding to each speaking object. It can be understood that the object identification data in the speaking object library may be in the form of custom codes, unicode, etc., which is not particularly limited in this embodiment of the present application. Taking the single thermal code as an example, the preset a object has tone 1 (assuming that the two tone attributes of tone 1 and tone 2 are included together) and tone 2 (assuming that the two tone attributes of tone 1 and tone 2 are included together), the preset B object has tone 2 and tone 1, the single thermal code corresponding to the a object is (0, 1, 0), and the single thermal code corresponding to the object B is (1, 0, 1). It will be appreciated that the speaking object may also be directly encoded, for example, the preset speaking object includes an a object, a B object, and a C object, where the object identification data of the a object is 100, the object identification data of the B object is 010, and the object identification data of the C object is 001. Any one of the speaking objects is obtained from the speaking object library as a first sample speaking object, and object identification data corresponding to the first sample speaking object is used as first object identification data.

In step S303 of some embodiments, the first sample phoneme and the first object identification data are used as input data of a preset original duration prediction model to perform duration prediction, so as to obtain a predicted duration (i.e., an initial duration) corresponding to the first sample phoneme and the first object identification data.

In step S304 of some embodiments, loss data is calculated according to the initial duration, the first sample duration, and a preset loss function to determine a gap between the predicted duration and the actual duration (i.e., the first sample duration). And carrying out parameter adjustment on the original duration prediction model according to the loss data, thereby obtaining the target duration prediction model.

It can be understood that, in order to ensure that the target duration prediction model can predict the duration of phonemes of different speaking objects, traversal operation needs to be performed on all speaking objects in the speaking object library, so as to improve the prediction capability of the target duration prediction model.

According to the voice synthesis method, the original duration prediction model is trained through the real duration to obtain the target duration prediction model, the problem of prediction duration error accumulation caused by mismatching of the training process and the prediction process is avoided, and therefore accuracy of voice synthesis is improved to a certain extent.

In step S104 of some embodiments, the acoustic feature data is used as input data of a preset target duration prediction model, where the target duration prediction model is a model trained in advance and capable of outputting a corresponding target duration according to the acoustic feature data. It will be appreciated that the target duration is used to characterize the speaking duration of the corresponding phoneme in the target speech. The target duration prediction model comprises two feature extraction layers and a linear layer, wherein each feature extraction layer comprises a one-dimensional convolution layer and a layer standardization layer. It will be appreciated that the structure of the target duration prediction model described above is merely exemplary, and embodiments of the present application are not specifically limited thereto.

Referring to fig. 4, in some embodiments, before step S105, the speech synthesis method provided in the embodiments of the present application further includes training a target phoneme distribution prediction model, including, but not limited to, steps S401 to S405:

step S401, acquiring a second sample phoneme of a second sample text and a second sample duration of the second sample phoneme;

step S402, obtaining second object identification data of a preset second sample speaking object;

step S403, obtaining sample phoneme distribution data according to the second object identification data and the second sample phonemes; wherein the sample phoneme distribution data is used for representing distribution difference conditions of all second sample phonemes in the second sample duration;

Step S404, inputting the second object identification data, the second sample duration and the second sample phonemes into a preset original phoneme distribution prediction model to perform phoneme distribution prediction to obtain initial phoneme distribution data;

and step S405, carrying out parameter adjustment on the original phoneme distribution prediction model according to the initial phoneme distribution data and the sample phoneme distribution data to obtain a target phoneme distribution prediction model.

In step S401 of some embodiments, a preset second sample text is input to a corresponding encoder, so as to perform preprocessing operations such as text structure and language judgment, text normalization, text-to-phoneme conversion, sentence prosody prediction, and the like on the second sample text, so as to obtain a plurality of second sample phonemes. And acquiring a preset duration label (namely a second sample duration) of each second sample phoneme.

In step S402 of some embodiments, a speaking object library is preset, where the speaking object library includes a plurality of speaking objects, and object identification data corresponding to each speaking object. It can be understood that the object identification data in the speaking object library may be in the form of custom codes, unicode, etc., which is not particularly limited in this embodiment of the present application. Taking the single thermal code as an example, the preset a object has tone 1 (assuming that the two tone attributes of tone 1 and tone 2 are included together) and tone 2 (assuming that the two tone attributes of tone 1 and tone 2 are included together), the preset B object has tone 2 and tone 1, the single thermal code corresponding to the a object is (0, 1, 0), and the single thermal code corresponding to the object B is (1, 0, 1). It will be appreciated that the speaking object may also be directly encoded, for example, the preset speaking object includes an a object, a B object, and a C object, where the object identification data of the a object is 100, the object identification data of the B object is 010, and the object identification data of the C object is 001. Any one of the speaking objects is obtained from the speaking object library as a second sample speaking object, and object identification data corresponding to the second sample speaking object is used as second object identification data. It will be appreciated that in order to ensure that the target phoneme distribution prediction model is trained synchronously with the target duration prediction model, the first sample speaking object and the second sample speaking object should be the same speaking object.

In step S403 of some embodiments, the corresponding sample phoneme distribution data is obtained by searching from a preset speaking object library according to the second object identification data and the second sample phonemes. It is understood that the sample phoneme distribution data is label data set in advance according to the second object identification data and the second sample phonemes, and is used for representing the sounding situation of the second sample phonemes by the second sample object in the second sample duration in the real situation.

In step S404 of some embodiments, the second object identification data, the second sample duration, and the second sample phoneme are used as input data of the original phoneme distribution prediction model, so as to obtain prediction distribution data (i.e., initial phoneme distribution data) corresponding to the second sample phoneme, the second object identification data, and the second sample duration.

In step S405 of some embodiments, the penalty data is calculated according to the initial phoneme distribution data, the sample phoneme distribution data, and the preset penalty function to determine a gap between the initial phoneme distribution data and the sample phoneme distribution data. And carrying out parameter adjustment on the original phoneme distribution prediction model according to the loss data, so as to obtain a target phoneme distribution prediction model.

According to the voice synthesis method, the target phoneme distribution prediction model for predicting the phoneme distribution is trained, so that the distribution difference of the original phonemes in the target duration can be predicted, and the naturalness of target voice obtained according to the target phonemes is improved.

In step S105 of some embodiments, the target duration and the acoustic feature data are used as input data of a preset target phoneme distribution prediction model, where the target phoneme distribution prediction model is a model that is trained in advance and can predict a distribution difference condition of an original phoneme in the target duration. It will be appreciated that since the sequence length of a phoneme sequence is typically smaller than the sequence length of a speech spectrum such as a mel-spectrum, one phoneme should correspond to a plurality of mel-spectra such that the length of the phoneme sequence matches the length of the mel-spectrum sequence. According to the method and the device for synthesizing the target speech, the target phoneme distribution data are obtained through the target phoneme distribution prediction model and are used for adding disturbance to a plurality of original phonemes in the target duration time, so that the synthesized target speech is closer to real living pronunciation, and the naturalness of the target speech is improved.

Referring to fig. 5, in some embodiments, step S105 includes, but is not limited to, steps S501 through S502:

step S501, inputting target duration time and acoustic feature data into a feature extraction layer for feature extraction to obtain initial phoneme distribution data;

step S502, inputting the initial phoneme distribution data into a feature conversion layer for feature conversion to obtain target phoneme distribution data.

In step S501 of some embodiments, the target phoneme distribution prediction model includes two feature extraction layers, and one feature conversion layer. Each feature extraction layer includes a one-dimensional convolution layer and a normalization layer. The one-dimensional convolution layer is used for extracting features of input data, and the normalization layer is used for performing normalization operation on output data of the one-dimensional convolution layer so as to obtain initial phoneme distribution data. It is understood that the normalization layer may adopt any normalization manner such as BN (batch norm), LN (Layer Normalization ), and the like, which is not specifically limited to the embodiment of the present application.

In step S502 of some embodiments, the feature conversion layer includes a linear layer and an active layer. The linear layer is used for performing dimension conversion operation on the initial phoneme distribution data, and the activation layer is used for performing nonlinear mapping on output data of the linear layer according to a preset activation function so as to obtain target phoneme distribution data. It can be understood that the preset activation function may be adaptively selected according to actual needs, which is not specifically limited in this embodiment of the present application. For example, the Softplus activation function may be selected by its ability to produce characteristics of normal distribution β and σ parameters such that the target phoneme distribution prediction model ultimately can output target phoneme distribution data for determining gaussian distribution variance values to provide sampling parameters for subsequent upsampling operations.

In step S106 of some embodiments, the original phonemes are extended according to the target phoneme distribution data and the target duration to obtain target phonemes, so that the sequence length corresponding to the target phonemes matches the sequence length of the mel spectrum. For example, in the related art, when the target duration of the original phoneme p in the "mandarin chinese" target text is predicted to be 3 according to the target duration prediction model, the original phoneme p is expanded to a { p, p, p } sequence. In the above-described spreading sequence, the distribution of each target phoneme in the target duration is the same, i.e., the plurality of target phonemes have the same weight value in the target duration. In the embodiment of the application, disturbance is added to the original phonemes in the target duration according to the target phoneme distribution data, so that a distribution difference exists among a plurality of target phonemes in the extended sequence (namely, the sequence corresponding to the target phonemes), and the accuracy and the authenticity of the target speech are improved.

Referring to fig. 6, in some embodiments, step S106 includes, but is not limited to, steps S601 through S602:

step S601, obtaining duration center data according to target duration;

step S602, up-sampling processing is carried out on the original phonemes according to the duration center data and the target phoneme distribution data, and the target phonemes are obtained.

In step S601 of some embodiments, a disturbance signal in a gaussian distribution is added to a plurality of original phonemes in a target duration to improve naturalness of a target speech obtained from the target phonemes. Thus, the duration is calculated from the target duration and the following formula (1)Heart data C _i To center the duration data C _i As a mean of the gaussian distribution, i.e. the center of the distribution of the original phonemes over the target duration is determined.

Wherein d _i Representing a target duration obtained by predicting an ith original phoneme according to a target duration prediction model; d, d _j And representing the target duration predicted by the jth original phoneme according to the target duration prediction model.

In step S602 of some embodiments, the distribution difference condition of the original phonemes in the target duration may be determined by the target phoneme distribution data, and thus the variance value of the gaussian distribution may be determined by the target phoneme distribution data, so that the alignment basis (i.e., the alignment matrix after the normalization post-processing) at the time of the up-sampling processing is calculated according to the following formula (2).

Wherein W is _ti Representing the normalized alignment matrix; sigma (sigma) _i Representing target phoneme distribution data obtained by predicting an ith original phoneme according to a target phoneme distribution prediction model; sigma (sigma) _j Representing target phoneme distribution data obtained by predicting a jth original phoneme according to a target phoneme distribution prediction model;representing a gaussian distribution; n represents the number of original phonemes resulting from the input of the target text to the pre-set encoder.

It will be appreciated that inputting the target text into the pre-set encoder will result in a vector h= (H) comprising a plurality of original phonemes ₁ ,h ₂ ,...,h _N ) Therefore, the alignment matrix is used as the weight of the corresponding original phoneme to obtain the target phoneme W _ti h _i . On the basis, rootThe result vector u= (U) after upsampling the vector H can be obtained according to the following formula (3) ₁ ,u ₂ ,...,u _t )。

According to the voice synthesis method provided by the embodiment of the application, the original phonemes are subjected to Gaussian distribution up-sampling processing according to the target duration and the target phoneme distribution data, so that the naturalness of the target phonemes obtained by expansion is ensured on the basis of expanding the original phonemes. Moreover, the up-sampling of the Gaussian distribution is a micro up-sampling operation, namely, the Mel spectrum loss generated by the subsequent operation can be converted into the loss generated during training of the target duration prediction model, so that the training effect of the target duration prediction model can be improved, and the accuracy of speech synthesis is further improved.

In step S107 of some embodiments, a speech synthesis process is performed according to a plurality of target phonemes to obtain a target speech corresponding to a target text.

Referring to fig. 7, in some embodiments, step S107 includes, but is not limited to, steps S701 through S702:

step S701, inputting a target phoneme into a preset decoder to perform feature conversion to obtain a voice frequency spectrum;

step S702, inputting the voice spectrum to a preset vocoder for voice synthesis to obtain target voice.

In step S701 of some embodiments, the target phoneme is converted into a speech spectrum such as mel spectrum by a preset decoder. It will be appreciated that the decoder and encoder preset in the embodiments of the present application are both configured based on the fastspech 2 model, and each of the decoder and encoder is composed of six FFT (Fast Fourier Transform ) layers, each of which includes a multi-headed attention layer and a one-dimensional convolution layer, so as to be able to combine more acoustic features to obtain a speech spectrum.

In step S702 of some embodiments, the voice spectrum obtained according to the above steps is used as input data of a preset vocoder, so as to obtain a playable voice waveform according to the voice spectrum, and further obtain the target voice. It is understood that the vocoder may be a vocoder based on phase reconstruction or a vocoder based on a neural network, which is not particularly limited in the embodiments of the present application.

According to the voice synthesis method, the target duration of the original phonemes is predicted through the trained target duration prediction model, and the target duration is obtained. And predicting the distribution difference condition of the original phonemes in the target duration by using a target phoneme distribution prediction model, and carrying out up-sampling processing on the original phonemes according to the predicted target phoneme distribution data and the target duration, so that the sequence length of the target phonemes can be aligned with the spectrum sequence length for speech synthesis. Therefore, the voice synthesis method provided by the embodiment of the application considers the distribution difference of each phoneme in the extended phoneme sequence, so that the problem that the voice synthesis accuracy is affected by hard extension in the related technology is avoided. And the target identification data and the original phonemes are used as input data of a target duration prediction model together, so that the target duration is the real duration of the matched target speaking object, and the naturalness and similarity of the synthesized target speech are improved.

Referring to fig. 8, an embodiment of the present application further provides a speech synthesis apparatus, including:

A text acquisition module 801, configured to acquire a target text;

the encoding module 802 is configured to input a target text to a preset encoder for acoustic encoding, so as to obtain an original phoneme; obtaining acoustic characteristic data according to the original phonemes;

the duration prediction module 803 is configured to input acoustic feature data to a preset target duration prediction model to perform duration prediction, so as to obtain a target duration;

the parameter range prediction module 804 is configured to input the target duration and the acoustic feature data to a preset target phoneme distribution prediction model to perform phoneme distribution prediction, so as to obtain target phoneme distribution data; the target phoneme distribution data are used for representing distribution difference conditions of all original phonemes in the target duration;

the up-sampling module 805 performs up-sampling processing on the original phonemes according to the target phoneme distribution data and the target duration to obtain target phonemes;

the speech generating module 806 is configured to perform speech synthesis according to the target phoneme, so as to obtain a target speech.

It can be seen that the foregoing embodiments of the speech synthesis method are applicable to the embodiments of the speech synthesis apparatus, and the functions specifically implemented by the embodiments of the speech synthesis apparatus are the same as those of the embodiments of the speech synthesis method, and the beneficial effects achieved by the embodiments of the speech synthesis method are the same as those achieved by the embodiments of the speech synthesis method.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the voice synthesis method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the speech synthesis method to perform the embodiments of the present application;

An input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);

a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the voice synthesis method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of speech synthesis, the method comprising:

acquiring a target text;

obtaining acoustic characteristic data according to the original phonemes;

2. The method of claim 1, wherein said deriving acoustic feature data from said original phonemes comprises:

acquiring target identification data of a target speaking object;

3. The method according to claim 2, wherein before the acoustic feature data is input into a preset target duration prediction model for duration prediction, the method further comprises training the target duration prediction model, and specifically comprises:

4. The method according to claim 2, wherein before inputting the target duration and the acoustic feature data into a preset target phoneme distribution prediction model for phoneme distribution prediction to obtain target phoneme distribution data, the method further comprises training the target phoneme distribution prediction model, specifically comprising:

5. The method of claim 1, wherein upsampling the original phone according to the target phone distribution data and the target duration to obtain a target phone comprises:

obtaining duration center data according to the target duration;

6. The method according to any one of claims 1 to 5, wherein the performing speech synthesis according to the target phonemes to obtain target speech includes:

7. A method according to any one of claims 1 to 5, wherein the target phoneme distribution prediction model comprises a feature extraction layer and a feature conversion layer; wherein the feature extraction layer comprises a convolution layer and a normalization layer, and the feature conversion layer comprises a linear layer and an activation layer;

8. A speech synthesis apparatus, comprising:

the text acquisition module is used for acquiring a target text;

9. An electronic device comprising a memory storing a computer program and a processor implementing the speech synthesis method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method of any one of claims 1 to 7.