CN116564273A - Speech synthesis method, speech synthesis system, electronic device, and storage medium - Google Patents

Speech synthesis method, speech synthesis system, electronic device, and storage medium Download PDF

Info

Publication number
CN116564273A
CN116564273A CN202310727618.7A CN202310727618A CN116564273A CN 116564273 A CN116564273 A CN 116564273A CN 202310727618 A CN202310727618 A CN 202310727618A CN 116564273 A CN116564273 A CN 116564273A
Authority
CN
China
Prior art keywords
data
phoneme
model
sample
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310727618.7A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
程宁
季圣鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310727618.7A priority Critical patent/CN116564273A/en
Publication of CN116564273A publication Critical patent/CN116564273A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Signal Processing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The application provides a voice synthesis method, a voice synthesis system, electronic equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring sample phoneme data and sample voice, inputting the sample phoneme data into an original synthetic model, and performing phoneme coding on the sample phoneme data through a phoneme coding sub-model to obtain phoneme hiding data; performing phoneme adaptation on the phoneme hidden data through a variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data; carrying out spectrum prediction on the phoneme alignment data and the phoneme characteristic data through a noise reduction sub-model to obtain a predicted Mel spectrum; performing parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel spectrum to obtain a voice synthesis model; and inputting the target text data into a voice synthesis model for voice synthesis processing to obtain target synthesized voice. The embodiment of the invention can improve the generation quality and the generation efficiency of the synthesized voice and effectively simplify the calculated amount of the voice synthesis process.

Description

Speech synthesis method, speech synthesis system, electronic device, and storage medium
Technical Field
The present disclosure relates to the field of financial technology, and in particular, to a voice synthesis method, a voice synthesis system, an electronic device, and a storage medium.
Background
With the rapid development of financial science and technology and socioeconomic performance, people have increasingly demanded a bank service level. In the scenes of intelligent customer service, multi-round dialogue, robot outbound and the like, the voice synthesis technology can be applied to specific scenes such as daily business handling, business consultation, business recommendation, marketing, and revenue-inducing. Therefore, the method can more truly and accurately transmit the related information to the target object through voice, and is one of the most effective and direct methods for improving the customer experience and service level. The Speech synthesis technology (TTS) is a technology of synthesizing a given Text into audio capable of simulating the pronunciation of a target object. The related art TTS method generates a corresponding mel spectrogram from a text by means of autoregressive, and synthesizes voice from the generated mel spectrogram by using a pre-trained vocoder. However, although this approach can generate high-fidelity audio, it is computationally intensive and less efficient to synthesize. Therefore, how to provide a speech synthesis method, which can improve the generation quality and the generation efficiency of the synthesized speech, and effectively simplify the calculation amount of the speech synthesis process, is a technical problem to be solved.
Disclosure of Invention
The main purpose of the embodiments of the present application is to provide a speech synthesis method, a speech synthesis system, an electronic device, and a storage medium, which can improve the quality and efficiency of the synthesis of speech, and effectively simplify the calculation amount in the speech synthesis process.
To achieve the above object, a first aspect of an embodiment of the present application proposes a speech synthesis method, including:
obtaining sample data, wherein the sample data comprises sample phoneme data and sample voice, and the sample phoneme data is used for representing text content of the sample voice;
inputting the sample phoneme data into a preset original synthesis model, wherein the original synthesis model comprises a phoneme coding sub-model, a variance adaptation sub-model and a noise reduction sub-model;
performing phoneme coding processing on the sample phoneme data through the phoneme coding sub-model to obtain phoneme hiding data;
performing phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data;
performing spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through the noise reduction sub-model to obtain a predicted Mel spectrum;
Performing parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel frequency spectrum to obtain a voice synthesis model;
and inputting the acquired target text data into the voice synthesis model for voice synthesis processing to obtain target synthesized voice.
In some embodiments, the performing, by the noise reduction sub-model, a spectrum prediction process on the phoneme alignment data and the phoneme feature data to obtain a predicted mel spectrum includes:
inputting the phoneme alignment data into the noise reduction sub-model, and performing data sampling on the phoneme alignment data to obtain candidate adaptation data and position information of the candidate adaptation data;
performing spectrum diffusion processing on the candidate adaptation data according to a preset time step to obtain spectrum diffusion data;
performing spectrum inverse sampling processing on the spectrum diffusion data according to the preset time step, the candidate adaptation data and the phoneme characteristic data to obtain predicted spectrum data;
and generating a frequency spectrum of the predicted spectrum data according to the position information to obtain the predicted Mel frequency spectrum.
In some embodiments, the performing spectrum diffusion processing on the candidate adaptation data according to a preset time step to obtain spectrum diffusion data includes:
Acquiring a noise scheduling parameter of the preset time step;
performing data sampling on the candidate adaptation data to obtain first adaptation data;
carrying out noise adding processing on the first adaptive data according to the preset time step and the noise scheduling parameter to obtain second adaptive data;
and obtaining the spectrum diffusion data according to the first adapting data and the second adapting data.
In some embodiments, the performing parameter adjustment on the original synthesis model according to the sample voice and the predicted mel spectrum to obtain a voice synthesis model includes:
performing diffusion parameter calculation according to the noise scheduling parameters and the preset time steps to obtain diffusion process parameters;
acquiring noise distribution data, and performing prediction loss calculation according to the noise distribution data, the candidate adaptation data, the prediction spectrum data, the preset time step, the diffusion process parameters and the phoneme characteristic data to obtain prediction loss data;
and carrying out parameter adjustment on the original synthesis model according to the prediction loss data to obtain the voice synthesis model.
In some embodiments, the variance aptamer model includes a duration predictor, a pitch predictor, an energy predictor;
And performing phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme feature data, wherein the phoneme alignment data and the phoneme feature data comprise:
performing phoneme alignment processing on the phoneme hidden data according to the duration predictor to obtain the phoneme alignment data;
performing pitch prediction processing on the phoneme alignment data according to the pitch predictor to obtain first condition data;
performing energy prediction processing on the phoneme alignment data according to the energy predictor to obtain second condition data;
and carrying out data combination on the first condition data and the second condition data to obtain the phoneme characteristic data.
In some embodiments, the pitch predictor includes a pitch activation layer, a normalization layer, and a pitch projection layer;
the pitch prediction processing is performed on the phoneme alignment data according to the pitch predictor to obtain first condition data, including:
nonlinear processing is carried out on the phoneme alignment data according to the pitch activation layer to obtain pitch activation data;
normalizing the pitch activation data according to the normalization layer to obtain normalized hidden data;
And carrying out linear projection processing on the normalized hidden data according to the pitch projection layer to obtain the first condition data.
In some embodiments, the phoneme encoding submodel comprises a phoneme convolution layer, a phoneme self-attention layer, and a phoneme projection layer;
the processing of phoneme coding is performed on the sample phoneme data through the phoneme coding submodel to obtain phoneme hiding data, including:
carrying out phoneme convolution processing on the sample phoneme data according to the phoneme convolution layer to obtain phoneme coding data;
performing self-attention processing on the phoneme coding data according to the phoneme self-attention layer to obtain phoneme attention data;
and carrying out linear projection processing on the phoneme attention data according to the phoneme projection layer to obtain the phoneme hiding data.
To achieve the above object, a second aspect of the embodiments of the present application proposes a speech synthesis system, the system comprising:
a sample acquisition module for acquiring sample data, the sample data comprising sample phoneme data and sample speech, the sample phoneme data being used for characterizing text content of the sample speech;
the model input module is used for inputting the sample phoneme data into a preset original synthesis model, and the original synthesis model comprises a phoneme coding sub-model, a variance adaptation sub-model and a noise reduction sub-model;
The phoneme coding module is used for carrying out phoneme coding processing on the sample phoneme data through the phoneme coding sub-model to obtain phoneme hiding data;
the phoneme adaptation module is used for carrying out phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data;
the frequency spectrum prediction module is used for carrying out frequency spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through the noise reduction sub-model to obtain a predicted Mel frequency spectrum;
the parameter adjustment module is used for carrying out parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel frequency spectrum to obtain a voice synthesis model;
and the voice synthesis module is used for inputting the acquired target text data into the voice synthesis model to perform voice synthesis processing so as to obtain target synthesized voice.
To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory storing a computer program and a processor implementing a method according to any one of the first aspects of the embodiments of the present application when the processor executes the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application further proposes a computer readable storage medium storing a computer program, which when executed by a processor implements a method according to any one of the first aspects of the embodiments of the present application.
According to the voice synthesis method, the voice synthesis system, the electronic equipment and the storage medium, firstly, sample data are obtained, the sample data comprise sample phoneme data and sample voices, and the sample phoneme data are used for representing text contents of the sample voices. Then, the sample phoneme data is input to a preset original synthesis model including a phoneme encoding sub-model, a variance adaptation sub-model, and a noise reduction sub-model. And carrying out phoneme coding processing on the sample phoneme data through the phoneme coding submodel to obtain phoneme hiding data. And carrying out phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data. And carrying out spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through the noise reduction submodel to obtain a predicted Mel spectrum. And then, carrying out parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel spectrum to obtain a voice synthesis model. And inputting the acquired target text data into a voice synthesis model for voice synthesis processing to obtain target synthesized voice. The embodiment of the invention can improve the generation quality and the generation efficiency of the synthesized voice and effectively simplify the calculated amount of the voice synthesis process.
Drawings
FIG. 1 is a first flowchart of a speech synthesis method provided in an embodiment of the present application;
FIG. 2 is a flowchart of a specific method of step S130 in FIG. 1;
FIG. 3 is a schematic diagram of a phoneme encoding sub-model according to an embodiment of the present application;
FIG. 4 is a flowchart of a specific method of step S140 in FIG. 1;
FIG. 5 is a schematic diagram of a variance aptamer model provided in an embodiment of the present application;
FIG. 6 is a flowchart of a specific method of step S420 in FIG. 4;
FIG. 7 is a flowchart of a specific method of step S150 in FIG. 1;
FIG. 8 is a flowchart of a specific method of step S720 in FIG. 7;
FIG. 9 is a flowchart of a specific method of step S160 in FIG. 1;
FIG. 10 is a block diagram of a speech synthesis system according to an embodiment of the present application;
fig. 11 is a schematic hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
artificial intelligence (Artificial Intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (Natural Language Processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, speech recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the natural language processing involves data mining related to language processing, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language computing, and the like.
Speech synthesis technology (Text-To-Speech, TTS): is a technique for converting text to speech. TTS generally comprises two steps: the first step is text processing, which mainly converts text into a phoneme sequence and marks out the information of the start and stop time, frequency change and the like of each phoneme; the second step is speech synthesis, which mainly generates speech from a sequence of phonemes.
Phonemes: the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme.
With the rapid development of financial science and technology and socioeconomic performance, people have increasingly demanded a bank service level. In the scenes of intelligent customer service, multi-round dialogue, robot outbound and the like, the voice synthesis technology can be applied to specific scenes such as daily business handling, business consultation, business recommendation, marketing, and revenue-inducing. Therefore, the method can more truly and accurately transmit the related information to the target object through voice, and is one of the most effective and direct methods for improving the customer experience and service level. The Speech synthesis technology (TTS) is a technology of synthesizing a given Text into audio capable of simulating the pronunciation of a target object. The related art TTS method generates a corresponding mel spectrogram from a text by means of autoregressive, and synthesizes voice from the generated mel spectrogram by using a pre-trained vocoder. However, although this approach can generate high-fidelity audio, it is computationally intensive and less efficient to synthesize. Therefore, how to provide a speech synthesis method, which can improve the generation quality and the generation efficiency of the synthesized speech, and effectively simplify the calculation amount of the speech synthesis process, is a technical problem to be solved.
Based on the above, the voice synthesis method, the voice synthesis system, the electronic device and the storage medium provided by the embodiment of the application can improve the generation quality and the generation efficiency of the synthesized voice, and effectively simplify the calculation amount of the voice synthesis process.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the application provides a voice synthesis method, which relates to the technical field of artificial intelligence. The voice synthesis method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, etc.; the server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; the software may be an application or the like that implements a speech synthesis method, but is not limited to the above form.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should be noted that, in each specific embodiment of the present application, when related processing is required according to data related to user identity or characteristics, such as user information, user behavior data, user voice data, user history data, and user location information, the permission or consent of the user is obtained first, and the collection, use, processing, and the like of these data all comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.
Referring to fig. 1, fig. 1 is an optional flowchart of a speech synthesis method according to an embodiment of the present application, and in some embodiments of the present application, the speech synthesis method according to an embodiment of the present application includes, but is not limited to, steps S110 to S170, and these seven steps are described in detail below with reference to fig. 1.
Step S110, sample data is obtained, wherein the sample data comprises sample phoneme data and sample voices, and the sample phoneme data is used for representing text contents of the sample voices;
step S120, inputting sample phoneme data into a preset original synthesis model, wherein the original synthesis model comprises a phoneme coding sub-model, a variance adaptation sub-model and a noise reduction sub-model;
step S130, carrying out phoneme coding processing on the sample phoneme data through a phoneme coding sub-model to obtain phoneme hiding data;
step S140, carrying out phoneme adaptation processing on the phoneme hidden data through a variance adaptation sub model to obtain phoneme alignment data and phoneme characteristic data;
step S150, carrying out spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through a noise reduction sub-model to obtain a predicted Mel spectrum;
step S160, carrying out parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel spectrum to obtain a voice synthesis model;
Step S170, inputting the acquired target text data into a speech synthesis model for speech synthesis processing to obtain target synthesized speech.
In steps S110 to S170 of some embodiments, first, sample data including sample phoneme data for characterizing text content of a sample voice and the sample voice is acquired. Then, the sample phoneme data is input to a preset original synthesis model including a phoneme encoding sub-model, a variance adaptation sub-model, and a noise reduction sub-model. And carrying out phoneme coding processing on the sample phoneme data through the phoneme coding submodel to obtain phoneme hiding data. And carrying out phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data. And carrying out spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through the noise reduction submodel to obtain a predicted Mel spectrum. And then, carrying out parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel spectrum to obtain a voice synthesis model. And inputting the acquired target text data into a voice synthesis model for voice synthesis processing to obtain target synthesized voice. According to the method and the device for predicting the Mel frequency spectrum, the Mel frequency spectrum is predicted according to the phoneme alignment data and the phoneme characteristic data, the generation quality and the generation efficiency of the synthesized voice can be improved, and the calculation amount of the voice synthesis process is effectively simplified.
In step S110 of some embodiments, in order to train to obtain a speech synthesis model, embodiments of the present application first obtain a training sample set, where the training sample set includes at least one sample data, and each sample data includes sample phoneme data and sample speech. Wherein, the sample voice is used for representing a reference voice of the sample phoneme data in model training. The sample phoneme data may be a text phoneme sequence obtained by performing text phoneme conversion on the acquired initial text through a preset phoneme conversion model. In order to improve the generation efficiency of the speech synthesis model, the sample phoneme data is used for representing the text content of the sample speech, i.e. the text content of the sample phoneme data is identical to the speech content of the sample speech. According to the embodiment of the application, the sample voice is used as a comparison label for model prediction so as to guide the training process of the sample phoneme data in a model.
Note that, the phoneme conversion model used in the embodiment of the present application may be constructed by using model structures such as Deep Voice3 model, word-to-sound conversion model (Grapheme to Phoneme, G2P), and the like, which are not limited herein.
Note that, the storage format of the sample voice in the present application may be MP3 format, CDA format, WAV format, WMA format, RA format, MIDI format, OGG format, APE format, AAC format, or the like, which is not limited in the present application.
It should be noted that the speech synthesis method of the present application may be used to assist applications such as car broadcasting and announcement, car navigation, electronic dictionary, consumer electronics, smart phones, smart speakers, voice assistant, electronic book reading, etc. For example, in a daily business consultation scenario of the intelligent customer service, after a target object dials a phone call, the intelligent customer service may generate target synthesized voice according to a preset phone text, and guide the target object to execute a corresponding operation by playing the synthesized voice, so as to obtain required information.
In step S120 of some embodiments, the related art speech synthesis system includes an acoustic model that can implement mapping from input text to speech features, and a vocoder that can synthesize speech based on the speech features. After inputting text for which speech synthesis is desired to the speech synthesis system, the acoustic model predicts speech features from the input text and inputs the predicted speech features to the vocoder. The vocoder is configured to synthesize speech based on the derived predicted speech characteristics. However, because of the loss of prediction from the acoustic model, there is a large mismatch between the predicted speech features received by the vocoder from the acoustic model and the actual speech features, which results in a non-ideal synthesized speech generated by the vocoder, e.g., often with significant stuffy or muffled noise problems. In order to be able to predict high-quality synthesized speech and to make the synthesized speech have the characteristic of diversity, the original synthesis model preset in the application comprises a phoneme coding sub-model, a variance adaptation sub-model and a noise reduction sub-model.
It should be noted that the original synthesis model further includes a vocoder for synthesizing speech according to the speech characteristics input thereto, wherein the input of the vocoder may be the predicted mel frequency spectrum output by the noise reduction sub-model.
In step S130 of some embodiments, the phoneme encoding submodel is used to perform a phoneme encoding process on the input sample phoneme data to obtain phoneme hiding data. Wherein the phoneme hiding data is used for representing deeper characteristic information of the sample phoneme data.
Referring to fig. 2, fig. 2 is a flowchart of a specific method of step S130 according to an embodiment of the present application. In some embodiments of the present application, the phoneme coding submodel includes a phoneme convolution layer, a phoneme self-attention layer, and a phoneme projection layer, and step S130 may specifically include, but is not limited to, steps S210 to S230, which are described in detail below in connection with fig. 2.
Step S210, carrying out phoneme convolution processing on the sample phoneme data according to the phoneme convolution layer to obtain phoneme coding data;
step S220, performing self-attention processing on the phoneme encoded data according to the phoneme self-attention layer to obtain phoneme attention data;
step S230, linear projection processing is carried out on the phoneme attention data according to the phoneme projection layer, so that phoneme hiding data are obtained.
In steps S210 to S230 of some embodiments, as shown in fig. 3, a phoneme encoding sub-model 310 proposed in the embodiments of the present application is composed of a model structure of a feedforward transformer based on a transformer architecture, and the phoneme encoding sub-model 310 includes a phoneme convolution layer 311, a phoneme self-attention layer 312, and a phoneme projection layer 313. The phoneme convolution layer 311 is used to extract key features in the sample phoneme data, so as to eliminate the influence of noise features or redundant features on the speech synthesis efficiency. Specifically, the phoneme self-attention layer 312 refers to a processing operation that applies different weights (attention scores) to different information that needs to be considered to solve a problem when the problem is solved in a specific scene, applies higher weights to information that helps the problem to be large, and applies lower weights to information that helps the problem to be small, so that the problem can be solved by using the information better. When self-attention processing is performed on the phoneme-encoded data according to the phoneme self-attention layer 312, a larger attention score is given to the phoneme-encoded data having a large contribution to the emotion change of the recognition target object; the phoneme coding data with small emotion change contribution of the recognition target object is endowed with smaller attention score, sentences corresponding to the phoneme coding data with high attention score are relatively high in importance, and the expressed emotion can represent the emotion expressed by the target object more truly than other sentences, so that the accuracy and efficiency of speech synthesis can be effectively improved. Thereafter, the input phoneme attentiveness data is subjected to a linear projection process by the phoneme projection layer 313, and the resulting phoneme hidden data is transmitted to the variance adaptation sub-model 320. According to the embodiment of the application, the feature data are integrated through the mode fusion network, so that the diversity of the synthesized voice can be effectively improved.
It should be noted that, in order to focus attention on the position of the phoneme encoded data itself, the phoneme self-attention layer proposed in the embodiment of the present application may be constructed by using a multi-head attention mechanism.
In step S140 of some embodiments, the variance aptamer model proposed in the embodiments of the present application is used to predict the duration of each phoneme, so as to adjust the length of the phoneme hiding features to the length and dimension size of the predicted speech. For example, the feature data size obtained after the phoneme coding submodel is 80×70, and assuming that the two-dimensional matrix size of the mel spectrum corresponding to the finally output predicted speech is 80×140, the variance adaptation submodel may also adjust the matrix of the phoneme hidden data obtained after the phoneme coding submodel processing to 80×140, so as to obtain the synthesized speech with the required matrix size.
Referring to fig. 4, fig. 4 is a flowchart of a specific method of step S140 according to an embodiment of the present application. In some embodiments of the present application, the variance aptamer model includes a duration predictor, a pitch predictor, and an energy predictor, and step S140 may specifically include, but is not limited to, steps S410 to S440, which are described in detail below in connection with fig. 4.
Step S410, performing phoneme alignment processing on the phoneme hidden data according to the duration predictor to obtain phoneme aligned data;
step S420, performing pitch prediction processing on the phoneme alignment data according to a pitch predictor to obtain first condition data;
step S430, performing energy prediction processing on the phoneme alignment data according to the energy predictor to obtain second condition data;
step S440, data combination is carried out on the first condition data and the second condition data to obtain phoneme characteristic data.
In steps S410 to S440 of some embodiments, in order to make the synthesized speech energy full and pitch accurate, i.e. effectively improve the synthesis quality of speech, the embodiment of the present application uses a variance aptamer model to indicate different variances in speech, such as energy variance, pitch variance, etc. Specifically, as shown in fig. 5, the phoneme-hidden data is input into a variance adaptation sub-model 510, and phoneme-alignment processing is performed on the phoneme-hidden data according to a duration predictor 511, resulting in phoneme-aligned data. The phoneme alignment data is in the form of a matrix to adjust the length of the phoneme hiding features to the length and dimension size of the predicted speech. The phoneme alignment data is pitch-predicted according to the pitch predictor 512 to obtain first condition data for characterizing the pitch condition variance data of the output predicted speech. The phoneme alignment data is subjected to energy prediction processing according to the energy predictor 513 to obtain second condition data for characterizing the energy condition variance data of the output predicted speech. And carrying out data corresponding combination according to the first condition data and the second condition data to obtain phoneme characteristic data, wherein the phoneme characteristic data is used as a condition parameter in the noise reduction sub-model 520 to synthesize a predicted mel frequency spectrum with accurate pitch and better quality.
Referring to fig. 6, fig. 6 is a flowchart of a specific method of step S420 according to an embodiment of the present application. In some embodiments of the present application, the pitch predictor includes a pitch activation layer, a normalization layer, and a pitch projection layer, and step S420 may include, but is not limited to, steps S610 to S630, which are described in detail below in connection with fig. 6.
Step S610, nonlinear processing is carried out on the phoneme alignment data according to the pitch activation layer, so as to obtain pitch activation data;
step S620, carrying out normalization processing on the pitch activation data according to the normalization layer to obtain normalized hidden data;
step S630, linear projection processing is carried out on the normalized hidden data according to the pitch projection layer, and first condition data are obtained.
In steps S610 to S630 of some embodiments, pitch is used to characterize the frequency of the speech, energy is used to characterize the intensity of the speech, and in order to more accurately implement pitch estimation of the phoneme alignment data, the pitch predictor proposed in the embodiments of the present application includes a pitch activation layer, a normalization layer, and a pitch projection layer. Specifically, the pitch-activated layer comprises a modified linear unit (Rectified Linear Unit, reLU) and a one-dimensional convolution unit of layer 2. Wherein, the ReLU is a piecewise linear function that will output directly if the input is positive, otherwise, the function will output zero, i.e. the ReLU can make the model easier to train and can achieve better performance. The one-dimensional convolution unit can be equivalent to a fully connected network, and the number of channels can be changed on the premise of not changing the size of the feature map, so that the abstract expression capacity of a local network module is effectively enhanced, and the accuracy of pitch prediction is improved. After the pitch activation layer, the pitch predictor further comprises a normalization layer for characterizing that the convolved result meets the normal distribution again, so that the gradient will not disappear when the pitch projection layer is input again. The pitch projection layer is used to project normalized concealment data for concealment states into the output sequence.
It should be noted that, when a complex feedforward neural network trains a smaller data set, or when some data classification results are too good, the bias of the training set is shown, the phenomenon of over-fitting is easily caused, so that a larger error occurs in the test period. Therefore, in order to prevent overfitting, after the normalization layer, the pitch predictor provided by the embodiment of the application further comprises a random discarding layer, wherein the random discarding layer discards random data in the training process and sets the discarded random data to zero, and by repeatedly doing so during the training process, the overfitting phenomenon in the training phase is effectively prevented.
It should be noted that, the duration predictor and the energy predictor provided in the embodiments of the present application are similar to the model structure of the pitch predictor, but in the specific training process, the duration predictor mainly performs predictive training on the duration feature in the phoneme hiding feature, and the energy predictor mainly performs predictive training on the phoneme energy feature in the phoneme alignment data. According to the embodiment of the application, phoneme adaptation processing is carried out on the phoneme hidden data through the variance adaptation sub-model, two matrixes are obtained, wherein the matrixes are phoneme aligned data and phoneme characteristic data respectively, and the phoneme aligned data are used for representing data in an aligned form which is the same as the content of the phoneme hidden data. The phoneme characteristic data is used to characterize the condition information required to generate the predicted mel spectrum.
In step S150 of some embodiments, in order to iteratively refine the hidden data after the length adjustment into a predicted mel spectrum, a spectrum prediction process is performed on the data output by the variance adaptation sub-model through the noise reduction sub-model, so as to obtain the predicted mel spectrum. According to the method and the device, the sample phoneme data are transformed through the scale of the Mel frequency spectrum, and the obtained predicted Mel frequency spectrum can learn the nonlinear transformation of the frequency spectrum.
Referring to fig. 7, fig. 7 is a flowchart of a specific method of step S150 according to an embodiment of the present application. In some embodiments of the present application, step S150 may specifically include, but is not limited to, steps S710 to S740, which are described in detail below in conjunction with fig. 7.
Step S710, inputting the phoneme alignment data into a noise reduction sub-model, and performing data sampling on the phoneme alignment data to obtain candidate adaptation data and position information of the candidate adaptation data;
step S720, performing spectrum diffusion processing on the candidate adaptation data according to a preset time step to obtain spectrum diffusion data;
step S730, performing spectrum inverse sampling processing on the spectrum diffusion data according to the preset time step, the candidate adaptation data and the phoneme characteristic data to obtain predicted spectrum data;
Step S740, performing spectrum generation on the predicted spectrum data according to the position information to obtain a predicted Mel spectrum.
In steps S710 to S740 of some embodiments, the parameterized noise reduction sub-model is trained for direct prediction of clean data to avoid the problems of significant data quality degradation and model convergence with a small number of diffusion iterations during the accelerated sampling process. The embodiment of the application is improved based on a quick distillation Text-to-Speech synthesis model (Progressive Fast Diffusion Model For High-Quality Text-to-Speech, prodiff) to obtain a noise reduction submodel, namely, the data variance of the predicted Mel frequency spectrum is reduced by a knowledge extraction method. Specifically, the noise reduction sub-model of the present application includes a model sampling layer and a model inverse sampling layer. In the model sampling layer, firstly inputting the phoneme alignment data into a noise reduction sub-model, and carrying out data sampling on the phoneme alignment data to obtain candidate adaptation data and position information of the candidate adaptation data. The noise reduction sub-model comprises T preset time steps which are also used for representing the number of loop iterations of the model. The spectrum diffusion process is a forward diffusion process in a Markov chain mode, and a Gaussian distribution is combined in T time steps, namely a small amount of Gaussian noise is gradually added into candidate adaptive data, so that a series of noise samples are generated, namely spectrum diffusion data, and the spectrum diffusion data obey the Gaussian distribution. During the spectrum diffusion process, the candidate adaptation data gradually loses the distinguishable characteristic as the time step increases, and when T approaches infinity, the spectrum diffusion data is equivalent to isotropic gaussian distribution data. In order to accurately obtain the predicted mel spectrum at the model inverse sampling layer, in the embodiment of the application, spectrum inverse sampling processing is performed on spectrum diffusion data according to preset time steps, candidate adaptation data and phoneme characteristic data, namely, added noise data is accurately predicted according to the candidate adaptation data, and a denoising process is performed on the candidate adaptation data by combining the preset time steps and the phoneme characteristic data, so that accurate predicted spectrum data is accurately deduced. And finally, performing spectrum synthesis on the obtained corresponding predicted spectrum data according to the position information of the candidate adaptation data to obtain a predicted Mel spectrum. According to the method and the device, the candidate adaptation data are sampled according to the preset time steps in a parameterized mode, and the denoising model is parameterized by directly predicting clean data, so that the problem of obvious reduction of data quality in the accelerated sampling process is avoided.
Referring to fig. 8, fig. 8 is a flowchart of a specific method of step S720 according to an embodiment of the present application. In some embodiments of the present application, step S720 may specifically include, but is not limited to, steps S810 to S840, which are described in detail below in conjunction with fig. 8.
Step S810, obtaining a noise scheduling parameter of a preset time step;
step S820, data sampling is carried out on the candidate adaptation data to obtain first adaptation data;
step S830, performing noise adding processing on the first adaptive data according to a preset time step and a noise scheduling parameter to obtain second adaptive data;
in step S840, spectrum diffusion data is obtained according to the first adaptation data and the second adaptation data.
In steps S810 to S840 of some embodiments, when performing spectrum diffusion processing, a noise scheduling parameter of a predetermined time step is acquired first, which may be denoted as β t T represents the time step in a specific diffusion step, t.epsilon.0, T]The noise scheduling parameter is used to characterize the super-parameters of the forward diffusion process. Data sampling is carried out on the candidate adaptation data to obtain first adaptation data, and the first adaptation data is marked as x 0 . As shown in the following formulas (1) and (2), the first adaptive data is subjected to noise adding processing according to a preset time step and a noise scheduling parameter to obtain second adaptive data, which is recorded as x 1 The method comprises the steps of carrying out a first treatment on the surface of the And so on, the second adaptation data is used as new first adaptation data to deduce the data of the next time step so as to deduce hidden variable x T
Wherein I represents an identity matrix, x t Representing the noisy adaptation data at time step t, x t-1 And (3) representing the adaptive data after the noise is added in the t-1 time step, wherein q is used for representing the data distribution of the time step, and N represents a preset spectrum diffusion function which can be a Gaussian distribution function.
It should be noted that, in the embodiment of the present application, when performing spectrum inverse sampling processing, the pre-sampled first adaptive data x is adopted 0 The denoising process is carried out as a condition, so that the problem that the synthesis efficiency of a model is reduced by calculating prediction spectrum data and loss data according to obtained data by adopting a distillation learning method as the input of the denoising process in the prior art can be avoided. According to the embodiment of the application, the spectrum sampling process and the spectrum inverse sampling process are combined into one stage, namely, the two processes can be carried out according to the sampled candidate adaptation data, and the synthesis efficiency of speech synthesis can be effectively improved.
It should be noted that, for any time step t, the noise scheduling parameter used in the present application may be calculated by using a cosine scheduling function, as shown in the following formula (3).
β t =cos(0.5πt) (3)
In step S160 of some embodiments, the embodiments of the present application directly predict the loss data of clean data to obtain better quality synthesized speech. Specifically, when the original synthesis model is subjected to parameter adjustment according to the sample voice and the predicted Mel spectrum, the candidate adaptation data which are initially extracted are adopted as the condition parameters, so that the problem that the synthesis efficiency of the model is reduced due to the fact that the prior art adopts a distillation learning method as the input of the denoising process and the calculation of the predicted spectrum data and the loss data is carried out according to the obtained data can be avoided.
Referring to fig. 9, fig. 9 is a flowchart of a specific method of step S160 according to an embodiment of the present application. In some embodiments of the present application, step S160 may specifically include, but is not limited to, steps S910 to S930, which are described in detail below in conjunction with fig. 9.
Step S910, performing diffusion parameter calculation according to the noise scheduling parameters and the preset time steps to obtain diffusion process parameters;
step S920, obtaining noise distribution data, and performing prediction loss calculation according to the noise distribution data, the candidate adaptation data, the prediction spectrum data, the preset time step, the diffusion process parameters and the phoneme characteristic data to obtain prediction loss data;
And step S930, carrying out parameter adjustment on the original synthesis model according to the prediction loss data to obtain a voice synthesis model.
In steps S910 to S930 of some embodiments, in order to directly predict the loss of the initial clean data, a diffusion parameter calculation is performed according to the noise scheduling parameter and the preset time step, so as to obtain a diffusion process parameter, which is denoted as α t And the diffusion process parameter satisfies the following formula (4). And acquiring noise distribution data, namely E, wherein the noise distribution data E is noise meeting normal distribution. According toAs shown in the following formula (5), the prediction loss calculation is performed according to the noise distribution data, the candidate adaptation data, the prediction spectrum data, the preset time step, the diffusion process parameters and the phoneme characteristic data, so as to obtain prediction loss data L.
/>
Wherein θ represents a shared parameter that is used to represent the spectrum inverse sampling process as a markov chain parameterized by the shared parameter; x is x θ Data representing when the time step parameter is θ; con represents phoneme characteristic data, i.e., data of con includes first condition data of pitch prediction and second condition data of energy prediction.
It should be noted that, when the model after the parameter adjustment of the original synthesis model meets the preset end condition, the speech synthesis model is obtained. The preset training ending condition may be that similarity calculation is performed according to the sample voice and the predicted mel spectrum, that is, the sample mel spectrum of the sample voice is obtained, and similarity calculation is performed according to the sample mel spectrum and the predicted mel spectrum to obtain spectrum similarity data. And when the frequency spectrum similarity data is larger than or equal to a preset similarity threshold value, judging that the current model training is finished.
It should be noted that, the function for performing the similarity calculation may be selected according to actual needs, for example, cosine similarity calculation, a time axis comparison method, and the like, which are not limited herein.
In step S170 of some embodiments, in practical application, the application is applied to a terminal, and when a target object needs to perform speech synthesis on the terminal, a pop-up box may be displayed on a terminal page by selecting text content that needs to perform speech synthesis on the terminal page. The target object performs a speech synthesis process by touching a synthesized speech button in a pop-up box, at which time the target text is sent to the speech synthesis system via a speech synthesis service request. And then, voice broadcasting is carried out through a loudspeaker of the terminal, so that the target object can hear the target synthesized voice which is the same as the content of the target text. For example, a text-to-speech synthesis system may be installed on a smart phone, in which a speech synthesis model trained in the present application is deployed. Upon detecting an operation to convert text to speech, the smart phone generates a speech synthesis service request and sends the speech synthesis service request to the speech synthesis system. In response to the speech synthesis service request, the smartphone utilizes the speech synthesis system to extract target text data from the speech synthesis service request to synthesize target synthesized speech from the target text data via the trained speech synthesis model.
It should be noted that, for example, in business handling under financial technology, the speech synthesis method provided in the present application may perform speech synthesis on preset process texts of different business handling processes, so as to generate target synthesized speech of multiple different business handling processes. Specifically, if the requirement of the target object is identified as "A card handling", the target synthesized voice for guiding handling the service can be selected according to the requirement matching. After the target object is identified to complete one step of operation, the guiding voice of the next step of business handling is played, so that the target object is guided to conduct business handling by playing the target synthesized voice.
Referring to fig. 10, fig. 10 is a schematic block diagram of a speech synthesis system according to an embodiment of the present application. In some embodiments of the present application, the speech synthesis system includes a sample acquisition module 1010, a model input module 1020, a phoneme encoding module 1030, a phoneme adaptation module 1040, a spectrum prediction module 1050, a parameter adjustment module 1060, and a speech synthesis module 1070.
A sample acquisition module 1010 for acquiring sample data, the sample data including sample phoneme data and sample speech, the sample phoneme data being used to characterize text content of the sample speech;
The model input module 1020 is configured to input the sample phoneme data into a preset original synthesis model, where the original synthesis model includes a phoneme coding sub-model, a variance adaptation sub-model, and a noise reduction sub-model;
the phoneme encoding module 1030 is configured to perform phoneme encoding processing on the sample phoneme data through a phoneme encoding sub-model to obtain phoneme hiding data;
the phoneme adaptation module 1040 is configured to perform phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme feature data;
the spectrum prediction module 1050 is configured to perform spectrum prediction processing on the phoneme alignment data and the phoneme feature data through the noise reduction sub-model to obtain a predicted mel spectrum;
the parameter adjustment module 1060 is configured to perform parameter adjustment on the original synthesis model according to the sample speech and the predicted mel spectrum, so as to obtain a speech synthesis model;
the speech synthesis module 1070 is configured to input the obtained target text data to a speech synthesis model for speech synthesis processing, so as to obtain target synthesized speech.
It should be noted that, the speech synthesis system in the embodiment of the present application is configured to execute the above-mentioned speech synthesis method, and the speech synthesis system in the embodiment of the present application corresponds to the above-mentioned speech synthesis method, and the specific training process refers to the above-mentioned speech synthesis method, which is not described herein in detail.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the voice synthesis method of the embodiment of the application when executing the computer program.
The electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a car computer, etc.
The electronic device according to the embodiment of the present application is described in detail below with reference to fig. 11.
Referring to fig. 11, fig. 11 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:
the processor 1110 may be implemented by a general-purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc., for executing relevant programs to implement the technical solutions provided in the embodiments of the present application;
the Memory 1120 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 1120 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 1120, and the processor 1110 invokes the speech synthesis method to perform the embodiments of the present application;
An input/output interface 1130 for implementing information input and output;
the communication interface 1140 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WI F I, bluetooth, etc.);
a bus 1150 for transferring information between various components of the device (e.g., processor 1110, memory 1120, input/output interface 1130, and communication interface 1140);
wherein processor 1110, memory 1120, input/output interface 1130, and communication interface 1140 implement communication connections among each other within the device via bus 1150.
The embodiment of the application also provides a computer readable storage medium, and the computer readable storage medium stores a computer program, and the computer program is executed by a processor to realize the voice synthesis method of the embodiment of the application.
According to the voice synthesis method, the voice synthesis system, the electronic equipment and the storage medium, the ProDiff model is improved, namely loss is predicted directly according to initial data, so that the generation quality of synthesized voice can be effectively improved. In addition, the embodiment of the application adopts a parameterization mode to sample the candidate adaptation data according to the preset time step, and parameterizes the denoising model by directly predicting clean data, so that the problem of obvious reduction of data quality in the accelerated sampling process is avoided. According to the embodiment of the application, the Mel frequency spectrum is predicted according to the phoneme alignment data and the phoneme characteristic data, so that the diversity of speech synthesis is improved, and the resource consumption can be greatly reduced by reducing the sampling step number to the unit number, so that the calculated amount in the speech synthesis process is effectively simplified, and the generation efficiency of the synthesized speech is further effectively improved.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including multiple instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method of speech synthesis, the method comprising:
obtaining sample data, wherein the sample data comprises sample phoneme data and sample voice, and the sample phoneme data is used for representing text content of the sample voice;
inputting the sample phoneme data into a preset original synthesis model, wherein the original synthesis model comprises a phoneme coding sub-model, a variance adaptation sub-model and a noise reduction sub-model;
performing phoneme coding processing on the sample phoneme data through the phoneme coding sub-model to obtain phoneme hiding data;
performing phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data;
performing spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through the noise reduction sub-model to obtain a predicted Mel spectrum;
Performing parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel frequency spectrum to obtain a voice synthesis model;
and inputting the acquired target text data into the voice synthesis model for voice synthesis processing to obtain target synthesized voice.
2. The method of claim 1, wherein performing spectral prediction processing on the phoneme alignment data and the phoneme feature data by the noise reduction sub-model to obtain a predicted mel spectrum comprises:
inputting the phoneme alignment data into the noise reduction sub-model, and performing data sampling on the phoneme alignment data to obtain candidate adaptation data and position information of the candidate adaptation data;
performing spectrum diffusion processing on the candidate adaptation data according to a preset time step to obtain spectrum diffusion data;
performing spectrum inverse sampling processing on the spectrum diffusion data according to the preset time step, the candidate adaptation data and the phoneme characteristic data to obtain predicted spectrum data;
and generating a frequency spectrum of the predicted spectrum data according to the position information to obtain the predicted Mel frequency spectrum.
3. The method according to claim 2, wherein the performing spectrum diffusion processing on the candidate adaptation data according to a preset time step to obtain spectrum diffusion data includes:
Acquiring a noise scheduling parameter of the preset time step;
performing data sampling on the candidate adaptation data to obtain first adaptation data;
carrying out noise adding processing on the first adaptive data according to the preset time step and the noise scheduling parameter to obtain second adaptive data;
and obtaining the spectrum diffusion data according to the first adapting data and the second adapting data.
4. A method according to claim 3, wherein said performing parameter adjustment on said original synthesis model based on said sample speech and said predicted mel spectrum to obtain a speech synthesis model comprises:
performing diffusion parameter calculation according to the noise scheduling parameters and the preset time steps to obtain diffusion process parameters;
acquiring noise distribution data, and performing prediction loss calculation according to the noise distribution data, the candidate adaptation data, the prediction spectrum data, the preset time step, the diffusion process parameters and the phoneme characteristic data to obtain prediction loss data;
and carrying out parameter adjustment on the original synthesis model according to the prediction loss data to obtain the voice synthesis model.
5. The method of any one of claims 1 to 4, wherein the variance adaptation sub-model comprises a duration predictor, a pitch predictor, an energy predictor;
And performing phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme feature data, wherein the phoneme alignment data and the phoneme feature data comprise:
performing phoneme alignment processing on the phoneme hidden data according to the duration predictor to obtain the phoneme alignment data;
performing pitch prediction processing on the phoneme alignment data according to the pitch predictor to obtain first condition data;
performing energy prediction processing on the phoneme alignment data according to the energy predictor to obtain second condition data;
and carrying out data combination on the first condition data and the second condition data to obtain the phoneme characteristic data.
6. The method of claim 5, wherein the pitch predictor comprises a pitch activation layer, a normalization layer, and a pitch projection layer;
the pitch prediction processing is performed on the phoneme alignment data according to the pitch predictor to obtain first condition data, including:
nonlinear processing is carried out on the phoneme alignment data according to the pitch activation layer to obtain pitch activation data;
normalizing the pitch activation data according to the normalization layer to obtain normalized hidden data;
And carrying out linear projection processing on the normalized hidden data according to the pitch projection layer to obtain the first condition data.
7. The method of any one of claims 1 to 4, wherein the phoneme encoding submodel comprises a phoneme convolution layer, a phoneme self-attention layer, and a phoneme projection layer;
the processing of phoneme coding is performed on the sample phoneme data through the phoneme coding submodel to obtain phoneme hiding data, including:
carrying out phoneme convolution processing on the sample phoneme data according to the phoneme convolution layer to obtain phoneme coding data;
performing self-attention processing on the phoneme coding data according to the phoneme self-attention layer to obtain phoneme attention data;
and carrying out linear projection processing on the phoneme attention data according to the phoneme projection layer to obtain the phoneme hiding data.
8. A speech synthesis system, the system comprising:
a sample acquisition module for acquiring sample data, the sample data comprising sample phoneme data and sample speech, the sample phoneme data being used for characterizing text content of the sample speech;
the model input module is used for inputting the sample phoneme data into a preset original synthesis model, and the original synthesis model comprises a phoneme coding sub-model, a variance adaptation sub-model and a noise reduction sub-model;
The phoneme coding module is used for carrying out phoneme coding processing on the sample phoneme data through the phoneme coding sub-model to obtain phoneme hiding data;
the phoneme adaptation module is used for carrying out phoneme adaptation processing on the phoneme hidden data through the variance adaptation sub-model to obtain phoneme alignment data and phoneme characteristic data;
the frequency spectrum prediction module is used for carrying out frequency spectrum prediction processing on the phoneme alignment data and the phoneme characteristic data through the noise reduction sub-model to obtain a predicted Mel frequency spectrum;
the parameter adjustment module is used for carrying out parameter adjustment on the original synthesis model according to the sample voice and the predicted Mel frequency spectrum to obtain a voice synthesis model;
and the voice synthesis module is used for inputting the acquired target text data into the voice synthesis model to perform voice synthesis processing so as to obtain target synthesized voice.
9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 7.
CN202310727618.7A 2023-06-16 2023-06-16 Speech synthesis method, speech synthesis system, electronic device, and storage medium Pending CN116564273A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310727618.7A CN116564273A (en) 2023-06-16 2023-06-16 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310727618.7A CN116564273A (en) 2023-06-16 2023-06-16 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN116564273A true CN116564273A (en) 2023-08-08

Family

ID=87488178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310727618.7A Pending CN116564273A (en) 2023-06-16 2023-06-16 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN116564273A (en)

Similar Documents

Publication Publication Date Title
CN111899719A (en) Method, apparatus, device and medium for generating audio
US10810993B2 (en) Sample-efficient adaptive text-to-speech
CN112530408A (en) Method, apparatus, electronic device, and medium for recognizing speech
CN113658577A (en) Speech synthesis model training method, audio generation method, device and medium
CN114330371A (en) Session intention identification method and device based on prompt learning and electronic equipment
CN116543768A (en) Model training method, voice recognition method and device, equipment and storage medium
CN116682411A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN110930975B (en) Method and device for outputting information
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN111414748A (en) Traffic data processing method and device
CN116884386A (en) Speech synthesis method, speech synthesis apparatus, device, and storage medium
CN115798455A (en) Speech synthesis method, system, electronic device and storage medium
Daouad et al. An automatic speech recognition system for isolated Amazigh word using 1D & 2D CNN-LSTM architecture
CN115273805A (en) Prosody-based speech synthesis method and apparatus, device, and medium
CN116564273A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
CN117041430B (en) Method and device for improving outbound quality and robustness of intelligent coordinated outbound system
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN116469372A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116665638A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116543797A (en) Emotion recognition method and device based on voice, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination