CN116682411A - Speech synthesis method, speech synthesis system, electronic device, and storage medium - Google Patents

Speech synthesis method, speech synthesis system, electronic device, and storage medium Download PDF

Info

Publication number
CN116682411A
CN116682411A CN202310636076.2A CN202310636076A CN116682411A CN 116682411 A CN116682411 A CN 116682411A CN 202310636076 A CN202310636076 A CN 202310636076A CN 116682411 A CN116682411 A CN 116682411A
Authority
CN
China
Prior art keywords
sample
emotion
voice
feature
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310636076.2A
Other languages
Chinese (zh)
Inventor
郭洋
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310636076.2A priority Critical patent/CN116682411A/en
Publication of CN116682411A publication Critical patent/CN116682411A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a voice synthesis method, a voice synthesis system, electronic equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring a sample phoneme sequence of a sample voice text; encoding the sample phoneme sequence through a text encoding module to obtain phoneme encoding characteristics; obtaining global emotion characteristics of the sample phoneme sequence according to the global recognition module; obtaining emotion conversion characteristics of the initial voice of the sample according to the sentence level recognition module; determining target sample characteristics according to the phoneme coding characteristics, the global emotion characteristics and the emotion transformation characteristics; performing voice synthesis on the target sample characteristics through a voice synthesis module to obtain predicted synthesized voice; adjusting parameters of the model according to the initial voice and the predicted synthesized voice of the sample to obtain an emotion voice synthesis model; and inputting the target voice text into the emotion voice synthesis model to synthesize the target synthesized voice. The embodiment of the application can generate high-quality synthesized voice with richer emotion expression.

Description

Speech synthesis method, speech synthesis system, electronic device, and storage medium
Technical Field
The present application relates to the technical field of financial science and technology, and in particular, to a voice synthesis method, a voice synthesis system, an electronic device, and a storage medium.
Background
Speech synthesis is a technique that enables conversion of specified text into synthesized speech of a target speaker. Along with the wide use of intelligent voice technology in task scenes such as voice interaction, information broadcasting, sound reading, intelligent sales and the like of financial science and technology, the target object has higher and higher requirements on the effect of voice synthesis. With the rapid development of deep learning technology, the naturalness and the voice quality of the voice synthesis are greatly improved. However, human voice is rich in expressive force and emotion, how to make the synthesized voice better simulate emotion expression of human voice, and has the characteristics of more natural and smooth, high fidelity and the like, which is important for improving the application of the voice synthesis technology. Currently, related art speech synthesis methods typically use explicit emotion type tags as a condition to generate synthesized speech containing emotion from the original text. However, the synthesized speech obtained by the method only learns an average emotion expression, and cannot transmit fine style changes of emotion in the speech, so that high-quality synthesized speech with richer emotion expression cannot be generated. Therefore, how to provide a method for deep mining of fine emotion information contained in text to generate high-quality synthesized voice with richer emotion expression is a technical problem to be solved.
Disclosure of Invention
The embodiment of the application mainly aims to provide a voice synthesis method, a voice synthesis system, electronic equipment and a storage medium, which can deeply mine fine emotion information contained in texts so as to generate high-quality synthesized voice with richer emotion expression.
To achieve the above object, a first aspect of an embodiment of the present application provides a speech synthesis method, including:
acquiring sample data, wherein the sample data comprises sample voice text and sample initial voice of the sample voice text;
performing text conversion on the sample voice text to obtain a sample phoneme sequence;
inputting the sample data into a preset initial voice synthesis model, wherein the initial voice synthesis model comprises a text coding module, a global recognition module, a sentence level recognition module and a voice synthesis module;
the text coding module is used for coding the sample phoneme sequence to obtain phoneme coding characteristics;
carrying out emotion recognition processing on the sample phoneme sequence through the global recognition module to obtain global emotion characteristics;
extracting emotion characteristics of the sample initial voice through the sentence-level recognition module to obtain emotion conversion characteristics;
Performing feature stitching on the phoneme coding feature, the global emotion feature and the emotion transformation feature to obtain a target sample feature;
performing voice synthesis processing on the target sample characteristics through the voice synthesis module to obtain predicted synthesized voice;
parameter adjustment is carried out on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech to obtain an emotion speech synthesis model;
and inputting the target voice text to be processed into the emotion voice synthesis model to perform voice synthesis processing, so as to obtain target synthesized voice.
In some embodiments, the performing, by the global recognition module, emotion recognition processing on the sample phoneme sequence to obtain a global emotion feature includes:
carrying out emotion recognition processing on the sample phoneme sequence according to the global recognition module to obtain an emotion type label and a classification predicted value of the emotion type label;
searching from a preset emotion vector lookup table according to the emotion type label to obtain an emotion embedded vector of the emotion type label;
and carrying out weighted calculation according to the classification predicted value and the emotion embedded vector to obtain the global emotion feature.
In some embodiments, the global recognition module includes a pre-training model and an emotion classifier, and the performing emotion recognition processing on the sample phoneme sequence according to the global recognition module to obtain an emotion type label and a classification predicted value of the emotion type label includes:
extracting emotion characteristics of the sample phoneme sequence according to the pre-training model to obtain sample prediction characteristics;
carrying out emotion classification prediction on the sample prediction features according to the emotion classifier to obtain classification prediction features;
and carrying out de-linearization processing on the classification prediction features according to a preset activation function to obtain the emotion type labels and the classification prediction values of the emotion type labels.
In some embodiments, the emotion classifier includes a multi-head attention unit and a global convolution unit, and the performing emotion classification prediction on the sample prediction feature according to the emotion classifier to obtain a classification prediction feature includes:
performing self-attention processing on the sample prediction features according to the multi-head attention unit to obtain attention features;
performing feature fusion on the sample prediction feature and the attention feature to obtain an attention fusion feature;
Normalizing the attention fusion characteristic to obtain a first prediction characteristic;
carrying out global feature extraction on the first prediction feature according to the global convolution unit to obtain a global convolution feature;
performing feature fusion on the first prediction feature and the global convolution feature to obtain a second prediction feature;
and carrying out normalization processing on the second prediction features to obtain the classification prediction features.
In some embodiments, the sentence-level recognition module includes a sentence-level encoder, and the extracting, by the sentence-level recognition module, emotion features of the sample initial speech to obtain emotion transformation features includes:
performing audio conversion on the sample initial voice to obtain a sample Mel frequency spectrum;
extracting emotion characteristics of the sample Mel frequency spectrum through the sentence-level encoder to obtain sentence-level hidden characteristics;
and performing feature conversion on the sentence-level hidden features to obtain emotion conversion features.
In some embodiments, the sentence-level recognition module further includes a sentence-level convolution unit, a correction unit, and a feature mapping unit, the method further including: training the sentence-level encoder, specifically including:
Sentence-level feature extraction is carried out on the phoneme coding features according to the sentence-level convolution unit, so that sentence-level convolution features are obtained;
correcting the sentence-level convolution characteristic according to the correcting unit to obtain a corrected characteristic;
performing feature mapping processing on the correction features according to the feature mapping unit to obtain sentence-level prediction features;
carrying out loss calculation on the emotion transformation characteristics and the sentence-level prediction characteristics according to a preset loss function to obtain sentence-level prediction loss values;
and carrying out parameter adjustment on a preset initial encoder according to the sentence-level prediction loss value to obtain the sentence-level encoder.
In some embodiments, the speech synthesis module includes an a priori encoder, a posterior encoder, a duration predictor and a decoder, and the speech synthesis module performs speech conversion processing on the target sample feature to obtain a predicted synthesized speech, including:
performing feature coding processing on the target sample features according to the prior encoder to obtain prior coding features;
performing short-time Fourier transform on the sample initial voice to obtain a sample linear frequency spectrum;
extracting hidden variables from the sample linear frequency spectrum according to the posterior encoder to obtain sample hidden variable characteristics;
Extracting phoneme duration according to the target sample characteristics by the duration predictor to obtain sample phoneme duration;
monotonically aligning and searching the sample hidden variable features and the target sample features according to the sample phoneme duration, and determining a target alignment matrix;
and decoding the target sample characteristics according to the target alignment matrix and the decoder to obtain the predicted synthesized voice.
To achieve the above object, a second aspect of an embodiment of the present application proposes a speech synthesis system, the system comprising:
the voice sample acquisition module is used for acquiring sample data, wherein the sample data comprises sample voice text and sample initial voice of the sample voice text;
the text conversion module is used for carrying out text conversion on the sample voice text to obtain a sample phoneme sequence;
the model input module is used for inputting the sample data into a preset initial voice synthesis model, wherein the initial voice synthesis model comprises a text coding module, a global recognition module, a sentence-level recognition module and a voice synthesis module;
the coding module is used for coding the sample phoneme sequence through the text coding module to obtain phoneme coding characteristics;
The global emotion recognition module is used for carrying out emotion recognition processing on the sample phoneme sequence through the global recognition module to obtain global emotion characteristics;
the sentence-level feature extraction module is used for extracting emotion features of the sample initial voice through the sentence-level recognition module to obtain emotion conversion features;
the feature splicing module is used for carrying out feature splicing on the phoneme coding feature, the global emotion feature and the emotion transformation feature to obtain a target sample feature;
the voice conversion module is used for carrying out voice synthesis processing on the target sample characteristics through the voice synthesis module to obtain predicted synthesized voice;
the parameter adjustment module is used for carrying out parameter adjustment on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech to obtain an emotion speech synthesis model;
and the voice synthesis module is used for inputting the target voice text to be processed into the emotion voice synthesis model to perform voice synthesis processing so as to obtain target synthesized voice.
To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device comprising a memory and a processor, the memory storing a computer program, the processor implementing the method according to any one of the first aspect of the embodiments of the present application when executing the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application also proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements a method according to any one of the first aspect of the embodiments of the present application.
According to the voice synthesis method, the voice synthesis system, the electronic equipment and the storage medium, firstly, sample data are obtained, the sample data comprise sample voice texts and sample initial voices of the sample voice texts, and the sample initial voices contain voice emotion information which needs to be synthesized by the sample voice texts. And performing text conversion on the sample voice text to obtain a sample phoneme sequence. And inputting the sample data into a preset initial voice synthesis model, wherein the initial voice synthesis model comprises a text coding module, a global recognition module, a sentence level recognition module and a voice synthesis module. The text coding module is used for coding the sample phoneme sequence to obtain phoneme coding characteristics; carrying out emotion recognition processing on the sample phoneme sequence through a global recognition module to obtain global emotion characteristics; and extracting emotion characteristics of the sample initial voice through the sentence-level recognition module to obtain emotion conversion characteristics. And then, carrying out feature stitching on the phoneme coding features, the global emotion features and the emotion transformation features to obtain target sample features. And carrying out voice synthesis processing on the target sample characteristics through the voice synthesis module to obtain predicted synthesized voice, wherein the predicted synthesized voice is used for representing and generating the synthesized voice which is the same as the voice emotion information of the initial voice of the sample. According to the embodiment of the application, the initial speech synthesis model comprising the text coding module, the global recognition module, the sentence-level recognition module and the speech synthesis module is constructed, so that fine emotion information contained in a text can be deeply mined when speech synthesis is carried out on sample data according to the initial speech synthesis model. And carrying out parameter adjustment on the initial voice synthesis model through the sample initial voice and the predicted synthesized voice to obtain an emotion voice synthesis model which has the same structure as the initial voice synthesis model but can generate high-quality synthesized voice with richer emotion expression. Therefore, when the emotion voice synthesis model provided by the embodiment of the application carries out voice synthesis processing on the target voice text, the fine emotion information contained in the text can be deeply mined to generate high-quality synthesized voice with richer emotion expression.
Drawings
FIG. 1 is a first flowchart of a speech synthesis method according to an embodiment of the present application;
FIG. 2 is a flowchart of a specific method of step S1050 in FIG. 1;
FIG. 3 is a flowchart of a specific method of step S210 in FIG. 2;
FIG. 4 is a flowchart of a specific method of step S320 in FIG. 3;
FIG. 5 is a flowchart of a specific method of step S1060 in FIG. 1;
FIG. 6 is a second flowchart of a speech synthesis method provided by an embodiment of the present application;
FIG. 7 is a flow chart of a specific method of step S1080 of FIG. 1;
FIG. 8 is a block diagram of a speech synthesis system according to an embodiment of the present application;
fig. 9 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
First, several nouns involved in the present application are parsed:
artificial intelligence (Artificial Intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (Natural Language Processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.
Speech synthesis (Text-To-Speech, TTS): TTS is a technology from text to speech and generally comprises two steps: the first step is text processing, which mainly converts text into a phoneme sequence and marks out the information of the start and stop time, frequency change and the like of each phoneme; the second step is speech synthesis, which mainly generates speech according to the phoneme sequence (and the marked information such as start and stop time, frequency change, etc.).
Transformer model: the method is widely applied to the fields of natural language processing, such as machine translation, question-answering systems, text abstracts, speech recognition and the like.
BERT (Bidirectional Encoder Representation from Transformers) model: the method is used for further increasing the generalization capability of the word vector model, and fully describing the character level, the word level, the sentence level and even the relation characteristics among sentences, and is constructed based on a transducer.
The voice synthesis is a technology capable of converting a specified text into a target speaker to synthesize voice, and is a core technology for serving tasks such as voice interaction, information broadcasting, voice reading and the like. With the rapid development of deep learning technology, the naturalness and the voice quality of the voice synthesis are greatly improved. However, human speech is expressive and emotion-rich, and how to present the proper emotion in synthesized speech is critical to building a diverse speech generation system.
Speech synthesis is a technique that enables conversion of specified text into synthesized speech of a target speaker. Along with the wide use of intelligent voice technology in task scenes such as voice interaction, information broadcasting, sound reading, intelligent sales and the like of financial science and technology, the target object has higher and higher requirements on the effect of voice synthesis. With the rapid development of deep learning technology, the naturalness and the voice quality of the voice synthesis are greatly improved. However, human voice is rich in expressive force and emotion, how to make the synthesized voice better simulate emotion expression of human voice, and has the characteristics of more natural and smooth, high fidelity and the like, which is important for improving the application of the voice synthesis technology. Currently, related art speech synthesis methods typically use explicit emotion type tags as a condition to generate synthesized speech containing emotion from the original text. However, the synthesized speech obtained by the method only learns an average emotion expression, and cannot transmit fine style changes of emotion in the speech, so that high-quality synthesized speech with richer emotion expression cannot be generated. Therefore, how to provide a method for deep mining of fine emotion information contained in text to generate high-quality synthesized voice with richer emotion expression is a technical problem to be solved.
Based on the above, the voice synthesis method, the voice synthesis system, the electronic device and the storage medium provided by the embodiment of the application can deeply mine the fine emotion information contained in the text to generate high-quality synthesized voice with richer emotion expression.
The embodiment of the application provides a voice synthesis method, a voice synthesis system, an electronic device and a storage medium, and specifically, the following embodiment is used for explaining, and first describes the voice synthesis method in the embodiment of the application.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the application provides a voice synthesis method, relates to the technical field of artificial intelligence, and particularly relates to the technical field of animation processing. The voice synthesis method provided by the embodiment of the application can be applied to the terminal, can be applied to the server side, and can also be software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, etc.; the server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; the software may be an application or the like that implements a speech synthesis method, but is not limited to the above form.
The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.
Referring to fig. 1, fig. 1 is an optional flowchart of a speech synthesis method according to an embodiment of the present application, and in some embodiments of the present application, the speech synthesis method includes, but is not limited to, steps S1010 to S1100, and these ten steps are described in detail below with reference to fig. 1.
Step S1010, obtaining sample data, wherein the sample data comprises sample voice text and sample initial voice of the sample voice text;
Step S1020, performing text conversion on the sample voice text to obtain a sample phoneme sequence;
step S1030, inputting sample data into a preset initial speech synthesis model, wherein the initial speech synthesis model comprises a text coding module, a global recognition module, a sentence-level recognition module and a speech synthesis module;
step S1040, coding the sample phoneme sequence through a text coding module to obtain phoneme coding characteristics;
step S1050, carrying out emotion recognition processing on the sample phoneme sequence through a global recognition module to obtain global emotion characteristics;
step S1060, extracting emotion characteristics of the initial voice sample through a sentence-level recognition module to obtain emotion conversion characteristics;
step S1070, performing feature stitching on the phoneme coding features, the global emotion features and the emotion transformation features to obtain target sample features;
step S1080, performing voice synthesis processing on the target sample characteristics through a voice synthesis module to obtain predicted synthesized voice;
step S1090, carrying out parameter adjustment on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech to obtain an emotion speech synthesis model;
step S1100, inputting the target voice text to be processed into the emotion voice synthesis model for voice synthesis processing, and obtaining target synthesized voice.
It should be noted that, in an actual application environment, the speech synthesis method provided by the embodiment of the present application may be executed by a terminal or a server, respectively, or may be executed by the terminal and the server cooperatively. The method for detecting the synthesized voice is described in terms of the corresponding terminal execution example, specifically, the method comprises the following steps: the terminal or the server locally acquires a target voice text to be processed, obtains a corresponding target phoneme sequence based on the text, and carries out voice synthesis processing on the target phoneme sequence through a pre-trained emotion voice synthesis model to obtain target synthesized voice containing emotion of the target voice text. The emotion voice synthesis model is obtained by training a terminal or a server based on a sample phoneme sequence of a sample voice text and a corresponding sample initial voice and is deployed at the terminal. The speech synthesis method may also be deployed on a server, so that the server may also implement the steps of the speech synthesis method described above.
In step S1010 of some embodiments, a training sample set is obtained, the training sample set comprising at least one sample data, the sample data comprising a sample speech text and a sample initial speech of the sample speech text. Wherein the sample data further comprises a sample emotion tag pre-tagged to the sample voice text.
It should be noted that, the sample data may be obtained by performing voice recording according to the obtained sample voice text, and determining a corresponding sample initial text; and text conversion can be performed according to the acquired sample initial voice to determine a corresponding sample voice text. The sample initial voice is used for representing the real voice of the sample voice text, and the sample emotion labels corresponding to the sample initial voice and the sample voice text are the same.
It should be noted that, the preset emotion type label may include, for example, a question, a happy, and the like, and the sample emotion label of the present application is any one of preset emotion type labels.
It should be noted that, the sample voice text obtained by the embodiment of the present application may relate to various fields, such as science and technology, sports, leisure and entertainment, food and literature, that is, the voice synthesis method proposed by the present application may be applied to different fields. For example, in a voice interaction scene of financial science and technology, various tone voices with emotion colors such as happy, charming, sorry and the like are synthesized for the intelligent interaction robot, so that more vitality is given to emotion expression of the robot, and man-machine interaction experience is improved.
It should be noted that, the sample initial voice in the present application may be MP3 format, CDA format, WAV format, WMA format, RA format, MIDI format, OGG format, APE format, AAC format, etc., and the present application is not limited thereto.
In step S1020 of some embodiments, in order to make the obtained synthesized speech more in line with the actual sounding state, text conversion is performed on the sample speech text to obtain a sample phoneme sequence. Specifically, a pre-trained acoustic model can be adopted to perform text conversion on a sample voice text, namely, word segmentation processing is performed on the sample voice text, so that text word segmentation to be synthesized is obtained; performing phonon conversion on each text word segmentation to obtain word segmentation phonons to be synthesized; and combining the obtained word segmentation phones to obtain a sample phone sequence of the sample voice text. The phonemes may refer to pronunciation phonemes of characters or words in the text to be synthesized, such as initials and finals in Chinese characters. Correspondingly, a sample phoneme sequence may refer to a sequence of multiple phonemes. For example, if the sample voice text is "which is liked by you" and the corresponding text word is "you", "like", "which" or "one", when the word segmentation phones corresponding to the text word are obtained, the word segmentation phones corresponding to the text word are combined, and the sample phoneme sequence of the sample voice text is [ ninxihuanayige ].
In step S1030 of some embodiments, an initial speech synthesis model is constructed, the initial speech synthesis model including a text encoding module, a global recognition module, a sentence-level recognition module, and a speech synthesis module. The global recognition module is used for learning the emotion type of the text globally from the text, and the sentence-level recognition module is used for learning the emotion change trend of the intonation in the sentence. And inputting the sample data into a preset initial voice synthesis model, and deeply mining fine emotion information contained in the text.
In step S1040 of some embodiments, after the sample data is input to a preset initial speech synthesis model, the text encoding module encodes the sample phoneme sequence to obtain a phoneme encoding feature. It should be noted that the text encoding module may be constructed using a transducer model, and the encoding portion of the transducer model may be formed by stacking n encoder layers, each of which is composed of two sub-layer connection structures. The first sublayer connection structure may include a multi-head attention sublayer and a normalization layer and a residual connection; the second sub-layer connection structure may comprise a feed-forward full connection sub-layer and normalization layer and a residual connection.
In step S1050 of some embodiments, in order to more accurately identify the emotion type of the sample speech text, the global emotion feature is obtained by performing emotion recognition processing on the sample phoneme sequence by using the global recognition module.
Specifically, referring to fig. 2, fig. 2 is a flowchart of a specific method of step S1050 according to an embodiment of the present application. In some embodiments of the present application, step S1050 may specifically include, but is not limited to, steps S210 to S230, which are described in detail below in conjunction with fig. 2.
Step S210, carrying out emotion recognition processing on the sample phoneme sequence according to the global recognition module to obtain emotion type labels and classification predicted values of the emotion type labels;
step S220, searching from a preset emotion vector lookup table according to the emotion type label to obtain an emotion embedded vector of the emotion type label;
and step S230, carrying out weighted calculation according to the classification predicted value and the emotion embedded vector to obtain global emotion characteristics.
In step S210 of some embodiments, in order to predict the emotion type to which the sample text sentence belongs, the sample phoneme sequence is input into the global recognition module, so that emotion recognition processing is performed on the sample phoneme sequence according to the global recognition module, and an emotion type label and a classification predicted value of the emotion type label are obtained, that is, a probability value corresponding to each emotion type label preset to which the sample text sentence belongs is predicted.
In step S220 of some embodiments, the global recognition module of the present application includes a global emotion feature extractor, where the global emotion feature extractor can train and construct an emotion vector lookup table according to a sample emotion label of each sample voice text in a training sample set, where the emotion vector lookup table is used to store a mapping relationship between emotion type labels and emotion embedded vectors, so that sample data belonging to the same emotion type label corresponds to the same emotion embedded vector, and the emotion embedded vector can represent information of the emotion type. Specifically, searching is performed from a preset emotion vector lookup table according to emotion type labels, and an emotion embedded vector of each emotion type label is determined and used for representing the characteristics of the embedded vector corresponding to the emotion type label.
It should be noted that, the search of the emotion vector lookup table is satisfied as shown in formula (1), so as to determine the emotion embedded vector of each emotion type tag according to the emotion vector lookup table.
h glo =f 1 (e i ) (1)
Wherein f 1 Search function for representing emotion vector lookup table, e i Class i emotion class label for representing reality, h glo For representing the emotion type label e i The corresponding emotion embeds the vector.
In step S230 of some embodiments, in order to avoid inaccurate emotion expression due to emotion type prediction errors, the embodiment of the present application uses emotion-embedded weighting calculation to globally identify emotion for a sample phoneme sequence. Specifically, as shown in formula (2), weighting calculation is performed according to the classification predicted value and the emotion embedded vector to obtain a global emotion feature, and the global emotion feature is recorded as
Wherein M represents the total number of preset emotion type labels, p i Representing that a sample phoneme sequence prediction belongs to emotion class label e i Is used for the classification prediction value of (1).
Referring to fig. 3, fig. 3 is a flowchart of a specific method of step S210 according to an embodiment of the application. In some embodiments of the present application, the global recognition module further includes a pre-training model and an emotion classifier, and step S210 may specifically include, but is not limited to, steps S310 to S330, which are described in detail below in conjunction with fig. 3.
Step S310, carrying out emotion feature extraction on a sample phoneme sequence according to a pre-training model to obtain sample prediction features;
step S320, carrying out emotion classification prediction on the sample prediction features according to the emotion classifier to obtain classification prediction features;
Step S330, performing a de-linearization process on the classification prediction features according to a preset activation function to obtain emotion type labels and classification prediction values of the emotion type labels.
In step S310 of some embodiments, in order to improve accuracy of emotion prediction by the global recognition module, emotion feature extraction is performed on the sample phoneme sequence according to the pre-training model, so as to obtain sample prediction features corresponding to the sample phoneme sequence. The training process of the pre-training model is that an initial network model constructed based on a Bert model structure is adopted, and in order to prevent the pre-training model obtained through training from generating fitting in the processing process of sample data, a pre-training sample set is obtained and comprises a plurality of pre-training text emotion samples and emotion classification labels corresponding to the pre-training text emotion samples. And fine tuning is carried out on model parameters of the initial network model according to the pre-training sample set to obtain a pre-training model, so that the calculation speed and the classification accuracy of the pre-training model can be effectively improved.
It should be noted that, in the embodiment of the present application, the initial network model based on Bert may be fine-tuned by using data such as NLPCC2013 and NLPCC2014, and the emotion types of the data included in the pre-training sample set are consistent with those included in the training sample set of the whole model.
In step S320 and step S330 of some embodiments, emotion classification prediction is performed on the sample prediction features according to the emotion classifier, so as to obtain classification prediction features, where the classification prediction features are used to characterize feature information under each emotion classification label. And then, carrying out the de-linearization processing on the classification prediction features according to a preset activation function, namely mapping the output result of classification into a [0,1] interval, and obtaining the classification prediction value corresponding to each emotion type label.
Referring to fig. 4, fig. 4 is a flowchart of a specific method of step S320 according to an embodiment of the application. In some embodiments of the present application, the emotion classifier includes a multi-head attention unit and a global convolution unit, and step S320 may specifically include, but is not limited to, steps S410 to S460, which are described in detail below in conjunction with fig. 4.
Step S410, performing self-attention processing on the sample prediction features according to the multi-head attention unit to obtain attention features;
step S420, carrying out feature fusion on the sample prediction features and the attention features to obtain attention fusion features;
step S430, carrying out normalization processing on the attention fusion characteristics to obtain first prediction characteristics;
Step S440, carrying out global feature extraction on the first predicted feature according to the global convolution unit to obtain a global convolution feature;
step S450, carrying out feature fusion on the first prediction feature and the global convolution feature to obtain a second prediction feature;
step S460, carrying out normalization processing on the second prediction features to obtain classified prediction features.
In steps S410 to S460 of some embodiments, in order to improve the classification accuracy of the emotion classifier, the emotion classifier of the embodiment of the present application may include a multi-head attention unit and a global convolution unit, so as to help the emotion classifier capture more abundant feature information, and effectively avoid overfitting of the model. Specifically, the sample prediction features are subjected to self-attention processing according to the multi-head attention unit to obtain attention features, and then the sample prediction features and the attention features are subjected to feature fusion in a residual connection mode to obtain attention fusion features. In order to limit the obtained characteristic data in a certain range so as to reduce adverse effects caused by singular sample data, the attention fusion characteristic is normalized to obtain a first prediction characteristic. And then, in order to improve the operation speed of the model, carrying out global feature extraction on the first prediction feature according to the global convolution unit to obtain a global convolution feature. The global convolution unit may operate in the form of a one-dimensional convolution. And carrying out feature fusion on the first prediction feature and the global convolution feature according to a residual error connection mode to obtain a second prediction feature. And carrying out normalization processing on the second prediction features to obtain the classification prediction features.
In step S1060 of some embodiments, in order to learn prosody information specific to each sentence in the text, such as intra-sentence emotion change information, intonation conversion information, etc., the embodiment of the present application performs emotion feature extraction on the sample initial speech according to the set sentence-level recognition model, so as to obtain emotion conversion features.
Specifically, referring to fig. 5, fig. 5 is a flowchart of a specific method of step S1060 according to an embodiment of the present application. In some embodiments of the present application, the sentence-level recognition module includes a sentence-level encoder, and step S1060 may specifically include, but is not limited to, steps S510 to S530, which are described in detail below in conjunction with fig. 5.
Step S510, performing audio conversion on the initial voice of the sample to obtain a sample Mel frequency spectrum;
step S520, extracting emotion characteristics of a sample Mel frequency spectrum through a sentence-level encoder to obtain sentence-level hidden characteristics;
and step S530, performing feature conversion on the sentence-level hidden features to obtain emotion conversion features.
In step S510 and step S520 of some embodiments, in order to extract intra-sentence emotion change information, intonation conversion information, and the like of each sentence, audio conversion is performed on the sample initial speech, so as to obtain a sample mel spectrum corresponding to the sample initial speech. And extracting emotion characteristics of the sample Mel frequency spectrum through a preset sentence-level encoder to obtain sentence-level hidden characteristics, wherein the sentence-level hidden characteristics are used for representing emotion change information of sentence levels.
In step S530 of some embodiments, in order to preserve feature information for effectively predicting emotion changes of speech, the present application may connect a Long Short-Term Memory (LSTM) to the output of the sentence-level encoder, that is, perform feature conversion on the sentence-level hidden feature according to the Long Short-Term Memory network, and obtain an emotion transformation feature by using the output of the last time step of LTSM, where the emotion transformation feature is a feature vector with a fixed length and may be denoted as h utt
Referring to fig. 6, fig. 6 is another alternative flowchart of a speech synthesis method according to an embodiment of the application. In some embodiments of the present application, the sentence-level recognition module further includes a sentence-level convolution unit, a correction unit, and a feature mapping unit, and after step S520, the speech synthesis method provided in the embodiment of the present application may further include the steps of: the sentence-level encoder is trained, and this step may include, but is not limited to, steps S610 through S650, which are described in detail below in conjunction with fig. 6.
Step S610, sentence level feature extraction is carried out on the phoneme coding features according to the sentence level convolution unit, so that sentence level convolution features are obtained;
step S620, correcting the sentence-level convolution characteristic according to the correcting unit to obtain a corrected characteristic;
Step S630, carrying out feature mapping processing on the corrected features according to the feature mapping unit to obtain sentence-level prediction features;
step S640, carrying out loss calculation on the emotion transformation characteristics and the sentence level prediction characteristics according to a preset loss function to obtain sentence level prediction loss values;
step S650, parameter adjustment is performed on the preset initial encoder according to the sentence-level prediction loss value, so as to obtain the sentence-level encoder.
In steps S610 to S650 of some embodiments, in order to improve the encoding capability of the sentence-level encoder to obtain feature information capable of accurately reflecting the overall intra-sentence emotion variation, first, sentence-level feature extraction is performed on the phoneme encoding feature according to a sentence-level convolution unit to obtain a sentence-level convolution feature, where the sentence-level convolution unit may use one-dimensional convolution or two-dimensional convolution. In order to limit the obtained characteristic data within a certain range so as to reduce adverse effects caused by singular sample data, the sentence-level convolution characteristic is corrected according to a correction unit to obtain a correction characteristic. In order to make the generalization capability of the model stronger, carrying out feature mapping processing on the corrected features according to a feature mapping unit to obtain sentence-level prediction features, wherein the feature mapping unit can be constructed by adopting a dropout network. And then, carrying out loss calculation on the emotion transformation characteristic and the sentence level prediction characteristic according to a preset loss function to obtain a sentence level prediction loss value, and carrying out parameter adjustment on a preset initial encoder according to the sentence level prediction loss value until the sentence level prediction loss value of the initial encoder reaches a model ending condition to obtain the sentence level encoder.
In step S1070 of some embodiments, to implement multi-scale emotion migration, feature stitching is performed on the phoneme coding features, the global emotion features, and the emotion transformation features to obtain target sample features, and the target sample features are input to the speech synthesis module for training.
In step S1080 of some embodiments, in order to make the synthetic speech expressive, the embodiment of the present application constructs a speech synthesis model using a structure based on a conditional variance self-encoder, and performs speech synthesis processing on the target sample feature by using the speech synthesis module to obtain a predicted synthetic speech, where the predicted synthetic speech is used to represent the synthetic speech predicted by the initial speech synthesis model constructed by the present application.
Referring to fig. 7, fig. 7 is a flowchart of a specific method of step S1080 according to an embodiment of the application. In some embodiments of the present application, the speech synthesis module includes a priori encoder, a posterior encoder, a duration predictor, and a decoder, and step S1080 may include, but is not limited to, steps S710 to S760, which are described in detail below in conjunction with fig. 7.
Step S710, performing feature encoding processing on the target sample features according to the prior encoder to obtain prior encoded features;
Step S720, performing short-time Fourier transform on the initial voice of the sample to obtain a sample linear frequency spectrum;
step S730, extracting hidden variables of the sample linear frequency spectrum according to the posterior encoder to obtain sample hidden variable characteristics;
step S740, extracting phoneme duration from the target sample characteristics according to the duration predictor to obtain sample phoneme duration;
step S750, performing monotonic alignment search on the sample hidden variable features and the target sample features according to the sample phoneme duration, and determining a target alignment matrix;
step S760, the target sample feature is decoded according to the target alignment matrix and the decoder to obtain the predicted synthesized speech.
In steps S710 to S760 of some embodiments, in order to generate high-quality synthesized speech with richer emotion expressions, feature encoding processing is performed on the spliced target sample features according to the prior encoder, so as to obtain prior encoded features. And performing short-time Fourier transform on the sample initial voice to obtain a sample linear frequency spectrum. And extracting the hidden variable of the sample linear frequency spectrum according to a posterior encoder to obtain the sample hidden variable characteristic, wherein the posterior encoder can adopt a non-causal WaveNet residual module in the WaveGlow and the Glow-TTS. And extracting the phoneme duration of the target sample feature according to the duration predictor to obtain the sample phoneme duration so as to estimate the phoneme duration distribution. And carrying out monotonic alignment search on the sample hidden variable features and the target sample features according to the sample phoneme duration, and determining a target alignment matrix. And decoding the target sample characteristics according to the target alignment matrix and the decoder to obtain predicted synthesized voice.
In step S1090 of some embodiments, parameter adjustment is performed on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech, that is, a model loss value is determined according to the sample initial speech and the predicted synthesized speech, model parameters in the global recognition module, the sentence-level recognition module and the speech synthesis module are adjusted according to the model loss value, and when the initial speech synthesis model meets a preset training end condition, an emotion speech synthesis model is obtained.
It should be noted that, the preset training ending condition may be when the model loss value of the initial speech synthesis model is less than or equal to a preset loss value threshold, or when the synthesis accuracy of the initial speech synthesis model is greater than or equal to a preset accuracy threshold.
In step S1100 of some embodiments, after training to deeply mine fine emotion information contained in a text and generating an emotion voice synthesis model of high-quality synthesized voice with richer emotion expression, a target voice text to be processed is input into the emotion voice synthesis model for voice synthesis processing, so as to obtain a target synthesized voice.
Specifically, a speech synthesis system for converting text into speech can be installed on the terminal, and an emotion speech synthesis model is deployed in the speech synthesis system. Therefore, when detecting the text-to-speech operation, the terminal generates a speech synthesis service request and sends the speech synthesis service request to the speech synthesis system. In response to the voice synthesis service request, the terminal extracts a target voice text to be processed from the voice synthesis service request by utilizing a voice synthesis system, and synthesizes the target voice text into target synthesis voice containing emotion hidden in the target voice text through an emotion voice synthesis model.
In the specific application, when the target object needs to perform speech synthesis at the terminal, the text content that needs to perform speech synthesis may be selected in the terminal page, and then the terminal page may display a pop-up box. The target object performs a speech synthesis process by touching a synthesized speech button in a pop-up box, at which time the text content is sent to the speech synthesis system via a speech synthesis service request. And then, voice broadcasting is carried out through a loudspeaker of the terminal, so that the target object can hear the synthesized voice with richer emotion expression.
In the voice interaction scene of the finance technology, the voice synthesis method provided by the application is used for synthesizing the target voice texts of different scenes, and the synthesized target synthesized voice is input into the intelligent interaction robot so that the intelligent interaction robot can select the synthesized voice according to specific scenes and dialogs. For example, when a target object begins to communicate with an intelligent interactive robot, the robot may select a voice with a "query" emotion category to play, such as "please ask you what needs help" to determine the needs of the customer. When the target object finishes voice communication with the intelligent interaction robot, the robot can play voices with the emotion type of 'cheerful', such as 'expect communication with you next', more vigor can be given to emotion expression of the robot, and man-machine interaction experience is improved.
Referring to fig. 8, fig. 8 is a schematic block diagram of a speech synthesis system according to an embodiment of the application. In some embodiments of the present application, the speech synthesis system includes a speech sample acquisition module 8010, a text conversion module 8020, a model input module 8030, an encoding module 8040, a global emotion recognition module 8050, a sentence-level feature extraction module 8060, a feature stitching module 8070, a speech conversion module 8080, a parameter adjustment module 8090, and a speech synthesis module 8100.
A voice sample acquiring module 8010, configured to acquire sample data, where the sample data includes a sample voice text and a sample initial voice of the sample voice text;
the text conversion module 8020 is used for performing text conversion on the sample voice text to obtain a sample phoneme sequence;
the model input module 8030 is configured to input sample data to a preset initial speech synthesis model, where the initial speech synthesis model includes a text encoding module, a global recognition module, a sentence-level recognition module, and a speech synthesis module;
the encoding module 8040 is used for encoding the sample phoneme sequence through the text encoding module to obtain phoneme encoding characteristics;
the global emotion recognition module 8050 is used for performing emotion recognition processing on the sample phoneme sequence through the global recognition module to obtain global emotion characteristics;
The sentence-level feature extraction module 8060 is configured to extract emotion features from the sample initial speech through the sentence-level recognition module, so as to obtain emotion transformation features;
the feature stitching module 8070 is used for performing feature stitching on the phoneme coding feature, the global emotion feature and the emotion transformation feature to obtain a target sample feature;
the voice conversion module 8080 is configured to perform voice synthesis processing on the target sample features through the voice synthesis module, so as to obtain predicted synthesized voice;
the parameter adjustment module 8090 is used for performing parameter adjustment on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech to obtain an emotion speech synthesis model;
the voice synthesis module 8100 is configured to input the target voice text to be processed into the emotion voice synthesis model for voice synthesis processing, so as to obtain target synthesized voice.
It should be noted that, the speech synthesis system in the embodiment of the present application is used for executing the above-mentioned speech synthesis method, and the speech synthesis system in the embodiment of the present application corresponds to the above-mentioned speech synthesis method, and the specific training process refers to the above-mentioned speech synthesis method, which is not described herein in detail.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the voice synthesis method of the embodiment of the application when executing the computer program.
The electronic device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a car computer, etc.
An electronic device according to an embodiment of the present application is described in detail below with reference to fig. 9.
Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, where the electronic device includes:
the processor 910 may be implemented by a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical scheme provided by the embodiments of the present application;
the Memory 920 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). The memory 920 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 920 and called by the processor 910 to perform the speech synthesis method according to the embodiments of the present disclosure;
An input/output interface 930 for inputting and outputting information;
the communication interface 940 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);
a bus 950 for transferring information between components of the device (e.g., processor 910, memory 920, input/output interface 930, and communication interface 940);
wherein processor 910, memory 920, input/output interface 930, and communication interface 940 implement communication connections among each other within the device via a bus 950.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the voice synthesis method of the embodiment of the application when being executed by a processor.
According to the voice synthesis method, the voice synthesis system, the electronic equipment and the storage medium provided by the embodiment of the application, sample data are obtained, the sample data comprise sample voice texts and sample initial voices of the sample voice texts, and the sample initial voices contain voice emotion information which needs to be synthesized by the sample voice texts. And performing text conversion on the sample voice text to obtain a sample phoneme sequence. And inputting the sample data into a preset initial voice synthesis model, wherein the initial voice synthesis model comprises a text coding module, a global recognition module, a sentence level recognition module and a voice synthesis module. The text coding module is used for coding the sample phoneme sequence to obtain phoneme coding characteristics; carrying out emotion recognition processing on the sample phoneme sequence through a global recognition module to obtain global emotion characteristics; and extracting emotion characteristics of the sample initial voice through the sentence-level recognition module to obtain emotion conversion characteristics. And then, carrying out feature stitching on the phoneme coding features, the global emotion features and the emotion transformation features to obtain target sample features. And carrying out voice synthesis processing on the target sample characteristics through the voice synthesis module to obtain predicted synthesized voice, wherein the predicted synthesized voice is used for representing and generating the synthesized voice which is the same as the voice emotion information of the initial voice of the sample. According to the embodiment of the application, the initial voice synthesis model comprising the text coding module, the global recognition module, the sentence level recognition module and the voice synthesis module is constructed, and the model construction is carried out from voice emotion characterizations of different levels, so that fine emotion information contained in a text can be deeply mined when the voice synthesis is carried out on sample data according to the initial voice synthesis model. According to the embodiment of the application, the emotion information can be extracted from the initial voice of the sample, and the emotion information can be predicted from the voice text of the sample, so that multi-scale emotion migration in an emotion voice synthesis task is realized. And carrying out parameter adjustment on the initial voice synthesis model through the sample initial voice and the predicted synthesized voice to obtain an emotion voice synthesis model which has the same structure as the initial voice synthesis model but can generate diversified high-fidelity synthesized voice. Therefore, when the emotion voice synthesis model provided by the embodiment of the application carries out voice synthesis processing on the target voice text, the fine emotion information contained in the text can be deeply mined to generate high-quality synthesized voice with richer emotion expression.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing a program.
The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method of speech synthesis, the method comprising:
acquiring sample data, wherein the sample data comprises sample voice text and sample initial voice of the sample voice text;
performing text conversion on the sample voice text to obtain a sample phoneme sequence;
inputting the sample data into a preset initial voice synthesis model, wherein the initial voice synthesis model comprises a text coding module, a global recognition module, a sentence level recognition module and a voice synthesis module;
the text coding module is used for coding the sample phoneme sequence to obtain phoneme coding characteristics;
carrying out emotion recognition processing on the sample phoneme sequence through the global recognition module to obtain global emotion characteristics;
extracting emotion characteristics of the sample initial voice through the sentence-level recognition module to obtain emotion conversion characteristics;
Performing feature stitching on the phoneme coding feature, the global emotion feature and the emotion transformation feature to obtain a target sample feature;
performing voice synthesis processing on the target sample characteristics through the voice synthesis module to obtain predicted synthesized voice;
parameter adjustment is carried out on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech to obtain an emotion speech synthesis model;
and inputting the target voice text to be processed into the emotion voice synthesis model to perform voice synthesis processing, so as to obtain target synthesized voice.
2. The method according to claim 1, wherein the performing, by the global recognition module, emotion recognition processing on the sample phoneme sequence to obtain global emotion features includes:
carrying out emotion recognition processing on the sample phoneme sequence according to the global recognition module to obtain an emotion type label and a classification predicted value of the emotion type label;
searching from a preset emotion vector lookup table according to the emotion type label to obtain an emotion embedded vector of the emotion type label;
and carrying out weighted calculation according to the classification predicted value and the emotion embedded vector to obtain the global emotion feature.
3. The method of claim 2, wherein the global recognition module includes a pre-training model and an emotion classifier, and the performing emotion recognition processing on the sample phoneme sequence according to the global recognition module to obtain an emotion type tag and a classification predicted value of the emotion type tag includes:
extracting emotion characteristics of the sample phoneme sequence according to the pre-training model to obtain sample prediction characteristics;
carrying out emotion classification prediction on the sample prediction features according to the emotion classifier to obtain classification prediction features;
and carrying out de-linearization processing on the classification prediction features according to a preset activation function to obtain the emotion type labels and the classification prediction values of the emotion type labels.
4. The method of claim 3, wherein the emotion classifier includes a multi-head attention unit and a global convolution unit, and wherein the performing emotion classification prediction on the sample prediction feature according to the emotion classifier to obtain a classification prediction feature includes:
performing self-attention processing on the sample prediction features according to the multi-head attention unit to obtain attention features;
Performing feature fusion on the sample prediction feature and the attention feature to obtain an attention fusion feature;
normalizing the attention fusion characteristic to obtain a first prediction characteristic;
carrying out global feature extraction on the first prediction feature according to the global convolution unit to obtain a global convolution feature;
performing feature fusion on the first prediction feature and the global convolution feature to obtain a second prediction feature;
and carrying out normalization processing on the second prediction features to obtain the classification prediction features.
5. The method according to any one of claims 1 to 4, wherein the sentence-level recognition module includes a sentence-level encoder, and the extracting, by the sentence-level recognition module, emotion features from the sample initial speech to obtain emotion transformation features includes:
performing audio conversion on the sample initial voice to obtain a sample Mel frequency spectrum;
extracting emotion characteristics of the sample Mel frequency spectrum through the sentence-level encoder to obtain sentence-level hidden characteristics;
and performing feature conversion on the sentence-level hidden features to obtain emotion conversion features.
6. The method of claim 5, wherein the sentence-level recognition module further comprises a sentence-level convolution unit, a correction unit, and a feature mapping unit, the method further comprising: training the sentence-level encoder, specifically including:
Sentence-level feature extraction is carried out on the phoneme coding features according to the sentence-level convolution unit, so that sentence-level convolution features are obtained;
correcting the sentence-level convolution characteristic according to the correcting unit to obtain a corrected characteristic;
performing feature mapping processing on the correction features according to the feature mapping unit to obtain sentence-level prediction features;
carrying out loss calculation on the emotion transformation characteristics and the sentence-level prediction characteristics according to a preset loss function to obtain sentence-level prediction loss values;
and carrying out parameter adjustment on a preset initial encoder according to the sentence-level prediction loss value to obtain the sentence-level encoder.
7. The method according to any one of claims 1 to 4, wherein the speech synthesis module includes a priori encoder, a posterior encoder, a duration predictor and a decoder, and wherein the speech conversion processing is performed on the target sample feature by the speech synthesis module to obtain a predicted synthesized speech, including:
performing feature coding processing on the target sample features according to the prior encoder to obtain prior coding features;
performing short-time Fourier transform on the sample initial voice to obtain a sample linear frequency spectrum;
Extracting hidden variables from the sample linear frequency spectrum according to the posterior encoder to obtain sample hidden variable characteristics;
extracting phoneme duration according to the target sample characteristics by the duration predictor to obtain sample phoneme duration;
monotonically aligning and searching the sample hidden variable features and the target sample features according to the sample phoneme duration, and determining a target alignment matrix;
and decoding the target sample characteristics according to the target alignment matrix and the decoder to obtain the predicted synthesized voice.
8. A speech synthesis system, the system comprising:
the voice sample acquisition module is used for acquiring sample data, wherein the sample data comprises sample voice text and sample initial voice of the sample voice text;
the text conversion module is used for carrying out text conversion on the sample voice text to obtain a sample phoneme sequence;
the model input module is used for inputting the sample data into a preset initial voice synthesis model, wherein the initial voice synthesis model comprises a text coding module, a global recognition module, a sentence-level recognition module and a voice synthesis module;
The coding module is used for coding the sample phoneme sequence through the text coding module to obtain phoneme coding characteristics;
the global emotion recognition module is used for carrying out emotion recognition processing on the sample phoneme sequence through the global recognition module to obtain global emotion characteristics;
the sentence-level feature extraction module is used for extracting emotion features of the sample initial voice through the sentence-level recognition module to obtain emotion conversion features;
the feature splicing module is used for carrying out feature splicing on the phoneme coding feature, the global emotion feature and the emotion transformation feature to obtain a target sample feature;
the voice conversion module is used for carrying out voice synthesis processing on the target sample characteristics through the voice synthesis module to obtain predicted synthesized voice;
the parameter adjustment module is used for carrying out parameter adjustment on the initial speech synthesis model according to the sample initial speech and the predicted synthesized speech to obtain an emotion speech synthesis model;
and the voice synthesis module is used for inputting the target voice text to be processed into the emotion voice synthesis model to perform voice synthesis processing so as to obtain target synthesized voice.
9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method according to any one of claims 1 to 7.
CN202310636076.2A 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis system, electronic device, and storage medium Pending CN116682411A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310636076.2A CN116682411A (en) 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310636076.2A CN116682411A (en) 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN116682411A true CN116682411A (en) 2023-09-01

Family

ID=87781692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310636076.2A Pending CN116682411A (en) 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis system, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN116682411A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117132864A (en) * 2023-10-27 2023-11-28 深圳品阔信息技术有限公司 Multi-mode input digital character generation method, device, equipment and storage medium
CN117727290A (en) * 2024-02-18 2024-03-19 厦门她趣信息技术有限公司 Speech synthesis method, device, equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117132864A (en) * 2023-10-27 2023-11-28 深圳品阔信息技术有限公司 Multi-mode input digital character generation method, device, equipment and storage medium
CN117727290A (en) * 2024-02-18 2024-03-19 厦门她趣信息技术有限公司 Speech synthesis method, device, equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
US11929059B2 (en) Method, device, and computer readable storage medium for text-to-speech synthesis using machine learning on basis of sequential prosody feature
CN111566656B (en) Speech translation method and system using multi-language text speech synthesis model
EP3469592B1 (en) Emotional text-to-speech learning system
KR102582291B1 (en) Emotion information-based voice synthesis method and device
KR20230043084A (en) Method and computer readable storage medium for performing text-to-speech synthesis using machine learning based on sequential prosody feature
CN116682411A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
US10685644B2 (en) Method and system for text-to-speech synthesis
Dhanjal et al. An automatic machine translation system for multi-lingual speech to Indian sign language
CN111079423A (en) Method for generating dictation, reading and reporting audio, electronic equipment and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113314096A (en) Speech synthesis method, apparatus, device and storage medium
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113314097B (en) Speech synthesis method, speech synthesis model processing device and electronic equipment
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN111489742B (en) Acoustic model training method, voice recognition device and electronic equipment
WO2021231050A1 (en) Automatic audio content generation
Jamtsho et al. OCR and speech recognition system using machine learning
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology
US11887583B1 (en) Updating models with trained model update objects
CN116665642A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN116564273A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN116469372A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination