CN116580691A - Speech synthesis method, speech synthesis device, electronic device, and storage medium - Google Patents

Speech synthesis method, speech synthesis device, electronic device, and storage medium Download PDF

Info

Publication number
CN116580691A
CN116580691A CN202310640902.0A CN202310640902A CN116580691A CN 116580691 A CN116580691 A CN 116580691A CN 202310640902 A CN202310640902 A CN 202310640902A CN 116580691 A CN116580691 A CN 116580691A
Authority
CN
China
Prior art keywords
prosody
model
emotion
voice
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310640902.0A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
程宁
唐浩彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310640902.0A priority Critical patent/CN116580691A/en
Publication of CN116580691A publication Critical patent/CN116580691A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and relates to the technical fields of artificial intelligence and digital medical treatment. The voice synthesis method utilizes a cross-domain emotion recognition sub-model to obtain auxiliary emotion identification of target text information; obtaining prosody embedding of the reference voice by utilizing a prosody coding submodel; obtaining a pitch characteristic vector by using a pitch predictor model; obtaining a duration characteristic vector by using a duration predictor model; and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information. And generating auxiliary emotion marks for the target text information by using the cross-domain emotion recognition sub-model, combining the correlation between the auxiliary emotion marks and the prosody in the voice synthesis process, selecting reference voice to generate synthesized voice containing emotion, improving the naturalness of the synthesized voice, and expanding the application range of the text to the voice synthesis technology.

Description

Speech synthesis method, speech synthesis device, electronic device, and storage medium
Technical Field
The present invention relates to the field of artificial intelligence and digital medical technology, and more particularly, to a speech synthesis method, a speech synthesis apparatus, an electronic device, and a storage medium.
Background
In recent years, a speech synthesis technology for synthesizing text into speech is gradually applied to speech signal processing systems such as speech interaction, sound broadcasting, personalized sound production and the like, and the speech synthesis technology is effectively used for improving the user experience of the speech interaction, so that the speech synthesis technology has a potentially wide use value. Meanwhile, with the rising of voice synthesis technology, the voice synthesis can also support the scene requirements in the digital medical fields such as health management, electronic medical records and the like.
In the related art, in order to improve the naturalness of a Speech synthesis result in a Text-to-Speech (TTS) process, speech content prediction needs to be performed by combining emotion information in a Speech synthesis model. However, the related art does not perform speech content synthesis using the correlation between emotion information and prosody, resulting in a low naturalness of speech synthesis. Therefore, how to improve the naturalness of speech synthesis by utilizing the correlation between emotion information and prosody in speech synthesis becomes a technical problem to be solved urgently.
Disclosure of Invention
The embodiment of the application mainly aims to provide a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, improve the naturalness of a voice synthesis result and expand the application range of a voice synthesis technology.
To achieve the above object, a first aspect of an embodiment of the present application provides a speech synthesis method, including:
inputting the acquired target text information into a speech synthesis model, wherein the speech synthesis model comprises: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model;
carrying out emotion recognition on the target text information by using the cross-domain emotion recognition sub-model to obtain an auxiliary emotion mark of the target text information;
performing prosody coding on the reference voice based on the auxiliary emotion mark by utilizing the prosody coding submodel to obtain prosody embedding of the reference voice;
utilizing the pitch predictor model to conduct pitch prediction on the prosody embedding and the text coding vector of the target text information to obtain a pitch characteristic vector;
performing duration prediction on the prosody embedding and the text coding vector of the target text information by using the duration predictor model to obtain a duration feature vector;
and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information.
In one embodiment, the cross-domain emotion recognition sub-model includes a first encoder, a second encoder, and a classifier; the method for identifying emotion of the target text information by using the cross-domain emotion identification sub-model further comprises the following steps before the auxiliary emotion identification of the target text information is obtained:
obtaining a voice emotion recognition data set, wherein the voice emotion recognition data set comprises: a first data set comprising a first speech sample and a speech emotion tag and a second data set comprising a second text sample;
generating, with a first encoder, a first data feature based on the first speech sample;
generating, with a second encoder, a second data feature based on the second text sample;
adjusting parameters of the first encoder and the second encoder according to the distribution loss values of the first data feature and the second data feature until the distribution loss values reach a first preset convergence condition;
inputting the first data characteristics into a classifier to obtain a predicted emotion;
and obtaining a recognition loss value according to the predicted emotion and the voice emotion label, and adjusting parameters of the classifier until the recognition loss value reaches a second preset convergence condition, so as to obtain the trained cross-domain emotion recognition sub-model.
In an embodiment, the adjusting parameters of the first encoder and the second encoder according to the distribution loss values of the first data feature and the second data feature until the distribution loss values reach a first preset convergence condition includes:
acquiring a first expected value of the first data characteristic mapped by a preset mapping function;
obtaining a second expected value of the second data characteristic mapped by a preset mapping function;
obtaining the distribution loss value according to the first expected value and the second expected value;
and adjusting parameters of the first encoder and the second encoder until the distribution loss value reaches the first preset convergence condition, wherein the first preset convergence condition is that the distribution loss value is minimum.
In an embodiment, before the prosody encoding is performed on the reference speech based on the auxiliary emotion identifier by using the prosody encoding submodel to obtain prosody embedding of the reference speech, the method further includes: and selecting reference voices from a reference voice library according to the auxiliary emotion marks, wherein the reference voice library comprises reference voices of a plurality of voice emotion categories.
In an embodiment, the prosody encoding the reference speech based on the auxiliary emotion identifier by using the prosody encoding submodel to obtain prosody embedding of the reference speech includes:
The prosody coding submodel extracts prosody characteristics of the reference voice to obtain prosody characteristic vectors of the reference voice, wherein the prosody characteristics comprise speech speed characteristics and pitch characteristics;
the prosody encoding sub-model generates the prosody embedding according to the prosody feature vector.
In an embodiment, the performing speech synthesis using the pitch feature vector, the duration feature vector, and the text encoding vector to obtain the speech content corresponding to the target text information includes:
performing duration information fusion on the text encoding vector by using the duration feature vector to obtain a text duration vector;
the pitch characteristic vector and the text duration vector are fused and then input as a decoder, so that a predicted Mel frequency spectrum is obtained;
and performing voice synthesis according to the predicted Mel frequency spectrum to generate the voice content.
In an embodiment, the prosody encoder sub-model is a transducer model including an attention layer;
performing prosody encoding on the reference voice based on the auxiliary emotion mark by using the prosody encoding submodel to obtain prosody embedding of the reference voice, wherein the prosody embedding comprises the following steps:
extracting prosodic feature vectors of the reference speech of the auxiliary emotion mark by using the attention layer;
Generating the prosody embedding according to the prosody feature vector.
To achieve the above object, a second aspect of an embodiment of the present application provides a speech synthesis apparatus, including:
an acquisition unit configured to input the acquired target text information into a speech synthesis model, the speech synthesis model including: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model;
the auxiliary emotion identification generation unit is used for carrying out emotion identification on the target text information by utilizing the cross-domain emotion identification sub-model to obtain auxiliary emotion identification of the target text information;
the prosody embedding generating unit is used for performing prosody encoding on the reference voice based on the auxiliary emotion identification by utilizing the prosody encoding sub-model to obtain prosody embedding of the reference voice;
a pitch prediction unit, configured to perform pitch prediction on the prosody embedding and the text encoding vector of the target text information by using the pitch prediction sub-model, so as to obtain a pitch feature vector;
the duration prediction unit is used for carrying out duration prediction on the prosody embedding and the text coding vector of the target text information by utilizing the duration prediction sub-model to obtain a duration characteristic vector;
And the voice synthesis prediction unit is used for performing voice synthesis by using the pitch characteristic vector, the duration characteristic vector and the text coding vector to obtain voice content corresponding to the target text information.
An acquisition unit configured to input the acquired target text information into a speech synthesis model, the speech synthesis model including: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model;
the auxiliary emotion identification generation unit is used for carrying out emotion identification on the target text information by utilizing the cross-domain emotion identification sub-model to obtain auxiliary emotion identification of the target text information;
the prosody embedding generating unit is used for performing prosody encoding on the reference voice based on the auxiliary emotion identification by utilizing the prosody encoding sub-model to obtain prosody embedding of the reference voice;
a pitch prediction unit, configured to perform pitch prediction on the prosody embedding and the text encoding vector of the target text information by using the pitch prediction sub-model, so as to obtain a pitch feature vector;
the duration prediction unit is used for carrying out duration prediction on the prosody embedding and the text coding vector of the target text information by utilizing the duration prediction sub-model to obtain a duration characteristic vector;
And the voice synthesis prediction unit is used for performing voice synthesis by using the pitch characteristic vector, the duration characteristic vector and the text coding vector to obtain voice content corresponding to the target text information.
To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect when the processor executes the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method according to the first aspect.
The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, wherein the voice synthesis method inputs acquired target text information into a voice synthesis model, and carries out emotion recognition on the target text information by using a cross-domain emotion recognition sub-model to obtain an auxiliary emotion mark of the target text information; performing prosody coding on the reference voice based on the auxiliary emotion mark by utilizing the prosody coding sub-model to obtain prosody embedding of the reference voice; utilizing a pitch predictor model to predict the pitch of the prosody embedded text coding vector of the target text information to obtain a pitch feature vector; performing duration prediction on the prosody embedding and the text coding vector of the target text information by using a duration predictor model to obtain a duration feature vector; and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information. In the embodiment, the cross-domain emotion recognition sub-model is utilized to generate the auxiliary emotion mark for the target text information, the relevance between the auxiliary emotion mark and the rhythm is combined in the voice synthesis process, the reference voice is selected to generate the synthesized voice containing emotion, the naturalness of the synthesized voice is improved, and the application range of the text-to-voice synthesis technology is expanded.
Drawings
Fig. 1 is a schematic diagram of a speech synthesis model according to an embodiment of the present invention.
Fig. 2 is a flowchart of a speech synthesis method according to an embodiment of the present invention.
FIG. 3 is a cross-domain emotion recognition sub-model of a speech synthesis model provided in accordance with yet another embodiment of the present invention.
FIG. 4 is a flowchart of a training process for a cross-domain emotion recognition sub-model of a speech synthesis model according to another embodiment of the present invention.
Fig. 5 is a flowchart of step S440 in fig. 4.
Fig. 6 is a flowchart of step S130 in fig. 2.
Fig. 7 is a speech synthesis flow chart of a speech synthesis model according to still another embodiment of the present invention.
Fig. 8 is a flowchart of a speech synthesis method according to another embodiment of the present invention.
Fig. 9 is a block diagram of a speech synthesis apparatus according to another embodiment of the present invention.
Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
First, several nouns involved in the present invention are parsed:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Attention unit: the study of the attention unit was first seen in the psychology, when a person was seeing one picture, the attention was focused only on a certain point, although all large pictures were faced. Bahdanau et al in 2014 introduced the attention unit into the machine translation for the first time, and combined with the attention unit and the neural network, improved the accuracy of the machine translation. Many subsequent studies have attempted to use the attention unit in conjunction with neural networks in different tasks and to achieve more significant advantages over conventional approaches to varying degrees.
Convolutional neural network (Convolutional Neural Networks, CNN): is a feedforward neural network which comprises convolution calculation and has a depth structure, and is one of representative algorithms of deep learning. The convolutional neural network has characteristic learning capability and can carry out translation invariant classification on input information according to a hierarchical structure of the convolutional neural network. The convolutional neural network imitates the visual perception mechanism construction of living beings, can carry out supervised learning and unsupervised learning, and the convolutional kernel parameter sharing and the sparsity of interlayer connection in the hidden layer enable the convolutional neural network to check the characteristics with smaller calculation amount. One common convolutional neural network structure is input layer-convolutional layer-pooling layer-full-link layer-output layer.
Mel spectroscopy (mel spline): i.e., mel-frequency spectrum, is a spectrum obtained by fourier transforming an acoustic signal and then transforming the signal in mel-scale. The spectrogram is often a large one, and in order to obtain sound features of a suitable size, the spectrogram may be transformed into a mel-scale filter bank. In the Mel frequency domain, the Mel frequency of the voice and the perception capability of the person to the tone are in linear relation, and the Mel spectrum is obtained by combining Mel frequency cepstrum and a spectrogram.
Embedding (Embedding): the method is a characteristic method commonly used in the field of deep learning, and maps high-dimensional original data (such as images, sentences, voices and the like) to a low-dimensional manifold, so that the high-dimensional original data becomes separable after being mapped to the low-dimensional manifold, and the mapping process is embedding.
Long Short-Term Memory artificial neural network (LSTM): is one of RNN (Recurrent Neural Network). LSTM is well suited for modeling time series data, such as text data. The calculation of LSTM can be summarized as: the information useful for calculation at the subsequent time is transmitted by forgetting and memorizing new information in the cell state, useless information is discarded, and the hidden state is output at each time step, calculated by the hidden state at the last time and the current input, and is controlled according to the forgetting gate, the memorizing gate and the output gate.
In recent years, a speech synthesis technology for synthesizing text into speech is gradually applied to speech signal processing systems such as speech interaction, sound broadcasting, personalized sound production and the like, and the speech synthesis technology is effectively used for improving the user experience of the speech interaction, so that the speech synthesis technology has a potentially wide use value. Speech synthesis systems are widely used in life in a variety of contexts, including speech dialog systems or intelligent speech assistants, which have applications: siri, news whistle point, telephone information inquiry system, car navigation, audio electronic book, etc.; the applications of the voice dialog system are: a language learning system, a real-time information broadcasting system such as airport stations, an information acquisition and communication system for vision or voice handicapped persons, and the like.
The applicant has found that in the related art, in order to enhance the naturalness of a Speech synthesis result in a Text-to-Speech (TTS) process, speech content prediction is required to be performed by combining emotion information in a Speech synthesis model. However, the related art does not perform speech content synthesis using the correlation between emotion information and prosody, resulting in a low naturalness of speech synthesis. For example, the voice book provided by many voice novel platforms at present is read, the synthesized voice content is quite sufficient in mechanical sense and lacks emotion, the effect which can be achieved by people who should pause the voice is difficult to achieve, and the user experience is poor. Therefore, how to improve the naturalness of speech synthesis by utilizing the correlation between emotion information and prosody in speech synthesis becomes a technical problem to be solved urgently.
Based on the above, the embodiment of the invention provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, wherein the voice synthesis method utilizes a cross-domain emotion recognition sub-model to generate auxiliary emotion marks for target text information, combines the relevance between the auxiliary emotion marks and prosody in the voice synthesis process, selects reference voice to generate synthesized voice containing emotion, improves the naturalness of the synthesized voice, and expands the application range of text-to-voice synthesis technology.
The embodiment of the invention provides a voice synthesis method, a voice synthesis device, an electronic device and a storage medium, and specifically, the following embodiment is used for describing the voice synthesis method in the embodiment of the invention.
The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (ArtificialIntelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the invention provides a voice synthesis method, relates to the technical field of artificial intelligence, and particularly relates to the technical field of data mining. The voice synthesis method provided by the embodiment of the invention can be applied to a terminal, a server and a computer program running in the terminal or the server. For example, the computer program may be a native program or a software module in an operating system; the Application may be a local (Native) Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a client that supports a speech synthesis method or a speech synthesis method, or may be an applet, i.e. a program that only needs to be downloaded to a browser environment to run; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in. Wherein the terminal communicates with the server through a network. The speech synthesis method or the speech synthesis method may be performed by a terminal or a server, or by a terminal and a server in cooperation.
In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, or the like. In addition, the terminal can also be an intelligent vehicle-mounted device. The intelligent vehicle-mounted equipment provides relevant services by applying the voice synthesis method or the voice synthesis method of the embodiment, and driving experience is improved. The server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; or may be service nodes in a blockchain system, where a Peer-To-Peer (P2P) network is formed between the service nodes, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (TCP, transmission Control Protocol) protocol. The server may be provided with a speech synthesis method or a server of the speech synthesis method, through which interaction with the terminal may be performed, for example, the server may be provided with corresponding software, which may be an application for implementing the speech synthesis method or the speech synthesis method, or the like, but is not limited to the above form. The terminal and the server may be connected by a communication connection manner such as bluetooth, USB (Universal Serial Bus ) or a network, which is not limited herein.
The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
In order to facilitate understanding of the embodiments of the present application, the following is first briefly introduced with reference to the text-to-speech synthesis concept in the example of a specific application scenario.
Text-to-speech synthesis is an important component of man-machine speech communication, and using text-to-speech synthesis technology allows a machine to speak like a human, so that some information expressed or stored in text form can be converted into speech, so that people can conveniently obtain the information through hearing.
Emotion recognition: emotion is a phenomenon that integrates human behaviors, ideas and feelings, and emotion recognition refers to obtaining corresponding emotion information expressed by text content from text information. To obtain emotion information of text content, features need to be extracted from text data and classified to obtain emotion information.
In an application scenario: the user A sends corresponding text information, and can select a voice object identifier (mainly used for identifying different tone colors, such as someone or other mechanical sounds, and the like) according to requirements, and the other end user B receives the voice information corresponding to the text information, when the voice information is played, the user A actually hears the selected tone color sound, and meanwhile, the voice information contains emotion information of the text information. The process uses text-to-speech synthesis to convert the text information of user a into corresponding speech information.
Input: text information sent by the user A;
And (3) outputting: target speech.
In another application scenario: user a needs to listen to the speech corresponding to the novice being read, i.e. to realize a voiced reading. The user A selects to read in a voice way, the voice synthesis system acquires corresponding novel text information, judges emotion information according to the novel text content, and fuses the corresponding emotion information when the voice synthesis is carried out to generate reading voice which can suppress the pause.
Input: text content selected by the user A;
and (3) outputting: target speech containing emotion information.
The following first describes a speech synthesis method in the embodiment of the present invention.
In one embodiment, referring to FIG. 1, a speech synthesis model 100 includes: cross-domain emotion recognition sub-model 200, prosody encoding sub-model 300, pitch predictor sub-model 400, and duration predictor sub-model 500. Fig. 2 is an alternative flowchart of a speech synthesis method according to an embodiment of the present invention, where speech synthesis is performed using the speech synthesis model 100 shown in fig. 1, and the method in fig. 2 may include, but is not limited to, steps S110 to S160. It should be understood that the order of steps S110 to S160 in fig. 2 is not particularly limited, and the order of steps may be adjusted, or some steps may be reduced or increased according to actual requirements.
Step S110: and inputting the acquired target text information into a speech synthesis model.
In one embodiment, the speech synthesis model comprises: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model.
In an embodiment, the target text information may be obtained by receiving a user input, for example, a chat information input of the user, or may be obtained by selecting a user from a page, for example, when the user selects to read the current page, the text content of the current page is the target text information. If speech synthesis of paragraphs is desired, it is possible to summarize the individual texts in the paragraphs after speech synthesis. The present embodiment is not particularly limited thereto.
In an embodiment, the target voice identifier, that is, the tone of the synthesized voice, may be selected at the same time, and the reference voices corresponding to different tones are added to the reference voices, so that the prosody obtained by the prosody prediction module is embedded into the target voice identifier information.
Step S120: and carrying out emotion recognition on the target text information by using the cross-domain emotion recognition sub-model to obtain an auxiliary emotion identification of the target text information.
In one embodiment, the cross-domain emotion recognition sub-model is trained by deep learning in advance, and then the trained cross-domain emotion recognition sub-model is utilized to predict and obtain the auxiliary emotion identification according to the target text information. In one embodiment, a cross-domain emotion recognition sub-model, such as an FCN-8S structure, is constructed based on a convolutional recurrent neural network structure of an attention mechanism.
It can be understood that the text content of the target text information contains emotion information of the target person, such as when chatting to a certain event, emotion related to happiness (happiness, plainness, sadness) is expressed, such as when receiving other people's apodization, emotion related to forgiveness (forgiveness, no placement, no forgiveness) is expressed, and the like, such as emotion related to agitation is expressed when the emotion of the placard content is expanded, and all the emotion information is expressed. In an embodiment, the cross-domain emotion recognition sub-model of the embodiment of the disclosure performs emotion classification on input target text information to obtain auxiliary emotion identification, namely, the emotion information in the target text information is divided into different emotion classification results according to a preset classification standard. The preset classification criteria may be happiness, sadness, difficulty or anger, etc. In this embodiment, the emotion classification standard is not specifically limited, and different classification standards can be set according to actual use situations.
In one embodiment, referring to FIG. 3, cross-domain emotion recognition sub-model 200 includes: a first encoder 210, a second encoder 220, and a classifier 230. Referring to fig. 4, the training process of the cross-domain emotion recognition sub-model includes steps S410 to S460.
Step S410: a speech emotion recognition dataset is obtained.
In one embodiment, referring to FIG. 3, the speech emotion recognition dataset comprises: a first data set 241 and a second data set 242, wherein the first data set comprises: the first voice sample and the voice emotion label corresponding to the first voice sample, and the second data set comprises a second text sample. Because there are many more speech emotion sample training data sets in the related art, the present speech emotion sample training data set may be used as the first data set in this embodiment. For example, the Belfast emotion database, which includes 5 emotions, respectively, angry, sadness, happiness, fear, and neutrality. The second text sample may be any text sample. The method of acquiring the first data set and the second data set in this embodiment is not particularly limited.
Step S420: a first data feature is generated based on the first speech samples with a first encoder.
In an embodiment, referring to fig. 3, the first encoder 210 extracts a first data feature of a first speech sample in the first data set according to the model training requirement, wherein the first speech sample is speech format data, and the first data feature may be one or more of prosodic features, spectral features, timbre features, or features based on Teager Energy Operator (TEO).
Step S430: a second data feature is generated based on the second text sample using a second encoder.
In an embodiment, referring to fig. 3, the second encoder 220 extracts a second data feature of a second text sample in the second data set according to the model training requirement, where the second text sample is text format data, and the second data feature may be a statistical feature, such as a maximum value, a mean value, a standard deviation, and other statistical information.
Step S440: and adjusting parameters of the first encoder and the second encoder according to the distribution loss values of the first data characteristic and the second data characteristic until the distribution loss values reach a first preset convergence condition.
In an embodiment, since the first data set is a voice data set and the second data set is a text data set, the data distribution between the two data sets is different, and the cross-domain emotion recognition sub-model in the embodiment of the application needs to correlate tasks between the two data sets, and transfer knowledge learned from the first data set to the second data set, so that a corresponding emotion label can be generated for the text data set through the emotion recognition process of the voice data set. Therefore, in this embodiment, the distribution loss value of the first data feature and the second data feature needs to be used to adjust that the distribution between the first data feature and the second data feature tends to be similar, so as to improve the accuracy of the emotion tags in the second data set.
In an embodiment, referring to fig. 5, which is a flowchart showing a specific implementation of step S440, in this embodiment, adjusting parameters of the first encoder and the second encoder according to the distribution loss values of the first data feature and the second data feature until the distribution loss values reach a first preset convergence condition, step S440 includes:
step S441: and obtaining a first expected value of the first data characteristic mapped by a preset mapping function.
Step S442: and obtaining a second expected value of the second data characteristic mapped by the preset mapping function.
Step S443: and obtaining a distribution loss value according to the first expected value and the second expected value.
Step S444: and adjusting parameters of the first encoder and the second encoder until the distribution loss value reaches a first preset convergence condition, wherein the first preset convergence condition is that the distribution loss value is minimum.
In an embodiment, since the features of the speech sample set and the text sample set are different, that is, the two data sets belong to two different domains, in order to enable the cross-domain emotion recognition sub-model to migrate the emotion recognition capability of recognizing emotion in the speech sample to emotion recognition of the text sample, so as to label the label-free text sample with emotion labels, features obtained by encoding in the two domains need to be distributed in a uniform manner. It will be appreciated that the trend toward the same distribution in this embodiment may not be exactly the same distribution, so long as the distribution is closest. When the cross-domain emotion recognition sub-model is trained in the embodiment, in order to enable the cross-domain emotion recognition sub-model trained by the voice sample to cross-domain emotion recognition of text, the first data feature generated by the first encoder based on the first voice sample and the second data feature generated by the second encoder based on the second text sample are subjected to distribution detection, so that the first data feature and the second data feature tend to be distributed together.
In one embodiment, the maximum mean difference (Maximum mean discrepancy, MMD) is utilized to calculate the similarity between the first data feature and the second data feature. The MMD algorithm is used to measure whether two samples are from the same distribution.
In an embodiment, assuming that the preset mapping function F belongs to the function domain F, step S441 maps the first data feature with the preset mapping function F to obtain a first expected value F1, step S442 maps the second data feature with the preset mapping function F to obtain a second expected value F2, and step S443 obtains a distribution loss value according to the first expected value F1 and the second expected value F2, where the distribution loss value is the MMD distance between the two. In an embodiment, the MMD distance is the maximum value of the expected difference between the first data characteristic (containing distribution information) and the second data characteristic (containing distribution information) mapped by a predetermined mapping function F in the defined function field F. Wherein the first expected value and the second expected value are the average of the two distribution maps.
In an embodiment, in the MMD distance, the function field F is defined as an arbitrary vector within a unit sphere in the regenerated hilbert space, that is, the preset mapping function F satisfies the condition: if f <1, then the MMD distance is the difference between the mean distances of the first data feature and the second data feature in hilbert space.
In an embodiment, in theory, that the MMD distance is zero represents that the two distributions are co-distributed, in order to improve the training efficiency of the cross-domain emotion recognition sub-model, in this embodiment, parameters of the first encoder and the second encoder are adjusted until the distribution loss value reaches a first preset convergence condition, where the first preset convergence condition is that the MMD distance is minimum, and the minimum may be that a small threshold is set, and if the distribution loss value is smaller than the threshold, the first convergence condition is considered to be reached, and the distribution loss value is minimum, and the first data feature generated by the first encoder based on the first speech sample and the second data feature generated by the second encoder based on the second speech sample tend to be co-distributed. The first preset convergence condition is not particularly limited in this embodiment.
It will be appreciated that in this embodiment, the MMD algorithm optimizes parameters of the first encoder and the second encoder, instead of selecting the first data set and the second data set, so that after optimizing, the coding features of the first encoder for the voice information and the coding features of the second encoder for the text information are distributed.
Step S450: and inputting the first data characteristic into a classifier to obtain the predicted emotion.
In one embodiment, referring to fig. 3, after the parameters of the first encoder and the second encoder are adjusted, classifier 230 performs emotion recognition on the first data feature to obtain a predicted emotion. It will be appreciated that since the adjusted first and second encoder outputs are co-distributed, classifier 230 performs emotion recognition on the second data characteristic, which can result in an approximate predicted emotion.
Step S460: and obtaining a recognition loss value according to the predicted emotion and the voice emotion label, and adjusting parameters of the classifier until the recognition loss value reaches a second preset convergence condition to obtain a trained cross-domain emotion recognition sub-model.
In an embodiment, referring to fig. 3, after obtaining the predicted emotion, the predicted emotion and the speech emotion label corresponding to the first speech sample are compared and calculated to obtain the recognition loss value, where the recognition loss value may be a cross entropy loss value. And adjusting parameters of the classifier 230 according to the cross entropy loss value, and if the second preset convergence condition is met, finishing cross-domain emotion recognition sub-model training. It is understood that the second preset convergence condition may be that the cross entropy loss value is smaller than a threshold value or reaches a preset convergence number, and the embodiment does not specifically limit the second preset convergence condition.
According to the method, through training of the cross-domain emotion recognition sub-model, the cross-domain emotion recognition sub-model trained by the voice sample can be cross-domain to emotion recognition of the text sample. The pre-trained cross-domain emotion recognition sub-model can perform emotion recognition on any one of the input target text information to obtain auxiliary emotion identification of the target text information.
In one embodiment, the training samples used to train the speech synthesis model do not contain labels, and the training samples contain text information. That is, in this embodiment, the training samples do not need to be manually labeled. The text information of the training sample may be collected using an input device (e.g., a touch panel or keyboard, etc.). Or may be obtained from a local memory or other devices, or may be obtained through downloading via the internet, and the method for obtaining the training sample in this embodiment is not particularly limited. The cross-domain emotion recognition sub-model is utilized to generate auxiliary emotion identification for unlabeled training samples, the problem of insufficient training sample size caused by low manual labeling efficiency is solved, and the training efficiency of the voice synthesis model can be improved in the training process of the voice synthesis model.
Step S130: and performing prosody coding on the reference voice based on the auxiliary emotion identification by using the prosody coding sub-model to obtain prosody embedding of the reference voice.
In one embodiment, step S130 further includes: and selecting reference voices from a reference voice library according to the auxiliary emotion marks, wherein the reference voice library comprises reference voices of a plurality of voice emotion categories.
In one embodiment, after the auxiliary emotion mark is obtained, a corresponding reference voice is selected. For example, if the auxiliary emotion mark is angry, then selecting a reference voice related to angry.
In an embodiment, reference voices corresponding to different emotion tags are obtained in advance, a reference voice library is constructed, and the different reference voices contain feature information corresponding to emotion. In order to improve the practicability and accuracy of the reference voice, in an embodiment, a large number of original reference voices under the same emotion, for example, 100 original reference voices related to the gas generation, are firstly obtained, and then vector average is carried out on the large number of original reference voices to obtain a reference voice representing average characteristics, so that the accuracy of the reference voice is improved.
In one embodiment, the prosody encoder sub-model is a transducer model that includes an attention layer. In one embodiment, the prosodic coding submodel consists of 3 layers of transformers, each of which contains self-attention layers, where the convolution kernel has a size of 5 and the filter has a size of 384.
In an embodiment, referring to fig. 6, a flowchart is shown for a specific implementation of step S130, in this embodiment, step S130 of generating prosody embedding of a reference speech corresponding to an auxiliary emotion identifier through a prosody coding sub-model includes:
step S131: the prosodic feature vectors of the reference speech are extracted by the prosodic coding sub-model.
In one embodiment, the prosodic coding submodel includes a reference encoder that includes an attention layer that is utilized to extract prosodic features of the reference speech to obtain prosodic feature vectors of the reference speech, the prosodic features including speech speed features and pitch features.
Step S132: the prosody code sub-model generates prosody embedding from the prosody feature vector.
In one embodiment, the prosodic coding sub-model includes an embedded encoder for generating prosodic embeddings from the prosodic feature vectors.
In an embodiment, a prosodic coding sub-model is trained by deep learning in advance, and then prosodic features of the reference speech are extracted according to the reference speech by using the trained prosodic coding sub-model to obtain prosodic feature vectors of the reference speech, and prosodic embeddings are generated according to the prosodic feature vectors.
In an embodiment, the prosodic features are obtained from the reference speech by a reference encoder of the prosodic prediction model. For a section of reference speech, the information contained inside the reference speech can be divided into two parts: the first part is syllable pronunciation information corresponding to the reference voice; the second part is other pronunciation characteristics except syllable information, mainly comprising prosodic characteristics, wherein the prosodic characteristics comprise speech speed characteristics and pitch characteristics. It will be appreciated that the speech rate and pitch of the sound are different in different emotional states, for example, the speech rate is faster and the pitch is higher in the state of lively and peaceful. Since the prosodic features cannot be directly extracted from the reference speech, the prosodic prediction model in this embodiment can learn the ability to obtain the prosodic features from the speech features during the training process.
It can be appreciated that in the training process, the training samples of the prosody coding sub-model are speech information and corresponding emotion labels, and since the prosody coding sub-model cannot necessarily obtain prosody embedding of the speech information completely related to the emotion labels at first, training of the prosody coding sub-model by using the emotion labels is also required. The training process of the prosody coding sub-model is not particularly limited in this embodiment, a pre-trained prosody coding sub-model is used, and first, a corresponding reference voice is selected according to the auxiliary emotion mark, and then prosody embedding of the reference voice is obtained by using the prosody coding sub-model.
Embedding is a feature processing method that maps high-dimensional raw data to a low-dimensional manifold, so that the high-dimensional raw data becomes separable after being mapped to the low-dimensional manifold. In one embodiment, step S133 performs an embedding operation on the prosodic feature vector using the prosodic coding sub-model to obtain a corresponding low-dimensional feature vector: prosody embedding.
In one embodiment, prosody embedding is first performed with text encoding vectors of target text information to obtain speech synthesis vectors.
In one embodiment, if the target text information is a Chinese character, the Chinese character is usually composed of initials and finals, which can be further subdivided according to linguistic principles, and the same finals and different combinations of initials can be represented as different phonemes, so that the initials and finals are defined as a phoneme set through linguistic principles and a built pre-set phoneme dictionary.
In one embodiment, the initial consonants and final sounds are defined as 66 classes of phonemes to form a pre-set phoneme dictionary, which is expressed as follows:
“a”,“aa”,“ai”,“an”,“ang”,“ao”,“b”,“c”,“ch”,“d”,“e”,“ee”,“ei”,“en”,“eng”,“er”,“f”,“g”,“h”,“i”,“ia”,“ian”,“iang”,“iao”,“ie”,“ii”,“in”,“ing”,“iong”,“iu”,“ix”,“iy”,“iz”,“j”,“k”,“l”,“m”,“n”,“o”,“ong”,“oo”,“ou”,“p”,“q”,“r”,“s”,“sh”,“t”,“u”,“ua”,“uai”,“uan”,“uang”,“ueng”,“ui”,“un”,“uo”,“uu”,“v”,“van”,“ve”,“vn”,“vv”,“x”,“z”,“zh”。
as can be seen from the above, the speech synthesis model further includes a text encoder, where the text encoder encodes the target text information, and the obtaining of the text encoding vector may be transferring text content of the input target text information into pinyin information, then selecting a preset phoneme dictionary, splitting the pinyin into single phonemes according to the preset phoneme dictionary, and splicing the single phonemes into a phoneme sequence, i.e. the text encoding vector.
It can be seen from the above that, firstly, the target text information is encoded to obtain a text encoding vector, so as to be connected with the prosody embedding obtained in the above step, and the connected vector is used as a speech synthesis vector for performing subsequent speech synthesis, that is, the speech synthesis vector contains the text information and prosody information corresponding to emotion.
Step S140: and carrying out pitch prediction on the prosody embedding and the text coding vector of the target text information by using a pitch predictor model to obtain a pitch feature vector.
In one embodiment, the pitch predictor model is trained by deep learning in advance, and then the pitch feature vector is obtained according to the speech synthesis vector (the text encoding vector of the prosody embedding and the target text information) by using the trained pitch predictor model.
In an embodiment, the pitch predictor model may be a convolutional neural network or a cyclic neural network, the input is a speech synthesis vector, and the output is the pitch characteristics of each phoneme corresponding to the speech synthesis vector. The pitch characteristic of each phoneme is obtained by predicting a pitch prediction sub-model by combining the rhythm embedded pitch characteristic in the speech synthesis vector.
Step S150: and carrying out duration prediction on the prosody embedding and the text coding vector of the target text information through a duration predictor model to obtain a duration feature vector.
In one embodiment, the duration predictor model is trained by deep learning in advance, and then the duration feature vector is obtained according to the speech synthesis vector (the prosody embedding and the text encoding vector of the target text information) by using the trained duration predictor model. The duration predictor model may be trained using training samples to obtain better model parameters for the scene, and the training process is not specifically limited in this embodiment.
It can be understood that the duration of pronunciation of different words by the speaker under different emotions is different, so that the duration predictor model predicts the duration of each phoneme corresponding to the speech synthesis vector by combining the speech speed characteristics in prosody embedding to form a duration characteristic vector.
For example, in one embodiment:
the text encoding vector is: { j, in, t, i, an, t, i, an, q, i, h, en, h, ao };
the duration feature vector of the text encoding vector is expressed as:
{j-0.06,in-0.09,t-0.03,i-0.03,an-0.06,t-0.03,i-0.03,an-0.06,q-0.03,i-0.09,h-0.03,en-0.06,h-0.03,ao-0.06};
according to prosody embedding under the auxiliary emotion tags, different pitches and durations corresponding to the target text information are predicted, and the durations are used for representing speech speed.
Step S160: and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information.
In one embodiment, the pitch feature vector, the duration feature vector and the text encoding vector are used for speech synthesis to obtain the speech content corresponding to the target text information.
In an embodiment, referring to fig. 7, the specific process of step S160 of performing speech synthesis using the pitch feature vector, the duration feature vector, and the text encoding vector to obtain the speech content corresponding to the target text information includes:
step S171: and carrying out duration information fusion on the text encoding vector by using the duration feature vector to obtain a text duration vector.
In one embodiment, performing duration information fusion on the text encoding vector by using the duration feature vector refers to performing speech frame level alignment on each phoneme in the text encoding vector according to the duration of each phoneme to obtain a text duration vector.
For example, in one embodiment:
phoneme sequence: { j, in, t, i, an, t, i, an, q, i, h, en, h, ao };
the duration feature vector of the text encoding vector is expressed as:
{j-0.06,in-0.09,t-0.03,i-0.03,an-0.06,t-0.03,i-0.03,an-0.06,q-0.03,i-0.09,h-0.03,en-0.06,h-0.03,ao-0.06};
the duration of the corresponding speech frame is 0.03s, and the duration of the text coding vector is converted into the number of frames of the speech frame, which is expressed as:
{ j-2 frame, in-3 frame, t-1 frame, i-1 frame, an-2 frame, q-1 frame, i-3 frame, h-1 frame, en-2 frame, h-1 frame, ao-2 frame }.
Step S172: and fusing the pitch characteristic vector and the text duration vector, and inputting the fused pitch characteristic vector and the text duration vector as a decoder to obtain a predicted Mel frequency spectrum.
In one embodiment, fusing the pitch feature vector and the text duration vector refers to setting a corresponding pitch for each frame of the text duration vector, and then decoding the fused result as a decoder input to obtain a predicted mel spectrum.
Step S173: and performing voice synthesis according to the predicted Mel frequency spectrum to generate voice content.
In one embodiment, the vocoder is used to convert the predicted mel spectrum into corresponding speech content, which is a wave file representing the speech signal in the form of waves (wave), and the present embodiment does not limit the representation of the speech content.
In one embodiment, the process of adjusting parameters of the speech synthesis model according to the speech content to obtain the trained speech synthesis model is described as: in the process of training the speech synthesis model, the training sample not only contains target text information but also contains training speech corresponding to the target text information, the speech content obtained in the steps is compared with the training speech to obtain a speech synthesis loss value, and then parameters of the speech synthesis model are adjusted based on the speech synthesis loss value so as to train the speech synthesis model to obtain a trained speech synthesis model.
As can be seen from the above, referring to fig. 8, the voice synthesis method generates an auxiliary emotion mark for the target text information through the cross-domain emotion recognition sub-model 200, and generates a prosodic feature vector of the reference voice corresponding to the auxiliary emotion mark through the reference encoder 310 of the prosodic coding sub-model 300, and the embedded encoder 320 of the prosodic coding sub-model 300 generates prosodic embedding of the prosodic feature vector; then, the prosody is embedded and connected with a text coding vector of the target text information obtained by using the text encoder 600 to obtain a speech synthesis vector; and respectively performing pitch prediction and duration prediction on the speech synthesis vector by using the pitch prediction sub-model 400 and the duration prediction sub-model 500 to obtain a pitch feature vector and a duration feature vector, and finally performing speech synthesis by using the pitch feature vector, the duration feature vector and the text coding vector by using the decoder 700 to obtain speech content corresponding to the target text information.
In an embodiment, in actual use, only text information is required to be input, and the trained speech synthesis model is used for speech synthesis to obtain synthesized speech, and further guiding operations and the like can be performed on the obtained speech data.
The voice synthesis method of the embodiment of the application can be used in the medical field, the target text information is medical data collected under the informed consent of a patient, such as personal health files, prescriptions, examination reports and other data, and the application scene of the voice synthesis method of the embodiment of the application in the digital medical field is described below.
In medical institutions such as hospitals, a robot or other intelligent terminals can be made to talk with patients like people through a voice synthesis technology, and necessary guidance, advice and services are provided. The voice synthesis method provided by the embodiment of the application ensures that the synthetic voice of the robot or other intelligent terminals has high naturalness, thus not only relieving the busy workload of medical staff, but also better serving patients, and improving the use experience of the patient communication process. Or medical staff can use the voice synthesis technology to convert the medical record into voice, and store and backup the voice in a data center. The method can not only accelerate the recording speed, but also make the file more real and vivid, thereby providing more accurate information for medical staff.
According to the above, the voice synthesis method inputs the obtained target text information into a voice synthesis model, and carries out emotion recognition on the target text information by using a cross-domain emotion recognition sub-model to obtain an auxiliary emotion mark of the target text information; performing prosody coding on the reference voice based on the auxiliary emotion mark by utilizing the prosody coding sub-model to obtain prosody embedding of the reference voice; utilizing a pitch predictor model to predict the pitch of the prosody embedded text coding vector of the target text information to obtain a pitch feature vector; performing duration prediction on the prosody embedding and the text coding vector of the target text information by using a duration predictor model to obtain a duration feature vector; and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information. In the embodiment, the cross-domain emotion recognition sub-model is utilized to generate the auxiliary emotion mark for the target text information, the relevance between the auxiliary emotion mark and the rhythm is combined in the voice synthesis process, the reference voice is selected to generate the synthesized voice containing emotion, the naturalness of the synthesized voice is improved, and the application range of the text-to-voice synthesis technology is expanded.
The embodiment of the invention also provides a voice synthesis device, which can realize the voice synthesis method, and the voice synthesis model comprises the following steps: referring to fig. 9, the apparatus includes:
an obtaining unit 910, configured to input the obtained target text information into a speech synthesis model, where the speech synthesis model includes: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model;
the auxiliary emotion mark generation unit 920 is configured to perform emotion recognition on the target text information by using the cross-domain emotion recognition sub-model, so as to obtain an auxiliary emotion mark of the target text information;
a prosody embedding generating unit 930 for prosody encoding the reference speech based on the auxiliary emotion recognition by using the prosody encoding sub-model to obtain prosody embedding of the reference speech;
a pitch prediction unit 940 for performing pitch prediction on the prosody embedding and the text encoding vector of the target text information by using a pitch predictor model to obtain a pitch feature vector;
a duration prediction unit 950 for performing duration prediction on the prosody embedding and the text encoding vector of the target text information using the duration predictor model, to obtain a duration feature vector;
The speech synthesis prediction unit 960 is configured to perform speech synthesis using the pitch feature vector, the duration feature vector, and the text encoding vector, and obtain speech content corresponding to the target text information.
The specific implementation of the speech synthesis apparatus in this embodiment is substantially identical to the specific implementation of the speech synthesis method described above, and will not be described in detail here.
The embodiment of the invention also provides electronic equipment, which comprises:
at least one memory;
at least one processor;
at least one program;
the program is stored in the memory, and the processor executes the at least one program to implement the speech synthesis method of the present invention as described above. The electronic equipment can be any intelligent terminal including a mobile phone, a tablet personal computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a vehicle-mounted computer and the like.
Referring to fig. 10, fig. 10 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 1001 may be implemented by using a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. to execute related programs to implement the technical solution provided by the embodiments of the present invention;
The memory 1002 may be implemented in the form of a ROM (read only memory), a static storage device, a dynamic storage device, or a RAM (random access memory). The memory 1002 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 1002, and the processor 1001 invokes a speech synthesis method to perform the embodiments of the present disclosure;
an input/output interface 1003 for implementing information input and output;
the communication interface 1004 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.); and
information is transferred between components of the device (e.g., processor 1001, memory 1002, input/output interface 1003, and communication interface 1004) via bus 1005;
wherein the processor 1001, the memory 1002, the input/output interface 1003, and the communication interface 1004 realize communication connection between each other inside the device through a communication bus 1005.
The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the voice synthesis method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, wherein the voice synthesis method inputs acquired target text information into a voice synthesis model, and carries out emotion recognition on the target text information by using a cross-domain emotion recognition sub-model to obtain an auxiliary emotion mark of the target text information; performing prosody coding on the reference voice based on the auxiliary emotion mark by utilizing the prosody coding sub-model to obtain prosody embedding of the reference voice; utilizing a pitch predictor model to predict the pitch of the prosody embedded text coding vector of the target text information to obtain a pitch feature vector; performing duration prediction on the prosody embedding and the text coding vector of the target text information by using a duration predictor model to obtain a duration feature vector; and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information. In the embodiment, the cross-domain emotion recognition sub-model is utilized to generate the auxiliary emotion mark for the target text information, the relevance between the auxiliary emotion mark and the rhythm is combined in the voice synthesis process, the reference voice is selected to generate the synthesized voice containing emotion, the naturalness of the synthesized voice is improved, and the application range of the text-to-voice synthesis technology is expanded.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A method of speech synthesis, comprising:
inputting the acquired target text information into a speech synthesis model, wherein the speech synthesis model comprises: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model;
carrying out emotion recognition on the target text information by using the cross-domain emotion recognition sub-model to obtain an auxiliary emotion mark of the target text information;
performing prosody coding on the reference voice based on the auxiliary emotion mark by utilizing the prosody coding submodel to obtain prosody embedding of the reference voice;
utilizing the pitch predictor model to conduct pitch prediction on the prosody embedding and the text coding vector of the target text information to obtain a pitch characteristic vector;
performing duration prediction on the prosody embedding and the text coding vector of the target text information by using the duration predictor model to obtain a duration feature vector;
and performing voice synthesis by using the pitch feature vector, the duration feature vector and the text coding vector to obtain voice content corresponding to the target text information.
2. The method of claim 1, wherein the cross-domain emotion recognition sub-model comprises a first encoder, a second encoder, and a classifier; the method for identifying emotion of the target text information by using the cross-domain emotion identification sub-model further comprises the following steps before the auxiliary emotion identification of the target text information is obtained:
obtaining a voice emotion recognition data set, wherein the voice emotion recognition data set comprises: a first data set comprising a first speech sample and a speech emotion tag and a second data set comprising a second text sample;
generating, with a first encoder, a first data feature based on the first speech sample;
generating, with a second encoder, a second data feature based on the second text sample;
adjusting parameters of the first encoder and the second encoder according to the distribution loss values of the first data feature and the second data feature until the distribution loss values reach a first preset convergence condition;
inputting the first data characteristics into a classifier to obtain a predicted emotion;
and obtaining a recognition loss value according to the predicted emotion and the voice emotion label, and adjusting parameters of the classifier until the recognition loss value reaches a second preset convergence condition, so as to obtain the trained cross-domain emotion recognition sub-model.
3. The method according to claim 2, wherein said adjusting parameters of said first encoder and said second encoder according to said distribution loss values of said first data characteristic and said second data characteristic until said distribution loss values reach a first predetermined convergence condition comprises:
acquiring a first expected value of the first data characteristic mapped by a preset mapping function;
obtaining a second expected value of the second data characteristic mapped by a preset mapping function;
obtaining the distribution loss value according to the first expected value and the second expected value;
and adjusting parameters of the first encoder and the second encoder until the distribution loss value reaches the first preset convergence condition, wherein the first preset convergence condition is that the distribution loss value is minimum.
4. The method according to claim 1, wherein the prosody encoding the reference speech based on the auxiliary emotion mark by using the prosody encoding submodel, before obtaining prosody embedding of the reference speech, further comprises: and selecting reference voices from a reference voice library according to the auxiliary emotion marks, wherein the reference voice library comprises reference voices of a plurality of voice emotion categories.
5. The method of claim 1, wherein prosody encoding the reference speech based on the auxiliary emotion recognition by using the prosody encoding submodel to obtain prosody embedding of the reference speech, comprising:
the prosody coding submodel extracts prosody characteristics of the reference voice to obtain prosody characteristic vectors of the reference voice, wherein the prosody characteristics comprise speech speed characteristics and pitch characteristics;
the prosody encoding sub-model generates the prosody embedding according to the prosody feature vector.
6. The method according to claim 1, wherein the performing speech synthesis using the pitch feature vector, the duration feature vector, and the text encoding vector to obtain the speech content corresponding to the target text information includes:
performing duration information fusion on the text encoding vector by using the duration feature vector to obtain a text duration vector;
the pitch characteristic vector and the text duration vector are fused and then input as a decoder, so that a predicted Mel frequency spectrum is obtained;
and performing voice synthesis according to the predicted Mel frequency spectrum to generate the voice content.
7. A method of speech synthesis according to any of claims 1 to 6, wherein the prosody encoding submodel is a transducer model comprising an attention layer;
performing prosody encoding on the reference voice based on the auxiliary emotion mark by using the prosody encoding submodel to obtain prosody embedding of the reference voice, wherein the prosody embedding comprises the following steps:
extracting prosodic feature vectors of the reference speech of the auxiliary emotion mark by using the attention layer;
generating the prosody embedding according to the prosody feature vector.
8. A speech synthesis apparatus, comprising:
an acquisition unit configured to input the acquired target text information into a speech synthesis model, the speech synthesis model including: a cross-domain emotion recognition sub-model, a prosody coding sub-model, a pitch predictor sub-model, and a duration predictor sub-model;
the auxiliary emotion identification generation unit is used for carrying out emotion identification on the target text information by utilizing the cross-domain emotion identification sub-model to obtain auxiliary emotion identification of the target text information;
the prosody embedding generating unit is used for performing prosody encoding on the reference voice based on the auxiliary emotion identification by utilizing the prosody encoding sub-model to obtain prosody embedding of the reference voice;
A pitch prediction unit, configured to perform pitch prediction on the prosody embedding and the text encoding vector of the target text information by using the pitch prediction sub-model, so as to obtain a pitch feature vector;
the duration prediction unit is used for carrying out duration prediction on the prosody embedding and the text coding vector of the target text information by utilizing the duration prediction sub-model to obtain a duration characteristic vector;
and the voice synthesis prediction unit is used for performing voice synthesis by using the pitch characteristic vector, the duration characteristic vector and the text coding vector to obtain voice content corresponding to the target text information.
9. An electronic device comprising a memory storing a computer program and a processor implementing the speech synthesis method of any of claims 1 to 7 when the computer program is executed by the processor.
10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method of any one of claims 1 to 7.
CN202310640902.0A 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis device, electronic device, and storage medium Pending CN116580691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310640902.0A CN116580691A (en) 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis device, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310640902.0A CN116580691A (en) 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis device, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN116580691A true CN116580691A (en) 2023-08-11

Family

ID=87543083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310640902.0A Pending CN116580691A (en) 2023-05-31 2023-05-31 Speech synthesis method, speech synthesis device, electronic device, and storage medium

Country Status (1)

Country Link
CN (1) CN116580691A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711374A (en) * 2024-02-01 2024-03-15 广东省连听科技有限公司 Audio-visual consistent personalized voice synthesis system, synthesis method and training method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711374A (en) * 2024-02-01 2024-03-15 广东省连听科技有限公司 Audio-visual consistent personalized voice synthesis system, synthesis method and training method
CN117711374B (en) * 2024-02-01 2024-05-10 广东省连听科技有限公司 Audio-visual consistent personalized voice synthesis system, synthesis method and training method

Similar Documents

Publication Publication Date Title
Singh et al. A multimodal hierarchical approach to speech emotion recognition from audio and text
Vashisht et al. Speech recognition using machine learning
Gu et al. Speech intention classification with multimodal deep learning
Johar Emotion, affect and personality in speech: The Bias of language and paralanguage
Wang et al. Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition.
CN112967725A (en) Voice conversation data processing method and device, computer equipment and storage medium
Wang et al. Comic-guided speech synthesis
CN112214585A (en) Reply message generation method, system, computer equipment and storage medium
CN112765971A (en) Text-to-speech conversion method and device, electronic equipment and storage medium
CN114400005A (en) Voice message generation method and device, computer equipment and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116580691A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116682411A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN116884386A (en) Speech synthesis method, speech synthesis apparatus, device, and storage medium
CN117292022A (en) Video generation method and device based on virtual object and electronic equipment
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116631434A (en) Video and voice synchronization method and device based on conversion system and electronic equipment
WO2023226239A1 (en) Object emotion analysis method and apparatus and electronic device
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Pérez-Espinosa et al. Emotion recognition: from speech and facial expressions
Lin et al. Sequential modeling by leveraging non-uniform distribution of speech emotion
Yasmin et al. Discrimination of male and female voice using occurrence pattern of spectral flux
Juyal et al. Emotion recognition from speech using deep neural network
CN116416966A (en) Text-to-speech synthesis method, apparatus, device and storage medium
CN116469372A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination