CN116434730A - Speech synthesis method, device, equipment and storage medium based on multi-scale emotion - Google Patents

Speech synthesis method, device, equipment and storage medium based on multi-scale emotion Download PDF

Info

Publication number
CN116434730A
CN116434730A CN202310269651.XA CN202310269651A CN116434730A CN 116434730 A CN116434730 A CN 116434730A CN 202310269651 A CN202310269651 A CN 202310269651A CN 116434730 A CN116434730 A CN 116434730A
Authority
CN
China
Prior art keywords
emotion
information
sentence
module
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310269651.XA
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310269651.XA priority Critical patent/CN116434730A/en
Publication of CN116434730A publication Critical patent/CN116434730A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • G10L2015/025Phonemes, fenemes or fenones being the recognition units
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a speech synthesis method, device, equipment and storage medium based on multi-scale emotion, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring a target sentence composed of a plurality of sentence characters, and determining global text emotion information through a global emotion module pair; converting the target sentence into a corresponding phoneme sequence, performing voice emotion prediction on the phoneme sequence through a sentence emotion module to obtain voice emotion information, and adjusting a reference Mel spectrogram to obtain intonation information; predicting syllable intensity information by a local emotion module; and synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence. According to the technical scheme of the embodiment, the synthesized voice has richer emotion information through multi-scale emotion, rhythm of each word can be reflected, authenticity of voice broadcasting is improved, and therefore user experience is improved.

Description

Speech synthesis method, device, equipment and storage medium based on multi-scale emotion
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a speech synthesis method, device, equipment and storage medium based on multi-scale emotion.
Background
At present, the voice synthesis technology can convert any text information into voice in real time and read the voice, and is widely applied to various fields. On the basis of ensuring the accuracy of speech synthesis, in order to improve the user experience, it is also required that the synthesized speech can have emotional expressions. The common mode is to carry out semantic recognition on the text information, so as to classify the overall emotion of the sentence, and then carry out voice broadcasting according to the set intonation of each category. Although emotion dimension can be increased for voice broadcasting, emotion at the current stage is usually aimed at the whole sentence, and multi-scale attributes such as rhythm of voice are not considered, and great distinction still exists between synthesized voice and real voice, so that emotion hierarchy of voice broadcasting is insufficient and user experience is poor.
Disclosure of Invention
The embodiment of the application mainly aims to provide a voice synthesis method, device, equipment and storage medium based on multi-scale emotion, aiming to improve emotion level of synthesized voice and user experience.
To achieve the above object, a first aspect of an embodiment of the present application provides a speech synthesis method based on multi-scale emotion, where the method includes:
Acquiring a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module;
text emotion prediction is carried out on the target sentence through the global emotion module, and global text emotion information of the target sentence is determined;
converting the target sentence into a phoneme sequence, and respectively inputting the phoneme sequence into the sentence emotion module and the local emotion module, wherein each phoneme of the phoneme sequence corresponds to sentence characters one by one;
carrying out voice emotion prediction on the phoneme sequence through the sentence emotion module to obtain voice emotion information, wherein the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension;
acquiring a preset reference Mel spectrogram, and adjusting the reference Mel spectrogram according to the voice emotion information to obtain intonation information;
dividing the phoneme sequence into a plurality of phoneme groups through the local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information;
and synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence.
In some embodiments, the performing emotion prediction on the target sentence by the global emotion module, determining global text emotion information of the target sentence includes:
acquiring a preset global emotion lookup table, wherein the global emotion lookup table comprises a plurality of selectable emotion categories;
carrying out emotion prediction on the target sentence through the global emotion module, and determining class prediction probability of the target sentence for each selectable emotion class;
and carrying out weighted summation on the selectable emotion categories according to the category prediction probability, and determining the selectable emotion category closest to the weighted summation result as the global text emotion information.
In some embodiments, the adjusting the reference mel-spectrogram according to the speech emotion information to obtain intonation information includes:
convolving, normalizing, dropout processing and mean value pooling the reference mel spectrogram through the sentence emotion module to obtain a reference output vector;
adjusting the reference output vector into a target output vector according to the voice emotion information;
and determining the target output vector as the intonation information.
In some embodiments, the adjusting the reference output vector to a target output vector according to the speech emotion information includes:
Time-aligning the speech emotion information with the reference output vector;
and determining the numerical difference between the reference output vector and the voice emotion information as the target output vector.
In some embodiments, the dividing the phoneme sequence into a plurality of phoneme groups by the local emotion module comprises:
carrying out semantic recognition on the target sentence, and combining the related sentence characters into a sentence phrase;
and determining a phoneme phrase corresponding to the sentence phrase according to the mapping relation between the sentence characters and the phonemes.
In some embodiments, the performing emotion intensity prediction on each of the phoneme sets to obtain syllable intensity information includes:
carrying out emotion intensity prediction on each phoneme set to obtain phoneme emotion intensity information;
carrying out emotion intensity prediction on each sentence phrase to obtain phrase emotion intensity information;
and determining the numerical difference between the phoneme emotion intensity information and the phrase emotion intensity information as the syllable intensity information.
In some embodiments, the speech synthesis model further includes a speech synthesis encoder, the synthesizing the target speech from the global text emotion information, the intonation information, the syllable intensity information, and the target sentence, comprising:
Inputting the global text emotion information, the intonation information and the syllable intensity information to the speech synthesis encoder, and determining a vector output by the speech synthesis encoder as a target mel spectrogram;
and performing voice conversion after the target Mel spectrogram and the target sentence are aligned in time to obtain the target voice.
To achieve the above object, a second aspect of the embodiments of the present application proposes a speech synthesis apparatus based on multi-scale emotion, the apparatus comprising:
the sentence acquisition module is used for acquiring a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module;
the global prediction module is used for predicting text emotion of the target sentence through the global emotion module and determining global text emotion information of the target sentence;
the sentence conversion module is used for converting the target sentence into a phoneme sequence, and inputting the phoneme sequence into the sentence emotion module and the local emotion module respectively, wherein each phoneme of the phoneme sequence corresponds to the sentence characters one by one;
The sentence prediction module is used for carrying out voice emotion prediction on the phoneme sequence through the sentence emotion module to obtain voice emotion information, and the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension;
the intonation determining module is used for obtaining a preset reference Mel spectrogram, and adjusting the reference Mel spectrogram according to the voice emotion information to obtain intonation information;
the syllable prediction module is used for dividing the phoneme sequence into a plurality of phoneme groups through the local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information;
and the voice synthesis module is used for synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence.
To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, storing a computer program, which when executed by a processor implements the method described in the first aspect.
The method, the device, the equipment and the storage medium for synthesizing the voice based on the multi-scale emotion, which are provided by the application, comprise the following steps: acquiring a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module; text emotion prediction is carried out on the target sentence through the global emotion module, and global text emotion information of the target sentence is determined; converting the target sentence into a phoneme sequence, and respectively inputting the phoneme sequence into the sentence emotion module and the local emotion module, wherein each phoneme of the phoneme sequence corresponds to sentence characters one by one; carrying out voice emotion prediction on the phoneme sequence through the sentence emotion module to obtain voice emotion information, wherein the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension; acquiring a preset reference Mel spectrogram, and adjusting the reference Mel spectrogram according to the voice emotion information to obtain intonation information; dividing the phoneme sequence into a plurality of phoneme groups through the local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information; and synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence. According to the technical scheme of the embodiment, global emotion, sentence emotion and pitch emotion can be comprehensively considered when speech is synthesized, the synthesized speech has richer emotion information through multi-scale emotion, rhythm of each word can be embodied, reality of speech broadcasting is improved, and therefore user experience is improved.
Drawings
FIG. 1 is a flow chart of a speech synthesis method based on multi-scale emotion provided in one embodiment of the present application;
FIG. 2 is a block diagram of a speech synthesis model in another embodiment of the present application;
fig. 3 is a flowchart of step S102 in fig. 1;
fig. 4 is a flowchart of step S105 in fig. 1;
fig. 5 is a flowchart of step S402 in fig. 4;
fig. 6 is a flowchart of step S106 in fig. 1;
fig. 7 is a flowchart of step S106 in fig. 1;
fig. 8 is a flowchart of step S107 in fig. 1;
FIG. 9 is a schematic structural diagram of a speech synthesis device based on multi-scale emotion according to an embodiment of the present application;
fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied in human languages (e.g., target language, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Information extraction (Information Extraction): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.
BERT model: a language model published by google in 2018 that trains deep bi-directional representations by combining bi-directional converters in all layers. The BERT model combines the advantages of a plurality of natural language processing models, and obtains better effects in a plurality of natural language processing tasks. In the related art, the model input vector of the BERT model is the vector sum of a word vector (Token) and a position vector (Position Embedding) and a sentence vector (Segment Embedding). The word vector is a vectorization representation of the text, the position vector is used for representing the position of the word in the text, and the sentence vector is used for representing the sequence of sentences in the text.
Based on the above, the embodiment of the application provides a voice synthesis method, device, equipment and storage medium based on multi-scale emotion, aiming at improving emotion level of synthesized voice and improving user experience.
The method, device, equipment and storage medium for synthesizing voice based on multi-scale emotion provided by the embodiment of the application are specifically described through the following embodiments, and the method for synthesizing voice based on multi-scale emotion in the embodiment of the application is described first.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the application provides a speech synthesis method based on multi-scale emotion, and relates to the technical field of artificial intelligence. The speech synthesis method based on the multi-scale emotion can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a speech synthesis method based on multi-scale emotion, but is not limited to the above form.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
Fig. 1 is an optional flowchart of a speech synthesis method based on multi-scale emotion provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S105.
Step S101, obtaining a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module;
step S102, text emotion prediction is carried out on a target sentence through a global emotion module, and global text emotion information of the target sentence is determined;
step S103, converting the target sentence into a phoneme sequence, and respectively inputting the phoneme sequence into a sentence emotion module and a local emotion module, wherein each phoneme of the phoneme sequence corresponds to sentence characters one by one;
step S104, carrying out voice emotion prediction on the phoneme sequence through a sentence emotion module to obtain voice emotion information, wherein the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension;
step S105, a preset reference Mel spectrogram is obtained, and the reference Mel spectrogram is adjusted according to the voice emotion information to obtain intonation information;
step S106, dividing the phoneme sequence into a plurality of phoneme groups by a local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information;
step S107, synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence.
It should be noted that, the target sentence may be a sentence composed of a plurality of sentence characters, and of course, the target sentence may also include sub-sentences, and the length of the target sentence is not limited in this embodiment.
It should be noted that, referring to fig. 2, the speech synthesis model includes a global emotion module, a sentence emotion module and a local emotion module, when the speech synthesis model is trained, the training data may include a training text sentence, a sample mel spectrogram corresponding to the training text sentence and emotion category, and each module of the speech synthesis model is cooperatively trained by using the labeled training data, and a specific training manner is a technology well known to those skilled in the art, which is not repeated herein.
It should be noted that, for speech, emotion in each sentence is unified, for example, expression is happy or care is performed, so that emotion basis of the target sentence can be determined through a global emotion module, and since global emotion is generally generalized emotion, emotion recognition can be directly performed on text of the target sentence, emotion recognition can be determined through simple classification by semantic recognition, and the specific process is not excessively limited in this embodiment.
In order to determine the voice prosody of the target sentence, emotion features need to be superimposed on the existing prosody of each word, so that the target sentence can be converted into a phoneme sequence by the speech synthesis encoder, and each sentence word is converted into one phoneme in the phoneme sequence, thereby realizing phoneme embedding.
It should be noted that, the sentence emotion module may further predict emotion of the sentence on the basis of global emotion, for example, the sentence emotion module may predict sentence change according to a phoneme sequence, so as to determine a change of intonation with time when broadcasting the target sentence.
The phoneme sequence is a sequence arranged according to time sequence, emotion prediction of time dimension can be carried out on the phoneme sequence through a statement change prediction module, a reference mel spectrogram is adjusted according to emotion results predicted by the phoneme sequence, and an adjustment result of the reference mel spectrogram is determined to be intonation information. The principle and the manner of generation of mel-patterns are well known to those skilled in the art and will not be described in detail herein.
It should be noted that after the two scales of global emotion and sentence emotion are obtained, for the purpose of stronger reality of voice broadcasting, local scales can be introduced again, for example, for each sentence word or phrase, therefore, a phoneme sequence can be divided into a plurality of phoneme groups through a local emotion module, each phoneme group can comprise a plurality of related phrases, syllable intensity information can be obtained through extracting emotion intensity of the phoneme group, rhythm of different phrases can be reflected when voice broadcasting is performed, and reality of the phrases is improved.
After global text emotion information, intonation information and syllable intensity information are obtained, target emotion information of a target sentence can be obtained through a AND operation, and the target sentence is subjected to speech synthesis through the target emotion information to obtain target speech, and on the premise of having emotion information and sentences, speech synthesis is a technology well known to those skilled in the art, and will not be described in detail herein.
The steps S101 to S107 illustrated in the embodiment of the application can comprehensively consider global emotion, sentence emotion and pitch emotion when synthesizing voice, and the synthesized voice has richer emotion information through multi-scale emotion, so that the rhythm of each word can be embodied, the authenticity of voice broadcasting is improved, and therefore user experience is improved.
In some embodiments, referring to fig. 3, step S102 may further include, but is not limited to, the following steps:
step S301, a preset global emotion lookup table is obtained, wherein the global emotion lookup table comprises a plurality of selectable emotion categories;
step S302, emotion prediction is carried out on a target sentence through a global emotion module, and the category prediction probability of the target sentence for each selectable emotion category is determined;
Step S303, weighting and summing the selectable emotion categories according to the category prediction probability, and determining the selectable emotion category closest to the weighted and summed result as global text emotion information.
It should be noted that, because the global emotion module performs emotion prediction for text information of a target sentence, since text and emotion cannot be uniquely determined, for example, emotions corresponding to some texts under different scenes are different, in order to improve prediction efficiency, a trainable global emotion lookup table may be set in the global emotion module, and a plurality of selectable emotion categories may be set in the global emotion lookup table, and specific emotion categories may be selected according to actual requirements and are not limited herein.
It should be noted that, the global emotion module may set a trained emotion classifier, for example, a linear layer and an activation function are added as the emotion classifier based on a pre-trained BERT-Base model, and those skilled in the art have an incentive to adjust the structure of the emotion classifier according to actual requirements, which is not limited herein.
It should be noted that, since the result obtained by emotion prediction is usually a category prediction probability, global text emotion information may be determined by means of weighted summation, for example, calculated by the following formula:
Figure BDA0004162483040000081
Wherein h is global For representation of global text emotion information, p i Predicting probability for the category corresponding to the ith optional emotion category, f 1 (e i ) The weight of the ith optional emotion category is given, and M is the number of optional emotion categories. After obtaining h global Thereafter, the weight value and h can be selected global The closest selectable emotion category is determined as global text emotion information.
In some embodiments, referring to fig. 4, step S105 may further include, but is not limited to, the following steps:
step S401, carrying out convolution, normalization and dropout processing and mean value pooling on a reference Mel spectrogram through a statement emotion module to obtain a reference output vector;
step S402, adjusting the reference output vector into a target output vector according to the voice emotion information;
in step S403, the target output vector is determined as intonation information.
It should be noted that, a sentence change encoder may be set in the sentence emotion module, where the sentence change encoder is composed of two 1-dimensional convolution layers, and the reference mel spectrogram generated by the speech synthesis decoder is input to the sentence change encoder and then convolved twice, and then normalized and dropout are performed, and the average value pooling on the time axis is applied to output to obtain a reference output vector of the emotion change of the utterance. The specific parameters of the statement change encoder only have an effect on the values of the reference output vector, and a person skilled in the art can select parameters of the neural network according to actual requirements, and the parameters are not limited herein.
It should be noted that, the obtained reference output vector may be used as a reference, and the speech emotion module may be used to predict the sentence change of the phoneme sequence, so as to determine the emotion prediction result of the sentence, and adjust the reference output vector according to the emotion prediction result, for example, if the emotion is sad, then adjust the reference output vector according to the gain corresponding to the sad, so that the obtained target output vector may reflect the emotion, and the obtained intonation may be matched with the actual emotion.
In some embodiments, referring to fig. 5, step S402 may further include, but is not limited to, the following steps:
step S501, aligning the voice emotion information with the reference output vector in time;
in step S502, a numerical difference between the reference output vector and the speech emotion information is determined as a target output vector.
It should be noted that, in order to obtain the target output vector according to the reference output vector and the speech emotion information, it is necessary to align the reference output vector and the speech emotion information in time, and calculate the target output vector according to the following formula:
Figure BDA0004162483040000091
Figure BDA0004162483040000092
wherein L is utt For the target output vector, h utt For reference output vector, ++>
Figure BDA0004162483040000093
And for the voice emotion information, the adjustment of the reference mel spectrogram is realized through difference value calculation. During training of the sentence emotion module, the output of the sentence can be minimized by +. >
Figure BDA0004162483040000094
H output by sum and sentence variation encoder utt L between 1 The loss function provides supervision, l 1 The loss function may be set according to the convergence requirement for the model, and is not limited herein.
In some embodiments, referring to fig. 6, step S106 may further include, but is not limited to, the following steps:
step S601, carrying out semantic recognition on a target sentence, and combining related sentence characters into a sentence phrase;
step S602, determining a phoneme phrase corresponding to the sentence phrase according to the mapping relation between the sentence characters and the phonemes.
In order to determine emotion of different phrases, the target sentence can be divided into a plurality of sentence phrases according to simple semantic recognition, then phonemes in a phoneme sequence corresponding to the sentence phrases are determined to be phoneme phrases, and emotion intensity extraction can be performed according to the phoneme phrases through phoneme grouping, so that emotion intensity of phrase dimension can be introduced when speech is synthesized, and authenticity of the target speech is improved.
In the training process of the local emotion module, a marked syllable segment can be input for training, so that the local emotion module can extract emotion intensity, and the training process is not limited.
In some embodiments, referring to fig. 7, step S106 may further include, but is not limited to, the following steps:
step S701, carrying out emotion intensity prediction on each phoneme set to obtain phoneme emotion intensity information;
step S702, carrying out emotion intensity prediction on each sentence phrase to obtain phrase emotion intensity information;
in step S703, the numerical difference between the phoneme emotion intensity information and the phrase emotion intensity information is determined as syllable intensity information.
In order to determine the emotion intensity of each phrase, two modules of text local emotion intensity prediction and voice local emotion intensity extraction can be set in the local emotion module, emotion intensity prediction is carried out on the sentence phrase through the text local emotion intensity prediction, phrase emotion intensity information can be obtained as basic information of emotion, phoneme emotion intensity information is used as adjustment information, syllable intensity information is obtained through difference value calculation, and syllable intensity information can be used for feeding back the emotion intensity of the phrase.
In some embodiments, the speech synthesis model further includes a speech synthesis encoder, referring to fig. 8, step S107 may further include, but is not limited to, the following steps:
step S801, inputting global text emotion information, intonation information and syllable intensity information into a speech synthesis encoder, and determining a vector output by the speech synthesis encoder as a target Mel spectrogram;
Step S802, performing voice conversion after aligning the target Mel spectrogram and the target sentence in time to obtain target voice.
It should be noted that after global text emotion information, intonation information and syllable intensity information are obtained, the above information may be and operated, the obtained result carries multi-scale emotion information, and then the multi-scale emotion information is input to a speech synthesis decoder to obtain a target mel spectrogram, and in the case of having the target mel spectrogram, those skilled in the art are well aware of how to perform speech synthesis, and detailed descriptions are omitted herein. Through the technical scheme of the embodiment, the target voice can have multi-scale emotion, and the authenticity is improved.
Referring to fig. 9, the embodiment of the present application further provides a speech synthesis device based on multi-scale emotion, which may implement the above speech synthesis method based on multi-scale emotion, where the speech synthesis device 900 based on multi-scale emotion includes:
the sentence acquisition module 901 is configured to acquire a target sentence composed of a plurality of sentence characters, input the target sentence into a trained speech synthesis model, and the speech synthesis model includes a global emotion module, a sentence emotion module and a local emotion module;
the global prediction module 902 is configured to perform text emotion prediction on the target sentence through the global emotion module, and determine global text emotion information of the target sentence;
The sentence conversion module 903 is configured to convert the target sentence into a phoneme sequence, and input the phoneme sequence to the sentence emotion module and the local emotion module respectively, where each phoneme of the phoneme sequence corresponds to sentence characters one by one;
the sentence prediction module 904 is configured to perform speech emotion prediction on the phoneme sequence through the sentence emotion module, so as to obtain speech emotion information, where the speech emotion information characterizes emotion variation of the phoneme sequence in a time dimension;
the intonation determining module 905 is configured to obtain a preset reference mel spectrogram, and adjust the reference mel spectrogram according to the speech emotion information to obtain intonation information;
the syllable prediction module 906 is configured to divide the phoneme sequence into a plurality of phoneme groups by using the local emotion module, and predict the emotion intensity of each phoneme group to obtain syllable intensity information;
the speech synthesis module 907 is configured to synthesize a target speech according to the global text emotion information, intonation information, syllable strength information, and target sentence.
The specific implementation manner of the speech synthesis device based on multi-scale emotion is basically the same as the specific embodiment of the speech synthesis method based on multi-scale emotion, and is not described herein.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the speech synthesis method based on the multi-scale emotion when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.
Referring to fig. 10, fig. 10 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 1001 may be implemented by a general purpose central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;
the Memory 1002 may be implemented in the form of a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access Memory (Random Access Memory, RAM). Memory 1002 may store an operating system and other application programs, and when implementing the technical solutions provided in the embodiments of the present application by software or firmware, relevant program codes are stored in memory 1002, and the processor 1001 invokes a speech synthesis method based on multi-scale emotion to perform the embodiments of the present application;
an input/output interface 1003 for implementing information input and output;
the communication interface 1004 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);
A bus 1005 for transferring information between the various components of the device (e.g., the processor 1001, memory 1002, input/output interface 1003, and communication interface 1004);
wherein the processor 1001, the memory 1002, the input/output interface 1003, and the communication interface 1004 realize communication connection between each other inside the device through the bus 1005.
The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the speech synthesis method based on the multi-scale emotion when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the application provides a speech synthesis method, device, equipment and storage medium based on multi-scale emotion, wherein the method comprises the following steps: acquiring a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module; text emotion prediction is carried out on the target sentence through the global emotion module, and global text emotion information of the target sentence is determined; converting the target sentence into a phoneme sequence, and respectively inputting the phoneme sequence into the sentence emotion module and the local emotion module, wherein each phoneme of the phoneme sequence corresponds to sentence characters one by one; carrying out voice emotion prediction on the phoneme sequence through the sentence emotion module to obtain voice emotion information, wherein the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension; acquiring a preset reference Mel spectrogram, and adjusting the reference Mel spectrogram according to the voice emotion information to obtain intonation information; dividing the phoneme sequence into a plurality of phoneme groups through the local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information; and synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence. According to the technical scheme of the embodiment, global emotion, sentence emotion and pitch emotion can be comprehensively considered when speech is synthesized, the synthesized speech has richer emotion information through multi-scale emotion, rhythm of each word can be embodied, reality of speech broadcasting is improved, and therefore user experience is improved.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing programs, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.
The present embodiments are operational with numerous general purpose or special purpose computer device environments or configurations. For example: personal computers, server computers, hand-held or portable electronic devices, tablet electronic devices, multiprocessor devices, microprocessor-based devices, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above devices or electronic devices, and the like. The application may be described in the general context of computer programs, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing electronic devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The units involved in the embodiments of the present application may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
It should be noted that although in the above detailed description several modules or units of an electronic device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, in accordance with embodiments of the present application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a usb disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing electronic device (may be a personal computer, a server, a touch terminal, or a network electronic device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.
While the preferred embodiment of the present invention has been described in detail, the present invention is not limited to the above embodiment, and various equivalent modifications and substitutions can be made by those skilled in the art without departing from the spirit of the present invention, and these equivalent modifications and substitutions are intended to be included in the scope of the present invention as defined in the appended claims.

Claims (10)

1. A speech synthesis method based on multi-scale emotion, the method comprising:
acquiring a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module;
Text emotion prediction is carried out on the target sentence through the global emotion module, and global text emotion information of the target sentence is determined;
converting the target sentence into a phoneme sequence, and respectively inputting the phoneme sequence into the sentence emotion module and the local emotion module, wherein each phoneme of the phoneme sequence corresponds to sentence characters one by one;
carrying out voice emotion prediction on the phoneme sequence through the sentence emotion module to obtain voice emotion information, wherein the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension;
acquiring a preset reference Mel spectrogram, and adjusting the reference Mel spectrogram according to the voice emotion information to obtain intonation information;
dividing the phoneme sequence into a plurality of phoneme groups through the local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information;
and synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence.
2. The speech synthesis method based on multi-scale emotion according to claim 1, wherein the emotion prediction of the target sentence by the global emotion module, and determining global text emotion information of the target sentence, comprises:
Acquiring a preset global emotion lookup table, wherein the global emotion lookup table comprises a plurality of selectable emotion categories;
carrying out emotion prediction on the target sentence through the global emotion module, and determining class prediction probability of the target sentence for each selectable emotion class;
and carrying out weighted summation on the selectable emotion categories according to the category prediction probability, and determining the selectable emotion category closest to the weighted summation result as the global text emotion information.
3. The method for synthesizing speech based on multi-scale emotion according to claim 1, wherein said adjusting the reference mel spectrogram according to the speech emotion information to obtain intonation information comprises:
convolving, normalizing, dropout processing and mean value pooling the reference mel spectrogram through the sentence emotion module to obtain a reference output vector;
adjusting the reference output vector into a target output vector according to the voice emotion information;
and determining the target output vector as the intonation information.
4. The method of speech synthesis based on multi-scale emotion according to claim 3, wherein said adjusting the reference output vector to a target output vector according to the speech emotion information comprises:
Time-aligning the speech emotion information with the reference output vector;
and determining the numerical difference between the reference output vector and the voice emotion information as the target output vector.
5. The method of claim 1, wherein the dividing the phoneme sequence into a plurality of phoneme groups by the local emotion module comprises:
carrying out semantic recognition on the target sentence, and combining the related sentence characters into a sentence phrase;
and determining a phoneme phrase corresponding to the sentence phrase according to the mapping relation between the sentence characters and the phonemes.
6. The method for synthesizing speech based on multi-scale emotion according to claim 5, wherein said performing emotion intensity prediction on each of said phoneme sets to obtain syllable intensity information comprises:
carrying out emotion intensity prediction on each phoneme set to obtain phoneme emotion intensity information;
carrying out emotion intensity prediction on each sentence phrase to obtain phrase emotion intensity information;
and determining the numerical difference between the phoneme emotion intensity information and the phrase emotion intensity information as the syllable intensity information.
7. The method of claim 1, wherein the speech synthesis model further comprises a speech synthesis encoder, and wherein synthesizing the target speech from the global text emotion information, the intonation information, the syllable strength information, and the target sentence comprises:
inputting the global text emotion information, the intonation information and the syllable intensity information to the speech synthesis encoder, and determining a vector output by the speech synthesis encoder as a target mel spectrogram;
and performing voice conversion after the target Mel spectrogram and the target sentence are aligned in time to obtain the target voice.
8. A speech synthesis apparatus based on multi-scale emotion, the apparatus comprising:
the sentence acquisition module is used for acquiring a target sentence composed of a plurality of sentence characters, and inputting the target sentence into a trained speech synthesis model, wherein the speech synthesis model comprises a global emotion module, a sentence emotion module and a local emotion module;
the global prediction module is used for predicting text emotion of the target sentence through the global emotion module and determining global text emotion information of the target sentence;
The sentence conversion module is used for converting the target sentence into a phoneme sequence, and inputting the phoneme sequence into the sentence emotion module and the local emotion module respectively, wherein each phoneme of the phoneme sequence corresponds to the sentence characters one by one;
the sentence prediction module is used for carrying out voice emotion prediction on the phoneme sequence through the sentence emotion module to obtain voice emotion information, and the voice emotion information characterizes emotion change of the phoneme sequence in a time dimension;
the intonation determining module is used for obtaining a preset reference Mel spectrogram, and adjusting the reference Mel spectrogram according to the voice emotion information to obtain intonation information;
the syllable prediction module is used for dividing the phoneme sequence into a plurality of phoneme groups through the local emotion module, and carrying out emotion intensity prediction on each phoneme group to obtain syllable intensity information;
and the voice synthesis module is used for synthesizing target voice according to the global text emotion information, the intonation information, the syllable intensity information and the target sentence.
9. An electronic device comprising a memory storing a computer program and a processor implementing the multi-scale emotion-based speech synthesis method of any of claims 1 to 7 when the computer program is executed.
10. A computer-readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the multi-scale emotion-based speech synthesis method of any of claims 1 to 7.
CN202310269651.XA 2023-03-15 2023-03-15 Speech synthesis method, device, equipment and storage medium based on multi-scale emotion Pending CN116434730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310269651.XA CN116434730A (en) 2023-03-15 2023-03-15 Speech synthesis method, device, equipment and storage medium based on multi-scale emotion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310269651.XA CN116434730A (en) 2023-03-15 2023-03-15 Speech synthesis method, device, equipment and storage medium based on multi-scale emotion

Publications (1)

Publication Number Publication Date
CN116434730A true CN116434730A (en) 2023-07-14

Family

ID=87089907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310269651.XA Pending CN116434730A (en) 2023-03-15 2023-03-15 Speech synthesis method, device, equipment and storage medium based on multi-scale emotion

Country Status (1)

Country Link
CN (1) CN116434730A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118588085A (en) * 2024-08-05 2024-09-03 南京硅基智能科技有限公司 Voice interaction method, voice interaction system and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118588085A (en) * 2024-08-05 2024-09-03 南京硅基智能科技有限公司 Voice interaction method, voice interaction system and storage medium

Similar Documents

Publication Publication Date Title
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
Vashisht et al. Speech recognition using machine learning
CN114676234A (en) Model training method and related equipment
CN113901191A (en) Question-answer model training method and device
CN115497477B (en) Voice interaction method, voice interaction device, electronic equipment and storage medium
CN116541493A (en) Interactive response method, device, equipment and storage medium based on intention recognition
CN112349294B (en) Voice processing method and device, computer readable medium and electronic equipment
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN114936274B (en) Model training method, dialogue generating method and device, equipment and storage medium
CN116719999A (en) Text similarity detection method and device, electronic equipment and storage medium
CN116564270A (en) Singing synthesis method, device and medium based on denoising diffusion probability model
CN116434730A (en) Speech synthesis method, device, equipment and storage medium based on multi-scale emotion
CN113823259B (en) Method and device for converting text data into phoneme sequence
CN117373591A (en) Disease identification method and device for electronic medical record, electronic equipment and storage medium
CN114611529B (en) Intention recognition method and device, electronic equipment and storage medium
CN116665639A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116364054A (en) Voice synthesis method, device, equipment and storage medium based on diffusion
CN116541551A (en) Music classification method, music classification device, electronic device, and storage medium
CN116956925A (en) Electronic medical record named entity identification method and device, electronic equipment and storage medium
CN116469370A (en) Target language voice synthesis method and device, electronic equipment and storage medium
CN116645961A (en) Speech recognition method, speech recognition device, electronic apparatus, and storage medium
CN116580704A (en) Training method of voice recognition model, voice recognition method, equipment and medium
CN114786059B (en) Video generation method, video generation device, electronic device, and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination