CN113948061A - Speech synthesis method, system, speech synthesis model and training method thereof - Google Patents

Speech synthesis method, system, speech synthesis model and training method thereof Download PDF

Info

Publication number
CN113948061A
CN113948061A CN202111205560.7A CN202111205560A CN113948061A CN 113948061 A CN113948061 A CN 113948061A CN 202111205560 A CN202111205560 A CN 202111205560A CN 113948061 A CN113948061 A CN 113948061A
Authority
CN
China
Prior art keywords
text
audio
training
bottleneck
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111205560.7A
Other languages
Chinese (zh)
Inventor
司马华鹏
毛志强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suqian Silicon Based Intelligent Technology Co ltd
Original Assignee
Suqian Silicon Based Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suqian Silicon Based Intelligent Technology Co ltd filed Critical Suqian Silicon Based Intelligent Technology Co ltd
Priority to CN202111205560.7A priority Critical patent/CN113948061A/en
Publication of CN113948061A publication Critical patent/CN113948061A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a voice synthesis method, a voice synthesis system, a voice synthesis model and a training method thereof, wherein the method comprises the steps of obtaining a target text and a first bottleneck characteristic of the target text; acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios; acquiring a reference text corresponding to each reference audio in a reference audio library, and acquiring a second bottleneck characteristic of each reference text; calculating the similarity of the first bottleneck characteristic and the second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template; determining reference audio corresponding to the text template as an audio template; and inputting the audio template and the target text into a pre-trained voice synthesis model to synthesize the voice with deep emotion hierarchical features.

Description

Speech synthesis method, system, speech synthesis model and training method thereof
Technical Field
The present application relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method, a speech synthesis system, a speech synthesis model, and a training method thereof.
Background
Speech synthesis, also known as text-to-speech, is mainly used to convert text to speech and to provide the synthesized speech with as high intelligibility and naturalness as possible. In recent years, with the progress of speech synthesis technology, synthesized speech is getting closer to the real sound when a person speaks in terms of sound quality and naturalness. However, when a person speaks, the person has various styles and is rich in various emotional colors. Therefore, how to synthesize speech with unique style and emotional color is the key to the development of speech synthesis technology.
In order to solve the above problem, voices with different styles or emotions can be synthesized by embedding styles or emotions in a voice synthesis stage. Such a synthesis may enable a wider variety of styles or emotions to be embedded and synthesized, for example, when synthesizing speech featuring different emotion categories such as recitation, chatty, etc. according to a user's selection, the synthesized speech may be speech with different emotions such as happy, sad, angry, etc.
However, the same emotion category may be further divided into a plurality of emotion levels, and taking the "happy" emotion category as an example, the "happy" emotion category may be further divided into a plurality of levels such as "happy spring," wonderful "and" happy, "and only by using styles or emotion embedding in the speech synthesis stage, the speech synthesis with the emotion levels described above cannot be realized, which is not favorable for the user experience.
Disclosure of Invention
The application provides a voice synthesis method, a voice synthesis system, a voice synthesis model and a training method thereof, which aim to solve the problem that voice with deep emotion levels cannot be synthesized in the prior art and improve user experience.
In a first aspect, the present application provides a speech synthesis method, including:
acquiring a target text and a first bottleneck characteristic of the target text;
acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios;
acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text;
calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template;
determining the reference audio corresponding to the text template as an audio template;
and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.
Optionally, obtaining a first bottleneck characteristic corresponding to the target text includes:
acquiring text data related to emotion;
establishing an emotion encoding network model according to the text data, wherein the emotion encoding network model is used for acquiring emotion characteristics of an input text;
analyzing the target text according to the emotion coding network model, acquiring the emotion characteristics of the target text, and determining the emotion characteristics of the target text as first bottleneck characteristics.
Optionally, the obtaining of the second bottleneck feature corresponding to each of the reference texts includes:
analyzing each reference text according to the emotion coding network model, acquiring the emotion characteristics of each reference text, and determining the emotion characteristics of the reference texts as second bottleneck characteristics.
Optionally, obtaining a first bottleneck characteristic corresponding to the target text includes:
acquiring text data related to styles;
establishing a style coding network model according to the text data, wherein the style coding network model is used for acquiring style characteristics of an input text;
analyzing the target text according to the style coding network model, acquiring style characteristics of the target text, and determining the style characteristics of the target text as first bottleneck characteristics.
Optionally, the obtaining of the second bottleneck feature corresponding to each of the reference texts includes:
analyzing each reference text according to the style coding network model, acquiring the style characteristics of each reference text, and determining the style characteristics of the reference texts as second bottleneck characteristics.
In a second aspect, the present application provides a speech synthesis system configured to:
acquiring a target text and a first bottleneck characteristic of the target text;
acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios;
acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text;
calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template;
determining the reference audio corresponding to the text template as an audio template;
and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.
In a third aspect, the present application provides a speech synthesis model applied to the above method and system, including an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module, and a vocoder module, wherein:
the encoder module is used for acquiring a text sequence of an input target text, wherein the text sequence of the target text is a phoneme set of the target text, and converting the text sequence into a corresponding text code;
the feature extraction module is used for acquiring a third bottleneck feature of the audio template according to the input audio template, wherein the third bottleneck feature at least comprises one of emotional feature and style feature of the audio template;
the duration prediction module is used for acquiring the predicted duration of the text code according to the text code and the third bottleneck characteristic, wherein the predicted duration of the text code is the pronunciation duration corresponding to each frame of the text code obtained through prediction;
the time length sampling module is used for performing upsampling processing on the text code according to the output of the feature extraction module and the time length prediction module to obtain the text code subjected to the upsampling processing and a third bottleneck feature subjected to the upsampling processing;
the fundamental frequency prediction module is used for predicting the fundamental frequency characteristic of the text code according to the input text code subjected to the upsampling processing and the third bottleneck characteristic subjected to the upsampling processing;
the decoder module is used for acquiring the audio features of the audio to be synthesized according to the text codes subjected to the upsampling, the third bottleneck features subjected to the upsampling and the fundamental frequency features;
the vocoder module is used for obtaining the synthesized audio according to the audio characteristics of the audio to be synthesized.
In a fourth aspect, the present application provides a method for training a speech synthesis model, which is applied to the method and system described above, and includes:
acquiring a training material, wherein the training material comprises a training audio and a training text corresponding to the training audio, and the training text is a text with one or more of emotional characteristics or style characteristics;
analyzing the training audio to obtain the audio features of the training audio;
analyzing the training text to obtain text features of the training text, wherein the text features of the training text comprise a collection of each phoneme of the training text, an emotional feature collection of the training text and a style feature collection of the training text;
matching the audio frequency characteristics of the training audio frequency with the text characteristics of the training text according to the training audio frequency and the training text to obtain pronunciation duration information of the training text, wherein the pronunciation duration information is a duration set corresponding to each phoneme in the training text;
inputting the audio features, the text features and the pronunciation duration information to the speech synthesis model to train the speech synthesis model.
Optionally, the audio features include at least one of mel-frequency spectrum features and mel-frequency cepstrum features of the training audio.
Optionally, matching, according to the training audio and the training text, audio features of the training audio with text features of the training text to obtain pronunciation duration information of the training text, where the pronunciation duration information is a duration set corresponding to each phoneme in the training text, and the method includes:
and establishing a corresponding relation between each phoneme in the training text and the audio features of the training audio, so that each phoneme corresponds to the audio features of a plurality of training audios.
According to the technical scheme, the application provides a speech synthesis method, a speech synthesis system, a speech synthesis model and a training method thereof, wherein the method comprises the steps of obtaining a target text and a first bottleneck characteristic of the target text; acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios; acquiring a reference text corresponding to each reference audio in a reference audio library, and acquiring a second bottleneck characteristic of each reference text; calculating the similarity of the first bottleneck characteristic and the second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template; determining reference audio corresponding to the text template as an audio template; and inputting the audio template and the target text into a pre-trained voice synthesis model to synthesize the voice with deep emotion hierarchical features.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a speech synthesis method provided herein;
FIG. 2 is a speech synthesis model provided herein;
fig. 3 is a method for training a speech synthesis model according to the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all embodiments. Other embodiments based on the embodiments of the present application and obtained by a person of ordinary skill in the art without any creative effort belong to the protection scope of the present application.
To make the purpose and embodiments of the present application clearer, the following will clearly and completely describe the exemplary embodiments of the present application with reference to the attached drawings in the exemplary embodiments of the present application, and it is obvious that the described exemplary embodiments are only a part of the embodiments of the present application, and not all of the embodiments.
It should be noted that the brief descriptions of the terms in the present application are only for the convenience of understanding the embodiments described below, and are not intended to limit the embodiments of the present application. These terms should be understood in their ordinary and customary meaning unless otherwise indicated.
The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between similar or analogous objects or entities and not necessarily for describing a particular sequential or chronological order, unless otherwise indicated. It is to be understood that the terms so used are interchangeable under appropriate circumstances.
The terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a product or apparatus that comprises a list of elements is not necessarily limited to all elements expressly listed, but may include other elements not expressly listed or inherent to such product or apparatus.
The term "module" refers to any known or later developed hardware, software, firmware, artificial intelligence, fuzzy logic, or combination of hardware and/or software code that is capable of performing the functionality associated with that element.
In recent years, with the progress of speech synthesis technology, synthesized speech is getting closer to the real sound when a person speaks in terms of sound quality and naturalness. However, when a person speaks, the person has various styles and is rich in various emotional colors. Therefore, how to synthesize speech with unique style and emotional color is the key to the development of speech synthesis technology. The application provides a voice synthesis method, a voice synthesis system, a voice synthesis model and a training method thereof, which can synthesize voice with deep emotion levels and improve user experience.
Referring to fig. 1, a flow chart of a speech synthesis method provided by the present application is shown in fig. 1, and the method includes the following steps:
s110: and acquiring a target text and a first bottleneck characteristic of the target text.
In some embodiments, the target text may be an electronic book, or may be a chapter, a segment, or a sentence in an electronic book, or may be other types of text, such as news, public articles, short message communication records, chat records of internet platform communication APP, and the like.
In some embodiments, the first bottleneck characteristic of the target text comprises at least one of an emotional characteristic to be expressed by the content of the target text and a style characteristic that can be embodied by the content of the target text.
In some embodiments, the emotional features to be expressed by the content of the target text may be obtained using an emotion encoding network model. The method comprises the steps of establishing an emotion encoding network model, crawling the existing text data in a public network by utilizing a crawler technology, analyzing each text data by reading words related to emotion content in the text data, and marking each text data manually, wherein the analyzed text data can be marked as emotion categories such as happiness, anger, sadness, happiness, disguise, fright, fear and the like, establishing a convolutional neural network of the emotion encoding network model according to the marked text data, training the convolutional neural network of the emotion encoding network model by utilizing a back propagation technology until a convergence condition is reached, so that the training of the convolutional neural network of the emotion encoding network model is completed, and extracting emotion characteristics of the input text by the trained convolutional neural network of the emotion encoding network model.
In some embodiments, the first bottleneck characteristic of the target text may be a style characteristic that can be embodied by the content of the target text, and the style characteristic that can be embodied by the content of the target text may be obtained by using a style-coded network model. The method comprises the steps of establishing a style coding network model, crawling the existing text data in the public network by utilizing a crawler technology, analyzing the text data by reading the content of the text data, obtaining the style characteristics of the text data, and manually marking the text data, wherein the analyzed text data can be marked as style categories such as chatting, reciting and the like, establishing an emotion coding network model according to the marked text data, training the emotion coding network model by utilizing a back propagation technology until a convergence condition is reached so as to finish the training of the emotion coding network model, and extracting the style characteristics of the input text by utilizing the trained emotion coding network model.
S120: a reference audio library is obtained, the reference audio library comprising a number of reference audios.
In some embodiments, the reference audio may be a corpus obtained by training a speech synthesis model, and the reference audio composing the reference audio library should be audio with strong emotional or stylistic features.
S130: and acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text.
In some embodiments, the second bottleneck characteristic of the reference text comprises at least one of an emotional characteristic to be expressed by the content of the target text and a style characteristic that can be embodied by the content of the target text.
In some embodiments, the emotional characteristics to be expressed by the content of the reference text can be obtained by using the emotion encoding network model.
In some embodiments, the style characteristics that can be embodied by the content of the reference text can be obtained by using the style-coded network model described above.
S140: and calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template.
In some embodiments, the first bottleneck feature and the second bottleneck feature may both be features relating to emotion categories, which may be further divided into a plurality of emotion hierarchies.
More specifically, the first bottleneck feature of the target text may belong to the same emotion class as the second bottleneck feature of the reference text, for example, the first bottleneck feature and the second bottleneck feature both belong to an emotion class of "like", which may be further divided into a plurality of emotion levels such as "spring festival meaning", "happy ocean", "happy and" happy self ", and the like, and the degree of" like "corresponding to different emotion levels is different, and the similarity between the emotion level of" spring festival meaning "and the emotion level of" happy ocean "is significantly greater than the similarity between the emotion level of" spring festival meaning "and the emotion level of" happy ocean ". Therefore, if the first bottleneck feature corresponding to the target text corresponds to the emotional level of "vernal wind interest", the reference text 1 corresponds to the emotional level of "happy oceans", and the reference text 2 corresponds to the emotional level of "polar-happy", and since the similarity between the emotional level corresponding to the target text and the emotional level corresponding to the reference text 1 is greater than the similarity between the emotional level corresponding to the target text and the emotional level corresponding to the reference text 2, the reference text 1 is selected as the text template.
Further, the first bottleneck characteristic of the target text may belong to different emotion categories from the second bottleneck characteristic of the reference text, for example, the first bottleneck characteristic belongs to an emotion category of "happiness", the second bottleneck characteristic belongs to an emotion category of "anger", the emotion category of "anger" is further divided into multiple emotion levels such as "fire thrill", "anger unevenness", "thunder anger", and the like, if the first bottleneck characteristic corresponding to the target text corresponds to the emotion level of "spring wind destiny", the reference text 1 corresponds to the emotion level of "anger unevenness", the reference text 2 corresponds to the emotion level of "thunder anger", since the similarity of the emotion level corresponding to the target text to the emotion level corresponding to the reference text 1 is greater than the similarity of the emotion level corresponding to the target text to the emotion level corresponding to the reference text 2, the reference text 1 is selected as the text template.
Further, the first bottleneck feature of the target text may belong to the same emotion class as the second bottleneck feature of the partial reference text, and the first bottleneck feature of the target text may belong to a different emotion class from the second bottleneck feature of the partial reference text, for example, the first bottleneck feature may belong to an emotion class of "happiness", the second bottleneck feature may belong to an emotion class of "happiness", and the other part of the second bottleneck feature belongs to an emotion class of "anger", if the first bottleneck feature corresponding to the target text corresponds to an emotion level of "vernal availability", the reference text 1 corresponds to an emotion level of "happy ocean", the reference text 2 corresponds to an emotion level of "extreme joy", the reference text 3 corresponds to an emotion level of "uneven anger", the reference text 4 corresponds to an emotion level of "thunderstorm", and since the emotion level corresponding to the target text is more similar to the emotion level corresponding to the reference text 1 than the target text And selecting the reference text 1 as a text template according to the similarity between the corresponding emotion level and the emotion levels corresponding to other reference texts.
In some embodiments, the first bottleneck feature and the second bottleneck feature may also be features related to style categories, and at this time, the method for calculating the similarity between the first bottleneck feature and the second bottleneck feature of each reference text is similar to the method for calculating the features related to emotion categories of the first bottleneck feature and the second bottleneck feature, and is not described herein again.
S150: determining the reference audio corresponding to the text template as an audio template;
s160: and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.
More specifically, referring to fig. 2, a speech synthesis model provided by the present application includes: encoder module, feature extraction module, duration prediction module, duration sampling module, fundamental frequency prediction module, decoder module and vocoder module, wherein:
the encoder module is used for acquiring a text sequence of an input target text and converting the text sequence into a corresponding text code.
In some embodiments, the text sequence of the target text is a phone set of the target text, and the encoder module may convert the target text into an abstract text encoding recognizable by the speech synthesis model for use by other modules;
the feature extraction module is used for acquiring a third bottleneck feature of the audio template according to the input audio template;
in some embodiments, the third bottleneck feature comprises at least one of an emotional feature and a style feature of the audio template.
And the duration prediction module is used for acquiring the predicted duration of the text code according to the text code and the third bottleneck characteristic.
In some embodiments, the predicted duration of the text encoding is a pronunciation duration corresponding to each frame of the text encoding obtained through prediction.
And the time length sampling module is used for performing upsampling processing on the text code according to the output of the feature extraction module and the time length prediction module to obtain the text code subjected to upsampling processing and a third bottleneck feature subjected to upsampling processing.
And the fundamental frequency prediction module is used for predicting the fundamental frequency characteristic of the text code according to the input text code subjected to the upsampling processing and the third bottleneck characteristic subjected to the upsampling processing.
And the decoder module is used for acquiring the audio features of the audio to be synthesized according to the text codes subjected to the upsampling processing, the third bottleneck features subjected to the upsampling processing and the fundamental frequency features.
The vocoder module is used for obtaining the synthesized audio according to the audio characteristics of the audio to be synthesized.
Further, the following describes an exemplary implementation and usage process of the speech synthesis method provided in the present application in a scenarized and materialized manner.
In some embodiments, the approach provided herein may synthesize audio with emotional features, following exemplary embodiment 1 of the present application.
The embodiment takes the case that the user A and the user B chat through social software as an example.
(1) The user A inputs a text A to the user B through social software;
(2) analyzing the text A, specifically, acquiring a first bottleneck characteristic of the text A, namely an emotional characteristic of the text A, through an emotional coding network model;
(3) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely the emotional characteristic of the reference text;
(4) respectively calculating the similarity between the emotional feature of each reference text and the emotional feature of the text A, selecting the reference text with the highest similarity with the emotional feature of the text A, and determining the reference text as a text template;
(5) acquiring a reference audio of the text template, and determining the reference audio as an audio template;
(6) inputting the audio template and the text A into a pre-trained speech synthesis model, wherein the speech synthesis model comprises an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module and a vocoder module;
(7) the method comprises the steps that an encoder module obtains a text sequence of an input text A, converts the text sequence into a corresponding text code, and inputs the text code to a duration prediction module;
(8) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires the emotional features of the audio template and inputs the emotional features of the audio template into a duration prediction module;
(9) the duration prediction module acquires the predicted duration of the text code according to the input text code and the emotional characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;
(10) the time length sampling module performs up-sampling processing on the text code according to the output of the feature extraction module and the time length prediction module, outputs the text code subjected to the up-sampling processing and a third bottleneck feature subjected to the up-sampling processing, and inputs an output result to the fundamental frequency prediction module;
(11) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the up-sampling processing and the emotional characteristics of the audio template subjected to the up-sampling processing to obtain the text codes subjected to the up-sampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;
(12) the decoder module acquires the audio features of the audio to be synthesized according to the text code subjected to the up-sampling processing, the third bottleneck feature subjected to the up-sampling processing and the fundamental frequency feature, and inputs the audio features of the audio to be synthesized into the vocoder module;
(13) the vocoder module obtains synthesized audio according to the input audio characteristics of the audio to be synthesized.
In some embodiments, the approach provided herein may synthesize audio with style characteristics, following exemplary embodiment 2 of the present application.
The embodiment takes the case that the user A and the user B chat through social software as an example.
(1) The user A inputs a text A to the user B through social software;
(2) analyzing the text A, specifically, obtaining a first bottleneck characteristic of the text A, namely the style characteristic of the text A, through a style coding network model;
(3) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely a style characteristic of the reference text;
(4) respectively calculating the similarity between the style features of each reference text and the style features of the text A, selecting the reference text with the highest similarity with the style features of the text A, and determining the reference text as a text template;
(5) acquiring a reference audio of the text template, and determining the reference audio as an audio template;
(6) inputting the audio template and the text A into a pre-trained speech synthesis model, wherein the speech synthesis model comprises an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module and a vocoder module;
(7) the method comprises the steps that an encoder module obtains a text sequence of an input text A, converts the text sequence into a corresponding text code, and inputs the text code to a duration prediction module;
(8) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires style features of the audio template and inputs the style features of the audio template into a duration prediction module;
(9) the duration prediction module acquires the predicted duration of the text code according to the input text code and the style characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;
(10) the time length sampling module performs up-sampling processing on the text code according to the output of the feature extraction module and the time length prediction module, outputs the text code subjected to the up-sampling processing and a third bottleneck feature subjected to the up-sampling processing, and inputs an output result to the fundamental frequency prediction module;
(11) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the upsampling processing and the style characteristics of the audio template subjected to the upsampling processing to obtain the text codes subjected to the upsampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;
(12) the decoder module acquires the audio features of the audio to be synthesized according to the text code subjected to the up-sampling processing, the third bottleneck feature subjected to the up-sampling processing and the fundamental frequency feature, and inputs the audio features of the audio to be synthesized into the vocoder module;
(13) the vocoder module obtains synthesized audio according to the input audio characteristics of the audio to be synthesized.
In some embodiments, the present application may also be applied to audio conversion of text, and the following is exemplary embodiment 3 of the present application, and exemplary embodiment 3 of the present application is described taking a novel example that a user desires to listen to.
(1) Acquiring the text content of the novel, analyzing the text content of the novel, and specifically acquiring a first bottleneck characteristic of the text, namely an emotional characteristic of the text, through an emotional coding network model;
(2) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely the emotional characteristic of the reference text;
(3) respectively calculating the similarity of the emotional features of each reference text and the emotional features of the novel text, selecting the reference text with the highest similarity with the emotional features of the novel text, and determining the reference text as a text template;
(4) acquiring a reference audio of the text template, and determining the reference audio as an audio template;
(5) inputting the audio template and the novel text into a pre-trained speech synthesis model, wherein the speech synthesis model comprises an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module and a vocoder module;
(6) the method comprises the steps that an encoder module obtains a text sequence of an input novel text, converts the text sequence into a corresponding text code and inputs the text code to a duration prediction module;
(7) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires the emotional features of the audio template and inputs the emotional features of the audio template into a duration prediction module;
(8) the duration prediction module acquires the predicted duration of the text code according to the input text code and the emotional characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;
(9) the time length sampling module performs up-sampling processing on the text code according to the output of the feature extraction module and the time length prediction module, outputs the text code subjected to the up-sampling processing and a third bottleneck feature subjected to the up-sampling processing, and inputs an output result to the fundamental frequency prediction module;
(10) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the up-sampling processing and the emotional characteristics of the audio template subjected to the up-sampling processing to obtain the text codes subjected to the up-sampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;
(11) the decoder module acquires the audio features of the audio to be synthesized according to the text code subjected to the up-sampling processing, the third bottleneck feature subjected to the up-sampling processing and the fundamental frequency feature, and inputs the audio features of the audio to be synthesized into the vocoder module;
(12) the vocoder module obtains synthesized audio according to the input audio characteristics of the audio to be synthesized.
In some embodiments, the present application may also convert the target text into audio with style characteristics, and the following is exemplary embodiment 4 of the present application, and exemplary embodiment 4 of the present application is also illustrated by taking a novel that the user desires to listen to as an example.
(1) Acquiring the text content of a novel, analyzing the text content of the novel, and specifically acquiring a first bottleneck characteristic of the text, namely the style characteristic of the text, through a style coding network model;
(2) acquiring a reference text corresponding to each reference audio from a reference audio library, and a second bottleneck characteristic corresponding to each reference text, namely a style characteristic of the reference text;
(3) respectively calculating the similarity of the style features of each reference text and the style features of the novel text, selecting the reference text with the highest similarity with the style features of the novel text, and determining the reference text as a text template;
(4) acquiring a reference audio of the text template, and determining the reference audio as an audio template;
(5) inputting the audio template and the novel text into a pre-trained speech synthesis model, wherein the speech synthesis model comprises an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a fundamental frequency prediction module, a decoder module and a vocoder module;
(6) the method comprises the steps that an encoder module obtains a text sequence of an input novel text, converts the text sequence into a corresponding text code and inputs the text code to a duration prediction module;
(7) inputting an audio template into a feature extraction module, wherein the feature extraction module acquires style features of the audio template and inputs the style features of the audio template into a duration prediction module;
(8) the duration prediction module acquires the predicted duration of the text code according to the input text code and the style characteristics of the audio template, and inputs the predicted duration of the text code to the duration sampling module;
(9) the time length sampling module performs up-sampling processing on the text code according to the output of the feature extraction module and the time length prediction module, outputs the text code subjected to the up-sampling processing and a third bottleneck feature subjected to the up-sampling processing, and inputs an output result to the fundamental frequency prediction module;
(10) the base frequency prediction module predicts the base frequency characteristics of the text codes according to the input text codes subjected to the upsampling processing and the style characteristics of the audio template subjected to the upsampling processing to obtain the text codes subjected to the upsampling processing, the prediction duration and the base frequency characteristics, and inputs the output result to the decoder module;
(11) the decoder module acquires the audio features of the audio to be synthesized according to the text code subjected to the up-sampling processing, the third bottleneck feature subjected to the up-sampling processing and the fundamental frequency feature, and inputs the audio features of the audio to be synthesized into the vocoder module;
(12) the vocoder module obtains synthesized audio according to the input audio characteristics of the audio to be synthesized.
Referring to fig. 3, a method for training a speech synthesis model provided by the present application, where the speech synthesis model can be trained and optimized by the method, includes:
s310, acquiring a training material, wherein the training material comprises a training audio and a training text corresponding to the training audio;
in some embodiments, the training audio may be a set of emotional voice data with different emotional tendencies recorded by a professional sound recorder, or may be a set of a large number of emotional voice data with different emotional tendencies directly crawled on the internet.
S320, analyzing the training audio to obtain the audio characteristics of the training audio;
in some embodiments, the training audio may be subjected to short-term fourier transform, and then the spectrum of the transformed audio is mapped to a mel scale, and logarithm is taken by using a trigonometric function and discrete cosine transform is taken to obtain the audio features of the training audio, where the audio features at least include one of mel-frequency spectrum features and mel-frequency cepstral features of the training audio.
S330, analyzing the training text to obtain the text characteristics of the training text;
in some embodiments, the training text is sequentially subjected to sentence structure analysis, text regularization, word segmentation, part of speech prediction, prosody prediction, phonetic transcription, and the like to obtain text features of the training text, where the training text includes a set of each phoneme of the training text, a set of emotional features of the training text, and a set of style features of the training text.
Wherein the sentence structure analysis is used to divide the training text into a single collection of sentences. Alternatively, sentence structure analysis can be implemented by using a model based on neural network training. The text regularization is an expression for converting punctuation or numbers in the training text, which are not chinese, into sentences in the context of chinese, for example, the text regularization is performed on the training text "2.1" to obtain the training text "two-point one", which is not limited in this example. Optionally, the text regularization process may be implemented by a neural network-based training model. The word segmentation processing is to segment the sentences in the training text according to the semantics and segment the Chinese characters of a word together when segmenting. Optionally, the word segmentation process may be implemented by using a model based on neural network training. And the part-of-speech prediction is used for predicting the part-of-speech of each word in the training text after word segmentation. Parts of speech include, but are not limited to, nouns, verbs, adjectives, quantifiers, pronouns, adverbs, prepositions, conjunctions, adjectives, sighs, and vocabularies. Optionally, the part-of-speech prediction may be implemented by using a model based on neural network training. And the prosody prediction is used for predicting the prosody of each word in the training text after the part of speech prediction, namely predicting the level and zeptosis format and prosody rule of each word. Alternatively, prosody prediction can be implemented using a model based on neural network training. The phoneme conversion process is to convert the text in the training text after prosody prediction into a corresponding phoneme, for example, if the text with conversion is "good", the result after phoneme conversion is "h, a, o, 3", which includes 3 phonemes, which is not limited by this example. Alternatively, the prediction of the transphoneme can be realized by a model based on neural network training.
S340, matching the audio frequency characteristics of the training audio frequency with the text characteristics of the training text according to the training audio frequency and the training text to obtain pronunciation duration information of the training text, wherein the pronunciation duration information is a duration collection corresponding to each phoneme in the training text;
in some embodiments, according to the training audio and the training text, aligning the audio features of the training audio with the text features of the training text by using an alignment tool to obtain pronunciation duration information of the training text.
And S350, inputting the audio features, the text features and the pronunciation duration information into the speech synthesis model to train the speech synthesis model.
In some embodiments, the present application further provides a speech synthesis system configured to:
acquiring a target text and a first bottleneck characteristic of the target text;
acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios;
acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text;
calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template;
determining the reference audio corresponding to the text template as an audio template;
and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.
According to the technical scheme, the application provides a speech synthesis method, a speech synthesis system, a speech synthesis model and a training method thereof, wherein the method comprises the steps of obtaining a target text and a first bottleneck characteristic of the target text; acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios; acquiring a reference text corresponding to each reference audio in a reference audio library, and acquiring a second bottleneck characteristic of each reference text; calculating the similarity of the first bottleneck characteristic and the second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template; determining reference audio corresponding to the text template as an audio template; and inputting the audio template and the target text into a pre-trained voice synthesis model to synthesize the voice with deep emotion hierarchical features.
In specific implementation, the present invention further provides a computer storage medium, where the computer storage medium may store a program, and the program may include some or all of the steps in the embodiments of the speech synthesis method provided by the present invention when executed. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM) or a Random Access Memory (RAM).
Those skilled in the art will readily appreciate that the techniques of the embodiments of the present invention may be implemented as software plus a required general purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments provided in the present application are only a few examples of the general concept of the present application, and do not limit the scope of the present application. Any other embodiments extended according to the scheme of the present application without inventive efforts will be within the scope of protection of the present application for a person skilled in the art.

Claims (10)

1. A method of speech synthesis, comprising:
acquiring a target text and a first bottleneck characteristic of the target text;
acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios;
acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text;
calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template;
determining the reference audio corresponding to the text template as an audio template;
and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.
2. The method of claim 1, wherein obtaining a first bottleneck feature corresponding to the target text comprises:
acquiring text data related to emotion;
establishing an emotion encoding network model according to the text data, wherein the emotion encoding network model is used for acquiring emotion characteristics of an input text;
analyzing the target text according to the emotion coding network model, acquiring the emotion characteristics of the target text, and determining the emotion characteristics of the target text as first bottleneck characteristics.
3. The method according to claim 2, wherein obtaining a second bottleneck characteristic corresponding to each of the reference texts comprises:
analyzing each reference text according to the emotion coding network model, acquiring the emotion characteristics of each reference text, and determining the emotion characteristics of the reference texts as second bottleneck characteristics.
4. The method of claim 1, wherein obtaining a first bottleneck feature corresponding to the target text comprises:
acquiring text data related to styles;
establishing a style coding network model according to the text data, wherein the style coding network model is used for acquiring style characteristics of an input text;
analyzing the target text according to the style coding network model, acquiring style characteristics of the target text, and determining the style characteristics of the target text as first bottleneck characteristics.
5. The method of claim 4, wherein obtaining a second bottleneck characteristic corresponding to each of the reference texts comprises:
analyzing each reference text according to the style coding network model, acquiring the style characteristics of each reference text, and determining the style characteristics of the reference texts as second bottleneck characteristics.
6. A speech synthesis system, characterized in that the system is configured to:
acquiring a target text and a first bottleneck characteristic of the target text;
acquiring a reference audio library, wherein the reference audio library comprises a plurality of reference audios;
acquiring a reference text corresponding to each reference audio in the reference audio library, and acquiring a second bottleneck characteristic of each reference text;
calculating the similarity of the first bottleneck characteristic and a second bottleneck characteristic of each reference text, and determining the reference text corresponding to the second bottleneck characteristic with the highest similarity with the first bottleneck characteristic as a text template;
determining the reference audio corresponding to the text template as an audio template;
and inputting the audio template and the target text into a pre-trained speech synthesis model to obtain a synthesized audio.
7. A speech synthesis model for use in the method of any one of claims 1-5 and the system of claim 6, comprising an encoder module, a feature extraction module, a duration prediction module, a duration sampling module, a pitch prediction module, a decoder module, and a vocoder module, wherein:
the encoder module is used for acquiring a text sequence of an input target text, wherein the text sequence of the target text is a phoneme set of the target text, and converting the text sequence into a corresponding text code;
the feature extraction module is used for acquiring a third bottleneck feature of the audio template according to the input audio template, wherein the third bottleneck feature at least comprises one of emotional feature and style feature of the audio template;
the duration prediction module is used for acquiring the predicted duration of the text code according to the text code and the third bottleneck characteristic, wherein the predicted duration of the text code is the pronunciation duration corresponding to each frame of the text code obtained through prediction;
the time length sampling module is used for performing upsampling processing on the text code according to the output of the feature extraction module and the time length prediction module to obtain the text code subjected to the upsampling processing and a third bottleneck feature subjected to the upsampling processing;
the fundamental frequency prediction module is used for predicting the fundamental frequency characteristic of the text code according to the input text code subjected to the upsampling processing and the third bottleneck characteristic subjected to the upsampling processing;
the decoder module is used for acquiring audio features of the audio to be synthesized according to the text code subjected to the upsampling processing, the third bottleneck feature subjected to the upsampling processing and the fundamental frequency feature;
the vocoder module is used for obtaining the synthesized audio according to the audio characteristics of the audio to be synthesized.
8. A method for training a speech synthesis model, applied to the method of any one of claims 1 to 5 and the system of claim 6, comprising:
acquiring a training material, wherein the training material comprises a training audio and a training text corresponding to the training audio, and the training text is a text with one or more of emotional characteristics or style characteristics;
analyzing the training audio to obtain the audio features of the training audio;
analyzing the training text to obtain text features of the training text, wherein the text features of the training text comprise a collection of each phoneme of the training text, an emotional feature collection of the training text and a style feature collection of the training text;
matching the audio frequency characteristics of the training audio frequency with the text characteristics of the training text according to the training audio frequency and the training text to obtain pronunciation duration information of the training text, wherein the pronunciation duration information is a duration set corresponding to each phoneme in the training text;
inputting the audio features, the text features and the pronunciation duration information to the speech synthesis model to train the speech synthesis model.
9. The training method of claim 8, wherein the audio features comprise at least one of mel-frequency spectral features and mel-frequency cepstral features of the training audio.
10. The training method according to claim 8, wherein matching audio features of the training audio with text features of the training text according to the training audio and the training text to obtain pronunciation duration information of the training text, the pronunciation duration information being a duration set corresponding to each phoneme in the training text, comprises:
and establishing a corresponding relation between each phoneme in the training text and the audio features of the training audio, so that each phoneme corresponds to the audio features of a plurality of training audios.
CN202111205560.7A 2021-10-15 2021-10-15 Speech synthesis method, system, speech synthesis model and training method thereof Pending CN113948061A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111205560.7A CN113948061A (en) 2021-10-15 2021-10-15 Speech synthesis method, system, speech synthesis model and training method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111205560.7A CN113948061A (en) 2021-10-15 2021-10-15 Speech synthesis method, system, speech synthesis model and training method thereof

Publications (1)

Publication Number Publication Date
CN113948061A true CN113948061A (en) 2022-01-18

Family

ID=79331057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111205560.7A Pending CN113948061A (en) 2021-10-15 2021-10-15 Speech synthesis method, system, speech synthesis model and training method thereof

Country Status (1)

Country Link
CN (1) CN113948061A (en)

Similar Documents

Publication Publication Date Title
KR102582291B1 (en) Emotion information-based voice synthesis method and device
JP2022107032A (en) Text-to-speech synthesis method using machine learning, device and computer-readable storage medium
JP7228998B2 (en) speech synthesizer and program
KR101160193B1 (en) Affect and Voice Compounding Apparatus and Method therefor
CN112786007A (en) Speech synthesis method, device, readable medium and electronic equipment
US20230298564A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN113658577A (en) Speech synthesis model training method, audio generation method, device and medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Gabdrakhmanov et al. Ruslan: Russian spoken language corpus for speech synthesis
CN110930975B (en) Method and device for outputting information
CN113903326A (en) Speech synthesis method, apparatus, device and storage medium
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116386594A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN113539239B (en) Voice conversion method and device, storage medium and electronic equipment
CN114708848A (en) Method and device for acquiring size of audio and video file
CN113948061A (en) Speech synthesis method, system, speech synthesis model and training method thereof
CN114333903A (en) Voice conversion method and device, electronic equipment and storage medium
CN113948062A (en) Data conversion method and computer storage medium
Jamtsho et al. OCR and speech recognition system using machine learning
CN113450756A (en) Training method of voice synthesis model and voice synthesis method
Houidhek et al. Evaluation of speech unit modelling for HMM-based speech synthesis for Arabic
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology
CN115457931B (en) Speech synthesis method, device, equipment and storage medium
Ajayi et al. Indigenuous Vocabulary Reformulation for Continuousyorùbá Speech Recognition In M-Commerce Using Acoustic Nudging-Based Gaussian Mixture Model
Hirose Modeling of fundamental frequency contours for HMM-based speech synthesis: Representation of fundamental frequency contours for statistical speech synthesis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination