CN117854478B - Speech synthesis method, device and system based on controllable text - Google Patents

Speech synthesis method, device and system based on controllable text Download PDF

Info

Publication number
CN117854478B
CN117854478B CN202410250738.7A CN202410250738A CN117854478B CN 117854478 B CN117854478 B CN 117854478B CN 202410250738 A CN202410250738 A CN 202410250738A CN 117854478 B CN117854478 B CN 117854478B
Authority
CN
China
Prior art keywords
style
voice
phoneme sequence
emotion
language
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410250738.7A
Other languages
Chinese (zh)
Other versions
CN117854478A (en
Inventor
周若华
刘洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN202410250738.7A priority Critical patent/CN117854478B/en
Publication of CN117854478A publication Critical patent/CN117854478A/en
Application granted granted Critical
Publication of CN117854478B publication Critical patent/CN117854478B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a controllable text-based voice synthesis method, a controllable text-based voice synthesis device and a controllable text-based voice synthesis system, which comprise the following steps: acquiring voice content to be synthesized, and forming a first phoneme sequence based on an external speaker embedding module; identifying semantic information of an input text, and respectively acquiring decoupled voice styles, emotion types and language types; generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter; extracting style characteristics and time distribution characteristics of the second phoneme sequence; adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type; adjusting style characteristics of the second phoneme sequence based on the speech style; based on the time corresponding relation before adjustment, fusing the adjusted time distribution characteristics and the adjusted voice style characteristics to obtain a third phoneme sequence; the third phoneme sequence is decoded based on a decoder to obtain a synthesized speech. The quality of the synthesized voice is improved, and the style controllability and the method applicability are improved.

Description

Speech synthesis method, device and system based on controllable text
Technical Field
The present application relates to the field of speech synthesis technology, and in particular, to a method, apparatus, and system for speech synthesis based on controllable text.
Background
Style conversion techniques are techniques for modifying the expression style of audio or text, and in speech synthesis, style conversion typically involves adjusting the pitch, speed, emotional color, etc. of the audio to meet the user's more personalized needs for speech output. Currently, style conversion techniques have made significant progress in converting text to a particular style of speech, but currently there are still some challenges.
Existing models often have high requirements on large amounts of training data and computational resources, limiting their wide application in practical scenarios. Furthermore, current systems focus mainly on processing single speaker voices, with certain limitations in the context of multiple speakers. Although the style conversion technology has improved performance in recent years, the style control of the existing system often depends on a system constructed based on expressive voice recording of specific discrete style types, that is, the generated voice style is basically consistent with the style of the input voice, if no voice sample is input, the voice of the corresponding style cannot be generated, the voice generation model with controllable style usually takes phonemes as input, and for part of the voice generation requirement of lack of high-quality text-to-phoneme conversion, it is difficult to generate high-quality voice according to the requirement. The voice generation method in the prior art is not flexible enough in practical application, because the user prefers to directly define the style of the required voice through text description without referring to a voice sample of a specific style, the use requirement of the user is difficult to meet, the synthesized voice is limited by the input voice language type, namely Chinese is generated, and the voice generation requirement of switching the language type is difficult to meet. For this reason, researchers are devoted to research on the technology of generating speech of a specified style based on text control, and a cross-speaker style migration scheme PromptStyle based on text description guidance is proposed, however, the above scheme directly aligns the style prompt generated by text with the style prompt previously based on speech training to synthesize speech of the specified style of text, the training of a model still depends on the speech prompt, the sample of the model training has higher requirements, the training step is complicated, the alignment of the text style cannot completely integrate various types of information of the style into the finally synthesized speech, and the synthesized speech is hard.
Disclosure of Invention
In view of the above, the application provides a method, a device and a system for synthesizing voice based on controllable text, which are used for improving the naturalness and quality of generated voice, reducing the dependence of model training on voice style prompt samples and widening the application scene of the model.
Specifically, the application is realized by the following technical scheme:
the first aspect of the present application provides a controllable text-based speech synthesis method, which is characterized in that the controllable text-based speech synthesis method includes:
acquiring voice content to be synthesized, and forming a first phoneme sequence based on an external speaker embedding module;
Identifying semantic information of an input text, and respectively acquiring decoupled voice styles, emotion types and language types;
Generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter;
Extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation;
adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type;
Adjusting style characteristics of the second phoneme sequence based on the speech style;
Based on the time corresponding relation before adjustment, fusing the adjusted time distribution characteristics and the adjusted voice style characteristics to obtain a third phoneme sequence;
and decoding the third phoneme sequence based on a decoder to obtain a synthetic voice, wherein the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text.
Preferably, the identifying semantic information of the input text, respectively obtaining a decoupled speech style, emotion type and language type, includes:
Acquiring an input text;
inputting the input text to a semantic understanding module, and positioning a style entity, a emotion entity and a language type entity, wherein the style entity, the emotion entity and the language type entity are not completely identical;
the style entity, the emotion entity and the language type entity are respectively input into a style identification module, an emotion classification module and a language classification module;
and the style recognition module, the emotion classification module and the language classification module respectively output recognition results based on the global semantics of the input text, and acquire decoupled voice styles, emotion types and language types corresponding to the input text.
Preferably, the style entity comprises a plurality of,
The style recognition module, the emotion classification module and the language classification module output recognition results based on the global semantics of the input text respectively, and specifically comprise:
The style identification module determines the relation among a plurality of style entities according to the global semantics of the input text;
The style identification module obtains embedded representations of the plurality of style entities and outputs an identified speech style based on relationships between the embedded representations and the plurality of style entities.
Preferably, the style recognition module, the emotion classification module and the language classification module output recognition results based on the global semantics of the input text respectively, and specifically include:
The emotion classification module identifies global emotion of the input text and local emotion of the emotion entity;
correcting the global emotion based on the local emotion;
And the emotion classification module identifies the local emotion after classification and correction to obtain emotion types.
Preferably, the style recognition module, the emotion classification module and the language classification module output recognition results based on the global semantics of the input text respectively, and specifically include:
the language classification module obtains the embedded representation of the language type entity;
Identifying the embedded representation of the language-type entity obtains a classification result of the language type.
Preferably, extracting the style feature and the time distribution feature of the second phoneme sequence includes:
Generating time distribution characteristics according to the existence of the phonemes at each time point in the time period of the second phoneme sequence, wherein the time distribution characteristics are used for representing whether the phonemes exist at each time point;
and extracting the style characteristics of the second phoneme sequence at each time point in the time period of the second phoneme sequence, wherein the style characteristics at each time point at least comprise tone, timbre and loudness.
Preferably, the fusing the adjusted time distribution feature and the adjusted voice style feature based on the time correspondence before adjustment to obtain a third phoneme sequence specifically includes:
determining a phoneme time distribution of the third phoneme sequence by taking the adjusted time distribution characteristic as a reference, wherein the phoneme time distribution is used for indicating whether phonemes exist at each time point in the third phoneme sequence;
And matching style characteristics of each time point in the adjusted voice style characteristics according to the phoneme time distribution of the third phoneme sequence and the time corresponding relation before adjustment, and generating phonemes of each time point in the third phoneme sequence.
Preferably, the language type of the second phoneme sequence is the same as the language type defined by the input text, and the voice content represented by the second phoneme sequence is the same as the voice content to be synthesized;
the language, emotion and voice style of the third phoneme sequence are the same as the language type, emotion type and voice style corresponding to the input text, and the voice content represented by the third phoneme sequence is the same as the voice content to be synthesized.
A second aspect of the present invention provides a controllable text-based speech synthesis apparatus comprising:
The voice acquisition module is used for acquiring voice content to be synthesized and forming a first phoneme sequence based on the external speaker embedding module;
the decoupling module is used for identifying semantic information of an input text and respectively acquiring a decoupled voice style, emotion type and language type;
A language conversion module for generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter;
The extraction module is used for extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation;
The adjusting module is used for adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type; the adjusting module is further used for adjusting style characteristics of the second phoneme sequence based on the voice style;
The fusion module is used for fusing the adjusted time distribution characteristics and the adjusted voice style characteristics based on the time corresponding relation before adjustment to obtain a third phoneme sequence;
The decoding module is used for decoding the third phoneme sequence based on the decoder to obtain synthetic voice, the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text.
A third aspect of the present invention provides a controllable text-based speech synthesis system, the controllable text-based speech synthesis system comprising at least:
The prompt layer comprises an external speaker embedding layer and a decoupling layer, and is used for receiving input voice and input text, wherein the external speaker embedding layer is used for receiving the input voice to acquire voice content to be synthesized, and a first phoneme sequence is formed by utilizing the voice content to be synthesized; the decoupling layer is used for respectively identifying semantic information of an input text and respectively acquiring a decoupled voice style, emotion type and language type;
the style adjustment layer is connected with the prompt layer, receives output signals of the external speaker embedding layer and the decoupling layer, realizes the voice style, emotion and language adjustment of input voice, and the output signals of the style adjustment layer are phoneme sequences, wherein the outputted phoneme sequence styles, emotion and language are matched with the input text;
The decoding layer is connected with the style adjustment layer and used for decoding the phoneme sequence output by the style adjustment layer to obtain synthesized voice, the voice content of the synthesized voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthesized voice correspond to the input text;
The prompt layer, the style adjustment layer and the decoding layer are independently trained.
According to the controllable text-based voice synthesis method, device and system provided by the invention, through improving the voice style adjustment module and extracting various types of characteristics such as styles, languages, emotions and the like from the input text in a decoupling way, the voice of the appointed language is directly generated under the condition of no sample training, namely, the voice is generated by using zero samples. And secondly, from the auditory sense, the style of the voice is determined by the voice style and emotion, the information defined in the input text is understood through decoupling, the phoneme sequence is independently modified, and finally the modified phoneme sequence is fused, so that the decoupling adjustment and the effect fusion of the adjustment process are realized, the phenomena of poor control of the adjustment effect, strong sound effect change linkage and the like under the condition of integral adjustment are avoided, and the quality of the finally synthesized voice is improved. In addition, the model does not need to train a voice-voice sample set, and the dependence on an application method is reduced and the applicability of the method is improved directly through a text-voice mode and a sequence-voice mode. The invention generates the voice consistent with the appointed style of the input text, is beneficial to capturing different attributes of the target voice more comprehensively, improves the diversity and flexibility of voice synthesis, has wider application range, and improves the controllability of the target voice and the flexibility of style and language type switching.
Drawings
FIG. 1 is a flowchart of a first embodiment of a controllable text-based speech synthesis method provided by the present application;
fig. 2 is a schematic structural diagram of a first embodiment of a controllable text-based speech synthesis apparatus according to the present application;
fig. 3 is a schematic structural diagram of a first embodiment of a speech synthesis system based on controllable text according to the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the application. The word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" depending on the context.
The application provides a voice synthesis method, device and system based on controllable text, which are used for improving the practical applicability of style conversion technology, improving the quality and naturalness of synthesized voice and providing a more visual and flexible voice customizing mode for users.
The method, the device and the system for synthesizing the voice based on the controllable text acquire the voice content to be synthesized, and form a first phoneme sequence based on an external speaker embedding module; identifying semantic information of an input text, and respectively acquiring decoupled voice styles, emotion types and language types; generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter; extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation; adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type; adjusting style characteristics of the second phoneme sequence based on the speech style; based on the time corresponding relation before adjustment, fusing the adjusted time distribution characteristics and the adjusted voice style characteristics to obtain a third phoneme sequence; and decoding the third phoneme sequence based on a decoder to obtain a synthetic voice, wherein the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text.
Fig. 1 is a flowchart of a first embodiment of a controllable text-based speech synthesis method provided by the present application. Referring to fig. 1, the method provided in this embodiment specifically includes:
The speech content to be synthesized is obtained, and a first phoneme sequence is formed based on an external speaker embedding module.
The user inputs speech and input text, uses the speech to express what the content of the speech to be synthesized is, e.g. "i am happy today", i.e. the content is unchanged, changing the overall style of speech defined by the input text. The input text defines the overall style of speech, e.g. "i really want to hear a skilled cantonese". The input speech is serialized by using an external speaker embedding module to obtain the content of the synthesized speech, and the next step is that the content needs to be subjected to style change to adapt to the limitation of the input text.
And identifying semantic information of the input text, and respectively acquiring decoupled voice styles, emotion types and language types.
The input text typically does not directly express style, language type, and emotion, and semantic understanding is required in advance, and then desired target information is extracted from the semantics. Changing the overall style, the goal is to change the content of three aspects of the phonemes: the speech style, emotion type and language type, the speech style at least includes pitch, timbre and loudness, the speech style may also include other factors that affect the overall hearing effect of the speech, but not limited thereto. Emotion types are emotions expressed by the input text, which may include: the positive, negative, neutral, and other types may also be included, but are not limited to, sadness, low-lying, happiness, and the like. The language type can be various language types such as Mandarin, english, guangdong, and the like.
As a preferred embodiment, identifying semantic information of an input text, and respectively obtaining a decoupled speech style, emotion type and language type, specifically including:
Acquiring an input text; inputting the input text to a semantic understanding module, and positioning a style entity, a emotion entity and a language type entity, wherein the style entity, the emotion entity and the language type entity are not completely identical; the style entity, the emotion entity and the language type entity are respectively input into a style identification module, an emotion classification module and a language classification module; and the style recognition module, the emotion classification module and the language classification module respectively output recognition results based on the global semantics of the input text, and acquire decoupled voice styles, emotion types and language types corresponding to the input text.
As an alternative embodiment, recognizing semantics, a transducer encoder module may be used, in particular, the pre-trained BERT is used to extract the semantic features of the input, and the adaptor layer may be used to convert the semantic features into cues with preset standard sizes and embed the cues into the spatial representation. It should be noted that the BERT includes multiple layers, where the layers can learn semantic information of different layers of the input text sample, respectively, in this embodiment, the BERT layer is trained based on the speech mel spectrum first to obtain a first parameter set, then the parameters of a part of the BERT layer are fixed, the remaining parameters are trained by using the input text, and stability of the speech synthesis model can be maintained, and these components are trained by using the speech sample marked with style features and the text sample marked with text semantic information to learn how to extract the semantic information and the style information from the text, and combine them together to perform speech generation. As an alternative embodiment, the semantic understanding capabilities of the model may be obtained solely through training of text.
It should be noted that, the style and expression of the generated voice may be guided by obtaining the input text, where the style prompt text expresses the desired style of the voice by the user in a natural language manner, which may be a short natural language description, covering the user's desire in terms of intonation, speech speed, emotion, and so on, such as "pleasant kiss", "formal intonation", and so on. In this way, the user can control the style of speech generated in a more intuitive and flexible manner. The language types of the style prompt text and the target voice are considered simultaneously, so that the voice synthesis model based on the controllable text can be more comprehensively guided to generate voice output meeting the user expectations.
In a piece of text, a plurality of entities exist, but only part of the entities are related to styles, emotions and language types, and the entities corresponding to the three entities are not identical. For example, for "i am too want to hear a rich and proficient cantonese" there are a number of entities, "i am, genuine, too want, hear, proficiency, richness, cantonese", only cantonese is for the language type, too want to correspond to the urgent emotion, so it is necessary to first extract the entity to which each change corresponds in order to understand what the goal of the style change is.
For a style entity, the style entity comprises a plurality of style entities, and the style recognition module determines the relation among the plurality of style entities according to the global semantics of the input text; the style identification module obtains embedded representations of the plurality of style entities and outputs an identified speech style based on relationships between the embedded representations and the plurality of style entities. The style entity may be single, at which point a unique style entity may be determined; the number of style entities may be plural, and at this time, the plural style entities and the relationships between the plural entities may be determined so as to accurately understand the plural entities.
For emotion entities, the emotion classification module identifies global emotion of the input text and local emotion of the emotion entity; correcting the global emotion based on the local emotion; and the emotion classification module identifies the local emotion after classification and correction to obtain emotion types. Emotion entities are extremely individual words that facilitate "too thinking" and enable local emotion to be obtained based on the understanding of the entity. Furthermore, in order to further improve the accuracy of emotion understanding, the invention corrects the local emotion based on the global emotion understood by the whole input text. Specifically, the global emotion is obtained based on the full text semantic understanding of the input text, and the direction of local understanding is adjusted based on the positive and negative directions of the global emotion, for example, global: forward, local: negative direction, after correction: and (3) forward direction.
For language type entities, the language classification module obtains embedded representations of the language type entities; identifying the embedded representation of the language-type entity obtains a classification result of the language type. The language-type entity is typically a noun, such as "cantonese," english, etc., where the language type of the target is obtained by embedded representation and recognition of an entity.
According to the method provided by the invention, the controllability of the synthesized voice is improved by limiting style adjustment of a plurality of aspects through the input text; through the decoupling expression of each adjustment characteristic, the mutual influence among a plurality of adjustment objects is avoided, and the adjustment accuracy is improved; through accurate understanding and correction of the entity, the controllability of adjustment is improved, the style adjustment process is simultaneously optimized from two aspects of a control target and a control mode, and the final adjustment effect is improved.
A second phoneme sequence is generated based on the language type of the first phoneme sequence converted by the converter. The converter is trained in advance based on the sample, and a language translation process is performed in advance, namely, the voice of the A language is converted into the voice of the B language, at the moment, the style and emotion of the whole voice are not changed, and only the language type is changed. The language type of the second phoneme sequence is the same as the language type defined by the input text, and the voice content represented by the second phoneme sequence is the same as the voice content to be synthesized.
And extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation.
The generated phoneme sequence has a duration and, as an alternative embodiment, the phoneme duration distribution is predicted from the input speech using a random duration predictor layer to predict the duration of the resulting synthesized speech.
In order to accurately adjust the information of different dimensions, a better adjusting effect is obtained, and the characteristics of the phoneme sequences of different dimensions are respectively extracted so as to realize the accurate and interference-free adjustment of a single dimension.
As an alternative embodiment, extracting the style characteristic and the time distribution characteristic of the second phoneme sequence includes: generating time distribution characteristics according to the existence of the phonemes at each time point in the time period of the second phoneme sequence, wherein the time distribution characteristics are used for representing whether the phonemes exist at each time point; and extracting the style characteristics of the second phoneme sequence at each time point in the time period of the second phoneme sequence, wherein the style characteristics at each time point at least comprise tone, timbre and loudness.
Adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type; and adjusting the style characteristics of the second phoneme sequence based on the voice style.
And fusing the adjusted time distribution characteristics and the adjusted voice style characteristics based on the time corresponding relation before adjustment to obtain a third phoneme sequence. The language, emotion and voice style of the third phoneme sequence are the same as the language type, emotion type and voice style corresponding to the input text, and the voice content represented by the third phoneme sequence is the same as the voice content to be synthesized.
After the single-dimension adjustment, the single-dimension needs to be accurately fused, and preferably, the fusion process specifically includes: determining a phoneme time distribution of the third phoneme sequence by taking the adjusted time distribution characteristic as a reference, wherein the phoneme time distribution is used for indicating whether phonemes exist at each time point in the third phoneme sequence; and matching style characteristics of each time point in the adjusted voice style characteristics according to the phoneme time distribution of the third phoneme sequence and the time corresponding relation before adjustment, and generating phonemes of each time point in the third phoneme sequence.
And decoding the third phoneme sequence based on a decoder to obtain a synthetic voice, wherein the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text.
The decoder may be a vocoder for generating synthesized speech based on phonemes, and in particular, the vocoder may convert the sequence of phonemes into a final audio waveform using HiFi-GAN.
Overall, the whole method adopts the combination of the transducer voice encoder to extract various dimensional information, the stream-based decoder to change and adjust the style and the HiFi-GAN vocoder to perform audio synthesis, so that the whole text-to-voice synthesis module can generate high-quality voice.
It should be noted that, the different attributes of the synthesized speech may include the voice and the speech speed, the emotion color, the personality of the speaker, and the like, for example, the synthesized speech may be humorous or formal in terms of the personality of the speaker. It should be noted that the content of the finally obtained synthesized speech may be derived from the content of the input speech, but the style thereof is derived from the style of the expression of the input text, and the language type is derived from the expression of the input text as well, i.e. the input text is used as the prompt information of speech synthesis.
Fig. 2 is a schematic structural diagram of a first embodiment of a controllable text-based speech synthesis apparatus according to the present application. Referring to fig. 2, the apparatus at least includes:
The voice acquisition module is used for acquiring voice content to be synthesized and forming a first phoneme sequence based on the external speaker embedding module;
the decoupling module is used for identifying semantic information of an input text and respectively acquiring a decoupled voice style, emotion type and language type;
A language conversion module for generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter;
The extraction module is used for extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation;
The adjusting module is used for adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type; the adjusting module is further used for adjusting style characteristics of the second phoneme sequence based on the voice style;
The fusion module is used for fusing the adjusted time distribution characteristics and the adjusted voice style characteristics based on the time corresponding relation before adjustment to obtain a third phoneme sequence;
The decoding module is used for decoding the third phoneme sequence based on the decoder to obtain synthetic voice, the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text.
Fig. 3 is a schematic structural diagram of a first embodiment of a controllable text-based speech synthesis system according to the present application, please refer to fig. 3, wherein the controllable text-based speech synthesis system at least includes:
The prompt layer comprises an external speaker embedding layer and a decoupling layer, and is used for receiving input voice and input text, wherein the external speaker embedding layer is used for receiving the input voice to acquire voice content to be synthesized, and a first phoneme sequence is formed by utilizing the voice content to be synthesized; the decoupling layer is used for respectively identifying semantic information of an input text and respectively acquiring a decoupled voice style, emotion type and language type;
the style adjustment layer is connected with the prompt layer, receives output signals of the external speaker embedding layer and the decoupling layer, realizes the voice style, emotion and language adjustment of input voice, and the output signals of the style adjustment layer are phoneme sequences, wherein the outputted phoneme sequence styles, emotion and language are matched with the input text;
The decoding layer is connected with the style adjustment layer and used for decoding the phoneme sequence output by the style adjustment layer to obtain synthesized voice, the voice content of the synthesized voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthesized voice correspond to the input text; the prompt layer, the style adjustment layer and the decoding layer are independently trained.
As an alternative embodiment, each module in each layer is also trained independently. The parameters after independent training remain unchanged, allowing the speech synthesis system to retain the previously learned capabilities without changing during the training process. For one sample, a plurality of data are included therein, specifically: the user inputs the voice, the input text, the corresponding style, the language type, the emotion entity, the recognition result and the synthesized voice.
The external speaker embedding module extracts the external speaker embedding of the input speech, and sets the cue layer, the style adjustment layer, and the decoding layer using the external speaker embedding as a global condition. For the external speaker embedding module, the final loss function is represented using speaker consistency loss (Speaker Consistency Loss, SCL). For example, assuming that y (-) is a function for output speaker embedding, cos_sim is a cosine similarity function, α is a positive real number that controls the effect of SCL in the final loss, and n is the batch size, then SCL is defined as follows:
Where g and h represent the reference real audio and the synthesized audio, respectively.
Corresponding to the embodiment of the controllable text-based speech synthesis method, the application also provides an embodiment of the controllable text-based speech synthesis device.
The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.
For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.
The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather to enable any modification, equivalent replacement, improvement or the like to be made within the spirit and principles of the application.

Claims (8)

1. A method for controllable text-based speech synthesis, the method comprising:
acquiring voice content to be synthesized, and forming a first phoneme sequence based on an external speaker embedding module;
Identifying semantic information of an input text, and respectively acquiring decoupled voice styles, emotion types and language types;
Generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter;
Extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation;
adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type;
Adjusting style characteristics of the second phoneme sequence based on the speech style;
Based on the time corresponding relation before adjustment, fusing the adjusted time distribution characteristics and the adjusted voice style characteristics to obtain a third phoneme sequence;
Decoding a third phoneme sequence based on a decoder to obtain a synthetic voice, wherein the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text;
Extracting style characteristics and time distribution characteristics of the second phoneme sequence comprises the following steps:
Generating time distribution characteristics according to the existence of the phonemes at each time point in the time period of the second phoneme sequence, wherein the time distribution characteristics are used for representing whether the phonemes exist at each time point;
extracting style characteristics of the second phoneme sequence at each time point in the time period of the second phoneme sequence, wherein the style characteristics at each time point at least comprise tone, timbre and loudness;
the step of obtaining a third phoneme sequence based on the adjusted time distribution characteristic and the adjusted voice style characteristic which are fused based on the time corresponding relation before adjustment specifically comprises the following steps:
determining a phoneme time distribution of the third phoneme sequence by taking the adjusted time distribution characteristic as a reference, wherein the phoneme time distribution is used for indicating whether phonemes exist at each time point in the third phoneme sequence;
And matching style characteristics of each time point in the adjusted voice style characteristics according to the phoneme time distribution of the third phoneme sequence and the time corresponding relation before adjustment, and generating phonemes of each time point in the third phoneme sequence.
2. The method of claim 1, wherein the identifying semantic information of the input text, respectively obtaining a decoupled speech style, emotion type, and language type, comprises:
Acquiring an input text;
inputting the input text to a semantic understanding module, and positioning a style entity, a emotion entity and a language type entity, wherein the style entity, the emotion entity and the language type entity are not completely identical;
the style entity, the emotion entity and the language type entity are respectively input into a style identification module, an emotion classification module and a language classification module;
and the style recognition module, the emotion classification module and the language classification module respectively output recognition results based on the global semantics of the input text, and acquire decoupled voice styles, emotion types and language types corresponding to the input text.
3. The method of claim 2, wherein the style entity comprises a plurality of,
The style recognition module, the emotion classification module and the language classification module output recognition results based on the global semantics of the input text respectively, and specifically comprise:
The style identification module determines the relation among a plurality of style entities according to the global semantics of the input text;
The style identification module obtains embedded representations of the plurality of style entities and outputs an identified speech style based on relationships between the embedded representations and the plurality of style entities.
4. The method according to claim 2, wherein the style recognition module, the emotion classification module and the language classification module output recognition results based on global semantics of the input text, respectively, specifically comprising:
The emotion classification module identifies global emotion of the input text and local emotion of the emotion entity;
correcting the global emotion based on the local emotion;
And the emotion classification module identifies the local emotion after classification and correction to obtain emotion types.
5. The method according to claim 2, wherein the style recognition module, the emotion classification module and the language classification module output recognition results based on global semantics of the input text, respectively, specifically comprising:
the language classification module obtains the embedded representation of the language type entity;
Identifying the embedded representation of the language-type entity obtains a classification result of the language type.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The language type of the second phoneme sequence is the same as the language type defined by the input text, and the voice content represented by the second phoneme sequence is the same as the voice content to be synthesized;
the language, emotion and voice style of the third phoneme sequence are the same as the language type, emotion type and voice style corresponding to the input text, and the voice content represented by the third phoneme sequence is the same as the voice content to be synthesized.
7. A controllable text-based speech synthesis apparatus for performing the method of any of claims 1-6, the controllable text-based speech synthesis apparatus comprising:
The voice acquisition module is used for acquiring voice content to be synthesized and forming a first phoneme sequence based on the external speaker embedding module;
the decoupling module is used for identifying semantic information of an input text and respectively acquiring a decoupled voice style, emotion type and language type;
A language conversion module for generating a second phoneme sequence based on the language type of the first phoneme sequence converted by the converter;
The extraction module is used for extracting style characteristics and time distribution characteristics of the second phoneme sequence, wherein the style characteristics and the time distribution characteristics have a time corresponding relation;
The adjusting module is used for adjusting the time distribution characteristics of the second phoneme sequence based on the emotion type; the adjusting module is further used for adjusting style characteristics of the second phoneme sequence based on the voice style;
The fusion module is used for fusing the adjusted time distribution characteristics and the adjusted voice style characteristics based on the time corresponding relation before adjustment to obtain a third phoneme sequence;
The decoding module is used for decoding the third phoneme sequence based on the decoder to obtain synthetic voice, the voice content of the synthetic voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthetic voice correspond to the input text.
8. A controllable text-based speech synthesis system for performing the method of any of claims 1-6, the controllable text-based speech synthesis system comprising at least:
The prompt layer comprises an external speaker embedding layer and a decoupling layer, and is used for receiving input voice and input text, wherein the external speaker embedding layer is used for receiving the input voice to acquire voice content to be synthesized, and a first phoneme sequence is formed by utilizing the voice content to be synthesized; the decoupling layer is used for respectively identifying semantic information of an input text and respectively acquiring a decoupled voice style, emotion type and language type;
the style adjustment layer is connected with the prompt layer, receives output signals of the external speaker embedding layer and the decoupling layer, realizes the voice style, emotion and language adjustment of input voice, and the output signals of the style adjustment layer are phoneme sequences, wherein the outputted phoneme sequence styles, emotion and language are matched with the input text;
The decoding layer is connected with the style adjustment layer and used for decoding the phoneme sequence output by the style adjustment layer to obtain synthesized voice, the voice content of the synthesized voice is the same as the voice content to be synthesized, and the style, emotion and language of the synthesized voice correspond to the input text;
The prompt layer, the style adjustment layer and the decoding layer are independently trained.
CN202410250738.7A 2024-03-05 2024-03-05 Speech synthesis method, device and system based on controllable text Active CN117854478B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410250738.7A CN117854478B (en) 2024-03-05 2024-03-05 Speech synthesis method, device and system based on controllable text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410250738.7A CN117854478B (en) 2024-03-05 2024-03-05 Speech synthesis method, device and system based on controllable text

Publications (2)

Publication Number Publication Date
CN117854478A CN117854478A (en) 2024-04-09
CN117854478B true CN117854478B (en) 2024-05-03

Family

ID=90538593

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410250738.7A Active CN117854478B (en) 2024-03-05 2024-03-05 Speech synthesis method, device and system based on controllable text

Country Status (1)

Country Link
CN (1) CN117854478B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003302992A (en) * 2002-04-11 2003-10-24 Canon Inc Method and device for synthesizing voice
CN113327572A (en) * 2021-06-02 2021-08-31 清华大学深圳国际研究生院 Controllable emotion voice synthesis method and system based on emotion category label
CN113707125A (en) * 2021-08-30 2021-11-26 中国科学院声学研究所 Training method and device for multi-language voice synthesis model
CN114495956A (en) * 2022-02-08 2022-05-13 北京百度网讯科技有限公司 Voice processing method, device, equipment and storage medium
CN115620699A (en) * 2022-12-19 2023-01-17 深圳元象信息科技有限公司 Speech synthesis method, speech synthesis system, speech synthesis apparatus, and storage medium
CN116312463A (en) * 2023-03-15 2023-06-23 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device, and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2003302992A (en) * 2002-04-11 2003-10-24 Canon Inc Method and device for synthesizing voice
CN113327572A (en) * 2021-06-02 2021-08-31 清华大学深圳国际研究生院 Controllable emotion voice synthesis method and system based on emotion category label
CN113707125A (en) * 2021-08-30 2021-11-26 中国科学院声学研究所 Training method and device for multi-language voice synthesis model
CN114495956A (en) * 2022-02-08 2022-05-13 北京百度网讯科技有限公司 Voice processing method, device, equipment and storage medium
CN115620699A (en) * 2022-12-19 2023-01-17 深圳元象信息科技有限公司 Speech synthesis method, speech synthesis system, speech synthesis apparatus, and storage medium
CN116312463A (en) * 2023-03-15 2023-06-23 平安科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic device, and storage medium

Also Published As

Publication number Publication date
CN117854478A (en) 2024-04-09

Similar Documents

Publication Publication Date Title
EP3994683B1 (en) Multilingual neural text-to-speech synthesis
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
JP2022107032A (en) Text-to-speech synthesis method using machine learning, device and computer-readable storage medium
US9368104B2 (en) System and method for synthesizing human speech using multiple speakers and context
US20220013106A1 (en) Multi-speaker neural text-to-speech synthesis
JP4246790B2 (en) Speech synthesizer
CN112863483A (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
KR20210086974A (en) Cross-lingual voice conversion system and method
CN116092472A (en) Speech synthesis method and synthesis system
CN113470622B (en) Conversion method and device capable of converting any voice into multiple voices
Chen et al. Polyglot speech synthesis based on cross-lingual frame selection using auditory and articulatory features
CN117854478B (en) Speech synthesis method, device and system based on controllable text
CN116312471A (en) Voice migration and voice interaction method and device, electronic equipment and storage medium
CN116129868A (en) Method and system for generating structured photo
CN113539236B (en) Speech synthesis method and device
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
Aso et al. Speakbysinging: Converting singing voices to speaking voices while retaining voice timbre
KR101920653B1 (en) Method and program for edcating language by making comparison sound
WO2021231050A1 (en) Automatic audio content generation
CN112863476A (en) Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing
CN114464151B (en) Sound repairing method and device
CN112992118B (en) Speech model training and synthesizing method with few linguistic data
CN112002302B (en) Speech synthesis method and device
Wu et al. Synthesis of spontaneous speech with syllable contraction using state-based context-dependent voice transformation
Yoon et al. Enhancing Multilingual TTS with Voice Conversion Based Data Augmentation and Posterior Embedding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant