CN116978353A - Speech synthesis method, apparatus, electronic device, storage medium, and program product - Google Patents

Speech synthesis method, apparatus, electronic device, storage medium, and program product Download PDF

Info

Publication number
CN116978353A
CN116978353A CN202310274253.7A CN202310274253A CN116978353A CN 116978353 A CN116978353 A CN 116978353A CN 202310274253 A CN202310274253 A CN 202310274253A CN 116978353 A CN116978353 A CN 116978353A
Authority
CN
China
Prior art keywords
text
sample
pronunciation
network
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310274253.7A
Other languages
Chinese (zh)
Inventor
邹双圆
方鹏
刘恺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310274253.7A priority Critical patent/CN116978353A/en
Publication of CN116978353A publication Critical patent/CN116978353A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Abstract

The application discloses a voice synthesis method, a device, electronic equipment, a storage medium and a program product, wherein a target text which is required to be subjected to voice synthesis is obtained and converted into a corresponding phoneme sequence, then text semantic features and context semantic features corresponding to the text semantic features are obtained, the pronunciation rhythm of the target text during the voice synthesis is determined based on the obtained text semantic features, the context semantic features and the converted phoneme sequence, in addition, the voice features of the target text are determined according to the text semantic features of the target text, the voice features are used for indicating the pronunciation characteristics of a speaker who is subjected to voice synthesis, and finally voice synthesis is performed according to the phoneme sequences, the voice features and the determined pronunciation rhythm to obtain the pronunciation audio of the target text. The application can accurately express the semantics of the target text by the synthesized pronunciation audio, simulate the pronunciation characteristics of a speaker and achieve the aim of improving the speech synthesis quality.

Description

Speech synthesis method, apparatus, electronic device, storage medium, and program product
Technical Field
The present application relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, storage medium, and program product.
Background
Speech synthesis, also known as Text to Speech (Text) technology, aims to convert internally generated or externally input Text information into audible Speech information, which is equivalent to putting the device with an artificial mouth, so that the device can speak like a person with an opening. For example, the text reading function provided by the browser application of electronic devices such as mobile phones and computers is based on the voice synthesis technology, and can convert text information in a webpage into audible voice information to be played.
In the related art, the words/words in the text are usually converted into corresponding voice audios word by word, and then the voice audios of all the words/words are combined to obtain complete voice audios of the corresponding text, however, this approach results in poor audio quality of voice synthesis.
Disclosure of Invention
Embodiments of the present application provide a method, apparatus, electronic device, computer-readable storage medium, and computer program product for speech synthesis, which can improve speech synthesis quality.
In a first aspect, the present application provides a speech synthesis method, including:
Acquiring a target text which needs to be subjected to voice synthesis, and converting the target text into a corresponding phoneme sequence;
acquiring text semantic features of a target text and acquiring context semantic features of a context corresponding to the target text;
determining the pronunciation rhythm of the target text according to the phoneme sequence, the text semantic features and the context semantic features;
determining the voice characteristics of the target text according to the text semantic characteristics, wherein the voice characteristics are used for indicating the pronunciation characteristics of a speaker in voice synthesis;
and performing voice synthesis according to the pronunciation rhythm, the voice characteristics and the phoneme sequence to obtain the pronunciation audio of the target text.
In a second aspect, the present application provides a speech synthesis apparatus, comprising:
the text acquisition module is used for acquiring a target text which needs to be subjected to voice synthesis and converting the target text into a corresponding phoneme sequence;
the semantic acquisition module is used for acquiring text semantic features of the target text and acquiring context semantic features of contexts corresponding to the target text;
the prosody determining module is used for determining the pronunciation prosody of the target text according to the phoneme sequence and the context semantic features;
the voice determining module is used for determining the voice characteristics of the target text according to the text semantic characteristics, wherein the voice characteristics are used for indicating the pronunciation characteristics of a speaker in voice synthesis;
And the voice synthesis module is used for performing voice synthesis according to the pronunciation rhythm, the voice characteristics and the phoneme sequence to obtain the pronunciation audio of the target text.
In an alternative embodiment, the prosody determining module is configured to obtain a speech synthesis model, where the speech synthesis model includes a prosody prediction network, a feature extraction network, a length adjustment network, and a speech synthesis network; inputting the phoneme sequence, the text semantic features and the context semantic features into a prosody prediction network to perform prosody prediction, and determining the pronunciation prosody of the target text;
the voice synthesis module is used for inputting the phoneme sequence into the feature extraction network to perform feature extraction so as to obtain a first phoneme feature of the phoneme sequence; inputting the first phoneme features and the voice features into a length adjustment network for duration prediction, and adjusting the length of the first phoneme features according to the predicted pronunciation duration to obtain first adjusted phoneme features; and inputting the pronunciation rhythm, the voice characteristics and the first regulation phoneme characteristics into a voice synthesis network to perform voice synthesis, so as to obtain pronunciation audio of the target text.
In an alternative embodiment, the prosody prediction network includes a first attention sub-network, a second attention sub-network, a feature extraction sub-network, a feature fusion sub-network, and a prosody prediction sub-network, and the prosody determining module is configured to input the phoneme sequence into the feature extraction sub-network to perform feature extraction, so as to obtain a second phoneme feature of the phoneme sequence; inputting the second phoneme features and the text semantic features into a first attention sub-network for attention enhancement to obtain first enhancement features of the second phoneme features; inputting the second phoneme features and the context semantic features into a second attention sub-network to carry out attention enhancement, so as to obtain second enhancement features of the second phoneme features; inputting the second phoneme feature, the first enhancement feature and the second enhancement feature into a feature fusion sub-network to perform feature fusion to obtain fusion features; and inputting the fusion characteristics into a prosody predictor network to perform prosody prediction, and determining the pronunciation prosody of the target text.
In an alternative embodiment, the prosody determining module is configured to input the fusion feature into the prosody prediction sub-network, perform prosody prediction in the audio frame dimension through the prosody prediction sub-network, and determine the prosody of the target text in the audio frame dimension.
In an optional embodiment, the speech synthesis model further includes a speech feature prediction network, and the speech determination module is configured to input the text semantic features into the speech feature prediction network to predict the speech features, so as to obtain speech features corresponding to the text semantic features.
In an alternative embodiment, the speech synthesis model further comprises a model training module for constructing a prosody encoding network and a speech feature encoding network based on the reference encoder; acquiring sample audio, and sample text and sample pronunciation time corresponding to the sample audio; inputting the sample audio into a prosody coding network to perform prosody coding to obtain a sample pronunciation prosody of the sample audio, and inputting the sample audio into a speech feature coding network to perform speech feature coding to obtain sample speech features of the sample audio; acquiring sample text semantic features of a sample text and sample context semantic features of contexts corresponding to the sample text; inputting the semantic features of the sample text into a voice feature prediction network to predict voice features, obtaining predicted sample voice features, and obtaining first loss according to the predicted sample voice features and the sample voice features; converting the sample text into a corresponding sample phoneme sequence, inputting the sample phoneme sequence, the sample context semantic features and the sample text semantic features into a prosody prediction network for prosody prediction to obtain a predicted sample pronunciation prosody, and obtaining a second loss according to the predicted sample pronunciation prosody and the sample pronunciation prosody; inputting the sample phoneme sequence into a feature extraction network to perform feature extraction to obtain sample phoneme features of the sample phoneme sequence, inputting the sample phoneme features and the predicted sample speech features into a length adjustment network to perform time length prediction, adjusting the length of the sample phoneme features according to the predicted sample pronunciation time length obtained by prediction to obtain sample adjustment phoneme features, and obtaining a third loss according to the predicted sample pronunciation time length and the sample pronunciation time length; inputting the predicted sample pronunciation rhythm, the predicted sample voice characteristics and the sample adjustment phoneme characteristics into a voice synthesis network to perform voice synthesis to obtain sample pronunciation audio of a sample text, and obtaining fourth loss according to the sample pronunciation audio and the sample audio; and training the speech synthesis model according to the first loss, the second loss, the third loss and the fourth loss until a preset training stop condition is met.
In an alternative embodiment, the semantic acquisition module is configured to acquire a plurality of context texts of the target text; splicing the target text and adjacent texts in the multiple context texts in pairs to obtain multiple text pairs; and extracting semantic features from the plurality of text pairs to obtain context semantic features.
In an alternative embodiment, the semantic acquisition module is configured to extract semantic features of the plurality of texts on the input pre-trained semantic representation model to obtain contextual semantic features.
In an alternative embodiment, the text acquisition module is used for responding to the input interactive voice and generating corresponding response text as the target text;
the speech synthesis module is also used for outputting pronunciation audio of the target text.
In a third aspect, the present application provides an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program in the memory, to implement the steps in the speech synthesis method provided by the present application.
In a fourth aspect, the present application provides a computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor for implementing the steps in the speech synthesis method provided by the present application.
In a fifth aspect, the present application provides a computer program product comprising a computer program or instructions which, when executed by a processor, implement the steps in the speech synthesis method provided by the present application.
In the application, firstly, a target text which needs to be subjected to voice synthesis is obtained, the target text is converted into a corresponding phoneme sequence, then, the text semantic feature of the target text and the context semantic feature corresponding to the target text are obtained, the pronunciation rhythm of the target text during voice synthesis is determined based on the obtained text semantic feature, the context semantic feature and the converted phoneme sequence, in addition, the voice feature of the target text is determined according to the text semantic feature of the target text, the voice feature is used for indicating the pronunciation characteristics of a speaker in voice synthesis, and finally, the voice synthesis is performed according to the phoneme sequence, the voice feature and the determined pronunciation rhythm, so that the pronunciation audio of the target text is obtained. Compared with the related art, the application not only introduces the pronunciation rhythm into the voice synthesis, so that the synthesized pronunciation audio has rhythm, but also the pronunciation rhythm is determined according to the phoneme sequence of the target text, the corresponding text semantic feature and the context semantic feature, so that the pronunciation rhythm of the synthesized pronunciation audio is matched with the pronunciation rhythm, the context semantic feature and the pronunciation rhythm of the synthesized pronunciation audio, and the semantics of the target text can be expressed more accurately.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1a is a schematic diagram of a speech synthesis system according to an embodiment of the present application;
FIG. 1b is a flow chart of a speech synthesis method according to an embodiment of the present application;
FIG. 1c is a schematic diagram of a speech synthesis model according to an embodiment of the present application;
FIG. 1d is a schematic diagram of a refined structure of the prosody prediction network of FIG. 1 c;
FIG. 1e is a schematic diagram of another embodiment of a speech synthesis model according to the present application;
FIG. 1f is a schematic diagram of training a speech synthesis model in an embodiment of the present application;
FIG. 2 is a schematic flow chart of a speech synthesis method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of generating and outputting responsive speech based on a speech synthesis model in an embodiment of the application;
fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
It should be noted that the principles of the present application are illustrated as implemented in a suitable computing environment. The following description is based on illustrative embodiments of the application and should not be taken as limiting other embodiments of the application not described in detail herein.
In the following description of the present application reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or a different subset of all possible embodiments and can be combined with each other without conflict.
In the following description of the present application, the terms "first", "second", "third" and "third" are merely used to distinguish similar objects from each other, and do not represent a particular ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a particular order or sequence, as permitted, to enable embodiments of the present application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
In order to be able to improve the quality of speech synthesis, embodiments of the present application provide a speech synthesis method, a speech synthesis apparatus, an electronic device, a computer-readable storage medium, and a computer program product. Wherein the speech synthesis method may be performed by a speech synthesis apparatus or by an electronic device integrated with the speech synthesis apparatus.
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
Referring to fig. 1a, the present application further provides a speech synthesis system, as shown in fig. 1a, where the speech synthesis system includes an electronic device 100, and the speech synthesis apparatus provided by the present application is integrated in the electronic device 100. For example, the electronic device 100 may obtain a text that needs to be synthesized by speech, record the text as a target text, convert the target text into a corresponding phoneme sequence, obtain text semantic features of the target text, obtain context semantic features of a context corresponding to the target text, further determine a pronunciation rhythm of the target text during speech synthesis according to the phoneme sequence, the text semantic features and the context semantic features of the target text, determine speech features of the target text according to the text semantic features, and finally perform speech synthesis according to the phoneme sequence of the target text, the determined pronunciation rhythm and the determined speech features, so as to obtain pronunciation audio of the target text.
The electronic device 100 may be any device with a processor and having a processing capability, such as a mobile electronic device with a processor, such as a smart phone, a tablet computer, a palm computer, a notebook computer, and a smart speaker, or a stationary electronic device with a processor, such as a desktop computer, a television, a server, and an industrial device.
In addition, as shown in fig. 1a, the speech synthesis system may further include a memory 200 for storing original data, intermediate data, and result data in a speech synthesis process, for example, the electronic device 100 stores the acquired target text (original data) required for speech synthesis, a phoneme sequence obtained by converting the target text, text semantic features of the target text, and context semantic features of the corresponding context, a pronunciation prosody determined by the phoneme sequence and the context semantic features, a speech feature (intermediate data) determined by the text semantic features, and pronunciation audio (result data) of the target text obtained by speech synthesis in the memory 200.
It should be noted that, the schematic view of the speech synthesis system shown in fig. 1a is only an example, and the speech synthesis system and the scene described in the embodiment of the present application are for more clearly describing the technical solution of the embodiment of the present application, and do not constitute a limitation on the technical solution provided by the embodiment of the present application, and those skilled in the art can know that, with the evolution of the speech synthesis system and the appearance of the new service scenario, the technical solution provided by the embodiment of the present application is equally applicable to similar technical problems.
The following will describe in detail. The numbers of the following examples are not intended to limit the preferred order of the examples.
Referring to fig. 1b, fig. 1b is a schematic flow chart of a speech synthesis method according to an embodiment of the present application, and as shown in fig. 1b, the flow chart of the speech synthesis method according to the present application is as follows:
at 110, a target text for which speech synthesis is desired is obtained and converted into a corresponding phoneme sequence.
In this embodiment, the language type of the target text is not particularly limited, and may be any language text requiring speech synthesis, depending on the actual speech synthesis requirement, for example, the "Good morning" target text of chinese requiring speech synthesis may be obtained, the "Good moving" target text of english requiring speech synthesis may also be obtained, and so on. The target text corresponds to sentence text of the chapter where the target text is located, and the chapter can be colloquially understood as one or more sections of speech formed by a plurality of sentence texts.
After the target text to be subjected to speech synthesis is obtained, the obtained target text is further converted into a corresponding phoneme sequence. The manner in which the conversion from text to phoneme sequence is performed is not particularly limited, and a suitable conversion manner may be selected by those skilled in the art according to actual needs.
For example, the pronunciation dictionary includes mapping relations between words and phonemes, and words/words in the target text can be converted into corresponding phonemes one by one according to the pronunciation dictionary, so that phonemes converted from words/words in the target text form a phoneme sequence of the target text.
Exemplary, the target text to be synthesized is English text Please turn on the light, and the target text Please turn on the light is converted into corresponding phoneme sequence according to the mapping relation between the words and phonemes in the pronunciation dictionary
In addition, in order to more clearly characterize the phoneme sequence, the present embodiment adds a start flag and an end flag before and after the phoneme sequence, respectively, and characterizes the beginning of the phoneme sequence by the start flag and the end of the phoneme sequence by the end flag. The specific configuration of the start flag and the end flag is not particularly limited herein, and may be configured by those skilled in the art according to actual needs.
For example, the start flag may be configured as'<bos>"configuration end flag is"<eos>", for phoneme sequencesAfter the addition of the start flag and the end flag, the program is changed to +.>
At 120, text semantic features of the target text are obtained, as well as contextual semantic features of the context to which the target text corresponds.
In this embodiment, after the target text to be subjected to speech synthesis is obtained, in addition to converting the target text into a corresponding phoneme sequence, semantic features of the target text are obtained and are recorded as text semantic features, and semantic features of a context corresponding to the target text are obtained and recorded as context semantic features. The context of the target text with corresponding length can be intercepted from chapters to which the target text belongs according to the configured context intercepting length, semantic features of the intercepted context are further extracted, and the extracted semantic features are recorded as the context semantic features of the target text.
For example, a chapter includes 7 sentence texts, in order: sentence text a, sentence text B, sentence text C, sentence text D, sentence text E, sentence text F, and sentence text G, assuming that sentence text D therein is acquired as a target text requiring speech synthesis, sentence text C and sentence text E will be intercepted as the context of sentence text D if the context interception length is configured as 1 sentence; if the context interception length is configured as 2 sentences, sentence text B, sentence text C, sentence text E, and sentence text F will be intercepted as the context of sentence text D; if the context intercept length is configured with 3 sentences, sentence text a, sentence text B, sentence text C, sentence text E, sentence text F, and sentence text G will be intercepted as the context of sentence text D.
It should be noted that, the target text and the context thereof may be respectively input into the pre-trained semantic representation model to extract semantic features, where the semantic representation model is not particularly limited, and may be selected by those skilled in the art according to actual needs. For example, a BERT (Bidirectional Encoder Representations from Transformers) model may be selected to enable extraction of semantic features.
In an embodiment, obtaining the context semantic feature of the context corresponding to the target text includes:
acquiring a plurality of context texts of a target text;
splicing the target text and adjacent texts in the multiple context texts in pairs to obtain multiple text pairs;
and extracting semantic features from the plurality of text pairs to obtain context semantic features.
To provide more rich context semantic information, the present embodiment provides an alternative context semantic feature acquisition scheme. The method comprises the steps of firstly intercepting a context text of a target text with corresponding length from chapters to which the target text belongs according to configured context intercepting lengths, so that a plurality of texts including the target text and the intercepted context text exist, further splicing adjacent texts in the plurality of texts in pairs to obtain a plurality of text pairs, and finally extracting semantic features of the spliced plurality of text pairs to obtain the context semantic features of the target text.
For example, continuing with the above chapter including 7 sentence texts as an example, assuming that the context cut length is configured as 2 sentences, sentence text B, sentence text C, sentence text E, and sentence text F are to be cut into the context of sentence text D, and accordingly, sentence text a and sentence text B are spliced into one text pair, sentence text B and sentence text C are spliced into one text pair, sentence text C and sentence text D are spliced into one text pair, sentence text D and sentence text E are spliced into one text pair, and sentence text E and sentence text F are spliced into one text pair.
In an embodiment, extracting semantic features from a plurality of text pairs to obtain contextual semantic features includes:
and extracting semantic features of the plurality of texts on the input pre-trained semantic representation model to obtain context semantic features.
The semantic representation model is not particularly limited, and can be selected by those skilled in the art according to actual needs. For example, a BERT (Bidirectional Encoder Representations from Transformers) model can be selected to realize the extraction of the semantic features, and a self-coding pre-training language model such as a RoBERTa model can also be adopted to realize the extraction of the semantic features.
At 130, the prosody of the target text is determined from the phoneme sequence, the text semantic features, and the contextual semantic features.
Among them, the prosody of pronunciation includes low-level features such as pitch, accent\emphasis, pause\sentence break and rhythm, and affects speaking style, which describes higher-level features such as emotion, etc. It will be appreciated that the same sentence may express different semantics in different context, and that the prosody of the pronunciation thereof will also need to be different accordingly to more accurately express the desired semantics.
In this embodiment, the pronunciation rhythm of the target text during speech synthesis is comprehensively determined according to the phoneme sequence of the target text, the text semantic features and the context semantic features thereof.
In 140, speech features of the target text are determined based on the text semantic features, the speech features being indicative of pronunciation characteristics of a speaker of the speech synthesis.
In this embodiment, besides the text semantic features of the target text itself to assist in determining the pronunciation prosody, the text semantic features are used to determine the voice features of the target text, where the voice features only include features related to the speaker identity and are used to indicate the pronunciation characteristics of the speaker in voice synthesis, so that different speakers can be distinguished. The speaker may be a real person or a virtual person.
In addition, it should be noted that the execution sequence of 130 and 140 is not affected by the sequence number, and 130 may be executed first, 140 may be executed first, or 130 and 140 may be executed simultaneously.
In 150, speech synthesis is performed according to the prosody, the speech features and the phoneme sequence to obtain the pronunciation audio of the target text.
As described above, after determining the prosody and the speech feature of the target text, speech synthesis is performed according to the prosody, the speech feature, and the phoneme sequence obtained by converting the target text, so as to obtain the speech audio of the target text. The same effect of speaking the target text according to the pronunciation rhythm and the pronunciation characteristics of the simulated speaker can be obtained by playing the pronunciation audio.
In one embodiment, the speech synthesis of the target text is implemented by adopting an artificial intelligence speech synthesis mode based on natural language processing, and the pronunciation rhythm of the target text is determined according to the phoneme sequence, the text semantic features and the context semantic features, including:
acquiring a voice synthesis model, wherein the voice synthesis model comprises a prosody prediction network, a feature extraction network, a length adjustment network and a voice synthesis network;
inputting the phoneme sequence, the text semantic features and the context semantic features into a prosody prediction network to perform prosody prediction, and determining the pronunciation prosody of the target text;
Performing speech synthesis according to the pronunciation prosody, the speech features and the phoneme sequence to obtain pronunciation audio of the target text, including:
inputting the phoneme sequence into a feature extraction network to perform feature extraction to obtain a first phoneme feature of the phoneme sequence;
inputting the first phoneme features and the voice features into a length adjustment network for duration prediction, and adjusting the length of the first phoneme features according to the predicted pronunciation duration to obtain first adjusted phoneme features;
inputting the pronunciation rhythm, the voice characteristics and the first regulation phoneme characteristics into a voice synthesis network to perform voice synthesis, and obtaining pronunciation audio of the target text.
Artificial intelligence (Artificial Intelligence, AI) is a theory, method, technique, and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend, and extend human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. Artificial intelligence software technology mainly includes Machine Learning (ML) technology, wherein Deep Learning (DL) is a new research direction in Machine Learning, which is introduced into Machine Learning to make it closer to an original target, i.e., artificial intelligence. At present, deep learning is mainly applied to the fields of machine vision, natural language processing and the like.
Deep learning is the inherent regularity and presentation hierarchy of learning sample data, and information obtained during such learning processes greatly aids in interpretation of data such as text, image and sound. By using the deep learning technology and the corresponding training set, network models realizing different functions can be obtained through training, for example, a gender classification model for gender classification can be obtained through training based on one training set, and an image optimization model for image optimization can be obtained through training based on another training set. Accordingly, in the present embodiment, using sample audio and its corresponding sample text as a training set, a speech synthesis model is trained that is configured to map an input sequence of phonemes to a mel spectrum. That is, the speech synthesis model outputs not audio that can be directly played, but a mel-frequency spectrum sequence of the frequency domain, which needs to be converted to the time domain to obtain the corresponding voicing audio.
Referring to fig. 1c, the speech synthesis model is composed of four parts, namely a prosody prediction network, a feature extraction network, a length adjustment network and a speech synthesis network.
The feature extraction network is a trunk of the model and is used for mapping the phoneme sequence to a hidden space to realize modeling of the phoneme information and obtain the hidden sequence.
The length adjusting network is connected with the feature extraction network and the voice synthesis network and is used for solving the problem of length mismatch between the phoneme sequence and the Mel frequency spectrum sequence and controlling the pronunciation duration. Note that, the length of the phoneme sequence is generally smaller than that of the mel-spectrum sequence, and each phoneme corresponds to several mel-spectrums, and the length of the mel-spectrum corresponding to a phoneme is referred to as the pronunciation time of the phoneme in this embodiment. The length adjusting network predicts the pronunciation time length of the input hidden sequence according to the voice characteristics, and expands the length of the hidden sequence based on the predicted pronunciation time length.
The prosody prediction network is connected to the speech synthesis network for providing additional prosody information that is input with the phoneme sequence and the corresponding contextual semantic features and output with the prosody of the pronunciation that matches both the phoneme sequence and its corresponding contextual semantic features.
The voice synthesis network is used for mapping the hidden sequence with the length adjusted to the corresponding Mel frequency spectrum sequence according to the pronunciation rhythm and the voice characteristics.
Specifically, in this embodiment, a phoneme sequence obtained by converting a target text, text semantic features thereof and corresponding context semantic features are input into a prosody prediction network to perform prosody prediction, so as to obtain a pronunciation prosody required by speech synthesis.
In addition, inputting the phoneme sequence into a feature extraction network for feature extraction, mapping the phoneme sequence into a hidden space, and modeling the phoneme information to obtain a hidden sequence, and recording the hidden sequence as a first phoneme feature of the phoneme sequence; then, the voice feature is used as a reference of pronunciation time length prediction, and is input into a length adjustment network together with the first phoneme feature to conduct time length prediction, so that pronunciation time length corresponding to the first phoneme feature is obtained, the length of the first phoneme feature is adjusted according to the pronunciation time length, so that the first phoneme feature can be aligned with the pronunciation time length, and the first adjustment phoneme feature is correspondingly obtained; then, inputting the pronunciation rhythm, the voice characteristics and the first regulation phoneme characteristics into a voice synthesis network, restricting the voice synthesis of the first regulation phoneme characteristics through the pronunciation rhythm and the voice characteristics, and mapping the first regulation phoneme characteristics into a Mel frequency spectrum sequence of a frequency domain; and finally, mapping the Mel frequency spectrum sequence of the frequency domain to the time domain to obtain the pronunciation audio of the target text.
Illustratively, the feature extraction network may be implemented based on an Encoder (Encoder) of the FastSpecech framework, the Length adjustment network may be implemented based on a Length adjuster (Length Regulator) of the FastSpecech framework, and the speech synthesis network may be implemented based on a Decoder (Decoder) of the FastSpecech framework. The prosody prediction network is described later.
In an embodiment, please refer to fig. 1d, further provide an internal refinement structure of a prosody prediction network, the prosody prediction network includes a first attention sub-network, a second attention sub-network, a feature extraction sub-network, a feature fusion sub-network and a prosody prediction sub-network, the prosody prediction network is input with a phoneme sequence, text semantic features and context semantic features to perform prosody prediction, and determining a prosody of a target text includes:
inputting the phoneme sequence into a feature extraction sub-network to perform feature extraction to obtain a second phoneme feature of the phoneme sequence;
inputting the second phoneme feature and the text semantic feature into a first attention sub-network for attention calculation to obtain a first attention weight of the second phoneme feature, and carrying out weighted calculation on the second phoneme feature according to the first attention weight to obtain a first enhancement feature of the second phoneme feature;
Inputting the second phoneme features and the context semantic features into a second attention sub-network for attention calculation to obtain second attention weights of the second phoneme features, and carrying out weighted calculation on the second phoneme features according to the second attention weights to obtain second enhancement features of the second phoneme features;
inputting the second phoneme feature, the first enhancement feature and the second enhancement feature into a feature fusion sub-network to perform feature fusion to obtain fusion features;
and inputting the fusion characteristics into a prosody predictor network to perform prosody prediction, and determining the pronunciation prosody of the target text.
Wherein the first attention sub-network, the second attention sub-network may be implemented based on a multi-headed self-attention network, the feature extraction sub-network may be implemented based on an Encoder (Encoder) of a FastSpecech framework, the feature fusion sub-network is configured to implement a fusion (Add) operation, and the prosody prediction sub-network may be implemented using N FFT (Feed-Forward Transformer) blocks and M convolutional stacks, where N and M are positive integers greater than or equal to 1.
In this embodiment, when determining the pronunciation prosody of the target text, firstly, a phoneme sequence of the target text is input into a feature extraction sub-network to perform feature extraction, the phoneme sequence is mapped into a hidden space, modeling of phoneme information is achieved, a hidden sequence is obtained, and the hidden sequence is recorded as a second phoneme feature of the phoneme sequence.
In addition, the second phoneme feature and the text semantic feature are input into a first attention sub-network, the multi-head self-attention calculation is carried out on the second phoneme feature and the text semantic feature through the first attention sub-network by adopting a multi-head self-attention mechanism, a first weight parameter for enhancing the attention is obtained, and the second phoneme feature is weighted according to the first weight parameter, so that a first enhancement feature of the second phoneme feature is obtained.
In addition, the second phoneme feature and the context semantic feature are input into a second attention sub-network, the multi-head self-attention calculation is carried out on the second phoneme feature and the context semantic feature through the second attention sub-network by adopting a multi-head self-attention mechanism, a second weight parameter for attention enhancement is obtained, and the second phoneme feature is subjected to weighting operation according to the second weight parameter, so that a second enhancement feature of the second phoneme feature is obtained.
And after extracting the second phoneme characteristic of the phoneme sequence and carrying out attention enhancement on the second phoneme characteristic to obtain a first enhancement characteristic and a second enhancement characteristic, inputting the second phoneme characteristic, the first enhancement characteristic and the second enhancement characteristic into a characteristic fusion sub-network to carry out characteristic fusion, and obtaining the fusion characteristic of the three. Thus, the fusion features carry the features of the phoneme sequence itself, as well as the text semantic features and the context semantic features of the target text itself. And inputting the fusion characteristics into a prosody prediction sub-network to perform prosody prediction, so as to obtain the pronunciation prosody of the target text during speech synthesis.
It can be understood that when a real person reads an article or a chapter paragraph, prosodic relief information of each sentence depends on context information, and the prosodic information can be more fully mined by introducing text semantic features and context semantic features into prediction of pronunciation prosody, so that the problem of consistent sentence prosody is avoided, and the accuracy of prosody prediction is improved.
In one embodiment, inputting the fusion feature into a prosody prediction sub-network to perform prosody prediction, determining a prosody of a target text, comprising:
inputting the fusion characteristics into a prosody prediction sub-network, performing prosody prediction in the audio frame dimension through the prosody prediction sub-network, and determining the pronunciation prosody of the target text in the audio frame dimension.
It should be noted that in the time domain dimension, one phoneme generally corresponds to a plurality of audio frames when pronouncing, in order to provide richer prosodic information to improve the quality of speech synthesis, in this embodiment, the prosody network is configured to predict the prosody of pronouncing in the audio frame dimension. Correspondingly, when prosody prediction is carried out, the characteristics of the phoneme sequence and the fusion characteristics of the text semantic characteristics and the context semantic characteristics of the target text are carried, the prosody prediction sub-network is input, the prosody prediction sub-network is provided for predicting the prosody of the pronunciation in the audio frame dimension, and the prosody of the pronunciation of the target text in the audio frame dimension is obtained.
In an embodiment, the speech synthesis model further comprises a speech feature prediction network for determining speech features of the target text based on the text semantic features, comprising:
inputting the text semantic features into a voice feature prediction network to predict the voice features, and obtaining the voice features corresponding to the text semantic features.
Referring to fig. 1e, the speech synthesis model further includes a speech feature prediction network configured to take text semantic features of the target text as input and corresponding speech features as output. By way of example, the speech feature prediction network may consist of a set of two-way long and short term memory networks and an average pooling layer, a fully connected layer.
In an embodiment, referring to fig. 1f, an optional model training scheme is provided, and before obtaining the target text to be synthesized, the method further includes:
constructing a prosody coding network and a voice characteristic coding network based on a reference encoder;
acquiring sample audio, and sample text and sample pronunciation time corresponding to the sample audio;
inputting the sample audio into a prosody coding network to perform prosody coding to obtain a sample pronunciation prosody of the sample audio, and inputting the sample audio into a speech feature coding network to perform speech feature coding to obtain sample speech features of the sample audio;
Acquiring sample text semantic features of a sample text and sample context semantic features of contexts corresponding to the sample text;
inputting the semantic features of the sample text into a voice feature prediction network to predict voice features, obtaining predicted sample voice features, and obtaining first loss according to the predicted sample voice features and the sample voice features;
converting the sample text into a corresponding sample phoneme sequence, inputting the sample phoneme sequence, the sample context semantic features and the sample text semantic features into a prosody prediction network for prosody prediction to obtain a predicted sample pronunciation prosody, and obtaining a second loss according to the predicted sample pronunciation prosody and the sample pronunciation prosody;
inputting the sample phoneme sequence into a feature extraction network to perform feature extraction to obtain sample phoneme features of the sample phoneme sequence, inputting the sample phoneme features and the predicted sample speech features into a length adjustment network to perform time length prediction, adjusting the length of the sample phoneme features according to the predicted sample pronunciation time length obtained by prediction to obtain sample adjustment phoneme features, and obtaining a third loss according to the predicted sample pronunciation time length and the sample pronunciation time length;
inputting the predicted sample pronunciation rhythm, the predicted sample voice characteristics and the sample adjustment phoneme characteristics into a voice synthesis network to perform voice synthesis to obtain sample pronunciation audio of a sample text, and obtaining fourth loss according to the sample pronunciation audio and the sample audio;
Training the speech synthesis model according to the first loss, the second loss, the third loss and the fourth loss until a preset training stop condition is met.
The present embodiment further provides an alternative model training scheme. The method comprises the steps of respectively constructing a prosody coding network and a voice characteristic coding network based on a reference encoder, wherein the prosody coding network is configured to take a mel frequency spectrum of voice as input, take a pronunciation prosody adopted by the voice as output, the voice characteristic coding network is configured to take the mel frequency spectrum of the voice as input, and take the voice characteristics of a voice speaker as output. Illustratively, the reference encoder may be composed of L1-dimensional convolution blocks, a gate loop control unit, a full-join layer, where all convolution blocks are activated and batch normalized by ReLU, L is a positive integer greater than or equal to 1.
In addition, sample audio and a text corresponding to the sample audio are acquired and recorded as sample text, and in addition, the sample pronunciation time length of the sample audio is also acquired, wherein the sample pronunciation time length is used for indicating the pronunciation time length of each phoneme in the sample audio. For the obtained sample text, further obtaining semantic features of the sample text, and marking the semantic features as sample text semantic features, and obtaining semantic features of a context corresponding to the sample text, and marking the semantic features as sample context semantic features.
In this embodiment, training of the prosody prediction network is constrained by using the prosody coding network, training of the speech feature prediction network is constrained by using the speech feature coding network, wherein the obtained sample audio is input into the prosody coding network to perform prosody coding to obtain the prosody of the corresponding sample audio, and the prosody of the corresponding sample audio is recorded as the sample prosody.
And for the voice feature prediction network, inputting the obtained sample text semantic features into the voice feature prediction network to predict the voice features, marking the voice features output by the voice feature prediction network at the moment as predicted sample voice features, and obtaining a first loss according to the difference between the predicted sample voice features and the sample voice features. The first loss may be calculated by using a mean square error loss.
For the prosody prediction network, converting the obtained sample text into a corresponding sample phoneme sequence, inputting the sample phoneme sequence, together with the obtained sample context semantic features and the sample text semantic features, into the prosody prediction network for prosody prediction, recording the pronunciation prosody output by the prosody prediction network at the moment as a predicted sample pronunciation prosody, and obtaining a second loss according to the difference between the predicted sample pronunciation prosody and the sample pronunciation prosody. In this embodiment, in the training phase, the prosody prediction network and the prosody encoding network form a variable self-encoder, the prosody encoding network corresponds to an encoder, the prosody prediction network corresponds to a decoder, the encoder maps the original data into a distribution, and the decoder maps the distribution into reconstructed data which is consistent with the original data as much as possible. In this embodiment, the data form of the prosody is distributed, whether it is a prosody of pronunciation, a prosody of sample pronunciation, or a prosody of predicted sample pronunciation. Accordingly, when obtaining the second loss, the following loss function may be used to obtain the second loss:
Wherein NLoss represents the second loss, var represents the mean of the sample prosody (distribution), mu represents the variance of the sample prosody (distribution), mu pre Representing the mean of the predicted sample prosody (distribution).
For the length adjustment network, firstly, inputting the sample phoneme sequence obtained by converting the sample text into a feature extraction network for feature extraction to obtain sample phoneme features of the sample phoneme sequence, then inputting the predicted sample voice features predicted by the voice feature prediction network together with the sample phoneme features into the length adjustment network for duration prediction to obtain the pronunciation duration of the sample phoneme features, recording the pronunciation duration as predicted sample pronunciation duration, and obtaining a third loss according to the difference between the predicted sample pronunciation duration and the sample pronunciation duration. The third loss may be calculated by a mean square error loss.
And for the voice synthesis network, inputting the predicted sample pronunciation rhythm, the predicted sample voice characteristics and the sample adjustment phoneme characteristics obtained above into the voice synthesis network for voice synthesis, recording pronunciation audio output by the voice synthesis network at the moment as sample pronunciation audio, and acquiring a fourth loss according to the difference between the sample pronunciation audio and the sample audio. The fourth loss may be calculated by a mean square error loss.
As described above, the first loss of the speech feature prediction capability of the speech feature prediction network, the second loss of the prosody prediction capability of the prosody prediction network, the third loss of the duration prediction capability of the length adjustment network, and the fourth loss of the speech synthesis capability of the speech synthesis network are respectively obtained, so as to ensure that each network in the speech synthesis model can work in a coordinated manner, and the best state is achieved. The preset training stop condition may be set by a person skilled in the art according to actual needs, for example, the preset training stop condition may be set to be that the number of iterations of the weight parameters of each network in the speech synthesis model reaches a preset number of times, and the preset training stop condition may also be set to be that the total loss of the speech synthesis model is less than or equal to a loss threshold.
In one embodiment, obtaining target text for which speech synthesis is desired includes:
Responding to the input interactive voice, and generating a corresponding response text as a target text;
performing speech synthesis according to the pronunciation rhythm, the speech characteristics and the phoneme sequence, and obtaining pronunciation audio of the target text, wherein the method further comprises the following steps:
and outputting pronunciation audio of the target text.
In this embodiment, the target text to be subjected to voice synthesis may come from external voice interaction, so as to improve the pronunciation quality of the voice interaction.
When receiving input interactive voice, generating a response text corresponding to the interactive voice according to a configured voice interaction strategy, and taking the response text as a target text needing to be subjected to voice synthesis. It should be noted that, the configuration of the voice interaction policy in this embodiment is not particularly limited, and may be configured by those skilled in the art according to actual needs. Correspondingly, after the pronunciation audio of the target text is obtained through voice synthesis, the pronunciation audio of the target text is further output as response voice of the interactive voice.
It can be understood that the voice synthesis method provided by the above embodiment of the present application can be applied to voice synthesis of voice interactive scenes such as intelligent sound boxes and virtual digital people, and can also be applied to dubbing of non-voice interactive scenes such as film and television, long/short video, games, cartoon, audio books and the like. For example, the spoken text of the long/short video can be obtained as a target text to be subjected to voice synthesis, so as to realize dubbing of the long/short video, and the chapter text such as the literary text or the novel text can be obtained as a target text to be subjected to voice synthesis, so as to realize the application of the voice book.
It should be noted that, in the above embodiments, the audio output by the speech synthesis model is in the form of mel spectrum in the frequency domain, and it is necessary to convert it into the time domain to obtain audio that can be directly played.
As can be seen from the foregoing, in the embodiment of the present application, a target text to be subjected to speech synthesis is first obtained, the target text is converted into a corresponding phoneme sequence, then a text semantic feature of the target text and a context semantic feature corresponding to the target text are obtained, and based on the obtained text semantic feature, the context semantic feature and the phoneme sequence obtained by conversion, a pronunciation rhythm of the target text during speech synthesis is determined, in addition, according to the text semantic feature of the target text, a speech feature of the target text is determined, the speech feature is used for indicating a pronunciation characteristic of a speaker of speech synthesis, and finally speech synthesis is performed according to the phoneme sequence, the speech feature and the determined pronunciation rhythm, so as to obtain a pronunciation audio of the target text. Compared with the related art, the application not only introduces the pronunciation rhythm into the voice synthesis, so that the synthesized pronunciation audio has rhythm, but also the pronunciation rhythm is determined according to the phoneme sequence of the target text, the corresponding text semantic feature and the context semantic feature, so that the pronunciation rhythm of the synthesized pronunciation audio is matched with the pronunciation rhythm, the context semantic feature and the pronunciation rhythm of the synthesized pronunciation audio, and the semantics of the target text can be expressed more accurately.
According to the voice synthesis method provided in the above embodiment, the following is further described in detail by taking an application scenario in which the voice synthesis device is integrated in the electronic device to implement voice interaction as an example.
Referring to fig. 2 and 3 in combination, the flow of the speech synthesis method may further be as follows:
in 210, the electronic device generates a corresponding answer text as a target text for which speech synthesis is desired in response to the input interactive speech, and converts the target text into a corresponding phoneme sequence.
In this embodiment, when an input interactive voice is received, the electronic device generates a response text corresponding to the interactive voice according to a configured voice interaction policy, and uses the response text as a target text for voice synthesis. It should be noted that, the configuration of the voice interaction policy in this embodiment is not particularly limited, and may be configured by those skilled in the art according to actual needs. The interactive voice can be any language voice needing to be interacted, and depends on actual interaction requirements, for example, the interactive voice can be English interactive voice or Chinese interactive voice. The target text corresponds to sentence text of the chapter where the target text is located, and the chapter can be colloquially understood as one or more sections of speech composed of a plurality of sentence texts.
After the target text to be subjected to speech synthesis is obtained, the electronic device further converts the obtained target text into a corresponding phoneme sequence. The manner in which the conversion from text to phoneme sequence is performed is not particularly limited, and a suitable conversion manner may be selected by those skilled in the art according to actual needs. For example, the pronunciation dictionary includes mapping relations between words and phonemes, and words/words in the target text can be converted into corresponding phonemes one by one according to the pronunciation dictionary, so that phonemes converted from words/words in the target text form a phoneme sequence of the target text.
In 220, the electronic device obtains a speech synthesis model, the speech synthesis model including a prosody prediction network, a speech feature prediction network, a feature extraction network, a length adjustment network, and a speech synthesis network, the prosody prediction network including a first attention sub-network, a second attention sub-network, a feature extraction sub-network, a feature fusion sub-network, and a prosody prediction sub-network.
In 230, the electronic device obtains text semantic features of the target text and obtains context semantic features of a context to which the target text corresponds.
In this embodiment, after the target text to be subjected to speech synthesis is obtained, the semantic features of the target text and the contextual semantic features of the corresponding context are obtained in addition to converting the target text into the corresponding phoneme sequence.
For example, the electronic device may input the target text into the pre-trained semantic representation model to extract semantic features, where the semantic representation model is not particularly limited, and may be selected by those skilled in the art according to actual needs. For example, a BERT (Bidirectional Encoder Representations from Transformers) model may be selected to enable extraction of text semantic features.
In addition, the electronic device can intercept the context of the target text with corresponding length from the chapters of the target text according to the configured context intercept length, further extract semantic features from the intercepted context, and record the extracted semantic features as the context semantic features of the target text.
For example, a chapter includes 7 sentence texts, in order: sentence text a, sentence text B, sentence text C, sentence text D, sentence text E, sentence text F, and sentence text G, assuming that sentence text D therein is acquired as a target text requiring speech synthesis, sentence text C and sentence text E will be intercepted as the context of sentence text D if the context interception length is configured as 1 sentence; if the context interception length is configured as 2 sentences, sentence text B, sentence text C, sentence text E, and sentence text F will be intercepted as the context of sentence text D; if the context intercept length is configured with 3 sentences, sentence text a, sentence text B, sentence text C, sentence text E, sentence text F, and sentence text G will be intercepted as the context of sentence text D.
In 240, the electronic device inputs the phoneme sequence into the feature extraction sub-network for feature extraction to obtain a second phoneme feature of the phoneme sequence, inputs the second phoneme feature and the text semantic feature into the first attention sub-network for attention enhancement to obtain a first enhancement feature of the second phoneme feature, and inputs the second phoneme feature and the context semantic feature into the second attention sub-network for attention enhancement to obtain a second enhancement feature of the second phoneme feature.
The electronic device firstly inputs a phoneme sequence of a target text into a feature extraction sub-network to perform feature extraction, maps the phoneme sequence to a hidden space, realizes modeling of phoneme information, obtains a hidden sequence, and marks the hidden sequence as a second phoneme feature of the phoneme sequence.
In addition, the second phoneme feature and the text semantic feature are input into a first attention sub-network, the multi-head self-attention calculation is carried out on the second phoneme feature and the text semantic feature through the first attention sub-network by adopting a multi-head self-attention mechanism, a first weight parameter for enhancing the attention is obtained, and the second phoneme feature is weighted according to the first weight parameter, so that a first enhancement feature of the second phoneme feature is obtained.
In addition, the second phoneme feature and the context semantic feature are input into a second attention sub-network, the multi-head self-attention calculation is carried out on the second phoneme feature and the context semantic feature through the second attention sub-network by adopting a multi-head self-attention mechanism, a second weight parameter for attention enhancement is obtained, and the second phoneme feature is subjected to weighting operation according to the second weight parameter, so that a second enhancement feature of the second phoneme feature is obtained.
In 250, the electronic device inputs the second phoneme feature, the first enhancement feature and the second enhancement feature into a feature fusion sub-network to perform feature fusion, obtains a fusion feature, inputs the fusion feature into a prosody prediction sub-network to perform prosody prediction, and determines the pronunciation prosody of the target text.
As above, after the second phoneme feature of the phoneme sequence is extracted and the attention of the second phoneme feature is enhanced to obtain the first enhancement feature and the second enhancement feature, the electronic device further inputs the second phoneme feature, the first enhancement feature and the second enhancement feature into the feature fusion sub-network to perform feature fusion, so as to obtain the fusion feature of the three features. Thus, the fusion features carry the features of the phoneme sequence itself, as well as the text semantic features and the context semantic features of the target text itself. And inputting the fusion characteristics into a prosody prediction sub-network to perform prosody prediction, so as to obtain the pronunciation prosody of the target text during speech synthesis.
It can be understood that when a real person reads an article or a chapter paragraph, prosodic relief information of each sentence depends on context information, and the prosodic information can be more fully mined by introducing text semantic features and context semantic features into prediction of pronunciation prosody, so that the problem of consistent sentence prosody is avoided, and the accuracy of prosody prediction is improved.
In 260, the electronic device inputs the text semantic features into the speech feature prediction network to predict the speech features, and obtains the speech features corresponding to the text semantic features.
The speech feature prediction network is configured to take text semantic features of the target text as input and corresponding speech features as output, wherein the speech features only comprise features related to the identity of a speaker and are used for indicating the pronunciation characteristics of the speaker in speech synthesis, so that different speakers can be distinguished. Accordingly, the speech synthesis is performed by using the speech characteristics of the specific speaker, so that the effect of simulating the speaking of the specific speaker can be achieved. The speaker may be a real person or a virtual person. By way of example, the speech feature prediction network may consist of a set of two-way long and short term memory networks and an average pooling layer, a fully connected layer.
In 270, the electronic device inputs the phoneme sequence into the feature extraction network to perform feature extraction to obtain a first phoneme feature of the phoneme sequence, inputs the first phoneme feature and the voice feature into the length adjustment network to perform duration prediction, and adjusts the length of the first phoneme feature according to the predicted pronunciation duration to obtain a first adjusted phoneme feature.
The electronic equipment inputs a phoneme sequence into a feature extraction network to perform feature extraction to obtain a first phoneme feature of the phoneme sequence, inputs a voice feature serving as a reference for pronunciation time length prediction and the first phoneme feature together into a length adjustment network to perform prediction time length to obtain a pronunciation time length corresponding to the first phoneme feature, and adjusts the length of the first phoneme feature according to the pronunciation time length so that the first phoneme feature can be aligned with the pronunciation time length to correspondingly obtain a first adjustment phoneme feature.
In 280, the electronic device inputs the prosody, the first adjusted phoneme feature, and the speech feature into a speech synthesis network for speech synthesis to obtain a pronunciation audio of the target text.
At 290, the electronic device outputs the pronunciation audio of the target text as a reply voice to the interactive voice.
As described above, the speech characteristics indicate the pronunciation characteristics of the speaker of the speech synthesis, the pronunciation prosody indicates the prosody characteristics of the speaker, and the pronunciation prosody and the speech characteristics will be referred to, so as to constrain the speech synthesis of the first adjustment phoneme characteristics by the speech synthesis network, so that the synthesized pronunciation audio has the same effect of speaking of the speaker when played. As the response to the input interactive voice, the electronic equipment outputs the pronunciation audio of the target text as the response voice of the interactive voice, so as to realize the effect of responding to the interactive voice.
In order to facilitate better implementation of the voice synthesis method provided by the embodiment of the application, the embodiment of the application also provides a voice synthesis device based on the voice synthesis method. The meaning of the nouns is the same as that of the voice synthesis method, and specific implementation details refer to the description of the method embodiment.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application, where the speech synthesis apparatus may include a text obtaining module 310, a semantic obtaining module 320, a prosody determining module 330, a speech determining module 340, and a speech synthesis module,
a text obtaining module 310, configured to obtain a target text that needs to be synthesized by speech, and convert the target text into a corresponding phoneme sequence;
The semantic acquisition module 320 is configured to acquire text semantic features of the target text and acquire context semantic features of a context corresponding to the target text;
a prosody determining module 330 for determining a prosody of the target text based on the phoneme sequence and the contextual semantic features;
the voice determining module 340 is configured to determine a voice feature of the target text according to the text semantic feature, where the voice feature is used to indicate a pronunciation characteristic of a speaker of the voice synthesis;
the speech synthesis module 350 is configured to perform speech synthesis according to the pronunciation prosody, the speech features and the phoneme sequence, so as to obtain pronunciation audio of the target text.
In an alternative embodiment, the prosody determination module 330 is configured to obtain a speech synthesis model, where the speech synthesis model includes a prosody prediction network, a feature extraction network, a length adjustment network, and a speech synthesis network; inputting the phoneme sequence, the text semantic features and the context semantic features into a prosody prediction network to perform prosody prediction, and determining the pronunciation prosody of the target text;
the speech synthesis module 350 is configured to input the phoneme sequence into the feature extraction network to perform feature extraction, so as to obtain a first phoneme feature of the phoneme sequence; inputting the first phoneme features and the voice features into a length adjustment network for duration prediction, and adjusting the length of the first phoneme features according to the predicted pronunciation duration to obtain first adjusted phoneme features; and inputting the pronunciation rhythm, the voice characteristics and the first regulation phoneme characteristics into a voice synthesis network to perform voice synthesis, so as to obtain pronunciation audio of the target text.
In an alternative embodiment, the prosody prediction network includes a first attention sub-network, a second attention sub-network, a feature extraction sub-network, a feature fusion sub-network, and a prosody prediction sub-network, and the prosody determining module 330 is configured to input the phoneme sequence into the feature extraction sub-network to perform feature extraction, so as to obtain a second phoneme feature of the phoneme sequence; inputting the second phoneme features and the text semantic features into a first attention sub-network for attention enhancement to obtain first enhancement features of the second phoneme features; inputting the second phoneme features and the context semantic features into a second attention sub-network to carry out attention enhancement, so as to obtain second enhancement features of the second phoneme features; inputting the second phoneme feature, the first enhancement feature and the second enhancement feature into a feature fusion sub-network to perform feature fusion to obtain fusion features; and inputting the fusion characteristics into a prosody predictor network to perform prosody prediction, and determining the pronunciation prosody of the target text.
In an alternative embodiment, the prosody determination module 330 is configured to input the fusion feature into a prosody prediction sub-network, perform prosody prediction in the audio frame dimension through the prosody prediction sub-network, and determine the prosody of the target text in the audio frame dimension.
In an alternative embodiment, the speech synthesis model further includes a speech feature prediction network, and the speech determining module 340 is configured to input the text semantic features into the speech feature prediction network to predict the speech features, so as to obtain the speech features corresponding to the text semantic features.
In an alternative embodiment, the speech synthesis model further comprises a model training module for constructing a prosody encoding network and a speech feature encoding network based on the reference encoder; acquiring sample audio, and sample text and sample pronunciation time corresponding to the sample audio; inputting the sample audio into a prosody coding network to perform prosody coding to obtain a sample pronunciation prosody of the sample audio, and inputting the sample audio into a speech feature coding network to perform speech feature coding to obtain sample speech features of the sample audio; acquiring sample text semantic features of a sample text and sample context semantic features of contexts corresponding to the sample text; inputting the semantic features of the sample text into a voice feature prediction network to predict voice features, obtaining predicted sample voice features, and obtaining first loss according to the predicted sample voice features and the sample voice features; converting the sample text into a corresponding sample phoneme sequence, inputting the sample phoneme sequence, the sample context semantic features and the sample text semantic features into a prosody prediction network for prosody prediction to obtain a predicted sample pronunciation prosody, and obtaining a second loss according to the predicted sample pronunciation prosody and the sample pronunciation prosody; inputting the sample phoneme sequence into a feature extraction network to perform feature extraction to obtain sample phoneme features of the sample phoneme sequence, inputting the sample phoneme features and the predicted sample speech features into a length adjustment network to perform time length prediction, adjusting the length of the sample phoneme features according to the predicted sample pronunciation time length obtained by prediction to obtain sample adjustment phoneme features, and obtaining a third loss according to the predicted sample pronunciation time length and the sample pronunciation time length; inputting the predicted sample pronunciation rhythm, the predicted sample voice characteristics and the sample adjustment phoneme characteristics into a voice synthesis network to perform voice synthesis to obtain sample pronunciation audio of a sample text, and obtaining fourth loss according to the sample pronunciation audio and the sample audio; and training the speech synthesis model according to the first loss, the second loss, the third loss and the fourth loss until a preset training stop condition is met.
In an alternative embodiment, the semantic acquisition module 320 is configured to acquire a plurality of context texts of the target text; splicing the target text and adjacent texts in the multiple context texts in pairs to obtain multiple text pairs; and extracting semantic features from the plurality of text pairs to obtain context semantic features.
In an alternative embodiment, the semantic acquisition module 320 is configured to extract semantic features from a plurality of texts on the input pre-trained semantic representation model to obtain contextual semantic features.
In an alternative embodiment, the text obtaining module 310 is configured to generate, in response to the input interactive voice, a corresponding response text as the target text;
the speech synthesis module 350 is also configured to output the pronunciation audio of the target text.
The specific implementation of each module can be referred to the previous embodiments, and will not be repeated here.
In this embodiment, the text obtaining module 310 obtains the target text to be synthesized by voice, converts the target text into a corresponding phoneme sequence, the semantic obtaining module 320 obtains the text semantic feature of the target text and the context semantic feature of the context corresponding to the target text, the prosody determining module 330 determines the prosody of the target text during voice synthesis based on the obtained text semantic feature, the context semantic feature and the converted phoneme sequence, and the voice determining module 340 determines the voice feature of the target text according to the text semantic feature of the target text, wherein the voice feature is used for indicating the pronunciation characteristics of the speaker of the voice synthesis, and finally the voice synthesizing module 350 performs voice synthesis according to the phoneme sequence, the voice feature and the determined pronunciation prosody to obtain the pronunciation audio of the target text. Compared with the related art, the application not only introduces the pronunciation rhythm into the voice synthesis, so that the synthesized pronunciation audio has rhythm, but also the pronunciation rhythm is determined according to the phoneme sequence of the target text, the corresponding text semantic feature and the context semantic feature, so that the pronunciation rhythm of the synthesized pronunciation audio is matched with the pronunciation rhythm, the context semantic feature and the pronunciation rhythm of the synthesized pronunciation audio, and the semantics of the target text can be expressed more accurately.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the processor is used for executing the steps in the voice synthesis method provided by the embodiment by calling the computer program stored in the memory.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the application.
The electronic device may include one or more processing cores 'processors 101, one or more computer-readable storage media's memory 102, power supply 103, and input unit 104, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 5 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
the processor 101 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 102, and invoking data stored in the memory 102. Optionally, processor 101 may include one or more processing cores; alternatively, the processor 101 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 101.
The memory 102 may be used to store software programs and modules, and the processor 101 executes various functional applications and data processing by executing the software programs and modules stored in the memory 102. The memory 102 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 102 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 102 may also include a memory controller to provide access to the memory 102 by the processor 101.
The electronic device further comprises a power supply 103 for powering the various components, optionally, the power supply 103 may be logically connected to the processor 101 by a power management system, whereby the functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 103 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may further comprise an input unit 104, which input unit 104 may be used for receiving input digital or character information and for generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.
Although not shown, the electronic device may further include a display unit, an image acquisition component, and the like, which are not described herein. In particular, in this embodiment, the processor 101 in the electronic device loads executable codes corresponding to one or more computer programs into the memory 102 according to the following instructions, and the steps in the speech synthesis method provided by the present application are executed by the processor 101, for example:
acquiring a target text which needs to be subjected to voice synthesis, and converting the target text into a corresponding phoneme sequence;
acquiring text semantic features of a target text and acquiring context semantic features of a context corresponding to the target text;
determining the pronunciation rhythm of the target text according to the phoneme sequence, the text semantic features and the context semantic features;
determining the voice characteristics of the target text according to the text semantic characteristics, wherein the voice characteristics are used for indicating the pronunciation characteristics of a speaker in voice synthesis;
And performing voice synthesis according to the pronunciation rhythm, the voice characteristics and the phoneme sequence to obtain the pronunciation audio of the target text.
It should be noted that, the electronic device provided in the embodiment of the present application and the speech synthesis method in the above embodiment belong to the same concept, and detailed implementation processes of the electronic device are shown in the above related embodiments, which are not repeated herein.
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed on a processor of an electronic device provided by an embodiment of the present application, causes the processor of the electronic device to perform steps in a speech synthesis method provided by the present application. The storage medium may be a magnetic disk, an optical disk, a Read Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.
The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform various alternative implementations of the speech synthesis method described above.
The foregoing has outlined a detailed description of the speech synthesis method, apparatus, electronic device, storage medium and program product provided by the present application, and the detailed description of the principles and embodiments of the present application herein has been provided by way of example only to facilitate the understanding of the method and core idea of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, the present description should not be construed as limiting the present application.
It should be noted that when the above embodiments of the present application are applied to specific products or technologies, related data concerning users are required to obtain user approval or consent, and the collection, use and processing of the related data are required to comply with related laws and regulations and standards of related countries and regions.

Claims (13)

1. A method of speech synthesis, comprising:
acquiring a target text which needs to be subjected to voice synthesis, and converting the target text into a corresponding phoneme sequence;
acquiring text semantic features of the target text and acquiring context semantic features of contexts corresponding to the target text;
Determining the pronunciation rhythm of the target text according to the phoneme sequence, the text semantic features and the context semantic features;
determining the voice characteristics of the target text according to the text semantic characteristics, wherein the voice characteristics are used for indicating the pronunciation characteristics of a speaker in voice synthesis;
and performing voice synthesis according to the pronunciation rhythm, the voice characteristics and the phoneme sequence to obtain pronunciation audio of the target text.
2. The method of claim 1, wherein said determining the prosody of the target text from the phoneme sequence, the text semantic features, and the contextual semantic features comprises:
obtaining a voice synthesis model, wherein the voice synthesis model comprises a prosody prediction network, a feature extraction network, a length adjustment network and a voice synthesis network;
inputting the phoneme sequence, the text semantic features and the context semantic features into the prosody prediction network to perform prosody prediction, and determining the pronunciation prosody of the target text;
and performing speech synthesis according to the pronunciation prosody, the speech features and the phoneme sequence to obtain pronunciation audio of the target text, wherein the method comprises the following steps:
Inputting the phoneme sequence into the feature extraction network for feature extraction to obtain a first phoneme feature of the phoneme sequence;
inputting the first phoneme features and the voice features into the length adjustment network for duration prediction, and adjusting the length of the first phoneme features according to the predicted pronunciation duration to obtain first adjustment phoneme features;
inputting the pronunciation rhythm, the voice characteristics and the first regulation phoneme characteristics into the voice synthesis network to perform voice synthesis, and obtaining pronunciation audio of the target text.
3. The speech synthesis method according to claim 2, wherein the prosody prediction network includes a first attention sub-network, a second attention sub-network, a feature extraction sub-network, a feature fusion sub-network, and a prosody prediction sub-network, and the inputting the phoneme sequence, the text semantic features, and the contextual semantic features into the prosody prediction network performs prosody prediction, determining a prosody of the target text, comprising:
inputting the phoneme sequence into the feature extraction sub-network to perform feature extraction to obtain a second phoneme feature of the phoneme sequence;
Inputting the second phoneme characteristic and the text semantic characteristic into the first attention sub-network to carry out attention enhancement, so as to obtain a first enhancement characteristic of the second phoneme characteristic;
inputting the second phoneme characteristic and the context semantic characteristic into the second attention sub-network to carry out attention enhancement, so as to obtain a second enhancement characteristic of the second phoneme characteristic;
inputting the second phoneme feature, the first enhancement feature and the second enhancement feature into the feature fusion sub-network to perform feature fusion to obtain fusion features;
and inputting the fusion characteristics into the prosody prediction sub-network to perform prosody prediction, and determining the pronunciation prosody of the target text.
4. The method of claim 3, wherein inputting the fusion feature into the prosody prediction sub-network for prosody prediction, determining a prosody of the target text comprises:
inputting the fusion characteristics into the prosody prediction sub-network, performing prosody prediction in the audio frame dimension through the prosody prediction sub-network, and determining the pronunciation prosody of the target text in the audio frame dimension.
5. The method of claim 1, wherein the speech synthesis model further comprises a speech feature prediction network, wherein the determining the speech features of the target text based on the text semantic features comprises:
Inputting the text semantic features into the voice feature prediction network to predict the voice features, and obtaining the voice features corresponding to the text semantic features.
6. The method for synthesizing speech according to claim 5, wherein before the target text requiring speech synthesis is obtained, further comprising:
constructing a prosody coding network and a voice characteristic coding network based on a reference encoder;
acquiring sample audio, wherein the sample text and the sample pronunciation time length corresponding to the sample audio;
inputting the sample audio into the prosody coding network to perform prosody coding to obtain sample pronunciation prosody of the sample audio, and inputting the sample audio into the speech feature coding network to perform speech feature coding to obtain sample speech features of the sample audio;
acquiring sample text semantic features of the sample text, and sample context semantic features of contexts corresponding to the sample text;
inputting the sample text semantic features into a voice feature prediction network to predict voice features to obtain predicted sample voice features, and acquiring first losses according to the predicted sample voice features and the sample voice features;
Converting the sample text into a corresponding sample phoneme sequence, inputting the sample phoneme sequence, the sample context semantic features and the sample text semantic features into the prosody prediction network for prosody prediction to obtain predicted sample pronunciation prosody, and acquiring a second loss according to the predicted sample pronunciation prosody and the sample pronunciation prosody;
inputting a sample phoneme sequence into the feature extraction network to perform feature extraction to obtain sample phoneme features of the sample phoneme sequence, inputting the sample phoneme features and the predicted sample speech features into the length adjustment network to perform duration prediction, adjusting the length of the sample phoneme features according to predicted sample pronunciation duration obtained by prediction to obtain sample adjustment phoneme features, and obtaining a third loss according to the predicted sample pronunciation duration and the sample pronunciation duration;
inputting the predicted sample pronunciation prosody, the predicted sample voice characteristics and the sample adjustment phoneme characteristics into the voice synthesis network for voice synthesis to obtain sample pronunciation audio of the sample text, and obtaining fourth loss according to the sample pronunciation audio and the sample audio;
And training the speech synthesis model according to the first loss, the second loss, the third loss and the fourth loss until a preset training stopping condition is met.
7. The method for synthesizing speech according to claim 1, wherein the obtaining the context semantic feature of the context corresponding to the target text includes:
acquiring a plurality of context texts of the target text;
splicing the target text and adjacent texts in the multiple context texts in pairs to obtain multiple text pairs;
and extracting semantic features from the plurality of text pairs to obtain the context semantic features.
8. The method of speech synthesis according to claim 7, wherein extracting semantic features from a plurality of the text pairs to obtain the contextual semantic features comprises:
and extracting semantic features of the text pair input pre-trained semantic representation model to obtain the context semantic features.
9. The method for synthesizing speech according to any one of claims 1 to 8, wherein the obtaining the target text for which speech synthesis is required includes:
responding to the input interactive voice, and generating a corresponding response text as the target text;
And after the voice synthesis is carried out according to the pronunciation rhythm, the voice characteristics and the phoneme sequence to obtain the pronunciation audio of the target text, the method further comprises the following steps:
and outputting pronunciation audio of the target text.
10. A speech synthesis apparatus, comprising:
the text acquisition module is used for acquiring a target text which needs to be subjected to voice synthesis and converting the target text into a corresponding phoneme sequence;
the semantic acquisition module is used for acquiring text semantic features of the target text and acquiring context semantic features of a context corresponding to the target text;
the prosody determining module is used for determining the pronunciation prosody of the target text according to the phoneme sequence, the text semantic features and the context semantic features;
the voice determining module is used for determining voice characteristics of the target text according to the text semantic characteristics, wherein the voice characteristics are used for indicating pronunciation characteristics of a speaker of voice synthesis;
and the voice synthesis module is used for performing voice synthesis according to the pronunciation rhythm, the pronunciation characteristics and the phoneme sequence to obtain pronunciation audio of the target text.
11. An electronic device comprising a memory storing a computer program and a processor for running the computer program in the memory to perform the steps of the speech synthesis method of any of claims 1 to 9.
12. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the speech synthesis method of any one of claims 1 to 9.
13. A computer program product comprising a computer program or instructions which, when executed by a processor, carries out the steps of the speech synthesis method of any one of claims 1 to 9.
CN202310274253.7A 2023-03-17 2023-03-17 Speech synthesis method, apparatus, electronic device, storage medium, and program product Pending CN116978353A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310274253.7A CN116978353A (en) 2023-03-17 2023-03-17 Speech synthesis method, apparatus, electronic device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310274253.7A CN116978353A (en) 2023-03-17 2023-03-17 Speech synthesis method, apparatus, electronic device, storage medium, and program product

Publications (1)

Publication Number Publication Date
CN116978353A true CN116978353A (en) 2023-10-31

Family

ID=88470145

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310274253.7A Pending CN116978353A (en) 2023-03-17 2023-03-17 Speech synthesis method, apparatus, electronic device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN116978353A (en)

Similar Documents

Publication Publication Date Title
CN111048062B (en) Speech synthesis method and apparatus
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
CN111276120B (en) Speech synthesis method, apparatus and computer-readable storage medium
KR20220004737A (en) Multilingual speech synthesis and cross-language speech replication
CN111312245B (en) Voice response method, device and storage medium
WO2020253509A1 (en) Situation- and emotion-oriented chinese speech synthesis method, device, and storage medium
WO2022188734A1 (en) Speech synthesis method and apparatus, and readable storage medium
US20210110811A1 (en) Automatically generating speech markup language tags for text
CN111433847B (en) Voice conversion method, training method, intelligent device and storage medium
WO2022178969A1 (en) Voice conversation data processing method and apparatus, and computer device and storage medium
CN111968618A (en) Speech synthesis method and device
WO2021212954A1 (en) Method and apparatus for synthesizing emotional speech of specific speaker with extremely few resources
CN110851650B (en) Comment output method and device and computer storage medium
CN111930900B (en) Standard pronunciation generating method and related device
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
CN113761268A (en) Playing control method, device, equipment and storage medium of audio program content
CN114882862A (en) Voice processing method and related equipment
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116978353A (en) Speech synthesis method, apparatus, electronic device, storage medium, and program product
CN114613351A (en) Rhythm prediction method, device, readable medium and electronic equipment
CN114373443A (en) Speech synthesis method and apparatus, computing device, storage medium, and program product
CN113555000A (en) Acoustic feature conversion and model training method, device, equipment and medium
Zahariev et al. Intelligent voice assistant based on open semantic technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication