CN116597807A - Speech synthesis method, device, equipment and medium based on multi-scale style - Google Patents

Speech synthesis method, device, equipment and medium based on multi-scale style Download PDF

Info

Publication number
CN116597807A
CN116597807A CN202310707136.5A CN202310707136A CN116597807A CN 116597807 A CN116597807 A CN 116597807A CN 202310707136 A CN202310707136 A CN 202310707136A CN 116597807 A CN116597807 A CN 116597807A
Authority
CN
China
Prior art keywords
style
vector
audio
target
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310707136.5A
Other languages
Chinese (zh)
Inventor
张旭龙
王健宗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310707136.5A priority Critical patent/CN116597807A/en
Publication of CN116597807A publication Critical patent/CN116597807A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Machine Translation (AREA)

Abstract

The application relates to the technical field of artificial intelligence, and discloses a speech synthesis method, a device, computer equipment and a storage medium based on a multi-scale style, which solve the problems of strong machine sense and insufficient emotion in the traditional speech synthesis scheme, wherein the method comprises the following steps: extracting target audio and target text corresponding to the original voice; performing style analysis on the target audio to obtain a first style embedded vector; carrying out style prediction on the target text to obtain a second style embedded vector; fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector; and synthesizing target voice based on the target style embedding vector.

Description

Speech synthesis method, device, equipment and medium based on multi-scale style
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a speech synthesis method, apparatus, computer device, and storage medium based on a multi-scale style.
Background
The existing speech synthesis technology has made great progress, but in actual production and life, people can easily distinguish whether the other end of the conversation is a robot or a real person, because the synthesized speech data is generally considered to be stable, and therefore, the speech data is not too rich in emotion and expression.
Along with the increasing interest and demand of people on emotion synthesis and personalized synthesis in recent years, the emphasis of emotion voice synthesis work is that a single-scale model is built by acquiring context information from sentences, but the difference of voice styles on different scales is ignored, so that the style condition of synthesized voice is single, not abundant enough and the machine feel is obvious.
Disclosure of Invention
The embodiment of the application provides a speech synthesis method, a device, computer equipment and a storage medium based on a multi-scale style, which are used for solving the problems that the style of synthesized speech in the traditional scheme is single, not abundant enough and obvious in machine sense.
A speech synthesis method based on a multi-scale style, comprising:
extracting target audio and target text corresponding to the original voice;
performing style analysis on the target audio to obtain a first style embedded vector;
carrying out style prediction on the target text to obtain a second style embedded vector;
fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
and synthesizing target voice based on the target style embedding vector.
A multi-scale style based speech synthesis apparatus comprising:
the extraction module is used for extracting target audio and target text corresponding to the original voice;
the style analysis module is used for carrying out style analysis on the target audio to obtain a first style embedded vector;
the style prediction module is used for performing style prediction on the target text to obtain a second style embedded vector;
the fusion module is used for fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
and the synthesis module is used for synthesizing target voice based on the target style embedding vector.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the multi-scale style based speech synthesis method described above when the computer program is executed.
A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the multi-scale style based speech synthesis method described above.
In the scheme realized by the speech synthesis method, the device, the computer equipment and the storage medium based on the multi-scale style, compared with the traditional scheme, the multi-scale style extraction and embedding method is provided, the speech styles are fully extracted from different scales, the styles and emotions of the synthesized speech data are highlighted, the speech styles of different scales are introduced for analysis and prediction, emotion expression of the synthesized speech is assisted, the synthesis quality of the emotion speech is improved, the finally emotion-rich synthesized speech can be obtained, and the problems that the traditional speech synthesis scheme has strong machine sense and insufficient emotion are solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application environment of a speech synthesis method based on a multi-scale style according to an embodiment of the present application;
FIG. 2 is a flow chart of a speech synthesis method based on a multi-scale style in an embodiment of the application;
FIG. 3 is a flowchart of a specific embodiment of step S20 in FIG. 2;
FIG. 4 is another flow chart of a speech synthesis method based on a multi-scale style in an embodiment of the application;
FIG. 5 is a flowchart of a specific embodiment of step S25 in FIG. 3;
FIG. 6 is a flowchart of a specific embodiment of step S30 in FIG. 2;
FIG. 7 is a schematic diagram of a speech synthesis apparatus based on a multi-scale style according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a computer device according to an embodiment of the application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The speech synthesis method based on the multi-scale style provided by the embodiment of the application can be applied to an application environment as shown in fig. 1, wherein a client can communicate with a server through a network. The client can provide various original voices, and the server can extract target audio and target text corresponding to the original voices after obtaining the original voices; performing style analysis on the target audio to obtain a first style embedded vector; carrying out style prediction on the target text to obtain a second style embedded vector; and fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector, obtaining multi-scale style information, and finally synthesizing target voice based on the target style embedded vector. Compared with the traditional scheme, the multi-scale style extraction and embedding method is provided, the style and emotion of the synthesized voice data are highlighted by fully extracting the voice styles from different scales, the multi-scale style prediction module combined with the context is provided, the voice style analysis and prediction of different scales are introduced, emotion expression of the synthesized voice is helped, the synthesis quality of the emotion voice is improved, and finally the synthesized voice rich in emotion can be obtained. The clients may be, but are not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.
First, before describing the embodiments of the present application, it is necessary to parse several nouns related to the present application:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (NLP, natural Language Processing): NLP is an artificial intelligence for professional analysis of human language, and its working principle is approximately: receiving natural language, which is evolved by natural use of humans, with which humans communicate translated natural language every day; natural language is analyzed by a probability-based algorithm and the result is output.
Embedding (embedding): embedding is a vector representation, which means representing an object, which may be a word, or a commodity, or a movie, etc., with a low-dimensional vector; or an emotion style as referred to in the practice of the present application; the method is often applied to machine learning, and in the process of constructing a machine learning model, the object is encoded into a low-dimensional dense vector and then transmitted to a network or an encoder and a decoder for processing, so that the processing efficiency is improved.
In one embodiment, as shown in fig. 2, a speech synthesis method based on a multi-scale style is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:
s10: extracting target audio and target text corresponding to the original voice;
in this embodiment, a piece of speech is first obtained, where the speech includes multiple sentences, and any one of the pieces of speech is recorded as an original speech. That is, the original speech is any speech in the acquired section of speech, after the original speech is obtained, the corresponding audio and text are extracted from the original speech and recorded as the target audio and the target text.
In particular, in the implementation, the original speech may be input into a trained audio encoder and a trained text encoder, so as to obtain the target audio and the target text corresponding to the original speech, where the extraction process may be obtained based on a commonly used audio extraction algorithm and a text extraction algorithm, which are not described in detail herein.
In addition, the same processing manner is adopted for any other original voice, and for convenience of description, the embodiment of the application uses one original voice as an example.
S20: performing style analysis on the target audio to obtain a first style embedded vector;
s30: carrying out style prediction on the target text to obtain a second style embedded vector;
after the target audio and the target text corresponding to the original voice are obtained, in order to obtain the subsequent rich voice synthesis effect, but the voice which is strong in machine sense and lacks emotion expression, the embodiment of the application is divided into two ways to perform two different style scale treatments. The first processing branch is to perform style analysis on the target audio to obtain a first style embedded vector, and record the first style embedded vector as a first style embedded vector, wherein the first style embedded vector represents style information of audio dimensions. It can be understood that the audio contains the most direct emotion style expression of the original voice, so that style analysis is needed to be performed on the target audio, so as to obtain the style corresponding to the audio, and the style is converted into a style embedding vector. And the second processing branch is used for carrying out style prediction on the target text to obtain another style embedded vector, and marking the other style embedded vector as a second style embedded vector, wherein the second style embedded vector represents style information of character dimension. It will be appreciated that the words in the original speech often also show emotional styles, and exemplary, for example, "why you want to do so for me, i'm feel very wounded", from which it can be seen that the word carries a sad style, and, for example, "why you want to do so for me, i'm feel very wounded-! The following is carried out By "it can be seen from this text that the text carries an anger or dislike in addition to a sad style. Therefore, the text contains the emotion style expression of the other layer of the original voice, so that style prediction is needed for the target text corresponding to the original voice, the style corresponding to the back of the text is predicted, and the text is converted into a second style embedded vector for reflecting.
S40: fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
s50: and synthesizing target voice based on the target style embedding vector.
And obtaining the corresponding style embedded vector after obtaining the styles of the emotion styles corresponding to the audio and the emotion styles corresponding to the text, finally fusing the first style embedded vector and the second style embedded vector after fusing the multi-scale styles, obtaining a target style embedded vector, and finally synthesizing target voice based on the target style embedded vector and corresponding voice data. In an embodiment, the merging refers to directly superposing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector. The method is not limited in detail, and other fusion modes are also possible, and the embodiment of the application is not limited in detail.
It can be seen that the embodiment of the application provides a speech synthesis method based on a multi-scale style, compared with the traditional scheme, the embodiment of the application provides a multi-scale style extraction and embedding method, the style and emotion of synthesized speech data are fully extracted from different scales, a multi-scale style prediction module combined with the context is provided, the analysis and prediction of the speech styles of different scales are introduced, the emotion expression of the synthesized speech is assisted, the synthesis quality of the emotion speech is improved, the finally emotion-rich synthesized speech can be obtained, and the problems of strong machine sense and insufficient emotion of the traditional speech synthesis scheme are solved.
In combination with the above embodiment, in the embodiment of the present application, in order to further improve emotion expression of synthesized speech, improve speech richness, also consider the relevance between different levels of information, make further optimization in terms of style embedded vector construction, specifically, combine context information between different levels, help emotion expression of synthesized speech, improve synthesis quality of emotion speech,
in an embodiment, as shown in fig. 3, in step S20, that is, performing style analysis on the target audio to obtain a first style embedded vector, the method specifically includes the following steps: (write examples of corresponding to the corresponding weights)
S21: extracting a mel spectrum of the target audio as a local mel spectrum;
s22: acquiring a Mel spectrum of the context voice of the target audio, and splicing the Mel spectrum of the context audio and the Mel spectrum of the target audio to obtain a global Mel spectrum;
s23: extracting a mel spectrum of sub-audio divided according to a sub-word phoneme boundary in the target audio to be used as a segment mel spectrum;
s24: respectively carrying out style coding on the global Mel spectrum, the local Mel spectrum and the segment Mel spectrum, and respectively inputting the coded style information into corresponding style label layers to obtain a global audio style vector, a local audio style vector and a segment audio style vector;
s25: and obtaining the total emotion style variable of the target audio as a first style embedded vector according to the global audio style vector, the local audio style vector and the segment audio style vector.
For steps S21-S25, when constructing the first style embedding vector corresponding to the target audio, the corresponding local mel spectrum, global mel spectrum and segment mel spectrum are first analyzed. The local mel spectrum is a mel spectrum obtained by mel spectrum conversion of the current target audio, and the global mel spectrum is a mel spectrum obtained by stitching a local mel spectrum corresponding to the target audio and a mel spectrum of a context audio of the target audio, specifically speaking, the mel spectrums corresponding to 2n+1 sentences are stitched together, (n refers to the number of sentences related to the context of the target audio), and n can be set according to experience without limitation. The segment mel spectrum is obtained by performing mel spectrum conversion on audio segments corresponding to the segmented subwords by taking inter-phoneme boundaries of the target audio as segmentation basis.
In specific implementation, reference encoders for processing global, local and segment scale mel spectrums are introduced in the embodiment, and are a global reference encoder, a local reference encoder and a segment reference encoder respectively; splicing mel spectrums corresponding to 2n+1 sentences (n refers to the number of sentences related to the context) of the target audio to be used as the input of a global reference encoder; taking the current target audio as the input of a local reference encoder; and dividing the word boundary of the inter-phoneme boundary of the target audio to obtain an audio segment corresponding to the subword as the input of the segment reference encoder, so that the global reference encoder, the local reference encoder and the segment reference encoder respectively output a global mel spectrum, a local mel spectrum and a segment mel spectrum.
It should be noted that there may be various ways to divide the subwords, for example, there are many methods of aligning the spectrums and phonemes or words. The MFA (forced alignment tool) method as used in Fastspeech2, corresponds speech and text, each word corresponding to which segment of the spectrum, resulting in a corresponding set of subwords.
After obtaining a global Mel spectrum, a local Mel spectrum and a segment Mel spectrum, respectively carrying out style coding on the global Mel spectrum, the local Mel spectrum and the segment Mel spectrum, respectively inputting the coded style information into corresponding style tag layers, namely, inputting the corresponding global style tag layers, local style tag layers and segment style tag layers to obtain a global audio style vector, a local audio style vector and a segment audio style vector, and finally obtaining a total emotion style variable of the target audio according to the global audio style vector, the local audio style vector and the segment audio style vector, wherein the total emotion style variable is used as a first style embedding vector. Specifically, the global audio style vector, the local audio style vector, and the clip audio style vector are superimposed to obtain a first style embedded vector.
In the embodiment, according to the hierarchical relationship of the target audio and considering the relevance among different levels of information, the context information among different levels is combined to construct the style vector corresponding to the audio, so as to help the emotion expression of the synthesized voice, improve the synthesis quality of the emotion voice, and fully consider the information relevance among the extracted voice styles to obtain the emotion synthesized voice with better effect.
It should be noted that, in an embodiment, in order to further reduce the processing workload and improve the speech processing efficiency in view of the problem of information redundancy, in an embodiment, as shown in fig. 5, in step S25, that is, the style encoding is performed on the global mel spectrum, the local mel spectrum, and the segment mel spectrum, and the encoded style information is input into the corresponding style tag layer, so as to obtain a global audio style vector, a local audio style vector, and a segment audio style vector, respectively, which includes:
s251: performing style coding on the global Mel spectrum to obtain a global audio style as a first residual style, and performing style coding on the local Mel spectrum and the segment Mel spectrum respectively to obtain a local audio style and a segment audio style;
s252: subtracting the global audio style from the local audio style to obtain a second residual style;
s253: subtracting the local audio style from the segment audio style to obtain a third residual style;
s254: and respectively inputting the first residual error style, the second residual error style and the third residual error style into corresponding style label layers to obtain a global audio style vector, a local audio style vector and a fragment audio style vector.
In this embodiment, as shown in FIG. 4, the low scale level will subtract the embedding style obtained for the previous scale level, taking into account the problem of information redundancy. Finally, residual styles embedded in different scales can be obtained from the reference encoders with the three scales and respectively recorded as a first residual style R global Second residual style vector R local And a third residual style vector R segment . These three residual styles are transmitted into corresponding style label layers, and corresponding style labels are output as style information of different subsequent scales. The corresponding global audio style vector Em embedded in the global style can be obtained through the style label layer processing global Local audio style vector Em local And a segment audio style vector Em segment . Finally, for each target audio divided in the encoding stage, the corresponding multi-scale style of the target audio is the sum Em of three-scale style embeddings total =Em global +Em local +Em segmengt ,Em total I.e. the first style embedding vector.
In this embodiment, the low-scale level subtracts the embedding style obtained by the previous-scale level from the information redundancy, so that the processing of redundant style information can be effectively reduced, the repeated modeling of the same information is reduced by taking the way of subtracting the output of the previous scale as a constraint, and finally, the residual styles embedded in different scales, namely, the first residual style R, can be respectively obtained from the reference encoders of the three scales global Second residual style vector R local And a third residual style vector R segment These residual styles are transferred into corresponding style label layers, corresponding style labels are output, style information is provided for a subsequent predictive encoder, and a first style embedded vector is constructed.
In an embodiment, as shown in fig. 6, in step S30, that is, performing style prediction on the target text to obtain a second style embedded vector, the method specifically includes the following steps:
s31: extracting the semantics of the target text as a local semantic sequence;
s32: connecting the context text of the target text with the spliced text of the target text, and extracting the semantics of the spliced text to obtain a global semantic sequence;
s33: extracting semantic sequences of sub-word sets divided in the target text to serve as fragment semantic sequences;
s34: carrying out style prediction on the global semantic sequence, the local semantic sequence and the fragment semantic sequence respectively to obtain a global text style vector, a local text style vector and a fragment text style vector;
s35: and superposing the global text style vector, the local text style vector and the fragment text style vector to obtain the total emotion style variable of the target text as a second style embedded vector.
For steps S31-S35, when constructing the second style embedded vector corresponding to the target text, the corresponding global semantic sequence, local semantic sequence and segment semantic sequence are first analyzed, as shown in fig. 4, where the three semantic sequences may be converted by a hierarchical text editor, the local semantic sequence is a sequence obtained by performing semantic conversion on the current target text, and the global semantic sequence is a spliced text obtained by connecting the target text and the context text of the target text through semantic conversion, specifically, 2n+1 texts are spliced, (n refers to the number of context texts related to the target text), and n may be set according to experience, which is not limited in particular. The segment semantic sequence is a mel spectrum obtained by performing semantic conversion on text segments corresponding to the segmented subwords by taking inter-phoneme boundaries of the target text as segmentation basis.
In specific implementation, in this embodiment, a hierarchical text editor for processing global, local and segment-scale semantics is introduced, and a global prediction encoder and a local prediction encoder for performing segmentation prediction based on a semantic sequence are also introducedEncoder and segment predictive encoder, it is worth noting that the predictive encoder consists of a fully concatenated layer and an activation function, as with the reference encoder mentioned above. Taking the spliced text as input of a global predictive coder; taking the current target text as the input of a local reference encoder; dividing text segments corresponding to the subwords as inputs to the segment predictive coder, so that the global predictive coder, the local predictive coder and the segment predictive coder output global text style vectors P respectively global Local text style vector P local And segment text style vector P segmengt
It should be noted that there may be various ways to divide the subwords, for example, there are many methods of aligning the spectrums and phonemes or words. The MFA (forced alignment tool) method as used in Fastspeech2, corresponds speech and text, each word corresponding to which segment of the spectrum, resulting in a corresponding set of subwords.
And after the global text style vector, the local text style vector and the segment text style vector are obtained, obtaining the total emotion style variable of the target text according to the global text style vector, the local text style vector and the segment text style vector, wherein the total emotion style variable is used as a second style embedded vector. Specifically, the global text style vector, the local text style vector, and the segment text style vector are superimposed to obtain a second style embedded vector.
In this embodiment, the three-scale style embedding attempts to recover the multi-scale speaking style in human speech by considering context information at different levels, while, when text is style predicted, the embedding is superimposed to form the multi-scale style embedding for each target text in the current sentence, based on the text style prediction, also considering context and tomographic information.
In one embodiment, in step S34, performing style prediction on the global semantic sequence, the local semantic sequence, and the segment semantic sequence to obtain a global text style vector, a local text style vector, and a segment text style vector, where the method includes: respectively inputting the global semantic sequence, the local semantic sequence and the fragment semantic sequence into a global style predictor, a local style predictor and a fragment style predictor to respectively obtain a global text style vector, a local text style vector and a fragment text style vector; the global text style is used as the style condition constraint of the local style predictor, and the local text style is used as the style condition constraint of the segment style predictor.
In this embodiment, using its global style embedding as a conditional constraint for a low-level style predictive coder, different-scale embedding styles are sequentially generated from a multi-scale style predictor, including a global text style vector P global Local text style vector P local And segment text style vector P segmengt . The training targets of the predictor are embedded from the corresponding real styles in the extractor, so that the accuracy and the relevance of style prediction can be improved while the hierarchical relationship is further ensured, the relevance among different layers of information is considered, the context information among different layers is combined, emotion expression of synthesized voice is facilitated, and the synthesis quality of emotion voice is improved.
The three-scale style embedding attempts to recover multi-scale speech styles in human speech by taking into account different levels of contextual information. And finally, superposing all the embedding style vectors to form multi-scale style embedding of each segment in the current sentence, and processing the multi-scale style embedding by a differential adapter and a decoder to obtain the final emotion-rich synthesized voice. Here, the pitch and duration required to synthesize the correct speech are predicted by the differential adapter, guided by the true duration and pitch extracted from the true speech during training, and the MSE penalty is optimized to give the predictor the ability to synthesize the duration of the correct pitch, which can be predicted correctly during reasoning.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
In an embodiment, a speech synthesis apparatus based on a multi-scale style is provided, where the speech synthesis apparatus based on a multi-scale style corresponds to the speech synthesis method based on a multi-scale style in the above embodiment one by one. As shown in fig. 7, the multi-scale style-based speech synthesis apparatus 10 includes an extraction module 101, a style prediction module 102, a style analysis module 103, a fusion module 104, and a synthesis module 105. The functional modules are described in detail as follows:
the extraction module 101 is used for extracting target audio and target text corresponding to the original voice;
the style prediction module 102 is configured to perform style analysis on the target audio to obtain a first style embedded vector;
the style analysis module 103 is configured to perform style prediction on the target text to obtain a second style embedded vector;
a fusion module 104, configured to fuse the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
a synthesis module 105, configured to synthesize a target speech based on the target style embedding vector.
In one embodiment, the style analysis module 102 is specifically configured to:
extracting a mel spectrum of the target audio as a local mel spectrum;
acquiring a Mel spectrum of the context voice of the target audio, and splicing the Mel spectrum of the context audio and the Mel spectrum of the target audio to obtain a global Mel spectrum;
extracting a mel spectrum of sub-audio divided according to a sub-word phoneme boundary in the target audio to be used as a segment mel spectrum;
respectively carrying out style coding on the global Mel spectrum, the local Mel spectrum and the segment Mel spectrum, and respectively inputting the coded style information into corresponding style label layers to obtain a global audio style vector, a local audio style vector and a segment audio style vector;
and obtaining the total emotion style variable of the target audio as a first style embedded vector according to the global audio style vector, the local audio style vector and the segment audio style vector.
In one embodiment, the style analysis module 102 is further specifically configured to:
performing style coding on the global Mel spectrum to obtain a global audio style as a first residual style, and performing style coding on the local Mel spectrum and the segment Mel spectrum respectively to obtain a local audio style and a segment audio style;
subtracting the global audio style from the local audio style to obtain a second residual style;
subtracting the local audio style from the segment audio style to obtain a third residual style;
and respectively inputting the first residual error style, the second residual error style and the third residual error style into corresponding style label layers to obtain a global audio style vector, a local audio style vector and a fragment audio style vector.
In one embodiment, the style prediction module 103 is specifically configured to:
extracting the semantics of the target text as a local semantic sequence;
connecting the context text of the target text with the spliced text of the target text, and extracting the semantics of the spliced text to obtain a global semantic sequence;
extracting semantic sequences of sub-word sets divided in the target text to serve as fragment semantic sequences;
carrying out style prediction on the global semantic sequence, the local semantic sequence and the fragment semantic sequence respectively to obtain a global text style vector, a local text style vector and a fragment text style vector;
and superposing the global text style vector, the local text style vector and the fragment text style vector to obtain the total emotion style variable of the target text as a second style embedded vector.
In an embodiment, the style prediction module 103 is further specifically configured to:
respectively inputting the global semantic sequence, the local semantic sequence and the fragment semantic sequence into a global style predictor, a local style predictor and a fragment style predictor to respectively obtain a global text style vector, a local text style vector and a fragment text style vector;
the global text style is used as the style condition constraint of the local style predictor, and the local text style is used as the style condition constraint of the segment style predictor.
In one embodiment, the fusion module 104 is specifically configured to:
and superposing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector.
It can be seen that, compared with the traditional scheme, the embodiment of the application provides a speech synthesis device based on a multi-scale style, a multi-scale style extraction and embedding method is provided, the style and emotion of synthesized speech data are fully extracted from different scales, a multi-scale style prediction module combined with the context is provided, speech style analysis and prediction of different scales are introduced, emotion expression of synthesized speech is assisted, synthesis quality of emotion speech is improved, finally emotion-rich synthesized speech can be obtained, and the problems of strong machine sense and insufficient emotion of the traditional speech synthesis scheme are solved.
For specific limitations on the multi-scale style based speech synthesis apparatus, reference may be made to the above limitations on the multi-scale style based speech synthesis method, and no further description is given here. The above-described modules in the multi-scale style based speech synthesis apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing the original voice and synthesizing the obtained synthesized voice. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a speech synthesis method based on a multi-scale style. Computer-readable storage media include volatile and/or nonvolatile storage media.
In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:
extracting target audio and target text corresponding to the original voice;
performing style analysis on the target audio to obtain a first style embedded vector;
carrying out style prediction on the target text to obtain a second style embedded vector;
fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
and synthesizing target voice based on the target style embedding vector.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:
extracting target audio and target text corresponding to the original voice;
performing style analysis on the target audio to obtain a first style embedded vector;
carrying out style prediction on the target text to obtain a second style embedded vector;
fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
and synthesizing target voice based on the target style embedding vector.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A speech synthesis method based on a multi-scale style, comprising:
extracting target audio and target text corresponding to the original voice;
performing style analysis on the target audio to obtain a first style embedded vector;
carrying out style prediction on the target text to obtain a second style embedded vector;
fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
and synthesizing target voice based on the target style embedding vector.
2. The method for synthesizing speech based on a multi-scale style according to claim 1, wherein performing style analysis on the target audio to obtain a first style embedded vector comprises:
extracting a mel spectrum of the target audio as a local mel spectrum;
acquiring a Mel spectrum of the context voice of the target audio, and splicing the Mel spectrum of the context audio and the Mel spectrum of the target audio to obtain a global Mel spectrum;
extracting a mel spectrum of sub-audio divided according to a sub-word phoneme boundary in the target audio to be used as a segment mel spectrum;
respectively carrying out style coding on the global Mel spectrum, the local Mel spectrum and the segment Mel spectrum, and respectively inputting the coded style information into corresponding style label layers to obtain a global audio style vector, a local audio style vector and a segment audio style vector;
and obtaining the total emotion style variable of the target audio as a first style embedded vector according to the global audio style vector, the local audio style vector and the segment audio style vector.
3. The method of speech synthesis according to claim 2, wherein the performing the style coding on the global mel spectrum, the local mel spectrum and the segment mel spectrum, and inputting the coded style information into the corresponding style tag layer, respectively, to obtain a global audio style vector, a local audio style vector and a segment audio style vector, includes:
performing style coding on the global Mel spectrum to obtain a global audio style as a first residual style, and performing style coding on the local Mel spectrum and the segment Mel spectrum respectively to obtain a local audio style and a segment audio style;
subtracting the global audio style from the local audio style to obtain a second residual style;
subtracting the local audio style from the segment audio style to obtain a third residual style;
and respectively inputting the first residual error style, the second residual error style and the third residual error style into corresponding style label layers to obtain a global audio style vector, a local audio style vector and a fragment audio style vector.
4. The method for synthesizing speech based on a multi-scale style according to claim 1, wherein performing style prediction on the target text to obtain a second style embedded vector comprises:
extracting the semantics of the target text as a local semantic sequence;
connecting the context text of the target text with the spliced text of the target text, and extracting the semantics of the spliced text to obtain a global semantic sequence;
extracting semantic sequences of sub-word sets divided in the target text to serve as fragment semantic sequences;
carrying out style prediction on the global semantic sequence, the local semantic sequence and the fragment semantic sequence respectively to obtain a global text style vector, a local text style vector and a fragment text style vector;
and superposing the global text style vector, the local text style vector and the fragment text style vector to obtain the total emotion style variable of the target text as a second style embedded vector.
5. The method of speech synthesis according to claim 4, wherein performing style prediction on the global semantic sequence, the local semantic sequence, and the segment semantic sequence to obtain a global text style vector, a local text style vector, and a segment text style vector, respectively, includes:
respectively inputting the global semantic sequence, the local semantic sequence and the fragment semantic sequence into a global style predictor, a local style predictor and a fragment style predictor to respectively obtain a global text style vector, a local text style vector and a fragment text style vector;
the global text style is used as the style condition constraint of the local style predictor, and the local text style is used as the style condition constraint of the segment style predictor.
6. The method of multi-scale style based speech synthesis according to any of claims 1-5, wherein fusing the first and second style embedded vectors to obtain a target style embedded vector comprises:
and superposing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector.
7. A speech synthesis apparatus based on a multi-scale style, comprising:
the extraction module is used for extracting target audio and target text corresponding to the original voice;
the style analysis module is used for carrying out style analysis on the target audio to obtain a first style embedded vector;
the style prediction module is used for performing style prediction on the target text to obtain a second style embedded vector;
the fusion module is used for fusing the first style embedded vector and the second style embedded vector to obtain a target style embedded vector;
and the synthesis module is used for synthesizing target voice based on the target style embedding vector.
8. The multi-scale style based speech synthesis apparatus of claim 7, wherein the style prediction module is specifically configured to:
extracting a mel spectrum of the target audio as a local mel spectrum;
acquiring a Mel spectrum of the context voice of the target audio, and splicing the Mel spectrum of the context audio and the Mel spectrum of the target audio to obtain a global Mel spectrum;
extracting a mel spectrum of sub-audio divided according to a sub-word phoneme boundary in the target audio to be used as a segment mel spectrum;
performing style coding on the global Mel spectrum, the local Mel spectrum and the segment Mel spectrum respectively to obtain a global audio style vector, a local audio style vector and a segment audio style vector;
and obtaining the total emotion style variable of the target audio as a first style embedded vector according to the global audio style vector, the local audio style vector and the segment audio style vector.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the multi-scale style based speech synthesis method according to any of claims 1 to 6 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the multi-scale style based speech synthesis method according to any of claims 1 to 6.
CN202310707136.5A 2023-06-15 2023-06-15 Speech synthesis method, device, equipment and medium based on multi-scale style Pending CN116597807A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310707136.5A CN116597807A (en) 2023-06-15 2023-06-15 Speech synthesis method, device, equipment and medium based on multi-scale style

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310707136.5A CN116597807A (en) 2023-06-15 2023-06-15 Speech synthesis method, device, equipment and medium based on multi-scale style

Publications (1)

Publication Number Publication Date
CN116597807A true CN116597807A (en) 2023-08-15

Family

ID=87590098

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310707136.5A Pending CN116597807A (en) 2023-06-15 2023-06-15 Speech synthesis method, device, equipment and medium based on multi-scale style

Country Status (1)

Country Link
CN (1) CN116597807A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117636842A (en) * 2024-01-23 2024-03-01 北京天翔睿翼科技有限公司 Voice synthesis system and method based on prosody emotion migration

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117636842A (en) * 2024-01-23 2024-03-01 北京天翔睿翼科技有限公司 Voice synthesis system and method based on prosody emotion migration
CN117636842B (en) * 2024-01-23 2024-04-02 北京天翔睿翼科技有限公司 Voice synthesis system and method based on prosody emotion migration

Similar Documents

Publication Publication Date Title
Zhao et al. Automatic assessment of depression from speech via a hierarchical attention transfer network and attention autoencoders
CN110570876B (en) Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
WO2021127817A1 (en) Speech synthesis method, device, and apparatus for multilingual text, and storage medium
CN114245203B (en) Video editing method, device, equipment and medium based on script
CN112463942B (en) Text processing method, text processing device, electronic equipment and computer readable storage medium
Sterpu et al. How to teach DNNs to pay attention to the visual modality in speech recognition
CN113761841B (en) Method for converting text data into acoustic features
CN114091466B (en) Multimode emotion analysis method and system based on transducer and multitask learning
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
CN113450765B (en) Speech synthesis method, device, equipment and storage medium
US11322133B2 (en) Expressive text-to-speech utilizing contextual word-level style tokens
CN116597807A (en) Speech synthesis method, device, equipment and medium based on multi-scale style
CN114360502A (en) Processing method of voice recognition model, voice recognition method and device
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116092478A (en) Voice emotion conversion method, device, equipment and storage medium
CN114743539A (en) Speech synthesis method, apparatus, device and storage medium
CN113823259B (en) Method and device for converting text data into phoneme sequence
CN116665639A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Barakat et al. Deep learning-based expressive speech synthesis: a systematic review of approaches, challenges, and resources
Zhang et al. Learning deep and wide contextual representations using BERT for statistical parametric speech synthesis
CN115240713A (en) Voice emotion recognition method and device based on multi-modal features and contrast learning
CN114358021A (en) Task type dialogue statement reply generation method based on deep learning and storage medium
CN118471266B (en) Pronunciation prediction method, pronunciation prediction device, electronic apparatus, and storage medium
Carson-Berndsen Multilingual time maps: portable phonotactic models for speech technology
CN114580389B (en) Chinese medical field causal relation extraction method integrating radical information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination