CN116469372A

CN116469372A - Speech synthesis method, speech synthesis device, electronic device, and storage medium

Info

Publication number: CN116469372A
Application number: CN202310632858.9A
Authority: CN
Inventors: 张旭龙; 王健宗; 唐浩彬
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-31
Filing date: 2023-05-31
Publication date: 2023-07-21

Abstract

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the following steps: acquiring target text data and reference voice data; vectorizing the reference voice data to obtain a reference embedded voice vector; extracting features of the target text data to obtain a target text representation vector; performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector; performing voice synthesis based on the target style embedded vector and the target text representation vector to obtain synthesized spectrum data; and performing frequency spectrum conversion on the synthesized frequency spectrum data to obtain synthesized voice data. The embodiment of the application can improve the accuracy of voice synthesis.

Description

Speech synthesis method, speech synthesis device, electronic device, and storage medium

Technical Field

The present disclosure relates to the technical field of financial science and technology, and in particular, to a voice synthesis method, a voice synthesis device, an electronic device, and a storage medium.

Background

Along with the rapid development of artificial intelligence technology, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like, and the service level of enterprise customer service is improved through the functions of intelligent marketing, intelligent collection, content navigation and the like.

Currently, conversation robots are often adopted in financial service scenes such as intelligent customer service, shopping guide and the like to provide corresponding service support for various objects. The conversational speech used by these conversational robots is often generated based on speech synthesis.

Most of the voice synthesis methods in the related art are based on convolutional neural networks to extract text content information in text data, and rely on the extracted text content information and a fixed prosody template to perform voice synthesis, which often results in poor emotion expression capability of the synthesized voice data and influences the accuracy of voice synthesis, so how to improve the accuracy of voice synthesis becomes a technical problem to be solved urgently.

Disclosure of Invention

The embodiment of the application mainly aims to provide a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and aims to improve the accuracy of voice synthesis.

To achieve the above object, a first aspect of an embodiment of the present application proposes a speech synthesis method, including:

acquiring target text data and reference voice data;

vectorizing the reference voice data to obtain a reference embedded voice vector;

Extracting features of the target text data to obtain a target text representation vector;

performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector;

performing voice synthesis based on the target style embedded vector and the target text representation vector to obtain synthesized spectrum data;

and performing frequency spectrum conversion on the synthesized frequency spectrum data to obtain synthesized voice data.

In some embodiments, the performing style marking on the reference embedded speech vector to obtain a target style embedded vector corresponding to the reference embedded speech vector includes:

acquiring a plurality of preset style tag vectors, wherein the style tag vectors are vector representations corresponding to the preset style tags;

performing attention calculation on each style tag vector and each reference embedded voice vector to obtain style similarity between each style tag vector and each reference embedded voice vector;

based on the style similarity, style tag weight of each style tag vector is obtained;

and carrying out weighted summation on the style tag vector based on the style tag weight to obtain the target style embedded vector.

In some embodiments, the performing attention computation on each of the style tag vector and the reference embedded speech vector to obtain a style similarity between each of the style tag vector and the reference embedded speech vector includes:

performing matrix multiplication on each style tag vector and the reference embedded voice vector to obtain a query vector, a key vector and a value vector corresponding to each style tag vector;

and performing attention calculation on the query vector, the key vector and the value vector based on a preset function to obtain the style similarity between the style tag vector and the reference embedded voice vector.

In some embodiments, the obtaining the style tag weight of each of the style tag vectors based on the plurality of style similarities includes:

summing the style similarities to obtain comprehensive style similarity;

and dividing the style similarity of each style tag vector and the comprehensive style similarity to obtain the style tag weight of the style tag vector.

In some embodiments, the performing speech synthesis based on the target style embedding vector and the target text representation vector to obtain synthesized spectrum data includes:

Performing splicing processing on the target style embedded vector and the target text representation vector to obtain a combined embedded vector;

performing feature alignment on the combined embedded vector based on an attention mechanism to obtain a target phoneme vector;

and decoding the target phoneme vector to obtain the synthesized spectrum data.

In some embodiments, the feature alignment of the combined embedded vector based on the attention mechanism to obtain a target phoneme vector includes:

performing time prediction on the combined embedded vector based on a preset duration prediction model to obtain a phoneme duration;

and carrying out phoneme length adjustment on the combined embedded vector based on the attention mechanism and the phoneme duration to obtain the target phoneme vector.

In some embodiments, the performing spectral conversion on the synthesized spectrum data to obtain synthesized voice data includes:

inputting the synthesized spectrum data into a preset vocoder, wherein the vocoder comprises a deconvolution layer and a multi-receptive field fusion layer;

up-sampling the synthesized spectrum data based on the deconvolution layer to obtain target spectrum data;

And carrying out multi-scale feature fusion on the target frequency spectrum data based on the multi-receptive field fusion layer to obtain the synthesized voice data.

To achieve the above object, a second aspect of the embodiments of the present application proposes a speech synthesis apparatus, the apparatus comprising:

the data acquisition module is used for acquiring target text data and reference voice data;

the vectorization module is used for vectorizing the reference voice data to obtain a reference embedded voice vector;

the feature extraction module is used for extracting features of the target text data to obtain a target text expression vector;

the style marking module is used for performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector;

the voice synthesis module is used for carrying out voice synthesis based on the target style embedded vector and the target text representation vector to obtain synthesized spectrum data;

and the spectrum conversion module is used for performing spectrum conversion on the synthesized spectrum data to obtain synthesized voice data.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the method described in the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of the first aspect.

The voice synthesis method, the voice synthesis device, the electronic equipment and the storage medium are used for obtaining target text data and reference voice data; vectorizing the reference voice data to obtain a reference embedded voice vector; and extracting the characteristics of the target text data to obtain a target text representation vector, and obtaining the target text representation vector for representing the text semantic content, so that the target text representation vector can be utilized for speech synthesis. Further, style marking is carried out on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector, so that the voice rhyme style can be controlled, the target style embedded vector can be used for voice synthesis, and the accuracy of voice synthesis emotion and the voice synthesis effect are improved. Further, speech synthesis is performed based on the target style embedded vector and the target text representation vector, so that synthesized spectrum data is obtained, and the data quality of the synthesized spectrum data can be improved better. Finally, the synthesized voice data is subjected to frequency spectrum conversion to obtain synthesized voice data, the synthesized voice data in a waveform form can be conveniently obtained, the synthesized voice data simultaneously contains text content characteristics of target text data and style characteristics of reference voice data, the emotion expression capability is good, the accuracy of voice synthesis is improved, further, in the process of ensuring intelligent conversations of products, financial products and the like, the synthesized voice expressed by a conversation robot can be more attached to conversational style preference of conversational objects, conversation communication is carried out by adopting a conversational mode and conversational style which are more interesting to the conversational objects, conversation quality and conversation effectiveness are improved, intelligent voice conversation service can be realized, and service quality and customer satisfaction of customers are improved.

Drawings

FIG. 1 is a flow chart of a speech synthesis method provided in an embodiment of the present application;

fig. 2 is a flowchart of step S104 in fig. 1;

fig. 3 is a flowchart of step S202 in fig. 2;

fig. 4 is a flowchart of step S203 in fig. 2;

fig. 5 is a flowchart of step S105 in fig. 1;

fig. 6 is a flowchart of step S502 in fig. 5;

fig. 7 is a flowchart of step S106 in fig. 1;

fig. 8 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

First, several nouns referred to in this application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.

Information extraction (Information Extraction, NER): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

Mel-frequency cepstral coefficients (Mel-Frequency Cipstal Coefficients, MFCC): is a set of key coefficients used to create the mel-frequency spectrum. From the segments of the music signal, a set of cepstrum is obtained that is sufficient to represent the music signal, and mel-frequency cepstrum coefficients are cepstrum (i.e., the spectrum) derived from the cepstrum. Unlike the general cepstrum, the biggest feature of mel-frequency cepstrum is that the frequency bands on mel-frequency spectrum are uniformly distributed on mel scale, that is, such frequency bands are closer to human nonlinear auditory System (Audio System) than the general observed, linear cepstrum representation method. For example: in audio compression techniques, mel-frequency cepstrum is often used for processing.

Vocoder (vocoder): i.e. a speech analysis synthesis system of a model of the speech signal. Only model parameters are used in transmission, and the model parameter estimation and speech synthesis technology is used in coding and decoding, and the speech signal coder is also used as a coder and decoder for analyzing and synthesizing speech, and is also called a speech analysis and synthesis system or a speech band compression system.

Multi-head attention: the multi-head attribute is to use a plurality of queries to calculate in parallel a plurality of information choices from the input information. Each focusing on a different part of the input information. Hard attention, i.e. the desire for all input information based on the attention profile.

Embedding (embedding): embedding is a vector representation, which means representing an object, which may be a word, or a commodity, or a movie, etc., with a low-dimensional vector; the nature of the subedding vector is that objects corresponding to vectors with similar distances can have similar meanings, the subedding essence is a mapping from semantic space to vector space, and meanwhile, the relation of the original sample in the semantic space is kept in the vector space as far as possible, for example, the positions of two words with similar semantics in the vector space are relatively similar. The method can be used for encoding the object by using the low-dimensional vector, can also preserve the meaning of the object, is commonly applied to machine learning, and is used for improving the efficiency by encoding the object into a low-dimensional dense vector and then transmitting the low-dimensional dense vector to DNN in the construction process of a machine learning model.

Encoding (Encoder): the input sequence is converted into a vector of fixed length.

Decoding (Decoder): the fixed vector generated before is converted into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.

BERT (Bidirectional Encoder Representation from Transformers) model: the BERT model further increases the generalization capability of the word vector model, fully describes character-level, word-level, sentence-level and even inter-sentence relationship characteristics, and is constructed based on a transducer. There are three types of ebedding in BERT, namely Token ebedding, segment Embedding, position Embedding; wherein Token documents are word vectors, the first word is a CLS Token, which can be used for the subsequent classification task; segment Embeddings is used to distinguish between two sentences, because pre-training does not only LM but also classification tasks with two sentences as input; position Embeddings, here the position word vector is not a trigonometric function in transfor, but BERT is learned through training. However, the BERT directly trains a position embedding to keep the position information, randomly initializes a vector at each position, adds model training, finally obtains an empedding containing the position information, and finally selects direct splicing on the position embedding and word empedding combination mode. Phoneme (Phone): the method is characterized in that minimum voice units are divided according to the natural attribute of voice, the voice units are analyzed according to pronunciation actions in syllables, and one action forms a phoneme.

Softmax function: the Softmax function is a normalized exponential function.

Speech synthesis refers To the synthesis of intelligible, natural Speech from Text, also known as Text-To-Speech (TTS). Speech synthesis systems are widely used in life in various contexts, including speech dialog systems; an intelligent voice assistant; a telephone information inquiry system; vehicle navigation, and auxiliary applications such as sound electronic books; language learning; real-time information broadcasting systems such as airports, stations, etc.; information acquisition and communication for visually or speech impaired persons, and the like.

Taking an insurance service robot as an example, it is often necessary to fuse the description text of an insurance product with the speaking style of a fixed object to generate a description voice of the insurance product by the fixed object. When the insurance service robot dialogues with some interested objects, the description voice is automatically invoked to introduce insurance products for the objects.

Most of the current speech synthesis methods are based on convolutional neural networks to extract text content information in text data, and rely on the extracted text content information and a fixed prosody template to perform speech synthesis, which often results in poor emotion expression capability of the synthesized speech data and influences the accuracy of speech synthesis, so how to improve the accuracy of speech synthesis becomes a technical problem to be solved urgently.

Based on this, the embodiment of the application provides a voice synthesis method, a voice synthesis device, an electronic device and a storage medium, aiming at improving the accuracy of voice synthesis.

The speech synthesis method, the speech synthesis device, the electronic apparatus and the storage medium provided in the embodiments of the present application are specifically described by the following embodiments, and the speech synthesis method in the embodiments of the present application is first described.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a voice synthesis method, which relates to the technical field of artificial intelligence. The voice synthesis method provided by the embodiment of the application can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like that implements a speech synthesis method, but is not limited to the above form.

It should be noted that, in each specific embodiment of the present application, when related processing is required according to data related to user identity or characteristics, such as user information, user voice data, user behavior data, user history data, and user location information, the permission or consent of the user is obtained first, and the collection, use, processing, and the like of these data all comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.

The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Fig. 1 is an optional flowchart of a speech synthesis method provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S106.

Step S101, acquiring target text data and reference voice data;

step S102, vectorizing the reference voice data to obtain a reference embedded voice vector;

step S103, extracting features of the target text data to obtain a target text expression vector;

step S104, performing style marking on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector;

step S105, performing voice synthesis based on the target style embedded vector and the target text expression vector to obtain synthesized spectrum data;

step S106, performing frequency spectrum conversion on the synthesized frequency spectrum data to obtain synthesized voice data.

Step S101 to step S106 illustrated in the embodiment of the present application are performed by acquiring target text data and reference voice data; vectorizing the reference voice data to obtain a reference embedded voice vector; and extracting the characteristics of the target text data to obtain a target text representation vector, and obtaining the target text representation vector for representing the text semantic content, so that the target text representation vector can be utilized for speech synthesis. Further, style marking is carried out on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector, so that the voice rhyme style can be controlled, the target style embedded vector can be used for voice synthesis, and the accuracy of voice synthesis emotion and the voice synthesis effect are improved. Further, speech synthesis is performed based on the target style embedded vector and the target text representation vector, so that synthesized spectrum data is obtained, and the data quality of the synthesized spectrum data can be improved better. Finally, the synthesized speech data is subjected to frequency spectrum conversion to obtain synthesized speech data, and the synthesized speech data in a waveform form can be conveniently obtained, and simultaneously contains the text content characteristics of the target text data and the style characteristics of the reference speech data, so that the speech synthesis method has good emotion expression capability and improves the accuracy of speech synthesis.

In step S101 of some embodiments, the target text data may be acquired from the public data set, or the target text data to be processed may be acquired from an existing text database or a network platform, or the like, without limitation. For example, the public data set may be a THCHS30 data set or a ljspech data set, or the like.

It should be noted that the target text data may be text data including proper nouns in the financial domain, words of financial business templates, product descriptions including insurance products, product descriptions including financial products, and common conversations in the financial domain.

Meanwhile, the data can be crawled in a targeted manner after the data source is set by writing the web crawler, so that the reference speech data of the reference speaking object can be obtained, wherein the data source can be various types of network platforms, social media can also be certain specific audio databases and the like, the reference speaking object can be a network user, a lecturer, a singer and the like, and the reference speech data can be musical materials, a lecture report, a chat conversation and the like of the reference speaking object. Wherein the reference speech data may be an audio signal, which may be composed of phonemes, wherein a phoneme is a minimum unit or a minimum speech fragment constituting a syllable. By the method, the reference voice data and the target text data can be conveniently acquired, and the data acquisition efficiency is improved.

In step S102 of some embodiments, the reference speech data may be input to a preset reference encoder, and the reference speech data is compressed into a prosodic audio signal with a variable length by the reference encoder, so as to implement vectorization processing on the reference speech data, and the reference speech data is converted into a speech vector with a fixed length, so as to obtain a reference embedded speech vector, where the reference embedded speech vector contains speech prosodic feature information of the reference speech data. The method can conveniently convert the reference voice data from the time domain signal to the frequency domain characteristic, and extract the prosodic feature information in the reference voice data, so that the prosodic feature information of the reference voice data can be integrated into the synthesized voice in the subsequent voice synthesis process, and the accuracy of voice synthesis is improved.

In step S103 of some embodiments, the target text data may be input into a preset text model, which may be a text coding model such as a Bert model, without limitation. Extracting features of the target text data through a preset text model, obtaining text embedded representations corresponding to each text character in the target text data, and splicing all text embedded representations corresponding to the target text data to obtain a target text representation vector. For example, firstly, word segmentation is performed on target text data by using a text model to obtain a plurality of text characters, then text embedded representations of each text character are queried through a preset character dictionary, vector connection is performed on all the text embedded representations to obtain a target text representation vector corresponding to the target text data, and the target text representation vector contains phoneme features corresponding to the target text data, so that the target text representation vector can be used for representing text semantic content of the target text data. The method can accurately realize the conversion of the target text data from the data space to the vector space, and obtain the target text representation vector representing the text semantic content, so that the target text representation vector can be utilized for speech synthesis, the synthesized speech corresponding to the target text data is obtained, and the accuracy of speech synthesis is improved.

Referring to fig. 2, in some embodiments, step S104 may include, but is not limited to, steps S201 to S204:

step S201, a plurality of preset style tag vectors are obtained, wherein the style tag vectors are vector representations corresponding to the preset style tags;

step S202, performing attention calculation on each style tag vector and the reference embedded voice vector to obtain style similarity between each style tag vector and the reference embedded voice vector;

step S203, based on the multiple style similarities, the style tag weight of each style tag vector is obtained;

and step S204, carrying out weighted summation on the style tag vectors based on the style tag weights to obtain target style embedded vectors.

In step S201 of some embodiments, a plurality of preset style labels may be extracted from a preset label database, and the extracted style labels may be subjected to coding processing to obtain a vector representation of each style label, thereby obtaining a plurality of style label vectors. The style labels include different prosodic styles, for example, the style labels include a plurality of prosodic style types such as sinking sound, hyperthermia sound, jerkiness sound, etc., without limitation, and the preset label database can be constructed by reasoning according to expert experience, etc.

In step S202 of some embodiments, a multi-head attention mechanism may be introduced to calculate the style similarity between each style tag vector and the reference embedded speech vector, and the multi-head attention mechanism is used to project the reference embedded speech vector onto different style tag vectors, so as to obtain the duty ratio situation of the reference embedded speech vector on the different style tag vectors, and determine the style tag weight of each style tag vector according to the duty ratio situation. For example, matrix multiplication may be performed on each style tag vector and the reference embedded speech vector, and attention calculation may be performed on the query vector, the key vector, and the value vector corresponding to each style tag vector based on a preset function (such as a softmax function), so as to obtain style similarity between each style tag vector and the reference embedded speech vector.

In step S203 of some embodiments, the multiple style similarities may be summed to obtain a comprehensive style similarity, and then the style similarity and the comprehensive style similarity of each style tag vector are divided to obtain a style tag weight of each style tag vector.

In step S204 of some embodiments, the style tag vectors may be weighted and summed according to the style tag weights, so as to implement style marking on the reference embedded speech vector, and obtain a target style embedded vector corresponding to the reference embedded speech vector. For example, the style tag vector includes a style tag vector a, a style tag vector B, and a style tag vector C, and the weights of the three style tag vectors are respectively 0.1, 0.3, and 0.6, and the target style embedding vector M is m=0.1×a+0.3×b+0.6×c. The target style embedded vector contains prosodic style information for the reference speech data.

The above steps S201 to S204 can relatively conveniently determine the duty ratio of the reference embedded speech vector on different style tag vectors, determine the style tag weight of each style tag vector according to the duty ratio, and generate a weighted style embedded representation of the reference embedded speech vector based on different style tag weights, so that the weighted representation of the speech prosody space features can be used to control the speech prosody, and the target style embedded vector can be used for speech synthesis, thereby improving the accuracy of speech synthesis emotion and the speech synthesis effect.

Referring to fig. 3, in some embodiments, step S202 may include, but is not limited to, steps S301 to S302:

step S301, performing matrix multiplication on each style tag vector and the reference embedded voice vector to obtain a query vector, a key vector and a value vector corresponding to each style tag vector;

and step S302, performing attention calculation on the query vector, the key vector and the value vector based on a preset function to obtain the style similarity between the style tag vector and the reference embedded voice vector.

In step S301 of some embodiments, matrix multiplication may be performed on each style tag vector and the reference embedded speech vector by using a multi-head attention mechanism, so as to calculate a query vector, a key vector, and a value vector of each style tag vector, where the key vector may be denoted as k=n×w1, the value vector may be denoted as v=n×w2, and the query vector may be denoted as q=n×w3, where N is the style tag vector, and W1, W2, and W3 are trainable parameters.

In step S302 of some embodiments, the preset function may be an activation function such as a softmax function, and taking the softmax function as an example, the attention calculation is performed on the query vector, the key vector and the value vector through the softmax function to obtain a style similarity Z between the style tag vector and the reference embedded speech vector, where the style similarity Z may be represented as shown in formula (1), d is a feature dimension of the style tag vector, and T represents a transposed operation on the key vector K:

the step S301 to the step S302 can determine the proximity degree of the reference embedded voice vector and different style label vectors more conveniently, and the style similarity is obtained, so that the prosody style deviation of the reference embedded voice vector can be judged based on the style similarity, and the accuracy of prosody style configuration is improved.

Referring to fig. 4, in some embodiments, step S203 may include, but is not limited to, steps S401 to S402:

step S401, summing the plurality of style similarities to obtain a comprehensive style similarity;

step S402, the style similarity and the comprehensive style similarity of each style label vector are divided to obtain the style label weight of the style label vector.

In step S401 of some embodiments, a sum function or other statistical function or statistical tool may be used to sum the multiple style similarities to obtain a comprehensive style similarity.

In step S402 of some embodiments, the style similarity and the integrated style similarity for each style tag vector may be divided to obtain a style tag weight for the style tag vector.

For example, if the style tag vector includes a style tag vector a, a style tag vector B, and a style tag vector C, and the style similarities between the three style tag vectors and the reference speech embedded vector are 0.37,0.66,0.8, the style similarities are summed to obtain a comprehensive style similarity, that is, the comprehensive style similarity is 0.37+0.66+0.8=1.83. The style tag weight of the style tag vector a is 0.37/1.83=0.2, the style tag weight of the style tag vector B is 0.66/1.83=0.36, and the style tag weight of the style tag vector C is 0.8/1.83=0.44.

The above steps S401 to S402 can relatively conveniently determine the duty ratio of the reference embedded speech vector on different style tag vectors, determine the style tag weight of each style tag vector according to the duty ratio, and generate the weighted style embedded representation of the reference embedded speech vector based on the different style tag weights, thereby improving the accuracy of prosody control in speech synthesis.

Referring to fig. 5, in some embodiments, step S105 may include, but is not limited to, steps S501 to S503:

step S501, performing splicing processing on the target style embedded vector and the target text expression vector to obtain a combined embedded vector;

step S502, performing feature alignment on the combined embedded vector based on an attention mechanism to obtain a target phoneme vector;

in step S503, decoding processing is performed on the target phoneme vector to obtain synthesized spectrum data.

In step S501 of some embodiments, when performing the stitching process on the target style embedded vector and the target text representation vector, the target style embedded vector and the target text representation vector may be directly connected to obtain a combined embedded vector.

In step S502 of some embodiments, first, a pre-set duration prediction model may be used to perform a temporal prediction on the combined embedded vector to obtain a phoneme duration. And the method is favorable for the attention mechanism and the phoneme duration time to adjust the phoneme length of the combined embedded vector so as to obtain a target phoneme vector. It should be noted that this process is mainly for achieving alignment of phonemes and mel-cepstrum frames. Since the length of a phoneme sequence is typically smaller than the length of its mel-cepstral sequence. It is therefore necessary to calculate the length of the mel-cepstrum sequence that aligns each phoneme, which is the phoneme duration. The phoneme sequence can be tiled more conveniently based on the length adjuster and the phoneme duration so that the phoneme sequence matches the length of the mel-frequency cepstrum sequence.

In other embodiments, the phoneme duration may be extended or shortened in equal proportion to achieve control of the sound speed of the synthesized speech during speech synthesis. Alternatively, the prosody of the synthesized speech may also be adjusted by adjusting the duration of the space characters of the combined embedded vector to control the pause duration between words in the synthesized speech.

In step S503 of some embodiments, the target phoneme vector may be converted into a form of mel-cepstral sequence by performing a decoding process on the target phoneme vector by a decoder, to obtain synthesized spectrum data.

Through the steps S501 to S503, the speech synthesis can be performed based on the text content features of the target text data and the prosodic style features of the reference speech data, so that the synthesized spectrum data contains the text content information and the prosodic style features meeting the requirements, and the data quality of the synthesized spectrum data can be better improved, thereby synthesizing the speech data containing the target prosody, and improving the accuracy of the speech synthesis.

Referring to fig. 6, in some embodiments, step S502 includes, but is not limited to, steps S601 to S602:

step S601, performing time prediction on the combined embedded vector based on a preset duration prediction model to obtain a phoneme duration;

Step S602, performing phoneme length adjustment on the combined embedded vector based on the attention mechanism and the phoneme duration to obtain a target phoneme vector.

In step S601 of some embodiments, the preset duration prediction model includes a two-layer one-dimensional convolution network for extracting the spectrum time feature information in the combined embedded vector and a linear layer for outputting a scalar to predict the duration of the phoneme. The duration prediction model may use a mean square error function (MSE) as the loss function. And predicting the spectrum time characteristic by using a prediction function (such as a softmax function, a sigmiod function and the like) in a linear layer to obtain a time length corresponding to the spectrum time characteristic, and taking the time length as a phoneme duration.

In step S602 of some embodiments, when the phoneme length of the combined embedded vector is adjusted based on the attention mechanism and the phoneme duration, the phoneme duration is used as the length of the cepstrum sequence of the combined embedded vector pair Ji Meier, and the phoneme sequence in the combined embedded vector is subjected to phoneme tiling by using the attention mechanism, so that the phoneme sequence of the combined embedded vector can be consistent with the length of the mel cepstrum sequence, and the target phoneme vector is obtained.

Through the steps S601 to S602, the alignment of the phonemes and frames of the combined embedded vector can be conveniently realized, so that the phoneme sequence of the combined embedded vector can be matched with the length of the mel cepstrum sequence to be synthesized, and the rhythm rationality of the synthesized speech and the speech quality of the synthesized speech are improved.

Referring to fig. 7, in some embodiments, step S106 may include, but is not limited to, steps S701 to S702:

step S701, inputting synthesized spectrum data into a preset vocoder, wherein the vocoder comprises a deconvolution layer and a multi-receptive field fusion layer;

step S702, up-sampling processing is carried out on synthetic spectrum data based on a deconvolution layer, so as to obtain target spectrum data;

step S703, performing multi-scale feature fusion on the target spectrum data based on the multi-receptive field fusion layer to obtain synthesized speech data.

In step S701 of some embodiments, the synthesized spectrum data may be input into a preset vocoder using a preset computer program or script program, where the vocoder may be a HiFi-Gan or MelGan, etc., and the vocoder includes a deconvolution layer and a multi-receptive field fusion layer, and the vocoder is used to convert synthesized spectrum data in the form of mel-frequency cepstrum sequence into synthesized speech data in the form of waveforms.

In step S702 of some embodiments, up-sampling processing is performed on the synthesized spectrum data based on the deconvolution layer, so as to implement convolution transposition of the synthesized spectrum data, and obtain target spectrum data with richer spectrum feature content.

In step S703 of some embodiments, the multi-receptive field fusion layer includes a plurality of residual blocks, when multi-scale feature fusion is performed on the target spectrum data based on the multi-receptive field fusion layer, feature reconstruction may be performed on the target spectrum data by using each residual block to obtain a plurality of scale speech waveform features, and the speech waveform features of all scales are fused to obtain synthesized speech data.

In one specific example, the synthesized speech data is descriptive speech about insurance products, financial products, including speaking styles, speaking emotions of a certain animated character. The conversation robot can attract potential objects with the unique speaking styles and speaking emotions of the animated character by synthesizing the voice data, making the potential objects more interested in insurance products or financial products recommended by synthesizing the voice data.

The above steps S701 to S703 enable the synthesized target speech data to simultaneously contain the text content feature of the target text data and the prosody feature of the reference speech data, thereby effectively improving the accuracy of speech synthesis.

According to the voice synthesis method, target text data and reference voice data are obtained; vectorizing the reference voice data to obtain a reference embedded voice vector; and extracting the characteristics of the target text data to obtain a target text representation vector, and obtaining the target text representation vector for representing the text semantic content, so that the target text representation vector can be utilized for speech synthesis. Further, style marking is carried out on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector, so that the voice rhyme style can be controlled, the target style embedded vector can be used for voice synthesis, and the accuracy of voice synthesis emotion and the voice synthesis effect are improved. Further, speech synthesis is performed based on the target style embedded vector and the target text representation vector, so that synthesized spectrum data is obtained, and the data quality of the synthesized spectrum data can be improved better. Finally, the synthesized spectrum data is subjected to spectrum conversion to obtain synthesized voice data, the synthesized voice data in a waveform form can be conveniently obtained, the synthesized voice data simultaneously contains text content characteristics of target text data and style characteristics of reference voice data, the emotion expression capability is good, the accuracy of voice synthesis is improved, further, in the process of intelligent conversations such as security products, financial products and the like, the synthesized voice expressed by a conversation robot can be more attached to conversational style preference of conversational objects, conversational communication is carried out by adopting conversational modes and conversational styles which are more interesting to conversational objects, conversation quality and conversation effectiveness are improved, intelligent voice conversation service can be realized, service quality of clients and client satisfaction are improved, and the success rate of security products and financial products is improved.

Referring to fig. 8, an embodiment of the present application further provides a speech synthesis apparatus, which may implement the above speech synthesis method, where the apparatus includes:

a data acquisition module 801, configured to acquire target text data and reference voice data;

the vectorization module 802 is configured to perform vectorization processing on the reference voice data to obtain a reference embedded voice vector;

the feature extraction module 803 is configured to perform feature extraction on the target text data to obtain a target text representation vector;

the style marking module 804 is configured to perform style marking on the reference embedded speech vector to obtain a target style embedded vector corresponding to the reference embedded speech vector;

the speech synthesis module 805 is configured to perform speech synthesis based on the target style embedding vector and the target text representation vector, to obtain synthesized spectrum data;

the spectrum conversion module 806 is configured to perform spectrum conversion on the synthesized spectrum data to obtain synthesized speech data.

The specific implementation of the speech synthesis apparatus is substantially the same as the specific embodiment of the speech synthesis method described above, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises: the voice synthesis system comprises a memory, a processor, a program stored on the memory and capable of running on the processor, and a data bus for realizing connection communication between the processor and the memory, wherein the program is executed by the processor to realize the voice synthesis method. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:

the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the speech synthesis method to perform the embodiments of the present application;

an input/output interface 903 for inputting and outputting information;

the communication interface 904 is configured to implement communication interaction between the present device and other devices, and may implement communication in a wired manner (such as USB, network cable, etc.), or may implement communication in a wireless manner (such as mobile network, WI F I, bluetooth, etc.);

A bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);

wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to realize the voice synthesis method.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides a voice synthesis method, a voice synthesis device, electronic equipment and a computer readable storage medium, which are used for acquiring target text data and reference voice data; vectorizing the reference voice data to obtain a reference embedded voice vector; and extracting the characteristics of the target text data to obtain a target text representation vector, and obtaining the target text representation vector for representing the text semantic content, so that the target text representation vector can be utilized for speech synthesis. Further, style marking is carried out on the reference embedded voice vector to obtain a target style embedded vector corresponding to the reference embedded voice vector, so that the voice rhyme style can be controlled, the target style embedded vector can be used for voice synthesis, and the accuracy of voice synthesis emotion and the voice synthesis effect are improved. Further, speech synthesis is performed based on the target style embedded vector and the target text representation vector, so that synthesized spectrum data is obtained, and the data quality of the synthesized spectrum data can be improved better. Finally, the synthesized spectrum data is subjected to spectrum conversion to obtain synthesized voice data, the synthesized voice data in a waveform form can be conveniently obtained, the synthesized voice data simultaneously contains text content characteristics of target text data and style characteristics of reference voice data, the emotion expression capability is good, the accuracy of voice synthesis is improved, further, in the process of intelligent conversations such as security products, financial products and the like, the synthesized voice expressed by a conversation robot can be more attached to conversational style preference of conversational objects, conversational communication is carried out by adopting conversational modes and conversational styles which are more interesting to conversational objects, conversation quality and conversation effectiveness are improved, intelligent voice conversation service can be realized, service quality of clients and client satisfaction are improved, and the success rate of security products and financial products is improved.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-7 are not limiting to embodiments of the present application and may include more or fewer steps than shown, or certain steps may be combined, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of speech synthesis, the method comprising:

acquiring target text data and reference voice data;

2. The method of claim 1, wherein the performing style marking on the reference embedded speech vector to obtain a target style embedded vector corresponding to the reference embedded speech vector comprises:

3. The method of speech synthesis according to claim 2, wherein performing attention computation on each of the style tag vectors and the reference embedded speech vectors to obtain style similarities between each of the style tag vectors and the reference embedded speech vectors comprises:

4. The method of claim 2, wherein the deriving style tag weights for each of the style tag vectors based on the plurality of style similarities comprises:

Summing the style similarities to obtain comprehensive style similarity;

5. The method according to claim 1, wherein the performing speech synthesis based on the target style embedding vector and the target text representation vector to obtain synthesized spectrum data includes:

and decoding the target phoneme vector to obtain the synthesized spectrum data.

6. The method of claim 5, wherein feature alignment of the combined embedded vectors based on an attention mechanism to obtain a target phoneme vector comprises:

7. The method according to any one of claims 1 to 6, wherein the performing spectral conversion on the synthesized spectral data to obtain synthesized speech data includes:

8. A speech synthesis apparatus, the apparatus comprising:

9. An electronic device comprising a memory storing a computer program and a processor implementing the speech synthesis method according to any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the speech synthesis method of any one of claims 1 to 7.