CN113257225A

CN113257225A - Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics

Info

Publication number: CN113257225A
Application number: CN202110600732.4A
Authority: CN
Inventors: 郑书凯; 李太豪; 裴冠雄
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-08-13
Anticipated expiration: 2041-05-31
Also published as: CN113257225B

Abstract

The invention belongs to the field of artificial intelligence, and particularly relates to an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics, wherein the method comprises the following steps: the method comprises the steps of collecting a text and an emotion label through a recording collection device, preprocessing the text, obtaining phoneme and phoneme alignment information, generating segmentation and segmentation semantic information, respectively calculating and obtaining segmentation pronunciation duration information, segmentation pronunciation speed information, segmentation pronunciation energy information and phoneme fundamental frequency information, respectively training a segmentation speed prediction network, a segmentation energy prediction network and a phoneme fundamental frequency prediction network, obtaining and splicing phoneme implicit information, segmentation speed implicit information, segmentation energy implicit information and phoneme fundamental frequency implicit information, and synthesizing emotion voice. The invention can lead the synthesized emotional voice to be more natural by fusing the vocabulary and the phoneme pronunciation characteristics related to the emotional pronunciation into the end-to-end voice synthesis model.

Description

Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics.

Background

Linguistic interaction is one of the earliest ways of human interaction, and speech is therefore the primary way for humans to express emotion. With the rise of man-machine interaction, the conversation robot has the emotion similar to a human, and speaking is like a real person, which is an urgent need. At present, the main classification modes of emotions are 7 emotions proposed by Ekman in the last century, which are respectively as follows: neutral, happy, sad, angry, afraid, aversion to, surprised.

With the rise of deep learning in recent years, the speech synthesis technology becomes more mature, and the pronunciation of a machine can be realized like a speaker. However, it is still a very difficult problem to make a machine emit speech with emotion like a human, and currently, mainstream emotion speech synthesis can be divided into two methods. One is a segmentation method based on hidden markov model, a traditional machine learning; another is an end-to-end approach based on deep learning. The speech synthesized based on the hidden Markov method has strong mechanical sense and unnatural sounding, and is rarely used at present. The speech synthesized by the deep learning-based method is relatively natural. However, at present, emotion voice synthesized based on deep learning is only simply integrated into text features, and the quality of synthesized emotion voice cannot be effectively guaranteed.

In the prior art, because the mode of integrating the emotion information is simple, the emotion label is only simply integrated into the text feature generally, and the characteristics of a person in emotion voice pronunciation are not considered, so that the emotion information cannot be well learned by a model, and the synthesized emotion voice is hard and unnatural.

Disclosure of Invention

In order to solve the technical problems in the prior art, the invention provides an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics, and the specific technical scheme is as follows:

an emotion voice synthesis method fusing vocabulary and phoneme pronunciation characteristics comprises the following steps:

acquiring a text and an emotion label through recording acquisition equipment;

preprocessing the text, acquiring phonemes and phoneme alignment information, and generating word segmentation and word segmentation semantic information;

step three, respectively calculating and obtaining word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information and phoneme fundamental frequency information;

step four, respectively training a participle speech rate prediction network Net _ WordSpeed, a participle energy prediction network Net _ WordEnergy and a phoneme fundamental frequency prediction network Net _ PhonemF 0;

step five, acquiring phoneme implicit information through an Encoder of a Tacotron2, acquiring participle speed implicit information through Net _ WordSpeed, acquiring participle energy implicit information through Net _ WordEnergy, and acquiring phoneme fundamental frequency implicit information through Net _ PhonemF 0;

and step six, splicing the phoneme implicit information, the participle speed implicit information, the participle energy implicit information and the phoneme fundamental frequency implicit information to synthesize the emotional voice.

Further, the step one specifically includes the step S1: through the recording acquisition equipment, 7 emotion types of voice audios which are neutral, happy, sad, angry, afraid, aversion and surprise are acquired and expressed as

Text corresponding to speech, expressed as

The emotion type corresponding to the voice is expressed as

。

Further, the second step specifically includes the following steps:

step S2, collecting the text

Converted into corresponding tones by pypinyin kitPlain text, expressed as

Then the phoneme text is processed

And obtained

Obtaining time alignment information of the text through a speech processing tool software HTK, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as

；

Step S3, for text

Performing word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation text

Text to be participled

Inputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxD

In particular, the amount of the surfactant is,

wherein the content of the first and second substances,

is a vector of dimension D.

Further, the third step specifically includes the following steps:

step S4, using the generated

And generated participle text

Calculating the pronunciation time of each word segmentation to obtain a word segmentation-time text

；

Step S5, obtaining word-time text

Calculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text

；

Step S6, for the audio frequency

And word-length text

Calculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts

；

Step S7, for the audio frequency

And phoneme-duration text

Calculating fundamental frequency information of phoneme pronunciation through a library toolkit, and classifying the fundamental frequency information into five categories according to the fundamental frequency, wherein the categories are as follows: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text

。

Further, the fourth step specifically includes the following steps:

step S8, training the participle speech rate prediction network Net _ WordSpeed: will be emotional type

And word segmentation features

As a network input, a speech rate category tag

Inputting the target as a network target into a deep learning sequence prediction network BiLSTM-CRF, and then obtaining a participle speech rate prediction network Net _ WordSpeed through network training of deep learning;

step S9, training the word segmentation energy prediction network Net _ WordEnergy: will be emotional type

And word segmentation features

As network input, energy tags

Inputting the network target into a deep learning sequence prediction network BLSTM-CRF, and obtaining a participle energy prediction network Net _ WordEnergy by the same processing method as the step S8;

step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: will be emotional type

And phoneme text

All are converted into vector form by One-Hot conversion technology and then used as network input and base frequency label

And converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a sequence prediction deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.

Further, the step S8 specifically includes the following steps:

step A: emotional type

Converting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D

；

And B: will obtain

And

splicing in the first dimension to obtain the network input

In particular, the amount of the surfactant is,

and C: the label with the word segmentation length N is processed in the One-Hot directionThe quantity conversion technology is used for converting the vector into an One-Hot vector with the width of 5, and finally the network label matrix with the dimension of Nx 5 is obtained

In particular, the amount of the surfactant is,

wherein the content of the first and second substances,

is a vector of dimension 5;

step D: inputting network

And network label matrix

Inputting the predicted text speech into BLSTM-CRF network for training, and obtaining the speech rate predicting network Net _ WordSpeed capable of predicting text speech rate through automatic learning of network.

Further, the fifth step specifically includes the following steps:

step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: corresponding phoneme text

Inputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network

；

Step S12, obtaining the implied information of word speed through Net _ WordSpeed: feature of dividing words

Inputting the word speed prediction network Net _ WordSpeed to obtain the speech speed implicit characteristic output by the BilSTM

According to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained

；

Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: feature of dividing words

Inputting the energy into a word segmentation energy prediction network Net _ WordEnergy to obtain the energy implicit characteristic of the BiLSTM output

According to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained

；

Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: text of phonemes

Inputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM

。

Further, the sixth step specifically includes the following steps:

step S15 is to

、

、

、

To be spliced to obtain the final Decoder network input of Tacotron2

In particular, the amount of the surfactant is,

step S16, converting the data into a data file

Then the emotion speech is input into a Decoder network of Tacotron2, and then is decoded and synthesized to obtain the final emotion speech through the subsequent structure of the Tacotron2 network.

An emotion speech synthesis system fusing vocabulary and phoneme pronunciation characteristics, comprising:

the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;

the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;

the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;

the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;

and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice.

Further, the synthesized emotion voice is preferentially adopted in the output, and model synthesis is secondarily adopted to improve the response speed of the system.

The invention has the advantages that:

1. according to the emotion voice synthesis method, the emotion of synthesized voice is indirectly controlled by controlling the pronunciation control of the words, the words are basic units of pronunciation rhythm, and people express different emotions by controlling the volume, the speed and the fundamental frequency of pronunciation of different words, so that the emotion voice synthesis is carried out by simulating the mode of expressing emotion of pronunciation of human, the emotion of the voice can be better synthesized, and the synthesized voice is more natural;

2. according to the emotion voice synthesis method, the three elements related to emotion pronunciation are predicted by using the independent speech speed prediction network, the energy prediction network and the fundamental frequency prediction network, so that the final voice output effect can be conveniently controlled in a manner of multiplying and adjusting the output of the independent network by a simple coefficient;

3. according to the emotion voice synthesis method, Tacotron2 is used as a skeleton network, so that the final voice synthesis quality can be effectively improved;

4. the emotion voice synthesis system is provided with the emotion voice calling interface, high-quality emotion voice with emotion can be synthesized through simple http calling, and the user experience can be greatly improved for scenes needing human-computer voice interaction. The method is used for intelligent telephone service conversation scenes, map intelligent navigation conversation scenes, conversation robot interaction scenes in children education, humanoid robot conversation interaction scenes in banks, airports and the like.

Drawings

FIG. 1 is a schematic diagram of an emotion speech synthesis system according to the present invention;

FIG. 2 is a flow chart of the emotion voice synthesis method of the present invention;

FIG. 3 is a schematic diagram of a network structure of the emotion speech synthesis method of the present invention;

fig. 4 is a schematic diagram of a network structure of a Tacotron2 of the speech synthesis system.

Detailed Description

In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

As shown in fig. 1, an emotion speech synthesis system with vocabulary and phoneme pronunciation features fused includes:

and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice, wherein the synthesized emotion voice is preferentially adopted by the output emotion voice, and then the model synthesis is adopted to improve the response speed of the system.

As shown in fig. 2-4, an emotion speech synthesis method with vocabulary and phoneme pronunciation characteristics fused includes the following steps:

step S1, collecting text and emotion labels: by recordingA collecting device for collecting 7 emotion types of voice audio represented as neutral, happy, sad, angry, afraid, aversion and surprise

Text corresponding to speech, expressed as

The emotion type corresponding to the voice is expressed as

；

Step S2, preprocessing the text, and acquiring phonemes and phoneme alignment information: for the text collected in step S1

Converted into corresponding phoneme text through pypinyin toolkit and expressed as

Then the phoneme text is processed

And obtained in step S1

；

Step S3, preprocessing the text, generating word segmentation and word segmentation semantic information: for text

Text to be participled

In particular, the amount of the surfactant is,

wherein the content of the first and second substances,

is a vector with dimension D;

step S4, calculating word segmentation pronunciation duration information: generated by step S2

And the participle text generated in step S3

；

Step S5, calculating word segmentation pronunciation speed information: participle-duration text obtained through step S4

；

Step S6, calculatingWord segmentation pronunciation energy information: comparing the audio obtained in step S1 with the word-length text obtained in step S4

；

Step S7, calculating phoneme fundamental frequency information: for the audio obtained in step S1

And the phoneme-duration text obtained in step S2

；

Step S8, training the participle speech rate prediction network Net _ WordSpeed: the emotion types obtained in the step S1

And the word segmentation characteristics obtained in step S3

The speech rate category label obtained in step S5 is used as the network input

The reason why the BiLSTM bidirectional long-short term memory network is adopted as the network target and is input into a deep learning sequence prediction network BiLSTM-CRF is thatBecause BilSTM is particularly suitable for processing sequence-class tasks, such as speech signal processing, text signal processing, etc., and then obtaining the participle speech speed prediction network Net _ WordSpeed through deep learning network training. Specifically, the method comprises the following steps:

step A: emotional type

；

And B: obtained in step S3 and step A

And

splicing in the first dimension to obtain the network input

In particular, the amount of the surfactant is,

and C: label with length of word segmentation N

Converting the vector into an One-Hot vector with the width of 5 by using an One-Hot vector conversion technology to finally obtain a network label matrix with the dimension of Nx 5

In particular, the amount of the surfactant is,

wherein the content of the first and second substances,

is a vector of dimension 5;

step D: inputting the network obtained in the step B

And the network label matrix obtained in the step C

Step S9, training the word segmentation energy prediction network Net _ WordEnergy: the emotion types obtained in the step S1

And the word segmentation characteristics obtained in step S3

The energy label obtained in step S6 is used as the network input

The network object is input to the deep learning sequence prediction network BLSTM-CRF, and the participle energy prediction network Net _ wordrenergy is obtained by the same processing method as that in step S8.

Step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: the emotion types obtained in the step S1

And the phoneme text obtained in step S2

All of the signals are converted into vector form by One-Hot conversion technology and then used as network input, and the fundamental frequency label obtained in step S7

And converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.

Step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: subjecting the product obtained in step S2

；

Step S12, obtaining the implied information of word speed through Net _ WordSpeed: the word segmentation characteristics obtained in the step S3

Inputting the word-segmentation speech rate prediction network Net _ WordSpeed obtained in step S8 to obtain the speech rate implicit characteristic output by BilSTM

；

Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: the word segmentation characteristics obtained in the step S3

Inputting the word segmentation energy prediction network Net _ WordEnergy obtained in the step S9 to obtain the energy implicit characteristic output by the BilSTM

；

Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: the phoneme text obtained in step S2

Inputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 obtained in the step S10 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM

；

Step S15, splicing phoneme implicit information, participle speed implicit information, participle energy implicit information and phoneme fundamental frequency implicit information: step S11 is obtained

Step S12 obtains

Step S13 obtains

Step S14 obtains

To be spliced to obtain the final Decoder network input of Tacotron2

In particular, the amount of the surfactant is,

step S16, synthesizing emotion voice: and inputting the result obtained in the step S15 into a Decoder network of a Decoder of a Tacotron2, and then decoding and synthesizing the result through the subsequent structure of the Tacotron2 network to obtain the final emotional voice.

In summary, the method provided by the present embodiment improves the rationality of emotion speech feature generation by controlling the pronunciation of the text vocabulary, and can improve the quality of finally synthesized emotion speech.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims

1. An emotion voice synthesis method fusing vocabulary and phoneme pronunciation characteristics is characterized by comprising the following steps:

acquiring a text and an emotion label through recording acquisition equipment;

2. The method as claimed in claim 1, wherein the step one comprises the step S1: the voice audios of 7 emotion types of neutrality, happiness, sadness, anger, fear, aversion and surprise are collected by a recording and collecting device and are expressed as

Text corresponding to speech, expressed as

The emotion type corresponding to the voice is expressed as

。

3. The method as claimed in claim 2, wherein the second step comprises the following steps:

step S2, collecting the text

Then the phoneme text is processed

And obtained

By means of the speech processing tool software HTK,acquiring time alignment information of the text, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as

；

Step S3, for text

Text to be participled

In particular, the amount of the surfactant is,

wherein the content of the first and second substances,

is a vector of dimension D.

4. The method as claimed in claim 3, wherein the third step comprises the following steps:

step S4, using the generated

And generated participle text

；

Step S5, obtaining word-time text

；

Step S6, for the audio frequency

And word-length text

；

Step S7, for the audio frequency

And phoneme-duration text

Calculating the fundamental frequency information of the phoneme pronunciation through a library toolkit, and according to the fundamental frequency informationThe fundamental frequency is classified into five categories, namely: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text

。

5. The method as claimed in claim 4, wherein the step four includes the following steps:

And word segmentation features

As a network input, a speech rate category tag

And word segmentation features

As network input, energy tags

And phoneme text

6. The method as claimed in claim 5, wherein the step S8 comprises the following steps:

step A: emotional type

；

And B: will obtain

And

splicing in the first dimension to obtain the network input

In particular, the amount of the surfactant is,

and C: converting the label with the word segmentation length N into an One-Hot vector with the width of 5 by an One-Hot vector conversion technology to finally obtain a network label matrix with the dimension of Nx 5

In particular, the amount of the surfactant is,

wherein the content of the first and second substances,

is a vector of dimension 5;

step D: inputting network

And network label matrix

7. The method as claimed in claim 5, wherein the fifth step comprises the following steps:

；

；

；

。

8. The method as claimed in claim 7, wherein the sixth step comprises the following steps:

step S15 is to

、

、

、

To be spliced to obtain the final Decoder network input of Tacotron2

In particular, the amount of the surfactant is,

step S16, converting the data into a data file

9. An emotion speech synthesis system fusing vocabulary and phoneme pronunciation characteristics, comprising:

10. The system of claim 9, wherein the output is synthesized emotion speech preferentially and model synthesis is used to improve system response speed.