CN113257225A - Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics - Google Patents

Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics Download PDF

Info

Publication number
CN113257225A
CN113257225A CN202110600732.4A CN202110600732A CN113257225A CN 113257225 A CN113257225 A CN 113257225A CN 202110600732 A CN202110600732 A CN 202110600732A CN 113257225 A CN113257225 A CN 113257225A
Authority
CN
China
Prior art keywords
text
phoneme
network
information
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110600732.4A
Other languages
Chinese (zh)
Other versions
CN113257225B (en
Inventor
郑书凯
李太豪
裴冠雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202110600732.4A priority Critical patent/CN113257225B/en
Publication of CN113257225A publication Critical patent/CN113257225A/en
Application granted granted Critical
Publication of CN113257225B publication Critical patent/CN113257225B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Abstract

The invention belongs to the field of artificial intelligence, and particularly relates to an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics, wherein the method comprises the following steps: the method comprises the steps of collecting a text and an emotion label through a recording collection device, preprocessing the text, obtaining phoneme and phoneme alignment information, generating segmentation and segmentation semantic information, respectively calculating and obtaining segmentation pronunciation duration information, segmentation pronunciation speed information, segmentation pronunciation energy information and phoneme fundamental frequency information, respectively training a segmentation speed prediction network, a segmentation energy prediction network and a phoneme fundamental frequency prediction network, obtaining and splicing phoneme implicit information, segmentation speed implicit information, segmentation energy implicit information and phoneme fundamental frequency implicit information, and synthesizing emotion voice. The invention can lead the synthesized emotional voice to be more natural by fusing the vocabulary and the phoneme pronunciation characteristics related to the emotional pronunciation into the end-to-end voice synthesis model.

Description

Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics.
Background
Linguistic interaction is one of the earliest ways of human interaction, and speech is therefore the primary way for humans to express emotion. With the rise of man-machine interaction, the conversation robot has the emotion similar to a human, and speaking is like a real person, which is an urgent need. At present, the main classification modes of emotions are 7 emotions proposed by Ekman in the last century, which are respectively as follows: neutral, happy, sad, angry, afraid, aversion to, surprised.
With the rise of deep learning in recent years, the speech synthesis technology becomes more mature, and the pronunciation of a machine can be realized like a speaker. However, it is still a very difficult problem to make a machine emit speech with emotion like a human, and currently, mainstream emotion speech synthesis can be divided into two methods. One is a segmentation method based on hidden markov model, a traditional machine learning; another is an end-to-end approach based on deep learning. The speech synthesized based on the hidden Markov method has strong mechanical sense and unnatural sounding, and is rarely used at present. The speech synthesized by the deep learning-based method is relatively natural. However, at present, emotion voice synthesized based on deep learning is only simply integrated into text features, and the quality of synthesized emotion voice cannot be effectively guaranteed.
In the prior art, because the mode of integrating the emotion information is simple, the emotion label is only simply integrated into the text feature generally, and the characteristics of a person in emotion voice pronunciation are not considered, so that the emotion information cannot be well learned by a model, and the synthesized emotion voice is hard and unnatural.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics, and the specific technical scheme is as follows:
an emotion voice synthesis method fusing vocabulary and phoneme pronunciation characteristics comprises the following steps:
acquiring a text and an emotion label through recording acquisition equipment;
preprocessing the text, acquiring phonemes and phoneme alignment information, and generating word segmentation and word segmentation semantic information;
step three, respectively calculating and obtaining word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information and phoneme fundamental frequency information;
step four, respectively training a participle speech rate prediction network Net _ WordSpeed, a participle energy prediction network Net _ WordEnergy and a phoneme fundamental frequency prediction network Net _ PhonemF 0;
step five, acquiring phoneme implicit information through an Encoder of a Tacotron2, acquiring participle speed implicit information through Net _ WordSpeed, acquiring participle energy implicit information through Net _ WordEnergy, and acquiring phoneme fundamental frequency implicit information through Net _ PhonemF 0;
and step six, splicing the phoneme implicit information, the participle speed implicit information, the participle energy implicit information and the phoneme fundamental frequency implicit information to synthesize the emotional voice.
Further, the step one specifically includes the step S1: through the recording acquisition equipment, 7 emotion types of voice audios which are neutral, happy, sad, angry, afraid, aversion and surprise are acquired and expressed as
Figure 895901DEST_PATH_IMAGE001
Text corresponding to speech, expressed as
Figure 485145DEST_PATH_IMAGE002
The emotion type corresponding to the voice is expressed as
Figure 186254DEST_PATH_IMAGE003
Further, the second step specifically includes the following steps:
step S2, collecting the text
Figure 112622DEST_PATH_IMAGE004
Converted into corresponding tones by pypinyin kitPlain text, expressed as
Figure 333518DEST_PATH_IMAGE005
Then the phoneme text is processed
Figure 687139DEST_PATH_IMAGE006
And obtained
Figure 609965DEST_PATH_IMAGE001
Obtaining time alignment information of the text through a speech processing tool software HTK, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as
Figure 340024DEST_PATH_IMAGE007
Step S3, for text
Figure 415427DEST_PATH_IMAGE004
Performing word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation text
Figure 939949DEST_PATH_IMAGE008
Text to be participled
Figure 350071DEST_PATH_IMAGE008
Inputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxD
Figure 618241DEST_PATH_IMAGE009
In particular, the amount of the surfactant is,
Figure 548151DEST_PATH_IMAGE010
wherein the content of the first and second substances,
Figure 509154DEST_PATH_IMAGE011
is a vector of dimension D.
Further, the third step specifically includes the following steps:
step S4, using the generated
Figure 406571DEST_PATH_IMAGE012
And generated participle text
Figure 947274DEST_PATH_IMAGE013
Calculating the pronunciation time of each word segmentation to obtain a word segmentation-time text
Figure 997270DEST_PATH_IMAGE014
Step S5, obtaining word-time text
Figure 863595DEST_PATH_IMAGE015
Calculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text
Figure 248308DEST_PATH_IMAGE016
Step S6, for the audio frequency
Figure 858281DEST_PATH_IMAGE001
And word-length text
Figure 762783DEST_PATH_IMAGE015
Calculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts
Figure 65589DEST_PATH_IMAGE017
Step S7, for the audio frequency
Figure 666160DEST_PATH_IMAGE001
And phoneme-duration text
Figure 79824DEST_PATH_IMAGE018
Calculating fundamental frequency information of phoneme pronunciation through a library toolkit, and classifying the fundamental frequency information into five categories according to the fundamental frequency, wherein the categories are as follows: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text
Figure 104412DEST_PATH_IMAGE019
Further, the fourth step specifically includes the following steps:
step S8, training the participle speech rate prediction network Net _ WordSpeed: will be emotional type
Figure 968331DEST_PATH_IMAGE020
And word segmentation features
Figure 937424DEST_PATH_IMAGE021
As a network input, a speech rate category tag
Figure 30145DEST_PATH_IMAGE016
Inputting the target as a network target into a deep learning sequence prediction network BiLSTM-CRF, and then obtaining a participle speech rate prediction network Net _ WordSpeed through network training of deep learning;
step S9, training the word segmentation energy prediction network Net _ WordEnergy: will be emotional type
Figure 424087DEST_PATH_IMAGE022
And word segmentation features
Figure 68694DEST_PATH_IMAGE021
As network input, energy tags
Figure 400450DEST_PATH_IMAGE023
Inputting the network target into a deep learning sequence prediction network BLSTM-CRF, and obtaining a participle energy prediction network Net _ WordEnergy by the same processing method as the step S8;
step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: will be emotional type
Figure 155916DEST_PATH_IMAGE022
And phoneme text
Figure 873205DEST_PATH_IMAGE005
All are converted into vector form by One-Hot conversion technology and then used as network input and base frequency label
Figure 688715DEST_PATH_IMAGE019
And converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a sequence prediction deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.
Further, the step S8 specifically includes the following steps:
step A: emotional type
Figure 507766DEST_PATH_IMAGE024
Converting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D
Figure 66923DEST_PATH_IMAGE025
And B: will obtain
Figure 169877DEST_PATH_IMAGE021
And
Figure 890709DEST_PATH_IMAGE025
splicing in the first dimension to obtain the network input
Figure 462636DEST_PATH_IMAGE026
In particular, the amount of the surfactant is,
Figure 684538DEST_PATH_IMAGE027
and C: the label with the word segmentation length N is processed in the One-Hot directionThe quantity conversion technology is used for converting the vector into an One-Hot vector with the width of 5, and finally the network label matrix with the dimension of Nx 5 is obtained
Figure 517365DEST_PATH_IMAGE028
In particular, the amount of the surfactant is,
Figure 284464DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 937162DEST_PATH_IMAGE030
is a vector of dimension 5;
step D: inputting network
Figure 838122DEST_PATH_IMAGE031
And network label matrix
Figure 384510DEST_PATH_IMAGE028
Inputting the predicted text speech into BLSTM-CRF network for training, and obtaining the speech rate predicting network Net _ WordSpeed capable of predicting text speech rate through automatic learning of network.
Further, the fifth step specifically includes the following steps:
step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: corresponding phoneme text
Figure 712723DEST_PATH_IMAGE032
Inputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network
Figure 728084DEST_PATH_IMAGE033
Step S12, obtaining the implied information of word speed through Net _ WordSpeed: feature of dividing words
Figure 432734DEST_PATH_IMAGE034
Inputting the word speed prediction network Net _ WordSpeed to obtain the speech speed implicit characteristic output by the BilSTM
Figure 99208DEST_PATH_IMAGE035
According to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained
Figure 598323DEST_PATH_IMAGE036
Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: feature of dividing words
Figure 100979DEST_PATH_IMAGE034
Inputting the energy into a word segmentation energy prediction network Net _ WordEnergy to obtain the energy implicit characteristic of the BiLSTM output
Figure 343742DEST_PATH_IMAGE037
According to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained
Figure 870581DEST_PATH_IMAGE038
Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: text of phonemes
Figure 681542DEST_PATH_IMAGE039
Inputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM
Figure 796129DEST_PATH_IMAGE040
Further, the sixth step specifically includes the following steps:
step S15 is to
Figure 701637DEST_PATH_IMAGE041
Figure 218069DEST_PATH_IMAGE042
Figure 934352DEST_PATH_IMAGE043
Figure 536235DEST_PATH_IMAGE044
To be spliced to obtain the final Decoder network input of Tacotron2
Figure 245434DEST_PATH_IMAGE045
In particular, the amount of the surfactant is,
Figure 616372DEST_PATH_IMAGE046
step S16, converting the data into a data file
Figure 503557DEST_PATH_IMAGE045
Then the emotion speech is input into a Decoder network of Tacotron2, and then is decoded and synthesized to obtain the final emotion speech through the subsequent structure of the Tacotron2 network.
An emotion speech synthesis system fusing vocabulary and phoneme pronunciation characteristics, comprising:
the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;
the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;
the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;
the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;
and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice.
Further, the synthesized emotion voice is preferentially adopted in the output, and model synthesis is secondarily adopted to improve the response speed of the system.
The invention has the advantages that:
1. according to the emotion voice synthesis method, the emotion of synthesized voice is indirectly controlled by controlling the pronunciation control of the words, the words are basic units of pronunciation rhythm, and people express different emotions by controlling the volume, the speed and the fundamental frequency of pronunciation of different words, so that the emotion voice synthesis is carried out by simulating the mode of expressing emotion of pronunciation of human, the emotion of the voice can be better synthesized, and the synthesized voice is more natural;
2. according to the emotion voice synthesis method, the three elements related to emotion pronunciation are predicted by using the independent speech speed prediction network, the energy prediction network and the fundamental frequency prediction network, so that the final voice output effect can be conveniently controlled in a manner of multiplying and adjusting the output of the independent network by a simple coefficient;
3. according to the emotion voice synthesis method, Tacotron2 is used as a skeleton network, so that the final voice synthesis quality can be effectively improved;
4. the emotion voice synthesis system is provided with the emotion voice calling interface, high-quality emotion voice with emotion can be synthesized through simple http calling, and the user experience can be greatly improved for scenes needing human-computer voice interaction. The method is used for intelligent telephone service conversation scenes, map intelligent navigation conversation scenes, conversation robot interaction scenes in children education, humanoid robot conversation interaction scenes in banks, airports and the like.
Drawings
FIG. 1 is a schematic diagram of an emotion speech synthesis system according to the present invention;
FIG. 2 is a flow chart of the emotion voice synthesis method of the present invention;
FIG. 3 is a schematic diagram of a network structure of the emotion speech synthesis method of the present invention;
fig. 4 is a schematic diagram of a network structure of a Tacotron2 of the speech synthesis system.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
As shown in fig. 1, an emotion speech synthesis system with vocabulary and phoneme pronunciation features fused includes:
the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;
the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;
the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;
the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;
and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice, wherein the synthesized emotion voice is preferentially adopted by the output emotion voice, and then the model synthesis is adopted to improve the response speed of the system.
As shown in fig. 2-4, an emotion speech synthesis method with vocabulary and phoneme pronunciation characteristics fused includes the following steps:
step S1, collecting text and emotion labels: by recordingA collecting device for collecting 7 emotion types of voice audio represented as neutral, happy, sad, angry, afraid, aversion and surprise
Figure 327156DEST_PATH_IMAGE001
Text corresponding to speech, expressed as
Figure 449833DEST_PATH_IMAGE002
The emotion type corresponding to the voice is expressed as
Figure 799912DEST_PATH_IMAGE003
Step S2, preprocessing the text, and acquiring phonemes and phoneme alignment information: for the text collected in step S1
Figure 982631DEST_PATH_IMAGE002
Converted into corresponding phoneme text through pypinyin toolkit and expressed as
Figure 434472DEST_PATH_IMAGE047
Then the phoneme text is processed
Figure 360840DEST_PATH_IMAGE048
And obtained in step S1
Figure 565425DEST_PATH_IMAGE001
Obtaining time alignment information of the text through a speech processing tool software HTK, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as
Figure 919046DEST_PATH_IMAGE049
Step S3, preprocessing the text, generating word segmentation and word segmentation semantic information: for text
Figure 592604DEST_PATH_IMAGE002
Performing word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation text
Figure 322663DEST_PATH_IMAGE050
Text to be participled
Figure 381755DEST_PATH_IMAGE051
Inputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxD
Figure 171856DEST_PATH_IMAGE052
In particular, the amount of the surfactant is,
Figure 598289DEST_PATH_IMAGE053
wherein the content of the first and second substances,
Figure 866460DEST_PATH_IMAGE054
is a vector with dimension D;
step S4, calculating word segmentation pronunciation duration information: generated by step S2
Figure 45637DEST_PATH_IMAGE055
And the participle text generated in step S3
Figure 6640DEST_PATH_IMAGE056
Calculating the pronunciation time of each word segmentation to obtain a word segmentation-time text
Figure 389211DEST_PATH_IMAGE057
Step S5, calculating word segmentation pronunciation speed information: participle-duration text obtained through step S4
Figure 461072DEST_PATH_IMAGE057
Calculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text
Figure 760335DEST_PATH_IMAGE058
Step S6, calculatingWord segmentation pronunciation energy information: comparing the audio obtained in step S1 with the word-length text obtained in step S4
Figure 626660DEST_PATH_IMAGE059
Calculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts
Figure 762106DEST_PATH_IMAGE060
Step S7, calculating phoneme fundamental frequency information: for the audio obtained in step S1
Figure 372079DEST_PATH_IMAGE061
And the phoneme-duration text obtained in step S2
Figure 543427DEST_PATH_IMAGE062
Calculating fundamental frequency information of phoneme pronunciation through a library toolkit, and classifying the fundamental frequency information into five categories according to the fundamental frequency, wherein the categories are as follows: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text
Figure 580653DEST_PATH_IMAGE063
Step S8, training the participle speech rate prediction network Net _ WordSpeed: the emotion types obtained in the step S1
Figure 468975DEST_PATH_IMAGE064
And the word segmentation characteristics obtained in step S3
Figure 7272DEST_PATH_IMAGE065
The speech rate category label obtained in step S5 is used as the network input
Figure 890915DEST_PATH_IMAGE066
The reason why the BiLSTM bidirectional long-short term memory network is adopted as the network target and is input into a deep learning sequence prediction network BiLSTM-CRF is thatBecause BilSTM is particularly suitable for processing sequence-class tasks, such as speech signal processing, text signal processing, etc., and then obtaining the participle speech speed prediction network Net _ WordSpeed through deep learning network training. Specifically, the method comprises the following steps:
step A: emotional type
Figure 239987DEST_PATH_IMAGE067
Converting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D
Figure 209081DEST_PATH_IMAGE068
And B: obtained in step S3 and step A
Figure 285490DEST_PATH_IMAGE068
And
Figure 289218DEST_PATH_IMAGE069
splicing in the first dimension to obtain the network input
Figure 543613DEST_PATH_IMAGE070
In particular, the amount of the surfactant is,
Figure 2DEST_PATH_IMAGE071
and C: label with length of word segmentation N
Figure 880102DEST_PATH_IMAGE072
Converting the vector into an One-Hot vector with the width of 5 by using an One-Hot vector conversion technology to finally obtain a network label matrix with the dimension of Nx 5
Figure 738337DEST_PATH_IMAGE073
In particular, the amount of the surfactant is,
Figure 429212DEST_PATH_IMAGE074
wherein the content of the first and second substances,
Figure 107318DEST_PATH_IMAGE075
is a vector of dimension 5;
step D: inputting the network obtained in the step B
Figure 791109DEST_PATH_IMAGE070
And the network label matrix obtained in the step C
Figure 503850DEST_PATH_IMAGE073
Inputting the predicted text speech into BLSTM-CRF network for training, and obtaining the speech rate predicting network Net _ WordSpeed capable of predicting text speech rate through automatic learning of network.
Step S9, training the word segmentation energy prediction network Net _ WordEnergy: the emotion types obtained in the step S1
Figure 365627DEST_PATH_IMAGE076
And the word segmentation characteristics obtained in step S3
Figure 531029DEST_PATH_IMAGE069
The energy label obtained in step S6 is used as the network input
Figure 18511DEST_PATH_IMAGE077
The network object is input to the deep learning sequence prediction network BLSTM-CRF, and the participle energy prediction network Net _ wordrenergy is obtained by the same processing method as that in step S8.
Step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: the emotion types obtained in the step S1
Figure 320180DEST_PATH_IMAGE076
And the phoneme text obtained in step S2
Figure 618437DEST_PATH_IMAGE078
All of the signals are converted into vector form by One-Hot conversion technology and then used as network input, and the fundamental frequency label obtained in step S7
Figure 536714DEST_PATH_IMAGE079
And converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.
Step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: subjecting the product obtained in step S2
Figure 296729DEST_PATH_IMAGE080
Inputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network
Figure 984062DEST_PATH_IMAGE081
Step S12, obtaining the implied information of word speed through Net _ WordSpeed: the word segmentation characteristics obtained in the step S3
Figure 187642DEST_PATH_IMAGE082
Inputting the word-segmentation speech rate prediction network Net _ WordSpeed obtained in step S8 to obtain the speech rate implicit characteristic output by BilSTM
Figure 327636DEST_PATH_IMAGE083
According to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained
Figure 156920DEST_PATH_IMAGE084
Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: the word segmentation characteristics obtained in the step S3
Figure 698760DEST_PATH_IMAGE082
Inputting the word segmentation energy prediction network Net _ WordEnergy obtained in the step S9 to obtain the energy implicit characteristic output by the BilSTM
Figure 73241DEST_PATH_IMAGE085
According to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained
Figure 700531DEST_PATH_IMAGE086
Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: the phoneme text obtained in step S2
Figure 808208DEST_PATH_IMAGE087
Inputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 obtained in the step S10 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM
Figure 470133DEST_PATH_IMAGE088
Step S15, splicing phoneme implicit information, participle speed implicit information, participle energy implicit information and phoneme fundamental frequency implicit information: step S11 is obtained
Figure 15515DEST_PATH_IMAGE081
Step S12 obtains
Figure 395681DEST_PATH_IMAGE089
Step S13 obtains
Figure 301189DEST_PATH_IMAGE038
Step S14 obtains
Figure 552042DEST_PATH_IMAGE088
To be spliced to obtain the final Decoder network input of Tacotron2
Figure 517593DEST_PATH_IMAGE090
In particular, the amount of the surfactant is,
Figure 775267DEST_PATH_IMAGE091
step S16, synthesizing emotion voice: and inputting the result obtained in the step S15 into a Decoder network of a Decoder of a Tacotron2, and then decoding and synthesizing the result through the subsequent structure of the Tacotron2 network to obtain the final emotional voice.
In summary, the method provided by the present embodiment improves the rationality of emotion speech feature generation by controlling the pronunciation of the text vocabulary, and can improve the quality of finally synthesized emotion speech.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.

Claims (10)

1. An emotion voice synthesis method fusing vocabulary and phoneme pronunciation characteristics is characterized by comprising the following steps:
acquiring a text and an emotion label through recording acquisition equipment;
preprocessing the text, acquiring phonemes and phoneme alignment information, and generating word segmentation and word segmentation semantic information;
step three, respectively calculating and obtaining word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information and phoneme fundamental frequency information;
step four, respectively training a participle speech rate prediction network Net _ WordSpeed, a participle energy prediction network Net _ WordEnergy and a phoneme fundamental frequency prediction network Net _ PhonemF 0;
step five, acquiring phoneme implicit information through an Encoder of a Tacotron2, acquiring participle speed implicit information through Net _ WordSpeed, acquiring participle energy implicit information through Net _ WordEnergy, and acquiring phoneme fundamental frequency implicit information through Net _ PhonemF 0;
and step six, splicing the phoneme implicit information, the participle speed implicit information, the participle energy implicit information and the phoneme fundamental frequency implicit information to synthesize the emotional voice.
2. The method as claimed in claim 1, wherein the step one comprises the step S1: the voice audios of 7 emotion types of neutrality, happiness, sadness, anger, fear, aversion and surprise are collected by a recording and collecting device and are expressed as
Figure 369747DEST_PATH_IMAGE001
Text corresponding to speech, expressed as
Figure 903801DEST_PATH_IMAGE002
The emotion type corresponding to the voice is expressed as
Figure 152379DEST_PATH_IMAGE003
3. The method as claimed in claim 2, wherein the second step comprises the following steps:
step S2, collecting the text
Figure 141064DEST_PATH_IMAGE002
Converted into corresponding phoneme text through pypinyin toolkit and expressed as
Figure 424278DEST_PATH_IMAGE004
Then the phoneme text is processed
Figure 840216DEST_PATH_IMAGE005
And obtained
Figure 310511DEST_PATH_IMAGE001
By means of the speech processing tool software HTK,acquiring time alignment information of the text, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as
Figure 102887DEST_PATH_IMAGE006
Step S3, for text
Figure 240607DEST_PATH_IMAGE002
Performing word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation text
Figure 827446DEST_PATH_IMAGE007
Text to be participled
Figure 50617DEST_PATH_IMAGE007
Inputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxD
Figure 381104DEST_PATH_IMAGE008
In particular, the amount of the surfactant is,
Figure 373331DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 396651DEST_PATH_IMAGE010
is a vector of dimension D.
4. The method as claimed in claim 3, wherein the third step comprises the following steps:
step S4, using the generated
Figure 107118DEST_PATH_IMAGE006
And generated participle text
Figure 975717DEST_PATH_IMAGE011
Calculating the pronunciation time of each word segmentation to obtain a word segmentation-time text
Figure 88029DEST_PATH_IMAGE012
Step S5, obtaining word-time text
Figure 16671DEST_PATH_IMAGE013
Calculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text
Figure 214434DEST_PATH_IMAGE014
Step S6, for the audio frequency
Figure 886724DEST_PATH_IMAGE001
And word-length text
Figure 853543DEST_PATH_IMAGE013
Calculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts
Figure 218665DEST_PATH_IMAGE015
Step S7, for the audio frequency
Figure 638145DEST_PATH_IMAGE001
And phoneme-duration text
Figure 114126DEST_PATH_IMAGE016
Calculating the fundamental frequency information of the phoneme pronunciation through a library toolkit, and according to the fundamental frequency informationThe fundamental frequency is classified into five categories, namely: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text
Figure 935451DEST_PATH_IMAGE017
5. The method as claimed in claim 4, wherein the step four includes the following steps:
step S8, training the participle speech rate prediction network Net _ WordSpeed: will be emotional type
Figure 471475DEST_PATH_IMAGE018
And word segmentation features
Figure 378251DEST_PATH_IMAGE019
As a network input, a speech rate category tag
Figure 392343DEST_PATH_IMAGE020
Inputting the target as a network target into a deep learning sequence prediction network BiLSTM-CRF, and then obtaining a participle speech rate prediction network Net _ WordSpeed through network training of deep learning;
step S9, training the word segmentation energy prediction network Net _ WordEnergy: will be emotional type
Figure 333754DEST_PATH_IMAGE021
And word segmentation features
Figure 775100DEST_PATH_IMAGE019
As network input, energy tags
Figure 434752DEST_PATH_IMAGE022
Inputting the network target into a deep learning sequence prediction network BLSTM-CRF, and obtaining a participle energy prediction network Net _ WordEnergy by the same processing method as the step S8;
step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: will be emotional type
Figure 986956DEST_PATH_IMAGE021
And phoneme text
Figure 641928DEST_PATH_IMAGE023
All are converted into vector form by One-Hot conversion technology and then used as network input and base frequency label
Figure 395120DEST_PATH_IMAGE017
And converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a sequence prediction deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.
6. The method as claimed in claim 5, wherein the step S8 comprises the following steps:
step A: emotional type
Figure 401122DEST_PATH_IMAGE024
Converting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D
Figure 897963DEST_PATH_IMAGE025
And B: will obtain
Figure 141862DEST_PATH_IMAGE008
And
Figure 925011DEST_PATH_IMAGE025
splicing in the first dimension to obtain the network input
Figure 293675DEST_PATH_IMAGE026
In particular, the amount of the surfactant is,
Figure 453261DEST_PATH_IMAGE027
and C: converting the label with the word segmentation length N into an One-Hot vector with the width of 5 by an One-Hot vector conversion technology to finally obtain a network label matrix with the dimension of Nx 5
Figure 223771DEST_PATH_IMAGE028
In particular, the amount of the surfactant is,
Figure 177820DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 33781DEST_PATH_IMAGE030
is a vector of dimension 5;
step D: inputting network
Figure 997058DEST_PATH_IMAGE031
And network label matrix
Figure 356495DEST_PATH_IMAGE028
Inputting the predicted text speech into BLSTM-CRF network for training, and obtaining the speech rate predicting network Net _ WordSpeed capable of predicting text speech rate through automatic learning of network.
7. The method as claimed in claim 5, wherein the fifth step comprises the following steps:
step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: corresponding phoneme text
Figure 747025DEST_PATH_IMAGE032
Inputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network
Figure 824702DEST_PATH_IMAGE033
Step S12, obtaining the implied information of word speed through Net _ WordSpeed: feature of dividing words
Figure 591670DEST_PATH_IMAGE034
Inputting the word speed prediction network Net _ WordSpeed to obtain the speech speed implicit characteristic output by the BilSTM
Figure 71193DEST_PATH_IMAGE035
According to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained
Figure 632624DEST_PATH_IMAGE036
Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: feature of dividing words
Figure 197598DEST_PATH_IMAGE034
Inputting the energy into a word segmentation energy prediction network Net _ WordEnergy to obtain the energy implicit characteristic of the BiLSTM output
Figure 502677DEST_PATH_IMAGE037
According to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained
Figure 836707DEST_PATH_IMAGE038
Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: text of phonemes
Figure 569039DEST_PATH_IMAGE039
Inputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM
Figure 621309DEST_PATH_IMAGE040
8. The method as claimed in claim 7, wherein the sixth step comprises the following steps:
step S15 is to
Figure 464500DEST_PATH_IMAGE041
Figure 918615DEST_PATH_IMAGE042
Figure 821849DEST_PATH_IMAGE043
Figure 361415DEST_PATH_IMAGE044
To be spliced to obtain the final Decoder network input of Tacotron2
Figure 8297DEST_PATH_IMAGE045
In particular, the amount of the surfactant is,
Figure 316918DEST_PATH_IMAGE046
step S16, converting the data into a data file
Figure 391054DEST_PATH_IMAGE045
Then the emotion speech is input into a Decoder network of Tacotron2, and then is decoded and synthesized to obtain the final emotion speech through the subsequent structure of the Tacotron2 network.
9. An emotion speech synthesis system fusing vocabulary and phoneme pronunciation characteristics, comprising:
the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;
the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;
the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;
the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;
and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice.
10. The system of claim 9, wherein the output is synthesized emotion speech preferentially and model synthesis is used to improve system response speed.
CN202110600732.4A 2021-05-31 2021-05-31 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics Active CN113257225B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110600732.4A CN113257225B (en) 2021-05-31 2021-05-31 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110600732.4A CN113257225B (en) 2021-05-31 2021-05-31 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics

Publications (2)

Publication Number Publication Date
CN113257225A true CN113257225A (en) 2021-08-13
CN113257225B CN113257225B (en) 2021-11-02

Family

ID=77185459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110600732.4A Active CN113257225B (en) 2021-05-31 2021-05-31 Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics

Country Status (1)

Country Link
CN (1) CN113257225B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711413A (en) * 2023-11-02 2024-03-15 广东广信通信服务有限公司 Voice recognition data processing method, system, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665644B1 (en) * 1999-08-10 2003-12-16 International Business Machines Corporation Conversational data mining
US20080082329A1 (en) * 2006-09-29 2008-04-03 Joseph Watson Multi-pass speech analytics
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN111627420A (en) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 Specific-speaker emotion voice synthesis method and device under extremely low resources
CN111696579A (en) * 2020-06-17 2020-09-22 厦门快商通科技股份有限公司 Speech emotion recognition method, device, equipment and computer storage medium
CN112786004A (en) * 2020-12-30 2021-05-11 科大讯飞股份有限公司 Speech synthesis method, electronic device, and storage device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6665644B1 (en) * 1999-08-10 2003-12-16 International Business Machines Corporation Conversational data mining
US20080082329A1 (en) * 2006-09-29 2008-04-03 Joseph Watson Multi-pass speech analytics
CN108364632A (en) * 2017-12-22 2018-08-03 东南大学 A kind of Chinese text voice synthetic method having emotion
CN111627420A (en) * 2020-04-21 2020-09-04 升智信息科技(南京)有限公司 Specific-speaker emotion voice synthesis method and device under extremely low resources
CN111696579A (en) * 2020-06-17 2020-09-22 厦门快商通科技股份有限公司 Speech emotion recognition method, device, equipment and computer storage medium
CN112786004A (en) * 2020-12-30 2021-05-11 科大讯飞股份有限公司 Speech synthesis method, electronic device, and storage device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117711413A (en) * 2023-11-02 2024-03-15 广东广信通信服务有限公司 Voice recognition data processing method, system, device and storage medium

Also Published As

Publication number Publication date
CN113257225B (en) 2021-11-02

Similar Documents

Publication Publication Date Title
CN106056207B (en) A kind of robot depth interaction and inference method and device based on natural language
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN109036371A (en) Audio data generation method and system for speech synthesis
CN112037773B (en) N-optimal spoken language semantic recognition method and device and electronic equipment
CN115329779B (en) Multi-person dialogue emotion recognition method
CN113838448B (en) Speech synthesis method, device, equipment and computer readable storage medium
CN107221344A (en) A kind of speech emotional moving method
CN111341293A (en) Text voice front-end conversion method, device, equipment and storage medium
CN111951781A (en) Chinese prosody boundary prediction method based on graph-to-sequence
Zhao et al. End-to-end-based Tibetan multitask speech recognition
Dongmei Design of English text-to-speech conversion algorithm based on machine learning
CN113257225B (en) Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
CN114821088A (en) Multi-mode depth feature extraction method and system based on optimized BERT model
CN112257432A (en) Self-adaptive intention identification method and device and electronic equipment
CN116129868A (en) Method and system for generating structured photo
CN116092472A (en) Speech synthesis method and synthesis system
CN114446324A (en) Multi-mode emotion recognition method based on acoustic and text features
CN114898779A (en) Multi-mode fused speech emotion recognition method and system
CN114708848A (en) Method and device for acquiring size of audio and video file
CN114694633A (en) Speech synthesis method, apparatus, device and storage medium
CN114121018A (en) Voice document classification method, system, device and storage medium
CN114973045A (en) Hierarchical multi-modal emotion analysis method based on multi-task learning
CN113066473A (en) Voice synthesis method and device, storage medium and electronic equipment
CN112992116A (en) Automatic generation method and system of video content
CN113628609A (en) Automatic audio content generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant