CN113257225A - Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics - Google Patents
Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics Download PDFInfo
- Publication number
- CN113257225A CN113257225A CN202110600732.4A CN202110600732A CN113257225A CN 113257225 A CN113257225 A CN 113257225A CN 202110600732 A CN202110600732 A CN 202110600732A CN 113257225 A CN113257225 A CN 113257225A
- Authority
- CN
- China
- Prior art keywords
- text
- phoneme
- network
- information
- word segmentation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
Abstract
The invention belongs to the field of artificial intelligence, and particularly relates to an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics, wherein the method comprises the following steps: the method comprises the steps of collecting a text and an emotion label through a recording collection device, preprocessing the text, obtaining phoneme and phoneme alignment information, generating segmentation and segmentation semantic information, respectively calculating and obtaining segmentation pronunciation duration information, segmentation pronunciation speed information, segmentation pronunciation energy information and phoneme fundamental frequency information, respectively training a segmentation speed prediction network, a segmentation energy prediction network and a phoneme fundamental frequency prediction network, obtaining and splicing phoneme implicit information, segmentation speed implicit information, segmentation energy implicit information and phoneme fundamental frequency implicit information, and synthesizing emotion voice. The invention can lead the synthesized emotional voice to be more natural by fusing the vocabulary and the phoneme pronunciation characteristics related to the emotional pronunciation into the end-to-end voice synthesis model.
Description
Technical Field
The invention belongs to the field of artificial intelligence, and particularly relates to an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics.
Background
Linguistic interaction is one of the earliest ways of human interaction, and speech is therefore the primary way for humans to express emotion. With the rise of man-machine interaction, the conversation robot has the emotion similar to a human, and speaking is like a real person, which is an urgent need. At present, the main classification modes of emotions are 7 emotions proposed by Ekman in the last century, which are respectively as follows: neutral, happy, sad, angry, afraid, aversion to, surprised.
With the rise of deep learning in recent years, the speech synthesis technology becomes more mature, and the pronunciation of a machine can be realized like a speaker. However, it is still a very difficult problem to make a machine emit speech with emotion like a human, and currently, mainstream emotion speech synthesis can be divided into two methods. One is a segmentation method based on hidden markov model, a traditional machine learning; another is an end-to-end approach based on deep learning. The speech synthesized based on the hidden Markov method has strong mechanical sense and unnatural sounding, and is rarely used at present. The speech synthesized by the deep learning-based method is relatively natural. However, at present, emotion voice synthesized based on deep learning is only simply integrated into text features, and the quality of synthesized emotion voice cannot be effectively guaranteed.
In the prior art, because the mode of integrating the emotion information is simple, the emotion label is only simply integrated into the text feature generally, and the characteristics of a person in emotion voice pronunciation are not considered, so that the emotion information cannot be well learned by a model, and the synthesized emotion voice is hard and unnatural.
Disclosure of Invention
In order to solve the technical problems in the prior art, the invention provides an emotion voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics, and the specific technical scheme is as follows:
an emotion voice synthesis method fusing vocabulary and phoneme pronunciation characteristics comprises the following steps:
acquiring a text and an emotion label through recording acquisition equipment;
preprocessing the text, acquiring phonemes and phoneme alignment information, and generating word segmentation and word segmentation semantic information;
step three, respectively calculating and obtaining word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information and phoneme fundamental frequency information;
step four, respectively training a participle speech rate prediction network Net _ WordSpeed, a participle energy prediction network Net _ WordEnergy and a phoneme fundamental frequency prediction network Net _ PhonemF 0;
step five, acquiring phoneme implicit information through an Encoder of a Tacotron2, acquiring participle speed implicit information through Net _ WordSpeed, acquiring participle energy implicit information through Net _ WordEnergy, and acquiring phoneme fundamental frequency implicit information through Net _ PhonemF 0;
and step six, splicing the phoneme implicit information, the participle speed implicit information, the participle energy implicit information and the phoneme fundamental frequency implicit information to synthesize the emotional voice.
Further, the step one specifically includes the step S1: through the recording acquisition equipment, 7 emotion types of voice audios which are neutral, happy, sad, angry, afraid, aversion and surprise are acquired and expressed asText corresponding to speech, expressed asThe emotion type corresponding to the voice is expressed as。
Further, the second step specifically includes the following steps:
step S2, collecting the textConverted into corresponding tones by pypinyin kitPlain text, expressed asThen the phoneme text is processedAnd obtainedObtaining time alignment information of the text through a speech processing tool software HTK, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as;
Step S3, for textPerforming word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation textText to be participledInputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxDIn particular, the amount of the surfactant is,
Further, the third step specifically includes the following steps:
step S4, using the generatedAnd generated participle textCalculating the pronunciation time of each word segmentation to obtain a word segmentation-time text;
Step S5, obtaining word-time textCalculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text;
Step S6, for the audio frequencyAnd word-length textCalculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts;
Step S7, for the audio frequencyAnd phoneme-duration textCalculating fundamental frequency information of phoneme pronunciation through a library toolkit, and classifying the fundamental frequency information into five categories according to the fundamental frequency, wherein the categories are as follows: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text。
Further, the fourth step specifically includes the following steps:
step S8, training the participle speech rate prediction network Net _ WordSpeed: will be emotional typeAnd word segmentation featuresAs a network input, a speech rate category tagInputting the target as a network target into a deep learning sequence prediction network BiLSTM-CRF, and then obtaining a participle speech rate prediction network Net _ WordSpeed through network training of deep learning;
step S9, training the word segmentation energy prediction network Net _ WordEnergy: will be emotional typeAnd word segmentation featuresAs network input, energy tagsInputting the network target into a deep learning sequence prediction network BLSTM-CRF, and obtaining a participle energy prediction network Net _ WordEnergy by the same processing method as the step S8;
step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: will be emotional typeAnd phoneme textAll are converted into vector form by One-Hot conversion technology and then used as network input and base frequency labelAnd converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a sequence prediction deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.
Further, the step S8 specifically includes the following steps:
step A: emotional typeConverting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D;
And B: will obtainAndsplicing in the first dimension to obtain the network inputIn particular, the amount of the surfactant is,
and C: the label with the word segmentation length N is processed in the One-Hot directionThe quantity conversion technology is used for converting the vector into an One-Hot vector with the width of 5, and finally the network label matrix with the dimension of Nx 5 is obtainedIn particular, the amount of the surfactant is,
step D: inputting networkAnd network label matrixInputting the predicted text speech into BLSTM-CRF network for training, and obtaining the speech rate predicting network Net _ WordSpeed capable of predicting text speech rate through automatic learning of network.
Further, the fifth step specifically includes the following steps:
step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: corresponding phoneme textInputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network;
Step S12, obtaining the implied information of word speed through Net _ WordSpeed: feature of dividing wordsInputting the word speed prediction network Net _ WordSpeed to obtain the speech speed implicit characteristic output by the BilSTMAccording to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained;
Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: feature of dividing wordsInputting the energy into a word segmentation energy prediction network Net _ WordEnergy to obtain the energy implicit characteristic of the BiLSTM outputAccording to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained;
Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: text of phonemesInputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM。
Further, the sixth step specifically includes the following steps:
step S15 is to、 、、To be spliced to obtain the final Decoder network input of Tacotron2In particular, the amount of the surfactant is,
step S16, converting the data into a data fileThen the emotion speech is input into a Decoder network of Tacotron2, and then is decoded and synthesized to obtain the final emotion speech through the subsequent structure of the Tacotron2 network.
An emotion speech synthesis system fusing vocabulary and phoneme pronunciation characteristics, comprising:
the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;
the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;
the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;
the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;
and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice.
Further, the synthesized emotion voice is preferentially adopted in the output, and model synthesis is secondarily adopted to improve the response speed of the system.
The invention has the advantages that:
1. according to the emotion voice synthesis method, the emotion of synthesized voice is indirectly controlled by controlling the pronunciation control of the words, the words are basic units of pronunciation rhythm, and people express different emotions by controlling the volume, the speed and the fundamental frequency of pronunciation of different words, so that the emotion voice synthesis is carried out by simulating the mode of expressing emotion of pronunciation of human, the emotion of the voice can be better synthesized, and the synthesized voice is more natural;
2. according to the emotion voice synthesis method, the three elements related to emotion pronunciation are predicted by using the independent speech speed prediction network, the energy prediction network and the fundamental frequency prediction network, so that the final voice output effect can be conveniently controlled in a manner of multiplying and adjusting the output of the independent network by a simple coefficient;
3. according to the emotion voice synthesis method, Tacotron2 is used as a skeleton network, so that the final voice synthesis quality can be effectively improved;
4. the emotion voice synthesis system is provided with the emotion voice calling interface, high-quality emotion voice with emotion can be synthesized through simple http calling, and the user experience can be greatly improved for scenes needing human-computer voice interaction. The method is used for intelligent telephone service conversation scenes, map intelligent navigation conversation scenes, conversation robot interaction scenes in children education, humanoid robot conversation interaction scenes in banks, airports and the like.
Drawings
FIG. 1 is a schematic diagram of an emotion speech synthesis system according to the present invention;
FIG. 2 is a flow chart of the emotion voice synthesis method of the present invention;
FIG. 3 is a schematic diagram of a network structure of the emotion speech synthesis method of the present invention;
fig. 4 is a schematic diagram of a network structure of a Tacotron2 of the speech synthesis system.
Detailed Description
In order to make the objects, technical solutions and technical effects of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
As shown in fig. 1, an emotion speech synthesis system with vocabulary and phoneme pronunciation features fused includes:
the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;
the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;
the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;
the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;
and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice, wherein the synthesized emotion voice is preferentially adopted by the output emotion voice, and then the model synthesis is adopted to improve the response speed of the system.
As shown in fig. 2-4, an emotion speech synthesis method with vocabulary and phoneme pronunciation characteristics fused includes the following steps:
step S1, collecting text and emotion labels: by recordingA collecting device for collecting 7 emotion types of voice audio represented as neutral, happy, sad, angry, afraid, aversion and surpriseText corresponding to speech, expressed asThe emotion type corresponding to the voice is expressed as;
Step S2, preprocessing the text, and acquiring phonemes and phoneme alignment information: for the text collected in step S1Converted into corresponding phoneme text through pypinyin toolkit and expressed asThen the phoneme text is processedAnd obtained in step S1Obtaining time alignment information of the text through a speech processing tool software HTK, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as;
Step S3, preprocessing the text, generating word segmentation and word segmentation semantic information: for textPerforming word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation textText to be participledInputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxDIn particular, the amount of the surfactant is,
step S4, calculating word segmentation pronunciation duration information: generated by step S2And the participle text generated in step S3Calculating the pronunciation time of each word segmentation to obtain a word segmentation-time text;
Step S5, calculating word segmentation pronunciation speed information: participle-duration text obtained through step S4Calculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text;
Step S6, calculatingWord segmentation pronunciation energy information: comparing the audio obtained in step S1 with the word-length text obtained in step S4Calculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts;
Step S7, calculating phoneme fundamental frequency information: for the audio obtained in step S1And the phoneme-duration text obtained in step S2Calculating fundamental frequency information of phoneme pronunciation through a library toolkit, and classifying the fundamental frequency information into five categories according to the fundamental frequency, wherein the categories are as follows: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text;
Step S8, training the participle speech rate prediction network Net _ WordSpeed: the emotion types obtained in the step S1And the word segmentation characteristics obtained in step S3The speech rate category label obtained in step S5 is used as the network inputThe reason why the BiLSTM bidirectional long-short term memory network is adopted as the network target and is input into a deep learning sequence prediction network BiLSTM-CRF is thatBecause BilSTM is particularly suitable for processing sequence-class tasks, such as speech signal processing, text signal processing, etc., and then obtaining the participle speech speed prediction network Net _ WordSpeed through deep learning network training. Specifically, the method comprises the following steps:
step A: emotional typeConverting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D;
And B: obtained in step S3 and step AAndsplicing in the first dimension to obtain the network inputIn particular, the amount of the surfactant is,
and C: label with length of word segmentation NConverting the vector into an One-Hot vector with the width of 5 by using an One-Hot vector conversion technology to finally obtain a network label matrix with the dimension of Nx 5In particular, the amount of the surfactant is,
step D: inputting the network obtained in the step BAnd the network label matrix obtained in the step CInputting the predicted text speech into BLSTM-CRF network for training, and obtaining the speech rate predicting network Net _ WordSpeed capable of predicting text speech rate through automatic learning of network.
Step S9, training the word segmentation energy prediction network Net _ WordEnergy: the emotion types obtained in the step S1And the word segmentation characteristics obtained in step S3The energy label obtained in step S6 is used as the network inputThe network object is input to the deep learning sequence prediction network BLSTM-CRF, and the participle energy prediction network Net _ wordrenergy is obtained by the same processing method as that in step S8.
Step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: the emotion types obtained in the step S1And the phoneme text obtained in step S2All of the signals are converted into vector form by One-Hot conversion technology and then used as network input, and the fundamental frequency label obtained in step S7And converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.
Step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: subjecting the product obtained in step S2Inputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network;
Step S12, obtaining the implied information of word speed through Net _ WordSpeed: the word segmentation characteristics obtained in the step S3Inputting the word-segmentation speech rate prediction network Net _ WordSpeed obtained in step S8 to obtain the speech rate implicit characteristic output by BilSTMAccording to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained;
Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: the word segmentation characteristics obtained in the step S3Inputting the word segmentation energy prediction network Net _ WordEnergy obtained in the step S9 to obtain the energy implicit characteristic output by the BilSTMAccording to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained;
Step S14, acquiring phoneme fundamental frequency implicit information through Net _ PhonemeF 0: the phoneme text obtained in step S2Inputting the phoneme fundamental frequency prediction network Net _ PhonemF 0 obtained in the step S10 to obtain the phoneme fundamental frequency implicit characteristics output by the BilSTM;
Step S15, splicing phoneme implicit information, participle speed implicit information, participle energy implicit information and phoneme fundamental frequency implicit information: step S11 is obtainedStep S12 obtainsStep S13 obtainsStep S14 obtainsTo be spliced to obtain the final Decoder network input of Tacotron2In particular, the amount of the surfactant is,
step S16, synthesizing emotion voice: and inputting the result obtained in the step S15 into a Decoder network of a Decoder of a Tacotron2, and then decoding and synthesizing the result through the subsequent structure of the Tacotron2 network to obtain the final emotional voice.
In summary, the method provided by the present embodiment improves the rationality of emotion speech feature generation by controlling the pronunciation of the text vocabulary, and can improve the quality of finally synthesized emotion speech.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way. Although the foregoing has described the practice of the present invention in detail, it will be apparent to those skilled in the art that modifications may be made to the practice of the invention as described in the foregoing examples, or that certain features may be substituted in the practice of the invention. All changes, equivalents and modifications which come within the spirit and scope of the invention are desired to be protected.
Claims (10)
1. An emotion voice synthesis method fusing vocabulary and phoneme pronunciation characteristics is characterized by comprising the following steps:
acquiring a text and an emotion label through recording acquisition equipment;
preprocessing the text, acquiring phonemes and phoneme alignment information, and generating word segmentation and word segmentation semantic information;
step three, respectively calculating and obtaining word segmentation pronunciation duration information, word segmentation pronunciation speed information, word segmentation pronunciation energy information and phoneme fundamental frequency information;
step four, respectively training a participle speech rate prediction network Net _ WordSpeed, a participle energy prediction network Net _ WordEnergy and a phoneme fundamental frequency prediction network Net _ PhonemF 0;
step five, acquiring phoneme implicit information through an Encoder of a Tacotron2, acquiring participle speed implicit information through Net _ WordSpeed, acquiring participle energy implicit information through Net _ WordEnergy, and acquiring phoneme fundamental frequency implicit information through Net _ PhonemF 0;
and step six, splicing the phoneme implicit information, the participle speed implicit information, the participle energy implicit information and the phoneme fundamental frequency implicit information to synthesize the emotional voice.
2. The method as claimed in claim 1, wherein the step one comprises the step S1: the voice audios of 7 emotion types of neutrality, happiness, sadness, anger, fear, aversion and surprise are collected by a recording and collecting device and are expressed asText corresponding to speech, expressed asThe emotion type corresponding to the voice is expressed as。
3. The method as claimed in claim 2, wherein the second step comprises the following steps:
step S2, collecting the textConverted into corresponding phoneme text through pypinyin toolkit and expressed asThen the phoneme text is processedAnd obtainedBy means of the speech processing tool software HTK,acquiring time alignment information of the text, and generating a phoneme-duration text containing pronunciation duration of each phoneme, wherein the phoneme-duration text is expressed as;
Step S3, for textPerforming word segmentation by using a word segmentation tool, namely inserting word segmentation boundary identifiers into the original text to generate word segmentation textText to be participledInputting to a pre-training Bert network with the output width of D Chinese characters to obtain the segmentation characteristics with dimension of NxDIn particular, the amount of the surfactant is,
4. The method as claimed in claim 3, wherein the third step comprises the following steps:
step S4, using the generatedAnd generated participle textCalculating the pronunciation time of each word segmentation to obtain a word segmentation-time text;
Step S5, obtaining word-time textCalculating the speech rate information of the participles, and classifying the speech rate into 5 classes, which are respectively: slow, general, fast and fast to obtain the corresponding word rate class label of the word segmentation text;
Step S6, for the audio frequencyAnd word-length textCalculating pronunciation energy information of the participle through the sum of squares of the audio amplitude in the participle duration, and classifying the energy information into five categories, which are respectively as follows: low, medium, high and high, thereby obtaining energy labels corresponding to the word segmentation texts;
Step S7, for the audio frequencyAnd phoneme-duration textCalculating the fundamental frequency information of the phoneme pronunciation through a library toolkit, and according to the fundamental frequency informationThe fundamental frequency is classified into five categories, namely: low, medium, high and high, thereby obtaining the base frequency label corresponding to the phoneme text。
5. The method as claimed in claim 4, wherein the step four includes the following steps:
step S8, training the participle speech rate prediction network Net _ WordSpeed: will be emotional typeAnd word segmentation featuresAs a network input, a speech rate category tagInputting the target as a network target into a deep learning sequence prediction network BiLSTM-CRF, and then obtaining a participle speech rate prediction network Net _ WordSpeed through network training of deep learning;
step S9, training the word segmentation energy prediction network Net _ WordEnergy: will be emotional typeAnd word segmentation featuresAs network input, energy tagsInputting the network target into a deep learning sequence prediction network BLSTM-CRF, and obtaining a participle energy prediction network Net _ WordEnergy by the same processing method as the step S8;
step S10, training the phoneme fundamental frequency prediction network Net _ PhonemEF 0: will be emotional typeAnd phoneme textAll are converted into vector form by One-Hot conversion technology and then used as network input and base frequency labelAnd converting the phoneme base frequency prediction network into a vector form by using an One-Hot conversion technology, inputting the vector form as a network target into a sequence prediction deep learning sequence prediction network BLS TM-CRF, and obtaining the phoneme base frequency prediction network Net _ PhonemF 0 by using a training method the same as the step S8.
6. The method as claimed in claim 5, wherein the step S8 comprises the following steps:
step A: emotional typeConverting the signal into a One-Hot vector with the width of 7 by using an One-Hot vector conversion technology, and then converting the signal into a label input implicit characteristic with the dimension of D through a single-layer full-connection network with the width of D;
And B: will obtainAndsplicing in the first dimension to obtain the network inputIn particular, the amount of the surfactant is,
and C: converting the label with the word segmentation length N into an One-Hot vector with the width of 5 by an One-Hot vector conversion technology to finally obtain a network label matrix with the dimension of Nx 5In particular, the amount of the surfactant is,
7. The method as claimed in claim 5, wherein the fifth step comprises the following steps:
step S11, obtaining phoneme implicit information through the Encoder of Tacotron 2: corresponding phoneme textInputting the data into an Encoder network of a Tacotron2 network to obtain the output characteristics of the Encoder network;
Step S12, obtaining the implied information of word speed through Net _ WordSpeed: feature of dividing wordsInputting the word speed prediction network Net _ WordSpeed to obtain the speech speed implicit characteristic output by the BilSTMAccording to the number of phonemes contained in each participle, length completion is carried out on the speech speed implicit characteristics in the time dimension through copying, and the speech speed implicit characteristics with the length being the number of phonemes are obtained;
Step S13, acquiring the implied word segmentation energy information through Net _ WordEnergy: feature of dividing wordsInputting the energy into a word segmentation energy prediction network Net _ WordEnergy to obtain the energy implicit characteristic of the BiLSTM outputAccording to the number of phonemes contained in each participle, length completion is carried out on the energy implicit characteristics in the time dimension through copying, and the energy implicit characteristics with the length being the number of phonemes are obtained;
8. The method as claimed in claim 7, wherein the sixth step comprises the following steps:
step S15 is to、 、、To be spliced to obtain the final Decoder network input of Tacotron2In particular, the amount of the surfactant is,
9. An emotion speech synthesis system fusing vocabulary and phoneme pronunciation characteristics, comprising:
the text acquisition module is used for acquiring text contents and emotion labels needing to be synthesized by adopting http transmission;
the text preprocessing module is used for preprocessing the acquired text and performing word segmentation and phoneme conversion on the text, and comprises: the method comprises the steps of sequentially carrying out text symbol unified conversion on a text into English symbols, digital format unified conversion into a Chinese text, Chinese text word segmentation, word segmentation of the word segmentation text into a semantic vector representation form through pre-training Bert, and conversion of the text into a phoneme text through a pypinyin toolkit, wherein the emotion tags obtain vector representation of emotion tags through One-Hot conversion, and data which can be used for neural network processing are generated;
the emotion voice synthesis module is used for processing the text and the emotion information through the designed network model and synthesizing emotion voice;
the data storage module is used for storing the synthesized emotion voice by utilizing a MySQL database;
and the synthesized voice scheduling module is used for deciding whether to adopt a model to synthesize voice or call the synthesized voice from the database as output, and opens an http port for outputting the synthesized emotion voice.
10. The system of claim 9, wherein the output is synthesized emotion speech preferentially and model synthesis is used to improve system response speed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110600732.4A CN113257225B (en) | 2021-05-31 | 2021-05-31 | Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110600732.4A CN113257225B (en) | 2021-05-31 | 2021-05-31 | Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113257225A true CN113257225A (en) | 2021-08-13 |
CN113257225B CN113257225B (en) | 2021-11-02 |
Family
ID=77185459
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110600732.4A Active CN113257225B (en) | 2021-05-31 | 2021-05-31 | Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113257225B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117711413A (en) * | 2023-11-02 | 2024-03-15 | 广东广信通信服务有限公司 | Voice recognition data processing method, system, device and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6665644B1 (en) * | 1999-08-10 | 2003-12-16 | International Business Machines Corporation | Conversational data mining |
US20080082329A1 (en) * | 2006-09-29 | 2008-04-03 | Joseph Watson | Multi-pass speech analytics |
CN108364632A (en) * | 2017-12-22 | 2018-08-03 | 东南大学 | A kind of Chinese text voice synthetic method having emotion |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN111696579A (en) * | 2020-06-17 | 2020-09-22 | 厦门快商通科技股份有限公司 | Speech emotion recognition method, device, equipment and computer storage medium |
CN112786004A (en) * | 2020-12-30 | 2021-05-11 | 科大讯飞股份有限公司 | Speech synthesis method, electronic device, and storage device |
-
2021
- 2021-05-31 CN CN202110600732.4A patent/CN113257225B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6665644B1 (en) * | 1999-08-10 | 2003-12-16 | International Business Machines Corporation | Conversational data mining |
US20080082329A1 (en) * | 2006-09-29 | 2008-04-03 | Joseph Watson | Multi-pass speech analytics |
CN108364632A (en) * | 2017-12-22 | 2018-08-03 | 东南大学 | A kind of Chinese text voice synthetic method having emotion |
CN111627420A (en) * | 2020-04-21 | 2020-09-04 | 升智信息科技(南京)有限公司 | Specific-speaker emotion voice synthesis method and device under extremely low resources |
CN111696579A (en) * | 2020-06-17 | 2020-09-22 | 厦门快商通科技股份有限公司 | Speech emotion recognition method, device, equipment and computer storage medium |
CN112786004A (en) * | 2020-12-30 | 2021-05-11 | 科大讯飞股份有限公司 | Speech synthesis method, electronic device, and storage device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117711413A (en) * | 2023-11-02 | 2024-03-15 | 广东广信通信服务有限公司 | Voice recognition data processing method, system, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113257225B (en) | 2021-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106056207B (en) | A kind of robot depth interaction and inference method and device based on natural language | |
CN112650831A (en) | Virtual image generation method and device, storage medium and electronic equipment | |
CN109036371A (en) | Audio data generation method and system for speech synthesis | |
CN112037773B (en) | N-optimal spoken language semantic recognition method and device and electronic equipment | |
CN115329779B (en) | Multi-person dialogue emotion recognition method | |
CN113838448B (en) | Speech synthesis method, device, equipment and computer readable storage medium | |
CN107221344A (en) | A kind of speech emotional moving method | |
CN111341293A (en) | Text voice front-end conversion method, device, equipment and storage medium | |
CN111951781A (en) | Chinese prosody boundary prediction method based on graph-to-sequence | |
Zhao et al. | End-to-end-based Tibetan multitask speech recognition | |
Dongmei | Design of English text-to-speech conversion algorithm based on machine learning | |
CN113257225B (en) | Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics | |
CN114821088A (en) | Multi-mode depth feature extraction method and system based on optimized BERT model | |
CN112257432A (en) | Self-adaptive intention identification method and device and electronic equipment | |
CN116129868A (en) | Method and system for generating structured photo | |
CN116092472A (en) | Speech synthesis method and synthesis system | |
CN114446324A (en) | Multi-mode emotion recognition method based on acoustic and text features | |
CN114898779A (en) | Multi-mode fused speech emotion recognition method and system | |
CN114708848A (en) | Method and device for acquiring size of audio and video file | |
CN114694633A (en) | Speech synthesis method, apparatus, device and storage medium | |
CN114121018A (en) | Voice document classification method, system, device and storage medium | |
CN114973045A (en) | Hierarchical multi-modal emotion analysis method based on multi-task learning | |
CN113066473A (en) | Voice synthesis method and device, storage medium and electronic equipment | |
CN112992116A (en) | Automatic generation method and system of video content | |
CN113628609A (en) | Automatic audio content generation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |