CN113990286A - Speech synthesis method, apparatus, device and storage medium - Google Patents

Speech synthesis method, apparatus, device and storage medium Download PDF

Info

Publication number
CN113990286A
CN113990286A CN202111272328.5A CN202111272328A CN113990286A CN 113990286 A CN113990286 A CN 113990286A CN 202111272328 A CN202111272328 A CN 202111272328A CN 113990286 A CN113990286 A CN 113990286A
Authority
CN
China
Prior art keywords
text
synthesized
emotion
voice
emotion recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111272328.5A
Other languages
Chinese (zh)
Inventor
王昕�
杨大明
聂吉昌
田维政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN202111272328.5A priority Critical patent/CN113990286A/en
Publication of CN113990286A publication Critical patent/CN113990286A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Abstract

The invention relates to a voice synthesis technology, and discloses a voice synthesis method, which comprises the following steps: acquiring a text to be synthesized, and converting the text to be synthesized into basic audio data; performing emotion recognition on a text to be synthesized by using a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized; identifying the role of the text to be synthesized by utilizing a semantic analysis model; inquiring pronunciation parameters corresponding to the roles and the emotion types from the voice block chain nodes; and inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data. The invention relates to a block chain technology, in particular to a method for constructing voice block chain nodes to store pronunciation parameters. The invention also provides a voice synthesis device, electronic equipment and a storage medium. The invention can solve the problem that the synthesized voice is mechanical and stiff.

Description

Speech synthesis method, apparatus, device and storage medium
Technical Field
The present invention relates to the field of speech synthesis technologies, and in particular, to a speech synthesis method and apparatus, an electronic device, and a computer-readable storage medium.
Background
The existing voice synthesis method is to query a voice pronunciation database, confirm the pronunciation of each character or word, and finally splice and output all pronunciations together to obtain synthesized voice, however, the synthesized voice is mechanically stiff and inflexible, which greatly affects the sensitivity of people to voice synthesis products, and leads to various aspects in daily life, such as film and television dubbing of novels and textbooks, and still needs to consume a lot of manpower, material resources and time to dub, so as to obtain audio data with rich emotion.
Disclosure of Invention
The invention provides a voice synthesis method, a voice synthesis device and a computer readable storage medium, and mainly aims to solve the problem that synthesized voice is mechanically stiff.
In order to achieve the above object, the present invention provides a speech synthesis method, including:
acquiring a pre-constructed text to be synthesized, and converting the text to be synthesized into basic audio data;
carrying out emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized;
identifying the role of the text to be synthesized by utilizing a pre-trained semantic analysis model;
inquiring pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
Optionally, the converting the text to be synthesized into basic audio data includes:
performing phonetic transcription on the text to be synthesized by using a strictly phonetic transcription method to obtain a phoneme sequence;
extracting the audio frequency segment of each syllable in the phoneme sequence from a pre-constructed basic pronunciation database;
and splicing the audio segments of all the syllables according to the sequence of the phoneme sequence to obtain basic audio data.
Optionally, the identifying, by using a pre-trained semantic analysis model, the role to which the text to be synthesized belongs includes:
when the text to be synthesized comprises the voice-over text and the dialogue text, recognizing the semantics of the voice-over text by using a pre-trained semantic analysis model, and determining the belonging scores of the voice-over text belonging to each role according to the semantics of the voice-over text;
judging whether the affiliated score larger than a preset qualified threshold exists in each affiliated score;
when the belonging score which is greater than or equal to the preset qualified threshold value exists, taking the role corresponding to the belonging score which is greater than or equal to the qualified threshold value as the belonging role of the text to be synthesized;
when the scores which are larger than or equal to the preset qualified threshold value do not exist, recognizing the conversation text by using the semantic analysis model to obtain a score queue of the conversation text belonging to each role;
and performing score superposition operation of corresponding roles on the scores of the voice-over text belonging to the roles and the scores in the score queue, and extracting the role corresponding to the highest score from a superposition result as the role of the text to be synthesized.
Optionally, the performing emotion recognition on the text to be synthesized by using the pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized includes:
quantizing the text to be synthesized to obtain a text quantization matrix;
performing feature extraction on the text quantization matrix by using the convolution layer of the emotion recognition model to obtain a feature matrix set;
reducing the dimension of the characteristic matrix set by using a pooling layer and a flatten layer of the emotion recognition model to obtain a characteristic sequence;
and importing the characteristic sequence into a decision tree classification network of the emotion recognition model for classification to obtain the emotion type of the text to be synthesized.
Optionally, the quantizing the text to be synthesized to obtain a text quantization matrix includes:
performing word segmentation operation on the text to be synthesized to obtain a word set;
quantizing the Word set by utilizing a Word2Vec model and a position code pre-configured in the emotion recognition model to obtain a Word vector set;
and splitting the word vector set according to a preset combination strategy, and splicing split results to obtain a text quantization matrix.
Optionally, before performing emotion recognition on the text to be synthesized by using the pre-trained emotion recognition model to obtain an emotion type of the text to be synthesized, the method further includes:
step I, connecting a Bert neural network with a decision tree classification network to obtain an emotion recognition model to be trained;
step II, performing emotion recognition on the pre-constructed sentence sample set by using the emotion recognition model to be trained to obtain a recognition result;
step III, calculating a loss value of the recognition result and a real quality inspection result corresponding to the statement sample set by using a preset loss function;
step IV, when the loss value is larger than a preset qualified parameter, updating the model parameter of the emotion recognition model to be trained according to an Adaboost algorithm, and returning to the step II;
and V, when the loss value is less than or equal to the qualified parameter, obtaining the emotion recognition model after training.
Optionally, before querying pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes, the method further includes:
acquiring a pre-constructed multidimensional pronunciation parameter set, and calculating the pronunciation parameter set by using a message digest algorithm to obtain a character tag;
and encrypting and uploading the pronunciation parameter set to a pre-constructed voice block chain node by utilizing the character tag.
In order to solve the above problem, the present invention also provides a speech synthesis apparatus, comprising:
the audio acquisition module is used for acquiring a pre-constructed text to be synthesized and converting the text to be synthesized into basic audio data;
the emotion recognition module is used for carrying out emotion recognition on the text to be synthesized by utilizing a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized;
the role recognition module is used for recognizing the role of the text to be synthesized by utilizing a pre-trained semantic analysis model;
the pronunciation parameter query module is used for querying pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and the audio style setting module is used for inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the speech synthesis method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium having at least one computer program stored therein, the at least one computer program being executed by a processor in an electronic device to implement the speech synthesis method described above.
According to the method and the device, the emotion type and the role of the text to be synthesized are analyzed through the emotion recognition model and the semantic analysis model, wherein the emotion scene of the text to be synthesized can be accurately and efficiently positioned through the emotion recognition model and the semantic analysis model, the role to which the emotion type corresponds can be conveniently inquired from a pre-constructed voice block chain node, the pronunciation parameters corresponding to the emotion type can be conveniently inquired from the pre-constructed voice block chain node, the basic audio data can be tuned according to the pronunciation parameters through the synthesis of the pronunciation data and the basic audio data, and finally the emotion audio data with emotion characteristics can be obtained. Therefore, the voice synthesis method, the voice synthesis device, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem that the voice generated by voice synthesis is mechanical and stiff, and improve the flexibility of the synthesized voice.
Drawings
Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention;
FIG. 2 is a detailed flowchart illustrating a step of a speech synthesis method according to an embodiment of the present invention;
FIG. 3 is a detailed flowchart illustrating a step of a speech synthesis method according to an embodiment of the present invention;
FIG. 4 is a functional block diagram of a speech synthesis apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device for implementing the speech synthesis method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a speech synthesis method. The execution subject of the speech synthesis method includes, but is not limited to, at least one of electronic devices such as a server and a terminal that can be configured to execute the method provided by the embodiments of the present application. In other words, the speech synthesis method may be performed by software or hardware installed in the terminal device or the server device, and the software may be a block chain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Fig. 1 is a schematic flow chart of a speech synthesis method according to an embodiment of the present invention.
In this embodiment, the speech synthesis method includes:
and S1, acquiring a pre-constructed text to be synthesized, and converting the text to be synthesized into basic audio data.
In the embodiment of the invention, the text to be synthesized can be a sentence or a plurality of sentences. Specifically, the text to be synthesized may be a news text to be dubbed, or may also be a text content of a novel or a text book, where the news text is required to be neutral emotion, and the emotional types of the novel or the text book are rich.
In addition, the text to be synthesized can be converted into the basic audio data through a speech recognition technology, and the speech recognition technology can be a text-to-speech engine, such as an online dictionary, and is used for converting the text to be synthesized into the basic audio data which only has the pronunciation of each word in the text and does not contain the whole emotional characteristics.
Specifically, the basic audio data includes default voice characteristic parameters such as tone, speech rate, and pitch.
In detail, in the embodiment of the present invention, the obtaining a pre-constructed text to be synthesized and converting the text to be synthesized into basic audio data includes:
performing phonetic transcription on the text to be synthesized by using a strictly phonetic transcription method to obtain a phoneme sequence;
extracting the audio frequency segment of each syllable in the phoneme sequence from a pre-constructed basic pronunciation database;
and splicing the audio segments of all the syllables according to the sequence of the phoneme sequence to obtain basic audio data.
The strictly speaking phonetic transcription method generally uses international phonetic symbols (IPA for short) as a marking tool to convert each character in the text to be synthesized into phonemes, such as converting the text [ e.g., love, substitution … … ] into phonemes [ ā, a-i, d-i … … ], wherein the phonemes are minimum phonetic units divided according to natural attributes of speech, and are analyzed according to pronunciation actions in syllables, so that the phonemes can be classified into monophones and multifactones, such as [ ā ], a '-i, and a' -i … … ], such as syllables of monophones, diphones, and triphones.
In addition, the basic pronunciation database can be any authority-issued phonetic dictionary database.
Specifically, the phoneme sequence may be connected with the audio segments of the respective syllable units using diphones (from the center of one phoneme to the center of the next phoneme) as a unit to obtain basic audio data, thereby improving continuity of the synthesized speech.
S2, carrying out emotion recognition on the text to be synthesized by using the pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized.
In the embodiment of the invention, the emotion recognition model is a neural network model constructed by a Bert neural network and a decision tree classification network, and is used for judging the emotion type of the text to be synthesized according to the context of each word in the text to be synthesized, wherein the emotion type comprises the emotion type and the emotion intensity level of the text to be synthesized.
In detail, as shown in fig. 2, in the embodiment of the present invention, the performing emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model to obtain an emotion type of the text to be synthesized includes:
and S21, quantizing the text to be synthesized to obtain a text quantization matrix.
Further, in the embodiment of the present invention, the quantizing the text to be synthesized to obtain a text quantization matrix includes:
performing word segmentation operation on the text to be synthesized to obtain a word set;
quantizing the Word set by utilizing a Word2Vec model and a position code pre-configured in the emotion recognition model to obtain a Word vector set;
and splitting the word vector set according to a preset combination strategy, and splicing split results to obtain a text quantization matrix.
Wherein the position code is a position vector generated according to the sequence and number of word vectors in the text to be synthesized, such as [ E ]0、E1、E2… …, the emotion recognition model is mainly configured in the initial layer part of the bert neural network of the emotion recognition model, and the text to be synthesized can be ensured to be sequentially and completely subjected to subsequent text recognition processing.
Specifically, in the embodiment of the present invention, windows with sizes of 2 words, 3 words, and 4 words in a jieba tool are used to sequentially intercept the text to be synthesized from front to back to obtain a word block set, an authoritative common word database is used to perform existence judgment on each word block in the word block set, and the word blocks existing in the common word database are output to obtain a word set.
In the embodiment of the invention, the Word2Vec model is a group of related models used for generating Word vectors, each Word in the Word set can be mapped into a vector to obtain an initial Word vector set, but the initial Word vector set generated by the Word2Vec model is discrete, and in order to enhance the characteristics in subsequent texts, the embodiment of the invention processes each sentence alone, so that no extra-sentence coding is considered, and only the pre-constructed intra-sentence coding [ E0、E1、E2… … to saidThe initial set of word vectors is position coded to obtain a set of word vectors, which may be [ E [ ], for example0+EWe have found that、E1+ENow……】。
In the embodiment of the present invention, if the word vector set is longer, the word vector set is cut by L bytes in length to form an L × N (number of columns) text quantization matrix, wherein the remaining vector data in the nth row may be supplemented with '0' until the matrix is complete.
And S22, performing feature extraction on the text quantization matrix by using the convolution layer of the emotion recognition model to obtain a feature matrix set.
In the embodiment of the invention, the convolutional layer is provided with M convolutional cores with preset quantity, and the text quantization matrix is subjected to one-time sliding convolution calculation by utilizing each convolutional core to obtain a feature matrix set containing M feature matrices;
and S23, reducing the dimension of the feature matrix set by using the pooling layer and the flatten layer of the emotion recognition model to obtain a feature sequence.
The embodiment of the invention utilizes the maximum pooling operation to pool the feature matrix set, reserve main features, eliminate secondary features to obtain a dimension-reduced feature matrix set, and utilizes the flatten layer to carry out flattening treatment on the dimension-reduced feature matrix set to obtain a feature sequence containing each one-dimensional feature;
and S24, importing the characteristic sequence into a decision tree classification network of the emotion recognition model for classification to obtain the emotion type of the text to be synthesized.
Specifically, the embodiment of the present invention performs full join operation on the text feature sequences in the feature sequences, identifies and obtains sentence scenes corresponding to each text feature, and performs decision tree classification operation on each sentence scene to obtain the emotion type of the text to be synthesized.
In addition, in this embodiment of the present invention, before the step of S2, the method further includes:
step I, connecting a Bert neural network with a decision tree classification network to obtain an emotion recognition model to be trained;
step II, performing emotion recognition on the pre-constructed sentence sample set by using the emotion recognition model to be trained to obtain a recognition result;
for example, a preset number of sentences are randomly extracted from a large number of novels to be output and stored, so as to obtain a sentence sample set;
step III, calculating a loss value of the recognition result and a real quality inspection result corresponding to the statement sample set by using a preset loss function;
and when the professional randomly extracts a preset number of sentences from the large number of novels, the real quality inspection result is a result label obtained by evaluating the extracted sentences.
In the embodiment of the present invention, the sentence sample set is imported into the emotion recognition model to be trained for text recognition, where the process is similar to the process from S21 to S24, and is not described herein again;
in the embodiment of the invention, the loss value is as follows:
Loss=αLossidentification+βLossDecision tree
Wherein, the LossIdentificationAs a Loss function of the Bert neural network, said LossDecision treeAnd a loss function corresponding to the decision tree classification network is obtained, wherein alpha and beta are weight coefficients, and the loss function is configured according to the training effect in the embodiment of the invention.
Step IV, when the loss value is larger than a preset qualified parameter, automatically updating the model parameter of the emotion recognition model to be trained according to an Adaboost algorithm, and returning to the step II;
adaboost is an iterative algorithm, and is an algorithm which continuously changes the weight position of each decision tree in an emotion recognition model in the classification process through iterative training, so that the classification result is more accurate. Specifically, in each training process by using the Adaboost algorithm, the embodiment of the present invention changes the weight coefficient of each classifier in the decision tree classification network, and when classifying a correctly classified sentence sample, the weight is reduced, and when classifying an incorrectly classified sentence sample, the weight is increased, and according to a loss function, the model parameters in the classification network are updated, so that the wrongly classified sentence samples are fewer and fewer, and the loss value is lower and lower.
And V, when the loss value is less than or equal to the qualified parameter, obtaining the emotion recognition model after training.
According to the embodiment of the invention, training is carried out according to the operation steps from the step I to the step V, and finally the emotion recognition model which is trained is obtained.
And S3, identifying the role of the text to be synthesized by utilizing the pre-trained semantic analysis model.
The characters are male or female characters of a news reporter, and can also be characters with various timbres in a novel or a speech book, such as 'old male, middle-aged female, and Luoli sound … …'.
In detail, as shown in fig. 3, in another embodiment of the present invention, the identifying the role of the text to be synthesized by using the pre-trained semantic analysis model includes:
s31, when the text to be synthesized contains the voice-over text and the dialogue text, recognizing the semantics of the voice-over text by using a pre-trained semantic analysis model, and determining the belonging scores of the voice-over text belonging to each role according to the semantics of the voice-over text;
s32, judging whether the affiliated scores are larger than a preset qualified threshold value or not;
s33, when the score is larger than or equal to the preset qualified threshold value, taking the role corresponding to the score larger than or equal to the qualified threshold value as the role of the text to be synthesized;
s34, when the score which is larger than or equal to the preset qualified threshold value does not exist, recognizing the dialog text by using the semantic analysis model to obtain a score queue of each role to which the dialog text belongs;
and S35, performing score superposition operation of corresponding roles on the scores of the voice-over text belonging to the roles and the scores in the score queue, and extracting the role corresponding to the highest score from the superposition result as the role of the text to be synthesized.
Specifically, when analyzing the role to which the text to be synthesized belongs, the bystander text and the dialog text in the text to be synthesized may be extracted first, such as [ … … ], the repulsion term: the newspaper! The enemy team is found in thirty miles ahead of the general. ", extracting the" repulsion in the bystander text (i.e. the text except quotation marks) can obtain the scores of the characters of the text to be synthesized as [ Shenjun: 0%, general: 0%, repellency: 100% ] and wherein the belonging score of 100% of the reputations is greater than a preset qualifying threshold of 80%, then the character of the text to be synthesized is "reputations".
Further, when the text to be synthesized is [ … … ], a person hasty coming in a hurry outside the account: "the enemy team is found in the front thirty miles of the general. ", then identify the bystander text (i.e.," rush to one person out of account "in the other than quotation mark), and get the score that the belonged to may be [ military: 30%, general: 0%, repellency: 50% and … …%, wherein the maximum is [ repel: 50% but less than 80% of the qualified threshold, the dialog text (the grand general, enemy team found in thirty days ahead) needs to be identified again, and a score queue is obtained as the following: 10% of company: 20%, and the number of the reputations is 40%, … … ], then, the scores of the corresponding characters are superimposed to obtain a score of the "Shenjun: 40%, … …, repellency: 100%, … … ], selecting the character with the highest score, namely 'repugnance', as the text to be synthesized belongs to.
And S4, inquiring pronunciation parameters corresponding to the characters and the emotion types from the pre-constructed voice block chain nodes.
The voice block chain node is a database which extracts and stores the multidimensional pronunciation parameters of the statement sample through a block chain technology.
In the embodiment of the invention, when the role and the emotion type of the text to be synthesized are identified, the role and the emotion type can be used as keywords to be inquired in the voice block chain node, so that multidimensional pronunciation parameters respectively corresponding to the role and the emotion type are obtained, such as the fundamental frequency and harmonic frequency of the voice, the unvoiced sound possibility (between 0 and 1), the voiced sound possibility (between 0 and 1), the speech speed or the duration of the pronunciation state, the tone and the rhythm.
Specifically, during query, a pronunciation parameter corresponding to the role to which the user belongs and a pronunciation parameter corresponding to the emotion type are queried. For example, in the embodiment of the present invention, if the characters of the sentence text are repugnance and the emotion types are high-level tension types including urge and tension, the tone information of the repugnance characters is firstly queried through the voice block link node, and then the high-level tension types under the repugnance characters are queried to obtain pronunciation parameters including information of speech speed and pause habits.
In an embodiment of the present invention, before querying pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes, the method further includes:
acquiring a pre-constructed multidimensional pronunciation parameter set, and calculating the pronunciation parameter set by using a message digest algorithm to obtain a character tag;
and encrypting and uploading the pronunciation parameter set to a pre-constructed voice block chain node by utilizing the character tag.
The information digest algorithm is a very important branch in the cryptographic algorithm, functions such as data signature and data integrity verification are achieved by extracting fingerprint information from all data, the algorithm has irreversibility, and decoding by means of violent enumeration can be prevented.
Specifically, in the embodiment of the present invention, dubbing actors with different roles in a preset text (such as a novel or a text book) are obtained in advance, dubbing is performed on a preset sentence sample under each type of preset emotion type vocabulary entry to obtain a sample audio, multidimensional pronunciation parameters such as timbre, tone, word pause interval and the like in the sample audio are extracted to obtain multidimensional pronunciation parameters, and the multidimensional pronunciation parameters are encrypted and linked to a block chain node for storage through a message summarization algorithm. The embodiment of the invention stores the multidimensional pronunciation parameters of various types of voices by using a block chain technology, thereby preventing important data from being lost or changed.
And S5, inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
In particular, the audio synthesizer may be a TrakAxPC audio processor.
For example, the basic audio data is imported into a pre-constructed Trak AxPC audio processor, after pronunciation parameters including information of speech speed and pause habit are obtained according to the high-level tension type under the inquired repulsion character, the basic audio data is tuned according to the pronunciation parameters, and pronunciation audio of an adult male character and a repulsion character with fast speech speed is obtained.
Further, when the text to be synthesized includes the bystander text and the non-bystander text, during the speech synthesis, only the style reconstruction operation of the dialog text belonging to the individual character included in the text to be synthesized can be performed, that is, the dialog text belonging to the individual character included in the text to be synthesized is adjusted according to the character and emotion type to which the dialog text belongs, without adjusting the part of the bystander text (for example, changing the tone and emotion).
According to the method and the device, the emotion type and the role of the text to be synthesized are analyzed through the emotion recognition model and the semantic analysis model, wherein the emotion scene of the text to be synthesized can be accurately and efficiently positioned through the emotion recognition model and the semantic analysis model, the role to which the emotion type corresponds can be conveniently inquired from a pre-constructed voice block chain node, the pronunciation parameters corresponding to the emotion type can be conveniently inquired from the pre-constructed voice block chain node, the basic audio data can be tuned according to the pronunciation parameters through the synthesis of the pronunciation data and the basic audio data, and finally the emotion audio data with emotion characteristics can be obtained. Therefore, the voice synthesis method, the voice synthesis device, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem that the voice generated by voice synthesis is mechanical and stiff, and improve the flexibility of the synthesized voice.
Fig. 4 is a functional block diagram of a speech synthesis apparatus according to an embodiment of the present invention.
The speech synthesis apparatus 100 of the present invention can be installed in an electronic device. According to the implemented functions, the speech synthesis apparatus 100 may include an audio acquisition module 101, an emotion recognition module 102, a character recognition module 103, a pronunciation parameter query module 104, and an audio style setting module 105. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the audio acquiring module 101 is configured to acquire a pre-constructed text to be synthesized, and convert the text to be synthesized into basic audio data;
the emotion recognition module 102 is configured to perform emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model to obtain an emotion type of the text to be synthesized;
the role recognition module 103 is configured to recognize a role of the text to be synthesized by using a pre-trained semantic analysis model;
the pronunciation parameter query module 104 is configured to query pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and the audio style setting module 105 is configured to input the pronunciation parameters and the basic speech data to an audio synthesizer for synthesis, so as to obtain emotion audio data.
In detail, in the embodiment of the present invention, when the modules in the speech synthesis apparatus 100 are used, the same technical means as the speech synthesis method described in fig. 1 to fig. 3 are adopted, and the same technical effects can be produced, which is not described herein again.
Fig. 5 is a schematic structural diagram of an electronic device for implementing a speech synthesis method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a speech synthesis program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing a voice synthesis program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a speech synthesis program, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 5 only shows an electronic device with components, and it will be understood by a person skilled in the art that the structure shown in fig. 5 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than shown, or a combination of certain components, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The speech synthesis program stored in the memory 11 of the electronic device 1 is a combination of instructions that, when executed in the processor 10, enable:
acquiring a pre-constructed text to be synthesized, and converting the text to be synthesized into basic audio data;
carrying out emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized;
identifying the role of the text to be synthesized by utilizing a pre-trained semantic analysis model;
inquiring pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring a pre-constructed text to be synthesized, and converting the text to be synthesized into basic audio data;
carrying out emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized;
identifying the role of the text to be synthesized by utilizing a pre-trained semantic analysis model;
inquiring pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A method of speech synthesis, the method comprising:
acquiring a pre-constructed text to be synthesized, and converting the text to be synthesized into basic audio data;
carrying out emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized;
identifying the role of the text to be synthesized by utilizing a pre-trained semantic analysis model;
inquiring pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
2. The speech synthesis method of claim 1, wherein the converting the text to be synthesized into base audio data comprises:
performing phonetic transcription on the text to be synthesized by using a strictly phonetic transcription method to obtain a phoneme sequence;
extracting the audio frequency segment of each syllable in the phoneme sequence from a pre-constructed basic pronunciation database;
and splicing the audio segments of all the syllables according to the sequence of the phoneme sequence to obtain basic audio data.
3. The speech synthesis method of claim 1, wherein the identifying the belonging role of the text to be synthesized by using the pre-trained semantic analysis model comprises:
when the text to be synthesized comprises the voice-over text and the dialogue text, recognizing the semantics of the voice-over text by using a pre-trained semantic analysis model, and determining the belonging scores of the voice-over text belonging to each role according to the semantics of the voice-over text;
judging whether the affiliated score larger than a preset qualified threshold exists in each affiliated score;
when the belonging score which is greater than or equal to the preset qualified threshold value exists, taking the role corresponding to the belonging score which is greater than or equal to the qualified threshold value as the belonging role of the text to be synthesized;
when the scores which are larger than or equal to the preset qualified threshold value do not exist, recognizing the conversation text by using the semantic analysis model to obtain a score queue of the conversation text belonging to each role;
and performing score superposition operation of corresponding roles on the scores of the voice-over text belonging to the roles and the scores in the score queue, and extracting the role corresponding to the highest score from a superposition result as the role of the text to be synthesized.
4. The method of claim 1, wherein the obtaining the emotion type of the text to be synthesized by performing emotion recognition on the text to be synthesized by using a pre-trained emotion recognition model comprises:
quantizing the text to be synthesized to obtain a text quantization matrix;
performing feature extraction on the text quantization matrix by using the convolution layer of the emotion recognition model to obtain a feature matrix set;
reducing the dimension of the characteristic matrix set by using a pooling layer and a flatten layer of the emotion recognition model to obtain a characteristic sequence;
and importing the characteristic sequence into a decision tree classification network of the emotion recognition model for classification to obtain the emotion type of the text to be synthesized.
5. The speech synthesis method of claim 4, wherein the quantizing the text to be synthesized to obtain a text quantization matrix, comprises:
performing word segmentation operation on the text to be synthesized to obtain a word set;
quantizing the Word set by utilizing a Word2Vec model and a position code pre-configured in the emotion recognition model to obtain a Word vector set;
and splitting the word vector set according to a preset combination strategy, and splicing split results to obtain a text quantization matrix.
6. The speech synthesis method of claim 1, wherein before performing emotion recognition on the text to be synthesized by using the pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized, the method further comprises:
step I, connecting a Bert neural network with a decision tree classification network to obtain an emotion recognition model to be trained;
step II, performing emotion recognition on the pre-constructed sentence sample set by using the emotion recognition model to be trained to obtain a recognition result;
step III, calculating a loss value of the recognition result and a real quality inspection result corresponding to the statement sample set by using a preset loss function;
step IV, when the loss value is larger than a preset qualified parameter, updating the model parameter of the emotion recognition model to be trained according to an Adaboost algorithm, and returning to the step I I;
and V, when the loss value is less than or equal to the qualified parameter, obtaining the emotion recognition model after training.
7. The method of speech synthesis according to claim 1, wherein before querying pronunciation parameters corresponding to the associated character and emotion type from a pre-constructed speech blockchain node, the method further comprises:
acquiring a pre-constructed multidimensional pronunciation parameter set, and calculating the pronunciation parameter set by using a message digest algorithm to obtain a character tag;
and encrypting and uploading the pronunciation parameter set to a pre-constructed voice block chain node by utilizing the character tag.
8. A speech synthesis apparatus, characterized in that the apparatus comprises:
the audio acquisition module is used for acquiring a pre-constructed text to be synthesized and converting the text to be synthesized into basic audio data;
the emotion recognition module is used for carrying out emotion recognition on the text to be synthesized by utilizing a pre-trained emotion recognition model to obtain the emotion type of the text to be synthesized;
the role recognition module is used for recognizing the role of the text to be synthesized by utilizing a pre-trained semantic analysis model;
the pronunciation parameter query module is used for querying pronunciation parameters corresponding to the roles and the emotion types from pre-constructed voice block chain nodes;
and the audio style setting module is used for inputting the pronunciation parameters and the basic voice data into an audio synthesizer for synthesis to obtain emotion audio data.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the speech synthesis method of any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a speech synthesis method according to any one of claims 1 to 7.
CN202111272328.5A 2021-10-29 2021-10-29 Speech synthesis method, apparatus, device and storage medium Pending CN113990286A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111272328.5A CN113990286A (en) 2021-10-29 2021-10-29 Speech synthesis method, apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111272328.5A CN113990286A (en) 2021-10-29 2021-10-29 Speech synthesis method, apparatus, device and storage medium

Publications (1)

Publication Number Publication Date
CN113990286A true CN113990286A (en) 2022-01-28

Family

ID=79744405

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111272328.5A Pending CN113990286A (en) 2021-10-29 2021-10-29 Speech synthesis method, apparatus, device and storage medium

Country Status (1)

Country Link
CN (1) CN113990286A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615524A (en) * 2018-05-14 2018-10-02 平安科技(深圳)有限公司 A kind of phoneme synthesizing method, system and terminal device
US10475438B1 (en) * 2017-03-02 2019-11-12 Amazon Technologies, Inc. Contextual text-to-speech processing
US20200169591A1 (en) * 2019-02-01 2020-05-28 Ben Avi Ingel Systems and methods for artificial dubbing
CN111274807A (en) * 2020-02-03 2020-06-12 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium
CN111667811A (en) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and medium
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112527994A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium
CN112908292A (en) * 2019-11-19 2021-06-04 北京字节跳动网络技术有限公司 Text voice synthesis method and device, electronic equipment and storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10475438B1 (en) * 2017-03-02 2019-11-12 Amazon Technologies, Inc. Contextual text-to-speech processing
CN108615524A (en) * 2018-05-14 2018-10-02 平安科技(深圳)有限公司 A kind of phoneme synthesizing method, system and terminal device
US20200169591A1 (en) * 2019-02-01 2020-05-28 Ben Avi Ingel Systems and methods for artificial dubbing
CN112908292A (en) * 2019-11-19 2021-06-04 北京字节跳动网络技术有限公司 Text voice synthesis method and device, electronic equipment and storage medium
CN111274807A (en) * 2020-02-03 2020-06-12 华为技术有限公司 Text information processing method and device, computer equipment and readable storage medium
CN111667811A (en) * 2020-06-15 2020-09-15 北京百度网讯科技有限公司 Speech synthesis method, apparatus, device and medium
CN112086086A (en) * 2020-10-22 2020-12-15 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and computer readable storage medium
CN112527994A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Emotion analysis method, emotion analysis device, emotion analysis equipment and readable storage medium

Similar Documents

Publication Publication Date Title
CN107564511B (en) Electronic device, phoneme synthesizing method and computer readable storage medium
CN110263150B (en) Text generation method, device, computer equipment and storage medium
CN112086086A (en) Speech synthesis method, device, equipment and computer readable storage medium
CN109461437B (en) Verification content generation method and related device for lip language identification
CN109241330A (en) The method, apparatus, equipment and medium of key phrase in audio for identification
CN113096242A (en) Virtual anchor generation method and device, electronic equipment and storage medium
CN111862937A (en) Singing voice synthesis method, singing voice synthesis device and computer readable storage medium
CN112466273A (en) Speech synthesis method, speech synthesis device, electronic equipment and storage medium
CN113420556A (en) Multi-mode signal based emotion recognition method, device, equipment and storage medium
CN113345431A (en) Cross-language voice conversion method, device, equipment and medium
CN112951233A (en) Voice question and answer method and device, electronic equipment and readable storage medium
CN115002491A (en) Network live broadcast method, device, equipment and storage medium based on intelligent machine
CN116320607A (en) Intelligent video generation method, device, equipment and medium
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN114999533A (en) Intelligent question-answering method, device, equipment and storage medium based on emotion recognition
CN113707124A (en) Linkage broadcasting method and device of voice operation, electronic equipment and storage medium
CN113887200A (en) Text variable-length error correction method and device, electronic equipment and storage medium
CN110781327B (en) Image searching method and device, terminal equipment and storage medium
CN112201253A (en) Character marking method and device, electronic equipment and computer readable storage medium
CN111401069A (en) Intention recognition method and intention recognition device for conversation text and terminal
CN115631748A (en) Emotion recognition method and device based on voice conversation, electronic equipment and medium
CN113990286A (en) Speech synthesis method, apparatus, device and storage medium
CN114548114A (en) Text emotion recognition method, device, equipment and storage medium
CN113808577A (en) Intelligent extraction method and device of voice abstract, electronic equipment and storage medium
CN114155832A (en) Speech recognition method, device, equipment and medium based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination