CN114299919A - Method and device for converting characters into voice, storage medium and computer equipment - Google Patents
Method and device for converting characters into voice, storage medium and computer equipment Download PDFInfo
- Publication number
- CN114299919A CN114299919A CN202111620527.0A CN202111620527A CN114299919A CN 114299919 A CN114299919 A CN 114299919A CN 202111620527 A CN202111620527 A CN 202111620527A CN 114299919 A CN114299919 A CN 114299919A
- Authority
- CN
- China
- Prior art keywords
- recognition result
- information
- voice
- sender
- emotion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000006243 chemical reaction Methods 0.000 claims abstract description 157
- 230000008909 emotion recognition Effects 0.000 claims abstract description 117
- 230000006854 communication Effects 0.000 claims abstract description 47
- 230000008451 emotion Effects 0.000 claims description 37
- 238000004891 communication Methods 0.000 claims description 26
- 230000036651 mood Effects 0.000 claims description 23
- 238000004590 computer program Methods 0.000 claims description 8
- 230000004044 response Effects 0.000 claims description 7
- 210000000988 bone and bone Anatomy 0.000 claims description 6
- 210000005069 ears Anatomy 0.000 claims description 6
- 230000001960 triggered effect Effects 0.000 claims description 5
- 230000002349 favourable effect Effects 0.000 abstract 1
- 239000013598 vector Substances 0.000 description 46
- 230000011218 segmentation Effects 0.000 description 15
- 230000008569 process Effects 0.000 description 14
- 238000012549 training Methods 0.000 description 12
- 238000012937 correction Methods 0.000 description 9
- 238000005457 optimization Methods 0.000 description 6
- 208000029257 vision disease Diseases 0.000 description 6
- 230000002159 abnormal effect Effects 0.000 description 5
- KRQUFUKTQHISJB-YYADALCUSA-N 2-[(E)-N-[2-(4-chlorophenoxy)propoxy]-C-propylcarbonimidoyl]-3-hydroxy-5-(thian-3-yl)cyclohex-2-en-1-one Chemical compound CCC\C(=N/OCC(C)OC1=CC=C(Cl)C=C1)C1=C(O)CC(CC1=O)C1CCCSC1 KRQUFUKTQHISJB-YYADALCUSA-N 0.000 description 4
- 241001672694 Citrus reticulata Species 0.000 description 4
- 238000001514 detection method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 239000003999 initiator Substances 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002650 habitual effect Effects 0.000 description 2
- 208000016354 hearing loss disease Diseases 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 206010047571 Visual impairment Diseases 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 230000004393 visual impairment Effects 0.000 description 1
Images
Landscapes
- Telephonic Communication Services (AREA)
Abstract
The invention discloses a method, a device, a storage medium and computer equipment for converting characters into voice, wherein the method comprises the following steps: acquiring character information to be converted; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result, and converting the character information into voice information. The scheme can convert the text information sent in the communication process into the voice information, is favorable for the receiver to understand the information transmitted by the sender, and improves the user experience in the communication process.
Description
Technical Field
The invention relates to the technical field of information, in particular to a method and a device for converting characters into voice, a storage medium and computer equipment.
Background
In the current society, communication and communication between people are more and more frequent and close, and the emergence of smart phones also enables communication tools based on wireless communication to be realized, and voice communication also becomes the most important communication mode.
At present, when an information initiator is in a noisy environment, the voice input can be unclear, so that the communication quality is influenced, and the information initiator generally transfers characters to carry out communication. However, it is difficult to accurately convey the emotion or mood of the information originator at that time, and if the information recipient has a low degree of culture, the text information is not read or there is visual impairment, which may cause inconvenience to the information recipient.
Disclosure of Invention
The invention provides a method, a device, a storage medium and computer equipment for converting text information into voice information in a communication process, so that inconvenience to an information receiver is avoided.
According to a first aspect of the present invention, there is provided a text-to-speech method, comprising:
acquiring character information to be converted;
carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information;
and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result, and converting the character information into voice information.
Optionally, the invoking a preset voice conversion model matched with the intention recognition result, the emotion recognition result, and the mood recognition result to convert the text information into voice information includes:
acquiring an initial voice conversion model and a plurality of groups of corresponding model parameters thereof;
determining target model parameters which correspond to the intention recognition result, the emotion recognition result and the tone recognition result together from the multiple groups of model parameters, wherein each group of emotion recognition result corresponds to one group of model parameters, and the group of emotion recognition results comprise the intention recognition result, the emotion recognition result and the tone recognition result;
adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result;
and calling the matched preset voice conversion model to convert the text information into voice information.
Optionally, the acquiring text information to be converted includes:
receiving character information input by a sender;
the calling of a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result converts the text information into voice information, and the method comprises the following steps:
if the text information contains special characters, verifying the identity of a sender, calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the tone recognition result and the identity information of the sender, and converting the text information input by the sender into voice information.
Optionally, the receiving text information input by the sender includes:
detecting the sound decibel of the current environment of the sender;
if the sound decibel is larger than the preset sound decibel, providing a character input interface, and receiving character information input by the sender based on the character input interface; or
Acquiring a historical conversation record between the sender and the receiver;
determining an input mode selected by the sender when the sender last talks with the receiver based on the historical conversation record;
and if the input mode selected when the sender has a conversation with the receiver last time is a character mode, outputting a character input interface and receiving character information input by the sender based on the character input interface.
Optionally, before the receiving the text message input by the sender, the method further includes:
collecting voice information of a plurality of groups of scene sentences read by the sender, wherein the corresponding intentions, emotions or tone of the scene sentences in different groups are different;
determining character information respectively corresponding to a plurality of groups of voice information read by the sender, and training an initial voice conversion model bound with identity information of the sender and a plurality of groups of model parameters corresponding to the initial voice conversion model based on the plurality of groups of voice information and the character information respectively corresponding to the voice information.
Optionally, after training the initial voice conversion model bound to the identity information of the sender and the plurality of sets of model parameters corresponding to the initial voice conversion model based on the plurality of sets of voice information and the text information corresponding to the plurality of sets of voice information, the method further includes:
acquiring real-time voice information of the sender through a preset device;
optimizing multiple groups of model parameters of an initial voice conversion model bound with the identity information of the sender based on the collected real-time voice information; and/or
Acquiring specific characters input by the sender and voice information input by the sender aiming at the specific characters;
and optimizing multiple groups of model parameters of the initial voice conversion model bound with the identity information of the sender based on the specific characters and the corresponding voice information thereof.
Optionally, after the acquiring, by a predetermined device, the real-time voice information of the sender, the method further includes:
performing character conversion on the real-time voice information to obtain real-time character information corresponding to the real-time voice information;
respectively identifying intention, emotion and tone of voice of the real-time character information to obtain an intention identification result, an emotion identification result and a tone identification result corresponding to the real-time character information;
and if determining that the emotion of the sender is abnormal according to the intention recognition result corresponding to the real-time character information, or determining that the emotion recognition result corresponding to the real-time character information is not matched with the tone recognition result, interrupting the acquisition of the real-time voice information of the sender.
Optionally, after training the initial voice conversion model bound to the identity information of the sender and the plurality of sets of model parameters corresponding to the initial voice conversion model based on the plurality of sets of voice information and the text information corresponding to the plurality of sets of voice information, the method further includes:
acquiring a trial reading sentence input by the sender;
converting the trial reading sentences into voice information and playing the voice information to the sender by utilizing a plurality of groups of model parameters of a trained initial voice conversion model bound with the identity information of the sender, and outputting a selection correction interface corresponding to the trial reading sentences;
acquiring revised voice corresponding to a target character in a trial reading statement selected by the sender based on the selection correction interface;
optimizing sets of model parameters of an initial speech conversion model bound to identity information of the sender based on the revised speech.
Optionally, after receiving the text message input by the sender, the method further includes:
when the text information is dialect, acquiring the output voice type selected by the sender;
if the output voice type is dialect voice, calling a preset dialect voice conversion model bound with the identity information of the sender, and converting the dialect input by the sender into dialect voice information;
if the output voice type is standard voice, carrying out mandarin conversion on the dialect by utilizing a preset dialect lexicon to obtain standard character information corresponding to the dialect;
carrying out multi-dimensional emotion recognition on the standard character information to obtain an emotion recognition result of the standard character information under multiple dimensions;
and calling a preset voice conversion model matched with the multi-dimensional emotion recognition result and the identity information of the sender, and converting the standard text information into standard voice information.
Optionally, after the calling a preset voice conversion model matching the intention recognition result, the emotion recognition result, and the mood recognition result to convert the text information into voice information, the method further includes:
and converting the audio sound wave in the voice information into bone conduction sound wave in response to the received sound wave conversion instruction.
Optionally, after the calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the mood recognition result, and the identity information of the sender, and converting text information input by the sender into voice information, the method further includes:
responding to a received background sound adding instruction, outputting and displaying a background sound list, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound to the voice information by overlapping a sound wave in the target background sound with a sound wave in the voice information; or
Acquiring the current conversation record of the sender and the receiver, and determining the current scene of the sender according to the conversation record; and determining a target background sound matched with the current scene of the sender, and adding the target background sound to the voice information by overlapping the sound wave in the target background sound with the sound wave in the voice information.
Optionally, after the calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the mood recognition result, and the identity information of the sender, and converting text information input by the sender into voice information, the method further includes:
responding to an encrypted voice adding instruction triggered by the sender, and acquiring encrypted voice of the sender;
adjusting the frequency of the sound waves in the encrypted voice to a specific frequency which cannot be recognized by human ears;
and adding the adjusted encrypted voice to the voice information by overlapping the sound wave in the adjusted encrypted voice with the sound wave in the voice information.
Optionally, the method further comprises:
collecting operation data of the sender aiming at the communication device in the communication process;
matching the operation data in the communication process with historical operation data;
and if the operation data in the communication process is not matched with the historical operation data, verifying the sent identity information by starting the camera device.
According to a second aspect of the present invention, there is provided a text-to-speech apparatus, comprising:
the acquisition unit is used for acquiring character information to be converted;
the identification unit is used for carrying out multi-dimensional emotion identification on the character information to obtain an intention identification result, an emotion identification result and a tone identification result corresponding to the character information;
and the conversion unit is used for calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result and converting the text information into voice information.
Optionally, the conversion unit includes: a first obtaining module, a first determining module, an adding module and a converting module,
the first obtaining module is used for obtaining an initial voice conversion model and a plurality of groups of corresponding model parameters;
the first determining module is configured to determine target model parameters corresponding to the intention recognition result, the emotion recognition result and the mood recognition result from the multiple sets of model parameters, where each set of emotion recognition result corresponds to one set of model parameters, and the set of emotion recognition results includes the intention recognition result, the emotion recognition result and the mood recognition result;
the adding module is used for adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result;
and the conversion module is used for calling the matched preset voice conversion model to convert the text information into voice information.
Optionally, the obtaining unit is specifically configured to receive text information input by a sender;
the conversion unit is specifically configured to verify an identity of a sender if the text information includes a special character, and call a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the tone recognition result, and the identity information of the sender to convert the text information input by the sender into voice information.
Optionally, the obtaining unit includes: a detection module, a receiving module, a second obtaining module and a second determining module,
the detection module is used for detecting the sound decibel of the current environment of the sender;
the receiving module is used for providing a text input interface if the sound decibel is greater than a preset sound decibel, and receiving text information input by the sender based on the text input interface;
the second obtaining module is used for obtaining a historical conversation record between the sender and the receiver;
the second determining module is configured to determine, based on the historical conversation record, an input mode selected when the sender has last interacted with the receiver;
the receiving module is further configured to output a text input interface if the input mode selected by the sender when the sender has last interacted with the receiver is a text mode, and receive text information input by the sender based on the text input interface.
Optionally, the apparatus further comprises: a collection unit and a training unit, wherein,
the acquisition unit is used for acquiring the voice information of a plurality of groups of scene sentences read by the sender, wherein the corresponding intentions, moods or moods of the scene sentences in different groups are different;
the training unit is used for determining character information corresponding to a plurality of groups of voice information read by the sender respectively, and training an initial voice conversion model bound with identity information of the sender and a plurality of groups of model parameters corresponding to the initial voice conversion model based on the plurality of groups of voice information and the character information corresponding to the voice information respectively.
Optionally, the apparatus further comprises: an optimization unit for optimizing the operation of the system,
the acquisition unit is also used for acquiring the real-time voice information of the sender through a preset device;
the optimization unit is used for optimizing multiple groups of model parameters of an initial voice conversion model bound with the identity information of the sender based on the collected real-time voice information;
the acquiring unit is used for acquiring the specific characters input by the sender and the voice information input by the sender aiming at the specific characters;
the optimization unit is further configured to optimize, based on the specific text and the corresponding speech information thereof, a plurality of sets of model parameters of the initial speech conversion model bound to the identity information of the sender.
Optionally, the apparatus further comprises: the unit of interruption is used to interrupt the signal,
the conversion unit is further configured to perform text conversion on the real-time voice information to obtain real-time text information corresponding to the real-time voice information;
the identification unit is further used for identifying the intention, the emotion and the tone of voice of the real-time character information respectively to obtain an intention identification result, an emotion identification result and a tone of voice identification result corresponding to the real-time character information;
and the interruption unit is used for interrupting the acquisition of the real-time voice information of the sender if determining that the emotion of the sender is abnormal according to the intention recognition result corresponding to the real-time text information or determining that the emotion recognition result corresponding to the real-time text information is not matched with the tone recognition result.
Optionally, the obtaining unit is further configured to obtain a trial reading statement input by the sender;
the conversion unit is further configured to convert the trial reading sentences into voice information and play the voice information to the sender by using the trained multiple sets of model parameters of the initial voice conversion model bound with the identity information of the sender, and output a selection correction interface corresponding to the trial reading sentences;
the obtaining unit is further configured to obtain a revised voice corresponding to a target character in a trial reading sentence selected by the sender based on the selection correction interface;
the optimization unit is further configured to optimize, based on the revised voice, a plurality of sets of model parameters of the initial voice conversion model bound to the identity information of the sender.
Optionally, the obtaining unit is further configured to obtain, when the text information is a dialect, an output voice type selected by the sender;
the conversion unit is further configured to invoke a preset dialect voice conversion model bound with the identity information of the sender if the output voice type is dialect voice, and convert the dialect input by the sender into dialect voice information;
the conversion unit is further configured to, if the output voice type is a standard voice, perform mandarin conversion on the dialect by using a preset dialect lexicon to obtain standard text information corresponding to the dialect;
the identification unit is also used for carrying out multi-dimensional emotion identification on the standard character information to obtain an emotion identification result of the standard character information under multiple dimensions;
the conversion unit is further used for calling a preset voice conversion model matched with the multi-dimensional emotion recognition result and the identity information of the sender, and converting the standard text information into standard voice information.
Optionally, the converting unit is further configured to convert the audio sound wave in the voice message into a bone conduction sound wave in response to the received sound wave conversion instruction.
Optionally, the apparatus further comprises: a superimposing unit for superimposing the first and second images,
the superposition unit is used for responding to the received background sound adding instruction, outputting and displaying a background sound list, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound to the voice information by superposing a sound wave in the target background sound and a sound wave in the voice information; or obtaining the current conversation record of the sender and the receiver, and determining the current scene of the sender according to the conversation record; and determining a target background sound matched with the current scene of the sender, and adding the target background sound to the voice information by overlapping the sound wave in the target background sound with the sound wave in the voice information.
Optionally, the obtaining unit is further configured to obtain, in response to an encrypted voice adding instruction triggered by the sender, an encrypted voice of the sender;
the superposition unit is also used for adjusting the frequency of the sound waves in the encrypted voice to a specific frequency which cannot be identified by human ears; and adding the adjusted encrypted voice to the voice information by overlapping the sound wave in the adjusted encrypted voice with the sound wave in the voice information.
Optionally, the apparatus further comprises: a matching unit for matching the received signal with the received signal,
the acquisition unit is also used for acquiring the operation data of the sender aiming at the communication device in the communication process;
the matching unit is used for matching the operation data in the communication process with historical operation data; and if the operation data in the communication process is not matched with the historical operation data, verifying the sent identity information by starting the camera device.
According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described upper and lower body motion matching method.
According to a fourth aspect of the present invention, there is provided a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described upper and lower body motion matching method when executing the program.
Compared with the current mode of using characters for communication, the method, the device, the storage medium and the computer equipment for converting the characters into the voice can obtain the character information to be converted; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; meanwhile, a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result is called, the character information is converted into voice information, a receiver can feel the current emotion, tone and intention of the sender through the converted voice information, and in addition, communication contents can be obtained more conveniently for receivers with low culture degree or vision disorder.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a schematic flow chart illustrating a text-to-speech method according to an embodiment of the present invention;
FIG. 2 is a flow chart of another text-to-speech method according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram illustrating a text-to-speech apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of another text-to-speech apparatus according to an embodiment of the present invention;
fig. 5 shows a physical structure diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
At present, the word communication mode can not convey the emotion or tone of the information initiator at that time, and in addition, if the cultural degree of the information receiver is not high, the word information can not be read, or the vision disorder exists, the mode can cause inconvenience to the information receiver.
In order to solve the above problem, an embodiment of the present invention provides a method for converting text to speech, as shown in fig. 1, where the method includes:
101. and acquiring the text information to be converted.
The character information to be converted is the character information which is sent to the receiver by the sender through the communication software. The embodiment of the invention is mainly suitable for a scene of converting the text information into the voice information in the communication process. The execution subject of the embodiment of the present invention is a device or an apparatus capable of performing voice conversion on text information, and may specifically be a client or a server.
In a specific application scenario, a client of a sender usually has two input modes, one is a text input mode, and the other is a voice input mode, when the sender inputs text information at the client and selects voice conversion, the client of the sender acquires the text information input by the sender and uses the text information as text information to be converted, and then the client of the sender directly converts the text information into voice information and sends the voice information to the client of a receiver. In addition, the sender can also directly send the input text information to the receiver, after the client of the receiver receives the text information, the receiver can select to perform voice conversion in the client, and at the moment, the client of the receiver can acquire the text information received by the receiver and convert the text information into voice information to be played to the receiver.
The client side of the sender can also send the text information input by the sender to the server, the server can send an information prompt to the receiver after receiving the text information sent by the sender, meanwhile, the text information sent by the sender is directly converted into voice information on the side of the server, the voice information and the text information are correspondingly stored, when the receiver sees the prompt information and knows that the communication information of the sender exists, an information acquisition request is sent to the server, and the server sends the text information or the voice information to the client side of the receiver based on the information receiving mode selected by the receiver.
Therefore, the obtaining process and the voice conversion process of the text information to be converted may be executed in the client or the server, which is not specifically limited in the embodiment of the present invention.
102. And carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information.
The multi-dimensional emotion recognition comprises intention recognition, emotion recognition and tone recognition, the intention recognition result comprises asking, daily chatting, asking me by others and the like, the tone recognition result comprises statement, question, praying, exaggeration, praying, hypothesis, emphasis, question reversal, graceful transition and the like, and the emotion recognition result comprises thank you, joy, love, complaint, complain, anger, disgust, fear and the like.
For the embodiment of the invention, in order to enable the receiving party to feel the tone, emotion and intention of the sending party at the moment, in the process of voice conversion, intention recognition, emotion recognition and tone recognition are required to be carried out on the character information. As an alternative embodiment, the method comprises, for the specific process of intent identification: determining semantic information vectors of each participle corresponding to the character information; and inputting the semantic information vector corresponding to each participle into a preset intention recognition model for intention recognition to obtain an intention recognition result corresponding to the character information. Further, the determining semantic information vectors of the respective participles corresponding to the text information includes: determining a query vector, a key vector and a value vector corresponding to any participle in each participle; multiplying the query vector corresponding to the arbitrary participle with the key vector corresponding to each participle to obtain the attention score of each participle for the arbitrary participle; and multiplying and summing the attention scores and the value vectors corresponding to the participles to obtain a semantic information vector corresponding to any participle. The preset intention recognition model may be a multilayer perceptron.
Specifically, word segmentation processing may be performed on the text information to obtain each word segmentation corresponding to the text information, then an embedded vector corresponding to each word segmentation is determined in a word2vec manner, and the embedded vector corresponding to each word segmentation is input into an attention layer of an encoder to perform feature extraction, during the processing of the attention layer, different linear transformations are performed on the embedded vector to obtain a query vector, a key vector, and a value vector corresponding to each word segmentation, then a semantic information vector corresponding to each word segmentation is determined according to the query vector, the key vector, and the value vector corresponding to each word segmentation, further, after the semantic information vector corresponding to each word segmentation is determined, the semantic information vector corresponding to each word segmentation is input into a multilayer perceptron to perform intent recognition, and the process of intent recognition is actually a classification process by using the multilayer perceptron, and finally, the multilayer perceptron outputs probability values of the text information belonging to different intentions, and the intention corresponding to the maximum probability value is determined as the target intention corresponding to the text information.
Further, when the word information is subjected to aura recognition, the corresponding aura of the word information can be determined according to the word of the aura, the punctuation and the information input speed contained in the word information, for example, if the word information contains an exclamation mark, the result of the aura recognition is an exclamation mark; if the character information contains the tone words 'pay attention back', the tone recognition result is emphasized; if the input speed of the text information is slow, the tone recognition result is statement.
Further, in the process of performing emotion recognition on the text information, feature vectors of the text information are extracted mainly by using a multi-head attention layer and a feedforward neural network layer of an encoder, and then the extracted feature vectors are input to a softmax layer for emotion recognition to obtain emotion recognition results corresponding to the text information, and based on the emotion recognition results, the method comprises the following steps: performing word segmentation processing on the character information to obtain each word segmentation corresponding to the character information; inputting the embedded vector corresponding to any word segmentation in each word segmentation into different attention subspaces in an attention layer of an encoder for feature extraction to obtain a first feature vector of any word segmentation in the different attention subspaces; multiplying and summing the first feature vector of any participle under the different attention subspaces and weights corresponding to the different attention subspaces to obtain an attention layer output vector corresponding to any participle; adding the attention layer output vector and the first feature vector to obtain a second feature vector corresponding to any participle; inputting the second feature vector into a feedforward neural network layer of an encoder to perform feature extraction, and obtaining a third feature vector corresponding to any participle; and inputting the third feature vector corresponding to each word segmentation into a softmax layer for emotion recognition to obtain an emotion recognition result corresponding to the character information.
Further, the inputting the embedded vector corresponding to any one of the participles into different attention subspaces in an attention layer of an encoder for feature extraction to obtain a first feature vector of the any one participle in the different attention subspaces includes: determining a query vector, a key vector and a value vector of any participle under the different attention subspaces according to the embedded vector corresponding to the any participle; multiplying the query vector of any participle under the different attention subspaces by the key vector of each participle under the different attention subspaces to obtain the attention score of each participle under the different attention subspaces for the any participle; and multiplying and summing the attention scores of the participles under different attention subspaces and the key vectors to obtain a first feature vector corresponding to any participle.
Therefore, the intention recognition result, the emotion recognition result and the tone recognition result corresponding to the character information are obtained according to the mode. It should be noted that the multi-dimensional emotion recognition in the embodiment of the present invention is not limited to intent recognition, emotion recognition, and mood recognition, and may also include emotion recognition in other dimensions.
103. And calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result, and converting the character information into voice information.
In order to enable the receiving party to feel the current emotion, intention and tone of the sending party, the embodiment of the present invention needs to call a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result of the text information to perform voice conversion, so as to ensure that the converted voice information contains the current emotion, intention and tone of the sending party, and based on this, step 103 specifically includes: acquiring an initial voice conversion model and a plurality of groups of corresponding model parameters thereof; determining target model parameters which correspond to the intention recognition result, the emotion recognition result and the tone recognition result from the multiple groups of model parameters; adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result; and calling the matched preset voice conversion model to convert the text information into voice information. Each group of emotion recognition results corresponds to a group of model parameters, and the group of emotion recognition results comprise intention recognition results, emotion recognition results and tone recognition results.
For example, the initial voice conversion model does not include model parameters, the intention recognition result corresponding to the text information is found, the tone recognition result is prayer, the emotion recognition result is thank you, a target model parameter corresponding to a set of emotion recognition results of "found", "prayer" and "thank you" is determined from the plurality of sets of model parameters, then the target model parameter is added to the initial voice conversion model to obtain a preset voice conversion model, the text information is converted into voice information by using the preset voice conversion model, and the receiver can feel that the intention of the sender is found, "found", the tone is "prayer", and the emotion is "thank you" on the basis of the converted voice information.
Further, in the embodiment of the present invention, the emotion recognition result, the intention recognition result, and the mood recognition result listed in step 102 may be further refined according to a degree, for example, the emotion recognition result includes "thank you" and "thank you", "thank you" may be refined according to a degree to "special thank you", "comparison thank you", and "general thank you", and specifically, the emotion recognition result may be refined according to an interval where the output value is located, for example, thank you when the output value is greater than 0.5, and "general thank you" when the output value is between 0.5 and 0.6; when the output value is between 0.6 and 0.8, the emotion recognition result is 'thank you for comparison'; when the output value is 0.8 or more, the emotion recognition result is "special thank you". The different degrees of emotion recognition result also lead to different model parameters, for example, "required", "prayer" and "thank you" correspond to the group a model parameters, and "required", "prayer" and "thank you" correspond to the group B model parameters. Therefore, the emotion recognition result can be divided into more fine granularity according to the method, and the determination precision of the target model parameter is improved.
After a target parameter model is determined and a preset voice conversion model is obtained, text information is input into the preset voice conversion model for voice conversion to obtain voice information, wherein the preset voice conversion model can be specifically a Tacotron model, the Tacotron model mainly comprises an encoder, a decoder based on an attention mechanism and a post-processing network, specifically, a feature vector corresponding to the text information is extracted by the encoder, the feature vector is converted into voice spectrum data by the decoder, and the voice spectrum data is converted into a waveform by the post-processing network, so that the voice information can be output.
In a specific application scenario, in order to facilitate a receiver with hearing impairment to acquire communication content, the audio sound wave in the voice message can be converted into bone conduction sound wave, and the receiver with hearing impairment can capture the bone conduction sound wave by means of specific hardware equipment, so that corresponding communication content can be acquired. The audio sound wave is a sound wave which can be normally received by human ears.
Compared with the existing mode of using characters for communication, the method for converting the characters into the voice can obtain the character information to be converted; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; meanwhile, a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result is called, the character information is converted into voice information, a receiver can feel the current emotion, tone and intention of the sender through the converted voice information, and in addition, communication contents can be obtained more conveniently for receivers with low culture degree or vision disorder.
Further, in order to better explain the text-to-speech process, as a refinement and an extension to the foregoing embodiment, an embodiment of the present invention provides another text-to-speech method, as shown in fig. 2, where the method includes:
201. receiving the character information input by the sender.
For the embodiment of the invention, when a sender sends information to a receiver, a client of the sender can detect the sound decibel of the current environment, and when the sound decibel exceeds a certain value, the client can automatically switch to a text input interface to prompt the sender to input text information, and based on the detection, the method comprises the following steps: detecting the sound decibel of the current environment of the sender; if the sound decibel is larger than the preset sound decibel, providing a character input interface, and receiving character information input by the sender based on the character input interface; and if the sound decibel is less than or equal to the preset sound decibel, outputting a voice input interface, and receiving voice information input by the sender based on the voice input interface. The preset sound decibel can be set according to actual service requirements.
For example, the preset sound decibel is 70 decibels, the client of the sender can detect the sound decibel of the current environment by means of a sensor, and if the sound decibel of the environment where the sender is currently located is detected to be 90, the client can output a text input interface and acquire text information input by the sender through the text input interface because the sound decibel is greater than 70 decibels; if the sound decibel of the current environment of the sender is detected to be 50 decibels, the client side can output a voice input interface and receive voice information input by the sender based on the voice input interface because the sound decibel is less than 70 decibels and cannot cause interference to the voice input of the sender in the current environment.
In a specific application scenario, the client may also recommend a corresponding input mode for the sender according to a historical conversation record between the sender and a specific receiver, based on which the method includes: acquiring a historical conversation record between the sender and the receiver; determining an input mode selected by the sender when the sender last talks with the receiver based on the historical conversation record; if the input mode selected by the sender when the sender has a conversation with the receiver last time is a character mode, outputting a character input interface and receiving character information input by the sender based on the character input interface; and if the input model selected when the sender has a conversation with the receiver last time is a voice mode, outputting a voice input interface and receiving voice information input by the sender based on the voice input interface.
Further, in order to ensure the safety of information sending, when a sender inputs text information at a client, the operation data of the sender for the communication device in the communication process is collected, the operation data in the communication process is matched with historical operation data, and if the operation data in the communication process is not matched with the historical operation data, the sent identity information is verified by starting a camera device. The historical operation data comprises the character input habit of the sender, such as frequently-error characters or frequently-used phrases, and also comprises the position where the sender habitually holds the mobile phone.
Specifically, when a sender inputs text information at a client, the client acquires operation data of the sender for a communication device in the communication process, wherein the operation data comprises a mobile phone holding position, input wrong text, a used word group and the like, if the mobile phone holding position acquired by the client at this time is different from a habitual mobile phone holding position of the sender, or the input wrong text acquired at this time is different from the habitual input wrong text of the sender, the identity of the sender needs to be verified, for example, a current picture of a mobile phone holder is acquired through a camera device, the picture is compared with a picture of a client logger, if the pictures are consistent, the identity of the sender passes verification, and the text information is converted into voice information and sent to a receiver; if the pictures are not consistent, the identity authentication of the sender is not passed, and the text information is intercepted, namely, the text information is not subjected to voice conversion.
Further, when the identity authentication of the sender is not passed, the text message can be intercepted, and the text message can be normally subjected to voice conversion, before the voice message is sent to the receiver, a voice prompt can be sent to the receiver, and the specific content can be that the message sender is not a mobile phone holder, so that the receiver receives the prompt message before receiving the voice message, and the condition that the receiver is cheated can be avoided. It should be noted that the identity authentication of the sender may be performed before the voice message is sent, or may be performed after the voice message is sent, for example, after the voice message is sent to the receiver, the identity authentication of the sender is performed, and if the sender does not pass the identity authentication, a voice prompt message is sent to the receiver.
202. And carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information.
For the embodiment of the present invention, the specific processes of performing intent recognition, emotion recognition and tone recognition on the text information are substantially similar to those in step 102, and are not described herein again.
203. If the text information contains special characters, verifying the identity of a sender, calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the tone recognition result and the identity information of the sender, and converting the text information input by the sender into voice information.
The special characters include money, password authentication, account password and other characters related to financial transaction. For the embodiment of the invention, if once the text information input by the sender contains the special characters related to the financial transaction, in order to avoid the situation that the benefit of the receiver is damaged due to the fact that the receiver is cheated, the identity of the sender needs to be verified, and a preset voice model matched with the emotion recognition result and the identity information of the sender is called to convert the text information into the voice information. The preset voice model matched with the identity information of the sender is the acoustic model of the sender, namely, after the receiver receives the voice information, the identity of the sender can be identified through voice, so that the safety of communication information can be ensured, and the benefit of the receiver is prevented from being damaged.
Specifically, an initial voice conversion model matched with identity information of a sender and a plurality of groups of model parameters corresponding to the initial voice conversion model are determined, then target model parameters matched with an intention recognition result, an emotion recognition result and a tone recognition result are determined from the plurality of groups of model parameters, the target model parameters are added into the matched initial voice conversion model to obtain a preset voice conversion model matched with both an emotion recognition result and the identity information of the sender, finally, the preset voice conversion model is utilized to convert text information into voice information, the voice information played by a client of a receiver is native voice of the sender, and therefore the receiver can determine the identity of the sender.
For the embodiment of the present invention, in order to more truly reflect the scene of the sender when inputting characters, a corresponding background sound may be added to the converted voice information, as an optional implementation manner for adding the background sound, the method includes: responding to the received background sound adding instruction, outputting and displaying a background sound list, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound to the voice information by overlapping the sound wave in the target background sound with the sound wave in the voice information. For example, the sender selects background music of christmas music from the background music list, the client superimposes the sound wave in the background music of christmas music with the sound wave in the voice message, and sends the superimposed voice message to the receiver, and the receiver can hear the background music of christmas music in the process of listening to the voice message.
Further, as another optional implementation of adding the background sound, the method further includes: acquiring the current conversation record of the sender and the receiver, and determining the current scene of the sender according to the conversation record; and determining a target background sound matched with the current scene of the sender, and adding the target background sound to the voice information by overlapping the sound wave in the target background sound with the sound wave in the voice information. For example, the client determines that the sender is currently at the sea by acquiring the conversation record of the sender and the receiver, so that the sea wave is added to the voice message as background sound, and the receiver can hear the sound of the sea wave in the process of hearing the voice message, thereby knowing that the sender is currently at the sea.
In a specific application scenario, the sender may further embed the encrypted voice in the voice message to prevent the voice message from being utilized maliciously, for example, in the process of sending the identity document information to the receiver, the sender may embed the encrypted voice in the voice message, where "the identity information is only used for handling the credit card service", so that the voice becomes an evidence, which is convenient for the sender to prove and prevents the benefit from being damaged, based on which, the method includes: responding to an encrypted voice adding instruction triggered by the sender, and acquiring encrypted voice of the sender; adjusting the frequency of the sound waves in the encrypted voice to a specific frequency which cannot be recognized by human ears; and adding the adjusted encrypted voice to the voice information by overlapping the sound wave in the adjusted encrypted voice with the sound wave in the voice information.
Specifically, since the encrypted voice may affect the receiving party to listen to the normal voice information, the frequency of the sound wave in the encrypted voice may be adjusted to a specific frequency that cannot be recognized by the human ear, and then the adjusted encrypted voice is added to the voice information, so that the receiving party cannot listen to the adjusted encrypted voice in the process of listening to the voice information, and can only listen to the normal voice information, thereby not affecting the receiving party.
For the embodiment of the present invention, before performing voice conversion by using the initial voice conversion model bound to the identity information of the sender and the corresponding sets of model parameters thereof, the initial voice conversion model bound to the identity information of the sender and the corresponding sets of model parameters thereof need to be trained. Based on this, the method comprises: collecting voice information of a plurality of groups of scene sentences read by the sender, wherein the corresponding intentions, emotions or tone of the scene sentences in different groups are different; determining character information respectively corresponding to a plurality of groups of voice information read by the sender, and training an initial voice conversion model bound with identity information of the sender and a plurality of groups of model parameters corresponding to the initial voice conversion model based on the plurality of groups of voice information and the character information respectively corresponding to the voice information. And the set of intention, emotion and tone correspond to a set of model parameters, so that a plurality of sets of model parameters of the initial voice conversion model bound with the identity information of the sender can be obtained through training by the voice information of a plurality of sets of scene sentences read by the sender.
Further, in the process of collecting voice information, the reliability of the collected voice information needs to be verified, and if a problem is found in the collected voice information, the collection is immediately interrupted, based on which the method comprises: performing character conversion on the real-time voice information to obtain real-time character information corresponding to the real-time voice information; respectively identifying intention, emotion and tone of voice of the real-time character information to obtain an intention identification result, an emotion identification result and a tone identification result corresponding to the real-time character information; and if determining that the emotion of the sender is abnormal according to the intention recognition result corresponding to the real-time character information, or determining that the emotion recognition result corresponding to the real-time character information is not matched with the tone recognition result, interrupting the acquisition of the real-time voice information of the sender.
For example, if the intention recognition result is a lasso, it indicates that the emotion of the sender is abnormal, and the collection of the voice information is interrupted. For another example, if the emotion recognition result is anger and the tone recognition result is statement, it indicates that the emotion recognition result and the tone recognition result are not matched, and the collection of the voice information is interrupted.
Further, after the initial voice conversion model bound to the identity information of the sender and the plurality of sets of model parameters corresponding to the initial voice conversion model are trained, the voice intonation of the voice conversion model can be modified, that is, the model parameters are optimized, as an optional implementation mode for optimizing the model parameters, the method includes: acquiring real-time voice information of the sender through a preset device; and optimizing multiple groups of model parameters of the initial voice conversion model bound with the identity information of the sender based on the collected real-time voice information.
Specifically, the real-time voice information of the sender can be collected, the collected real-time voice information is converted into text information, the real-time voice information of the sender and the text information corresponding to the real-time voice information are used as a sample training set, and multiple groups of model parameters of the initial voice conversion model are optimized, so that the voice tone of the converted voice information is closer to the actual voice tone of the sender.
Further, as another optional implementation of the model parameter optimization, the method includes: acquiring specific characters input by the sender and voice information input by the sender aiming at the specific characters; and optimizing multiple groups of model parameters of the initial voice conversion model bound with the identity information of the sender based on the specific characters and the corresponding voice information thereof. Specifically, the sender can input some unusual characters or characters with specific pronunciations at the client and read aloud, the client collects the voice information of the specific characters read aloud by the sender and optimizes multiple groups of model parameters of the initial voice conversion model by using the voice information, so that the voice conversion effect of the model can be improved.
In a specific application scenario, the trained initial voice conversion model and multiple groups of model parameters can be used for converting the trial reading sentences input by the sender into corresponding voice information for playing, the sender can revise the voice information, and meanwhile, the multiple groups of model parameters of the initial voice conversion model are optimized based on revised voice. Based on this, the method comprises: acquiring a trial reading sentence input by the sender; converting the trial reading sentences into voice information and playing the voice information to the sender by utilizing a plurality of groups of model parameters of a trained initial voice conversion model bound with the identity information of the sender, and outputting a selection correction interface corresponding to the trial reading sentences; acquiring revised voice corresponding to a target character in a trial reading statement selected by the sender based on the selection correction interface; optimizing sets of model parameters of an initial speech conversion model bound to identity information of the sender based on the revised speech. The target character can be any one, two or more characters in the trial reading sentence.
For example, if the trial reading sentence is "we go to the seaside today", after hearing the voice information corresponding to the trial reading sentence, if the pronunciation of the seaside is not satisfactory, the target character "seaside" may be selected in the selection correction interface corresponding to the trial reading sentence, and meanwhile, the voice tone of the seaside is corrected to obtain the revised voice corresponding to the trial reading sentence, and based on the revised voice, a plurality of sets of model parameters corresponding to the initial voice conversion model are optimized, so that the voice conversion effect of the optimized model parameters is closer to the real sound of the sender.
In a specific application scenario, a preset dialect voice conversion model bound with identity information of a sender can be further pre-constructed, and when text information input by the sender is a dialect, the preset dialect voice conversion model can be called to convert the text information into dialect voice information, and based on the above, the method comprises the following steps: when the text information is dialect, acquiring the output voice type selected by the sender; if the output voice type is dialect voice, calling a preset dialect voice conversion model bound with the identity information of the sender, and converting the dialect input by the sender into dialect voice information; if the output voice type is standard voice, carrying out mandarin conversion on the dialect by utilizing a preset dialect lexicon to obtain standard character information corresponding to the dialect; carrying out multi-dimensional emotion recognition on the standard character information to obtain an emotion recognition result of the standard character information under multiple dimensions; and calling a preset voice conversion model matched with the multi-dimensional emotion recognition result and the identity information of the sender, and converting the standard text information into standard voice information. The preset dialect lexicon comprises each dialect phrase and a standard phrase corresponding to the dialect phrase.
Specifically, when the text information input by the sender is dialect, the output voice type can be selected, and when the output voice type is dialect voice, a preset dialect voice conversion model bound with the identity information of the sender is called to convert the dialect into dialect voice information, wherein the preset dialect voice conversion model is obtained by training collected dialect voice information read aloud by the sender and corresponding dialect text information. Further, when the output speech type is standard speech, the input dialect may be converted into standard text information by using a preset dialect lexicon, and then a matched preset speech conversion model is called to perform speech conversion, so as to obtain standard speech information.
Compared with the current mode of using characters for communication, the other method for converting characters into voice provided by the embodiment of the invention can obtain the character information to be converted; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; meanwhile, a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result is called, the character information is converted into voice information, a receiver can feel the current emotion, tone and intention of the sender through the converted voice information, and in addition, communication contents can be obtained more conveniently for receivers with low culture degree or vision disorder.
Further, as a specific implementation of fig. 1, an embodiment of the present invention provides a text-to-speech apparatus, as shown in fig. 3, the apparatus includes: an acquisition unit 31, a recognition unit 32 and a conversion unit 33.
The obtaining unit 31 may be configured to obtain text information to be converted.
The identification unit 32 may be configured to perform multidimensional emotion identification on the text information to obtain an intention identification result, an emotion identification result, and a mood identification result corresponding to the text information.
The conversion unit 33 may be configured to invoke a preset voice conversion model matched with the intention recognition result, the emotion recognition result, and the mood recognition result, and convert the text information into voice information.
In a specific application scenario, as shown in fig. 4, the converting unit 33 includes: a first obtaining module 331, a first determining module 332, an adding module 333 and a converting module 334.
The first obtaining module 331 may be configured to obtain an initial speech conversion model and a plurality of sets of model parameters corresponding to the initial speech conversion model.
The first determining module 332 may be configured to determine target model parameters corresponding to the intention recognition result, the emotion recognition result, and the mood recognition result from the multiple sets of model parameters, where each set of emotion recognition result corresponds to one set of model parameters, and the set of emotion recognition results includes the intention recognition result, the emotion recognition result, and the mood recognition result.
The adding module 333 may be configured to add the target model parameter to the initial voice conversion model to obtain a preset voice conversion model matching the intention recognition result, the emotion recognition result, and the mood recognition result.
The conversion module 334 may be configured to invoke the matched preset voice conversion model to convert the text message into a voice message.
In a specific application scenario, the obtaining unit 31 may be specifically configured to receive text information input by a sender.
The conversion unit 33 may be specifically configured to verify an identity of a sender if the text information includes a special character, and call a preset voice conversion model that matches the intention recognition result, the emotion recognition result, the mood recognition result, and the identity information of the sender, so as to convert the text information input by the sender into voice information.
In a specific application scenario, as shown in fig. 4, the obtaining unit 31 includes: a detection module 311, a receiving module 312, a second obtaining module 313 and a second determining module 314.
The detecting module 311 may be configured to detect a sound decibel of an environment where the sender is currently located.
The receiving module 312 is configured to provide a text input interface if the sound decibel is greater than a preset sound decibel, and receive text information input by the sender based on the text input interface.
The second obtaining module 313 may be configured to obtain a history dialog record between the sender and the receiver.
The second determining module 314 may be configured to determine, based on the historical conversation record, an input mode selected when the sender last conversed with the receiver.
The receiving module 312 may be further configured to output a text input interface if the input mode selected when the sender has last interacted with the receiver is a text mode, and receive text information input by the sender based on the text input interface.
In a specific application scenario, the apparatus further includes: an acquisition unit 34 and a training unit 35.
The collecting unit 34 may be configured to collect voice information of a plurality of sets of scene sentences read by the sender, where corresponding intentions, emotions, or moods of different sets of scene sentences are different.
The training unit 35 may be configured to determine text information corresponding to each of the multiple sets of voice information read by the sender, and train an initial voice conversion model bound to the identity information of the sender and multiple sets of model parameters corresponding to the initial voice conversion model based on the multiple sets of voice information and the text information corresponding to the multiple sets of voice information.
In a specific application scenario, the apparatus further includes: an optimization unit 36.
The collecting unit 34 may be further configured to collect real-time voice information of the sender through a predetermined device.
The optimizing unit 36 may be configured to optimize, based on the collected real-time speech information, multiple sets of model parameters of the initial speech conversion model bound to the identity information of the sender.
The obtaining unit 31 may be further configured to obtain a specific character input by the sender and a voice message recorded by the sender for the specific character.
The optimizing unit 36 may be further configured to optimize, based on the specific text and the corresponding speech information, a plurality of sets of model parameters of the initial speech conversion model bound to the identity information of the sender.
In a specific application scenario, the apparatus further includes: an interrupt unit 37.
The conversion unit 33 may be further configured to perform text conversion on the real-time voice information to obtain real-time text information corresponding to the real-time voice information.
The identification unit 32 may be further configured to perform intention, emotion and tone identification on the real-time text information, respectively, so as to obtain an intention identification result, an emotion identification result, and a tone identification result corresponding to the real-time text information.
The interrupting unit 37 may be configured to interrupt the collection of the real-time voice information of the sender if it is determined that the emotion of the sender is abnormal according to the intention recognition result corresponding to the real-time text information, or it is determined that the emotion recognition result corresponding to the real-time text information is not matched with the mood recognition result.
In a specific application scenario, the obtaining unit 31 may be further configured to obtain a trial reading statement input by the sender.
The conversion unit 33 may be further configured to convert the trial reading sentence into voice information and play the voice information to the sender by using a plurality of sets of model parameters of the trained initial voice conversion model bound to the identity information of the sender, and output a selection correction interface corresponding to the trial reading sentence.
The obtaining unit 31 may be further configured to obtain a revised voice corresponding to a target character in a trial reading sentence selected by the sender based on the selection correction interface.
The optimizing unit 36 is further configured to optimize, based on the revised voice, a plurality of sets of model parameters of the initial voice conversion model bound to the identity information of the sender.
In a specific application scenario, the obtaining unit 31 may be further configured to obtain an output voice type selected by the sender when the text information is a dialect.
The converting unit 33 may be further configured to, if the output voice type is dialect voice, invoke a preset dialect voice conversion model bound with the identity information of the sender, and convert the dialect input by the sender into dialect voice information.
The conversion unit 33 may be further configured to, if the output voice type is a standard voice, perform mandarin conversion on the dialect by using a preset dialect lexicon to obtain standard text information corresponding to the dialect.
The identification unit 32 may be further configured to perform multidimensional emotion identification on the standard text information to obtain an emotion identification result of the standard text information in multiple dimensions.
The conversion unit 33 may be further configured to invoke a preset voice conversion model matched with the multi-dimensional emotion recognition result and the identity information of the sender, and convert the standard text information into standard voice information.
In a specific application scenario, the converting unit 33 may be further configured to convert the audio sound wave in the voice message into a bone conduction sound wave in response to the received sound wave conversion instruction.
In a specific application scenario, the apparatus further includes: a superimposing unit 38.
The superimposing unit 38 may be configured to output and display a background sound list in response to the received background sound adding instruction, acquire a selection instruction for selecting a target background sound from the background sound list, and add the target background sound to the voice information by superimposing a sound wave in the target background sound and a sound wave in the voice information; or obtaining the current conversation record of the sender and the receiver, and determining the current scene of the sender according to the conversation record; and determining a target background sound matched with the current scene of the sender, and adding the target background sound to the voice information by overlapping the sound wave in the target background sound with the sound wave in the voice information.
In a specific application scenario, the obtaining unit 31 may be further configured to obtain the encrypted voice of the sender in response to an encrypted voice adding instruction triggered by the sender.
The superposition unit 38 may be further configured to adjust the frequency of the sound wave in the encrypted speech to a specific frequency that cannot be recognized by human ears; and adding the adjusted encrypted voice to the voice information by overlapping the sound wave in the adjusted encrypted voice with the sound wave in the voice information.
In a specific application scenario, the apparatus further includes: a matching unit 39.
The collecting unit 34 may be further configured to collect operation data of the sender for the communication device in the communication process.
The matching unit 39 may be configured to match the operation data in the communication process with historical operation data; and if the operation data in the communication process is not matched with the historical operation data, verifying the sent identity information by starting the camera device.
It should be noted that other corresponding descriptions of the functional modules related to the text-to-speech device provided in the embodiment of the present invention may refer to the corresponding description of the method shown in fig. 1, and are not described herein again.
Based on the method shown in fig. 1, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the following steps: acquiring character information to be converted; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result, and converting the character information into voice information.
Based on the above embodiments of the method shown in fig. 1 and the apparatus shown in fig. 3, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 5, where the computer device includes: a processor 41, a memory 42, and a computer program stored on the memory 42 and executable on the processor, wherein the memory 42 and the processor 41 are both arranged on a bus 43 such that when the processor 41 executes the program, the following steps are performed: acquiring character information to be converted; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result, and converting the character information into voice information.
Through the technical scheme of the invention, the text information to be converted can be obtained; carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information; meanwhile, a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result is called, the character information is converted into voice information, a receiver can feel the current emotion, tone and intention of the sender through the converted voice information, and in addition, communication contents can be obtained more conveniently for receivers with low culture degree or vision disorder.
It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method for converting text to speech, comprising:
acquiring character information to be converted;
carrying out multi-dimensional emotion recognition on the character information to obtain an intention recognition result, an emotion recognition result and a tone recognition result corresponding to the character information;
and calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result, and converting the character information into voice information.
2. The method according to claim 1, wherein said invoking a preset voice conversion model matching the intention recognition result, the emotion recognition result, and the mood recognition result to convert the text information into voice information comprises:
acquiring an initial voice conversion model and a plurality of groups of corresponding model parameters thereof;
determining target model parameters which correspond to the intention recognition result, the emotion recognition result and the tone recognition result together from the multiple groups of model parameters, wherein each group of emotion recognition result corresponds to one group of model parameters, and the group of emotion recognition results comprise the intention recognition result, the emotion recognition result and the tone recognition result;
adding the target model parameters into the initial voice conversion model to obtain a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result;
and calling the matched preset voice conversion model to convert the text information into voice information.
3. The method of claim 1, wherein the obtaining the text information to be converted comprises:
receiving character information input by a sender;
the calling of a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result converts the text information into voice information, and the method comprises the following steps:
if the text information contains special characters, verifying the identity of a sender, calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result, the tone recognition result and the identity information of the sender, and converting the text information input by the sender into voice information.
4. The method according to claim 1, wherein after the calling a preset voice conversion model matching the intention recognition result, the emotion recognition result, and the mood recognition result to convert the text information into voice information, the method further comprises:
and converting the audio sound wave in the voice information into bone conduction sound wave in response to the received sound wave conversion instruction.
5. The method according to claim 3, wherein after the calling a preset voice conversion model matching with the intention recognition result, the emotion recognition result, the mood recognition result, and the identity information of the sender to convert text information input by the sender into voice information, the method further comprises:
responding to a received background sound adding instruction, outputting and displaying a background sound list, acquiring a selection instruction for selecting a target background sound from the background sound list, and adding the target background sound to the voice information by overlapping a sound wave in the target background sound with a sound wave in the voice information; or
Acquiring the current conversation record of the sender and the receiver, and determining the current scene of the sender according to the conversation record; and determining a target background sound matched with the current scene of the sender, and adding the target background sound to the voice information by overlapping the sound wave in the target background sound with the sound wave in the voice information.
6. The method according to claim 3, wherein after the calling a preset voice conversion model matching with the intention recognition result, the emotion recognition result, the mood recognition result, and the identity information of the sender to convert text information input by the sender into voice information, the method further comprises:
responding to an encrypted voice adding instruction triggered by the sender, and acquiring encrypted voice of the sender;
adjusting the frequency of the sound waves in the encrypted voice to a specific frequency which cannot be recognized by human ears;
and adding the adjusted encrypted voice to the voice information by overlapping the sound wave in the adjusted encrypted voice with the sound wave in the voice information.
7. The method of claim 3, further comprising:
collecting operation data of the sender aiming at the communication device in the communication process;
matching the operation data in the communication process with historical operation data;
and if the operation data in the communication process is not matched with the historical operation data, verifying the sent identity information by starting the camera device.
8. A text-to-speech apparatus, comprising:
the acquisition unit is used for acquiring character information to be converted;
the identification unit is used for carrying out multi-dimensional emotion identification on the character information to obtain an intention identification result, an emotion identification result and a tone identification result corresponding to the character information;
and the conversion unit is used for calling a preset voice conversion model matched with the intention recognition result, the emotion recognition result and the tone recognition result and converting the text information into voice information.
9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111620527.0A CN114299919A (en) | 2021-12-27 | 2021-12-27 | Method and device for converting characters into voice, storage medium and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111620527.0A CN114299919A (en) | 2021-12-27 | 2021-12-27 | Method and device for converting characters into voice, storage medium and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114299919A true CN114299919A (en) | 2022-04-08 |
Family
ID=80969747
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111620527.0A Pending CN114299919A (en) | 2021-12-27 | 2021-12-27 | Method and device for converting characters into voice, storage medium and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114299919A (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130218566A1 (en) * | 2012-02-17 | 2013-08-22 | Microsoft Corporation | Audio human interactive proof based on text-to-speech and semantics |
US20150120282A1 (en) * | 2013-10-30 | 2015-04-30 | Lenovo (Singapore) Pte. Ltd. | Preserving emotion of user input |
CN111161703A (en) * | 2019-12-30 | 2020-05-15 | 深圳前海达闼云端智能科技有限公司 | Voice synthesis method with tone, device, computing equipment and storage medium |
CN111192568A (en) * | 2018-11-15 | 2020-05-22 | 华为技术有限公司 | Speech synthesis method and speech synthesis device |
WO2020141643A1 (en) * | 2019-01-03 | 2020-07-09 | 엘지전자 주식회사 | Voice synthetic server and terminal |
KR20200111609A (en) * | 2019-12-16 | 2020-09-29 | 휴멜로 주식회사 | Apparatus for synthesizing speech and method thereof |
US20200380962A1 (en) * | 2019-05-30 | 2020-12-03 | Citrix Systems, Inc. | Systems and methods for extraction of user intent from speech or text |
CN112951233A (en) * | 2021-03-30 | 2021-06-11 | 平安科技(深圳)有限公司 | Voice question and answer method and device, electronic equipment and readable storage medium |
-
2021
- 2021-12-27 CN CN202111620527.0A patent/CN114299919A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130218566A1 (en) * | 2012-02-17 | 2013-08-22 | Microsoft Corporation | Audio human interactive proof based on text-to-speech and semantics |
US20150120282A1 (en) * | 2013-10-30 | 2015-04-30 | Lenovo (Singapore) Pte. Ltd. | Preserving emotion of user input |
CN111192568A (en) * | 2018-11-15 | 2020-05-22 | 华为技术有限公司 | Speech synthesis method and speech synthesis device |
US20200357383A1 (en) * | 2018-11-15 | 2020-11-12 | Huawei Technologies Co., Ltd. | Speech Synthesis Method and Speech Synthesis Apparatus |
WO2020141643A1 (en) * | 2019-01-03 | 2020-07-09 | 엘지전자 주식회사 | Voice synthetic server and terminal |
US20200380962A1 (en) * | 2019-05-30 | 2020-12-03 | Citrix Systems, Inc. | Systems and methods for extraction of user intent from speech or text |
KR20200111609A (en) * | 2019-12-16 | 2020-09-29 | 휴멜로 주식회사 | Apparatus for synthesizing speech and method thereof |
CN111161703A (en) * | 2019-12-30 | 2020-05-15 | 深圳前海达闼云端智能科技有限公司 | Voice synthesis method with tone, device, computing equipment and storage medium |
CN112951233A (en) * | 2021-03-30 | 2021-06-11 | 平安科技(深圳)有限公司 | Voice question and answer method and device, electronic equipment and readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11210461B2 (en) | Real-time privacy filter | |
JP6084654B2 (en) | Speech recognition apparatus, speech recognition system, terminal used in the speech recognition system, and method for generating a speaker identification model | |
KR101795593B1 (en) | Device and method for protecting phone counselor | |
CN110661927A (en) | Voice interaction method and device, computer equipment and storage medium | |
JP2016529567A (en) | Method, apparatus and system for verifying payment | |
CN102117614A (en) | Personalized text-to-speech synthesis and personalized speech feature extraction | |
CN110149805A (en) | Double-directional speech translation system, double-directional speech interpretation method and program | |
US20100178956A1 (en) | Method and apparatus for mobile voice recognition training | |
WO2014120291A1 (en) | System and method for improving voice communication over a network | |
CN106713111B (en) | Processing method for adding friends, terminal and server | |
CN104811559A (en) | Noise reduction method, communication method and mobile terminal | |
CN109346057A (en) | A kind of speech processing system of intelligence toy for children | |
WO2021051504A1 (en) | Method for identifying abnormal call party, device, computer apparatus, and storage medium | |
CN111833907B (en) | Man-machine interaction method, terminal and computer readable storage medium | |
CN115171731A (en) | Emotion category determination method, device and equipment and readable storage medium | |
CN115101053A (en) | Emotion recognition-based conversation processing method and device, terminal and storage medium | |
CN113299309A (en) | Voice translation method and device, computer readable medium and electronic equipment | |
WO2021159734A1 (en) | Data processing method and apparatus, device, and medium | |
US8452599B2 (en) | Method and system for extracting messages | |
CN115206342A (en) | Data processing method and device, computer equipment and readable storage medium | |
KR102413860B1 (en) | Voice agent system and method for generating responses based on user context | |
CN109616116B (en) | Communication system and communication method thereof | |
US20190304457A1 (en) | Interaction device and program | |
CN114299919A (en) | Method and device for converting characters into voice, storage medium and computer equipment | |
KR102684930B1 (en) | Video learning systems for enable learners to be identified through artificial intelligence and method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |