CN112908292B - Text voice synthesis method and device, electronic equipment and storage medium - Google Patents

Text voice synthesis method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112908292B
CN112908292B CN201911134833.6A CN201911134833A CN112908292B CN 112908292 B CN112908292 B CN 112908292B CN 201911134833 A CN201911134833 A CN 201911134833A CN 112908292 B CN112908292 B CN 112908292B
Authority
CN
China
Prior art keywords
text
role
style
voice
dialog
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911134833.6A
Other languages
Chinese (zh)
Other versions
CN112908292A (en
Inventor
潘俊杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ByteDance Network Technology Co Ltd
Original Assignee
Beijing ByteDance Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ByteDance Network Technology Co Ltd filed Critical Beijing ByteDance Network Technology Co Ltd
Priority to CN201911134833.6A priority Critical patent/CN112908292B/en
Publication of CN112908292A publication Critical patent/CN112908292A/en
Application granted granted Critical
Publication of CN112908292B publication Critical patent/CN112908292B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Abstract

In the speech synthesis method, the speech synthesis apparatus, the electronic device, and the storage medium for the text provided in this embodiment, the dialog text of at least one dialog in the text to be processed and the role to which the dialog text belongs are obtained through recognition; determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data; inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model, so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the role according to the style parameters, and a scheme of synthesized voice corresponding to the role is obtained, thereby enabling the voice deductive styles of different characters in the sound reading material based on the synthesized voice to be more diversified, improving the rendering capability of the sound reading material on the scenes, and being beneficial to improving the interest of a user on the sound reading material.

Description

Text voice synthesis method and device, electronic equipment and storage medium
Technical Field
The embodiment of the disclosure relates to the field of big data processing, and in particular relates to a text speech synthesis method and device, an electronic device and a storage medium.
Background
The audio book is accepted by more and more people by the advantages of simple and convenient use, no limitation of use environment and the like, and becomes one of the main reading modes of people.
In the prior art, the voiced books are mainly voiced novels, and the generation of voiced novels relies on speech synthesis techniques. Specifically, a voice corpus can be prerecorded, and based on the text content of the novel, the text is converted into voice and output to the user.
However, in the existing text-to-speech conversion process, the characters can be converted into the speech with corresponding pronunciation only according to the pronunciation of the characters in the novel text, and the deductive styles of the speech including the speed, tone and turning of the speech of different characters are not obviously different. This will make the user when listening to current sound reading thing, the sight rendering ability of sound reading thing is relatively poor, and the style is comparatively single, and interest and enjoyment are not enough, influence user experience.
Disclosure of Invention
In order to solve the above problems, the present disclosure provides a method and an apparatus for text speech synthesis, an electronic device, and a storage medium.
In a first aspect, an embodiment of the present disclosure provides a text speech synthesis method, including:
identifying and obtaining a conversation text of at least one section of conversation in the text to be processed and a role to which the conversation text belongs;
determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data;
inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model, so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the role according to the style parameters to obtain the synthetic voice corresponding to the role.
In a second aspect, an embodiment of the present disclosure provides a text speech synthesis apparatus, including:
the recognition module is used for recognizing the conversation text of at least one section of conversation in the text to be processed and the role to which the conversation text belongs;
the voice conversion module is used for determining the voice corpus of each role according to a preset voice corpus and converting the dialogue text of each dialogue into corresponding voice data;
and the style conversion module is used for inputting the voice data of the conversation of each role and the acquired style parameters of the roles into the trained audio style migration model so as to enable the trained audio style migration model to adjust the deduction style of each voice data corresponding to the roles according to the style parameters and obtain the synthesized voice corresponding to the roles.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method for speech synthesis of text as described above in the first aspect and in various possible designs of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, a speech synthesis method for a text is implemented, as described in the first aspect and various possible designs of the first aspect.
In the speech synthesis method, the speech synthesis device, the electronic device, and the storage medium for the text provided by this embodiment, the dialog text of at least one dialog in the text to be processed and the role to which the dialog text belongs are obtained through recognition; determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data; inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model, so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the role according to the style parameters, and a scheme of synthesized voice corresponding to the role is obtained, thereby enabling the voice deductive styles of different characters in the sound reading material based on the synthesized voice to be more diversified, improving the rendering capability of the sound reading material on the scenes, and being beneficial to improving the interest of a user on the sound reading material.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.
FIG. 1 is a schematic diagram of a network architecture on which the present disclosure is based;
fig. 2 is a schematic flowchart of a text speech synthesis method according to an embodiment of the present disclosure;
fig. 3 is an interface schematic diagram of a text speech synthesis method according to an embodiment of the present disclosure
Fig. 4 is a block diagram of a speech synthesis apparatus for text according to an embodiment of the present disclosure;
fig. 5 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
The audio book is accepted by more and more people by the advantages of simple and convenient use, no limitation of use environment and the like, and becomes one of the main reading modes of people.
In the prior art, the voiced books are mainly voiced novels, and the generation of voiced novels relies on speech synthesis techniques. Specifically, a speech corpus can be prerecorded, and text can be converted into speech based on the text content of the novel and output to the user.
However, in the existing text-to-speech conversion process, only the pronunciation of the characters in the novel text can be converted into the corresponding pronunciation, and the deductive styles of the voices including the speed, the tone and the inflection of different characters are not obviously different. This will make the user when listening to current sound reading thing, the sight rendering ability of sound reading thing is relatively poor, and the style is comparatively single, and interest and enjoyment are not enough, influence user experience.
In order to solve the above problems, the present disclosure provides a method and an apparatus for text speech synthesis, an electronic device, and a storage medium.
Referring to fig. 1, fig. 1 is a schematic diagram of a network architecture on which the present disclosure is based, and as shown in fig. 1, one network architecture on which the present disclosure is based may include a text speech synthesis apparatus 2 and terminals 1.
The text speech synthesis device 2 is hardware or software that can interact with each terminal 1 via a network, and can be used to execute a text speech synthesis method described in each embodiment described below.
When the speech synthesis apparatus 1 for text is hardware, it includes a cloud server with an arithmetic function. When the text speech synthesis apparatus 1 is software, it can be installed in electronic devices with computing functions, including but not limited to laptop portable computers, desktop computers, and the like.
The terminal 1 is a device including a smartphone, a tablet computer, a desktop computer, or the like that can communicate and interact with the text speech synthesis apparatus 2 via a network.
For example, in the actual scenario of speech synthesis of an audio reading, the text speech synthesis apparatus 2 may be loaded in an operating server of the audio reading, and a large amount of reading text information that can be converted into the audio reading is generally stored in the operating server. The running server can interact with the terminal 1 to receive the reading listening request of the user and determine the target reading which is triggered by the user and is desired to be listened to. Subsequently, the text speech synthesis device 2 may process the text information of the target reading by using the text speech synthesis method provided by the present disclosure to determine the person to which each conversation belongs in the text information and determine the emotion tag of each conversation, and convert the text information into speech information, so that the operating server may perform text-to-speech conversion processing on the text information based on the conversation attribution, and send the processed spoken reading to the terminal 1.
Certainly, in an optional scenario, the speech synthesis apparatus 2 for text may process all existing text information in the running server in advance, so that the running server may convert the text of the audio book into speech in advance and store the speech, and further, when the running server receives a listening request initiated by the terminal, the running server may directly send the speech and the text of the corresponding audio book to the user for the user to listen to.
It should be noted that, based on different application scenarios, the running server may further store other types of text information, and other interaction modes may also exist among the running server, the text speech synthesis device, and the terminal, which is not limited in this disclosure.
In a first aspect, referring to fig. 2, fig. 2 is a schematic flowchart of a text speech synthesis method according to an embodiment of the present disclosure. The speech synthesis method of the text provided by the embodiment of the disclosure comprises the following steps:
step 101, identifying and obtaining a dialog text of at least one section of dialog in the text to be processed and a role to which the dialog text belongs.
It should be noted that the main execution body of the speech synthesis method provided in this example is the aforementioned speech synthesis apparatus.
The speech synthesis device acquires the text to be processed of the audio book from the running server, and generally, the text data volume of the text of the audio book is large. In the example of the present disclosure, the text of the audio book may be split first to obtain a plurality of texts to be processed, which have a data volume suitable for processing, so that the voice synthesis apparatus processes the texts one by one. The splitting process may be based on a text structure of the text itself, such as a paragraph structure, a chapter structure, and the like, or may be based on a grammatical semantic, such as a plurality of continuous speech segments representing the same meaning or scenario, and the like.
The speech synthesis means will then perform a corresponding processing for each text to be processed. The speech synthesis device firstly identifies the text to be processed to obtain a dialog text of at least one dialog and at least one character in the text to be processed.
Specifically, the regular expression may be used to determine the position of each dialog in the text to be processed, and extract the corresponding dialog text according to each position. In this embodiment, the dialog text and the role in the text to be processed can be retrieved by using the preset regular expression. For example, a general dialog is drawn by using quotation marks, and therefore, a corresponding regular expression can be set to determine the position of the quotation marks in the text information as the position of the dialog; for another example, some dialogs may have "a person says" as a prompt, so a corresponding regular expression may be set to determine the position of "a person says" in the text and determine the position of the dialog in the text information based on the position, and after the speech synthesis apparatus obtains the position of each dialog, the dialog information of the dialog will be extracted based on the position of the dialog.
The speech synthesis means will also use named entity recognition to determine at least one role that appears in the text to be processed. Named entity recognition technology is a technology for recognizing the content of named nature from text, and the recognition range includes name of person, place name, organization name, proper noun, etc. In this embodiment, the speech synthesis apparatus will use the hit entity recognition technology to recognize the character information, i.e., the name of the character, in which the character appears in the text information.
After the speech synthesis device acquires the dialogs and roles appearing in the text to be processed, the association relationship between the dialogs and the roles is also determined, namely the role to which each dialog belongs is determined. Specifically, the speech synthesis apparatus may determine an association relationship between each dialog and each character according to the number of times each character appears and the position where each character appears in the text to be processed, and determine the character to which each dialog belongs. For example, for each dialog, as the position where a certain orange color appears in the text is closer to the position where the dialog appears, the more the association between the dialog and the character is, i.e., the higher the possibility that the two are in the attribution relationship is; conversely, the farther the dialog is located from the character, the less the association between the dialog and the character, i.e., the lower the likelihood that the two will be attributed. For another example, when a certain character appears more times in the text, the association degree between each dialog and the character is greater, that is, the probability that the two characters present the attribution relationship is higher; conversely, when a certain character appears in the text less frequently, the association degree between each dialog and the character is smaller, i.e., the probability that the two characters are in the attribution relationship is lower. Thus, in the manner described above, the speech synthesis apparatus will determine the role to which each conversation belongs.
Step 102, determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data.
Specifically, similar to the prior art, in the embodiment of the present disclosure, different roles have different timbre characteristics in consideration of different ages and different sexes, for example, the timbre of children is generally crisp, the timbre of older people is cloudy, the timbre of girls is higher, and the timbre of males is lower. Generally, corresponding voice corpora are preset for different tone features and stored in a voice pre-database, which includes but is not limited to children tone, girls tone, old men tone, and so on.
When the speech synthesizer is used to determine the speech corpus of each character according to the preset speech corpus and convert the dialog text of each dialog into corresponding speech data, the following method can be adopted: and determining a voice corpus with the tone characteristics of each role in a preset voice corpus library, and generating voice data of each section of conversation based on the conversation text of each conversation corresponding to the role according to the voice corpus of each role.
Step 103, inputting the voice data of the dialog of each character and the acquired style parameters of the character into the trained audio style migration model, so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the character according to the style parameters to obtain the synthesized voice corresponding to the character.
The speech data generated in step 102 will not have any deductive style, i.e. the speech data will only include the original timbre features of the character. The synthesized speech obtained after processing in step 103 will make the speech data of the character have a certain deductive style, which includes but is not limited to: a sound playing cavity, a phase sound cavity, a sound beautifying cavity, a drama cavity and the like.
The embodiment of the disclosure utilizes the trained audio style migration model to realize the function of endowing the voice data with style parameters. Specifically, the speech synthesis device may obtain style parameters for the target speech data for the text of a certain audio reading, or for all conversations of a certain character in the audio reading, or for a certain conversation in the audio reading, and generally, the style parameters may include parameters such as pitch, speech rate, accent characteristics, transcription characteristics, and pronunciation characteristics. And inputting the audio data and the style parameters into the trained audio style migration model, so that the audio style migration model can migrate the style described by the style parameters into the audio data, and the deductive style described by the specific style parameters of the migrated audio data is obtained, namely the synthesized voice with the deductive style described by the style parameters is obtained.
In addition, the audio style migration model may be a neural network model, and correspondingly, the method further includes: establishing an audio style migration model and acquiring training voice data; the training voice data comprises a plurality of style parameters and voice data of at least one section of dialogue corresponding to each style parameter; and training the audio style migration model by adopting the training voice data so as to enable the audio style migration model to extract style characteristics of the voice data corresponding to the style parameters, and establishing an incidence relation between the style parameters and the style characteristics to obtain the trained audio style migration model.
Fig. 3 is an interface schematic diagram of a text speech synthesis method according to an embodiment of the present disclosure, and as shown in fig. 3, the text speech synthesis apparatus may further send the text of the audio book and the speech obtained based on the synthesized speech to the terminal after obtaining the synthesized speech, so that the terminal displays the text on a display interface and outputs the speech to the user.
In the speech synthesis method for texts provided by this embodiment, a dialog text of at least one dialog in a text to be processed and a role to which the dialog text belongs are obtained by recognition; determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data; inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model, so that the trained audio style migration model adjusts the deduction style of each voice data corresponding to the role according to the style parameters, and a scheme of synthesized voice corresponding to the role is obtained, thereby enabling the voice deduction styles of different characters in the sound reading materials based on the synthesized voice to be more diversified, improving the rendering capability of the sound reading materials on scenes, and being beneficial to improving the interest of users on the sound reading materials.
Fig. 4 is a block diagram of a text speech synthesis apparatus according to an embodiment of the present disclosure. For ease of illustration, only portions relevant to embodiments of the present disclosure are shown. Referring to fig. 4, the text speech synthesis apparatus includes: a recognition module 10, a voice conversion module 20, and a style conversion module 30.
The recognition module 10 is used for recognizing the dialog text of at least one section of dialog in the text to be processed and the role to which the dialog text belongs;
the voice conversion module 20 is configured to determine a voice corpus of each role according to a preset voice corpus, and convert a dialog text of each dialog into corresponding voice data;
the style conversion module 30 is configured to input the voice data of the dialog of each character and the acquired style parameters of the character into the trained audio style migration model, so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the character according to the style parameters to obtain a synthesized voice corresponding to the character.
In an optional embodiment provided by the present disclosure, further comprising: a training module;
the training module is used for establishing an audio style migration model and acquiring training voice data; the training voice data comprises a plurality of style parameters and voice data of at least one section of dialogue corresponding to each style parameter; and training the audio style migration model by adopting the training voice data so as to enable the audio style migration model to extract style characteristics of the voice data corresponding to the style parameters, and establishing an incidence relation between the style parameters and the style characteristics to obtain the trained audio style migration model.
In an optional embodiment provided by the present disclosure, the voice corpus of a plurality of tone characteristics is stored in the voice corpus;
the voice conversion module 20 is specifically configured to determine a voice corpus having a tone characteristic of each role in a preset voice corpus; and generating voice data of each section of dialogue based on the dialogue text of each dialogue corresponding to each role according to the voice corpus of each role.
In an optional embodiment provided by the present disclosure, the identification module 10 is specifically configured to: identifying a dialog text of at least one section of dialog in a text to be processed and at least one role in the text to be processed; and determining the role to which the conversation belongs according to the incidence relation between each section of conversation and each role.
In an optional embodiment provided by the present disclosure, the recognition module 10 is specifically configured to determine, by using a regular expression, a position of each dialog in a text to be processed in the text to be processed, and extract a corresponding dialog text according to each position; at least one role is identified that determines occurrences in the text to be processed using named entity recognition.
The speech synthesis apparatus for text provided by this embodiment obtains a dialog text of at least one dialog in a text to be processed and a role to which the dialog text belongs by recognition; determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data; inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model, so that the trained audio style migration model adjusts the deduction style of each voice data corresponding to the role according to the style parameters, and a scheme of synthesized voice corresponding to the role is obtained, thereby enabling the voice deduction styles of different characters in the sound reading materials based on the synthesized voice to be more diversified, improving the rendering capability of the sound reading materials on scenes, and being beneficial to improving the interest of users on the sound reading materials.
The electronic device provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Referring to fig. 5, a schematic structural diagram of an electronic device 900 suitable for implementing the embodiment of the present disclosure is shown, where the electronic device 900 may be a terminal device or a server. Among them, the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a car terminal (e.g., car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 5, the electronic device 900 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 901, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage device 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are also stored. The processing apparatus 901, ROM902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
Generally, the following devices may be connected to the I/O interface 905: input devices 906 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 907 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 908 including, for example, magnetic tape, hard disk, etc.; and a communication device 909. The communication device 909 may allow the electronic apparatus 900 to perform wireless or wired communication with other apparatuses to exchange data. While fig. 5 illustrates an electronic device 900 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 909, or installed from the storage device 908, or installed from the ROM 902. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing apparatus 901.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first obtaining unit may also be described as a "unit obtaining at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The following are some embodiments of the disclosure.
In a first aspect, according to one or more embodiments of the present disclosure, a method for speech synthesis of text comprises:
identifying and obtaining a conversation text of at least one section of conversation in the text to be processed and a role to which the conversation text belongs;
determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data;
inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model, so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the role according to the style parameters to obtain the synthetic voice corresponding to the role.
In an optional embodiment provided by the present disclosure, further comprising:
establishing an audio style migration model and acquiring training voice data; the training voice data comprises a plurality of style parameters and voice data of at least one section of dialogue corresponding to each style parameter;
and training the audio style migration model by adopting the training voice data so as to enable the audio style migration model to extract style characteristics of the voice data corresponding to the style parameters, and establishing an incidence relation between the style parameters and the style characteristics to obtain the trained audio style migration model.
In an optional embodiment provided by the present disclosure, the voice corpus of a plurality of tone features is stored in the voice corpus;
correspondingly, the determining the voice corpus of each role according to the preset voice corpus, and converting the dialog text of each dialog into corresponding voice data includes:
determining a voice corpus with the tone color characteristics of each role in a preset voice corpus;
and generating voice data of each section of dialogue based on the dialogue text of each dialogue corresponding to each role according to the voice corpus of each role.
In an optional embodiment provided by the present disclosure, the identifying obtains a dialog text of at least one dialog in the text to be processed and a role to which the dialog text belongs, including:
identifying and obtaining a dialog text of at least one section of dialog in a text to be processed and at least one role in the text to be processed;
and determining the role to which the conversation belongs according to the incidence relation between each section of conversation and each role.
In an optional embodiment provided by the present disclosure, the identifying obtains a dialog text of at least one dialog in a to-be-processed text, and at least one character in the to-be-processed text, including:
determining the position of each section of dialogue in the text to be processed by adopting a regular expression, and extracting a corresponding dialogue text according to each position;
at least one role is identified that determines occurrences in the text to be processed using named entity recognition.
In an optional embodiment provided by the present disclosure, the determining, according to the association relationship between each dialog and each role, a role to which the dialog belongs includes:
and determining the incidence relation between each section of conversation and each role according to the occurrence frequency of each role and the occurrence position of each role in the text to be processed, and determining the role to which each section of conversation belongs.
In a second aspect, according to one or more embodiments of the present disclosure, a speech synthesis apparatus for text includes:
the recognition module is used for recognizing the conversation text of at least one section of conversation in the text to be processed and the role to which the conversation text belongs;
the voice conversion module is used for determining the voice corpus of each role according to a preset voice corpus and converting the dialogue text of each dialogue into corresponding voice data;
and the style conversion module is used for inputting the voice data of the conversation of each role and the acquired style parameters of the roles into the trained audio style migration model so as to enable the trained audio style migration model to adjust the deduction style of each voice data corresponding to the roles according to the style parameters and obtain the synthesized voice corresponding to the roles.
In an optional embodiment provided by the present disclosure, further comprising: a training module;
the training module is used for establishing an audio style migration model and acquiring training voice data; the training voice data comprises a plurality of style parameters and voice data of at least one section of dialogue corresponding to each style parameter; and training the audio style migration model by adopting the training voice data so as to enable the audio style migration model to extract style characteristics of the voice data corresponding to the style parameters, and establishing an incidence relation between the style parameters and the style characteristics to obtain the trained audio style migration model.
In an optional embodiment provided by the present disclosure, the voice corpus of a plurality of tone features is stored in the voice corpus;
the voice conversion module is specifically used for determining a voice corpus with the tone color characteristics of each role in a preset voice corpus; and generating voice data of each section of dialogue based on the dialogue text of each dialogue corresponding to the role according to the voice corpus of each role.
In an optional embodiment provided by the present disclosure, the identification module is specifically configured to: identifying and obtaining a dialog text of at least one section of dialog in a text to be processed and at least one role in the text to be processed; and determining the roles to which the conversations belong according to the incidence relation between each section of conversation and each role.
In an optional embodiment provided by the present disclosure, the recognition module is specifically configured to determine, by using a regular expression, a position of each dialog in the text to be processed, and extract a corresponding dialog text according to each position; at least one role is identified that determines occurrences in the text to be processed using named entity recognition.
In a third aspect, in accordance with one or more embodiments of the present disclosure, an electronic device, comprises: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the method for speech synthesis of text as previously described.
In a fourth aspect, according to one or more embodiments of the present disclosure, a computer-readable storage medium has stored therein computer-executable instructions that, when executed by a processor, implement a method for speech synthesis of text as described above.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (14)

1. A method for speech synthesis of text, comprising:
identifying and obtaining a conversation text of at least one section of conversation in the text to be processed and a role to which the conversation text belongs;
determining the voice corpus of each role according to a preset voice corpus, and converting the dialogue text of each dialogue into corresponding voice data;
inputting the voice data of the dialogue of each role and the acquired style parameters of the role into a trained audio style migration model so that the trained audio style migration model adjusts the deductive style of each voice data corresponding to the role according to the style parameters to obtain the synthetic voice corresponding to the role;
and the audio style migration model has an association relationship between style parameters and style features.
2. The method for synthesizing speech of text according to claim 1, further comprising:
establishing an audio style migration model and acquiring training voice data; the training voice data comprises a plurality of style parameters and voice data of at least one section of dialogue corresponding to each style parameter;
and training the audio style migration model by adopting the training voice data so as to enable the audio style migration model to extract style characteristics of the voice data corresponding to the style parameters, and establishing an incidence relation between the style parameters and the style characteristics to obtain the trained audio style migration model.
3. The method according to claim 1, wherein the voice corpus of a plurality of timbre features is stored in the voice corpus;
correspondingly, the determining the voice corpus of each role according to the preset voice corpus, and converting the dialog text of each dialog into corresponding voice data includes:
determining a voice corpus with the tone color characteristics of each role in a preset voice corpus;
and generating voice data of each section of dialogue based on the dialogue text of each dialogue corresponding to the role according to the voice corpus of each role.
4. The method for synthesizing text with speech according to claim 1, wherein the identifying obtains the dialog text of at least one dialog in the text to be processed and the role to which the dialog text belongs, and comprises:
identifying and obtaining a dialog text of at least one section of dialog in a text to be processed and at least one role in the text to be processed;
and determining the role to which the conversation belongs according to the incidence relation between each section of conversation and each role.
5. The method of claim 4, wherein the recognizing obtains a dialog text of at least one dialog in a text to be processed, and at least one character in the text to be processed, and comprises:
determining the position of each section of dialogue in the text to be processed by adopting a regular expression, and extracting a corresponding dialogue text according to each position;
at least one role is identified that determines occurrences in the text to be processed using named entity recognition.
6. The method according to claim 4, wherein the determining the role to which the dialog belongs according to the association relationship between each dialog and each role comprises:
and determining the incidence relation between each section of conversation and each role according to the occurrence frequency of each role and the occurrence position of each role in the text to be processed, and determining the role to which each section of conversation belongs.
7. An apparatus for speech synthesis of text, comprising:
the recognition module is used for recognizing the conversation text of at least one section of conversation in the text to be processed and the role to which the conversation text belongs;
the voice conversion module is used for determining the voice corpus of each role according to a preset voice corpus and converting the dialogue text of each dialogue into corresponding voice data;
the style conversion module is used for inputting the voice data of the conversation of each role and the acquired style parameters of the role into a trained audio style migration model so as to enable the trained audio style migration model to adjust the deductive style of each voice data corresponding to the role according to the style parameters and obtain the synthetic voice corresponding to the role;
and the audio style migration model has an association relationship between style parameters and style features.
8. The apparatus for synthesizing text with speech according to claim 7, further comprising: a training module;
the training module is used for establishing an audio style migration model and acquiring training voice data; the training voice data comprises a plurality of style parameters and voice data of at least one section of dialogue corresponding to each style parameter; and training the audio style migration model by adopting the training voice data so as to enable the audio style migration model to extract style characteristics of the voice data corresponding to the style parameters, and establishing an incidence relation between the style parameters and the style characteristics to obtain the trained audio style migration model.
9. The apparatus for synthesizing text with speech according to claim 7, wherein the speech corpus is stored with a plurality of timbre features;
the voice conversion module is specifically used for determining a voice corpus with the tone color characteristics of each role in a preset voice corpus; and generating voice data of each section of dialogue based on the dialogue text of each dialogue corresponding to the role according to the voice corpus of each role.
10. The speech synthesis apparatus of text according to claim 7, wherein the recognition module is specifically configured to: identifying and obtaining a dialog text of at least one section of dialog in a text to be processed and at least one role in the text to be processed; and determining the role to which the conversation belongs according to the incidence relation between each section of conversation and each role.
11. The speech synthesis device for text according to claim 10, wherein the recognition module is specifically configured to determine a position of each dialog in the text to be processed by using a regular expression, and extract a corresponding dialog text according to each position; at least one role is identified that determines occurrences in the text to be processed using named entity recognition.
12. The apparatus according to claim 10, wherein the recognition module is specifically configured to determine, according to the number of times each character appears and the position where each character appears in the text to be processed, an association relationship between each dialog and each character, and determine a character to which each dialog belongs.
13. An electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
execution of the computer-executable instructions stored by the memory by the at least one processor causes the at least one processor to perform a method of speech synthesis of text according to any of claims 1-6.
14. A computer-readable storage medium having computer-executable instructions stored thereon which, when executed by a processor, implement a method for speech synthesis of text according to any one of claims 1-6.
CN201911134833.6A 2019-11-19 2019-11-19 Text voice synthesis method and device, electronic equipment and storage medium Active CN112908292B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911134833.6A CN112908292B (en) 2019-11-19 2019-11-19 Text voice synthesis method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911134833.6A CN112908292B (en) 2019-11-19 2019-11-19 Text voice synthesis method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112908292A CN112908292A (en) 2021-06-04
CN112908292B true CN112908292B (en) 2023-04-07

Family

ID=76103459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911134833.6A Active CN112908292B (en) 2019-11-19 2019-11-19 Text voice synthesis method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112908292B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113539236B (en) * 2021-07-13 2024-03-15 网易(杭州)网络有限公司 Speech synthesis method and device
CN113539235B (en) * 2021-07-13 2024-02-13 标贝(青岛)科技有限公司 Text analysis and speech synthesis method, device, system and storage medium
CN113990286A (en) * 2021-10-29 2022-01-28 北京大学深圳研究院 Speech synthesis method, apparatus, device and storage medium
CN114390220B (en) * 2022-01-19 2023-12-08 中国平安人寿保险股份有限公司 Animation video generation method and related device

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015092936A1 (en) * 2013-12-20 2015-06-25 株式会社東芝 Speech synthesizer, speech synthesizing method and program
CN105304080B (en) * 2015-09-22 2019-09-03 科大讯飞股份有限公司 Speech synthetic device and method
CN108091321B (en) * 2017-11-06 2021-07-16 芋头科技(杭州)有限公司 Speech synthesis method
CN107705783B (en) * 2017-11-27 2022-04-26 北京搜狗科技发展有限公司 Voice synthesis method and device
US10418025B2 (en) * 2017-12-06 2019-09-17 International Business Machines Corporation System and method for generating expressive prosody for speech synthesis
CN109979430B (en) * 2017-12-28 2021-04-20 深圳市优必选科技有限公司 Robot story telling method and device, robot and storage medium
CN109272984A (en) * 2018-10-17 2019-01-25 百度在线网络技术(北京)有限公司 Method and apparatus for interactive voice
CN109523988B (en) * 2018-11-26 2021-11-05 安徽淘云科技股份有限公司 Text deduction method and device
CN110148398A (en) * 2019-05-16 2019-08-20 平安科技(深圳)有限公司 Training method, device, equipment and the storage medium of speech synthesis model

Also Published As

Publication number Publication date
CN112908292A (en) 2021-06-04

Similar Documents

Publication Publication Date Title
US11727914B2 (en) Intent recognition and emotional text-to-speech learning
CN112908292B (en) Text voice synthesis method and device, electronic equipment and storage medium
CN111583900B (en) Song synthesis method and device, readable medium and electronic equipment
US10614803B2 (en) Wake-on-voice method, terminal and storage medium
CN112489620B (en) Speech synthesis method, device, readable medium and electronic equipment
CN111369971B (en) Speech synthesis method, device, storage medium and electronic equipment
CN111369967B (en) Virtual character-based voice synthesis method, device, medium and equipment
CN111402843B (en) Rap music generation method and device, readable medium and electronic equipment
WO2020098115A1 (en) Subtitle adding method, apparatus, electronic device, and computer readable storage medium
CN110197655B (en) Method and apparatus for synthesizing speech
CN112786006A (en) Speech synthesis method, synthesis model training method, apparatus, medium, and device
CN107707745A (en) Method and apparatus for extracting information
CN111798821B (en) Sound conversion method, device, readable storage medium and electronic equipment
CN112786007A (en) Speech synthesis method, device, readable medium and electronic equipment
CN111782576B (en) Background music generation method and device, readable medium and electronic equipment
CN111489735B (en) Voice recognition model training method and device
CN112927674B (en) Voice style migration method and device, readable medium and electronic equipment
US20210082396A1 (en) A highly empathetic its processing
CN112786008A (en) Speech synthesis method, device, readable medium and electronic equipment
CN113327580A (en) Speech synthesis method, device, readable medium and electronic equipment
CN112765971A (en) Text-to-speech conversion method and device, electronic equipment and storage medium
JP2022133392A (en) Speech synthesis method and device, electronic apparatus, and storage medium
CN113421550A (en) Speech synthesis method, device, readable medium and electronic equipment
CN113257218B (en) Speech synthesis method, device, electronic equipment and storage medium
CN114495902A (en) Speech synthesis method, speech synthesis device, computer readable medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant