CN113066473A - Voice synthesis method and device, storage medium and electronic equipment - Google Patents

Voice synthesis method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN113066473A
CN113066473A CN202110347739.XA CN202110347739A CN113066473A CN 113066473 A CN113066473 A CN 113066473A CN 202110347739 A CN202110347739 A CN 202110347739A CN 113066473 A CN113066473 A CN 113066473A
Authority
CN
China
Prior art keywords
voice
emotion
style
text
response
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110347739.XA
Other languages
Chinese (zh)
Inventor
杨辰雨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN202110347739.XA priority Critical patent/CN113066473A/en
Publication of CN113066473A publication Critical patent/CN113066473A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Abstract

The embodiment of the application discloses a voice synthesis method, a voice synthesis device, a storage medium and electronic equipment. The method comprises the following steps: acquiring current voice input by a user; obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice; and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice. According to the technical scheme, the style tag and the emotion tag of the response synthesized voice can be adjusted in real time according to the input voice of the client, and the expressive force of the response synthesized voice is improved, so that the client experience in intelligent voice interaction is improved.

Description

Voice synthesis method and device, storage medium and electronic equipment
Technical Field
The embodiment of the application relates to the technical field of computers, in particular to a voice synthesis method, a voice synthesis device, a storage medium and electronic equipment.
Background
With the rapid development of artificial intelligence technology, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like, and the service level of enterprise customer service is improved through functions of intelligent marketing, intelligent collection urging, content navigation and the like.
In current intelligent voice interaction, it is common practice to assign a general sound library for model training and voice synthesis.
The method can meet the basic requirements of intelligent voice interaction, but the style is single, emotional expression is avoided, and the customer experience is poor.
Disclosure of Invention
The embodiment of the application provides a voice synthesis method, a voice synthesis device, a storage medium and electronic equipment, which can adjust style labels and emotion labels of response synthesized voice in real time according to input voice of a client, and improve expressive force of the response synthesized voice, so that client experience in intelligent voice interaction is improved.
In a first aspect, an embodiment of the present application provides a speech synthesis method, where the method includes:
acquiring current voice input by a user;
obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice;
and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.
In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:
the current voice acquiring unit is used for acquiring current voice input by a user;
the tag obtaining unit is used for obtaining a style tag and/or an emotion tag of the response text by using a preset style emotion matching model according to the current voice;
and the response synthetic voice obtaining unit is used for obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and the response synthetic voice is used for interacting with the current voice.
In a third aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a speech synthesis method according to an embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable by the processor, and the processor executes the computer program to implement the speech synthesis method according to the embodiment of the present application.
According to the technical scheme provided by the embodiment of the application, the current voice input by a user is obtained; obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice; and obtaining response synthesized voice according to the response text with the style label and/or the emotion label, and using the response synthesized voice to interact with the current voice. According to the technical scheme, the style tag and the emotion tag of the response synthesized voice can be adjusted in real time according to the input voice of the client, and the expressive force of the response synthesized voice is improved, so that the client experience in intelligent voice interaction is improved.
Drawings
FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to a second embodiment of the present application;
fig. 3 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.
Detailed Description
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.
Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.
Example one
Fig. 1 is a flowchart of a speech synthesis method provided in an embodiment of the present application, where the present embodiment is applicable to a case of intelligent speech interaction, and the method may be executed by a speech synthesis apparatus provided in an embodiment of the present application, where the speech synthesis apparatus may be implemented by software and/or hardware, and may be integrated in a device such as an intelligent terminal for speech synthesis.
As shown in fig. 1, the speech synthesis method includes:
and S110, acquiring the current voice input by the user.
In the scheme, the speech synthesis is a process of converting a text into a speech and outputting the speech, and the process mainly comprises the steps of decomposing an input text into phonemes according to pronunciation, processing special symbols, and converting phoneme sequences into digital audio through an acoustic model and a vocoder. Response synthesized voice can be obtained through voice synthesis, and intelligent voice interaction is achieved.
The current voice may refer to a voice input by the user at the current time. For example, the current speech may be today's weather conditions or tomorrow's temperature, etc. The current speech input by the user may be obtained by using a speech recognition rule, and the specific speech recognition rule is not limited in this embodiment.
And S120, obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice.
In this embodiment, the answer text may be text that is voice interacted with the current speech. For example, the current speech is today's weather condition, and the answer text may be today's weather sunny.
The style labels can refer to different language materials and modes adopted according to different social occasions, purposes, tasks and the inheritability and the quality of the interpersonals when voice interaction is carried out. For example, the style label may be sweet or harsh, etc. Wherein, sweet can mean that the sound is very soft and beautiful; severity may refer to the severity of the sound.
In the scheme, the emotion label can refer to language emotion which appeals to aesthetic psychology of people and imagination reproduction in clear image and emotion. For example, the affective tag can be a joy or anger, and the like.
And determining the emotion label and/or the style label by outputting the label with the maximum matching degree. The preset style emotion matching model can be a neural network model or a Bert model. Preferably, the preset style emotion matching model may be a Bert model. The Bert model converts each word in the text into a one-dimensional vector by inquiring a word vector table, and the one-dimensional vector is used as model input; the model output is the vector representation after the full-text semantic information corresponding to each word is input.
In this embodiment, after the current voice is obtained, the current voice is converted into an interactive text, the interactive text is processed to obtain a response text, and the interactive text and the response text are processed by using the preset style emotion matching model to obtain a style tag and/or an emotion tag of the response text.
In this technical solution, optionally, obtaining a style tag and/or an emotion tag of a response text by using a preset style emotion matching model according to the current speech includes:
obtaining an interactive text and a response text of the current voice according to the current voice;
and processing the interactive text and the response text by using a preset style emotion matching model to obtain a style label and/or an emotion label of the response text.
Wherein the interactive text may refer to a text form of the current speech.
According to the scheme, the current voice can be converted to obtain an interactive text, the current voice is analyzed to obtain a response text, the interactive text and the response text are used as input, the interactive text and the response text are processed by utilizing a preset style emotion matching model, an emotion label and/or a style label are output, and the emotion label and/or the style label with the highest matching degree is used as the style label and/or the emotion label of the response text.
The interactive text and the response text are processed by utilizing the preset style emotion matching model, so that style labels and/or emotion labels of the response text can be obtained, and the customer experience in intelligent voice interaction can be improved.
In this technical solution, optionally, obtaining the interactive text and the response text of the current speech according to the current speech includes:
converting the current voice into an interactive text through a voice recognition module;
and processing the interactive text by utilizing a semantic understanding module, a dialogue management module and a voice generation module to generate a response text.
In the scheme, the voice recognition module is used for recognizing the voice and converting the voice into a text form; the voice understanding module is used for analyzing the content of the text to obtain the key content of the text. In the semantic understanding process, conversation management, error correction, content management and context information are included; the dialogue management module is used for searching in the corresponding technical field according to the key content of the text to obtain the response content corresponding to the key content of the text; the voice generating module is used for processing the response content to obtain a response text. For example, the current voice is the weather condition of today, and the current voice is converted into interactive text through the voice recognition module; the semantic understanding module analyzes the content of the interactive text, and the key content of the text can be today and weather; the conversation management module searches in the weather field to obtain the weather today; and the voice generation module generates response text from the retrieved content.
By processing the current voice, the response text matched with the current voice can be obtained, and intelligent voice interaction can be realized.
In this technical solution, optionally, the processing the interactive text and the response text by using a preset style emotion matching model to obtain a style tag and/or an emotion tag of the response text includes:
splicing the interactive text and the response text to obtain a target text;
and taking the target text and the style label as a first input, and/or taking the target text and the emotion label as a second input, performing style emotion matching on the first input and/or the second input by using a preset style emotion matching model, and taking the style label and/or the emotion label with the maximum matching degree as a style label and/or an emotion label of the response text.
In this embodiment, the target text may be obtained by splicing the interactive text and the response text by adding a separator. The separator may be a letter, a number, a special mark, or the like. For example, the interaction text is today's weather condition, the answer text is today's weather sunny, AND the target text may be today's weather condition AND today's weather sunny.
In the scheme, the output result of the preset style emotion matching model is the label matching degree, the first input and/or the second input are/is matched with the emotion label library one by one, and the label with the maximum matching degree is output, so that the style label and/or the emotion label of the response text are/is obtained. For example, the target text is the current weather condition AND the current weather is fine, the style label is sweet AND strict, the emotion label is happiness AND anger, the target text, sweet AND happiness, the target text, sweet AND anger, the target text, strict AND happiness, the target text, strict AND anger are respectively matched with the emotion label library by using a preset style emotion matching model, the matching degree of the target text, sweet AND happiness is maximum, the style label of the response text is sweet, AND the emotion label is happiness.
And performing style emotion matching on the first input and/or the second input by using a preset style emotion matching model, so that a response text with style tags and/or emotion tags can be obtained, the expressive force of response synthetic voice is improved, and the customer experience in intelligent voice interaction is improved.
In the technical solution, optionally, the preset style emotion matching model includes a Bert model; and the Bert model is used for outputting the style label and/or the emotion label of the response text for matching.
In the scheme, a neural network with a classification function is added at the output end of the Bert, the Bert model and the neural network jointly process the first input and/or the second input, the output result is the matching degree of the label, and the emotion label and the style label are determined by outputting the label with the maximum matching degree.
The style and emotion matching model based on the pre-training model Bert is used for conducting label prediction of synthesized voice style and emotion, the style and emotion matching model is obtained through fine adjustment on the pre-training model Bert, and the labels are matched one by one, so that a better effect can be achieved on a small data volume, and the workload of data marking is reduced.
In the technical scheme, optionally, the style label comprises sweet, natural, harsh and lively; the emotion labels include happiness, anger, sadness and happiness.
By determining the style label and/or emotion label of the response text, the expressive force of the response synthesized voice can be improved, and therefore the customer experience in intelligent voice interaction is improved.
In this technical solution, optionally, obtaining the response synthesized speech according to the response text with the style tag and/or the emotion tag includes:
and performing voice synthesis processing on the response text with the style label and/or the emotion label by using a voice synthesis module to obtain response synthesis voice.
In the scheme, the response text with the style label and/or the emotion label can be subjected to voice synthesis processing through equal voice synthesis rules of combination of DNN (Deep Neural Networks) and LSTM (Long Short-Term Memory) or combination of end-to-end and Neural network vocoders, so that response synthesized voice is obtained. The equal voice synthesis rules of the end-to-end and neural network vocoder are combined to generate audio by means of unit splicing. End-to-end speech synthesis may refer to the direct transition from text to speech, which includes database, acoustic modeling, and acoustic models.
The voice synthesis module is used for carrying out voice synthesis processing on the response text with the style label and/or the emotion label, so that the expressive force of response synthetic voice can be improved, and the customer experience in intelligent voice interaction is improved.
And S130, obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.
The response synthesized voice may refer to a response voice interacting with a current voice, and may have a specific style and a specific emotion. For example, the response synthesized voice may be a response voice with a sweet-style tag and a favorite emotion tag. And obtaining the response synthetic voice matched with the style label and/or the emotion label according to the content of the response text and the style label and/or the emotion label.
According to the technical scheme provided by the embodiment of the application, the current voice input by a user is obtained; obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice; and obtaining response synthesized voice according to the response text with the style label and/or the emotion label, and using the response synthesized voice to interact with the current voice. By executing the technical scheme, the style tag and the emotion tag of the response synthesized voice can be adjusted in real time according to the input voice of the client, and the expressive force of the response synthesized voice is improved, so that the client experience in intelligent voice interaction is improved.
Example two
Fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to a second embodiment of the present application, and as shown in fig. 2, the speech synthesis apparatus includes:
a current voice acquiring unit 210 configured to acquire a current voice input by a user;
a tag obtaining unit 220, configured to obtain a style tag and/or an emotion tag of the response text according to the current speech by using a preset style emotion matching model;
and a response synthesized speech obtaining unit 230, configured to obtain a response synthesized speech according to the response text with the style tag and/or the emotion tag, and configured to interact with the current speech.
In this embodiment, optionally, the tag obtaining unit 220 includes:
the text obtaining subunit is used for obtaining the interactive text and the response text of the current voice according to the current voice;
and the response text label obtaining subunit is used for processing the interactive text and the response text by using a preset style emotion matching model to obtain a style label and/or an emotion label of the response text.
In this technical solution, optionally, the text obtaining subunit is specifically configured to:
converting the current voice into an interactive text through a voice recognition module;
and processing the interactive text by utilizing a semantic understanding module, a dialogue management module and a voice generation module to generate a response text.
In this technical solution, optionally, the response text label obtains a subunit, which is specifically configured to:
splicing the interactive text and the response text to obtain a target text;
and taking the target text and the style label as a first input, and/or taking the target text and the emotion label as a second input, performing style emotion matching on the first input and/or the second input by using a preset style emotion matching model, and taking the style label and/or the emotion label with the maximum matching degree as a style label and/or an emotion label of the response text.
In the technical solution, optionally, the preset style emotion matching model includes a Bert model; and the Bert model is used for outputting the style label and/or the emotion label of the response text for matching.
In the technical scheme, optionally, the style label comprises sweet, natural, harsh and lively; the emotion labels include happiness, anger, sadness and happiness.
In this technical solution, optionally, the response synthesized speech obtaining unit 230 is specifically configured to:
and performing voice synthesis processing on the response text with the style label and/or the emotion label by using a voice synthesis module to obtain response synthesis voice.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.
EXAMPLE III
Embodiments of the present application also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, perform a method of speech synthesis, the method comprising:
acquiring current voice input by a user;
obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice;
and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.
Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting a storage medium such as a CD-ROM, floppy disk, or tape device; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic storage media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.
Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the speech synthesis operation described above, and may also perform related operations in the speech synthesis method provided in any embodiments of the present application.
Example four
The embodiment of the application provides electronic equipment, and the voice synthesis device provided by the embodiment of the application can be integrated in the electronic equipment. Fig. 3 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in fig. 3, the present embodiment provides an electronic device 300, which includes: one or more processors 320; a storage device 310, configured to store one or more programs, which when executed by the one or more processors 320, cause the one or more processors 320 to implement the speech synthesis method provided in the embodiment of the present application, where the method includes:
acquiring current voice input by a user;
obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice;
and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.
Of course, those skilled in the art will understand that the processor 320 also implements the technical solution of the speech synthesis method provided in any embodiment of the present application.
The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 3, the electronic device 300 includes a processor 320, a storage device 310, an input device 330, and an output device 340; the number of the processors 320 in the electronic device may be one or more, and one processor 320 is taken as an example in fig. 3; the processor 320, the storage device 310, the input device 330, and the output device 340 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 350 in fig. 3.
The storage device 310 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and module units, such as program instructions corresponding to the speech synthesis method in the embodiment of the present application.
The storage device 310 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage device 310 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 310 may further include memory located remotely from processor 320, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 330 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display screen, a speaker, and other electronic devices.
The electronic equipment provided by the embodiment of the application can adjust the style label and the emotion label of the response synthesized voice in real time according to the input voice of the client, and improve the expressive force of the response synthesized voice, so that the purpose of the client experience in intelligent voice interaction is improved.
The speech synthesis device, the storage medium and the electronic device provided in the above embodiments may execute the speech synthesis method provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to the speech synthesis method provided in any of the embodiments of the present application.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims (10)

1. A method of speech synthesis, comprising:
acquiring current voice input by a user;
obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice;
and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.
2. The method of claim 1, wherein obtaining style tags and/or emotion tags of answer text using a preset style emotion matching model according to the current speech comprises:
obtaining an interactive text and a response text of the current voice according to the current voice;
and processing the interactive text and the response text by using a preset style emotion matching model to obtain a style label and/or an emotion label of the response text.
3. The method of claim 2, wherein obtaining the interactive text and the response text of the current speech according to the current speech comprises:
converting the current voice into an interactive text through a voice recognition module;
and processing the interactive text by utilizing a semantic understanding module, a dialogue management module and a voice generation module to generate a response text.
4. The method of claim 2, wherein processing the interactive text and the response text by using a preset style emotion matching model to obtain a style tag and/or an emotion tag of the response text comprises:
splicing the interactive text and the response text to obtain a target text;
and taking the target text and the style label as a first input, and/or taking the target text and the emotion label as a second input, performing style emotion matching on the first input and/or the second input by using a preset style emotion matching model, and taking the style label and/or the emotion label with the maximum matching degree as a style label and/or an emotion label of the response text.
5. The method of claim 1, wherein the pre-defined style emotion matching model comprises a Bert model; and the Bert model is used for outputting the style label and/or the emotion label of the response text for matching.
6. The method of claim 1, wherein the style label includes sweet, natural, harsh, and lively; the emotion labels include happiness, anger, sadness and happiness.
7. The method of claim 1, wherein obtaining the response synthesized speech from the response text with the style tag and/or the emotion tag comprises:
and performing voice synthesis processing on the response text with the style label and/or the emotion label by using a voice synthesis module to obtain response synthesis voice.
8. A speech synthesis apparatus, comprising:
the current voice acquiring unit is used for acquiring current voice input by a user;
the tag obtaining unit is used for obtaining a style tag and/or an emotion tag of the response text by using a preset style emotion matching model according to the current voice;
and the response synthetic voice obtaining unit is used for obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and the response synthetic voice is used for interacting with the current voice.
9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech synthesis method according to any one of claims 1 to 7.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech synthesis method according to any one of claims 1-7 when executing the computer program.
CN202110347739.XA 2021-03-31 2021-03-31 Voice synthesis method and device, storage medium and electronic equipment Pending CN113066473A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110347739.XA CN113066473A (en) 2021-03-31 2021-03-31 Voice synthesis method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110347739.XA CN113066473A (en) 2021-03-31 2021-03-31 Voice synthesis method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113066473A true CN113066473A (en) 2021-07-02

Family

ID=76564841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110347739.XA Pending CN113066473A (en) 2021-03-31 2021-03-31 Voice synthesis method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113066473A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115460166A (en) * 2022-09-06 2022-12-09 网易(杭州)网络有限公司 Instant voice communication method and device, electronic equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160027452A1 (en) * 2014-07-28 2016-01-28 Sone Computer Entertainment Inc. Emotional speech processing
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109410913A (en) * 2018-12-13 2019-03-01 百度在线网络技术(北京)有限公司 A kind of phoneme synthesizing method, device, equipment and storage medium
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion
CN110908631A (en) * 2019-11-22 2020-03-24 深圳传音控股股份有限公司 Emotion interaction method, device, equipment and computer readable storage medium
US20200118544A1 (en) * 2019-07-17 2020-04-16 Lg Electronics Inc. Intelligent voice recognizing method, apparatus, and intelligent computing device
CN111696535A (en) * 2020-05-22 2020-09-22 百度在线网络技术(北京)有限公司 Information verification method, device, equipment and computer storage medium based on voice interaction
CN111916082A (en) * 2020-08-14 2020-11-10 腾讯科技(深圳)有限公司 Voice interaction method and device, computer equipment and storage medium
CN112099628A (en) * 2020-09-08 2020-12-18 平安科技(深圳)有限公司 VR interaction method and device based on artificial intelligence, computer equipment and medium
CN112270168A (en) * 2020-10-14 2021-01-26 北京百度网讯科技有限公司 Dialogue emotion style prediction method and device, electronic equipment and storage medium
CN112382287A (en) * 2020-11-11 2021-02-19 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and storage medium

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160027452A1 (en) * 2014-07-28 2016-01-28 Sone Computer Entertainment Inc. Emotional speech processing
CN108962217A (en) * 2018-07-28 2018-12-07 华为技术有限公司 Phoneme synthesizing method and relevant device
CN109410913A (en) * 2018-12-13 2019-03-01 百度在线网络技术(北京)有限公司 A kind of phoneme synthesizing method, device, equipment and storage medium
CN110211563A (en) * 2019-06-19 2019-09-06 平安科技(深圳)有限公司 Chinese speech synthesis method, apparatus and storage medium towards scene and emotion
US20200118544A1 (en) * 2019-07-17 2020-04-16 Lg Electronics Inc. Intelligent voice recognizing method, apparatus, and intelligent computing device
CN110908631A (en) * 2019-11-22 2020-03-24 深圳传音控股股份有限公司 Emotion interaction method, device, equipment and computer readable storage medium
CN111696535A (en) * 2020-05-22 2020-09-22 百度在线网络技术(北京)有限公司 Information verification method, device, equipment and computer storage medium based on voice interaction
CN111916082A (en) * 2020-08-14 2020-11-10 腾讯科技(深圳)有限公司 Voice interaction method and device, computer equipment and storage medium
CN112099628A (en) * 2020-09-08 2020-12-18 平安科技(深圳)有限公司 VR interaction method and device based on artificial intelligence, computer equipment and medium
CN112270168A (en) * 2020-10-14 2021-01-26 北京百度网讯科技有限公司 Dialogue emotion style prediction method and device, electronic equipment and storage medium
CN112382287A (en) * 2020-11-11 2021-02-19 北京百度网讯科技有限公司 Voice interaction method and device, electronic equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
朱频频: "智能客户服务技术与应用", 31 January 2019, 中国铁道出版社有限公司, pages: 62 - 63 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115460166A (en) * 2022-09-06 2022-12-09 网易(杭州)网络有限公司 Instant voice communication method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111933129B (en) Audio processing method, language model training method and device and computer equipment
CN107657017B (en) Method and apparatus for providing voice service
Huang et al. Pretraining techniques for sequence-to-sequence voice conversion
US9412359B2 (en) System and method for cloud-based text-to-speech web services
US11961515B2 (en) Contrastive Siamese network for semi-supervised speech recognition
CN114830139A (en) Training models using model-provided candidate actions
Mamyrbayev et al. End-to-end speech recognition in agglutinative languages
CN110070859A (en) A kind of audio recognition method and device
US7069513B2 (en) System, method and computer program product for a transcription graphical user interface
Kumar et al. AutoSSR: an efficient approach for automatic spontaneous speech recognition model for the Punjabi Language
CN110852075B (en) Voice transcription method and device capable of automatically adding punctuation marks and readable storage medium
CN110647613A (en) Courseware construction method, courseware construction device, courseware construction server and storage medium
CN109065019B (en) Intelligent robot-oriented story data processing method and system
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
CN113066473A (en) Voice synthesis method and device, storage medium and electronic equipment
CN113314096A (en) Speech synthesis method, apparatus, device and storage medium
CN112397053B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN115050351A (en) Method and device for generating timestamp and computer equipment
CN112233648A (en) Data processing method, device, equipment and storage medium combining RPA and AI
Honggai et al. Linguistic multidimensional perspective data simulation based on speech recognition technology and big data
Breen et al. Voice in the user interface
KR20190106011A (en) Dialogue system and dialogue method, computer program for executing the method
Sartiukova et al. Remote Voice Control of Computer Based on Convolutional Neural Network
US20230410787A1 (en) Speech processing system with encoder-decoder model and corresponding methods for synthesizing speech containing desired speaker identity and emotional style
Deng Research on Online English Speech Interactive Recognition System Based on Nose Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination