CN113066473A

CN113066473A - Voice synthesis method and device, storage medium and electronic equipment

Info

Publication number: CN113066473A
Application number: CN202110347739.XA
Authority: CN
Inventors: 杨辰雨
Original assignee: CCB Finetech Co Ltd
Current assignee: CCB Finetech Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-02

Abstract

The embodiment of the application discloses a voice synthesis method, a voice synthesis device, a storage medium and electronic equipment. The method comprises the following steps: acquiring current voice input by a user; obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice; and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice. According to the technical scheme, the style tag and the emotion tag of the response synthesized voice can be adjusted in real time according to the input voice of the client, and the expressive force of the response synthesized voice is improved, so that the client experience in intelligent voice interaction is improved.

Description

Voice synthesis method and device, storage medium and electronic equipment

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a voice synthesis method, a voice synthesis device, a storage medium and electronic equipment.

Background

With the rapid development of artificial intelligence technology, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like, and the service level of enterprise customer service is improved through functions of intelligent marketing, intelligent collection urging, content navigation and the like.

In current intelligent voice interaction, it is common practice to assign a general sound library for model training and voice synthesis.

The method can meet the basic requirements of intelligent voice interaction, but the style is single, emotional expression is avoided, and the customer experience is poor.

Disclosure of Invention

The embodiment of the application provides a voice synthesis method, a voice synthesis device, a storage medium and electronic equipment, which can adjust style labels and emotion labels of response synthesized voice in real time according to input voice of a client, and improve expressive force of the response synthesized voice, so that client experience in intelligent voice interaction is improved.

In a first aspect, an embodiment of the present application provides a speech synthesis method, where the method includes:

acquiring current voice input by a user;

obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice;

and obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.

In a second aspect, an embodiment of the present application provides a speech synthesis apparatus, including:

the current voice acquiring unit is used for acquiring current voice input by a user;

the tag obtaining unit is used for obtaining a style tag and/or an emotion tag of the response text by using a preset style emotion matching model according to the current voice;

and the response synthetic voice obtaining unit is used for obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and the response synthetic voice is used for interacting with the current voice.

In a third aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements a speech synthesis method according to an embodiment of the present application.

In a fourth aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable by the processor, and the processor executes the computer program to implement the speech synthesis method according to the embodiment of the present application.

According to the technical scheme provided by the embodiment of the application, the current voice input by a user is obtained; obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice; and obtaining response synthesized voice according to the response text with the style label and/or the emotion label, and using the response synthesized voice to interact with the current voice. According to the technical scheme, the style tag and the emotion tag of the response synthesized voice can be adjusted in real time according to the input voice of the client, and the expressive force of the response synthesized voice is improved, so that the client experience in intelligent voice interaction is improved.

Drawings

FIG. 1 is a flow chart of a speech synthesis method according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to a second embodiment of the present application;

fig. 3 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be further noted that, for the convenience of description, only some of the structures related to the present application are shown in the drawings, not all of the structures.

Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the steps as a sequential process, many of the steps can be performed in parallel, concurrently or simultaneously. In addition, the order of the steps may be rearranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a speech synthesis method provided in an embodiment of the present application, where the present embodiment is applicable to a case of intelligent speech interaction, and the method may be executed by a speech synthesis apparatus provided in an embodiment of the present application, where the speech synthesis apparatus may be implemented by software and/or hardware, and may be integrated in a device such as an intelligent terminal for speech synthesis.

As shown in fig. 1, the speech synthesis method includes:

and S110, acquiring the current voice input by the user.

In the scheme, the speech synthesis is a process of converting a text into a speech and outputting the speech, and the process mainly comprises the steps of decomposing an input text into phonemes according to pronunciation, processing special symbols, and converting phoneme sequences into digital audio through an acoustic model and a vocoder. Response synthesized voice can be obtained through voice synthesis, and intelligent voice interaction is achieved.

The current voice may refer to a voice input by the user at the current time. For example, the current speech may be today's weather conditions or tomorrow's temperature, etc. The current speech input by the user may be obtained by using a speech recognition rule, and the specific speech recognition rule is not limited in this embodiment.

And S120, obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice.

In this embodiment, the answer text may be text that is voice interacted with the current speech. For example, the current speech is today's weather condition, and the answer text may be today's weather sunny.

The style labels can refer to different language materials and modes adopted according to different social occasions, purposes, tasks and the inheritability and the quality of the interpersonals when voice interaction is carried out. For example, the style label may be sweet or harsh, etc. Wherein, sweet can mean that the sound is very soft and beautiful; severity may refer to the severity of the sound.

In the scheme, the emotion label can refer to language emotion which appeals to aesthetic psychology of people and imagination reproduction in clear image and emotion. For example, the affective tag can be a joy or anger, and the like.

And determining the emotion label and/or the style label by outputting the label with the maximum matching degree. The preset style emotion matching model can be a neural network model or a Bert model. Preferably, the preset style emotion matching model may be a Bert model. The Bert model converts each word in the text into a one-dimensional vector by inquiring a word vector table, and the one-dimensional vector is used as model input; the model output is the vector representation after the full-text semantic information corresponding to each word is input.

In this embodiment, after the current voice is obtained, the current voice is converted into an interactive text, the interactive text is processed to obtain a response text, and the interactive text and the response text are processed by using the preset style emotion matching model to obtain a style tag and/or an emotion tag of the response text.

In this technical solution, optionally, obtaining a style tag and/or an emotion tag of a response text by using a preset style emotion matching model according to the current speech includes:

obtaining an interactive text and a response text of the current voice according to the current voice;

and processing the interactive text and the response text by using a preset style emotion matching model to obtain a style label and/or an emotion label of the response text.

Wherein the interactive text may refer to a text form of the current speech.

According to the scheme, the current voice can be converted to obtain an interactive text, the current voice is analyzed to obtain a response text, the interactive text and the response text are used as input, the interactive text and the response text are processed by utilizing a preset style emotion matching model, an emotion label and/or a style label are output, and the emotion label and/or the style label with the highest matching degree is used as the style label and/or the emotion label of the response text.

The interactive text and the response text are processed by utilizing the preset style emotion matching model, so that style labels and/or emotion labels of the response text can be obtained, and the customer experience in intelligent voice interaction can be improved.

In this technical solution, optionally, obtaining the interactive text and the response text of the current speech according to the current speech includes:

converting the current voice into an interactive text through a voice recognition module;

and processing the interactive text by utilizing a semantic understanding module, a dialogue management module and a voice generation module to generate a response text.

In the scheme, the voice recognition module is used for recognizing the voice and converting the voice into a text form; the voice understanding module is used for analyzing the content of the text to obtain the key content of the text. In the semantic understanding process, conversation management, error correction, content management and context information are included; the dialogue management module is used for searching in the corresponding technical field according to the key content of the text to obtain the response content corresponding to the key content of the text; the voice generating module is used for processing the response content to obtain a response text. For example, the current voice is the weather condition of today, and the current voice is converted into interactive text through the voice recognition module; the semantic understanding module analyzes the content of the interactive text, and the key content of the text can be today and weather; the conversation management module searches in the weather field to obtain the weather today; and the voice generation module generates response text from the retrieved content.

By processing the current voice, the response text matched with the current voice can be obtained, and intelligent voice interaction can be realized.

In this technical solution, optionally, the processing the interactive text and the response text by using a preset style emotion matching model to obtain a style tag and/or an emotion tag of the response text includes:

splicing the interactive text and the response text to obtain a target text;

and taking the target text and the style label as a first input, and/or taking the target text and the emotion label as a second input, performing style emotion matching on the first input and/or the second input by using a preset style emotion matching model, and taking the style label and/or the emotion label with the maximum matching degree as a style label and/or an emotion label of the response text.

In this embodiment, the target text may be obtained by splicing the interactive text and the response text by adding a separator. The separator may be a letter, a number, a special mark, or the like. For example, the interaction text is today's weather condition, the answer text is today's weather sunny, AND the target text may be today's weather condition AND today's weather sunny.

In the scheme, the output result of the preset style emotion matching model is the label matching degree, the first input and/or the second input are/is matched with the emotion label library one by one, and the label with the maximum matching degree is output, so that the style label and/or the emotion label of the response text are/is obtained. For example, the target text is the current weather condition AND the current weather is fine, the style label is sweet AND strict, the emotion label is happiness AND anger, the target text, sweet AND happiness, the target text, sweet AND anger, the target text, strict AND happiness, the target text, strict AND anger are respectively matched with the emotion label library by using a preset style emotion matching model, the matching degree of the target text, sweet AND happiness is maximum, the style label of the response text is sweet, AND the emotion label is happiness.

And performing style emotion matching on the first input and/or the second input by using a preset style emotion matching model, so that a response text with style tags and/or emotion tags can be obtained, the expressive force of response synthetic voice is improved, and the customer experience in intelligent voice interaction is improved.

In the technical solution, optionally, the preset style emotion matching model includes a Bert model; and the Bert model is used for outputting the style label and/or the emotion label of the response text for matching.

In the scheme, a neural network with a classification function is added at the output end of the Bert, the Bert model and the neural network jointly process the first input and/or the second input, the output result is the matching degree of the label, and the emotion label and the style label are determined by outputting the label with the maximum matching degree.

The style and emotion matching model based on the pre-training model Bert is used for conducting label prediction of synthesized voice style and emotion, the style and emotion matching model is obtained through fine adjustment on the pre-training model Bert, and the labels are matched one by one, so that a better effect can be achieved on a small data volume, and the workload of data marking is reduced.

In the technical scheme, optionally, the style label comprises sweet, natural, harsh and lively; the emotion labels include happiness, anger, sadness and happiness.

By determining the style label and/or emotion label of the response text, the expressive force of the response synthesized voice can be improved, and therefore the customer experience in intelligent voice interaction is improved.

In this technical solution, optionally, obtaining the response synthesized speech according to the response text with the style tag and/or the emotion tag includes:

and performing voice synthesis processing on the response text with the style label and/or the emotion label by using a voice synthesis module to obtain response synthesis voice.

In the scheme, the response text with the style label and/or the emotion label can be subjected to voice synthesis processing through equal voice synthesis rules of combination of DNN (Deep Neural Networks) and LSTM (Long Short-Term Memory) or combination of end-to-end and Neural network vocoders, so that response synthesized voice is obtained. The equal voice synthesis rules of the end-to-end and neural network vocoder are combined to generate audio by means of unit splicing. End-to-end speech synthesis may refer to the direct transition from text to speech, which includes database, acoustic modeling, and acoustic models.

The voice synthesis module is used for carrying out voice synthesis processing on the response text with the style label and/or the emotion label, so that the expressive force of response synthetic voice can be improved, and the customer experience in intelligent voice interaction is improved.

And S130, obtaining response synthetic voice according to the response text with the style label and/or the emotion label, and using the response synthetic voice to interact with the current voice.

The response synthesized voice may refer to a response voice interacting with a current voice, and may have a specific style and a specific emotion. For example, the response synthesized voice may be a response voice with a sweet-style tag and a favorite emotion tag. And obtaining the response synthetic voice matched with the style label and/or the emotion label according to the content of the response text and the style label and/or the emotion label.

According to the technical scheme provided by the embodiment of the application, the current voice input by a user is obtained; obtaining style labels and/or emotion labels of the response text by using a preset style emotion matching model according to the current voice; and obtaining response synthesized voice according to the response text with the style label and/or the emotion label, and using the response synthesized voice to interact with the current voice. By executing the technical scheme, the style tag and the emotion tag of the response synthesized voice can be adjusted in real time according to the input voice of the client, and the expressive force of the response synthesized voice is improved, so that the client experience in intelligent voice interaction is improved.

Example two

Fig. 2 is a schematic structural diagram of a speech synthesis apparatus according to a second embodiment of the present application, and as shown in fig. 2, the speech synthesis apparatus includes:

a current voice acquiring unit 210 configured to acquire a current voice input by a user;

a tag obtaining unit 220, configured to obtain a style tag and/or an emotion tag of the response text according to the current speech by using a preset style emotion matching model;

and a response synthesized speech obtaining unit 230, configured to obtain a response synthesized speech according to the response text with the style tag and/or the emotion tag, and configured to interact with the current speech.

In this embodiment, optionally, the tag obtaining unit 220 includes:

the text obtaining subunit is used for obtaining the interactive text and the response text of the current voice according to the current voice;

and the response text label obtaining subunit is used for processing the interactive text and the response text by using a preset style emotion matching model to obtain a style label and/or an emotion label of the response text.

In this technical solution, optionally, the text obtaining subunit is specifically configured to:

In this technical solution, optionally, the response text label obtains a subunit, which is specifically configured to:

splicing the interactive text and the response text to obtain a target text;

In this technical solution, optionally, the response synthesized speech obtaining unit 230 is specifically configured to:

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE III

Embodiments of the present application also provide a storage medium containing computer-executable instructions that, when executed by a computer processor, perform a method of speech synthesis, the method comprising:

acquiring current voice input by a user;

Storage medium-any of various types of memory devices or storage devices. The term "storage medium" is intended to include: mounting a storage medium such as a CD-ROM, floppy disk, or tape device; computer system memory or random access memory such as DRAM, DDR RAM, SRAM, EDO RAM, Lanbas (Rambus) RAM, etc.; non-volatile memory such as flash memory, magnetic storage media (e.g., hard disk or optical storage); registers or other similar types of memory elements, etc. The storage medium may also include other types of memory or combinations thereof. In addition, the storage medium may be located in the computer system in which the program is executed, or may be located in a different second computer system connected to the computer system through a network (such as the internet). The second computer system may provide the program instructions to the computer for execution. The term "storage medium" may include two or more storage media that may reside in different locations, such as in different computer systems that are connected by a network. The storage medium may store program instructions (e.g., embodied as a computer program) that are executable by one or more processors.

Of course, the storage medium provided in the embodiments of the present application contains computer-executable instructions, and the computer-executable instructions are not limited to the speech synthesis operation described above, and may also perform related operations in the speech synthesis method provided in any embodiments of the present application.

Example four

The embodiment of the application provides electronic equipment, and the voice synthesis device provided by the embodiment of the application can be integrated in the electronic equipment. Fig. 3 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application. As shown in fig. 3, the present embodiment provides an electronic device 300, which includes: one or more processors 320; a storage device 310, configured to store one or more programs, which when executed by the one or more processors 320, cause the one or more processors 320 to implement the speech synthesis method provided in the embodiment of the present application, where the method includes:

acquiring current voice input by a user;

Of course, those skilled in the art will understand that the processor 320 also implements the technical solution of the speech synthesis method provided in any embodiment of the present application.

The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 3, the electronic device 300 includes a processor 320, a storage device 310, an input device 330, and an output device 340; the number of the processors 320 in the electronic device may be one or more, and one processor 320 is taken as an example in fig. 3; the processor 320, the storage device 310, the input device 330, and the output device 340 in the electronic apparatus may be connected by a bus or other means, and are exemplified by a bus 350 in fig. 3.

The storage device 310 is a computer-readable storage medium, and can be used for storing software programs, computer-executable programs, and module units, such as program instructions corresponding to the speech synthesis method in the embodiment of the present application.

The storage device 310 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the storage device 310 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, storage 310 may further include memory located remotely from processor 320, which may be connected via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 330 may be used to receive input numbers, character information, or voice information, and to generate key signal inputs related to user settings and function control of the electronic apparatus. The output device 340 may include a display screen, a speaker, and other electronic devices.

The electronic equipment provided by the embodiment of the application can adjust the style label and the emotion label of the response synthesized voice in real time according to the input voice of the client, and improve the expressive force of the response synthesized voice, so that the purpose of the client experience in intelligent voice interaction is improved.

The speech synthesis device, the storage medium and the electronic device provided in the above embodiments may execute the speech synthesis method provided in any embodiment of the present application, and have corresponding functional modules and beneficial effects for executing the method. For technical details that are not described in detail in the above embodiments, reference may be made to the speech synthesis method provided in any of the embodiments of the present application.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present application and the technical principles employed. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, although the present application has been described in more detail with reference to the above embodiments, the present application is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present application, and the scope of the present application is determined by the scope of the appended claims.

Claims

1. A method of speech synthesis, comprising:

acquiring current voice input by a user;

2. The method of claim 1, wherein obtaining style tags and/or emotion tags of answer text using a preset style emotion matching model according to the current speech comprises:

3. The method of claim 2, wherein obtaining the interactive text and the response text of the current speech according to the current speech comprises:

4. The method of claim 2, wherein processing the interactive text and the response text by using a preset style emotion matching model to obtain a style tag and/or an emotion tag of the response text comprises:

splicing the interactive text and the response text to obtain a target text;

5. The method of claim 1, wherein the pre-defined style emotion matching model comprises a Bert model; and the Bert model is used for outputting the style label and/or the emotion label of the response text for matching.

6. The method of claim 1, wherein the style label includes sweet, natural, harsh, and lively; the emotion labels include happiness, anger, sadness and happiness.

7. The method of claim 1, wherein obtaining the response synthesized speech from the response text with the style tag and/or the emotion tag comprises:

8. A speech synthesis apparatus, comprising:

9. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the speech synthesis method according to any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the speech synthesis method according to any one of claims 1-7 when executing the computer program.