CN113593521B

CN113593521B - Speech synthesis method, device, equipment and readable storage medium

Info

Publication number: CN113593521B
Application number: CN202110863647.7A
Authority: CN
Inventors: 谢慧智
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-07-29
Filing date: 2021-07-29
Publication date: 2022-09-20
Anticipated expiration: 2041-07-29
Also published as: CN113593521A

Abstract

The application discloses a voice synthesis method, a voice synthesis device, voice synthesis equipment and a readable storage medium, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring session data of a first object; determining emotion information of a second object according to session data of the first object; searching text data matched with the conversation data of the first object from a text database; and synthesizing voice data of the second object according to the text data and the emotion information of the second object, and replying to the conversation data of the first object based on the voice data of the second object. Because the voice data of the second object contains the emotion information of the second object, the expressive force of the voice data is enhanced, and the service quality of the intelligent robot is improved.

Description

Speech synthesis method, device, equipment and readable storage medium

Technical Field

The embodiment of the application relates to the technical field of artificial intelligence, in particular to a voice synthesis method, a device, equipment and a readable storage medium.

Background

With the rapid development of artificial intelligence technology, more and more industries provide real-time, automatic and convenient question answering service for users through intelligent robots.

In the related art, when providing a question answering service for a user, an intelligent robot acquires dialog data of a first object (user), queries text data matched with the dialog data of the first object from a text library, converts the queried text data into voice data, and takes the voice data as reply content of the dialog data of the first object, namely, dialog data of a second object (intelligent robot). Since only the text data is converted into the voice data, the presentation of the dialogue data of the second object is poor, thereby reducing the service quality of the intelligent robot.

Disclosure of Invention

The embodiment of the application provides a speech synthesis method, a speech synthesis device, speech synthesis equipment and a readable storage medium, which can be used for solving the problems in the related art.

In one aspect, an embodiment of the present application provides a speech synthesis method, where the method includes:

acquiring session data of a first object;

determining emotion information of a second object according to the session data of the first object;

searching text data matched with the session data of the first object from a text database;

and synthesizing voice data of the second object according to the text data and the emotion information of the second object, and replying conversation data of the first object based on the voice data of the second object.

In one possible implementation, the determining emotion information of a second object according to the session data of the first object includes:

acquiring emotion information of the first object according to the session data of the first object;

generating emotional information of the second object according to the emotional information of the first object.

In one possible implementation manner, the obtaining emotion information of the first object according to the session data of the first object includes:

acquiring context data of session data of the first object;

obtaining emotion information of the first object according to the session data and the context data of the first object.

inputting the session data of the first object into an emotion classification model, and outputting emotion information of the first object by the emotion classification model.

In a possible implementation manner, before the inputting the session data of the first subject to the emotion classification model, the method further includes:

obtaining a plurality of first sample object data, the first sample object data comprising first session data with emotion tags, the first session data being session data of a first sample object;

determining emotion information corresponding to each first session data according to each first session data;

determining emotion classification results corresponding to the first session data according to emotion information corresponding to the first session data;

and obtaining the emotion classification model according to the emotion classification result and the emotion label corresponding to each piece of first session data.

In one possible implementation, the generating emotion information of the second object according to emotion information of the first object includes:

the emotion information of the first object is input to an emotion generation model, and the emotion information of the second object is output by the emotion generation model.

In a possible implementation manner, before the inputting the emotion information of the first object to the emotion generation model, the method further includes:

acquiring emotion labels corresponding to a plurality of second session data, wherein the second session data are session data of a second sample object corresponding to the first session data;

generating emotion information corresponding to each second session data according to the emotion information corresponding to each first session data;

determining emotion classification results corresponding to the second session data according to emotion information corresponding to the second session data;

and acquiring the emotion generation model according to the emotion classification result and the emotion label corresponding to each second session data.

In one possible implementation, the synthesizing of the speech data of the second object according to the text data and the emotion information of the second object includes:

splicing the text data and the emotion information of the second object to obtain first information;

generating first frequency spectrum information of the second object according to the first information;

and generating voice data of the second object according to the first frequency spectrum information of the second object.

generating second frequency spectrum information of the second object according to the text data;

splicing the second frequency spectrum information of the second object with the emotion information of the second object to obtain second information;

and generating voice data of the second object according to the second information.

splicing the first frequency spectrum information of the second object with the emotion information of the second object to obtain third information;

and generating voice data of the second object according to the third information.

In another aspect, an embodiment of the present application provides a speech synthesis apparatus, where the apparatus includes:

the acquisition module is used for acquiring session data of a first object;

a determining module for determining emotion information of a second object according to the session data of the first object;

the searching module is used for searching text data matched with the session data of the first object from a text database;

and the synthesis module is used for synthesizing the voice data of the second object according to the text data and the emotion information of the second object and replying the conversation data of the first object based on the voice data of the second object.

In a possible implementation manner, the determining module is configured to obtain emotion information of the first object according to session data of the first object; generating emotional information of the second object according to the emotional information of the first object.

In a possible implementation manner, the determining module is configured to obtain context data of session data of the first object; obtaining emotion information of the first object according to the session data and the context data of the first object.

In a possible implementation manner, the determining module is configured to input the session data of the first object to an emotion classification model, and output emotion information of the first object by the emotion classification model.

In a possible implementation manner, the obtaining module is further configured to obtain a plurality of first sample object data, where the first sample object data includes first session data with emotion tags, and the first session data is session data of a first sample object;

the determining module is further configured to determine emotion information corresponding to each first session data according to each first session data;

the determining module is further configured to determine an emotion classification result corresponding to each first session data according to the emotion information corresponding to each first session data;

the obtaining module is further configured to obtain the emotion classification model according to the emotion classification result and the emotion label corresponding to each piece of first session data.

In a possible implementation manner, the determining module is configured to input the emotion information of the first object into an emotion generation model, and output the emotion information of the second object by the emotion generation model.

In a possible implementation manner, the obtaining module is further configured to obtain emotion labels corresponding to a plurality of second session data, where the second session data is session data of a second sample object corresponding to the first session data;

the determining module is further configured to generate emotion information corresponding to each second session data according to the emotion information corresponding to each first session data;

the determining module is further configured to determine an emotion classification result corresponding to each second session data according to the emotion information corresponding to each second session data;

the obtaining module is further configured to obtain the emotion generation model according to the emotion classification result and the emotion label corresponding to each piece of second session data.

In a possible implementation manner, the synthesis module is configured to splice the text data and emotion information of the second object to obtain first information; generating first frequency spectrum information of the second object according to the first information; and generating voice data of the second object according to the first spectrum information of the second object.

In a possible implementation manner, the synthesis module is configured to generate second spectrum information of the second object according to the text data; splicing the second frequency spectrum information of the second object with the emotion information of the second object to obtain second information; and generating voice data of the second object according to the second information.

In a possible implementation manner, the synthesis module is configured to splice the text data and emotion information of the second object to obtain first information; generating first frequency spectrum information of the second object according to the first information; splicing the first frequency spectrum information of the second object with the emotion information of the second object to obtain third information; and generating voice data of the second object according to the third information.

In another aspect, an embodiment of the present application provides an electronic device, where the electronic device includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor, so that the electronic device implements any one of the above-mentioned speech synthesis methods.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to make a computer implement any of the above-mentioned speech synthesis methods.

In another aspect, a computer program or a computer program product is provided, in which at least one computer instruction is stored, and the at least one computer instruction is loaded and executed by a processor, so as to make a computer implement any of the above-mentioned speech synthesis methods.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

the technical scheme provided by the embodiment of the application is that the voice data of the second object is synthesized according to the text data and the emotion information of the second object, the conversation data of the first object is replied based on the voice data of the second object, and the voice data of the second object contains the emotion information of the second object, so that the expressive force of the voice data is enhanced, and the service quality of the intelligent robot is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment of a speech synthesis method according to an embodiment of the present application;

fig. 2 is a flowchart of a speech synthesis method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of training data provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of an emotion classification model provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an emotion generation model provided by an embodiment of the application;

FIG. 6 is a schematic diagram of a speech synthesis model provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a terminal device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of an implementation environment of a speech synthesis method provided in an embodiment of the present application, where the implementation environment includes an electronic device 11 as shown in fig. 1, and the speech synthesis method in the embodiment of the present application may be executed by the electronic device 11. Illustratively, the electronic device 11 may include at least one of a terminal device or a server.

The terminal device may be at least one of a smart phone, a game console, a desktop computer, a tablet computer, an e-book reader, an MP3(Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3) player, an MP4(Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4) player, and a laptop computer.

The server may be one server, or a server cluster formed by multiple servers, or any one of a cloud computing platform and a virtualization center, which is not limited in this embodiment of the present application. The server can be in communication connection with the terminal device through a wired network or a wireless network. The server may have functions of data processing, data storage, data transceiving, and the like, and is not limited in the embodiment of the present application.

Based on the foregoing implementation environment, the embodiment of the present application provides a speech synthesis method, which may be executed by the electronic device 11 in fig. 1, taking a flowchart of the speech synthesis method provided in the embodiment of the present application as shown in fig. 2 as an example. As shown in fig. 2, the method includes steps S21 through S24.

In step S21, session data of the first object is acquired.

The session data of the first object includes, but is not limited to, at least one of text data, picture data, and voice data. In the process of the conversation between the first object and the second object, any data sent by the first object can be used as the conversation data of the first object, and the conversation data of the first object is obtained by receiving the data sent by the first object.

The type of the first object is not limited. Illustratively, the first object includes, but is not limited to, a user, a smart robot.

In step S22, emotion information of the second object is determined based on the session data of the first object.

The type of the second object is not limited. Illustratively, the second object includes, but is not limited to, a smart robot.

In practical application, the session data of the first object may be a plurality of pieces of data, the emotion information of the second object corresponding to each piece of data is determined, and then the emotion information of the second object corresponding to each piece of data is integrated to obtain the emotion information of the second object corresponding to the session data of the first object.

For example, the session data of the first object includes data a1, data a2, and data A3, the emotion information of the second object corresponding to data a1 is determined as emotion information a, the emotion information of the second object corresponding to data a2 is determined as emotion information a, the emotion information of the second object corresponding to data A3 is determined as emotion information b, and the emotion information a, and the emotion information b are integrated to obtain the emotion information a of the second object corresponding to the session data of the first object.

Wherein determining emotional information of the second object from the session data of the first object comprises: acquiring emotion information of the first object according to session data of the first object; generating emotional information of the second object based on the emotional information of the first object.

When determining the emotion information of the second object according to the session data of the first object, the emotion information of the first object corresponding to each piece of data in the session data of the first object is determined, the emotion information of the first object corresponding to the session data of the first object is obtained by integrating the emotion information of the first object corresponding to each piece of data, and then, determining emotion information of a second object corresponding to the conversation data of the first object according to the emotion information of the first object corresponding to the conversation data of the first object, or, determining the emotion information of the first object corresponding to each piece of data in the session data of the first object, then determining the emotion information of the second object corresponding to each piece of data according to the emotion information of the first object corresponding to each piece of data, and then, and integrating the emotion information of the second object corresponding to each piece of data to obtain the emotion information of the second object corresponding to the conversation data of the first object.

In the embodiment of the present application, acquiring emotion information of a first object according to session data of the first object includes: acquiring context data of session data of a first object; obtaining emotion information of the first object according to the session data and the context data of the first object.

Context data of the session data is a context and/or context directly or indirectly adjacent to the session data, the context data including, but not limited to, at least one of text data, picture data, and voice data.

Illustratively, for the conversation "how do you get to me phone, i'm get you as many as you do, one does not get, the context data of the conversation data" i'm get so "is" how do you get to me phone ", or" one does not get ", or" how do you get to me phone "and" one does not get ".

In a possible implementation manner, the session data and the context data of the first object are spliced, the spliced data are input into the emotion classification model, the emotion classification model is converted based on the data features of the spliced data to obtain the emotion features of the first object, and the emotion features are emotion information. The emotion classification model obtains an implementation manner of emotion characteristics based on data characteristic transformation of data, which is described in detail in relation to fig. 4 below and is not described herein again.

In the embodiment of the present application, acquiring emotion information of a first object according to session data of the first object includes: the conversation data of the first object is input into the emotion classification model, and the emotion information of the first object is output by the emotion classification model.

Before inputting the session data of the first object into the emotion classification model, the method further comprises: obtaining a plurality of first sample object data, the first sample object data comprising first session data with emotion tags, the first session data being session data of the first sample object; determining emotion information corresponding to each first session data according to each first session data; determining emotion classification results corresponding to the first session data according to emotion information corresponding to the first session data; and obtaining an emotion classification model according to the emotion classification result and the emotion label corresponding to each piece of first session data.

Inputting the session data of the first object into the emotion classification model, and converting the emotion classification model based on the data features of the session data of the first object to obtain the emotion features of the first object, wherein the emotion classification model obtains the implementation manner of the emotion features based on the data features conversion of the data, which is detailed in the following related content related to fig. 4 and is not repeated herein.

By acquiring a large amount of training data, the emotion classification model is acquired by using the training data. As shown in fig. 3, fig. 3 is a schematic diagram of training data provided in an embodiment of the present application. The training data comprises session data and corresponding emotion labels of the first sample object (i.e. the emotion labels of the first session data mentioned below), session data and corresponding emotion labels of the second sample object (i.e. the emotion labels of the second session data mentioned below). As shown in fig. 3, training data 1 comprises session data a1 and a corresponding emotion tag a1 for the first sample object, session data b1 and a corresponding emotion tag b1 for the second sample object; training data 2 comprises session data a2 and a corresponding emotion tag a2 for the first sample object, session data b2 and a corresponding emotion tag b2 for the second sample object; the training data n comprises session data an and a corresponding emotion label an of the first sample object, session data bn and a corresponding emotion label bn of the second sample object, where n is a positive integer. The session data of the first sample object includes, but is not limited to, at least one of text data, picture data, and voice data, and the session data of the second sample object includes, but is not limited to, at least one of text data, picture data, and voice data.

In practical application, the emotion classification model is obtained by using the session data of the first sample object and the corresponding emotion labels in the training data, namely, the first classification model is trained by using a plurality of first sample object data to obtain the emotion classification model. For each first sample object data, inputting the first session data (namely the session data of the first sample object) into a first classification model, converting the first classification model according to the data features of the first session data to obtain emotion features of the first session data, and then obtaining emotion classification results of the first session data according to the emotion features of the first session data, wherein the emotion classification results of the first session data are probabilities that the first session data belong to each emotion. And then, calculating a loss value of the first classification model by using the emotion label of each first session data and the emotion classification result of each first session data output by the first classification model, and optimizing the model parameters of the first classification model by using the loss value to obtain the optimized model. According to the method, after the model is optimized for multiple times, the emotion classification model is obtained.

The size and structure of the emotion classification model are not limited, and in a possible implementation manner, the emotion classification model is an attention model, as shown in fig. 4, fig. 4 is a schematic diagram of an emotion classification model provided in an embodiment of the present application, and the emotion classification model includes an encoding portion and a decoding portion. Specifically, the session data (including the aforementioned context data, first session data, session data of the first object, and the like) is input to the emotion classification model, the encoding portion converts the session data into data features, the decoding portion converts the data features into emotion features based on the attention principle, and then emotion classification results are obtained based on the emotion features.

In the embodiment of the application, the session data comprises a plurality of data segments, when the encoding part converts the session data into the data characteristics, each data segment is converted into each data segment characteristic, for each data segment, the fusion characteristic of the data segment is obtained through calculation according to the data segment characteristic and the weight of the data segment and the data segment characteristic and the weight of at least one adjacent data segment, and the adjacent data segment is the data segment directly or indirectly adjacent to the data segment. According to the mode, the fusion characteristics of the data segments are obtained through calculation, and the fusion characteristics of the data segments are the data characteristics in the embodiment of the application.

If the session data is text data, the data segment is characters or words, and the data segment features are character string features; if the session data is picture data, the data segment is a pixel point or a picture area, and the data segment is characterized by three primary colors (Red-Green-Blue, RGB); if the session data is voice data, the data segment is a voice segment, and the data feature is an audio feature.

Taking text data as an example, the text data includes five words, the adjacent word of the first word may be a second word, a second word and a third word, a second word to a fourth word, or even a second word to a fifth word, when calculating the fusion feature of the first word, the weight of the first word and the weight of each adjacent word need to be determined, the character string feature of the first word is multiplied by the weight of the first word, the character string feature of each adjacent word is multiplied by the corresponding weight, and the sum of the products is calculated to obtain the fusion feature of the first word, the calculation manner of other words is similar to the calculation manner of the first word, and no further description is given here.

And the decoding part is based on an attention principle, when the data characteristics are converted into emotion characteristics, the emotion characteristics corresponding to the first data segment are obtained according to the fusion characteristics of the first data segment, and for other data segments except the first data segment, the emotion characteristics corresponding to the other data segments are obtained according to the fusion characteristics of the other data segments and the emotion characteristics corresponding to at least one data segment before the other data segments. And finally, obtaining final emotional characteristics according to the emotional characteristics corresponding to the data segments.

Taking text data as an example, the text data comprises five words, the emotional characteristics corresponding to the first word are obtained through conversion according to the fusion characteristics of the first word, and the emotional characteristics corresponding to the first word are recorded as y1, wherein y1 ═ f (c1), f is a conversion function, and c1 is the fusion characteristics of the first word; for the second word, obtaining an emotional characteristic corresponding to the second word according to the fusion characteristic of the second word and the emotional characteristic corresponding to the first word, and marking the emotional characteristic corresponding to the second word as y2, wherein y2 ═ f (c2, y1), and c2 is the fusion characteristic of the second word; and for the third word, obtaining the emotional characteristic corresponding to the third word according to the fusion characteristic of the third word and the emotional characteristic corresponding to the second word, or according to the fusion characteristic of the third word and the emotional characteristics corresponding to the first word and the second word, and marking the emotional characteristic corresponding to the third word as y3, wherein y3 is f (c3 and y2), or y3 is f (c3, y1 and y2), c3 is the fusion characteristic of the third word, and so on. And finally, obtaining final emotion characteristics according to the emotion characteristics corresponding to all the words, and obtaining emotion classification results corresponding to the conversation data according to the final emotion characteristics.

It should be noted that the first classification model and the emotion classification model are only different in model parameters, and the size and the structure of the first classification model and the emotion classification model are the same, so the first classification model and the emotion classification model process features in the same way. The emotion classification model shown in fig. 4 is a possible implementation model, and the processing manners of the encoding portion and the decoding portion are also a possible implementation manner, and in practical application, different processing may be performed on the features based on the attention mechanism.

In the embodiment of the present application, generating emotion information of a second object according to emotion information of a first object includes: the emotion information of the first object is input to the emotion generation model, and the emotion information of the second object is output by the emotion generation model.

Wherein before inputting the emotion information of the first object into the emotion generation model, the method further comprises: acquiring emotion labels corresponding to a plurality of second session data, wherein the second session data are session data of a second sample object corresponding to the first session data; generating emotion information corresponding to each second session data according to the emotion information corresponding to each first session data; determining emotion classification results corresponding to the second session data according to emotion information corresponding to the second session data; and obtaining an emotion generation model according to the emotion classification result and the emotion label corresponding to each second session data.

The emotion information (i.e., emotion characteristics) of the first object output by the emotion classification model mentioned above is input to the emotion generating model, and the emotion characteristics of the second object are obtained by the emotion generating model based on the emotion characteristics of the first object through transformation, where the emotion generating model obtains the emotion characteristics of the second object based on the emotion characteristics of the first object through transformation, which is described in detail below with reference to fig. 5 and is not described herein again.

The emotion generation model is obtained by using the training data shown in fig. 3, and in actual application, the emotion generation model is obtained by using the emotion label corresponding to the session data of the first sample object and the emotion label corresponding to the session data of the second sample object in the training data. For the emotion label corresponding to the session data of each first sample object, the emotion label corresponding to the session data of the first sample object is converted into the emotion feature corresponding to the session data of the first sample object, the emotion feature corresponding to the session data of the first sample object (namely, the emotion information of the first sample object) is input into the second classification model, the emotion feature corresponding to the session data of the second sample object is generated by the second classification model according to the emotion feature corresponding to the session data of the first sample object, the emotion classification result corresponding to the session data of the second sample object is obtained according to the emotion feature corresponding to the session data of the second sample object, and the emotion classification result is the probability that the session data of the second sample object belongs to each emotion. And then, calculating a loss value of the second classification model by using the emotion label and the emotion classification result corresponding to the session data of each second sample object, and optimizing the model parameters of the second classification model by using the loss value to obtain an optimized model. According to the mode, after the model is optimized for multiple times, the emotion generation model is obtained.

The size and the structure of the emotion generation model are not limited, and in a possible implementation manner, the emotion generation model is an attention model, as shown in fig. 5, fig. 5 is a schematic diagram of an emotion generation model provided in an embodiment of the present application, and the emotion generation model includes an encoding portion and a decoding portion. In the embodiment of the application, the emotion information (i.e., emotion characteristics) of the first object includes data segment characteristics of a plurality of data segments corresponding to the first object, the emotion characteristics of the first object are input to the emotion generation model, the encoding portion obtains fusion characteristics of the data segments according to the data segment characteristics of the data segments corresponding to the first object, the decoding portion obtains the emotion characteristics of the data segments corresponding to the second object based on the attention principle and the fusion characteristics of the data segments, and the final emotion characteristics (i.e., emotion information of the second object) is obtained based on the emotion characteristics of the data segments corresponding to the second object. The processing manner of the encoding part and the decoding part is described in the related description related to fig. 4, and is not described herein again.

It will be appreciated that the second classification model and the emotion generation model are only different in model parameters and are the same in size and structure, and therefore the second classification model and the emotion generation model process features in the same way. The emotion generation model shown in fig. 4 is a possible implementation model, and the processing manners of the encoding portion and the decoding portion are also a possible implementation manner, and in practical application, different processing may be performed on the features based on the attention mechanism.

In step S23, text data matching the conversation data of the first object is searched for from the text database.

In the embodiment of the application, a plurality of text data are stored in a text database, each text data comprises at least one keyword, and when the text data is stored in the text database, the storage mode at least comprises the following storage modes 1-3.

The storage mode 1 stores the corresponding relation between the question keywords and the answer keywords, and stores the corresponding relation between the answer keywords and the answer text data.

In the storage mode 1, the session data of the first object is used as question text data, a question keyword is extracted from the question text data, an answer keyword corresponding to the extracted question keyword is searched for according to the corresponding relationship between the stored question keyword and the answer keyword, then answer text data corresponding to the searched answer keyword is searched for according to the corresponding relationship between the stored answer keyword and the answer text data, and the searched answer text data is text data matched with the session data of the first object.

And the storage mode 2 is used for storing the corresponding relation between the question key words and the answer text data.

In the storage mode 2, the session data of the first object is used as question text data, the question keywords are extracted from the question text data, then answer text data corresponding to the extracted question keywords are searched according to the corresponding relation between the stored question keywords and the answer text data, and the searched answer text data is text data matched with the session data of the first object.

And the storage mode 3 is used for storing the corresponding relation between the question text data and the answer text data.

In the storage mode 3, when the stored corresponding relationship between the question text data and the answer text data has the question text data identical to the session data of the first object, the answer text data corresponding to the question text data identical to the session data of the first object is the text data matched with the session data of the first object; and when the stored corresponding relation between the question text data and the answer text data does not exist, the similarity between each question text data and the conversation data of the first object is calculated, and the answer text data corresponding to the question text data with the maximum similarity is the text data matched with the conversation data of the first object. The method for calculating the similarity between the question text data and the session data of the first object is not limited.

Step S24, synthesizing the voice data of the second object based on the text data and the emotion information of the second object, and replying to the conversation data of the first object based on the voice data of the second object.

In the embodiment of the application, the text data and the emotion information of the second object can be input into the voice synthesis model, and the voice synthesis model synthesizes the voice data of the second object based on the text data and the emotion information of the second object, so that the voice data of the second object has the emotion information, and the expressive force and the personification degree of the voice data are improved. The structure and size of the Speech synthesis model are not limited, and the Speech synthesis model is, for example, a Text-To-Speech (TTS) model. The voice synthesis model comprises a spectrum generator and a vocoder, wherein the spectrum generator is a model for converting text data into corresponding spectrum information, and the vocoder is a model for converting the spectrum information into voice data.

In one possible implementation, synthesizing speech data of the second object from the text data and emotion information of the second object includes: splicing the text data with emotion information of a second object to obtain first information; generating first spectrum information of a second object according to the first information; and generating voice data of the second object according to the first spectrum information of the second object.

In practical application, text data is converted into data features, the data features are spliced with emotional features of a second object, the spliced features are first information, the first information is input into a spectrum generator, the spectrum generator outputs first spectrum features (namely first spectrum information) of the second object, the first spectrum features are input into a vocoder, and voice data of the second object is output by the vocoder, wherein the structure and the size of the vocoder are not limited.

In a possible implementation manner, the spectrum generator is an attention model and includes an encoding portion and a decoding portion, the first information includes data segment characteristics of a plurality of data segments, the encoding portion obtains a fusion characteristic of each data segment according to the data segment characteristics of each data segment, the decoding portion obtains a spectrum characteristic of each data segment based on the attention principle and the fusion characteristic of each data segment, and obtains a first emotion characteristic, that is, first spectrum information based on the spectrum characteristic of each data segment. The processing manner of the encoding part and the decoding part is described in the related description related to fig. 4, and is not described herein again.

In one possible implementation, synthesizing speech data of the second object from the text data and emotion information of the second object includes: generating second frequency spectrum information of a second object according to the text data; splicing the second frequency spectrum information of the second object with the emotion information of the second object to obtain second information; and generating voice data of the second object according to the second information.

In practical application, text data is converted into data features, the data features are input into a spectrum generator, second spectrum features (namely second spectrum information) of a second object are output by the spectrum generator, the second spectrum features are spliced with emotion features of the second object, the spliced features are the second information, the second information is input into a vocoder, and voice data of the second object is output by the vocoder, wherein the structure and the size of the vocoder are not limited.

In a possible implementation manner, the spectrum generator is an attention model, and includes an encoding portion and a decoding portion, where data features corresponding to text data include data segment features of multiple data segments, the encoding portion obtains a fusion feature of each data segment according to the data segment features of each data segment, the decoding portion obtains a spectrum feature of each data segment based on the attention principle and based on the fusion feature of each data segment, and obtains a second emotional feature, that is, second spectrum information based on the spectrum feature of each data segment. The processing manner of the encoding part and the decoding part is described in the related description related to fig. 4, and is not described herein again.

In one possible implementation, synthesizing speech data of the second object based on the text data and emotion information of the second object includes: splicing the text data with emotion information of a second object to obtain first information; generating first spectrum information of a second object according to the first information; splicing the first frequency spectrum information of the second object with the emotion information of the second object to obtain third information; and generating voice data of the second object according to the third information.

In practical application, as shown in fig. 6, fig. 6 is a schematic diagram of a speech synthesis model provided in an embodiment of the present application, specifically, text data is converted into data features, the data features are spliced with emotional features of a second object (i.e., emotional information of the second object), the spliced features are first information, the first information is input to a spectrum generator, the spectrum generator outputs first spectral features of the second object (i.e., first spectral information), the first spectral features are spliced with the emotional features of the second object, the spliced features are third information, the third information is input to a vocoder, and speech data of the second object is output by the vocoder, where the structure and size of the vocoder are not limited.

It should be noted that, in practical applications, the emotional characteristics of the second object may also be input into the decoding portion of the spectrum generator. In a possible implementation manner, the emotional features of the second object include emotional features of a plurality of data segments, the decoding portion splices the fusion feature of the first data segment with the emotional feature of the first data segment, obtains the spectral feature corresponding to the first data segment according to the feature after splicing, splices the fusion feature of the other data segments with the emotional feature of the other data segments for the other data segments except the first data segment, and obtains the spectral feature corresponding to the other data segments according to the feature after splicing and the spectral feature corresponding to at least one data segment before the other data segments. Finally, the final spectral characteristics (including the aforementioned first spectral characteristics, second spectral characteristics, etc.) are obtained according to the corresponding spectral characteristics of each data segment.

In a possible implementation manner, the first object and the first sample object related to the embodiments of the present application may be users, and the second object and the second sample object may be smart customer services.

Fig. 7 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application, and as shown in fig. 7, the apparatus includes an obtaining module 71, a determining module 72, a searching module 73, and a synthesizing module 74.

An obtaining module 71, configured to obtain session data of the first object.

A determining module 72 for determining emotional information of the second object based on the session data of the first object.

And a searching module 73, configured to search the text database for text data matching the session data of the first object.

And a synthesis module 74 for synthesizing the voice data of the second object according to the text data and the emotion information of the second object, and replying to the conversation data of the first object based on the voice data of the second object.

In a possible implementation, the determining module 72 is configured to obtain emotion information of the first object according to session data of the first object; generating emotional information of the second object based on the emotional information of the first object.

In a possible implementation, the determining module 72 is configured to obtain context data of session data of the first object; obtaining emotion information of the first object according to the session data and the context data of the first object.

In one possible implementation, the determining module 72 is configured to input the session data of the first object into the emotion classification model, and output emotion information of the first object by the emotion classification model.

In a possible implementation manner, the obtaining module is further configured to obtain a plurality of first sample object data, where the first sample object data includes first session data with emotion tags, and the first session data is session data of the first sample object.

And the determining module is further used for determining emotion information corresponding to each first session data according to each first session data.

And the determining module is further used for determining the emotion classification result corresponding to each first session data according to the emotion information corresponding to each first session data.

And the obtaining module is further used for obtaining the emotion classification model according to the emotion classification result and the emotion label corresponding to each piece of first session data.

In one possible implementation, the determining module 72 is configured to input the emotion information of the first object to the emotion generating model, and output the emotion information of the second object by the emotion generating model.

In a possible implementation manner, the obtaining module is further configured to obtain emotion tags corresponding to a plurality of second session data, where the second session data is session data of a second sample object corresponding to the first session data.

And the determining module is further used for generating emotion information corresponding to each second session data according to the emotion information corresponding to each first session data.

And the determining module is further used for determining emotion classification results corresponding to the second session data according to the emotion information corresponding to the second session data.

And the obtaining module is further used for obtaining the emotion generation model according to the emotion classification result and the emotion label corresponding to each second session data.

In a possible implementation manner, the synthesizing module 74 is configured to splice the text data and the emotion information of the second object to obtain the first information; generating first spectrum information of a second object according to the first information; and generating voice data of the second object according to the first spectrum information of the second object.

In a possible implementation manner, the synthesizing module 74 is configured to generate second spectrum information of the second object according to the text data; splicing the second frequency spectrum information of the second object with the emotion information of the second object to obtain second information; and generating voice data of the second object according to the second information.

In a possible implementation manner, the synthesizing module 74 is configured to splice the text data and the emotion information of the second object to obtain the first information; generating first spectrum information of a second object according to the first information; splicing the first frequency spectrum information of the second object with the emotion information of the second object to obtain third information; and generating voice data of the second object according to the third information.

It should be understood that, when the apparatus provided in fig. 7 implements its functions, it is only illustrated by the division of the functional modules, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments for details, which are not described herein again.

The technical scheme provided by the embodiment of the application is that the voice data of the second object is synthesized according to the text data and the emotion information of the second object, the conversation data of the first object is replied based on the voice data of the second object, and because the voice data of the second object contains the emotion information of the second object, the expressive force of the voice data is enhanced, and the service quality of the intelligent robot is improved.

Fig. 8 shows a block diagram of a terminal device 800 according to an exemplary embodiment of the present application. The terminal device 800 may be a portable mobile terminal such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The terminal device 800 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal device 800 includes: a processor 801 and a memory 802.

The processor 801 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 801 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 801 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 801 may be integrated with a GPU (Graphics Processing Unit) which is responsible for rendering and drawing the content required to be displayed by the display screen. In some embodiments, the processor 801 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 802 may include one or more computer-readable storage media, which may be non-transitory. Memory 802 may also include high speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 802 is used to store at least one instruction for execution by processor 801 to implement the speech synthesis methods provided by method embodiments herein.

In some embodiments, the terminal device 800 may further include: a peripheral interface 803 and at least one peripheral. The processor 801, memory 802 and peripheral interface 803 may be connected by bus or signal lines. Various peripheral devices may be connected to peripheral interface 803 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 804, a display screen 805, a camera assembly 806, an audio circuit 807, a positioning assembly 808, and a power supply 809.

The peripheral interface 803 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 801 and the memory 802. In some embodiments, the processor 801, memory 802, and peripheral interface 803 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 801, the memory 802, and the peripheral interface 803 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 804 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 804 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 804 converts an electrical signal into an electromagnetic signal to be transmitted, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 804 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 804 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the radio frequency circuit 804 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 805 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 805 is a touch display, the display 805 also has the ability to capture touch signals on or above the surface of the display 805. The touch signal may be input to the processor 801 as a control signal for processing. At this point, the display 805 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 805 may be one, and is disposed on the front panel of the terminal device 800; in other embodiments, the number of the display screens 805 may be at least two, and the at least two display screens are respectively disposed on different surfaces of the terminal device 800 or are in a folding design; in other embodiments, the display 805 may be a flexible display, disposed on a curved surface or a folded surface of the terminal device 800. Even further, the display 805 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 805 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.

The camera assembly 806 is used to capture images or video. Optionally, camera assembly 806 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, the main camera and the wide-angle camera are fused to realize panoramic shooting and a VR (Virtual Reality) shooting function or other fusion shooting functions. In some embodiments, camera head assembly 806 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuit 807 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 801 for processing or inputting the electric signals to the radio frequency circuit 804 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different positions of the terminal device 800. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 801 or the radio frequency circuit 804 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 807 may also include a headphone jack.

The positioning component 808 is used to locate the current geographic Location of the terminal device 800 to implement navigation or LBS (Location Based Service). The Positioning component 808 may be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 809 is used to supply power to various components in the terminal device 800. The power supply 809 can be ac, dc, disposable or rechargeable. When the power supply 809 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal device 800 also includes one or more sensors 810. The one or more sensors 810 include, but are not limited to: acceleration sensor 811, gyro sensor 812, pressure sensor 813, fingerprint sensor 814, optical sensor 815 and proximity sensor 816.

The acceleration sensor 811 can detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal apparatus 800. For example, the acceleration sensor 811 may be used to detect the components of the gravitational acceleration in three coordinate axes. The processor 801 may control the display 805 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 811. The acceleration sensor 811 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 812 may detect a body direction and a rotation angle of the terminal device 800, and the gyro sensor 812 may cooperate with the acceleration sensor 811 to acquire a 3D motion of the user on the terminal device 800. From the data collected by the gyro sensor 812, the processor 801 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 813 may be disposed on the side bezel of terminal device 800 and/or underneath display screen 805. When the pressure sensor 813 is arranged on the side frame of the terminal device 800, the holding signal of the user to the terminal device 800 can be detected, and the processor 801 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 813. When the pressure sensor 813 is disposed at a lower layer of the display screen 805, the processor 801 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 805. The operability control comprises at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 814 is used for collecting a fingerprint of the user, and the processor 801 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 814, or the fingerprint sensor 814 identifies the identity of the user according to the collected fingerprint. Upon identifying the user as a trusted identity, the processor 801 authorizes the user to perform relevant sensitive operations, including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 814 may be disposed on the front, back, or side of terminal device 800. When a physical button or a vendor Logo is provided on the terminal device 800, the fingerprint sensor 814 may be integrated with the physical button or the vendor Logo.

The optical sensor 815 is used to collect the ambient light intensity. In one embodiment, processor 801 may control the display brightness of display 805 based on the ambient light intensity collected by optical sensor 815. Specifically, when the ambient light intensity is high, the display brightness of the display screen 805 is increased; when the ambient light intensity is low, the display brightness of the display 805 is adjusted down. In another embodiment, the processor 801 may also dynamically adjust the shooting parameters of the camera assembly 806 based on the ambient light intensity collected by the optical sensor 815.

A proximity sensor 816, also called a distance sensor, is typically provided on the front panel of the terminal device 800. The proximity sensor 816 is used to collect the distance between the user and the front surface of the terminal device 800. In one embodiment, when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal device 800 is gradually reduced, the processor 801 controls the display 805 to switch from the bright screen state to the dark screen state; when the proximity sensor 816 detects that the distance between the user and the front surface of the terminal device 800 becomes gradually larger, the display screen 805 is controlled by the processor 801 to switch from the screen-on state to the screen-on state.

Those skilled in the art will appreciate that the configuration shown in fig. 8 is not limiting of terminal device 800 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.

Fig. 9 is a schematic structural diagram of a server provided in this embodiment, where the server 900 may generate a relatively large difference due to different configurations or performances, and may include one or more processors (CPUs) 901 and one or more memories 902, where the one or more memories 902 store at least one program code, and the at least one program code is loaded and executed by the one or more processors 901 to implement the speech synthesis method provided in each method embodiment. Certainly, the server 900 may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input and output, and the server 900 may also include other components for implementing device functions, which are not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium having at least one program code stored therein, the at least one program code being loaded and executed by a processor to implement any of the above-described speech synthesis methods.

Alternatively, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, there is also provided a computer program or computer program product having at least one computer instruction stored therein, the at least one computer instruction being loaded and executed by a processor to implement any of the speech synthesis methods described above.

It should be understood that reference to "a plurality" herein means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The above description is only exemplary of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements and the like that are made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method of speech synthesis, the method comprising:

acquiring session data of a first object;

converting the text data into data characteristics, and splicing the data characteristics with emotion information of the second object to obtain first information, wherein the emotion information of the second object comprises emotion characteristics of a plurality of data segments, and the first information comprises data segment characteristics of the plurality of data segments;

obtaining the fusion characteristics of each data segment according to the data segment characteristics of each data segment;

splicing the fusion characteristics of the first data segment with the emotion characteristics of the first data segment, and converting the spliced characteristics to obtain the frequency spectrum characteristics corresponding to the first data segment;

for other data segments except the first data segment, splicing the fusion characteristics of the other data segments with the emotion characteristics of the other data segments, and obtaining the spectrum characteristics corresponding to the other data segments according to the spliced characteristics and the spectrum characteristics corresponding to at least one data segment before the other data segments;

obtaining first spectrum information of the second object according to the spectrum characteristics corresponding to the data segments;

generating voice data of the second object according to the first spectrum information of the second object;

replying to the session data of the first object based on the voice data of the second object.

2. The method of claim 1, wherein determining emotional information of a second object from the session data of the first object comprises:

3. The method according to claim 2, wherein the obtaining of the emotion information of the first object from the session data of the first object comprises:

acquiring context data of session data of the first object;

4. The method of claim 2, wherein the obtaining emotional information of the first object from the session data of the first object comprises:

5. The method of claim 4, wherein prior to inputting the session data of the first subject into an emotion classification model, further comprising:

and acquiring the emotion classification model according to the emotion classification result and the emotion label corresponding to each piece of first session data.

6. The method of claim 5, wherein generating the emotional information of the second object based on the emotional information of the first object comprises:

7. The method of claim 6, wherein prior to inputting the emotional information of the first subject into an emotion generation model, further comprising:

8. The method of claim 1, wherein the generating the speech data of the second object according to the first spectral information of the second object comprises:

9. A speech synthesis apparatus, characterized in that the apparatus comprises:

the acquisition module is used for acquiring session data of a first object;

the synthesis module is used for converting the text data into data characteristics and splicing the data characteristics with emotion information of the second object to obtain first information, wherein the emotion information of the second object comprises emotion characteristics of a plurality of data segments, and the first information comprises data segment characteristics of the plurality of data segments; obtaining the fusion characteristics of each data segment according to the data segment characteristics of each data segment; splicing the fusion characteristics of the first data segment with the emotion characteristics of the first data segment, and converting the spliced characteristics to obtain the frequency spectrum characteristics corresponding to the first data segment; for other data segments except the first data segment, splicing the fusion characteristics of the other data segments with the emotion characteristics of the other data segments, and obtaining the spectrum characteristics corresponding to the other data segments according to the spliced characteristics and the spectrum characteristics corresponding to at least one data segment before the other data segments; obtaining first spectrum information of the second object according to the spectrum characteristics corresponding to the data segments; generating voice data of the second object according to the first frequency spectrum information of the second object; replying to the session data of the first object based on the voice data of the second object.

10. An electronic device, comprising a processor and a memory, wherein at least one program code is stored in the memory, and wherein the at least one program code is loaded into and executed by the processor to cause the electronic device to implement the speech synthesis method according to any one of claims 1 to 8.

11. A computer-readable storage medium having stored therein at least one program code, the at least one program code being loaded and executed by a processor, to cause a computer to implement the speech synthesis method according to any one of claims 1 to 8.