CN111862938A

CN111862938A - Intelligent response method, terminal and computer readable storage medium

Info

Publication number: CN111862938A
Application number: CN202010379072.7A
Authority: CN
Inventors: 郭庭炜; 文成; 赵帅江
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2020-05-07
Filing date: 2020-05-07
Publication date: 2020-10-30

Abstract

The invention provides an intelligent response method, a terminal and a computer readable storage medium. The method comprises the following steps: acquiring a response text, then extracting one or more of a text feature, a first emotion feature and a first style feature of the response text, and acquiring an object feature of a target response object, thereby generating target response data based on at least two of the text feature, the first emotion feature, the first style feature and the object feature, wherein the target response data comprises voice data; further, the target response data is output. The technical scheme provided by the invention can solve the problems of single response voice emotion and poor voice interaction effect caused by the single response voice emotion, and improves the interestingness of the intelligent interaction process.

Description

Intelligent response method, terminal and computer readable storage medium

Technical Field

The present invention relates to computer technologies, and in particular, to an intelligent response method, a terminal, and a computer-readable storage medium.

Background

With the development of computer technology, voice response technology has been widely developed. For example, when receiving a voice interaction instruction of a user, the terminal may output voice data to respond to the voice interaction instruction; for another example, the terminal can output a text transmitted to the user by another person in a voice form, and the like. The voice response technology enables the intelligent interaction process to be more flexible and interesting.

The voice response technology is realized based on a voice synthesis technology. That is, a voice is synthesized from the response text to the user, and the synthesized voice is output as the response voice, thereby realizing the voice response. Currently, the voice response technology generally implements voice synthesis according to default sound characteristics when synthesizing response voice. The voice synthesis mode leads to single emotion of the response voice and single voice of the speaker, which leads to less information which can be transmitted by the response voice and poorer voice interaction effect.

Disclosure of Invention

The invention provides an intelligent response method, a terminal and a computer readable storage medium, which are used for solving the problems of single response voice emotion and poor voice interaction effect caused by single response voice emotion and improving the interestingness of an intelligent interaction process.

In a first aspect, the present invention provides an intelligent response method, including:

acquiring a response text;

extracting one or more of text features, first emotion features and first style features of the response text;

acquiring object characteristics of a target response object;

generating target response data based on at least two of the text feature, the first emotion feature, the first style feature and the object feature; the target response data comprises voice data;

And outputting the target response data.

In a second aspect, the present invention provides a terminal, a processing module and a transceiver module;

wherein the processing module is configured to:

acquiring a response text;

acquiring object characteristics of a target response object;

generating target response data based on at least two of the text feature, the first emotion feature, the first style feature and the object feature;

the transceiver module is used for outputting the target response data.

In a third aspect, the present invention provides a terminal, including:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any of the first aspects.

In a fourth aspect, the invention provides a computer readable storage medium having stored thereon a computer program for execution by a processor to perform the method according to the first aspect.

The invention provides an intelligent response method, a terminal and a computer readable storage medium. In the scheme, before the target response data is output, one or more of text features, first emotion features and first style features carried in the text data can be acquired based on the response text, object features of the target response object are acquired, and then the target response data is generated based on at least two of the feature data.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a schematic flow chart of an intelligent response method according to an embodiment of the present invention;

fig. 2 is a schematic diagram of an intelligent response provided by an embodiment of the present invention;

FIG. 3 is a diagram illustrating an emotion prediction model according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating an emotion prediction model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an avatar used in an exaggerated scene in a taxi APP provided in an embodiment of the present invention;

FIG. 6 is a schematic diagram of another intelligent response provided by an embodiment of the present invention;

FIG. 7 is a schematic diagram of a generative model provided in accordance with an embodiment of the present invention;

FIG. 8 is a schematic diagram of another generative model provided by an embodiment of the present invention;

FIG. 9 is a schematic diagram of another generative model provided by an embodiment of the present invention;

FIG. 10 is a schematic diagram of another generative model provided by an embodiment of the present invention;

FIG. 11 is a schematic diagram of another generative model provided by an embodiment of the present invention;

FIG. 12 is a schematic diagram of another generative model provided by an embodiment of the present invention;

fig. 13 is a functional block diagram of a terminal according to an embodiment of the present invention;

fig. 14 is a schematic entity structure diagram of a terminal according to an embodiment of the present invention.

With the foregoing drawings in mind, certain embodiments of the disclosure have been shown and described in more detail below. These drawings and written description are not intended to limit the scope of the disclosed concepts in any way, but rather to illustrate the concepts of the disclosure to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The intelligent response method provided by the embodiment of the invention can be applied to any terminal equipment. The terminal device according to the embodiment of the present invention may be a wireless terminal or a wired terminal. A wireless terminal may refer to a device that provides voice and/or other traffic data connectivity to a user, a handheld device having wireless connection capability, or other processing device connected to a wireless modem. A wireless terminal, which may be a mobile terminal such as a mobile telephone (or "cellular" telephone) and a computer having a mobile terminal, for example, a portable, pocket, hand-held, computer-included, or vehicle-mounted mobile device, may communicate with one or more core Network devices via a Radio Access Network (RAN), and may exchange language and/or data with the RAN. For another example, the Wireless terminal may also be a Personal Communication Service (PCS) phone, a cordless phone, a Session Initiation Protocol (SIP) phone, a Wireless Local Loop (WLL) station, a Personal Digital Assistant (PDA), and other devices. A wireless Terminal may also be referred to as a system, a Subscriber Unit (Subscriber Unit), a Subscriber Station (Subscriber Station), a Mobile Station (Mobile), a Remote Station (Remote Station), a Remote Terminal (Remote Terminal), an Access Terminal (Access Terminal), a User Terminal (User Terminal), a User agent (User agent), and a User Device or User Equipment, which are not limited herein.

Optionally, the terminal device may also be an intelligent wearable device, an intelligent home device, or a vehicle-mounted device. Wherein, intelligent wearing equipment can include but not limited to: smart headsets, self-energizing bracelets, smart watches, wearable health monitoring devices, and the like, without exhaustive. Home smart devices may include, but are not limited to: the intelligent television, the intelligent sound box, the intelligent electric cooker, the intelligent refrigerator, the intelligent air conditioner and the like are not exhaustive. The in-vehicle devices may include, but are not limited to: the vehicle-mounted sound box, the vehicle-mounted exaggerating robot and the like are not exhaustive.

The specific application scene of the invention is any voice response scene, namely any scene outputting the text data in a voice form.

In an exemplary scenario, the embodiment of the present invention may be applied to a scenario in which a "exaggeration robot" performs a voice exaggeration on a user. The exaggeration robot is a flexible and interesting AI application capable of outputting an exaggeration voice for a user upon receiving an instruction of "exaggeration me bar" or the like from the user. The exaggeration robot can be applied to any scene, and can be applied to an Application program (APP) for exaggerating a driver end user or a passenger end user, or outputting an exaggeration voice of the driver end or the passenger end to an opposite end user to the opposite end user.

In another exemplary scenario, the embodiment of the present invention may be applied to a human-computer interaction process between a vehicle-mounted device (or a home smart device, a smart wearable device, etc.) and a user. Illustratively, after the vehicle-mounted device is awakened, when the awakening response voice is output, intelligent response can be realized according to the scheme.

In another exemplary scenario, the embodiment of the present invention may also be applied to a text-to-speech scenario. Illustratively, when the terminal receives an operation instruction for converting characters into voice, voice data (which is used as target response data) can be acquired and output according to the scheme.

However, in any of the foregoing scenarios related to voice response in the prior art, the terminal can generally synthesize discordant voice response data only according to the default single speaker characteristic and single emotion manner, and this implementation manner results in a reduced amount of information that can be conveyed by the response voice, and the response voice is also monotonous, which affects the intelligent interaction effect and experience.

The technical scheme provided by the invention aims to solve the technical problems in the prior art.

The following describes the technical solutions of the present invention and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

The embodiment of the invention provides an intelligent response method. Referring to fig. 1, the method includes the following steps:

s102, acquiring a response text.

As before, the response text may have different representations based on the different voice response scenarios to which the present solution is applied. It should be noted that, in some scenarios, the response text is obtained in response to receiving the voice interaction instruction.

The three scenarios described above are now exemplified.

In a scenario of exaggerating for a user (e.g., a driver-end user or a passenger-end user in the taxi APP), the answer text may be exaggeration data for the user. At this time, in one possible scenario, the response text may be obtained in response to receiving a voice interaction instruction containing "exaggeration me bar", at which time the response text is obtained according to the present scheme. Alternatively, in another possible scenario, in a scenario where the terminal (in this case, the driver end) receives the quart text (i.e., the response text) of the passenger end user to the driver end user, or vice versa, in a scenario where the terminal (in this case, the passenger end) receives the quart text (i.e., the response text) of the driver end user to the passenger end user, the response text is not obtained based on the voice interaction instruction containing "quart me bar".

In this scenario, the response text may be automatically determined by the terminal, or may be manually selected and determined by the user himself or another user (e.g., an opposite-end user). The embodiment of the present invention is not particularly limited in the manner of acquiring the response text. For example, the answer text may be "the driver is the most sunny". For another example, the response text may be "you are hard to thank you in the weather" or the like.

In a human-computer interaction scene of the vehicle-mounted device, the smart home device or the smart wearable device, the response text may be specifically a wakeup response word or other voice response words. In this scenario, the answer text is obtained based on the received voice interaction instruction. For example, when the smart speaker receives a wake word (as a voice interaction command) from the user, the wake response word (as a response text) may be "on"; when the smart speaker receives the voice interaction instruction as "play music", the wake-up response word (as the answer text at this time) may be "good, that is, XX music is to be played for you".

In the text-to-speech scene, the response text is target text data selected by the user.

And S104, extracting one or more of the text feature, the first emotion feature and the first style feature of the response text.

In this embodiment, the answer text may be subjected to at least one of extraction of a text feature, extraction of the first emotional feature, and extraction of the first style feature. Wherein the first emotional feature is used for describing the emotional state of the response text; the first style feature is used to describe a language style of the answer text.

Illustratively, the first emotional characteristic and the first style characteristic may be expressed by a form of a tag. For example, the type of the first tag included in the first emotional feature may include, but is not limited to: labels used to characterize emotion such as joy, anger, sadness, or pain; the type of second label included in the first style feature may include, but is not limited to: the labels for representing the language style such as the steady maturity, the pretty, the loveliness and the leaving are used.

The way in which these features are extracted is described in detail below, and is not expanded here.

And S106, acquiring the object characteristics of the target response object.

In the embodiment of the present invention, a plurality of candidate responder objects may be included, and the object characteristics of each candidate responder object are different. Therefore, when this step is implemented, it is necessary to determine a target response object from among the candidate response objects, and further, to acquire the object feature of the target response object. The target response object may be determined by the terminal itself, or may be specified by the user (the opposite end user or the user itself). For example, when the first user performs a quart on the second user, the quart data received by the second user may be operated and specified by the first user on the first user. For example, when the first user performs a quart on the second user, after the second user receives the quart data, if the second user is not satisfied with the current target response object, the target response object may be further modified.

In the embodiment of the present invention, the object feature may include, but is not limited to, a voice feature. In addition, object features may include, but are not limited to: a facial feature.

Thus, in one possible embodiment of this step, the voice characteristics of the target response object may be obtained. In this manner, voice data can be subsequently synthesized as target response data.

In another possible embodiment of this step, the voice features and facial features of the target responding object may be obtained. In this way, video data containing voice data can be subsequently synthesized as target response data. The implementation mode is more vivid and has better interaction effect.

The manner in which the features of the object are obtained is detailed later.

In addition, it should be noted that, in the embodiment of the present invention, S102 and S104 may be executed sequentially; however, the execution order between S106 and the first two steps is not particularly limited. In an exemplary embodiment, S106 may be performed before S102, simultaneously with S102 or S104, between S102 and S104, or after S104.

S108, generating target response data based on at least two of the text feature, the first emotion feature, the first style feature and the object feature; the target response data includes voice data.

And synthesizing voice data or video data based on the characteristics acquired in the previous steps to be used as target response data.

Illustratively, the target response data may be generated based on the text feature, the first emotion feature, the first style feature, and the object feature.

Illustratively, the target response data may be generated based on the text feature, the first emotional feature, and the object feature.

Illustratively, the target response data may be generated based on the text feature, the first style feature, and the object feature.

It is not exhaustive and will be described in detail.

S110, outputting the target response data.

While the scheme shown in fig. 1 is applied to a quartic scene, fig. 2 shows a schematic diagram of an intelligent response method in a driver-end user quartic scene, for example. As shown in fig. 2A, the driver end user can click the function control 201 to enter the exaggeration interface in the driver end display interface of the taxi-taking APP, and then the terminal can display the interface shown in fig. 2B. Fig. 2B is a display interface with an exaggerated function, on which the driver end user can send out voice, and accordingly, the terminal collects real-time voice data, i.e., performs step S102. After the terminal collects the voice data, the steps S104 and S106 may be executed to determine whether the collected voice data contains the specified dialect. Then, if it is recognized that one of the "quart driver" or the "quart me bar" is included in the real-time voice data from the driver-side user, a display interface as shown in fig. 2C may be displayed in the terminal. As shown in fig. 2C, the target response data 203 for "quart me bar" is output on the current interface, specifically: the "feeling of thank you without painfully feeling to meet me" in the weather and rain "is the voice data.

In addition, in the display interface shown in fig. 2B, the driver-end user may click the exaggeration control 202 to trigger the exaggeration function, so that the terminal displays the interface shown in fig. 2C, which is not described in detail. In the display interface shown in fig. 2A, a driver's newly received exaggeration can be prompted in the function control 201.

Based on the intelligent response method shown in fig. 1, the embodiment of the invention can acquire one or more of text characteristics, first emotion characteristics and first style characteristics carried in text data based on the response text, acquire object characteristics of the target response object, and further generate the target response data based on the characteristic data, so that compared with single emotion and response voice of a single speaker in the prior art, the target response data output by the terminal in the scheme can embody emotion of the text data and object characteristics of the target response object, that is, the information which can be conveyed by the target response data is richer, which is beneficial to improving the intelligent interaction effect and also improves flexibility and interestingness of the intelligent interaction process.

On the basis of the embodiment shown in fig. 1, a specific implementation of the embodiment of the present invention will now be described.

In one aspect, one or more of a text feature, a first emotion feature, and a first style feature may be obtained based on the response text.

In an embodiment of the present invention, the text processing may be performed on the response text to obtain the text feature.

Specifically, the text feature refers to a feature vector extracted from the answer text and available for speech synthesis, and the text feature is used for indicating text information in the subsequent target answer data. Specifically, the text processing according to the embodiment of the present invention may include, but is not limited to, one or more of regularization processing, word segmentation processing, part of speech tagging processing, phoneme tagging processing, and prosody analysis processing.

Wherein the regularization process is used to convert the response text into text data of a uniform language type. For example, when the answer text is "you are the 1 st loved one! If the answer text contains Chinese characters and numbers, and the voice types are not uniform, the numbers can be converted into corresponding Chinese characters through regularization processing, so that the processed answer text is just that you are the most loved one! ". Besides, the text may also be converted into 2-ary digits, and the like, and the embodiment of the present invention has no particular limitation on the type of the language after the regularization processing.

The word segmentation process is used for switching the sentence text into single characters, words and punctuation. Hereinafter, the word segmentation result will be simply referred to as a phrase for convenience of description.

Part-of-speech tagging is used to tag the part of speech of each phrase, where the part of speech may include, but is not limited to: adjectives, nouns, verbs, etc., are not intended to be exhaustive.

The phoneme marking process is used for marking the pinyin (the tone containing the pinyin) of each short sentence. In a specific implementation of the present solution, the Phoneme labeling process may be implemented by a Grapheme-to-Phoneme (G2P) model. Specifically, the G2P model utilizes a Recurrent Neural Network (RNN) model and a long-short-term memory network (LSTM) model to realize the conversion from english words to phonemes. The embodiment of the invention does not limit the concrete structure and training mode of G2P, the input of the G2P model is text data, and the output is the phoneme characteristics of the text data.

The prosody analysis processing can be realized by a neural network model, wherein the input of the neural network model is text data, and the output of the neural network model is prosody characteristics of the text data.

After one or more of the above treatments, the text characteristics corresponding to the response text can be obtained. For example, in an embodiment of the present invention, the text data may be sequentially and linearly processed according to the foregoing processing manner to obtain the text feature.

On the basis, the text features can be processed by utilizing the trained emotion prediction model to obtain first emotion features; and/or processing the text features by using the trained style prediction model to obtain the first style features.

Specifically, the extraction of the first emotion feature of the response text can be processed by a trained emotion prediction model. The input data of the emotion prediction model is text data. The output data of the emotion prediction model is an emotion feature, which may be specifically an emotion feature vector or an emotion tag (i.e., the aforementioned first tag).

In step S104, the response text or the text feature (vector) obtained by the foregoing processing may be used as an input of the emotion prediction model, and an output of the emotion prediction model may be specifically the first emotion feature. In addition, in a subsequent embodiment of the present invention, the emotion prediction model can be used to obtain a second emotion feature, which will be described in detail later.

Specifically, the emotion prediction model may be a deep learning network model, which includes an output layer, a bottleneck layer, a plurality of hidden layers, and an input layer, and fig. 3 specifically shows hidden layer 1 and hidden layer 2. For example, fig. 3 is a schematic diagram illustrating an emotion prediction model provided by an embodiment of the present invention. As shown in FIG. 3, the input data for the input layer may be text features, as before; the hidden layer can improve the feature dimension and process the features; the bottleneck layer can reduce the feature dimension of the processed features after the features are processed, so that the emotion features finally output by the model can be ensured to contain key information capable of effectively distinguishing different language emotions, and the subsequently synthesized target response data (voice or video) can have richer emotion information.

In the scenario shown in fig. 3, the text information is processed, which may be to perform the aforementioned text processing on the text information to obtain text features, and input the text features into the hidden layer 1.

In some possible embodiments, the text data may be directly used as the input of the emotion prediction model, and the emotion prediction model performs the text processing to directly obtain the emotion characteristics. The model structure in this scenario is similar to that of fig. 3 and will not be described in detail.

The emotion prediction model can be trained well in advance, and the model training process is not detailed in the embodiment of the invention. The trained emotion prediction model can be stored in a storage position which can be read by the terminal; or the model can be deployed on the line, and the terminal can directly call the model to realize emotion recognition.

Similarly, the text features may be processed by a trained style prediction model to obtain the first style features. The input data of the style prediction model is text data. The output data of the style prediction model is a style feature, which may be a style feature vector or a style label (i.e., the second label mentioned above).

In step S104, the response text or the text features (vectors) obtained by the foregoing processing may be used as the input of the style prediction model, and the output of the style prediction model may be specifically the first style feature.

Fig. 4 is a schematic structural diagram of a style prediction model provided in the present application. As shown in fig. 4, the style prediction model may be a deep learning network model, and specifically, the deep learning network model may include an output layer, a bottleneck layer, a plurality of hidden layers, and an input layer, and two hidden layers (hidden layer 1 and hidden layer 2) are illustrated in fig. 4 as an example. After the response text is subjected to text processing, the obtained text features are used as input data and input into the deep learning network model through the input layer; the hidden layer can improve the feature dimension and process the features; the bottleneck layer can reduce the feature dimension of the processed features after the features are processed, so that the final output style features of the model can be ensured to contain key information capable of effectively distinguishing different styles, and the target response data (audio or video) synthesized subsequently can have specific style information.

In some possible embodiments, the answer text data may also be directly used as an input of the style prediction model, and the style prediction model performs the text processing to directly obtain the style characteristics. The model structure in this scenario is similar to that of fig. 4 and will not be described in detail.

The style prediction model can be trained well in advance, and the model training process is not detailed in the embodiment of the application. The trained style prediction model can be stored in a readable storage position of the terminal equipment; or the model can be deployed on the line, and the terminal device can directly call the model to realize style recognition.

On the other hand, it is also necessary to acquire the object characteristics of the target response object. As before, the object features may be speech features; or voice features and facial features.

The target response object is explained first. In this embodiment of the present invention, the target response object may include: an avatar or a person. Wherein, the virtual image can be a user-defined image or an image of a person. By way of example, fig. 5 shows a schematic diagram of an avatar used in a quartic scene in a taxi APP. And the characters may be public characters or users. It can be understood that when the public character is the target response object, the enterprise to which the scheme belongs should have the authorization of the public character. And the user may be himself or another user.

The target response object can be determined by the terminal in a self-defining mode, or can be selected or switched by the user subjectively.

Take the example of a driver-side user exaggerating a passenger-side user. In one embodiment of the scenario, a preset avatar may be targeted to respond to the target. In another embodiment of the scenario, the passenger end user (getting user authorization) or an avatar of the passenger end user may be targeted for response objects. In another embodiment of the scenario, the driver end user (taking user authorization) or an avatar of the driver end user may be targeted for response objects.

Or, when the driver end user is not satisfied with the currently selected (or default) target response object, the terminal can be operated to switch the target response object. In other words, the terminal may further receive the command from the object switch, and obtain one or more candidate response objects indicated by the command from the object switch, so as to obtain the target response object.

Illustratively, fig. 6 shows one possible implementation in this scenario. In this scenario, FIG. 6A shows an exaggerated display of the passenger-side user versus the driver-side user. On the display interface of fig. 6A, an object control 601 is displayed, and in the object control 601, a user can slide left or right to switch a target response object. As shown in fig. 6A, if operation information of the user sliding leftward in the object control 601 is received, the terminal may recognize the operation information as an object switching instruction, and thus, switch the target response object on the exaggerated display interface. At this time, the exaggeration display interface after switching is as shown in fig. 6B. After the processing, the target response object determined in the object control 601 is switched from the virtual object in fig. 6A to the character in fig. 6B.

In addition, the answer text 602 is further shown in FIG. 6, and as shown in FIG. 6A, the current answer text is "the driver has the most sunshine, hottest, best quality, best knowledge of cold and heat! ".

On the basis of fig. 6, the object control 601 in the embodiment of the present invention can also implement a dialog switching function. In an exemplary embodiment, the user can click or long-press the object control 601, so that the terminal can recognize the operation information as a call switching instruction. In this way, the terminal can also switch the quartic dialogs on the current quartic display interface when receiving the dialogs switching instruction. Thus, when the user clicks on the send control 603, the switched quart is sent to the driver side, and thus, when the driver side outputs the quart, the quart is the response text of the driver side.

It should be noted that the determination manner of the target response object shown in fig. 6 is only one possible implementation manner of the embodiment of the present invention, and in an actual implementation scenario, there may be other various selection or indication manners. For example, a plurality of candidate response objects may be displayed on the exaggeration display interface, and the user may select one or more of the candidate response objects as the target response objects.

When the target response object is a plurality of candidate response objects, it is necessary to fuse object features of the plurality of candidate response objects and use the fused feature as an object feature of the target response object.

Based on the difference of the target response objects, the scheme can be at least realized in the following way when the object characteristics (voice characteristics or facial characteristics) are acquired.

In one possible embodiment, the object characteristics of each candidate responder may be pre-stored. In this way, when the step is executed, it is only necessary to acquire the target response object determined in the candidate response objects and extract the object feature of the target response object. The implementation mode is simple and reliable, and is beneficial to shortening the processing time, improving the processing efficiency and further being beneficial to improving the response efficiency.

In another possible embodiment, the object characteristics may be obtained by obtaining historical data of the target response object and performing characteristic extraction on the historical data. In particular, feature extraction may be achieved by a trained deep learning model.

When the voice feature of the target response object is obtained, the historical voice data of the target response object can be obtained, and therefore the trained voiceprint recognition model is used for processing the historical voice data to obtain the voice feature.

The input of the voiceprint recognition model is voice data, and the output of the voiceprint recognition model is voice characteristics. In the embodiment of the invention, the voice characteristics are irrelevant to texts, external noise and the like and are used for distinguishing the characteristics of different speakers (response objects). Specifically, the speech features involved in the embodiments of the present invention may include, but are not limited to: and tone color characteristics. In addition, speech features include, but are not limited to: one or more of a tonal characteristic, a loudness characteristic, a intonation characteristic, a mood characteristic, a language type characteristic.

When the facial features of the target response object are obtained, historical facial data of the target response object can be collected in the sounding process of the target response object, and therefore the trained facial recognition model is used for processing the historical facial data to obtain the facial features.

It should be noted that, when the target response object is a plurality of candidate response objects, in an implementation manner, after object features of the plurality of candidate response objects are obtained, the object features may be fused by using a feature fusion model, and subsequent steps may be performed by using the fused object features. Or, in another implementation manner, after the history data of the multiple candidate response objects are acquired, the history data of the multiple candidate response objects may be combined together, and the voiceprint recognition model (or the face recognition model) is input, so that the object feature output by the voiceprint recognition model (or the face recognition model) is the fusion object feature after the object features of the multiple candidate response objects are fused.

In addition, the voiceprint recognition model and the face recognition model can be trained well in advance, and the model training process is not detailed in the embodiment of the invention. The trained voiceprint recognition model or the trained face recognition model can be stored in a storage position which can be read by the terminal; or, the models can be deployed on the online, and the terminal can directly call the models to realize the identification of the object features.

Based on the foregoing processing, one or more of the text feature, the first emotion feature, and the first style feature, and the object feature (including the speech feature and possibly the facial feature) may be obtained, then, in executing S108, at least two of the text feature, the first emotion feature, the first style feature, and the object feature may be processed by using the generative model to obtain the target response data. In particular, the generative model may be embodied as an end-to-end cyclic network model.

Illustratively, FIGS. 7 and 8 show a schematic diagram of a generative model. Fig. 7 and 8 are examples of a scene of "generating target response data based on the text feature, the first emotional feature, and the object feature".

The target response data generated by the generative model shown in fig. 7 is voice data. As shown in fig. 7, the generative model is an end-to-end cyclic network model, and specifically includes: an encoder (encoder), a concatenation module, an Attention (Attention) mechanism module, a decoder (decoder) and a vocoder.

The text features are input into an encoder, the encoder encodes the text features, the processed text features are spliced with the first emotion features and the object features (the speech features at this time) in a splicing module, then the splicing features are input into a decoder on the basis of the action of an Attention mechanism, the decoder processes the splicing features, and acoustic features are output to a vocoder. Finally, voice data is synthesized by the vocoder as target response data.

And the target response data generated by the generative model shown in fig. 8 is voice data or video data. Compared with fig. 7, the generative model shown in fig. 8 is based on fig. 7 and further includes an image synthesizer. In the embodiment shown in fig. 8, if the target feature includes only the speech feature, the processing method of each processing module is the same as that of fig. 7, and the vocoder may synthesize the speech data as the target response data. Or if the object features comprise facial features, the object features are spliced together with other features in a splicing module, and then the decoder processes the spliced features and can output acoustic features and image features, so that the acoustic features are input into the vocoder and voice data is output by the vocoder; the image features are input to an image synthesizer, and the image synthesizer outputs image data. The generated voice is thus combined with the image as finally output video data (target response data).

Fig. 9 is a schematic structural diagram of a generative model provided in the present application, as shown in fig. 9, where target response data generated by the generative model shown in fig. 9 is voice data. As shown in fig. 9, the generative model is an end-to-end cyclic network model, and specifically includes: an encoder (encoder), a concatenation module, an Attention (Attention) mechanism module, a decoder (decoder) and a vocoder. The text features are input into an encoder, the encoder encodes the text features, the processed text features are spliced with the first style features and the object features (the voice features at this time) in a splicing module, then the splicing features are input into a decoder on the basis of the action of an Attention mechanism, the decoder processes the splicing features and outputs the acoustic features to a vocoder. Finally, voice data is synthesized by the vocoder as target response data.

Fig. 10 is a schematic structural diagram of another generative model provided in the present application, and as shown in fig. 10, target response data generated by the generative model is voice data or video data. Compared with fig. 9, the generative model shown in fig. 10 is based on fig. 9 and further includes an image synthesizer. In the embodiment shown in fig. 10, if the target feature includes only the speech feature, the processing method of each processing module is the same as that of fig. 9, and the vocoder may synthesize the speech data as the target response data. Or if the object features comprise facial features, the object features are spliced together with other features in a splicing module, and then the decoder processes the spliced features and can output acoustic features and image features, so that the acoustic features are input into the vocoder and voice data is output by the vocoder; the image features are input to an image synthesizer, and the image synthesizer outputs image data. The generated voice is thus combined with the image as finally output video data (target response data).

For example, fig. 11 and 12 are schematic structural diagrams of a generative model provided in the present application.

As shown in fig. 11, compared to the embodiments shown in fig. 7 and fig. 9, in the embodiment shown in fig. 11, the concatenation module is used for concatenating the text feature, the first emotion feature, the first style feature, and the object feature, and the processed data is synthesized by the vocoder as the target response data.

As shown in fig. 12, compared to the embodiments shown in fig. 8 and fig. 10, in the embodiment shown in fig. 12, the splicing module is configured to splice the text feature, the first emotion feature, the first style feature and the object feature, and after processing by each module (refer to the foregoing, which is not described herein), finally combine the voice data synthesized by the vocoder with the image data output by the image synthesizer, and output the video data as the target response data.

The intelligent response method provided by the embodiment of the invention can be triggered based on the voice interaction instruction received from the user, namely, the response text is acquired in response to the received voice interaction instruction. In this case, in addition to generating the target response data in the aforementioned manner, the emotion condition of the voice interaction instruction may be further combined on the basis of the target response data.

Specifically, a second emotion feature and/or a second style feature of the voice interaction instruction may be obtained. Then, a third emotional feature corresponding to the second emotional feature is obtained, and/or a third style feature corresponding to the second style feature is obtained. Further, target response data is generated based on at least two of the text feature, the first emotional feature, the object feature, the third emotional feature, and the third style feature.

For example, a second emotion feature of the voice interaction instruction may be acquired, then a third emotion feature corresponding to the second emotion feature may be acquired, and then the target response data may be generated based on the text feature, the first emotion feature, the object feature, and the third emotion feature.

When the second emotion feature is obtained, in a possible embodiment, the voice interaction instruction may be converted into an instruction text, and then the instruction text is input into the emotion prediction model for processing, so as to obtain the second emotion feature. Alternatively, in another possible embodiment, the speech interactive instruction can be processed by using a trained speech emotion prediction model, wherein the speech emotion prediction model inputs speech data and outputs emotion characteristics. Therefore, the second emotion characteristic can be obtained only by inputting the voice interaction instruction into the voice emotion prediction model.

The second emotional characteristic can reflect the emotional state of the user, so that when the terminal responds to the user, the terminal can select an appropriate emotion to respond to the user according to a preset emotion corresponding relation (corresponding relation between the second emotional characteristic and the third emotional characteristic). The emotion correspondence may be preset in advance, and the specific correspondence is not particularly limited in the embodiment of the present invention. Thus, the second emotional feature may be the same as or different from the third emotional feature. Illustratively, if the second emotional characteristic is anger, the third emotional characteristic corresponding to the anger is placation, and the first emotional characteristic and the third emotional characteristic are different. For example, if the obtained second emotional feature is an incentive and the obtained third emotional feature corresponding to the acquired second emotional feature is an incentive, the two emotional features may be the same.

After the third emotional feature is obtained, the third emotional feature can be used as one input of the synthesis model, and is spliced with other features in the splicing module, and then the subsequent processing is carried out.

And after the target response data are obtained, the target response data can be directly output.

For example, a second style feature of the voice interaction instruction may be acquired, then a third style feature corresponding to the second style feature may be acquired, and then the target response data may be generated based on the text feature, the first style feature, the object feature, and the third style feature.

When the second style characteristic is obtained, in a possible embodiment, the voice interaction instruction may be converted into an instruction text, and then the instruction text is input into the style prediction model for processing, so as to obtain the second style characteristic. Alternatively, in another possible embodiment, the voice interaction instruction may be processed by using a trained style prediction model, wherein the input of the style prediction model is voice data, and the output is style characteristics. Thus, the second style characteristic can be obtained only by inputting the voice interaction instruction into the style prediction model.

The second style characteristic can reflect the style of the user, so that when the terminal device responds to the user, the terminal device can select an appropriate style to respond to the style according to the preset style corresponding relationship (the corresponding relationship between the second style characteristic and the third style characteristic). The style corresponding relationship may be preset in advance, and the embodiment of the present application is not particularly limited to a specific corresponding relationship. Thus, the second style feature may be the same as or different from the third style feature. For example, if the second style characteristic is "mature and steady style", the third style characteristic corresponding to the second style characteristic may be "pretty and lovely style", which are different from each other. For example, if the obtained second style characteristic is "leave-out-of-friendship style", the obtained third style characteristic corresponding to the obtained second style characteristic may be "leave-out-of-friendship style", which are the same.

After the third style characteristic is obtained, the third style characteristic can be used as one input of the synthesis model, and the third style characteristic is spliced with other characteristics in a splicing module and then subjected to subsequent processing.

In an embodiment, when the target response data is output, it is also possible to detect whether an output environment of the target response data is available. Thus, when the output environment is available, the target response data is directly output. Conversely, when the output environment is not available, the target response data is output at the target time.

Wherein, the target moment includes: a time instant when the output environment is available is detected. In other words, if the output environment is not available, it may be continuously detected whether the output environment is available until the output environment is detected to be available, that is, the target response data is output.

Alternatively, the target time may be a time spaced from the current time by a preset wait time. In other words, if the output environment is not available, the target response data may be output after waiting for a preset waiting time period. In this embodiment, the waiting time period may be preset as needed, for example, the waiting time period may be 1h, or may also be 5 minutes, and the like, which is not exhaustive or limited. Note that, in this embodiment, after waiting for the preset waiting time period, it is not necessary to detect whether the output environment is available, and the target response data is directly output.

In this embodiment, prompt information may also be output to prompt the output time of the target response data. For example, the prompt message may be: since the current output environment is not available, the target response data will be played for you after 5 minutes. Further, the user can also operate the prompt message to cancel or change the output time of the target response data.

Alternatively, the target time may be a preset time. In other words, if the output environment is not available, the target response data may be output at a preset timing. Similar to the previous embodiment, the preset time is preset as needed, and in this embodiment, it is not necessary to detect whether the output environment is available any more at the preset time and directly output the target response data. Similarly, in this embodiment, a prompt message may also be output, for example, the following may be output: since the current output environment is not available, the target response data will be played for you at 15 o' clock. Similarly, the user may also operate on the prompt message to cancel or change the output time of the target response data.

And detecting whether the output environment is available can be implemented according to one or more of the following ways:

In one possible embodiment, the vehicle motion status is obtained when the target response data is associated with the vehicle. Thus, when the vehicle motion state is the running state, the output environment is not available; on the contrary, when the vehicle motion state is the non-driving state, the output environment is available.

In another possible embodiment, the current multimedia output state may also be obtained, the multimedia including: video or audio. Thus, when the multimedia output state is the play state, the output environment is not available; otherwise, if there is no multimedia output currently, the output environment is available.

For example, in the exaggeration scenario described above for the driver end, where the current terminal is the driver end and the driver end is currently in the vehicle driving state, the current output environment may be determined as unavailable to avoid interference of the target response data with the driver. In addition, if the driver end is in a video or music playing state, the current output environment is not available.

For example, in a possible scenario, the user may have uttered a "exaggeration me bar" voice, and when the terminal has not output the target response data, the user has turned on a video later, and is currently in a multimedia video playing state, and the current output environment is not available. The video that the user opens later may be a video in a current Application (APP), or may be a video in another APP, which is not particularly limited.

It is to be understood that some or all of the steps or operations in the above-described embodiments are merely examples, and other operations or variations of various operations may be performed by the embodiments of the present application. Further, the various steps may be performed in a different order presented in the above-described embodiments, and it is possible that not all of the operations in the above-described embodiments are performed.

The words used in this application are words of description only and not of limitation of the claims. As used in the description of the embodiments and the claims, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this application is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, the terms "comprises" and/or "comprising," when used in this application, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Based on the intelligent response method provided by the method embodiment, the embodiment of the invention further provides device embodiments for realizing the steps and the method in the method embodiment.

An embodiment of the present invention provides a terminal, please refer to fig. 13, where the terminal 1300 includes: a processing module 132 and a transceiver module 134;

wherein, the processing module 132 is configured to:

acquiring a response text;

acquiring object characteristics of a target response object;

generating target response data based on at least two of the text feature, the first emotional feature, the first style feature and the object feature; the target response data comprises voice data;

and a transceiver module 134 for outputting the target response data.

In an embodiment of the present invention, the processing module 132 is specifically configured to:

performing textualization processing on the response text to obtain text characteristics;

processing the text features by using the trained emotion prediction model to obtain first emotion features; the first emotional feature is used for describing the emotional state of the response text;

processing the text features by using a trained style prediction model to obtain the first style features; the first style feature is used for describing a language style of the response text.

Wherein the text processing comprises: one or more of regularization processing, word segmentation processing, part-of-speech tagging processing, phoneme tagging processing, and prosody analysis processing.

The emotion prediction model and/or the style prediction model are/is a deep learning network model, and the deep learning network model comprises an output layer, a bottleneck layer, a plurality of hidden layers and an input layer.

In another embodiment of the present invention, the processing module 132 is specifically configured to: acquiring voice characteristics of a target response object; alternatively, the voice feature of the target response object is acquired, and the face feature of the target response object is acquired.

In another embodiment of the present invention, the processing module 132 is specifically configured to:

acquiring historical voice data of a target response object;

and processing historical voice data by using the trained voiceprint recognition model to obtain voice characteristics.

Wherein the voice features include: a timbre characteristic; the speech features further include: one or more of a tonal characteristic, a loudness characteristic, a intonation characteristic, a mood characteristic, a language type characteristic.

collecting historical face data of a target response object in the sounding process of the target response object;

And processing historical face data by using the trained face recognition model to obtain the face features.

In another embodiment of the present invention, the transceiver module 134 is further configured to: receiving an object switching instruction;

at this time, the processing module 132 is specifically configured to: and acquiring one or more candidate response objects indicated by the object switching instruction to obtain a target response object.

processing at least two of the text feature, the first emotion feature, the first style feature and the object feature by using the generation model to obtain target response data; wherein, the generated model is an end-to-end circulation network model.

In another embodiment of the present invention, the target response data is: voice data or video data.

In another embodiment of the invention, the response text is obtained in response to receiving the voice interaction instruction.

acquiring a second emotion characteristic and/or a second style characteristic of the voice interaction instruction;

acquiring a third emotional feature corresponding to the second emotional feature; and/or acquiring a third style characteristic corresponding to the second style characteristic;

Generating target response data based on at least two of the text feature, the first emotional feature, the first style feature, the object feature, the third emotional feature, and the third style feature.

In another embodiment of the present invention, the processing module 132 is specifically configured to: detecting whether an output environment of the target response data is available;

at this time, the transceiver module 134 is specifically configured to: outputting target response data at a target time when the output environment is unavailable; wherein, the target moment includes: and detecting the time when the output environment is available, or the time when the waiting time length is preset from the current time, or the preset time.

when the target response data is associated with the vehicle, acquiring a vehicle motion state;

when the vehicle motion state is a driving state, the output environment is not available.

acquiring a current multimedia output state;

when the multimedia output state is the play state, the output environment is not available.

The terminal 1300 in the embodiment shown in fig. 13 may be configured to execute the technical solution in the above method embodiment, and further reference may be made to the relevant description in the method embodiment for implementing the principle and the technical effect.

It should be understood that the division of the modules of the terminal 1300 shown in fig. 13 is merely a logical division, and the actual implementation may be wholly or partially integrated into one physical entity or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling by the processing element in software, and part of the modules can be realized in the form of hardware. For example, the processing module 132 may be a separate processing element, or may be integrated in the terminal 1300, for example, implemented in a chip of the terminal, or may be stored in a memory of the terminal 1300 in the form of a program, and a processing element of the terminal 1300 calls and executes the functions of the above modules. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling programs. As another example, these modules may be integrated together, implemented in the form of a system-on-a-chip (SOC).

Also, an embodiment of the present invention provides a terminal, please refer to fig. 14, where the terminal 1300 includes:

a memory 1310;

a processor 1320; and

a computer program;

wherein the computer program is stored in the memory 1310 and configured to be executed by the processor 1320 to implement the methods of the embodiments described above.

The number of the processors 1320 in the terminal 1300 may be one or more, and the processors 1320 may also be referred to as processing units, which may implement a certain control function. Processor 1320 may be a general purpose processor or a special purpose processor, etc. In an alternative design, processor 1320 may also store instructions that are executable by processor 1320 to cause terminal 1300 to perform the methods described in the method embodiments above.

In yet another possible design, terminal 1300 may include circuitry that may perform the functions of transmitting or receiving or communicating in the foregoing method embodiments.

Optionally, the number of the memories 1310 in the terminal 1300 may be one or more, and the memory 1310 stores instructions or intermediate data, and the instructions can be executed on the processor 1320, so that the terminal 1300 performs the method described in the above method embodiments. Optionally, other related data may also be stored in the memory 1310. Optionally, instructions and/or data may also be stored in the processor 1320. The processor 1320 and the memory 1310 may be provided separately or may be integrated together.

In addition, as shown in fig. 14, a transceiver 1330 is further disposed in the terminal 1300, where the transceiver 1330 may be referred to as a transceiver unit, a transceiver circuit, a transceiver, or the like, and is used for data transmission or communication with the test device or other terminal devices, and details are not repeated here.

As shown in fig. 14, the memory 1310, the processor 1320, and the transceiver 1330 are connected by a bus and communicate.

If the terminal 1300 is configured to implement the method corresponding to fig. 1, the target response data may be output by the transceiver 1330, for example. And the processor 1320 is configured to perform corresponding determination or control operations, and optionally, may store corresponding instructions in the memory 1310. The specific processing manner of each component can be referred to the related description of the previous embodiment.

Furthermore, an embodiment of the present invention provides a readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the method according to the method embodiment.

Since each module in this embodiment can execute the method shown in the method embodiment, reference may be made to the related description of the method embodiment for a part not described in detail in this embodiment.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An intelligent answering method, comprising:

acquiring a response text;

acquiring object characteristics of a target response object;

and outputting the target response data.

2. The method of claim 1, wherein the extracting one or more of a text feature, a first emotion feature and a first language style feature of the answer text comprises:

performing text processing on the response text to obtain the text characteristics;

processing the text features by using a trained emotion prediction model to obtain the first emotion features; the first emotional feature is used for describing the emotional state of the response text;

3. The method of claim 2, wherein the textual processing comprises: one or more of regularization processing, word segmentation processing, part-of-speech tagging processing, phoneme tagging processing, and prosody analysis processing.

4. The method of claim 2, wherein the emotion prediction model and/or the style prediction model is a deep learning network model, and the deep learning network model comprises an output layer, a bottleneck layer, a plurality of hidden layers and an input layer.

5. The method of claim 1, wherein the obtaining of the object characteristics of the target response object comprises:

acquiring voice characteristics of the target response object;

alternatively, the first and second electrodes may be,

acquiring voice features of the target response object, and acquiring facial features of the target response object.

6. The method of claim 5, wherein the obtaining the voice feature of the target response object comprises:

acquiring historical voice data of the target response object;

and processing the historical voice data by using the trained voiceprint recognition model to obtain the voice characteristics.

7. The method of claim 6, wherein the speech features comprise: a timbre characteristic;

The speech features further include: one or more of a tonal characteristic, a loudness characteristic, a intonation characteristic, a mood characteristic, a language type characteristic.

8. The method of claim 5, wherein said obtaining facial features of said target responding subject comprises:

acquiring historical face data of the target response object in the sounding process of the target response object;

and processing the historical facial data by using the trained facial recognition model to obtain the facial features.

9. The method according to claim 1 or 5, characterized in that the method further comprises:

receiving an object switching instruction;

and acquiring one or more candidate response objects indicated by the object switching instruction to obtain the target response object.

10. The method of any one of claims 1-8, wherein generating target response data based on at least two of the text feature, the first emotion feature, the first style feature, and the object feature comprises:

processing at least two of the text feature, the first emotion feature, the first style feature and the object feature by using a generation model to obtain the target response data;

Wherein the generative model is an end-to-end cyclic network model.

11. The method of claim 10, wherein the target response data is: voice data or video data.

12. The method of any of claims 1-8, wherein the response text is obtained in response to receiving a voice interaction instruction.

13. The method of claim 12, wherein generating target response data based on at least two of the text feature, the first emotion feature, the first style feature, and the object feature comprises:

generating the target response data based on at least two of the text feature, the first emotion feature, the first style feature, the object feature, the third emotion feature and the third style feature.

14. The method of any one of claims 1-8, wherein said outputting said target response data comprises:

Detecting whether an output environment of the target response data is available;

outputting the target response data at a target time when the output environment is unavailable;

wherein the target time comprises: and detecting the available time of the output environment, or the time separated from the current time by a preset waiting time length, or the preset time.

15. The method of claim 14, wherein the detecting whether an output environment of the target response data is available comprises:

16. The method of claim 14, wherein the detecting whether an output environment of the target response data is available comprises:

acquiring a current multimedia output state;

when the multimedia output state is a play state, the output environment is unavailable.

17. A terminal, comprising: the processing module and the transceiver module;

wherein the processing module is configured to:

acquiring a response text;

Acquiring object characteristics of a target response object;

the transceiver module is used for outputting the target response data.

18. A terminal, comprising:

a memory;

a processor; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of any one of claims 1-16.

19. A computer-readable storage medium, having stored thereon a computer program for execution by a processor to perform the method of any one of claims 1-16.