CN115309882A

CN115309882A - Interactive information generation method, system and storage medium based on multi-modal characteristics

Info

Publication number: CN115309882A
Application number: CN202210955672.2A
Authority: CN
Inventors: 陈锁; 顾文元; 张雪源
Original assignee: Yuanmeng Human Intelligence International Co ltd; Shanghai Yuanmeng Intelligent Technology Co ltd
Current assignee: Yuanmeng Human Intelligence International Co ltd; Shanghai Yuanmeng Intelligent Technology Co ltd
Priority date: 2022-08-10
Filing date: 2022-08-10
Publication date: 2022-11-08

Abstract

The invention discloses an interactive information generating method, a system and a storage medium based on multi-modal characteristics, wherein the method comprises the following steps: obtaining multi-modal features of the interactive object, wherein the multi-modal features comprise image features, voice features and text features; carrying out normalization processing on the multi-modal features to obtain target feature vectors; classifying the multi-modal characteristics according to a preset classifier corresponding to each modal to obtain a classification result corresponding to each modal characteristic; inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module to generate a target dialogue strategy; and inputting the target conversation strategy into a natural language generation module to generate interactive information. The method and the system can perform quantitative analysis based on the multi-mode information transmitted by the user, and generate more accurate, more visual and more humanized interaction information according to the current interaction requirement.

Description

Interactive information generation method, system and storage medium based on multi-modal characteristics

Technical Field

The invention relates to the technical field of multi-modal information interaction, in particular to an interactive information generation method, an interactive information generation system and a storage medium based on multi-modal characteristics.

Background

With the development of avatar technology, avatars are increasingly used in scenarios requiring chat interaction, such as customer service, smart chat, and content generation. At present, the chat interaction between the virtual person and the user mainly depends on a dialogue system in the natural language field, so that the chat interaction between the virtual person and the user is more intelligent and more active, and the interaction effect between the virtual person and the user is more and more close to that between a real person and the user.

In the current virtual human technology, user inputs received by various dialog systems are generally texts converted after speech recognition, natural Language parsing and Understanding are performed on the texts through an NLU (Natural Language Understanding) technology, and then an obtained intermediate result is transmitted to a dialog management module (DM) for further policy management and Natural Language Generation (NLG). The mode only considers the condition that the input data is text generally, and the interactive mode of the dialogue system is only text form generally, so that the character of the virtual human is single and rigid in expression and cannot capture the real-time psychology of the user quickly and accurately.

Therefore, there is a need for an interactive information generating method based on multi-modal characteristics, which performs quantitative analysis based on multi-modal information transmitted by a user, fully understands the current interactive demand of the user, and generates more accurate, more intuitive and more humanized interactive information based on the current interactive demand.

Disclosure of Invention

In order to solve the technical problems that an interaction mode in the existing interaction method is single and rigid, and interaction feedback cannot be performed quickly and accurately based on information transmitted by a user, the invention provides an interaction information generation method, an interaction information generation system and a storage medium based on multi-modal characteristics, and the specific technical scheme is as follows:

the invention provides an interactive information generation method based on multi-modal characteristics, which comprises the following steps:

obtaining multi-modal features of the interactive object, wherein the multi-modal features comprise image features, voice features and text features;

carrying out normalization processing on the multi-modal features to obtain target feature vectors;

classifying the multi-modal characteristics according to a classifier corresponding to each preset mode to obtain a classification result corresponding to each modal characteristic;

inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module to generate a target dialogue strategy;

and inputting the target conversation strategy into a natural language generation module to generate interactive information.

According to the interactive information generation method based on the multi-modal features, the multi-modal features are subjected to normalization processing to obtain target feature vectors, the multi-modal features are classified respectively, a target dialogue strategy is generated according to the target feature vectors and the classification results, interactive information is obtained, the multi-modal features transmitted by interactive objects are considered comprehensively for interaction, the multi-modal features are subjected to quantitative analysis, the current interaction requirements of users are fully known, and the interaction process is more accurate, more visual and more humanized.

In some embodiments, the obtaining multimodal features of the interactive object specifically includes:

collecting image information, voice information and dialogue text information of the interactive object in a preset time period;

performing feature extraction on the image information through an image sequence analysis algorithm to obtain the image features;

performing feature extraction on the voice information through a voice feature extraction algorithm to obtain the voice features;

and performing feature extraction on the dialogue text information through a natural language processing algorithm to obtain the text feature.

The interactive information generation method based on the multi-modal characteristics respectively extracts the characteristics of the acquired image information, voice information and dialogue text information of the interactive object through an image sequence analysis algorithm, a voice characteristic extraction algorithm and a natural language processing algorithm, and is convenient for carrying out more accurate information interaction according to the multi-modal characteristics in the follow-up process.

In some embodiments, after the acquiring the image information of the interaction object within the preset time period, the method further includes:

performing image segmentation on the image information through an image semantic segmentation model to generate expression image information and limb image information, wherein the image semantic segmentation model performs deep learning training generation based on an image information data set labeled with the expression image information and the limb image information;

the image feature extraction of the image information through an image sequence analysis algorithm to obtain the image feature specifically comprises:

and respectively extracting the characteristics of the expression image information and the limb image information through the image sequence analysis algorithm to obtain expression image characteristics and limb image characteristics.

According to the interactive information generation method based on the multi-modal characteristics, the image information is subjected to image segmentation to generate the expression image information and the limb image information, so that the interactive requirements of the interactive objects can be conveniently analyzed subsequently according to the expression image characteristics and the limb image characteristics, and the accuracy of interactive information generation is improved.

In some embodiments, the extracting the features of the image information by using an image sequence analysis algorithm to obtain the image features specifically further includes:

feature extraction is respectively carried out on the expression image information and the limb image information in the preset time period through a preset CNN + LSTM model, and an expression image feature group and a limb image feature group are obtained, wherein the expression image feature group comprises each expression image feature in the preset time period, and the limb image feature group comprises each limb image feature in the preset time period.

In some embodiments, the extracting the feature of the speech information by using a speech feature extraction algorithm to obtain the speech feature specifically includes:

performing feature extraction on the voice information through an MFCC algorithm to obtain the voice feature;

the extracting the features of the dialog text information through the natural language processing algorithm to obtain the text features specifically comprises the following steps:

performing feature extraction on the dialog text information through a preset regular expression to obtain a first text feature vector;

performing feature extraction on the dialogue text information through a preset LSTM model to obtain a second text feature vector;

and splicing and fusing the first text feature vector and the second text feature vector as the text features.

In some embodiments, the inputting the classification result corresponding to the target feature vector and each modal feature into a preset dialog management module to generate a target dialog policy specifically includes:

and inputting the target feature vectors and the classification results corresponding to the modal features into a preset dialogue management module, and performing dialogue state management and strategy selection through a finite state machine, a Bayesian network and an LSTM network carried in the dialogue management module to generate the target dialogue strategy.

In some embodiments, the natural language generation module performs deep learning training based on a data set including the target dialogue strategy and the interaction information corresponding relation, and the interaction information includes interaction content, voice style, facial expression, body movement and interaction character.

The interactive information generation method based on the multi-modal characteristics provided by the invention not only has a text form, but also carries out information interaction with the interactive object through a plurality of interactive modes such as voice style, facial expression, limb action, interactive character and the like, so that the interactive process is more three-dimensional and agile, and richer interactive feedback is given to the interactive object.

In some embodiments, after the inputting the target dialog strategy into the natural language generation module to generate the interaction information, the method further includes:

and generating a virtual character image according to the interactive content, the voice style, the facial expression, the limb actions and the interactive character to interact with the interactive object.

The interactive information generation method based on the multi-modal characteristics provided by the invention realizes that the virtual character is more three-dimensional and natural in the interactive process, accurately and quickly captures the psychological activity of the user based on the multi-modal characteristics transmitted by the user in real time, and is based on more natural and smooth interactive experience of the user through the mode of the virtual character.

In some embodiments, according to another aspect of the present invention, the present invention also provides an interaction information generation system based on multi-modal features, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring multi-modal characteristics of an interactive object, and the multi-modal characteristics comprise image characteristics, voice characteristics and text characteristics;

the processing module is connected with the acquisition module and is used for carrying out normalization processing on the multi-modal features to obtain a target feature vector;

the classification module is connected with the acquisition module and is used for classifying the multi-modal characteristics according to a preset classifier corresponding to each mode to obtain a classification result corresponding to each mode characteristic;

the first generation module is respectively connected with the processing module and the classification module and is used for inputting the target feature vector and the classification result corresponding to each modal feature into a preset conversation management module to generate a target conversation strategy;

and the second generation module is connected with the first generation module and used for inputting the target dialogue strategy into the natural language generation module to generate interactive information.

In some embodiments, according to another aspect of the present invention, the present invention further provides a storage medium having at least one instruction stored therein, and the instruction is loaded and executed by a processor to implement the operations performed by the above-mentioned interaction information generation method based on multi-modal features.

The invention provides at least one technical effect as follows:

(1) The multi-modal features are normalized to obtain target feature vectors, the multi-modal features are classified respectively, a target conversation strategy is generated according to the target feature vectors and the classification results, interaction information is obtained, the multi-modal features transmitted by an interaction object are considered comprehensively for interaction, the multi-modal features are subjected to quantitative analysis, the current interaction requirements of a user are fully known, and the interaction process is more accurate, more visual and more humanized;

(2) The collected image information, voice information and dialogue text information of the interactive object are subjected to feature extraction through an image sequence analysis algorithm, a voice feature extraction algorithm and a natural language processing algorithm, so that more accurate information interaction is performed according to multi-modal features in the follow-up process;

(3) The image information is subjected to image segmentation to generate expression image information and limb image information, so that the interaction requirements of the interaction object can be analyzed subsequently according to the expression image characteristics and the limb image characteristics, and the accuracy of generating the interaction information is improved;

(4) Besides the text form, information interaction is carried out on the interactive object through various interaction modes such as voice style, facial expression, limb actions and interactive character, so that the interactive process is more three-dimensional and agile, and the interactive object is given more abundant interactive feedback;

(5) The virtual character is more three-dimensional and natural in the interaction process, the psychological activities of the user can be accurately and quickly captured based on the multi-modal characteristics transmitted by the user in real time, and the interaction experience of the user is more natural and smooth through the mode of the virtual character.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flow chart of a method for generating interactive information based on multi-modal features according to the present invention;

FIG. 2 is a flow chart of multi-modal features of an interactive object in the interactive information generation method based on multi-modal features of the present invention;

FIG. 3 is another flowchart of multi-modal features of an interactive object in the interactive information generation method based on multi-modal features of the present invention;

FIG. 4 is a flowchart of generating a target dialog strategy in the method for generating interactive information based on multi-modal features according to the present invention;

FIG. 5 is another flow chart of a method for generating interaction information based on multi-modal features according to the present invention;

FIG. 6 is a diagram of an example of an interactive information generating system based on multi-modal features.

Reference numbers in the figures: the device comprises an acquisition module-10, a processing module-20, a classification module-30, a first generation module-40 and a second generation module-50.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically depicted, or only one of them is labeled. In this document, "one" means not only "only one" but also a case of "more than one".

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, without inventive effort, other drawings and embodiments can be derived from them.

One embodiment of the present invention, as shown in fig. 1, provides an interactive information generating method based on multi-modal features, comprising the steps of:

s100 multi-modal features of the interactive object are obtained.

In particular, the multi-modal features include an image feature V ₁ Speech feature V ₂ And text feature V ₃ When the interactive object is detected to be close to the preset distance threshold value, the image information and the voice information of the interactive object are shot at preset time intervals through the camera device and the voice acquisition device, the interactive text information of the interactive object is identified according to the voice information, and the image feature V is extracted according to the image information, the voice information and the interactive text information ₁ Speech feature V ₂ And text feature V ₃ 。

S200, normalization processing is carried out on the multi-modal features to obtain target feature vectors.

In particular, for multi-modal feature image features V ₁ Voice feature V ₂ And text feature V ₃ And performing NLU (Natural Language Understanding) feature fusion to obtain a target feature vector V.

S300, classifying the multi-modal features according to a preset classifier corresponding to each modal to obtain a classification result corresponding to each modal feature.

Illustratively, image feature V1, speech feature V2 and text feature V3 are classified according to an image classifier, a speech classifier and a text classifier, respectively, image feature V ₁ Possibly classified as "urgency" label, speech feature V ₂ Possibly classified as "very fast speech" label, text feature V ₃ May be classified as a "find toilet" label.

S400, inputting the target feature vector and the classification result corresponding to each modal feature into a preset conversation management module to generate a target conversation strategy.

S500, inputting the target dialogue strategy into a natural language generation module to generate interactive information.

Illustratively, according to the image feature V ₁ Label, speech feature V classified as "urgency ₂ Tag classified as "fast speech", text feature V ₃ The label classified as "find toilet" and the image feature V1 may be classified as "take urgency" label, the voice feature V2 may be classified as "fast voice" label, the text feature V3 may be classified as "find toilet" label, and the mutual information is generated as "position of toilet is straight line 50 meters back to right turn".

According to the multi-modal feature-based interaction information generation method provided by the embodiment, the multi-modal features are subjected to normalization processing to obtain target feature vectors, the multi-modal features are classified respectively, a target dialogue strategy is generated according to the target feature vectors and the classification results, interaction information is further obtained, the multi-modal features transmitted by the interaction objects are considered comprehensively for interaction, the multi-modal features are subjected to quantitative analysis, the current interaction requirements of users are fully known, and the interaction process is more accurate, more visual and more humanized.

In one embodiment, as shown in fig. 2, the step S100 of obtaining multi-modal features of the interactive object specifically includes:

s110, collecting image information, voice information and dialogue text information of the interactive object in a preset time period.

S121, extracting the features of the image information through an image sequence analysis algorithm to obtain image features.

S122, extracting the characteristics of the voice information through a voice characteristic extraction algorithm to obtain voice characteristics.

Illustratively, the voice feature V is obtained by performing feature extraction on voice information through MFCC (Mel Frequency Cepstrum coeffient) algorithm ₃ 。

And S123, carrying out feature extraction on the dialogue text information through a natural language processing algorithm to obtain text features.

Exemplarily, feature extraction is performed on the dialog text information through a preset regular expression to obtain a first text feature vector, feature extraction is performed on the dialog text information through a preset LSTM model to obtain a second text feature vector, and the first text feature vector and the second text feature vector are spliced and fused to serve as a text feature V ₅ 。

The interactive information generation method based on the multi-modal features provided by the embodiment respectively performs feature extraction on the acquired image information, voice information and dialogue text information of the interactive object through an image sequence analysis algorithm, a voice feature extraction algorithm and a natural language processing algorithm, so that more accurate information interaction can be performed subsequently according to the multi-modal features.

In one embodiment, as shown in fig. 3, after the step S110 collects image information, voice information and dialog text information of the interactive object within a preset time period, the method further includes:

and S124, carrying out image segmentation on the image information through the image semantic segmentation model to generate expression image information and limb image information.

Specifically, the image semantic segmentation model is generated by deep learning training based on an image information dataset marked with expression image information and limb image information.

And S125, respectively extracting the characteristics of the expression image information and the limb image information through an image sequence analysis algorithm to obtain expression image characteristics and limb image characteristics.

Specifically, the image feature V ₁ Including expressive image features V ₄ And limb image feature V ₅ 。

Exemplarily, feature extraction is respectively performed on expression image information and limb image information within a preset time period through a preset CNN (conditional Neural Network) + LSTM (Long Short-Term Memory) model to obtain an expression image feature group and a limb image feature group, where the expression image feature group includes each expression image feature V within the preset time period ₄ The limb image feature group comprises each limb image feature V in a preset time period ₅ 。

According to the multi-modal feature-based interaction information generation method, the image information is subjected to image segmentation to generate the expression image information and the limb image information, so that the interaction requirements of the interaction object can be conveniently analyzed subsequently according to the expression image features and the limb image features, and the accuracy of interaction information generation is improved.

In an embodiment, as shown in fig. 4, step S400 inputs the target feature vector and the classification result corresponding to each modal feature into a preset dialog management module to generate a target dialog policy, which specifically includes:

s410, inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module, and performing dialogue state management and strategy selection through a finite state machine, a Bayesian network and an LSTM network carried in the dialogue management module to generate a target dialogue strategy.

Specifically, a certain dialog decision manner is selected from different dialog decision manners as a target dialog strategy according to the target feature vector and the modal features, and the interactive content, the voice style, the facial expression, the limb action and the interactive character generated for the same image feature of the interactive object in the different dialog decision manners are different.

Illustratively, when the a conversation decision mode is taken as the target conversation strategy, a concise and concise interactive character and intuitive and simple interactive contents are adopted for the text information "asking for an XX stock for quotation" of the current quotation of the user, for example: when a B conversation decision mode is taken as a target conversation strategy, aiming at the text information of a user, which 'asks for how XX stock is currently put into a market' the interactive content with mild and good interactive character and elaboration is adopted, for example, 'the current stock trend rises within 28 minutes, the current stock price is X, the daily transaction amount drops by Z yuan compared with that of yesterday, and asks for whether more company stock information needs to be known or not'.

In one embodiment, the natural language generation module performs deep learning training generation based on a data set including a target dialogue strategy and interaction information corresponding relation, wherein the interaction information comprises interaction content, voice style, facial expression, limb action and interaction character.

Illustratively, the label classified as "urgent" according to the image feature V1, the label classified as "speech fast" according to the voice feature V2, the label classified as "search toilet" according to the text feature V3, and the label generated as "interactive contents — position of toilet is straight 50 meters and then turns right, voice style-succinctly, facial expression-non-expression, body motion-indicating target direction, interactive lattice-intimacy".

The interactive information generation method based on the multi-modal characteristics provided by the embodiment performs information interaction with the interactive object through a plurality of interaction modes such as voice style, facial expression, limb movement and interactive character, so that the interactive process is more three-dimensional and agile, and the interactive object is provided with richer interactive feedback.

In one embodiment, the interactive information further includes interactive techniques, and the interactive techniques include interactive texts preset for different interaction situations of the user, for example, preset fixed interactive statements for fixed problems, or preset fixed interactive statements when it is determined that the interactive content of the user is unclear.

In one embodiment, as shown in fig. 5, after the step S500 inputs the target dialog policy into the natural language generation module to generate the interaction information, the method further includes:

s600, generating a virtual character image according to the interactive content, the voice style, the facial expression, the limb action and the interactive character and interacting with the interactive object.

The interactive information generation method based on the multi-modal characteristics provided by the embodiment realizes that the virtual character is more three-dimensional and natural in the interactive process, accurately and quickly captures the psychological activity of the user based on the multi-modal characteristics transmitted by the user in real time, and is based on more natural and smooth interactive experience of the user through the mode of the virtual character.

In one embodiment, as shown in fig. 6, according to another aspect of the present invention, the present invention further provides an interaction information generation system based on multi-modal features, which includes an obtaining module 10, a processing module 20, a classifying module 30, a first generating module 40 and a second generating module 50.

Wherein the obtaining module 10 is configured to obtain multimodal features of the interactive object.

Specifically, the multi-modal features comprise an image feature V1, a voice feature V2 and a text feature V3, when the interactive object is detected to be close to the preset distance threshold, image information and voice information of the interactive object are shot at intervals of preset time in a preset time period through the camera device and the voice collecting device, the interactive text information of the interactive object is identified according to the voice information, and the image feature V1, the voice feature V2 and the text feature V3 are extracted according to the image information, the voice information and the interactive text information.

The processing module 20 is connected to the obtaining module 10, and is configured to perform normalization processing on the multi-modal features to obtain a target feature vector.

Specifically, NLU (Natural Language Understanding) feature fusion is performed on the multimodal feature image feature V1, the speech feature V2, and the text feature V3 to obtain a target feature vector V.

The classification module 30 is connected to the obtaining module 10, and is configured to classify the multi-modal features according to a preset classifier corresponding to each modal, so as to obtain a classification result corresponding to each modal feature.

Illustratively, the image feature V1, the voice feature V2 and the text feature V3 are classified according to an image classifier, a voice classifier and a text classifier, respectively, the image feature V1 may be classified as a "urgency" tag, the voice feature V2 may be classified as a "fast speech" tag, and the text feature V3 may be classified as a "find toilet" tag.

The first generating module 40 is connected to the processing module 20 and the classifying module 30, respectively, and is configured to input the target feature vector and the classification result corresponding to each modal feature into a preset dialog management module to generate a target dialog policy.

The second generating module 50 is connected to the first generating module 40, and is used for inputting the target dialogue strategy into the natural language generating module to generate the interactive information.

Illustratively, the interaction information is generated as "the position of the toilet is a straight line 50 meters back and right turn" according to a label that the image feature V1 is classified as "urgent", a label that the voice feature V2 is classified as "fast speech", a label that the text feature V3 is classified as "find toilet", and a label that the image feature V1 may be classified as "urgent", a label that the voice feature V2 may be classified as "fast speech", a label that the text feature V3 may be classified as "find toilet", and a label that the position of the toilet is a straight line 50 meters back and right turn ".

The interaction information generation system based on the multi-modal features obtains target feature vectors by performing normalization processing on the multi-modal features, classifies the multi-modal features respectively, generates a target dialogue strategy according to the target feature vectors and classification results, further obtains interaction information, interacts by comprehensively considering the multi-modal features transmitted by an interaction object, performs quantitative analysis on the multi-modal features, fully understands the current interaction requirements of a user, and enables the interaction process to be more accurate, more intuitive and more humanized.

In one embodiment, the present invention further provides a storage medium, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operations performed in the above-described embodiment of the interactive information generation method based on multimodal features, for example, the storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAM), a compact disc read only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In the foregoing embodiments, the descriptions of the respective embodiments have their respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.

Those of ordinary skill in the art will appreciate that the various illustrative elements and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed interaction information generation method, system and storage medium based on multi-modal features may be implemented in other ways. For example, the above-described embodiments of the method, system and storage medium for generating interaction information based on multi-modal features are merely illustrative, and for example, the modules or units may be divided into only one logical functional division, and may be implemented in other ways, for example, multiple units or modules may be combined or integrated into another system, or some features may be omitted or not executed. In another aspect, the communication links shown or discussed with respect to each other may be electrical, mechanical or other forms through some interfaces, device or unit communication links or integrated circuits.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for generating interaction information based on multi-modal features, comprising the steps of:

2. The method for generating interaction information based on multi-modal features according to claim 1, wherein the obtaining multi-modal features of the interaction object specifically comprises:

and performing feature extraction on the dialogue text information through a natural language processing algorithm to obtain the text features.

3. The method for generating interactive information based on multi-modal features according to claim 2, further comprising, after the acquiring the image information of the interactive object within a preset time period:

and respectively carrying out feature extraction on the expression image information and the limb image information through the image sequence analysis algorithm to obtain expression image features and limb image features.

4. The method according to claim 3, wherein the image features are obtained by feature extraction of the image information through an image sequence analysis algorithm, and further comprising:

5. The method for generating interactive information based on multi-modal features according to claim 2, wherein the feature extraction of the speech information by the speech feature extraction algorithm to obtain the speech features specifically comprises:

6. The method for generating interaction information based on multi-modal features according to claim 1, wherein the step of inputting the classification result corresponding to the target feature vector and each modal feature into a preset dialog management module to generate a target dialog strategy specifically comprises:

7. The multi-modal feature-based interaction information generation method according to claim 1,

the natural language generation module performs deep learning training generation based on a data set comprising the target conversation strategy and the interactive information corresponding relation, and the interactive information comprises interactive content, voice style, facial expression, limb action and interactive character.

8. The method of claim 7, wherein after inputting the target dialog strategy into a natural language generating module to generate the interaction information, the method further comprises:

and generating a virtual character image according to the interactive content, the voice style, the facial expression, the limb action and the interactive character and interacting with the interactive object.

9. An interactive information generating system based on multi-modal features, comprising:

the processing module is connected with the acquisition module and used for carrying out normalization processing on the multi-modal features to obtain target feature vectors;

10. A storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by a processor to implement the operations performed by the multi-modal feature-based interaction information generation method according to any one of claims 1 to 8.