CN115309882A - Interactive information generation method, system and storage medium based on multi-modal characteristics - Google Patents

Interactive information generation method, system and storage medium based on multi-modal characteristics Download PDF

Info

Publication number
CN115309882A
CN115309882A CN202210955672.2A CN202210955672A CN115309882A CN 115309882 A CN115309882 A CN 115309882A CN 202210955672 A CN202210955672 A CN 202210955672A CN 115309882 A CN115309882 A CN 115309882A
Authority
CN
China
Prior art keywords
information
modal
image
features
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210955672.2A
Other languages
Chinese (zh)
Inventor
陈锁
顾文元
张雪源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanmeng Human Intelligence International Co ltd
Shanghai Yuanmeng Intelligent Technology Co ltd
Original Assignee
Yuanmeng Human Intelligence International Co ltd
Shanghai Yuanmeng Intelligent Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanmeng Human Intelligence International Co ltd, Shanghai Yuanmeng Intelligent Technology Co ltd filed Critical Yuanmeng Human Intelligence International Co ltd
Priority to CN202210955672.2A priority Critical patent/CN115309882A/en
Publication of CN115309882A publication Critical patent/CN115309882A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an interactive information generating method, a system and a storage medium based on multi-modal characteristics, wherein the method comprises the following steps: obtaining multi-modal features of the interactive object, wherein the multi-modal features comprise image features, voice features and text features; carrying out normalization processing on the multi-modal features to obtain target feature vectors; classifying the multi-modal characteristics according to a preset classifier corresponding to each modal to obtain a classification result corresponding to each modal characteristic; inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module to generate a target dialogue strategy; and inputting the target conversation strategy into a natural language generation module to generate interactive information. The method and the system can perform quantitative analysis based on the multi-mode information transmitted by the user, and generate more accurate, more visual and more humanized interaction information according to the current interaction requirement.

Description

Interactive information generation method, system and storage medium based on multi-modal characteristics
Technical Field
The invention relates to the technical field of multi-modal information interaction, in particular to an interactive information generation method, an interactive information generation system and a storage medium based on multi-modal characteristics.
Background
With the development of avatar technology, avatars are increasingly used in scenarios requiring chat interaction, such as customer service, smart chat, and content generation. At present, the chat interaction between the virtual person and the user mainly depends on a dialogue system in the natural language field, so that the chat interaction between the virtual person and the user is more intelligent and more active, and the interaction effect between the virtual person and the user is more and more close to that between a real person and the user.
In the current virtual human technology, user inputs received by various dialog systems are generally texts converted after speech recognition, natural Language parsing and Understanding are performed on the texts through an NLU (Natural Language Understanding) technology, and then an obtained intermediate result is transmitted to a dialog management module (DM) for further policy management and Natural Language Generation (NLG). The mode only considers the condition that the input data is text generally, and the interactive mode of the dialogue system is only text form generally, so that the character of the virtual human is single and rigid in expression and cannot capture the real-time psychology of the user quickly and accurately.
Therefore, there is a need for an interactive information generating method based on multi-modal characteristics, which performs quantitative analysis based on multi-modal information transmitted by a user, fully understands the current interactive demand of the user, and generates more accurate, more intuitive and more humanized interactive information based on the current interactive demand.
Disclosure of Invention
In order to solve the technical problems that an interaction mode in the existing interaction method is single and rigid, and interaction feedback cannot be performed quickly and accurately based on information transmitted by a user, the invention provides an interaction information generation method, an interaction information generation system and a storage medium based on multi-modal characteristics, and the specific technical scheme is as follows:
the invention provides an interactive information generation method based on multi-modal characteristics, which comprises the following steps:
obtaining multi-modal features of the interactive object, wherein the multi-modal features comprise image features, voice features and text features;
carrying out normalization processing on the multi-modal features to obtain target feature vectors;
classifying the multi-modal characteristics according to a classifier corresponding to each preset mode to obtain a classification result corresponding to each modal characteristic;
inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module to generate a target dialogue strategy;
and inputting the target conversation strategy into a natural language generation module to generate interactive information.
According to the interactive information generation method based on the multi-modal features, the multi-modal features are subjected to normalization processing to obtain target feature vectors, the multi-modal features are classified respectively, a target dialogue strategy is generated according to the target feature vectors and the classification results, interactive information is obtained, the multi-modal features transmitted by interactive objects are considered comprehensively for interaction, the multi-modal features are subjected to quantitative analysis, the current interaction requirements of users are fully known, and the interaction process is more accurate, more visual and more humanized.
In some embodiments, the obtaining multimodal features of the interactive object specifically includes:
collecting image information, voice information and dialogue text information of the interactive object in a preset time period;
performing feature extraction on the image information through an image sequence analysis algorithm to obtain the image features;
performing feature extraction on the voice information through a voice feature extraction algorithm to obtain the voice features;
and performing feature extraction on the dialogue text information through a natural language processing algorithm to obtain the text feature.
The interactive information generation method based on the multi-modal characteristics respectively extracts the characteristics of the acquired image information, voice information and dialogue text information of the interactive object through an image sequence analysis algorithm, a voice characteristic extraction algorithm and a natural language processing algorithm, and is convenient for carrying out more accurate information interaction according to the multi-modal characteristics in the follow-up process.
In some embodiments, after the acquiring the image information of the interaction object within the preset time period, the method further includes:
performing image segmentation on the image information through an image semantic segmentation model to generate expression image information and limb image information, wherein the image semantic segmentation model performs deep learning training generation based on an image information data set labeled with the expression image information and the limb image information;
the image feature extraction of the image information through an image sequence analysis algorithm to obtain the image feature specifically comprises:
and respectively extracting the characteristics of the expression image information and the limb image information through the image sequence analysis algorithm to obtain expression image characteristics and limb image characteristics.
According to the interactive information generation method based on the multi-modal characteristics, the image information is subjected to image segmentation to generate the expression image information and the limb image information, so that the interactive requirements of the interactive objects can be conveniently analyzed subsequently according to the expression image characteristics and the limb image characteristics, and the accuracy of interactive information generation is improved.
In some embodiments, the extracting the features of the image information by using an image sequence analysis algorithm to obtain the image features specifically further includes:
feature extraction is respectively carried out on the expression image information and the limb image information in the preset time period through a preset CNN + LSTM model, and an expression image feature group and a limb image feature group are obtained, wherein the expression image feature group comprises each expression image feature in the preset time period, and the limb image feature group comprises each limb image feature in the preset time period.
In some embodiments, the extracting the feature of the speech information by using a speech feature extraction algorithm to obtain the speech feature specifically includes:
performing feature extraction on the voice information through an MFCC algorithm to obtain the voice feature;
the extracting the features of the dialog text information through the natural language processing algorithm to obtain the text features specifically comprises the following steps:
performing feature extraction on the dialog text information through a preset regular expression to obtain a first text feature vector;
performing feature extraction on the dialogue text information through a preset LSTM model to obtain a second text feature vector;
and splicing and fusing the first text feature vector and the second text feature vector as the text features.
In some embodiments, the inputting the classification result corresponding to the target feature vector and each modal feature into a preset dialog management module to generate a target dialog policy specifically includes:
and inputting the target feature vectors and the classification results corresponding to the modal features into a preset dialogue management module, and performing dialogue state management and strategy selection through a finite state machine, a Bayesian network and an LSTM network carried in the dialogue management module to generate the target dialogue strategy.
In some embodiments, the natural language generation module performs deep learning training based on a data set including the target dialogue strategy and the interaction information corresponding relation, and the interaction information includes interaction content, voice style, facial expression, body movement and interaction character.
The interactive information generation method based on the multi-modal characteristics provided by the invention not only has a text form, but also carries out information interaction with the interactive object through a plurality of interactive modes such as voice style, facial expression, limb action, interactive character and the like, so that the interactive process is more three-dimensional and agile, and richer interactive feedback is given to the interactive object.
In some embodiments, after the inputting the target dialog strategy into the natural language generation module to generate the interaction information, the method further includes:
and generating a virtual character image according to the interactive content, the voice style, the facial expression, the limb actions and the interactive character to interact with the interactive object.
The interactive information generation method based on the multi-modal characteristics provided by the invention realizes that the virtual character is more three-dimensional and natural in the interactive process, accurately and quickly captures the psychological activity of the user based on the multi-modal characteristics transmitted by the user in real time, and is based on more natural and smooth interactive experience of the user through the mode of the virtual character.
In some embodiments, according to another aspect of the present invention, the present invention also provides an interaction information generation system based on multi-modal features, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring multi-modal characteristics of an interactive object, and the multi-modal characteristics comprise image characteristics, voice characteristics and text characteristics;
the processing module is connected with the acquisition module and is used for carrying out normalization processing on the multi-modal features to obtain a target feature vector;
the classification module is connected with the acquisition module and is used for classifying the multi-modal characteristics according to a preset classifier corresponding to each mode to obtain a classification result corresponding to each mode characteristic;
the first generation module is respectively connected with the processing module and the classification module and is used for inputting the target feature vector and the classification result corresponding to each modal feature into a preset conversation management module to generate a target conversation strategy;
and the second generation module is connected with the first generation module and used for inputting the target dialogue strategy into the natural language generation module to generate interactive information.
In some embodiments, according to another aspect of the present invention, the present invention further provides a storage medium having at least one instruction stored therein, and the instruction is loaded and executed by a processor to implement the operations performed by the above-mentioned interaction information generation method based on multi-modal features.
The invention provides at least one technical effect as follows:
(1) The multi-modal features are normalized to obtain target feature vectors, the multi-modal features are classified respectively, a target conversation strategy is generated according to the target feature vectors and the classification results, interaction information is obtained, the multi-modal features transmitted by an interaction object are considered comprehensively for interaction, the multi-modal features are subjected to quantitative analysis, the current interaction requirements of a user are fully known, and the interaction process is more accurate, more visual and more humanized;
(2) The collected image information, voice information and dialogue text information of the interactive object are subjected to feature extraction through an image sequence analysis algorithm, a voice feature extraction algorithm and a natural language processing algorithm, so that more accurate information interaction is performed according to multi-modal features in the follow-up process;
(3) The image information is subjected to image segmentation to generate expression image information and limb image information, so that the interaction requirements of the interaction object can be analyzed subsequently according to the expression image characteristics and the limb image characteristics, and the accuracy of generating the interaction information is improved;
(4) Besides the text form, information interaction is carried out on the interactive object through various interaction modes such as voice style, facial expression, limb actions and interactive character, so that the interactive process is more three-dimensional and agile, and the interactive object is given more abundant interactive feedback;
(5) The virtual character is more three-dimensional and natural in the interaction process, the psychological activities of the user can be accurately and quickly captured based on the multi-modal characteristics transmitted by the user in real time, and the interaction experience of the user is more natural and smooth through the mode of the virtual character.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flow chart of a method for generating interactive information based on multi-modal features according to the present invention;
FIG. 2 is a flow chart of multi-modal features of an interactive object in the interactive information generation method based on multi-modal features of the present invention;
FIG. 3 is another flowchart of multi-modal features of an interactive object in the interactive information generation method based on multi-modal features of the present invention;
FIG. 4 is a flowchart of generating a target dialog strategy in the method for generating interactive information based on multi-modal features according to the present invention;
FIG. 5 is another flow chart of a method for generating interaction information based on multi-modal features according to the present invention;
FIG. 6 is a diagram of an example of an interactive information generating system based on multi-modal features.
Reference numbers in the figures: the device comprises an acquisition module-10, a processing module-20, a classification module-30, a first generation module-40 and a second generation module-50.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. However, it will be apparent to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
For the sake of simplicity, the drawings only schematically show the parts relevant to the present invention, and they do not represent the actual structure as a product. In addition, in order to make the drawings concise and understandable, components having the same structure or function in some of the drawings are only schematically depicted, or only one of them is labeled. In this document, "one" means not only "only one" but also a case of "more than one".
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
In addition, in the description of the present application, the terms "first", "second", and the like are used only for distinguishing the description, and are not intended to indicate or imply relative importance.
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description will be made with reference to the accompanying drawings. It is obvious that the drawings in the following description are only some examples of the invention, and that for a person skilled in the art, without inventive effort, other drawings and embodiments can be derived from them.
One embodiment of the present invention, as shown in fig. 1, provides an interactive information generating method based on multi-modal features, comprising the steps of:
s100 multi-modal features of the interactive object are obtained.
In particular, the multi-modal features include an image feature V 1 Speech feature V 2 And text feature V 3 When the interactive object is detected to be close to the preset distance threshold value, the image information and the voice information of the interactive object are shot at preset time intervals through the camera device and the voice acquisition device, the interactive text information of the interactive object is identified according to the voice information, and the image feature V is extracted according to the image information, the voice information and the interactive text information 1 Speech feature V 2 And text feature V 3
S200, normalization processing is carried out on the multi-modal features to obtain target feature vectors.
In particular, for multi-modal feature image features V 1 Voice feature V 2 And text feature V 3 And performing NLU (Natural Language Understanding) feature fusion to obtain a target feature vector V.
S300, classifying the multi-modal features according to a preset classifier corresponding to each modal to obtain a classification result corresponding to each modal feature.
Illustratively, image feature V1, speech feature V2 and text feature V3 are classified according to an image classifier, a speech classifier and a text classifier, respectively, image feature V 1 Possibly classified as "urgency" label, speech feature V 2 Possibly classified as "very fast speech" label, text feature V 3 May be classified as a "find toilet" label.
S400, inputting the target feature vector and the classification result corresponding to each modal feature into a preset conversation management module to generate a target conversation strategy.
S500, inputting the target dialogue strategy into a natural language generation module to generate interactive information.
Illustratively, according to the image feature V 1 Label, speech feature V classified as "urgency 2 Tag classified as "fast speech", text feature V 3 The label classified as "find toilet" and the image feature V1 may be classified as "take urgency" label, the voice feature V2 may be classified as "fast voice" label, the text feature V3 may be classified as "find toilet" label, and the mutual information is generated as "position of toilet is straight line 50 meters back to right turn".
According to the multi-modal feature-based interaction information generation method provided by the embodiment, the multi-modal features are subjected to normalization processing to obtain target feature vectors, the multi-modal features are classified respectively, a target dialogue strategy is generated according to the target feature vectors and the classification results, interaction information is further obtained, the multi-modal features transmitted by the interaction objects are considered comprehensively for interaction, the multi-modal features are subjected to quantitative analysis, the current interaction requirements of users are fully known, and the interaction process is more accurate, more visual and more humanized.
In one embodiment, as shown in fig. 2, the step S100 of obtaining multi-modal features of the interactive object specifically includes:
s110, collecting image information, voice information and dialogue text information of the interactive object in a preset time period.
S121, extracting the features of the image information through an image sequence analysis algorithm to obtain image features.
S122, extracting the characteristics of the voice information through a voice characteristic extraction algorithm to obtain voice characteristics.
Illustratively, the voice feature V is obtained by performing feature extraction on voice information through MFCC (Mel Frequency Cepstrum coeffient) algorithm 3
And S123, carrying out feature extraction on the dialogue text information through a natural language processing algorithm to obtain text features.
Exemplarily, feature extraction is performed on the dialog text information through a preset regular expression to obtain a first text feature vector, feature extraction is performed on the dialog text information through a preset LSTM model to obtain a second text feature vector, and the first text feature vector and the second text feature vector are spliced and fused to serve as a text feature V 5
The interactive information generation method based on the multi-modal features provided by the embodiment respectively performs feature extraction on the acquired image information, voice information and dialogue text information of the interactive object through an image sequence analysis algorithm, a voice feature extraction algorithm and a natural language processing algorithm, so that more accurate information interaction can be performed subsequently according to the multi-modal features.
In one embodiment, as shown in fig. 3, after the step S110 collects image information, voice information and dialog text information of the interactive object within a preset time period, the method further includes:
and S124, carrying out image segmentation on the image information through the image semantic segmentation model to generate expression image information and limb image information.
Specifically, the image semantic segmentation model is generated by deep learning training based on an image information dataset marked with expression image information and limb image information.
And S125, respectively extracting the characteristics of the expression image information and the limb image information through an image sequence analysis algorithm to obtain expression image characteristics and limb image characteristics.
Specifically, the image feature V 1 Including expressive image features V 4 And limb image feature V 5
Exemplarily, feature extraction is respectively performed on expression image information and limb image information within a preset time period through a preset CNN (conditional Neural Network) + LSTM (Long Short-Term Memory) model to obtain an expression image feature group and a limb image feature group, where the expression image feature group includes each expression image feature V within the preset time period 4 The limb image feature group comprises each limb image feature V in a preset time period 5
According to the multi-modal feature-based interaction information generation method, the image information is subjected to image segmentation to generate the expression image information and the limb image information, so that the interaction requirements of the interaction object can be conveniently analyzed subsequently according to the expression image features and the limb image features, and the accuracy of interaction information generation is improved.
In an embodiment, as shown in fig. 4, step S400 inputs the target feature vector and the classification result corresponding to each modal feature into a preset dialog management module to generate a target dialog policy, which specifically includes:
s410, inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module, and performing dialogue state management and strategy selection through a finite state machine, a Bayesian network and an LSTM network carried in the dialogue management module to generate a target dialogue strategy.
Specifically, a certain dialog decision manner is selected from different dialog decision manners as a target dialog strategy according to the target feature vector and the modal features, and the interactive content, the voice style, the facial expression, the limb action and the interactive character generated for the same image feature of the interactive object in the different dialog decision manners are different.
Illustratively, when the a conversation decision mode is taken as the target conversation strategy, a concise and concise interactive character and intuitive and simple interactive contents are adopted for the text information "asking for an XX stock for quotation" of the current quotation of the user, for example: when a B conversation decision mode is taken as a target conversation strategy, aiming at the text information of a user, which 'asks for how XX stock is currently put into a market' the interactive content with mild and good interactive character and elaboration is adopted, for example, 'the current stock trend rises within 28 minutes, the current stock price is X, the daily transaction amount drops by Z yuan compared with that of yesterday, and asks for whether more company stock information needs to be known or not'.
In one embodiment, the natural language generation module performs deep learning training generation based on a data set including a target dialogue strategy and interaction information corresponding relation, wherein the interaction information comprises interaction content, voice style, facial expression, limb action and interaction character.
Illustratively, the label classified as "urgent" according to the image feature V1, the label classified as "speech fast" according to the voice feature V2, the label classified as "search toilet" according to the text feature V3, and the label generated as "interactive contents — position of toilet is straight 50 meters and then turns right, voice style-succinctly, facial expression-non-expression, body motion-indicating target direction, interactive lattice-intimacy".
The interactive information generation method based on the multi-modal characteristics provided by the embodiment performs information interaction with the interactive object through a plurality of interaction modes such as voice style, facial expression, limb movement and interactive character, so that the interactive process is more three-dimensional and agile, and the interactive object is provided with richer interactive feedback.
In one embodiment, the interactive information further includes interactive techniques, and the interactive techniques include interactive texts preset for different interaction situations of the user, for example, preset fixed interactive statements for fixed problems, or preset fixed interactive statements when it is determined that the interactive content of the user is unclear.
In one embodiment, as shown in fig. 5, after the step S500 inputs the target dialog policy into the natural language generation module to generate the interaction information, the method further includes:
s600, generating a virtual character image according to the interactive content, the voice style, the facial expression, the limb action and the interactive character and interacting with the interactive object.
The interactive information generation method based on the multi-modal characteristics provided by the embodiment realizes that the virtual character is more three-dimensional and natural in the interactive process, accurately and quickly captures the psychological activity of the user based on the multi-modal characteristics transmitted by the user in real time, and is based on more natural and smooth interactive experience of the user through the mode of the virtual character.
In one embodiment, as shown in fig. 6, according to another aspect of the present invention, the present invention further provides an interaction information generation system based on multi-modal features, which includes an obtaining module 10, a processing module 20, a classifying module 30, a first generating module 40 and a second generating module 50.
Wherein the obtaining module 10 is configured to obtain multimodal features of the interactive object.
Specifically, the multi-modal features comprise an image feature V1, a voice feature V2 and a text feature V3, when the interactive object is detected to be close to the preset distance threshold, image information and voice information of the interactive object are shot at intervals of preset time in a preset time period through the camera device and the voice collecting device, the interactive text information of the interactive object is identified according to the voice information, and the image feature V1, the voice feature V2 and the text feature V3 are extracted according to the image information, the voice information and the interactive text information.
The processing module 20 is connected to the obtaining module 10, and is configured to perform normalization processing on the multi-modal features to obtain a target feature vector.
Specifically, NLU (Natural Language Understanding) feature fusion is performed on the multimodal feature image feature V1, the speech feature V2, and the text feature V3 to obtain a target feature vector V.
The classification module 30 is connected to the obtaining module 10, and is configured to classify the multi-modal features according to a preset classifier corresponding to each modal, so as to obtain a classification result corresponding to each modal feature.
Illustratively, the image feature V1, the voice feature V2 and the text feature V3 are classified according to an image classifier, a voice classifier and a text classifier, respectively, the image feature V1 may be classified as a "urgency" tag, the voice feature V2 may be classified as a "fast speech" tag, and the text feature V3 may be classified as a "find toilet" tag.
The first generating module 40 is connected to the processing module 20 and the classifying module 30, respectively, and is configured to input the target feature vector and the classification result corresponding to each modal feature into a preset dialog management module to generate a target dialog policy.
The second generating module 50 is connected to the first generating module 40, and is used for inputting the target dialogue strategy into the natural language generating module to generate the interactive information.
Illustratively, the interaction information is generated as "the position of the toilet is a straight line 50 meters back and right turn" according to a label that the image feature V1 is classified as "urgent", a label that the voice feature V2 is classified as "fast speech", a label that the text feature V3 is classified as "find toilet", and a label that the image feature V1 may be classified as "urgent", a label that the voice feature V2 may be classified as "fast speech", a label that the text feature V3 may be classified as "find toilet", and a label that the position of the toilet is a straight line 50 meters back and right turn ".
The interaction information generation system based on the multi-modal features obtains target feature vectors by performing normalization processing on the multi-modal features, classifies the multi-modal features respectively, generates a target dialogue strategy according to the target feature vectors and classification results, further obtains interaction information, interacts by comprehensively considering the multi-modal features transmitted by an interaction object, performs quantitative analysis on the multi-modal features, fully understands the current interaction requirements of a user, and enables the interaction process to be more accurate, more intuitive and more humanized.
In one embodiment, the present invention further provides a storage medium, in which at least one instruction is stored, and the instruction is loaded and executed by a processor to implement the operations performed in the above-described embodiment of the interactive information generation method based on multimodal features, for example, the storage medium may be a Read Only Memory (ROM), a Random Access Memory (RAM), a compact disc read only memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
In the foregoing embodiments, the descriptions of the respective embodiments have their respective emphasis, and reference may be made to the related descriptions of other embodiments for parts that are not described or recited in detail in a certain embodiment.
Those of ordinary skill in the art will appreciate that the various illustrative elements and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed interaction information generation method, system and storage medium based on multi-modal features may be implemented in other ways. For example, the above-described embodiments of the method, system and storage medium for generating interaction information based on multi-modal features are merely illustrative, and for example, the modules or units may be divided into only one logical functional division, and may be implemented in other ways, for example, multiple units or modules may be combined or integrated into another system, or some features may be omitted or not executed. In another aspect, the communication links shown or discussed with respect to each other may be electrical, mechanical or other forms through some interfaces, device or unit communication links or integrated circuits.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A method for generating interaction information based on multi-modal features, comprising the steps of:
obtaining multi-modal features of the interactive object, wherein the multi-modal features comprise image features, voice features and text features;
carrying out normalization processing on the multi-modal features to obtain target feature vectors;
classifying the multi-modal characteristics according to a classifier corresponding to each preset mode to obtain a classification result corresponding to each modal characteristic;
inputting the target feature vector and the classification result corresponding to each modal feature into a preset dialogue management module to generate a target dialogue strategy;
and inputting the target conversation strategy into a natural language generation module to generate interactive information.
2. The method for generating interaction information based on multi-modal features according to claim 1, wherein the obtaining multi-modal features of the interaction object specifically comprises:
collecting image information, voice information and dialogue text information of the interactive object in a preset time period;
performing feature extraction on the image information through an image sequence analysis algorithm to obtain the image features;
performing feature extraction on the voice information through a voice feature extraction algorithm to obtain the voice features;
and performing feature extraction on the dialogue text information through a natural language processing algorithm to obtain the text features.
3. The method for generating interactive information based on multi-modal features according to claim 2, further comprising, after the acquiring the image information of the interactive object within a preset time period:
performing image segmentation on the image information through an image semantic segmentation model to generate expression image information and limb image information, wherein the image semantic segmentation model performs deep learning training generation based on an image information data set labeled with the expression image information and the limb image information;
the image feature extraction of the image information through an image sequence analysis algorithm to obtain the image feature specifically comprises:
and respectively carrying out feature extraction on the expression image information and the limb image information through the image sequence analysis algorithm to obtain expression image features and limb image features.
4. The method according to claim 3, wherein the image features are obtained by feature extraction of the image information through an image sequence analysis algorithm, and further comprising:
feature extraction is respectively carried out on the expression image information and the limb image information in the preset time period through a preset CNN + LSTM model, and an expression image feature group and a limb image feature group are obtained, wherein the expression image feature group comprises each expression image feature in the preset time period, and the limb image feature group comprises each limb image feature in the preset time period.
5. The method for generating interactive information based on multi-modal features according to claim 2, wherein the feature extraction of the speech information by the speech feature extraction algorithm to obtain the speech features specifically comprises:
performing feature extraction on the voice information through an MFCC algorithm to obtain the voice feature;
the extracting the features of the dialog text information through the natural language processing algorithm to obtain the text features specifically comprises the following steps:
performing feature extraction on the dialog text information through a preset regular expression to obtain a first text feature vector;
performing feature extraction on the dialogue text information through a preset LSTM model to obtain a second text feature vector;
and splicing and fusing the first text feature vector and the second text feature vector as the text features.
6. The method for generating interaction information based on multi-modal features according to claim 1, wherein the step of inputting the classification result corresponding to the target feature vector and each modal feature into a preset dialog management module to generate a target dialog strategy specifically comprises:
and inputting the target feature vectors and the classification results corresponding to the modal features into a preset dialogue management module, and performing dialogue state management and strategy selection through a finite state machine, a Bayesian network and an LSTM network carried in the dialogue management module to generate the target dialogue strategy.
7. The multi-modal feature-based interaction information generation method according to claim 1,
the natural language generation module performs deep learning training generation based on a data set comprising the target conversation strategy and the interactive information corresponding relation, and the interactive information comprises interactive content, voice style, facial expression, limb action and interactive character.
8. The method of claim 7, wherein after inputting the target dialog strategy into a natural language generating module to generate the interaction information, the method further comprises:
and generating a virtual character image according to the interactive content, the voice style, the facial expression, the limb action and the interactive character and interacting with the interactive object.
9. An interactive information generating system based on multi-modal features, comprising:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring multi-modal characteristics of an interactive object, and the multi-modal characteristics comprise image characteristics, voice characteristics and text characteristics;
the processing module is connected with the acquisition module and used for carrying out normalization processing on the multi-modal features to obtain target feature vectors;
the classification module is connected with the acquisition module and is used for classifying the multi-modal characteristics according to a preset classifier corresponding to each mode to obtain a classification result corresponding to each mode characteristic;
the first generation module is respectively connected with the processing module and the classification module and is used for inputting the target feature vector and the classification result corresponding to each modal feature into a preset conversation management module to generate a target conversation strategy;
and the second generation module is connected with the first generation module and used for inputting the target dialogue strategy into the natural language generation module to generate interactive information.
10. A storage medium, characterized in that at least one instruction is stored in the storage medium, and the instruction is loaded and executed by a processor to implement the operations performed by the multi-modal feature-based interaction information generation method according to any one of claims 1 to 8.
CN202210955672.2A 2022-08-10 2022-08-10 Interactive information generation method, system and storage medium based on multi-modal characteristics Pending CN115309882A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210955672.2A CN115309882A (en) 2022-08-10 2022-08-10 Interactive information generation method, system and storage medium based on multi-modal characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210955672.2A CN115309882A (en) 2022-08-10 2022-08-10 Interactive information generation method, system and storage medium based on multi-modal characteristics

Publications (1)

Publication Number Publication Date
CN115309882A true CN115309882A (en) 2022-11-08

Family

ID=83861414

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210955672.2A Pending CN115309882A (en) 2022-08-10 2022-08-10 Interactive information generation method, system and storage medium based on multi-modal characteristics

Country Status (1)

Country Link
CN (1) CN115309882A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117932043A (en) * 2024-03-22 2024-04-26 杭州食方科技有限公司 Dialogue style migration reply information display method, device, equipment and readable medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117932043A (en) * 2024-03-22 2024-04-26 杭州食方科技有限公司 Dialogue style migration reply information display method, device, equipment and readable medium

Similar Documents

Publication Publication Date Title
CN107492379B (en) Voiceprint creating and registering method and device
US11854540B2 (en) Utilizing machine learning models to generate automated empathetic conversations
CN112346567B (en) Virtual interaction model generation method and device based on AI (Artificial Intelligence) and computer equipment
CN110909165A (en) Data processing method, device, medium and electronic equipment
US11455472B2 (en) Method, device and computer readable storage medium for presenting emotion
CN111098312A (en) Window government affairs service robot
CN111274372A (en) Method, electronic device, and computer-readable storage medium for human-computer interaction
CN110807566A (en) Artificial intelligence model evaluation method, device, equipment and storage medium
CN108074571A (en) Sound control method, system and the storage medium of augmented reality equipment
US10770072B2 (en) Cognitive triggering of human interaction strategies to facilitate collaboration, productivity, and learning
CN110223134B (en) Product recommendation method based on voice recognition and related equipment
CN111651571A (en) Man-machine cooperation based session realization method, device, equipment and storage medium
CN114995657A (en) Multimode fusion natural interaction method, system and medium for intelligent robot
CN111176442A (en) Interactive government affair service system and method based on VR virtual reality technology
CN113703585A (en) Interaction method, interaction device, electronic equipment and storage medium
CN111383138B (en) Restaurant data processing method, device, computer equipment and storage medium
CN115935182A (en) Model training method, topic segmentation method in multi-turn conversation, medium, and device
CN115309882A (en) Interactive information generation method, system and storage medium based on multi-modal characteristics
CN117034019B (en) Service processing method and device, electronic equipment and storage medium
CN112910761B (en) Instant messaging method, device, equipment, storage medium and program product
CN107943299B (en) Emotion presenting method and device, computer equipment and computer readable storage medium
CN110991155A (en) Text correction method, apparatus, and medium
CN115062131A (en) Multi-mode-based man-machine interaction method and device
CN115640387A (en) Man-machine cooperation method and device based on multi-mode features
CN112328774B (en) Method for realizing task type man-machine conversation task based on multiple documents

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination