CN116301388B - Man-machine interaction scene system for intelligent multi-mode combined application - Google Patents

Man-machine interaction scene system for intelligent multi-mode combined application Download PDF

Info

Publication number
CN116301388B
CN116301388B CN202310524033.5A CN202310524033A CN116301388B CN 116301388 B CN116301388 B CN 116301388B CN 202310524033 A CN202310524033 A CN 202310524033A CN 116301388 B CN116301388 B CN 116301388B
Authority
CN
China
Prior art keywords
decision
user
mode
module
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310524033.5A
Other languages
Chinese (zh)
Other versions
CN116301388A (en
Inventor
张卫平
米小武
吴茜
王丹
张伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Global Digital Group Co Ltd
Original Assignee
Global Digital Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Global Digital Group Co Ltd filed Critical Global Digital Group Co Ltd
Priority to CN202310524033.5A priority Critical patent/CN116301388B/en
Publication of CN116301388A publication Critical patent/CN116301388A/en
Application granted granted Critical
Publication of CN116301388B publication Critical patent/CN116301388B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a man-machine interaction scene system for intelligent multi-mode combined application, which comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module; the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction; according to the invention, through collecting and analyzing voice, gesture and eye movement information, more accurate decision and understanding can be made, and the limitation of a single mode is solved.

Description

Man-machine interaction scene system for intelligent multi-mode combined application
Technical Field
The invention relates to the field of multi-mode human-computer interaction, in particular to a human-computer interaction scene system for intelligent multi-mode combined application.
Background
In recent years, along with the rapid development of the fields of computer vision, natural language processing, acoustic signal processing and the like, the multi-mode human-computer interaction technology becomes more and more a research hotspot and an application endpoint; the man-machine interaction system of the multi-mode combined application effectively combines multiple input modes of voice, images and gesture lamps, achieves more flexible, intelligent and natural man-machine interaction, and can greatly improve interaction efficiency.
Consult related published technical schemes, for example, CN114020153a prior art discloses a multi-mode man-machine interaction method and device, the method comprises: acquiring interactive text information from a user; predicting a transitional language according to the interactive text information; acquiring corresponding multi-modal content according to the transition language, taking the multi-modal content as first reply content, and pushing the first reply content to the virtual person client; generating corresponding multi-mode content according to the reply text information of the interactive text information, taking the corresponding multi-mode content as second reply content, and pushing the second reply content to the virtual person client; according to the invention, the transition words are inserted before the formal reply content, the reply text information is processed in a segmentation mode, one-round reply is changed into multi-round reply, the response speed of a virtual person is improved, and smooth human-computer interaction experience is realized; another typical prior art with publication number CN111554279A discloses a multi-modal human-computer interaction system based on Kinect, which comprises the following steps of constructing a data acquisition system capable of receiving multi-modal data acquired by Kinect; training the single-tone element of the acoustic model and the language model to obtain an acoustic recognition module; creating a lip movement dataset for training machine learning using the acquired color map data; training a lip reading identification model by using a lip movement data set by using a model training method of a convolutional neural network based on a residual neural network; the data acquisition system, the voice recognition model and the lip reading recognition model are integrated together to construct a multi-mode human-computer interaction system; the multi-mode man-machine interaction system of the invention enhances the robustness of voice recognition; both modes of the first scheme are text mode contents, and the interactivity with a user is not high enough; in the second scheme, multi-mode identification is completed only through single confidence comparison in a multi-mode decision layer, and adaptability and accuracy are low.
Disclosure of Invention
The invention aims to provide a man-machine interaction scene system for intelligent multi-mode combined application aiming at the defects existing at present.
The invention adopts the following technical scheme:
a man-machine interaction scene system for intelligent multi-mode combined application comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, wherein the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the decision fusion module comprises all decisions acquired under each mode, and all decisions are sharedBars, the set of individual decisions is +.>The decision fusion module generates the following probability matrix according to decisions and decision probabilities acquired under each mode:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Probability of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->
Further, the decision fusion module generates a weight matrix according to the probability matrix, and the weight matrix is obtained as follows:
computing a decision made by a user on each modalityIs a mean probability of (2):
wherein, the liquid crystal display device comprises a liquid crystal display device,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing decision making in gesture mode of user>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
Further, for each weight in the weight matrixThe method comprises the following steps:
further, the decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision matrix, and extracts a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction.
The beneficial effects obtained by the invention are as follows:
the invention collects the voice, gesture and eye movement information of the user through the data acquisition module; extracting characteristics of voice, gesture and eye movement information of a user through a characteristic extraction module, and predicting decision and decision probability under each mode according to each characteristic; the decision fusion module is used for constructing a probability matrix for decisions and decision probabilities in all modes, and providing weights for the probability matrix according to the average probability of the same decision made in all modes, and the weighted probability matrix is used as a final decision judgment matrix, so that the final decision of judgment comprehensively considers multi-mode information, and the accuracy is higher.
Drawings
The invention will be further understood from the following description taken in conjunction with the accompanying drawings. The components in the figures are not necessarily to scale, emphasis instead being placed upon illustrating the principles of the embodiments. Like reference numerals designate corresponding parts throughout the different views.
FIG. 1 is a schematic diagram of the overall module of the present invention.
FIG. 2 is a schematic diagram of a decision making and probability obtaining process for each mode according to the present invention.
Fig. 3 is a schematic diagram of the interaction of the present invention in a banking scenario.
The meaning of the reference numerals in the figures: 1-camera, 2-interactive interface, 3-microphone.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples thereof; it should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the invention; other systems, methods, and/or features of the present embodiments will be or become apparent to one with skill in the art upon examination of the following detailed description; it is intended that all such additional systems, methods, features and advantages be included within this description; included within the scope of the invention and protected by the accompanying claims; additional features of the disclosed embodiments are described in, and will be apparent from, the following detailed description.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there is an azimuth or positional relationship indicated by terms such as "upper", "lower", "left", "right", etc., based on the azimuth or positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but it is not indicated or implied that the apparatus or component referred to must have a specific azimuth, construction and operation in which the term is described in the drawings is merely illustrative, and it is not to be construed that the term is limited to the patent, and specific meanings of the term may be understood by those skilled in the art according to specific circumstances.
Embodiment one: as shown in fig. 1 and fig. 2, the present embodiment provides a human-computer interaction scene system for intelligent multi-mode combined application, which includes a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, wherein the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the decision fusion module comprises all decisions acquired under each mode, and all decisions are sharedBars, the set of individual decisions is +.>The decision fusion module generates the following probability matrix according to decisions and decision probabilities acquired under each mode:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Probability of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->
Further, the decision fusion module generates a weight matrix according to the probability matrix, and the weight matrix is obtained as follows:
computing a decision made by a user on each modalityIs a mean probability of (2):
wherein, the liquid crystal display device comprises a liquid crystal display device,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing a user's handDecision making ∈under the potential modality>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
Further, for each weight in the weight matrixThe method comprises the following steps:
further, the decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision matrix, and extracts a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction.
The embodiment collects voice, gesture and eye movement information of a user through a data acquisition module; extracting characteristics of voice, gesture and eye movement information of a user through a characteristic extraction module, and predicting decision and decision probability under each mode according to each characteristic; acquiring a final decision instruction of a user through a decision fusion module; the combined use of a plurality of interaction modes can more conveniently and efficiently complete the interaction task, so that a user does not need to rely on a single interaction mode, a more natural and visual interaction mode is provided for the user, the interaction experience of the user is smoother and more comfortable, the defects of the single interaction mode, such as low accuracy of voice recognition, poor reliability of gesture recognition and the like, are overcome, and the reliability and adaptability of interaction are improved; the decision fusion module is used for constructing a probability matrix for decisions and decision probabilities in all modes, and providing weights for the probability matrix according to the average probability of the same decision made in all modes, and the weighted probability matrix is used as a final decision judgment matrix, so that the final decision of judgment comprehensively considers multi-mode information, and the accuracy is higher.
Embodiment two: this embodiment should be understood to include at least all of the features of any one of the foregoing embodiments, and be further modified based thereon;
the embodiment provides a man-machine interaction scene system for intelligent multi-mode combined application, which comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the system further comprises man-machine interaction equipment, wherein the man-machine interaction equipment is provided with a microphone and an infrared camera, the voice acquisition module acquires voice information of a user through the microphone, and the gesture acquisition module and the eye movement acquisition module acquire gesture action information and eye movement information of the user through the infrared camera;
the man-machine interaction equipment is also provided with an interaction interface and an interaction audio output device, and the interaction module completes interaction between the man and the machine through the interaction interface and the interaction audio output device;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module,
the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the voice feature extraction module extracts voice features and obtains a decision of a voice mode and a decision probability of the voice mode as follows:
s101: performing voice preprocessing on the acquired voice information of the user, wherein the voice preprocessing operation comprises noise removal and voice signal enhancement;
s102: extracting voice characteristics from the preprocessed voice information, wherein in the embodiment, the MFCC technology is used for extracting the voice characteristics;
s103: inputting the voice characteristics into a pre-trained voice recognition model, and outputting a voice mode decision and a voice mode decision probability;
for the speech recognition model in step S103, it includes:
input layer: for receiving input speech features;
an intermediate layer: the system comprises a plurality of circulating neural network units, a plurality of control units and a plurality of control units, wherein the circulating neural network units are used for modeling input voice characteristics;
output layer: mapping the output of the middle layer to a tag sequence, wherein the tag sequence is probability distribution of the identification result; the activation function of the output layer is a common classification function, such as a softmax function;
a decoder: the method is used for decoding the tag sequence to obtain a recognition result, and a decoding algorithm adopts a known common decoding algorithm such as a greedy algorithm or a beam search algorithm; thereby obtaining the decision and the probability of the decision of the voice mode;
the gesture feature extraction module extracts gesture features and acquires a gesture mode decision and a gesture mode decision probability mode as follows:
s201: performing gesture preprocessing on the obtained gesture action information of the user, wherein the gesture preprocessing operation comprises denoising, binarizing and graying processing on the gesture action information;
s202: separating the hand parts in the preprocessed gesture motion information by an image segmentation technology;
s203: the extraction and selection of hand features of the hand part are completed through a CNN hand feature extraction model, wherein the CNN hand feature extraction model is a hand feature extraction model established by a technician in advance by using CNN based on experimental data;
s204: inputting the hand characteristics selected in the step S203 into a pre-trained gesture decision tree classifier, and outputting a decision of a gesture mode and a gesture mode decision probability;
the eye movement feature extraction module extracts eye movement features and acquires the decision of an eye movement mode and the decision probability of the eye movement mode as follows:
s301: performing eye movement pretreatment on the obtained eye movement information of the user, wherein the eye movement pretreatment operation comprises noise removal and eye movement error correction;
s302: the extraction and selection of the eye movement characteristics are completed through a CNN eye movement characteristic extraction model, wherein the CNN eye movement characteristic extraction model is an eye movement characteristic extraction model established by a technician in advance by using CNN based on experimental data;
s303: inputting the eye movement characteristics selected in the step S302 into a pre-trained eye movement decision tree classifier, and outputting the decision of the eye movement mode and the decision probability of the eye movement mode;
the decision fusion module obtains a final decision instruction by weighting the decisions and the decision probabilities of the modes obtained in the feature extraction module, and the specific implementation mode is as follows:
is common in the systemStrip decision, set the set of individual decisions as +.>The following probability matrix is generated according to the decisions and the decision probabilities acquired in each mode:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Is of (1)Rate of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Probability of->Decision making in the eye movement modality of the representative user>Is>I in (2) satisfies->
Computing a decision made by a user on each modalityIs a mean probability of (2):
wherein, the liquid crystal display device comprises a liquid crystal display device,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing decision making in gesture mode of user>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Weight of->Decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
For each weight in the weight matrixThe method comprises the following steps:
;
multiplying the probability matrix by the weight matrix to generate a final decision matrix, and extracting a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction;
the interaction module receives a final decision instruction of the decision fusion module and completes interaction between human and machine according to the final decision instruction;
when the final decision instructions received by the interaction module are more than two and contradict each other, the interaction audio output device in the interaction module interacts with the user to obtain more accurate modal information, such as: when two final decision instructions received by the interaction module are respectively forward and backward, the interaction module inquires the user through the interaction audio output device: "please confirm whether the next instruction is forward or backward", and generate a new final decision instruction to make interaction according to the following user mode information;
the system can be applied to various interaction scenes, such as home, hospital or bank, etc.; the system can be used as a bank intelligent business handling counter, acquire voice information, gesture action information and eye movement information of a user through a camera and a microphone, acquire a decision instruction of the user through a feature extraction module and a decision fusion module, and make corresponding interaction according to the decision instruction of the user through an interaction module as shown in fig. 3; as in fig. 3, the user sends out voice message "i need to transact XXX service", the system obtains the decision instruction of the user, and displays "transact XXX service needs XX/XXX procedure, ask about to transact now? "complete man-machine interaction".
In the embodiment, the decision and decision probability generated by voice information is identified and output through establishing a voice identification model, the decision and decision probability generated by gesture information is output through establishing a CNN hand feature extraction model and a gesture decision tree classifier, and the decision and decision probability generated by eye movement information is output through establishing a CNN eye movement feature extraction model and an eye movement decision tree classifier, so that the basis of the fusion of all modes is obtained; the final decision matrix is obtained by weighting the probability matrix, so that a final decision instruction is obtained, the user interaction information fused by the system is wider, and the decision accuracy of the decision is higher.
The foregoing disclosure is only a preferred embodiment of the present invention and is not intended to limit the scope of the invention, so that all equivalent technical changes made by applying the description of the present invention and the accompanying drawings are included in the scope of the present invention, and in addition, elements in the present invention can be updated as the technology develops.

Claims (1)

1. A man-machine interaction scene system for intelligent multi-mode combined application comprises a data acquisition module, a feature extraction module, a decision fusion module and an interaction module;
the data acquisition module is used for acquiring multi-modal data of a user, the characteristic extraction module is used for extracting characteristics of each mode according to the multi-modal data, predicting various decisions and probabilities of decisions expressed by the user in each mode according to the characteristics of each mode, the decision fusion module is used for analyzing and calculating the decisions and the decision probabilities expressed by the user in each mode to obtain a final decision instruction, and the interaction module is used for completing man-machine interaction according to the final decision instruction;
the data acquisition module comprises a voice acquisition module, a gesture acquisition module and an eye movement acquisition module, wherein the voice acquisition module is used for acquiring voice information of a user, the gesture acquisition module is used for acquiring gesture action information of the user, and the eye movement acquisition module is used for acquiring eye movement information of the user;
the feature extraction module comprises a voice feature extraction module, a gesture feature extraction module and an eye movement feature extraction module, wherein the voice feature extraction module is used for extracting voice features and predicting the decision of a user in a voice mode and the decision probability of the voice mode according to the voice features; the gesture feature extraction module is used for extracting gesture features and predicting the decision of a user in a gesture mode and the decision probability of the gesture mode according to the gesture features; the eye movement characteristic extraction module is used for extracting eye movement characteristics and predicting the decision of a user in an eye movement mode and the decision probability of the eye movement mode according to the eye movement characteristics;
the decision fusion module comprises all decisions acquired under each mode, and all decisions are sharedBars, the set of individual decisions is +.>The decision fusion module generates the following probability matrix according to decisions and decision probabilities acquired under each mode:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Corresponding toDecision making in the speech modality of the user>Probability of->Representing decision making in the speech modality of the user>Probability of->Middle->Corresponding to the gesture mode of the user +.>Probability of->Representing decision making in gesture mode of user>Probability of->Middle->Decision making in the corresponding user eye movement mode>Is a function of the probability of (1),decision making in the eye movement modality of the representative user>Probability of (2)For all->I in (2) satisfies->
The decision fusion module generates a weight matrix according to the probability matrix, and the weight matrix is obtained in the following manner:
computing a decision made by a user on each modalityIs a mean probability of (2):
wherein, the liquid crystal display device comprises a liquid crystal display device,decision making for the user on modalities +.>Average probability of +.>For the user at->Decision making ∈>Probability of->Representing the speech modality->Representing gesture modality->Represents an eye movement modality; and (3) calculating:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing the user at->Decision making ∈>The larger the distance between the probabilities of (2) and the average probability, the smaller the correlation between the decision made in the mode and the average decision;
according toGenerating a weight matrix, and giving weight to the decision in each mode, wherein the weight matrix has the following formula:
wherein, the liquid crystal display device comprises a liquid crystal display device,middle->Decision making in the voice modality of the corresponding user>Weight of->Representing decision making in the speech modality of the user>Weight of->Middle->Gesture modality of corresponding user->Decision making->Weight of->Representing decision making in gesture mode of user>Weight of->Middle->Eye movement modality of the corresponding user->Decision making->Is used for the weight of the (c),decision making in the eye movement modality of the representative user>Is for all +.>I in (2) satisfies->
For each weight in the weight matrixThe method comprises the following steps:
the decision fusion module multiplies the probability matrix and the weight matrix to generate a final decision matrix, and extracts a decision corresponding to a value larger than a threshold value in the final decision matrix as a final decision instruction;
the interaction module receives a final decision instruction of the decision fusion module and completes interaction between human and machine according to the final decision instruction;
when the final decision instructions received by the interaction module are more than two and contradict each other, the interaction audio output device in the interaction module interacts with the user to obtain more accurate modal information.
CN202310524033.5A 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application Active CN116301388B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310524033.5A CN116301388B (en) 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310524033.5A CN116301388B (en) 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application

Publications (2)

Publication Number Publication Date
CN116301388A CN116301388A (en) 2023-06-23
CN116301388B true CN116301388B (en) 2023-08-01

Family

ID=86789013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310524033.5A Active CN116301388B (en) 2023-05-11 2023-05-11 Man-machine interaction scene system for intelligent multi-mode combined application

Country Status (1)

Country Link
CN (1) CN116301388B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843381A (en) * 2016-03-18 2016-08-10 北京光年无限科技有限公司 Data processing method for realizing multi-modal interaction and multi-modal interaction system
CN111460494A (en) * 2020-03-24 2020-07-28 广州大学 Multi-mode deep learning-oriented privacy protection method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP4220258A1 (en) * 2017-04-19 2023-08-02 Magic Leap, Inc. Multimodal task execution and text editing for a wearable system
CN108983636B (en) * 2018-06-20 2020-07-17 浙江大学 Man-machine intelligent symbiotic platform system
CN111722713A (en) * 2020-06-12 2020-09-29 天津大学 Multi-mode fused gesture keyboard input method, device, system and storage medium
CN114154549A (en) * 2021-08-30 2022-03-08 华北电力大学 Gas turbine actuating mechanism fault diagnosis method based on multi-element feature fusion

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843381A (en) * 2016-03-18 2016-08-10 北京光年无限科技有限公司 Data processing method for realizing multi-modal interaction and multi-modal interaction system
CN111460494A (en) * 2020-03-24 2020-07-28 广州大学 Multi-mode deep learning-oriented privacy protection method and system

Also Published As

Publication number Publication date
CN116301388A (en) 2023-06-23

Similar Documents

Publication Publication Date Title
US10354362B2 (en) Methods and software for detecting objects in images using a multiscale fast region-based convolutional neural network
CN112767329B (en) Image processing method and device and electronic equipment
WO2019245768A1 (en) System for predicting articulated object feature location
JP2023541532A (en) Text detection model training method and apparatus, text detection method and apparatus, electronic equipment, storage medium, and computer program
CN111274372A (en) Method, electronic device, and computer-readable storage medium for human-computer interaction
CN111860362A (en) Method and device for generating human face image correction model and correcting human face image
CN110232340A (en) Establish the method, apparatus of video classification model and visual classification
CN113515942A (en) Text processing method and device, computer equipment and storage medium
CN114092759A (en) Training method and device of image recognition model, electronic equipment and storage medium
CN114495916B (en) Method, device, equipment and storage medium for determining insertion time point of background music
CN111783557A (en) Wearable blind guiding equipment based on depth vision and server
CN113780326A (en) Image processing method and device, storage medium and electronic equipment
CN114863437A (en) Text recognition method and device, electronic equipment and storage medium
CN112308006A (en) Sight line area prediction model generation method and device, storage medium and electronic equipment
Wang et al. (2+ 1) D-SLR: an efficient network for video sign language recognition
CN112270246A (en) Video behavior identification method and device, storage medium and electronic equipment
CN113379877A (en) Face video generation method and device, electronic equipment and storage medium
CN112101204A (en) Training method of generative countermeasure network, image processing method, device and equipment
CN116301388B (en) Man-machine interaction scene system for intelligent multi-mode combined application
CN112016523A (en) Cross-modal face recognition method, device, equipment and storage medium
CN116402914A (en) Method, device and product for determining stylized image generation model
CN109857244B (en) Gesture recognition method and device, terminal equipment, storage medium and VR glasses
CN115828889A (en) Text analysis method, emotion classification model, device, medium, terminal and product
CN113269068B (en) Gesture recognition method based on multi-modal feature adjustment and embedded representation enhancement
CN110263743B (en) Method and device for recognizing images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant